Projects

Data preprocessing and feature engineering of data from the Home Credit Default Risk dataset.
Evaluating, comparing and tuning the performance of several supervised learning models.
Models tested : DecisionTreeClassifier, RandomForestClassifier, GradientBoostingClassifier, XGBClassifier, LGBMClassifier, LinearSVC, RidgeClassifier
Use of oversampling and undersampling techniques, cross validation and curve plotting.

Natural Language Processing

Filtering out negative reviews from a sample (5000 reviews)
Preprocessing reviews to a format compatible with the NLP model
Extraction of topics from negative reviews with Gensim's Latent Dirichlet Allocation
Results analysis and visualization

Computer Vision: Image Classification

Equalizing the histograms for each photo in the sample (100 photos per label, 500 photos total)
Testing ORB for feature extraction
Dimensionality reduction and KMeans clustering
Using transfer learning with VGG16 for feature extraction
Dimensionality reduction and KMeans clustering
Visualizing and analyzing results
Analyzing some examples of mislabeled photos

Creating a custom data generator with Keras Sequence method
Image augmentation with the Albumentations library
Benchmarking multiple architectures, backbones and metrics:
- Mini U-net as baseline > architecture
- Architectures tested: U-net, PSPnet, Linknet
- Backbones tested: VGG16, ResNet34
- Loss functions: Categorical Focal Dice Loss, Categorical Focal Jaccard Loss
Performance obtained for best model (U-net with ResNet34 backbone, Categorical Focal Dice Loss, augmented data):
- MeanIoU = 0.73
- Loss = 0.20
Creating a webapp with Flask and Streamlit, deploying baseline > link

Exploring recommendation algorithm options:
- Content-Based Filtering with article embeddings
- Collaborative Filtering, comparing two librairies: Surprise and Implicit
Deploying a webapp using Azure functions (HTTP triggered) to Streamlit > link

Katrin-Misel Ponomarjova