DS - Bootcamp - DEC22 - Rakuten Challenge

Presentation

This repository contains the code for project Rakuten based on data issued by Rakuten Challenge and developed during Data Scientist training at DataScientest.

The cataloging of product listings through title and image categorization is a fundamental problem for any e-commerce marketplace. The traditional way of categorizing is to do it manually. However, this takes up a lot of employees’ time and can be expensive.

The goal of this project is to predict a product’s type code through its description and image for the e-commerce platform Rakuten.

This project was developed by the following team :

Charly LAGRESLE (GitHub / LinkedIn)
Olga (GitHub / LinkedIn)
Mohamed BACHKAT (GitHub / LinkedIn)

You can browse and run the notebooks. You will need to install the dependencies (in a dedicated environment) :

pip install -r src/requirements.txt

You can also see a summary presentation of the project in the formats pdf presentation.pdf and pptx presentation.pptx.

Application

The application has the following elements :

Frontend that contains the web interface created with application Streamlit
Backend that contains the API created with FastAPI
Monitoring that contains the application Tensorboard

The application should be run within a Docker container [Streamlit + FastAPI + Docker = ♥].
To run the Docker containers use the following command:

docker-compose up --build

The app should then be available at localhost:8501 and API documentation should be available at localhost:8111/docs.

Overview

Rakuten Challenge contains :

84 916 observations
27 categories to be determined
0 duplicate data
One color image per product
Image size is 500x500px in JPG format

The sample of the data:

The challenge presents several interesting research aspects due to :

the intrinsic noisy nature of the product labels and images
the typical unbalanced data distribution
the large size of the data
the description of the product in different languages

We use a supervised approach for the one-label classification problem with an imbalanced distribution of labels. Therefore, the metric used in this challenge to rank the model's performance is the weighted-F1 score.

The development of the best model contains the following steps :

Creation Text Classifier
Creation Image Classifier
Fusion Text Classifier and Image Classifier

Text Classifier :

Contains text prepprocessing and text vectorization using Natural Language Processing (NLP).
Based on the Neural_Embedder text model.

Image Classifier :

Bases on the CNN image model
Use transfer learning with the MobileNetV2 model loaded with pre-trained weights on ImageNet.

The fusion model has the following architecture :

The fusion model uses the image model to categorize products where the text model underperformed. The global weighted-F1 score is 82.2% and all categories exceed the 55% score.

Model density analysis :

Name		Name	Last commit message	Last commit date
Latest commit History 441 Commits
FastAPI_backend		FastAPI_backend
data		data
images		images
notebooks		notebooks
slides		slides
src		src
streamlit_app		streamlit_app
tensorboard_backend		tensorboard_backend
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DS - Bootcamp - DEC22 - Rakuten Challenge

Presentation

Application

Overview

About

Releases

Packages

Languages

data-modelisation/rakuten

Folders and files

Latest commit

History

Repository files navigation

DS - Bootcamp - DEC22 - Rakuten Challenge

Presentation

Application

Overview

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages