Skip to content

Predicting Rossmann sales six weeks in advance. Feel free to access the Telegram Bot in the link below.

License

Notifications You must be signed in to change notification settings

brunodifranco/project-rossmann-sales

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Creating a Bot that Predicts Rossmann Future Sales

A Regression Project

drawing

Obs: The business problem is fictitious, although both company and data are real.

The in-depth Python code explanation is available in this Jupyter Notebook.

1. Rossmann and Business Problem

Rossmann is one of the largest drug store chains in Europe, with operations in Germany, Poland, Hungary, the Czech Republic, Turkey, Albania, Kosovo and Spain. Their sales can be influenced by promotions, competition, school and state holidays, seasonality, locality, etc.

This Data Science project is focused on solving one problem:

  • Rossmann CEO is requiring a sales prediction of the next six weeks for each store, in order to determine the best resource allocation for each store renovation.

2. Data Overview

The data was collected from Kaggle. This dataset contains historical sales data for 1,115 Rossmann stores. The initial features descriptions are available below:

Feature Definition
Id an Id that represents a (Store, Date) duple within the dataset.
Store a unique Id for each store.
Sales the turnover for any given day.
DayOfWeek day of week on which the sale was made (e.g. DayOfWeek=1 -> monday, DayOfWeek=2 -> tuesday, etc).
Date date on which the sale was made.
Customers the number of customers on a given day.
Open an indicator for whether the store was open: 0 = closed, 1 = open.
StateHoliday indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None.
SchoolHoliday indicates if the (Store, Date) was affected by the closure of public schools.
StoreType differentiates between 4 different store models: a, b, c, d.
Assortment describes an assortment level: a = basic, b = extra, c = extended.
CompetitionDistance distance in meters to the nearest competitor store.
CompetitionOpenSince(Month/Year) gives the approximate year and month of the time the nearest competitor was opened.
Promo indicates whether a store is running a promo on that day.
Promo2 Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating.
Promo2Since(Year/Week) describes the year and calendar week when the store started participating in Promo2.
PromoInterval describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store.

3. Assumptions

  • Customers column was dropped, because for now there's no information about the amount of customers six weeks into the future.
  • The NaN's in CompetitionDistance were replaced by 3 times the maximum CompetitionDistance in the dataset, because the observations with NaN's are likely stores that are too far, which means there's no competition.
  • Some new features were created in order to best describe the problem:
New Feature Definition
day/week_of_year/year_week/month/year day/week_of_year/year_week/month/year extracted from 'date' column.
day/day_of_week/week_of_year/month(sin/cos) sin/cos component of each period, to capture their cyclical behavior.
competition_time_month amount of months from competition start.
promo_time_week time in weeks from when the promotion was active.
state_holiday(christmas/easter_holiday/public_holiday/regular_day) indicates wheter the sale was made in christmas, easter, public holiday or regular day.

4. Solution Plan

4.1. How was the problem solved?

To predict sales values for each store (six weeks in advance) a Machine Learning model was applied. To achieve that, the following steps were performed:

  • Understanding the Business Problem : Understanding the reasons why Rossmann's CEO was requiring that task, and plan the solution.

  • Collecting Data : Collecting Rossmann store and sales data from Kaggle.

  • Data Cleaning : Renaming columns, changing data types and filling NaN's.

  • Feature Engineering : Creating new features from the original ones, so that those could be used in the ML model.

  • Exploratory Data Analysis (EDA) : Exploring the data in order to obtain business experience, look for useful business insights and find important features for the ML model. The top business insights found are available in Section 5 .

  • Data Preparation : Applying Normalization and Rescaling Techniques in the data, as well as Enconding Methods and Response Variable Transformation.

  • Feature Selection : Selecting the best features to use in the ML model by applying the Boruta Algorithm.

  • Machine Learning Modeling : Training Regression Algorithms with time series cross-validation. The best model was selected to be improved via Hyperparameter Tuning. More information in Section 6 .

  • Model Evaluation : Evaluating the model using four metrics: MAE, MAPE, RMSE and R2.

  • Financial Results : Translating the ML model's statistical performance to financial and business performance.

  • Model Deployment (Telegram Bot) : Implementation of a Telegram Bot that will give you the prediction of any given available store number. This is the project's Data Science Product, and it can be accessed from anywhere. More information in Section 7 .

4.2. Tools and techniques used:

5. Top Business Insights

  • 1st - Stores with basic assortment level are the ones that sell more.

drawing


  • 2nd - Stores with higher number of close competitors sell more.

drawing


  • 3rd - Easter Holiday has the highest average sales, in comparison to other periods.

drawing


  • 4th - Stores sell less during the second semester of each year.

drawing


  • 5th - Stores Sell more after the 10th day of each month.

drawing

6. Machine Learning Models

This was the most fundamental part of this project, since it's in ML modeling where the sales predictions for each store can be made. Six models were trained using time series cross-validation:

  • Average Model (used as a baseline model)
  • Linear Regression
  • Lasso Regression (Regularized Linear Regression)
  • Random Forest Regressor
  • XGBoost Regressor
  • Light GBM Regressor

The initial performance for all six algorithms are displayed below (sorted by RMSE):

Model Name MAE MAPE RMSE R2
LGBM Regressor 833.23 +/- 121.0 0.1178 +/- 0.0093 1197.68 +/- 176.39 0.8467 +/- 0.0237
Random Forest Regressor 838.84 +/- 219.91 0.1162 +/- 0.0233 1257.62 +/- 321.14 0.8409 +/- 0.0527
XGBoost Regressor 900.29 +/- 152.53 0.1273 +/- 0.0155 1293.45 +/- 214.31 0.8322 +/- 0.0334
Average Model (Baseline) 1354.8 0.2064 1835.14 0.6366
Linear Regression 2081.72 +/- 295.57 0.3026 +/- 0.0166 2953.15 +/- 468.22 0.1353 +/- 0.0721
Lasso Regression 2116.42 +/- 341.46 0.292 +/- 0.0118 3058.12 +/- 504.18 0.0742 +/- 0.0834

Both Linear Regression and Lasso Regression have worst performances in comparison to the simple Average Model. This shows a nonlinear behavior in our dataset, hence the use of more complex models, such as Random Forest, XGBoost and Light GBM.

The LGBM model was chosen for Hyperparameter Tuning, since it has the lowest RMSE. Even if we look into other metrics, such as MAPE (on which Random Forest has the best performance), LGBM would still be better to use, because it's much faster to train and tune .

After tuning LGBM's hyperparameters using Random Search the model performance has improved:

Model Name MAE MAPE RMSE R2
LGBM Regressor 617.54000 0.08940 921.52000 0.90840

Metrics Definition and Interpretation

Metric Definition
MAE Mean Absolute Error
MAPE Mean Absolute Percentage Error
RMSE Root Mean Squared Error
R2 Coefficient of Determination

R2 basically show how well the sales are being predicted by the model, and alongside RMSE isn't the best metric to translate into financial performance, despite being key to check statistical performance.

Both MAE and MAPE are really useful in explaining the model's business performance. MAE shows how much the model prediction is wrong on average, while MAPE shows how much the model prediction is wrong on average percentage-wise.

6.1. Brief Financial Results:

Below there are displayed two tables with brief financial results given by the LGBM model, as the complete financial results will be explained in the next section .

A couple interesting metrics to evaluate the financial performance of this solution (LGBM Model) is the MAE and MAPE. Below there's a table with a few stores metrics:

Store Predictions (€) Worst Scenario (€) Best Scenario (€) MAE (€) MAPE
1 161,274.69 160,988.99 161,560.39 285.69937 0.06472
2 175,549.76 175,192.06 175,907.47 357.70668 0.07338
3 259,576.04 259,077.04 260,075.03 498.99756 0.07215
... ... ... ... ... ...
1113 238,353.17 237,841.04 238,865.30 512.12600 0.07852
1114 769,997.75 767,598.11 772,397.39 2399.63754 0.10164
1115 254,766.52 254,227.68 255,305.36 538.83848 0.07576

According to this model, the sales sum for all stores over the next six weeks is:

Scenario (€) Total Sales of the Next 6 Weeks (€)
Prediction 283,786,860.62
Worst Scenario 283,094,186.26
Best Scenario 284,479,534.97

7. Model Deployment

As previously mentioned, the complete financial results can be consulted by using the Telegram Bot. The idea behind this is to facilitate the access of any store sales prediction, as those can be checked from anywhere and from any electronic device, as long as internet connection is available. The bot will return you a sales prediction over the next six weeks for any available store, all you have to do is send him the store number in this format "/store_number" (e.g. /12, /23, /41, etc) . If a store number if non existent the message "Store not available" will be returned, and if you provide a text that isn't a number the bot will ask you to enter a valid store id.

Click below to chat with the Rossmann Bot
image

Because the deployment was made in a free cloud (Render) it could take a few minutes for the bot to respond, in the first request. In the following requests it should respond instantly.

8. Conclusion

In this project the main objective was accomplished:

A model that can provide good sales predictions for each store over the next six weeks was successfully trained and deployed in a Telegram Bot, which fulfilled CEO' s requirement, for now it's possible to determine the best resource allocation for each store renovation. In addition to that, five interesting and useful insights were found through Exploratory Data Analysis (EDA), so that those can be properly used by Rossmann CEO.

9. Next Steps

Further on, this solution could be improved by a few strategies:

  • Using ARIMA to predict the amount of customers over the next six weeks, so that the customers column could be added to the final model.

  • Tune even more the regression algorithm, by applying a Bayesian Optimization for instance.

  • Try other regression algorithms to predict the sales for each store.

  • Use different models for the stores on which it's more difficult (higher MAE and MAPE) to predict the sales.

Contact