Skip to content

Reproducible Machine Learning pipeline that predicts short-term rental prices in New York. Weights & Biases and MLflow are used.

License

Notifications You must be signed in to change notification settings

mxagar/ml_pipeline_rental_prices

 
 

Repository files navigation

Reproducible Machine Learning Pipeline for Short-Term Rental Price Prediction

This repository contains a Machine Learning (ML) pipeline which is able to predict short-term rental property prices in New York City. It is designed to be retrained easily with new data that comes frequently in bulk, assuming that prices (and thus, the model) vary constantly. The pipeline is composed by the typical steps or components from short/medium-size projects; these are explained in the section Introduction.

The following tools are used:

The Weights & Biases project which collects the tracked experiments can be found here: nyc_airbnb.

The starter code of the repository was originally forked from a project in the Udacity repository build-ml-pipeline-for-short-term-rental-prices. The instructions of that source project can be found in the file Instructions.md. If you would like to know more about why reproducible ML pipelines matter and how the tools used here interact, you can have a look at my ML pipeline project boilerplate.

The used dataset is a modified AirBnB dataset for New York City, which is in the source forked repository. AirBnB has a data dictionary which explains each feature. A subset of 15 independent variables has has been taken.

Table of contents:

Introduction

The following figure summarizes all the steps/components and the artifacts of the pipeline:

Reproducible ML Pipeline Diagram

The final goal of the pipeline is to produce the optimal inference artifact which is able to regress the price of a rented unit given its features.

The step related to the Exploratory Data Analysis (EDA) in light green is not part of the re-trainable pipeline, i.e., it needs to be called explicitly; additionally, the component contains research notebooks that are crafted manually to gather insights of the dataset. Note that the data processing and modeling are quite simple; the focus of the project lies on the MLOps aspect.

Some components and artifacts have been labeled as remote. Even though they are locally available in this repository, they are fetched from the original repository, showing that MLflow capability.

The file and folder structure reflects the diagram above:

.
├── README.md                       # This file
├── MLproject                       # High-level pipeline project
├── main.py                         # Pipeline main file
├── environment.yml                 # Environment to run the project
├── conda.yml                       # Environment of main pipeline
├── config.yaml                     # Configuration of pipeline
├── Instructions.md                 # Original instructions by Udacity
├── LICENSE.txt
├── images/                         # Markdown assets
│   └── ...
├── cookie-mlflow-step/             # Template for (empty) step generation
│   └── ...
├── components/                     # Remote components
│   ├── README.md
│   ├── conda.yml
│   ├── get_data/                   # Download/get data step/component
│   │   ├── MLproject               # MLflow project file
│   │   ├── conda.yml               # Environment for step
│   │   ├── data/                   # Dataset samples (retrieved with URL)
│   │   │   ├── sample1.csv
│   │   │   └── sample2.csv
│   │   └── run.py                  # Step implementation
│   ├── test_regression_model/      # Evaluation step/components
│   │   ├── MLproject
│   │   ├── conda.yml
│   │   └── run.py
│   └── train_val_test_split/       # Segregation step/component
│       ├── MLproject
│       ├── conda.yml
│       └── run.py
└── src/                            # Local components
    ├── README.md
    ├── basic_cleaning/             # Cleaning step/component
    │   ├── MLproject               # MLflow project file
    │   ├── conda.yml               # Environment for step
    │   └── run.py                  # Step implementation
    ├── data_check/                 # Non-/Deterministic tests step/component
    │   ├── MLproject
    │   ├── conda.yml
    │   ├── conftest.py
    │   └── test_data.py
    ├── eda/                        # Exploratory Data Analysis & Modeling (research env.)
    │   ├── EDA.ipynb               # Basic EDA notebook
    │   ├── MLproject
    │   ├── Modeling.ipynb          # Modeling trials, transferred to train_random_forest
    │   └── conda.yml
    └── train_random_forest/        # Inference pipeline
        ├── MLproject
        ├── conda.yml
        ├── feature_engineering.py
        └── run.py

Each step/component has a dedicated folder with the following files:

  • run.py: The implementation of the step function/goal.
  • MLproject: The MLflow project file which executes run.py with the proper arguments.
  • conda.yaml: The conda environment required by run.py which is set by MLflow.

All the components are controlled by the root-level script main.py; that file executes all the steps in order and with the parameters specified in config.yaml.

How to Use This Project

  1. If you're unfamiliar with the tools used in this project, I recommend you to have a look at my ML pipeline project boilerplate.
  2. Install the dependencies.
  3. Run the pipeline as explained in the dedicated section.
  4. You can modify the pipeline following these guidelines and having into account the original Instructions.md.

Dependencies

In order to set up the main environment from which everything is launched you need to install conda and create a Weights and Biases account; then, the following sets everything up:

# Clone repository
git clone [email protected]:mxagar/ml_pipeline_rental_prices.git
cd ml_pipeline_rental_prices

# Create new environment
conda env create -f environment.yml

# Activate environment
conda activate nyc_airbnb_dev

# Log in via the web interface
wandb login

All step/component dependencies are handled by MLflow using the dedicated conda.yaml environment definition files.

How to Run the Pipeline

There are multiple ways of running this pipeline, for instance:

  • Local execution or execution from cloned source code of the complete pipeline.
  • Local execution of selected pipeline steps.
  • Remote execution of a release.

In the following, some example commands that show how these approaches work are listed:

# Go to the root project level, where main.py is
cd ml_pipeline_rental_prices

# Local execution of the entire pipeline
mlflow run . 

# Step execution: Download or get the data
# Step name can be found in main.py
# This step uses the code in a remote repository
mlflow run . -P steps="download"

# Step execution: Clean / pre-process the raw dataset
# Step name can be found in main.py
# Note that any upstream artifacts need to be available
mlflow run . -P steps="basic_cleaning"

# Step execution: data check + segregation
# Step names can be found in main.py
# Note that any upstream artifacts need to be available
# The step data_split uses the code in a remote repository
mlflow run . -P steps="data_check,data_split"

# Hyperparameter variation with hydra sweeps
# Step execution: Train the random forest model
# Step names can be found in main.py
# Note that any upstream artifacts need to be available
mlflow run . \
-P steps=train_random_forest \
-P hydra_options="modeling.max_tfidf_features=10,15,30 modeling.random_forest.max_features=0.1,0.33,0.5,0.75,1.0 -m"

# Execution of a remote release hosted in my repository
# and use a different dataset/sample from the standard one
# Release: 1.0.1
mlflow run https://github.com/mxagar/ml_pipeline_rental_prices.git \
-v 1.0.1 \
-P hydra_options="etl.sample='sample2.csv'"

Note that any execution generates, among others:

  • Auxiliary tracking files and folders that we can ignore: wandb, mlruns, artifacts, etc.
  • Entries in the associated Weights and Biases project.

The entries in the Weights and Biases web interface cover extensive information related to each run of a component or uploaded/downloaded artifacts:

  • Time, duration and resources.
  • Logged information: parameters, metrics, images, etc.
  • Git commit branches.
  • Relationship between components and artifacts.
  • Etc.

As an example, here is a graph view of the lineage of the pipeline, which resembles the pipeline flow diagram shown above:

Lineage / Graph View of the exported model

How to Modify the Pipeline

We can generate new steps/components easily by creating a new folder under src/ with the following three files, properly implemented:

  • MLproject
  • conda.yaml
  • run.py

Then, we need to add the step and its parameters to the mlflow pipeline in main.py, as it is done with the other steps.

Instead of creating the folder and the files manually, we can use the tool cookiecutter, installed with the general requirements. For that, we run the following command:

cookiecutter cookie-mlflow-step -o src

That prompts a form we need to fill in via the CLI; for instance, if we'd like to create a new component which would train an XGBoost model, we could write this:

step_name [step_name]: train_xgboost
script_name [run.py]: run.py
job_type [my_step]: train_model
short_description [My step]: Training of an XGBoost model
long_description [An example of a step using MLflow and Weights & Biases]: Training of an XGBoost model
parameters [parameter1,parameter2]: trainval_artifact,val_size,random_seed,stratify_by,xgb_config,max_tfidf_features,output_artifact  

All that creates the component folder with the three files, to be implemented.

Issues / Notes

There are some deviations from the original Instructions.md that might be worth considering:

  • I changed the remote component folder/URL; additionally, we might need to specify the branch with the parameter version in the mlflow.run() call, because mlflow defaults to use master.
  • I had to change several conda.yaml files to avoid versioning/dependency issues: protobuf, etc.

Possible Improvements

  • Extend the Exploratory Data Analysis (EDA) and the Feature Engineering (FE). See my Guide on EDA, Data Cleaning and Feature Engineering.
  • Extend the current data modeling and the hyperparameter tuning.
  • Try other models than the random forest: create new components/steps for those and compare them.

Interesting Links

Authorship

Mikel Sagardia, 2022.
No guarantees.

If you find this repository helpful and use it, please link to the original source.

About

Reproducible Machine Learning pipeline that predicts short-term rental prices in New York. Weights & Biases and MLflow are used.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Jupyter Notebook 71.2%
  • Python 28.8%