Skip to content

MLOps project that recommends movies to watch implementing Data Engineering and MLOps best practices.

License

Notifications You must be signed in to change notification settings

brnaguiar/mlops-next-watch

Repository files navigation

Logo

Next Watch: E2E MLOps Pipelines with Spark!

CI

Prerequisites | Quick Start | Service Endpoints | Architecture | Project Organization | UI Showcase.

Prerequisites

  • Python
  • Conda or Venv
  • Docker

Installation and Quick Start

  1. Clone the repo
git clone https://github.com/brnaguiar/mlops-next-watch.git
  1. Create environment
make env
  1. Activate conda env
source activate nwenv
  1. Install requirements / dependencies and assets
make dependencies
  1. Pull the datasets
make datasets
  1. Configure containers and secrets
make init
  1. Run Docker Compose
make run
  1. Populate production Database with users
make users

Useful Service Endpoints

- Jupyter `http://localhost:8888`
- MLFlow `http://localhost:5000`
- Minio Console `http://localhost:9001`
- Airflow `http://localhost:8080`
- Streamlit Frontend `http://localhost:8501`
- FastAPI Backend` http://localhost:8000/`
- Grafana Dashboard `http://localhost:3000`
- Prometheus `http://localhost:9090`
- Pushgateway `http://localhost:9091`
- Spark UI `http://localhost:8081`

Architecture

Note: In "Monitoring and Analytics", it should be Grafana instead of Streamlit.

Project Organization


├── LICENSE
│
├── Makefile             <- Makefile with commands like `make env` or `make run`
│
├── README.md            <- The top-level README for developers using this project
│
├── data
│   ├── 01-external      <- Data from third party sources
│   ├── 01-raw           <- Data in a raw format
│   ├── 02-processed     <- The pre-processed data for modeling
│   └── 03-train         <- Splitted Pre-Processed data for model training
├── airflow
│   ├── dags             <- Airflow Dags
│   ├── logs             <- Airflow logging
│   ├── plugins          <- Airflow default directory for Plugins like Custom Operators, Sensors, etc... (however, we use the dir `include` in dags for this purpose)
│   └── config           <- Airflow Configurations and Settings
│
├── assets               <- Project assets like jar files used in Spark Sessions
│
├── models               <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks            <- Jupyter notebooks used in experimentation 
│
├── docker               <- Docker data and configurations
│
├── images               <- Project images
│
├── requirements.local   <- Required Site-Packages 
│                         
├── requirements.minimal <- Required Dist-Packages 
│                         
├── Makefile             <- File containing rules and dependencies to automate building processes
│
├── setup.py             <- Makes project pip installable (pip install -e .) so src can be imported 
│
├── src                  <- Source code for use in this project.
│   │
│   ├── collaborative    <- Source code for the collaborative recommendation strategy
│   │   └── models       <- Collaborative models
│   │   └── nodes        <- Data processing, validation, training, etc. functions (or nodes) that represent units of work.
│   │   └── pipelines    <- Collection of orquestrated data processing, validation, training, etc. nodes, arranged in a sequence or a directed acyclic graph (DAG)
│   │
│   ├── conf           <- Configuration files and parameters for the projects
│   │
│   ├── main.py        <- Main script, mostly to run pipelines
│   │
│   ├── scripts        <- Scripts, for instance, to create credentials files and populate databases
│   │
│   └── frontend       <- Source code for the Application Interface
│   │
│   └── utils          <- Project utils like Handlers and Controllers
│
└── tox.ini            <- Settings for flake8
│
└── pyproject.toml     <- Settings for the project, and tools like isort, black, pytest, etc.

UI Showcase

Streamlit Frontend App

MLflow UI

Minio UI

Airflow UI

Grafana UI

Prometheus UI

Prometheus Drift Detection Example


Project based on the cookiecutter data science project template.