A useful template to enable simple and efficient machine learning projects
Explore the docs »
View Demo
·
Report Bug
·
Request Feature
Table of Contents
While developing our machine learning pipeline template, we wanted to create an efficient and purposeful environment that could resolve many of the annoyances and issues we have faced in the past.
Our goal with this ML pipeline template is to create a user friendly utility to drastically speed up the development and implementation of a machine learning model for all sorts of various problems. Many of our past experiences with other templates or machine learning projects had left us hoping for a better working environment and a more efficient process.
This template enables fast experimentation, easy execution, and simple debugging for all components.
.
├── run.py # Python remote server control
├── static
│ └── img
│ ├── example_image.jpg # Example image for README
│ ├── iris_setosa.jpeg
│ ├── iris_versicolor.jpeg
│ └── iris_virginica.jpeg
└── templates
├── go.html
└── master.html # Main html file for front end
The app component of the directory controls the front end flask service which produces the user-friendly environment for interacting with the model.
.
├── config.yaml # Main global configuration file
├── data_acquisition
│ └── config.yaml # Data acquisition configuration
├── data_processing
│ └── config.yaml # Data processing configuration
├── model_training
│ └── config.yaml # Model training configuration
└── model_validation
└── config.yaml # Model validation configuration
The config component of the directory is where the most controls reside for this pipeline template. There is a config file for each of the main sections:
- Data Acquisition
- Data Processing
- Model Training
- Model Validation
There is also an additional configuration file for general settings that relate to each of these different sections and are shared.
The configuration files are intended to be the primary point of access and control for this pipeline. Any changes or utility additions should be controlled from their corresponding configuration file in order to keep an organized and properly modularized codebase.
.
├── 1_data_acquisition
│ └── main.py # Main file for data acquisition step
├── 2_data_processing
│ └── main.py # Main file for data processing step
├── 3_model_training
│ └── main.py # Main file for model training step
├── 4_model_validation
│ └── main.py # Main file for model validation step
└── 5_model_registration
└── main.py # Main file for model registration step (Optional)
The pipeline_components folder in the directory is the host to the main files for each step in the pipeline flow. Here, there exist only main files for each step (ordered numerically to represent the order of runtime). These main files should not be altered unless required to implement an additional utility function, or some other task. Changes made to this pipeline should remain within the utility functions in the /src/ directory and in the configuration files.
.
├── __init__.py
├── data
│ ├── __init__.py
│ ├── acquisition
│ │ ├── __init__.py
│ │ └── utils.py # Data acquisition utility functions
│ ├── processing
│ │ └── utils.py # Data processing utility functions
│ └── utils.py # General utility functions related to data
├── model
│ ├── __init__.py
│ ├── training
│ │ ├── __init__.py
│ │ └── utils.py # Data model training utility functions
│ └── utils.py # General utility functions related to models
└── utils.py # Main general utility functions
The src component of the directory is the core of our pipelines functionality. This directory stores the utility functions for each of the pipeline steps. When running the pipeline, these utility functions will be built as a package and can be imported and used in the main functions during runtime.
Here we will describe the necessary actions and steps that should be followed in order to run this pipeline.
There are only a couple of prerequisite steps required to run this pipeline. The first of which is to have Conda / Anaconda installed and the second is to be able to utilize MakeFiles.
Below is an example of how you can instruct your audience on installing and setting up your app. This template doesn't rely on any external dependencies or services.
-
Clone the repo
git clone https://github.com/zamaniali1995/ml-pipeline.git
-
Setup the conda environment using MakeFile
make create-env
-
Activate the newly created conda environment
conda activate ml-env
-
Create package
create-package
Here we will describe how to use this ML pipeline template, as well as how to run each component and build the front end display at the end.
This pipeline was designed so that configuration files are the primary means of controlling and altering the pipeline. These configuration files control the paths to the data, what kind of data processing to perform, how to split the training and testing data, which models to train, the range of potential hyper parameters to search through, which evaluation methods to use on the models, and many other similar selections.
These configuration files allow for changes to be made in one place, not requiring someone to dig through code and alter each place where some variable could exist.
If there is a desire to implement some additional processing method or some specific functionality for a given dataset, we have created a simple process to add utility functions that can be used and connected with the configuration files easily.
To validate that our template is working, we have included a sample dataset which can be used to run each component of the pipeline and which will produce a useable front-end local server. If everything is working as intended, the following steps should be able to produce a functioning predictor.
-
Acquire the data
make acquire-data
-
Process the data
make process-data
-
Train the model
make train-model
-
Evaluate the model
make evaluate-model
-
Generate the local Flask front-end
make run-server
-
Access the local Flask server
http://localhost:3001/
or
http://0.0.0.0:3001/
- Develop Base Pipeline Template
- Implement Example Dataset and Functional Front-End
- Add additional data processing utility functions to be available for use.
- Implement Test-Cases to be used for validation of the different pipeline steps
- Run the whole pipeline with one single command like run-pipeline
- Add more ways to load data
- AWS
- Google Cloud
- Microsoft Azure
See the open issues for a full list of proposed features (and known issues).
Ali Zamani - LinkedIn - [email protected]
Jacob Mish - LinkedIn - [email protected]
Project Link: https://github.com/zamaniali1995/ml-pipeline