Skip to content

ArwaSheraky/Montreal-Collisions

Repository files navigation

Montreal Road Collisions

Performing basic data cleaning, EDA and prediction of number of collisions in Montreal on a fixed day, knowing the lighting conditions, day of the week, month and weather conditions.

Introduction

The purpose of this project is to use acquired techniques of creating Docker images, working with volumes, containers, and stacks, creating clusters and performing deployment inside the cluster locally and on GCP. We will also be using HDFS, MondoDB and Parquet Files in our project. As we work in Spark we will be using pyspark language and working with RDD.

Contributors: @Arwasheraky, @Nastazya and @Clercy Link to Presentation


Project Structure

  • Code:

    • Scripts:

      • data_cleaning.ipynb - Data cleaning.
      • data_EDA.ipynb - Exploration data analysis.
      • csv-to-parquet.py - Making use of parquet and hadoop.
      • Write_features_MongoDB.ipynb - Making use of MongoDB.
      • Prediction_Subset.ipynb - Restructuring dataset for further prediction.
      • Prediction_and_model_export.ipynb - Prediction and model export.
      • pred.py - Script that runs the prediction with specific features
      • model.joblib - Exported prediction model used in the script
      • model.py - Script that builds prediction model, we use this script to run it and save the results to hdfs (optional, used for training, no description)
    • docker-compose:

      • jupyter-compose.yml - Running Jupyter, using external volume jupyter-data .
      • jupyter_bind-compose.yml - Running Jupyter, using volume mounting and external network.
      • jupyter_local-compose.yml - Running Jupyter, using volume mounting.
      • spark-compose.yml - Running Spark based on jupyter/pyspark-notebook image, no volumes, using external network.
      • spark_bind-compose.yml - Running Spark based on jupyter/pyspark-notebook image, using volume mounting and external network.
      • spark_bind_hdfs-compose.yml - Running Spark based on mjhea0/spark:2.4.1 image, using volume mounting and external network.
      • spark_hdfs-compose.yml- Running Spark based on mjhea0/spark:2.4.1 image, no volumes, using external network.
      • spark_mongo-compose.yml - Running Spark and MongoDB.
      • Dockerfile - Used to create a custom image that copies script and model then runs the script.
  • Data:

    • accidents.csv - Mnicipalities dataset, merged with original.
    • accidents_new.csv - Clean data, ready for EDA.
    • pred-1.csv - Restructured temporary dataset, not ready for prediction
    • final_part_1.csv - Restructured dataset, ready for prediction

Running instructions

  • Clone this folder locally

  • Make sure you have latest version of Docker installed

  • Deploy any of the code files locally, in a Spark cluster or in GCP.

    • Locally:

      env token=cebd1261 docker-compose -f jupyter_local-compose.yml up
      
    • Spark Cluster:

      docker network create spark-network
      docker-compose -f spark_bind-compose.yml up --scale spark-worker=2
      env token=cebd1261 docker-compose -f jupyter_bind-compose.yml up
      
    • GCP

      • Create a bucket with the files you would need
      • Create a dataproc cluster which includes Jupyter and Anaconda
      • Run Jupyter from the cluster
  • To run a prediction script pred.py that takes a number of features as an input and calculates the predicted number of collisions, follow these steps:

    • Run the environment using these commands from code folder:
    docker network create spark-network
    docker-compose --file spark-compose.yml up
    
    • Run this script on a cluster using this command:
    docker run -ti  --network=spark-network -v /${PWD}/script://app jupyter/pyspark-notebook:latest //usr/local/spark/bin/spark-submit --master spark://master:7077 //app/pred.py <input feature>
    

    where <input features> is a sequence of 4 parameters, devided by comma. Example: day,DI,2,neige.

    Possible parameters:

      - lighting comditions (day or night)
      - day of the week (DI, LU, MA, ME, JU, VE, SA)
      - month (11 to 12)
      - weather conditions (normal, pluie, neige, verglas)
    
    • You can also do the same thing using custom image:
    docker build -t <image_name> .
    

    or use existing image nastazya/image_for_pred:latest

    • Run the image using this format:
    docker run -ti  --network=spark-network <image name> <input features>
    or
    docker run -ti  --network=spark-network <image name>
    

    In case where the features weren't chosen this input is taken by default: day,DI,2,neige


Packaging and deployment

In this section, we will shortly describe some techniques we used along the project packaging and deployment.

Below is an approximate diagram of all the processes used in this project (for clarity, we presented it as a team of two co-workers).

Project packaging and deployment plan


Jupyter + Spark + Bind Mount.

This schema was used the most in this project:

  • Run this command from code :
docker-compose -f spark_bind-compose.yml up --scale spark-worker=2

  • Run Jupyter from another terminal using this command:
env token=cebd1261 docker-compose -f jupyter_bind-compose.yml up

  • Proceed with coding of one of the .ipynb files

spark+jupyter


Jupyter + Spark + Volumes.

In the middle of the project development process, we needed to access data_cleaning.ipynb and run some additional code without changing the existing file and its output. For this purpose, we've run jupyter-compose.yml that uses volume jupyter-data. When we opened Jupyter notebook there were some files already, but not the ones we needed:

Jupyter-data Volume

So we created a temporary folder with two files and copied them into Jupyter-data volume through busybox image:

jupyter-data volume files_after_adding_volume

Now we have a script folder with the files that we need inside our volume and whatever changes we do won't affect the original files. Later we will remove Jupyter-data volume.


Jupyter + Spark + MondoDB.

To execute a jupyter-spark-mongo session:

  1. Run spark_mongo-compose.yml in the following format:
  • docker-compose -f spark-mongodb-compose.yml up --scale worker=2
  1. In another terminal run jupyter-compose.yml in the following format:
  • env TOKEN=cebd1261 docker-compose --file jupyter-compose.yml up

The session should now be available. We used Write_features_MongoDB.ipynb

  1. Once the session is available you can upload a test .csv file by using the Upload function.

  1. The database file can be viewed at http://localhost:8181


Spark + Parquet + HDFS.

  1. Run spark_hdfs-compose.yml with 2 workers:

    • docker-compose --file spark_hdfs-compose.yml up --scale spark-worker=2

    • Run Pyspark Script

  2. From another terminal, run the PySpark scriptcsv-to-parquet.py, and mount the current directory:

    •   docker run -t --rm \
        -v "$(pwd)":/script \
        --network=spark-network \
            mjhea0/spark:2.4.1 \
            bin/spark-submit \
                --master spark://master:7077 \
                --class endpoint \
                /script/csv-to-parquet.py
      
    • The script is running to: read the csv file, write it to hdfs in parquet format, read the file again from hfds and try some SQL queries:

    • Run PySpark Query

  3. Now, we have our data in parquet format, uploaded to HDFS, we can browse it on localhost: http://localhost:50070/explorer.html

    • Write parquet to hdfs

    • Write parquet to hdfs


GCP Dataproc cluster

Running data cleaning and EDA on a local cluster was fairly easy. Once we came to a stage where we needed to restructure the data for prediction it became impossible due to insufficient resources. We had to switch to GCP.

  • Create a bucket

bucket

  • Create a Dataproc cluster and connect to it our bucket, Anaconda and Jupyter components

create_cluster

  • Go to the Cluster and open Jupyter window.

  • Upload your files

cluster_jupyter_tree_running

  • Run the file without creating Spark session (in Dataproc it is created automatically)

spark session running


Running script on a Spark cluster

Our script takes a number of features as an input and output the predicted number of collisions based on the info we provided.

  • Creating spark environment:
    docker network create spark-network
    docker-compose --file spark-compose.yml up
  • In another terminal run this command:
$ docker run -ti  --network=spark-network -v /${PWD}/script://app jupyter/pyspark-notebook:latest //usr/local/spark/bin/spark-submit --master spark://master:7077 //app/pred.py 'day','DI',2,'neige'

running_on_the_cluster


Running script on a Spark cluster from an image

  • Creating spark environment:
    docker network create spark-network
    docker-compose --file spark-compose.yml up
docker build -t <image_name> .

or use existing image nastazya/image_for_pred:latest from Docker Hub:

docker_hub_image

  • Run the image using this format:
docker run -ti  --network=spark-network <image name> <input features>

or without features (they will be taken from the image by default: day,DI,2,neige):

docker run -ti  --network=spark-network <image name>

running_from_image


Data Cleaning

The following steps were done:

  • Reading collisions dataset and checking its structure.
  • Choosing some columns and renaming them.
  • Adjusting the types.
  • Reading municipalities dataset and merging it with collisions.
  • Exploring collisions in each municipality and doing some other basic explorations to better understand the data.
  • Dealing with nulls: streets with nulls were removed; empty codes were removed too: we could replace them with 99 but we will remove all the rows with unknown categories anyway.
  • Writing the "clean" data to another file

Please check this file which contains the code, results and comments


Exploratory Data Analysis

  • The Exploratory Data Analysis undertaken summarizes a number of values from our dataset to help isolate the values required for our prediction model. The exploratory analysis was conducted after refinement of the original data to it's present form. To move towards a prediction model the data was filtered and aggregated in several forms allowing for various views and insights.

  • The data included for the analysis include:

    • A data schema is provided outlining the fields and number of records that were available for our investigation.
    • Aggregates have been made available outlining the number of accidents, victims by method of transportation, severity, municipality, street, days of the week and weather conditions.
    • Charts to visualize and facilitate the understanding of the data for decision making.
  • Based on our initial analysis what does the data tell us?

    1. That the majority of accidents took place in the city of Montreal area. First thoughts dictate that this would be attributed to the population density, and as Montreal is the business hub of the city raising the probability of accidents.

    Victims by municipality

    1. We can also see that Weather, Surface, time of day, day of week amongst others can factor into accidents as well> Friday has the largest number of accidents among the week days. In addition, most accidents happen in a Clear weather!

    Collisions by day

    Meteo conditions

    1. Fortunately, Material Damages are the most common damages. In contrast, Mortel Accidents are the least common.

    Gravite levels

EDA Notebook


Prediction

We want to predict number of collisions in Montreal based on lighting conditions, day of the week, month and weather conditions.

The following steps were done:

  • Restructuring of the dataset in order to be able to make a prediction:

    Original dataset has collision - oriented structure (one row - one collision):

    original_structure

    We will have to find the way to restructure it in order to be day - oriented (one day - one row) with number of collision per day. On top of that we want to divide each day into day and night in terms of light.

    After some manipulations we have an intermediate result where we have the data day-oriented:

    restructured_1

    Feature LIGHT has 4 possible values. We will group them into 2 (day and night) and create a new column for each option with possible values of 1 or 0.

    We will do the same with METEO which has 9 states. We will group them into 5 (normal, rain, snow, ice) and create the discrete columns.

    We will also create discrete columns with week days and month.

    This is the final structure of prediction dataset:

    final_part_1

    Please check this complete file which contains the code, results and comments

  • Splitting the dataset into test and train

    Please check this complete file which contains the code, results and comments

  • Building and running a model

    We used Random Forest Regressor which is known to perform well on discrete features. We haven't used other methods or Greed Search as the prediction itself wasn't the core objective in this project.

  • Checking the results

    As we were expecting the results are not great with CVS = 0.6 and MAE = 6.68 as the model has to be optimized. Ideally, we would add some features using external sources or develop new features based on ratios or dependencies of existing features.

    Strongest feature importance: Day, Night, Sunday, Saturday, Snow, January, December, Friday:

    feature_importances

  • Exporting the model and running the script

    We exported the model using joblib library, to be able to run the script that will take a number of features as an input and calculate the predicted number of collisions based on the info we provided (you can check the code here and instructions on how to run it in RUNME section)

    spark_white