Skip to content

Analysis of NYC's citibike data. Technologies: Python , Prefect, dbt, Terraform , Looker data studio

License

Notifications You must be signed in to change notification settings

vspatil/citibike-data-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

79 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NYC Citibike-Data-Pipeline

Overview:

This project has been developed as part of 2023 Data Engineering Zoomcamp. The goal of the project is to implement NYC's Citibike data pipeline. Its a batch pipeline which extracts data from NYC's Citibike Dataset and stores the raw data into Google Cloud Storage and Google Big Query. Stored data from bigquery will be transformed using DBT and transformed dataset will be used by Google Looker data studio to develop visualizations for analytics purposes.

Citibike Pipeline Architecture:

image

Problem description:

Citi Bike is NYC’s official bike share program, designed to give residents and visitors a fun, affordable and convenient alternative to walking, taxis, buses and subways. Citi Bike believes that biking is the best way to see NYC! It's a quick and affordable way to get all around the city, and it even allows you to sightsee along the way. Project answers the below questions and helps bikers to explore the NYC.

  • Where do Citi Bikers ride?
  • Which stations are most popular?
  • What days of the week are most rides taken on?
  • What are the total number of trips?

Dashboard Samples:

you can find the report here

image

image

Technologies:

Following technologies are used in implementing this pipeline

Setup to run the project:

  1. Clone the git repo to your system

    git clone <your-repo-url>
  2. Install the neccesary packages/pre-requisites for the project with the following command

      pip install -r requirements.txt
  3. Next you need to setup your Google Cloud environment

  1. Set up the infrastructure of the project using Terraform
  • If you do not have Terraform installed you can install it from here and then add it to your PATH

  • Once donwloaded navigate to the terraform folder :

     cd terraform/
  • then run the following commands to create your project infrastructure

     terraform init
     terraform plan -var="project=<your-gcp-project-id>"
     terraform apply -var="project=<your-gcp-project-id>"
  1. Run python code in Prefect folder
  • you have installed the required python packages in step 1, prefect should be installed with it. Confirm the prefect installation with following command

      prefect --version
  • You can start the prefect server so that you can access the UI using the command below:

    prefect orion start
  • access the UI at: http://127.0.0.1:4200/

  • Then change out the blocks so that they are registered to your credentials for GCS and Big Query. This can be done in the Blocks options

  • You can keep the blocks under the same names as in the code or change them. If you do change them make sure to change the code to reference the new block name

  • Go back to the terminal and run:

     cd prefect/
  • then run

     python citibike_data_pipeline.py
  • The python script will then store the citibike data both in your GCS bucket and in Big Query

  1. Running the dbt flow
  • Create a dbt account and log in using dbt cloud here
  • Once logged in clone the repo for use
  • in the cli at the bottom run the following command:
     dbt run
  • this will run all the models and create the final dataset called "fact_citibike"
  1. On successful run , the linage of fact_citibike looks as below :

image

  1. Visualization
  • You can now utilize the fact_citibike dataset and use it within Looker for visualizations.
  • you can find the report here