GitHub - vspatil/citibike-data-pipeline: Analysis of NYC's citibike data. Technologies: Python , Prefect, dbt, Terraform , Looker data studio

NYC Citibike-Data-Pipeline

Overview:

This project has been developed as part of 2023 Data Engineering Zoomcamp. The goal of the project is to implement NYC's Citibike data pipeline. Its a batch pipeline which extracts data from NYC's Citibike Dataset and stores the raw data into Google Cloud Storage and Google Big Query. Stored data from bigquery will be transformed using DBT and transformed dataset will be used by Google Looker data studio to develop visualizations for analytics purposes.

Citibike Pipeline Architecture:

Problem description:

Citi Bike is NYC’s official bike share program, designed to give residents and visitors a fun, affordable and convenient alternative to walking, taxis, buses and subways. Citi Bike believes that biking is the best way to see NYC! It's a quick and affordable way to get all around the city, and it even allows you to sightsee along the way. Project answers the below questions and helps bikers to explore the NYC.

Where do Citi Bikers ride?
Which stations are most popular?
What days of the week are most rides taken on?
What are the total number of trips?

Dashboard Samples:

you can find the report here

Technologies:

Following technologies are used in implementing this pipeline

Cloud: Goggle Cloud Platform
- Data Lake: Google Cloud Storage
- Data warehouse: Google Big Query
Terraform: Infrastructure as code (IaC) - creates project configuration for GCP to bypass cloud GUI.
Workflow orchestration: Prefect
Data Transformation: DBT
Data Visualisation: Google Looker data studio

Setup to run the project:

Clone the git repo to your system
```
git clone <your-repo-url>
```
Install the neccesary packages/pre-requisites for the project with the following command
```
  pip install -r requirements.txt
```
Next you need to setup your Google Cloud environment

Create a Google Cloud Platform project, if you do not already have one (https://console.cloud.google.com/cloud-resource-manager)
Configure Identity and Access Management (IAM) for the service account, provide the following privileges:
- BigQuery Admin
- Storage Admin
- Storage Object Admin
Download the JSON credentials and save it somehwere you'll remember, which will be JSON key.
Install the Google Cloud SDK
Configure the environment variable point to your GCP key (https://cloud.google.com/docs/authentication/application-default-credentials#GAC) and authenticate it using following commands
```
  export GOOGLE_APPLICATION_CREDENTIALS=<path_to_your_credentials>.json
  gcloud auth application-default login
```

Set up the infrastructure of the project using Terraform

If you do not have Terraform installed you can install it from here and then add it to your PATH
Once donwloaded navigate to the terraform folder :
```
 cd terraform/
```

then run the following commands to create your project infrastructure

 terraform init
 terraform plan -var="project=<your-gcp-project-id>"
 terraform apply -var="project=<your-gcp-project-id>"

Run python code in Prefect folder

you have installed the required python packages in step 1, prefect should be installed with it. Confirm the prefect installation with following command
```
  prefect --version
```
You can start the prefect server so that you can access the UI using the command below:
```
prefect orion start
```
access the UI at: http://127.0.0.1:4200/
Then change out the blocks so that they are registered to your credentials for GCS and Big Query. This can be done in the Blocks options
You can keep the blocks under the same names as in the code or change them. If you do change them make sure to change the code to reference the new block name
Go back to the terminal and run:
```
 cd prefect/
```
then run
```
 python citibike_data_pipeline.py
```
The python script will then store the citibike data both in your GCS bucket and in Big Query

Running the dbt flow

Create a dbt account and log in using dbt cloud here
Once logged in clone the repo for use
in the cli at the bottom run the following command:
```
 dbt run
```
this will run all the models and create the final dataset called "fact_citibike"

On successful run , the linage of fact_citibike looks as below :

Visualization

You can now utilize the fact_citibike dataset and use it within Looker for visualizations.
you can find the report here

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
dbt		dbt
prefect		prefect
terraform		terraform
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dbt

dbt

prefect

prefect

terraform

terraform

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

NYC Citibike-Data-Pipeline

Overview:

Citibike Pipeline Architecture:

Problem description:

Dashboard Samples:

Technologies:

Setup to run the project:

About

Releases

Packages

Languages

License

vspatil/citibike-data-pipeline

Folders and files

Latest commit

History

Repository files navigation

NYC Citibike-Data-Pipeline

Overview:

Citibike Pipeline Architecture:

Problem description:

Dashboard Samples:

Technologies:

Setup to run the project:

About

Topics

Resources

License

Stars

Watchers

Forks

Languages