Skip to content

vladyslavyaloveha/etl_platform

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš– ETL Platform: Analyzing NYC Yellow Taxi Trips with Airflow, FastAPI, and Cloud Integration

The aim of project is to create modern end-to-end ETL platform for Big Data analysis ❀️!
A hands-on experience with the latest library versions in a fully dockerized environment for
NYC Yellow Taxi Trips Analytics.
With Airflow and PySpark at its core, you'll explore the power of large-scale data processing using DAGs.
Choose between GCP or AWS for cloud solutions and manage your infrastructure with Terraform.
Enjoy playing with FastAPI web application for easy access to trip analytics.

πŸ† The ETL platform does simple task:

The system fetches trip data from a specified URL, conducts fundamental daily analytics including passenger count, distance traveled, and maximum trip distance. Subsequently, it uploads the computed analytics results onto cloud storage provided by the chosen cloud provider. ETL platform facilitates seamless data transfer from the storage platform to a designated database.
Furthermore, the system offers the functionality to retrieve insights via an accessible API endpoint.

🍰 ETL Platform Features

  • πŸ’š ETL platform is open-source and flexible playground for Big Data analysis based on standalone classic πŸš• Yellow Taxi Trip Data.
  • πŸ“¦ Fully dockerized via Docker Compose with latest library versions (🐍 Python 3.10+).
  • πŸ’ͺ Harnesses the power of Airflow and PySpark for efficient processing of large datasets.
  • πŸ” Offers Google Cloud Platform (Google Cloud Storage, BigQuery), Amazon Web Services (S3, Redshift) cloud solutions, based on user preference.
  • ☁️ Cloud infrastructure management handled through Terraform.
  • 🌟 Includes a user-friendly FastApi web application with Traefik enabling easy access to trip analytics.
  • πŸ”§ Uses Poetry for dependencies management.
  • πŸ“„ Provides basic Pre-commit hooks, Ruff formatting, and Checkov for security and compliance issues.

πŸš€ Getting Started

🎌 Installation

  1. Clone the ETL platform project.
  2. Check existence or install:
  1. Get and place credentials for clouds (AWS, GCP or both):
  1. Apply Cloud infrastructure via Terraform from root folder:
cd terraform/<cloud-provider>
terraform init
terraform plan
terraform apply -auto-approve

❗ For Aws cloud provider you have to pass aws_access_key_id, aws_secret_access_key, redshift_master_password.
As an output you will get redshift_cluster_endpoint where redshift-host could be extracted and passed in .env file below.
❗ For Gcp cloud provider you have to pass credentials_path=../../credentials/gcp/<filename>.json, project_name.

🌞 If previous steps done successfully, you will see Apply complete! message from Terraform and Cloud Infrastructure is ready to use!

  1. Update .env file under build folder with actual credentials:
# Gcp
GCP_CREDENTIALS_PATH=/opt/airflow/credentials/gcp/<filename>.json
GCP_PROJECT_NAME=<project-name>

# Aws
AWS_ACCESS_KEY_ID=<access-key-id>
AWS_SECRET_ACCESS_KEY=<secret-access-key-id>

REDSHIFT_HOST=<redshift-host>
REDSHIFT_MASTER_PASSWORD=<redshift-master-password>

⚠️ Replace default values (such as passwords) in .env file.

  1. Up project via docker-compose in build folder:
docker-compose up --build
  1. Check if build is done successfully πŸŽ‰ (provide _AIRFLOW_WWW_USER_USERNAME _AIRFLOW_WWW_USER_PASSWORD from .env to authenticate): http://localhost:8080.

πŸ’‘ Usage

  1. In your environment will be a DAG (called ETL_trip_data) which you can trigger with default parameters (and selected Cloud Provider): DAG

  2. Dag Graph you can find under http://localhost:8080/dags/ETL_trip_data/grid?tab=graph: Graph

  3. FastApi web application docs you can find under http://localhost:8009/docs.

  4. Spark interface: http://localhost:8082.

  5. Traefik monitoring: http://localhost:8085.

πŸ’» Tech Stack

Python Apache Airflow PySpark Docker AWS S3 Redshift GCP Google Cloud Storage BigQuery Terraform FastApi Traefik Poetry pre-commit checkov Ruff .parquet Git GitHub GitHub Actions Markdown License

πŸ˜€ Enjoying this project? Support via github star ⭐

✨ Adjust & Improve project for your needs

πŸ“ˆ Metrics

Alt