Skip to content

Data Engineering ZoomCamp Final Project" fully automated online store data pipeline"

Notifications You must be signed in to change notification settings

Ezzaldin97/online-store-pipeline

Repository files navigation

Online Store Data Pipeline

Description:

Fully Automated Pipeline to Manage the process of getting Raw Data from Source Systmes, Ingestig The Data to DataLake (Google Cloud Storage), Preparing the Data, Moving The Data to DataWarehouse (BigQuery), Transforming the Data to Create Aggregation Layers and RFM Analysis to Segment the Customers.

Objective:

Transforming The raw transactional data that comes from postgres database on a daily basis to aggreagated layers and analytical layers that could be used to segment and analyse product success and targeting customers.

for more information about data source, check Data Source README

for more information about final models and production tables Structure, check final tables README

What Technologies are being Used?

Final Dashboard:

visit Dashboard-Link

How to Make it Work?

  1. Setup your Google Cloud environment
export GOOGLE_APPLICATION_CREDENTIALS=<path_to_your_credentials>.json
gcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALS
gcloud auth application-default login
  1. Setup your infrastructure
  • Assuming you are using Linux AMD64 run the following commands to install Terraform - if you are using a different OS please choose the correct version here and exchange the download link and zip file name
sudo apt-get install unzip
cd ~/bin
wget https://releases.hashicorp.com/terraform/1.4.1/terraform_1.4.1_linux_amd64.zip
unzip terraform_1.4.1_linux_amd64.zip
rm terraform_1.4.1_linux_amd64.zip
  1. To initiate, plan and apply the infrastructure, adjust and run the following Terraform commands
cd terraform/
terraform init
terraform plan -var="project=<your-gcp-project-id>"
terraform apply -var="project=<your-gcp-project-id>"
  1. Go to fake-data-generator Directory and run the following to commands
  • run make run_img create and run the postgres docker image(Source System).
  • run make run create data and insert it to postgres.
  1. Setup your Orchestration
  • Go to prefect Directory
  • to setup the python virtual environment and install all dependancies run make venv
  • check prefect README to setup the blocks and dependancies before running the flow.
  • run the flow using make run
  1. the final tables will be created at online_store_data dataset in BigQuery.
  2. build the dashboard.