GitHub - derak-isaack/DTL-DBT-pipeline: A data pipeline extracts data from AWS S3 bucket, Load the data in an embedded database and finally applies transformation to the data.

MALL SHOPPING DATA ANALYSIS DATA PIPELINE

Project Overview

ETL and ELT data pipelines offer simplicity and automation for organizations that use cloud storage where data scales exponentially. The data pipelines enables data for analysts and modelling is readily available. Automation of data pipelines also ensures real-time tracking of important Key Perfomance Indicators for a business. Exponentially growing data may sometimes bring the challenge of storage constraints and embedded databases solve this efficiently.

Objectives

Develop and deploy an efficient ETL Data Pipeline extracting data from AWS S3 bucket using data load tool using DuckDB database, transforming it and finally loading it in an DuckDB embedded database.
Prepare data for analysis using queries which can be found `here'. The queries seek to find:

Average age of all shoppers across various shopping malls.
Average expenditure of each category of items by gender.
Most profitable shopping malls by year and month.
Shopping patterns across various malls.

The extracted data from the queries can be found here.

Data Extraction

The Data Load Tool open source library offers very efficient data pipelines offering consumption from various data sources including Google Sheets, Cloud Storage services, APIs and scraping scripts. This tool comes in handy for e-commerce organizations as it has the ability to track streaming data in real-time. It offers more scalability and customization depending on the needs at hand.

Data for this project was ingested from AWS S3 bucket. To get started with setting up the credentials to allow access by third oarty applications to your data, reading dlt documentation provides a clear step by step process for the setup.

After setting up everything, run the following commands in order to get started with dlt:

pip install requirements.txt
dlt init filesystem duckdb
To run your pipeline, run python filesystem.py in the terminal. This downloads a duckdb file in the working directory which contains information about the data source.
To have an interaction with the data, run dlt pipeline filesystem show which opens streamlit on the local browser.

For a more conclusive and extensive approach, reading data sources provides more information on how to connect to several data sources.

Data Transformation

Transformation of data is necessary because it needs to be error free for use by the data team to extract meaningful insights more fast than the old traditional formarts. Transformation is done using the Data Build Tool which is an open source platform that offers more flexibility for transformation of data in data pipelines.

To open up the Streamlit dashboard to get a sneak peak of data and run simple queries, run the line dlt pipeline filesystem show.

This tool leverages on using Jinja to simplify the process of creating sql queries to handle various data objectives. It implements DRY-Dont Repeat Yourself approach which allows inheritance of sql tables and queries.

Data Loading

DuckDb is an embedded database which has better computational power at handling big datasets and also creates them on the fly without necessarilly loading it in RDBMs. It also offers compression of big files therefore coming in handy in the event of storage inefficiencies.

It can also be connected to visualization and dashboarding tools after installation of required drivers connections.

Streamlit dashboard

To create an interactive end-to-end project, a streamlit dashboard provides an efficient User Interaction & Experience because of its simple navigable page. The dashboard has filters for:

Shopping mall
Year
Month
Product category
Gender & Payment method

With each filter, the visuals adjust accordingly to the filter applied. The application can be found (here)[https://dtl-dbt-pipeline-gyz2smtjzi8bdoqqfptsi8.streamlit.app/] with this

Future steps

Integrate an authorization and login page to ensure different users access different contents and analysis of the application.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.dlt		.dlt
.vscode		.vscode
customersales		customersales
filesystem		filesystem
jaffle_shop		jaffle_shop
logs		logs
target		target
.gitignore		.gitignore
.user.yml		.user.yml
Readme.md		Readme.md
Screenshot (797).png		Screenshot (797).png
app.py		app.py
dbt_project.yml		dbt_project.yml
dev.duckdb		dev.duckdb
dlt_project.duckdb		dlt_project.duckdb
filesystem_pipeline.py		filesystem_pipeline.py
helper_notebook.ipynb		helper_notebook.ipynb
profiles.yml		profiles.yml
project.py		project.py
requirements.txt		requirements.txt
run_dbt.py		run_dbt.py
standard_filesystem.duckdb		standard_filesystem.duckdb
standard_filesystem2.duckdb		standard_filesystem2.duckdb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MALL SHOPPING DATA ANALYSIS DATA PIPELINE

Project Overview

Objectives

Data Extraction

Data Transformation

Data Loading

Streamlit dashboard

Future steps

About

Releases

Packages

Languages

derak-isaack/DTL-DBT-pipeline

Folders and files

Latest commit

History

Repository files navigation

MALL SHOPPING DATA ANALYSIS DATA PIPELINE

Project Overview

Objectives

Data Extraction

Data Transformation

Data Loading

Streamlit dashboard

Future steps

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages