An End-to-End ETL data pipeline that leverages pyspark parallel processing to process about 25 million rows of data coming from a SaaS application using Apache Airflow as an orchestration tool and various data warehouse technologies and finally using Apache Superset to connect to DWH for generating BI dashboards for weekly reports
airflow
pyspark
datawarehouse
airflow-docker
dataengineering
amazon-s3
posgresql
azure-blob-storage
etl-pipeline
apache-superset
bi-dashboards
-
Updated
Dec 7, 2022 - Python