Skip to content

Data Warehousing ETL Demo with Apache Iceberg on EMR Local Environment

Notifications You must be signed in to change notification settings

jaehyeon-kim/iceberg-etl-demo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Iceberg ETL Demo

Data Warehousing ETL Demo with Apache Iceberg on EMR Local Environment

  • Unlike traditional Data Lake, new table formats (Iceberg, Hudi and Delta Lake) support features that can be used to apply data warehousing patterns, which can bring a way to be rescued from Data Swamp. In this post, we’ll discuss how to implement ETL using retail analytics data. It has two dimension data (user and product) and a single fact data (order). The dimension data sets have different ETL strategies depending on whether to track historical changes. For the fact data, the primary keys of the dimension data are added to facilitate later queries. We’ll use Iceberg for data storage/management and Spark for data processing. Instead of provisioning an EMR cluster, a local development environment will be used. Finally the ETL results will be queried by Athena for verification.

About

Data Warehousing ETL Demo with Apache Iceberg on EMR Local Environment

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published