The analytics team for music streaming startup Spakify wants to analyze the song-listening activity of their users. This analysis will be based on mobile app JSON user activity logs and song metadata that currently exists in Amazon S3.
The goal of this project is to build an ETL pipeline that extracts the user activity and song data from S3, stages the data in Amazon Redshift, and transforms the data into a set of dimensional tables on which the analtyics team can run queries for performing song play analysis.
sql_queries.py
- queries for creating tables and entering datacreate_tables.py
- drops and creates tables, used to reset tables prior to running ETL scriptscreate_cluster.ipynb
- notebook for creating the Redshift cluster and IAM roledata_quality_checks.ipynb
- notebook for running queries to perform sanity checks on the tables and inserted datadwh.cfg
- configuration values for AWS services
In order to run the Jupyter Notebook documents and Python files, an AWS IAM user with the following policies (or equivalent permissions) is required:
- AmazonRedshiftFullAccess
- AmazonS3ReadOnlyAccess
- IAMFullAccess
- AmazonEC2FullAccess
The access key and secret key need to be added to the [AWS]
section in the dwh.cfg
file.
[AWS]
KEY=YOURACCESSKEYGOESHERE
SECRET=PUTyourSECRETaccessKEYhereTHISisREQUIRED
-
Create the Redshift cluster and IAM role granting the cluster access to S3 by running the cells in the notebook
create_cluster.ipynb
-
Run the Python file to create tables
python create_tables.py
- Run the Python file to execute the ETL pipeline
python etl.py
- Check that the tables were created and populated with data by running the cells in the notebook
data_quality_checks.ipynb
- Run the cells at the bottom of the notebook
create_cluster.ipynb
in the section Deleting the cluster and IAM role