Spotify's-End-to-End-Data-Pipeline

Leveraging AWS Services for Enhanced Music Streaming Insights specifically tailored for extracting, transforming, and loading (ETL) Spotify's data to gain profound insights into the dynamics of the Music artists, albums and tracks of Top 100 Global playlists. The data pipeline commences by retrieving information from the Spotify API, then processing the data to make it suitable for analysis. The refined data is subsequently loaded into Amazon S3, and Amazon Athena is employed to query the processed data. AWS Glue plays a pivotal role in cataloging and managing metadata throughout the process.

AWS Serverless Services Used

AWS S3 (Simple Service Storage) : Amazon S3 (Simple Storage Service) is a highly scalable and versatile object storage service offered by Amazon Web Services (AWS).
AWS Lambda : AWS Lambda is a serverless compute service provided by Amazon Web Services (AWS) that enables developers to run code without provisioning or managing servers. It allows you to execute code in response to events and automatically scales the infrastructure based on the workload. AWS Lambda code can be triggered to the response to events like changes in S3, CloudWatch or other AWS services
AWS Cloud Watch : Amazon CloudWatch, provided by Amazon Web Services (AWS), is a robust monitoring and observability service. It empowers users to gather and monitor diverse metrics, oversee log files, establish alarms, and automatically take actions in response to modifications in AWS resources.
AWS Glue : AWS Glue is a fully managed extract, transform, and load (ETL) service provided by Amazon Web Services (AWS). It simplifies the process of preparing and loading data for analytics, allowing developers and data engineers to build scalable and efficient data pipelines. AWS Glue Data Catalg is a metadata repository that makes it easy to discover and manage data in AWS.
AWS Athena : Amazon Athena is an interactive query service provided by Amazon Web Services (AWS) that allows users to analyze data directly in Amazon Simple Storage Service (S3) using standard SQL queries. Athena is serverless, meaning there is no infrastructure to manage, and it scales automatically to handle large datasets.

Languages/Packages Used

Python Pandas
SQL - Structured Query Langugae
SpotifyAPI - spotipy python package.

Architecture

Project WorkFlow

Step 1: In this project the data from Spotify API is feteched and stored in S3 through the lambda function code. Two buckets namely, raw_data and transformed data are created accordingly.

Step 2: Upon loading the data into the S3, the transformation lambda function is triggered and the .json file is converted into the tables to do 2 tasks: - Creating the transformed data ( artists, albums, songs) into the transformed S3 bucket using pandas in lambda functions. - Once the transformation is done, the older files are moved to the processed bucket inside the transformation bucket.

Step 3: Once the transformed data is available in S3 bucket, AWS Glue crawlers are created to crawl the repective S3 buckets and fetch the data into the specified database.

Step 4: The tables from AWS DataCatalog are accessed through the Athena to get the meaning insights of the different tables built. Please refer to the .sql files attached for the insights found on songs, album and artists data.

In a nutshell, Its execution flow is :

Extract Data from Spotify API ----> Triggering Lambda Functions( Every 1 day) ----> Run the extract code ----> Store the raw data in S3 bucket ----> Trigger tranform function whenever new data is in S3 raw data bucket ----> Transform data and Load it in AWS Data Catalog tables ----> Finding the insights about the Top 100 songs using Athena.

Data Source:

Spotify API developers account : https://developer.spotify.com/ , For access to the Spotify Playlists.

Things to Note Down

Make sure to user the IAM user roles for accessing the AWS services and Allocate IAM policies for the AWS services interacting with each other.
I faced an issue while loading the data from the AWS Glue crawler into Athena, the issue was the data feteched was not in proper format, I figured out that, the SerDe in AWS Glue by defalut is of the format "org.apache.hadoop.mapred.TextInputFormat", But instead to resolve the comma separated issue, you can use "org.apache.hadoop.hive.serde2.OpenCSVSerde" parameter.
Second issue was the AWS Glue was not reading the proper column names for the artist table created through crawler. One has to edit the column names manually from the Edit JSON option.
Even though the transformation code in the Lambda Function is written to get the datetime format for date related columns. AWS Glue crawler has considered it as a STRING type. Convert this STRING type to date format using date_parse in Athena for Date time related SQL Analysis.
This Project can further be enhanced by importing the Athena results/tables in to the Visualization Tools like Tableau or Amazon Quicksight.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
ETL Code		ETL Code
Architecture.png		Architecture.png
README.md		README.md
Spotify_Data_Pipeline.ipynb		Spotify_Data_Pipeline.ipynb
albums_athena_analysis.sql		albums_athena_analysis.sql
artist_athena_results.sql		artist_athena_results.sql
songs_athena_analysis.sql		songs_athena_analysis.sql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spotify's-End-to-End-Data-Pipeline

AWS Serverless Services Used

Languages/Packages Used

Architecture

Project WorkFlow

Data Source:

Things to Note Down

About

Releases

Packages

Languages

sarutlaa/Spotify-End-to-End-Data-Pipeline

Folders and files

Latest commit

History

Repository files navigation

Spotify's-End-to-End-Data-Pipeline

AWS Serverless Services Used

Languages/Packages Used

Architecture

Project WorkFlow

Data Source:

Things to Note Down

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages