Skip to content

Create an automated pipeline that takes in new data, performs the appropriate transformations, and loads the data into existing tables. ETL (Extract, Transform and Load) Pipeline.

Notifications You must be signed in to change notification settings

ashley-green1/Movies-ETL

Repository files navigation

Movies-ETL

Overview

I have been tasked with creating an automated ETL pipeline that creates a Movie database from both structured and unstructured data. Code is written that extracts Wikipedia JSON data, Kaggle metadata, and the MovieLens rating data, transforms the data with regex and functions, then lastly, loads the data into a PostgreSQL database.

Resources

  • Extract - ETL_function_test.ipynb

  • Transform - ETL_clean_wiki_movies.ipynb & ETL_clean_kaggle_data.ipynb

  • Load - ETL_create_database.ipynb

  • Languages: Python, SQL

  • Tools: Jupyter Notebook, JSON, Pandas, NumPy, sqlalchemy, psycopg2

  • Data Source(s): wikipedia.movies.json, movies_metadata.csv, ratings.csv (note: due to large file size, I am unable to push ratings.csv to repo)

Results

The resulting database consists of two tables that are highlighted below.

A movies table with 6,052 rows of data, each representing a unique movie title. The table also includes 31 columns of data with production related information, such as, movie budget, producers, actors and release data.

movies

A ratings table a table with 26,024,289 rows of data, each representing a viewer rating of 1-5. The table also includes 5 columns of data such as movie title, rating and date. ratings

About

Create an automated pipeline that takes in new data, performs the appropriate transformations, and loads the data into existing tables. ETL (Extract, Transform and Load) Pipeline.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published