Data Engineering Github Trending Repository Project

Overview

In this updated version of project, I embarked on an exciting journey of exploring and analyzing trending repositories on GitHub. I aimed to extract valuable insights from the repositories' data, rank them based on various criteria, and create a meaningful schema for further analysis.

Workflow

Here's a high-level overview of the workflow I followed in this project:

Data Extraction: I fetched trending repositories' data from GitHub using web scraping techniques and the GitHub API.
Data Transformation: The extracted data was cleaned, transformed, and organized into a structured format suitable for analysis.
Schema Design: I designed a star schema that included dimension tables for users, repositories, time, and ranks, along with a fact table for repository information.
Database Creation: I set up a PostgreSQL database to store the structured data using the designed star schema.
Triggers and Functions: I implemented triggers and functions in PostgreSQL to handle the dynamic updates and insertions in the user dimension.
Ranking Algorithm: I developed a Python function to calculate rank values for repositories based on customizable weights.
Data Analysis: With the data in place, I conducted insightful analyses, such as identifying top repositories and understanding user behaviors.

Documentation and Sharing: I documented the entire project to capture challenges, solutions, workflow, and outcomes. This documentation serves as a valuable reference for both personal reflection and sharing with others.

Challenges Faced and Solutions

Throughout the project, I encountered several challenges that pushed me to think creatively and problem-solve effectively:

Data Extraction and Processing:

Challenge: Fetching data from GitHub and processing it efficiently for analysis.

Solution: I used Python with libraries like BeautifulSoup and the GitHub API to gather and process repository and user information. Schema Design and Database Management:

Challenge: Designing a star schema to organize data effectively and managing primary and foreign keys.

Solution: I carefully designed the schema with proper relationships between dimensions and the fact table, ensuring data integrity. Trigger and Function Implementation:

Challenge: Implementing triggers and functions to update user information while avoiding recursive loops.

Solution: I crafted triggers and functions in PostgreSQL to ensure smooth updates in the user dimension while maintaining control over the insertion process. Ranking Algorithm Development:

Challenge: Developing an algorithm to rank repositories based on stars, forks, and contributions.

Solution: I created a Python function to calculate rank values using customizable weights and implemented the function in the PostgreSQL database.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.ipynb_checkpoints		.ipynb_checkpoints
GDAv1.1		GDAv1.1
extracting		extracting
loading		loading
postgreSQL		postgreSQL
processing		processing
schema		schema
GDA_dataset_backup(CUSTOM)		GDA_dataset_backup(CUSTOM)
GDAv2.1_Dashboard.pbix		GDAv2.1_Dashboard.pbix
LICENSE		LICENSE
README.md		README.md
stub.ipynb		stub.ipynb
workflow.png		workflow.png

License

faizeraza/dataengineering-github-data-pipelineline

Folders and files

Latest commit

History

Repository files navigation

Data Engineering Github Trending Repository Project

Overview

Workflow

Challenges Faced and Solutions

Data Extraction and Processing:

Challenge: Fetching data from GitHub and processing it efficiently for analysis.

Challenge: Designing a star schema to organize data effectively and managing primary and foreign keys.

Challenge: Implementing triggers and functions to update user information while avoiding recursive loops.

Challenge: Developing an algorithm to rank repositories based on stars, forks, and contributions.

Tools and Technologies

Star Schema Diagram

WorkLow Flow Diagram Over Local

WorkLow Flow Diagram Over Cloud

important links

About

Topics

Resources

License

Stars

Watchers

Forks

Languages