Skip to content

seokyim8/Steam_data_pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 

Repository files navigation

Creator: Seok Yim (Noah)

Do you want to be the PIONEER of soon-to-POP-OFF games? Then you're gonna like this...

Title: Steam Data Pipeline

Project Summary:

A data pipeline that regularly scrapes, cleans, stores, and publishes data for newly released games on Steam. The data visualization is taken care of by Apache Superset (publicly accessible).

*** Preview ***
DASHBOARD
Website link:
http://18.212.126.33:8080/superset/dashboard/1/?standalone=3&show_filters=1

Authentication for anonymous users (Anyone can view it with these credentials):
ID: public
password: public

Description:

I frequently saw websites/projects with Steam-related data for popular(top 100) games but never saw one primarily focused on new releases on Steam. Thus, I decided to make one myself.

Technologies Used:

  • Python, MYSQL, AWS(EC2, RDS), Docker, Scrapy, Apache Superset, Selenium

Steps Taken:

  1. Created a Scrapy project that scrapes data from the official Steam website (https://store.steampowered.com/search/?sort_by=Released_DESC&supportedlang=english).
  2. Added selenium to deal with infinite scrolling. Created a Python scheduler with Apscheulder along with Python asyncio.
  3. Launched an EC2 and RDS instance, each for persisting the program and running the MYSQL database, respectively.
  4. Created a Docker image that downloads the Python dependencies along with the Chrome browser.
  5. On EC2, initialized the containerized project along with the containerized Apache Superset image.
  6. Made the dashboard publicly available.

Final Product:

- A dashboard/BI tool that updates every day at 7:30 am EST(with a couple extra updates during the day) with 1,000 entries from Steam.
- Contains visual expressions of the data that facilitate individuals in understanding the latest trends in games.