Skip to content

Filtering and analyzing Twitterer users of 500M Tweets using Google Cloud cloud computing via PySpark

Notifications You must be signed in to change notification settings

tahonick/Big-Data-Twitter-Profile-Analysis-500M-Tweets

Repository files navigation

Big Data Analysis: Using PySpark and Google Cloud Platform to Find Relevant Twitterers and Profile Users

Project objective was to identify the profiles of Twitterers, who are tweeting about University of Chicago and compare them to the profiles of Twitterers who are tweeting about other universities. Ultimately, the goal is to make actionable business recommendations to help the University improve the social media outreach programs.

Twitterer is the name given to those who Twitter - Twitter users: https://www.merriam-webster.com/dictionary/twitterer (Links to an external site.).

Project sub-goals:

  • Identify tweets related to UChicago and 3-4 universities of your choice
  • Discard irrelevant tweets (95%+ of the data)
  • Complete thorough EDA to identify which variables you can use to profile the Twitterers (very sparse JSON structure)
  • Identify the most prolific / influential Twitterers (By message volume and by message retweet)
  • Do you see any relationship between university locations and Twitterers’ locations?
  • What distinguishes University of Chicago Twitterers vs Twitterers who tweet about other universities
  • What are the timelines of these tweets? Do you see significant peaks and valleys?
  • How unique are the messages for each of these universities?
  • Are they mostly unique? Or mostly people are just copy-pasting the same text? (using one of the following: Jaccard similarity / Cosine Similarity / Simhash / Minhash)

For findings, please see PowerPoint document in repository.

About

Filtering and analyzing Twitterer users of 500M Tweets using Google Cloud cloud computing via PySpark

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published