Big Data Analysis: Using PySpark and Google Cloud Platform to Find Relevant Twitterers and Profile Users

Project objective was to identify the profiles of Twitterers, who are tweeting about University of Chicago and compare them to the profiles of Twitterers who are tweeting about other universities. Ultimately, the goal is to make actionable business recommendations to help the University improve the social media outreach programs.

Twitterer is the name given to those who Twitter - Twitter users: https://www.merriam-webster.com/dictionary/twitterer (Links to an external site.).

Project sub-goals:

Identify tweets related to UChicago and 3-4 universities of your choice
Discard irrelevant tweets (95%+ of the data)
Complete thorough EDA to identify which variables you can use to profile the Twitterers (very sparse JSON structure)
Identify the most prolific / influential Twitterers (By message volume and by message retweet)
Do you see any relationship between university locations and Twitterers’ locations?
What distinguishes University of Chicago Twitterers vs Twitterers who tweet about other universities
What are the timelines of these tweets? Do you see significant peaks and valleys?
How unique are the messages for each of these universities?
Are they mostly unique? Or mostly people are just copy-pasting the same text? (using one of the following: Jaccard similarity / Cosine Similarity / Simhash / Minhash)

For findings, please see PowerPoint document in repository.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
BDP Final Project Stage 1 and 3_ETL_EDA_Filter_Export_Similarity.ipynb		BDP Final Project Stage 1 and 3_ETL_EDA_Filter_Export_Similarity.ipynb
BDP Final Project Stage 2_Analysis.ipynb		BDP Final Project Stage 2_Analysis.ipynb
Big Data Tweeter Analysis.pdf		Big Data Tweeter Analysis.pdf
Combine and Clean files.yxmd		Combine and Clean files.yxmd
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Big Data Analysis: Using PySpark and Google Cloud Platform to Find Relevant Twitterers and Profile Users

For findings, please see PowerPoint document in repository.

About

Releases

Packages

Languages

tahonick/Big-Data-Twitter-Profile-Analysis-500M-Tweets

Folders and files

Latest commit

History

Repository files navigation

Big Data Analysis: Using PySpark and Google Cloud Platform to Find Relevant Twitterers and Profile Users

For findings, please see PowerPoint document in repository.

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages