Twitter Feed Analysis using Spark with Hadoop

An academic project as a part of course, "Principles of Big Data Management", to develop a system to store, process, analyse, and visualize Twitter’s data using Apache Spark

Phase -1 : Hadoop & Spark's Map Reduce Word Count of URLs and HashTags from the tweets collected through Twitter API using TWARC.

Documentation

Setting UP Hadoop

Setting UP Spark

Configuring TWARC

Tweet Collection & Extraction of Urls, HashTags

1. Python Script is written to do the tweet collection through API and then extraction of urls & hashtags.
2. It seeks users choice of keyword(s) to search for the corresponding tweets and storing it into a json file.
3. Twarc command 'search' is used to collect the respective tweets with a timeout of 15 Minutes(i.e., the collection
    of tweets is suspended if the search command is not done by 15 minutes).
4. The tweets collected will be stored in a json file 'tweets_keywords'.
5. Urls and HashTags are extracted from 'tweets_keywords' into a text file 'twitter_out.txt'.
    i.  While reading Tweets, empty tweets are ignored and the tweets with at least one url or one hashtag is extracted/ 
    ii. There are several kinds of urls exist in the url entity of tweet. The main url is being extracted into text file 
        ignoring remaining kind of urls.
    iii. Similarly, all the hashtags of tweets are extracted into the output text file.

Deployment(Step by Step Execution)

1. Extracting urls & hashtags from the collected tweets for given user keyword(s)

    python twitter_extraction.py

2. Moving extracted text file 'twitter_out' from local folder to hdfs folder.

    $HADOOP_HOME/bin/hdfs dfs -put '/local/path/twitter_out.txt' /your_hdfs_folder

3. Running Hadoop MapReduce WordCount on 'twitter_out.txt' present in 'your_hdfs_folder' and place the generated output under 'output'

    $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.1.jar wordcount /your_hdfs_folder/twitter_out.txt /your_hdfs_folder/output

4. The generated output can be found in the hdfs folder named 'output'.
5. Running Spark MapReduce WordCount on twitter_extraction and extracting the wordcount output into Spark_Output file which will be stored in local folder.

    $SPARK_HOME/bin/spark-submit run-example JavaWordCount /your_hdfs_folder/twitter_out.txt | grep -v info >> Spark_Output.txt

Authors

@Chandrasekhar-Syamala

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.vs		.vs
TwitterApplication		TwitterApplication
env		env
README.md		README.md
TwitterApplication.pyproj		TwitterApplication.pyproj
TwitterApplication.pyproj.user		TwitterApplication.pyproj.user
TwitterApplication.sln		TwitterApplication.sln
requirements.txt		requirements.txt
runserver.py		runserver.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.vs

.vs

TwitterApplication

TwitterApplication

env

env

README.md

README.md

TwitterApplication.pyproj

TwitterApplication.pyproj

TwitterApplication.pyproj.user

TwitterApplication.pyproj.user

TwitterApplication.sln

TwitterApplication.sln

requirements.txt

requirements.txt

runserver.py

runserver.py

Repository files navigation

Twitter Feed Analysis using Spark with Hadoop

Documentation

Setting UP Hadoop

Setting UP Spark

Configuring TWARC

Tweet Collection & Extraction of Urls, HashTags

Deployment(Step by Step Execution)

Authors

About

Releases

Packages

Languages

chandrasekhar-syamala/TwitterFeedAnalysis

Folders and files

Latest commit

History

Repository files navigation

Twitter Feed Analysis using Spark with Hadoop

Documentation

Tweet Collection & Extraction of Urls, HashTags

Deployment(Step by Step Execution)

Authors

About

Topics

Resources

Stars

Watchers

Forks

Languages