GitHub - mutaphore/social-news-bigdata: HN/Reddit social news websites big data analytic with Hadoop Hive and MapReduce. Discover correlation between a website's content and number of votes it receives on social news websites such as Reddit and HackerNews.

mutaphore / social-news-bigdata Public

Notifications You must be signed in to change notification settings
Fork 1
Star 4

HN/Reddit social news websites big data analytic with Hadoop Hive and MapReduce. Discover correlation between a website's content and number of votes it receives on social news websites such as Reddit and HackerNews.

4 stars 1 fork Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
crawler		crawler
datumbox		datumbox
hackernews		hackernews
hive		hive
mapreduce		mapreduce
pig		pig
reddit		reddit
README.txt		README.txt

Repository files navigation

CONTENTS
--------

/crawler
Python scrapy web crawler responsible for crawling urls retrieved from HN/Reddit
and save the content of those webpages as well as various metadata.

/datumbox
Machine learning framework from datumbox.com. The framework can be used
to perform many types of analysis such as sentiment analysis, topic
classication, keyword extration etc. on a large dataset.
Currently our project is not using this framework; however it can be an
additional add-on analysis toolbox later if we want to do a more
in-depth study.

/hackernews
hackernews_api.py - this script makes requests to the HN api and downloads all the data from their RESTful service.
extract_url_from_csv.py - extracts urls from the csv file generated by the HN api. This is used for testing only.
get_urls.py - extracts urls but check if they are valid by making a request. Also for testing.
urls.txt - example of some urls retrieved


/hive
process_hn_api_output.hql - cleans the HN api output file to remove commas, NULL fields, etc. for easier processing later
process_hn_crawl_output.hql - cleans the HN crawl output file to remove commas, NULL fields, etc. for easier processing later
filter_hn_api_fields.py - this script contains the transform function used by process_hn_api_output.hql 
filter_hn_crawl_fields.py - this script contains the transform function used by process_hn_crawl_output.hql
join_output_and_crawl.hql - joins the HN api dataset with crawled dataset so we can get further correlation info
get_reddit_output.hql - cleans and export reddit dataset to local directory
remove_fields_comma.py - an udf used by hive to remove comma from title field so that data can be process by mapreduce job 

/mapreduce
hn_num_comments - MapReduce program to get relationship between number of comments received on a post vs. the average score received
hn_page_content - MapReduce program to get relationship between number of links, scripts, images, styles vs. the average score received
reddit_post_type - MapRedue program to get relationship between the subreddit of a post on Reddit vs. the average score received
reddit_post_hour - MapReduce program to get relationship between the hour of a day vs. the average votes received

/pig
filter.pig - filters the HN output files to only get the "story" type urls.


/reddit
cleaned_data - the final reddit dataset ready to be consumed by the reddit_post_hour program
getposts.py - depricated
output.csv - raw reddit post data collected by calling reddit api from reddit.py script
post_datastructure.txt - descriptor for the reddit object return by the reddit api
reddit.py - python script that call the reddit api to get the raw reddit dataset (output.csv)