Predicting popularity of a Reddit post

What is Reddit?

Reddit is a social sharing website where you could post links, pictures, text and other users can upvote or downvote aparticular post based on if they like the post or not. If the post gets a high upvote score, then the post moves up so that it is visible to evryone. Reddit is a huge site, but it's divided into thousands of smaller communities called subreddits. In this project, the posts from "Popular" subreddit were used to prepare the dataset.

About the Project:

This project is about predicting the popularity of a Reddit post. The popularity of a Reddit post is determined by the total votes or score it gets. Score is the result of upvotes and downvotes for a particular post. So, it was identified as a regression problem.

About the Dataset:

The dataset for this project was created using web scarpping with the help of praw library. The features exracted are:

Title - title of the post
Gilded - rewarding a Reddit gold to the post
Over_18 - True if the post has adult content else False
Ups - no of upvotes for the post
Downs - no of downvotes for the post
Num_of_comments - no of comments for the post
Upvote_ratio - upvote ratio of the post
Score - total score (upvotes - downvotes)

Google Drive link for dataset (ScrappedPostsData.csv)

Sentiment Analysis:

Extra features were added with the help of Sentiment analysis for the title of the post using vaderSentiment analyzer. We get 4 columns neg, neu, pos and compound. These features tell how negative or positive the statement is. These columns were combined to one column, Predited_value, using the compound score.

positive sentiment: (compound score >= 0.05); neutral sentiment: (compound score > -0.05) and (compound score < 0.05); negative sentiment: (compound score <= -0.05)

Text pre processing:

Text preprocessing was done for the title of the post by removing punctuations, stop words, and performing stemming and lemmatization. To convert the title to numeric form, Glove embedding was used. This gives a numeric vector for all the unique words in the text. These vectors have 100 dimensions. To get one vector representation for each title, weighted average method was used. The mean of all word vectors for a particular title is taken to form one vector so that the title is represented using one vector. One hot encoding for the Predicted_value and Over_18 columns was performed.

Machine Learning:

As it is a regression problem, regression models like Linear regression, Decision tree regressor, Random forest regressor, KNN regressor, Lasso, Ridge, ElasticNet and XGBoost regressor were used. These models were trained with 60% train data and prediction was done using 40% test data. The performance of these models was measured based on test accuracy. Out of all these models, XGBoost regressor and Random forest regressor performed well with around 50% accuracy on test dataset.

Deployment:

The application was deployed on Heroku. The application takes a Redidt post URl as input and the required features are extracted from the URL. The deployed application was tested with different Reddit post URLs. As the accuracy of the model is around 50%, the predictions were a little different from expected. In future work, more data can be used to train the model to get good accuracy.

Link to Deployed Application : https://reddit-post-score.herokuapp.com/

Installing required librarires:

Installing keras and tenserflow:

pip install keras==2.2.4
pip install tensorflow==2.0

Installing nltk:

import nltk
nltk.download('popular')

Installing xgboost:

pip install xgboost

Installing vaderSentiment:

pip install vaderSentiment

Installing praw:

pip install praw

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Images		Images
templates		templates
EDA for Reddit Data.ipynb		EDA for Reddit Data.ipynb
Procfile		Procfile
README.md		README.md
Reddit Post Sentiment Analysis.ipynb		Reddit Post Sentiment Analysis.ipynb
Reddit_Post_Score_Prediction.ipynb		Reddit_Post_Score_Prediction.ipynb
Web Scrapping of Reddit.ipynb		Web Scrapping of Reddit.ipynb
app.py		app.py
embedding_matrix.csv		embedding_matrix.csv
one_hot.pkl		one_hot.pkl
requirements.txt		requirements.txt
senti.pkl		senti.pkl
tokens.csv		tokens.csv

alinasahoo/Reddit-Post-Scores

Folders and files

Latest commit

History

Repository files navigation

Predicting popularity of a Reddit post

What is Reddit?

About the Project:

About the Dataset:

Sentiment Analysis:

Text pre processing:

Machine Learning:

Deployment:

Installing required librarires:

About

Topics

Resources

Stars

Watchers

Forks

Languages