E4571 Personalisation Theory Class Project -- Fall 2017

Report for Part 2 of the project can be found in Part2/report/final_project_report.pdf.

Team Members:

Name	GitHub	UNI
Tejas Dharamsi	https://github.com/Dharamsitejas	td2520
Abhay S Pawar	https://github.com/abhayspawar	asp2197
Janak A Jain	https://github.com/janakajain	jaj2186
Vijayraghavan Balaji	https://github.com/vijaybalaji30	vb2428

~Steps to run the code~

Clone/Download the Repository
install dependencies pip3 install -r requirements.txt
move to folder Part2/analysis.

Report for Part 1 of the project can be found in Part1/documents/report_part1.pdf

Note: The main file containing the code for Part 1 is CF-Data.ipynb

File Structure

Top

Part2
- analysis
  - DatasetCreation_Benchmark_ContentBased.ipynb: contains the code for combination of dataset, Naïve baseline model, item-item collaborative filtering model and content based model
  - Hybrid.ipynb: Contains code for Hybrid Model: LSH Model + Content Based Model, Validates serendpity for books recommended by our best model: LSH
  - LSH_Complete.ipynb: contains the code for LSH model
  - book_features.ipynb: contains the code for generating word2vec features for books
  - feature_extraction_from_api.ipynb: Contains code to get book meta data from goodreads API using book isbn
  - tree_based_ann.ipynb: contains the code for Tree Based ANN model
- created_datasets
  - Combine.csv : contains the combined dataset of BX and Amazon dataset
  - book_features.csv: contains the data with features generated using word2vec
  - ibsn_features_new_batch.pickle: contains the data with features extracted BookReads API and enriched using word2vec
- figures: Contains Plots generated by our code.
- raw-data: Contains Book Crossing Dataset, amazon book dataset can be downloaded from here
- Final_Project_Outline.pdf
Part1
- analysis: CF-Data.ipynb main part1 file along with exploratory stuff.
- clean-data: Contains subset smaller datasets
- raw-data: Contains book-crossing raw datasets.
- documents: instructions and report
- figures: Contains Plots for visualisation
License
Readme
requirements.txt

About the Project

Image Courtesy: WellBuiltStyle.com

The project is part of the course on Personalization Theory and Applications by Prof. Brett Vintch. The aim of this project is to create a recommender system for books that is capable of offering customized recommendations to book readers based on the books they have already read.

Motivation

There is no friend as loyal as a friend - Ernest Hemingway

Thanks to Gutenberg and now, the digital boom, we now have access to a huge amount of collective intelligence, wisdom and stories. Indeed, humans perish but their voice continues to resonate through humans brains and minds long after they are gone - sometimes provoking us to think, making us parts of revolutions and sometimes confiding in us with their secrets. They have the ability to make us laugh, cry, think - think hard, and most imporantly, change our lives the way, perhaps nobody else can. In this sense, books are truly our loyal friends.

Can the importance of books as loyal friends ever be overestimated? We think not. Which is why we think that creating just the 'right' recommendations for readers is a noble objective. Consider it a quieter (Shh.. no noise in this library! :)) Facebook or a classier Tinder for those who like to read and listen, patiently.

Part II - Summary of findings

We have implemented four different types of algorithms from scratch and have compared them with with a naïve model. These four models are Tree-based Approximate Nearest Neighbor (ANN), Locality Sensitive Hashing (LSH), Item-item collaborative filtering (CF) and Content-based model. We also created a hybrid model that is a combination of LSH and Content-based model.

We used five-fold cross-validation for all of our developed models, which helped us in selecting the best model for comparison against the benchmark.

We have evaluated each of the developed models on following evaluation metrics:

Training time
RMSE
MAE
Coverage
Novelty

Results:

Comparison of several models on various comparison metrics

Model Name	Training Time (hours)	Best K	Average Test MAE	Average Test RMSE	Coverage
Naïve	N/A	N/A	0.763	0.944	N/A
Item-item CF	4.1	15	0.553	0.759	76.0%
Tree based ANN	1.927	20	0.55	0.76
LSH	1.29	15	0.573	0.796	65.6%
Content-based	0.6 (approx.)	25	0.593	0.8031	31.55%
Hybrid (LSH + Contentt)	1.89	15	0.5834	0.799	46.54%

Tweaking the Hybrid model

After developing the Hybrid model from scratch, the next step for us was to evaluate it different values of its hyper-parameter - the distribution of weights on the two underlying models. Given below is a summary of the MAE and RMSE metrics for the Hybrid model for various combinations of these weights.

Performance of the Hybrid model for various weight combinations of the underlying models

W_LSH	W_Content	MAE	RMSE
0.9	0.1	0.587	0.813
0.8	0.2	0.585	0.806
0.7	0.3	0.583	0.799
0.6	0.4	0.583	0.796
0.5	0.5	0.583	0.792

Interpretations

We selected the Hybrid model with a W_lsh to W_content weight ratio of 7:3 in order to select the right blend of coverage and serendipity. However, we observed that even at this level, the coverage of the model was significatly lower than that of the LSH model that we implemented from scratch. Hence, we would recommend the use of LSH model for making recommendations.

A special note on Serendipity of the best model

Our best model is LSH - which has comparable values of MAE and RMSE versus the traditional item-based CF model. Moreover, LSH trains in about a third of the time taken to train the item-based CF model. Another evaluation metric is serendity or novelty of recommendations.

An example of recommendation is shown in Figures 8 and 9 in the report. An interesting recommendation that can be observed from Figure 9 is "Don Quixote". It belongs to a genre that is not currently present in the user's rapport of genres. What's more is that Don Quixote is considered one of the most influential works from the Spanish Golden Age.

Upon closer observation, we find that Don Quixote contains several thematic plots and stylistic elements which are very similar to other books that the user has read. Moreover, such a serendipitous result is also likely to be liked by the user given the higher chances of similarity in stylistic and thematic patterns.

Future Scope of Work

In the future, we would like to extend this study to convert our code into a Python package. We invite members of the larger academic community to contribute to this project.

Part I - Summary of findings

We have implemented two different types of algorithms from scratch and have compared them with competitive models available from other packages. These two algorithms are Item-Item Collaborative Filtering and Non-negative Matrix Factorization (NMF)

We implemented our models using two approaches:

Collaborative filtering based (Approach 1)
Non-negative Matrix Factorization (NMF)based (Approach 2)

We used cross-validation for all of our developed models, which helped us in selecting the best model for comparison against the benchmark.

For both these approaches, we implemented two separate models for this study - one model was developed from scratch, while one was developed using Surprise.

Results:

For Approach 1, our model performed better than Surprise model for by a significant measure for Average MAE.
For Approach 2, our model did not fare well in front of Surprise model.

For each approach the results are described below below for each of the norms, viz. Euclidean distance, cosine distance and pearson correlation coefficient:

Approach 1: Item-Item Collaborative filtering based

Euclidean distance

Model Name	Average RMSE	Average MAE
Our model	1.54	0.96
Suprise	1.58	1.13

Cosine similarity

Model Name	Average RMSE	Average MAE
Our model	1.57	1.06
Suprise	1.64	1.22

Pearson correlation coefficient

Model Name	Average RMSE	Average MAE
Our model	1.53	1.01
Suprise	1.61	1.20

Approach 2: None-negative Matrix Factorization

NMF

Model Name	Average RMSE	Average MAE
Our model	2.97	2
Suprise	1.53	0.98

Feedback

We look forward to your feedback and comments on this project. Our email IDs are a combination of our four-letter UNI codes e.g. 'td2520' and follow the following rule: {UNI}@columbia.edu.

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
Part1		Part1
Part2		Part2
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

E4571 Personalisation Theory Class Project -- Fall 2017

File Structure

Top

About the Project

Motivation

Part II - Summary of findings

Tweaking the Hybrid model

Interpretations

A special note on Serendipity of the best model

Future Scope of Work

Part I - Summary of findings

Approach 1: Item-Item Collaborative filtering based

Approach 2: None-negative Matrix Factorization

Feedback

About

Releases

Packages

Contributors 3

Languages

License

tjdharamsi/E4571-Personalisation-Theory-Project

Folders and files

Latest commit

History

Repository files navigation

E4571 Personalisation Theory Class Project -- Fall 2017

File Structure

Top

About the Project

Motivation

Part II - Summary of findings

Tweaking the Hybrid model

Interpretations

A special note on Serendipity of the best model

Future Scope of Work

Part I - Summary of findings

Approach 1: Item-Item Collaborative filtering based

Approach 2: None-negative Matrix Factorization

Feedback

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages