Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected RMSE Differences in SVD Models with almost the same Training Data #472

Open
Gsj49 opened this issue Dec 22, 2023 · 0 comments
Open

Comments

@Gsj49
Copy link

Gsj49 commented Dec 22, 2023

Description

Issue Summary

I am encountering significantly different RMSE values when evaluating two SVD models using the Surprise library. Both models are nearly identical in configuration and training data, with the only difference being that one model is trained on the entire dataset (model_full), while the other is trained on almost the entire dataset, except for one sample (model_cv).

Steps to Reproduce

  1. Generate artificial datasets train_ratings and test_ratings using a function generate_dataset. The function generate_dataset use the formulations of surprise.prediction_algorithms.SVD to generate an artificial dataset:
    $r_{u i}=\mu+b_u+b_i+q_i^T p_u$
  2. Train two SVD models:
    • model_full on the entire train_ratings.
    • model_cv on train_ratings minus one sample.
  3. Evaluate both models on test_ratings.

python code

train_ratings, test_ratings, _ = generate_dataset(num_users=400,
                                                  num_items=400,
                                                  num_factors=7,
                                                  global_mean=3.5,
                                                  upper_bound=5,
                                                  lower_bound=1,
                                                  sparsity_ratio=0.8, 
# This means train_ratings have (400*400*0.2) samples of the user-item ratings and test_ratings have the remaining (400*400*0.8).
                                                  seed=0)

# train_ratings, test_ratings are both dataframes that consist of 3 columns: 'user_id', 'item_id', and 'rating'.

testset = [tuple(row) for row in test_ratings.itertuples(index=False)]

reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(train_ratings, reader)
trainset_cv, valset_cv = surprise.model_selection.train_test_split(data,test_size=0.0000001) 
# the valset only contains one sample
trainset_full = data.build_full_trainset()

model_cv = SVD(n_factors=7,random_state=0,reg_all=0)
model_cv.fit(trainset_cv)
pred_by_cvmodel = model_cv.test(testset)
accuracy.rmse(pred_by_cvmodel,verbose=True)

model_full = SVD(n_factors=7,random_state=0,reg_all=0)
model_full.fit(trainset_full)
pred_by_fullmodel = model_full.test(testset)
accuracy.rmse(pred_by_fullmodel,verbose=True)

output

RMSE: 1.2256
RMSE: 0.6395

The RMSE values are significantly different and I can not figure out the reason. I have tried other cross validation iterator such as surprise.model_selection.KFold, and got the same behavior. Is there maybe a potential problem with the way that cross validation iterator handles the training data?

This issue can also be reproduced using the movielens 100k dataset instead of simulated data, although the RMSE difference is not that large.

python code

data_file_path = './data/ml-100k/u.data'  
ratings = pd.read_csv(data_file_path, sep='\t', names=['user_id', 'item_id', 'rating', 'timestamp'])

train_ratings, test_ratings = train_test_split(ratings.iloc[:,:3],test_size=0.2,random_state=0)

testset = [tuple(row) for row in test_ratings.itertuples(index=False)]

reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(train_ratings, reader)
trainset_cv, valset_cv = surprise.model_selection.train_test_split(data,test_size=0.000001)
trainset_full = data.build_full_trainset()

model_cv = SVD(n_factors=100,random_state=0,reg_all=0)
model_cv.fit(trainset_cv)
pred_by_cvmodel = model_cv.test(testset)
accuracy.rmse(pred_by_cvmodel,verbose=True)

model_full = SVD(n_factors=100,random_state=0,reg_all=0)
model_full.fit(trainset_full)
pred_by_fullmodel = model_full.test(testset)
accuracy.rmse(pred_by_fullmodel,verbose=True)

output

RMSE: 0.9550
RMSE: 0.9516

Any suggestions or solutions to this phenomenon would be greatly appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant