Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong mapping of the raw IDs to the internal IDs #465

Open
benhaf opened this issue Mar 27, 2023 · 0 comments
Open

Wrong mapping of the raw IDs to the internal IDs #465

benhaf opened this issue Mar 27, 2023 · 0 comments

Comments

@benhaf
Copy link

benhaf commented Mar 27, 2023

Hi,

Description

The mapping of the raw IDs of the users to the internal IDs is not correct when the dataset contains more than 25000 rows. I tried to read the ratings from a file and from a dataframe, but it always gives a wrong mapping of the user IDs. I tested several datasets.
In the code below, after saving the training set with the internal IDs to SupriseTrainingSet.csv, I compare the file Train.txt to SupriseTrainingSet.csv.

Steps/Code to Reproduce

from surprise import Dataset, KNNBasic, Reader
import pandas as pd
import csv

train_file = files_dir + folder + "Train.txt"

reader = Reader(line_format="user item rating", sep="\t")

data = Dataset.load_from_file(train_file, reader=reader)

trainset = data.build_full_trainset() #creates the training set from the whole dataset

with open(files_dir + folder +"SupriseTrainingSet.csv", 'w', newline='') as file:
writer = csv.writer(file)
# write each row of data to the CSV file
for row in trainset.all_ratings():
writer.writerow(row)

algo = KNNBasic()
algo.fit(trainset)

Expected Results

####Original dataset
User Item Rating
1 225 2
1 154 5
1 73 3
1 43 4
1 199 4
1 34 2
1 227 4
1 94 2
1 74 1
1 76 4
1 181 5
1 105 2
1 253 5
1 200 3
1 61 4
1 93 5
1 272 3
1 53 3
1 174 5
1 193 4
1 161 4
1 129 5
1 195 5
1 9 5
1 156 4
1 262 3
1 99 3
1 21 1
1 35 1
1 123 4
1 104 1
1 148 2
1 184 4
1 249 4
1 54 3
1 66 4
1 107 4
1 8 1
1 145 2
1 102 2
1 134 4
1 125 3
1 165 5
1 49 3
1 114 5
1 32 5
1 252 2
1 209 4
1 153 3
1 26 3
1 137 5
1 133 4
1 217 3
1 245 2
1 24 3
2 286 4
2 292 4
2 313 5
2 272 5
2 290 3
2 10 2
2 312 3
2 280 3
2 281 3
2 14 4
2 296 3
2 1 4
2 279 4
3 332 1
3 339 3
3 350 3
3 319 2
3 352 2
3 260 4
3 336 1
3 348 4
3 345 3
3 271 3
3 346 5
4 327 5
4 357 4
4 329 5
4 288 4
4 300 5
5 457 1
5 2 3

####Internal IDs of surprise
User Item Rating
0 0 2
0 1 5
0 2 3
0 3 4
0 4 4
0 5 2
0 6 4
0 7 2
0 8 1
0 9 4
0 10 5
0 11 2
0 12 5
0 13 3
0 14 4
0 15 5
0 16 3
0 17 3
0 18 5
0 19 4
0 20 4
0 21 5
0 22 5
0 23 5
0 24 4
0 25 3
0 26 3
0 27 1
0 28 1
0 29 4
0 30 1
0 31 2
0 32 4
0 33 4
0 34 3
0 35 4
0 36 4
0 37 1
0 38 2
0 39 2
0 40 4
0 41 3
0 42 5
0 43 3
0 44 5
0 45 5
0 46 2
0 47 4
0 48 3
0 49 3
0 50 5
0 51 4
0 52 3
0 53 2
0 54 3
1 369 4
1 533 5
1 503 3
1 451 1
1 239 4
1 314 4
1 110 4
1 956 4
1 714 4
1 134 4
1 674 4
1 227 5
1 471 1
2 180 5
2 382 5
2 264 4
2 213 3
2 517 1
2 86 1
2 351 5
2 162 5
2 272 2
2 410 4
2 822 2
3 1328 1
3 401 5
3 807 3
3 84 3
3 1074 5
4 415 5
4 589 4

Actual Results

####Original dataset
User Item Rating
1 225 2
1 154 5
1 73 3
1 43 4
1 199 4
1 34 2
1 227 4
1 94 2
1 74 1
1 76 4
1 181 5
1 105 2
1 253 5
1 200 3
1 61 4
1 93 5
1 272 3
1 53 3
1 174 5
1 193 4
1 161 4
1 129 5
1 195 5
1 9 5
1 156 4
1 262 3
1 99 3
1 21 1
1 35 1
1 123 4
1 104 1
1 148 2
1 184 4
1 249 4
1 54 3
1 66 4
1 107 4
1 8 1
1 145 2
1 102 2
1 134 4
1 125 3
1 165 5
1 49 3
1 114 5
1 32 5
1 252 2
1 209 4
1 153 3
1 26 3
1 137 5
1 133 4
1 217 3
1 245 2
1 24 3
2 286 4
2 292 4
2 313 5
2 272 5
2 290 3
2 10 2
2 312 3
2 280 3
2 281 3
2 14 4
2 296 3
2 1 4
2 279 4
3 332 1
3 339 3
3 350 3
3 319 2
3 352 2
3 260 4
3 336 1
3 348 4
3 345 3
3 271 3
3 346 5
4 327 5
4 357 4
4 329 5
4 288 4
4 300 5
5 457 1
5 2 3

####Internal IDs of surprise
User Item Rating
0 0 2
0 1 5
0 2 3
0 3 4
0 4 4
0 5 2
0 6 4
0 7 2
0 8 1
0 9 4
0 10 5
0 11 2
0 12 5
0 13 3
0 14 4
0 15 5
0 16 3
0 17 3
0 18 5
0 19 4
0 20 4
0 21 5
0 22 5
0 23 5
0 24 4
0 25 3
0 26 3
0 27 1
0 28 1
0 29 4
0 30 1
0 31 2
0 32 4
0 33 4
0 34 3
0 35 4
0 36 4
0 37 1
0 38 2
0 39 2
0 40 4
0 41 3
0 42 5
0 43 3
0 44 5
0 45 5
0 46 2
0 47 4
0 48 3
0 49 3
0 50 5
0 51 4
0 52 3
0 53 2
0 54 3
0 369 4
0 533 5
0 503 3
0 451 1
0 239 4
0 314 4
0 110 4
0 956 4
0 714 4
0 134 4
0 674 4
0 227 5
0 471 1
0 180 5
0 382 5
0 264 4
0 213 3
0 517 1
0 86 1
0 351 5
0 162 5
0 272 2
0 410 4
0 822 2
0 1328 1
0 401 5
0 807 3
0 84 3
0 1074 5
0 415 5
0 589 4

Uploading results.xlsx…

Versions

Windows-10-10.0.22621-SP0
Python 3.8.3 (default, Jul 2 2020, 17:30:36) [MSC v.1916 64 bit (AMD64)]
surprise 1.1.3

@benhaf benhaf changed the title Wron mapping of the raw IDs to the internal IDs Wrong mapping of the raw IDs to the internal IDs Mar 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant