Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cross-validation kNN wrong results on custom dataset #462

Open
julx134 opened this issue Feb 15, 2023 · 0 comments
Open

Cross-validation kNN wrong results on custom dataset #462

julx134 opened this issue Feb 15, 2023 · 0 comments

Comments

@julx134
Copy link

julx134 commented Feb 15, 2023

Description

I am working on a capstone project that fits the item-based kNN on a custom Amazon appliance 100K dataset. I wanted to get the cross-validation metrics for this dataset, however, I am getting wildly incorrect results. To make sure my code wasn't a mistake, I ran the built-in MovieLens 100k dataset into my function and it returned valid results.

I've attached the datasets for your reference.
amazon_appliance_100k.csv
ml_100k.csv

Steps/Code to Reproduce

Here is the code to run and cross-validate a custom dataset on google collab:

def trainCustomDataset(path, num_folds):
  # path to custom dataset
  file_path = os.path.expanduser(path)

  #convert csv to dictionary
  rating_dict = {'user_id':[], 'item_id':[], 'rating':[]}
  with open(file_path, 'r') as dataset:
      for line in csv.reader(dataset):
          rating_dict['user_id'].append(line[0])
          rating_dict['item_id'].append(line[2])
          rating_dict['rating'].append(line[4])

  #convert dictionary to dataframe
  rating_df = pd.DataFrame.from_dict(rating_dict)

  #group duplicate values into one rating
  rating_df = rating_df.groupby(['user_id', 'item_id']).agg({'rating':'mean'}).reset_index()

  #define surprise reader object
  reader = Reader(rating_scale=(1,5))

  #convert dataframe into surprise dataset object
  data = Dataset.load_from_df(rating_df[['user_id', 'item_id', 'rating']], reader)

  # We'll use the item-based collaborative filtering algorithm
  sim_options = {
      "name": "cosine",
      "user_based": False,  # compute  similarities between items
  }
  #define IBCFRS
  algo = KNNBasic(sim_options=sim_options)
  algo.fit

  # Run 5-fold cross-validation and print results
  print(cross_validate(algo, data, measures=["RMSE", "MAE"], cv=num_folds, verbose=True))

Expected Results

My expected results should be similar to this:
ML_100k_results

Actual Results

Here are my actual results:
amazon_100k_result

Versions

Linux-5.10.147+-x86_64-with-glibc2.29
Python 3.8.10 (default, Nov 14 2022, 12:59:47) 
[GCC 9.4.0]
surprise 1.1.3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant