Improve similarities performance #448

mihaiblidaru · 2022-11-23T11:00:53Z

I'm trying to test multiple recommendation algorithms with a very big dataset (6M ratings, 69k users, 17k items) and I noticed that for algorithms that compute similarities the fit time is very high.

I found that there are some redundant operations that could be eliminated in order to speed up the similarities' computation.

In similarities.pyx when computing auxiliary values, like the freq, prods arrays, because y_ratings is not sorted we have to fully iterate over y_ratings two times resulting in N² operations for each y_ratings in yr.

    for y, y_ratings in yr.items():
        for xi, ri in y_ratings:
            for xj, rj in y_ratings:
                freq[xi, xj] += 1
                prods[xi, xj] += ri * rj
                sqi[xi, xj] += ri**2
                sqj[xi, xj] += rj**2

We can take advantage of the fact that all auxiliary matrices are symmetric (M[x][y] == M[y][X]) so filling the half above diagonal is enough to compute the similarities.

My change is sorting the y_ratings lists before filling the auxiliary arrays and changing the second for loop on y_ratings so that only the top half of the auxiliary matrices are filled. This reduces the number of operations for each element in yr from N² to N(N-1)/2.

    sorted_yr = { y : sorted(y_ratings, key = lambda x: x[0]) for y, y_ratings in yr.items() }

    for y, y_ratings in sorted_yr.items():
        for i, (xi, ri) in enumerate(y_ratings):
            for xj, rj in y_ratings[i + 1:]:

The sorting might reduce performance for small datasets, but for larger datasets like movielens-1m the performance gains are very noticeable.

Some performance tests I did on my laptop (Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz). The time in the next tables are the average time it took to train (just the .fit method call) a KNNBasic algorithm over 50 samples.
Dataset: movielens-1m

	Before (seconds)	After (seconds)	Time reduction (After/Before * 100)
cosine	32.58	13.78	42.29 %
msd	29.94	11.14	37.20 %
pearson	38.21	16.98	44.43 %
pearson_baseline	38.26	17.86	46.68 %

Dataset: movielens-100k

	Before(seconds)	After(seconds)	Time reduction(After/Before * 100)
cosine	0.46	0.20	43.47 %
msd	0.28	0.15	53.57 %
pearson	0.58	0.27	46.55 %
pearson_baseline	0.66	0.42	63.63 %

Of course, all tests pass successfully.

… computation For each processed user or item, the number of operations is reduced from N² to N(N-1)/2, roughly half. The sorting might reduce performance for small datasets but for larger datasets like movielens-1m the performance gains are very noticeable

mihaiblidaru · 2022-11-23T19:01:35Z

I just discovered that there are official benchmarks. I ran then using github workflows which was a terrible idea because the VMs they run on are veeeeeeeery sloooooow.

There's not much improvement when benchmarking the ml-100k dataset
Before

Movielens 100k	RMSE	MAE	Time
k-NN	0.98	0.774	0:00:17
Centered k-NN	0.951	0.749	0:00:18
k-NN Baseline	0.931	0.733	0:00:21

After

Movielens 100k	RMSE	MAE	Time
k-NN	0.98	0.774	0:00:16
Centered k-NN	0.951	0.749	0:00:17
k-NN Baseline	0.931	0.733	0:00:19

Using the ml-1m dataset my implementation is slower for some unknown reason and made me doubt the consistency of the performance of the github VMs.

Before

Movielens 1M	RMSE	MAE	Time
k-NN	0.923	0.727	0:11:51
Centered k-NN	0.929	0.738	0:12:03
k-NN Baseline	0.895	0.706	0:11:36

After

Movielens 1M	RMSE	MAE	Time
k-NN	0.923	0.727	0:13:31
Centered k-NN	0.929	0.738	0:13:28
k-NN Baseline	0.895	0.706	0:13:40

I ran the benchmarks again on a c5.2xlarge AWS EC2 instance and the benchmark results make more sense.
Before

Movielens 1M	RMSE	MAE	Time
k-NN	0.923	0.727	0:09:22

After

Movielens 1M	RMSE	MAE	Time
k-NN	0.923	0.727	0:07:45

My guess is that the improvement is lower when running the full benchmark because most of the time is spent testing the algorithm and generating predictions rather than training it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve similarities performance #448

Improve similarities performance #448

mihaiblidaru commented Nov 23, 2022

mihaiblidaru commented Nov 23, 2022

Improve similarities performance #448

Are you sure you want to change the base?

Improve similarities performance #448

Conversation

mihaiblidaru commented Nov 23, 2022

mihaiblidaru commented Nov 23, 2022