Improve and add Spearman #227

ghost · 2018-11-15T18:34:38Z

This PR should improve the already existing problem #167 .

I will revise the code and make it integrable.

I'll open this PR so you can see the progress.

…spearman

…spearman Update branch

ghost · 2018-11-15T19:19:54Z

The tests are not running yet and I'm still checking if the cPython function really works.

ghost · 2018-11-16T17:21:51Z

Hey @NicolasHug,

I am unsure whether the rankings are calculated correctly.

I think that the rankings are calculated by the columns and not by the rows of yr.

Or am I wrong ?

NicolasHug · 2018-11-16T18:44:51Z

what do you mean the rows of yr? yr is a dictionary so it doesn't have rows or columns.

yr is equal to ur or ir (from the trainset object), depending on the user_based parameter.

ghost · 2018-11-16T22:02:20Z

I split the comments a bit, so that it stays clear ^^

ghost · 2018-11-16T22:04:26Z

I refer to the dict yr from test_similarities.py:

n_x = 8
yr_global = {
    0: [(0, 3), (1, 3), (2, 3), (5, 1), (6, 1.5), (7, 3)], # noqa
    1: [(0, 4), (1, 4), (2, 4), ], # noqa
    2: [ (2, 5), (3, 2), (4, 3) ], # noqa
    3: [ (1, 1), (2, 4), (3, 2), (4, 3), (5, 3), (6, 3.5), (7, 2)], # noqa
    4: [ (1, 5), (2, 1), (5, 2), (6, 2.5), (7, 2.5)], # noqa
}

The current code would return this result:

sim = sims.spearman(n_x, yr, min_support=0)
[[ 1. 1. 1. 0. 0. 0. 0. 0. ]
 [ 1. 1. -0.595 0. 0. -0.999 -0.902 0.746]
 [ 1. -0.595 1. 1. 0. 0.789 0.412 -0.143]
 [ 0. 0. 1. 1. 0. 0. 0. 0. ]
 [ 0. 0. 0. 0. 1. 0. 0. 0. ]
 [ 0. -0.999 0.789 0. 0. 1. 0.885 -0.721]
 [ 0. -0.902 0.412 0. 0. 0.885 1. -0.961]
 [ 0. 0.746 -0.143 0. 0. -0.721 -0.961 1. ]]

I rounded it to three decimal places.

But in fact it should have come out:

[[ 1. 0.335 -0.057 -0.645 -0.645 -0.516 -0.516 -0.057]
 [ 0.335 1. -0.821 -0.866 -0.866 0.154 0.154 0.359]
 [-0.057 -0.821 1. 0.74 0.74 -0.5 -0.5 -0.816]
 [-0.645 -0.866 0.74 1. 1. 0.148 0.148 -0.444]
 [-0.645 -0.866 0.74 1. 1. 0.148 0.148 -0.444]
 [-0.516 0.154 -0.5 0.148 0.148 1. 1. 0.579]
 [-0.516 0.154 -0.5 0.148 0.148 1. 1. 0.579]
 [-0.057 0.359 -0.816 -0.444 -0.444 0.579 0.579 1. ]]

ghost · 2018-11-16T22:09:04Z

The second matrix is calculated according to the formula used in the code.

This formula corresponds to this one here:
https://mathcracker.com/spearman-correlation-calculator.php

(Sorry, but I just can't find a better source)

ghost · 2018-11-16T22:13:45Z

@gautamramk used to calculate the rank:

....
    rows = np.zeros(n_x, np.double)

    for y, y_ratings in iteritems(yr):
        for xi, ri in y_ratings:
            rows[xi] = ri
        ranks = rankdata(rows)
....

But actually the ranks should be determined by the columns from yr, because these represent e.g. the users and their choices.

Therefore I think that the current code is not correct and it does not calculate the Spearman Correlation.

ghost · 2018-11-16T22:17:59Z

I have written the following program to show how I think the Spearman Correlation would be calculated correctly for min_sqrt == 0.

This code was also used to generate the second matrix.

import numpy as np
from scipy.stats import rankdata

# yr_global as a matrix representation
matrix = np.array([[3., 3., 3., 0., 0., 1., 1.5, 3.],
                   [4., 4., 4., 0., 0., 0., 0., 0.],
                   [0., 0., 5., 2., 3., 0., 0., 0.],
                   [0., 1., 4., 2., 3., 3., 3.5, 2.],
                   [0., 5., 1., 0., 0., 2., 2.5, 2.5]])

n = len(matrix)
dim = len(matrix[0])

result = np.zeros((dim, dim))

for x in range(0, dim):
    rank_x = rankdata(matrix[:, x])
    for y in range(x, dim):
        rank_y = rankdata(matrix[:, y])

        prod = np.dot(rank_x, rank_y)
        sum_rx = sum(rank_x)
        sum_ry = sum(rank_y)
        sum_rxs = sum(rank_x ** 2)
        sum_rys = sum(rank_y ** 2)

        counter = n * prod - (sum_rx * sum_ry)
        denom_l = np.sqrt(n * sum_rxs - sum_rx ** 2)
        denom_r = np.sqrt(n * sum_rys - sum_ry ** 2)

        result[x, y] = round(counter / (denom_r * denom_l), 3)
        result[y, x] = result[x, y]

print(result)

ghost · 2018-11-16T22:27:19Z

So I wonder if I'm totally wrong.

If so, where is my mistake ?

If not, then I would completely rework the spearman method and create it similar to the example code.

Many thanks
Marc

…ss tests

ghost · 2018-11-17T14:36:33Z

Hey @NicolasHug,
I think I fixed the bug by introducing a ranking matrix.

The calculation in the previous version does not seem to be correct. The rank is calculated wrong.

The rank must not be calculated with the rows of yr but with the columns.

Like e.g.

          ----[[3., 3., 3., 0., 0., 1., 1.5, 3.],-----
               [4., 4., 4., 0., 0., 0., 0., 0.],
               [0., 0., 5., 2., 3., 0., 0., 0.],
               [0., 1., 4., 2., 3., 3., 3.5, 2.],
               [0., 5., 1., 0., 0., 2., 2.5, 2.5]]

Here the ranks would be calculated with the old version:

               [6.5, 6.5, 6.5, 1.5, 1.5, 3. , 4. , 6.5]

But it would have to be calculated like this:

                |
                |
                |
              [[3., 3., 3., 0., 0., 1., 1.5, 3.],
               [4., 4., 4., 0., 0., 0., 0., 0.],
               [0., 0., 5., 2., 3., 0., 0., 0.],
               [0., 1., 4., 2., 3., 3., 3.5, 2.],
               [0., 5., 1., 0., 0., 2., 2.5, 2.5]]
                |
                |
                |

Whereby the new version calculates the ranks like this:

              [4., 5., 2., 2., 2.]

The given tests now run without changes.

I also adjusted the documentation.

Excuse the wall of text above^^.

But now it should be correct

NicolasHug · 2018-11-18T18:41:07Z

Hey @MaFeg100 , sorry for the slow reply.

I haven't looked in great details, but I think you are probably correct. Here is what I understand, let me know if you agree.

Spearman is like Pearson correlation but instead of using raw ratings we use the ranks.

Considering a rating matrix U x I (users are rows, items are columns), then a user-user spearman similarity would first compute the the ranks of the ratings in a row-wise fashion (and I think that's what you mean by "the columns of yr", but I'm not comfortable speaking about row or columsn for yr because it's not really a matrix), and then apply a pearson sim.

Note however that this is a simplified view, as in reality we want to compute the rankings between the common ratings only, not the whole rows. (Maybe this has actually no impact? I haven't thought about it much.)

For now maybe the most important is to make a quick benchmark (more thorough benchmarks can be done later on) to make sure the computation can run in a decent time, compared to the other sims. I wouldn't want you to lose your time trying to fix this if in the end we won't merge the PR because spearman sim is too slow :/

Then of course, we should fix the bugs if there are any.

Hope this does not add confusion ^^, thanks for looking that up anyway.

ghost · 2018-11-19T11:12:42Z

Hey @NicolasHug,

that's right. Spearman calculates the ranks in the user or item vectors. That's what I meant by the term columns.

My improvement transfers yr directly into the ranks and then proceeds exactly like Pearson.

The old version calculated the ranks in the wrong direction.

The current version also considers the common elements. This is then regulated exactly like Pearson by the freq matrix.

So the only difference is that yr must be transformed into a form in which the ranks and not the ratings are included.

ghost · 2018-11-19T11:14:40Z

I'll just put in the cross validations of the procedures that Spearman uses.

Interesting should be time expenditure and error rate.

ghost · 2018-11-19T11:32:45Z

Example 1: Spearman: item-based KNNBasic; MovieLens100k; 5-fold cross validation

data = Dataset.load_builtin('ml-100k')
sim_options = {'name': 'spearman',
               'user_based': False
}
algo = KNNBasic(sim_options=sim_options)
cross_validate(algo, data, verbose=True)

Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.0370  1.0417  1.0520  1.0359  1.0446  1.0422  0.0058  
MAE (testset)     0.8324  0.8359  0.8423  0.8289  0.8366  0.8352  0.0045  
Fit time          1.99    2.00    1.99    2.04    2.06    2.02    0.03    
Test time         2.72    2.81    2.88    2.77    2.83    2.80    0.05

ghost · 2018-11-19T11:34:58Z

Example 2: Cosine: item-based KNNBasic; MovieLens100k; 5-fold cross validation

data = Dataset.load_builtin('ml-100k')
sim_options = {'name': 'cosine',
               'user_based': False
}
algo = KNNBasic(sim_options=sim_options)
cross_validate(algo, data, verbose=True)

Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.0214  1.0236  1.0327  1.0243  1.0297  1.0263  0.0042  
MAE (testset)     0.8098  0.8109  0.8158  0.8072  0.8133  0.8114  0.0030  
Fit time          1.22    1.27    1.24    1.27    1.18    1.24    0.03    
Test time         2.83    2.77    2.82    2.85    2.80    2.81    0.03

ghost · 2018-11-19T11:36:33Z

Example 3: Pearson: item-based KNNBasic; MovieLens100k; 5-fold cross validation

data = Dataset.load_builtin('ml-100k')
sim_options = {'name': 'pearson',
               'user_based': False
}
algo = KNNBasic(sim_options=sim_options)
cross_validate(algo, data, verbose=True)

Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.0428  1.0423  1.0380  1.0405  1.0457  1.0419  0.0025  
MAE (testset)     0.8354  0.8369  0.8291  0.8330  0.8394  0.8347  0.0035  
Fit time          1.79    1.80    1.78    1.79    1.78    1.79    0.01    
Test time         2.73    2.80    2.75    2.81    2.88    2.80    0.05

ghost · 2018-11-19T11:40:34Z

First conclusion:

Example 1 to 3 show that Spearman is close to Pearson for the item-based approach with the MAE and RMSE.

You can also see that Spearman takes longer compared to Cosine and Pearson. This is because yr is first transformed into a "rank representation".

Nevertheless, the time is not extremely worse than with Cosine or Spearman.

ghost · 2018-11-19T11:43:39Z

Example 4: Spearman: user-based KNNBasic; MovieLens100k; 5-fold cross validation

data = Dataset.load_builtin('ml-100k')
sim_options = {'name': 'spearman',
               'user_based': True
}
algo = KNNBasic(sim_options=sim_options)
cross_validate(algo, data, verbose=True)

Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.0096  1.0119  1.0051  1.0242  1.0108  1.0123  0.0064  
MAE (testset)     0.8018  0.8034  0.7943  0.8143  0.8038  0.8035  0.0064  
Fit time          1.12    1.18    1.21    1.16    1.75    1.29    0.23    
Test time         2.43    2.46    2.45    3.11    2.65    2.62    0.26

ghost · 2018-11-19T11:45:54Z

Example 5: Cosine: user-based KNNBasic; MovieLens100k; 5-fold cross validation

data = Dataset.load_builtin('ml-100k')
sim_options = {'name': 'cosine',
               'user_based': True
}
algo = KNNBasic(sim_options=sim_options)
cross_validate(algo, data, verbose=True)

Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.0095  1.0166  1.0234  1.0145  1.0188  1.0165  0.0046  
MAE (testset)     0.7981  0.8035  0.8125  0.8025  0.8024  0.8038  0.0047  
Fit time          0.70    0.78    0.71    0.68    0.68    0.71    0.04    
Test time         2.65    2.57    2.48    2.43    2.52    2.53    0.07

ghost · 2018-11-19T11:47:39Z

Example 6: Pearson: user-based KNNBasic; MovieLens100k; 5-fold cross validation

data = Dataset.load_builtin('ml-100k')
sim_options = {'name': 'pearson',
               'user_based': True
}
algo = KNNBasic(sim_options=sim_options)
cross_validate(algo, data, verbose=True)

Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.0054  1.0088  1.0122  1.0150  1.0165  1.0116  0.0041  
MAE (testset)     0.8002  0.8006  0.8023  0.8045  0.8068  0.8029  0.0025  
Fit time          0.96    0.99    0.98    0.97    0.99    0.98    0.01    
Test time         2.45    2.45    2.57    2.52    2.53    2.50    0.04

ghost · 2018-11-19T11:55:12Z

Second conclusion

Also here you can see that Spearman is close to Pearson's RMSE and MAE.

But also here you can see that Spearman needs a bit more time in comparison.

ghost · 2018-11-19T12:05:09Z

Conclusion of the fast Banchmark

I think Spearman may well be considered.

The method differs only slightly in the programming effort to the Pearson correlation.

It also shows that similar RMSE and MAE values are obtained for both the item and the user-based approach.

In addition, the user-based approach even runs faster than the item-based approach.

Nevertheless, Pearson is lagging behind the other methods.
This can be explained by the fact that the ranks are not directly available and yr must therefore first be converted into a "rank representation".

It can make sense to work on the determination of the ranks, since a large part of the time is required.

I hope my contribution is understandable^^

NicolasHug · 2018-11-19T14:04:22Z

Ok, thanks a lot for the benchmark. I agree that the computation time is well-reasonable, comparatively.

I'll try to review it more in details soon.

Instead of converting yr, would it be possible to avoid it by also passing xr? That is, instead of passing only ur or only ir, we could maybe just pass both?

ghost · 2018-11-19T16:15:45Z

I've considered that, too.

I think you could bypass the time for structuring the matrix by passing xr additionally.

Nevertheless I think that you won't save much more time.
Should xr be passed, the code would be similar to the old one from @gautamramk.

For this I have compared both old versions once.

As an example:

Old Pearson; MovieLens100k; 5-fold cross validation

data = Dataset.load_builtin('ml-100k')
sim_options = {'name': 'spearman_old',
               'user_based': True               }
algo = KNNBasic(sim_options=sim_options)
cross_validate(algo, data, verbose=True)

That would have resulted in:


                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.0110  1.0130  1.0068  1.0111  1.0064  1.0097  0.0026  
MAE (testset)     0.8019  0.8037  0.8007  0.8047  0.7969  0.8016  0.0027  
Fit time          1.32    1.49    1.42    1.31    1.39    1.39    0.07    
Test time         2.77    2.71    2.78    2.70    2.73    2.74    0.03

New Pearson; MovieLens100k; 5-fold cross validation

data = Dataset.load_builtin('ml-100k')
sim_options = {'name': 'spearman',
               'user_based': True               }
algo = KNNBasic(sim_options=sim_options)
cross_validate(algo, data, verbose=True)

That would have resulted in:

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.0156  1.0066  1.0147  1.0116  1.0132  1.0123  0.0032  
MAE (testset)     0.8059  0.7994  0.8069  0.8023  0.8049  0.8039  0.0027  
Fit time          1.40    1.47    1.26    1.27    1.21    1.32    0.10    
Test time         2.62    2.74    2.94    2.80    2.68    2.75    0.11

You can see that the old version and the new version provide comparable times.

Also the bugs are similar.
Nevertheless the new version passes the old tests whereas the old version does not pass them.

Furthermore, I think that the additional parameter xr will not change much in time.
Most of the extra time is caused by calculating the ranks.

So I looked at the times between the code sections in the pre-process phase.

(1) Old Pearson Preprocess

The rank is also calculated in this area.

....
    start = time.time()
    for y, y_ratings in iteritems(yr):
        for xi, ri in y_ratings:
            rows[xi] = ri
        ranks = rankdata(rows)
        for xi, _ in y_ratings:
            for xj, _ in y_ratings:
                prods[xi, xj] += ranks[xi] * ranks[xj]
                freq[xi, xj] += 1
                sqi[xi, xj] += ranks[xi]**2
                sqj[xi, xj] += ranks[xj]**2
                si[xi, xj] += ranks[xi]
                sj[xi, xj] += ranks[xj]
    end = time.time()
    t1 = end - start
    print(t1)
....

This section achieves about: 0.9549877643585205 s

(2) New Pearson Preprocess Building the Matrix

....
    start = time.time()
    # turn yr into a matrix
    for y, y_ratings in iteritems(yr):
        for x_i, r_i in y_ratings:
            matrix[y, x_i] = r_i
    end = time.time()
    t1 = end-start
    print(t1)
....

This section achieves about: 0.030748605728149414 s

(3) New Pearson Preprocess Building the Rank Matrix

....
    start = time.time()
    # turn the yr matrix into a matrix which contains the ranks the elements in yr
    for x_i in range(n_x):
        matrix[:,x_i] = rankdata(matrix[:,x_i])
    end = time.time()
    t2 = end - start
    print(t2)
....

This section achieves about: 0.13120055198669434 s

(4) New Pearson Preprocess

....
    start = time.time()
    for y, y_ratings in iteritems(yr):
        for xi, ri in y_ratings:
            # use the ranking matrix to get the elements row by row
            ranks[xi] = matrix[y, xi]
        for xi, _ in y_ratings:
            for xj, _ in y_ratings:
                prods[xi, xj] += ranks[xi] * ranks[xj]
                freq[xi, xj] += 1
                sqi[xi, xj] += ranks[xi]**2
                sqj[xi, xj] += ranks[xj]**2
                si[xi, xj] += ranks[xi]
                sj[xi, xj] += ranks[xj]
    end = time.time()
    t3 = end - start
    print(t3)
....

This section achieves about: 0.5745222568511963 s

So I think that an introduction of xr will not change much about the problem that the ranks have to be calculated.

Also you can see that the additional converting of yr doesn't take much time (see (2)).
In addition, no more changes than necessary would have to be made.

ghost · 2018-11-19T16:17:59Z

I hope the short analysis is helpful.

I would be happy to receive a review of the code.

Thanks in advance !^^

NicolasHug

Thanks a lot for the feedback. I made a more thorough review. This looks good in general but I have 2 important concerns:

I'd like to avoid building the matrix
The ranks seem to be computed on all ratings instead of computing them on the common ratings.

Sorry for the delay!

NicolasHug · 2018-12-01T17:07:42Z

tests/test_similarities.py

@@ -33,7 +33,7 @@ def test_cosine_sim():

 sim = sims.cosine(n_x, yr, min_support=1)

- # check symetry and bounds (as ratings are > 0, cosine sim must be >= 0)
+ # check symmetry and bounds (as ratings are > 0, cosine sim must be >= 0)


lol thanks for correcting the typos

Always leave the place cleaner than you found it. ^^

NicolasHug · 2018-12-01T17:09:32Z

surprise/similarities.pyx

+ (or items).
+
+ Only **common** users (or items) are taken into account. The Spearman
+ correlation coefficient can be seen as a non parametric Pearson's


What do you mean non-parametric?

I'd like to add something like "The spearman correlation coefficient is equivalent to Pearson correlation coefficient where the ratings are replaced by their rankings."

Okay I've improved on that.

NicolasHug · 2018-12-01T17:09:51Z

surprise/similarities.pyx

+
+ .. math ::
+ \\text{spearman_sim}(u, v) = \\frac{ \\sum\\limits_{i \\in I_{uv}}
+ (rank(r_{ui}) - \\overline{rank(u)}) \\cdot (rank(r_{vi}) - \\overline{rank(v)})} {\\sqrt{\\sum\\limits_{i


Please avoid lines longer than 79 characters

This as well.

NicolasHug · 2018-12-01T17:10:25Z

surprise/similarities.pyx

+ -1).
+
+ For details on Spearman coefficient, see in chapter 4, page 126 of: `Recommender Systems Handbook
+ <http://www.cs.ubbcluj.ro/~gabis/DocDiplome/SistemeDeRecomandare/Recommender_systems_handbook.pdf>`__.


Don't add the link I doubt it's very legal ;)

Is the change better?

NicolasHug · 2018-12-01T17:12:42Z

surprise/similarities.pyx

+ sj = np.zeros((n_x, n_x), np.double)
+ sim = np.zeros((n_x, n_x), np.double)
+ ranks = np.zeros(n_x, np.double)
+ matrix = np.zeros((len(yr), n_x), np.double)


This is going to be huge (n_users * n_items).

Passing xr as well would avoid the need to create matrix right? If that's the case then we should do it.

NicolasHug · 2018-12-01T17:27:15Z

surprise/similarities.pyx

+ for y, y_ratings in iteritems(yr):
+ for xi, ri in y_ratings:
+ # use the ranking matrix to get the elements row by row
+ ranks[xi] = matrix[y, xi]


I think there might be a problem here:

ranks[xi] contains the ranks for all the ys, right?

But when we compare 2 xs, we only want to do the that on the basis of their common ys. In the subsequent code you will compare them on the basis of all the ys.

Say we have 5 items and 2 users

ratings: user 1: 1, 2, X, 4, 5 user 2: X, X, 1, 5, 2

The ranks are:

ranks: user 1: 1, 2, X, 4, 5 user 2: X, X, 1, 3, 2

But on the common items the ratings are

ratings: user 1: X, X, X, 4, 5 user 2: X, X, X, 5, 2

and the ranks are then

ranks: user 1: X, X, X, 1, 2 user 2: X, X, X, 2, 1

So your code will consider the ranks

ranks: user 1: 4, 5 user 2: 3, 2

while it should actually be considering

ranks: user 1: 1, 2 user 2: 2, 1

Maybe this has no impact because the relative order of each rank will stay the same, and it has no effect on pearson? I don't know what would happen if there are ties though...

ghost · 2018-12-01T18:22:07Z

Hey.

Thanks for the review.

I will deal with it in more detail tomorrow or next week.

It's interesting to see how the ranking behaves. I'm sure I'll come up with something about that.

I don't think the ties are a problem. They are taken into account when the rankings are created, because they are standardized ranks.

But I will take a closer look at the concerns again.

I have also taken a closer look at the Spearman Rank correlation.

I used the webscope-R3 dataset from Yahoo.
The record contains 311704 reviews from 15.400 users for 1000 songs.

The record can be requested here:
https://webscope.sandbox.yahoo.com/catalog.php?datatype=r&did=3

For the comparison of the Spearman rank correlation I used a 5-fold cross validation.

For this I consider the fit time in comparison to the other methods with different distance measurement.

ghost · 2018-12-01T18:22:46Z

Here you can compare the amount of time spent with the size of the neighbourhood.

ghost · 2018-12-02T13:46:49Z

ghost · 2018-12-02T13:52:39Z

My code works according to this formula which I made myself.

ghost · 2018-12-02T13:59:52Z

After a few initial considerations, I think it makes no difference how the ranks are calculated.

Since the ranks are normalized, it makes no difference whether they are calculated over the entire sample or only over a sample.

But I think that this relationship can be examined more closely.
Are you interested in a formal proof or proof to the contrary?

If the assumption is true, xr can actually be used instead of yr.

However, if it turns out to be wrong, it must be investigated whether the additional effort to calculate the common items and the corresponding ranks is actually faster than the current solution.

NicolasHug · 2018-12-09T18:11:06Z

@MaFeg100 let me know if I made an error in the code but I think this snippet yields an example where computing the ranks on the common items provides different results than computing the ranks on the whole vectors.

import numpy as np
from numpy.testing import assert_almost_equal
from scipy.stats import rankdata

def spearman(u, v, ranks_on_common):
    # u = vectors of ratings of user u
    # v = vectors of ratings of user v
    # ranks_on_common: whether to compute the ranks on the common items only,
    # or on the whole ratings u and v (which is the PR's version)

    assert len(u) == len(v)

    common_items = [i for (i, (r_ui, r_vi)) in enumerate(zip(u, v))
                    if r_ui and r_vi]
    if not common_items:
        return 0

    print('ranks on common:', ranks_on_common)
    print('ratings u:', u)
    print('ratings v:', v)
    print('common items:', common_items)

    if ranks_on_common:
        # compute ranks on  common items
        u_commons = [u[i] for i in common_items]
        v_commons = [v[i] for i in common_items]
        rank_u = rankdata(u_commons)
        rank_v = rankdata(v_commons)
    else:
        # compute ranks on whole vectors (treating missing ratings as 0),
        # and then only keep the ranks for the common items
        rank_u = rankdata(u)
        rank_v = rankdata(v)
        rank_u = [rank_u[i] for i in common_items]
        rank_v = [rank_v[i] for i in common_items]

    print('ranks u:', rank_u)
    print('ranks v:', rank_v)

    assert len(rank_u) == len(rank_v) == len(common_items)

    # Then compute pearson sim as usual, on common items
    mu_u = np.mean(rank_u)
    mu_v = np.mean(rank_v)

    num = sum((r_ui - mu_u) * (r_uv - mu_v)
              for (r_ui, r_uv) in zip(rank_u, rank_v))
    a = sum((r_ui - mu_u)**2 for r_ui in rank_u)
    b = sum((r_vi - mu_v)**2 for r_vi in rank_v)
    denom = np.sqrt(a * b)

    if denom == 0:
        return 0

    return num / denom


rng = np.random.RandomState(0)
size = 4
for _ in range(1000):
    # generate random ratings vectors between [0, 5]
    u = rng.randint(0, 6, size)
    v = rng.randint(0, 6, size)

    a = spearman(u, v, ranks_on_common=True)
    print('-' * 5)
    b = spearman(u, v, ranks_on_common=False)
    print(a , b)
    print('-' * 10)
    assert_almost_equal(a, b)

...
----------
ranks on common: True
ratings u: [4 5 0 4]
ratings v: [3 5 3 4]
common items: [0, 1, 3]
ranks u: [1.5 3.  1.5]
ranks v: [1. 3. 2.]
-----
ranks on common: False
ratings u: [4 5 0 4]
ratings v: [3 5 3 4]
common items: [0, 1, 3]
ranks u: [2.5, 4.0, 2.5]
ranks v: [1.5, 4.0, 3.0]
0.8660254037844387 0.8029550685469661
----------
Traceback (most recent call last):
  File "lol.py", line 70, in <module>
    assert_almost_equal(a, b)
  File "/home/nico/.virtualenvs/bordel_36/lib/python3.6/site-packages/numpy/testing/nose_tools/utils.py", line 581, in assert_almost_equal
    raise AssertionError(_build_err_msg())
AssertionError: 
Arrays are not almost equal to 7 decimals
 ACTUAL: 0.8660254037844387
 DESIRED: 0.8029550685469661

If the assumption is true, xr can actually be used instead of yr.

I think that whether we should compute the ranks on the common items or on the whole ratings is totally independent from whether we should pass xr or just yr. If passing xr allows us not to compute the matrix of ratings, then xr should be passed because even if computing this matrix can be fast, it uses a lot of memory.

ghost · 2018-12-10T08:37:51Z

Hey @NicolasHug , you're right.

The calculation errors can be fixed by setting the ratings that are not shared by the users to 0.

I have changed your code so that both calculations are correct.

import numpy as np
from numpy.testing import assert_almost_equal
from scipy.stats import rankdata

def spearman(u, v, ranks_on_common):
    # u = vectors of ratings of user u
    # v = vectors of ratings of user v
    # ranks_on_common: whether to compute the ranks on the common items only,
    # or on the whole ratings u and v (which is the PR's version)

    assert len(u) == len(v)

    common_items = [i for (i, (r_ui, r_vi)) in enumerate(zip(u, v))
                    if r_ui and r_vi]
    if not common_items:
        return 0

    print('ranks on common:', ranks_on_common)
    print('ratings u:', u)
    print('ratings v:', v)
    print('common items:', common_items)

    if ranks_on_common:
        # compute ranks on  common items
        u_commons = [u[i] for i in common_items]
        v_commons = [v[i] for i in common_items]

        rank_u = rankdata(u_commons)
        rank_v = rankdata(v_commons)

    else:
        # compute ranks on whole vectors (treating missing ratings as 0),
        # and then only keep the ranks for the common items

        u_commons = [u[i] if i in common_items else 0 for i in range(len(u))]
        v_commons = [v[i] if i in common_items else 0 for i in range(len(v))]

        rank_u = rankdata(u_commons)
        rank_v = rankdata(v_commons)
        rank_u = [rank_u[i] for i in common_items]
        rank_v = [rank_v[i] for i in common_items]

    print('ranks u:', rank_u)
    print('ranks v:', rank_v)

    assert len(rank_u) == len(rank_v) == len(common_items)

    # Then compute pearson sim as usual, on common items
    mu_u = np.mean(rank_u)
    mu_v = np.mean(rank_v)

    num = sum((r_ui - mu_u) * (r_uv - mu_v)
              for (r_ui, r_uv) in zip(rank_u, rank_v))
    a = sum((r_ui - mu_u)**2 for r_ui in rank_u)
    b = sum((r_vi - mu_v)**2 for r_vi in rank_v)
    denom = np.sqrt(a * b)

    if denom == 0:
        return 0

    return num / denom


rng = np.random.RandomState(0)
size = 4
for _ in range(1000):
    # generate random ratings vectors between [0, 5]
    u = rng.randint(0, 6, size)
    v = rng.randint(0, 6, size)

    a = spearman(u, v, ranks_on_common=True)
    print('-' * 5)
    b = spearman(u, v, ranks_on_common=False)
    print(a , b)
    print('-' * 10)
    assert_almost_equal(a, b)

The same procedure can be used in 'matrix' to calculate the ranks over the same elements.

Is that right?

NicolasHug · 2018-12-17T16:49:08Z

Hmm I guess that could work but:

imputing with 0 only works if 0 is not in the rating scale. E.g. for the jester dataset which uses (-10, 10), this would not be correct. We need to impute with something that is not in the rating scale and I don't know how to do it in general (computing a min or a max could do it but there might be something better?)
I'm afraid now the bottleneck is to compute the common items.

gautamramk and others added 9 commits April 7, 2018 15:19

Added Spearman Correlation for similarities module.

781da79

Merge branch 'master' of https://github.com/NicolasHug/Surprise into …

0ab00e6

…spearman

Included spearman similarity in algo_base.py

dec5d9a

added tests

061a551

Merge branch 'master' of https://github.com/NicolasHug/Surprise into …

be724f0

…spearman Update branch

Merge remote-tracking branch 'surprise_origin/master' into spearman

e5c52a5

Add Spearman branch to travis

afc541b

Fix first syntax errors

6c82c02

Make code run but not pretty and failing tests

184aa8e

Remove last errors

f1c5798

TomatenMarc added 3 commits November 17, 2018 13:50

Refactor spearman sim method by turning yr into a ranking matrix + pa…

f4c1300

…ss tests

Add sim_option test for spearman similarity

ce49014

Add documentation and references for spearman

198d45f

NicolasHug reviewed Dec 1, 2018

View reviewed changes

TomatenMarc added 2 commits December 2, 2018 11:48

Specify Spearman Documentation

24eafd1

Fix pep8 for Spearman

4d207b7

Improve and add Spearman #227

Are you sure you want to change the base?

Improve and add Spearman #227

Conversation

ghost commented Nov 15, 2018

ghost commented Nov 15, 2018

ghost commented Nov 16, 2018 • edited by ghost

NicolasHug commented Nov 16, 2018

ghost commented Nov 16, 2018

ghost commented Nov 16, 2018 • edited by ghost

ghost commented Nov 16, 2018

ghost commented Nov 16, 2018 • edited by ghost

ghost commented Nov 16, 2018

ghost commented Nov 16, 2018

ghost commented Nov 17, 2018

NicolasHug commented Nov 18, 2018

ghost commented Nov 19, 2018

ghost commented Nov 19, 2018

ghost commented Nov 19, 2018 • edited by ghost

Example 1: Spearman: item-based KNNBasic; MovieLens100k; 5-fold cross validation

ghost commented Nov 19, 2018

Example 2: Cosine: item-based KNNBasic; MovieLens100k; 5-fold cross validation

ghost commented Nov 19, 2018

Example 3: Pearson: item-based KNNBasic; MovieLens100k; 5-fold cross validation

ghost commented Nov 19, 2018

First conclusion:

ghost commented Nov 19, 2018

Example 4: Spearman: user-based KNNBasic; MovieLens100k; 5-fold cross validation

ghost commented Nov 19, 2018

Example 5: Cosine: user-based KNNBasic; MovieLens100k; 5-fold cross validation

ghost commented Nov 19, 2018

Example 6: Pearson: user-based KNNBasic; MovieLens100k; 5-fold cross validation

ghost commented Nov 19, 2018

Second conclusion

ghost commented Nov 19, 2018

Conclusion of the fast Banchmark

NicolasHug commented Nov 19, 2018

ghost commented Nov 19, 2018 • edited by ghost

Old Pearson; MovieLens100k; 5-fold cross validation

New Pearson; MovieLens100k; 5-fold cross validation

(1) Old Pearson Preprocess

(2) New Pearson Preprocess Building the Matrix

(3) New Pearson Preprocess Building the Rank Matrix

(4) New Pearson Preprocess

ghost commented Nov 19, 2018

NicolasHug left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghost commented Dec 1, 2018

ghost commented Dec 1, 2018 • edited by ghost

ghost commented Dec 2, 2018

ghost commented Dec 2, 2018

ghost commented Dec 2, 2018

NicolasHug commented Dec 9, 2018

ghost commented Dec 10, 2018

NicolasHug commented Dec 17, 2018

ghost commented Nov 16, 2018 •

edited by ghost

ghost commented Nov 16, 2018 •

edited by ghost

ghost commented Nov 16, 2018 •

edited by ghost

ghost commented Nov 19, 2018 •

edited by ghost

ghost commented Nov 19, 2018 •

edited by ghost

ghost commented Dec 1, 2018 •

edited by ghost