Cosine full vector implementation #141

ODemidenko · 2018-02-05T09:43:18Z

Implemented cosine for full-vector (as was requested - adjusted cosine PR will be made separately, after this one). Tests and docs added.

NicolasHug

Thanks for the PR, I made a quick review (without diving into the code).
Could you please provide a few benchmark (RMSE, MAE, maybe recall / precision as in FAQ, computation time...)

NicolasHug · 2018-02-05T11:53:10Z

doc/source/prediction_algorithms.rst

@@ -130,6 +130,9 @@ argument is a dictionary with the following (all optional) keys:
 ``'False'``) for the similarity not to be zero. Simply put, if
 :math:`|I_{uv}| < \text{min_support}` then :math:`\text{sim}(u, v) = 0`. The
 same goes for items.
+- ``'common_ratings_only'``: Determines whether only common user/item ratings are
+ taken into account or all the full rating vectors are considered
+ (only relevant for cosine-based similraty). Default is True.


Should be 'Default is True' with two '`'

NicolasHug · 2018-02-05T11:54:13Z

surprise/similarities.pyx

+
+ Depending on ``common_ratings_only`` field of ``sim_options``
+ only common users (or items) are taken into account, or full rating
+ vectors (default: True).


NicolasHug · 2018-02-05T11:55:33Z

surprise/similarities.pyx

- sqi[xi, xj] += ri**2
- sqj[xi, xj] += rj**2
+ sqi[xi, xj] += ri ** 2
+ sqj[xi, xj] += rj ** 2


No need to change this?

NicolasHug · 2018-02-05T11:56:38Z

surprise/similarities.pyx

+ xi_iter = iter(sorted_y_ratings)
+ try:
+ xi_non_missing, ri_non_missing = next(xi_iter)
+ except StopIteration:


Could all the StopIteration be avoided?

NicolasHug · 2018-02-05T11:56:53Z

surprise/similarities.pyx

@@ -149,7 +244,7 @@ def msd(n_x, yr, min_support):
 for y, y_ratings in iteritems(yr):
 for xi, ri in y_ratings:
 for xj, rj in y_ratings:
- sq_diff[xi, xj] += (ri - rj)**2
+ sq_diff[xi, xj] += (ri - rj) ** 2


Not relevant to the PR

NicolasHug · 2018-02-05T11:56:59Z

surprise/similarities.pyx

- sqi[xi, xj] += ri**2
- sqj[xi, xj] += rj**2
+ sqi[xi, xj] += ri ** 2
+ sqj[xi, xj] += rj ** 2


NicolasHug · 2018-02-05T11:57:04Z

surprise/similarities.pyx

- sq_diff_i[xi, xj] += diff_i**2
- sq_diff_j[xi, xj] += diff_j**2
+ sq_diff_i[xi, xj] += diff_i ** 2
+ sq_diff_j[xi, xj] += diff_j ** 2


NicolasHug · 2018-02-05T11:57:10Z

tests/test_similarities.py

- 3: [(1, 1),   (2, 4), (3, 2), (4, 3), (5, 3), (6, 3.5), (7, 2)], # noqa
- 4: [(1, 5),   (2, 1), (5, 2), (6, 2.5), (7, 2.5)], # noqa
+ 3: [  (1, 1), (2, 4), (3, 2), (4, 3), (5, 3), (6, 3.5), (7, 2)], # noqa
+ 4: [  (1, 5), (2, 1), (5, 2), (6, 2.5), (7, 2.5)], # noqa


ODemidenko · 2018-02-05T12:26:32Z

Could you please provide a few benchmark (RMSE, MAE, maybe recall / precision as in FAQ, computation time...)

No, sorry, I don't have time for this measurement. And I added full cosine only to cover all possible options and to prepare common grounds for adjusted cosine (basically, I just add features necessary to pass the "recommender systems" specialization on Coursera, as I have seen complaints there about a lack of Python libs supproting this course).

Regarding the reformatting in the test examples - with previous formatting they were awfully confusing, being improperly aligned (probably, I broke it with my previous PR). This certainly should be fixed.

Other issues fixed

NicolasHug · 2018-02-05T12:59:31Z

No, sorry, I don't have time for this measurement.

I'm sorry but I can't accept a new algorithm / similarity measure without even having a vague idea of its performance.

ODemidenko · 2018-02-05T13:37:21Z

I'm sorry but I can't accept a new algorithm / similarity measure without even having a vague idea of its performance.

For similarity metrics currently you haven't made such measurements yourself, at least I have found only measurements on a single default sim metric. So, probably you would allow to skip it for a new similarity metrics as well?

Otherwise, I kindly ask you to provide detailed requirements on what you are expecting of me.
(it makes sense to add this to "Contributors" instruction as well. In order for possible contributors to understand all your requirements in advance)
Regarding your desire to have computation time reported - I have a different machine than yours and this value will be misleading, in comparison to other algorithms. So, I offer to skip this requirement as well.

NicolasHug · 2018-02-05T16:55:38Z

For similarity metrics currently you haven't made such measurements yourself

Ho trust me, I have.

So, probably you would allow to skip it for a new similarity metrics as well?

No, please don't ask me this, really.

Otherwise, I kindly ask you to provide detailed requirements on what you are expecting of me.

I'm not asking for much. A few CV procedures on ml-100k and ml-1m, comparing
performances between only_common_ratings=True and only_common_ratings=False
would be a good start.

(it makes sense to add this to "Contributors" instruction as well. In order for possible contributors to understand all your requirements in advance)

Good idea

Regarding your desire to have computation time reported - I have a different machine than yours and this value will be misleading, in comparison to other algorithms.

Absolutely. Benchmarking is about comparing, see my point above.

ODemidenko · 2018-02-05T23:46:23Z

Could you give a link on an existing similarity metrics comparison, as an example of what you are asking for? Or just inform when you will update contributors guideline on this?
Probably, you would agree to do this comparison after we add adjusted cosine metrics as well? Because it seems a waste of time to repeat this work twice.

NicolasHug · 2018-02-08T17:56:57Z

I don't have any link. I'm just asking for what I've already described in my previous post.

NicolasHug · 2018-03-06T21:38:55Z

Any update / benchmark on this?

ODemidenko force-pushed the cosine_full_vector branch 2 times, most recently from 3fd53d4 to 6c4a819 Compare February 5, 2018 11:02

NicolasHug reviewed Feb 5, 2018

View reviewed changes

ODemidenko force-pushed the cosine_full_vector branch from 6c4a819 to df067c0 Compare February 5, 2018 12:21

Cosine full vector implementation, as discussed in NicolasHug#135

2442bdf

ODemidenko force-pushed the cosine_full_vector branch from df067c0 to 2442bdf Compare February 5, 2018 12:24

NicolasHug mentioned this pull request Apr 3, 2018

Compute similarities between sparse rating vectors #164

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cosine full vector implementation #141

Cosine full vector implementation #141

ODemidenko commented Feb 5, 2018

NicolasHug left a comment

NicolasHug Feb 5, 2018

NicolasHug Feb 5, 2018

NicolasHug Feb 5, 2018

NicolasHug Feb 5, 2018

NicolasHug Feb 5, 2018

NicolasHug Feb 5, 2018

NicolasHug Feb 5, 2018

NicolasHug Feb 5, 2018

ODemidenko commented Feb 5, 2018

NicolasHug commented Feb 5, 2018

ODemidenko commented Feb 5, 2018

NicolasHug commented Feb 5, 2018

ODemidenko commented Feb 5, 2018

NicolasHug commented Feb 8, 2018

NicolasHug commented Mar 6, 2018

Cosine full vector implementation #141

Are you sure you want to change the base?

Cosine full vector implementation #141

Conversation

ODemidenko commented Feb 5, 2018

NicolasHug left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ODemidenko commented Feb 5, 2018

NicolasHug commented Feb 5, 2018

ODemidenko commented Feb 5, 2018

NicolasHug commented Feb 5, 2018

ODemidenko commented Feb 5, 2018

NicolasHug commented Feb 8, 2018

NicolasHug commented Mar 6, 2018