Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cosine full vector implementation #141

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ODemidenko
Copy link
Contributor

Implemented cosine for full-vector (as was requested - adjusted cosine PR will be made separately, after this one). Tests and docs added.

@ODemidenko ODemidenko force-pushed the cosine_full_vector branch 2 times, most recently from 3fd53d4 to 6c4a819 Compare February 5, 2018 11:02
Copy link
Owner

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, I made a quick review (without diving into the code).
Could you please provide a few benchmark (RMSE, MAE, maybe recall / precision as in FAQ, computation time...)

@@ -130,6 +130,9 @@ argument is a dictionary with the following (all optional) keys:
``'False'``) for the similarity not to be zero. Simply put, if
:math:`|I_{uv}| < \text{min_support}` then :math:`\text{sim}(u, v) = 0`. The
same goes for items.
- ``'common_ratings_only'``: Determines whether only common user/item ratings are
taken into account or all the full rating vectors are considered
(only relevant for cosine-based similraty). Default is True.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be 'Default is True' with two '`'


Depending on ``common_ratings_only`` field of ``sim_options``
only common users (or items) are taken into account, or full rating
vectors (default: True).
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

sqi[xi, xj] += ri**2
sqj[xi, xj] += rj**2
sqi[xi, xj] += ri ** 2
sqj[xi, xj] += rj ** 2
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to change this?

xi_iter = iter(sorted_y_ratings)
try:
xi_non_missing, ri_non_missing = next(xi_iter)
except StopIteration:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could all the StopIteration be avoided?

@@ -149,7 +244,7 @@ def msd(n_x, yr, min_support):
for y, y_ratings in iteritems(yr):
for xi, ri in y_ratings:
for xj, rj in y_ratings:
sq_diff[xi, xj] += (ri - rj)**2
sq_diff[xi, xj] += (ri - rj) ** 2
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not relevant to the PR

sqi[xi, xj] += ri**2
sqj[xi, xj] += rj**2
sqi[xi, xj] += ri ** 2
sqj[xi, xj] += rj ** 2
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

sq_diff_i[xi, xj] += diff_i**2
sq_diff_j[xi, xj] += diff_j**2
sq_diff_i[xi, xj] += diff_i ** 2
sq_diff_j[xi, xj] += diff_j ** 2
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

3: [(1, 1), (2, 4), (3, 2), (4, 3), (5, 3), (6, 3.5), (7, 2)], # noqa
4: [(1, 5), (2, 1), (5, 2), (6, 2.5), (7, 2.5)], # noqa
3: [ (1, 1), (2, 4), (3, 2), (4, 3), (5, 3), (6, 3.5), (7, 2)], # noqa
4: [ (1, 5), (2, 1), (5, 2), (6, 2.5), (7, 2.5)], # noqa
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

@ODemidenko
Copy link
Contributor Author

Could you please provide a few benchmark (RMSE, MAE, maybe recall / precision as in FAQ, computation time...)

No, sorry, I don't have time for this measurement. And I added full cosine only to cover all possible options and to prepare common grounds for adjusted cosine (basically, I just add features necessary to pass the "recommender systems" specialization on Coursera, as I have seen complaints there about a lack of Python libs supproting this course).

Regarding the reformatting in the test examples - with previous formatting they were awfully confusing, being improperly aligned (probably, I broke it with my previous PR). This certainly should be fixed.

Other issues fixed

@NicolasHug
Copy link
Owner

No, sorry, I don't have time for this measurement.

I'm sorry but I can't accept a new algorithm / similarity measure without even having a vague idea of its performance.

@ODemidenko
Copy link
Contributor Author

I'm sorry but I can't accept a new algorithm / similarity measure without even having a vague idea of its performance.

For similarity metrics currently you haven't made such measurements yourself, at least I have found only measurements on a single default sim metric. So, probably you would allow to skip it for a new similarity metrics as well?

Otherwise, I kindly ask you to provide detailed requirements on what you are expecting of me.
(it makes sense to add this to "Contributors" instruction as well. In order for possible contributors to understand all your requirements in advance)
Regarding your desire to have computation time reported - I have a different machine than yours and this value will be misleading, in comparison to other algorithms. So, I offer to skip this requirement as well.

@NicolasHug
Copy link
Owner

For similarity metrics currently you haven't made such measurements yourself

Ho trust me, I have.

So, probably you would allow to skip it for a new similarity metrics as well?

No, please don't ask me this, really.

Otherwise, I kindly ask you to provide detailed requirements on what you are expecting of me.

I'm not asking for much. A few CV procedures on ml-100k and ml-1m, comparing
performances between only_common_ratings=True and only_common_ratings=False
would be a good start.

(it makes sense to add this to "Contributors" instruction as well. In order for possible contributors to understand all your requirements in advance)

Good idea

Regarding your desire to have computation time reported - I have a different machine than yours and this value will be misleading, in comparison to other algorithms.

Absolutely. Benchmarking is about comparing, see my point above.

@ODemidenko
Copy link
Contributor Author

Could you give a link on an existing similarity metrics comparison, as an example of what you are asking for? Or just inform when you will update contributors guideline on this?
Probably, you would agree to do this comparison after we add adjusted cosine metrics as well? Because it seems a waste of time to repeat this work twice.

@NicolasHug
Copy link
Owner

I don't have any link. I'm just asking for what I've already described in my previous post.

@NicolasHug
Copy link
Owner

Any update / benchmark on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants