-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gensim Word2Vec produces different most_similar results through final epoch than end of training #3429
Comments
I believe this is the same as, or related to #2260 - but the removal of the (at-risk-of-staleness) There is still a much-smaller cache of each vector's own mangitude, in If there's truly still an issue in Gensim 4.0+, it'd be good to have a small fully self-contained example that vividly demonstrates it. That'd rule out something idiosyncratic in @Joshkking's setup, and likely point to some fix, or new workaround, or new warning we could show. |
@gojomo The instigating code is work related and has moved on with an updated environment with that class stripped out. I'll see if I can get the chance to replicate this on a smaller, publicly accessible corpora though it may be a while. |
Problem description
Word2Vec callbacks produce greatly different
most_similar()
results on last end of epoch compared to end of training. Expectation would be that the final end epoch callback similarity results are either identical or approximate of post training most similar results.Steps/code/corpus to reproduce
I'm using gensim's Word2Vec for a recommendation-like task with part of my evaluation being the use of callbacks and the
most_similar()
method. However, I am noticing a huge disparity between the final few epoch callbacks and that of immediately post-training. In fact, the last epoch callback may often appear worthless, while the post training result is as best as could be desired.My during-training tracking of most similar entries utilizes gensim's
CallbackAny2Vec
class. It follows the doc example fairly directly and roughly looks like:As the epochs progress, the
most_similar()
results given by the callbacks seem to not indicate an advancement of learning and seem erratic. In fact, often the callback from the first epoch shows the best result.Counterintuitively, I also have an additional process (not shown) built into the callback that does indicate gradual learning. Following the similarity print, I take the current model's vectors and evaluate them against a down-stream task. In brief, this process is a sklearn
GridSearchCV
logistic regression check against some known labels.I find that often the last
on_epoch_end
callback indicates little learning in my particular use case. However, if directly after training the model I try the similarity call again:I tend to get beautiful results that are in agreement with the downstream evaluation task also used in the callbacks, or are at least vastly different than that of the final epoch end.
I suspect
most_similar()
has an unusual behavior with during-training epoch-end callbacks, but I would be happy to understand instead my approach as flawed.Versions
macOS-10.16-x86_64-i386-64bit
Python 3.9.7 (default, Sep 16, 2021, 08:50:36)
[Clang 10.0.0]
Bits 64
NumPy 1.21.2
SciPy 1.7.3
gensim 4.1.2
FAST_VERSION 1
The text was updated successfully, but these errors were encountered: