-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add functions to reproduce preprocessing matching GoogleNews
, GLoVe
, etc pretrained word-vectors
#3485
Comments
My thoughts: A desire for help here has come up a lot – & at times I've shared my observations about what can be deduced from the limited statements, & observable contents, of pre-trained vector sets like the 'GoogleNews' release. However, without disclosures (or better yet code) from the original researchers who prepared such pretrained vectors, all such efforts will only ever be gradually-approximating their practices, with lingering exceptions & caveats generating more questions. Also: it often seems to be beginner & small-data projects that are most-eager to re-use pretrained vectors from elsewhere, under the assumption those must be the "right" thing, or better than what they'd achieve. But: many times that's not the case. For example, So while I'd see some value in a "best guess" function to mimic the tokenizing choices of those commonly-used pretrained sets – as a research effort, or contribution – I'd also prefer it prominently-disclaimered as non-official, & not-necessarily-an-endorsement of preferring those vectors, and that tokenization, for anyone's particular purpose. At this point, devising such helpers would be a sort of software-archeology/mystery project, and I'd not see it as any sort of urgent priority. But, it might make a good new-contributor, student, or hackathon project – especially if eventual integration includes good surrounding docs/discussion/demos of the limits/considerations involved in reusing another project's vectors/preprocessing choices. |
GoogleNews
, GLoVe
, etc pretrained word-vectorsGoogleNews
, GLoVe
, etc pretrained word-vectors
Suggested on project discussion list (https://groups.google.com/g/gensim/c/CsER2XBs8P4/m/f2EntuXRAgAJ):
The text was updated successfully, but these errors were encountered: