-
Notifications
You must be signed in to change notification settings - Fork 326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide probabilites for lemmatized forms #1105
Comments
I like your idea of preferring by probability, and taking nouns over verbs. In the short term, we may be able to approximate this with a Counter (counting dictionary of tokenized forms) and the higher count is probably more often the noun form. Of course a more robust metric will probably use embeddings and context to pull the lemma forms in the right direction. |
An intermediate step would be to have a function that takes as input:
and returns:
|
The Ensemble Lemmatizer will return the likelihood of possible lemmas depending on which of the sublemmatizers is used (e.g. frequency for training-data-based, i.e. EnsembleUnigramLemmatizer). Here is the example included in https://github.com/cltk/cltk/blob/v0.1.x/cltk/lemmatize/ensemble.py:
returns...
The 'scores' can be averaged across lemmatizers and a max taken (or threshold set). But also, using just the Unigram lemmatizer with the returned frequencies would accomplish what @clemsciences mentions in the previous comment. It needs further development and testing/evaluation, but the code is there. (Cf. also https://www.studiesaggilinguistici.it/index.php/ssl/article/view/273) |
@todd-cook If helpful, I could write a sublemmatizer that uses POS info in addition to token/lemma info. |
@todd-cook Been meaning to implement this as well—perhaps now is the time. |
from SLang:
The text was updated successfully, but these errors were encountered: