Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide probabilites for lemmatized forms #1105

Open
todd-cook opened this issue Jun 2, 2021 · 5 comments
Open

Provide probabilites for lemmatized forms #1105

todd-cook opened this issue Jun 2, 2021 · 5 comments
Assignees

Comments

@todd-cook
Copy link
Collaborator

from SLang:

Is there an out-of-the-box option to tell the lemmatisation (for Latin) to prefer the noun form over the verb? 

With many forms (such as „materias“ or forms of „ars“) I get Verb Lemmata (from extremely rare verbs) instead of the much more common noun (which usually would be the correct lemma on my data) when one of the noun forms is also a possible verb form.

Can I prevent this in any already existing way or do I need to implement something myself?

It would be better for me to always err on the noun side.
@todd-cook todd-cook self-assigned this Jun 2, 2021
@todd-cook
Copy link
Collaborator Author

I like your idea of preferring by probability, and taking nouns over verbs. In the short term, we may be able to approximate this with a Counter (counting dictionary of tokenized forms) and the higher count is probably more often the noun form. Of course a more robust metric will probably use embeddings and context to pull the lemma forms in the right direction.

@clemsciences
Copy link
Member

An intermediate step would be to have a function that takes as input:

  • the token to lemmatize and,
  • a probability threshold,

and returns:

  • a set of lemmata with their probabilities (according to the a model) ; all probabilities must be above the given threshold.

@diyclassics
Copy link
Collaborator

The Ensemble Lemmatizer will return the likelihood of possible lemmas depending on which of the sublemmatizers is used (e.g. frequency for training-data-based, i.e. EnsembleUnigramLemmatizer).

Here is the example included in https://github.com/cltk/cltk/blob/v0.1.x/cltk/lemmatize/ensemble.py:

test = "arma virumque cano qui".split()
patterns = [
(r'\b(.+)(o|is|it|imus|itis|unt)\b', r'\1o'),
(r'\b(.+)(o|as|at|amus|atis|ant)\b', r'\1o'),
]
EDL = EnsembleDictLemmatizer(lemmas = {'cano': 'cano'}, source='EDL', verbose=True)
EUL = EnsembleUnigramLemmatizer(train=[
        [('arma', 'arma'), ('virumque', 'vir'), ('cano', 'cano')],
        [('arma', 'arma'), ('virumque', 'virus'), ('cano', 'canus')],
        [('arma', 'arma'), ('virumque', 'vir'), ('cano', 'canis')],
        [('arma', 'arma'), ('virumque', 'vir'), ('cano', 'cano')],
        ], verbose=True, backoff=EDL)
ERL = EnsembleRegexpLemmatizer(regexps=patterns, source='Latin Regex Patterns', verbose=True, backoff=EUL)
ensemble_lemmas = ERL.lemmatize(test, lemmas_only=False)
for lemma in ensemble_lemmas:
    print(lemma)

returns...

('arma', [{"<EnsembleUnigramLemmatizer: [[('arma', 'arma'), ...], ...]>": [('arma', 1.0)]}])
('virumque', [{"<EnsembleUnigramLemmatizer: [[('arma', 'arma'), ...], ...]>": [('vir', 0.75), ('virus', 0.25)]}])
('cano', [{'<EnsembleRegexpLemmatizer: Latin Regex Patterns>': [('cano', 1.0)]}, {"<EnsembleUnigramLemmatizer: [[('arma', 'arma'), ...], ...]>": [('cano', 0.5), ('canus', 0.25), ('canis', 0.25)]}, {'<EnsembleDictLemmatizer: EDL>': [('cano', 100)]}])
('qui', [])

The 'scores' can be averaged across lemmatizers and a max taken (or threshold set). But also, using just the Unigram lemmatizer with the returned frequencies would accomplish what @clemsciences mentions in the previous comment.

It needs further development and testing/evaluation, but the code is there. (Cf. also https://www.studiesaggilinguistici.it/index.php/ssl/article/view/273)

@diyclassics
Copy link
Collaborator

@todd-cook If helpful, I could write a sublemmatizer that uses POS info in addition to token/lemma info.

@diyclassics
Copy link
Collaborator

Of course a more robust metric will probably use embeddings and context to pull the lemma forms in the right direction.

@todd-cook Been meaning to implement this as well—perhaps now is the time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants