Provide probabilites for lemmatized forms #1105

todd-cook · 2021-06-02T22:52:31Z

from SLang:

Is there an out-of-the-box option to tell the lemmatisation (for Latin) to prefer the noun form over the verb? 

With many forms (such as „materias“ or forms of „ars“) I get Verb Lemmata (from extremely rare verbs) instead of the much more common noun (which usually would be the correct lemma on my data) when one of the noun forms is also a possible verb form.

Can I prevent this in any already existing way or do I need to implement something myself?

It would be better for me to always err on the noun side.

The text was updated successfully, but these errors were encountered:

todd-cook · 2021-06-02T22:57:42Z

I like your idea of preferring by probability, and taking nouns over verbs. In the short term, we may be able to approximate this with a Counter (counting dictionary of tokenized forms) and the higher count is probably more often the noun form. Of course a more robust metric will probably use embeddings and context to pull the lemma forms in the right direction.

clemsciences · 2021-06-03T07:24:56Z

An intermediate step would be to have a function that takes as input:

the token to lemmatize and,
a probability threshold,

and returns:

a set of lemmata with their probabilities (according to the a model) ; all probabilities must be above the given threshold.

diyclassics · 2021-06-03T18:58:49Z

The Ensemble Lemmatizer will return the likelihood of possible lemmas depending on which of the sublemmatizers is used (e.g. frequency for training-data-based, i.e. EnsembleUnigramLemmatizer).

Here is the example included in https://github.com/cltk/cltk/blob/v0.1.x/cltk/lemmatize/ensemble.py:

test = "arma virumque cano qui".split()
patterns = [
(r'\b(.+)(o|is|it|imus|itis|unt)\b', r'\1o'),
(r'\b(.+)(o|as|at|amus|atis|ant)\b', r'\1o'),
]
EDL = EnsembleDictLemmatizer(lemmas = {'cano': 'cano'}, source='EDL', verbose=True)
EUL = EnsembleUnigramLemmatizer(train=[
        [('arma', 'arma'), ('virumque', 'vir'), ('cano', 'cano')],
        [('arma', 'arma'), ('virumque', 'virus'), ('cano', 'canus')],
        [('arma', 'arma'), ('virumque', 'vir'), ('cano', 'canis')],
        [('arma', 'arma'), ('virumque', 'vir'), ('cano', 'cano')],
        ], verbose=True, backoff=EDL)
ERL = EnsembleRegexpLemmatizer(regexps=patterns, source='Latin Regex Patterns', verbose=True, backoff=EUL)
ensemble_lemmas = ERL.lemmatize(test, lemmas_only=False)
for lemma in ensemble_lemmas:
    print(lemma)

returns...

('arma', [{"<EnsembleUnigramLemmatizer: [[('arma', 'arma'), ...], ...]>": [('arma', 1.0)]}])
('virumque', [{"<EnsembleUnigramLemmatizer: [[('arma', 'arma'), ...], ...]>": [('vir', 0.75), ('virus', 0.25)]}])
('cano', [{'<EnsembleRegexpLemmatizer: Latin Regex Patterns>': [('cano', 1.0)]}, {"<EnsembleUnigramLemmatizer: [[('arma', 'arma'), ...], ...]>": [('cano', 0.5), ('canus', 0.25), ('canis', 0.25)]}, {'<EnsembleDictLemmatizer: EDL>': [('cano', 100)]}])
('qui', [])

The 'scores' can be averaged across lemmatizers and a max taken (or threshold set). But also, using just the Unigram lemmatizer with the returned frequencies would accomplish what @clemsciences mentions in the previous comment.

It needs further development and testing/evaluation, but the code is there. (Cf. also https://www.studiesaggilinguistici.it/index.php/ssl/article/view/273)

diyclassics · 2021-06-03T19:01:48Z

@todd-cook If helpful, I could write a sublemmatizer that uses POS info in addition to token/lemma info.

diyclassics · 2021-06-03T19:03:53Z

Of course a more robust metric will probably use embeddings and context to pull the lemma forms in the right direction.

@todd-cook Been meaning to implement this as well—perhaps now is the time.

todd-cook added the feature-request label Jun 2, 2021

todd-cook self-assigned this Jun 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide probabilites for lemmatized forms #1105

Provide probabilites for lemmatized forms #1105

todd-cook commented Jun 2, 2021

todd-cook commented Jun 2, 2021

clemsciences commented Jun 3, 2021

diyclassics commented Jun 3, 2021

diyclassics commented Jun 3, 2021

diyclassics commented Jun 3, 2021

Provide probabilites for lemmatized forms #1105

Provide probabilites for lemmatized forms #1105

Comments

todd-cook commented Jun 2, 2021

todd-cook commented Jun 2, 2021

clemsciences commented Jun 3, 2021

diyclassics commented Jun 3, 2021

diyclassics commented Jun 3, 2021

diyclassics commented Jun 3, 2021