Strange behaviour of `LatinBackoffLemmatizer` with plural nouns of the second declension #1198

DavideMassidda · 2023-01-17T15:16:51Z

Processing Latin plural nouns from the second declension, sometimes the LatinBackoffLemmatizer adds a trailing digit.

I observed this strange behaviour with the term "lupus":

from cltk.lemmatize.lat import LatinBackoffLemmatizer
lemmatizer = LatinBackoffLemmatizer()

lupus = ['lupi','luporum','lupis','lupos','lupi','lupis']

lemmatizer.lemmatize(lupus)
[('lupi', 'lupus'), ('luporum', 'lupus1'), ('lupis', 'lupus1'), ('lupos', 'lupus1'), ('lupi', 'lupus'), ('lupis', 'lupus1')]

On the other hand, the term "amicus" does not present this bug:

amicus = ['amici','amicorum','amicis','amicos','amici','amicis']

lemmatizer.lemmatize(amicus)
[('amici', 'amicus'), ('amicorum', 'amicus'), ('amicis', 'amicus'), ('amicos', 'amicus'), ('amici', 'amicus'), ('amicis', 'amicus')]

I guess the fault lies with the DictLemmatizer:

lemmatizer = LatinBackoffLemmatizer(verbose=True)
lemmatizer.lemmatize(lupus)

[('lupi', 'lupus', '<UnigramLemmatizer: CLTK Sentence Training Data>'), ('luporum', 'lupus1', '<DictLemmatizer: Morpheus Lemmas>'), ('lupis', 'lupus1', '<DictLemmatizer: Morpheus Lemmas>'), ('lupos', 'lupus1', '<DictLemmatizer: Morpheus Lemmas>'), ('lupi', 'lupus', '<UnigramLemmatizer: CLTK Sentence Training Data>'), ('lupis', 'lupus1', '<DictLemmatizer: Morpheus Lemmas>')]

lemmatizer.lemmatize(amicus)

[('amici', 'amicus', '<UnigramLemmatizer: CLTK Sentence Training Data>'), ('amicorum', 'amicus', '<UnigramLemmatizer: CLTK Sentence Training Data>'), ('amicis', 'amicus', '<UnigramLemmatizer: CLTK Sentence Training Data>'), ('amicos', 'amicus', '<UnigramLemmatizer: CLTK Sentence Training Data>'), ('amici', 'amicus', '<UnigramLemmatizer: CLTK Sentence Training Data>'), ('amicis', 'amicus', '<UnigramLemmatizer: CLTK Sentence Training Data>')]

Environment: Windows 10 + Python 3.9.15 + cltk 1.1.6

The text was updated successfully, but these errors were encountered:

clemsciences · 2023-01-19T22:26:58Z

Different lemmas can have an identical form. For example: jus is the form of a lemma meaning "law", "right" and an other lemma meaning "gravy", "juice". In order to distinguish them, ambiguous lemmas get a trailing number. Here it can be jus1 and jus2.

The rule-based lemmatizer is this one (https://github.com/cltk/lat_models_cltk/blob/master/lemmata/latin_lemmata_cltk.py), as far as I know.

clemsciences · 2023-01-19T22:29:06Z

@diyclassics can probably give you more details on how to know which meaning is attached to which lemma.

DavideMassidda · 2023-01-20T09:44:18Z

Thank you very much, Clément! So, this isn't a bug, but a precise choice: the final number is used to disambiguate. Good to know!

clemsciences · 2023-01-21T14:59:10Z

This is not a bug, but this must be better documented.

DavideMassidda added the bug label Jan 17, 2023

DavideMassidda changed the title ~~Strange behavior of LatinBackoffLemmatizer with plural nouns of the second declension~~ Strange behaviour of LatinBackoffLemmatizer with plural nouns of the second declension Jan 17, 2023

clemsciences added acknowledged latin labels Jan 17, 2023

clemsciences assigned diyclassics Jan 19, 2023

clemsciences removed the bug label Jan 21, 2023

clemsciences mentioned this issue Mar 5, 2023

Improve docs for Latin lemmas #1211

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strange behaviour of `LatinBackoffLemmatizer` with plural nouns of the second declension #1198

Strange behaviour of `LatinBackoffLemmatizer` with plural nouns of the second declension #1198

DavideMassidda commented Jan 17, 2023 •

edited

clemsciences commented Jan 19, 2023

clemsciences commented Jan 19, 2023

DavideMassidda commented Jan 20, 2023

clemsciences commented Jan 21, 2023

Strange behaviour of LatinBackoffLemmatizer with plural nouns of the second declension #1198

Strange behaviour of LatinBackoffLemmatizer with plural nouns of the second declension #1198

Comments

DavideMassidda commented Jan 17, 2023 • edited

clemsciences commented Jan 19, 2023

clemsciences commented Jan 19, 2023

DavideMassidda commented Jan 20, 2023

clemsciences commented Jan 21, 2023

Strange behaviour of `LatinBackoffLemmatizer` with plural nouns of the second declension #1198

Strange behaviour of `LatinBackoffLemmatizer` with plural nouns of the second declension #1198

DavideMassidda commented Jan 17, 2023 •

edited