Add functionality to return the lemmas of words used in a corpus. #3256

Sion1225 · 2024-05-21T16:09:35Z

I have been working with natural language processing and often needed to know which words were used in certain corpora. Many dictionaries are comprised of word stems, requiring the extraction of stems from sentences. For example, specific words can be key clues or carry important information, necessitating the extraction of sentences using these words, or processing sentence information with them.

In this context, I have developed a class called AutoLemmatizer in stem/wordnet.py that automatically performs tokenization and part-of-speech-based lemmatization, returning the lemmas of all words used in a sentence.

I also considered converting 'n't' to 'not,' but have not implemented because I can't sure that is good idea.

>>> from nltk.stem import AutoLemmatizer
>>> auto_wnl = AutoLemmatizer()
>>> print(auto_wnl.auto_lemmatize('Proverbs are short sentences drawn from long experience.'))
['Proverbs', 'be', 'short', 'sentence', 'draw', 'from', 'long', 'experience', '.']
>>> print(auto_wnl.auto_lemmatize('proverbs are short sentences drawn from long experience.'))
['proverb', 'be', 'short', 'sentence', 'draw', 'from', 'long', 'experience', '.']

Resolves: nltk:#3257

Sion1225 · 2024-05-30T09:19:56Z

This issue is solved as #3257 full request.

Sion1225 mentioned this issue May 21, 2024

Develop text lemmatize function #3257

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add functionality to return the lemmas of words used in a corpus. #3256

Add functionality to return the lemmas of words used in a corpus. #3256

Sion1225 commented May 21, 2024 •

edited

Sion1225 commented May 30, 2024

Add functionality to return the lemmas of words used in a corpus. #3256

Add functionality to return the lemmas of words used in a corpus. #3256

Comments

Sion1225 commented May 21, 2024 • edited

Sion1225 commented May 30, 2024

Sion1225 commented May 21, 2024 •

edited