Skip to content

Python implementation of 4-gram language models that use either Witten-Bell or Kneser-Ney Smoothing

Notifications You must be signed in to change notification settings

tanmay-pro/Smoothing-in-LM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Smoothing in Language Modelling

File Structure

  • Run the file "kn.ipyb" for creating a language model while performing Kneser-ney Smoothing.
  • Run the file "wb.ipyb" for creating a language model while performing Witten-Bell Smoothing.
  • The corpus directory contains two different corpora (Difference in terms of size).

Key Points

  • Unknown words have been handled by adding tags in the language models. Incase a word appears less than a value (some aprticular frequency threshold which can be changed in the files) in the training set, it has been replaced by the token. Similarly, while calculating the perplexities on the test sets, if a word has not appeared before, it has been replaced by an token. This solves the issue of having an Open Vocabulary.

  • and tags have been added to take into account a 4-gram model. Not using these tokens would lead to a loss in the amount of information extracted from a sentence.

  • NLTK has not been used for tokenization, instead, a custom function has been written for the same.

About

Python implementation of 4-gram language models that use either Witten-Bell or Kneser-Ney Smoothing

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published