Turkish-NLP-Preprocessing-module

Preprocessing tool for Turkish NLP that contains tokenizer, normalizer, stop-word eliminator and stemmer. Developed by Melikşah Türker and Büşra Oğuzoğlu for CMPE561 NLP class project.

https://github.com/meliksahturker

https://github.com/busraoguzoglu

Sentence Splitter and Tokenizer modules have 2 versions, rule-based and machine learning based. Machine Learning part contains Naive Bayes Classifier and Logistic Regression Classifier. We developed the Naive Bayes algorithm from scratch, but used sklearn implementation for Logistic Regression.

Stop-Word eliminator has 2 versions, static and dynamic. Static one requires pre-defined stopwords, while dynamic one detects the stop-words choosing a threshold according to word frequency distribution, using second derivative(elbow rule) automatically.

Normalizer works using predefined normalization lexicon and Levenshtein distance calculating both whole word and consonant letters only, facilitating the both.

Stemmer works in a rule based fashion by checking many of the suffixes that exist in Turkish language in an ordered fashion following the extra rules containing irregularities. It can also tell if the given word is a noun or verb based on their suffixes. It covers almost all of the inflectional suffixes and some derivational suffixes.

Data folder contains lots of lexicons for multi-word-expressions, normalization, prefixes, abbreviations(non-breaking prefixes), stop-words, etc.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
DATA		DATA
Demonstration of PreProcessing System.ipynb		Demonstration of PreProcessing System.ipynb
DynamicStopWordEliminator.py		DynamicStopWordEliminator.py
MLBasedSentenceSplitter.py		MLBasedSentenceSplitter.py
MLBasedTokenizer.py		MLBasedTokenizer.py
NaiveBayesClassifier.py		NaiveBayesClassifier.py
Normalizer.py		Normalizer.py
NounSuffixes.py		NounSuffixes.py
PreProcessing.py		PreProcessing.py
README.md		README.md
RuleBasedSentenceSplitter.py		RuleBasedSentenceSplitter.py
RuleBasedTokenizer.py		RuleBasedTokenizer.py
StaticStopwordRemover.py		StaticStopwordRemover.py
Stemmer.py		Stemmer.py
Suffix.py		Suffix.py
TokenizationRules.py		TokenizationRules.py
Utility.py		Utility.py
VerbSuffixes.py		VerbSuffixes.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Turkish-NLP-Preprocessing-module

About

Releases

Packages

Languages

busraoguzoglu/Turkish-NLP-Preprocessing-module

Folders and files

Latest commit

History

Repository files navigation

Turkish-NLP-Preprocessing-module

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages