Project outline

This is an automatic extractive text summarization algorithm.

The working principle of extractive text summarization idea is that the model generates summaries using only the words that are already contained in the original text. Compared to the abstractive text summarization algorithms, these are easier to implement, do not necessarily require network training, but are less accurate and useful.

Extractive document summarization algorithms rank the pre-processed sentences in the original text depending on some selected features and produce a summary using solely these ranked sentences. The main algorithm that is followed throughout this project is the TextRank algorithm which is a graph-based summarization algorithm inspired by PageRank algorithm. Sentences are represented as nodes where connections between them are the edges. After pre-processing of the text documents, features are extracted and they are put into a cosine similarity matrix which is then used to produce the graphs and finally rank the sentences.

Project outline

The main dependencies are; NLTK, which is used mainly by taking advantage of tokenizers and lemmatizers in pre-processing step, and scikitlearn, which is useful in feature extraction.

Pre-processing step includes; special character and punctuation removal, case conversion, tokenization, stop-word removal, and lemmatization.

After these, feature extraction, whose sub-sections are; N-gram bag of words, word frequency vectorizer, and TF-IDF vectorizer. Finally, sentence ranks are calculated using PageRank algorithm and summaries are generated for the News category of Brown corpus.

Data

Brown and Reuters corpora are used via NLTK library. Brown corpus is the main set that the model uses to generate summaries and Reuters corpus is used only for trial purposes.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
ExtractiveTextSummarizerforgit.py		ExtractiveTextSummarizerforgit.py
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project outline

Data

About

Releases

Packages

Languages

BatuhanKursatUnal/Extractive-Text-Summarizer-Octopus

Folders and files

Latest commit

History

Repository files navigation

Project outline

Data

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages