Skip to content

An open-source package for python to clean raw text data

License

Notifications You must be signed in to change notification settings

prasanthg3/cleantext

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

cleantext

Downloads

cleantext is a an open-source python package to clean raw text data. Source code for the library can be found here.

Features

cleantext has two main methods,

  • clean: to clean raw text and return the cleaned text
  • clean_words: to clean raw text and return a list of clean words

cleantext can apply all, or a selected combination of the following cleaning operations:

  • Remove extra white spaces
  • Convert the entire text into a uniform lowercase
  • Remove digits from the text
  • Remove punctuations from the text
  • Remove or replace the part of text with custom regex
  • Remove stop words, and choose a language for stop words ( Stop words are generally the most common words in a language with no significant meaning such as is, am, the, this, are etc.)
  • Stem the words (Stemming is a process of converting words with similar meaning into a single word. For example, stemming of words run, runs, running will result run, run, run)

Installation

cleantext requires Python 3 and NLTK to execute.

To install using pip, use

pip install cleantext

Usage

  • Import the library:
import cleantext
  • Choose a method:

To return the text in a string format,

cleantext.clean("your_raw_text_here") 

To return a list of words from the text,

cleantext.clean_words("your_raw_text_here") 

To choose a specific set of cleaning operations,

cleantext.clean_words("your_raw_text_here",
clean_all= False # Execute all cleaning operations
extra_spaces=True ,  # Remove extra white spaces 
stemming=True , # Stem the words
stopwords=True ,# Remove stop words
lowercase=True ,# Convert to lowercase
numbers=True ,# Remove all digits 
punct=True ,# Remove all punctuations
reg: str = '<regex>', # Remove parts of text based on regex
reg_replace: str = '<replace_value>', # String to replace the regex used in reg
stp_lang='english'  # Language for stop words
)

Examples

import cleantext
cleantext.clean('This is A s$ample !!!! tExt3% to   cleaN566556+2+59*/133', extra_spaces=True, lowercase=True, numbers=True, punct=True)

returns,

'this is a sample text to clean'

import cleantext
cleantext.clean_words('This is A s$ample !!!! tExt3% to   cleaN566556+2+59*/133')

returns,

['sampl', 'text', 'clean']

from cleantext import clean
text = "my id, [email protected] and your, [email protected]"
clean(text, reg=r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", reg_replace='email', clean_all=False)

returns,

"my id, email and your, email"

License

MIT

For any questions, issues, bugs, and suggestions please visit here