Data used in this project: UCI Machine Learning Repository
The TSV file contains over 5000 text messages. File has two columns:
message_type
text_message
The object of this project is to create a simple anti-spam detector for text messages.
It consists of four parts:
- Introduction with some data transformations, quick exploratory data analysis and visualizations.
- Training data model and using machine learning to predict whether a text message is spam or not.
- Optimizing the process to increase performance of the model.
- Conclusions
Introduction
To handle collected data I used Jupyter Notebook with libraries:
- Pandas
- String
- Seaborn
- Matplotlib
- SKLearn
- NLTK
Collected data is a TSV file. It can be read with Pandas and turned into a Data Frame.
There are no missing values in any of the columns.
Exploratory Data Analysys
- adding a column with length of the message
- boxplot - visualizing the spread of messages length
- histogram - distribution of the message length
- histograms - distribution of the message length by the message type
Vectorizing
- removing the punctuation from the messages
- using CountVectorizer from SKLearn package
Training a model
- using the Naive Bayes model
- evaluating the model
This method reached 96.53% accuracy.
Vectorizing
- removing the punctuation from the messages
- removing the stopwords using NLP package
- using CountVectorizer from SKLearn package
- applying TF-IDF
Training a model
- using the Naive Bayes model
- evaluating the model
This method reached 95.75% accuracy.
Changing the model to Random Forest
- using the Random Forest model
- evaluating the model
This method reached 97.37% accuracy.
There is a lot of optimization possibilities to try with this project.
Firstly, there are lots of other text pre-processing techniques that can be applied to messages before vectorizing it (such as 'stemming').
Secondly, there is a lot different classifiers that can prove to be more efficient in this specific situation.
Also, deep understanding of the data can help to know what results should be achieved.