GitHub - maciej-ziaja/Spam-Detector: Simple anti-spam detector for text messages

👨‍💻 Built with

🔍 Description

Data used in this project: UCI Machine Learning Repository

The TSV file contains over 5000 text messages. File has two columns:

message_type
text_message

The object of this project is to create a simple anti-spam detector for text messages.

It consists of four parts:

Introduction with some data transformations, quick exploratory data analysis and visualizations.
Training data model and using machine learning to predict whether a text message is spam or not.
Optimizing the process to increase performance of the model.
Conclusions

📝 Introduction and Analysis:

Introduction

To handle collected data I used Jupyter Notebook with libraries:

Pandas
String
Seaborn
Matplotlib
SKLearn
NLTK

Collected data is a TSV file. It can be read with Pandas and turned into a Data Frame.
There are no missing values in any of the columns.

Exploratory Data Analysys

adding a column with length of the message
boxplot - visualizing the spread of messages length
histogram - distribution of the message length
histograms - distribution of the message length by the message type

📊 Prediction with machine learning model:

Vectorizing

removing the punctuation from the messages
using CountVectorizer from SKLearn package

Training a model

using the Naive Bayes model
evaluating the model

This method reached 96.53% accuracy.

🛍️ Optimizing the process:

Vectorizing

removing the punctuation from the messages
removing the stopwords using NLP package
using CountVectorizer from SKLearn package
applying TF-IDF

Training a model

using the Naive Bayes model
evaluating the model

This method reached 95.75% accuracy.

Changing the model to Random Forest

using the Random Forest model
evaluating the model

This method reached 97.37% accuracy.

📋 Conclusions

There is a lot of optimization possibilities to try with this project.

Firstly, there are lots of other text pre-processing techniques that can be applied to messages before vectorizing it (such as 'stemming').

Secondly, there is a lot different classifiers that can prove to be more efficient in this specific situation.

Also, deep understanding of the data can help to know what results should be achieved.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
SMS_Data		SMS_Data
readme.md		readme.md
spam_detector.ipynb		spam_detector.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SMS_Data

SMS_Data

readme.md

readme.md

spam_detector.ipynb

spam_detector.ipynb

Repository files navigation

👨‍💻 Built with

🔍 Description

📝 Introduction and Analysis:

📊 Prediction with machine learning model:

🛍️ Optimizing the process:

📋 Conclusions

About

Releases

Packages

Languages

maciej-ziaja/Spam-Detector

Folders and files

Latest commit

History

Repository files navigation

👨‍💻 Built with

🔍 Description

📝 Introduction and Analysis:

📊 Prediction with machine learning model:

🛍️ Optimizing the process:

📋 Conclusions

About

Topics

Resources

Stars

Watchers

Forks

Languages