Skip to content

maciej-ziaja/Spam-Detector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 

Repository files navigation

👨‍💻 Built with

Python Jupyter Notebook Pandas Scikit Learn Matplotlib

🔍 Description

Data used in this project: UCI Machine Learning Repository

The TSV file contains over 5000 text messages. File has two columns:

message_type
text_message

The object of this project is to create a simple anti-spam detector for text messages.

It consists of four parts:

  1. Introduction with some data transformations, quick exploratory data analysis and visualizations.
  2. Training data model and using machine learning to predict whether a text message is spam or not.
  3. Optimizing the process to increase performance of the model.
  4. Conclusions

📝 Introduction and Analysis:

Introduction

To handle collected data I used Jupyter Notebook with libraries:

  • Pandas
  • String
  • Seaborn
  • Matplotlib
  • SKLearn
  • NLTK

Collected data is a TSV file. It can be read with Pandas and turned into a Data Frame.
There are no missing values in any of the columns.

Exploratory Data Analysys

  • adding a column with length of the message
  • boxplot - visualizing the spread of messages length
  • histogram - distribution of the message length
  • histograms - distribution of the message length by the message type

📊 Prediction with machine learning model:

Vectorizing

  • removing the punctuation from the messages
  • using CountVectorizer from SKLearn package

Training a model

  • using the Naive Bayes model
  • evaluating the model

This method reached 96.53% accuracy.

🛍️ Optimizing the process:

Vectorizing

  • removing the punctuation from the messages
  • removing the stopwords using NLP package
  • using CountVectorizer from SKLearn package
  • applying TF-IDF

Training a model

  • using the Naive Bayes model
  • evaluating the model

This method reached 95.75% accuracy.

Changing the model to Random Forest

  • using the Random Forest model
  • evaluating the model

This method reached 97.37% accuracy.

📋 Conclusions

There is a lot of optimization possibilities to try with this project.

Firstly, there are lots of other text pre-processing techniques that can be applied to messages before vectorizing it (such as 'stemming').

Secondly, there is a lot different classifiers that can prove to be more efficient in this specific situation.

Also, deep understanding of the data can help to know what results should be achieved.