GitHub - falaybeg/SparkStreaming-Network-Anomaly-Detection: This repository includes supervised and unsupervised machine learning methods which are used to detect anomalies on network datasets. Decision Tree, Random Forest, Gradient Boost Tree, Naive Bayes, and Logistic Regression were used for supervised learning. K-Means was used for unsupervised learning.

Summary

Anomaly Detection (also known as Outlier Detection) is the process of recognizing objects that are different from normal expectations. When we compare anomaly and noise data there have some differences. Noise data is far away from the mean or median in a distribution. Whereas, the anomaly is generated by a different process than whatever generated the rest of the data. When huge data needs to be processed in near real-time to gain insight, streaming data is the best answer. Analyzing this data could provide us valuable insight for future actions.

In this repository was written Machine Learning based Real-time Network Anomaly Detection project using Spark Streaming. Decision Tree, Random Forest, Gradient Boost Tree, Naive Bayes, and Logistic Regression algorithms were used for supervised machine learning. K-means clustering algorithm was used for unsupervised machine learning.

Dataset: The improved version of KDD which is NSL-KDD dataset was used for experiments. This data set has 41 attributes and 42nd is the class label which is assigned as normal or anomaly. Principal Component Analysis (PCA) technique was used for feature reduction.

Used Algorithms

Supervised Learning Process

In Supervised learning, data is firstly prepared for ML algorithms. Secondly, the model object is created from one of supervised algorithms and model object is trained by batch data. For Streaming process, it works according to the trained model. When comes new data, it is parsed and applied feature reduction. Then data is predicted using one of the Supervised algorithms and Sliding Window. Finally, the results are printed in the console screen as a confusion matrix.

Unsupervised Learning Process

In unsupervised learning, data is firstly prepared for clustering K-Means algorithm. Secondly, k is chosen randomly and after prediction (clustering) Silhouette score for best K value is calculated. The feature distance is calculated according to their cluster for every feature. The Max Distance of Normal values for every cluster is calculated. For Streaming process when new data comes, the cluster is predicted from the trained model and distance is calculated. Sliding Windows has used to process data in 3-seconds window and 1-second slide. After that, max distance and feature distance are compared. If feature distance is bigger than max distance, the system assigns the value as an anomaly. Otherwise, it is assigned as normal. Finally, the results are printed in the console screen as a confusion matrix.

Results

Gradient Boost Tree Streaming Results

K-Means Streaming Results

Logistic Regression Streaming Results

Keywords: Spark Streaming, Real-Time, Network, Anomaly Detection, Machine Learning, Supervised Learning, Unsupervised Learning, Classifiction, Clustering

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.ipynb_checkpoints		.ipynb_checkpoints
data		data
images		images
README.md		README.md
TestDf.csv		TestDf.csv
TestDf1.csv		TestDf1.csv
TestDf2.csv		TestDf2.csv
TrainDf.csv		TrainDf.csv
anomaly_detection-classification.ipynb		anomaly_detection-classification.ipynb
anomaly_detection-clustering.ipynb		anomaly_detection-clustering.ipynb
decision_tree_network_anomaly.ipynb		decision_tree_network_anomaly.ipynb
gradient_boost_tree_network-anomaly.ipynb		gradient_boost_tree_network-anomaly.ipynb
k-means_network-anomaly.ipynb		k-means_network-anomaly.ipynb
logistic_regression_network-anomaly.ipynb		logistic_regression_network-anomaly.ipynb
naive_bayes_network-anomaly.ipynb		naive_bayes_network-anomaly.ipynb
random_forest_network-anomaly.ipynb		random_forest_network-anomaly.ipynb

falaybeg/SparkStreaming-Network-Anomaly-Detection

Folders and files

Latest commit

History

Repository files navigation

Summary

Used Algorithms

Supervised Learning Process

Unsupervised Learning Process

Results

About

Topics

Resources

Stars

Watchers

Forks

Languages