Skip to content

This repository includes supervised and unsupervised machine learning methods which are used to detect anomalies on network datasets. Decision Tree, Random Forest, Gradient Boost Tree, Naive Bayes, and Logistic Regression were used for supervised learning. K-Means was used for unsupervised learning.

Notifications You must be signed in to change notification settings

falaybeg/SparkStreaming-Network-Anomaly-Detection

Repository files navigation

Summary


Anomaly Detection (also known as Outlier Detection) is the process of recognizing objects that are different from normal expectations. When we compare anomaly and noise data there have some differences. Noise data is far away from the mean or median in a distribution. Whereas, the anomaly is generated by a different process than whatever generated the rest of the data. When huge data needs to be processed in near real-time to gain insight, streaming data is the best answer. Analyzing this data could provide us valuable insight for future actions.

In this repository was written Machine Learning based Real-time Network Anomaly Detection project using Spark Streaming. Decision Tree, Random Forest, Gradient Boost Tree, Naive Bayes, and Logistic Regression algorithms were used for supervised machine learning. K-means clustering algorithm was used for unsupervised machine learning.

Dataset: The improved version of KDD which is NSL-KDD dataset was used for experiments. This data set has 41 attributes and 42nd is the class label which is assigned as normal or anomaly. Principal Component Analysis (PCA) technique was used for feature reduction.

Used Algorithms

Supervised Learning Process


In Supervised learning, data is firstly prepared for ML algorithms. Secondly, the model object is created from one of supervised algorithms and model object is trained by batch data. For Streaming process, it works according to the trained model. When comes new data, it is parsed and applied feature reduction. Then data is predicted using one of the Supervised algorithms and Sliding Window. Finally, the results are printed in the console screen as a confusion matrix.

Unsupervised Learning Process


In unsupervised learning, data is firstly prepared for clustering K-Means algorithm. Secondly, k is chosen randomly and after prediction (clustering) Silhouette score for best K value is calculated. The feature distance is calculated according to their cluster for every feature. The Max Distance of Normal values for every cluster is calculated. For Streaming process when new data comes, the cluster is predicted from the trained model and distance is calculated. Sliding Windows has used to process data in 3-seconds window and 1-second slide. After that, max distance and feature distance are compared. If feature distance is bigger than max distance, the system assigns the value as an anomaly. Otherwise, it is assigned as normal. Finally, the results are printed in the console screen as a confusion matrix.

Results


Gradient Boost Tree Streaming Results

K-Means Streaming Results

Logistic Regression Streaming Results


Keywords: Spark Streaming, Real-Time, Network, Anomaly Detection, Machine Learning, Supervised Learning, Unsupervised Learning, Classifiction, Clustering

About

This repository includes supervised and unsupervised machine learning methods which are used to detect anomalies on network datasets. Decision Tree, Random Forest, Gradient Boost Tree, Naive Bayes, and Logistic Regression were used for supervised learning. K-Means was used for unsupervised learning.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published