Anomaly Detection with PCA and Deep Autoencoder

This repository hosts my work on the performance evaluation of four types of anomaly detectors on different datasets.

Methods of Anoamly Detection

Four methods of Anomaly Detection are implemented and evaluated on each of the dataset

PCA based Reconstruction Error Method: The dataset is reconstructed with Principal Component Analysis method, and the anomalous data points are detected based on the reconstruction error, or the orthorgonal distance between the original and reconstructed representation.
PCA based Multivariate Gaussian Method: The dataset is encoded with PCA method, and the anomalous data points are detected by applying a Multivariate Gaussian Distribution on the encoded dataset.
Deep Autoencoder based Reconstruction Error Method: A deep autoencoder is trained with the normal dataset with the goal to fully represent the original dataset. Then the dataset is reconstructed with the Deep Autoencoder model, and the anomalous data points are detected based on the distance between the original and reconstructed representation.
Deep Autoencoder based Multivariate Gaussian Method: The dataset is encoded with the trained Deep Autoencoder model, and the anomalous data points are detected by applying a Multivariate Gaussian Distribution on the encoded dataset.

Datasets

Yale Faces Dataset: Faces dataset from Computer Vision Lab in Yale University. The dataset contains photos of different subjects under 9 poses and 64 illumination conditions. The anomalies in the dataset are all photos of people with mustache
MNIST: the photos of handwritten digits released by Yann LeCun. The anomalies in the dataset are all photos of the digit 2
Synthetic Dataset 1: artificial dataset synthesized by myself. It is a binary dataset with dimension of 16 and generated with a multivariate gaussian distribution. The anomalies are vectors whose total number of '1s' in th vector is less than the threshold (4).
Synthetic Dataset 2: It is also a binary dataset with dimension of 16 and generated with a multivariate gaussian distribution. The anomalies are vectors whose sum of the right n-1 digits is even and the left-most digit is odd (1)
Synthetic Dataset 3: It is a binary dataset with dimension of 16 and generated with three multivariate gaussian distributions. The data generated with one distribution is treated and normal, and the data generated by the other two distribution is treated as anomalous.
Synthetic Dataset 4: It is a binary dataset with dimension of 16 and generated with two multivariate gaussian distributions. The anomalies are vectors whose total number of '1s' in th vector is less than the threshold (4).

Examples of the Datasets

Below is an example of the Yale Faces Dataset. Subjects with mustache are defined as anomaly, just like the bottom right photo:

Below is an example of the MNIST Dataset. Photos of the number 2 are defined as anomlay:

The table below shows the characteristics of the synthetic datasets:

Below explains how the Synthetic Dataset 1 was generated:

Below explains how the Synthetic Dataset 2 was generated:

Results for each type of anomaly detector

I applied each of the Anomaly Detectors on all of the dataset. Here I used two metrics in evaluation:

R-Precision: The precision of the top R results, where R is the total number of positive results.
Precision @ K: The precision of the top k results.

Some findings:

There is no single anomaly detector that excel in all the datasets.
Autoencoder-based methods work better than PCA-based methods in general.
Gaussian-based methods fail in most of the dataset, partially due to the correlation between latent dimensions after encoding. However, the Autoencoder + Gaussian method succeeded in Synthetic dataset #1, where the anomaly

Autoencoder-based methods work better than PCA-based methods in general

Autoencoder + Gaussian method succeeds in synthetic #1

Some Visualization

The project has not finished yet, so I only show a few visualization here for presentation purpose. The results and conclusions are coming soon.

Name		Name	Last commit message	Last commit date
Latest commit History 159 Commits
.idea		.idea
.ipynb_checkpoints		.ipynb_checkpoints
MNIST		MNIST
Reports		Reports
Synthetic		Synthetic
Synthetic_2		Synthetic_2
Synthetic_3		Synthetic_3
Synthetic_4		Synthetic_4
Yale_Faces_Data		Yale_Faces_Data
.gitattributes		.gitattributes
.gitignore		.gitignore
AnomalyDataClass.py		AnomalyDataClass.py
README.md		README.md
autoencoder_training.py		autoencoder_training.py
run.py		run.py
support_functions.py		support_functions.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Anomaly Detection with PCA and Deep Autoencoder

Methods of Anoamly Detection

Datasets

Examples of the Datasets

Results for each type of anomaly detector

Some Visualization

PCA - Eigenfaces generated from the Face Database

PCA - Reconstructed photos from the Face Database

Deep Autoencoder - Reconstructed photos from the MNIST Database

Deep Autoencoder - Result with the Reconstruction Error Method on the MNIST Database

Comparison between PCA and Deep Autoencoder - Reconstructed Data from the Synthetic Dataset 3

Reconstruction with PCA

Reconstruction with Deep Autoencoder

PCA - Result with the Reconstruction Error Method on the Synthetic Dataset 3

About

Releases

Packages

Languages

Ivan-Zhou/Anomaly_Detection

Folders and files

Latest commit

History

Repository files navigation

Anomaly Detection with PCA and Deep Autoencoder

Methods of Anoamly Detection

Datasets

Examples of the Datasets

Results for each type of anomaly detector

Some Visualization

PCA - Eigenfaces generated from the Face Database

PCA - Reconstructed photos from the Face Database

Deep Autoencoder - Reconstructed photos from the MNIST Database

Deep Autoencoder - Result with the Reconstruction Error Method on the MNIST Database

Comparison between PCA and Deep Autoencoder - Reconstructed Data from the Synthetic Dataset 3

Reconstruction with PCA

Reconstruction with Deep Autoencoder

PCA - Result with the Reconstruction Error Method on the Synthetic Dataset 3

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages