Skip to content

Ivan-Zhou/Anomaly_Detection

Repository files navigation

Anomaly Detection with PCA and Deep Autoencoder

This repository hosts my work on the performance evaluation of four types of anomaly detectors on different datasets.

Methods of Anoamly Detection

Four methods of Anomaly Detection are implemented and evaluated on each of the dataset

  • PCA based Reconstruction Error Method: The dataset is reconstructed with Principal Component Analysis method, and the anomalous data points are detected based on the reconstruction error, or the orthorgonal distance between the original and reconstructed representation.
  • PCA based Multivariate Gaussian Method: The dataset is encoded with PCA method, and the anomalous data points are detected by applying a Multivariate Gaussian Distribution on the encoded dataset.
  • Deep Autoencoder based Reconstruction Error Method: A deep autoencoder is trained with the normal dataset with the goal to fully represent the original dataset. Then the dataset is reconstructed with the Deep Autoencoder model, and the anomalous data points are detected based on the distance between the original and reconstructed representation.
  • Deep Autoencoder based Multivariate Gaussian Method: The dataset is encoded with the trained Deep Autoencoder model, and the anomalous data points are detected by applying a Multivariate Gaussian Distribution on the encoded dataset.

Datasets

  • Yale Faces Dataset: Faces dataset from Computer Vision Lab in Yale University. The dataset contains photos of different subjects under 9 poses and 64 illumination conditions. The anomalies in the dataset are all photos of people with mustache
  • MNIST: the photos of handwritten digits released by Yann LeCun. The anomalies in the dataset are all photos of the digit 2
  • Synthetic Dataset 1: artificial dataset synthesized by myself. It is a binary dataset with dimension of 16 and generated with a multivariate gaussian distribution. The anomalies are vectors whose total number of '1s' in th vector is less than the threshold (4).
  • Synthetic Dataset 2: It is also a binary dataset with dimension of 16 and generated with a multivariate gaussian distribution. The anomalies are vectors whose sum of the right n-1 digits is even and the left-most digit is odd (1)
  • Synthetic Dataset 3: It is a binary dataset with dimension of 16 and generated with three multivariate gaussian distributions. The data generated with one distribution is treated and normal, and the data generated by the other two distribution is treated as anomalous.
  • Synthetic Dataset 4: It is a binary dataset with dimension of 16 and generated with two multivariate gaussian distributions. The anomalies are vectors whose total number of '1s' in th vector is less than the threshold (4).

Examples of the Datasets

Below is an example of the Yale Faces Dataset. Subjects with mustache are defined as anomaly, just like the bottom right photo: Dataset_Faces

Below is an example of the MNIST Dataset. Photos of the number 2 are defined as anomlay: Dataset_MNIST

The table below shows the characteristics of the synthetic datasets: Table_Dataset_Synthetic

Below explains how the Synthetic Dataset 1 was generated: Dataset_Synthetic1

Below explains how the Synthetic Dataset 2 was generated: Dataset_Synthetic2

Results for each type of anomaly detector

I applied each of the Anomaly Detectors on all of the dataset. Here I used two metrics in evaluation:

  • R-Precision: The precision of the top R results, where R is the total number of positive results.
  • Precision @ K: The precision of the top k results.

Results_Evaluation_Table

Some findings:

  • There is no single anomaly detector that excel in all the datasets.
  • Autoencoder-based methods work better than PCA-based methods in general.
  • Gaussian-based methods fail in most of the dataset, partially due to the correlation between latent dimensions after encoding. However, the Autoencoder + Gaussian method succeeded in Synthetic dataset #1, where the anomaly

Autoencoder-based methods work better than PCA-based methods in general

Results_Evaluation_1

Autoencoder + Gaussian method succeeds in synthetic #1

Results_Evaluation_2

Some Visualization

The project has not finished yet, so I only show a few visualization here for presentation purpose. The results and conclusions are coming soon.

PCA - Eigenfaces generated from the Face Database

Faces_PCA_EIGEN

PCA - Reconstructed photos from the Face Database

Faces_PCA_RC

Deep Autoencoder - Reconstructed photos from the MNIST Database

MNIST_DA_RC

Deep Autoencoder - Result with the Reconstruction Error Method on the MNIST Database

MNIST_DA_RE

Comparison between PCA and Deep Autoencoder - Reconstructed Data from the Synthetic Dataset 3

Reconstruction with PCA

S3_PCA_RE

Reconstruction with Deep Autoencoder

S3_DA_RE

PCA - Result with the Reconstruction Error Method on the Synthetic Dataset 3

S3_PCA_RE_RESULT