Skip to content

SageMaker implementation of LSTM-AE model for time series anomaly detection.

Notifications You must be signed in to change notification settings

fg-research/lstm-ae-sagemaker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LSTM-AE SageMaker Algorithm

The Time Series Anomaly Detection (LSTM-AE) Algorithm from AWS Marketplace performs time series anomaly detection with a Long Short-Term Memory Network Autoencoder (LSTM-AE). It implements both training and inference from CSV data and supports both CPU and GPU instances. The training and inference Docker images were built by extending the PyTorch 2.1.0 Python 3.10 SageMaker containers.

Model Description

The LSTM-AE model reconstructs the time series with an LSTM autoencoder. The encoder and decoder consist of a single LSTM layer and have the same number of hidden units. The encoder takes as input the time series and returns the hidden states. The hidden states of the encoder are used for initializing the hidden states of the decoder, which reconstructs the time series in reversed order. The autoencoder parameters are learned on a training set containing only normal data (i.e. without anomalies) by minimizing the mean squared error (MSE) between the actual and reconstructed values of the time series.

After the model has been trained, a multivariate Gaussian distribution is fitted to the model's reconstruction errors on an independent validation set (also without anomalies) using Maximum Likelihood Estimation (MLE). At inference time, the model reconstructs the values of all the time series (which can now include anomalies) and calculates the squared Mahalanobis distance between the reconstruction errors and the Gaussian distribution previously estimated on normal data. The computed squared Mahalanobis distance is then used as an anomaly score: the larger the squared Mahalanobis distance at a given a time step, the more likely the time step is to be an anomaly.

LSTM-AE architecture (source: doi: 10.48550/arXiv.1607.00148)

Model Resources: [Paper]

SageMaker Algorithm Description

The algorithm implements the model as described above with no changes.

Notes:

  • The algorithm splits the training data into two independent subsets: one subset is used for training the LSTM autoencoder, while the other subset is used for calculating the reconstruction errors to which the parameters of the multivariate Gaussian distribution are fitted. The (optional) validation data accepted by the algorithm is only used for scoring the model, i.e. for calculating the mean squared error (MSE) and mean absolute error (MAE) between the actual values of the time series in the validation dataset and their reconstructed values generated by the previously trained LSTM autoencoder.

  • The algorithm views the multivariate time series as different measurements on the same system. An anomaly is intended as an abnormal behavior of the entire system, not of a single individual measurement. As a result, the algorithm outputs only one anomaly score for each time step, representing the likelihood that the overall system is in a normal state at that time step. The algorithm can also be applied to a univariate time series (i.e. to a single time series). Consider fitting the model to each individual time series if the time series are not similar to each other and are not related, or if you need to identify the anomalies separately in each time series.

Training

The training algorithm has two input data channels: training and validation. The training channel is mandatory, while the validation channel is optional.

The training and validation datasets should be provided as CSV files and should only contain normal data (i.e. without anomalies). Each column of the CSV file represents a time series, while each row represents a time step. All the time series should have the same length and should not contain missing values. The CSV file should not contain any index column or column headers. See the sample input files train.csv and valid.csv.

See notebook.ipynb for an example of how to launch a training job.

Distributed Training

The algorithm supports multi-GPU training on a single instance, which is implemented through torch.nn.DataParallel. The algorithm does not support multi-node (or distributed) training across multiple instances.

Hyperparameters

The training algorithm takes as input the following hyperparameters:

  • sequence-length: int. The length of the sequences.
  • sequence-stride: int. The period between consecutive sequences.
  • hidden-size: int. The number of hidden units of each LSTM layer.
  • lr: float. The learning rate used for training.
  • batch-size: int. The batch size used for training.
  • epochs: int. The number of training epochs.

Metrics

The training algorithm logs the following metrics:

  • train_mse: float. Training mean squared error.
  • train_mae: float. Training mean absolute error.

If the validation channel is provided, the training algorithm also logs the following additional metrics:

  • valid_mse: float. Validation mean squared error.
  • valid_mae: float. Validation mean absolute error.

See notebook.ipynb for an example of how to launch a hyperparameter tuning job.

Inference

The inference algorithm takes as input a CSV file containing the time series. Each column of the CSV file represents a time series, while each row represents a time step. The CSV file should not contain any index column or column headers. All the time series should have the same length and should not contain missing values. See the sample input file test.csv.

The inference algorithm outputs the anomaly scores and the reconstructed values of the time series. The anomaly scores are included in the first column, while the reconstructed values of the time series are included in the subsequent columns. See the sample output files batch_predictions.csv and real_time_predictions.csv.

See notebook.ipynb for an example of how to launch a batch transform job.

Note: The algorithm does not support variable length sequences, and therefore the length of the input time series should be a multiple of the sequence length.

Endpoints

The algorithm supports only real-time inference endpoints. The inference image is too large to be uploaded to a serverless inference endpoint.

See notebook.ipynb for an example of how to deploy the model to an endpoint, invoke the endpoint and process the response.

Additional Resources: [Sample Notebook] [Blog Post]

References

  • P. Malhotra, A. Ramakrishnan, G. Anand, L. Vig, P. Agarwal, and G. Shroff, "LSTM-based encoder-decoder for multi-sensor anomaly detection", 2016, arXiv preprint, doi: 10.48550/arXiv.1607.00148.