Attentioned Convolutional LSTM Inpainting Network for Anomaly Detection in Videos
We propose a semi-supervised model for detecting anomalies in videos inspired by the Video Pixel Network [van den Oord et al., 2016]. VPN is a probabilistic generative model based on a deep neural network that estimates the discrete joint distribution of raw pixels in video frames. Our model extends the Convolutional-LSTM video encoder part of the VPN with a novel convolutional based attention mechanism. We also modify the Pixel-CNN decoder part of the VPN to a frame inpainting task where a partially masked version of the frame to predict is given as input. The frame reconstruction error is used as an anomaly indicator. We test our model on a modified version of the moving mnist dataset [Srivastava et al., 2015]. Our model is shown to be effective in detecting anomalies in videos. This approach could be a component in applications requiring visual common sense.
Real-time anomaly detection in videos has significant value across many domains such as robot patrolling [Chakravarty et al., 2007] and visual inspection of manufacturing processes. The task remains challenging due to the complexity and variability of the data and high computational cost in an edge device setting. Current approaches range from supervised models based on Convolutional Neural Networks (CNNs) architectures [Sabokrou et al., 2018], long-term temporal dynamic models such as Recurrent Neural Networks (RNNs) [Radford et al., 2018] and unsupervised models for video features learning [Zhang et al., 2016, Pham et al., 2011, Zhao et al., 2011]. In this paper we propose an encoder-decoder network where the input is a sequence of frames with the last frame partially masked and the output is a reconstruction of that frame. We use the reconstruction error as an indicator for an anomaly in the sequence, where we assume the model will reconstruct a masked pixel with a typical value, in accordance with that pixel’s spatial and temporal context. A pixel containing unexpected / out of context value will be poorly reconstructed and indicate an anomaly. Our model is inspired by the Video Pixel Networks (VPN) [Kalchbrenner et al., 2016] with two main differences - 1. We add a convolutional based attention mechanism where the filters weights are dynamic and input dependent. This mechanism utilizes the local structure of images better than a standard global weighting attention mechanism. 2. A partially masked version of the frame to predict is given as input to the model (see figure 1(b)) - this eliminates the need for masked convolutions and enables the computation of the predicted distribution of all pixels to be parallelized. The model can also utilize information from an unmasked neighborhood of a predicted pixel which makes the prediction task tractable. Using the proposed modifications above our model is able to find anomalies in videos in an unsupervised manner and in real-time.
2 Related work
2.1 Anomaly Detection
Unsupervised and semi-supervised video anomaly detection models can be classified into three main categories - 1. Representation learning for reconstruction: Encoder-Decoder Methods which transform the input into a hidden representation and then try to reconstruct it. Anomalies are represented by poorly reconstructed deviations from the source. Principal Component Analysis (PCA) and Auto-encoders (AEs) are examples of such models. 2. Predictive modeling: where the sequence of frames is viewed as a time series and the model’s task is to predict the next frame pixels values distribution. Anomalies are represented by pixels with low likelihood values. Auto-Regressive models and Convolutional-LSTMs are examples of such models. 3) Generative models: e.g., Generative Adversarial Networks (GAN) and Variational Auto-Encoders (VAE), which can compute a measure of frame abnormality.
2.2 Video Pixel Network
VPN is a frame predictive model shown to give SOTA results on the moving mnist and pushing robots datasets [Finn et al., 2016]. The architecture of the VPN consists of two parts: A CNN resolution preserving encoder and a Pixel-CNN decoder [van den Oord et al., 2016]. The CNN encoder output is aggregated over time by a Convolutional-LSTM in order to capture temporal dependencies. The Pixel-CNN decoder uses masked convolutions to model space and color dependencies in the predicted frame (by allowing a flow of information from previously predicted pixels to a current predicted pixel). The last layer of the Pixel-CNN decoder is a softmax layer over 256 intensity values for each color channel in each pixel.
3 Our model
3.1 Convolutional based attention mechanism
The relevant context window for frame prediction may vary in size and frames importance distribution. An attention mechanism is a popular tool used to overcome memory limitations of recurrent models and bring to focus relevant parts of a context window. Since current attention mechanisms do not leverage the local structure of images, we propose the use of a convolution with input dependent filter weights to generate an attention like mechanism [Shen et al., 2017]. We use a small meta-network to output context-sensitive convolution filters, which are then applied to a tensor of concatenated Convolutional-LSTM outputs (representing the context window). The Convolutional-LSTM and convolutional attention output tensors preserve the spatial dimensions and local structure of the video frames. This allows us to concatenate the partially masked frame as an additional channel of the attention output tensor and forward it to the inpainting network for reconstruction (see figure 1(a)).
3.2 Convolutions with masked frames for image inpainting
In the VPN model the frame to be predicted is given as input in the training phase. The PixelCNN decoder uses masked convolutions to ensure the predicted pixel does not "see" its label (i.e. true value). The masked convolution only uses information from pixels preceding the predicted pixel (for a top-bottom left-right pixel order), enabling the network to model some of the spatial dependencies in the predicted frame. In inference time the pixels are predicted sequentially. In our anomaly detection reconstruction approach the frame to be predicted is also given as input but is partially masked, blocking the flow of information from a label to a masked pixel. The modeling of spatial dependencies of a pixel is enabled by using information from non-masked pixels in its neighborhood. We use a grid mask with random shifts where the portion of masked pixels in the frame is 95% (see figure 1(b)). This way the model learns a general structure of the frame and must rely on temporal dependencies. In inference time the same procedure is applied, so the pixels are predicted in parallel, resulting in real-time detection.
3.3 Loss function as an anomaly measure
We use the log-likelihood of the pixels values given the network predicted distribution as a loss function. The average pixels log-likelihood is used as a global score for frame abnormality, where we assume the pixels are independently distributed given the unmasked pixels and context window frames. The loss is defined as:
where are the pixels of the frame to reconstruct in time , is the masked frame, are all the frames prior to the -th frame, are the network parameters, is the value of channel of the pixel of frame and is the predicted distribution for that value.
In the training phase we train our network only on anomaly free videos. This way the network learns to predict a distribution for pixel values showing normal behavior, and will give low probability predictions for abnormal values in inference time. We use the log-likelihood as an anomaly measure where low likelihood pixel values indicate higher chance for these pixels to show an anomaly.
We evaluate each contribution proposed in this paper, convolutional-based attention and masked frame reconstruction, on a modified version of the Moving MNIST dataset [Srivastava et al., 2015]. We show that our model can learn both the temporal and spatial aspects of the movies and automatically detect anomalies without explicit supervision. We compare two methods as baselines: the original VPN model and Conv-LSTM network (which detects abnormal frames based on the reconstruction error [Medel and Savakis, 2016]), together with two variants of our model - the first omits the masked frame from the input and the second does not use attention.
Dataset - The Moving MNIST is a common dataset consisting of two digits moving independently in a frame (potentially overlapping) with constant velocity. It consists of sequences of 20 frames of size . The training sequences are generated on-the-fly by sampling MNIST digits and generating trajectories with randomly sampled velocity and angle. The training set was downloaded from [Srivastava et al., 2015] and consists of 10000 sequences. Our test set consists of both normal and corrupted sequences. In order to generate a corrupted sequence, we replace the last frame with the first frame and paint a "corruption" of black pixels on a digit (see figure 1). These corruptions are in two dimensions - temporal (changing the frame order) and spatial (painting the black square).
Evaluation Metric - We use the Equal Error Rate (EER) which is the accuracy value for equal precision and recall, a standard metric in abnormal event detection.
Results - Table 4 shows the EER for the different models tested. Our model outperforms both the baseline models (VPN and Conv-LSTM) and the partial variations of our model, showing the importance of each contribution. Replacing the last frame with the first tests the ability of the models to detect temporal anomalies in the sequence. In such anomalies the attention mechanism improves the model’s ability to capture the abnormal frame-to-frame changes. The black square corruption tests the ability of the model to capture spatial dependencies. Our masked frame approach captures the dependencies between a masked pixel and its unmasked neighborhood, resulting in the reconstruction of the original values of the blackened pixels, i.e. predicting low probability for zero values.
|Our model w/o the masked convolutions||84.6|
|Our model w/o the conv-attention mechanism||85.7|
- van den Oord et al.  Aaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, pages 4790–4798, 2016.
- Srivastava et al.  Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsupervised learning of video representations using lstms. In International conference on machine learning, pages 843–852, 2015.
- Chakravarty et al.  Punarjay Chakravarty, Alan M Zhang, Ray Jarvis, and Lindsay Kleeman. Anomaly detection and tracking for a patrolling robot. In Australasian Conference on Robotics and Automation (ACRA). Citeseer, 2007.
- Sabokrou et al.  Mohammad Sabokrou, Mohsen Fayyaz, Mahmood Fathy, Zahra Moayed, and Reinhard Klette. Deep-anomaly: Fully convolutional neural network for fast anomaly detection in crowded scenes. Computer Vision and Image Understanding, 2018.
- Radford et al.  Benjamin J Radford, Leonardo M Apolonio, Antonio J Trias, and Jim A Simpson. Network traffic anomaly detection using recurrent neural networks. arXiv preprint arXiv:1803.10769, 2018.
- Zhang et al.  Ying Zhang, Huchuan Lu, Lihe Zhang, and Xiang Ruan. Combining motion and appearance cues for anomaly detection. Pattern Recognition, 51:443–452, 2016.
- Pham et al.  Duc Son Pham, Budhaditya Saha, Dinh Q Phung, and Svetha Venkatesh. Detection of cross-channel anomalies from multiple data channels. In Data Mining (ICDM), 2011 IEEE 11th International Conference on, pages 527–536. IEEE, 2011.
- Zhao et al.  Bin Zhao, Li Fei-Fei, and Eric P Xing. Online detection of unusual events in videos via dynamic sparse coding. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 3313–3320. IEEE, 2011.
- Kalchbrenner et al.  Nal Kalchbrenner, Aaron van den Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals, Alex Graves, and Koray Kavukcuoglu. Video pixel networks. arXiv preprint arXiv:1610.00527, 2016.
- Finn et al.  Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physical interaction through video prediction. In Advances in neural information processing systems, pages 64–72, 2016.
- Shen et al.  Dinghan Shen, Martin Renqiang Min, Yitong Li, and Lawrence Carin. Adaptive convolutional filter generation for natural language understanding. CoRR, abs/1709.08294, 2017. URL http://arxiv.org/abs/1709.08294.
- Medel and Savakis  Jefferson Ryan Medel and Andreas Savakis. Anomaly detection in video using predictive convolutional long short-term memory networks. arXiv preprint arXiv:1612.00390, 2016.