AND: Autoregressive Novelty Detectors

AND: Autoregressive Novelty Detectors

Davide Abati  Angelo Porrello  Simone Calderara  Rita Cucchiara

University of Modena and Reggio Emilia,
Via P.Vivarelli, 10, Modena, Italy

We propose an unsupervised model for novelty detection. The subject is treated as a density estimation problem, in which a deep neural network is employed to learn a parametric function that maximizes probabilities of training samples. This is achieved by equipping an autoencoder with a novel module, responsible for the maximization of compressed codes’ likelihood by means of autoregression. We illustrate design choices and proper layers to perform autoregressive density estimation when dealing with both image and video inputs. Despite a very general formulation, our model shows promising results in diverse one-class novelty detection and video anomaly detection benchmarks.

1 Introduction

Novelty and anomaly detection have been extensively studied in computer vision since many years. Nonetheless, the cornerstone of research in the field is the definition of novelty itself. The concept of discovering novelty has emerged since Greek philosophy in the popular Plato’s paradox of inquiry, where the philosopher raised the question “how can I know something that I didn’t know before?”. Hence, the problem of novelty can be apprehended as the problem of knowledge [8]. Accordingly, novelty detection in computer science was defined as “the identification of unknown data or signals that a machine learning system is not aware of during training” [35, 2]. In computer vision this has been mostly associated with events or actions that are different with respect to known models, e.g. referring to an object, a person or a group of targets which behave in a surprising manner [11].
In this work we tackle this problem by modelling the knowledge by means of a probability distribution for which “novel” elements are the most inexplicable ones in a probabilistic sense. This formulation abides both conventional computer science definition and classical novelty interpretations. Technically, this is achievable through an unsupervised knowledge distiller that summarizes the actual informative content about the data and simultaneously models its probability distribution. To this aim, we leverage the power of deep autoencoders as effective knowledge compressors, while forcing autoregressive properties on distilled codes in order to estimate their density function. Despite autoregressive density estimation have been profitably applied to deep learning [30, 56, 18], our contribution resides in an holistic approach where the extracted knowledge is conditioned to follow an autoregressive scheme thus its probability can be captured during training. We evaluate our Autoregressive Novelty Detector (AND) on both image and video novelty and anomaly detection tasks achieving very encouraging results. To the best of our knowledge, this is the first model employing autoregression for the estimation of density functions in complex, high dimensional feature spaces. We believe the proposed procedure can foster many possible refinements in several different applications processing highly structured and dimensional data.

2 Related work

Novelty and anomaly detection in vision. Due to the complexity in defining and labeling novel events, nearly all methods in literature are based on unsupervised learning. Few models detect unusual patterns in data streams by employing statistical tests paired with either compressed sensing [9] or dimensionality reduction [41]. A significant line of work focuses on learning a parametric compression and reconstruction of normal data, motivated by the assumption that outlier samples lead to worse reconstructions. Among these approaches, sparse-coding algorithms model normal patterns as a linear combination of a set of basis components, imposing a sparsity constraint on scalar coefficients [65, 13, 32, 62]. On a different note, other models rely on learning the probability distribution of normal patterns by proposing both non-parametric [1] and parametric [7, 34, 31] estimators to approximate the density of low-level appearance and motion features. At a higher abstraction level, graphical modeling has proven successful in estimating the distribution of video events and their causality, by means of Hidden Markov Models [15], Markov Random Fields [23] and energy-based models [29]. Other models, such as social force [37] and spectral graph analysis [10], specifically address video surveillance scenarios by tackling the detection of abnormalities in crowd trajectories patterns. Finally, different directions involve clustering of spatio temporal video volumes [66, 48], Gaussian Process Regression [12], scan statistics [22], foreground explainability [5] and discriminability of normal frames with respect to context [14].
Nevertheless, all aforementioned methods are disentangled by the feature extraction process and strongly depend on the efficacy of the chosen features set. Modern novelty detection leverages the representation power of deep neural networks in the expense of hand crafted features. Algorithms such as one-class support vector machines (OC-SVM) [52] or Gaussians classifiers have been successfully employed out of features extracted by deep networks [53, 63, 49, 50]. Reconstruction-based models, instead, that include autoencoders [19], convolutional long-short term memories [36] and sparse coding with temporal constraints [33], have been employed by leveraging on the magnitude of the reconstruction error. Nevertheless, this metric has proven to be a proxy of the gradient of the log-density of hidden representations, rather than representing the density itself [3]. Few works address novelty detection by assessing temporal regularity of deep representations, by slow feature analysis [21] or binary temporal patterns [43]. Probabilistic models typically aim at estimating densities in deep feature space by means of GMMs [16, 68], or minimizing energy functions [64, 60]. Indeed, GMMs expectation-maximization in high dimensions is troublesome, mostly due to the estimation of covariance matrices [67, 61, 28]. Notably, the method in [38] illustrates how CNNs can directly estimate density ratios of normalcy over novelty. Motivated by their effectiveness in learning complex distributions, some recent works train generative adversarial networks (GANs) on normal samples. However, since the underlying data probability distribution is embedded in the network parameters and cannot be explicitly queried for scoring samples, these methods employ different heuristics for the evaluation. Specifically, [51, 44] employ reconstruction measures to monitor abnormalities, whereas [45] directly queries the GAN discriminator for a normalcy score. Finally, [40] addresses a related task, namely one-class classification, by learning CNN features that are both compact and descriptive using supervision from an additional external dataset.
For a comprehensive review of anomaly detection techniques in computer vision, we refer the reader to [42, 26].

Autoregressive density estimation A popular deep model for density estimation is the Restricted Boltzmann Machine [54, 17], which can be trained with contrastive divergence [20] to model the density of binary inputs. The problem with this approach is that exact inference is troublesome due to an intractable partition function. To overcome this limitation, the first autoregressive models were introduced: the binary NADE [30] and its real-valued version RNADE [56] employ a deep network to estimate conditional probabilities of unobserved variables with sigmoid units and Gaussian Mixtures respectively. Further improvements were presented in [57, 18], that extend previous works by introducing a training strategy that copes with all possible autoregressive orderings among variables. Powerful generative models such as Pixel-RNN [59], Pixel-CNN [58] condition the probability distribution of pixels on previous ones, following a preset ordering. Recently, autoregression techniques have been merged with attention and meta-learning techniques for few-shot density estimation [46], as well as the parallel framework of normalizing flows [47] in  [39].

3 Proposed model

Figure 1: An example of autoregression over the joint probability distribution of four variables, represented as a graphical model. According to this model, estimating the joint probability of , , , reduces to the estimation of the conditionals , , and the marginal .

In this section we describe the proposed model, according to which the scoring function is a probability density estimator, that directly learns to assign high probabilities to normal samples within the training set. The learned density function evaluates representations produced by a deep autoencoder, and its estimation proceeds following an autoregressive procedure, in which elements of the feature vector get observed gradually to predict the value of unobserved ones. First, we briefly review the technique of autoregression and its application in density estimation. Afterwards, the model’s architecture and objective function will be illustrated in detail.

Figure 2: The structure of the proposed autoencoder. Paired with a standard compression-reconstruction network, a density estimation module learns the distribution of latent codes, via autoregression.

3.1 Background: autoregressive models

Autoregression is a well known technique that factorizes the probability distribution over the joint assignment x of random variables using the chain rule of probability,


so that estimating reduces to the estimation of each single conditional probability density (CPD) in the form . Notably, Eq. (1) assumes a given order over random variables (Fig. 1). Despite the chain rule is true in general, estimating CPDs can be easier for some peculiar variable ordering and harder for others, depending on the nature of the joint variable space. Prior models such as Pixel-RNN [59] and Pixel-CNN [58] employ autoregression to generate images one pixel at a time, and the ordering in which the process advances is either from corners to center or top-left to bottom-right, row-wise. Other autoregressive models [57, 18] perform order agnostic training by shuffling the order at each mini-batch update. A lot of efforts have been spent in this direction, because all mentioned models estimate the density of raw data, for which the best autoregressive order is unknown. On the contrary, we employ this technique out of representations extracted by a deep network, that we constrain to provide autoregressive feature vectors.

3.2 Autoregressive novelty detection

The successful application of autoregression in density estimation motivates us in employing it in a novelty detection setting. However, in real world scenarios this requires modeling of extremely high dimensional data (e.g. images and videos). This can be troublesome for two reasons: First, as the dimensionality of the feature space grows, the estimation of a density function gets more and more difficult as the number of required samples raises significantly. Moreover, since each autoregressive layer requires at least the same number of hidden units as the input layer [18], the scalability of the whole model turns unfeasible due to memory limitations and the exploding number of parameters.
We deal with the aforementioned problems by shifting the analysis in a compressed domain i.e. the feature space learned by a deep autoencoder. This way, data dimensionality problems are alleviated, since autoencoders can significantly compress input data. More importantly, by forcing the autoregression property during the autoencoder training, it is possible to guide the feature space creation, such that the encoded representation is autoregressive in any preset order. More formally, our architecture is composed of three building blocks (Fig. 2): the encoder , the decoder and the density estimator . The encoder processes input from an unknown training distribution and maps it into a compressed representation , having lower dimensionality , whereas the decoder provides a reconstructed version of the input . Importantly, the representation is sigmoid-activated, so that . The estimator provides the representation space density in with autoregression. Specifically, it produces as output probability distributions , represented as multinomials over a linear quantization of the space in bins. In order to represent a proper probability distribution, a softmax activation is employed, ensuring that probabilities sum up to 1 along the quantization axis. As will be further detailed in the remainder of this section, all layers of the estimation network follow a precise connectivity scheme, ensuring that each output distribution is only connected to inputs , thus resembling a valid autoregressive estimate of the CPD .

Figure 3: Each block within the encoder and decoder network is composed of a convolutional path and an identity path, merged by a residual connection. For each convolutional layer, indicates the stride and the kernel size.

The objective of the complete architecture is to minimize:


where spans over all training examples and is a utility function transforming each element into a -dimensional one-hot encoding, highlighting the correct quantization bin to which belongs. Importantly, while the reconstruction loss guides the autoencoder towards data compression, the autoregression loss trains the sub-network to regress the distribution of compressed representations, through maximum likelihood estimation. Thus, the encoder’s parametrization is forced to encode input samples such that representations are both descriptive (reconstruction loss is low) and lie in a high density region of the feature space (autoregression loss is low). Similar to the case of variational autoencoders [25], the autoregression loss acts as a regularizer, avoiding the needs of other regularization techniques such as denoising and contraction.

(a) (b)
Figure 4: (a) Representation of a masked fully connected layer, described by Eq. 3. The connectivity between the input and output tensor is highlighted by different colors. We illustrate a layer of type A in this figure. (b) The Masked Stacked Convolutional layer (Eq. 5). It allows to exploit translation invariance only along the time axis. Different kernel colors represent different parametrizations.

Once the model is trained, we detach the decoder and interpret as a deep density estimator, that can be queried with new samples for their plausibility under the learned distribution. Indeed, the autoregression loss of a single test sample estimates the negative log-probability of its representation in feature space, hence can be used as a score for novelty versus regularity.

General architecture. The proposed model is represented in Fig. 2. The encoder and decoder networks are composed of several residual convolutional blocks, whose routing patterns require two paths. The standard path processes input by three stacked convolutions, the first of which changes the resolution of the input tensor (spatially and/or temporally) by striding, either down-sampling (encoder block) or up-sampling (decoder block, in which the first convolution of the block is transposed) feature maps. An identity path proceeds by applying a strided unary kernel convolution. Two densely connected sub-networks map the encoder output to the compressed feature vector and bring it back to a spatial feature map fed as input to the decoder. Further details about blocks are given in Fig. 3.

Image model. The network estimating image densities is composed of 2D blocks (Fig. 3, image model), except for the innermost layers that are fully connected and provide a compressed representation. This compressed feature vector has dimensionality and, as mentioned, undergoes an autoregressive module that estimates its probability. Each layer within this module features a Masked Fully Connection (MFC, Fig. 4a), that can follow two different patterns, namely A and B. Formally, it computes output feature map given the input . The connection between the input element in position , channel and the output element in position , channel is parametrized by


The feature vector entering the first autoregressive layer hasn’t any feature channel axis, thus it can be thought as a vector in . On the contrary, the output of the final autoregressive layer provides probability estimates for the bins that compose the space quantization (see previous section), so it lives in . Intermediate layers can have an arbitrary number of output channels. The first layer within the estimation network has type A (thus strictly depending on previous elements), whereas all the remaining ones have type B (thus masking only successive elements). All hidden layers are leaky ReLU activated.

Video model. When processing video inputs, the network consumes temporal clips of 16 consecutive frames. All convolutions are 3D, so they stride along the temporal axis [55]. The structure of 3D blocks is represented in Fig. 3, video model. In order to obtain a temporally-ordered representation of the input clip, each 3D convolution within encoding blocks is causal [6], so that the output cannot access information from future frames (i.e. each connection to future units is zero-masked). The last fully connected layers within the encoder process each feature map along the temporal axis in parallel, so that maps from different time-steps undergo the same parametrization. This way, the encoding procedure does not shuffle information across time-steps, ensuring temporal ordering. The decoder architecture mirrors the encoder to restore the input data.
The compressed representation of video clips has dimensionality , being the number of temporal time-steps and the length of the code. Accordingly, the estimation network is designed to capture two dimensional patterns within observed elements of the code in order to provide an estimate over the values of unobserved ones. However, naively plugging 2D convolutional layers would assume translation invariance in patterns that characterize the code. Due to the way the compressed representation is built, this assumption is true only along the temporal axis, whereas it seems wrong along the code axis (as mentioned above, it is the output of a fully connected layer). To address these constraints, our proposal is to apply different convolutional kernels to each element along the code dimension, allowing the observation of the whole feature vector in the previous time-step and a portion of the current time-step (Fig. 4b). Every convolution is free to stride along the time axis, and captures temporal patterns. We name this operation Masked Stacked Convolution.
Specifically, the -th convolution is equipped with a kernel kernel, that gets multiplied by the binary mask , defined as


where indexes the temporal axis and the code axis.
Each single convolution yields a column vector, as a result of its stride along time. The set of column vectors resulting from the application of the convolutions to the input tensor are horizontally stacked to build the output tensor , as follows:


where represents the horizontal concatenation operation.
Similarly to the model discussed for images, each MSC layer features a leaky ReLU activation and provides for one among two types of masking, namely A and B, the former employed only in the first layer and the latter employed in all successive layers within the estimation network. The output of the estimation network is a tensor having shape providing autoregressive CPDs for each element of the representation of the input video clip.

MNIST CIFAR10 DR(eye)VE ShanghaiTech
E-blocks [32,64] [64,128,256] [8,16,32,64,64] [8,16,32,64,64]
D-blocks [64,32] [256,128,64] [64,32,16,8,8] [64,32,16,8,8]
Spatial stride [2,2] [2,2,2] [2,2,2,2,2] [2,2,2,2,2]
Temporal stride - - [2,2,1,1,1] [2,2,1,1,1]
FC-layers [64] [256,64] [64] [512,64]
MFC layers [32,32,32,32,100] [32,32,32,32,100] - -
MSC layers - - [4,4,100] [4,4,100]
Code dim 64 64 464 464
mini-batch 64 64 12 8
Table 1: Details regarding each implemented model. We report the number of encoder (E) and decoder (D) blocks along with the number of filter banks applied by each convolution, the stride of the first convolution within each encoder block (decoder mirrors the architecture, so strides are applied in reverse), the fully connected layers constructing the code, the number of channels of hidden autoregressive (MFC, MSC) layers, the code dimensionality, the mini-batch size employed during training.
0 0.988 0.885 0.894 0.997 0.531 0.926 0.984 0.630 0.658 0.411 0.700 0.788 0.708 0.717
1 0.999 0.996 0.999 0.999 0.995 0.995 0.995 0.440 0.520 0.478 0.386 0.428 0.458 0.494
2 0.902 0.710 0.792 0.936 0.476 0.805 0.947 0.649 0.657 0.616 0.679 0.617 0.664 0.662
3 0.950 0.693 0.851 0.959 0.517 0.818 0.952 0.487 0.497 0.562 0.535 0.574 0.510 0.527
4 0.955 0.844 0.888 0.973 0.739 0.823 0.960 0.735 0.727 0.728 0.748 0.511 0.722 0.736
5 0.968 0.776 0.819 0.964 0.542 0.803 0.971 0.500 0.496 0.513 0.523 0.571 0.505 0.504
6 0.978 0.861 0.944 0.993 0.592 0.890 0.991 0.725 0.758 0.688 0.687 0.422 0.707 0.726
7 0.965 0.884 0.922 0.976 0.789 0.898 0.970 0.533 0.564 0.497 0.493 0.454 0.471 0.560
8 0.853 0.669 0.740 0.923 0.340 0.817 0.922 0.649 0.680 0.487 0.696 0.715 0.713 0.680
9 0.955 0.825 0.917 0.976 0.662 0.887 0.979 0.508 0.540 0.378 0.386 0.426 0.458 0.566
avg 0.951 0.814 0.877 0.969 0.618 0.866 0.967 0.586 0.610 0.536 0.583 0.551 0.592 0.617
Table 2: ROC-AUC results for novelty detection on MNIST and CIFAR10. Each row represents a different class on which baselines and ANDs are trained on.

4 Experiments

In this section we evaluate quantitatively the proposed AND model in two contexts: one class novelty detection and video anomaly detection.

Implementation details. All experiments were performed following the general model illustrated in Sec. 3. However, some architectural choices were specialized for each dataset, in order to cope with different data complexities, overfitting or other issues. These modifications mainly involve the encoder and decoder capacities, in terms of convolutional blocks and number of filter banks within each block. We refer the reader to Tab. 1 for more details. All models were trained minimizing Eq. 2 using . As for the optimizer, we choose Adam [24], with a learning rate of 0.001.

4.1 Novelty detection: MNIST and CIFAR10

We first assess the ability of our model to discriminate between normal samples seen during training and ones drawn from a different distribution. In these terms, we perform several one-class experiments on MNIST and CIFAR10. Specifically, for both datasets we isolate images from each class and consider them as the normal distribution. For each class, we train a different AND network and at test time, the whole test set is presented: the model is tasked to assign a higher score (probability) to samples sharing the class with training images. We randomly pick 20% images from the training set of each class, and use it for validation purposes. Importantly, no novel samples are employed within the training and validation set.

Low density High density
Figure 5: For each class in the CIFAR test set, we report images assigned to a low probability density (left image) and a high probability density (right image) by our model. High density images are clearly more regular than other ones, in terms of object size (airplanes, boats), context (deer) and orientation (horses).

We consider the following baselines:

  • standard methods such as OC-SVM [52] and Kernel Density Estimator (KDE), employed out of features extracted by PCA-whitening. Hyperparameters such as the number of principal components, the KDE kernel bandwidth and OC-SVM tolerance are tuned on the validation set.

  • a denoising autoencoder (DAE), sharing the same encoder-decoder structure as our proposal, but defective of the density estimation module. The reconstruction error is employed as a measure of normalcy vs novelty.

  • a variational autoencoder (VAE) [25], also sharing the same capacity as our model, in which the Evidence Lower Bound (ELBO) is employed as score.

  • an autoregressive model such as Pix-CNN [58], that estimates the density directly in the pixel space.

  • the GAN-based approach illustrated in [51].

Both baselines and our model feature unbounded scores, thus choosing a threshold to get a categorical prediction is often problematic. Therefore, Tab. 2 reports results of this experiment in terms of ROC-AUC. This metric can be interpreted as the expectation that a uniformly drawn normal image is scored higher than a uniformly drawn novel image, and is free from the choice of a fixed threshold.

Considering a simple setting such as MNIST, most methods perform favorably. In particular, VAE and AND yield the best results, and exhibit very similar performances. This is reasonable due to the fact that with such a clean image content the code distribution collapses to a simple distribution in the latent space for which the Naive Gaussian posterior imposed by VAE is a good proxy. Surprisingly, Pix-CNN excels in modeling the distribution of ones, but struggles on other classes. This finding suggests that autoregression within the joint space of pixels is complex, while employing it in representation space can be beneficial. Notably, OC-SVM outperforms most deep models in this setting.
On the contrary, CIFAR10 exhibits a strong intra-class variability and represents a significant challenge for most methods. Many baselines score lower or near 0.5 on many classes, which means similar or even worse than random scoring. In this setting, AND outperforms all baselines. The gap with VAE increases, possibly due to the Gaussian posterior VAE assumes for latent representations, which represents a more severe constraint in the setting of real world images. DAE performances suggest that reconstruction errors can indeed be interpreted as a proxy for sample density, but requires extra carefulness. Specifically, autoencoders are very likely to encode low-level features within the code (since they are useful for reconstruction), resulting in precise restorations of novel unseen samples. On the contrary, AND models directly model the density of extracted representations, regardless their level of abstraction, without making assumptions about their underlying distribution. Notably, KDE yields a better performance with respect to all deep learning based approaches, except for AND. Nevertheless, KDE is a non-parametric estimator, therefore needs to store all training examples in memory in order to evaluate the normalcy of new images thus resulting an unfeasible approach as the dataset grows in size.
A visualization of AND behavior is represented in Fig. 5, illustrating CIFAR images scored with low and high densities.

ConvAE [19] AND
P@20 P@50 P@100 MAP AUC P@20 P@50 P@100 MAP AUC
Morning 0.85 0.72 0.58 0.513 0.713 0.95 0.88 0.87 0.570 0.758
Evening 0.50 0.68 0.65 0.466 0.700 0.00 0.00 0.00 0.366 0.662
Night 0.75 0.90 0.95 0.937 0.977 1.00 1.00 1.00 0.965 0.981
Sunny 0.60 0.50 0.44 0.368 0.529 0.95 0.48 0.40 0.409 0.585
Cloudy 0.60 0.62 0.59 0.470 0.583 0.85 0.80 0.71 0.507 0.626
Rainy 0.75 0.62 0.46 0.290 0.472 1.00 1.00 0.96 0.286 0.494
Downtown 0.15 0.14 0.21 0.344 0.414 0.15 0.20 0.19 0.336 0.441
Countryside 0.30 0.42 0.34 0.326 0.506 1.00 1.00 0.98 0.561 0.721
Highway 0.60 0.36 0.48 0.367 0.632 1.00 1.00 1.00 0.525 0.718
avg 0.57 0.56 0.52 0.453 0.614 0.77 0.71 0.68 0.473 0.665
Table 3: Performances on the DR(eye)VE novelty detection task.
Figure 6: ROC curves representing AND and ConvAE [19] performances on the DR(eye)VE dataset, aggregated by time of day, weather and landscape.
High scored clips for the normal class
Morning Evening Night Sunny Cloudy Rainy Downtown Countryside Highway
(Evening) (Night) (Evening) (Rainy) (Rainy) (Cloudy) (Highway) (Highway) (Countryside)
High scored clips for a novel class
Figure 7: Analysis of AND errors in the DR(eye)VE novelty detection task. Each column illustrates high scored clips of a model trained on a single scenario. The topmost image of each column reports the first frame of the normal clip yielding the highest score. The bottom image reports the first frame of the clip yielding the highest scores among the novel classes. Best viewed on screen and zoomed in.

4.2 Novelty detection: DR(eye)VE

We now stress AND performances in a more challenging video novelty detection context. We employ the recently proposed DR(eye)VE dataset [4] which features 74 sequences of driving videos, from the perspective of a car, each of which is 5 minutes long. The dataset was originally proposed for attention prediction but releases, for each sequence, annotations concerning the type of landscape (downtown, countryside, highway), the time of acquisition (morning, evening, night) and weather (sunny, cloudy, rainy). Therefore, we proceed in a similar fashion as presented in Sec. 4.1, training several AND models by isolating each class and successively assessing their ability to recognize normal and novel samples. The network is fed with 16-frames clips, with images down-sampled to match a resolution of pixels. For training, we employ clips randomly drawn from the first 38 driving sequences (except for the central 500 frames), while the test set is composed of 10000 random clips from the remaining ones. 200 random clips from the 500 central frames of each training sequence constitute the validation set. We report in Tab. 3, Fig. 6 AND results compared to the convolutional autoencoder (ConvAE) approach presented in [19], which scored favorably in several video benchmarks. Evaluation is carried out by sorting test samples according to the predicted normalcy score, and computing precision@k (P@K), mean average precision (MAP) and ROC-AUC. As shown in table, AND outperforms the competitor model on nearly all classes. In particular, it clearly emerges how “night” clips are easy to distinguish from “morning” and “evening” ones, due to global changes in illumination. On the other hand, classes such as “rainy” and “downtown” yield poor performances.

In particular, two degenerate cases needs further reasoning. Indeed, “evening” sequences score positively globally, but have zero precision in the first 100 test samples. This is again due to global illumination, in that AND tends to positively score darker images, which correspond to “night” within the test set. The opposite case is represented by “rainy” sequences, where scores are not discriminative globally but very precise in terms of top precisions. This suggests highly distinctive clips within rainy sequences (e.g. where the camera itself is covered by water drops), in spite of many others that are very hard to distinguish i.e. when light rain falls and drops are not visible in the clip. To represent the task complexity and a qualitative analysis of AND errors, we report in Fig. 7 the best-scoring normal clip and the best-scoring novel clip for each class. The figure illustrates, column-wise, that the boundary separating a class from other ones is often mild. As an example, “night” can be easily mistaken for “evening” when the illumination condition is similar, and discriminating between “cloudy” and “rainy” is generally hard.

Figure 8: Representation of AND behavior on four sequences of the Shanghai Tech test set. We plot the negative log-probability (NLP) predicted by our model for each frame. It can be considered a measure of anomaly (the highest, the more abnormal) and succeeds in detecting different unusual behaviors. Best viewed on screen and zoomed in.

4.3 Video anomaly detection

A further domain interested by the proposed autoregressive model is video anomaly detection. This topic has a great impact on many computer vision applications, including surveillance, where it is often cast as the detection of uncommon pedestrian’s moving patterns. We are interested in evaluating whether our density estimation approach transfers to this peculiar setting where the anomaly is triggered by the temporal evolution of small scene regions (i.e. where pedestrians behave).
For the purpose, we choose the recent Shanghai Tech Campus benchmark [33]. The dataset features 13 different cameras, delivering more than 250,000 frames for training, and 130 clips representing abnormal events for testing. These latter ones comprehend appearances of bicicles, cars and other vehicles, as well as hasty changes in motion patterns due to episodes of chasing, brawling, fighting, and pick pocketing.
During test, we evaluate 16-frames clips seeking for a frame level score. Thus, we compute the score of a frame as the mean score among the 16 clips containing it.

ConvAE [19] 0.609
TSC [33] 0.679
sRNN [33] 0.680
AND 0.685
Table 4: Results on Shanghai Tech Campus anomaly detection benchmark, where AND performs on par with other state of the art models.

We report our performances in Tab. 4, compared to ConvAE model by Hasan et al[19], Temporally-coherent Sparse Coding (TSC) and stacked RNN (sRNN) from [33]. During testing, we notice how sequences from different cameras were scored differently due to the distribution of training clips. Coherently with its objective, our model tends to assign a higher score to cameras delivering more sequences. Nevertheless, within a given sequence abnormal behaviors are generally detected. We compute the ROC-AUC separately for each sequence and report the mean. Notably, our model outperforms ConvAE by a significant margin, and performs on par with other state-of-the-art models. This result is remarkable for two reasons: the first is that we do not make any assumption about surveillance specific features set (e.g. optical flow like in [63]), but instead the autoencoder seamlessly focuses on motion features; eventually anomalies emerge directly from the modeling of latent distributions without any assumption about both their structure and domain.
We report in Fig. 8 some success cases, in which AND successfully identifies abnormal entities such as bycicles and cars, as well as uncommon behaviors such as brawling.

0 1 2 3 4 5 6 7 8 9
ARG Train -201.60 -161.60 -171.43 -172.73 -174.17 -186.48 -158.22 -162.37 -171.65 -154.11
Val -200.96 -160.38 -170.10 -172.29 -173.85 -185.25 -157.22 -162.20 -171.42 -154.02
Test -200.89 -159.73 -169.64 -170.75 -172.40 -184.27 -157.74 -161.65 -170.10 -152.70
RDM Train -496.33 -456.34 -466.16 -467.47 -468.90 -481.21 -452.95 -457.10 -466.39 -448.84
Val -495.69 -455.11 -464.83 -467.02 -468.58 -479.98 -451.95 -456.93 -466.15 -448.75
Test -495.62 -454.47 -464.37 -465.48 -467.13 -479.00 -452.48 -456.38 -464.83 -447.43
INV Train -791.06 -751.07 -760.89 -762.20 -763.63 -775.94 -747.68 -751.83 -761.12 -743.57
Val -790.42 -749.84 -759.56 -761.75 -763.31 -774.71 -746.68 -751.66 -760.88 -743.48
Test -790.35 -749.20 -759.11 -760.22 -761.86 -773.73 -747.21 -751.12 -759.56 -742.16
Table 5: Average log-likelihood numerical results. Each BN is trained on AND latent codes from the training set of a single class, following either the autoregression order (ARG), a random order (RDM) or the order inverse to autoregression (INV). We report the log-likelihood also on the validation and test set. For train-val-test split, see Sec 4.1 of the paper. Only “normal” test samples are used in this evaluation.

4.4 On the causal structure of representations

In this section we investigate the capability of the AND encoder to produce representations that follow the causal structure imposed by the autoregression loss during training. To this aim, we extract representations out of the 10 AND models trained on MNIST digits (Sec. 4.1) and fit different graphical models to model their distribution. Specifically, we train several Bayesian Networks (BNs), with different autoregressive structures. Specifically, each BN is modeled with Linear Gaussian CPDs [27], such that the CPD is modeled as


with the exception of root nodes that are modeled with a Gaussian distribution. Concerning the ordering, we test:

  • ARG order: the BN structure follows the autoregressive order imposed during training.

  • RDM order: the BN structure follows a random autoregressive order.

  • INV order: the BN structure follows an autoregressive order which is the inverse with respect to the one imposed during training.

It is worth noticing that the three structures exhibit the same number of edges and independent parameters, so that the difference in fitting capabilities is only due to the causal order imposed over variables.
Fig. 9 reports the average training log-likelihood (i.e. the training log-likelihood divided by the number of training samples) of all BN models. Remarkably, the autoregressive order is clearly a better fit, supporting the capability of the encoder network to extract features with known autoregressive properties. Moreover, to show that this result is not due to overfitting or other lurking behaviors, we report in Tab 5 log-likelihoods for training, validation and test set (as defined in Sec. 4.1).

Figure 9: Average training log-likelihood of a Bayesian Network modeling the distribution of latent codes produced by the AND encoder (trained on MNIST digits). When the network structure resembles the autoregressive order imposed during AND training, a much higher likelihood is achieved. This behavior is consistent in all classes, and supports the capability of the AND encoder to produce codes that respect a pre-imposed autoregressive structure.

5 Conclusions

In this paper, we propose an autoregressive framework for novelty detection. Our main contribution consists of an end-to-end model which is trainable in a fully unsupervised fashion, that can learn complex high-dimensional distributions directly from data. Once trained, it serves as a deep parametric density estimator, and can be queried for the probability of new samples. The proposed model is data agnostic, and we discuss how to specialize its structure to cope with image and video inputs. Experimental results show promising performances in both one-class and anomaly detection settings, fostering the flexibility of our framework in tackling tasks having different nature without making data-related assumptions.


  • [1] A. Adam, E. Rivlin, I. Shimshoni, and D. Reinitz. Robust real-time unusual event detection using multiple fixed-location monitors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(3):555–560, 2008.
  • [2] M. Ahmed, A. N. Mahmood, and J. Hu. A survey of network anomaly detection techniques. Journal of Network and Computer Applications, 60:19 – 31, 2016.
  • [3] G. Alain and Y. Bengio. What regularized auto-encoders learn from the data-generating distribution. The Journal of Machine Learning Research, 15(1):3563–3593, 2014.
  • [4] S. Alletto, A. Palazzi, F. Solera, S. Calderara, and R. Cucchiara. Dr(eye)ve: a dataset for attention-based tasks with applications to autonomous and assisted driving. In IEEE International Conference on Computer Vision and Pattern Recognition Workshops, 2016.
  • [5] B. Antić and B. Ommer. Video parsing for abnormality detection. In IEEE International Conference on Computer Vision, pages 2415–2422. IEEE, 2011.
  • [6] S. Bai, J. Z. Kolter, and V. Koltun. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv:1803.01271, 2018.
  • [7] A. Basharat, A. Gritai, and M. Shah. Learning object motion patterns for anomaly detection and improved object detection. In IEEE International Conference on Computer Vision and Pattern Recognition, pages 1–8. IEEE, 2008.
  • [8] M. Brioschi et al. The problem of novelty according to c.s. peirce and an whitehead. PhD thesis, 2015.
  • [9] S. Budhaditya, D.-S. Pham, M. Lazarescu, and S. Venkatesh. Effective anomaly detection in sensor networks data streams. In Data Mining, 2009. ICDM’09. Ninth IEEE International Conference on, pages 722–727. IEEE, 2009.
  • [10] S. Calderara, U. Heinemann, A. Prati, R. Cucchiara, and N. Tishby. Detecting anomalies in people’s trajectories using spectral graph analysis. Computer Vision and Image Understanding, 115(8):1099–1111, 2011.
  • [11] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Comput. Surv., 41(3):15:1–15:58, July 2009.
  • [12] K.-W. Cheng, Y.-T. Chen, and W.-H. Fang. Video anomaly detection and localization using hierarchical feature representation and gaussian process regression. In IEEE International Conference on Computer Vision and Pattern Recognition, pages 2909–2917. IEEE, 2015.
  • [13] Y. Cong, J. Yuan, and J. Liu. Sparse reconstruction cost for abnormal event detection. In IEEE International Conference on Computer Vision and Pattern Recognition, pages 3449–3456. IEEE, 2011.
  • [14] A. Del Giorno, J. A. Bagnell, and M. Hebert. A discriminative framework for anomaly detection in large videos. In European Conference on Computer Vision, pages 334–349. Springer, 2016.
  • [15] T. V. Duong, H. H. Bui, D. Q. Phung, and S. Venkatesh. Activity recognition and abnormality detection with the switching hidden semi-markov model. In IEEE International Conference on Computer Vision and Pattern Recognition, volume 1, pages 838–845. IEEE, 2005.
  • [16] Y. Feng, Y. Yuan, and X. Lu. Learning deep event models for crowd anomaly detection. Neurocomputing, 219:548–556, 2017.
  • [17] Y. Freund and D. Haussler. A fast and exact learning rule for a restricted class of boltzmann machines. Neural Information Processing Systems, 4:912–919, 1992.
  • [18] M. Germain, K. Gregor, I. Murray, and H. Larochelle. Made: Masked autoencoder for distribution estimation. In International Conference on Machine Learning, pages 881–889, 2015.
  • [19] M. Hasan, J. Choi, J. Neumann, A. K. Roy-Chowdhury, and L. S. Davis. Learning temporal regularity in video sequences. In IEEE International Conference on Computer Vision and Pattern Recognition, pages 733–742. IEEE, 2016.
  • [20] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771–1800, 2002.
  • [21] X. Hu, S. Hu, Y. Huang, H. Zhang, and H. Wu. Video anomaly detection using deep incremental slow feature analysis network. IET Computer Vision, 10(4):258–267, 2016.
  • [22] Y. Hu, Y. Zhang, and L. S. Davis. Unsupervised abnormal crowd activity detection using semiparametric scan statistic. In IEEE International Conference on Computer Vision and Pattern Recognition Workshops, pages 767–774. IEEE, 2013.
  • [23] J. Kim and K. Grauman. Observe locally, infer globally: a space-time mrf for detecting abnormal activities with incremental updates. In IEEE International Conference on Computer Vision and Pattern Recognition, pages 2921–2928. IEEE, 2009.
  • [24] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 2015.
  • [25] D. P. Kingma and M. Welling. Auto-encoding variational bayes. International Conference on Learning Representations, 2014.
  • [26] B. R. Kiran, D. M. Thomas, and R. Parakkal. An overview of deep learning based methods for unsupervised and semi-supervised anomaly detection in videos. arXiv preprint arXiv:1801.03149, 2018.
  • [27] D. Koller and N. Friedman. Probabilistic graphical models: principles and techniques. Technical report, 2009.
  • [28] A. Krishnamurthy. High-dimensional clustering with sparse gaussian mixture models. Unpublished paper, pages 191–192, 2011.
  • [29] J. Kwon and K. M. Lee. A unified framework for event summarization and rare event detection from multiple views. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9):1737–1750, 2015.
  • [30] H. Larochelle and I. Murray. The neural autoregressive distribution estimator. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 29–37, 2011.
  • [31] W. Li, V. Mahadevan, and N. Vasconcelos. Anomaly detection and localization in crowded scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(1):18–32, 2014.
  • [32] C. Lu, J. Shi, and J. Jia. Abnormal event detection at 150 fps in matlab. In IEEE International Conference on Computer Vision, pages 2720–2727. IEEE, 2013.
  • [33] W. Luo, W. Liu, and S. Gao. A revisit of sparse coding based anomaly detection in stacked rnn framework. IEEE International Conference on Computer Vision, 2017.
  • [34] V. Mahadevan, W. Li, V. Bhalodia, and N. Vasconcelos. Anomaly detection in crowded scenes. In IEEE International Conference on Computer Vision and Pattern Recognition, pages 1975–1981. IEEE, 2010.
  • [35] M. Markou and S. Singh. Novelty detection: a review—part 2:: neural network based approaches. Signal Processing, 83(12):2499 – 2521, 2003.
  • [36] J. R. Medel and A. Savakis. Anomaly detection in video using predictive convolutional long short-term memory networks. arXiv preprint arXiv:1612.00390, 2016.
  • [37] R. Mehran, A. Oyama, and M. Shah. Abnormal crowd behavior detection using social force model. In IEEE International Conference on Computer Vision and Pattern Recognition, pages 935–942. IEEE, 2009.
  • [38] H. Nam and M. Sugiyama. Direct density ratio estimation with convolutional neural networks with application in outlier detection. IEICE Transactions on Information and Systems, 98(5):1073–1079, 2015.
  • [39] G. Papamakarios, I. Murray, and T. Pavlakou. Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, pages 2335–2344, 2017.
  • [40] P. Perera and V. M. Patel. Learning deep features for one-class classification. arXiv preprint arXiv:1801.05365, 2018.
  • [41] D. S. Pham, B. Saha, D. Q. Phung, and S. Venkatesh. Detection of cross-channel anomalies from multiple data channels. In Data Mining (ICDM), 2011 IEEE 11th International Conference on, pages 527–536. IEEE, 2011.
  • [42] O. P. Popoola and K. Wang. Video-based abnormal human behavior recognition—a review. IEEE Transactions on Systems, Man and Cybernetics, 42(6):865–878, 2012.
  • [43] M. Ravanbakhsh, M. Nabi, H. Mousavi, E. Sangineto, and N. Sebe. Plug-and-play cnn for crowd motion analysis: An application in abnormal event detection. IEEE Winter Conference on Applications of Computer Vision, 2018.
  • [44] M. Ravanbakhsh, M. Nabi, E. Sangineto, L. Marcenaro, C. Regazzoni, and N. Sebe. Abnormal event detection in videos using generative adversarial nets. IEEE International Conference on Image Processing, 2017.
  • [45] M. Ravanbakhsh, E. Sangineto, M. Nabi, and N. Sebe. Training adversarial discriminators for cross-channel abnormal event detection in crowds. arXiv preprint arXiv:1706.07680, 2017.
  • [46] S. Reed, Y. Chen, T. Paine, A. v. d. Oord, S. Eslami, D. Rezende, O. Vinyals, and N. de Freitas. Few-shot autoregressive density estimation: Towards learning to learn distributions. International Conference on Learning Representations, 2018.
  • [47] D. J. Rezende and S. Mohamed. Variational inference with normalizing flows. International Conference on Machine Learning, 2015.
  • [48] M. J. Roshtkhari and M. D. Levine. Online dominant and anomalous behavior detection in videos. In IEEE International Conference on Computer Vision and Pattern Recognition, pages 2611–2618. IEEE, 2013.
  • [49] M. Sabokrou, M. Fathy, M. Hoseini, and R. Klette. Real-time anomaly detection and localization in crowded scenes. In IEEE International Conference on Computer Vision and Pattern Recognition Workshops, pages 56–62, 2015.
  • [50] M. Sabokrou, M. Fayyaz, M. Fathy, Z. Moayed, and R. Klette. Deep-anomaly: Fully convolutional neural network for fast anomaly detection in crowded scenes. Computer Vision and Image Understanding, 2018.
  • [51] T. Schlegl, P. Seeböck, S. M. Waldstein, U. Schmidt-Erfurth, and G. Langs. Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In International Conference on Information Processing in Medical Imaging, pages 146–157. Springer, 2017.
  • [52] B. Schölkopf, R. C. Williamson, A. J. Smola, J. Shawe-Taylor, and J. C. Platt. Support vector method for novelty detection. In Neural Information Processing Systems, 2000.
  • [53] P. Seeböck, S. Waldstein, S. Klimscha, B. S. Gerendas, R. Donner, T. Schlegl, U. Schmidt-Erfurth, and G. Langs. Identifying and categorizing anomalies in retinal imaging data. arXiv preprint arXiv:1612.00686, 2016.
  • [54] P. Smolensky. Information processing in dynamical systems: Foundations of harmony theory. Technical report, Department of Computer Science, Colorado University, 1986.
  • [55] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In IEEE International Conference on Computer Vision, pages 4489–4497. IEEE, 2015.
  • [56] B. Uria, I. Murray, and H. Larochelle. Rnade: The real-valued neural autoregressive density-estimator. In Advances in Neural Information Processing Systems, pages 2175–2183, 2013.
  • [57] B. Uria, I. Murray, and H. Larochelle. A deep and tractable density estimator. In International Conference on Machine Learning, pages 467–475, 2014.
  • [58] A. van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, and A. Graves. Conditional image generation with pixelcnn decoders. In Neural Information Processing Systems.
  • [59] A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. International Conference on Machine Learning, 2016.
  • [60] H. Vu, D. Phung, T. D. Nguyen, A. Trevors, and S. Venkatesh. Energy-based models for video anomaly detection. arXiv preprint arXiv:1708.05211, 2017.
  • [61] Z. Wang, Q. Gu, Y. Ning, and H. Liu. High dimensional em algorithm: Statistical optimization and asymptotic normality. In Advances in neural information processing systems, pages 2521–2529, 2015.
  • [62] T. Xiao, C. Zhang, and H. Zha. Learning to detect anomalies in surveillance video. IEEE Signal Processing Letters, 22(9):1477–1481, 2015.
  • [63] D. Xu, Y. Yan, E. Ricci, and N. Sebe. Detecting anomalous events in videos by learning deep representations of appearance and motion. Computer Vision and Image Understanding, 156:117–127, 2017.
  • [64] S. Zhai, Y. Cheng, W. Lu, and Z. Zhang. Deep structured energy based models for anomaly detection. In International Conference on Machine Learning, pages 1100–1109, 2016.
  • [65] B. Zhao, L. Fei-Fei, and E. P. Xing. Online detection of unusual events in videos via dynamic sparse coding. In IEEE International Conference on Computer Vision and Pattern Recognition, pages 3313–3320. IEEE, 2011.
  • [66] H. Zhong, J. Shi, and M. Visontai. Detecting unusual activity in video. In IEEE International Conference on Computer Vision and Pattern Recognition, volume 2, pages II–II. IEEE, 2004.
  • [67] R. Zhu, L. Wang, C. Zhai, and Q. Gu. High-dimensional variance-reduced stochastic gradient expectation-maximization algorithm. In International Conference on Machine Learning, 2017.
  • [68] B. Zong, Q. Song, M. R. Min, W. Cheng, C. Lumezanu, D. Cho, and H. Chen. Deep autoencoding gaussian mixture model for unsupervised anomaly detection. In International Conference on Learning Representations, 2018.

Appendix A Supplementary Material

a.1 Loss function ablation study

In this section we study the contribution of the two terms of the loss function in grater detail.
As discussed in Sec. 3, the reconstruction and the autoregression terms intervene jointly during training. While the former encourages representations that are descriptive of input samples, the latter makes sure they are assigned to a high probability by the autoregressive procedure. This implies that the former tries to separate representations, whereas the latter pulls them towards the nearest mode of the density function, thus having a collapsing effect. Given this conflicting behavior, the importance of their joint optimization is even more clear. To advocate for this intuition, we perform an ablation study on the model employed for anomaly detection on Shanghai Tech, training it by turning off each term, and report the results in Tab. 6. When the reconstruction loss is absent (AR only), the encoder is tasked to produce representations assigned to high density region, regardless of the sample presented as input. This is obviously a severe issue during the test phase, in which samples from the novel class are assign to the same high density representations. Moreover, naively removing the autoregression term from the loss function wouldn’t make sense in the whole framework (since it’s the one employed to score test samples). For this reason, the “REC only” entry of Tab. 6 refers to a training setting in which the autoregression loss only optimizes the parameters of the density estimation network (i.e. its gradient does not reach the encoder network). The low performance delivered by this such training strategy reinforces the importance of the autoregression objecting in forcing autoregressive properties to representations.

Loss term AR only REC only Joint
ROC-AUC 0.515 0.480 0.685
Table 6: Ablation study assessing the importance of optimizing the reconstruction and autoregression terms of the loss function. See text for details.
AND AND AE 0.29 ms 17.10 ms All 0.32 ms 18.42 ms 3.80 % 7.72 % fps -
Figure 10: Top and center, trainable parameters of MFC and MSC as a function of code length (dashed red line at code_length=64, and are set as in Tab. 1 of the paper). Bottom, inference times of the autoencoder only (AE) and the complete architecture (All). The relative increases () is low since the propose autoregressive layers share massive optimizations with standard convolutional and fully connected layers.

a.2 On the complexity of autoregressive layers

In this section, we illustrate the complexity and scalability of Masked Fully Connected (MFC) and Masked Stacked Convolution (MSC) layers (Fig. 4 of the main paper)111We refer to the type ‘B’ of both layers, since it is an upper bound to the type ‘A’: adhering to the notation introduced in Sec. 3, MFC exhibits trainable parameters and a computational complexity . MSC, instead, features free parameters and a time complexity . Fig. 10 reports the quadratic trend of the number of trainable parameters as function of the code length, and a study assessing the impact of the density estimation network on inference times.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description