AND: Autoregressive Novelty Detectors
Abstract
We propose an unsupervised model for novelty detection. The subject is treated as a density estimation problem, in which a deep neural network is employed to learn a parametric function that maximizes probabilities of training samples. This is achieved by equipping an autoencoder with a novel module, responsible for the maximization of compressed codes’ likelihood by means of autoregression. We illustrate design choices and proper layers to perform autoregressive density estimation when dealing with both image and video inputs. Despite a very general formulation, our model shows promising results in diverse oneclass novelty detection and video anomaly detection benchmarks.
1 Introduction
Novelty and anomaly detection have been extensively studied in computer vision since many years. Nonetheless, the cornerstone of research in the field is the definition of novelty itself. The concept of discovering novelty has emerged since Greek philosophy in the popular Plato’s paradox of inquiry, where the philosopher raised the question “how can I know something that I didn’t know before?”. Hence, the problem of novelty can be apprehended as the problem of knowledge [8].
Accordingly, novelty detection in computer science was defined as “the identification of unknown data or signals that a machine learning system is not aware of during training” [35, 2].
In computer vision this has been mostly associated with events or actions that are different with respect to known models, e.g. referring to an object, a person or a group of targets which behave in a surprising manner [11].
In this work we tackle this problem by modelling the knowledge by means of a probability distribution for which “novel” elements are the most inexplicable ones in a probabilistic sense. This formulation abides both conventional computer science definition and classical novelty interpretations.
Technically, this is achievable through an unsupervised knowledge distiller that summarizes the actual informative content about the data and simultaneously models its probability distribution.
To this aim, we leverage the power of deep autoencoders as effective knowledge compressors, while forcing autoregressive properties on distilled codes in order to estimate their density function.
Despite autoregressive density estimation have been profitably applied to deep learning [30, 56, 18], our contribution resides in an holistic approach where the extracted knowledge is conditioned to follow an autoregressive scheme thus its probability can be captured during training.
We evaluate our Autoregressive Novelty Detector (AND) on both image and video novelty and anomaly detection tasks achieving very encouraging results.
To the best of our knowledge, this is the first model employing autoregression for the estimation of density functions in complex, high dimensional feature spaces. We believe the proposed procedure can foster many possible refinements in several different applications processing highly structured and dimensional data.
2 Related work
Novelty and anomaly detection in vision.
Due to the complexity in defining and labeling novel events, nearly all methods in literature are based on unsupervised learning. Few models detect unusual patterns in data streams by employing statistical tests paired with either compressed sensing [9] or dimensionality reduction [41].
A significant line of work focuses on learning a parametric compression and reconstruction of normal data, motivated by the assumption that outlier samples lead to worse reconstructions. Among these approaches, sparsecoding algorithms model normal patterns as a linear combination of a set of basis components, imposing a sparsity constraint on scalar coefficients [65, 13, 32, 62].
On a different note, other models rely on learning the probability distribution of normal patterns by proposing both nonparametric [1] and parametric [7, 34, 31] estimators to approximate the density of lowlevel appearance and motion features.
At a higher abstraction level, graphical modeling has proven successful in estimating the distribution of video events and their causality, by means of Hidden Markov Models [15], Markov Random Fields [23] and energybased models [29].
Other models, such as social force [37] and spectral graph analysis [10], specifically address video surveillance scenarios by tackling the detection of abnormalities in crowd trajectories patterns.
Finally, different directions involve clustering of spatio temporal video volumes [66, 48], Gaussian Process Regression [12], scan statistics [22], foreground explainability [5] and discriminability of normal frames with respect to context [14].
Nevertheless, all aforementioned methods are disentangled by the feature extraction process and strongly depend on the efficacy of the chosen features set.
Modern novelty detection leverages the representation power of deep neural networks in the expense of hand crafted features.
Algorithms such as oneclass support vector machines (OCSVM) [52] or Gaussians classifiers have been successfully employed out of features extracted by deep networks [53, 63, 49, 50].
Reconstructionbased models, instead, that include autoencoders [19], convolutional longshort term memories [36] and sparse coding with temporal constraints [33], have been employed by leveraging on the magnitude of the reconstruction error. Nevertheless, this metric has proven to be a proxy of the gradient of the logdensity of hidden representations, rather than representing the density itself [3].
Few works address novelty detection by assessing temporal regularity of deep representations, by slow feature analysis [21] or binary temporal patterns [43].
Probabilistic models typically aim at estimating densities in deep feature space by means of GMMs [16, 68], or minimizing energy functions [64, 60].
Indeed, GMMs expectationmaximization in high dimensions is troublesome, mostly due to the estimation of covariance matrices [67, 61, 28].
Notably, the method in [38] illustrates how CNNs can directly estimate density ratios of normalcy over novelty.
Motivated by their effectiveness in learning complex distributions, some recent works train generative adversarial networks (GANs) on normal samples. However, since the underlying data probability distribution is embedded in the network parameters and cannot be explicitly queried for scoring samples, these methods employ different heuristics for the evaluation. Specifically, [51, 44] employ reconstruction measures to monitor abnormalities, whereas [45] directly queries the GAN discriminator for a normalcy score.
Finally, [40] addresses a related task, namely oneclass classification, by learning CNN features that are both compact and descriptive using supervision from an additional external dataset.
For a comprehensive review of anomaly detection techniques in computer vision, we refer the reader to [42, 26].
Autoregressive density estimation
A popular deep model for density estimation is the Restricted Boltzmann
Machine [54, 17], which can be trained with contrastive divergence [20] to model the density of binary inputs. The problem with this approach is that exact inference is troublesome due to an intractable partition function. To overcome this limitation, the first autoregressive models were introduced: the binary NADE [30] and its realvalued version RNADE [56] employ a deep network to estimate conditional probabilities of unobserved variables with sigmoid units and Gaussian Mixtures respectively. Further improvements were presented in [57, 18], that extend previous works by introducing a training strategy that copes with all possible autoregressive orderings among variables. Powerful generative models such as PixelRNN [59], PixelCNN [58] condition the probability distribution of pixels on previous ones, following a preset ordering. Recently, autoregression techniques have been merged with attention and metalearning techniques for fewshot density estimation [46], as well as the parallel framework of normalizing flows [47] in [39].
3 Proposed model
In this section we describe the proposed model, according to which the scoring function is a probability density estimator, that directly learns to assign high probabilities to normal samples within the training set. The learned density function evaluates representations produced by a deep autoencoder, and its estimation proceeds following an autoregressive procedure, in which elements of the feature vector get observed gradually to predict the value of unobserved ones. First, we briefly review the technique of autoregression and its application in density estimation. Afterwards, the model’s architecture and objective function will be illustrated in detail.
3.1 Background: autoregressive models
Autoregression is a well known technique that factorizes the probability distribution over the joint assignment x of random variables using the chain rule of probability,
(1) 
so that estimating reduces to the estimation of each single conditional probability density (CPD) in the form . Notably, Eq. (1) assumes a given order over random variables (Fig. 1). Despite the chain rule is true in general, estimating CPDs can be easier for some peculiar variable ordering and harder for others, depending on the nature of the joint variable space. Prior models such as PixelRNN [59] and PixelCNN [58] employ autoregression to generate images one pixel at a time, and the ordering in which the process advances is either from corners to center or topleft to bottomright, rowwise. Other autoregressive models [57, 18] perform order agnostic training by shuffling the order at each minibatch update. A lot of efforts have been spent in this direction, because all mentioned models estimate the density of raw data, for which the best autoregressive order is unknown. On the contrary, we employ this technique out of representations extracted by a deep network, that we constrain to provide autoregressive feature vectors.
3.2 Autoregressive novelty detection
The successful application of autoregression in density estimation motivates us in employing it in a novelty detection setting. However, in real world scenarios this requires modeling of extremely high dimensional data (e.g. images and videos). This can be troublesome for two reasons: First, as the dimensionality of the feature space grows, the estimation of a density function gets more and more difficult as the number of required samples raises significantly. Moreover, since each autoregressive layer requires at least the same number of hidden units as the input layer [18], the scalability of the whole model turns unfeasible due to memory limitations and the exploding number of parameters.
We deal with the aforementioned problems by shifting the analysis in a compressed domain i.e. the feature space learned by a deep autoencoder. This way, data dimensionality problems are alleviated, since autoencoders can significantly compress input data. More importantly, by forcing the autoregression property during the autoencoder training, it is possible to guide the feature space creation, such that the encoded representation is autoregressive in any preset order.
More formally, our architecture is composed of three building blocks (Fig. 2): the encoder , the decoder and the density estimator . The encoder processes input from an unknown training distribution and maps it into a compressed representation , having lower dimensionality , whereas the decoder provides a reconstructed version of the input . Importantly, the representation is sigmoidactivated, so that . The estimator provides the representation space density in with autoregression. Specifically, it produces as output probability distributions , represented as multinomials over a linear quantization of the space in bins. In order to represent a proper probability distribution, a softmax activation is employed, ensuring that probabilities sum up to 1 along the quantization axis. As will be further detailed in the remainder of this section, all layers of the estimation network follow a precise connectivity scheme, ensuring that each output distribution is only connected to inputs , thus resembling a valid autoregressive estimate of the CPD .
The objective of the complete architecture is to minimize:
(2) 
where spans over all training examples and is a utility function transforming each element into a dimensional onehot encoding, highlighting the correct quantization bin to which belongs. Importantly, while the reconstruction loss guides the autoencoder towards data compression, the autoregression loss trains the subnetwork to regress the distribution of compressed representations, through maximum likelihood estimation. Thus, the encoder’s parametrization is forced to encode input samples such that representations are both descriptive (reconstruction loss is low) and lie in a high density region of the feature space (autoregression loss is low). Similar to the case of variational autoencoders [25], the autoregression loss acts as a regularizer, avoiding the needs of other regularization techniques such as denoising and contraction.
(a)  (b) 
Once the model is trained, we detach the decoder and interpret as a deep density estimator, that can be queried with new samples for their plausibility under the learned distribution.
Indeed, the autoregression loss of a single test sample estimates the negative logprobability of its representation in feature space, hence can be used as a score for novelty versus regularity.
General architecture.
The proposed model is represented in Fig. 2. The encoder and decoder networks are composed of several residual convolutional blocks, whose routing patterns require two paths. The standard path processes input by three stacked convolutions, the first of which changes the resolution of the input tensor (spatially and/or temporally) by striding, either downsampling (encoder block) or upsampling (decoder block, in which the first convolution of the block is transposed) feature maps.
An identity path proceeds by applying a strided unary kernel convolution.
Two densely connected subnetworks map the encoder output to the compressed feature vector and bring it back to a spatial feature map fed as input to the decoder. Further details about blocks are given in Fig. 3.
Image model.
The network estimating image densities is composed of 2D blocks (Fig. 3, image model), except for the innermost layers that are fully connected and provide a compressed representation.
This compressed feature vector has dimensionality and, as mentioned, undergoes an autoregressive module that estimates its probability. Each layer within this module features a Masked Fully Connection (MFC, Fig. 4a), that can follow two different patterns, namely A and B.
Formally, it computes output feature map given the input . The connection between the input element in position , channel and the output element in position , channel is parametrized by
(3) 
The feature vector entering the first autoregressive layer hasn’t any feature channel axis, thus it can be thought as a vector in . On the contrary, the output of the final autoregressive layer provides probability estimates for the bins that compose the space quantization (see previous section), so it lives in . Intermediate layers can have an arbitrary number of output channels. The first layer within the estimation network has type A (thus strictly depending on previous elements), whereas all the remaining ones have type B (thus masking only successive elements). All hidden layers are leaky ReLU activated.
Video model.
When processing video inputs, the network consumes temporal clips of 16 consecutive frames.
All convolutions are 3D, so they stride along the temporal axis [55]. The structure of 3D blocks is represented in Fig. 3, video model.
In order to obtain a temporallyordered representation of the input clip, each 3D convolution within encoding blocks is causal [6], so that the output cannot access information from future frames (i.e. each connection to future units is zeromasked).
The last fully connected layers within the encoder process each feature map along the temporal axis in parallel, so that maps from different timesteps undergo the same parametrization. This way, the encoding procedure does not shuffle information across timesteps, ensuring temporal ordering. The decoder architecture mirrors the encoder to restore the input data.
The compressed representation of video clips has dimensionality , being the number of temporal timesteps and the length of the code. Accordingly, the estimation network is designed to capture two dimensional patterns within observed elements of the code in order to provide an estimate over the values of unobserved ones.
However, naively plugging 2D convolutional layers would assume translation invariance in patterns that characterize the code.
Due to the way the compressed representation is built, this assumption is true only along the temporal axis, whereas it seems wrong along the code axis (as mentioned above, it is the output of a fully connected layer). To address these constraints, our proposal is to apply different convolutional kernels to each element along the code dimension, allowing the observation of the whole feature vector in the previous timestep and a portion of the current timestep (Fig. 4b). Every convolution is free to stride along the time axis, and captures temporal patterns. We name this operation Masked Stacked Convolution.
Specifically, the th convolution is equipped with a kernel kernel, that gets multiplied by the binary mask , defined as
(4) 
where indexes the temporal axis and the code axis.
Each single convolution yields a column vector, as a result of its stride along time. The set of column vectors resulting from the application of the convolutions to the input tensor are horizontally stacked to build the output tensor , as follows:
(5) 
where represents the horizontal concatenation operation.
Similarly to the model discussed for images, each MSC layer features a leaky ReLU activation and provides for one among two types of masking, namely A and B, the former employed only in the first layer and the latter employed in all successive layers within the estimation network. The output of the estimation network is a tensor having shape providing autoregressive CPDs for each element of the representation of the input video clip.
MNIST  CIFAR10  DR(eye)VE  ShanghaiTech  

Eblocks  [32,64]  [64,128,256]  [8,16,32,64,64]  [8,16,32,64,64] 
Dblocks  [64,32]  [256,128,64]  [64,32,16,8,8]  [64,32,16,8,8] 
Spatial stride  [2,2]  [2,2,2]  [2,2,2,2,2]  [2,2,2,2,2] 
Temporal stride      [2,2,1,1,1]  [2,2,1,1,1] 
FClayers  [64]  [256,64]  [64]  [512,64] 
MFC layers  [32,32,32,32,100]  [32,32,32,32,100]     
MSC layers      [4,4,100]  [4,4,100] 
Code dim  64  64  464  464 
minibatch  64  64  12  8 
MNIST  CIFAR10  

OC SVM  KDE  DAE  VAE  Pix CNN  GAN  AND  OC SVM  KDE  DAE  VAE  Pix CNN  GAN  AND  
0  0.988  0.885  0.894  0.997  0.531  0.926  0.984  0.630  0.658  0.411  0.700  0.788  0.708  0.717 
1  0.999  0.996  0.999  0.999  0.995  0.995  0.995  0.440  0.520  0.478  0.386  0.428  0.458  0.494 
2  0.902  0.710  0.792  0.936  0.476  0.805  0.947  0.649  0.657  0.616  0.679  0.617  0.664  0.662 
3  0.950  0.693  0.851  0.959  0.517  0.818  0.952  0.487  0.497  0.562  0.535  0.574  0.510  0.527 
4  0.955  0.844  0.888  0.973  0.739  0.823  0.960  0.735  0.727  0.728  0.748  0.511  0.722  0.736 
5  0.968  0.776  0.819  0.964  0.542  0.803  0.971  0.500  0.496  0.513  0.523  0.571  0.505  0.504 
6  0.978  0.861  0.944  0.993  0.592  0.890  0.991  0.725  0.758  0.688  0.687  0.422  0.707  0.726 
7  0.965  0.884  0.922  0.976  0.789  0.898  0.970  0.533  0.564  0.497  0.493  0.454  0.471  0.560 
8  0.853  0.669  0.740  0.923  0.340  0.817  0.922  0.649  0.680  0.487  0.696  0.715  0.713  0.680 
9  0.955  0.825  0.917  0.976  0.662  0.887  0.979  0.508  0.540  0.378  0.386  0.426  0.458  0.566 
avg  0.951  0.814  0.877  0.969  0.618  0.866  0.967  0.586  0.610  0.536  0.583  0.551  0.592  0.617 
4 Experiments
In this section we evaluate quantitatively the proposed AND model in two contexts: one class novelty detection and video anomaly detection.
Implementation details. All experiments were performed following the general model illustrated in Sec. 3. However, some architectural choices were specialized for each dataset, in order to cope with different data complexities, overfitting or other issues. These modifications mainly involve the encoder and decoder capacities, in terms of convolutional blocks and number of filter banks within each block. We refer the reader to Tab. 1 for more details. All models were trained minimizing Eq. 2 using . As for the optimizer, we choose Adam [24], with a learning rate of 0.001.
4.1 Novelty detection: MNIST and CIFAR10
We first assess the ability of our model to discriminate between normal samples seen during training and ones drawn from a different distribution. In these terms, we perform several oneclass experiments on MNIST and CIFAR10. Specifically, for both datasets we isolate images from each class and consider them as the normal distribution. For each class, we train a different AND network and at test time, the whole test set is presented: the model is tasked to assign a higher score (probability) to samples sharing the class with training images. We randomly pick 20% images from the training set of each class, and use it for validation purposes. Importantly, no novel samples are employed within the training and validation set.
Low density  High density 

We consider the following baselines:

standard methods such as OCSVM [52] and Kernel Density Estimator (KDE), employed out of features extracted by PCAwhitening. Hyperparameters such as the number of principal components, the KDE kernel bandwidth and OCSVM tolerance are tuned on the validation set.

a denoising autoencoder (DAE), sharing the same encoderdecoder structure as our proposal, but defective of the density estimation module. The reconstruction error is employed as a measure of normalcy vs novelty.

a variational autoencoder (VAE) [25], also sharing the same capacity as our model, in which the Evidence Lower Bound (ELBO) is employed as score.

an autoregressive model such as PixCNN [58], that estimates the density directly in the pixel space.

the GANbased approach illustrated in [51].
Both baselines and our model feature unbounded scores, thus choosing a threshold to get a categorical prediction is often problematic. Therefore, Tab. 2 reports results of this experiment in terms of ROCAUC. This metric can be interpreted as the expectation that a uniformly drawn normal image is scored higher than a uniformly drawn novel image, and is free from the choice of a fixed threshold.
Considering a simple setting such as MNIST, most methods perform favorably.
In particular, VAE and AND yield the best results, and exhibit very similar performances. This is reasonable due to the fact that with such a clean image content the code distribution collapses to a simple distribution in the latent space for which the Naive Gaussian posterior imposed by VAE is a good proxy.
Surprisingly, PixCNN excels in modeling the distribution of ones, but struggles on other classes. This finding suggests that autoregression within the joint space of pixels is complex, while employing it in representation space can be beneficial. Notably, OCSVM outperforms most deep models in this setting.
On the contrary, CIFAR10 exhibits a strong intraclass variability and represents a significant challenge for most methods. Many baselines score lower or near 0.5 on many classes, which means similar or even worse than random scoring. In this setting, AND outperforms all baselines.
The gap with VAE increases, possibly due to the Gaussian posterior VAE assumes for latent representations, which represents a more severe constraint in the setting of real world images.
DAE performances suggest that reconstruction errors can indeed be interpreted as a proxy for sample density, but requires extra carefulness. Specifically, autoencoders are very likely to encode lowlevel features within the code (since they are useful for reconstruction), resulting in precise restorations of novel unseen samples.
On the contrary, AND models directly model the density of extracted representations, regardless their level of abstraction, without making assumptions about their underlying distribution.
Notably, KDE yields a better performance with respect to all deep learning based approaches, except for AND. Nevertheless, KDE is a nonparametric estimator, therefore needs to store all training examples in memory in order to evaluate the normalcy of new images thus resulting an unfeasible approach as the dataset grows in size.
A visualization of AND behavior is represented in Fig. 5, illustrating CIFAR images scored with low and high densities.
ConvAE [19]  AND  

P@20  P@50  P@100  MAP  AUC  P@20  P@50  P@100  MAP  AUC  
Morning  0.85  0.72  0.58  0.513  0.713  0.95  0.88  0.87  0.570  0.758 
Evening  0.50  0.68  0.65  0.466  0.700  0.00  0.00  0.00  0.366  0.662 
Night  0.75  0.90  0.95  0.937  0.977  1.00  1.00  1.00  0.965  0.981 
Sunny  0.60  0.50  0.44  0.368  0.529  0.95  0.48  0.40  0.409  0.585 
Cloudy  0.60  0.62  0.59  0.470  0.583  0.85  0.80  0.71  0.507  0.626 
Rainy  0.75  0.62  0.46  0.290  0.472  1.00  1.00  0.96  0.286  0.494 
Downtown  0.15  0.14  0.21  0.344  0.414  0.15  0.20  0.19  0.336  0.441 
Countryside  0.30  0.42  0.34  0.326  0.506  1.00  1.00  0.98  0.561  0.721 
Highway  0.60  0.36  0.48  0.367  0.632  1.00  1.00  1.00  0.525  0.718 
avg  0.57  0.56  0.52  0.453  0.614  0.77  0.71  0.68  0.473  0.665 
High scored clips for the normal class  
Morning  Evening  Night  Sunny  Cloudy  Rainy  Downtown  Countryside  Highway 
(Evening)  (Night)  (Evening)  (Rainy)  (Rainy)  (Cloudy)  (Highway)  (Highway)  (Countryside) 
High scored clips for a novel class 
4.2 Novelty detection: DR(eye)VE
We now stress AND performances in a more challenging video novelty detection context. We employ the recently proposed DR(eye)VE dataset [4] which features 74 sequences of driving videos, from the perspective of a car, each of which is 5 minutes long. The dataset was originally proposed for attention prediction but releases, for each sequence, annotations concerning the type of landscape (downtown, countryside, highway), the time of acquisition (morning, evening, night) and weather (sunny, cloudy, rainy). Therefore, we proceed in a similar fashion as presented in Sec. 4.1, training several AND models by isolating each class and successively assessing their ability to recognize normal and novel samples. The network is fed with 16frames clips, with images downsampled to match a resolution of pixels. For training, we employ clips randomly drawn from the first 38 driving sequences (except for the central 500 frames), while the test set is composed of 10000 random clips from the remaining ones. 200 random clips from the 500 central frames of each training sequence constitute the validation set. We report in Tab. 3, Fig. 6 AND results compared to the convolutional autoencoder (ConvAE) approach presented in [19], which scored favorably in several video benchmarks. Evaluation is carried out by sorting test samples according to the predicted normalcy score, and computing precision@k (P@K), mean average precision (MAP) and ROCAUC. As shown in table, AND outperforms the competitor model on nearly all classes. In particular, it clearly emerges how “night” clips are easy to distinguish from “morning” and “evening” ones, due to global changes in illumination. On the other hand, classes such as “rainy” and “downtown” yield poor performances.
In particular, two degenerate cases needs further reasoning. Indeed, “evening” sequences score positively globally, but have zero precision in the first 100 test samples. This is again due to global illumination, in that AND tends to positively score darker images, which correspond to “night” within the test set. The opposite case is represented by “rainy” sequences, where scores are not discriminative globally but very precise in terms of top precisions. This suggests highly distinctive clips within rainy sequences (e.g. where the camera itself is covered by water drops), in spite of many others that are very hard to distinguish i.e. when light rain falls and drops are not visible in the clip. To represent the task complexity and a qualitative analysis of AND errors, we report in Fig. 7 the bestscoring normal clip and the bestscoring novel clip for each class. The figure illustrates, columnwise, that the boundary separating a class from other ones is often mild. As an example, “night” can be easily mistaken for “evening” when the illumination condition is similar, and discriminating between “cloudy” and “rainy” is generally hard.
4.3 Video anomaly detection
A further domain interested by the proposed autoregressive model is video anomaly detection. This topic has a great impact on many computer vision applications, including surveillance, where it is often cast as the detection of uncommon pedestrian’s moving patterns. We are interested in evaluating whether our density estimation approach transfers to this peculiar setting where the anomaly is triggered by the temporal evolution of small scene regions (i.e. where pedestrians behave).
For the purpose, we choose the recent Shanghai Tech Campus benchmark [33]. The dataset features 13 different cameras, delivering more than 250,000 frames for training, and 130 clips representing abnormal events for testing. These latter ones comprehend appearances of bicicles, cars and other vehicles, as well as hasty changes in motion patterns due to episodes of chasing, brawling, fighting, and pick pocketing.
During test, we evaluate 16frames clips seeking for a frame level score. Thus, we compute the score of a frame as the mean score among the 16 clips containing it.
ROCAUC  

ConvAE [19]  0.609 
TSC [33]  0.679 
sRNN [33]  0.680 
AND  0.685 
We report our performances in Tab. 4, compared to ConvAE model by Hasan et al. [19], Temporallycoherent Sparse Coding (TSC) and stacked RNN (sRNN) from [33]. During testing, we notice how sequences from different cameras were scored differently due to the distribution of training clips. Coherently with its objective, our model tends to assign a higher score to cameras delivering more sequences. Nevertheless, within a given sequence abnormal behaviors are generally detected. We compute the ROCAUC separately for each sequence and report the mean. Notably, our model outperforms ConvAE by a significant margin, and performs on par with other stateoftheart models. This result is remarkable for two reasons: the first is that we do not make any assumption about surveillance specific features set (e.g. optical flow like in [63]), but instead the autoencoder seamlessly focuses on motion features; eventually anomalies emerge directly from the modeling of latent distributions without any assumption about both their structure and domain.
We report in Fig. 8 some success cases, in which AND successfully identifies abnormal entities such as bycicles and cars, as well as uncommon behaviors such as brawling.
Classes  

0  1  2  3  4  5  6  7  8  9  
ARG  Train  201.60  161.60  171.43  172.73  174.17  186.48  158.22  162.37  171.65  154.11 
Val  200.96  160.38  170.10  172.29  173.85  185.25  157.22  162.20  171.42  154.02  
Test  200.89  159.73  169.64  170.75  172.40  184.27  157.74  161.65  170.10  152.70  
RDM  Train  496.33  456.34  466.16  467.47  468.90  481.21  452.95  457.10  466.39  448.84 
Val  495.69  455.11  464.83  467.02  468.58  479.98  451.95  456.93  466.15  448.75  
Test  495.62  454.47  464.37  465.48  467.13  479.00  452.48  456.38  464.83  447.43  
INV  Train  791.06  751.07  760.89  762.20  763.63  775.94  747.68  751.83  761.12  743.57 
Val  790.42  749.84  759.56  761.75  763.31  774.71  746.68  751.66  760.88  743.48  
Test  790.35  749.20  759.11  760.22  761.86  773.73  747.21  751.12  759.56  742.16 
4.4 On the causal structure of representations
In this section we investigate the capability of the AND encoder to produce representations that follow the causal structure imposed by the autoregression loss during training. To this aim, we extract representations out of the 10 AND models trained on MNIST digits (Sec. 4.1) and fit different graphical models to model their distribution. Specifically, we train several Bayesian Networks (BNs), with different autoregressive structures. Specifically, each BN is modeled with Linear Gaussian CPDs [27], such that the CPD is modeled as
(6) 
with the exception of root nodes that are modeled with a Gaussian distribution. Concerning the ordering, we test:

ARG order: the BN structure follows the autoregressive order imposed during training.

RDM order: the BN structure follows a random autoregressive order.

INV order: the BN structure follows an autoregressive order which is the inverse with respect to the one imposed during training.
It is worth noticing that the three structures exhibit the same number of edges and independent parameters, so that the difference in fitting capabilities is only due to the causal order imposed over variables.
Fig. 9 reports the average training loglikelihood (i.e. the training loglikelihood divided by the number of training samples) of all BN models. Remarkably, the autoregressive order is clearly a better fit, supporting the capability of the encoder network to extract features with known autoregressive properties. Moreover, to show that this result is not due to overfitting or other lurking behaviors, we report in Tab 5 loglikelihoods for training, validation and test set (as defined in Sec. 4.1).
5 Conclusions
In this paper, we propose an autoregressive framework for novelty detection. Our main contribution consists of an endtoend model which is trainable in a fully unsupervised fashion, that can learn complex highdimensional distributions directly from data. Once trained, it serves as a deep parametric density estimator, and can be queried for the probability of new samples. The proposed model is data agnostic, and we discuss how to specialize its structure to cope with image and video inputs. Experimental results show promising performances in both oneclass and anomaly detection settings, fostering the flexibility of our framework in tackling tasks having different nature without making datarelated assumptions.
References
 [1] A. Adam, E. Rivlin, I. Shimshoni, and D. Reinitz. Robust realtime unusual event detection using multiple fixedlocation monitors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(3):555–560, 2008.
 [2] M. Ahmed, A. N. Mahmood, and J. Hu. A survey of network anomaly detection techniques. Journal of Network and Computer Applications, 60:19 – 31, 2016.
 [3] G. Alain and Y. Bengio. What regularized autoencoders learn from the datagenerating distribution. The Journal of Machine Learning Research, 15(1):3563–3593, 2014.
 [4] S. Alletto, A. Palazzi, F. Solera, S. Calderara, and R. Cucchiara. Dr(eye)ve: a dataset for attentionbased tasks with applications to autonomous and assisted driving. In IEEE International Conference on Computer Vision and Pattern Recognition Workshops, 2016.
 [5] B. Antić and B. Ommer. Video parsing for abnormality detection. In IEEE International Conference on Computer Vision, pages 2415–2422. IEEE, 2011.
 [6] S. Bai, J. Z. Kolter, and V. Koltun. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv:1803.01271, 2018.
 [7] A. Basharat, A. Gritai, and M. Shah. Learning object motion patterns for anomaly detection and improved object detection. In IEEE International Conference on Computer Vision and Pattern Recognition, pages 1–8. IEEE, 2008.
 [8] M. Brioschi et al. The problem of novelty according to c.s. peirce and an whitehead. PhD thesis, 2015.
 [9] S. Budhaditya, D.S. Pham, M. Lazarescu, and S. Venkatesh. Effective anomaly detection in sensor networks data streams. In Data Mining, 2009. ICDM’09. Ninth IEEE International Conference on, pages 722–727. IEEE, 2009.
 [10] S. Calderara, U. Heinemann, A. Prati, R. Cucchiara, and N. Tishby. Detecting anomalies in peopleâs trajectories using spectral graph analysis. Computer Vision and Image Understanding, 115(8):1099–1111, 2011.
 [11] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Comput. Surv., 41(3):15:1–15:58, July 2009.
 [12] K.W. Cheng, Y.T. Chen, and W.H. Fang. Video anomaly detection and localization using hierarchical feature representation and gaussian process regression. In IEEE International Conference on Computer Vision and Pattern Recognition, pages 2909–2917. IEEE, 2015.
 [13] Y. Cong, J. Yuan, and J. Liu. Sparse reconstruction cost for abnormal event detection. In IEEE International Conference on Computer Vision and Pattern Recognition, pages 3449–3456. IEEE, 2011.
 [14] A. Del Giorno, J. A. Bagnell, and M. Hebert. A discriminative framework for anomaly detection in large videos. In European Conference on Computer Vision, pages 334–349. Springer, 2016.
 [15] T. V. Duong, H. H. Bui, D. Q. Phung, and S. Venkatesh. Activity recognition and abnormality detection with the switching hidden semimarkov model. In IEEE International Conference on Computer Vision and Pattern Recognition, volume 1, pages 838–845. IEEE, 2005.
 [16] Y. Feng, Y. Yuan, and X. Lu. Learning deep event models for crowd anomaly detection. Neurocomputing, 219:548–556, 2017.
 [17] Y. Freund and D. Haussler. A fast and exact learning rule for a restricted class of boltzmann machines. Neural Information Processing Systems, 4:912–919, 1992.
 [18] M. Germain, K. Gregor, I. Murray, and H. Larochelle. Made: Masked autoencoder for distribution estimation. In International Conference on Machine Learning, pages 881–889, 2015.
 [19] M. Hasan, J. Choi, J. Neumann, A. K. RoyChowdhury, and L. S. Davis. Learning temporal regularity in video sequences. In IEEE International Conference on Computer Vision and Pattern Recognition, pages 733–742. IEEE, 2016.
 [20] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771–1800, 2002.
 [21] X. Hu, S. Hu, Y. Huang, H. Zhang, and H. Wu. Video anomaly detection using deep incremental slow feature analysis network. IET Computer Vision, 10(4):258–267, 2016.
 [22] Y. Hu, Y. Zhang, and L. S. Davis. Unsupervised abnormal crowd activity detection using semiparametric scan statistic. In IEEE International Conference on Computer Vision and Pattern Recognition Workshops, pages 767–774. IEEE, 2013.
 [23] J. Kim and K. Grauman. Observe locally, infer globally: a spacetime mrf for detecting abnormal activities with incremental updates. In IEEE International Conference on Computer Vision and Pattern Recognition, pages 2921–2928. IEEE, 2009.
 [24] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 2015.
 [25] D. P. Kingma and M. Welling. Autoencoding variational bayes. International Conference on Learning Representations, 2014.
 [26] B. R. Kiran, D. M. Thomas, and R. Parakkal. An overview of deep learning based methods for unsupervised and semisupervised anomaly detection in videos. arXiv preprint arXiv:1801.03149, 2018.
 [27] D. Koller and N. Friedman. Probabilistic graphical models: principles and techniques. Technical report, 2009.
 [28] A. Krishnamurthy. Highdimensional clustering with sparse gaussian mixture models. Unpublished paper, pages 191–192, 2011.
 [29] J. Kwon and K. M. Lee. A unified framework for event summarization and rare event detection from multiple views. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9):1737–1750, 2015.
 [30] H. Larochelle and I. Murray. The neural autoregressive distribution estimator. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 29–37, 2011.
 [31] W. Li, V. Mahadevan, and N. Vasconcelos. Anomaly detection and localization in crowded scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(1):18–32, 2014.
 [32] C. Lu, J. Shi, and J. Jia. Abnormal event detection at 150 fps in matlab. In IEEE International Conference on Computer Vision, pages 2720–2727. IEEE, 2013.
 [33] W. Luo, W. Liu, and S. Gao. A revisit of sparse coding based anomaly detection in stacked rnn framework. IEEE International Conference on Computer Vision, 2017.
 [34] V. Mahadevan, W. Li, V. Bhalodia, and N. Vasconcelos. Anomaly detection in crowded scenes. In IEEE International Conference on Computer Vision and Pattern Recognition, pages 1975–1981. IEEE, 2010.
 [35] M. Markou and S. Singh. Novelty detection: a reviewâpart 2:: neural network based approaches. Signal Processing, 83(12):2499 – 2521, 2003.
 [36] J. R. Medel and A. Savakis. Anomaly detection in video using predictive convolutional long shortterm memory networks. arXiv preprint arXiv:1612.00390, 2016.
 [37] R. Mehran, A. Oyama, and M. Shah. Abnormal crowd behavior detection using social force model. In IEEE International Conference on Computer Vision and Pattern Recognition, pages 935–942. IEEE, 2009.
 [38] H. Nam and M. Sugiyama. Direct density ratio estimation with convolutional neural networks with application in outlier detection. IEICE Transactions on Information and Systems, 98(5):1073–1079, 2015.
 [39] G. Papamakarios, I. Murray, and T. Pavlakou. Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, pages 2335–2344, 2017.
 [40] P. Perera and V. M. Patel. Learning deep features for oneclass classification. arXiv preprint arXiv:1801.05365, 2018.
 [41] D. S. Pham, B. Saha, D. Q. Phung, and S. Venkatesh. Detection of crosschannel anomalies from multiple data channels. In Data Mining (ICDM), 2011 IEEE 11th International Conference on, pages 527–536. IEEE, 2011.
 [42] O. P. Popoola and K. Wang. Videobased abnormal human behavior recognitionâa review. IEEE Transactions on Systems, Man and Cybernetics, 42(6):865–878, 2012.
 [43] M. Ravanbakhsh, M. Nabi, H. Mousavi, E. Sangineto, and N. Sebe. Plugandplay cnn for crowd motion analysis: An application in abnormal event detection. IEEE Winter Conference on Applications of Computer Vision, 2018.
 [44] M. Ravanbakhsh, M. Nabi, E. Sangineto, L. Marcenaro, C. Regazzoni, and N. Sebe. Abnormal event detection in videos using generative adversarial nets. IEEE International Conference on Image Processing, 2017.
 [45] M. Ravanbakhsh, E. Sangineto, M. Nabi, and N. Sebe. Training adversarial discriminators for crosschannel abnormal event detection in crowds. arXiv preprint arXiv:1706.07680, 2017.
 [46] S. Reed, Y. Chen, T. Paine, A. v. d. Oord, S. Eslami, D. Rezende, O. Vinyals, and N. de Freitas. Fewshot autoregressive density estimation: Towards learning to learn distributions. International Conference on Learning Representations, 2018.
 [47] D. J. Rezende and S. Mohamed. Variational inference with normalizing flows. International Conference on Machine Learning, 2015.
 [48] M. J. Roshtkhari and M. D. Levine. Online dominant and anomalous behavior detection in videos. In IEEE International Conference on Computer Vision and Pattern Recognition, pages 2611–2618. IEEE, 2013.
 [49] M. Sabokrou, M. Fathy, M. Hoseini, and R. Klette. Realtime anomaly detection and localization in crowded scenes. In IEEE International Conference on Computer Vision and Pattern Recognition Workshops, pages 56–62, 2015.
 [50] M. Sabokrou, M. Fayyaz, M. Fathy, Z. Moayed, and R. Klette. Deepanomaly: Fully convolutional neural network for fast anomaly detection in crowded scenes. Computer Vision and Image Understanding, 2018.
 [51] T. Schlegl, P. Seeböck, S. M. Waldstein, U. SchmidtErfurth, and G. Langs. Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In International Conference on Information Processing in Medical Imaging, pages 146–157. Springer, 2017.
 [52] B. Schölkopf, R. C. Williamson, A. J. Smola, J. ShaweTaylor, and J. C. Platt. Support vector method for novelty detection. In Neural Information Processing Systems, 2000.
 [53] P. Seeböck, S. Waldstein, S. Klimscha, B. S. Gerendas, R. Donner, T. Schlegl, U. SchmidtErfurth, and G. Langs. Identifying and categorizing anomalies in retinal imaging data. arXiv preprint arXiv:1612.00686, 2016.
 [54] P. Smolensky. Information processing in dynamical systems: Foundations of harmony theory. Technical report, Department of Computer Science, Colorado University, 1986.
 [55] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In IEEE International Conference on Computer Vision, pages 4489–4497. IEEE, 2015.
 [56] B. Uria, I. Murray, and H. Larochelle. Rnade: The realvalued neural autoregressive densityestimator. In Advances in Neural Information Processing Systems, pages 2175–2183, 2013.
 [57] B. Uria, I. Murray, and H. Larochelle. A deep and tractable density estimator. In International Conference on Machine Learning, pages 467–475, 2014.
 [58] A. van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, and A. Graves. Conditional image generation with pixelcnn decoders. In Neural Information Processing Systems.
 [59] A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. International Conference on Machine Learning, 2016.
 [60] H. Vu, D. Phung, T. D. Nguyen, A. Trevors, and S. Venkatesh. Energybased models for video anomaly detection. arXiv preprint arXiv:1708.05211, 2017.
 [61] Z. Wang, Q. Gu, Y. Ning, and H. Liu. High dimensional em algorithm: Statistical optimization and asymptotic normality. In Advances in neural information processing systems, pages 2521–2529, 2015.
 [62] T. Xiao, C. Zhang, and H. Zha. Learning to detect anomalies in surveillance video. IEEE Signal Processing Letters, 22(9):1477–1481, 2015.
 [63] D. Xu, Y. Yan, E. Ricci, and N. Sebe. Detecting anomalous events in videos by learning deep representations of appearance and motion. Computer Vision and Image Understanding, 156:117–127, 2017.
 [64] S. Zhai, Y. Cheng, W. Lu, and Z. Zhang. Deep structured energy based models for anomaly detection. In International Conference on Machine Learning, pages 1100–1109, 2016.
 [65] B. Zhao, L. FeiFei, and E. P. Xing. Online detection of unusual events in videos via dynamic sparse coding. In IEEE International Conference on Computer Vision and Pattern Recognition, pages 3313–3320. IEEE, 2011.
 [66] H. Zhong, J. Shi, and M. Visontai. Detecting unusual activity in video. In IEEE International Conference on Computer Vision and Pattern Recognition, volume 2, pages II–II. IEEE, 2004.
 [67] R. Zhu, L. Wang, C. Zhai, and Q. Gu. Highdimensional variancereduced stochastic gradient expectationmaximization algorithm. In International Conference on Machine Learning, 2017.
 [68] B. Zong, Q. Song, M. R. Min, W. Cheng, C. Lumezanu, D. Cho, and H. Chen. Deep autoencoding gaussian mixture model for unsupervised anomaly detection. In International Conference on Learning Representations, 2018.
Appendix A Supplementary Material
a.1 Loss function ablation study
In this section we study the contribution of the two terms of the loss function in grater detail.
As discussed in Sec. 3, the reconstruction and the autoregression terms intervene jointly during training. While the former encourages representations that are descriptive of input samples, the latter makes sure they are assigned to a high probability by the autoregressive procedure. This implies that the former tries to separate representations, whereas the latter pulls them towards the nearest mode of the density function, thus having a collapsing effect. Given this conflicting behavior, the importance of their joint optimization is even more clear.
To advocate for this intuition, we perform an ablation study on the model employed for anomaly detection on Shanghai Tech, training it by turning off each term, and report the results in Tab. 6.
When the reconstruction loss is absent (AR only), the encoder is tasked to produce representations assigned to high density region, regardless of the sample presented as input. This is obviously a severe issue during the test phase, in which samples from the novel class are assign to the same high density representations.
Moreover, naively removing the autoregression term from the loss function wouldn’t make sense in the whole framework (since it’s the one employed to score test samples). For this reason, the “REC only” entry of Tab. 6 refers to a training setting in which the autoregression loss only optimizes the parameters of the density estimation network (i.e. its gradient does not reach the encoder network). The low performance delivered by this such training strategy reinforces the importance of the autoregression objecting in forcing autoregressive properties to representations.
Loss term  AR only  REC only  Joint 

ROCAUC  0.515  0.480  0.685 
a.2 On the complexity of autoregressive layers
In this section, we illustrate the complexity and scalability of Masked Fully Connected (MFC) and Masked Stacked Convolution (MSC) layers (Fig. 4 of the main paper)^{1}^{1}1We refer to the type ‘B’ of both layers, since it is an upper bound to the type ‘A’: adhering to the notation introduced in Sec. 3, MFC exhibits trainable parameters and a computational complexity . MSC, instead, features free parameters and a time complexity . Fig. 10 reports the quadratic trend of the number of trainable parameters as function of the code length, and a study assessing the impact of the density estimation network on inference times.