Image Anomalies: a Review and Synthesis of Detection Methods
Abstract
We review the broad variety of methods that have been proposed for anomaly detection in images. Most methods found in the literature have in mind a particular application. Yet we show that the methods can be classified mainly by the structural assumption they make on the “normal” image. Five different structural assumptions emerge. Our analysis leads us to reformulate the best representative algorithms by attaching to them an a contrario detection that controls the number of false positives and thus derive universal detection thresholds. By combining the most general structural assumptions expressing the background’s normality with the best proposed statistical detection tools, we end up proposing generic algorithms that seem to generalize or reconcile most methods. We compare the six best representatives of our proposed classes of algorithms on anomalous images taken from classic papers on the subject, and on a synthetic database. Our conclusion is that it is possible to perform automatic anomaly detection on a single image.
1 Introduction
The automatic detection of anomalous structure in arbitrary images is concerned with the problem of finding nonconfirming patterns with respect to the image normality. This is a challenging problem in computer vision, since there is no clear and straightforward definition of what is (ab)normal for a given arbitrary image. Automatic anomaly detection has high stakes in industry, remote sensing and medicine (Figure 1). It is crucial to be able to handle automatically massive data to detect for example anomalous masses in mammograms [70, 36], chemical targets in multispectral and hyperspectral satellite images [4, 69, 66], sea mines in sidescan sonar images [51], defects in industrial monitoring applications [81, 76]. This detection may be using any imaging device from cameras to scanning electron microscopes [14].
Our goal here is to review the huge variety of methods that have been proposed for this problem in the realm of image processing. This review is constructive. It will not only aim at classifying the methods, but also at deciding if one or several general anomaly detection frameworks emerge from the analysis. Indeed, most methods found in the literature have in mind a particular application (but all claim some degree of generality though) and so might overlap without knowing.
We found that all anomaly detection methods make a general structural assumption on the “normal” background that actually characterizes the method. By combining the most general structural assumption with the best proposed statistical detection tools, we shall converge to a few generic algorithms that seem to generalize or reconcile most methods. The experiments will be both qualitative and quantitative.
To prove this, we shall compare improved representatives of the main algorithmic classes on classic diversified examples. This will illustrate how well anomaly detection algorithms work, regardless of their application field.
Before proceeding to our analysis, we shall discuss in the next sections of this introduction the very definition of anomaly detection. This will also establish the plan of this paper, given at the end of the introduction. More than 1000 papers in Google scholar contain the three key words “anomaly detection” and “image”. It is virtually impossible to analyze them all, but this excessive number and the excellent existing reviews request a synthesis. Our assumption here will be that that if the subject makes any sense, there must exist generic algorithms performing parameterless anomaly detection on any image. This review starts with an analysis of the two most celebrated recent reviews on the topic. It will also confirm that no synthesis had been so far proposed.
1.1 A quick review of reviews
The 2009 review paper by Chandola et al. [16] on anomaly detection is arguably the most complete review so far. It considered allegedly all existing techniques and all application fields and reviewed 361 papers. The review establishes a distinction between point anomaly, contextual anomaly, collective anomalies, depending on whether the background is steady or evolving and the anomaly has a larger scale than the initial samples. It also distinguishes between supervised, mildly supervised and unsupervised anomalies. It revises the main objects where anomalies are sought for (images, text, material, machines, networks, health, trading, banking operations, etc.) and lists the preferred techniques in each domain. Then it finally proposes the following classification of all involved techniques.

Classification techniques, e.g. SVM, Neural networks. Their main assumption is that it is possible to train a classifier to distinguish between normal and anomalous data in the given feature space. Classification is either multiclass (normal versus abnormal) or oneclass (only trains to detect normality, that is, learns a discriminative boundary around normal data). Among the oneclass detection methods we have the Replicator Neural Networks (autoencoders).
These are multilayer feed forward neural networks with same number of input and output neurons. The training involves compressing normal data into hidden layers. Given a new sample the reconstruction error is directly used as an anomaly score for the test instance.

Nearest neighbor based anomaly detection. Their basic assumption is that normal data instances occur in dense neighborhoods, while anomalies occur far from their closest neighbors. This can be measured by the distance to the nearest neighbor or as relative density.

Clustering based anomaly detection. Normal data instances are assumed to belong to a cluster in the data, while anomalies are defined as those standing far from the centroid of their closest cluster.

Statistical anomaly detection. Anomalies are defined as observations unlikely to be generated by the “background” stochastic model. Thus, anomalies occur in the low probability regions of the background model. Here the background models can be: parametric (Gaussian, Gaussian mixture, regression), or nonparametric and built, e.g., by a kernel method.

Spectral anomaly detection. The main tool here is actually principal component analysis (PCA) and its generalizations. Its principle is that an anomaly has deviant coordinates with respect to normal PCA coordinates.

Information theoretic anomaly detection. These techniques analyze the information content of a data set using information theoretic measures, such as, the Kolomogorov complexity, the entropy, the relative entropy, among others.
The above review by Chandola et al. [16] is fairly well completed by the more recent review by Pimentel et al. [62]. This paper presents a complete survey of novelty detection methods and introduces a classification into five groups:

Probabilistic novelty detection (these methods are based on a stochastic background model)

Distancebased methods (we shall name them later selfsimilarity based methods)

Reconstructionbased methods (generally leading to a background subtraction)

Domainbased methods: determine the location of the novelty boundary using only the data that lie closest to it, and do not make any assumption about data distribution (we shall call them centersurround methods)

Informationtheoretic methods: requires a measure that is sensitive enough to detect the effects of novel points in the dataset.
Discussion.
Both highly cited reviews made an excellent job of considering countless papers and proposing a categorization of methods. Nevertheless, their final map of the methods is a Prévertstyle inventory where methods are distributed according to what they do, rather than to what they assume. Indeed, they did not discuss the structural assumptions underlying the various methods. They did not conclude on a unified statistical decision framework. Thus, while using most of their proposed categories of methods, we shall attempt at reorganizing their panorama according to two main questions:

What is the structural assumption made on the background: in other terms what is “normal”?

How is the anomaly detection threshold defined and computed, and which guarantees is it delivering?
Our ideal goal would be to find out the weakest (and therefore most general) structural assumption on normal data, and to apply to it the most rigorous statistical test. Before proceeding to a classification of anomaly detection methods, we shall examine several related questions which share some of their tools with anomaly detection. They will be discarded but the tools are nevertheless relevant.
1.2 What anomaly detection isn’t
1.2.1 Not a classification problem
Most papers and reviews on anomaly detection agree that multiclass classification techniques like SVM can be discarded, because anomalies are generally not observed in sufficient number and lack statistical coherence. There are exceptions like the recent method introduced by Ding et al. [24]. After defining anomalies as exceptional events that cannot be learned, this paper ends up treating the anomaly detection problem as a classification problem, where one disposes of enough anomalous samples to learn classification parameters from the data themselves. Given several datasets with dimensions from 8 to 50 with moderate size (a few hundreds to a few thousand samples), this paper applies classic density estimators to sizable extracts of the normal set (kmeans, SVM, Gaussian mixture), then learns the optimal thresholds for each classifier and finally compares the performance of these classifiers.
Another neuralnetwork based method was proposed by Kumar [44]. This paper on the detection of local fabric defects first performs a PCA dimension reduction on windows followed by the training of a neural network on a base of detects / nondetects, thus again performing twoclass classification.
Conclusions.
These papers rather exemplify what anomaly detection generally is not. As we shall see, if a faithful stochastic background model is built, modeling the anomalies is not absolutely necessary. They can be unambiguously modeled a contrario, namely as unlikely to belong to the background model.
1.2.2 Not a saliency measure
A broad related literature exists on saliency measures. Such saliency measures may be learned from average fixation maps by humans. For example, Tavakoli et al. [71] designed an anomaly detector trained on average human fixation maps learning both the anomalies and their surround vectors as Gaussian vectors. This reduced the problem to a two class Bayesian classification problem.
Yet in general the goal of saliency detectors is only to deliver a fuzzy saliency map, in contrast to anomaly detectors that are requested to signal the anomalous regions. The saliency detectors try to mimic the human visual perception and in general introduce semantic prior knowledge related to the perceptual system (e.g., face detectors). This approach works particularly well with neural networks because attention maps obtained by gaze trackers can be used as a ground truth for the training step. SALICON by Huang et al. [40] is one of these deep neural networks architecture achieving state of the art performance.
Methods providing a saliency map often assign an anomaly score to each tested feature based on the inverse of the height of the bin to which it belongs. For example, in [64] a saliency map is obtained by combining 32 multiscale oriented features obtained by convolving the image with oriented Gabor functions. A weighted combination of the most contrasted channels for each orientation yields an unique multiscale orientation channel for each orientation. Then the histograms of these channels are computed and each pixel is given a weight which is roughly inversely proportional to its rarity in the histogram. The same rarity measurement is applied to the colors after PCA. Summing all of these saliency maps one obtains something similar to what is observed with gaze trackers: the salient regions are the most visited.
While such saliency methods yield impressive visual results, they do not afford easily the thresholds permitting to single out anomalies.
Similarly, image patches are represented by Borji and Itti [9] using their coefficients on a patch dictionary learned on natural images. Local and global image patch rarities are considered as two “complementary processes”. Each patch is first represented by a vector of coefficients that linearly reconstruct it from a learned dictionary of patches from natural scenes (“normal” data). Two saliency measures (one local and one global) are calculated and fused to indicate the saliency of each patch. The local saliency is computed as the distinctiveness of a patch from its surrounding patches, while the global saliency is the inverse of a patch’s probability of happening over the entire image. The final saliency map is built by normalizing and fusing local and global saliency maps of all channels from both color systems. (Patch rarity is measured both in RGB and Lab color spaces.)
One can consider the work by Murray et al. [55] “Saliency Estimation Using a NonParametric LowLevel Vision Model”, as a good representative of the multiscale centersurround saliency methods. Its idea is to:

apply a multiscale multiorientation wavelet pyramid to the image;

measure the local wavelet energy for each wavelet channel at each scale and orientation;

compute a centersurround ratio for this energy;

obtain in that way wavelet contrast coefficients that have the same spatial multiscale sampling as the wavelet pyramid itself;

apply the reverse wavelet pyramid to the contrast coefficients to obtain a saliency map.
This is a typical saliencyonly model, for which it is not clear how to build an adequate detection threshold.
Conclusions.
Saliency detection methods learned from human gaze tracking are semantic methods that fall off our inquiry. Instead the last three generic saliency methods listed are tantalizing. Indeed, they seem to do a very good job of enhancing anomalies by measuring rarity. Yet they come with no clear mechanism to transform the saliency map into a probabilistic one that might allow hypothesis testing and eventually statistically motivated detection thresholds.
1.2.3 A sketch of our proposed classification
The anomaly detection problem has been generally handled as a “oneclass” classification problem. The 2003 very complete review by Markou and Singh [50] concluded that most research on anomaly detection was driven by modeling background data distributions, to estimate the probability that test data do not belong to such distributions. Hence the mainstream methods can be classified by their approach to background modeling. Every detection method has to do two things:

to model the anomalyfree “background”. This background model may be constructed for samples of various sizes extracted from the given image (or an image database): pixels (e.g. in hyperspectral images), patches, local features (e.g. wavelet coefficients).

to define a measure on the observed data evaluating how far its samples are from their background model. Generally, this measure is a probability of false alarm (or even better, as we shall see, an expectation of the number of false alarms) associated with each sample.
The choice of the background model is clearly the most important decision to take. Hence we shall primarily classify the methods by step (a), namely by their background model (and the ways it is learned). Secondarily, we shall classify them by the way a final decision is taken on step (b).
We review in the rest of this section the main classes of background models that arise in the literature. These classes can be characterized by their structural assumption on the background. We found five generic structural assumptions. These assumptions drive the choice of the estimationdetection tool. Hence a classification based on the background structures will be more transparent. The five structural assumptions are:

the background can be modeled by a probability density function (pdf), which is either parametric, such as, a Gaussian, or a Gaussian mixture, or is obtained by interpolation from samples by a kernel density estimation method; this structure leads to detect anomalies by hypothesis testing on the pdf;

the background is globally homogeneous (leading to a global Fourier or neural network model and to background subtraction);

the background is locally spatially homogeneous (leading to centersurround methods);

the background is sparse on a given dictionary or base (leading to variational decomposition models). A different implementation of the same principle is to train a nonparametric model (e.g., neural networks) to reconstruct normal data, and then check how good reconstruction is;

the background is selfsimilar (in the nonlocal sense).
Plan of the paper.
For each of the above listed background models, we shall examine in Section 2 examples and variants, see how they are estimated, and finally how they are used for obtaining a detection threshold. An important observation is that the choice of a detection threshold is problematic for all background structures except the first one, which is the only one providing a probability distribution. Hence in Section 3 we shall examine ways to incorporate a rigorous probabilistic detection threshold in the most relevant methods spotted in Section 2. Section 4 is dedicated to two comparison protocols for the six most representative and analyzed methods. We finally conclude in Section 5.
2 Detailed analysis of the main anomaly detection families
The main anomaly detection families can be analyzed from their structural assumptions on the background model. In what follows we present and discuss the five different families that we announced.
2.1 Stochastic background models
Their principle is that anomalies occur in the low probability regions of the background model. The stochastic model can be parametric (Gaussian, Gaussian mixture, regression), or nonparametric. For example in “spectral anomaly detection” as presented by Chandola et al. [16], an anomaly is defined by having deviant coordinates with respect to normal PCA coordinates.
Du and Zhang [25] propose to build a Gaussian background model from random image patches in a hyperspectral image. Once this background model is obtained, the anomalous patches are detected using a threshold on their Mahalanobis distance to the background Gaussian model. The selection of the image blocks permitting to estimate the Gaussian patch model is performed by a RANSAC procedure [28], picking randomly patches in the image and excluding progressively the anomalous ones.
Goldman and Cohen [33], aiming at seamine detection, propose a detection scheme that does not rely on a statistical model of the targets. It performs a background estimation in a local feature space of principal components (this again amounts to building a Gaussian model). Then hypothesis testing is used for the detection of anomalous pixels, namely those with an exceedingly high Mahalanobis distance to the Gaussian. This detects potentially anomalous pixels, which are thereafter grouped and filtered by morphological operators. This ulterior filter suggests that the first stage may yield many false alarms.
Tarassenko et al. [70] identify abnormal masses in mammograms by assuming that abnormalities are uniformly distributed outside the boundaries of normality (defined using an estimation of the probability density function from training data). If a test vector falls in a low probability region (using a predetermined threshold), then the test vector is considered to be novel. The process to build the background model is complex and involves selecting five local features, equalizing their means and variances to give them the same importance, clustering the data set into four classes, and estimating for each cluster its pdf by a nonparametric method (i.e., Parzen window interpolation). Finally, a feature vector is considered anomalous if it has low probability for each estimated pdf. Such nonparametric pdf estimate has of course an overfitting or underfitting risk, due to the fact that training data are limited.
The idea introduced by Xie and Mirmehdi [80] is to learn a texture model based on Julezs’ texton theory [43]. The textons are interpreted as image patches following a Gaussian model. Thus a random image patch is assumed to follow a Gaussian mixture model (GMM), which is therefore estimated from exemplar images by the expectationmaximization algorithm (EM). The method works at several scales in a Gaussian pyramid with fixed size patches (actually ). The right threshold values for detecting anomalies are learned on a few images without defects. At each scale, the minimum probability in the GMM over all patches is computed. These probabilities serve as detection thresholds. A patch is then considered anomalous if its probability is lower than the minimum learned on the faultless textures on two consecutive scales. A saliency map is obtained by summing up these consecutive probability excesses. Clearly, this model can be transformed from a saliency map to an anomaly detector by using hypothesis testing on the background Gaussian mixture model.
Grosjean and Moisan [36] propose a method that models the background image as a Gaussian stationary process, which can also be modeled as the result of the convolution of a white Gaussian noise model with an arbitrary kernel, in other terms a colored noise. This background model is rather restrictive, but it is precise and simple to estimate. The Gaussian model is first estimated. Then the image is convolved either with lowpass filters (to detect global peaks in the texture) or centersurround filters (to detect locally contrasted peaks in the texture). The Gaussian probability density function of each of these convolved images is easily computed. Finally, a probabilistic detection threshold for the filtered images is determined using the a contrario framework introduced by Desolneux et al. [21], [22]. This framework will be introduced in Section 3.1. It enables to compute detection thresholds guaranteeing a targeted number of false alarms, under the nullhypothesis that there are no anomalies.
Conclusions.
To summarize, in the above methods relying on probabilistic background models, outliers are detected as incoherent with respect to a probability distribution estimated from the input image(s). The anomaly detection threshold is a statistical likelihood test on the learned background model. In all cases, it gives or can give a pvalue for each detection. Nevertheless, this does not yield a control on the number of false alarms, with the notable exception of the method of Grosjean and Moisan [36].
2.2 Homogeneous background models
These methods estimate and subtract the background from the image to get a residual image on which detection is eventually performed. We shall examine three ways to do so: by Fourier modeling, by autoencoder or by subtraction of a smooth background.
Fourier background model.
Perhaps the most successful background based method is the detection of anomalies in periodic patterns of textile [77, 78, 61]. This can be done naturally by cutting specific frequencies in the Fourier domain and thresholding the residual to find the defects. For example Tsai and Hsieh [77] remove the background by a frequency cutoff. Then a detection threshold using a combination of the mean and the variance of the residual yields a detection map.
Similarly, Tsai and Huang [78] propose an automatic inspection of defects in randomly textured surfaces which arise in sandpaper, castings, leather, and other industrial materials. The proposed method does not rely on local texture features, but on a background subtraction scheme in Fourier domain. It assumes that the spread of frequency components in the power spectrum space is isotropic, and with a shape that is close to a circle. By finding an adequate radius in the spectrum space, and setting to zero the frequency components outside the selected circle, the periodic, repetitive patterns of statistical textures are removed. In the restored image, the homogeneous regions in the original image get approximately flat, but the defective region is preserved. According to the authors, this leads to convert the defect detection in textures into a simple thresholding problem in nontextured images.
Perng et al. [61] focus on anomaly detection during the production of bolts and nuts. The method starts by creating normalized unwrapped images of the pattern on which the detection is performed. The first step consists in removing the “background” by setting to zero some Fourier coefficients. Indeed, the background pattern being extremely periodic, is almost entirely removed by canceling large Fourier coefficients. The mean and the variance of the residual are then computed. This residual is then thresholded using a statistical process control (SPC) binarization method,
(1) 
where is a control parameter. Regions set to zero are then detected.
Aiger and Talbot [2] propose to learn a Gaussian background Fourier model of the image Fourier phase directly from the input image. The method assumes that only a few sparse defaults are present in the provided image. First a “phase only transform (PHOT)” is applied to the image. This amounts to normalizing each Fourier coefficient (by dividing each Fourier coefficient by its Fourier magnitude) before inverting the Fourier transform. If the phase of the image were uniformly random, then the result of the PHOT would be a RPN (Random Phase Noise). RPN’s are well known models for a wide class of “microtextures” as explained in Galerne et al. [29]. A RPN is a random image where the Fourier coefficients have deterministic moduli and random, uniform, independent phases. A RPN with unitary Fourier coefficients is actually very close to a Gaussian white noise. Indeed a Gaussian white noise has uniform random phase. And the expectation of the moduli of its Fourier coefficients is constant. A very local anomaly is expected to have a value in excess compared to the PHOT. Anomalous pixels are therefore detected as peaks of the Mahalanobis distance of their values to the background Gaussian model. Hence, a probability of false alarm can be directly computed in this ideal case. The detection can be also applied after convolving the PHOT transformed image with a Gaussian, to detect blobs instead of single pixels.
Neuralnetworkbased background model.
The general idea is to learn the background model by using a neural network trained on normal data. Under the assumption that the background is homogeneous, the “replicator” neural networks proposed by Hawkins et al. [37] can be used to learn this model. These are multilayer feed forward neural networks with same number of input and output neurons. The training involves compressing data into hidden layers. The testing phase reconstructs each data sample and its reconstruction error is used as an anomaly score.
An and Cho [3] proposed to train a variational autoencoder (VAE), and to compute from it an average reconstruction probability, which is a different measure than just looking at the difference between the input and output. Given a new data point, a number of samples are drawn from the trained probabilistic encoder. For each code sample, the probabilistic decoder outputs the corresponding mean and variance parameters. Then, the probability of the original data being generated from a Gaussian distribution having these parameters is calculated. The average probability, named reconstruction probability, among all drawn samples is used as an anomaly score.
The work of Schlegl et al. [65] is in the same direction as using an autoencoder and looking at the norm between the original and the output. A Generative Adversarial Network (GAN) [34]) is trained (generator + discriminator) by using anomalousfree data. Then, given a new test image a representation in latent space is computed (by backpropagation), and the GAN reconstruction is compared to the input. The discriminator cost is then used alongside the representation of the input by the network to find the anomalies. There is, however no guarantee that the latent representation found would do good for anomaly free examples. Hence, it is not clear why the discriminator cost would detect anomalies.
Smooth background model.
Perhaps the most important application of anomaly detection in industry is default detection on smooth materials. A very recent and exemplary method is proposed by Tout et al. [75]. In this paper, the authors develop a method for the fully automatic detection of anomalies on wheels surface. First, the wheel image are registered to a fixed position. For each wheel patch in a given position, a linear deterministic background model is designed. Its basis is made of a few low degree polynomials combined with a small number of basis functions learned as the first basis vectors of a PCA applied to exemplar data. The acquisition noise is accurately modeled by a twoparameter Poisson noise. The parameters are easily estimated from the data. The background estimation is a mere projection of each observed patch on the background subspace. The residual, computed as the difference between the input and the projection, can contain only noiseand anomalies. Thus, classic hypothesis testing on the norm of the residual of each patch will yield an automatic detection threshold. This method is clearly adapted to defect detection on smooth surfaces.
Conclusions.
With the exception of the last examined method (Tout et al. [75]), background subtraction methods kick the can down the road: their good point is that they avoid modeling the background by subtracting it. But then a stochastic model is nevertheless required on the residual! The last considered method does a good job by directly interpreting the residual as noise. This points to the important fact that a stochastic model of the difference between background and noise might be far simpler to estimate. It boils down to a noise model.
2.3 Local homogeneity models: centersurround detection.
These methods are often used for creating saliency maps. Their rationale is that anomalies (or saliency) occur as local events contrasting with their surroundings.
In one of the early papers on this topic, Itti et al. [42] propose to compute a set of center surround linear filters based on color, orientation and intensity. The filters are chosen to only have positive output values. The resultant maps are normalized by stretching their response so that the max is at a prespecified value. These positive feature maps are then summed up to produce a final saliency map. Detection is then done on a simple winnertakesall scheme on the maximum of the response maps. This method is applied in Itti and Koch [41] to detect vehicles via their saliency in huge natural or urban images.
The method was expanded by Gao et al. [31]. The features in this paper are basically the same as those proposed by Itti and Koch [41], that is, simple color features, intensity features, and a few orientation filters (Gabor functions, wavelets). This last paper does detection on image and video with center surround saliency detector. It directly compares its results to those of Itti and Koch [41] and takes similar features, but works differently with them. In particular it computes centersurround discrimination scores for the features, and puts in doubt the linearity of centersurround filters and the need for computing a (necessarily nonlinear) probability of false alarm in the background model. In fact, they claim [31]:
“In particular, it is hypothesized that, in the absence of highlevel goals, the most salient locations of the visual field are those that enable the discrimination between center and surround with smallest expected probability of error.”
The difficulty of centersurround anomaly detection is faced by Honda and Nayar [39], who introduced a generic method which tentatively works on all type of images. The main idea is to estimate a probability density for subregions in an image, conditioned upon the areas surrounding these subregions. The estimation method employs independent component analysis and the KarhunenLoève transform (KLT) to reduce dimensionality and find a compact representation of the region space and its surroundings, with elements as independent as possible. Anomaly is again defined as a subregion with low conditional probability with respect to its surrounding. This is both a coarse grained and complex method.
The method by Mahadevan et al. [48] developed for videos essentially and inspired from Itti et al. [42] similarly models both center and background by Gaussian Mixtures (of spacetime 4141 patches, hence called MDT –mixtures of dynamic textures). This leads to a discrimination between “background” and relevant information. The method requires learning two distributions for the patch features in each region, one for the “center” and one for the “surround”. A patch is considered anomalous if the KullbackLeibler divergence between both distributions is large.
Conclusions.
We mentioned in Section 1.2.2 that saliency detectors are tantalizing as they propose simple and efficient rarity measurements but no detection mechanism (threshold value) is provided. The above reviewed centersurround methods attempt to remedy that. But then, the method becomes quite heavy as it requires estimating a local stochastic model for both the center and surround. Hence we are forced back to twoclass classification with fewer samples and a far more complex methodology.
2.4 Sparsitybased background models and its variational implementations
One recent nonparametric trend is to learn a sparse dictionary representing the background (i.e., normality) and to characterize outliers by their nonsparsity.
Margolin et al. [49] propose a method for building salient maps by a conjunction of pattern distinctness and color distinctness. They claim that for pattern distinctness, patch sparsity is enough to characterize visual saliency. They proceed by:

Computing the PCA of all patches (of fixed size –typically ) in the image;

Computing the pattern saliency of a patch as where the norm is computed on the PCA coordinates.

The pattern saliency measure is combined (by multiplication) with a color distinctness measure, that measures the distance of each color super pixel to its closest color cluster. The final map therefore is where is the color distinctness.

The final result is a product of this saliency map with (roughly) a Gaussian centered in the center of mass of the previous saliency map.
We now pass to sparsity models that learn the background model as a dictionary on which “normal” patches would have to be represented by a sparse linear combination of the elements of the dictionary (and anomalous patches tentatively would not).
For Boracchi et al. [8], the background model is a learned patch dictionary from a database of anomalyfree data. The abnormality of a patch is measured as the Mahalanobis distance to a 2D Gaussian learned on the parameter pairs composed by the norm of the coefficients and of their reconstruction error. In what follows we detail this method.
Although the method looks general, the initial question addressed by Boracchi et al. [8] is how to detect anomalies in complex homogeneous textures like microfibers. A model is built as a dictionary learned from all patches by minimizing
(2) 
where is the matrix whose columns are the reference patches, the dictionary is represented as a matrix where the columns are the elements of the dictionary, is a matrix where the th column represents the coefficients of patch on , and the datafitting error is measured by the Frobenius norm of the first term. The norm on must be understood as the sum of the absolute values of all of its coefficients. Once a minimizer is obtained, the same functional can be used to find a sparse representation for each patch by minimizing
(3) 
The question then arises: how to decide from this minimization that a patch is anomalous? The authors propose to associate to each patch the pair of values . The first component is a datafidelity term measuring how well the patch is represented in . The second component measures the sparsity (and therefore the adequacy) of this representation. An empirical 2D Gaussian model is then estimated for these pairs calculated for all patches in the reference anomalousfree dataset. Under this Gaussian assumption, the normality region can be defined for the patch model as
(4) 
where fixing is a “suitable question”. The question now is to define what should be used to separate normal from abnormal patches.
The Boracchi et al. [8] method is directly related to the sparse texture modeling previously introduced by Elhamifar et al. [27], where a “row sparsity index” is defined to distinguish outliers in a dataset. The outliers are added to the dictionary. Hence, in any variational sparse decomposition of themselves, they will be used primarily as they cannot be sparsely decomposed over the inlier dictionary. In the words of the authors [27],
We use the fact that outliers are often incoherent with respect to the collection of the true data. Hence, an outlier prefers to write itself as an affine combination of itself, while true data points choose points among themselves as representatives as they are more coherent with each other.
As we saw the Boracchi et al. [8] method is extremely well formalized. It was completed in Carrera et al. [14] by adding a multiscale detection framework measuring the anomaly’s nonsparsity at several scales. The 2015 variant by Carrera et al. [13] of the above models introduces the tempting idea of building a convolutional sparse dictionary. This is done by minimizing
(5) 
subject to , , where and denote a collection of filters and coefficient vectors respectively. As usual in such sparse dictionary models, the minimization can be done on both the filters and coordinates and summing for a learning set of patches. Deprived of the sum over , the same functional can be minimized for a given input patch to compute its coordinates and evaluate its sparsity.
Defining anomaly detection as a variational problem, where anomalies are detected as nonsparse, is also the core of the method proposed by Adler et al. [1]. In a nutshell, the norm of the coefficients on a learned background dictionary is used as an anomaly measure. More precisely, assuming a dictionary on which normal data would be sparse, the method performs the minimization
(6) 
where for a “single measurement vector” and for a “multiple measurement vector” and is the norm. Here is the data matrix where each column is a distinct data vector. Similarly is a matrix whose columns are the dictionary’s components. is the matrix of coefficients of these data vectors on which is forced by the term to become sparse. Yet anomalies, which are not sparse on , make a residual whose norm is measured as , therefore their number should be moderated. Of course this functional depending on two parameters raises the question of their adequate values. The final result is a decomposition where the difference between and should be mainly noise and therefore we can write this
(7) 
where is the noisy residual, the sparse part of and its anomalies.
In appendix A we prove that the dual variational method amounts to finding directly the anomalies. Furthermore, we have seen that these methods cleverly solve the decision problem by applying very simple hypothesis testing to the low dimensional variables formed by the values of the terms of the functional. Hence, the method is generic, applicable to all images and can be completed by computing a number of false alarms, as we shall see. Indeed, we interpret the apparent overdetection by a neglect of the multiple testing. This can be fixed by the acontrario method and we shall do it in section 3.5.
Dual interpretation of sparsity models.
Sparsity based variational methods lack the direct interpretation enjoyed by other methods as to the proper definition of an anomaly. By reviewing the first simplest method of this kind proposed by Boracchi et al. [8], we shall see that its dual interpretation points to the detection of the most deviant anomaly. Let a dictionary representing “normal” patches. Given a new patch we compute the representation using the dictionary,
(8) 
and then build the “normal” component of the patch as .
One can derive the following Lagrangian dual formulation (see Appendix A),
(9) 
where the vector are the Lagrangian multipliers.
While represents the “normal” part of the patch , represents the anomaly. Indeed, the condition imposes to to be far from the patches represented by . Moreover, for a solution of the dual to exist (and so that the duality gap doesn’t exist) it requires that i.e. which confirms the previous observation. Notice that the solution of (9) exists by an obvious compactness argument and is unique by the strict convexity of the dual functional.
Conclusions.
The great advantage of the background models assuming sparsity is that they make a structural generic assumption, and derive a variational model that depends on one or two parameters only, namely the relative weights given to the terms of the energy to be minimized.
2.5 Nonlocal selfsimilar background models
The nonlocal selfsimilarity principle is invoked as a qualitative regularity prior in many image restoration methods, and particularly for image denoising methods such as the bilateral filter [73] or nonlocal means [11]. It was first introduced for texture synthesis in the pioneering work of Efros and Leung [26].
The basic assumption of this generic background model, applicable to most images, is that in normal data, each image patch belongs to a dense cluster in the patch space. Anomalies instead occur far from their closest neighbors. This definition of an anomaly can be implemented by clustering the image patches (anomalies being detected as far away from the centroid of their own cluster), or by a nearest neighbor search (NNS) leading to a direct rarity measurement.
For example in their paper [67] “Static and Spacetime Visual Saliency Detection by SelfResemblance”, Seo and Milanfar propose to directly measure rarity as an inverse function of resemblance. At each pixel a descriptor measures the likeness of a pixel (or voxel) to its surroundings. Then, this descriptor is compared to the corresponding descriptors of the pixels in a wider neighborhood. The saliency at a pixel is measured by
(10) 
where is the cosine distance between two descriptors, is the local feature, and for , the closest features to in the surrounding, and is a parameter.
The formula reads as follows: if all are not aligned to , the exponentials in (10) will be all small and therefore the saliency will be high. If instead only one correlates well with , the saliency will be close to one, and if different s correlate well with , will be approximately equal to . This method cannot yield better than a saliency measure, as no clear way of having a detection mechanism emerges: how do we set a detection threshold?
The algorithm in Zontak and Cohen [81] is also inspired from NLmeans. In short each patch of the test image is compared to the patches of the reference (or in practice a local nonlocal region so that the computation time stays reasonable); if the sum of the nonlocal weights is smaller than a given threshold then the weights are set to zero. An image is then reconstructed with these NLmean type weights. In this image the anomalies should stand out as not being reconstructed. More precisely, the method applies NLmeans to all patches of the source image, NLmeans being computed with respect to a reference image which is not itself anomalous. Here the NLmeans parameter , is crucial. This parameter determines the decay of the weights of the patches as a function of their distance to the original. This decay is faster when is small. Since a threshold is put on the sum of weights, if this sum is too small, then the patch is not reconstructed and is detected as an anomaly. Thus must be fixed large enough to ensure that any patch in the reference image can be reconstructed with nonzero weights. On second thoughts, one is led to the conclusion that the algorithm can be summarized much more simply as: a) fix a similarity threshold and a similarity coefficient and b) compare each patch of the source image to the patches of the reference; if the sum of the distances is higher than the similarity threshold, then the patch is an anomaly. But, again, how to fix this anomaly threshold? In the paper it is arbitrarily fixed at . The anomaly detection is applied to strongly selfsimilar wafers, and the authors display actually the difference between their actual denoised source image and an equally denoised reference image. Hence, the displayed experiments, if not the method, perform a form of background subtraction followed by a detection threshold on the residual (even though the detection could directly be done on the weights as suggested instead of the residual).
A similar idea was proposed by Tax and Duin [72]:
“The distance of the new object and its nearest neighbor in the training set is found and the distance of this nearest neighbor and its nearest neighbor in the training set is also found. The quotient between the first and the second distance is taken as indication of the novelty of the object.ââ
As demonstrated more recently by the SIFT method [47] this ratio is a powerful tool. In SIFT a descriptor in a first image is compared to all other descriptors in a target image. If the ratio between the closest descriptor and the second closest one is below a certain threshold, the match between both descriptors is considered meaningful. Otherwise, it is considered casual.
More recently, the selfsimilarity measurement proposed by Goferman et al. [32], finds for each patch its most similar patches in a spatial neighborhood, and computes its saliency as
(11) 
The distance between patches is a combination of Euclidean distance of color maps in LAB coordinates and of the Euclidean distances of patch positions,
(12) 
where the norm is the Euclidean distance between patch color vectors or between patch positions .
The algorithm computes saliency measures at four different scales and then averages them to produce the final patch saliency. This is a rough measure: all the images are scaled to the same size of 250 pixels (largest dimension) and take patches of size . The four scales are 100%, 80%, 50% and 30%. A pixel is considered salient if its saliency value exceeds a certain threshold ( in the examples shown in the paper).
The patch distance (12) used in Goferman et al. [32] is almost identical to the descriptor distance proposed by Mishne and Cohen [52]. Like in their previous paper Mishne and Cohen [51], the authors perform first a dimension reduction of the patches. To that aim a nearest neighbor graph on the set of patches is built, where the weights on the edges between patches are decreasing functions of their Euclidean distances, . These positive weights allow to define a graph Laplacian. Then the basis of eigenvectors of the Laplacian is computed. The first coordinates of each patch on this basis yield a low dimensional embedding of the patch space. (There is an equivalence between this representation of patches and the application to the patches of the NLmeans algorithm, as pointed out in [68].)
The anomaly score involves the distance of each patch to the first nearest neighbors, using the new patch coordinates. This yields the following anomaly score for a given patch :
(13) 
All of the mentioned methods so far have no clear specification of their anomaly threshold. This comes from the fact that the selfsimilarity principle is merely qualitative. It does not fix a rule to decide if two patches are alike or not.
While Background probabilistic modeling is very powerful, it only works on a very restricted class of images (namely homogeneous objects like textiles or smooth painted surfaces). Only in such restrictive cases can it give rigorous detection thresholds based on the estimated probability density function of the background. Thus, in Davy et al. [20] we proposed to perform background modeling on the residual image obtained by background subtraction. As for other selfsimilarity based methods, the background is assumed selfsimilar. Thus, to remove it, a variant of the NLmeans algorithm is applied.
The background modeling consists in replacing each image patch by an average of the most similar ones; with the additional restriction that similar patches must be found outside a square region centered at the query patch. This is done to avoid that any anomaly with some internal structure might be considered a valid structure.
For each patch in the input image , the most similar patches denoted by are searched and averaged to produce a selfsimilar estimate,
(14) 
where is a normalizing constant, and is a parameter.
Since each pixel belongs to several different patches, they will therefore receive several distinct estimates that can be averaged. Finally, the residual image is built as . The idea is that anomalies have no similarities in the image, and thus should remain in the residual . In the absence of the anomalies, the residual should be unstructured and therefore akin to a noise. This is much simpler to model, for example by considering a stochastic model, than the input image.
Then, the method uses the Grosjean and Moisan [36] acontrario method to detect fine scale anomalies on the residual. A pyramid of images is used to detect anomalies at all scales. The method is shown to deliver similar results when producing the residual from features obtained from convolutional neural networks instead of the raw RGB features (see [20]). Still, there is something unsatisfactory in the method: it assumes like Grosjean and Moisan [36] that the background is an uniform Gaussian random field, but no evidence is given that the residual would obey such a model.
In the next paragraph we make an excursion on the case of video, where the selfsimilarity model is particularly well founded.
Selfsimilarity in video anomaly detection
In [6], image regions are matched (allowing deformations) to others in the same image or video. The probability of the deformation is estimated and gives a saliency map. Given a query image and a reference image, the anomaly detection works as follows:

match patches from the query image to the reference one;

match a query patch group to a reference patch group, up to an elastic deformation (modeled by a Gaussian). This amounts to matching two probabilistic graph models for two ensembles of patches.

a “probability” measure is then computed which can be understood as a measurement of the deformation from the query to the reference ensemble of patches. This probability gives a saliency map.
Nevertheless none of the above “probability” measures is given by an explicit probabilistic model that could be learned from data. Hence the method seems hard to extend from the few examples where it seems to give good results to a general method.
Boracchi and Roveri [7] also proposed to detect anomalies (or changes) in timeseries by exploiting the selfsimilarity. Their general idea is that a normal patch should have at least one very similar patch along the sequence. Given a temporal patch (a small temporal window) the residual with respect to the most similar patch in the sequence is computed. This leads to a new residual sequence (i.e., change indicator sequence). The final step is to apply a traditional change detector test (CDT) on the residual sequence. CDTs are statistical tests to detect rare events in sequences, that is, samples that do not conform to an independent and identical distributed model. CDTs run in an online and sequential fashion. The very recent method [56] is similar to the above commented [7]. Its main difference is the usage of convolutional neural network features instead of image patches.
Conclusions on selfsimilarity
Like sparsity, selfsimilarity is a powerful qualitative model, but we have pointed out that in all of its applications except one, it lacks a rigorous mechanism to fix an anomaly detection threshold. The only exception is [20], extending the Grosjean and Moisan [36] method and therefore obtaining a rigorous detection threshold under the assumption that the residual image is a Gaussian random field. The fact that the residual is more akin to a random noise than the background image is believable, but not formalized.
2.6 Conclusions, selection of the methods, and their synthesis
Stochastic background 
Gaussian model 
pdf estimation 
Gaussian Mixture 
Gaussian Stationary process 
Homogeneous background 
Fourier background 
Neuralnetwork background 
Smooth background 
Local homogeneity 
Sparsity based 
Nonlocal selfsimilar background 

Du and Zhang [25]  •  •  
Goldman and Cohen [33]  •  •  
Tarassenko et al. [70]  •  •  
Xie and Mirmehdi [80]  •  •  
Grosjean and Moisan [36]  •  •  •  
Tsai and Hsieh [77]  •  •  
Tsai and Huang [78]  •  •  
Perng et al. [61]  •  •  
Aiger and Talbot [2]  •  •  
Hawkins et al. [37]  •  •  
An and Cho [3]  •  •  
Schlegl et al. [65]  •  •  
Tout et al. [75]  •  
Itti et al. [42]  •  
Itti and Koch [41]  •  
Gao et al. [31]  •  
Honda and Nayar [39]  •  •  
Mahadevan et al. [48]  •  •  •  
Margolin et al. [49]  •  
Boracchi et al. [8]  •  •  
Elhamifar et al. [27]  •  
Carrera et al. [13]  •  •  
Carrera et al. [14]  •  •  
Adler et al. [1]  •  
Seo and Milanfar [67]  •  
Zontak and Cohen [81]  •  •  
Tax and Duin [72]  •  
Goferman et al. [32]  •  
Mishne and Cohen [51]  •  
Mishne and Cohen [52]  •  
Boiman and Irani [6]  •  •  
Boracchi and Roveri [7]  •  
Napoletano et al. [56]  •  •  •  
Davy et al. [20]  •  •  •  •  • 
Table 1 recapitulates the different methods presented in this section. Our classification in the table has a finer grain than the one used in our analysis. In particular among the stochastic background models, we point out those which are Gaussian, or a mixture of Gaussians, or a Gaussian field, or that model the background by a kernel method. We observed that these methods giving a background probabilistic model are powerful when the images belong to a restricted class of homogeneous objects, like textiles or smooth painted surfaces. Indeed, the method furnishes rigorous detection thresholds based on the estimated probability density function. But, regrettably, stochastic background modeling is not applicable on generic images. For the same reason, CNN based background reconstruction models are restrictive and do not rely on provable detection thresholds. We saw that centersurround contrast methods are successful for saliency enhancement, but generally again lack a detection mechanism. We also saw that the centersurround methods proposing a detection threshold have to estimate two stochastic models, one for the center and one for the surround, being therefore quite complex and coarse grained. The last two columns of the table, namely the sparsity and the selfsimilarity models are tempting and thriving. Their big advantage is their universality: they can be applied to all background images, homogeneous or not, stochastic or not. But again, the selfsimilarity model lacks a rigorous detection mechanism, because it works on a feature space that is not easily modeled. Nevertheless, several sparsity models that we examined do propose a hypothesis testing method based on a pair of parameters derived from the variational method. But these parameters have no justifiable model and anyway do not take into account the multiple testing. This last objection can be fixed though, by computing a number of false alarms as proposed in [36], and we shall do it in the next section.
As pointed out in Davy et al. [20], abandoning the goal of building a stochastic background model does not imply abandoning the idea of a wellfounded probabilistic threshold. Their work hints that background subtraction is a powerful way to get rid from the hard constraint to model background and to work only on the residual. But in [20] no final argument is given demonstrating that the residual can be modeled as a simple noise. Nevertheless, this paper shows that the parametric Grosjean and Moisan [36] detection works better on the residual than on the original image (see Section 3.2).
We noticed that at least one paper (Aiger and Talbot [2]) has proposed a form of background whitening. It seems therefore advisable to improve background subtracting methods by applying the PHOT to the residual.
Our conclusion is that we might be closer to a fully generic anomaly detection by combining the best advances that we have listed. To summarize we see two different combinations of these advances that might give a competitive result:

Background subtraction by selfsimilarity and residual whitening
These two proposals have the advantage of taking into account all the advances in anomaly detection that we pointed out. They cannot be united; sparsity and selfsimilarity are akin but different regularity models. We notice that both methods actually work on a residual. In the second proposed method the residual is computed explicitly. In the first one, the decision method is taken on a Gaussian model for a pair of parameters where one is actually the norm of the residual and the other one a sparsity measure. In section 3 we develop the tools necessary to compare the selected methods. We need a unified anomaly detection criterion, and we have seen that the a contrario framework gives one.
3 Extending the acontrario detection mechanism to all compared methods
In Section 2, we classified anomaly detection methods into several families based on their background models: stochastic, homogeneous, local homogeneous, sparsitybased and nonlocal selfsimilar models. Our final goal is to compare the results of these families by selecting state of the art representatives for each family.
All methods presented in section 2 require a detection threshold. These thresholds are not always explicit and remain empirical in many papers: instead of a universal threshold, most methods propose a range from which to choose depending on the application or even on the image.
To perform a fair comparison of the selected methods, we must automatically set their detection threshold, based on an uniform criterion. This will done by computing for each method a Number of False Alarms, using the acontrario framework introduced by Desolneux et al. [21], [22]. This detection criterion is already used in two of the examined papers, [36] and [20].
3.1 Computing a number of false alarms in the acontrario framework
The acontrario framework is classical in many detection or estimation computer vision tasks, such as line segment detection [79, 35], ellipse detection [58], spot detection [36], vanishing points detection [45, 46], fundamental matrix estimation [53], image registration [54], mirrorsymmetry detection [59], cloud detection [19], among others. It relies on the following simple definition.
Definition 3.1
[36] Given a set of random variables with known distribution under a nullhypothesis , a test function is called an NFA if it guarantees a bound on the expectation of its number of false alarms under , namely:
(15) 
To put it in words, raising a detection every time the test function is below should give under an expectation of less than false alarms. An observation is said to be ”meaningful” if it satisfies , where is the predefined target for the expected number of false alarms. The lower the ”stronger” the detection.
While the definition of the background model doesn’t contain any a priori information on what should be detected, the design of the test function reflects expectations on what is an anomaly. A common way to build an NFA is to take
(16) 
or
(17) 
where is the number of tests, goes over all tests, and , are the observations which excess should raise an alarm. These test functions are typically used when anomalies are expected to have higher values than the background in the first case, or when anomalies are expected to have higher modulus than the background. If for example the represent the pixels of an image, there would be one test per pixel and per channel. Hence would be the product of the image dimension by the number of image channels.
Grosjean and Moisan [36] proved that the test function (16) satisfies Definition 3.1. Since the only requirement of their proof is that has to be a realvalued random variable, a more general result can be derived for any function and multidimensional if is a realvalued random variable. Under these conditions, the following function
(18) 
also is a NFA.
In short, applying the acontrario framework just requires a stochastic background model giving the laws of the random variables , and a test function .
In Davy et al. [20] for example, denote the pixels of the residual image , that presumably follow a Gaussian colored noise model. This Gaussian model defines the null hypothesis ), and is the total number of tested pixels (considering all the scales and channels), and the test function is given by (17).
Proposition 3.1.1
Consider the simplest case where all tested variables are equally distributed under , and assume that their cumulative distribution function is invertible. Assume that the test function is given by (17). Then testing if is above defined by
(19) 
ensures a number of false alarms lower that .
In the particular acontrario setting given by Eq (19), the number of false alarms gives a result similar to the Bonferroni correction [5], used to compensate for multiple testing. It is also interpretable as a per family error rate [38]. Deeper results can be found in [22].
In the next sections we specify the a contrario framework for the methods that we will be comparing.
3.2 The Grosjean and Moisan [36] stochastic parametric background model and the Davy et al. [20] selfsimilarity model
Grosjean and Moisan [36] proposed to model the input image as a colored Gaussian stationary process. The method is designed to detect bright local spots in textured images, for example, mammograms. Three different ways to compute a NFA are proposed by locally assuming (i) no context, (ii) contrast related to the context, and (iii) a conditional context. Method (i) comes down to convolving the image with disk kernels, and testing the tails of the obtained Gaussian distributions, while method (ii) comes down to convolving with centersurround kernels. Their second method is preferred since with strong noise correlation the local average in their background model can be far from 0. In Davy et al. [20], a residual image is produced with a selfsimilarity removal step, which contains a normalization step to make the noise more Gaussian. The residual is then supposed to behave as colored Gaussian noise. Then the method comes down to convolving the residual with disk kernels, and testing the tails of the obtained Gaussian distributions. Both methods do combine the detection at several scales of the input image. Thus, both methods share a similar detection mechanism and can be expressed in the same terms.
Under , the result of the convolutions of the residual with the testing kernels are colored Gaussian noise which mean and variance can be estimated accurately from the filtered image itself. Hence, the NFA test function applied on all the residual values (pixel/channel/residual) is exactly the function (17).
3.3 The Fourier homogeneous background model of Aiger and Talbot [2]
In the Aiger and Talbot [2] method, a residual is obtained by setting the value of the modulus of the Fourier coefficients of the image (PHOT) to 1. The residual is then modeled a contrario as a simple Gaussian white noise whose mean and variance are estimated from the image. Anomalous pixels are therefore detected by using a threshold on the Mahalanobis distance between the pixel value and the background Gaussian model. Let be the null hypothesis under which the residual values follow a Gaussian distribution with mean and variance . Then we have
(20)  
(21)  
(22)  
(23)  
(24) 
Thus, the associated function
(25) 
is an NFA of the form (18), where the number of tests corresponds to the number of pixels in the image. This NFA leads to detect an anomalous pixel when is above verifying
(26) 
3.4 The Zontak and Cohen [81] nonlocal selfsimilar model
In this method, the detection test is based on the NLmeans weights. If the sum of these weights is smaller than a threshold (before normalization of these weights), then it is considered an anomaly. In what follows, we discuss how to choose this threshold by computing a NFA. We restrict ourselves to the case where the distance between patches is the distance.
Let us recall that for a reference patch , a similarity parameter , and a set of neighboring patches , an anomaly is detected when
(27) 
Under , every patch of the image is associated with spatially close patches . At least one of these patches is similar and only differs by the realization of the noise, the noisefree content assumed to be identical. The noise is supposed to be for each pixel an independent centered Gaussian noise of variance . We know that
(28) 
verifies the NFA property (this is just equation (18) with a well chosen ).
By hypothesis, at least one of the  we shall name one of these patches  is a realization of the same content than but with different noise.
By event inclusion,
(29) 
Moreover
(30) 
Here we suppose that the candidate is indeed the same than the patch modulo the noise. Therefore the distance follows a law of degree the size of the patch.
That is,
(31) 
where chi2 is the cumulative density function of the distribution of the degree the size of the patch.
Thus, by bounding (28) from above, and using the fact that a function whose value is always above a NFA is also a NFA (there will be fewer or an equal number of detections), the following test function also is a NFA:
(32) 
Thus, by definition of a NFA, a detection is raised if
(33) 
which leads to a threshold of , such that,
(34) 
To get precise enough weights, all computations for this method are done using the logsumexp trick. Otherwise the computing accuracy is such that it is not possible to differentiate the different detections. Indeed, the sum of the weights is too close to zero for standard double precision. In order to fit the hypothesis we estimate using a noise level estimation such as the one presented by Ponomarenko et al. [63] and extended by Colom and Buades [18].
3.5 The Boracchi et al. [8] sparsitybased background model
In this method the detection is done using a threshold on the Mahalanobis distance. Chen [17] has shown, as a generalization of Chebyshev’s inequality, that for a random vector of dimension with covariance matrix we have
(35) 
Moreover, it has been shown in [57] that this inequality is sharp if no other assumptions are made on . Therefore, in the case of this method, for a candidate and a reference set ,
(36) 
Hence the function
(37) 
is clearly an NFA associated to the method. Using (36) and the obvious fact that a function whose value is always above an NFA also is an NFA, we deduce that the test function
(38) 
also is a NFA. Thus, detection is made if
(39) 
which leads to a threshold , such that
(40) 
While the method was originally presented as using an external database of anomaly free detections, we use it on the image itself i.e. the dictionary is learned on the image presenting the anomalies. Our intuition is that because anomalies are supposed to be rare (meaning sparse in the image), using the image itself should not impact too much the dictionary, and the detections should still be accurate.
3.6 The Mishne and Cohen [51] nonlocal selfsimilar model
There is no obvious way to formalize this method under the acontrario framework. For the experiments that we present in Section 4, we use the detection threshold suggested in the original paper even though there is no actual theoretical justification.
4 Experiments
In this section we shall compare the six methods analyzed in Section 3. In what follows, we detail the different variants that we finally compare:

The nonlocal selfsimilar model of Mishne and Cohen [52] with the detection threshold as detailed in the original publication. We denote this method by Mishne.
We propose two types of experimental comparison.

The first comparison is a qualitative sanity check. For this qualitative analysis we tested on synthetic examples having obvious anomalies of different types (color, shape, cluster), or inexistent (white noise). These toy examples provide a sanity check since one would expect all algorithms to perform perfectly on them. We will also examine the results of the competitors on challenging examples taken from anomaly detection articles.

The second protocol is a quantitative evaluation. We generated anomalyfree images as samples of colored random Gaussian noise. Being a spatially homogeneous random process, such images should remain neutral for an anomaly detector. We then introduced small anomalies to these images and evaluated whether these synthetic anomalies were detected by the competitors. This leads to evaluate a true positive detection rate (TP) for each method on these images. We also evaluated how much of the anomaly free background was wrongly detected, namely the false positive detection rate (FP). Disposing of TPFP pairs yields ROC curves that will be opportunely discussed. Undoubtedly, the random Gaussian noise used in this experiment could be replaced by any other spatially homogeneous random process. We varied the background texture by varying strongly the process’s power spectrum.
4.1 Qualitative evaluation
The toy examples are probably the easiest to analyze. We show the results in Figure 5. We generated images in the classic form used in anomaly detection benchmarks like in [64], where the anomaly is the shape or the color that is unique in the figure. In the third toy example most rectangles are well spaced except in a small region. The anomaly therefore is a change in spatial density. Even though these examples are extremely simple to analyze, they appear to challenge several methods, as can be seen in Figure 5. Only Davy et al. [20] is able to detect accurately the anomaly in all three examples. This is explained in the second row where the residual after background subtraction is shown. In the residual details of the anomalies stand out on a noiselike background. While Aiger and Talbot [2] works well with the color and the shape, it fails to detect the spatial density anomaly. The other methods Zontak and Cohen [81], Mishne and Cohen [52], Grosjean and Moisan [36], Boracchi et al. [8] overdetect the contours of the non anomalous shapes, thus leading to many false positives. We also tried a sanity check with a pure white Gaussian noise image. This is done in the last two examples of Figure 5. Davy et al. [20], Zontak and Cohen [81] and Grosjean and Moisan [36] soundly detect no anomaly in white noise, as expected. However a few detections are made by Boracchi et al. [8] and almost everything is detected by Mishne and Cohen [52]. It can be noted that the background model of the first three papers is directly respected in the case of white Gaussian noise, which explains the perfect result. (In the case of the model of Davy et al. [20], it has to be noted that nonlocal means asymptotically transforms white Gaussian noise into white Gaussian noise [12]). The overdetection in Mishne and Cohen [52] can be explained by the lack of an automatic statistical threshold. The few spurious detections in Boracchi et al. [8] show that the feature used for the detection doesn’t follow a Gaussian distribution, contrarily to the method’s testing assumption. It is also clear that one cannot build a sound sparse dictionary for white noise.
The same test was done after adding a small anomalous spot to the noise, and the conclusion is similar: [20, 36] perform well, [8] has a couple of false detections and doesn’t detect the anomaly. One method, Zontak and Cohen [81], doesn’t detect anything. Finally Mishne and Cohen [52] overdetects. Both noise images were taken from Grosjean and Moisan [36].
We then analyze three examples coming from previous papers. The first one (first column in Figure 6) is a radar image of an undersea mine borrowed from Mishne and Cohen [52]. The mine is detected by Davy et al. [20], Grosjean and Moisan [36] without any false detections. Both Mishne and Cohen [52], Boracchi et al. [8] have false detections; Zontak and Cohen [81] overdetects and Aiger and Talbot [2] misses the mine. The second example (second column in Figure 5) shows an example of nearperiodic texture. This is one of the examples where Fourier based methods are ideally well suited. It was therefore important to check if more generic methods were still able to detect the anomaly. Three methods Aiger and Talbot [2], Zontak and Cohen [81] and Grosjean and Moisan [36] fail to detect the anomaly, the other three methods performing really well. This makes the case for selfsimilarity and sparsity based methods, that generalize nicely the background’s periodicity assumption. The final example (third column from Figure 6) is a real example of medical imaging borrowed from Grosjean and Moisan [36] where the goal is to detect the tumor (the large white region). Aiger and Talbot [2], Zontak and Cohen [81], Boracchi et al. [8] fail to detect the tumor. A strong detection is given by Mishne and Cohen [52] but the false alarms are also strong and numerous. Finally Davy et al. [20] has stronger tumor detections than Grosjean and Moisan [36]) ( against ) but it has several false alarms as well.
Finally we tested the methods on real photographs taken from the Toronto dataset [10]. The main problem when using real photographs is that anomalies detected by humans may be semantic. None of the methods we consider was made to detect semantic anomalies, that can only be learned on human annotated images. Nevertheless, the tests’ results are still enlightening. Detections are very different from one method to the other. The fourth example in Figure 6 shows a man walking in front of some trees. Aiger and Talbot [2], Grosjean and Moisan [36] and Mishne and Cohen [52] don’t detect anything. Both Zontak and Cohen [81], Boracchi et al. [8] detect mostly the trees and the transition between the road and the sidewalk. Surprisingly Davy et al. [20] only detects the man. Indeed in the noise like residual one can check that the man stands out. The second example shows a garage door as well as a brick wall. This time the algorithms tend to agree more. The conspicuous sign on the door is well detected by all methods as well the lens flare. A gap at the bottom between the brick wall and the door is detected by Davy et al. [20], Zontak and Cohen [81], Mishne and Cohen [52], Grosjean and Moisan [36], Boracchi et al. [8]. The methods Mishne and Cohen [52] and Boracchi et al. [8] also detect the transition between the wall and and the brick wall. Finally some detections on the brick wall are made by Davy et al. [20] and Boracchi et al. [8]. The residuals of Davy et al. [20] on the second row are much closer to noise than the background, which amply justify the interest of detecting on the residual rather than on the background. Nevertheless, the residual has no reason to be uniform, as is apparent in the garage’s residual. Even if the detections look any way acceptable, this nonuniformity of the residual noise suggests that centersurround detectors based on a local variance (as done in [36]) might eventually be preferable.
Fixing a target number of for the NFA means that under the model, only false positives should occur per image. Yet many of the shown examples show several false positives. Given the mathematical justification of these thresholds, false positives come from discrepancies between the hypothetical model and the image. In the case of Zontak and Cohen [81], the overdetection in the trees of the picture with a man can be explained by the limited selfsimilarity of the trees: for this region, the nearest patches won’t be close enough to the patch to reconstruct to fit the model, which requires at least one wouldbeidenticalexceptforthenoise patch in the neighborhood. The overdetection in the case of the undersea mine is likely a mismatch of the noise model with the picture noise. The many false alarms of this method for the other examples, makes us wonder if the model hypothesis is not too strong. The Boracchi et al. [8] method triggers many false detections in almost all examples tested. As we mentioned, this suggests that the Gaussian model for the detection pairs is inaccurate. This is not necessarily a problem for specific fault detection applications where the false alarm curves can be learned.
4.2 Quantitative evaluation
Estimating how well an anomaly detector works ”in general” is a challenging evaluation task. Qualitative experiments such as the ones presented in section 4.1 give no final decision. Our goal now is to address the performance evaluation in terms of true positive rate (TP) and false positive rate (FP). To that aim, we generated a set of ten RPN textures [30] which are deprived of any statistical anomalies. We then introduced artificial anomalies by merging a small piece of another image inside each of them. This was made by simple blending or by Poisson editing [60] using the implementation of [23]. This method provides a set of images where a ground truth is known. Hence the detection quality measure can be clearly defined. Figure 2 shows one of the generated RPN images with an anomaly added and the anomaly’s ground truth locus. Table 2 shows the result for our six methods on this dataset.
TP pixels (in %)  FP pixels (in %)  TP anomalies (in %)  FP anomalies (in %)  

Aiger and Talbot [2]  90  40  
Zontak and Cohen [81]  0  0  
Mishne and Cohen [52]  90  90  
Boracchi et al. [8]  100  100  
Grosjean and Moisan [36]  30  20  
Davy et al. [20]  80  10 
TP pixels (in %)  FP pixels (in %)  TP anomalies (in %)  FP anomalies (in %)  
Aiger and Talbot [2]  100  100  
Zontak and Cohen [81]  60  60  
Mishne and Cohen [52]  50  90  
Boracchi et al. [8]  100  100  
Grosjean and Moisan [36]  70  100  
Davy et al. [20]  100  100 
Table 2 demonstrates that for all methods, the predicted number of false positives (namely the theoretical NFA) is not always achieved. Indeed, the threshold for Table 2 was chosen so that the theoretical number of false detections per image should be . When taking into account the total number of pixels, this means that only around false detections should be made by any method in this table. Only two methods are close to this number: [2] and [20], while the other compared methods make too many false detections. Such a false positive target might seem too strict. Yet, it is an important requirement of anomaly detectors in fault detection to minimize the false alarm rate. Indeed excessive false alarms may put a production chain in jeopardy. Images are generally of the order of pixels. Therefore if one wants to limit the false detection rate in a series of tested images, the false positive rate needs to be really small. The methods compared  except Mishne and Cohen [52]  used the NFA framework as seen in Section 3. Therefore the discrepancy between the theoretical target and the obtained number of false alarms is explained by an inadequate for the images. In fact, only the background model of Aiger and Talbot [2] matches completely these really specific textures that are RPNs.
To better compare the methods, we also computed ROC curves for all methods, Figures 3 and 4, as well as the table of true positive areas and false positive areas for a fixed positive rate of (Table 3). The ROC curve aren’t impacted by the choice of thresholds. Figure 3 is shown with a log scale for the number of false positives because its low or very low false positive section is much more relevant for anomaly detection than the rest. From these ROC curves and tables we can conclude, for this specific example, that [20] performs the best followed closely by [2] (which theoretically should be optimal for this problem) and [36]. The trailing methods are [52], [8] and [81]. Nevertheless, if a moderate number of false positives can be tolerated, then [8] becomes really attractive because of its high detection precision. Figure 4 illustrates the problem of false detections. Most methods requires many false detections to achieve a reasonable detection rate. Only Aiger and Talbot [2] and Davy et al. [20] detect well while still keeping a zero false detection rate. This confirms the results from Table 2. Table 3 also shows that having a detection is useful to obtain a good precision but leads to almost all images getting false positives. In practice is too large a tolerance for images.
5 Discussion and conclusions
Our analysis and experiments seem to confirm the view that generic anomaly detection methods can be built on purely qualitative assumptions. Such methods do not require a learning database for the background or the anomalies, but can learn directly normality from a single image in which anomalies may be present. This property is of course plausible only under the assumption that anomalies are a minor part of the image.
Since anomalies cannot be modeled, the focus of attention of all methods is the background model. Methods giving a stochastic model to the background, parametric or not, could only be applied to restricted classes of background. For this reason, our attention has been drawn to the thriving qualitative background models. Any assumption about a kind of global or local background homogeneity is a priori acceptable. The most restrictive models assume that the background is periodic, or smooth or even low dimensional. This kind of strong regularity assumption is not extensible to any image.
Another common sense principle is put forward by local contrast centersurround detectors, which anomalies generate local anomalous contrast. Yet centersurround methods suffer from the difficulty of defining a universal detection rule.
A more clever idea has emerged with the Aiger and Talbot [2] method, which is to transform the background into a homogeneous texture while the anomalies would still stand out.
Meanwhile the old idea of performing a background subtraction remains quite valid. Indeed, as pointed out still very recently in [75], background subtraction may be used to return to an elementary background model for the residual that might contain only noise.
The most general background models are merely qualitative. We singled out two of them as the most recent and powerful ones: the sparsity assumption and the selfsimilarity assumption. We found that two recent exponents use these assumptions to perform a sort of background subtraction: Carrera et al. [15] for sparsity and Davy et al. [20] for selfsimilarity.
Furthermore, we found that all methods required a strict control of the number of false alarms to become universal. Indeed most methods were originally presented with at best an empirical threshold and at worst a comment saying that the threshold depends on the application. The first method proposing this is the one by Grosjean and Moisan [36], and it was recently extended in Davy et al. [20]. Since [36] requires a background stochastic model, we concluded that a good universal model should:

subtract a background model that is merely qualitative (selfsimilar, sparse);

handle the residual as a stochastic process to detect anomalies as anomalies in a colored noise;

possibly also whiten the residual before detecting the anomaly.
This way, most methods are generalized in a common framework. We tested three such syncretic methods and compared them favorably with the three other most relevant methods taken from the main classes of background models. Our comparative tests were made on very diverse images. Our quantitative comparison tests were made on simulated ground truths with stochastic background.
Both tests seem to validate the possibility of detecting anomalies with very few false alarms using a merely qualitative background model. This fact is both surprising and exciting. It confirms that there has been significant progress in the past decade. We hope that this study, at the very least, provides users with useful generic tools that can be combined for any detection task.
6 Acknowledgments
Work supported by IDEX ParisSaclay IDI 2016, ANR11IDEX000302, ONR grant N000141712552, CNES MISS project, Agencia Nacional de Investigación e Innovación (ANII, Uruguay) grant FCE_1_2017_135458, DGA Astrid ANR17ASTR001301, DGA ANR16DEFA000401, Programme ECOS Sud – UdelaR  Paris Descartes U17E04, and MENRT.
Appendix A Appendix: Dual formulation of sparsity models
Sparsity based variational methods lack the direct interpretation enjoyed by other methods as to the proper definition of an anomaly. By reviewing the first simplest method of this kind proposed in [8], we shall see that its dual interpretation points to the detection of the worst anomaly. Let a dictionary representing ”normal” image patches. For a given patch the normal patch corresponding to is where
(41) 
One can derive the following dual optimization problem: Let ,
(42) 
The Lagrangian is in this case
(43)  
(44) 
The dual problem is then
(45)  
(46) 
Consider first : This part is differentiable in so that
(47) 
therefore the inf is achieved for . The inf is in this case
(48) 
As for : This part is not differentiable (because not smooth) nevertheless the subgradient exists. Let such that (for all i ). The subgradient of gives .
(49)  
(50) 
A necessary condition to attain the infimum is then . This leads to with the condition that (because ) which can be injected into the previous equation which gives
(51)  
(52)  
(53)  
(54) 
Finally
(55) 
Therefore the dual problem is
(56) 
which is equivalent to
(57) 
It can be reformulated in a penalized version as
(58) 
While represents the “normal” part of the patch , represents the anomaly. Indeed, the condition imposes to to be far from the patches represented by . Moreover, for a solution of the dual to exist (and so that the duality gap doesn’t exist) it requires that i.e. which confirms the previous observation. Notice that the solution of (58) exists by an obvious compactness argument and is unique by the strict convexity of the dual functional.
References
 Adler et al. [2015] Amir Adler, Michael Elad, Yacov HelOr, and Ehud Rivlin. Sparse coding with anomaly detection. Journal of Signal Processing Systems, 79(2):179–188, 2015.
 Aiger and Talbot [2010] Dror Aiger and Hugues Talbot. The phase only transform for unsupervised surface defect detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010.
 An and Cho [2015] Jinwon An and Sungzoon Cho. Variational autoencoder based anomaly detection using reconstruction probability. Special Lecture on IE, 2:1–18, 2015.
 Ashton [1998] Edward A Ashton. Detection of subpixel anomalies in multispectral infrared imagery using an adaptive bayesian classifier. IEEE Transactions on Geoscience and Remote Sensing, 36(2):506–517, 1998.
 Bland and Altman [1995] J Martin Bland and Douglas G Altman. Multiple significance tests: the bonferroni method. British Medical Journal, 310(6973):170, 1995.
 Boiman and Irani [2007] Oren Boiman and Michal Irani. Detecting irregularities in images and in video. International Journal of Computer Vision, 74(1):17–31, 2007.
 Boracchi and Roveri [2014] Giacomo Boracchi and Manuel Roveri. Exploiting selfsimilarity for change detection. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), 2014.
 Boracchi et al. [2014] Giacomo Boracchi, Diego Carrera, and Brendt Wohlberg. Novelty detection in images by sparse representations. In Proceedings of the IEEE Symposium on Intelligent Embedded Systems (IES), 2014.
 Borji and Itti [2012] Ali Borji and Laurent Itti. Exploiting local and global patch rarities for saliency detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
 Bruce and Tsotsos [2006] Neil Bruce and John Tsotsos. Saliency based on information maximization. In Advances in Neural Information Processing Systems, 2006.
 Buades et al. [2005] Antoni Buades, Bartomeu Coll, and JM Morel. A nonlocal algorithm for image denoising. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2005.
 Buades et al. [2008] Antoni Buades, Bartomeu Coll, and JeanMichel Morel. Nonlocal image and movie denoising. International Journal of Computer Vision, 76(2):123–139, 2008.
 Carrera et al. [2015] Diego Carrera, Giacomo Boracchi, Alessandro Foi, and Brendt Wohlberg. Detecting anomalous structures by convolutional sparse models. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), 2015.
 Carrera et al. [2016] Diego Carrera, Giacomo Boracchi, Alessandro Foi, and Brendt Wohlberg. Scaleinvariant anomaly detection with multiscale groupsparse models. In Proceedings of the IEEE International Conference on Image Processing (ICIP), 2016.
 Carrera et al. [2017] Diego Carrera, Fabio Manganini, Giacomo Boracchi, and Ettore Lanzarone. Defect detection in sem images of nanofibrous materials. IEEE Transactions on Industrial Informatics, 13(2):551–561, 2017.
 Chandola et al. [2009] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey. ACM computing surveys (CSUR), 41(3):15, 2009.
 Chen [2007] Xinjia Chen. A new generalization of Chebyshev inequality for random vectors. arXiv preprint arXiv:0707.0805, 2007.
 Colom and Buades [2013] Miguel Colom and Antoni Buades. Analysis and Extension of the Ponomarenko et al. Method, Estimating a Noise Curve from a Single Image. Image Processing On Line, 3:173–197, 2013.
 Dagobert [2017] Tristan Dagobert. Evaluation of high precision low baseline stereo vision algorithms. Theses, Université ParisSaclay, December 2017.
 Davy et al. [2018] Axel Davy, Thibaud Ehret, Mauricio Delbracio, and JeanMichel Morel. Reducing anomaly detection in images to detection in noise. In Proceedings of the IEEE International Conference on Image Processing (ICIP), 2018.
 Desolneux et al. [2004] Agnè Desolneux, Lionel Moisan, and JeanMichel Morel. Gestalt Theory and Computer Vision. Springer Netherlands, Dordrecht, 2004. ISBN 9781402020810.
 Desolneux et al. [2007] Agnes Desolneux, Lionel Moisan, and JeanMichel Morel. From Gestalt theory to image analysis: a probabilistic approach, volume 34. Springer Science & Business Media, 2007.
 Di Martino et al. [2016] J. Matías Di Martino, Gabriele Facciolo, and Enric MeinhardtLlopis. Poisson Image Editing. Image Processing On Line, 6:300–325, 2016.
 Ding et al. [2014] Xuemei Ding, Yuhua Li, Ammar Belatreche, and Liam P Maguire. An experimental evaluation of novelty detection methods. Neurocomputing, 135:313–327, 2014.
 Du and Zhang [2011] Bo Du and Liangpei Zhang. Randomselectionbased anomaly detector for hyperspectral imagery. IEEE Transactions on Geoscience and Remote sensing, 49(5):1578–1589, 2011.
 Efros and Leung [1999] Alexei A Efros and Thomas K Leung. Texture synthesis by nonparametric sampling. In Computer Vision, 1999. Seventh International Conference on, 1999.
 Elhamifar et al. [2012] Ehsan Elhamifar, Guillermo Sapiro, and Rene Vidal. See all by looking at a few: Sparse modeling for finding representative objects. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
 Fischler and Bolles [1987] Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. In Readings in computer vision, pages 726–740. Elsevier, 1987.
 Galerne et al. [2011a] Bruno Galerne, Yann Gousseau, and JeanMichel Morel. Random phase textures: Theory and synthesis. IEEE Transactions on image processing, 20(1):257–267, 2011a.
 Galerne et al. [2011b] Bruno Galerne, Yann Gousseau, and JeanMichel Morel. MicroTexture Synthesis by Phase Randomization. Image Processing On Line, 1:213–237, 2011b.
 Gao et al. [2008] Dashan Gao, Vijay Mahadevan, and Nuno Vasconcelos. The discriminant centersurround hypothesis for bottomup saliency. In Advances in Neural Information Processing Systems, 2008.
 Goferman et al. [2012] Stas Goferman, Lihi ZelnikManor, and Ayellet Tal. Contextaware saliency detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(10):1915–1926, 2012.
 Goldman and Cohen [2004] Arnon Goldman and Israel Cohen. Anomaly detection based on an iterative local statistics approach. Signal Processing, 84(7):1225–1229, 2004.
 Goodfellow et al. [2014] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, 2014.
 Grompone von Gioi et al. [2012] Rafael Grompone von Gioi, JÃ©rÃ©mie Jakubowicz, JeanMichel Morel, and Gregory Randall. LSD: a Line Segment Detector. Image Processing On Line, 2:35–55, 2012.
 Grosjean and Moisan [2009] Bénédicte Grosjean and Lionel Moisan. Acontrario detectability of spots in textured backgrounds. Journal of Mathematical Imaging and Vision, 33(3):313–337, 2009. ISSN 09249907.
 Hawkins et al. [2002] Simon Hawkins, Hongxing He, Graham Williams, and Rohan Baxter. Outlier detection using replicator neural networks. In DaWaK, 2002.
 Hochberg and Tamhane [1987] Yosef Hochberg and Ajit Tamhane. Multiple comparison procedures. John Wiley, 1987.
 Honda and Nayar [2001] Toshifumi Honda and Shree K Nayar. Finding” anomalies” in an arbitrary image. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), volume 2, 2001.
 Huang et al. [2015] Xun Huang, Chengyao Shen, Xavier Boix, and Qi Zhao. Salicon: Reducing the semantic gap in saliency prediction by adapting deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015.
 Itti and Koch [2000] Laurent Itti and Christof Koch. A saliencybased search mechanism for overt and covert shifts of visual attention. Vision Research, 40(10):1489–1506, 2000.
 Itti et al. [1998] Laurent Itti, Christof Koch, and Ernst Niebur. A model of saliencybased visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11):1254–1259, 1998.
 Julesz [1981] Bela Julesz. Textons, the elements of texture perception, and their interactions. Nature, 290(5802):91, 1981.
 Kumar [2003] Ajay Kumar. Neural network based detection of local textile defects. Pattern Recognition, 36(7):1645–1659, 2003.
 Lezama et al. [2014] José Lezama, Rafael Grompone von Gioi, Gregory Randall, and JeanMichel Morel. Finding vanishing points via point alignments in image primal and dual domains. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
 Lezama et al. [2017] JosÃ© Lezama, Gregory Randall, and Rafael Grompone von Gioi. Vanishing Point Detection in Urban Scenes Using Point Alignments. Image Processing On Line, 7:131–164, 2017.
 Lowe [1999] David G Lowe. Object recognition from local scaleinvariant features. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 1999.
 Mahadevan et al. [2010] Vijay Mahadevan, Weixin Li, Viral Bhalodia, and Nuno Vasconcelos. Anomaly detection in crowded scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010.
 Margolin et al. [2013] Ran Margolin, Ayellet Tal, and Lihi ZelnikManor. What makes a patch distinct? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013.
 Markou and Singh [2003] Markos Markou and Sameer Singh. Novelty detection: a review –part 1: statistical approaches. Signal processing, 83(12):2481–2497, 2003.
 Mishne and Cohen [2013] Gal Mishne and Israel Cohen. Multiscale anomaly detection using diffusion maps. IEEE Journal of Selected Topics in Signal Processing, 7(1):111–123, 2013.
 Mishne and Cohen [2014] Gal Mishne and Israel Cohen. Multiscale anomaly detection using diffusion maps and saliency score. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, 2014.
 Moisan and Stival [2004] Lionel Moisan and Bérenger Stival. A probabilistic criterion to detect rigid point matches between two images and estimate the fundamental matrix. International Journal of Computer Vision, 57(3):201–218, 2004.
 Moisan et al. [2012] Lionel Moisan, Pierre Moulon, and Pascal Monasse. Automatic Homographic Registration of a Pair of Images, with A Contrario Elimination of Outliers. Image Processing On Line, 2:56–73, 2012.
 Murray et al. [2011] Naila Murray, Maria Vanrell, Xavier Otazu, and C Alejandro Parraga. Saliency estimation using a nonparametric lowlevel vision model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011.
 Napoletano et al. [2018] Paolo Napoletano, Flavio Piccoli, and Raimondo Schettini. Anomaly detection in nanofibrous materials by cnnbased selfsimilarity. Sensors, 18(1):209, 2018.
 Navarro [2014] Jorge Navarro. Can the bounds in the multivariate chebyshev inequality be attained? Statistics & Probability Letters, 91:1–5, 2014.
 Patraucean et al. [2012] V Patraucean, P Gurdjos, and R Grompone von Gioi. A parameterless ellipse and line segment detector with enhanced ellipse fitting. In ECCV, 2012.
 Patraucean et al. [2013] Viorica Patraucean, Rafael Grompone von Gioi, and Maks Ovsjanikov. Detection of mirrorsymmetric image patches. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013.
 Pérez et al. [2003] Patrick Pérez, Michel Gangnet, and Andrew Blake. Poisson image editing. ACM Transactions on graphics (TOG), 22(3):313–318, 2003.
 Perng et al. [2010] DerBaau Perng, SsuHan Chen, and YuanShuo Chang. A novel internal thread defect autoinspection system. The International Journal of Advanced Manufacturing Technology, 47(58):731–743, 2010.
 Pimentel et al. [2014] Marco AF Pimentel, David A Clifton, Lei Clifton, and Lionel Tarassenko. A review of novelty detection. Signal Processing, 99:215–249, 2014.
 Ponomarenko et al. [2007] Nikolay N Ponomarenko, Vladimir V Lukin, MS Zriakhov, Arto Kaarna, and Jaakko Astola. An automatic approach to lossy compression of aviris images. In Geoscience and Remote Sensing Symposium, 2007. IGARSS 2007. IEEE International, 2007.
 Riche et al. [2013] Nicolas Riche, Matei Mancas, Matthieu Duvinage, Makiese Mibulumukini, Bernard Gosselin, and Thierry Dutoit. Rare2012: A multiscale raritybased saliency detection with its comparative statistical analysis. Signal Processing: Image Communication, 28(6):642–658, 2013.
 Schlegl et al. [2017] Thomas Schlegl, Philipp Seeböck, Sebastian M. Waldstein, Ursula SchmidtErfurth, and Georg Langs. Unsupervised Anomaly Detection with Generative Adversarial Networks to Guide Marker Discovery. arXiv preprint arXiv:1703.0592, 2017.
 Schweizer and Moura [2000] Susan M Schweizer and José MF Moura. Hyperspectral imagery: Clutter adaptation in anomaly detection. IEEE Transactions on Information Theory, 46(5):1855–1871, 2000.
 Seo and Milanfar [2009] Hae Jong Seo and Peyman Milanfar. Static and spacetime visual saliency detection by selfresemblance. Journal of Vision, 9(12):15–15, 2009.
 Singer et al. [2009] Amit Singer, Yoel Shkolnisky, and Boaz Nadler. Diffusion interpretation of nonlocal neighborhood filters for signal denoising. SIAM Journal on Imaging Sciences, 2(1):118–139, 2009.
 Stein et al. [2002] David WJ Stein, Scott G Beaven, Lawrence E Hoff, Edwin M Winter, Alan P Schaum, and Alan D Stocker. Anomaly detection from hyperspectral imagery. IEEE Signal Processing Magazine, 19(1):58–69, 2002.
 Tarassenko et al. [1995] Lionel Tarassenko, Paul Hayton, Nicholas Cerneaz, and Michael Brady. Novelty detection for the identification of masses in mammograms. In 1995 Fourth International Conference on Artificial Neural Networks, 1995.
 Tavakoli et al. [2011] Hamed Rezazadegan Tavakoli, Esa Rahtu, and Janne Heikkilä. Fast and efficient saliency detection using sparse sampling and kernel density estimation. In Scandinavian Conference on Image Analysis, 2011.
 Tax and Duin [1998] David MJ Tax and Robert PW Duin. Outlier detection using classifier instability. In Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), 1998.
 Tomasi and Manduchi [1998] Carlo Tomasi and Roberto Manduchi. Bilateral filtering for gray and color images. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 1998.
 Tout [2018] Karim Tout. Automatic vision system for surface inspection and monitoring: Application to wheel inspection. PhD thesis, Troyes University of Technology (UTT), April 2018.
 Tout et al. [2016] Karim Tout, Rémi Cogranne, and Florent Retraint. Fully automatic detection of anomalies on wheels surface using an adaptive accurate model and hypothesis testing theory. In Signal Processing Conference (EUSIPCO), 2016 24th European, 2016.
 Tout et al. [2017] Karim Tout, Florent Retraint, and Remi Cogranne. Automatic vision system for wheel surface inspection and monitoring. In ASNT Annual Conference 2017, 2017.
 Tsai and Hsieh [1999] DM Tsai and CY Hsieh. Automated surface inspection for directional textures. Image and Vision Computing, 18(1):49–62, 1999.
 Tsai and Huang [2003] DuMing Tsai and TseYun Huang. Automated surface inspection for statistical textures. Image and Vision computing, 21(4):307–323, 2003.
 Von Gioi et al. [2010] Rafael Grompone Von Gioi, Jeremie Jakubowicz, JeanMichel Morel, and Gregory Randall. Lsd: A fast line segment detector with a false detection control. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(4):722–732, 2010.
 Xie and Mirmehdi [2007] Xianghua Xie and Majid Mirmehdi. Texems: Texture exemplars for defect detection on random textured surfaces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(8):1454–1464, 2007.
 Zontak and Cohen [2010] Maria Zontak and Israel Cohen. Defect detection in patterned wafers using anisotropic kernels. Machine Vision and Applications, 21(2):129–141, 2010.