Perceptual Image Anomaly Detection
Abstract
We present a novel method for image anomaly detection, where algorithms that use samples drawn from some distribution of “normal” data, aim to detect outofdistribution (abnormal) samples. Our approach includes a combination of encoder and generator for mapping an image distribution to a predefined latent distribution and vice versa. It leverages Generative Adversarial Networks to learn these data distributions and uses perceptual loss for the detection of image abnormality. To accomplish this goal, we introduce a new similarity metric, which expresses the perceived similarity between images and is robust to changes in image contrast. Secondly, we introduce a novel approach for the selection of weights of a multiobjective loss function (image reconstruction and distribution mapping) in the absence of a validation dataset for hyperparameter tuning. After training, our model measures the abnormality of the input image as the perceptual dissimilarity between it and the closest generated image of the modeled data distribution. The proposed approach is extensively evaluated on several publicly available image benchmarks and achieves stateoftheart performance.
1 Introduction
Anomaly detection is one of the most important problems in a range of realworld settings, including medical applications [litjens2017survey], cyberintrusion detection [kwon2017survey], fraud detection [abdallah2016fraud], anomaly event detection in videos [kiran2018overview] and overgeneralization problems of neural networks [spigler2019denoising]. Anomaly detection tasks generally involve the use of samples of a “normal” class, drawn from some distribution, to build a classifier that is able to detect “abnormal” samples, i.e. outliers with respect to the aforementioned distribution. Although anomaly detection is wellstudied in a range of domains, image anomaly detection is still a challenge due to the complexity of distributions over images.
Generative Adversarial Networks (GANs) [goodfellow2014generative] present one of the new promising deep anomaly detection approaches. One network called the generator is trained to transform latent vectors, drawn from a latent distribution, to images in such a way that the second network, the discriminator, cannot distinguish between real images and generated ones. Thus after training, the generator performs a mapping of the latent distribution to the data distribution. This property has been used [schlegl2017unsupervised, deecke2018image, perera2019ocgan] to estimate the likelihood of abnormality for an input: if there is a vector in latent space, which after passing through the generator could reconstruct the input object, the object is normal, otherwise it is not. The difference between an input and its closest reconstruction (reconstruction error) is used as an anomaly score for this object.
Although there is a scope of methods that use GAN for anomaly detection, none of them were directly developed for anomaly detection on images. Usually, they apply the L1norm or Mean Squared Error (MSE) between the pixels to compute a reconstruction error, which does not correspond to human understanding of the similarity between two images. Another problem of GANbased approaches is how to find the latent vector that, after passing through the generator, recovers the input object. Previously, it was performed by a gradient descent optimization procedure [schlegl2017unsupervised, deecke2018image], cotraining the generator and the encoder that recovers the latent vector [zenati2018efficient, zenati2018adversarially, perera2019ocgan]. However, existing techniques are either timeconsuming [schlegl2017unsupervised, deecke2018image] or difficult to train [zenati2018efficient, zenati2018adversarially], or consist of complex multistep learning procedures [perera2019ocgan]. Another problem is that the complete loss function consists of a sum of many components with weighting coefficients as hyperparameters. The lack of a validation set (we do not have any anomaly examples during training), makes it difficult to choose these coefficients.
relpercL1: 0.52 
relpercL1: 0.49 
relpercL1: 0.47 
In our work we propose solutions for each of these three problems:

We developed a new metric that measures the similarity between the perception of two images. Our metric, called relativeperceptualL1 loss, is based on perceptual loss [johnson2016perceptual], but is more robust to noise and changes of contrast of images (Figure 1).

We propose a new technique for training an encoder that predicts a latent vector jointly with the generator. We construct a loss function in such a way that the encoder predicts a vector belonging to the latent distribution, and that the image reconstructed from this vector by the generator is similar to the input.

We propose a way to choose the weighting coefficients in the complete loss functions for the encoder and the generator. We base our solution on the norm of the gradients (with respect to network parameters) of each loss function, to balance the contribution of all losses during the training process.
The proposed approach, called Perceptual Image Anomaly Detection (PIAD), allows us to improve performance on several wellknown datasets. We experimented with MNIST, Fashion MNIST, COIL100, CIFAR10, LSUN and CelebA and made an extensive comparison with a wide range of anomaly detection approaches of different paradigms.
2 Related Work
Anomaly detection has been extensively studied in a wide range of domains [chalapathy2019deep]. However, anomaly detection on image data is still challenging. Classical approaches such as explicit modeling of latent space using KDE [parzen1962estimation] or OneClass SVM [chen2001one] which learns a boundary around samples of a normal class, show poor quality when applied to image anomaly detection tasks. Due to the problem of the curse of dimensionality, these algorithms are weak in modeling complex highdimensional distributions.
Deep autoencoders play an important role among anomaly detection methods [sakurada2014anomaly, an2015variational, zhou2017anomaly]. Autoencoders that perform dimension reduction for normal samples learn some common factors inherent in normal data. Abnormal samples do not contain these factors and thus cannot be accurately reconstructed by autoencoders. However, image anomaly detection is still challenging for autoencoders, and usually they are applied only on simple abnormal samples, when the variability of normal images is low.
There are also “mixed” approaches that use autoencoders or other deep models for representation learning. GPND [pidhorskyi2018generative] leverages an adversarial autoencoder to create a lowdimensional representation and then uses a probabilistic interpretation of the latent space to obtain an anomaly score. The method described in [abati2019latent] models a latent distribution obtained from a deep autoencoder using an autoregressive network. In Deep SVDD [ruff2018deep], Ruff et al. show how to train a oneclass classification objective together with deep feature representation.
GANs [goodfellow2014generative] created a new branch in the development of image anomaly detection. GANbased approaches [schlegl2017unsupervised, deecke2018image, zenati2018efficient, zenati2018adversarially, perera2019ocgan] differ in two parts: (i) how to find latent vectors that correspond to the input images, (ii) how to estimate abnormality based on the input image and the reconstructed one. For the second problem, these methods use a linear combination of the L1norm or the MSE between the input image and the reconstruction (reconstruction error), and the discriminator’s prediction of the reality of the reconstructed image. For the first problem, approaches AnoGAN [schlegl2017unsupervised] and ADGAN [deecke2018image] propose to use timeconsuming gradient descent optimization of a latent vector. Other approaches train an encoder to predict a latent vector for each image. Figure 2 demonstrates the differences between the existing approaches. ALAD [zenati2018adversarially] and [zenati2018efficient] train the encoder adversarially: the adversarial loss computed by the discriminator, which takes pairs (image, vector), forces the encoder to predict a latent vector that reconstructs the input image. However, discriminators of such models train with a crossentropy loss function, which causes an unstable training process. The OCGAN model trains a denoising autoencoder. To improve the quality of mapping, authors added two discriminators: D and D, and a classifier which searches for hard negative examples (bad generated images).
3 Perceptual Image Anomaly Detection
Conceptually the idea of PIAD follows the OCGAN. We apply the power of GANs two times, once for building a mapping from the latent space to the image space, and again to create an inverse mapping. A generator and an encoder are trained jointly to satisfy three conditions (see Figure 1(d)):

Generator performs a mapping from latent distribution to data distribution ;

Encoder performs a mapping from to ;

The image which generator recovers from the latent vector that is predicted by encoder must be close to the original image (reconstruction term): .
To accomplish conditions 1 and 2 we train the generator and the encoder with adversarial losses. Therefore, two discriminators and are required. To evaluate the reconstruction term we propose to use our new relativeperceptualL1 loss.
Ideologically, our approach differs from OCGAN. OCGAN is a denoising autoencoder with highly constrained latent space. On top of reconstruction loss, it uses adversarial loss to ensure that the decoder (the generator in our notation) can reproduce only normal examples. Our approach, however, is based on the power of adversarial loss for mapping two distributions. In practice, OCGAN differs in the classifier component, which helps to find weak places of latent space, which produce not “normal” images, but make the whole training process much complicated and multisteps. Also, we do not add noise to image before passing it through the encoder.
In order to train the encoder and generator to minimize a multiobjective loss function, we propose a new way of setting the weights of the loss functions that equalizes the contribution of each loss in the training process. Due to the fact that our approach relies on gradients of the parameters of the loss function, we called it gradientnormalizing weights policy.
After training the proposed model on samples of a normal class, we suggest to predict the abnormality of a new example by evaluating the relativeperceptualL1 loss between the input and :
(1) 
We consider the relativeperceptualL1 loss in more detail in Section 3.1, the procedure for training models in Section 3.2 and the gradientnormalizing weights policy in Section 3.3.
3.1 RelativeperceptualL1 Loss
Features obtained by a neural network, trained on a large dataset for the task of object recognition, can capture highlevel image content without binding to exact pixel locations [gatys2015texture]. In [johnson2016perceptual] Johnson et al. proposed content distance between two images, called perceptual loss: this metric computes the MSE between features taken at a deep level of a neural network that has been pretrained on an object classification task.
Let be a feature map obtained from some deep layer of the network on image , and the shape of this feature map. Then the perceptual loss between image and is determined as:
(2) 
However, perceptual loss is very sensitive to changes in image contrast. Figure 1 shows three pairs of images: pairs 0(b) and 0(c) have lower contrast than 0(a). Perceptual loss drops by 22% for images 0(b) compared to 0(a), although for human supervision the pair 0(b) differs from the pair 0(a) very little. In this way, if we used perceptual loss for computing anomaly score, the model would tend to predict lower contrast images as less abnormal. Another problem is that perceptual loss applies the MSE over features, but the MSE penalizes the noise in the obtained feature values very heavily.
We tackled these problems and propose relativeperceptualL1 loss, which is robust to contrast and noise. First of all, we noticed that features obtained at different filters can have a different scatter of values. As an example, Figure 3 (left) shows the standard deviations of filter responses of some deep layer of VGG19 [simonyan2014very], computed over Imagenet [russakovsky2015imagenet]. We visualize the standard deviations since they indicate the overall value of the features, which are themselves distributed around zero. Standard deviations differ by a factor of 23, which means that the contributions per filter vary by a factor 23 as well. Therefore, as the first step of relativeperceptualL1, we propose to normalize the obtained deep features by the mean and std of filter responses which are precalculated over the large dataset, like Imagenet. Secondly, we propose to use the L1norm instead of the MSE, since the L1norm is more robust to noise. Thirdly, to make the loss more resistant to contrast, we research how feature values behave under changes of contrast. Figure 3 (right) illustrates this behavior: during the reduction of image contrast, the feature value average decreases. In this way, the absolute error (the difference between features), which is used in perceptual loss, decreases as well for each pair of lower contrast images. Therefore, we propose not to use absolute error, but relative error, which measures the ratio of the absolute error of features to the average values of these features.
Consequently, relativeperceptualL1 is calculated as follows:
(3)  
(4) 
where , are the precalculated mean and std of filter responses.
3.2 Training Objective
To train both discriminators we used the Wasserstein GAN with a Gradient Penalty objective (WGANGP) [gulrajani2017improved]. The training of a Wasserstein GAN is more stable than a classical GAN [goodfellow2014generative] (which was used in [zenati2018efficient, zenati2018adversarially, perera2019ocgan]), it prevents mode collapse, and does not require a careful searching schedule of generator/discriminator training. Thus, the discriminator learns by minimizing the following loss:
(5) 
where GP is Gradient Penalty Regularization [gulrajani2017improved] and is a weighting parameter. In the same way, minimizes . Adversarial loss of the generator is
(6) 
Adversarial loss of the encoder is computed in the same way. Reconstruction loss is measured using the proposed relativeperceptualL1 loss:
(7) 
Thus, the total objectives for the encoder and generator are as follows:
(8)  
(9) 
where and are weighting parameters. The training process consists of alternating steps of optimization of the discriminators and one step of optimization of the generator together with the encoder. Parameters and change every iterations following our gradientnormalizing weights policy. The full training procedure is summarized in Algorithm 1. (Steps “update gradient history” and “select weighting parameters” are explained in detail in the next Section).
3.3 Gradientnormalizing Weight Policy
Our objective function consists of the sum of multiple losses. To find weighting parameters for these losses, we cannot use crossvalidation, because no anomaly examples are available to calculate anomaly detection quality. The work [perera2019ocgan] chooses weights empirically based on reconstruction quality. However, it requires a person to manually select the coefficients for each experiment, and it is not objective and reproducible.
In order to choose weighting parameters automatically, we need to base our solution on measured values of an experiment. Let be a vector of network parameters, , are losses calculated for this network, and
(10) 
Then
(11) 
Depending on the nature of the loss functions and , the norms of and can differ by a factor of ten or even a hundred. Coefficient regulates the relative influence of the loss functions in the total gradient with respect to this parameter . To make the contribution of the loss functions equal, we can choose coefficient in the following way:
(12) 
However, due to using stochastic optimization, gradients are very noisy during training. To make this process more robust and stable, we propose to average the coefficients over all network parameters and over their previous values (history information). Our approach is summarized in Algorithm 2 and Algorithm 3.
In short: for each loss, we calculate the derivative (backpropagate loss) wrt each network weight . Then for each convolutional layer we compute the L2norm of the derivative wrt the weight matrix and store it. This is done after every iterations in training, and all previously calculated values are kept, thus creating a gradient history per loss, per layer. We calculate the L2norm per layer (but not per each weight ) to reduce the size of the stored information. Computing the norm over all network parameters would lose too much information, since the last layers usually have more parameters, and hence information about gradients from the first layers would be lost. Firstly, the coefficient is calculated per layer: we perform exponential smoothing of the history values of each loss (to make values robust to noise), and then calculate the average ratio between the last N entries in the gradient history for and the same for . The final value for is computed as the average over the s per layer.
Our approach simply generalizes to a loss function consisting of more than two contributions. It also leaves room for research on which norm to use and how to compute the final weights.
4 Experiments
We show the effectiveness of the proposed approach by evaluation on six publicly available datasets and compare our method with a diverse collection of stateoftheart methods for outofdistribution detection, including stateoftheart GANbased approaches.
Datasets. For evaluation we use the following wellknown datasets (Table 1): MNIST [lecun2010mnist] and Fashion MNIST (fMNIST) [xiao2017fashion], COIL100 [nene1996columbia] (images of 100 different objects against a black background, where views of each object are taken at pose intervals of 5 degrees), CIFAR10 [krizhevsky2009learning], LSUN [xiao2010sun] (we used only the bedrooms and conference room classes), and the aligned & cropped face image attributes dataset CelebA [liu2015faceattributes].
In all experiments, images of MNIST, Fashion MNIST and COIL100 were resized to , examples of the LSUN dataset were downscaled to size and for images of CelebA we made a central crop and then downscaled to size .
MNIST  fMNIST  COIL100  CIFAR10  LSUN(bedr.)  CelebA  
# classes  10  10  100  10  1  40 attrib. 
# instances  70,000  70,000  7,200  60,000  3,033,342  202,599 
Competing Methods. As shallow baselines we consider standard methods such as OCSVM [chen2001one] and KDE [parzen1962estimation]. We also test the performance of our approach against four stateoftheart GANbased methods: AnoGAN [schlegl2017unsupervised], ADGAN [deecke2018image], OCGAN [perera2019ocgan] and ALAD [zenati2018adversarially]. Finally, we report the performance of three deep learning approaches from different paradigms: Deep SVDD [ruff2018deep], GPND [pidhorskyi2018generative], and the Latent Space Autoregression approach [abati2019latent] (results will be reported under the name LSA). All these methods have been briefly described in Section 2.
For ADGAN [deecke2018image], OCGAN [perera2019ocgan], ALAD [zenati2018adversarially], GPND [pidhorskyi2018generative], LSA [abati2019latent] we used results as reported in the corresponding publications. Results for OCSVM, KDE, AnoGAN were obtained from [abati2019latent].
Evaluation Protocol. To test the methods on classification datasets, we use a onevsall evaluation scheme, which has recently been increasingly used in anomaly detection papers [deecke2018image, perera2019ocgan, zenati2018adversarially, abati2019latent, ruff2018deep, pidhorskyi2018generative]: to simulate outofdistribution condition, one class of a dataset is considered as normal data while images of other classes are considered as abnormal. We evaluate results quantitatively using the area under the ROC curve (ROC AUC), which is a standard metric for this task.
Implementation Details. In all experiments we used preactivation resnet blocks to build our generator, encoder, and discriminator; for computing relativeperceptualL1 loss we used VGG19 [simonyan2014very] network, pretrained on Imagenet [russakovsky2015imagenet]. Other implementation details are presented in the supplementary material.
4.1 Results
Shallow  GANbased methods  Deep methods  PIAD  
OCSVM  KDE  AnoGAN  ADGAN  OCGAN  ALAD  LSA  Deep SVDD  (our)  
MNIST 
0  0.988  0.885  0.926  0.999  0.998    0.993  0.980  0.996 
1  0.999  0.996  0.995  0.992  0.999    0.999  0.997  0.999  
2  0.902  0.710  0.805  0.968  0.942    0.959  0.917  0.985  
3  0.950  0.693  0.818  0.953  0.963    0.966  0.919  0.981  
4  0.955  0.844  0.823  0.960  0.975    0.956  0.949  0.960  
5  0.968  0.776  0.803  0.955  0.980    0.964  0.885  0.976  
6  0.978  0.861  0.890  0.980  0.991    0.994  0.983  0.995  
7  0.965  0.884  0.898  0.950  0.981    0.980  0.946  0.984  
8  0.853  0.669  0.817  0.959  0.939    0.953  0.939  0.982  
9  0.955  0.825  0.887  0.965  0.981    0.981  0.965  0.989  
avg  0.951  0.814  0.866  0.968  0.975    0.975  0.948  0.985  
CIFAR10 
airplane  0.630  0.658  0.708  0.661  0.757    0.735  0.617  0.837 
car  0.440  0.520  0.458  0.435  0.531    0.580  0.659  0.876  
bird  0.649  0.657  0.664  0.636  0.640    0.690  0.508  0.753  
cat  0.487  0.497  0.510  0.488  0.620    0.542  0.591  0.602  
deer  0.735  0.727  0.722  0.794  0.723    0.761  0.609  0.808  
dog  0.500  0.496  0.505  0.640  0.620    0.546  0.657  0.713  
frog  0.725  0.758  0.707  0.685  0.723    0.751  0.677  0.839  
horse  0.533  0.564  0.471  0.559  0.575    0.535  0.673  0.842  
ship  0.649  0.680  0.713  0.798  0.820    0.717  0.759  0.867  
truck  0.508  0.540  0.458  0.643  0.554    0.548  0.731  0.849  
avg  0.586  0.610  0.592  0.634  0.657  0.607  0.641  0.648  0.799 
MNIST and CIFAR10. Following [deecke2018image, perera2019ocgan, zenati2018adversarially, abati2019latent, ruff2018deep] we evaluated our approach on MNIST and CIFAR10 using a traintest dataset split, where the training split contains part of the known class and the test split contains the unknown classes and the remainder of the known class. We run each experiment 3 times with different initializations and present the averaged ROC AUC in Table 2.
Since the MNIST dataset is easy, all methods work well. However, our approach allows to improve performance on several tricky digits (like 3 and 8) and outperforms all other approaches on average over the dataset. On the more diverse and complex CIFAR10, the superiority of the proposed method is even more noticeable. This dataset contains 10 image classes with extremely high intraclass variation and the images are so small that even a human cannot always distinguish the kind of object on the image. Our approach based on the perceptual similarity of images can better capture classspecific information of an image, and hence, improves performance of anomaly detection in almost all experiments.
fMNIST and COIL100. The work of GPND [pidhorskyi2018generative] uses another traintest separation to evaluate performance. For a fair comparison, we repeat their evaluation scheme. The model trains on 80% randomly sampled instances of a normal class. The remaining 20% of normal data and the same number of randomly selected anomaly instances are used for testing. We report the average performance of GPND and PIAD on the fMNIST and COIL100 datasets in Table 3 (left), along with OCGAN since they came out the second best on the previous datasets, and they report fMNIST/COIL100 results in the same evaluation scheme as well. For the COIL100 dataset we randomly selected one class to be used as normal, and repeated this procedure 30 times (as it was done in [pidhorskyi2018generative]). The comparison shows that PIAD excels on both datasets.
LSUN. We also compare PIAD with the ADGAN approach on the LSUN dataset, training a model on images of bedrooms and treating images of the conference room class as anomaly. We achieve a ROC AUC of 0.781 against 0.641 with ADGAN.
CelebA. In order to test our approach in conditions that are closer to a realworld use case, we experimented on the CelebA dataset, where we use attributes (Bald, Mustache, Bangs, Eyeglasses, Wearing_Hat) to split the data into normal/anomaly cases. We train our model on ’normal’ images where one of these attributes is not present and test against anomaly images, where that same attribute is present. Table 3 (center) shows the results.
Anomaly attributes Eyeglasses and Wearing_Hat are the easiest for PIAD. As shown in Figure 3(c) (4th and 5th examples), passing image through the encoder and the generator removes glasses and hat from the images. Anomalies Mustache and Bangs are more of a challenge, but we noticed that our model removes the mustache as well, and makes bangs more transparent. However, our model failed to recognize the Bald anomaly. This may be a result of the complexity of the anomaly (see Figure 3(c) first image, where indeed the man is not completely bald) and also inexact annotation (on Figure 3(c) the first image is annotated as bald, but the second is not).
4.2 Ablation Study
We also performed an ablation study on the CIFAR10 dataset to show the effectiveness of each proposed component of PIAD. We considered 5 scenarios: baseline, i.e. our model with MSE as reconstruction error during training and as anomaly score, with empirically chosen weighting parameters by human supervision of the quality of generated examples and reconstructions; + grnorm w, the same, but with gradientnormalizing weights policy; + perc, where we further changed from MSE to perceptual loss; + percL1, where we added normalization on in perceptual loss and used L1norm over features instead of MSE, + relpercL1, where we used the proposed loss of Section 3.1. We present the average ROC AUC over CIFAR10 in Table 3 (right).
We note that our proposed gradientnormalizing weight policy shows the same result as carefully found weights through parameter selection by a human after running the model several times. Each further modification improved results as well. Figure 5 shows examples of images that were seen as the least and the most likely to be an anomaly, for different reconstruction losses. We note that only relativeperceptualL1 loss is not prone to select monochrome images as the least anomalous, and furthermore, this loss selected classes that are closest to the car classes as less anomalous: truck, ship.



5 Conclusion
We introduced a deep anomaly detection approach, built directly for the image domain and exploiting knowledge of perceptual image similarity. For the latter, we proposed a new metric that is based on perceptual loss, but is more robust to noise and changes of contrast of images. As a part of our work we proposed an approach for selecting weights of a multiobjective loss function, which makes a contribution of all losses equal in the training process. We demonstrated the superiority of our approach against stateoftheart GANbased methods and deep approaches of other paradigms on a diverse collection of image benchmarks. In the future, we plan to perform a more extensive evaluation of our method on higher resolution image data, like medical images.