Perceptual Image Anomaly Detection
We present a novel method for image anomaly detection, where algorithms that use samples drawn from some distribution of “normal” data, aim to detect out-of-distribution (abnormal) samples. Our approach includes a combination of encoder and generator for mapping an image distribution to a predefined latent distribution and vice versa. It leverages Generative Adversarial Networks to learn these data distributions and uses perceptual loss for the detection of image abnormality. To accomplish this goal, we introduce a new similarity metric, which expresses the perceived similarity between images and is robust to changes in image contrast. Secondly, we introduce a novel approach for the selection of weights of a multi-objective loss function (image reconstruction and distribution mapping) in the absence of a validation dataset for hyperparameter tuning. After training, our model measures the abnormality of the input image as the perceptual dissimilarity between it and the closest generated image of the modeled data distribution. The proposed approach is extensively evaluated on several publicly available image benchmarks and achieves state-of-the-art performance.
Anomaly detection is one of the most important problems in a range of real-world settings, including medical applications [litjens2017survey], cyber-intrusion detection [kwon2017survey], fraud detection [abdallah2016fraud], anomaly event detection in videos [kiran2018overview] and overgeneralization problems of neural networks [spigler2019denoising]. Anomaly detection tasks generally involve the use of samples of a “normal” class, drawn from some distribution, to build a classifier that is able to detect “abnormal” samples, i.e. outliers with respect to the aforementioned distribution. Although anomaly detection is well-studied in a range of domains, image anomaly detection is still a challenge due to the complexity of distributions over images.
Generative Adversarial Networks (GANs) [goodfellow2014generative] present one of the new promising deep anomaly detection approaches. One network called the generator is trained to transform latent vectors, drawn from a latent distribution, to images in such a way that the second network, the discriminator, cannot distinguish between real images and generated ones. Thus after training, the generator performs a mapping of the latent distribution to the data distribution. This property has been used [schlegl2017unsupervised, deecke2018image, perera2019ocgan] to estimate the likelihood of abnormality for an input: if there is a vector in latent space, which after passing through the generator could reconstruct the input object, the object is normal, otherwise it is not. The difference between an input and its closest reconstruction (reconstruction error) is used as an anomaly score for this object.
Although there is a scope of methods that use GAN for anomaly detection, none of them were directly developed for anomaly detection on images. Usually, they apply the L1-norm or Mean Squared Error (MSE) between the pixels to compute a reconstruction error, which does not correspond to human understanding of the similarity between two images. Another problem of GAN-based approaches is how to find the latent vector that, after passing through the generator, recovers the input object. Previously, it was performed by a gradient descent optimization procedure [schlegl2017unsupervised, deecke2018image], co-training the generator and the encoder that recovers the latent vector [zenati2018efficient, zenati2018adversarially, perera2019ocgan]. However, existing techniques are either time-consuming [schlegl2017unsupervised, deecke2018image] or difficult to train [zenati2018efficient, zenati2018adversarially], or consist of complex multi-step learning procedures [perera2019ocgan]. Another problem is that the complete loss function consists of a sum of many components with weighting coefficients as hyper-parameters. The lack of a validation set (we do not have any anomaly examples during training), makes it difficult to choose these coefficients.
In our work we propose solutions for each of these three problems:
We developed a new metric that measures the similarity between the perception of two images. Our metric, called relative-perceptual-L1 loss, is based on perceptual loss [johnson2016perceptual], but is more robust to noise and changes of contrast of images (Figure 1).
We propose a new technique for training an encoder that predicts a latent vector jointly with the generator. We construct a loss function in such a way that the encoder predicts a vector belonging to the latent distribution, and that the image reconstructed from this vector by the generator is similar to the input.
We propose a way to choose the weighting coefficients in the complete loss functions for the encoder and the generator. We base our solution on the norm of the gradients (with respect to network parameters) of each loss function, to balance the contribution of all losses during the training process.
The proposed approach, called Perceptual Image Anomaly Detection (PIAD), allows us to improve performance on several well-known datasets. We experimented with MNIST, Fashion MNIST, COIL-100, CIFAR-10, LSUN and CelebA and made an extensive comparison with a wide range of anomaly detection approaches of different paradigms.
2 Related Work
Anomaly detection has been extensively studied in a wide range of domains [chalapathy2019deep]. However, anomaly detection on image data is still challenging. Classical approaches such as explicit modeling of latent space using KDE [parzen1962estimation] or One-Class SVM [chen2001one] which learns a boundary around samples of a normal class, show poor quality when applied to image anomaly detection tasks. Due to the problem of the curse of dimensionality, these algorithms are weak in modeling complex high-dimensional distributions.
Deep autoencoders play an important role among anomaly detection methods [sakurada2014anomaly, an2015variational, zhou2017anomaly]. Autoencoders that perform dimension reduction for normal samples learn some common factors inherent in normal data. Abnormal samples do not contain these factors and thus cannot be accurately reconstructed by autoencoders. However, image anomaly detection is still challenging for autoencoders, and usually they are applied only on simple abnormal samples, when the variability of normal images is low.
There are also “mixed” approaches that use autoencoders or other deep models for representation learning. GPND [pidhorskyi2018generative] leverages an adversarial autoencoder to create a low-dimensional representation and then uses a probabilistic interpretation of the latent space to obtain an anomaly score. The method described in [abati2019latent] models a latent distribution obtained from a deep autoencoder using an auto-regressive network. In Deep SVDD [ruff2018deep], Ruff et al. show how to train a one-class classification objective together with deep feature representation.
GANs [goodfellow2014generative] created a new branch in the development of image anomaly detection. GAN-based approaches [schlegl2017unsupervised, deecke2018image, zenati2018efficient, zenati2018adversarially, perera2019ocgan] differ in two parts: (i) how to find latent vectors that correspond to the input images, (ii) how to estimate abnormality based on the input image and the reconstructed one. For the second problem, these methods use a linear combination of the L1-norm or the MSE between the input image and the reconstruction (reconstruction error), and the discriminator’s prediction of the reality of the reconstructed image. For the first problem, approaches AnoGAN [schlegl2017unsupervised] and ADGAN [deecke2018image] propose to use time-consuming gradient descent optimization of a latent vector. Other approaches train an encoder to predict a latent vector for each image. Figure 2 demonstrates the differences between the existing approaches. ALAD [zenati2018adversarially] and [zenati2018efficient] train the encoder adversarially: the adversarial loss computed by the discriminator, which takes pairs (image, vector), forces the encoder to predict a latent vector that reconstructs the input image. However, discriminators of such models train with a cross-entropy loss function, which causes an unstable training process. The OCGAN model trains a denoising autoencoder. To improve the quality of mapping, authors added two discriminators: D and D, and a classifier which searches for hard negative examples (bad generated images).
3 Perceptual Image Anomaly Detection
Conceptually the idea of PIAD follows the OCGAN. We apply the power of GANs two times, once for building a mapping from the latent space to the image space, and again to create an inverse mapping. A generator and an encoder are trained jointly to satisfy three conditions (see Figure 1(d)):
Generator performs a mapping from latent distribution to data distribution ;
Encoder performs a mapping from to ;
The image which generator recovers from the latent vector that is predicted by encoder must be close to the original image (reconstruction term): .
To accomplish conditions 1 and 2 we train the generator and the encoder with adversarial losses. Therefore, two discriminators and are required. To evaluate the reconstruction term we propose to use our new relative-perceptual-L1 loss.
Ideologically, our approach differs from OCGAN. OCGAN is a denoising autoencoder with highly constrained latent space. On top of reconstruction loss, it uses adversarial loss to ensure that the decoder (the generator in our notation) can reproduce only normal examples. Our approach, however, is based on the power of adversarial loss for mapping two distributions. In practice, OCGAN differs in the classifier component, which helps to find weak places of latent space, which produce not “normal” images, but make the whole training process much complicated and multi-steps. Also, we do not add noise to image before passing it through the encoder.
In order to train the encoder and generator to minimize a multi-objective loss function, we propose a new way of setting the weights of the loss functions that equalizes the contribution of each loss in the training process. Due to the fact that our approach relies on gradients of the parameters of the loss function, we called it gradient-normalizing weights policy.
After training the proposed model on samples of a normal class, we suggest to predict the abnormality of a new example by evaluating the relative-perceptual-L1 loss between the input and :
3.1 Relative-perceptual-L1 Loss
Features obtained by a neural network, trained on a large dataset for the task of object recognition, can capture high-level image content without binding to exact pixel locations [gatys2015texture]. In [johnson2016perceptual] Johnson et al. proposed content distance between two images, called perceptual loss: this metric computes the MSE between features taken at a deep level of a neural network that has been pre-trained on an object classification task.
Let be a feature map obtained from some deep layer of the network on image , and the shape of this feature map. Then the perceptual loss between image and is determined as:
However, perceptual loss is very sensitive to changes in image contrast. Figure 1 shows three pairs of images: pairs 0(b) and 0(c) have lower contrast than 0(a). Perceptual loss drops by 22% for images 0(b) compared to 0(a), although for human supervision the pair 0(b) differs from the pair 0(a) very little. In this way, if we used perceptual loss for computing anomaly score, the model would tend to predict lower contrast images as less abnormal. Another problem is that perceptual loss applies the MSE over features, but the MSE penalizes the noise in the obtained feature values very heavily.
We tackled these problems and propose relative-perceptual-L1 loss, which is robust to contrast and noise. First of all, we noticed that features obtained at different filters can have a different scatter of values. As an example, Figure 3 (left) shows the standard deviations of filter responses of some deep layer of VGG-19 [simonyan2014very], computed over Imagenet [russakovsky2015imagenet]. We visualize the standard deviations since they indicate the overall value of the features, which are themselves distributed around zero. Standard deviations differ by a factor of 2-3, which means that the contributions per filter vary by a factor 2-3 as well. Therefore, as the first step of relative-perceptual-L1, we propose to normalize the obtained deep features by the mean and std of filter responses which are pre-calculated over the large dataset, like Imagenet. Secondly, we propose to use the L1-norm instead of the MSE, since the L1-norm is more robust to noise. Thirdly, to make the loss more resistant to contrast, we research how feature values behave under changes of contrast. Figure 3 (right) illustrates this behavior: during the reduction of image contrast, the feature value average decreases. In this way, the absolute error (the difference between features), which is used in perceptual loss, decreases as well for each pair of lower contrast images. Therefore, we propose not to use absolute error, but relative error, which measures the ratio of the absolute error of features to the average values of these features.
Consequently, relative-perceptual-L1 is calculated as follows:
where , are the pre-calculated mean and std of filter responses.
3.2 Training Objective
To train both discriminators we used the Wasserstein GAN with a Gradient Penalty objective (WGAN-GP) [gulrajani2017improved]. The training of a Wasserstein GAN is more stable than a classical GAN [goodfellow2014generative] (which was used in [zenati2018efficient, zenati2018adversarially, perera2019ocgan]), it prevents mode collapse, and does not require a careful searching schedule of generator/discriminator training. Thus, the discriminator learns by minimizing the following loss:
where GP is Gradient Penalty Regularization [gulrajani2017improved] and is a weighting parameter. In the same way, minimizes . Adversarial loss of the generator is
Adversarial loss of the encoder is computed in the same way. Reconstruction loss is measured using the proposed relative-perceptual-L1 loss:
Thus, the total objectives for the encoder and generator are as follows:
where and are weighting parameters. The training process consists of alternating steps of optimization of the discriminators and one step of optimization of the generator together with the encoder. Parameters and change every iterations following our gradient-normalizing weights policy. The full training procedure is summarized in Algorithm 1. (Steps “update gradient history” and “select weighting parameters” are explained in detail in the next Section).
3.3 Gradient-normalizing Weight Policy
Our objective function consists of the sum of multiple losses. To find weighting parameters for these losses, we cannot use cross-validation, because no anomaly examples are available to calculate anomaly detection quality. The work [perera2019ocgan] chooses weights empirically based on reconstruction quality. However, it requires a person to manually select the coefficients for each experiment, and it is not objective and reproducible.
In order to choose weighting parameters automatically, we need to base our solution on measured values of an experiment. Let be a vector of network parameters, , are losses calculated for this network, and
Depending on the nature of the loss functions and , the norms of and can differ by a factor of ten or even a hundred. Coefficient regulates the relative influence of the loss functions in the total gradient with respect to this parameter . To make the contribution of the loss functions equal, we can choose coefficient in the following way:
However, due to using stochastic optimization, gradients are very noisy during training. To make this process more robust and stable, we propose to average the coefficients over all network parameters and over their previous values (history information). Our approach is summarized in Algorithm 2 and Algorithm 3.
In short: for each loss, we calculate the derivative (backpropagate loss) wrt each network weight . Then for each convolutional layer we compute the L2-norm of the derivative wrt the weight matrix and store it. This is done after every iterations in training, and all previously calculated values are kept, thus creating a gradient history per loss, per layer. We calculate the L2-norm per layer (but not per each weight ) to reduce the size of the stored information. Computing the norm over all network parameters would lose too much information, since the last layers usually have more parameters, and hence information about gradients from the first layers would be lost. Firstly, the coefficient is calculated per layer: we perform exponential smoothing of the history values of each loss (to make values robust to noise), and then calculate the average ratio between the last N entries in the gradient history for and the same for . The final value for is computed as the average over the -s per layer.
Our approach simply generalizes to a loss function consisting of more than two contributions. It also leaves room for research on which norm to use and how to compute the final weights.
We show the effectiveness of the proposed approach by evaluation on six publicly available datasets and compare our method with a diverse collection of state-of-the-art methods for out-of-distribution detection, including state-of-the-art GAN-based approaches.
Datasets. For evaluation we use the following well-known datasets (Table 1): MNIST [lecun2010mnist] and Fashion MNIST (fMNIST) [xiao2017fashion], COIL-100 [nene1996columbia] (images of 100 different objects against a black background, where views of each object are taken at pose intervals of 5 degrees), CIFAR-10 [krizhevsky2009learning], LSUN [xiao2010sun] (we used only the bedrooms and conference room classes), and the aligned & cropped face image attributes dataset CelebA [liu2015faceattributes].
In all experiments, images of MNIST, Fashion MNIST and COIL-100 were resized to , examples of the LSUN dataset were downscaled to size and for images of CelebA we made a central crop and then downscaled to size .
|# classes||10||10||100||10||1||40 attrib.|
Competing Methods. As shallow baselines we consider standard methods such as OC-SVM [chen2001one] and KDE [parzen1962estimation]. We also test the performance of our approach against four state-of-the-art GAN-based methods: AnoGAN [schlegl2017unsupervised], ADGAN [deecke2018image], OCGAN [perera2019ocgan] and ALAD [zenati2018adversarially]. Finally, we report the performance of three deep learning approaches from different paradigms: Deep SVDD [ruff2018deep], GPND [pidhorskyi2018generative], and the Latent Space Autoregression approach [abati2019latent] (results will be reported under the name LSA). All these methods have been briefly described in Section 2.
For ADGAN [deecke2018image], OCGAN [perera2019ocgan], ALAD [zenati2018adversarially], GPND [pidhorskyi2018generative], LSA [abati2019latent] we used results as reported in the corresponding publications. Results for OC-SVM, KDE, AnoGAN were obtained from [abati2019latent].
Evaluation Protocol. To test the methods on classification datasets, we use a one-vs-all evaluation scheme, which has recently been increasingly used in anomaly detection papers [deecke2018image, perera2019ocgan, zenati2018adversarially, abati2019latent, ruff2018deep, pidhorskyi2018generative]: to simulate out-of-distribution condition, one class of a dataset is considered as normal data while images of other classes are considered as abnormal. We evaluate results quantitatively using the area under the ROC curve (ROC AUC), which is a standard metric for this task.
Implementation Details. In all experiments we used pre-activation resnet blocks to build our generator, encoder, and discriminator; for computing relative-perceptual-L1 loss we used VGG-19 [simonyan2014very] network, pre-trained on Imagenet [russakovsky2015imagenet]. Other implementation details are presented in the supplementary material.
|Shallow||GAN-based methods||Deep methods||PIAD|
MNIST and CIFAR-10. Following [deecke2018image, perera2019ocgan, zenati2018adversarially, abati2019latent, ruff2018deep] we evaluated our approach on MNIST and CIFAR-10 using a train-test dataset split, where the training split contains part of the known class and the test split contains the unknown classes and the remainder of the known class. We run each experiment 3 times with different initializations and present the averaged ROC AUC in Table 2.
Since the MNIST dataset is easy, all methods work well. However, our approach allows to improve performance on several tricky digits (like 3 and 8) and outperforms all other approaches on average over the dataset. On the more diverse and complex CIFAR-10, the superiority of the proposed method is even more noticeable. This dataset contains 10 image classes with extremely high intra-class variation and the images are so small that even a human cannot always distinguish the kind of object on the image. Our approach based on the perceptual similarity of images can better capture class-specific information of an image, and hence, improves performance of anomaly detection in almost all experiments.
fMNIST and COIL-100. The work of GPND [pidhorskyi2018generative] uses another train-test separation to evaluate performance. For a fair comparison, we repeat their evaluation scheme. The model trains on 80% randomly sampled instances of a normal class. The remaining 20% of normal data and the same number of randomly selected anomaly instances are used for testing. We report the average performance of GPND and PIAD on the fMNIST and COIL-100 datasets in Table 3 (left), along with OCGAN since they came out the second best on the previous datasets, and they report fMNIST/COIL-100 results in the same evaluation scheme as well. For the COIL-100 dataset we randomly selected one class to be used as normal, and repeated this procedure 30 times (as it was done in [pidhorskyi2018generative]). The comparison shows that PIAD excels on both datasets.
LSUN. We also compare PIAD with the ADGAN approach on the LSUN dataset, training a model on images of bedrooms and treating images of the conference room class as anomaly. We achieve a ROC AUC of 0.781 against 0.641 with ADGAN.
CelebA. In order to test our approach in conditions that are closer to a real-world use case, we experimented on the CelebA dataset, where we use attributes (Bald, Mustache, Bangs, Eyeglasses, Wearing_Hat) to split the data into normal/anomaly cases. We train our model on ’normal’ images where one of these attributes is not present and test against anomaly images, where that same attribute is present. Table 3 (center) shows the results.
Anomaly attributes Eyeglasses and Wearing_Hat are the easiest for PIAD. As shown in Figure 3(c) (4th and 5th examples), passing image through the encoder and the generator removes glasses and hat from the images. Anomalies Mustache and Bangs are more of a challenge, but we noticed that our model removes the mustache as well, and makes bangs more transparent. However, our model failed to recognize the Bald anomaly. This may be a result of the complexity of the anomaly (see Figure 3(c) first image, where indeed the man is not completely bald) and also inexact annotation (on Figure 3(c) the first image is annotated as bald, but the second is not).
4.2 Ablation Study
We also performed an ablation study on the CIFAR-10 dataset to show the effectiveness of each proposed component of PIAD. We considered 5 scenarios: baseline, i.e. our model with MSE as reconstruction error during training and as anomaly score, with empirically chosen weighting parameters by human supervision of the quality of generated examples and reconstructions; + gr-norm w, the same, but with gradient-normalizing weights policy; + perc, where we further changed from MSE to perceptual loss; + perc-L1, where we added normalization on in perceptual loss and used L1-norm over features instead of MSE, + rel-perc-L1, where we used the proposed loss of Section 3.1. We present the average ROC AUC over CIFAR-10 in Table 3 (right).
We note that our proposed gradient-normalizing weight policy shows the same result as carefully found weights through parameter selection by a human after running the model several times. Each further modification improved results as well. Figure 5 shows examples of images that were seen as the least and the most likely to be an anomaly, for different reconstruction losses. We note that only relative-perceptual-L1 loss is not prone to select monochrome images as the least anomalous, and furthermore, this loss selected classes that are closest to the car classes as less anomalous: truck, ship.
We introduced a deep anomaly detection approach, built directly for the image domain and exploiting knowledge of perceptual image similarity. For the latter, we proposed a new metric that is based on perceptual loss, but is more robust to noise and changes of contrast of images. As a part of our work we proposed an approach for selecting weights of a multi-objective loss function, which makes a contribution of all losses equal in the training process. We demonstrated the superiority of our approach against state-of-the-art GAN-based methods and deep approaches of other paradigms on a diverse collection of image benchmarks. In the future, we plan to perform a more extensive evaluation of our method on higher resolution image data, like medical images.