Anomaly Detection for Skin Disease Images Using Variational Autoencoder ††thanks: Code and data available at https://github.com/QuindiTech/VAE_ISIC2018
In this paper, we demonstrate the potential of applying Variational Autoencoder (VAE)  for anomaly detection in skin disease images. VAE is a class of deep generative models which is trained by maximizing the evidence lower bound of data distribution . When trained on only normal data, the resulting model is able to perform efficient inference and to determine if a test image is normal or not. We perform experiments on ISIC2018 Challenge Disease Classification dataset (Task 3) and compare different methods to use VAE to detect anomaly. The model is able to detect all diseases with 0.779 AUCROC. If we focus on specific diseases, the model is able to detect melanoma with 0.864 AUCROC and detect actinic keratosis with 0.872 AUCROC, even if it only sees the images of nevus. To the best of our knowledge, this is the first applied work of deep generative models for anomaly detection in dermatology.
Keywords:Deep Generative Models Variational Autoencoder Anomaly Detection.
Automatic skin disease detection would be valuable for both patients and doctors, and there has been success of applying deep supervised learning and CNN to the field of dermatology. These models have large number of parameters and require large-scale labeled dataset for different kind of diseases. Nevertheless, human beings seem to be able to detect an abnormal skin lesion even if they are not trained, provided that they have enough experience with what a healthy mole looks like. Making our machine to have this behavior is interesting by itself, and it also provides practical advantages. By only observing normal skin image data, the algorithm is able to generalize to multiple diseases or even rare diseases, which saves time and money for data collection. Motivated by these aspects, we decide to focus on the problem of unsupervised anomaly detection for skin disease. Doing unsupervised learning over the space of images are challenging because of the curse of dimension, but recent development in deep generative models could address this issue.
There are two related models called Generative Adversarial Network (GAN) and Variational Autoencoder (VAE). Both VAE and GAN have been applied to anomaly detection .  proposes using a direct “reconstruction probability” for detection and shows VAE outperforms a PCA baseline on MNIST dataset.  applies adversarial autoencoder to the unsupervised detection of lesions in brain MRI and improves the detection AUC for BRATS challenge dataset. AnoGAN  trains a GAN on healthy retinal optical coherence tomography images and uses gradient descent in latent space to find reconstructed images. The difference between original and reconstructed images is used as a metric for abnormality.  proposes to use denoising adversarial autoencoder to help classify skin legions, however, it is done through a semi-supervised setting, which used labelled data.
Our major contribution is not proposing any fundamentally new methods, but to emphasize the potential usefulness of deep generative models in dermatology. We investigate VAE based methods instead of GAN for the following reason: 1) Even with recent tricks like gradient penalty , GAN training is still unstable and highly dynamic. As a contrast VAE training is more stable and therefore is more suitable as a proof of concept. 2) Most of GAN-based methods require training an additional network which maps from image space to the noise space in order to get the reconstruction [11, 13], but it is unclear what the theoretical justification is of this additional network. On the contrary, VAE has a well defined mathematical framework and therefore is more interpretable.
We firstly give a brief introduction on the background of VAE and generative models. Then we propose different methods to use a trained VAE for anomaly detection.
2.1 Variational Autoencoder
VAE can be viewed as a directed probabilistic graphical model with the joint distribution defined as , where is the data, is the latent variable and is the prior. We choose the prior to be in this work. When the true posterior is intractable, one can use a parametric distribution to approximate the posterior. Then in order to perform MLE, it is sufficient to maximize the evidence lower bound:
We choose to be a Gaussian distribution with diagonal covariance, where are the output of a neural network. Then by the reparameterization trick, the evidence lower bound becomes
One can observe that the function of here is just adjusting the relative weight between reconstruction term and KL term, as a result, the final loss function to be minimized looks like
The resulting training objective can be viewed as a specific case of VAE, but our derivation is not from an optimization perspective like in .
2.2 Anomaly Score
The degree of anomaly can be characterized by the possibility of seeing appear under distribution . Therefore computing the anomaly score is essentially estimating . Once we have a trained VAE, there are several ways to use it to generate an anomaly score for the new image .
2.2.1 VAE Based Score
One choice is to use the negative of Eqn. (1) as an anomaly score. That is
where . If is larger, then has higher loss and thus is more likely to be an outlier. Since we can decompose the loss into reconstruction term and KL term, we can just define the corresponding anomaly scores:
The motivation of decomposition is to investigate how each term in VAE loss is useful for anomaly detection.
2.2.2 Importance Weighted Autoencoder (IWAE) Based Score
Importance Weighted Autoencoder  proposes a tighter lower bound on , which is
When , we recover the ELBO used by VAE. When becomes larger, it’s proved in  that the Eqn. (7) would become a tighter bound than Eqn. (1), resulting in a more accurate inference. Similarily we can use the negative of Eqn. (7) to compute the anomaly score as
where . The corresponding KL score and reconstruction score are
Although it is unclear whether a tighter lower bound estimate would help with outlier detection, we introduce these scores for the sake of comparison.
3.1 Model Architecture
We use the architecture similar to DCGAN. For the encoder, we avoid using linear layer to produce mean and log variance, but use two separate convolution layers. This architecture is fully convolutional and the number of convolution blocks are dependent on the input image size. In our implementation, the image size is 128, which makes the encoder consisted of 5 convolutional blocks and decoder consisted of 5 deconvolutional blocks respectively. ADAM is used as the optimizer with default setting. Hyperparameters are set as below.
(weight for KL term): 0.01
learning rate: 1e-4
(number of samples for calculating scores): 15
batch size: 32
training epochs: 40
3.2 Dataset and Proprocessing
We use ISIC2018 Challenge dataset (task 3) which contains images from 7 diseases. A detailed dataset information can be found in Table 1. For training the VAE, we use 6369 images as training set and 336 as validation set. For anomaly detection, we select 250 images from the validation set and 100 images from the rest of diseases. We normalize our data to have range from to and resize each image to have size .
The AUC result is summarized in Table 2. Our best AUC result is obtained by reconstruction scores with an overall AUC score of 0.77.In addition, the disease detection AUC for AKIEC and MEL reaches 0.87 and 0.86 respectively, even if the model has never seen a single image from these two diseases before. We notice that KL score is not very discriminative between normal and abnormal data. This is caused by using a small for the KL term, and model basically ignores the KL loss during training. We also try using a larger but it results in poorer AUC results. We also try using even smaller , but it causes some numerical instability and the improvement is not significant. These results imply that the current prior is not expressive enough such that enforcing the approximated posterior to be close to prior hurts the model’s expressiveness, which leads to worse AUC performance. We can also find that using IWAE variants scores does not make much difference from the VAE variants scores, which suggests that even if the bound is theoretically tighter, its practical implication for anomaly detection might not be huge. A sample of reconstruction images is shown in Figure 1.
We try to compare our method with a traditional baseline like PCA or Kernel-PCA for anomaly detection, but our image size (3x128x128) is way too large for these methods to be implemented without resorting to feature engineering. This also demonstrate the advantage of using VAE to cope with the curse of dimensionality in anomaly detection.
4 Future Work
Based on our current experiment results, there are several future research direction worth pursuing.
4.1 Improve VAE
As is shown above, our VAE faces the performance bottleneck because of the constraint to match posterior with a simple prior. One potential improvement would be adding a more expressive decoder like PixelVAE . PixelVAE uses an expressive autoregressive structure for the decoder, which decomposes the lower level features from the higher level semantics. When the latent variable is only left to model the higher level feature, the simple Gaussian prior might be enough. From the reconstruction result, we can find the model is still outputting blurry images. This could be improved by using a more flexible posterior family or by doing hierachical variational inference .
4.2 Improve Detection Methods
In this work we haven’t fully explored the method to use VAE for anomaly detection, but just use different outputs from VAE to compute the scores. One could fit a probability distribution (e.g. Gamma distribution) to the distribution of normal scores and use the standard statistical tests for anomaly detection. The latents of VAE can also be used for anomaly detection in several ways. One could train a one-class SVM using the latents as features. The latent space can also be used as a metric space so that the distance between two images can be defined by their inner product in the latent space. This enables one to develop a method similar to the metric learning based anomaly detection method .
In this paper we apply Variational Autoencoder (VAE) to the problem of anomaly detection in dermatology. VAE based anomaly detection method has a solid theoretic framework and is able to cope with high dimension data, like raw image pixels. Our objective is a specific case of VAE but from a different derivation. We experiment on ISIC 2018 Challenge Task 3 Dataset. By training only on normal data (nevus), the model is able to detect abnormal disease with 0.77 AUC. In particular, the model is able to detect AKIEC and MEL with 0.87 and 0.86 AUC respectively. This is to our knowledge the first work of applying Variational Autoencoder to dermatology, and we argue that although there have been successful applications of supervised learning and CNN based methods in dermatology, applying deep unsupervised learning in dermatology is a fruitful yet not fully explored research direction.
-  An, J., Cho, S.: Variational autoencoder based anomaly detection using reconstruction probability (2015)
-  Burda, Y., Grosse, R.B., Salakhutdinov, R.: Importance weighted autoencoders. CoRR abs/1509.00519 (2015)
-  Chen, X., Konukoglu, E.: Unsupervised detection of lesions in brain mri using constrained adversarial auto-encoders (2018)
-  Creswell, A., Pouplin, A., Bharath, A.A.: Denoising adversarial autoencoders: Classifying skin lesions using limited labelled training data. arXiv preprint arXiv:1801.00693 (2018)
-  Du, B., Zhang, L.: A discriminative metric learning based anomaly detection method. IEEE Transactions on Geoscience and Remote Sensing 52(11), 6844–6857 (2014)
-  Esteva, A., Kuprel, B., Novoa, R.A., Ko, J., Swetter, S.M., Blau, H.M., Thrun, S.: Dermatologist-level classification of skin cancer with deep neural networks. Nature 542(7639), 115 (2017)
-  Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C.: Improved training of wasserstein gans. In: Advances in Neural Information Processing Systems. pp. 5769–5779 (2017)
-  Gulrajani, I., Kumar, K., Ahmed, F., Taiga, A.A., Visin, F., Vazquez, D., Courville, A.: Pixelvae: A latent variable model for natural images. arXiv preprint arXiv:1611.05013 (2016)
-  Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., Lerchner, A.: beta-vae: Learning basic visual concepts with a constrained variational framework (2016)
-  Kingma, D.P., Welling, M.: Stochastic gradient vb and the variational auto-encoder. In: Second International Conference on Learning Representations, ICLR (2014)
-  Kiran, B.R., Thomas, D.M., Parakkal, R.: An overview of deep learning based methods for unsupervised and semi-supervised anomaly detection in videos. arXiv preprint arXiv:1801.03149 (2018)
-  Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR abs/1511.06434 (2015)
-  Schlegl, T., Seeböck, P., Waldstein, S.M., Schmidt-Erfurth, U., Langs, G.: Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In: International Conference on Information Processing in Medical Imaging. pp. 146–157. Springer (2017)
-  Sønderby, C.K., Raiko, T., Maaløe, L., Sønderby, S.K., Winther, O.: Ladder variational autoencoders. In: Advances in neural information processing systems. pp. 3738–3746 (2016)