Towards Visually Explaining Variational Autoencoders
Abstract
Recent advances in Convolutional Neural Network (CNN) model interpretability have led to impressive progress in visualizing and understanding model predictions. In particular, gradientbased visual attention methods have driven much recent effort in using visual attention maps as a means for visual explanations. A key problem, however, is these methods are designed for classification and categorization tasks, and their extension to explaining generative models, \eg, variational autoencoders (VAE) is not trivial. In this work, we take a step towards bridging this crucial gap, proposing the first technique to visually explain VAEs by means of gradientbased attention. We present methods to generate visual attention from the learned latent space, and also demonstrate such attention explanations serve more than just explaining VAE predictions. We show how these attention maps can be used to localize anomalies in images, demonstrating stateoftheart performance on the MVTecAD dataset. We also show how they can be infused into model training, helping bootstrap the VAE into learning improved latent space disentanglement, demonstrated on the Dsprites dataset.
1 Introduction
Dramatic progress in computer vision, driven by deep learning [23, 13, 15], has led to widespread adoption of the associated algorithms in realworld tasks, including healthcare, robotics, and autonomous driving [17, 52, 24] among others. Applications in many such safetycritical and consumerfocusing areas demand a clear understanding of the reasoning behind an algorithm’s predictions, in addition certainly to robustness and performance guarantees. Consequently, there has been substantial recent interest in devising ways to understand and explain the underlying why driving the output what.
Following the work of Zeiler and Fergus [41], much recent effort has been expended in developing ways to visualize feature activations in convolutional neural networks (CNNs). One line of work that has seen increasing adoption involves network attention [48, 34], typically visualized by means of attention maps that highlight feature regions considered (by the trained model) to be important for satisfying the training criterion. Given a trained CNN model, these techniques are able to generate attention maps that visualize where a certain object, \eg, a cat, is in the image, helping explain why this image is classified as belonging to the cat category. Some extensions [25, 37] provide ways to use the generated attention maps as part of trainable constraints that are enforced during model training, showing improved model generalizability as well as visual explainability. While Zheng et al. [46] used a classification module to show how one can generate a pair of such attention maps to explain why two images of people are similar/dissimilar, all these techniques, by design, need to perform classification to guide model explainability, limiting their use to object categorization problems.
Starting from such classification model explainability, one would naturally like to explain a wider variety of neural network models and architectures. For instance, there has been an explosion in the use of generative models following the work of Kingma and Welling [21] and Goodfellow et al. [12], and subsequent successful applications in a variety of tasks [16, 27, 38, 40]. While progress in algorithmic generative modeling has been swift [39, 18, 31], explaining such generative algorithms is still a relatively unexplored field of study. There are certainly some ongoing efforts in using the concept of visual attention in generative models [36, 2, 42], but the focus of these methods is to use attention as an auxiliary information source for the particular task of interest, and not visually explain the generative model itself.
In this work, we take a step towards bridging this crucial gap, developing new techniques to visually explain Variational Autoencoders (VAE) [22]. Note that while we use VAEs as an instantiation of generative models in our work, some of the ideas we discuss are not limited to VAEs and can certainly be extended to GANs [12] as well. Our intuition is that the latent space of a trained VAE encapsulates key properties of the VAE and that generating explanations conditioned on the latent space will help explain the reasoning for any downstream model predictions. Given a trained VAE, we present new ways to generate visual attention maps from the latent space by means of gradientbased attention. Specifically, given the learned Gaussian distribution, we use the reparameterization trick [22] to sample a latent code. We then backpropagate the activations in each dimension of the latent code to a convolutional feature layer in the model and aggregate all the resulting gradients to generate the attention maps. While these visual attention maps serve as means to explain the VAE, we can do much more than just that. A classical application of a VAE is in anomaly localization, where the intuition is that any input data that is not from the standard Gaussian distribution used to train the VAE should be anomalous in the inferred latent space. Given this inference, we can now generate attention maps helping visually explain why this particular input is anomalous. We then also go a step further, presenting ways in which to use these explanations as cues to precisely localize where the anomaly is in the image. We conduct extensive experiments on the recently proposed MVTec anomaly detection dataset and present stateoftheart anomaly localization results with just the standard VAE without any bells and whistles.
Latent space disentanglement is another important area of study with VAEs and has seen much recent progress [14, 19, 47]. With our visual attention explanations conditioned on the learned latent space, our intuition that using these attention maps as part of trainable constraints will lead to improved latent space disentanglement. To this end, we present a new learning objective we call attention disentanglement loss and show how one can train existing VAE models with this new loss. We demonstrate its impact in learning a disentangled embedding by means of experiments on the Dsprites dataset [30].
To summarize, the key contributions of this work include:

We take a step towards solving the relatively unexplored problem of visually explaining generative models, presenting new methods to generate visual attention maps conditioned on the latent space of a variational autoencoder.

Going beyond visual explanations, we show how our visual attention maps can be put to multipurpose use.

We present new ways to localize anomalies in images by using our attention maps as cues, demonstrating stateoftheart localization performance on the MVTecAD dataset [3].

We present a new learning objective called the attention disentanglement loss, showing how one incorporate it into standard VAE models, and demonstrate improved disentanglement performance on the Dsprites dataset [30].
2 Related Work
CNN Visual Explanations. Much recent effort has been expended in explaining CNNs as they have come to dominate performance on most vision tasks. Some widely adopted methods that attempt to visualize intermediate CNN feature layers included the work of Zeiler and Fergus [41] and Mahendran and Vedaldi [28], where methods to understand the activity within the layers of convolutional nets were presented. Some more recent extensions of this line of work include visualattentionbased approaches [49, 11, 35, 6], most of which can be categorized into either gradientbased methods or responsebased methods. Gradientbased method such as GradCAM [35] compute and visualize gradients backpropagated from the decision unit to a feature convolutional layer. On the other hand, responsebased approaches [43, 49, 11] typically add additional trainable units to the original CNN architecture to compute the attention maps. In both cases, the goal is to localize attentive and informative image regions that contribute the most to the model prediction. However, these methods and their extensions [11, 25, 37], while able to explain classification/categorization models, cannot be trivially extended to explaining deep generative models such as VAEs. In this work, we present methods, using the philosophy of gradientbased network attention, to compute and visualize attention maps directly from the learned latent embedding of the VAE. Furthermore, we make the resulting attention maps endtoend trainable and show how such a change can result in improved latent space disentanglement.
Anomaly Detection. Unsupervised learning for anomaly detection [1] still remains challenging. Most recent work in anomaly detection is based on either classificationbased [32, 5] or reconstructionbased approaches. Classificationbased approaches aim to progressively learn representative oneclass decision boundaries like hyperplanes [5] or hyperspheres [32] around the normalclass input distribution to tell outliers/anomalies apart. However, it was also shown [4] that these methods have difficulty dealing with highdimensional data. Reconstructionbased models, on the other hand, assume input data that are anomalous cannot be reconstructed well by a model that is trained only with normal input data. This principle has been used by several methods based on the traditional PCA [20], sparse representation [45], and more recently deep autoencoders [51, 50]. In this work, we take a different approach to tackling this problem. We use the attention maps generated by our proposed VAE visual explanation generation method as cues to localize anomalies. Our intuition is that representations of anomalous data should be reflected in latent embedding as being anomalous, and that generating input visual explanations from such an embedding gives us the information we need to localize the particular anomaly.
VAE Disentanglement. Much effort has been expended in understanding latent space disentanglement for generative models. Early work of Schmidhuber et al. [33] proposed a principle to disentangle latent representations by minimizing the predictability of one latent dimension given other dimensions. Desjardins et. al [10] generalized an approach based on restricted Boltzmann machines to factor the latent variables. Chen et. al extended GAN [12] framework to design the InfoGAN [8] to maximise the mutual information between a subset of latent variables and the observation. Some of the more recent unsupervised methods for disentanglement include VAE [14] which attempted to explore independent latent factors of variation in observed data. While still a popular unsupervised framework, VAE sacrificed reconstruction quality for obtaining better disentanglement. Chen et. al [7] extended VAE to TCVAE by introducing a total correlationbased objective, whereas Mathieu et al. [29] explored decomposition of the latent representation into two factors for disentanglement, and Kim et al. [19] proposed FactorVAE that encouraged the distribution of representations to be factorial and independent across the dimensions. While these methods focus on factorizing the latent representations provided by each individual latent neuron, we take a different approach. We enforce learning a disentangled space by formulating disentanglement constraints based on our proposed visual explanations, i.e., visual attention maps. To this end, we propose a new attention disentanglement learning objective that we quantitatively show provides superior performance when compared to existing work.
3 Technical Approach
In this section, we present our method to generate explanations for a VAE by means of gradientbased attention. We first begin with a brief review of VAEs in Sections 3.1 followed by our proposed method to generate VAE attention. We discuss our framework for localizing anomalies in images with these attention maps and conduct extensive experiments on the MVTecAD anomaly detection dataset [3], establishing stateoftheart anomaly localization performance. Next, we show how our generated attention visualizations can assist in learning a disentangled latent space by optimizing our new attention disentanglement loss. Here, we conduct experiments on the Dsprites [30] dataset and quantitatively demonstrate improved disentanglement performance when compared to existing approaches.
3.1 OneClass Variational Autoencoder
A vanilla VAE is essentially an autoencoder that is trained with the standard autoencoder reconstruction objective between the input and decoded/reconstructed data, as well as a variational objective term attempts to learn a standard normal latent space distribution. The variational objective is typically implemented with KullbackLeibler distribution metric computed between the latent space distribution and the standard Gaussian. Given input data , the conditional distribution of the encoder, the standard Gaussian distribution , and the reconstructed data , the vanilla VAE optimizes:
(1) 
where is the KullbackLeibler divergence and is the reconstruction term:
(2) 
where N is the total number of data samples.
3.2 Generating VAE Attention
We propose a new technique to generate VAE visual attention by means of gradientbased attention computation. Our proposed approach is substantially different from all existing techniques [35, 49, 46] compute the attention maps by backpropagating the score from a classification model, thus requires classspecific information (score). On the other hand, we are not restricted to such classwise information, and develop an attention mechanism directly using the learned latent space, thereby not needing an additional classification module like that these existing techniques do. As illustrated in Figure 2 and discussed below, we compute a score from the latent space, which is then used to calculate gradients and obtain the attention map.
Specifically, for each element in the latent vector z, we backpropagate gradients to the last convolutional feature maps , giving the attention map corresponding to . We repeat this for all elements of the dimensional latent space, giving . Finally, we get the overall VAE attention map as
(3) 
Each individual is computed as the linear combination
(4) 
where the scalar and is the feature channel () of the feature maps . Please kindly note is a matrix and so we use the global average pooling (GAP) operation to get the scalar . Specifically, this is , where and is the pixel value at location of the matrix .
3.3 Generating Anomaly Attention Explanations
We now move on to discuss how our gradientbased attention generation mechanism can be used to localize anomaly regions given a trained oneclass VAE. Inference with such a oneclass VAE with data it was trained for, i.e., normal data (digit “1” for instance), should ideally result in the learned latent space representing the standard normal distribution. Consequently, given a testing sample from a different class (abnormal data, digit “5” for instance), the latent representation inferred by the learned encoder should have a large difference when compared to the learned normal distribution.
Given an abnormal input image as input to a oneclass VAE trained on normal images , the encoder will infer the corresponding mean and variance and for each latent variable to describe the abnormal data. Given that the learned latent distribution follows in latent space and any anomalies should deviate from this distribution in the latent space, we define a normal difference distribution in sampling the latent code for anomaly attention generation:
(5) 
for each latent variable . Given a latent code sampled from , we use Equation 3 to compute the attention map for abnormal images. Figure 3 provides a visual summary.
Results
In this section, we evaluate our proposed method to generate visual explanations as well as perform anomaly localization with VAEs.
Metrics: We adopt the commonly used the area under the receiver operating characteristic curve (ROC AUC) for all quantitative performance evaluation. We define true positive rate (Tpr) as the percentage of pixels that are correctly classified as anomalous across the whole testing class, whereas the false positive rate (Fpr) the percentage of pixels that are wrongly classified as anomalous. In addition, we also compute the best intersectionoverunion (IOU) score by searching for the best threshold based on our ROC curve. Note that we first begin with qualitative (visual) evaluation on the MNIST and UCSD datasets, and then proceed to a more thorough quantitative evaluation on the MVTecAD dataset.
MNIST. We start by qualitatively evaluating our visual attention maps on the MNIST dataset [9]. Using training images from one single digit class, we train our oneclass VAE model, which will be used to test on all the digit numbers’ testing images. We reshape all the training and testing images to resolution of pixels.
In Figure 4 (top), we show results with a model trained on the digit “1” (normal class) and test on all other digits (each of which becomes an abnormal class). For each test image, we infer the latent vector using our trained encoder and use Equation 3 to generate the attention map. As can be observed in the results, the attention maps computed with the proposed method is intuitively satisfying. For instance, let us consider the attention maps generated with digit “7” as the test image. Our intuition tells us that a key difference between the “1” and the “7” is the tophorizontal bar in “7”, and our generated attention map indeed highlights this region. Similarly, the differences between an image of the digit “2” and the “1” are the horizontal base and the topround regions in the “2”. From the generated attention maps for “2”, we notice that we are indeed able to capture these differences, highlighting the top and bottom regions in the images for “2”. We also show testing results with other digits (\eg, “4”,“9”) as well as with a model trained on digit “3” and tested on the other digits in the same figure. We note similar observations can be made from these results as well, suggesting that our proposed attention generation mechanism is indeed able to highlight anomalous regions, thereby capturing the features in the underlying latent space that cause a certain data sample to be abnormal.
UCSD Ped1 Dataset: We next test our proposed method on the UCSD Ped 1[26] pedestrian video dataset, where the videos were captured with a stationary camera to monitor a pedestrian walkway. This dataset includes 34 training sequences and 36 testing sequences, with about 5500 “normal” frames and 3400 “abnormal” frames. We resize the data to pixels for training and testing.
We first qualitatively evaluate the performance of our proposed attention generation method in localizing anomalies. As we can see from Figure 5 (where the corresponding anomaly of interest is annotated on the left, \eg, bicycle, Car \etc), our anomaly localization technique with attention maps performs substantially better than simply computing the difference between the input and its reconstruction (this result is annotated as VanillaVAE in the figure). We note more precise localization of the highresponse regions in our generated attention maps, and these highresponse regions indeed correspond to anomalies in these images.
We next conduct a simple ablation study using the pixellevel segmentation AUROC score against the baseline method of difference between input data and the reconstruction. We test our proposed attention generation mechanism with varying levels of spatial resolution by backpropagating to each of the encoder’s convolutional layers: , , and . The results are shown in Table 1 where we see our proposed mechanism gives better performance than the baseline technique.
VanillaVAE  Ours(Conv1)  Ours(Conv2)  Ours(Conv3)  
AUROC  0.86  0.89  0.92  0.91 
MVTecAD Dataset: We consider the recently released comprehensive anomaly detection dataset: MVTec Anomaly Detection (MVTec AD) [3] that provides multiobject, multidefect natural images and pixellevel ground truth. This dataset contains 5354 highresolution color images of different objects/textures, with both normal and defect (abnormal) images provided in the testing set. We resize all images to pixels for training and testing. We conduct extensive qualitative and quantitative experiments and summarize results below.
We train a VAE with ResNet18 [13] as our feature encoder and a 32dimensional latent space. We further use random mirroring and random rotation, as done in the original work [3], to generate an augmented training set. Given a test image, we infer its latent representation and use Equation 3 to generate the anomaly attention map. Given our anomaly attention maps, we generate binary anomaly localization maps using a variety of thresholds on the pixel response values, which is encapsulated in the ROC curve. We then compute and report the area under the ROC curve (ROC AUC) and generate the best IOU number for our method based on FPR and TPR from the ROC curve.
The results are shown in Table2 (note that the baselines in the Table are the same methods as in [3]), where we compare our performance with the techniques evaluated in the benchmark paper of Bergmann et al. [3]. From the results, we note that with our anomaly localization approach using the proposed VAE attention, we obtain better results on most of the object categories than the competing methods. It is worth noting here that some of these methods are specifically designed for the anomaly localization task, whereas we train a standard VAE and generate our VAE attention maps for localization. Despite this simplicity, our method achieves competitive performance, demonstrating the potential of such an attention generation technique to be useful for tasks other than just model explanation.
We also show some qualitative results in Figure 6. We show results from 6 categories  three textures and three objects. For each category, we also show four types of defects provided by the dataset. We plot from the top row to the bottom the original images, ground truth segmentation masks, and our anomaly attention maps plotted upon input image. Given a variety types of defects of multiple different categories, our attention maps locate accurately upon anomalies and even more refined than the gt maps(note example WoodScratch, the gt map indicates a much bigger anomaly area than the actual scratch defect, yet our attention map captures perfectly the shape of the defect.)
Category 




ours  

Texture 
Carpet 
0.87  0.59  0.54  0.72  0.78  
0.69  0.38  0.34  0.20  0.1  
Grid 
0.94  0.90  0.58  0.59  0.73  
0.88  0.83  0.04  0.02  0.02  
Leather 
0.78  0.75  0.64  0.87  0.95  
0.71  0.67  0.34  0.74  0.24  
Tile 
0.59  0.51  0.50  0.93  0.80  
0.04  0.23  0.08  0.14  0.23  
Wood 
0.73  0.73  0.62  0.91  0.77  
0.36  0.29  0.14  0.47  0.14  
Objects 
Bottle 
0.93  0.86  0.86  0.78  0.87  
0.15  0.22  0.05  0.07  0.27  
Cable 
0.82  0.86  0.78  0.79  0.90  
0.01  0.05  0.01  0.13  0.18  
Capsule 
0.94  0.88  0.84  0.84  0.74  
0.09  0.11  0.04  0.00  0.11  
Hazelnut 
0.97  0.95  0.87  0.72  0.98  
0.00  0.41  0.02  0.00  0.44  
Metal Nut 
0.89  0.86  0.76  0.82  0.94  
0.01  0.26  0.00  0.13  0.49  
Pill 
0.91  0.85  0.87  0.68  0.83  
0.07  0.25  0.17  0.00  0.18  
Screw 
0.96  0.96  0.80  0.87  0.97  
0.03  0.34  0.01  0.00  0.17  
Toothbrush 
0.92  0.93  0.90  0.77  0.94  
0.08  0.51  0.07  0.00  0.14  
Transistor 
0.90  0.86  0.80  0.66  0.93  
0.01  0.22  0.08  0.03  0.30  
Zipper 
0.88  0.77  0.78  0.76  0.78  
0.10  0.13  0.01  0.00  0.06 
3.4 Attention Disentanglement
In the previous section, we discussed how one can generate visual explanations, by means of gradientbased attention, as well as anomaly attention maps for VAEs. We also discussed and experimentally evaluated using these anomaly attention maps for anomaly localization on a variety of datasets. We next discuss another application of our proposed VAE attention: VAE latent space disentanglement. Existing approaches for learning disentangled representations of deep generative models focus on formulating factorised, independent latent distributions so as to learn interpretable data representations. Some examples include VAE [14], InfoVAE [44], and FactorVAE [19], among others, all of which attempt to model the latent prior with factorial probability distribution. In this work, we present an alternative technique, based on our proposed VAE attention, called the attention disentanglement loss. We show how it can be integrated with existing baselines, \eg, FactorVAE, and demonstrate the resulting impact by means of qualitative attention maps and quantitatively performance characterization with standard disentanglement evaluation metrics.
Training with Attention Disentanglement
As we showed earlier, our proposed VAE attention, by means of gradientbased attention, generates attention maps that can explain the underlying latent space represented by the trained VAE. We showed how attention maps intuitively represent different regions of normal and abnormal images, directly corresponding to differences in the latent space (since we generate attention from the latent code). Consequently, our intuition is that using these attention maps to further bootstrap the training process of the VAE model should help boost latent space disentanglement. To this end, our bigpicture idea is to use these attention maps as trainable constraints to explicitly force the attention computed from the various dimensions in latent space to be as disentangled, or separable as possible. Our hypothesis is that if we are able to achieve this, we will be able to learn an improved disentangled latent space. To realize this objective, we propose a new loss called the attention disentanglement loss () that can be easily integrated with existing VAEtype models (see Figure 7). Note that while we use the FactorVAE [19] for demonstration in this work, the proposed attention disentanglement loss is no way limited to this model and can be used in conjunction with other models as well (\eg, VAE [14]). The proposed takes two attention maps and (each computed from a certain dimension in the latent space) as input, and attempts to separate the highresponse pixel regions in them as much as possible. This can be mathematically expressed as:
(6) 
where is the scalar product operation, and and are the pixel in the attention maps and respectively. The proposed can be directly integrated with the standard FactorVAE training objective , giving us an overall learning objective that can be expressed as:
(7) 
Results
Data: We use the Dsprites dataset [30] for experimental evaluation. This is a standard dataset used in the disentanglement literature, providing 737,280 binary 2D shape images.
Quantitative Results: In Table 3, we compare the disentanglement performance of our proposed method with other competing approaches: baseline FactorVAE [19] and VAE [14]. From Table 3, we note that training with our proposed results in substantially higher disentanglement scores when compared to the baseline FactorVAE that is trained with only under the same experimental settings. Specifically, the average disentanglement score of our proposed method is around 0.93, significantly higher than the the 0.82 for baseline FactorVAE (=40). We also note our proposed method obtains a higher disentanglement score when compared to VAE as well, which has 0.73 for =4. These results demonstrate the potential of both our proposed VAE attention as well as the associated disentanglement loss in improving the performance of existing methods in the disentanglement literature. These improved results are also reflected in the qualitative attention map results we discuss next.
Model  =1  =4  =6  =16 

VAE [14]  0.69  0.73  0.7  0.68 
=10  =20  =40  =100  
FactorVAE [19]  0.75  0.77  0.82  0.7 
ADFactorVAE  0.91  0.93  0.93  0.84 
Qualitative Results: Figure 8 shows some attention maps generated using the baseline FactorVAE and as well FactorVAE trained with our proposed (called ADFactorVAE) using the pipeline discussed above. The first row shows 5 input images, and the next 4 rows show results with our proposed method and the baseline FactorVAE. Row 2 shows attention maps generated with ADFactorVAE by backpropagating from the latent dimension with the highest response, whereas row 3 shows attention maps generated by backpropagating from the latent dimension with the next highest response. Rows 4 and 5 show the corresponding attention maps with baseline FactorVAE. We can observe that our proposed method can give better attention separation when compared to the baseline FactorVAE, with highresponse regions in different areas in the image.
4 Summary
We presented new techniques to visually explain variational autoencoders, taking a first step towards explaining deep generative models by means of gradientbased network attention. We showed how one can use the learned latent representation to compute gradients and generate VAE attention maps, without relying on classificationkind of models existing works use. We also showed we can go beyond using the resulting attention maps for explaining VAEs by demonstrating applicability and performance on two tasks: anomaly localization and latent space disentanglement. In anomaly localization, we used the fact that an abnormal input will result in latent variables that do not conform to the standard Gaussian in gradient backpropagation and attention generation. These anomaly attention maps were then used as cues to generate pixellevel binary anomaly masks. In latent space disentanglement, we showed how we can use our VAE attention from each latent dimension in enforcing a new attention disentanglement learning objective, resulting in improved attention separability as well as disentanglement performance.
Footnotes
 Wenqian Liu and Runze Li contributed equally to this work. Email: liu.wenqi@husky.neu.edu, rli047@ucr.edu, zhengm3@rpi.edu, srikrishna.karanam@unitedimaging.com, ziyan.wu@unitedimaging.com, bhanu@cris.ucr.edu, rjradke@ecse.rpi.edu, camps@coe.neu.edu.
References
 Samet Akcay, Amir AtapourAbarghouei, and Toby P Breckon. Ganomaly: Semisupervised anomaly detection via adversarial training. In Asian Conference on Computer Vision, pages 622–637. Springer, 2018.
 Youssef Alami Mejjati, Christian Richardt, James Tompkin, Darren Cosker, and Kwang In Kim. Unsupervised attentionguided imagetoimage translation. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 3693–3703. Curran Associates, Inc., 2018.
 Paul Bergmann, Michael Fauser, David Sattlegger, and Carsten Steger. Mvtec ad–a comprehensive realworld dataset for unsupervised anomaly detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9592–9600, 2019.
 Raghavendra Chalapathy and Sanjay Chawla. Deep learning for anomaly detection: A survey. arXiv preprint arXiv:1901.03407, 2019.
 Raghavendra Chalapathy, Aditya Krishna Menon, and Sanjay Chawla. Anomaly detection using oneclass neural networks. arXiv preprint arXiv:1802.06360, 2018.
 Aditya Chattopadhay, Anirban Sarkar, Prantik Howlader, and Vineeth N. Balasubramanian. Gradcam++: Generalized gradientbased visual explanations for deep convolutional networks. In WACV, 2018.
 Tian Qi Chen, Xuechen Li, Roger B Grosse, and David K Duvenaud. Isolating sources of disentanglement in variational autoencoders. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 2610–2620. Curran Associates, Inc., 2018.
 Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2172–2180. Curran Associates, Inc., 2016.
 Li Deng. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Processing Magazine, 29(6):141–142, 2012.
 Guillaume Desjardins, Aaron C. Courville, and Yoshua Bengio. Disentangling factors of variation via generative entangling. ArXiv, abs/1210.5474, 2012.
 Hiroshi Fukui, Tsubasa Hirakawa, Takayoshi Yamashita, and Hironobu Fujiyoshi. Attention branch network: Learning of attention mechanism for visual explanation. In CVPR, 2019.
 Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 Irina Higgins, Loïc Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew M Botvinick, Shakir Mohamed, and Alexander Lerchner. betavae: Learning basic visual concepts with a constrained variational framework. In ICLR, 2017.
 Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
 Phillip Isola, JunYan Zhu, Tinghui Zhou, and Alexei A. Efros. Imagetoimage translation with conditional adversarial networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
 Dakai Jin, Dazhou Guo, TsungYing Ho, Adam P. Harrison, Jing Xiao, ChenKan Tseng, and Le Lu. Accurate esophageal gross tumor volume segmentation in pet/ct using twostream chained 3d deep network fusion. In MICCAI, 2019.
 Takuhiro Kaneko, Yoshitaka Ushiku, and Tatsuya Harada. Labelnoise robust generative adversarial networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
 Hyunjik Kim and Andriy Mnih. Disentangling by factorising. In ICML, 2018.
 Jaechul Kim and Kristen Grauman. Observe locally, infer globally: a spacetime mrf for detecting abnormal activities with incremental updates. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 2921–2928. IEEE, 2009.
 Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Diederik P. Kingma and Max Welling. Autoencoding variational bayes. CoRR, abs/1312.6114, 2013.
 Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 Buyu Li, Wanli Ouyang, Lu Sheng, Xingyu Zeng, and Xiaogang Wang. Gs3d: An efficient 3d object detection framework for autonomous driving. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
 Kunpeng Li, Ziyan Wu, KuanChuan Peng, Jan Ernst, and Yun Fu. Guided attention inference network. IEEE transactions on pattern analysis and machine intelligence, 2019.
 Weixin Li, Vijay Mahadevan, and Nuno Vasconcelos. Anomaly detection and localization in crowded scenes. IEEE transactions on pattern analysis and machine intelligence, 36(1):18–32, 2013.
 Liqian Ma, Xu Jia, Qianru Sun, Bernt Schiele, Tinne Tuytelaars, and Luc Van Gool. Pose guided person image generation. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 406–416. Curran Associates, Inc., 2017.
 Aravindh Mahendran and Andrea Vedaldi. Understanding deep image representations by inverting them. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5188–5196, 2015.
 Emile Mathieu, Tom Rainforth, Siddharth Narayanaswamy, and Yee Whye Teh. Disentangling disentanglement. ArXiv, abs/1812.02833, 2018.
 Loic Matthey, Irina Higgins, Demis Hassabis, and Alexander Lerchner. dsprites: Disentanglement testing sprites dataset. https://github.com/deepmind/dspritesdataset/, 2017.
 Nazanin Mehrasa, Akash Abdu Jyothi, Thibaut Durand, Jiawei He, Leonid Sigal, and Greg Mori. A variational autoencoder model for stochastic point processes. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
 Lukas Ruff, Robert Vandermeulen, Nico Goernitz, Lucas Deecke, Shoaib Ahmed Siddiqui, Alexander Binder, Emmanuel Müller, and Marius Kloft. Deep oneclass classification. In International Conference on Machine Learning, pages 4393–4402, 2018.
 Jürgen Schmidhuber. Learning factorial codes by predictability minimization. Neural Comput., 4(6):863–879, Nov. 1992.
 Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Gradcam: Visual explanations from deep networks via gradientbased localization. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
 Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Gradcam: Visual explanations from deep networks via gradientbased localization. In ICCV, 2017.
 Yichuan Tang, Nitish Srivastava, and Ruslan Salakhutdinov. Learning generative models with visual attention. In NIPS, 2013.
 Lezi Wang, Ziyan Wu, Srikrishna Karanam, KuanChuan Peng, Rajat Vikram Singh, Bo Liu, and Dimitris N. Metaxas. Sharpen focus: Learning with attention separability and consistency. In The IEEE International Conference on Computer Vision (ICCV), October 2019.
 TingChun Wang, MingYu Liu, JunYan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. Videotovideo synthesis. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
 Jiqing Wu, Zhiwu Huang, Dinesh Acharya, Wen Li, Janine Thoma, Danda Pani Paudel, and Luc Van Gool. Sliced wasserstein generative models. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
 Yongqin Xian, Saurabh Sharma, Bernt Schiele, and Zeynep Akata. Fvaegand2: A feature generating framework for anyshot learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
 Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014.
 Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Selfattention generative adversarial networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 7354–7363, Long Beach, California, USA, 09–15 Jun 2019. PMLR.
 Jianming Zhang, Zhe Lin, Jonathan Brandt, Xiaohui Shen, and Stan Sclaroff. Topdown neural attention by excitation backprop. International Journal of Computer Vision, 126:1084–1102, 2016.
 Shengjia Zhao, Jiaming Song, and Stefano Ermon. Infovae: Information maximizing variational autoencoders. ArXiv, abs/1706.02262, 2017.
 Yiru Zhao, Bing Deng, Chen Shen, Yao Liu, Hongtao Lu, and XianSheng Hua. Spatiotemporal autoencoder for video anomaly detection. In Proceedings of the 25th ACM international conference on Multimedia, pages 1933–1941. ACM, 2017.
 Meng Zheng, Srikrishna Karanam, Ziyan Wu, and Richard J Radke. Reidentification with consistent attentive siamese networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5735–5744, 2019.
 Zhilin Zheng and Li Sun. Disentangling latent space for vae by label relevant/irrelevant dimensions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
 Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2921–2929, 2016.
 Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In CVPR, 2016.
 Chong Zhou and Randy C Paffenroth. Anomaly detection with robust deep autoencoders. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 665–674. ACM, 2017.
 Bo Zong, Qi Song, Martin Renqiang Min, Wei Cheng, Cristian Lumezanu, Daeki Cho, and Haifeng Chen. Deep autoencoding gaussian mixture model for unsupervised anomaly detection. 2018.
 Yiming Zuo, Weichao Qiu, Lingxi Xie, Fangwei Zhong, Yizhou Wang, and Alan L. Yuille. Craves: Controlling robotic arm with a visionbased economic system. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.