Classifieragnostic saliency map extraction
Abstract
We argue for the importance of decoupling saliency map extraction from any specific classifier. We propose a practical algorithm to train a classifieragnostic saliency mapping by simultaneously training a classifier and a saliency mapping. The proposed algorithm is motivated as finding the mapping that is not strongly coupled with any specific classifier. We qualitatively and quantitatively evaluate the proposed approach and verify that it extracts higher quality saliency maps compared to the existing approaches that are dependent on a fixed classifier. The proposed approach performs well even on images containing objects from classes unseen during training.
Classifieragnostic saliency map extraction
Konrad Żołna konrad.zolna@gmail.com Krzysztof J. Geras k.j.geras@nyu.edu Jagiellonian University New York University NYU School of Medicine CIFAR Azrieli Global Scholar Facebook AI Research Kyunghyun Cho kyunghyun.cho@nyu.edu
noticebox[b]Preprint. Work in progress.\end@float
1 Introduction
The recent success of deep convolutional networks for largescale object recognition [6, 8, 15, 19] has spurred interest in utilizing them to automatically detect and localize objects in natural images. Simonyan et al. [16] and Springenberg et al. [17] demonstrated that the gradient of the classspecific score of a given classifier could be used for extracting a saliency map of an image. Despite their promising results, it has been noticed that the classifierdependent saliency maps tend to be noisy, covering many irrelevant pixels and missing many relevant ones. Sometimes a map may even be adversarial, meaning that it is sufficient to fool a specific, given classifier but would not confuse another classifier. Much of the recent work has therefore focused on introducing regularization techniques of correcting saliency maps extracted for a given classifier. Selvaraju et al. [14], for instance, average multiple saliency maps created for perturbed images to obtain a smooth saliency map. On the other hand, some have tried to modify the deep convolutional networks to explicitly equip it with a saliency map extractor [11, 12].
We notice that the strong dependence on a given classifier lies at the center of the whole issue of unsatisfactory saliency maps and we attempt tackling this core problem directly. We first argue that it is necessary for a classifier to be uniquely optimal in order for any saliency map extracted for it to indicate each and every relevant pixel. This is a difficult condition to satisfy in general, because there could be many equally good classifiers and there is no guarantee that we could find one, if it even exists. We thus propose to train a saliency mapping that works for all possible classifiers (within a single family) weighted by their posterior probabilities. We call this approach a classagnostic saliency map extraction and propose a practical algorithm that avoids intractable expectation over the posterior distribution.
The proposed approach results in a neural network based saliency mapping that only depends on an input image. We qualitatively find that it extracts higher quality saliency maps compared to classifierdependent methods, as can be seen in Fig. 2. We evaluate it quantitatively by using the extracted saliency maps for object localization and observe that the proposed approach outperforms the existing localization techniques based on a fixed classifier and closely approaches the localization performance of a strongly supervised model. Furthermore, we experimentally validate that the proposed approach works reasonably well even for classes unseen during training, suggesting a way toward classagnostic saliency map extraction.
2 Classifieragnostic saliency map extraction
In this paper, we tackle a problem of extracting a salient region of an input image as a problem of extracting a mapping over an input image . Such a mapping should retain () any pixel of the input image if it aids classification, while it should mask () any other pixel.
2.1 Classifierdependent saliency map extraction
Earlier work has largely focused on a setting in which a classifier was given. These approaches can be implemented as solving the following maximization problem:
(1) 
where is a score function corresponding to a classification loss, i.e.,
(2) 
where denotes a masking operation, is a regularization term and is a per example classification loss, such as crossentropy. We are given a training set . This optimization procedure could be interpreted as finding a mapping that maximally confuses the given classifier . We refer to it as a classifierdependent saliency map extraction.
We define the classifier to be uniquely optimal, if for any possible mapping , where is defined as
(3) 
In other words, a uniquely optimal classifier utilizes all salient parts of the image. Any mapping obtained for a single classifier would not capture the saliency of all input pixels, unless the classifier is uniquely optimal.
A mapping obtained with a classifier that is not uniquely optimal may differ from mapping found using , even if both classifiers are equally good. This is against our definition of the mapping above, which stated that any pixel which helps classification must be indicated by the mask (a saliency map) with . This disagreement happens, because these two equally good, but distinct classifiers may use different, overlapping subsets of input pixels to perform classification.
An example
This behaviour can be intuitively explained with a simple example. Let us consider a data set in which all instances consist of two identical copies of images concatenated together, that is, , where . In this case, there exist at least two classifiers, and , with the same classification loss. The classifier uses only the left half of the image, while uses the other half. Each of the corresponding mappings, and , would then indicate a region of interest only on the corresponding half of the image.
2.2 Classifieragnostic saliency map extraction
In order to address this issue, we propose to alter the objective function in Eq. (1) to consider not only a single fixed classifier but all possible classifiers weighted by their posterior probabilities. That is,
(4) 
where the posterior probability, , is defined to be proportional to the exponentiated classification loss , i.e., ). Solving this optimization problem is equivalent to searching over the space of all possible classifiers, and finding a mapping that works with all of them. As we parameterize as a convolutional network (with parameters denoted as ), the space of all possible classifiers is isomorphic to the space of its parameters. The proposed approach considers all the classifiers and we call it a classifieragnostic saliency map extraction.
In the case of the simple example above, where each image contains two copies of a smaller image, both and , which respectively look at one and the other half of an image, the posterior probabilities of these two classifiers would be the same^{1}^{1}1 We assume a flat prior, i.e., . . Solving Eq. (4) implies that a mapping must minimize the loss for both of these classifiers.
2.3 Algorithm
The optimization problem in Eq. (4) is, unfortunately, generally intractable. This arises from the intractable expectation over the posterior distribution. Furthermore, the expectation is inside the optimization loop for the mapping , making it even harder to solve.
Thus, we approximately solve this problem by simultaneously estimating the mapping and the expected objective. First, we sample one with the posterior probability by taking a single step of stochastic gradient descent (SGD) on the classification loss with respect to with a small step size:
(5) 
This is motivated by earlier work [21, 10] which showed that SGD performs approximate Bayesian posterior inference.
We have up to samples^{2}^{2}2 A usual practice of “thinning” may be applied, leading to a fewer than samples. from the posterior distribution . We sample^{3}^{3}3 We set the chance of selecting to 50% and we spread the remaining 50% uniformly over . to get a singlesample estimate of in in Eq. (4) by computing . Then, we use it to obtain an updated mapping by
(6) 
We alternate between these two steps until converges (cf. Alg. 1).
Score function
The score function estimates the quality of the saliency map extracted by given a data set and a classifier . The score function must be designed to balance the precision and recall. The precision refers to the portion of pixels among those marked by as relevant, while the recall to the portion of relevant pixels marked by as relevant among all the pixels. In order to balance these two, the score function often consists of two terms.
The first term ensures that all the relevant pixels are included (high recall). As in Eq. (2), a popular choice has been the classification loss based on an input image masked out by . In our preliminary experiments, however, we noticed that this approach leads to obtaining masks with adversarial artifacts. We hence propose to use the entropy instead. This makes generated masks cover all salient pixels in the input, avoiding masks that may sway the class prediction to a different, but close class. For example, from one dog species to another.
The second term, , excludes a trivial solution that maximizes the recall, where the mapping simply outputs an allones saliency map. That would imply the maximal recall with a low precision. In order to avoid that and achieve reasonable precision, we must introduce a regularization term. Some of the popular choices include total variation [13] and norm. We use the latter only.
In summary, we use the following score function for the classagnostic saliency map extraction:
(7) 
where is a regularization coefficient.
Thinning
As the algorithm collects a set of classifiers, ’s, from the posterior distribution, we need a strategy to keep a small subset of them. An obvious approach is to keep all classifiers but this does not scale well with the number of iterations. We propose and empirically evaluate a few strategies. The first three of them assume a fixed size of . Namely, keeping the first classifier only, denoted by F (), the last only, denoted by L () and the first and last only, denoted by FL (). As an alternative, we also considered a growing set of classifiers where we only keep one classifier every 1000 iterations (denoted by L1000) but whenever , we randomly remove one from the set. Analogously, we experimented with L100.
Classification loss
Although we described our approach using the classification loss computed only on images masked out, as in Eq. (3), it is not necessary to define the classification loss exactly in this way. In the preliminary experiments, we noticed that the following alternative formulation, inspired by adversarial training [18], works better:
(8) 
We thus use the loss as defined above in the experiments. We conjecture that it is advantageous over the original one in Eq. (3), as the additional term prevents the degradation of the classifier’s performance on the original, unmasked images while the first term encourages the classifier to collect new pieces of evidence from the images that are masked.
3 Experimental settings
Dataset: ImageNet
Our models were trained on the official ImageNet training set with ground truth class labels [3]. We evaluate them on the validation set. Depending on the experiment, we use ground truth class or localization labels.
3.1 Architectures
Classifier and mapping
We use ResNet50 [6] as a classifier in our experiments. We follow an encoderdecoder architecture for constructing a mapping . The encoder is implemented also as a ResNet50 so its weights can be shared with the classifier or it can be separate. We experimentally find that sharing is beneficial. The decoder is a deep deconvolutional network that ultimately outputs the mask of an input image. The input to the decoder consists of all hidden layers of the encoder which are directly followed by a downscaling operation. We upsample them to be of the same size and concatenate them into a single feature map . This upsampling operation is implemented by first applying 11 convolution with 64 filters, followed by batch normalization, ReLU nonlinearity and then rescaling to 5656 pixels. Finally, a single 33 convolutional filter followed by sigmoid activation is applied on and the output is upscaled to a 224224 pixelsized mask using bilinear interpolation. The overall architecture is shown in Fig. 1.
3.2 Training
Optimization
We initialize the classifier by training it on the entire training set. We find this pretraining strategy facilitates learning, particularly in the early stage. In practice we use the pretrained ResNet50 from torchvision model zoo. We use vanilla SGD with a small learning rate of (with momentum coefficient set to 0.9 and weightdecay coefficient set to ) to continue training the classifier with the mixed classification loss as in Eq. (8). To train the mapping we use Adam [7] with the learning rate (with weightdecay coefficient set to ) and all the other hyperparameters set to default values. We fix the number of training epochs to 70 (each epoch covers only a random 20% of the training set).
Regularization coefficient
As noticed by Fan et al. [4], it is not trivial to find an optimal regularization coefficient . They proposed an adaptive strategy which gets rid of the manual selection of . We, however, find it undesirable due to the lack of control on the average size of the saliency map. Instead, we propose to control the average number of relevant pixels by manually setting , while applying the regularization term only when there is a disagreement between and . We then set for each experiment such that approximately 50% of pixels in each image are indicated as relevant by a mapping . In the preliminary experiments, we further noticed that this approach avoids problematic behavior when an image contains small objects, earlier observed by Fong and Vedaldi [5].
3.3 Evaluation
In our experiments we only use a single architecture explained in subsection 3.1. We use the abbreviation CASM (classifieragnostic saliency mapping) to denote the final model obtained using the proposed method. Our baseline model (Baseline) is of the same architecture and it is trained with a fixed classifier (classifierdependent saliency mapping) realized by following thinning strategy F.
Following the previous work [1, 5, 23] we discretize our mask by
where is the average mask intensity and is a hyperparameter. We simply set to , hence the average of pixel intensities is the same for the input mask and the discretized binary mask . To focus on the most dominant object we take the largest connected component of the binary mask to obtain the binary connected mask.
Visualization
We visualize the learned mapping by inspecting the saliency map of each image in three different ways. First, we visualize the maskedin image , which ideally leaves only the relevant pixels visible. Second, we visualize the maskedout image , which highlights pixels irrelevant to classification. Third, we visualize the inpainted maskedout image using an inpainting algorithm [20]. This allows us to inspect whether the object that should be masked out cannot be easily reconstructed from nearby pixels.
Classification by multiple classifiers
In order to verify our claim that the proposed approach results in a classifieragnostic saliency mapping, we evaluate a set of classifers^{4}^{4}4 We train twenty ResNet50 models with different initial random sets of parameters in addition to the classifiers from torchvision.models (https://pytorch.org/docs/master/torchvision/models.html): densenet121, densenet169, densenet201, densenet161, resnet18, resnet34, resnet50, resnet101, resnet152, vgg11, vgg11_bn, vgg13, vgg13_bn, vgg16, vgg16_bn, vgg19 and vgg19_bn. on the validation sets of maskedin images, maskedout images and inpainted maskedout images. If our claim is correct, we expect the inpainted maskedout images created by our method to break these classifiers, while the maskedin images would suffer minimal performance degradation.
Object localization
Because our downstream task is to recognize the most dominant object in an image, we can evaluate our approach on the task of its localization. To do so, we use the ILSVRC’14 localization task. We compute the bounding box of an object as the tightest box that covers the binary connected mask.
We use three metrics to quantify the quality of localization. First, we use the official metric (OM) from the ImageNet localization challenge, which considers the localization successful if at least one ground truth bounding box has IOU with predicted bounding box higher than 0.5 and the class prediction is correct. Since OM is dependent on the classifier, from which we have sought to make our mapping independent, we use another widely used metric, called localization error (LE), which only depends on the bounding box prediction [1, 5]. Lastly, we evaluate the original saliency map, of which each mask pixel is a continuous value between 0 and 1, by the continuous F1 score. Precision and recall are defined as the following:
where is the ground truth bounding box. We compute F1 scores against all the ground truth bounding boxes for each image and report the highest one among them as its final score.
4 Results and analysis
Visualization and statistics
We randomly select seven consecutive images from the validation set and input them to two instances of CASM (each using a different thinning strategy – L or L100) and Baseline. We visualize the original (clean), maskedin, maskedout and inpainted maskedout images in Fig. 2. We observe that the proposed approach produces clearly a better saliency maps, while the classifierdependent approach (Baseline) produces socalled adversarial masks [2].
We further compute some statistics of the saliency maps generated by CASM and Basline over the validation set. The masks extracted by CASM exhibit lower total variation ( vs. ), indicating that CASM produced more regular masks, despite the lack of explicit TV regularization. The entropy of mask pixel intensities is much smaller for CASM ( vs. ), indicating that the mask intensities are closer to either or on average. Furthermore, the standard deviation of the masked out volume is larger with CASM ( vs. ), indicating that CASM is capable of producing saliency maps of varying sizes dependent on the input images.
Classification
Model  OM  LE  F1 

Our:  
Baseline  59.6  49.6  53.4 
CASM  49.0  36.7  63.8 
CASMnearest  48.9  36.6  63.6 
ALN [4]  54.5  43.5   
Weakly supervised:  
Occlusion [22]    48.6   
CAM [24]  56.4  48.1   
GradCAM [14]    47.5   
Mask [5]    43.1   
Guid [9]    42.0   
Grad [16]    41.7   
Feed [1]    38.8   
Exc [23]    38.7   
Masking model [2]    36.7   
Supervised:  
VGG Net [15]    34.3   
As shown on the left panel of Figure 3, the entire set of classifiers suffers less from the maskedin images produced by CASM than those by Baseline. We, however, notice that most of the classifiers fail to classify the maskedout images produced by Baseline, which we conjecture is due to the adversarial nature of the saliency maps produced by Baseline approach. This is confirmed by the right panel which shows that simple inpainting of the maskedout images dramatically increases the accuracy when the saliency maps were produced by Baseline. The inpainted maskedout images by CASM, on the other hand, do not benefit from inpainting, because it truly does not maintain any useful evidence for classification.
Localization
We report the localization performance of CASM, Baseline and prior works in Table 1 using three different metrics. Most of the existing approaches, except for Fan et al. [4], assume the knowledge of the target class, unlike the proposed approach. The table clearly shows that CASM performs better than all prior approaches including the classifierdependent Baseline. In terms of LE, the fully supervised approach (VGG Net) is the only approach that outperforms CASM.
Thinning strategies
S  Shr  Thin  OM  LE  F1  
(a)  E  Y  F  59.6  49.6  53.4 
(b)  E  Y  L  49.4  36.9  64.9 
(c)  E  Y  FL  52.8  41.3  61.3 
(d)  E  Y  L1000  49.5  37.3  64.0 
(e)  E  Y  L100  49.0  36.7  63.8 
(f)  C  Y  F  69.0  61.2  44.0 
(g)  C  Y  L  50.4  38.1  65.7 
(h)  C  Y  L100  51.3  39.2  63.6 
(i)  E  N  F    55.5  47.8 
(j)  E  N  L    47.2  59.2 
(k)  E  N  L100    46.8  57.8 
In Table 2 (a–e), we compare the five thinning strategies described earlier, where F is equivalent to the Baseline. We observe that the strategies L and L100 perform better than the others, closely followed by L1000. This is expected, because the classifier samples become stale as the mapping evolves, implying that we should only trust a few latest classifier samples.
Sharing the encoder and classifier
Score function
Unlike Fan et al. [4], we use separate score functions for training the classifier and the saliency mapping. We empirically observe in Table 2 (a,b,e,f–h) that the proposed use of entropy as a score function results in a better mapping in term of OM and LE. The gap, however, narrows as we use better thinning strategies. On the other hand, the classification loss is better for F1 as it makes CASM focus on the dominant object only. Because we take the highest score for each ground truth bounding box, concentrating on the dominant object yields higher scores.
5 Unseen classes
Since the proposed approach does not require knowing the class of the object to be localized, we can use it with images that contain objects of classes that were not seen during training neither by the classifier nor the mapping . We explicitly test this capability by training five different CASMs on five subsets of the original training set of ImageNet.
A  B  C  D  E  F  All  

F  46.5  46.4  48.1  45.0  45.7  41.3  44.9 
E, F  39.5  41.2  43.1  40.3  39.5  38.7  40.0 
D, E, F  37.9  39.3  40.0  38.0  38.0  37.4  38.1 
C, D, E, F  38.2  38.5  39.9  37.9  37.9  37.8  38.1 
B, C, D, E, F  36.7  36.8  39.9  37.4  37.0  37.0  37.4 
  35.6  36.1  39.0  37.0  36.6  36.7  36.9 
We first divide the 1000 classes into five disjoint subsets (denoted as A, B, C, D, E and F) of sizes 50, 50, 100, 300, 300 and 200, respectively. We train our models (in all stages) on 95% images (classes in B, C, D, E and F), 90% images (classes in C, D, E and F), 80% images (classes in D, E and F), 50% images (classes in E and F) and finally on 20% of images only (classes in F only). Then, we test each saliency mapping on all the six subsets of classes independently. We use the thinning strategy L for computational efficiency in each case.
All models generalize well and the difference between their accuracy on seen or unseen classes is negligible (with exemption of the model trained on 20% of classes). The general performance is a little poorer which can be explained by the smaller training set. In Table 3, we see that the proposed approach works well even for localizing objects from previously unseen classes. The gap in the localization error between the seen and unseen classes grows as the training set shrinks. However, with a reasonably sized training set, the difference between the seen and unseen classes is small. This is an encouraging sign for the proposed model as an classagnostic saliency map.
6 Related work
The adversarial localization network [4] is perhaps the most closely related to our work. Similarly to ours, they simultaneously train the classifier and the saliency mapping which does not require the object’s class at test time. Noticeably, we have found four major differences. First, we use the entropy as a score function for training the mapping, whereas they used the classification loss. This results in obtaining better saliency maps as we have shown earlier. Second, we make the training procedure faster thanks to tying the weights of the encoder and the classifier, which also results in a much better performance. Third, we do not let the classifier shift to the distribution of maskedout images by continuing training it on both clean and maskedout images. Finally, their mapping relies on superpixels to build more contiguous masks which may miss small details due to inaccurate segmentation and makes the entire procedure more complex. Our approach solely works on raw pixels without requiring any extra tricks or techniques.
Dabkowski and Gal [2] also train a separate neural network dedicated to predicting saliency maps. However, their approach is a classifierdependent method and, as such, a lot of effort is devoted to preventing generating adversarial masks. Furthermore, the authors use a complex training objective with multiple hyperparameters which also has to be tuned carefully. On a final note, their model needs a ground truth class label which limits its use in practice.
7 Conclusions
In this paper, we proposed a new framework for classifieragnostic saliency map extraction which aims at finding a saliency mapping that works for all possible classifiers weighted by their posterior probabilities. We designed a practical algorithm that amounts to simultaneously training a classifier and a saliency mapping using stochastic gradient descent. We qualitatively observed that the proposed approach extracts saliency maps that cover all the relevant pixels in an image and that the maskedout images cannot be easily recovered by inpainting, unlike for classifierdependent approaches. We further observed that the proposed saliency map extraction procedure outperforms all existing weakly supervised approaches to object localization and can also be used on images containing objects from previously unseen classes, paving a way toward classagnostic saliency map extraction.
Acknowledgments
KC thanks support by AdeptMind, eBay, TenCent, NVIDIA and CIFAR. The authors would also like to thank Catriona C. Geras for correcting earlier versions of the manuscript.
References
 Cao et al. [2015] Chunshui Cao, Xianming Liu, Yi Yang, Yinan Yu, Jiang Wang, Zilei Wang, Yongzhen Huang, Liang Wang, Chang Huang, Wei Xu, et al. Look and think twice: Capturing topdown visual attention with feedback convolutional neural networks. In International Conference on Computer Vision, 2015.
 Dabkowski and Gal [2017] Piotr Dabkowski and Yarin Gal. Real time image saliency for black box classifiers. In Neural Information Processing Systems, 2017.
 Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, LiJia Li, Kai Li, and Li FeiFei. Imagenet: A largescale hierarchical image database. In Computer Vision and Pattern Recognition, 2009.
 Fan et al. [2017] Lijie Fan, Shengjia Zhao, and Stefano Ermon. Adversarial localization network. In Learning with limited labeled data: weak supervision and beyond, NIPS Workshop, 2017.
 Fong and Vedaldi [2017] Ruth C Fong and Andrea Vedaldi. Interpretable explanations of black boxes by meaningful perturbation. arXiv preprint arXiv:1704.03296, 2017.
 He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition, 2016.
 Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Neural Information Processing Systems, 2012.
 Mahendran and Vedaldi [2016] Aravindh Mahendran and Andrea Vedaldi. Salient deconvolutional networks. In European Conference on Computer Vision. Springer, 2016.
 Mandt et al. [2017] Stephan Mandt, Matthew D Hoffman, and David M Blei. Stochastic gradient descent as approximate bayesian inference. arXiv preprint arXiv:1704.04289, 2017.
 Oquab et al. [2015] Maxime Oquab, Léon Bottou, Ivan Laptev, and Josef Sivic. Is object localization for free?weaklysupervised learning with convolutional neural networks. In Computer Vision and Pattern Recognition, 2015.
 Pinheiro and Collobert [2015] Pedro O Pinheiro and Ronan Collobert. From imagelevel to pixellevel labeling with convolutional networks. In Computer Vision and Pattern Recognition, 2015.
 Rudin et al. [1992] Leonid I Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena, 60(14):259–268, 1992.
 Selvaraju et al. [2016] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Gradcam: Visual explanations from deep networks via gradientbased localization. See https://arxiv. org/abs/1610.02391 v3, 7(8), 2016.
 Simonyan and Zisserman [2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 Simonyan et al. [2013] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
 Springenberg et al. [2014] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.
 Szegedy et al. [2013] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
 Szegedy et al. [2015] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Computer Vision and Pattern Recognition, June 2015.
 Telea [2004] Alexandru Telea. An image inpainting technique based on the fast marching method. Journal of graphics tools, 9(1):23–34, 2004.
 Welling and Teh [2011] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In International Conference on Machine Learning, 2011.
 Zeiler and Fergus [2014] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European Conference on Computer Vision, 2014.
 Zhang et al. [2016] Jianming Zhang, Zhe Lin, Jonathan Brandt, Xiaohui Shen, and Stan Sclaroff. Topdown neural attention by excitation backprop. In European Conference on Computer Vision, 2016.
 Zhou et al. [2016] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In Computer Vision and Pattern Recognition, 2016.
Appendix
Resizing
We noticed that the details of the resizing policy preceding the evaluation procedures OM and LE vary between different works. The one thing they have in common is that the resized image is always 224224 pixels. The two main approaches are the following.

The image in the original size is resized such that the smaller edge of the resulting image is 224 pixels long. Then, the central 224224 crop is taken. The original aspect ratio of the objects in the image is preserved. Unfortunately, this method has a flaw – it may be impossible to obtain IOU > 0.5 between predicted localization box and the ground truth box when than a half of the bounding box is not seen by the model.

The image in the original size is resized directly to 224224 pixels. The advantage of this method is that the image is not cropped and it is always possible to obtain IOU > 0.5 between predicted localization box and the ground truth box. However, the original aspect ratio is distorted.
The difference between LE scores for different resizing strategy should not be large. For CASM it is 0.6% (the error rises to 37.2% when the second method is used). In this paper, for CASM, we report results for the first method.
Visualizations
In the remained of the appendix we replicate the content of Fig. 2 for randomly chosen sixteen classes. That is, in each figure we visualize saliency maps obtained for seven consecutive images from the validation set. The original images are in the first row. In the following rows maskedin images, maskedout images and inpainted maskedout images are shown. As before, we used two instances of CASM (each using a different thinning strategy – L or L100) and Baseline.
(a) CASM (L100)
(b) CASM (L)
(c) Baseline
(a) CASM (L100)
(b) CASM (L)
(c) Baseline
(a) CASM (L100)
(b) CASM (L)
(c) Baseline
(a) CASM (L100)
(b) CASM (L)
(c) Baseline
(a) CASM (L100)
(b) CASM (L)
(c) Baseline
(a) CASM (L100)
(b) CASM (L)
(c) Baseline
(a) CASM (L100)
(b) CASM (L)
(c) Baseline
(a) CASM (L100)
(b) CASM (L)
(c) Baseline
(a) CASM (L100)
(b) CASM (L)
(c) Baseline
(a) CASM (L100)
(b) CASM (L)
(c) Baseline
(a) CASM (L100)
(b) CASM (L)
(c) Baseline
(a) CASM (L100)
(b) CASM (L)
(c) Baseline
(a) CASM (L100)
(b) CASM (L)
(c) Baseline
(a) CASM (L100)
(b) CASM (L)
(c) Baseline
(a) CASM (L100)
(b) CASM (L)
(c) Baseline
(a) CASM (L100)
(b) CASM (L)
(c) Baseline