Classifier-agnostic saliency map extraction

Classifier-agnostic saliency map extraction

Konrad Żołna
konrad.zolna@gmail.com &Krzysztof J. Geras
k.j.geras@nyu.edu

Jagiellonian University
New York University
NYU School of Medicine
CIFAR Azrieli Global Scholar
Facebook AI Research &Kyunghyun Cho
kyunghyun.cho@nyu.edu
Abstract

We argue for the importance of decoupling saliency map extraction from any specific classifier. We propose a practical algorithm to train a classifier-agnostic saliency mapping by simultaneously training a classifier and a saliency mapping. The proposed algorithm is motivated as finding the mapping that is not strongly coupled with any specific classifier. We qualitatively and quantitatively evaluate the proposed approach and verify that it extracts higher quality saliency maps compared to the existing approaches that are dependent on a fixed classifier. The proposed approach performs well even on images containing objects from classes unseen during training.

 

Classifier-agnostic saliency map extraction


  Konrad Żołna konrad.zolna@gmail.com Krzysztof J. Geras k.j.geras@nyu.edu Jagiellonian University New York University NYU School of Medicine CIFAR Azrieli Global Scholar Facebook AI Research Kyunghyun Cho kyunghyun.cho@nyu.edu

\@float

noticebox[b]Preprint. Work in progress.\end@float

1 Introduction

The recent success of deep convolutional networks for large-scale object recognition [6, 8, 15, 19] has spurred interest in utilizing them to automatically detect and localize objects in natural images. Simonyan et al. [16] and Springenberg et al. [17] demonstrated that the gradient of the class-specific score of a given classifier could be used for extracting a saliency map of an image. Despite their promising results, it has been noticed that the classifier-dependent saliency maps tend to be noisy, covering many irrelevant pixels and missing many relevant ones. Sometimes a map may even be adversarial, meaning that it is sufficient to fool a specific, given classifier but would not confuse another classifier. Much of the recent work has therefore focused on introducing regularization techniques of correcting saliency maps extracted for a given classifier. Selvaraju et al. [14], for instance, average multiple saliency maps created for perturbed images to obtain a smooth saliency map. On the other hand, some have tried to modify the deep convolutional networks to explicitly equip it with a saliency map extractor [11, 12].

We notice that the strong dependence on a given classifier lies at the center of the whole issue of unsatisfactory saliency maps and we attempt tackling this core problem directly. We first argue that it is necessary for a classifier to be uniquely optimal in order for any saliency map extracted for it to indicate each and every relevant pixel. This is a difficult condition to satisfy in general, because there could be many equally good classifiers and there is no guarantee that we could find one, if it even exists. We thus propose to train a saliency mapping that works for all possible classifiers (within a single family) weighted by their posterior probabilities. We call this approach a class-agnostic saliency map extraction and propose a practical algorithm that avoids intractable expectation over the posterior distribution.

The proposed approach results in a neural network based saliency mapping that only depends on an input image. We qualitatively find that it extracts higher quality saliency maps compared to classifier-dependent methods, as can be seen in Fig. 2. We evaluate it quantitatively by using the extracted saliency maps for object localization and observe that the proposed approach outperforms the existing localization techniques based on a fixed classifier and closely approaches the localization performance of a strongly supervised model. Furthermore, we experimentally validate that the proposed approach works reasonably well even for classes unseen during training, suggesting a way toward class-agnostic saliency map extraction.

2 Classifier-agnostic saliency map extraction

In this paper, we tackle a problem of extracting a salient region of an input image as a problem of extracting a mapping over an input image . Such a mapping should retain () any pixel of the input image if it aids classification, while it should mask () any other pixel.

2.1 Classifier-dependent saliency map extraction

Earlier work has largely focused on a setting in which a classifier was given. These approaches can be implemented as solving the following maximization problem:

(1)

where is a score function corresponding to a classification loss, i.e.,

(2)

where denotes a masking operation, is a regularization term and is a per example classification loss, such as cross-entropy. We are given a training set . This optimization procedure could be interpreted as finding a mapping that maximally confuses the given classifier . We refer to it as a classifier-dependent saliency map extraction.

We define the classifier to be uniquely optimal, if for any possible mapping , where is defined as

(3)

In other words, a uniquely optimal classifier utilizes all salient parts of the image. Any mapping obtained for a single classifier would not capture the saliency of all input pixels, unless the classifier is uniquely optimal.

A mapping obtained with a classifier that is not uniquely optimal may differ from mapping found using , even if both classifiers are equally good. This is against our definition of the mapping above, which stated that any pixel which helps classification must be indicated by the mask (a saliency map) with . This disagreement happens, because these two equally good, but distinct classifiers may use different, overlapping subsets of input pixels to perform classification.

An example

This behaviour can be intuitively explained with a simple example. Let us consider a data set in which all instances consist of two identical copies of images concatenated together, that is, , where . In this case, there exist at least two classifiers, and , with the same classification loss. The classifier uses only the left half of the image, while uses the other half. Each of the corresponding mappings, and , would then indicate a region of interest only on the corresponding half of the image.

2.2 Classifier-agnostic saliency map extraction

In order to address this issue, we propose to alter the objective function in Eq. (1) to consider not only a single fixed classifier but all possible classifiers weighted by their posterior probabilities. That is,

(4)

where the posterior probability, , is defined to be proportional to the exponentiated classification loss , i.e., ). Solving this optimization problem is equivalent to searching over the space of all possible classifiers, and finding a mapping that works with all of them. As we parameterize as a convolutional network (with parameters denoted as ), the space of all possible classifiers is isomorphic to the space of its parameters. The proposed approach considers all the classifiers and we call it a classifier-agnostic saliency map extraction.

In the case of the simple example above, where each image contains two copies of a smaller image, both and , which respectively look at one and the other half of an image, the posterior probabilities of these two classifiers would be the same111 We assume a flat prior, i.e., . . Solving Eq. (4) implies that a mapping must minimize the loss for both of these classifiers.

2.3 Algorithm

input : an initial classifier ,
an initial mapping ,
dataset ,
number of iterations
output : the final mapping
Initialize a sample set . for  to  do
      
Algorithm 1 Classifier-agnostic saliency map extraction

The optimization problem in Eq. (4) is, unfortunately, generally intractable. This arises from the intractable expectation over the posterior distribution. Furthermore, the expectation is inside the optimization loop for the mapping , making it even harder to solve.

Thus, we approximately solve this problem by simultaneously estimating the mapping and the expected objective. First, we sample one with the posterior probability by taking a single step of stochastic gradient descent (SGD) on the classification loss with respect to with a small step size:

(5)

This is motivated by earlier work [21, 10] which showed that SGD performs approximate Bayesian posterior inference.

We have up to samples222 A usual practice of “thinning” may be applied, leading to a fewer than samples. from the posterior distribution . We sample333 We set the chance of selecting to 50% and we spread the remaining 50% uniformly over . to get a single-sample estimate of in in Eq. (4) by computing . Then, we use it to obtain an updated mapping by

(6)

We alternate between these two steps until converges (cf. Alg. 1).

Score function

The score function estimates the quality of the saliency map extracted by given a data set and a classifier . The score function must be designed to balance the precision and recall. The precision refers to the portion of pixels among those marked by as relevant, while the recall to the portion of relevant pixels marked by as relevant among all the pixels. In order to balance these two, the score function often consists of two terms.

The first term ensures that all the relevant pixels are included (high recall). As in Eq. (2), a popular choice has been the classification loss based on an input image masked out by . In our preliminary experiments, however, we noticed that this approach leads to obtaining masks with adversarial artifacts. We hence propose to use the entropy instead. This makes generated masks cover all salient pixels in the input, avoiding masks that may sway the class prediction to a different, but close class. For example, from one dog species to another.

The second term, , excludes a trivial solution that maximizes the recall, where the mapping simply outputs an all-ones saliency map. That would imply the maximal recall with a low precision. In order to avoid that and achieve reasonable precision, we must introduce a regularization term. Some of the popular choices include total variation [13] and norm. We use the latter only.

In summary, we use the following score function for the class-agnostic saliency map extraction:

(7)

where is a regularization coefficient.

Thinning

As the algorithm collects a set of classifiers, ’s, from the posterior distribution, we need a strategy to keep a small subset of them. An obvious approach is to keep all classifiers but this does not scale well with the number of iterations. We propose and empirically evaluate a few strategies. The first three of them assume a fixed size of . Namely, keeping the first classifier only, denoted by F (), the last only, denoted by L () and the first and last only, denoted by FL (). As an alternative, we also considered a growing set of classifiers where we only keep one classifier every 1000 iterations (denoted by L1000) but whenever , we randomly remove one from the set. Analogously, we experimented with L100.

Classification loss

Although we described our approach using the classification loss computed only on images masked out, as in Eq. (3), it is not necessary to define the classification loss exactly in this way. In the preliminary experiments, we noticed that the following alternative formulation, inspired by adversarial training [18], works better:

(8)

We thus use the loss as defined above in the experiments. We conjecture that it is advantageous over the original one in Eq. (3), as the additional term prevents the degradation of the classifier’s performance on the original, unmasked images while the first term encourages the classifier to collect new pieces of evidence from the images that are masked.

3 Experimental settings

Dataset: ImageNet

Our models were trained on the official ImageNet training set with ground truth class labels [3]. We evaluate them on the validation set. Depending on the experiment, we use ground truth class or localization labels.

3.1 Architectures

Figure 1: The overall architecture. The mapping consists of an encoder and an decoder and is shown at the top with gray background. The additional forward pass (when classifier acts on masked-out image) is needed during training only.
Classifier and mapping

We use ResNet-50 [6] as a classifier in our experiments. We follow an encoder-decoder architecture for constructing a mapping . The encoder is implemented also as a ResNet-50 so its weights can be shared with the classifier or it can be separate. We experimentally find that sharing is beneficial. The decoder is a deep deconvolutional network that ultimately outputs the mask of an input image. The input to the decoder consists of all hidden layers of the encoder which are directly followed by a downscaling operation. We upsample them to be of the same size and concatenate them into a single feature map . This upsampling operation is implemented by first applying 11 convolution with 64 filters, followed by batch normalization, ReLU non-linearity and then rescaling to 5656 pixels. Finally, a single 33 convolutional filter followed by sigmoid activation is applied on and the output is upscaled to a 224224 pixel-sized mask using bilinear interpolation. The overall architecture is shown in Fig. 1.

3.2 Training

Optimization

We initialize the classifier by training it on the entire training set. We find this pretraining strategy facilitates learning, particularly in the early stage. In practice we use the pretrained ResNet-50 from torchvision model zoo. We use vanilla SGD with a small learning rate of (with momentum coefficient set to 0.9 and weight-decay coefficient set to ) to continue training the classifier with the mixed classification loss as in Eq. (8). To train the mapping we use Adam [7] with the learning rate (with weight-decay coefficient set to ) and all the other hyperparameters set to default values. We fix the number of training epochs to 70 (each epoch covers only a random 20% of the training set).

Regularization coefficient

As noticed by Fan et al. [4], it is not trivial to find an optimal regularization coefficient . They proposed an adaptive strategy which gets rid of the manual selection of . We, however, find it undesirable due to the lack of control on the average size of the saliency map. Instead, we propose to control the average number of relevant pixels by manually setting , while applying the regularization term only when there is a disagreement between and . We then set for each experiment such that approximately 50% of pixels in each image are indicated as relevant by a mapping . In the preliminary experiments, we further noticed that this approach avoids problematic behavior when an image contains small objects, earlier observed by Fong and Vedaldi [5].

3.3 Evaluation

In our experiments we only use a single architecture explained in subsection 3.1. We use the abbreviation CASM (classifier-agnostic saliency mapping) to denote the final model obtained using the proposed method. Our baseline model (Baseline) is of the same architecture and it is trained with a fixed classifier (classifier-dependent saliency mapping) realized by following thinning strategy F.

Following the previous work [1, 5, 23] we discretize our mask by

where is the average mask intensity and is a hyperparameter. We simply set to , hence the average of pixel intensities is the same for the input mask and the discretized binary mask . To focus on the most dominant object we take the largest connected component of the binary mask to obtain the binary connected mask.

Visualization

We visualize the learned mapping by inspecting the saliency map of each image in three different ways. First, we visualize the masked-in image , which ideally leaves only the relevant pixels visible. Second, we visualize the masked-out image , which highlights pixels irrelevant to classification. Third, we visualize the inpainted masked-out image using an inpainting algorithm [20]. This allows us to inspect whether the object that should be masked out cannot be easily reconstructed from nearby pixels.

Classification by multiple classifiers

In order to verify our claim that the proposed approach results in a classifier-agnostic saliency mapping, we evaluate a set of classifers444 We train twenty ResNet-50 models with different initial random sets of parameters in addition to the classifiers from torchvision.models (https://pytorch.org/docs/master/torchvision/models.html): densenet121, densenet169, densenet201, densenet161, resnet18, resnet34, resnet50, resnet101, resnet152, vgg11, vgg11_bn, vgg13, vgg13_bn, vgg16, vgg16_bn, vgg19 and vgg19_bn. on the validation sets of masked-in images, masked-out images and inpainted masked-out images. If our claim is correct, we expect the inpainted masked-out images created by our method to break these classifiers, while the masked-in images would suffer minimal performance degradation.

Object localization

Because our downstream task is to recognize the most dominant object in an image, we can evaluate our approach on the task of its localization. To do so, we use the ILSVRC’14 localization task. We compute the bounding box of an object as the tightest box that covers the binary connected mask.

We use three metrics to quantify the quality of localization. First, we use the official metric (OM) from the ImageNet localization challenge, which considers the localization successful if at least one ground truth bounding box has IOU with predicted bounding box higher than 0.5 and the class prediction is correct. Since OM is dependent on the classifier, from which we have sought to make our mapping independent, we use another widely used metric, called localization error (LE), which only depends on the bounding box prediction [1, 5]. Lastly, we evaluate the original saliency map, of which each mask pixel is a continuous value between 0 and 1, by the continuous F1 score. Precision and recall are defined as the following:

where is the ground truth bounding box. We compute F1 scores against all the ground truth bounding boxes for each image and report the highest one among them as its final score.

(a) CASM (L100) (b) CASM (L) (c) Baseline
Figure 2: The original images are in the first row. In the following rows masked-in images, masked-out images and inpainted masked-out images are shown, respectively. Note that the proposed approach (a-b) remove all relevant pixels and hence the inpainted images show the background only. Seven randomly selected consecutive images from validation set are presented here. Please look into the appendix for extra visualizations.
Figure 3: The classification accuracy of the ImageNet-trained convolutional networks on masked-in images (left), masked-out images (center) and inpainted masked-out images (right). Orange and blue dots correspond to ResNet-50 models and all the other types of convolutional networks, respectively. We observe that the inpainted masked-out images obtained using Baseline are easier to classify than those using CASM, because Baseline fails to mask out all the relevant pixels, unlike CASM. On the right panel, most of the classifiers evaluated on images with random masks achieve accuracy higher than 40% and are not shown. We add jitters in the x-axis to make each dot more distinct from the others visibly.

4 Results and analysis

Visualization and statistics

We randomly select seven consecutive images from the validation set and input them to two instances of CASM (each using a different thinning strategy – L or L100) and Baseline. We visualize the original (clean), masked-in, masked-out and inpainted masked-out images in Fig. 2. We observe that the proposed approach produces clearly a better saliency maps, while the classifier-dependent approach (Baseline) produces so-called adversarial masks [2].

We further compute some statistics of the saliency maps generated by CASM and Basline over the validation set. The masks extracted by CASM exhibit lower total variation ( vs. ), indicating that CASM produced more regular masks, despite the lack of explicit TV regularization. The entropy of mask pixel intensities is much smaller for CASM ( vs. ), indicating that the mask intensities are closer to either or on average. Furthermore, the standard deviation of the masked out volume is larger with CASM ( vs. ), indicating that CASM is capable of producing saliency maps of varying sizes dependent on the input images.

Classification
Model OM LE F1
Our:
Baseline 59.6 49.6 53.4
CASM 49.0 36.7 63.8
CASM-nearest 48.9 36.6 63.6
ALN [4] 54.5 43.5 -
Weakly supervised:
Occlusion [22] - 48.6 -
CAM [24] 56.4 48.1 -
Grad-CAM [14] - 47.5 -
Mask [5] - 43.1 -
Guid [9] - 42.0 -
Grad [16] - 41.7 -
Feed [1] - 38.8 -
Exc [23] - 38.7 -
Masking model [2] - 36.7 -
Supervised:
VGG Net [15] - 34.3 -
Table 1: Localization evaluation using OM, LE and F1 scores. We report the better accuracy between those reported in the original papers or by Fong and Vedaldi [5]. In CASM-nearest, we replaced bilinear interpolation with the nearest neighbour interpolation at test time.

As shown on the left panel of Figure 3, the entire set of classifiers suffers less from the masked-in images produced by CASM than those by Baseline. We, however, notice that most of the classifiers fail to classify the masked-out images produced by Baseline, which we conjecture is due to the adversarial nature of the saliency maps produced by Baseline approach. This is confirmed by the right panel which shows that simple inpainting of the masked-out images dramatically increases the accuracy when the saliency maps were produced by Baseline. The inpainted masked-out images by CASM, on the other hand, do not benefit from inpainting, because it truly does not maintain any useful evidence for classification.

Localization

We report the localization performance of CASM, Baseline and prior works in Table 1 using three different metrics. Most of the existing approaches, except for Fan et al. [4], assume the knowledge of the target class, unlike the proposed approach. The table clearly shows that CASM performs better than all prior approaches including the classifier-dependent Baseline. In terms of LE, the fully supervised approach (VGG Net) is the only approach that outperforms CASM.

Thinning strategies
S Shr Thin OM LE F1
(a) E Y F 59.6 49.6 53.4
(b) E Y L 49.4 36.9 64.9
(c) E Y FL 52.8 41.3 61.3
(d) E Y L1000 49.5 37.3 64.0
(e) E Y L100 49.0 36.7 63.8
(f) C Y F 69.0 61.2 44.0
(g) C Y L 50.4 38.1 65.7
(h) C Y L100 51.3 39.2 63.6
(i) E N F - 55.5 47.8
(j) E N L - 47.2 59.2
(k) E N L100 - 46.8 57.8
Table 2: Ablation study. S refers to the choice of a score function (E: entropy, C: classification loss), Shr to whether the encoder and classifier are shared (Y: yes, N: no) and Thin to the thinning strategies.

In Table 2 (a–e), we compare the five thinning strategies described earlier, where F is equivalent to the Baseline. We observe that the strategies L and L100 perform better than the others, closely followed by L1000. This is expected, because the classifier samples become stale as the mapping evolves, implying that we should only trust a few latest classifier samples.

Sharing the encoder and classifier

As clear from Fig. 1, it is not necessary to share the parameters of the encoder and classifier. Our experiments, however, reveal that it is always beneficial to share them as shown in Table 2 (a,b,e,i–k).

Score function

Unlike Fan et al. [4], we use separate score functions for training the classifier and the saliency mapping. We empirically observe in Table 2 (a,b,e,f–h) that the proposed use of entropy as a score function results in a better mapping in term of OM and LE. The gap, however, narrows as we use better thinning strategies. On the other hand, the classification loss is better for F1 as it makes CASM focus on the dominant object only. Because we take the highest score for each ground truth bounding box, concentrating on the dominant object yields higher scores.

5 Unseen classes

Since the proposed approach does not require knowing the class of the object to be localized, we can use it with images that contain objects of classes that were not seen during training neither by the classifier nor the mapping . We explicitly test this capability by training five different CASMs on five subsets of the original training set of ImageNet.

A B C D E F All
F 46.5 46.4 48.1 45.0 45.7 41.3 44.9
E, F 39.5 41.2 43.1 40.3 39.5 38.7 40.0
D, E, F 37.9 39.3 40.0 38.0 38.0 37.4 38.1
C, D, E, F 38.2 38.5 39.9 37.9 37.9 37.8 38.1
B, C, D, E, F 36.7 36.8 39.9 37.4 37.0 37.0 37.4
- 35.6 36.1 39.0 37.0 36.6 36.7 36.9
Table 3: Localization errors (LE in %, ) of the models trained on a subset of classes. Each row corresponds to the training subset of classes and each column to the test subset of classes. Error rates on the previously unseen classes are marked with gray shade.

We first divide the 1000 classes into five disjoint subsets (denoted as A, B, C, D, E and F) of sizes 50, 50, 100, 300, 300 and 200, respectively. We train our models (in all stages) on 95% images (classes in B, C, D, E and F), 90% images (classes in C, D, E and F), 80% images (classes in D, E and F), 50% images (classes in E and F) and finally on 20% of images only (classes in F only). Then, we test each saliency mapping on all the six subsets of classes independently. We use the thinning strategy L for computational efficiency in each case.

All models generalize well and the difference between their accuracy on seen or unseen classes is negligible (with exemption of the model trained on 20% of classes). The general performance is a little poorer which can be explained by the smaller training set. In Table 3, we see that the proposed approach works well even for localizing objects from previously unseen classes. The gap in the localization error between the seen and unseen classes grows as the training set shrinks. However, with a reasonably sized training set, the difference between the seen and unseen classes is small. This is an encouraging sign for the proposed model as an class-agnostic saliency map.

6 Related work

The adversarial localization network [4] is perhaps the most closely related to our work. Similarly to ours, they simultaneously train the classifier and the saliency mapping which does not require the object’s class at test time. Noticeably, we have found four major differences. First, we use the entropy as a score function for training the mapping, whereas they used the classification loss. This results in obtaining better saliency maps as we have shown earlier. Second, we make the training procedure faster thanks to tying the weights of the encoder and the classifier, which also results in a much better performance. Third, we do not let the classifier shift to the distribution of masked-out images by continuing training it on both clean and masked-out images. Finally, their mapping relies on superpixels to build more contiguous masks which may miss small details due to inaccurate segmentation and makes the entire procedure more complex. Our approach solely works on raw pixels without requiring any extra tricks or techniques.

Dabkowski and Gal [2] also train a separate neural network dedicated to predicting saliency maps. However, their approach is a classifier-dependent method and, as such, a lot of effort is devoted to preventing generating adversarial masks. Furthermore, the authors use a complex training objective with multiple hyperparameters which also has to be tuned carefully. On a final note, their model needs a ground truth class label which limits its use in practice.

7 Conclusions

In this paper, we proposed a new framework for classifier-agnostic saliency map extraction which aims at finding a saliency mapping that works for all possible classifiers weighted by their posterior probabilities. We designed a practical algorithm that amounts to simultaneously training a classifier and a saliency mapping using stochastic gradient descent. We qualitatively observed that the proposed approach extracts saliency maps that cover all the relevant pixels in an image and that the masked-out images cannot be easily recovered by inpainting, unlike for classifier-dependent approaches. We further observed that the proposed saliency map extraction procedure outperforms all existing weakly supervised approaches to object localization and can also be used on images containing objects from previously unseen classes, paving a way toward class-agnostic saliency map extraction.

Acknowledgments

KC thanks support by AdeptMind, eBay, TenCent, NVIDIA and CIFAR. The authors would also like to thank Catriona C. Geras for correcting earlier versions of the manuscript.

References

  • Cao et al. [2015] Chunshui Cao, Xianming Liu, Yi Yang, Yinan Yu, Jiang Wang, Zilei Wang, Yongzhen Huang, Liang Wang, Chang Huang, Wei Xu, et al. Look and think twice: Capturing top-down visual attention with feedback convolutional neural networks. In International Conference on Computer Vision, 2015.
  • Dabkowski and Gal [2017] Piotr Dabkowski and Yarin Gal. Real time image saliency for black box classifiers. In Neural Information Processing Systems, 2017.
  • Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009.
  • Fan et al. [2017] Lijie Fan, Shengjia Zhao, and Stefano Ermon. Adversarial localization network. In Learning with limited labeled data: weak supervision and beyond, NIPS Workshop, 2017.
  • Fong and Vedaldi [2017] Ruth C Fong and Andrea Vedaldi. Interpretable explanations of black boxes by meaningful perturbation. arXiv preprint arXiv:1704.03296, 2017.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition, 2016.
  • Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Neural Information Processing Systems, 2012.
  • Mahendran and Vedaldi [2016] Aravindh Mahendran and Andrea Vedaldi. Salient deconvolutional networks. In European Conference on Computer Vision. Springer, 2016.
  • Mandt et al. [2017] Stephan Mandt, Matthew D Hoffman, and David M Blei. Stochastic gradient descent as approximate bayesian inference. arXiv preprint arXiv:1704.04289, 2017.
  • Oquab et al. [2015] Maxime Oquab, Léon Bottou, Ivan Laptev, and Josef Sivic. Is object localization for free?-weakly-supervised learning with convolutional neural networks. In Computer Vision and Pattern Recognition, 2015.
  • Pinheiro and Collobert [2015] Pedro O Pinheiro and Ronan Collobert. From image-level to pixel-level labeling with convolutional networks. In Computer Vision and Pattern Recognition, 2015.
  • Rudin et al. [1992] Leonid I Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena, 60(1-4):259–268, 1992.
  • Selvaraju et al. [2016] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. See https://arxiv. org/abs/1610.02391 v3, 7(8), 2016.
  • Simonyan and Zisserman [2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • Simonyan et al. [2013] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
  • Springenberg et al. [2014] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.
  • Szegedy et al. [2013] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
  • Szegedy et al. [2015] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Computer Vision and Pattern Recognition, June 2015.
  • Telea [2004] Alexandru Telea. An image inpainting technique based on the fast marching method. Journal of graphics tools, 9(1):23–34, 2004.
  • Welling and Teh [2011] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In International Conference on Machine Learning, 2011.
  • Zeiler and Fergus [2014] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European Conference on Computer Vision, 2014.
  • Zhang et al. [2016] Jianming Zhang, Zhe Lin, Jonathan Brandt, Xiaohui Shen, and Stan Sclaroff. Top-down neural attention by excitation backprop. In European Conference on Computer Vision, 2016.
  • Zhou et al. [2016] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In Computer Vision and Pattern Recognition, 2016.

Appendix

Resizing

We noticed that the details of the resizing policy preceding the evaluation procedures OM and LE vary between different works. The one thing they have in common is that the resized image is always 224224 pixels. The two main approaches are the following.

  • The image in the original size is resized such that the smaller edge of the resulting image is 224 pixels long. Then, the central 224224 crop is taken. The original aspect ratio of the objects in the image is preserved. Unfortunately, this method has a flaw – it may be impossible to obtain IOU > 0.5 between predicted localization box and the ground truth box when than a half of the bounding box is not seen by the model.

  • The image in the original size is resized directly to 224224 pixels. The advantage of this method is that the image is not cropped and it is always possible to obtain IOU > 0.5 between predicted localization box and the ground truth box. However, the original aspect ratio is distorted.

The difference between LE scores for different resizing strategy should not be large. For CASM it is 0.6% (the error rises to 37.2% when the second method is used). In this paper, for CASM, we report results for the first method.

Visualizations

In the remained of the appendix we replicate the content of Fig. 2 for randomly chosen sixteen classes. That is, in each figure we visualize saliency maps obtained for seven consecutive images from the validation set. The original images are in the first row. In the following rows masked-in images, masked-out images and inpainted masked-out images are shown. As before, we used two instances of CASM (each using a different thinning strategy – L or L100) and Baseline.


(a) CASM (L100)

(b) CASM (L)

(c) Baseline


(a) CASM (L100)

(b) CASM (L)

(c) Baseline


(a) CASM (L100)

(b) CASM (L)

(c) Baseline


(a) CASM (L100)

(b) CASM (L)

(c) Baseline


(a) CASM (L100)

(b) CASM (L)

(c) Baseline


(a) CASM (L100)

(b) CASM (L)

(c) Baseline


(a) CASM (L100)

(b) CASM (L)

(c) Baseline


(a) CASM (L100)

(b) CASM (L)

(c) Baseline


(a) CASM (L100)

(b) CASM (L)

(c) Baseline


(a) CASM (L100)

(b) CASM (L)

(c) Baseline


(a) CASM (L100)

(b) CASM (L)

(c) Baseline


(a) CASM (L100)

(b) CASM (L)

(c) Baseline


(a) CASM (L100)

(b) CASM (L)

(c) Baseline


(a) CASM (L100)

(b) CASM (L)

(c) Baseline


(a) CASM (L100)

(b) CASM (L)

(c) Baseline


(a) CASM (L100)

(b) CASM (L)

(c) Baseline

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
198702
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description