# Iterative and Adaptive Sampling with Spatial Attention for Black-Box Model Explanations

## Abstract

Deep neural networks have achieved great success in many real-world applications, yet it remains unclear and difficult to explain their decision-making process to an end-user. In this paper, we address the explainable AI problem for deep neural networks with our proposed framework, named IASSA, which generates an importance map indicating how salient each pixel is for the modelâs prediction with an iterative and adaptive sampling module. We employ an affinity matrix calculated on multi-level deep learning features to explore long-range pixel-to-pixel correlation, which can shift the saliency values guided by our long-range and parameter-free spatial attention. Extensive experiments on the MS-COCO dataset show that our proposed approach matches or exceeds the performance of state-of-the-art black-box explanation methods.

## 1 Introduction

It is still unclear how a specific deep neural network works, how certain it is about the decision making, etc, although the networks have achieved remarkable success in multiple applications such as object recognition [42, 51, 9, 18, 16, 19, 17, 10, 38], object detection [21, 5, 30], image labeling [15, 8], media forensics [33, 20, 14], medical diagnosis [43, 44], and autonomous driving [23, 4, 22]. However, due to the importance of explanation towards understanding and building trust in cognitive psychology and philosophy [12, 13, 28, 45, 31], it is very critical to make the deep neural networks more explainable and trustable, especially to ensure that the decision-making mechanism is transparent and easily interpretable. Therefore, the problem of Explainable AI, i.e., providing explanations for an intelligent modelâs decision, especially in explaining classification decisions made by deep neural networks on natural images, attracts much attention in artificial intelligence research [34].

Rather than explainable solutions [35, 50, 40, 29, 37] to certain white-box models via calculating importance based on the information like the network’s weights and gradients. We advocate a more general explainable approach to produce a saliency map for an arbitrary network as a black-box model, without requiring its details about the architecture and implementation. Such a saliency map can show how important each image pixel is for the networkâs prediction.

Recently, multiple explainable approaches have been proposed for black-box models. LIME [26, 1] proposes to draw random samples around the instance for an explanation by fitting an approximate linear decision model. However, such a superpixel based saliency method may not group correct regions. RISE [27] explores the black-box model by sub-sampling the input image via random masks and generating the final importance map by a linear combination of the random binary masks. Although this is seemingly simple yet surprisingly powerful approach for black-box models, the results are still far from perfect, especially in complex scenes.

In this paper, inspired by RISE [27], we propose a novel iterative and adaptive sampling with spatial attention (IASSA) form explanation of black-box models. We do not access parameter weights and gradients, as well as intermediate feature maps. We only sample the image randomly using a sliding window during the initialization stage. And then an iterative and adaptive sampling module is designed to generate sampling masks for the next iteration, based on the adjusted attention map which is obtained with the saliency map at the current iteration and the long-range and parameter-free spatial attention. Such an iterative procedure continues until convergence. The visual comparison with LIME and RISE is shown in Figure 1.

Regarding the long-range and parameter-free spatial attention module, we apply a pre-trained model trained on the large-scale ImageNet dataset to extract features for the input image. Note that we combine multi-level contextual features to better represent the image. Then we calculate an affinity matrix and apply a softmax function to get spatial attention. Since the affinity matrix covers the pixel-to-pixel correlations no matter whether they are local neighbors or not, our attention covers long-range inter-dependencies. Also, no parameters are required to be learned in this procedure. Such a long-range and parameter-free spatial attention can guide the saliency values in the obtained saliency map to the correlative pixels. This can be very helpful as guidance for adaptive sampling for the next iteration.

Another contribution of our work is our further evaluation. Besides previously used metrics like deletion, insertion and “Pointing Game” [27], we also choose to use F-1 and IoU . We also evaluate the final saliency maps at the pixel-level to highlight the success of our approach in maximizing information contained in each pixel. We argue that a comprehensive evaluation should be more trustable when compared with the human-annotated importance of the image regions. In our case, we assume ground truth masks are representative of human interpretation of the object, as they are human-annotated.

To sum up, the technical contributions are of three-folds: (1) we propose an iterative and adaptive sampling for generating accurate explanations, based on the adjusted saliency map generated by combining the saliency map obtained from the previous iteration and the long-range and parameter-free spatial attention map; (2) our long-range and parameter-free attention module that incorporates “objectness” and guides our adaptive sampler with the help of multi-level feature fusion; and (3) we further introduce an evaluation scheme that tries to estimate âgoodnessâ of an explanation in a way that it is reliable and accurate.

We conduct extensive experiments on the popular and vast dataset MS-COCO [11] and compare it with the state-of-the-art methods. The experimental results demonstrate the efficacy of our proposed method.

## 2 Related work

The related work can be divided into two categories, i.e., white-box approaches and black-box approaches for the importance of producing explanations.

White-box approaches rely on the information such as the model parameter weights and gradients, as well as the intermediate feature maps. Zeiler et. al. [47] visualize the intermediate representation learned by CNNs using deconvolutional networks. Explanations are achieved in other methods [25, 36, 46] by synthesizing an input image that highly activates a neuron. Class activation maps (CAM) [52] achieve class-specific importance at each location in an image by computing a weighted sum of the activation values at each location across all channels using a Global Average Pooling layer (GAP). Such a method prevents us from using this approach to explain models lacking a native GAP layer without additional re-training. Later, CAM was extended to Grad-CAM [35] by weighing the feature activation values at every location with the average gradient of the class score (w.r.t. the feature activation values) for every feature map channel. In addition, Zhang et. al. [50] introduce a probabilistic winner-takes-all strategy to compute the relative importance of neurons towards model predictions. Fong et. al. [7] and Cao et. al. [2] learn a perturbation mask that maximally affects the modelâs output by back-propagating the error signals through the model. However, all of the above methods assume that the internal parameters of the underlying model are accessible as a white-box. They achieve interpretability by incorporating changes to a white-box based model and are constrained to use specific network architectures, limiting reproducibility on a new dataset.

Black-box approaches treat the learning models as purely black-box, without requiring access to any details of the architecture and the implementation. LIME [32] tries to fit an approximate linear decision model (LIME) in the vicinity of a particular input. For a sufficiently complex model, a linear approximation may not result in a faithful representation of the non-linear model. Even though LIME model produces good quality results on the MS-COCO dataset, due to its reliance on super-pixels, they are not the best at grouping object boundaries with activation. As an improvement over LIME, RISE model [27] was proposed to generate an importance map indicating how salient each pixel is for the black-box model’s prediction. Such a method estimates importance empirically by probing the model with randomly masked versions of the input image and obtaining the corresponding outputs. Note that sampling methods to generate explanations have been explored in the past [27, 48, 6]. Even though they produce explanations for a wide variety of black-box model applications, their resolution is always limited by factors like sampling sensitivity and strength of classifier.

In this paper, unlike the existing methods, we explore a novel method to provide precise explanations for any application that uses a deep neural network for feature extraction, irrespective of the multi-level features. We leverage a long-range and parameter-free spatial attention to adjust the saliency map. We propose an iterative and adaptive sampling module with long-range and parameter-free attention to determine important regions in an image. The proposed system can also be adapted to perform co-saliency [6] by weighting the final saliency map using a standard feature comparison metric like Euclidean or Cosine distance. This makes our approach robust to the form of explanation desired and produces better quality saliency maps across different applications with little or no overhead in training.

## 3 Methodology

The proposed framework is illustrated in Figure 2. Given an input image, we perform a rough pass to initialize our approach. The sampled image regions are passed to the black box classifier that predicts logit scores for each sample, the predicted logit scores are used to weight image regions to produce an aggregated response map. Then an adjusted saliency map is generated by combining with the attention map obtained from the long-range and parameter-free spatial attention module. The attention module also guides the iterative and adaptive sampling to sample relevant regions in the next iteration. Such iterative procedure continues until convergence. Note that the spatial attention module is built based on multi-level deep learning features via an affinity matrix. In the following subsections, we further explain our approaches in detail.

### 3.1 Iterative and Adaptive Sampling Module

We propose a novel iterative and adapting sampler that is guided by our long-range and parameter-free spatial attention (LRPF-SA) to automatically pick sampling regions of interest with an appropriate sampling factor rather than weighting them equally. Sampling around the important regions ensures faster convergence and better quality saliency maps. The iterative quality of our approach also allows the users to control the quality of saliency maps, which is inversely proportional to the amount of time needed to generate them. We believe this is crucial in applications where the same explanation generator system needs to be scaled according to user requirements with minimal changes.

Given an image , a black-box model produces a score vector of length , where is the number of classes the black-box model was trained for. We sample the input image I, using masks M: be a sliding window of size and stride . Considering the masked version of I, where represents element-wise multiplication, we compute the confidence scores for all the masked images . We define the importance of a pixel as the expected score over all possible masks M conditioned on the event that pixel is observed. In other words, when the scalar score is high for a chosen mask , it can infer that the pixels preserved by are important. We define the importance of the pixel as the expected score over all possible masks conditional on the event that is observed, i.e..

(1) |

where

(2) |

Considering that , we rewrite Equation 3 in matrix notation as

(4) |

Using Monte Carlo sampling, at the iteration , the final saliency map is computed as a weighted average of a collection of masks by the following approximation:

(5) |

When the black-box model is associated with a class , then we can obtain a saliency map corresponding to according to Equation 4. Although most applications require only the top-1 saliency map, our approach can be used to obtain class specific salient structures.

The initial saliency map is generated based on a sliding window . After the initialization, we take the long-range and parameter-free attention module to adjust the saliency map from to at the -th iteration by the following rules

(6) |

where is a regularizer to control the amount of influence the attention network has towards generating the final explanation. The intuition behind using both saliency and attention maps is that, while the saliency maps are associated with the output of a back-box model, we provide a new insight with our proposed LRPF-SA (see next subsection) to apply some spatial constraints with respect to the extracted feature. Therefore, by combining both forms of explanations we hope to converge on an aggregated saliency map that gives a complete picture of the image regions that interest the system and also image regions that conform with object boundaries.

Then we use to guide the adaptive sampling for the next iteration by

(7) |

where denotes the highest activated region obtained by applying a threshold is evaluated against the binary map that highlights all pixels containing the object of interest, i.e.,

(8) |

With the adaptive sampling masks , we are able to apply Equation 5 to obtain the saliency map at the -th iteration. And then is obtained by Equation 6 to get the adaptive sampling masks for generate the saliency map at the -th iteration. It is worth noting that the window size and stride can be gradually depreciated with respect to the iteration count to increase the resolutions of saliency maps until there is very little or no change in the quality of maps. The number of iterations can also be fixed based on user requirements in applications where the user is willing to sacrifice the quality of saliency maps for run-time.

### 3.2 Long-Range and Parameter-Free Spatial Attention

Obtaining an attention map from a deep learning model is a well-researched topic [39, 49]. The recent development in minimizing attention generation overhead was proposed in [41]. Inspired by [41], we propose a novel long-range and parameter-free spatial attention (LRPF-SA) module. We make use of a deep network for feature extraction that encompasses activations from different levels of the network. We believe by using activations from different levels of the network we provide a true explanation about how the image is perceived by the complete network, giving rise to hierarchical salient concepts in the attention map. The saliency maps are then used to choose from the hierarchical concepts that match with image boundaries, thus giving rise to accurate and reliable saliency maps.

In this paper, we use the pre-trained network learned on the ImageNet dataset. Note that in the case of a new domain, the network can be adapted into the target domain using methods proposed in [3]. Let be a pre-trained deep network used to extract multi-level features that are combined by upsampling and performing sum fusion. Finally, we use a softmax operation over the resulting Affinity matrix to obtain an attention map as showing in Figure 2.

Note that the Affinity matrix contains dependencies of every pixel with all other pixels. Let , , and be the the features extracted from four different levels of the feature extractor. Since we use a of dimensions, where and are the height and width of the obtained feature maps, whereas is the number of channels. The feature maps , and are upsampled to , with channel numbers , , and . Upsampling the feature maps let us directly compute an aggregated response using the following Equation

(9) |

where the subscript denotes the upsampling operation, is the concatenation operation, and the long-range and parameter-free spatial attention can be obtained by

(10) |

where is reshaped on from to , and is the channel number of . Figure 3 shows an illustration of our LRPF-SA module that produces attention maps used to guide the iterative and adaptive sampling module. By using an attention mechanism we hope to gain information related to the ”objectness”, hidden among pixels in an image.

### 3.3 Iterative Saliency Convergence

We propose to find the best possible saliency map that captures the decision-making process of the underlying algorithm in an iterative manner. Generating high-quality explanations is a very time-consuming process and limits its usage in applications that require generating precise maps on large datasets. By gradually converging on the optimal saliency map, we hope to let the user decide the rate of convergence that fits their time budget, opening up possibilities of use of explanations for a wide variety of applications.

## 4 Experiments

One would wonder if we should consider an explanation “good” if it represents the importance according to the black-box classifier or if it conforms with object boundaries, encouraging human trust in the explanation system. To verify the effectiveness of our proposed approach IASSA, we conduct experiments on the MS-COCO dataset [11] and evaluate explanations for their ability to best represent image regions that both the underlying model relies on and also for their segmentation performance. By leveraging attention with model dependant saliency, the proposed approach achieves better performance when evaluated for insertion, deletion, intersection over union (IoU), F1-score, and a pointing game score [27]. We believe we can leverage the proposed explanation generation method to fine-tune models, especially deep learning classifiers in a closed loop using Attention Branch Networks [24].

Note that in this paper, the input images are resized to to facilitate mask reuse and ease in feature extraction. The IAS module is initialized with a window size of W of 45 and a stride S of 8 with step size 1.5 and 0.2 respectively. We use a of 0.5 and a of 0.3 to generate a new saliency map at any -th iteration. The maximum iteration number is 25.

### 4.1 Evaluation Metrics

Evaluating the quality of saliency maps can be subjective to the kind of explanation. We evaluate the quality of saliency maps using five different metrics: deletion, insertion, IoU, F1-score, along with a pointing game score [27].

In deletion, given a saliency map and input image we gradually remove pixels based on their importance in the saliency map, meanwhile monitoring the Area Under the Curve (AUC). A sharp drop in activation as a function of the fraction of pixels removed can be used to quantify the quality of saliency maps. Analogously, in insertion, we reveal pixels gradually in the blurred image. The pixels can be removed or added in several ways like setting the pixels of interest to zero, image mean, gray value or blurring pixels. For deletion, we set pixels of interest to a constant grey value. But the same evaluation protocol cannot be used for insertion as the model would be biased towards shapes of pixels introduced on an empty canvas.

To prevent the introduction of bias towards pixels grouping shapes, for insertion we unblur regions of the image, under consideration. The IoU and F1-score are calculated by applying a threshold of 0.3 on the range of aggregated saliency maps using Equation 8 and 9 obtained at the end of -th iterations. We also use a pointing game that considers an explanation as a positive hit when the highest activated pixel lies inside the object boundary. We average all performance metrics at both image and pixel-level by normalizing the performance by the number of pixels activated. The normalization for per-pixel performance lets us fairly evaluate explanations that might cover a region much larger than the object of interest but also include the object.

Method | Deletion | Insertion | F-1 | IoU | Pointing Game | |
---|---|---|---|---|---|---|

Image-level | LIME | 0.900967 | 0.99 | 0.15390 | 0.09745 | 0.16461 |

RISE | 0.1847 | 1.0 | 0.13837 | 0.13653 | 0.25 | |

IASSA | 0.18803 | 1.0 | 0.23658 | 0.15153 | 0.4216 | |

Pixel-level | LIME | 10.8526e-05 | 10.96158e-05 | 1.71177e-05 | 1.08447e-05 | 0.43671e-05 |

RISE | 5.5423e-05 | 28.8669e-05 | 4.26672e-05 | 2.69240e-05 | 8.95937e-05 | |

IASSA | 5.50534e-05 | 35.33639e-05 | 10.5960e-05 | 6.9282e-05 | 17.79331e-05 |

### 4.2 Effectiveness of Iterative Adaptive Sampling Module with LRPF-SA

We consider explanation generation as an optimization problem, assuming there exists an optimal explanation that encapsulates both model dependence and human interpretable cues in an image. Converging on this optimal explanation is conditioned upon parameters such as the iteration number , regularizer , and threshold (where and decide the convergence rate). We fix the value for and , and evaluate the impact of .

A qualitative analysis of the proposed explanation system’s ability to converge on an optimal explanation can be visualized in Figure 5. The obtained explanations contain well-defined image boundaries at iteration 10 and slowly converges to its peak performance at iteration 15. Figure 5 shows the improvement in the quality of explanations with the increase in the number of iterations. Figures 6 and 7 show the quantitative performance both at an image and pixel-level with increasing number of iterations. As we can observe, at the image-level, the proposed IASSA seems to reach its peak performance at iteration 15 and deteriorate post-peak due to oversampling. Whereas, when evaluated at the pixel level, the proposed method IASSA’s performance increases across all metrics but deletion suggesting the reduction in the influence of model-dependent saliency.

### 4.3 Comparison with State-of-the-art approaches

Figure 4 shows results comparing the proposed method with LIME and RISE. The saliency maps obtained by our IASSA highlight regions of interest more accurately than other state-of-the-art approaches. For example, the success of our approach can be qualitatively visualized in the test image for class “snowboard” in Figure 4 (row 4, column 5), while there exists an ambiguity if the person in the input images contributes to classification if using either LIME or RISE. The model looks at the snowboard to classify the image.

We also summarize the quantitative results in Table 1. From the table, we can observe that our proposed method IASSA outperforms all these two known black-box models explanation approaches with the added flexibility of to explain in an iterative manner enabling its application in speed-critical explanation systems. When averaged at an image level, LIME is severely affected, especially in pointing game to due to instances when the pixels with the highest activation were not aligned with the ground truth mask. The proposed model not only outperforms other explanation mechanisms when evaluated for “goodness” for the underlying model but also maintains human trust in explanation.

Even though RISE obtains deletion metrics close to the proposed system, our IASSA gives the best of both worlds by explaining the model underneath and encapsulating objectness information at the same time. While our IASSA performs close to the best when evaluated at the image level, the true merit of our approach can only be appreciated at the pixel level. In an ideal explanation, we would expect all the contributing regions to contain the highest activation possible as our optimal solution. Black box explanation approaches are prone to error in interpretation of an explanation due to extraneous image regions that affect human trust in explanation. Normalizing saliency maps with the number of pixels carrying the top 30% of the activations resolve this issue, resulting in a fair evaluation. The iterative aspect of our IASSA makes it a perfect match for applications that require the system to be scaled with minimal overhead.

### 4.4 Discussion

Fine-tuning hyper-parameters such as and , and plays a crucial role in determining performance. Hyperparameters help the human user control the quality of explanations and the algorithmâs convergence rate. Even though setting hyperparameters requires some knowledge about the underlying algorithm, we limit the range of values between a standard range of as opposed to arbitrary. The proposed system can result in explanations containing sampling artifacts due to a mismatch between window size of stride . To prevent this, we plan to look into other sampling methods that are both faster and can get a consensus on a larger image region at a time. Some examples of sampling artifacts are shown in Figure 8. Ultimately, the proposed system takes an average of approximately 800 milliseconds per iteration to compute explanation on an image of size using ResNet-50 in batches of 256. Since a majority of the run-time is spent in loading the deep learning feature extractor, we advice using large batch sizes to minimize model load time.

## 5 Conclusion

In this paper, we propose a novel iterative and adaptive sampling with a parameter-free long-range spatial attention for generating explanations for black-box models. The proposed approach assists in bridging the gap between model dependant explanation and human trustable explanation by laying the path for future research in methodologies to define “goodness” of an explanation. We prove the above claim by evaluating our approach using a plethora of metrics like deletion, insertion, IoU, F-1 score, and pointing game, at both the image and pixel levels. We believe the explanations obtained using our proposed approach could not only be used for the human to reason model decision but also contains generalized class specific information that could be fed back into the model to form a closed loop.

### References

- S. A. Bargal, A. Zunino, V. Petsiuk, J. Zhang, K. Saenko, V. Murino, and S. Sclaroff. Guided zoom: Questioning network evidence for fine-grained classification. ArXiv, abs/1812.02626, 2018.
- C. Cao, X. Liu, Y. Yang, Y. Yu, J. Wang, Z. Wang, Y. Huang, L. Wang, C. Huang, W. Xu, et al. Look and think twice: Capturing top-down visual attention with feedback convolutional neural networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2956–2964, 2015.
- M. Caron, P. Bojanowski, A. Joulin, and M. Douze. Deep clustering for unsupervised learning of visual features. In European Conference on Computer Vision (ECCV), 2018.
- J. Choi, D. Chun, H. Kim, and H.-J. Lee. Gaussian yolov3: An accurate and fast object detector using localization uncertainty for autonomous driving. In The IEEE International Conference on Computer Vision (ICCV), October 2019.
- B. Ding, C. Long, L. Zhang, and C. Xiao. Argan: Attentive recurrent generative adversarial network for shadow detection and removal. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 10213–10222, 2019.
- B. Dong, R. Collins, and A. Hoogs. Explainability for content-based image retrieval. In The IEEE Conference on Computer Vision and Pattern Recognition Workshop (CVPRW), June 2019.
- R. C. Fong and A. Vedaldi. Interpretable explanations of black boxes by meaningful perturbation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 3429–3437, 2017.
- T. Hu, C. Long, L. Zhang, and C. Xiao. Vital: A visual interpretation on text with adversarial learning for image labeling. arXiv preprint arXiv:1907.11811, 2019.
- G. Hua, C. Long, M. Yang, and Y. Gao. Collaborative active learning of a kernel machine ensemble for recognition. In IEEE International Conference on Computer Vision (ICCV), pages 1209–1216. IEEE, 2013.
- G. Hua, C. Long, M. Yang, and Y. Gao. Collaborative active visual recognition from crowds: A distributed ensemble approach. IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), 40(3):582–594, 2018.
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision (ECCV), pages 740–755. Springer, 2014.
- T. Lombrozo. The structure and function of explanations. Trends in Cognitive Sciences, 10(10):464–470, 2006.
- T. Lombrozo. The instrumental value of explanations. Philosophy Compass, 6(8):539–551, 2011.
- C. Long, A. Basharat, , and A. Hoogs. A coarse-to-fine deep convolutional neural network framework for frame duplication detection and localization in forged videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1–10. IEEE, 2019.
- C. Long, R. Collins, E. Swears, and A. Hoogs. Deep neural networks in fully connected crf for image labeling with social network metadata. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1607–1615. IEEE, 2019.
- C. Long and G. Hua. Multi-class multi-annotator active learning with robust gaussian process for visual recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2839–2847, 2015.
- C. Long and G. Hua. Correlational gaussian processes for cross-domain visual recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 118–126, 2017.
- C. Long, G. Hua, and A. Kapoor. Active visual recognition with expertise estimation in crowdsourcing. In IEEE International Conference on Computer Vision (ICCV), pages 3000–3007. IEEE, 2013.
- C. Long, G. Hua, and A. Kapoor. A joint gaussian process model for active visual recognition with expertise estimation in crowdsourcing. International Journal of Computer Vision (IJCV), 116(2):136–160, 2016.
- C. Long, E. Smith, A. Basharat, and A. Hoogs. A c3d-based convolutional neural network for frame dropping detection in a single video shot. In IEEE International Conference on Computer Vision and Pattern Recognition Workshop (CVPRW) on Media Forensics. IEEE, 2017.
- C. Long, X. Wang, G. Hua, M. Yang, and Y. Lin. Accurate object detection with location relaxation and regionlets re-localization. In The 12th Asian Conf. on Computer Vision (ACCV), pages 3000–3016. IEEE, 2014.
- X. Ma, Z. Wang, H. Li, P. Zhang, W. Ouyang, and X. Fan. Accurate monocular 3d object detection via color-embedded 3d reconstruction for autonomous driving. In The IEEE International Conference on Computer Vision (ICCV), October 2019.
- A. I. Maqueda, A. Loquercio, G. Gallego, N. GarcÃa, and D. Scaramuzza. Event-based vision meets deep learning on steering prediction for self-driving cars. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
- M. Mitsuhara, H. Fukui, Y. Sakashita, T. Ogata, T. Hirakawa, T. Yamashita, and H. Fujiyoshi. Embedding human knowledge in deep neural network via attention map. arXiv preprint arXiv:1905.03540, 2019.
- A. Nguyen, A. Dosovitskiy, J. Yosinski, T. Brox, and J. Clune. Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. In Advances in Neural Information Processing Systems (NeurIPS), pages 3387–3395, 2016.
- T. L. Pedersen and M. Benesty. lime: Local interpretable model-agnostic explanations. R Package version 0.4, 1, 2018.
- V. Petsiuk, A. Das, and K. Saenko. Rise: Randomized input sampling for explanation of black-box models. In Proceedings of the British Machine Vision Conference (BMVC), 2018.
- B. A. Plummer, M. I. Vasileva, V. Petsiuk, K. Saenko, and D. Forsyth. Why do these match? explaining the behavior of image similarity models. arXiv preprint arXiv:1905.10797, 2019.
- Z. Qi, S. Khorram, and F. Li. Visualizing deep networks by optimizing with integrated gradients. arXiv preprint arXiv:1905.00954, 2019.
- F. U. Rahman, B. Vasu, and A. Savakis. Resilience and self-healing of deep convolutional object detectors. In 2018 25th IEEE International Conference on Image Processing (ICIP), pages 3443–3447. IEEE, 2018.
- M. T. Ribeiro, S. Singh, and C. Guestrin. ”why should i trust you?”: Explaining the predictions of any classifier. In HLT-NAACL Demos, 2016.
- M. T. Ribeiro, S. Singh, and C. Guestrin. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (SIGKDD), pages 1135–1144. ACM, 2016.
- A. Rossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Niessner. Faceforensics++: Learning to detect manipulated facial images. In The IEEE International Conference on Computer Vision (ICCV), October 2019.
- W. Samek, G. Montavon, A. Vedaldi, L. K. Hansen, and K.-R. Müller. Explainable ai: Interpreting, explaining and visualizing deep learning. Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, 2019.
- R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision (CVPR), pages 618–626, 2017.
- K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
- B. Vasu, F. U. Rahman, and A. Savakis. Aerial-cam: Salient structures and textures in network class activation maps of aerial imagery. In 2018 IEEE 13th Image, Video, and Multidimensional Signal Processing Workshop (IVMSP), pages 1–5. IEEE, 2018.
- B. Vasu and A. Savakis. Visualizing the resilience of deep convolutional network interpretations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 107–110, 2019.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), pages 5998–6008, 2017.
- J. Wagner, J. M. Kohler, T. Gindele, L. Hetzel, J. T. Wiedemer, and S. Behnke. Interpretable and fine-grained visual explanations for convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 9097–9107, 2019.
- H. Wang, Y. Fan, Z. Wang, L. Jiao, and B. Schiele. Parameter-free spatial attention network for person re-identification. arXiv preprint arXiv:1811.12150, 2018.
- J. Wei, C. Long, H. Zou, and C. Xiao. Shadow inpainting and removal using generative adversarial networks with slice convolutions. Computer Graphics Forum (CGF), 38(7):381–392, 2019.
- B. Wu, X. Sun, L. Hu, and Y. Wang. Learning with unsure data for medical image diagnosis. In The IEEE International Conference on Computer Vision (ICCV), October 2019.
- X. Xing, Q. Li, H. Wei, M. Zhang, Y. Zhan, X. S. Zhou, Z. Xue, and F. Shi. Dynamic spectral graph convolution networks with assistant task training for early mci diagnosis. In Medical Image Computing and Computer Assisted Intervention (MICCAI), pages 639–646, 2019.
- C.-K. Yeh, C.-Y. Hsieh, A. S. Suggala, D. W. Inouye, and P. D. Ravikumar. On the (in)fidelity and sensitivity for explanations. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
- J. Yosinski, J. Clune, A. Nguyen, T. Fuchs, and H. Lipson. Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579, 2015.
- M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In European Conference on Computer Vision (ECCV), pages 818–833. Springer, 2014.
- M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In European Conference on Computer Vision (ECCV), pages 818–833. Springer, 2014.
- H. Zhang, I. J. Goodfellow, D. N. Metaxas, and A. Odena. Self-attention generative adversarial networks. arXiv:1805.08318, 2018.
- J. Zhang, S. A. Bargal, Z. Lin, J. Brandt, X. Shen, and S. Sclaroff. Top-down neural attention by excitation backprop. International Journal of Computer Vision (IJCV), 126(10):1084–1102, 2018.
- L. Zhang, C. Long, X. Zhang, and C. Xiao. Ris-gan: Explore residual and illumination with generative adversarial networks for shadow removal. In AAAI Conference on Artificial Intelligence (AAAI), 2020.
- B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2921–2929, 2016.