RISE: Randomized Input Sampling for Explanation of Black-box Models

RISE: Randomized Input Sampling for Explanation of Black-box Models


Deep neural networks are increasingly being used to automate data analysis and decision making, yet their decision process remains largely unclear and difficult to explain to end users. In this paper, we address the problem of Explainable AI for deep neural networks that take images as input and output a class probability. We propose an approach called RISE that generates an importance map indicating how salient each pixel is for the model’s prediction. In contrast to white-box approaches that estimate pixel importance using gradients or other internal network state, RISE works on black-box models. It estimates importance empirically by probing the model with randomly masked versions of the input image and obtaining the corresponding outputs. We compare our approach to state-of-the-art importance extraction methods using both an automatic deletion/insertion metric and a pointing metric based on human-annotated object segments. Extensive experiments on several benchmark datasets show that our approach matches or exceeds the performance of other methods, including white-box approaches.

1 Introduction

Recent success of deep neural networks has led to a remarkable growth in Artificial Intelligence (AI) research. In spite of the success, it remains largely unclear how a particular neural network comes to a decision, how certain it is about the decision, if and when it can be trusted, or when it has to be corrected. In domains where a decision can have serious consequences (e.g., medical diagnosis, autonomous driving, criminal justice etc.), it is especially important that the decision-making models are transparent. There is extensive evidence for the importance of explanation towards understanding and building trust in cognitive psychology Lombrozo2006Structure (), philosophy Lombrozo2011Instrumental () and machine learning Dzindolet2003Role (); Ribeiro2016Should (); Lipton2016Mythos () research. In this paper, we address the problem of Explainable AI, i.e., providing explanations for the artificially intelligent model’s decision. Specifically, we are interested in explaining classification decisions made by deep neural networks on natural images.

Consider the prediction of a popular image classification model (ResNet obtained from zhang2017EB ()) on the image depicting several sheep shown in Fig. (a)a. We might wonder, why is the model predicting the presence of a cow in this photo? Does it see all sheep as equally sheep-like? An explainable AI approach can provide answers to these questions, which in turn can help fix such mistakes. In this paper, we take a popular approach of generating a saliency or importance map that shows how important each image pixel is for the network’s prediction. In this case, our approach reveals that the ResNet model confuses the black sheep for a cow (Fig. (c)c), potentially due to the scarcity of black colored sheep in its training data. A similar observation is made for the photo of two birds (Fig (d)d) where the same ResNet model predicts the presence of a bird and a person. Our generated explanation reveals that the left bird provides most of the visual evidence for the ‘person’ class.

(a) Sheep - , Cow -
(b) Importance map of ‘sheep
(c) Importance map of ‘cow
(d) Bird - , Person -
(e) Importance map of ‘bird
(f) Importance map of ‘person
Figure 1: Our proposed RISE approach can explain why a black-box model (here, ResNet50) makes classification decisions by generating a pixel importance map for each decision (redder is more important). For the top image, it reveals that the model only recognizes the white sheep and confuses the black one with a cow; for the bottom image it confuses parts of birds with a person. (Images taken from the PASCAL VOC dataset.)

Existing methods Simonyan2013Deep (); Yosinski2014Understanding (); Nguyen2016Synthesizing (); Zhou2016Learning (); Selvaraju2017Gradcam (); zhang2017EB (); Fong2017Interpretable () compute importance for a given base model (the one being explained) and an output category. However, they require access to the internals of the base model, such as the gradients of the output with respect to the input, intermediate feature maps, or the network’s weights. Many methods are also limited to certain network architectures and/or layer types Zhou2016Learning (). In this paper, we advocate for a more general approach that can produce a saliency map for an arbitrary network without requiring access to its internals and does not require re-implementation for each network architecture. LIME Ribeiro2016Should () offers such a black-box approach by drawing random samples around the instance to be explained and fitting an approximate linear decision model. However, its saliency is based on superpixels, which may not capture correct regions (see Fig. (d)d.)

(a) Input
(b) RISE (ours)
(c) GradCAM
(d) LIME
(e) Image during deletion
(f) RISE-Deletion
(g) GradCAM-Deletion
(h) LIME-Deletion
Figure 2: Estimation of importance of each pixel by our proposed RISE model and state-of-the-art methods for a base model’s prediction along with ‘deletion’ score (AUC). The top row shows an input image (from ImageNet) and saliency maps produced by RISE, Grad-CAM Selvaraju2017Gradcam () and LIME Ribeiro2016Should () with ResNet50 as the base network (redder values indicate higher importance). The bottom row illustrates the deletion metric: salient pixels are gradually masked from the image ((e)e) in order of decreasing importance, and the probability of the ‘goldfish’ class predicted by the network is plotted vs. the fraction of removed pixels. In this example, RISE provides more accurate saliency and achieves the lowest AUC.

We propose a new black-box approach for estimating pixel saliency called Randomized Input Sampling for Explanation (RISE). Our approach is general and applies to any off-the-shelf image network, treating it as a complete black box and not assuming access to its parameters, features or gradients. The key idea is to probe the base model by sub-sampling the input image via random masks and recording its response to each of the masked images. The final importance map is generated as a linear combination of the random binary masks where the combination weights come from the output probabilities predicted by the base model on the masked images (See Fig. 3). This seemingly simple yet surprisingly powerful approach allows us to peek inside an arbitrary network without accessing any of its internal structure. Thus, RISE is a true black-box explanation approach which is conceptually different from mainstream white-box saliency approaches such as GradCAM Selvaraju2017Gradcam () and, in principle, is generalizable to base models of any architecture.

Another key contribution of our work is to propose causal metrics to evaluate the produced explanations. Most explanation approaches are evaluated in a human-centered way, where the generated saliency map is compared to the “ground truth” regions or bounding boxes drawn by humans in localization datasets Selvaraju2017Gradcam (); zhang2017EB (). Some approaches also measure human trust or reliability on the explanations Ribeiro2016Should (); Selvaraju2017Gradcam (). Such evaluations not only require a lot of human effort but, importantly, are unfit for evaluating whether the explanation is the true cause of the model’s decision. They only capture how well the explanations imitate the human-annotated importance of the image regions. But an AI system could behave differently from a human and learn to use cues from the background (e.g., using grass to detect cows) or other cues that are non-intuitive to humans. Thus a human-truthed metric cannot evaluate the correctness of an explanation that aims to extract the underlying decision process from the network.

Motivated by Fong2017Interpretable (), we propose two automatic evaluation metrics: deletion and insertion. The deletion metric measures the drop in the probability of a class as important pixels (given by the saliency map) are gradually removed from the image. A sharp drop, and thus a small area under the probability curve, are indicative of a good explanation. Fig. 2 shows plots produced by different explanation techniques for an image containing ‘goldfish’, where the total Area Under Curve (AUC) value is the smallest for our RISE model, indicating a more causal explanation. The insertion metric, on the other hand, captures the importance of the pixels in terms of their ability to synthesize an image and is measured by the rise in the probability of the class of interest as pixels are added according to the generated importance map. We argue that these two metrics not only alleviate the need for large-scale human evaluation or annotation effort, but are also better at assessing causal explanations by being human agnostic. For the sake of completeness, we also compare the performance of our method to state-of-the-art explanation models in terms of a human-centric evaluation metric.

2 Related work

The importance of producing explanations has been extensively studied in multiple fields, within and outside machine learning. Historically, representing knowledge using rules or decision trees Swartout1981Producing (); Swartout1993Explanation () has been found to be interpretable by humans. Another line of research focused on approximating the less interpretable models (e.g., neural network, non-linear SVMs etc.) with simple, interpretable models such as decision rules or sparse linear models Thrun1995Extracting (); Craven1996Extracting (). In a recent work Ribeiro et. al. Ribeiro2016Should (), fits a more interpretable approximate linear decision model (LIME) in the vicinity of a particular input. Though the approximation is fairly good locally, for a sufficiently complex model, a linear approximation may not lead to a faithful representation of the non-linear model. The LIME model can be applied to black-box networks like our approach, but its reliance on superpixels leads to inferior importance maps as shown in our experiments.

To explain classification decisions in images, previous works either visually ground image regions that strongly support the decision Selvaraju2017Gradcam (); Park2018Multimodal () or generate a textual description of why the decision was made Hendricks2016Generating (). The visual grounding is generally expressed as a saliency or importance map which shows the importance of each pixel towards the model’s decision. Existing approaches to deep neural network explanation either design an ‘interpretable’ network architectures or attempt to explain or ‘justify’ decisions made by an existing model.

Within the class of interpretable architectures, Xu et. al. Xu2015ShowAttention (), proposed an interpretable image captioning system by incorporating an attention network which learns where to look next in an image before producing each word of the natural language description of the image. A neural module network is employed in Andreas2016Learning (); Hu2017Learning () to produce the answers to visual question-answering problems in an interpretable manner by learning to divide the problem into subproblems. However, these approaches achieve interpretability by incorporating changes to a white-box base model and are constrained to use specific network architectures.

Neural justification approaches attempt to justify the decision of a base model. Third-person models Hendricks2016Generating (); Park2018Multimodal () train additional models from human annotated ‘ground truth’ reasoning in the form of saliency maps or textual justifications. The success of such methods depends on the availability of tediously labeled ground-truth explanations, and they do not produce high-fidelity explanations. On the other hand, first-person models Zhou2016Learning (); Selvaraju2017Gradcam (); Fong2017Interpretable () aim to generate explanations providing evidence for the model’s underlying decision process without using an additional model. In our work, we focus on producing a first-person justification.

Several approaches generate importance maps by isolating contributions of image regions to the prediction. In one of the early works Zeiler2013Visualizing (), Zeiler et al. visualize the internal representation learned by CNNs using deconvolutional networks. Other approaches Simonyan2013Deep (); Yosinski2014Understanding (); Nguyen2016Synthesizing () have tried to synthesize an input (an image) that highly activates a neuron. The Class Activation Mapping (CAM) approach Zhou2016Learning () achieves class-specific importance of each location of an image by computing a weighted sum of the feature activation values at that location across all channels. However, the approach can only be applied to a particular kind of CNN architecture where a global average pooling is performed over convolutional feature map channels immediately prior to the classification layer. Grad-CAM Selvaraju2017Gradcam () extends CAM by weighing the feature activation values at every location with the average gradient of the class score (w.r.t. the feature activation values) for every feature map channel. Zhang et al. zhang2017EB () introduce a probabilistic winner-take-all strategy to compute top-down importance of neurons towards model predictions. Fong et al. Fong2017Interpretable () learns a perturbation mask that maximally affects the model’s output by backpropagating the error signals through the model. However, all of the above methods Simonyan2013Deep (); Yosinski2014Understanding (); Nguyen2016Synthesizing (); Zhou2016Learning (); Selvaraju2017Gradcam (); zhang2017EB (); Fong2017Interpretable () assume access to the internals of the base model to obtain feature activation values, gradients or weights. RISE is a more general framework as the importance map is obtained with access to only the input and output of the base model.

Figure 3: Overview of RISE: Input image is elementwise multiplied with random masks and the masked images are fed to the base model. The final saliency map is a linear combination of the masks where the weights come from the score of the target class corresponding to the respective masked inputs.

3 Randomized Input Sampling for Explanation (RISE)

One way to measure the importance of an image region is to obscure or ‘perturb’ it and observe how much this affects the black box decision. For example, it can be done by setting pixel intensities to zero Zeiler2013Visualizing (); Fong2017Interpretable (); Ribeiro2016Should (), blurring the region Fong2017Interpretable () or by adding noise. In this work we estimate the importance of pixels by dimming them in random combinations, reducing their intensities down to zero. We model this by multiplying an image with a valued mask.

The mask generation process is described in detail in section 3.2.

3.1 Random Masking

Let be a black-box model, that for a given input from produces scalar confidence score. In our case, is the space of color images of size (). For example, may be a classifier that produces the probability that object of some class is present in the image, or a captioning model that outputs the probability of the next word given a partial sentence.

Let be a random binary mask with distribution . Consider random variable , where denotes element-wise multiplication. First, the image is masked by preserving only a subset of pixels. Then, the confidence score for the masked image is computed by the black box. We define importance of pixel as the expected score over all possible masks conditioned on the event that pixel is observed, i.e., :


The intuition behind this is that is high when pixels preserved by mask are important. It may not be the case for ‘adversarial’ examples when the model’s decision depends more on specific input characteristics rather than on the input content.

Eq. (1) can be rewritten as




Substituting from (3) in (2),


It can be written in matrix notation, combined with the fact that :


We propose to generate importance maps by empirically estimating the sum in equation (5) using Monte Carlo sampling. To produce an importance map, explaining the decision of model on image , we sample set of masks according to and probe the model by running it on masked images , . Then, we take the weighted average of the masks where the weights are the confidence scores and normalize it by the expectation of :


Note that our method does not use any information of inside the model and thus, is suitable for explaining black-box models.

3.2 Mask generation

Masking pixels independently may cause adversarial effects: slight change in pixel values may cause significant variation in the model’s confidence scores. Moreover, generating masks by independently setting their elements to zeros and ones will result in mask space of size . Bigger space size requires more samples for a good estimation in equation (6).

To address these issues we first sample smaller binary masks and then upsample them to larger resolution using bilinear interpolation. Bilinear upsampling doesn’t introduce sharp edges in as well as results in smooth importance map . After the interpolation masks are no longer binary, but have values from . Finally, to allow more flexible masking, we shift all masks by a random number of pixels in both spatial directions.

Formally, mask generation can be summarized as:

  1. Sample binary masks of some smaller size by setting each element independently to with probability and to with the remaining probability.

  2. Upsample all masks to size using bilinear interpolation, where is the size of the cell in the upsampled mask.

  3. Crop areas with uniformly random indents from up to .

(a) train –
(b) train’ importance
(c) car –
(d) car’ importance
(e) car –
(f) car’ importance
(g) TV –
(h) TV’ importance
Figure 4: RISE applied to images from PASCAL VOC dataset explaining ResNet model from zhang2017EB (). Captions show the categories that are explained as well as their probabilities. The explanations reveal interesting insights about the base model’s decisions. For example, it is right in deciding that there is a car in the image shown in LABEL:sub@fig:car_original_2. However, it is only the explanation LABEL:sub@fig:car_explanation_2 that shows that the base model is right for the wrong reason, as it concentrates on the monitor on the pavement instead of the actual cars that are very hard to see. The explanation helps us understand that the context overpowered the visual evidence of the object of interest. On the other hand, explanations for the image in LABEL:sub@fig:train_original show that the base model’s prediction of a train is wrong for the right reason. The platform-like bank of the canal and locomotive-like body of the boat make the base model predict it is a train. This is also true for the prediction of a TV in LABEL:sub@fig:tv_original. For the image in LABEL:sub@fig:car_original_1, the base model is right for the right reason as the license plate of the vehicle is given the most importance in predicting a car.

4 Experiments

Datasets and Base Models: We evaluated RISE on 3 publicly available object classification datasets, namely, PASCAL VOC07 Everingham2010Pascal (), MSCOCO2014 Lin2014Microsoft () and ImageNet Russakovsky2015Imagenet (). Given a base model, we test importance maps generated by different explanation methods for a target object category present in an image from the VOC and MSCOCO datasets. For the ImageNet dataset, we test the explanation generated for the top probable class of the image. We chose the particular versions of the VOC and MSCOCO datasets to compare fairly with the state-of-the-art reporting on the same datasets and same base models. For these two datasets, we used ResNet50 He2016Deep () and VGG16 Simonyan2013Very () networks trained by zhang2017EB () as base models. For ImageNet, the same base models were downloaded from the PyTorch model zoo 1. Fig. 4 shows some qualitative examples of importance maps for predictions made by the ResNet50 model on PASCAL VOC images.

4.1 Evaluation Metrics

Despite a growing body of research focusing on explainable machine learning, there is still no consensus about how to measure the explainability of a machine learning model Poursabzi2018Manipulating (). As a result, human evaluation has been the predominant way to assess model explanation by measuring it from the perspective of transparency, user trust or human comprehension of the decisions made by the model Herman2018Promise (). Existing justification methods zhang2017EB (); Selvaraju2017Gradcam () have evaluated saliency maps by their ability to localize objects. However, localization is merely a proxy for human explanation and may not correctly capture what causes the base model to make a decision irrespective of whether the decision is right or wrong as far as the proxy task is concerned. As a typical example, let us consider an image of a car driving on a road. Evaluating an explanation against the localization bounding box of the car does not give credit (in fact discredits) for correctly capturing ‘road’ as a possible cause behind the base model’s decision of classifying the image as that of a car. We argue that keeping humans out of the loop for evaluation makes it more fair and true to the classifier’s own view on the problem rather than representing a human’s view. Such a metric is not only objective (free from human bias) in nature but also saves time and resources.

Causal metrics for explanations: In order to avoid these issues, we propose two automatic evaluation metrics: deletion and insertion, motivated by Fong2017Interpretable (). The intuition behind the deletion metric is that the removal of the ‘cause’ will force the base model to change its decision. Specifically, this metric measures a decrease in the probability of the predicted class as more and more important pixels are removed from the image, where the importance of pixels is defined by the saliency score. A sharp drop and thus a low area under the probability curve (as a function of the fraction of removed pixels) means a good explanation. The insertion metric, on the other hand, takes a complementary approach. It measures the increase in probability as more and more pixels are introduced, with higher AUC indicative of a better explanation.

Method ResNet50 VGG16
Deletion Insertion Deletion Insertion
Grad-CAM Selvaraju2017Gradcam ()
Sliding window Zeiler2013Visualizing ()
LIME Ribeiro2016Should ()
RISE (ours)
Table 1: Comparative evaluation in terms of deletion and insertion scores. Except for Grad-CAM, the rest are black-box explanation models.

There are several ways of removing pixels from an image Dabkowski2017Real (), e.g., setting the pixel values to zero or any other constant gray value, blurring the pixels or even cropping out a tight bounding box. The same is true when pixels are introduced, e.g., they can be introduced to a constant canvas or by starting with a highly blurred image and gradually unblurring regions. All of these approaches have different pros and cons. A common issue is the introduction of spurious evidence which can fool the classifier. For example, if pixels are introduced to a constant canvas and if the introduced region happens to be oval in shape, the classifier may classify the image as a ‘balloon’ (possibly a printed balloon) with high probability. This issue is less severe if pixels are introduced to an initially blurred canvas as blurring takes away most of the finer details of an image without exposing it to sharp edges as image regions are introduced. This gives higher scores for all methods, so we took this strategy for insertion. For deletion, the aim is to fool the classifier as quickly as possible and blurring small regions instead of setting them to a constant gray level does not help. This is because a good classifier is usually able to fill in the missing details quite remarkably from the surrounding regions and from the small amount of low-frequency information left after blurring a tiny region. As a result, we set the image regions to constant values when removing them for the deletion metric evaluation. We used the same strategies for all the existing approaches with which we compared our method in terms of these two metrics.

Pointing game: We also evaluate saliency explanations in terms of a human evaluation metric, the pointing game introduced in zhang2017EB (). If the highest saliency point lies inside the human-annotated bounding box of an object, it is counted as a hit. The pointing game accuracy is given by , averaged over all target categories in the dataset. For a classification model that learns to rely on objects, this metric should be high for a good explanation.

4.2 Experimental Results

Experimental Settings: The binary random masks are generated with equal probabilities for 0’s and 1’s. For different CNN classifiers, we empirically select different numbers of masks, in particular, we used 4000 masks for the VGG16 network and 8000 for ResNet50. We have used and throughout. All the results used for comparison were either taken from published works or by running the publicly available code on datasets for which reported results could not be obtained.

Figure 5: Rise generated importance maps (second column) for two representative images (first column) with deletion (third column) and insertion curves (fourth column).

Deletion and Insertion scores: Table 1 shows a comparative evaluation of RISE with other state-of-the-art approaches in terms of both deletion and insertion metrics. The sliding window approach Zeiler2013Visualizing () systematically occludes fixed size image regions and probes the model with the perturbed image to measure the importance of the occluded region. We used a sliding window of size with stride . For LIME Ribeiro2016Should (), the number of samples was set to (taken from the code). For this experiment, we used the ImageNet classification dataset where no ground truth segmentation or localization mask is provided and thus explainability performance can only be measured via automatic metrics like deletion and insertion. For both the base models and according to both the metrics, RISE provides better performance, outperforming even the white-box Grad-CAM method. The values are better for ResNet50 which is intuitive as it is a better classification model than VGG16. Fig. 5 shows examples of RISE generated importance maps along with the deletion and insertion curves. The appendices contain more visual examples.


-1mm-2mm Base model Dataset AM Simonyan2013Deep () Deconv Zeiler2013Visualizing () CAM Zhou2016Learning () MWP zhang2017EB () c-MWP zhang2017EB () RISE VGG16 VOC 76.00 75.50 - 76.90 80.00 MSCOCO 37.10 38.60 - 39.50 49.60 Resnet50 VOC 65.80 73.00 90.60 80.90 89.20 MSCOCO 30.40 38.2 58.4 46.8 57.4

Table 2: Mean accuracy (%) in the pointing game

Pointing game accuracy: The performance in terms of pointing game accuracy is shown in Table 2 for the test split of PASCAL VOC07 and val split of MSCOCO2014 datasets. In this table, RISE is the only black-box method. The base models are obtained from zhang2017EB () and thus we list the pointing game accuracies reported in the paper. RISE reports an average value of 3 independent runs for VGG16 and 2 independent runs for ResNet50; the low standard deviation values indicate the robustness of the proposed approach against the randomness of the masks. For VGG16, RISE performs consistently better than all of the white-box methods with a significantly improved performance for the VOC dataset. For the deeper ResNet50 network with residual connections, RISE does not have the highest pointing accuracy but comes close. We stress again that good pointing accuracy may not correlate with actual causal processes in a network, however, RISE is competitive despite being black-box and more general than methods like CAM, which is only applicable to architectures without fully-connected layers.

4.3 RISE for Captioning

RISE can easily be extended to explain captions for any image description system. Some existing works use a separate attention network Xu2015ShowAttention () or assume access to feature activations zhang2017EB () and/or gradient values Selvaraju2017Gradcam () to ground words in an image caption. The most similar to our work is by Ramanishka et alRamanishka2017Top () where the base model is probed with conv features from small patches of the input image to estimate its importance for each word in the caption. However, our approach is not constrained to a single fixed size patch and is thus, less sensitive to object sizes as well as better at capturing additional context that may be present in the image. We provide a brief description of the way RISE can be applied for explaining caption along with the generated importance maps for a representative image without evaluating extensively for this task which is kept as a future work.

We take a base captioning model Donahue2015Long () that models the probability of the next word given a partial sentence and an input image :


We probe the base model by running it on a set of randomly masked inputs and computing saliency as for each word in . Input sentence can be any arbitrary sentence including the caption generated by the base model itself. One such example saliency map is shown in Fig. 6 from the MSCOCO dataset.

(a) “A horse and carriage on a city street.”
(b) “A horse…”
(c) “A horse and carriage…”
(d) White…”
Figure 6: Explanations of image captioning models. LABEL:sub@subfig:Original is the image with the caption generated by Donahue2015Long (). LABEL:sub@subfig:Horse and LABEL:sub@subfig:Carriage show the importance map generated by RISE for two words ‘horse’ and ‘carriage’ respectively from the generated caption. LABEL:sub@subfig:White shows the importance map for an arbitrary word ‘white’.

5 Conclusion

This paper presented RISE, a randomized approach for explaining black-box models by estimating the importance of input image regions for the model’s prediction. Despite its simplicity and generality, the method outperforms existing explanation approaches in terms of automatic causal metrics and performs competitively in terms of the human-centric pointing metric.

Future work will be to exploit the generality of the approach for explaining decisions made by complex networks in video and other domains.

Appendix A Algorithm to Compute Deletion Score

1:procedure Deletion
2:     Input: black box , image , importance map , number of pixels removed per step
3:     Output: deletion score
6:     while  has non-zero pixels do
7:          According to , set next pixels in to 0
10:     AreaUnderCurve
11:     return
Algorithm 1

Appendix B More Explanations with Insertion and Deletion Scores

Figure 7: Rise generated importance maps (second column) for representative images (first column) with deletion (third column) and insertion curves (fourth column).
Figure 8: Rise generated importance maps (second column) for representative images (first column) with deletion (third column) and insertion curves (fourth column).
Figure 9: Rise generated importance maps (second column) for representative images (first column) with deletion (third column) and insertion curves (fourth column).


  1. https://github.com/pytorch/vision


  1. T. Lombrozo, “The Structure and Function of Explanations,” Trends in Cognitive Sciences, vol. 10, no. 10, pp. 464–470, 2006.
  2. ——, “The Instrumental Value of Explanations,” Philosophy Compass, vol. 6, no. 8, pp. 539–551, 2011.
  3. M. T. Dzindolet, S. A. Peterson, R. A. Pomranky, L. G. Pierce, and H. P. Beck, “The Role of Trust in Automation Reliance,” International Journal of Human-Computer Studies, vol. 58, no. 6, pp. 697–718, 2003.
  4. M. T. Ribeiro, S. Singh, and C. Guestrin, “Why Should I Trust You?: Explaining the Predictions of any Classifier,” in ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.   ACM, 2016, pp. 1135–1144.
  5. Z. C. Lipton, “The Mythos of Model Interpretability,” arXiv preprint arXiv:1606.03490, 2016.
  6. J. Zhang, S. A. Bargal, Z. Lin, X. S. Jonathan Brandt, and S. Sclaroff, “Top-down Neural Attention by Excitation Backprop,” International Journal of Computer Vision, Dec 2017.
  7. K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps,” arXiv preprint arXiv:1312.6034, 2013.
  8. J. Yosinski, J. Clune, T. Fuchs, and H. Lipson, “Understanding Neural Networks Through Deep Visualization,” in International Conference on Machine Learning Workshop on Deep Learning.
  9. A. Nguyen, A. Dosovitskiy, J. Yosinski, T. Brox, and J. Clune, “Synthesizing the Preferred Inputs for Neurons in Neural Networks via Deep Generator Networks,” in Neural Information Processing Systems, 2016, pp. 3387–3395.
  10. B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning Deep Features for Discriminative Localization,” in IEEE Computer Vision and Pattern Recognition, 2016, pp. 2921–2929.
  11. R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-CAM: Visual Explanations From Deep Networks via Gradient-Based Localization,” in IEEE International Conference on Computer Vision, Oct 2017.
  12. R. C. Fong and A. Vedaldi, “Interpretable Explanations of Black Boxes by Meaningful Perturbation,” in IEEE International Conference on Computer Vision, Oct 2017.
  13. W. R. Swartout, “Producing Explanations and Justifications of Expert Consulting Programs,” 1981.
  14. W. R. Swartout and J. D. Moore, “Explanation in Second Generation Expert Systems,” in Second Generation Expert Systems.   Springer, 1993, pp. 543–585.
  15. S. Thrun, “Extracting Rules from Artificial Neural Networks with Distributed Representations,” in Advances in Neural Information Processing Systems, 1995, pp. 505–512.
  16. M. W. Craven and J. W. Shavlik, “Extracting Comprehensible Models from Trained Neural Networks,” Ph.D. dissertation, University of Wisconsin, Madison, 1996.
  17. D. H. Park, L. A. Hendricks, Z. Akata, A. Rohrbach, B. Schiele, T. Darrell, and M. Rohrbach, “Multimodal Explanations: Justifying Decisions and Pointing to the Evidence,” in IEEE Computer Vision and Pattern Recognition, Jun 2018.
  18. L. A. Hendricks, Z. Akata, M. Rohrbach, J. Donahue, B. Schiele, and T. Darrell, “Generating Visual Explanations,” in European Conference on Computer Vision, 2016, pp. 3–19.
  19. K. Xu, J. L. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention,” in International Conference on Machine Learning, 2015, pp. 2048–2057.
  20. J. Andreas, M. Rohrbach, T. Darrell, and D. Klein, “Learning to Compose Neural Networks for Question Answering,” in The Conference of the North American Chapter of the Association for Computational Linguistics, 2016, pp. 1545–1554.
  21. R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and K. Saenko, “Learning to Reason: End-To-End Module Networks for Visual Question Answering,” in IEEE International Conference on Computer Vision, Oct 2017.
  22. M. D. Zeiler and R. Fergus, “Visualizing and Understanding Convolutional Networks,” in European conference on computer vision.   Springer, 2014, pp. 818–833.
  23. M. Everingham, L. V. Gool, C. K. Williams, J. Winn, and A. Zisserman, “The Pascal Visual Object Classes (VOC) Challenge,” International journal of computer vision, vol. 88, no. 2, pp. 303–338, 2010.
  24. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision, 2014, pp. 740–755.
  25. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
  26. K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  27. K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” in International Conference on Learning Representations, May 2015.
  28. F. Poursabzi-Sangdeh, D. G. Goldstein, J. M. Hofman, J. W. Vaughan, and H. Wallach, “Manipulating and Measuring Model Interpretability,” arXiv preprint arXiv:1802.07810, 2018.
  29. B. Herman, “The Promise and Peril of Human Evaluation for Model Interpretability,” in Interpretable ML Symposium, Neural Information Processing Systems, Dec 2017.
  30. P. Dabkowski and Y. Gal, “Real Time Image Saliency for Black Box Classifiers,” in Neural Information Processing Systems, 2017, pp. 6970–6979.
  31. V. Ramanishka, A. Das, J. Zhang, and K. Saenko, “Top-Down Visual Saliency Guided by Captions,” in IEEE Conference on Computer Vision and Pattern Recognition, July 2017.
  32. J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 2625–2634.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description