xGEMs: Generating Examplars to Explain BlackBox Models
Abstract
This work proposes xGEMs : or manifold guided exemplars, a framework to understand blackbox classifier behavior by exploring the landscape of the underlying data manifold as data points cross decision boundaries. To do so, we train an unsupervised implicit generative model – treated as a proxy to the data manifold. We summarize blackbox model behavior quantitatively by perturbing data samples along the manifold. We demonstrate xGEMs’ ability to detect and quantify bias in model learning and also for understanding the changes in model behavior as training progresses.
xGEMs: Generating Examplars to Explain BlackBox Models
Shalmali Joshi UT Austin shalmali@utexas.edu Oluwasanmi Koyejo UIUC sanmi@illinois.edu Been Kim Google Brain beenkim@google.com Joydeep Ghosh UT Austin jghosh@utexas.edu
noticebox[b]Preprint. Work in progress.\end@float
1 Introduction
Machine learning algorithms have become widely deployed in domains beyond web based recommendation systems, like the criminal justice system (Angwin et al., 2016), clinical healthcare (Callahan and Shah, 2017) etc. For instance, risk assessment tools like COMPAS (Angwin et al., 2016) produce learned recidivism scores to consequently determine the amount of pretrial bail and detention. Similarly, medical interventions can impact health outcomes for patients, making institutions liable to provide explanations for their decisions. This has motivated regulatory agencies like the EU Parliament^{1}^{1}1in collaboration with the European Commission and the Council of the Eurpean Union to codify a right to data protection and “obtain an explanation of the decision reached using such automated systems^{2}^{2}2https://www.privacyregulation.eu/en/r71.htm".
Systems that provide satisfactory explanations for the decisions of such learning algorithms have until recently been few and far between. It is challenging to characterize the specific nature of explanability mechanisms given their complexity and lack of consensus on the nature and sufficiency of such explanations (DoshiVelez, 2017; Lipton, 2016). The problem is often compounded due to multiple levels of abstraction required to provide such explanations (Miller, 2017). For instance, system level explanations as required by regulatory bodies are different from an abstraction that would assist practitioners of machine learning. This work focuses on providing explanations for low level understanding of model behavior, albeit at an abstraction beyond performance metrics. Such a suite of explanations not only help improve understanding of opaque models^{3}^{3}3https://distill.pub/2018/buildingblocks/ (Higgins et al., 2016; Karpathy et al., 2015) but can also uncover biases (inherent in the data) that models pick up on e.g. learned gender and racial biases (Bolukbasi et al., 2016).
In this work, we posit that there need not be an inherent tradeoff between model performance and explanability, as is generally assumed (and found in Kim et al. (2016); Gupta et al. (2016); Hughes et al. (2016)). We propose an explanability tool that probes a supervised blackbox model along the data manifold for explanations via examples and/or summaries. Demonstrating model behavior via examples is known to be beneficial for improving and understanding the decision making process (Aamodt and Plaza, 1994). Navigating the data manifold allows us to explore blackbox behavior in different regions of the manifold. The proposed method can be utilized as a diagnostic tool to analyze training progression, compare classifier performance, and/or uncover inherent biases the classifier may have learned.
2 Related Work
Most closely related works to our approach are those that provide explanations by subselecting meaningful samples and/or semantically relevant features (like superpixels) that highlight undesirable model behavior (Elenberg et al., 2017; Kim et al., 2016). Most of these methods require the selected samples to be part of training/test dataset. This means that if the training/test set did not include the instance that best explains a specific decision, we would have to settle for a suboptimal choice. Our method aims to relax this constraint by generating new examples that are better suited for this purpose. In terms of generating examples, adversarial criticisms (Stock and Cisse, 2017) and the class of generative networks like GANs are relevant approaches. Specifically, (Stock and Cisse, 2017) use the adversarial attack paradigm as a means to select examples from existing training data to explain model behavior, similar to Kim et al. (2016). However note that the goal of generating adversarial examples and our explanations are fundamentally different. The primary goal of adversarial examples is to focus on exploiting the worst case confounding scenario given a decision boundary, while our work focuses on generating an example that lies on the data manifold as it crosses a decision boundary. See Figure 0(a) for a more intuitive explanation. We posit that it is important to uncover classifier behavior when data points are constrained to the data manifold. Such data instances are more ‘realistic’ and likely to be created by the underlying phenomenon that led to the training data. They provide an alternative method to probe a blackbox, specially in nonadversarial settings. They also characterize the residual vulnerabilities of a model that defends itself against adversarial attacks by detecting directed "noise" that is orthogonal to the manifold of the data or of an associated latent space.
We position our work as a diagnostic framework for understanding model behavior at an abstraction that may be most useful to a data science practitioner and/or a machine learning expert. However, as suggested before, explainable models focus on different notions of explanability. For example, Koh and Liang (2017) use influence functions, motivated by robust statistics Cook and Weisberg (1980) to determine importance of each training sample for model predictions. Li et al. (2015); Selvaraju et al. (2016) focus on understanding the workings of different layers of a deep network and studying saliency maps for feature attribution (Simonyan et al., 2013; Smilkov et al., 2017; Sundararajan et al., 2017). Saliency methods, while powerful, can be demonstrated to be unreliable without stronger conditions over the saliency model (Kindermans et al., 2017; Adebayo et al., 2018). Other paradigms of explainable models focus on locally approximating complex models using a simpler functional form to approximate the (local) decision boundary. For instance, LIME based approaches (Ribeiro et al., 2016; Shrikumar et al., 2016; Bach et al., 2015) locally approximate complex models with linear fits. Decision Trees are also considered more explainable if they are not too large. These approaches inherently assume a tradeoff between model performance and explanability, as less complex model classes tend to be empirically subpar in performance relative to the success of the target blackbox models they endeavor to explain. The xGEMs framework, however, does not rely on local approximations to provide explanations or assume such a tradeoff.
We summarize our key contributions as follows: 1. We introduce xGEMs, a framework for explaining supervised blackbox models via examples generated along the underlying data manifold. 2. We demonstrate the utility of xGEMs in (a) detecting bias in learned models, (b) characterizing the probabilistic decision manifold w.r.t. examples, and (c) facilitating model comparison beyond standard performance metrics.
3 Background
Implicit Generative Models
can be described as stochastic procedures that generate samples (denoted by the random variable ) from the data distribution without explicitly parameterizing . The two most significant types are the Variational AutoEncoders (VAEs) (Kingma and Welling, 2013) and Generative Adversarial Networks (GANs) (Goodfellow et al., 2014). Implicit generative models generally assume an underlying latent dimension that is mapped to the ambient data domain using a deterministic function parametrized by , usually as a deep neural network. The primary difference between GANs and VAEs is the training mechanism employed to learn function . GANs employ an adversarial framework by employing a discriminator that tries to classify generated samples from the deterministic function versus original samples and VAEs maximize an approximation to the data likelihood. The approximation thus obtained has an encoderdecoder structure of conventional autoencoders (Doersch, 2016). One can obtain a latent representation of any data sample within the latent embedding using the trained encoder network. While GANs do not train an associated encoder, recent advances in adversarially learned inference like BiGANs (Dumoulin et al., 2016; Donahue et al., 2016) can be utilized to obtain the latent embedding. In this work, we assume access to an implicit generative model that allows us to obtain the latent embedding of a data point.
Let (parametrized by ) be the inverse mapping function that provides the latent representation for a given data sample. Let be the analogous loss function such that for a given data sample :
(1) 
Examples of are the encoder in a VAE, or an inference network in a BiGAN. An appropriate distance function in the data domain can be used as the loss .
Without loss of generality, we assume that we would like to provide explanations for a binary classifier. Let be the target label. Let be the target blackbox classifier to be ‘explained’ and be the loss function used to train the blackbox classifier.
Adversarial Criticisms
Adversarial criticisms to explain blackbox classifiers look for perturbations to data samples such that the perturbations maximize the loss or change the predicted label. These perturbations are invisible to the human eye. That is, if is the target adversarial sample, an adversarial attack solves a Taylor approximation to the following:
(2) 
We now characterize the proposed model and detail the kinds of explanations it can provide.
4 Generating xGEMs
To provide explanations via examples over more naturalistic perturbations, we introduce a new set of examples, called manifold guided examples or xGEMs. First, we train an implicit generative model and an encoder network .
(3) 
A manifold guided example is defined w.r.t. a given data sample .
Definition 1 (xGEM).
An xGEM corresponding to a data point () and a target label , refers to the solution of Equation (3) for a fixed and known . The xGEM is denoted by .
We propose Algorithm 1 to estimate a manifold guided example xGEM for any data point . Intuitively, for a point , we determine its latent representation using . This allows us to explain model behavior from a common latent representation across all blackboxes. To find realistic perturbations to this point, along the data manifold, we traverse the latent space of the generator (our proxy for the data manifold) until the label switches to the desired target label . The desired manifold guided example or xGEM is the sample generated at the switch point in the latent embedding.
We empirically highlight the benefits of the discovering manifold guided examples in different contexts and abstractions that provide insights into model behavior.
5 Explanations using xGEMs
We first use a simple setting with simulated data to highlight the differences between the proposed explanation tool compared to criticisms and prototypes derived from adversarial attacks (Stock and Cisse, 2017).
5.1 An alternative view to Adversarial Criticisms
Figure 0(a) demonstrates a linear decision boundary trained on data with ambient dimension equal to 2. The onedimensional data manifold is parabolic as shown by the blue curve. The green points are in class 1 and red points are samples belonging to class 2. The figure illustrates manifold guided examples as well as the trajectory taken by the gradient steps of Algorithm 1. The trajectory to generate an adversarial criticism stems from Equation (2). A generative model maps from a 1d latent dimension to the data manifold shown by the blue curve. A single layer (softmax) neural network with output dimension=2 is trained on points sampled from this manifold (the yellow decision boundary separates the two classes – regions marked by the pink and green regions). As demonstrated by the figure, navigating along the latent dimension of the generator encourages the xGEM trajectory to be constrained along the data manifold, while adversarial criticisms may lie well outside the manifold. Thus manifold guided examples offer alternative view of classifier behavior via examples.

We defer examples of xGEMs evaluated for the MNIST dataset to the Appendix in the interest of space.
5.2 Towards automated bias detection
We now demonstrate the utility of generating manifold guided examples to detect if a target classifier is confounded w.r.t. a given attribute of interest. In particular, we wish to determine whether a blackbox is differentiating among the target labels using spurious correlations in the data. For instance, a classifier trained to determine the best medical intervention may be relying on attributes like gender to determine best treatment. It is desirable to have an automated mechanism to detect such behavior. We say that a classifier is confounded with an attribute of interest if the attribute substantially influences the blackbox’s predictions. We make this concrete in the context of our framework below.
Without loss of generality let be the (potentially protected) binary attribute of interest. We wish to examine whether the target classifier is biased/confounded by . Intuitively, we hope that attribute of an xGEM should be the same as that of the original point. In order to detect this, we assume there exists an oracle that perfectly classifies the confounding attribute when considered as the dependent variable, based on the other () independent variables. Additionally, we assume that is not confounded by the target label of the blackbox and is not used by to predict . Let be the training data where indexes a given point. Let be the xGEM of w.r.t. as returned by Algorithm 1. We argue that classifier is confounded by the attribute if equation (4) holds for a given .
(4) 
In practice, access to a perfect oracle is infeasible or prohibitively expensive. In some cases, such a classifier may be provided by regulatory bodies, thereby adhering to predetermined criterion as to what accounts for a reliable proxy oracle. For this case study, we assume it is sufficient that the proxy oracle has the same false positive and false negative error rates w.r.t. the target label, which is a fairness condition known as the Equalized Odds Criterion (Hardt et al., 2016). To demonstrate our algorithm, we assume access to a proxy oracle that satisfies the following conditions, given a :
(i)
(ii) satisfies the Equalized Odds (Hardt et al., 2016) criterion w.r.t. the target label .
Note that while we consider as an inexpensive proxy for , we prescribe that the experiment be carried out with . Figure 2 demonstrates how such confounding could be detected, as well as used for model comparison w.r.t. their biases. As shown in the figure, and are the classification boundaries of two blackbox models classifying a target label of interest. is a classifier that classifies the data according to attribute . Consider the sample and let and be the manifold guided examples of corresponding to classifiers and respectively. As shown in the figure, the attribute of the xGEM is different from that of while that of is not. We conclude that a blackbox is confounded if the fraction of points whose manifold guided examples or xGEMs that change attribute is greater than . Thus an empirical estimate of Equation (4) gives a metric that can quantify the amount of confounding in a given blackbox, while also allowing to compare different blackboxes w.r.t. the target attribute .
We evaluate our framework for confounding detection in facial images using the CelebA (Liu et al., 2015) dataset. The target blackbox classifier predicts the binary facial attribute – hair color (black or blond). We determine whether or not the blackbox is confounded with gender. We restrict to two genders, male and female, based on annotations available in CelebA. In particular, is a ResNet model (He et al., 2016)^{4}^{4}4https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10_estimator that classifies celebA faces by gender . is recalibrated to satisfy the two conditions mentioned earlier. Details of ’s performance and recalibration are provided in Table 1.
Two ResNet models and are trained to detect the hair color attribute (black hair vs blond hair) using two different datasets. is trained on all face samples with either black or blond hair whereas is trained such that all black hair samples are male while blond haired samples are all female. Table 2 gives the overall validation accuracy of both classifiers. Note that the validation set used for and are the same.
Table 2 also shows the fraction of samples whose manifold guided examples’ predicted attribute (in this case gender) is different from the original training sample w.r.t. . The fraction of confounded samples is clearly much larger for the classifier trained on a biased dataset as determined by the proxy oracle . Additionally, Table 3 suggests a 10–fold increase in the fraction of confounding for blond haired females with the biased classifier . Notice the decrease in the amount of confounding for black haired females while a general increase in confounding for all black haired faces. As an aside, the biased model also changes the background more than hair color in order to change the hair color label (see Figure 3). This suggests that quantifying such confounding using manifold guided examplesallows us to characterize biases w.r.t. any attribute of interest.
Figure 3 shows a few examples of such confounded images for the two blackboxes. In particular, we show examples where the blackbox trained on biased data for hair color classification changes gender of the sample as it crosses the decision boundary whereas the blackbox trained on unbiased data does not^{5}^{5}5All qualitative figures were chosen based on the confidence of the prediction from the blackbox and confidence of the reconstructed image.
5.3 Case Study: Model Assessment beyond performance metrics
An important aspect of blackbox analyses is to study the progression of training complex models. Specifically, observing manifold guided examples allows us to consider model behavior in the following aspects: 1) Discerning shifts in features relied on by the blackbox to differentiate between classes during training. 2) Characterizing the probabilistic manifolds of manifold guided examples as training progresses and its relation to calibration of complex networks (DeGroot and Fienberg, 1983). 3) Qualitative tradeoffs and/or mistakes made by the classifier for prototypical examples.
Reliability Diagrams have been used as a summary statistic to evaluate model calibration (DeGroot and Fienberg, 1983) that aims to study whether the confidence of a prediction matches the ground truth likelihood of the prediction. It has been observed that while model performance has improved substantially in recent years because of deep networks, such models are typically more prone to miscalibration (Guo et al., 2017). We provide a complementary statistic to Reliability Diagrams to assist model assessment/comparison.
For this study we train two deep networks (a ResNet model) and (a four layer CNN with local response normalization (lrn) ^{6}^{6}6https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10) with CelebA face images for the hair color (black/blond) binary classification task. For a given face, we evaluate the corresponding xGEM at multiple incremental training steps. We plot the confidence of labeling a point to have black hair with respect to the distance of the original reconstruction and its xGEM including all intermediate points from the decision boundary (called ‘confidence manifold’). Thus, all samples originally labeled black should have high confidence of being labeled and the confidence decreases as the sample crosses the decision boundary (viceversa for blond haired faces). Figure 4 shows the confidence manifolds for two samples (one in each column).
The top and bottom rows represent the manifolds obtained during training for model 1 () and model 2 () respectively. Sample 1(column 1) is a face with black hair while Sample 2(column 2) has blond hair. Legends show the distance of reconstructions from the original sample along the gradient steps, followed by overall classifier performance. Additionally, we fit a logistic function to each curve. All plots have been shiftaligned using . The confidence manifold for the same instance is fairly different across each model. As expected, the overall steepness increases as model trains to better discriminate samples. Intuitively, higher suggests that the classifier can easily discriminate the label with high confidence. For instance, for comparable overall accuracies, the manifolds suggest that model 2 has trained a decision boundary such that a manifold guided example is relatively close in image distance (compared to that of model 1). In the case of Sample 2, it is clear that model 2 mislabels the data point with high confidence initially while learning to predict the correct label eventually. However, a decrease in as training progresses for both models suggests a significant shift of the decision boundary to be closer to Sample 2. Qualitative images corresponding to these manifolds are shown in the Appendix.
Figures 4(a) and 4(b) show the 2d histogram of the logistic function parameter estimates stratified by the target label and the attribute of interest (gender). This allows to summarize the confidence manifolds across groups of interest for overall model comparison. For reference, Figure 6 shows the Reliability Diagram for both blackboxes. The ResNet model generally demonstrates more uniform steepness across samples at different distances from the decision boundary compared to the CNN+lrn model. Both models have a relatively small for blond haired males suggesting lower confidence in their predictions. Thus, summarizing confidence manifolds provides additional insight that may not be characterized by Reliability Diagrams for model comparison.
6 Discussion
This work presents a novel approach to characterizing and explaining blackbox supervised models via examples. An unsupervised implicit generative model is used as to approximate the data manifold, and subsequently used to guide the generation of increasingly confounding examples given a starting point. These examples are used to probe the target blackbox in several ways. In particular, we demonstrate the utility of manifold guided examples in automatically detecting bias in blackbox learning w.r.t. a (potentially protected) attribute as well as for model comparison. The proposed method also allows one to visualize training progression and provides insights complementary to notions of calibration of the blackbox model. Limitations of the proposed method include reliance on the implicit generator as a proxy of the data manifold. However, we note that we do not rely on specific architectures and/or training mechanisms for the generative model. We used images as they are easy to visualize even in highdimensions. However extending our studies to complex datasets beyond images is a compelling future extension.
References
 Aamodt and Plaza (1994) Agnar Aamodt and Enric Plaza. Casebased reasoning: Foundational issues, methodological variations, and system approaches. AI communications, 7(1):39–59, 1994.
 Adebayo et al. (2018) Julius Adebayo, Justin Gilmer, Ian Goodfellow, and Been Kim. Local explanation methods for deep neural networks lack sensitivity to parameter values. arXiv preprint, 2018.
 Angwin et al. (2016) J Angwin, J Larson, S Mattu, and L Kirchner. Machine bias risk assessments in criminal sentencing. ProPublica https://www. propublica. org, 2016.
 Bach et al. (2015) Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, KlausRobert Müller, and Wojciech Samek. On pixelwise explanations for nonlinear classifier decisions by layerwise relevance propagation. PloS one, 10(7):e0130140, 2015.
 Bolukbasi et al. (2016) Tolga Bolukbasi, KaiWei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances in Neural Information Processing Systems, pages 4349–4357, 2016.
 Callahan and Shah (2017) Alison Callahan and Nigam H Shah. Chapter 19  machine learning in healthcare. In Aziz Sheikh, Kathrin M. Cresswell, Adam Wright, and David W. Bates, editors, Key Advances in Clinical Informatics, pages 279 – 291. Academic Press, 2017.
 Cook and Weisberg (1980) R Dennis Cook and Sanford Weisberg. Characterizations of an empirical influence function for detecting influential cases in regression. Technometrics, 1980.
 DeGroot and Fienberg (1983) Morris H DeGroot and Stephen E Fienberg. The comparison and evaluation of forecasters. The statistician, pages 12–22, 1983.
 Doersch (2016) Carl Doersch. Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908, 2016.
 Donahue et al. (2016) Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarial feature learning. arXiv preprint arXiv:1605.09782, 2016.
 DoshiVelez (2017) Been DoshiVelez, Finale; Kim. Towards a rigorous science of interpretable machine learning. In eprint arXiv:1702.08608, 2017.
 Dumoulin et al. (2016) Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb, Martin Arjovsky, and Aaron Courville. Adversarially learned inference. arXiv preprint arXiv:1606.00704, 2016.
 Elenberg et al. (2017) Ethan Elenberg, Alexandros G Dimakis, Moran Feldman, and Amin Karbasi. Streaming weak submodularity: Interpreting neural networks on the fly. In Advances in Neural Information Processing Systems 30. Curran Associates, Inc., 2017.
 Goodfellow et al. (2014) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
 Guo et al. (2017) Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. arXiv preprint arXiv:1706.04599, 2017.
 Gupta et al. (2016) Maya Gupta, Andrew Cotter, Jan Pfeifer, Konstantin Voevodski, Kevin Canini, Alexander Mangylov, Wojciech Moczydlowski, and Alexander Van Esbroeck. Monotonic calibrated interpolated lookup tables. Journal of Machine Learning Research, 2016.
 Hardt et al. (2016) Moritz Hardt, Eric Price, Nati Srebro, et al. Equality of opportunity in supervised learning. In Advances in neural information processing systems, pages 3315–3323, 2016.
 He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 Higgins et al. (2016) Irina Higgins, Loic Matthey, Xavier Glorot, Arka Pal, Benigno Uria, Charles Blundell, Shakir Mohamed, and Alexander Lerchner. Early visual concept learning with unsupervised deep learning. arXiv preprint arXiv:1606.05579, 2016.
 Hughes et al. (2016) Michael C Hughes, Huseyin Melih Elibol, Thomas McCoy, Roy Perlis, and Finale DoshiVelez. Supervised topic models for clinical interpretability. arXiv preprint arXiv:1612.01678, 2016.
 Karpathy et al. (2015) Andrej Karpathy, Justin Johnson, and Li FeiFei. Visualizing and understanding recurrent networks. arXiv preprint arXiv:1506.02078, 2015.
 Kim et al. (2016) Been Kim, Rajiv Khanna, and Oluwasanmi O Koyejo. Examples are not enough, learn to criticize! criticism for interpretability. In Advances in Neural Information Processing Systems, pages 2280–2288, 2016.
 Kindermans et al. (2017) P.J. Kindermans, S. Hooker, J. Adebayo, M. Alber, K. T. Schütt, S. Dähne, D. Erhan, and B. Kim. The (Un)reliability of saliency methods. NIPS workshop on Explaining and Visualizing Deep Learning, 2017.
 Kingma and Welling (2013) Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Koh and Liang (2017) Pang Wei Koh and Percy Liang. Understanding blackbox predictions via influence functions. arXiv preprint arXiv:1703.04730, 2017.
 Li et al. (2015) Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan Jurafsky. Visualizing and understanding neural models in nlp. arXiv preprint arXiv:1506.01066, 2015.
 Lipton (2016) Zachary C Lipton. The mythos of model interpretability. arXiv preprint arXiv:1606.03490, 2016.
 Liu et al. (2015) Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.
 Miller (2017) Tim Miller. Explanation in artificial intelligence: Insights from the social sciences. arXiv preprint arXiv:1706.07269, 2017.
 Ribeiro et al. (2016) Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135–1144. ACM, 2016.
 Selvaraju et al. (2016) Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Gradcam: Visual explanations from deep networks via gradientbased localization. See https://arxiv. org/abs/1610.02391 v3, 7(8), 2016.
 Shrikumar et al. (2016) Avanti Shrikumar, Peyton Greenside, Anna Shcherbina, and Anshul Kundaje. Not just a black box: Learning important features through propagating activation differences. arXiv preprint arXiv:1605.01713, 2016.
 Simonyan et al. (2013) Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
 Smilkov et al. (2017) Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda Viégas, and Martin Wattenberg. Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825, 2017.
 Stock and Cisse (2017) Pierre Stock and Moustapha Cisse. Convnets and imagenet beyond accuracy: Explanations, bias detection, adversarial examples and model criticism. arXiv preprint arXiv:1711.11443, 2017.
 Sundararajan et al. (2017) Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. arXiv preprint arXiv:1703.01365, 2017.
Appendix
xGEMs for MNIST
Figure 7 shows manifold guided examples generated for a (multiclass) softmax classifier for MNIST^{7}^{7}7http://yann.lecun.com/exdb/mnist/ digit data. The first row in Figure 7 shows manifold guided examplefor digit if , while second and third row show manifold guided examplesfor digits and with and respectively. Notice how while traversing the manifold, the classifier switches decision from to and then to the target label (row 1). While the intermediate samples look like to human eye, the classifier is biased toward predicting . Row 2 suggests a bias toward predicting as for a minor smudging (visible to human eye). Finally, the third row demonstrates how the manifold guided examplefor suggests that the classifier considers a to be labeled as . Thus manifold guided examples can provide insight into the decision boundary of the classifier for each pair of digits.