Generating Contrastive Explanations with
Monotonic Attribute Functions
Explaining decisions of deep neural networks is a hot research topic with applications in medical imaging, video surveillance, and self driving cars. Many methods have been proposed in literature to explain these decisions by identifying relevance of different pixels. In this paper, we propose a method that can generate contrastive explanations for such data where we not only highlight aspects that are in themselves sufficient to justify the classification by the deep model, but also new aspects which if added will change the classification. One of our key contributions is how we define "addition" for such rich data in a formal yet humanly interpretable way that leads to meaningful results. This was one of the open questions laid out in , which proposed a general framework for creating (local) contrastive explanations for deep models. We showcase the efficacy of our approach on CelebA and Fashion-MNIST in creating intuitive explanations that are also quantitatively superior compared with other state-of-the-art interpretability methods.
With the explosion of deep learning  and its huge impact on domains such as computer vision and speech, amongst others, many of these technologies are being implemented in systems that affect our daily lives. In many cases, a negative side effect of deploying these technologies has been their lack of transparency , which has raised concerns not just at an individual level  but also at an organization or government level .
There have been many methods proposed in literature [14, 23, 19, 31, 1] that explain predictions of deep neural networks based on the relevance of different features or pixels/superpixels for an image. Recently, an approach called contrastive explanations method (CEM)  was proposed which highlights not just correlations or relevances but also features that are minimally sufficient to justify a classification, referred to as pertinent positives (PPs). CEM additionally outputs a minimal set of features, referred to as pertinent negatives (PNs), which when made non-zero or added, alter the classification and thus should remain absent in order for the original classification to prevail. For example, when justifying the classification of a handwritten image of a 3, the method will identify a subset of non-zero or on-pixels within the 3 which by themselves are sufficient for the image to be predicted as a 3 even if all other pixels are turned off (that is, made zero to match background). Moreover, it will identify a minimal set of off-pixels which if turned on (viz. a horizontal line of pixels at the right top making the 3 look like a 5) will alter the classification. Such forms of explanations are not only common in day-to-day social interactions (viz. the twin without the scar) but are also heavily used in fields such as medicine and criminology  with arguments for PNs being the most important aspect of an explanation .
To identify PNs, addition is easy to define for grayscale images where a pixel with a value of zero indicates no information and so increasing its value towards 1 indicates addition. However, for colored images with rich structure, it is not clear what is a “no information" value for a pixel and consequently what does one mean by addition. Defining addition in a naive way such as simply increasing the pixel or red-green-blue (RGB) channel intensities can lead to uninterpretable images as the relative structure may not be maintained with the added portion being not necessarily interpretable. Moreover, even for grayscaled images just increasing values of pixels may not lead to humanly interpretable images nor is there a guarantee that the added portion can be interpreted even if the overall image is realistic and lies on the data manifold.
In this paper, we overcome these limitations by defining “addition" in a novel way which leads to realistic images with the additions also being interpretable. To showcase the general applicability of our method to various settings, we first experiment with CelebA  where we apply our method to a data manifold learned using a generative adversarial network (GAN)  trained over the data and by building attribute classifiers for certain high-level concepts (viz. lipstick, hair color) in the dataset. We create realistic images with interpretable additions. Our second experiment is on Fashion-MNIST  where the data manifold is learned using a variational autoencoder (VAE)  and certain (interpretable) latent factors (as no attributes are available) are used to create realistic images with, again, interpretable additions. These two cases show that our method can be applied to colored as well as grayscale images and to datasets that may or may not have high level attributes.
2 Related Work
There have been many methods proposed in literature that aim to explain reasons for their decisions. These methods may be globally interpretable – rule/decision lists [34, 32], or exemplar based – prototype exploration [12, 8], or inspired by psychometrics  or interpretable generalizations of linear models . Moreover, there are also works that try to formalize interpretability .
A survey by  mainly explores two methods for explaining decisions of neural networks: i) Prototype selection methods [21, 22] that produce a prototype for a given class, ii) Local explanation methods that highlight relevant input features/pixels [1, 14, 25, 28]. Belonging to this second type there are multiple local explanation methods that generate explanations for images [29, 31, 23] and some others for NLP applications . There are also works  that highlight higher level concepts present in images based on examples of the concept provided by the user. These methods mostly focus on features that are present, although they may highlight negatively contributing features to the final classification. In particular, they do not identify concepts or features that are minimally sufficient to justify the classification or those that should be necessarily absent to maintain the original classification. There are also evaluation methods that perturb the input and remove features  to verify their importance, but these methods can only evaluate an explanation and do not find one.
Recently, there have been works that look beyond relevance. In , the authors try to find features that, with almost certainty, indicate a particular class. These can be seen as global indicators for a particular class. Of course, these may not always exist for a dataset. There are also works  that try to find stable insight that can be conveyed to the user in a (asymmetric) binary setting for medium-sized neural networks. The most relevant work to our current endevour is  and, as mentioned before, it cannot be directly applied when it comes to explaining colored images or images with rich structure.
We now describe the methodology for generating contrastive explanations for images. We first describe how to identify PNs, which involves the key contribution of defining “addition" for colored images in a meaningful way. We then describe how to identify PPs, which also utilizes this notion of adding attributes. Finally, we provide the algorithmic details for solving the optimization problems in Algorithm 1.
We first introduce some notation. Let denote the feasible input space with being an example such that and is the predicted label obtained from a neural network classifier. Let denote the set of superpixels that partition with denoting a set of binary masks which when applied to produce images by selecting the corresponding superpixels from . Let denote the mask corresponding to image .
If denotes the data manifold (based on GAN or VAE), then let denote the latent representation with denoting the latent representation corresponding to input such that . Let denote the number of (available or learned) interpretable features (latent or otherwise) which represent meaningful concepts (viz. moustache, glasses, smile) and let be corresponding functions acting on these features with higher values indicating presence of a certain visual concept while lower values indicating its absense. For example, CelebA has different high-level (interpretable) features for each image such as whether the person has black hair or high cheekbones. In this case, we could build binary classifiers for each of the features where a 1 would indicate presence of black hair or high cheekbones, while a zero would mean its absense. These classifiers would be the functions. On the other hand, for datasets with no high-level interpretable features, we could find latents by learning disentangled representations and choose those latents (with ranges) that are interpretable. Here the functions would be an identity map or negative identity map depending on which direction adds a certain concept (viz. sleeveless shirt to a long sleeve one). We note that these attribute functions could be used as latent features for the generator in a causal graph (e.g., ), or given a causal graph for the desired attributes, we could learn these functions from the architecture in .
Our procedure for finding PNs and PPs involves solving an optimization problem over the variable which is the outputted image. We denote the prediction of the model on the example by , where is any function that outputs a vector of prediction scores for all classes, such as prediction probabilities and logits (unnormalized probabilities) that are widely used in neural networks.
3.1 Pertinent Negatives (PNs)
To find PNs, we want to create a (realistic) image that lies in a different class than the original image but where we can claim that we have (minimally) “added" things to the original image without deleting anything to obtain this new image. If we are able to do this, we can say that the things that were added, which we call PNs, should be necessarily absent from the original image in order for its classification to remain unchanged.
The question is how to define "addition" for colored or images with rich structure. In , the authors tested on grayscale images where intuitively it is easy to define addition as increasing the pixel values towards 1. This, however, does not generalize to images where there are multiple channels (viz. RGB) and inherent structure that leads to realistic images, and where simply moving away from a certain pixel value may lead to unrealistic images. Moreover, addition defined in this manner may be completely uninterpretable. This is true even for grayscale images where, while the final image may be realistic, the addition may not be. The other big issue is that for colored images the background may be any color which indicates no signal and object pixels which are lower in value in the original image if increased towards background will make the object imperceptible in the new image, although the claim would be that we have “added" something. Such counterintuitive cases arise for complex images if we maintain their definition of addition.
Given these issues, we define addition in a novel manner. To define addition we assume that we have high-level interpretable features available for the dataset. Multiple public datasets [17, 38] have high-level interpretable features, while for others such features can be learned using unsupervised methods such as disentangled representations  or supervised methods where one learns concepts through labeled examples . Given such features, we define functions as before, where in each of these functions, increasing value indicates addition of a concept. Using these functions, we define addition as introducing more concepts into an image without deleting any existing concepts. Formally, this corresponds to never decreasing the from their original values based on the input image, but rather increasing them. However, we also want a minimal number of additions for our explanation to be crisp and so we encourage as few as possible to increase in value (within their allowed ranges) that will result in the final image being in a different class. We also want the final image to be realistic and that is why we learn a manifold on which we perturb the image, as we want our final image to also lie on it after the necessary additions.
This gives rise to the following optimization problem:
The first two terms in the objective function here are the novelty for PNs. The first term encourages the addition of attributes where we want the s for the final image to be no less than their original values. The second term encourages minimal addition of interpretable attributes. The third term is the PN loss from  and encourages the modified example to be predicted as a different class than , where is the -th class prediction score of . The hinge-like loss function pushes the modified example to lie in a different class than . The parameter is a confidence parameter that controls the separation between and . The fourth () and fifth terms () encourage the final image to be close to the original image in the input and latent spaces. In practice, one could have a threshold for each of the , where only an increase in values only beyond that threshold would imply a meaningful addition. The advantage of defining addition in this manner is that not only are the final images interpretable, but so are the additions, and we can clearly elucidate which (concepts) should be necessarily absent to maintain the original classification.
3.2 Pertinent Positives (PPs)
To find PPs, we want to highlight a minimal set of important pixels or superpixels which by themselves are sufficient for the classifier to output the same class as the original example. More formally, for an example image , our goal is to find an image such that (i.e. same prediction), with containing as few superpixels and interpretable concepts from the original image as possible. This leads to the following optimization problem:
The first term in the objective function here is the novelty for PPs and penalizes the addition of attributes since we seek a sparse explanation. The second term is the PP loss from  and is minimized when is greater than by at least , which is a margin/confidence parameter. Parameters are the associated regularization coefficients.
In the above formulation, we optimize over superpixels which of course subsumes the case of just using pixels. Superpixels have been used in prior works  to provide more interpretable results on image datasets and we allow for this more general option.
3.3 Optimization Details
To solve for PNs as formulated in (3.1), we note that the regularization term is penalizing a non-identity and complicated function of the optimization variable involving the data manifold , so proximal methods are not applicable. Instead, we use 1000 iterations of standard subgradient descent to solve (3.1). We find a PN by setting it to be the iterate having the smallest distance to the latent code of , among all iterates where prediction score of solution is at least .
To solve for PPs as formulated in (2), we first relax the binary mask on superpixels to be real-valued (each entry is between ) and then apply the standard iterative soft thresholding algorithm (ISTA) (see  for various references) that efficiently solves optimization problems with regularization. We run 100 iterations of ISTA in our experiments and obtain a solution that has the smallest norm and satisfies the prediction of being within margin of . We then rank the entries in according to their values in descending order and subsequently add ranked superpixels until the masked image predicts .
A discussion of hyperparameter selection is held to the Supplement.
We next illustrate the usefulness of CEM-MAF on three image data sets - CelebA  and two subsets of Fashion-MNIST . These datasets cover the gamut of color versus black/white images and having known high-level features versus derived disentangled features. CEM-MAF handles each scenario and offers explanations that are understandable by humans. These experiments exemplify the following observations:
PNs offer intuitive explanations given a set of interpretable monotonic attribute functions. In fact, they seem to be the preferred form of explanation in many cases as describing a decision in isolation (viz. why is a shirt a shirt) using PPs, relevance, or heatmaps is not always informative.
For colored images, PPs offer better direction as to what is important for the classification versus too much direction by LIME (shows too many features) or too little direction by Grad-CAM (only focuses on smiles), while for gray-scale images, neither PPs, LIME, or Grad-CAM are particularly informative versus PNs.
PPs and PNs offer the guarantee of being 100% accurate in maintaining or changing the class respectively as seen in Table 1 versus LIME or Grad-CAM.
Both proximity in the input and latent space along with sparsity in the additions play an important role in generating good quality contrastive explanations.
4.1 CelebA with Available High-Level Features
CelebA is a large-scale dataset of celebrity faces annotated with 40 attributes .
The CelebA experiments explain an 8-class classifier learned from the following binary attributes: Young/Old, Male/Female, Smiling/Not Smiling. We train a Resnet50  architecture to classify the original CelebA images. We selected the following 11 attributes as our based on previous studies  as well as based on what might be relevant for our class labels: High Cheekbones, Narrow Eyes, Oval Face, Bags Under Eyes, Heavy Makeup, Wearing Lipstick, Bangs, Gray Hair, Brown Hair, Black Hair and Blonde Hair. Note that this list does not include the attributes that define the classes because an explanation for someone that is smiling which simply says they are smiling would not be useful. Note that the usefulness of CEM-MAF is directly a function of the accuracies of the attribute functions. See Supplement for details on training these attribute functions and the GAN used for generation.
Results on five images are exhibited in Figure 1 using a segmentation into 200 superpixels111More examples in supplement.. The first two rows show the original class prediction followed by the original image. The next two rows show the pertinent negative’s class prediction and the pertinent negative image. The fourth row lists the attributes that were modified in the original, i.e., the reasons why the original image is not classified as being in the class of the pertinent negative. The next row shows the pertinent positive image, which combined with the PN, gives the complete explanation. The final two rows illustrate different explanations that can be compared with the PP: one derived from locally interpretable model-agnostic explanations (LIME)  followed by a gradient-based localized explanation designed for CNN models (Grad-CAM) .
First consider class explanations given by the PPs. Age seems to be captured by patches of skin, sex by patches of hair, and smiling by the presence or absence of the mouth. One might consider the patches of skin to be used to explain young versus old. PPs capture a part of the smile for those smiling, while leave out the mouth for those not smiling. Visually, these explanations are simple (very few selected features) and quite useful although they require human analysis. In comparison, LIME selects many more features that are relevant to the prediction, and while also useful, requires even more human intervention to explain the classifier. Grad-CAM seems to always focus on the mouth (Grad-CAM is more useful for discriminative tasks) and does not always find a part of the image that is positively relevant to the prediction.
A performance comparison of PPs between CEM-MAF, LIME, and Grad-CAM is given in Table 1. Across colored images, CEM-MAF finds a much sparser subset of superpixels than LIME and is guaranteed to have the same prediction as the original image. Both LIME and Grad-CAM select features for visual explanations that often have the wrong prediction (low PP accuracy). A third measure, PP Correlation, measures the benefit of each additional feature by ranking the prediction scores after each feature is added (confidence in the prediction should increase) and correlating with the expected ranks (perfect correlation here would give -1). Order for LIME was determined by classifier weights while order for Grad-CAM was determined by colors in the corresponding heatmaps. CEM-MAF is in general best at selecting features that increase confidence.
More intuitive explanations are offered by the pertinent negatives in Figure 1. The first PN changes a young, smiling, male into an old, smiling, male by adding gray hair, so we can explain young as not having gray hair. While the gray hair is not visually apparent, the classifier has picked up on it. We also note that while his facial hair was removed by the PN, this cannot be part of our explanation because facial hair is not one of our attributes. Another way to explain being young, in the second column, is the absence of an oval face. The third PN changes a female into a male, and the female is explained by the absence of a single hair color (in fact, she has black hair with brown highlights) and the presence of bangs. While the presence of bangs is intuitive, it is selected because our constraints of adding features to form PNs can be violated due to enforcing the constraints with regularization. The last two column explain a straight face (not smiling), which is given by the absence of high cheekbones or the absense of an oval face (since your face can become more oval when you raise your cheekbones).
4.2 Fashion-MNIST with Learned Disentangled Features
Fashion-MNIST is a large-scale image dataset of various fashion items (e.g., coats, trousers, sandals).
Two datasets are created as subsets of Fashion-MNIST, one for clothes (tee-shirts, trousers, pullovers, dresses, coats, and shirts) and one for shoes (sandals, sneakers, and ankleboots). As with CelebA, we need to generate new images for PNs that are realistic, and thus, a VAE was trained to learn the Fashion-MNIST manifold. However, this data does not have annotated features as in CelebA, so we learn the features using disentanglement following a recent variant of VAE called DIP-VAE  (see Supplement for more details).
We can then use these disentangled latent codes/features in lieu of the ground truth attributes. Based on what might be relevant to the clothes and shoes classes, we use four dimensions from the latent code as the attributes, corresponding to sleeve length, shoe mass, heel length and waist size. Given these attributes, we learn two classifiers, one with six classes (clothes) and one with three classes (shoes). See the Supplement for a visualization of the attributes and details about the classifiers.
Results on five clothes images and five shoe images are shown in Figure 2 (a) and (b), respectively. In order to make a fair comparison with CEM, we do not segment the images but do feature selection pixel-wise. Note that CEM does not do pixel selection for PPs but rather modifies pixel values.
Let us first look at the PPs. They are mostly sparse explanations and do not give any visually intuitive explanations. Both LIME and Grad-CAM, by selecting many more relevant pixels, offers a much more visually appealing explanation. However, these explanations simply imply, for example, that the shirt in the first row of Figure 2 (a) is a shirt because it looks like a shirt. These two datasets (clothes and shoes) are examples where the present features do not offer an intuitive explanation. Table 1 again shows that CEM-MAF selects much fewer features, but here LIME does better at selecting useful features (PP Correlation close to -1 for clothes). Additionally, LIME and Grad-CAM both have high PP Accuracy, but that is due to selecting many more features.
Rather, constrastive explanations (relative to other items) about what is absent leads to the most intuitive explanations for these images. That same shirt is a shirt because it does not have wide sleeves and is not wide around the waist. In the second row of Figure 2 (b) are classified as trousers because the connection between the legs is absent, and in the fifth row, the item is classified as a shirt because a solid color (more likely on a coat) is absent (i.e., there is a pattern). In the first row of Figure 2 (b), the item is a sandal because it is not thick (i.e., it is an open shoe), while the second item is a sneaker (rather than an ankleboot) because it is missing a heal. The other rows in Figure 2 demonstrate similar explanations.
4.3 Value of Regularization in PNs
Two key regularizations in the pertinent negative optimization problem (3.1) maintain that the latent representations of the original image and its PN remain close and that only a few of the attribute functions exhibit changes . One might ask whether both regularizations are necessary, and in fact, the framework does allow for including either regularization separately or jointly by setting appropriate penalty values.
Figure 3 illustrates the usefulness of these regularizations on an image of a young, smiling, male (left image). Regularizing only the latent representation proximity results in a PN that is a young, smiling, female (middle image), from which we can explain that the original image is a male because of the absence of makeup, bangs, and brown hair. The brown hair makes sense because the original image is not defined by any hair color, while female hair color is often easier to detect because females color their hair more often resulting in stronger color [4, 30]. Interestingly, three attributes were used to explain the female PN. However, adding sparsity to the number of selected attributes results in the not smiling PN (right image) obtained by solely modifying the cheekbone attribute. The lesson is that both forms of regularization add something to the explanation. Proximity of latent representations keeps the image visually similar (a close inspection shows the male and female having similar facial structure, smile, nose, etc.), while sparsity of modifications can be used to keep the explanation simpler resulting in minimal additions.
In the previous sections, we produced contrastive explanations by learning a data manifold. It is important to note that we do not necessarily need such a global data manifold to create our explanations, but rather a good local data representation around the image in question should be sufficient. Thus, for datasets where building an accurate global representation may be hard, if one can build a good local representation (viz. using conditional GANs), then it should still be possible to generate high quality contrastive explanations.
In this paper, we leveraged high level features that were readily available (viz. CelebA) as well as used generated ones (viz. Fashion-MNIST) based on unsupervised methods to produce contrastive explanations. As mentioned before one could also learn interpretable features in a supervised manner as done in  which we could also use to generate such explanations.
In summary, we have proposed a method to create contrastive explanations for image data with rich structure. The key novelty over previous work has been the way in which we define addition, which leads to realistic images and where the added information is also easy to interpret (viz. added makeup). Our results also showcase that using pertinent negatives might be the preferred form of explanation where it is not clear why a certain entity is what it is in isolation (e.g. based on relevance or pertinent positives), but could be explained much more crisply by contrasting it to another entity that closely resembles it (viz. adding a heal to a sneaker makes it look like an ankleboot).
-  Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7):e0130140, 2015.
-  Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.
-  Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15, pages 1721–1730, New York, NY, USA, 2015. ACM.
-  Kavita Daswani. More men coloring their hair. LA Times, 2012.
-  Amit Dhurandhar, Pin-Yu Chen, Ronny Luss, Chun-Chen Tu, Paishun Ting, Karthikeyan Shanmugam, and Payel Das. Explanations based on the missing: Towards contrastive explanations with pertinent negatives. In Advances in Neural Information Processing Systems, 2018.
-  Amit Dhurandhar, Vijay Iyengar, Ronny Luss, and Karthikeyan Shanmugam. Tip: Typifying the interpretability of procedures. arXiv preprint arXiv:1706.02952, 2017.
-  I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.
-  Karthik Gurumoorthy, Amit Dhurandhar, and Guillermo Cecchi. Protodash: Fast interpretable prototype selection. arXiv preprint arXiv:1707.01212, 2017.
-  Tsuyoshi Idé and Amit Dhurandhar. Supervised item response models for informative prediction. Knowl. Inf. Syst., 51(1):235–257, April 2017.
-  Shaoqing Ren Kaiming He, Xiangyu Zhang and Jian Sun. Deep residual learning for image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
-  Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. ICLR, 2018.
-  Been Kim, Rajiv Khanna, and Oluwasanmi Koyejo. Examples are not enough, learn to criticize! criticism for interpretability. In In Advances of Neural Inf. Proc. Systems, 2016.
-  Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, and Rory Sayres. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors. Intl. Conf. on Machine Learning, 2018.
-  Pieter-Jan Kindermans, Kristof T. Schütt, Maximilian Alber, Klaus-Robert Müller, Dumitru Erhan, Been Kim, and Sven Dähne. Learning how to explain neural networks: Patternnet and patternattribution. In Intl. Conference on Learning Representations (ICLR), 2018.
-  Abhishek Kumar, Prasanna Sattigeri, and Avinash Balakrishnan. Variational inference of disentangled latent concepts from unlabeled observations. Intl. Conf. on Learning Representations, 2017.
-  Tao Lei, Regina Barzilay, and Tommi Jaakkola. Rationalizing neural predictions. arXiv preprint arXiv:1606.04155, 2016.
-  Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.
-  Tim Miller. Contrastive explanation: A structural-model approach. CoRR, abs/1811.03163, 2018.
-  Grégoire Montavon, Wojciech Samek, and Klaus-Robert Müller. Methods for interpreting and understanding deep neural networks. Digital Signal Processing, 2017.
-  Alexandros G. Dimakis Murat Kocaoglum Christopher Snyder and Sriram Vishwanath. Causalgan: Learning causal implicit generative models with adversarial training. In International Conference on Learning Representations (ICLR 2018), 2018.
-  Anh Nguyen, Alexey Dosovitskiy, Jason Yosinski, Thomas Brox, and Jeff Clune. Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. In Advances in Neural Information Processing Systems, pages 3387–3395, 2016.
-  Anh Nguyen, Jason Yosinski, and Jeff Clune. Multifaceted feature visualization: Uncovering the different types of features learned by each neuron in deep neural networks. arXiv preprint arXiv:1602.03616, 2016.
-  Jose Oramas, Kaili Wang, and Tinne Tuytelaars. Visual explanation by interpretation: Improving visual feedback capabilities of deep neural networks. In arXiv:1712.06302, 2017.
-  Abhishek Das Ramakrishna Vedantam Devi Parikh Ramprasaath R. Selvaraju, Michael Cogswell and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of IEEE Conference on Computer Vision (ICCV), October 2017.
-  Marco Ribeiro, Sameer Singh, and Carlos Guestrin. "why should i trust you?” explaining the predictions of any classifier. In ACM SIGKDD Intl. Conference on Knowledge Discovery and Data Mining, 2016.
-  Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Anchors: High-precision model-agnostic explanations. In AAAI Conference on Artificial Intelligence (AAAI), 2018.
-  Wojciech Samek, Alexander Binder, Grégoire Montavon, Sebastian Lapuschkin, and Klaus-Robert Müller. Evaluating the visualization of what a deep neural network has learned. In IEEE Transactions on Neural Networks and Learning Systems, 2017.
-  Su-In Lee Scott Lundberg. Unified framework for interpretable methods. In In Advances of Neural Inf. Proc. Systems, 2017.
-  Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. See https://arxiv. org/abs/1610.02391 v3, 2016.
-  SFR. 75% of women now color their hair compared to 7% in 1950. South Florida Reporter, 2017.
-  Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. CoRR, abs/1312.6034, 2013.
-  Guolong Su, Dennis Wei, Kush Varshney, and Dmitry Malioutov. Interpretable two-level boolean rule learning for classification. In https://arxiv.org/abs/1606.05798, 2016.
-  Kush Varshney. Engineering safety in machine learning. In https://arxiv.org/abs/1601.04126, 2016.
-  Fulton Wang and Cynthia Rudin. Falling rule lists. In In AISTATS, 2015.
-  Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. CoRR, abs/1708.07747, 2017.
-  Philip N. Yannella and Odia Kagan. Analysis: Article 29 working party guidelines on automated decision making under gdpr. 2018. https://www.cyberadviserblog.com/2018/01/analysis-article-29-working-party-guidelines-on-automated-decision-making-under-gdpr/.
-  Xin Zhang, Armando Solar-Lezama, and Rishabh Singh. Interpreting neural network judgments via minimal, stable, and symbolic corrections. 2018. https://arxiv.org/abs/1802.07384.
-  Shi Qiu Xiaogang Wang Ziwei Liu, Ping Luo and Xiaoou Tang. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
Note all model training for experiment is done using Tensorflow and Keras.
Appendix A Hyperparameter selection for PN and PP
Find PNs is done by solving (3.1) which has hyperparameters . The confidence parameter is the user’s choice. We experimented with and report results with . We experimented with and report results with which better enforces the constraint of only adding attributes to a PN. The three hyperparameters were fixed. Sufficient sparsity in attributes was obtained with this value of but further experiments increasing could be done to allow for more attributes if desired. The results with selected and were deemed realistic there so there was no need to further tune them. Note that Section 4.3 required experimenting with to remove the attribute sparsity regularization. The last hyperparameter was selected via the following search: Start with and multiply by if no PN found after 1000 iterations of subgradient descent, and divided by 2 if PN found. Then run the next 1000 iterations and update again. This search on was repeated 9 times, meaning a total of iterations of subgradient descent were run with all other hyperparameters fixed.
Find PPs is done by solving (2) which has hyperparameters . Again, we experimented with and report results with . We experimented with and report results with for the same reason as for PNs. The hyperparameter because as above was too strong and did not find PPs with such high sparsity (usually allowing no selection). The same search on as described for PNs above was done for PPs, except this means a total of iterations of ISTA were run with all other hyperparameters fixed to learn a single PP.
Appendix B Additional CelebA Information
We here discuss how attribute classifiers were trained for CelebA, describe the GAN used for generation, and provide additional examples of CEM-MAF. CelebA datasets are available at http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html.
b.1 Training attribute classifiers for CelebA
For each of the 11 selected binary attributes, a CNN with four convolutional layers followed by a single dense layer was trained on 10000 CelebA images with Tensorflow’s SGD optimizer with Nesterov using learning rate=0.001, decay=1e-6, and momentum=0.9. for 250 epochs. Accuracies of each classifiers are given in Table 2.
|Bags Under Eyes||79.3%|
b.2 GAN Information
Our setup processes each face as a pixel colored image. A GAN was trained over the CelebA dataset in order to generate new images that lie in the same distribution of images as CelebA. Specifically, we use the pretrained progressive GAN222https://github.com/tkarras/progressive_growing_of_gans in  to approximate the data manifold of the CelebA dataset. The progressive training technique is able to grow both the generator and discriminator from low to high resolution, generating realistic human face images at different resolutions.
b.3 Additional CelebA Examples
Figure 4 gives additional examples of applying CEM-MAF to CelebA. Similar patterns can be seen with the PPs. CEM-MAF provides sparse explanations highlighting a few features, LIME provides explanations (positively relevant superpixels) that cover most of the image, and Grad-CAM focuses on the mouth area.
Appendix C Additional Fashion-MNIST Information
We here discuss how disentangled features were learned for Fashion-MNIST, how classifiers were trained, and provide additional examples of CEM-MAF. Fashion-MNIST datasets are available at https://github.com/zalandoresearch/fashion-mnist.
c.1 Learning Disentangled Features for Fashion-MNIST
Our setup processes each item as a pixel grayscale image. As with many real-world scenarios, Fashion-MNIST samples come without any supervision about the generative factors or attributes. For such data, we can rely on latent generative models such as VAE that aim to maximize the likelihood of generating new examples that match the observed data. VAE models have a natural inference mechanism baked in and thus allow principled enhancement in the learning objective to encourage disentanglement in the latent space. For inferring disentangled factors, inferred prior or expected variational posterior should be factorizable along its dimensions. We use a recent variant of VAE called DIP-VAE  that encourages disentanglement by explicitly matching the inferred aggregated posterior to the prior distribution. This is done by matching the covariance of the two distributions which amounts to decorrelating the dimensions of the inferred prior. Table 3 details the architecture for training DIP-VAE.
|Input||784 (flattened 28x28x1)|
|Encoder||FC 1200, 1200. ReLU activation.|
|Decoder||FC 1200, 1200, 1200, 784. ReLU activation.|
|Optimizer||Adam (lr = 1e-4) with mse loss|
Figure 5 shows the disentanglement in these latent features by visualizing the VAE decoder’s output for single latent traversals (varying a single latent between while keeping others fixed). For example, we can see that increasing the value of the second dimension of the latent code corresponds to increasing sleeve length while increasing the value of the third dimension corresponds to adding more material on the shoe.
c.2 Training Fashion-MNIST classifiers
Two datasets are created as subsets of Fashion-MNIST, one for clothes (tee-shirts, trousers, pullovers, dresses, coats, and shirts) and one for shoes (sandals, sneakers, and ankleboots). We train CNN models for each of these subsets with two convolutional layers followed by two dense layer to classify corresponding images from original Fashion-MNIST dataset. See Table 4 for training details.
|Shoes Classifier||Conv (5,5,32), MaxPool(2,2), Conv(5,5,64), MaxPool(2,2),|
|Flatten, FC 1024, Dropout (rate=0.4), FC 3, Relu activation.|
|Clothes Classifier||Conv (5,5,32), MaxPool(2,2), Conv(5,5,64), MaxPool(2,2),|
|Flatten, FC 1024, Dropout (rate=0.4), FC 6, Relu activation.|
|Optimizer||SGD (lr = 1e-3) with cross entropy loss|
c.3 Additional Fashion-MNIST Examples
Figure 6 gives additional examples of applying CEM-MAF to Fashion-MNIST. Here, the PPs are not as useful as for CelebA because they are often too sparse. This could be alleviated by requiring more confidence (lower in CEM-MAF). Both LIME and Grad-CAM highlight most of the images which is also not particularly useful for explaining. This is a dataset where the PNs offer the most intuitive explanations.