An Adversarial Approach for Explaining the Predictions of Deep Neural Networks
Abstract
Machine learning models have been successfully applied to a wide range of applications including computer vision, natural language processing, and speech recognition. A successful implementation of these models however, usually relies on deep neural networks (DNNs) which are treated as opaque blackbox systems due to their incomprehensible complexity and intricate internal mechanism. In this work, we present a novel algorithm for explaining the predictions of a DNN using adversarial machine learning. Our approach identifies the relative importance of input features in relation to the predictions based on the behavior of an adversarial attack on the DNN. Our algorithm has the advantage of being fast, consistent, and easy to implement and interpret. We present our detailed analysis that demonstrates how the behavior of an adversarial attack, given a DNN and a task, stays consistent for any input test data point proving the generality of our approach. Our analysis enables us to produce consistent and efficient explanations. We illustrate the effectiveness of our approach by conducting experiments using a variety of DNNs, tasks, and datasets. Finally, we compare our work with other wellknown techniques in the current literature.
1 Introduction
Explaining the outcomes of complex machine learning models is a prerequisite for establishing trust between the machines and users. As humans increasingly rely on DNNs to process large amounts of data and make decisions, it is crucial to develop solutions that can interpret the predictions of DNNs in a userfriendly manner. Explaining the outcomes of a model can help reduce bias and contribute to improvements in model design, performance, and accountability by providing beneficial insights into how models behave fidel2019explainability. Consequently, the field of explainable artificial intelligence systems, XAI, has gained traction in recent years, where researchers from different disciplines have come together to define, design and evaluate explainable systems vstrumbelj2014explaining; datta2016algorithmic; mohseni2018survey. The majority of current explainability algorithms for DNNs produce an explanation for a single inputoutput pair: an input data point fed into the DNN and the respective prediction made by the DNN. The algorithm usually finds the most important features in the input contributing the most to the model’s predictions and selects those as explanations for the model’s behavior alvarez2018robustness. The majority of these algorithms find the important features using either a perturbationbased approach or a saliencybased approach lundberg2017unified. The saliencybased approaches rely on gradients of the outputs in relation to the inputs to find the important features simonyan2013deep; selvaraju2017grad. Perturbationbased methods on the other hand apply small local changes to the input, track the changes in the output, and find and rank the important input features ribeiro2016should; alvarez2017causal.
One main problem with current stateoftheart explainability tools is their reliance on a large set of hyperparameters. This leads to local instability of explanations and can negatively affect the user’s experience alvarez2018robustness. An explainability algorithm should satisfy 3 properties: 1 It has to produce humanunderstandable explanations which are loyal to the decision making process of the DNN, 2 It has to be locally consistent and efficient, 3 It should be userfriendly, easy to apply and quick in providing explanations. In this work, we propose a new algorithm, explanations via adversarial attacks, which satisfies these 3 important properties and more. We call our method Adversarial Explanations for Artificial Intelligence systems or AXAI
Obviously, one needs to first show how adversarial attacks link to explainbility, i.e., how an attack can point to the important features in the input and how one can filter out the unimportant ones to produce explanations. Further, one needs to show how an adversary behaves similarly in its approach across models, tasks and datasets so that the explanations are consistent, stable, and applicable to a large group of models. Here, we present a novel algorithm for explaining the DNN’s predictions in multiple domains including text, audio and image. In particular, this paper makes the following contributions:

We show that given an PGD attack and a trained DNN, the distribution of attack magnitudes vs. frequency across all unseen test inputs follows a beta distribution, regardless of the task and dataset. We also show that these distributions are symmetric and the differences between their means, medians, and quantiles are not statistically significant.

We show that the most important input features, i.e., features with the largest effect on the model’s predictions, can be found using a consistent rule across different DNN architectures, datasets, and tasks. This rule leverages the properties of the distributions explained above.

We propose a novel algorithm for explaining the outcomes of DNNs and provide a detailed analysis of our algorithm’s performance for different DNN architectures, datasets and tasks.

We benchmark our algorithm against methods such as LIME and SHAP lundberg2017unified; ribeiro2016should and show that our algorithm performs faster while producing similar or better explainability results.
2 Related Work
One of the popular explainability solutions called LIME ribeiro2016should assumes that DNNs are linear locally. LIME trains weighted linear models on the top of the DNN for perturbed samples around a target input to produce explanations. The computational bottleneck in LIME is caused by the training part where a selected number of perturbed samples are sent through the DNN for learning the explanation. Certain combination of LIME’s hyperparameters can produce unstable results alvarez2018robustness. DeepLIFT produces explanations by modeling the slope of gradient changes of output with respect to the input shrikumar2017learning. GradCAM is a saliencybased method that uses the gradients of the input at the final convolutional layer to produce coarse localization maps pointing to important regions in the input selvaraju2017grad. The majority of approaches based on sensitivity maps fail to produce explanations that only rely on important features. Creators of DeepLIFT associate this lack of stability to the behavior of activation functions such as ReLU. smilkov2017smoothgrad proposed Smooth Grad which uses gradients and Gaussian based denoising methods to produce stable explanations. The authors of the paper mention that large outlier values in the gradient maps produced by gradient differentiation may cause instability. In our algorithm, we overcome the problem of instability by utilizing the density of attacks, which are created iteratively on segments. Some other important works in this area are given in sundararajan2017axiomatic; jacovi2018understanding; zhao2018respond; bach2015pixel; becker2018interpreting; erhan2009visualizing; letham2015interpretable.
DNNs are vulnerable to subtle adversarial perturbations applied to their input. The basic idea behind most adversarial attacks revolves around solving a maximization problem with a constraint that keeps the distance between the original input and adversarial input small, so that the adversarial input, while capable of fooling the DNN, is not perceptually recognizable by humans. The connection between model interpretation and attacks has recently gravitated the interest of researchers. ilyas2019adversarial and tsipras2018robustness showed that one benefit of adversarial examples is that they reveal useful insights into the salient features of input data and their effects on DNNs’ predictions. Our solution relies on the nature of adversarial attacks to select and produce important and explainable features given a specific input and DNN. Our work puts more emphasis on model interpretability, where we make use of the information obtained from an adversarial attack on a DNN to denoise the sensitivity maps and produce stable explanations. We denoise the gradient map by utilizing the iterative nature of the PGD attack and by considering only a minimum number of highly influential gradients that contribute the most to the predictions. We use the density of gradients in a number of segments to remove the noise that was not filtered out in the previous steps and produce humaninterpretable explanations.
3 Main Results
The core idea behind our approach, AXAI, is to utilize the knowledge gained from an adversarial attack on a DNN and an input, to find the important features in the input in order to produce good explanations. This is done by mapping “carefully filtered attacked inputs” onto predefined segments and filtering out the unimportant features. This will be discussed in more detail in later sections. First let’s look at an example in Fig. 1 to see how our approach works. Given an image classification DNN, the adversarial attack changes the pixels in the entire image, as seen in Fig. 0(c). The reason for this is simple: each pixel value is changed by the adversary so that the accumulated loss value can increase enough to fool the DNN. Fig. 0(b) shows the distribution of the attack on this image. The xaxis represents the magnitude of the pixel changes and the yaxis represents the number of pixels given each value on xaxis. AXAI maps the strongly attacked pixels to the image segments of the original image and filters out the segments with highest density of attacked pixels which meet certain criteria to produce explanations. Fig. 0(c) shows the value changes for the important attacked pixels. As we will show, the important features used for explanations are located at specific sections in the tails of the distribution given in Fig. 0(b). These are the pixels that directly affect the classification decision made by the model. We use QuickShift vedaldi2008quick for segmenting the input image (Fig. 0(d)). It is important to note that the segmentation step in our algorithm is general and any type of input segmentation method may be utilized for this step depending on the model and input type, e.g., language, signal or imagery. Fig. 0(e) shows the explanation produced by our algorithm.
Algorithm 1 details the steps taken by AXAI to produce an explanation for the output of a selected model . Suppose that input is segmented into groups using a segmentation method and that the attack magnitudes for the input and DNN are obtained. Let be the difference between the original and adversarial . We filter out the low intensity attack magnitudes and create a Boolean array , where values larger than a threshold, are only set to True. Let be the set of unique segments, . Next, we map the filtered attack to the segments , and create a new list of filtered attack groups, . The mapping function, Map in Algorithm 1, simply stacks the filtered attacks on the segments and groups the filtered attack based on the segments. Finally, the attack density of each unique segment can be written as (Calculate_density in Algorithm 1). We then extract the indices ’s of the top maximum values in (TopK_indices in Algorithm 1), and produce as explanation for the input . In next sections, we explain each step in details.
3.1 Whitebox adversarial attacks
Adversary can attack a DNN by adding engineered noise to the input to increase the associated loss value, if it has some prior knowledge of the DNN including the weights and biases. AXAI utilizes Projected Gradient Descend (PGD) attack madry2017towards, although any adversarial attack can replace PGD in our algorithm (Appendix B). However, PGD provides specific benefits such as stability and gradient smoothness that other attacks do not. PGD can be thought of as an iterative version of Fast Gradient Method (FGM) attack goodfellow2014explaining, where in each iteration, the adversarial changes are clipped into an ball of some value. PGD is generally considered a strong stable attack and is defined as,
(1) 
where for iterations, and are the inputs and outputs, and are the weights and biases.
3.2 Statistical analysis of attack magnitudes vs. frequency distributions
Here, we briefly report our statistical analysis of attack magnitudes vs. frequency distributions for a fixed DNN, dataset and an adversarial attack. We can show that the distributions are similar in their “shapes,” “means,” “mean ranks,” “medians,” and “quantiles,” and follow a Beta distribution with specific parameters. Given that there is no significant difference in the distributions, we can provide a universal threshold using quantiles which separates the important features from the rest to produce explanations.
To be able to show that highly perturbed regions can be chosen to produce explanations for a single input, we should first show analytically that the results are consistent for all inputs, i.e., adversarial attacks are consistent in their adversarial behavior and the manner in which they attack the most influential input segments . Our analyses prove this point and consequently we can show that our proposed rule to find the important input segments for a single input holds. Finally, we empirically show that these segments are indeed the most important parts of the input by analyzing the effects of them on the test error rate.
We can measure the symmetricity of distributions using the FisherPearson coefficient of skewness. We present the results for AlexNet on CIFAR10 kaur2018convolutional, VGG16 on CIFAR100 krizhevsky2009cifar and ResNet34 on ImageNet deng2009imagenet. The FisherPearson coefficients of the attack magnitudes vs. frequency distributions for all cases are shown in Fig. 2. It is seen that the skewness of all distributions falls within the range showing strong evidence that they are approximately symmetric bulmer1979principles. Only 0.9% of CIFAR10, 3.3% of CIFAR100 and 1.9% of ImageNet test datasets lie outside of range.
QuantileQuantile (QQ) plot allows us to understand how the quantiles of a distribution deviate from a specified theoretical distribution. The theoretical distribution selected is the normal distribution. The xaxis and yaxis represent the quantile values of the theoretical and sample distributions, respectively. While it is unlikely to have identical distributions that perfectly match, one can look at different parts of the QQ plot to distinguish between the similar and dissimilar locations in the distributions. Fig. 3 shows the QQ plots for random subsets of ImageNet and CIFAR10 test datasets each containing 1000 images. It is seen that the distributions follow a fairly straight line in the middle portion of the curve, while deviating at the upper and lower parts. This provides some evidence supporting the hypothesis that distributions are symmetrical with heavier tails.
ttest (CIFAR10)  MannWhitney (CIFAR10)  ttest (ImageNet)  MannWhitney (ImageNet)  
pvalue  0.70  0.58  0.64  0.55 
We perform the twosample location ttest and MannWhitney U test to determine if there is a significant difference between the hypotheses where the null hypothesis is the equality of the means. Carrying out pair ttests on all samples allows us to be conservative in confirming the mean similarity of the distributions. A sample here is defined as the attack magnitudes vs. frequency distribution for a data point in the test adversarial dataset created by the PGD attack on a DNN trained on the training dataset. The results reported in Table 1 indicate no significant difference between the means. Further, the MannWhitney U test results indicate that all pairs are similar to each other on the mean ranks. Under the assumption of two distributions having similar shapes, one could further state that MannWhitney test can be considered as a test of medians mcdonald2009handbook. Since, we have shown that the shapes are similar, we can conclude that there are no significant difference between the medians of the distributions. Further details in addition to the results for the ANOVA test are given in Appendix C.
AlexNet, CIFAR10, PGD  VGG16, CIFAR100, PGD  ResNet34, ImageNet, PGD  

15th Quantile  
25th Quantile  
Mean  
Median  
75th Quantile  
85th Quantile 
AlexNet, CIFAR10, PGD  VGG16, CIFAR100, PGD  ResNet34, ImageNet, PGD  

Next, to show consistency across distributions for a given model, dataset and attack, we estimate the values of quantiles, means and medians. We do this by estimating the statistics of the distributions and constructing confidences intervals. For each experiment, we estimate the mean, median, 15th, 25th, 75th and 85th quantiles of each attack magnitude vs. frequency distribution for the entire test dataset. The statistical confidence interval estimations at confidence level of are reported in Table 2. Our results show that the confidence intervals have narrow ranges and the estimations are consistent. The estimates for the 15th, 25th, 75th and 85th quantiles indicate a strong symmetricity with respect to the origin in all cases. This matches the results of the skewness test in Fig. 2. Another observation is that the confidence interval of the mean and medians are pretty narrow, supporting the results of the ttests and MannWhitney U test. Finally, we can show with high confidence that the distributions consistently follow a beta distribution. The beta distribution is a family of distributions defined by two positive shape parameters, denoted by and . The estimated and of the beta distribution are reported in Table 3. Further technical details on our analyses presented in this section, in addition to further experiments with audio and text input types, are provided in Appendix C.
3.3 Quantile selection for the explanations
Our algorithm produces explanations that rely only on the features in the input that have the largest effect on the predictions. While the majority of the input is attacked, our belief is that only important features are strongly attacked. We show how one can select the boundary threshold between “explainable features” and the rest based on attack magnitudes. We demonstrate this with 2 experiments: 1) AlexNet trained on CIFAR10, 2) ResNet34 trained on ImageNet, both attacked by PGD with 20 iterations. In each case, we select the successfully attacked inputs from the adversarial test dataset, i.e., the inputs that fool the DNN. We then only reattack specific features of the original clean inputs within the , and percentile of the distributions, where is the percentage threshold. The reattacking process starts from , where none of the input features are attacked, and then we gradually increase the value of until the attack successfully changes the prediction, and then we save the value of (Fig. 3(a)). We repeat this for every input. The probability density distribution of ’s are given in Fig. 3(b) and Fig. 3(c) with an estimated mean of .
CIFAR10, AlexNet  ImageNet, ResNet34  CIFAR10, AlexNet  ImageNet, ResNet34  
Attack Percentile  Attack Percentile  
0.78  0.88  0.16  0.07  
0.26  0.79  0.26  0.13  
0.50  0.63  0.45  0.25  
0.07  0.12  0.92  0.80 
Further, we report the test accuracies of the DNNs on the adversarial test datasets that are created based on different attack percentiles. Given an attack percentile range, the adversarial test dataset consists of adversarial test inputs which are created by attacking only portions of the input features that lie withing a specific percentile range of the attack magnitudes vs. frequency distributions similar to above. This allows us to understand how the features lying in the middle area, tails and outliers of the distributions affect the DNN’s predictions. Our findings are reported in Table 4. Our results show that the majority of the input features including those within the first two standard deviations and the outliers of the distributions do not have a strong effect on the predictions. A smaller portion of the input features which are also those attacked with the highest intensity, i.e., within the and percentiles of the distributions have the largest effect on the DNN’s predictions, confirming our hypothesis. We see the same trend across different DNNs and datasets (Appendix C).
4 Experiment Results
Earlier, we provided a sample explanation created by AXAI for an image classifier. Appendix E contains more experiments for image classification and object detection DNNs. Further, Appendix E contains an ablation study and an interesting comparison between explanations produced by a nonrobust model and an adversarially robust model. Here, we provide sample explanations produced by our algorithm for speech recognition and languagebased tasks.
4.1 Explaining a speech recognition model
The Speech Commands Dataset warden2018speech is an audio dataset of short spoken words. Here, we have converted the audio files to spectrograms and used them to train a LeNet model to identify “speech commands.” We have created timefrequency segments by dividing the spectrogram into timefrequency grids similar to Mishra2017LocalIM. The xaxis and yaxis indicate the timescale and logscale frequency of the spectograms respectively, and the color bar indicates the magnitude. This kind of segmentation results in equal sized rectangular blocks where the height of the segment covers the range of frequencies (yaxis) and the width of the segment covers the range of the time (xaxis) associated with the spoken word. The spectrogram of the first word ”Right” and its explanation are shown in Fig. 4(a) and Fig. 4(b). The explanation shows that the first and last character in the spoken word “Right” stand out as important features ( and intervals). This is reasonable because “Five” is the neighboring class of “Right” in the dataset (Appendix D) and “Right” and “Five” differ in the pronunciation of “r” and “f” and “t” and “v.” The second example is for the word “Three” (Fig. 4(c) and Fig. 4(d)). The produced explanation indicates the importance of “Thr” ( interval). This is reasonable because “Three” and its neighbor “Tree” differ in the letter “h” in “Thr,” and this difference is learned by the model during training to identify the two words correctly. More examples are shown in 8. Details on this experiment are given in Appendix E.
4.2 Explaining a text classification model
The Sentence Polarity Dataset Pang+Lee:05a is a collection of moviereview documents labeled with respect to their overall sentiment polarity. Here, we will look at a negative and positive example (Fig. 5(a) and Fig. 5(b)) where the rows are the word tokens in the sentence, and the columns are the embedding dimensions. The NLP model used in our experiment is taken from kim2014convolutional and trained on the dataset. As part of the preprocessing, the words in the dataset are tokenized and mapped to an embedding matrix. The word embedding matrix is also used as the segments in our algorithm. li2015visualizing mentions that the saliency map of an NLP model can be visualized using the embedding layer similar to saliency maps used for imagebased models. Consequently, one can apply our algorithm to NLP models in a similar manner, i.e., we can utilize the first order derivative of the loss with respect to the word embedding. This technique is similar to what was used in miyato2016adversarial. The first example, “it’s a glorified sitcom, and a long, unfunny one at that.” is classified as a negative review by the model. Fig. 5(a) shows that the word “unfunny” is strongly highlighted as the main explanation for this prediction. For the positive example “a work of astonishing delicacy and force,” it is seen that the word “astonishing” has the most significant influence on model’s prediction. More examples are shown in Fig. 9.
4.3 Benchmark tests
We test our algorithm against LIME and SHAP (Gradient Explainer). It is important to note that SHAP subsumes a number of prior approaches and provides a fair baseline. To show the consistency of our approach, we present visualizations for 3 cases: 1) AlexNet, CIFAR10, 2) VGG16, CIFAR100, 3) ResNet34, ImageNet using the 3 explainability tools and provide more experiments in Appendix F. The algorithms produce similar explanations where AXAI has fewer tuneable parameters and performs faster. LIME fails to produce good explanations for lowresolution CIFAR10 images. In Appendix F, we provide examples showing that AXAI outperforms LIME for lowresolution inputs. We benchmark the runningtime performance of AXAI, LIME and SHAP for ResNet34 trained on ImageNet on a single CPU (Intel Core i57360U) and single GPU (Tesla V100SXM2) on the entire test dataset. The results are given in Table. 5. LIME is the slowest to produce explanations. This is because LIME needs to forward propagate the perturbed inputs through the DNN several times. SHAP is also slower to generate the results in comparison to AXAI. LIME works better on a GPU. AXAI maintains its relative performance on the CPU and GPU. This is because the segmentation step which mainly uses the CPU is the main computational bottleneck for the algorithms (Appendix A). A few comparisons between AXAI, LIME, and SHAP are shown in Fig. 7.
Single CPU (Intel Core i57360U)  Single GPU (Tesla V100SXM2)  

LIME  105s  5.8s 
SHAP (Gradient Explainer)  35s  3.8s 
AXAI (PGD with 20 iters)  6.6s  1.7s 
5 Final Remarks and Conclusion
In this paper, we proposed a new approach for explaining the predictions of DNNs. Interpretability is directly related to the readability of an explanation gilpin2018explaining. An explanation relying on thousands of features is not interpretable. AXAI, similar to LIME, uses input segmentation to create humanreadable explanations focused on important input features. Further, AXAI has the following properties,
Property 1 (Robustness): Our approach is more robust to the changes in segmentation hyperparameters in comparison to other segmentation based approaches such as LIME. This is because AXAI does not require a surrogate model trained on “randomly perturbed inputs.” AXAI uses the deterministic attack magnitudes as “base explanations” for a given DNN and dataset, and uses segments as an “aid” to visualize the results. The segmentation affects the visualizations. We further explain this in Appendix A. Robustness is identical to stability of explanations as defined in pub.1104451629. A lower number of nondeterministic steps in the algorithm enhances stability. A carefully filtered explanation based on our approach simply removes the features that have a low impact on predictions. One can interpret this process as a denoising step to create a sparse representation of explanations.
Property 2 (Local attribution): Our algorithm is locally stable and uses local attributes to produce explanations. This is because an adversarial attack uses the most minimal amount of noise within an ball of some small to fool the DNN. Given the untargeted nature of the attack used in AXAI, the distributions can be interpreted as estimations of the boundaries among neighboring classes. Thus, one can conclude that the attack magnitudes are a representation of feature contributions to the predictions on a local scale. A similar conclusion is made in ancona2017towards, where it is argued that gradients can in fact point to important local attributions of a DNN. We explore this in details in Appendix D.
Property 3 (Completeness): Completeness as a property is described as the ability to accurately explain the operations of a DNN gilpin2018explaining. An explanation is more complete when it can explain the behavior of the DNN for a larger set of inputs. sundararajan2017axiomatic and smilkov2017smoothgrad mention the problem of sensitivity and lack of stability in gradientbased algorithms. In the literature, if a solution can reduce the gradient “sensitivity” problem, it can be described as having the âcompletenessâ property gilpin2018explaining. AXAI with PDG attack is complete in the same sense as SmoothGrad is smilkov2017smoothgrad. SmoothGrad takes the average of saliency maps with added Gaussian noise to reduce sensitivity. The PGD attack behaves in a similar manner by adding adversarial noise at each iteration. Both solutions add perturbations to the input to smooth gradient fluctuations. While further research can be done on the power of iterative attacks in their gradient smoothing effects, we argue that AXAI with iterative PGD does have the desirable characteristic and produces stable sharpen visualizations of sensitivity maps for robust explanations.
Lastly, as shown in Section 3, our explainability algorithm exhibits a highlevel of fidelity where the explainability outputs are both interpretable and also loyal to the decision making process of the DNN. The produced explainability segments directly point to the places in the input that affect the decision of the DNN. As a result, our solution can be used to explore the relationship between input features and predictions and to understand issues related to the training of DNNs, bias and robustness against adversarial attacks (Appendix E).
Potential Ethical Impact
Our work in this paper contributes to the fields of adversarial machine learning and artificial intelligence explainability (AI Explainability). There is still a huge gap between building a model in Jupyter notebook and shipping it as a standalone product to the users. Advances in these two fields directly relate to the deployment of AI systems that behave in a robust and userfriendly manner after deployment. Building AI systems is hard. AI explainability can provide insights into how AI models behave, why they make the decision they make and the reasoning behind their incorrect predictions. Additionally, explaining the outcomes of a model can help reduce bias and contribute to improvements in accountability and ethics by providing beneficial insights into how AI models think and make their decisions.
Despite the hype, AI engineers struggle with deploying models which meet the users’ performance expectations. A lack of robustness in the performance of trained model is a major impediment. We need to be able to design AI systems that both perform well and are robust. A robust model not only makes correct predictions in expected environment, but it also maintains an acceptable level of performance in unpredictable situations. Our work gives insights into how the adversary attacks an AI system trained to perform a specific task. Understanding how adversarial attacks behave can help AI engineers in development of AI systems that perform as expected while maintaining some level of robustness in presence of external disturbances and adversarial noise. This type of information can help AI engineers in developing AI models that perform better. In short our paper can help AI researchers in their endeavor to design, develop and deploy explainable ethical AI systems that are robust and reliable.
References
Appendix A QuickShift Segmentation
QuickShift is a mode seeking clustering algorithm proposed by vedaldi2008quick. QuickShift creates segments by repeatedly moving each data point to its closest neighbor point that has higher density calculated by a Parzen Estimator. The Kernel size argument in the QuickShift function controls the width of the gaussian kernel of the estimator. The path of moving points can be seen as a tree that connects data points. Eventually, the algorithm connects all data points into a single tree. To balance between under and over fragmentation of the image, a threshold, , is served as a breaking point that limits the length of the branches in the QuickShift trees. The threshold, , is the Max distance argument in the QuickShift function. Finally, the preprocessing step of QuickShift projects a given image into a 5D space, including color space and location . A hyperparameter, , takes a value between 0 and 1 and serves as a weight assigned to the color space, such that the feature space can be presented as .
LIME uses QuickShift for image segmentation where the default Kernel Size is 4, the Max distance is 200, and the threshold is 0.2. This combination prevents generating too many image segments. Eventhough the image segmentation process is only performed once per image, we would like to point out that the parameter selection does change the explanation results slightly. First, increasing the kernel size increases the computation time while decreasing the number of image segments, making this parameter the major computational bottleneck in image segmentation. Second, extra care should be taken when it comes to lowresolution images, when the image is coarse and the number of image segments are low, because important and unimportant features can easily be merged together, as demonstrated in Fig. 10. From the perspective of explainability, both accuracy and humanreadability are needed. This is achieved as long as the important segments are not merged with unimportant ones. This problem can be solved by selecting a small kernel size. In our algorithm, we introduce a user tunable hyperparameter, called explainabilty length, K, that allows users to decide the number of explainable segments. Humanreadability is subjective, so we let the user decide the explainable length, Fig. 11. We see that in Fig. 11, the wall of the castle on the left most side of the image is merged with the sky due to the similarity between colors. In both case, we picked the top 10 segments as explanations, i.e., explainabilty length=10. It is important to note that unlike LIME and other explainability algorithms, the choice of a longer explainabilty length (more segments) does not increase the computational time of our algorithm.
Deciding the tradeoff between the importance of the color (r,g,b) and spatial components (x,y) of the feature space, is especially important for high resolution images. Take a castle image in the ImgeNet dataset as an example (given in Fig. 12). We choose two different parameter combinations for comparison. The only difference between the two combinations is the parameter. For the first combination, we used 0.2 (Fig. 11(b)), for the second combination, we used 0.8 (Fig. 11(c)). One can see that using a lower prevents details from merging with irrelevant background information. In Fig. 11(b) and Fig. 11(c), the total number of segments are nearly the same (73 and 81) but the explanations have different qualities.
Appendix B Convergence of Explanations across Adversarial Attacks
As a tool for explainability, efficiency, accuracy and consistency are of top priority. Our experiments show that PGD attacks with different iterations create explanations similar to FGM attack. This points to consistency in explanations produced by our algorithm. PGD attack is an iterative version of FGM, while both attacks are subjected to an norm. Note that the distribution of the attacks can influence the explanation results. This also means that since the attack distributions of the first iteration and later iterations of the PGD attack are nearly identical, the overall explanations remain the same. In Fig. 13, we provide an example from the ImageNet dataset to show the convergence of the attacks and consistency of our explanations. Fig. 12(b) shows the explanation results for an FGM based algorithm. Fig. 12(c) and Fig. 12(d) show the explanation results based on the PGD attack with different number of iterations. They both look exactly the same. This is because the slight changes on the attack distribution for different number of iterations, do not affect the overall density of pixel changes in each segment, thus the final explainability results do not change. This point to stability and consistency of our algorithm. To further explore the stability and consistency of our approach, we can segment the image into much smaller segments, as given in Fig. 12(e) and Fig. 12(f), in this case using 50 times more segments than the previous case and then produce the explanations. In this case, we do see small differences between an explanation produced with a PGM attack with 10 iterations and one based on a PGM attack with 40 iterations. These small differences are caused by small differences in the attack distributions in each segment. While it is interesting to further explore how different types of attacks can lead to more “suitable” explanations, it is important to note that one could explain the outcomes using our algorithm and with both types of attacks. Further, we can conclude that using FGM or PGD attacks in our algorithm satisfies consistency, accuracy and efficiency conditions for producing explanations.
Appendix C Further Details on the Statistical Analyses given in Subsection 3.2
c.1 Further details on the statistical tests
The FisherPearson coefficient of a distribution with a sample size is calculated using the third moment and the second moment of the distribution,
(2) 
where,
(3) 
If skewness is 0, the data is perfectly symmetrical, if skewness is positive, then one interprets the distribution as skewed right, if skewness is negative, then the distribution is skewed left. bulmer1979principles pointed out that there are three levels of symmetricity, a) when skewness is between 0.5 to 0.5, the distribution is “approximately symmetric,” b) when skewness is within 1 and 0.5 or 0.5 and +1, the distribution is “moderately skewed,” c) when skewness falls out of the mentioned range, then the distribution is highly skewed. The FisherPearson coefficient of all attack magnitudes are shown in Fig. 18. It is seen that the skewness of all attack magnitudes falls within 0.5 an 0.5 showing the strong evidence that the distributions are approximately symmetric.
The tstatistic test is represented as follows,
(4) 
where,
(5) 
Here , and , are the means and variances of the two distributions with size . The tstatistic can be interpreted as a kind of measurement for the ratio of the “difference between groups” over the “difference within groups.” Carrying out pair ttests on all samples allows us to further be conservative on the similarity on means between the distributions. The results are shown in Table 4. Overall, there is no significant differences between the distributions.
To show the similarity between the distributions produced for a dataset, we also use the oneway ANOVA test on all the samples to show that the means across different distributions are the same. Samples here are defined as intensity vs. frequency distributions for all adversarial test samples created by attacking a model trained on a specific dataset. For CIFAR10, we get the pvalue of 0.9, and for a random subset of ImageNet test dataset we get the pvalue of 0.94, indicating no significant differences between the distribution means. Similarly, a twosample location ttest is used to determine if there is a significant difference between two groups where the null hypothesis is the equality of the means. Eventhough ANOVA and ttests are known for being robust on nonnormal data, we further performed pair wise MannâWhitney U test on all pair of distributions to test whether the mean ranks are similar.
MannâWhitney U test is a nonparametric test of the null hypothesis that two independent samples selected from population have the same distribution. The statistic U is calculated as following,
(6) 
Where subscripts “1” and “2” denote the two distributions being compared. In the case of comparing two distributions “sample 1” and “sample 2.” One first combines “sample 1” and “sample 2” together to form an ordered set, and then one assigns ranks to the members of this set. Next, one adds up the ranks for the members of the set coming from “sample 1” and “sample 2” respectively. This is called the rank sum of and . Once the rank sums are calculated The U statistic of the two distributions ( and ) are calculated as above. Finally, the U statistic is determined by the lower value between and . If is lower than , then is the U statistic of the Mann Whitney test between “sample1” and “sample 2”” and vice versa. We further perform the pairwise MannâWhitney U test on all pair of distributions to test whether the mean ranks are similar as well. If U is 0, it means that the two distributions are far away from each other where there are no overlaps between them. If the Rank sums are close enough, one can say the two distributions are highly overlapped. Thus, one can say the MannâWhitney U test is a test comparing the Rank sums (or the mean ranks, calculated by dividing the Rank sums over the size of samples) of two distributions. The smaller values of and is the one used when consulting significance tables.
c.2 QuantileQuantile plot
QuantileQuantile (QQ) plot allows us to show how the quantiles of a distribution deviates from a specified theoretical distribution. The theoretical distribution selected here is the normal distribution. Quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities. A QQ plot is then a scatterplot showing two sets of quantiles (a sample distribution and a theoretical distribution) against one another. The xaxis are the quantile values of the theoretical distribution while the yaxis are the quantile values of the sample distribution, i.e., the distribution of attack intensities vs. pixel frequencies. One can see that if the quantiles of the sample distribution perfectly match the theoretical quantiles, then one can see all the quantiles located on a straight line. While it is unlikely to have identical distributions that perfectly match the theoretical distribution, one can look at different sections of the QQ curves to distinguish the parts that two distributions share similarity and parts that they differ. Compared to a normal distribution, if the sample distribution has heavy or light tails, the QQ curve bends at the upper or lower portion based on side of the tails that deviates from the normal distribution. One can say that one purpose of QQ plots is to look at the âstraightnessâ of the QQ curve. We took a subset that contains 1000 images from both ImageNet and CIFAR10 and plotted the distributions against a normal distribution as given in Fig. 3. It is seen that all attack distributions plotted against the normal distribution have fairly straight lines at the middle portion of the QQ curve, while the curve bends at the upper part and the lower part. One can interpret this result as the attack magnitudes are similar to a normal distribution but differ in a way that the distributions have“heavy tails” thus the upper part of the curve bends “up” and the lower part of the curve bends“down.”
c.3 The beta distribution
The beta distribution is a family of distributions defined on the interval [a, b] parametrized by two positive shape parameters, denoted by and . The general formula for the probability density function of the beta distribution can be written as,
(7) 
where,
(8) 
The beta distribution is often used to describe different types of data, such as rainfall, traffic and financial data. In this paper, estimate the parameters of a beta distribution for our distributions. The method of moments estimation is employed to calculate the shape parameters, ,, of the twoparameter beta distribution. As the interval [a, b] is known, the method of moments estimates of and are
(9) 
(10) 
When the interval [a, b] is [0, 1]. This is called the standard beta distribution. Since in most cases the interval [a, b] is not bounded between [0, 1], one can replace with and with . Finally the estimated and of the beta distribution is listed in Table 3.
c.4 Statistical analysis of distributions for DNNs with text or audio input types
We test the symmetricity of distributions by calculating the FisherPearson coefficient of skewness for LeNet trained on Speech Commands dataset, and a convolutional neural network (CNN) given in kim2014convolutional on Polarity dataset. The FisherPearson coefficients of the attack magnitudes vs. frequency distributions for all 3 cases are shown in Fig. 14. It is seen that the skewness of all distributions falls within the range showing strong evidence that they are approximately symmetric bulmer1979principles.
We perform the twosample location ttest and MannWhitney U test to determine if there is a significant difference between two groups where the null hypothesis is the equality of the means. The results reported in Table 6 indicate no significant difference between the means. Further, the MannWhitney U test results indicate that all pairs are similar to each other on the mean ranks. Under the assumption of two distributions having similar shapes, one could further state that MannWhitney test can be considered as a test of medians mcdonald2009handbook. Since, we have shown that the shapes are similar, we can conclude that there are no significant difference between the medians of the distributions.
Dataset  LeNet, SpeechCommands, PGD  CNN, Sentence Polarity, PGD  

Test  ttest  MannWhitney  ttest  MannWhitne 
pvalue  0.30  0.25  0.47  0.42 
LeNet, SpeechCommands, PGD  CNN, Sentence Polarity, PGD  

15th Quantile  
25th Quantile  
Mean  
Median  
75th Quantile  
85th Quantile 
LeNet, SpeechCommands, PGD  CNN, Sentence Polarity, PGD  

Next, to show consistency across distributions for a given model, dataset and attack, we estimate the values of quantiles, means and medians. We do this by estimating the statistics of the distributions and constructing confidences intervals. For each experiment, we estimate the mean, median, 15th, 25th, 75th and 85th quantiles of each attack magnitude vs. frequency distribution for the entire test dataset. The statistical confidence interval estimations at confidence level of are reported in Table 7. Our results show that the confidence intervals have narrow ranges and the estimations are consistent. The estimates for the 15th, 25th, 75th and 85th quantiles indicate a strong symmetricity with respect to the origin in all cases. Another observation is that the confidence interval of the mean and medians are pretty narrow, supporting the results of the ttests and MannWhitney U test. Finally, we can show with high confidence that the distributions consistently follow a beta distribution. The beta distribution is a family of distributions defined by two positive shape parameters, denoted by and . The estimated and of the beta distribution are reported in Table 8.
Appendix D Explanations and Class Boundaries
Explaining how important features affect the predictions made by the model depends on the set of classes the model was trained to predict. Untargeted attacks change the prediction label of an input to the label of its closest neighbor. Based on the different datasets that a model may have been trained on, the label changes after attack may be significantly different. For example, given an image of a “Beagle” and a model that is trained on a dataset consisting of labels {Cats and Dogs}, after attacking the model, the label of the image can change from “Dog” to “Cat.” But if the same model is trained on a dataset composing of “Beagle, Golden retriever, and Egyptian Cat”, the label of the image can change from “Beagle” to “Golden retriever,” which is a more granule change. When an image is attacked, the features of the image will be directed to the nearest class with a similar probability distribution in the decision layer. Letâs look at an example from ImageNet where the input image is classified as a “convertible” by ResNet34 trained on ImageNet (given in Fig. 15). There are multiple classes such as minivan, sports car, race car etc., under the “car” category in ImageNet. After attacking the model, the label changes from “convertible” to “sports car.” This indicates that “sports car” may be the nearest neighbor class to the “convertible” class. If we look at the produced explanations we see that segments including the door are intensely attacked as given in Fig. 14(b). The fact is that the model thinks that the doors are the âmostâ important features for switching the label from “convertible” to “sports car.” Both classes, “convertible” and “sports car,” have similar wheels but different doors. In order to fool the model, attacking the wheels is not of top priority, itâs the doors that makes the difference between two classes. The fact is that the model thinks that the doors are the most important features for classifying the original image as “convertible” and not “sports car.” Both classes, have similar wheels but different doors. In order to fool the model, attacking the wheels is not of top priority, itâs the doors that make the difference between two classes. After bluring the segments of interest to the model, i.e. the door segment—Fig. 14(c), and feeding the image to the model, the predicted label changes from “convertible” to “sports car” which proves that the doors are the major features supporting the predictions made by the model. Using adversarial attacks as the force behind producing the explanations helps with finding the important features that are not only globally important to the model (doors are important features of cars, other classes do not have doors similar to cars), but also locally important to the model (within the car class, doors are the important features that make a difference between a convertible and a sports car).
There are also some explainable features that humans hardly understand but models do, these can be called “nonrobust features.” tsipras2018robustness introduced the concept of robust and nonrobust features, where the authors indicated that there are features that humans ignore but the models are sensitive to. They call these the nonrobust features. Nonrobust features are the features can easily be manipulated by the attacker in order to fool the model. Robust features are features that are both important to the model and also humans and at the same time invincible to small adversarial manipulations.
Appendix E Further Experiment Results
e.1 Explaining an image classification model
Fig. 16 shows two examples of the explanations produced using AXAI for image samples from ImageNet deng2009imagenet test dataset for a Resnet34 trained on ImageNet training dataset. In the first example, Fig. 15(a), the explanation results clearly show that the round control panel on an iPod is an important feature that helps the model identify an IPod in the image. The second example, Fig. 15(c), shows how the model recognizes that there are two cats in the image (one is the reflection of the cat in the mirror).
CIFAR10 dataset kaur2018convolutional consists of images of size pixels, compared to ImageNet, these images are lowresolution images. Fig. 17 shows the explanations produced by AXAI for sample images from CIFAR10 dataset for an AlexNet image classification model trained on CIFAR10 training dataset. For CIFAR10, our explanations clearly separate the background and capture the target object. The explanation given in Fig. 16(b) shows that the head of the horse with the leather halter is recognized by the model, and the white fence behind the horse is completely ignored by the model. This indicates that the model is welltrained. Similarly in Fig. 16(d) the ear and head of deer in the image helps the model to classify the image correctly into the deer class. Images from CIFAR10 dataset are easily explained due to the nature of the dataset with most objects in the images being located in the middle of the image and the lack of noisy background in most images.
e.2 Explaining an object detection model
We present two examples of explanations produced by our algorithm for a YOLOv3 object detection model trained on the SpaceNet Building Dataset van2018spacenet to detect buildings in overhead imagery. The produced explanation are clearly focused on areas where buildings are located and ignore empty spaces in the images such as the top left corner of Fig. 17(b). Further, as seen in Fig. 17(d), the roads are ignored and only buildings and their contours affect the predictions made by the object detector.
e.3 Further details on the speech recognition experiment
The Speech Commands Dataset warden2018speech is an audio dataset of short spoken words, such as “Right,” “Three,” “Bed,” etc. The audio files are converted to spectograms and are used to train a LeNet for a command recognition task. A spectrogram is a visual representation of the spectrum of frequencies of a signal as it varies with time. Fig. 4(a) is an example of a spectrogram. The yaxis, the frequency, of the spectograms are presented on a logscale, the xaxis represent the timescale, and the color bar shows the magnitude. Fig. 4(a) is the frequency spectrum of a human speaking the word “Right.” It is seen that in the time interval 0.4s to 1.1s, high magnitude is presented in the spectrum. In other words, the speaker pronounces the word ”Right” around 0.4s to 1.1s into the recorded audio file. This is how one reads a spectrogram. Our explainable solution uses audio files as input, converts them into spectrograms, and then generates the corresponding explanations. So if one feeds AXAI with an audio file of a human speaking “Right,” AXAI first transforms the audio into a spectrogram shown in Fig. 4(a), and produces the explanations in Fig. 4(b). The explanation will have the exact same scale as the input, and simply masks out the unimportant parts of the spectrogram. To read the explanations, one can refer to the original spectrogram input Fig. 4(a) and find where the audio is located in the spectrogram (for example looking at the magnitudes), and then look at the corresponding location of the explanations in Fig. 4(b).
The explanations of two examples are presented in Fig. 5. The spectrogram of the first example “Right” and its explanation are shown in Fig. 4(a) and Fig. 4(b). One can see from Fig. 4(a) that the spoken word “’Right” appears between 0.4s to 1.1s in the spectrogram of the audio file. If one looks at its corresponding explanation, it is seen that only timeintervals of 0.4s to 0.5s, 0.5s to 0.6s and 1.0s to 1.2s are not masked out by AXAI. This means that these intervals in the audio have great importance for the prediction made by the model. if we look back at Fig. 4(a), one then realizes that the explanation shows that the first few and the last few seconds of the spoken word “Right” are important to the model, and the middle part is not. Why is that? The neighboring class of “Right” is “Five.” “Right” and “Five” differ in how ”R” & ”F” and ”t” & ”ve” are pronounced. The middle part of “Five” and “Right” is highly similar and does not affect the model’s prediction on deciding whether the spoken word is “Five” or “Right.” The second example is “Three.” As seen in the spectrogram, Fig. 4(c), “Three” is expressed around the timeinterval 1.4s to 2.2s in the spectrogram of the audio file. The corresponding explanation is shown in Fig. 4(d). The explanation masks out almost everywhere except 1.4s to 1.6s and a small part in 1.6s to 1.7s and 1.9s to 2.2s. Now, let’s look at the original spectrgram of “Three” and understand what the explanation means. Since The explanation highlights 1.4s to 1.6s, which is the first few seconds of the spoken word. To understand why, one can learn that if we attack the model, then “Three” is missclassified as “Tree.” This indicates that the model has learned to recognize “Three” and not “Three” by learning the difference between “Thr” and “Tr.” The explanation tells us that the first few seconds of the audio are important (the utterance of “Thr”).
e.4 Ablation study
If a feature or a group of features is important to a model, then completely removing those features from the input would decrease the probability of a correct prediction. Accordingly, we performed an ablation study confirming that the explanations produced by AXAI contain important features. This ablation method can be used to test the accuracy of an explainability solution. If the generated explanation is faithful to the model, then removing the explanations would decrease the accuracy of the predictions. In this section, we demonstrate a simple experiment to validate our algorithm. Our experiment is performed as follows: 1) Generate the explanation of a targeted image via AXAI, where the explanation length is selected in this experiment, 2) Blur the top 5 explanations/segments of the targeted image according to the produced explanations, feed the modified image to the model and obtain its label, 3) repeat this process throughout the test dataset 4) Calculate the total decrease in accuracy. We use a ResNet34 training on ImageNet for this experiment and report the results for the entire ImageNet test dataset. Our results show that the prediction accuracy of the DNN decreases to after blurring the top 5 explanation/segments. To further investigate, instead of blurring the top 5 explanations, we blur only the 6th to10th explanations. This results in a drop in total accuracy. Hence, we can conclude 1) AXAI generates faithful explanations so that blurring the top explanations (the 1st5th explanations) lead to a strong decrease in model prediction accuracy, and 2) AXAI generates faithful explanations in order of importance, i.e., the generated 6th to 10th explanations are also important to the model but their influence on model predictions is relatively less than the first 5 generated explanations.
e.5 AXAI explanations for a robust model trained with adversarial training
In this subsection, we compare the explanations produced for a robust model to explanations produced for a nonrobust model. In our experiment, a robust model is a model trained on an adversarial dataset in addition to the training dataset so that the final trained model is more robust against adversarial attacks. Hypothetically, a robust model should focus more on robust important input features when making predictions. We have trained a nonrobust AlexNet and a robust AlexNet on CIFAR10 and produced the explanations using AXAI for test inputs. Fig. 19 shows the AXAI produced explanations for the DNN given a sample input. It is seen that a small part of the background is included in the explanations produced for the nonrobust AlexNet. However, the AXAI generated explanations for the robust model includes only the important features pertaining to the object in the image. In addition, the leg of the deer is now included in the explanations as well. It is concluded that explanations produced for the robust DNN are sharper, clearer and more robust than the ones generated for the regularly trained DNN.
e.6 Additional Examples
In this section, we provide additional explainability results from using AXAI on an AlexNet image classification model trained on CIFAR10, a VGG16 image classification model trained on CIFAR100, a ResNet34 image classification model trained on ImageNet, the LeNet speech recognition model and the sentence classification model, Fig. 20, Fig. 21, Fig. 22, Fig. 23, and Fig. 24.
Appendix F Benchmark Tests
We test our algorithm against LIME and SHAP. We use “Gradient Explainer” in SHAP, which integrates the f Integrated gradients algorithm with SHAP. Fig. 25 shows some sample comparisons among the 3 algorithms for 3 cases: 1) AlexNet trained on CIFAR10, 2) ResNet34 trained on ImageNet, 3) VGG16 trained on CIFAR100. PGDM with 20 iterations is used in our algorithm. For ImageNet, explanations for a sample test picture belonging to “Egyptian cat” are shown in Fig. 24(a), Fig. 24(b), and Fig. 24(c). One can see the similarity between the explanations. The explanations produced by the 3 algorithms focused on the upper left of the image which contains the eyes of the “Egyptian cat.” Both LIME and our algorithms point to the same segment as explanations. SHAP (Gradient Explainer) locates pixels of interest. The important pixels shown in this case aligns with the results of LIME and AXAI. Since the default image segmentation parameters LIME chooses do not allow for a suitable number of segments for explanation for CIFAR10 and CIFAR100 due to the resolutions of images, we lowered the Kernel size parameter to 1. The default Kernel size parameters LIME uses for QuickShift is too large for lowresolution images. As we mentioned before, this leads to a few very large segments in the image and neglects all the granular details in the image. For CIFAR10, both our approach and LIME capture the upper portion of the head of the horse including the ears and eyes (Fig. 24(d), Fig. 24(e)). The results of SHAP point out the important pixels located on the head, the nose and some pixels in the background (Fig. 24(f)). For CIFAR100, the explanations produced by the 3 algorithms are once again highly similar (Fig. 24(g), Fig. 24(h), and Fig. 24(i)). One can see that in many cases, pixel explanations do not serve as the best solution. Without the segments, it is hard to grasp the meaning behind explanations, this is because the human brain tends to comprehend image segments better than individual pixels.
Footnotes
 Code will be readily available.