Certifiably Robust Interpretation in Deep Learning

# Certifiably Robust Interpretation in Deep Learning

Alexander Levine Sahil Singla Soheil Feizi Corresponding Author (sfeizi@cs.umd.edu)
###### Abstract

Although gradient-based saliency maps are popular methods for deep learning interpretation, they can be extremely vulnerable to adversarial attacks. This is worrisome especially due to the lack of practical defenses for protecting deep learning interpretations against attacks. In this paper, we address this problem and provide two defense methods for deep learning interpretation. First, we show that a sparsified version of the popular SmoothGrad method, which computes the average saliency maps over random perturbations of the input, is certifiably robust against adversarial perturbations. We obtain this result by extending recent bounds for certifiably robust smooth classifiers to the interpretation setting. Experiments on ImageNet samples validate our theory. Second, we introduce an adversarial training approach to further robustify deep learning interpretation by adding a regularization term to penalize the inconsistency of saliency maps between normal and crafted adversarial samples. Empirically, we observe that this approach not only improves the robustness of deep learning interpretation to adversarial attacks, but it also improves the quality of the gradient-based saliency maps.

## 1 Introduction

The growing use of deep learning in many sensitive areas like autonomous driving, medicine, finance and even the legal system ([1, 2, 3, 4]) raises concerns about human trust in machine learning systems. Therefore, having interpretations for why certain predictions are made is critical for establishing trust between users and the machine learning system.

In the last couple of years, several approaches have been proposed for interpreting neural network outputs ([5, 6, 7, 8, 9]). Specifically, [5] computes the elementwise absolute value of the gradient of the largest class score with respect to the input. To define some notation, let be this most basic form of the gradient-based saliency map, for an input image . For simplicity, we also assume that elements of have been linearly normalized to be between and . represents, to a first order linear approximation, the importance of each pixel in determining the class label (see Figure 1-a). Numerous variations of this method have been introduced in the last couple of years which we review in the appendix.

A popular saliency map method which extends the basic gradient method is SmoothGrad [10], which takes the average gradient over random perturbations of the input. Formally, we define the smoothing function as:

 ¯g(x):=E[g(x+ϵ)], (1.1)

where has a normal distribution (i.e. ). We will discuss other smoothing functions in Section 3.1 while the empirical smoothing function which computes the average over finitely many perturbations of the input will be discussed in Section 3.3. We refer to the basic method described in the above equation as the scaled SmoothGrad 111The original definition of SmoothGrad does not normalize and take the absolute values of gradient elements before averaging. We start with the definition of equation 1.1 since it is easier to explain our results for, compared to a more general case. We discuss a more general case in Section 3..

aving a robust interpretation method is important since interpretation results are often used in downstream actions such as medical recommendations, object localization, program debugging and safety, etc. However, [11] has shown that several gradient-based interpretation methods are sensitive to adversarial examples, obtained by adding a small perturbation to the input image. These adversarial examples maintain the original class label while greatly distorting the saliency map (Figure 1-a).

Although adversarial attacks and defenses on image classification have been studied extensively in recent years (e.g. [12, 13, 14, 15, 16, 17, 18, 19, 20, 21]), to the best of our knowledge, there is no practical defense for deep learning interpretation against adversarial examples [22]. This is partially due to the difficulty of protecting high-dimensional saliency maps compared to defending a class label, as well as to the lack of a ground truth for interpretation.

Since a ground truth for interpretation is not available, we use a similarity metric between the original and perturbed saliency maps as an estimate of the interpretation robustness. We define as the number of overlapping elements between top largest elements of saliency maps of and its perturbed version . For an input , this measure depends on its specific perturbation . We define as the robustness measure with respect to the worst perturbation of . That is,

 R∗(x,K):=min~x R(x,~x,K) (1.2) ∥~x−x∥2≤ρ.

For deep learning models, this optimization is non-convex in general. Thus, characterizing the true robustness of interpretation methods will be a daunting task.

In our first main result of this paper, we show that a lower bound on the true robustness value of an interpretation method (i.e. a robustness certificate) can be computed efficiently. In other words, for a given input , we compute a robustness certificate such that . To establish the robustness certificate for saliency map methods, we first prove the following result for a general function whose range is between 0 and 1:

###### Theorem 1.

Let be the output of an interpretation method whose range is between 0 and 1 and let be its smoothed version defined as in Equation equation 1.1. Let and be the -th element and the -th largest elements of , respectively. Let be the cdf of the normal distribution. If

 Φ(Φ−1(¯h[i](x))−2ρσ)≥¯h[2K−i](x), (1.3)

then for the smoothed interpretation method, we have .

Intuitively, this means that, if there is a sufficiently large gap between the -th largest element of the smoothed saliency map and its -th largest element, then we can certify that at least elements in the top largest elements of the original smoothed saliency map will also be in the top elements of adversarially perturbed saliency map. We present a more general version of this result with empirical expectations for smoothing as well as another rank-based robustness certificate in Section 3. The proof of this bound relies on an extension of the results of [23] which addresses certified robustness in the classification case. Proofs for all theorems are given in the Appendix.

Evaluating the robustness certificate for the scaled SmoothGrad method on ImageNet samples produced vacuous bounds (Figure 1-b). This motivated us to develop variations of SmoothGrad with larger robustness certificates. One such variation is Sparsified SmoothGrad which is defined by smoothing a sparsification function that maps the largest elements of to one and the rest to zero. Sparsified SmoothGrad obtains a considerably large value of the robustness certificate (Figure 1-b) while producing high-quality saliency maps. We study other variations of Sparsified SmoothGrad in Section 3.

Our second main result in this paper is to develop an adversarial training approach to further robustify deep learning interpretation methods. Adversarial training is a common technique used to improve the robustness of classification models, by generating adversarial examples to the classification model during training, and then re-training the model to correctly classify these examples [21].

To the best of our knowledge, adversarial training has not yet been adapted to the interpretation domain. In this paper, we develop an adversarial training approach for the interpretation problem in two steps: First, we develop an adversarial attack on the interpretation as the extension of the attack introduced in [11]. We use the developed attack to craft adversarial examples to saliency maps during training. Second, we re-train the network by adding a regularization term to the training loss that penalizes the inconsistency of saliency maps between normal and crafted adversarial samples.

Empirically, we observe that our proposed adversarial training for interpretation significantly improves the robustness of saliency maps to adversarial attacks. Interestingly, we also observe that our proposed adversarial training improves the quality of the gradient-based saliency maps as well (Figure 2). We note that this observation is related to the observation made in [24] showing that adversarial training for classification improves the quality of the gradient-based saliency maps.

## 2 Preliminaries and Notation

We introduce the following notations to indicate Gaussian smoothing: for a function , we define population and empirical smoothed functions, respectively, as:

 ¯h(x)=Eϵ∼N(0,σ2I)[h(x+ϵ)]~h(x)=1qq∑i=1h(x+ϵi)ϵi∼N(0,σ2I) (2.1)

In other words, represents the expected value of when smoothed under normal perturbations of with some standard deviation while represents an empirical estimate of using samples. We call the smoothing variance and the number of smoothing perturbations.

We use to denote the element of the vector . Similarly denotes the element of the output . We also define, for any , as the ordinal rank of in (in the descending order): denotes that is the largest element in . We use to denote the largest element in . If is not an integer, the ceiling of is used. We use to denote the dimension of the input.

## 3 Smoothing for Certifiable Robustness

### 3.1 Sparsified SmoothGrad

In this section, we will derive general bounds which allow us to certify the robustness for a large class of smoothed saliency map methods. These bounds are applicable to any saliency map method whose range is . Note that while SmoothGrad [10] is similar to such methods, it requires some modifications for our bounds to be directly applicable. [10] in particular defines two methods, which we will call SmoothGrad and Quadratic SmoothGrad. SmoothGrad takes the mean over samples of the signed gradient values, with absolute value typically taken after smoothing for visualization. Quadratic SmoothGrad takes the mean of the elementwise squares of gradient values. Both methods therefore require modification for our bounds to be applied: we define scaled SmoothGrad , such that is the elementwise absolute value of the gradient, linearly scaled so that the largest element is one. We can silimarly define a scaled Quadratic SmoothGrad.

We first realized that scaled SmoothGrad and Quadratic SmoothGrad give vacuous robustness certificate bounds, as we demonstrated in Figure 1. Instead, we developed a new method, Sparsified SmoothGrad, which has (1) non-vacuous robustness certificates at ImageNet scale (Figure 3(a)), (2) similar high-quality visual output to SmoothGrad, and (3) theoretical guarantees that aid in setting its hyper-parameters (Section 3.5).

The Sparsified SmoothGrad is defined as , where is defined as follows:

 g[τ]i(x)=⎧⎪⎨⎪⎩0,if gi(x)

In other words, controls the degree of sparsification: a fraction of elements (the largest elements of ) are assigned to , and the rest are set to .

### 3.2 Robustness Certificate for the Population Case

In order to derive a robustness certificate for saliency maps, we present an extension of the classification robustness result of [23] to real-valued functions, rather than discrete classification functions. In our case, we will apply this to the saliency map vector . First, we define a floor function to simplify notation.

###### Definition 3.1.

(Floor function) The Floor function is a function , such that

 L(z)=Φ(Φ−1(z)−2ρσ)

where denotes the norm of the adversarial distortion and denotes the smoothing variance. is the cdf function for the standard normal distribution and is its inverse.

Below is our main result used in characterizing robustness certificates for interpretation methods:

###### Theorem 2.

Let be a real-valued function. Let be the floor function defined as in equation 3.1 with parameters and . Using   as the smoothing variance for , where :

Note that this theorem is valid for any general function. However, we will use it for our case where is a smoothed saliency map. Theorem 2 states that, for a given saliency map vector , if , then if is perturbed inside an norm ball of radius at most , .

This result extends Theorem 1 in [23] in two ways: first, it provides a guarantee about the difference in the values of two quantities, which in general might not be related, while the original result compared probabilities of two mutually exclusive events. Second, we are considering a real-valued function , rather than a classification output which can only take discrete values. This bound can be compared directly to [25]’s result which similarly concerns unrelated elements in a vector. Just as in the classification case (as noted by [23]), Theorem 2 gives a significantly tighter bound than that of [25] (see details in the appendix).

### 3.3 Robustness Certificate for the Empirical Case

In this section, we extend our robustness certificate result of Theorem 2 to the case where we use empirical estimates of smoothed functions. Following [25], we derive upper and lower bounds of the expected value function in terms of , by applying Hoeffding’s Lemma. To present our result for the empirical case, we first define an empirical floor function to derive a similar lower bound when the population mean is estimated using a finite number of samples:

###### Definition 3.2.

(Empirical Floor function) The Empirical Floor function is a function , such that for given values of , where denotes the maximum distortion, denotes the smoothing variance, denotes the probability bound, denotes the number of perturbations, and is the size of input of the function:

 ^L(z)=Φ(Φ−1(z−c)−2ρσ)−c where c=√ln(2n(1−p)−1)2q
###### Corollary 1.

Let be a function such that for given values of , , with probability at least ,

 ^L(~hi(x))≥~hj(x)⇒¯hi(~x)≥¯hj(~x) (3.2)

Note that unlike the population case, this certificate bound is probabilistic. Another consequence of Theorem 2 is that it allows us to derive certificates for the top- overlap (denoted by ). In particular:

###### Corollary 2.

, define as the largest such that . Then, with probability at least ,

 Rcert(x, K)≤R(x, ~x, K). (3.3)

Intuitively, if there is a sufficiently large gap between the and largest elements of empirical smoothed saliency maps, then we can certify that the overlap between top elements of original and perturbed population smoothed saliency maps is at least with probability at least .

Note that we can apply Corollary 2 directly to SmoothGrad (or Quadratic SmoothGrad), simply by scaling the components of (or ) to lie in the interval . However, we observe that this gives vacuous bounds for both of them when using the suggested hyperparameters from [10]. One issue is that the suggested value for (number of perturbations) is which is too small to give useful bounds in Corollary 1. For a standard size image from the ImageNet dataset , with , this gives (using Definition equation 3.2). Note that even for a small :

 ^L(z)=Φ(Φ−1(z−c)−2ρσ)−c≈Φ(Φ−1(z−c))−c=z−2c

Thus the gap between and is at least . We can see from Corollaries 1 and 2 that a gap of (on a scale of 1) is far too large to be of any practical use. We instead take , which gives a more manageable estimation error of . However, we found that even with this adjustment, the bounds computed using Corollary 2 are not satisfactory for either scaled SmoothGrad and or scaled Quadratic SmoothGrad (see details in the appendix). This prompted the development of Sparsified SmoothGrad described in Section 3.1.

### 3.4 Relaxed Sparsified SmoothGrad

For some applications, it may be desirable to have at least some differentiable elements in the computed saliency map. For this purpose, we also propose Relaxed Sparsified SmoothGrad:

 g[γ,τ]i(x)=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩0,if gi(x)

Here, controls the degree of sparsification and controls the degree of clipping: a fraction of elements are clipped to 1. Elements neither clipped nor sparsified are linearly scaled between and . Note that Relaxed Sparsified SmoothGrad is a generalization of Sparsified SmoothGrad. With no clipping (), we again achieve nearly-vacuous results. However, with only a small degree of clipping (), we achieve results very similar (although slightly worse) than sparsifed SmoothGrad; see Figure 3(b). We use Relaxed Sparsified SmoothGrad in this paper to test the performance of first-order adversarial attacks against Sparsified SmoothGrad-like techniques.

### 3.5 Robustness Certificate based on Median Saliency Ranks

In this section, we show that if the median rank of a saliency map element over smoothing perturbations is sufficiently small (i.e. near the top rank), then for an adversarially perturbed input, that element will certifiably remain near the top rank of the proposed Sparsified SmoothGrad method with high probability. This provides another theoretical reason for the robustness of the Sparsified SmoothGrad method.

To present this result, we first define the certified rank of an element in the saliency map as follows:

###### Definition 3.3 (Certified Rank).

For a given input and a given saliency map method (denoted by ), let the maximum adversarial distortion be , i.e. . Then, for a probability , the certified rank for an element at index (denoted by ) is defined as the minimum such that the condition:

 ^L(~hi(x))≥~h[k](x)

holds.

If the -th element of the saliency map has a certified rank of , using Corollary 1, we will have:

 ¯hi(~x)≥¯h[k](~x)     with probability at least p.

That is, the element of the population smoothed saliency map is guaranteed to be as large as the smallest elements of the smoothed saliency map of any adversarially perturbed input.

Note that certified rank depends on the particular perturbations used to generate the smoothed saliency map . In the following result, we show that if the median rank of a gradient element at index , over a set of randomly generated perturbations, is less than a specified threshold value, then the certified rank of that element in the Sparsified SmoothGrad saliency map generated using those perturbations can be upper bounded.

###### Theorem 3.

Let be the set of random perturbations for a given input using the smoothing variance . Using the Sparsified SmoothGrad method, for probability , we have

 Medianϵ∈U[rank(g(x+ϵ),i)]≤⌈τn⌉⇒rankcert(x,i)≤⌈τn⌉^L(12), (3.5)

where is the sparsification parameter of the Sparsified SmoothGrad method.

For instance, if and for sufficiently large number of smoothing perturbations (i.e. ), we have . If we set , then for indices whose median ranks are less than or equal to , their certified ranks will be less than or equal to . That is, even after adversarially perturbing the input, they will certifiably remain among the top elements of the Sparsified SmoothGrad saliency map.

We present a more general form of this result in the appendix.

### 3.6 Experimental Results

To test the empirical robustness of Sparsified SmoothGrad, we used an attack on adapted from the attack defined by [11]; see the appendix for details of our proposed attack. We chose Relaxed Sparsified SmoothGrad to test, rather than Sparsified SmoothGrad, because we are using a gradient-based attack, and Sparsified SmoothGrad has no defined gradients. We tested on ResNet-18 with CIFAR-10, with the attacker using a separately-trained, fully differential version of ResNet-18, with SoftPlus activations in place of ReLU.

We present our empirical results in Figure 6. We observe that our method is significantly more robust than the SmoothGrad method while its robustness is in par with the Quadratic SmoothGrad method with the same number of smoothing perturbations. We note that our robustness certificate appears to be loose for large perturbation magnitudes used in these experiments.

## 4 Adversarial Training for Robust Saliency Maps

Adversarial training has been used extensively for making neural networks robust against adversarial attacks on classification [21]. The key idea is to generate adversarial examples for a classification model, and then re-train the model on these adversarial examples.

In this section, we present, for the first time, an adversarial training approach for fortifying deep learning interpretations so that the saliency maps generated by the model (during test time) are robust against adversarial examples. We focus on “vanilla gradient” saliency maps, although the technique presented here can potentially be applied to any saliency map method which is differentiable w.r.t. the input. We solve the following optimization problem for the network weights (denoted by ):

 minθE(x,y)∼D[ℓcls(x,y)Classification\ loss+λ∥g(x)−g(~x)∥22Robustness\ loss], (4.1)

where is an adversarial perturbation for the saliency map generated from . To generate , we developed an attack on saliency maps by extending the attack of [11] (see the details in the appendix). is the standard cross entropy loss, and is the regularization parameter to encourage consistency between saliency maps of the original and adversarially perturbed images.

We observe that the proposed adversarial training significantly improves the robustness of saliency maps. Aggregate empirical results are presented in Figure 6, and examples of saliency maps are presented in Figure 2. It is notable that the quality of the saliency maps is greatly improved for unperturbed inputs, by adversarial training. We observe that even for very large value of , only a slight reduction in classification accuracy occurs due to the added regularization term.

## 5 Conclusion

In this work, we studied the robustness of deep learning interpretation against adversarial attacks and proposed two defense methods. Our first method is a sparsified variant of the popular SmoothGrad method which computes the average saliency maps over random perturbations of the input. By establishing an easy-to-compute robustness certificate for the interpretation problem, we showed that the proposed Sparsified SmoothGrad is certifiably robust to adversarial attacks while producing high-quality saliency maps. We provided extensive experiments on ImageNet samples validating our theory. Second, for the first time, we introduced an Adversarial Training approach to further fortify deep learning interpretation against adversarial attacks by penalizing the inconsistency of saliency maps between normal and crafted adversarial samples. The proposed adversarial training significantly improved the robustness of saliency maps without degrading from the classification accuracy. We also observed that, somewhat surprisingly, adversarial training for interpretation enhances the quality of the gradient-based saliency maps in addition to their robustness.

## References

• [1] Amal Lahiani, Jacob Gildenblat, Irina Klaman, Nassir Navab, and Eldad Klaiman. Generalizing multistain immunohistochemistry tissue segmentation using one-shot color deconvolution deep neural networks. arXiv preprint arXiv:1805.06958, 2018.
• [2] A. BenTaieb, J. Kawahara, and G. Hamarneh. Multi-loss convolutional networks for gland analysis in microscopy. In 2016 IEEE 13th International Symposium on Biomedical Imaging (ISBI), pages 642–645, April 2016.
• [3] Marvin Teichmann, Michael Weber, Marius Zoellner, Roberto Cipolla, and Raquel Urtasun. Multinet: Real-time joint semantic reasoning for autonomous driving. In 2018 IEEE Intelligent Vehicles Symposium (IV), pages 1013–1020. IEEE, 2018.
• [4] Thomas Fischer and Christopher Krauss. Deep learning with long short-term memory networks for financial market predictions. European Journal of Operational Research, 270(2):654–669, 2018.
• [5] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. In ICLR, 2014.
• [6] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In ICML, 2017.
• [7] D. Alvarez-Melis and T. S. Jaakkola. Towards Robust Interpretability with Self-Explaining Neural Networks. Neural Information Processing Systems, 2018.
• [8] Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt, and Been Kim. Sanity checks for saliency maps. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, NIPS’18, pages 9525–9536, USA, 2018. Curran Associates Inc.
• [9] Kan Huang, Chunbiao Zhu, and Ge Li. Robust saliency detection via fusing foreground and background priors. CoRR, abs/1711.00322, 2017.
• [10] Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda Viégas, and Martin Wattenberg. Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825, 2017.
• [11] Amirata Ghorbani, Abubakar Abid, and James Zou. Interpretation of neural networks is fragile. arXiv preprint arXiv:1710.10547, 2017.
• [12] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. Intriguing properties of neural networks. CoRR, abs/1312.6199, 2014.
• [13] Jonathan Uesato, Brendan O’Donoghue, Pushmeet Kohli, and Aäron van den Oord. Adversarial risk and the dangers of evaluating against weak attacks. In ICML, 2018.
• [14] Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. CoRR, abs/1412.6572, 2015.
• [15] Anish Athalye, Nicholas Carlini, and David A. Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In ICML, 2018.
• [16] Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. Synthesizing robust adversarial examples. In ICML, 2018.
• [17] Jacob Buckman, Aurko Roy, Colin Raffel, and Ian J. Goodfellow. Thermometer encoding: One hot way to resist adversarial examples. In ICLR. OpenReview.net, 2018.
• [18] Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. Adversarial examples in the physical world. CoRR, abs/1607.02533, 2017.
• [19] Nicolas Papernot and Patrick D. McDaniel. On the effectiveness of defensive distillation. CoRR, abs/1607.05113, 2016.
• [20] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pages 39–57. IEEE, 2017.
• [21] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. CoRR, abs/1706.06083, 2018.
• [22] C. Etmann, S. Lunz, P. Maass, and C.-B. Schönlieb. On the Connection Between Adversarial Robustness and Saliency Map Interpretability. arXiv e-prints, May 2019.
• [23] Jeremy M Cohen, Elan Rosenfeld, and J Zico Kolter. Certified adversarial robustness via randomized smoothing. arXiv preprint arXiv:1902.02918, 2019.
• [24] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. There is no free lunch in adversarial robustness (but there are unexpected benefits). arXiv preprint arXiv:1805.12152, 2018.
• [25] Mathias Lecuyer, Vaggelis Atlidakis, Roxana Geambasu, Daniel Hsu, and Suman Jana. Certified robustness to adversarial examples with differential privacy. arXiv preprint arXiv:1802.03471, 2018.
• [26] Pieter-Jan Kindermans, Sara Hooker, Julius Adebayo, Maximilian Alber, Kristof T. Schütt, Sven Dähne, Dumitru Erhan, and Been Kim. The (un)reliability of saliency methods. CoRR, abs/1711.00867, 2018.
• [27] Bai Li, Changyou Chen, Wenlin Wang, and Lawrence Carin. Second-order adversarial attack and certifiable robustness. arXiv preprint arXiv:1809.03113, 2018.
• [28] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3319–3328. JMLR. org, 2017.
• [29] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3145–3153. JMLR. org, 2017.
• [30] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.

## Appendix A Proofs

###### Theorem 1.

Let be a real-valued function,   be the smoothing variance for , then where such that :

 Φ(Φ−1(¯hi(x))−2ρσ)≥¯hj(x)⇒¯hi(~x)≥¯hj(~x)

where denotes the cdf function for the standard normal distribution and is its inverse.

###### Proof.

We first define a new, randomized function , where ,

 Hk(x)∼Bern(hk(x))

Let , Then :

 E[Hk(x+ϵ)]=Eϵ[EHk[Hk(x+ϵ)]]=Eϵ[hk(x+ϵ)] (A.1)

Now, we apply the following Lemma (Lemma 4 from [23]):

###### Lemma (Cohen’s lemma).

Let and . Let be any deterministic or random function, Then:

1. If for some and , then

2. If for some and , then

Using the same technique as used in the proof of Lemma 4 in [23], we fix and define,

 βi =σ∥δ∥Φ−1(E[Hi(x+ϵ)]) βj =σ∥δ∥Φ−1(1−E[Hj(x+ϵ)])

Also define the half-spaces:

 Si={z:δTz≤βi+δTx} ={z:δT(z−x)≤βi} Sj={z:δTz≥βj+δTx} ={z:δT(z−x)≥βj}

Applying algebra from the proof of Theorem 1 in [23], we have,

 Pr(X∈Si)=Φ(βiσ∥δ∥)=E[Hi(x+ϵ)] (A.2) Pr(X∈Sj)=1−Φ(βjσ∥δ∥)=E[Hj(x+ϵ)] (A.3) Pr(Y∈Si)=Φ(βiσ∥δ∥−∥δ∥σ)=Φ(Φ−1(E[Hi(x+ϵ)])−∥δ∥σ) (A.4) Pr(Y∈Sj)=Φ(−βiσ∥δ∥+∥δ∥σ)=Φ(Φ−1(E[Hj(x+ϵ)])+∥δ∥σ) (A.5)

Using equation A.2

 Pr(Hi(X)=1)=E[Hi(X)]=E[Hi(x+ϵ)]≥Pr(X∈Si)

Applying Statement 1 of Cohen’s lemma, using and :

 E[Hi(x+δ+ϵ)]=Pr(Hi(x+δ+ϵ)=1)=Pr(Hi(Y)=1)≥Pr(Y∈Si) (A.6)

Using equation A.3,

 Pr(Hj(X)=1)=E[Hj(X)]=E[Hj(x+ϵ)]≤Pr(X∈Sj)

Applying Statement 2 of Cohen’s lemma, using and :

 E[Hj(x+δ+ϵ)]=Pr(Hj(x+δ+ϵ)=1)=Pr(Hj(Y)=1)≤Pr(Y∈Sj) (A.7)

Using equation A.6 and equation A.7:

 Pr(Y∈Si)≥Pr(Y∈Sj)⇒E[Hi(x+δ+ϵ)]≥E[Hj(x+δ+ϵ)]

Using equation A.4 and equation A.5:

 Φ(Φ−1(E[Hi(x+ϵ)])−∥δ∥σ)≥Φ(Φ−1(E[Hj(x+ϵ)])+∥δ∥σ)

Now using , and that is a monotonic function:

 Φ(Φ−1(E[Hi(x+ϵ)])−2ρσ)≥E[Hj(x+ϵ)]⇒E[Hi(x+δ+ϵ)]≥E[Hj(x+δ+ϵ)]

Finally, we use equation equation A.1 to derive the intended result. ∎

###### Corollary 1.

Let be a function such that for given values of :

 ~h(x)=1qq∑i=1h(x+ϵi),ϵi∼N(0,σ2I) (A.8)

, with probability at least ,

 ^L(~hi(x))≥~hj(x)⇒¯hi(~x))≥¯hj(~x)
###### Proof.

By Hoeffding’s Inequality, for any ,

 Pr[|~hi(x)−¯hi(x)|≥c]≤2e−2qc2 (A.9)

Then:

 Pr[⋃i(|~hi(x)−¯hi(x)|≥c)]≤2ne−2qc2 (A.10)

Since we are free to choose c, we define such that , then:

 c=√ln(2n(1−p)−1)2q (A.11)
 Pr[⋃i(|~hi(x)−¯hi(x)|≥c)]≤2ne−2qc2=1−p ⟹1−Pr[⋃i(|~hi(x)−¯hi(x)|≥c)]≥p ⟹Pr[⋂i(|~hi(x)−¯hi(x)|

Then with probability at least :

 (A.12)

So:

 Φ(Φ−1(~hi(x)−c))−2ρσ)≥~hj(x)+c⟹Φ(Φ−1(¯hi(x))−2ρσ)≥¯hj(x) (A.13)

The result directly follows from Theorem 2. ∎

###### Corollary 2.

with probability at least ,

 R(x, ~x, K)≥Rcert(x, K) (A.14)

where is the largest such that .

###### Proof.

Note that the proof of Corollary 1 guarantees that with probability at least , all estimates are within the approximation bound of . So we can assume that Corollary 1 will apply simultaneously to all pairs of indices , with probability .
We proceed to prove by contradiction.

 Let i=Rcert(x,K)
 ⟹^L(~h[i](x))≥~h[2K−i](x),

Suppose there exists such that:

 R(x, ~x, K)

Since is a monotonically increasing function,

 ^L(~h[i](x))≥~h[2K−i](x)
 ⟹^L(~h[i′](x))≥~h[j′](x),∀ i′≤i, j′≥2K−i,

and therefore by Corollary 1:

 ∀ m,nrank(~h(x),m)≤i, rank(~h(x),n)≥2K−i⟹¯hm(~x))≥¯hn(~x) (A.15)

Let be the set of indices in the top elements in , and be the set of indices in the top elements in .
By assumption, and share fewer than elements, so there will be at least elements in which are not in .
All of these elements have rank at least in .
Thus by pigeonhole principle, there is some index , such that .
Thus by Equation equation A.15,

 ∀m, where rank(~h(x),m)≤i,¯hm(~x)≥¯hl(~x) (A.16)

Hence, there are such elements where : these elements are clearly in .
Because , Equation equation A.16 implies that these elements are all also in . Thus and share at least elements,which contradicts the premise.
(In this proof we have implicitly assumed that the top elements of a vector can contain more than elements, if ties occur, but that is assigned arbitrarily in cases of ties. In practice, ties in smoothed scores will be very unlikely.) ∎

### a.1 General Form and Proof of Theorem 3

We note that Theorem 3 can be used to derive a more general bound for any saliency map method that for an input , first maps to an elementwise function that only depends on the rank of the current element in and not on the individual value of the element. We denote the composition of the gradient function and this elementwise function as . The only properties that the function must satisfy is that it must be monotonically decreasing and non-negative. Thus, we have the following statement:

###### Theorem 2.

Let be the threshold value and let be the set of random perturbations for a given input using the smoothing variance and let be the probability bound. If is an element index such that:

 Medianϵ∈U[rank(g(x+ϵ),i)]≤T (A.17)

Then:

 rankcert(x, i)≤∑nj=1g[rank][j](x)^L(g[rank][T](x)2) (A.18)

Furthermore:

 n∑j=1g[rank][j](x), ^L(g[rank][T](x)2) are both % independent of x. Thus RHS is a constant. (A.19)
###### Proof.

Let the elementwise function be , i.e takes the rank of the element as the input and outputs a real number. Furthermore, we assume that is a non-negative monotonically decreasing function. Thus .
We use to denote the constant value that maps elements of rank to.
Note that is the largest element of .
Since is a monotonically decreasing function:

 g[rank][i](x)=f(i)∀ i∈[n]

Thus is independent of , we simply use to denote , i.e:

 g[rank][i](⋅)=f(i) ∀ i∈[n]

Because , for at least half of sampling instances in , .
So in these instances ,
The remaining half or fewer elements are mapped to other nonnegative values.
Thus the sample mean:

 ~g[rank]i(x)=1q∑ϵ∈Ug[rank]i(x+ϵ)≥g[rank][T](⋅)/2

Using Corollary 1, is certifiably as large as all elements with indices j such that:

 ^L(g[rank][T](⋅)/2)≥~g[rank]j(x)

.
Now we will find an upper bound on the number of elements with indices j such that:

 ~g[rank]j(x)>^L(g[rank][T](⋅)/2)

Because all the ranks from to will occur in every sample in U, we have:

 ∀ ϵ∈U,n∑k=1g[rank]k(x+ϵ)=n∑k=1g[rank][k](⋅)
 ⟹n∑k=1~g[rank]k(x)=n∑k=11q∑ϵ∈Ug[rank]i(x+ϵ)=n∑k=1g[rank][k](⋅)

Thus strictly fewer than elements will have mean greater than .
Hence, is certifiably at least as large as elements, which by the definition of yields the result.
Theorem 3 in the main text follows trivially, because in the Sparsified SmoothGrad case, , and . Note that this represents the tightest possible realization of this general theorem. ∎

## Appendix B Related Works

[6] defines a baseline, which represents an input absent of information and determines feature importance by accumulating gradient information along the path from the baseline to the original input. [7] builds interpretable neural networks by learning basis concepts that satisfy an interpretability criteria. [8] proposes methods to assess the quality of saliency maps. Although these methods can produce visually pleasing results, they can be sensitive to noise and adversarial perturbations.

[12] introduced adversarial attacks for classification in deep learning. That work dealt with attacks, and uses L-BFGS optimization to minimize the norm of the perturbation. [20] provide an attack for classification which is often considered state of the art.

One strategy to make classifiers more robust to adversarial attacks is randomized smoothing. [25] use randomized smoothing to develop certifiably robust classifiers in both the and norms. They show that if Gaussian smoothing is applied to class scores, a gap between the highest smoothed class score and the next highest smoothed score implies that the highest smoothed class score will still be highest under all perturbations of some magnitude. This guarantees that the smoothed classifier will be robust under adversarial perturbation.

[27] and [23] consider a related formulation. Cohen gives a bound that is tight in the case of linear classifiers and gives significantly larger certified radii. In their formulation, the unsmoothed classifier is treated as a black box outputting just a discrete class label. The smoothed classifier outputs the class observed with greatest frequency over noisy samples.

In the last couple of years, several approaches have been proposed to for interpreting neural network outputs. [5] computes the gradient of the class score with respect to the input. [10] computes the average gradient-based importance values generated from several noisy versions of the input. [6] defines a baseline, which represents an input absent of information and determines feature importance by accumulating gradient information along the path from the baseline to the original input. [7] builds interpretable neural networks by learning basis concepts that satisfy an interpretability criteria. [8] proposes methods to assess the quality of saliency maps. Although these methods can produce visually pleasing results, they can be sensitive to noise and adversarial perturbations ([11], [26]).

As mentioned in Section 1, several approaches have been introduced for interpreting image classification by neural networks ([5, 10, 6, 29]). It has also been shown that deep networks can be sensitive to noise and adversarial perturbations ([11], [26]).

## Appendix C L2 Attack on Saliency Maps

We developed an norm attack on , based on [11]’s attack. Our algorithm is presented as Algorithm 1. We deviate from [11]’s attack in the following ways.:

• We use gradient descent, rather than gradient sign descent: this is a direct adaptation to the norm.

• We initialize learning rate as , and then decrease learning rate with increasing iteration count, proportionately (for the most part) to the reciprocal of the iteration count. These are both standard practices for gradient descent.

• We use random initialization and random restarts, also standard optimization practices.

• If a gradient descent step would cross a decision boundary, we use backtracking line search to reduce the learning rate until the step stays on the correct-class side. This allows the optimization to get arbitrarily close to decision boundaries without crossing them.

We measured the effectiveness of our attack against a slight modification of [11]’s attack, in which the image was projected (if necessary) onto the ball at every iteration, and also clipped to fit within image box constraints (this was not mentioned in [11]’s original algorithm). For this attack, we set the () learning rate parameter at , and ran for up to 100 iterations. We also tested against random perturbations. For random perturbations, up to points were tested until a point in the correct class was identified. We tested these attacks on both “vanilla gradient” and SmoothGrad saliency maps. See Figure 7. Experimental conditions are as described in Section D for experiments on CIFAR-10. In this figure, for each attack magnitude, we discard any image on which any optimization method failed.

## Appendix D Description of Experiments in Section 3

For ImageNet experiments (Figures 1-b and 4-a,b), we use ResNet-50, using the model pre-trained on ImageNet that is provided by torchvision.models, and images were pre-processed according to the recommended procedure for that model. In all of these figures, data are from the validation set, samples size is 64, and the main data lines represent the percentile in the sample of the calculated robustness certificate. Error bars represent the and percentile values, corresponding to a confidence interval for the population quantile.
For CIFAR-10 experiments, we train a ResNet-18 model on the CIFAR-10 training set (with pixel intensities normalized to in each channel) using Stochastic Gradient Descent with Momentum as implemented by PyTorch[30]. The following training parameters were used:

Early stopping was used to maximize accuracy relative to the CIFAR-10 test set (this should not affect the validity of our results, because we are not concerned with classification accuracy.) For adversarial attacks, we train a version of ResNet-18 with SoftMax activations instead of ReLU. The adversarial attack used was the attack described in Algorithm 1, with