Saliency Methods for Explaining Adversarial Attacks

Saliency Methods for Explaining Adversarial Attacks

Jindong Gu
The University of Munich
Siemens AG, Corporate Technology
&Volker Tresp
The University of Munich
Siemens AG, Corporate Technology

In this work, we aim to explain the classifications of adversary images using saliency methods. Saliency methods explain individual classification decisions of neural networks by creating saliency maps. All saliency methods were proposed for explaining correct predictions. Recent research shows that many proposed saliency methods fail to explain the predictions. Notably, the Guided Backpropagation (GuidedBP) is essentially doing (partial) image recovery. In our work, our numerical analysis shows the saliency maps created by GuidedBP do contain class-discriminative information. We propose a simple and efficient way to enhance the created saliency maps. The proposed enhanced GuidedBP is the state-of-the-art saliency method to explain adversary classifications.



1 Introduction

The explanations produced by saliency methods reveal the relationship between inputs and outputs of the underlying model. In image classifications, the explanations are generally visualized as saliency maps. A saliency map (SM) is defined by three components: an input , a model corresponding to a function , and an output class .

A saliency method can be formulated as a function . A saliency map for the classification of the -th class is defined as


where has the same dimensions as the input . The value of an element in specifies the relevance of the input feature to the -th class. To be noted that the -th class could be neither the ground-truth class nor the predicted class .

In recent years, a large number of significant saliency methods have been proposed Simonyan et al. (2013); Zeiler and Fergus (2014); Springenberg et al. (2014); Bach et al. (2015); Ribeiro et al. (2016); Sundararajan et al. (2017); Smilkov et al. (2017); Shrikumar et al. (2017); Selvaraju et al. (2017); Ancona et al. (2017); Zintgraf et al. (2017); Dabkowski and Gal (2017); Fong and Vedaldi (2017); Gu et al. (2018). Mahendran and Vedaldi (2016); Adebayo et al. (2018) show that SMs created by Guided Backpropagation (GuidedBP Springenberg et al. (2014)) are neither class-discriminative nor sensitive to model parameters. Nie et al. (2018) proves that Guided Backpropagation is essentially doing (partial) image recovery, which is unrelated to the network decisions. Different from their conclusions, our numerical analysis shows that the SMs created by GuidedBP do contain class-relevant decisions.

Most of the existing saliency methods only consider the SM of the ground-truth class without considering the change of the class in Equation 1. Alvarez-Melis and Jaakkola (2018) shows that meaningful explanations should be robust to small local perturbations of the input. However, the small perturbation can lead the misclassification of neural networks Szegedy et al. (2014); Goodfellow et al. (2015). We would not expect that the explanations stay unchanged in this case since the neural networks make totally different decisions. Hence, the saliency methods should be discriminative to adversary perturbation.

Our contributions are as follows: 1) We identify class-discriminative information in SMs created by GuidedBP and propose a simple and efficient way to enhance the created SMs; 2) We explain classifications of adversary images with the proposed enhanced Guided Backpropagation and the existing ones. Their created evaluations are evaluated via qualitative and quantitative experiments.

2 Enhanced Guided Backpropagation

Similar to raw gradient backpropagation, GuidedBP propagates gradients back to inputs and takes the received gradients as their saliency values. The two methods differ only in handling ReLU layers. In GuidedBP, where is the gradients of the -th layer and the are the activations before RuLU layer, and 1 is the indicator function. Since the indicator function filter out part of gradients, the gradients received by some input features can be zeros, which is called filtering effect (FE). The filtering effect of a SM is formally defined as .

Alvarez-Melis and Jaakkola (2018) provides a theoretical analysis of GuidedBP. They show that the created SMs of different classes have similar filtering effects, which means that GuidedBP is not class-discriminative. In the following, we show the SMs created by GuiedBP do contain class-discriminative information and propose a simple way to enhance the discriminative information in the corresponding saliency maps.

(a) Similarity of FE (b) Difference of SM Values
Figure 1: The relationship between two SMs of each SM pair: a) The similarity ratio of binarized SMs describes the similarity of Filtering Effects of them. b) The Avg-Diff and the Max-Diff between two unnormalized SMs are computed to describe the difference of their saliency values.
(a) SMs before Norm. (b) SMs after Norm.
Figure 2: This toy example illustrates how the proposed method works to enhance the discriminativity of SMs. The indexes located between A and B correspond to the input features relevant to the -th class, and the ones between B and C are the input features relevant to the -th class.

2.1 Identifying Discriminative Information

and are the two saliency maps created by GuidedBP for the -th output class and the -th output class. They have similar filtering effects, as theoretically analyzed in Alvarez-Melis and Jaakkola (2018). The difference between them can only be their saliency values, if existing. However, in all published work, SMs are visualized by normalizing saliency values in a SM and mapping them to a color map . The possible difference of their saliency values is hidden by the normalization.

We take pre-trained VGG16 Simonyan and Zisserman (2014) model and fine-tune it on the PASCAL VOC2012 Everingham et al. (2010) dataset. Each image in the dataset may have many objects belonging to more than one class. We select images with multiple labels from the validation dataset. For each image, we produce SMs for ground-truth classes and choose any two of SMs to form a SM pair ( and ), i.e., SM pairs.

For each pair, we binarize the SMs and compute the similarity between two binarized ones, which is defined as the ratio of the number of pixels with the same value to the number of all pixels. All the scores of all images from the validation dataset are shown in Figure 0(a). All the scores are close to 1, which means the SMs of different classes have almost the same filtering effect.

Without modifying values of SMs, we compute their average and maximum. For each SM pair, we compute the difference of saliency values of two SMs as Avg-Diff and Max-Diff . The scores are visualized in Figure 0(b). They vary from 0 to 0.8. Given a classification, the two SMs and differ in saliency values instead of filtering effect.

2.2 Enhancing Discriminativie Information of Saliency Maps

In this section, we propose a simple and efficient way to extract information about the difference. We argue that the relatively larger saliency values in SMs correspond to the input features that support a specific class. We extract such class-relevant information by normalize them and subtract them, which is visualized in Figure 2. Figure 1(a) shows the saliency values of two SMs where input features are ordered by the saliency values of a SM . The two SMs have zeros in the interval [0, A] since both have the same filtering effect. The difference between the two SMs is their saliency values in the interval (A, C]. Figure 1(b) shows the normalized saliency values where the input features of (A, B] are relevant to the -th class, and the ones in (B, C] are relevant to the -th class.

In classifications of real-world images, the obtained discriminative pixels strongly depends on how the SMs are normalized. The trivial normalization is to divide the SM by their maximum. However, the maximal value of the SMs (i.e., the maximal local gradient value in vanilla Gradient approach) are noisy and often outliers Szegedy et al. (2014); Smilkov et al. (2017).

One alternative is the energy-based normalization. The individual SMs are normalized by the sum of its saliency values (i.e., the energy of the SMs). The SMs and are composed of three channels. The discriminative pixels for the -th class on the R channel are .

Neural networks have different sensitivity to different feature maps and input channels. In a classification, the sensitivity of channels could be different for different output classes. E.g., in case of , the discriminative region , and we lose all the information on the red channel. On the contrary case, we might keep too much detail information without highlighting discriminative features. On other channels, we could similarly lose all the information or keep too much non-discriminative information.

We propose the channel-wise energy-based normalization to circumvent the problem. We consider three channels separately. The discriminative pixels of R channel is . Similarly, the discriminative information of each channel is accurately identified. The generalization of the proposed enhancing method to other saliency methods will also be discussed in Section 4.

Figure 3: This figure shows SMs of clean image and adversary ones. The first column lists the original image and its adversary ones. Our Enhanced GuidedBP reacts the adversary attacks strongly, while all other the SMs produce similar SMs.
Figure 4: Following the rank of saliency values of a SM, a certain percentage of pixels of the adversary image are perturbed. The classification accuracy on the perturbed adversary images are shown.

3 Explaining Classifications of Adversary Images

Inputs with imperceptible perturbation can fool the well-trained neural networks. The Fast Gradient Sign Method (FGSM) Szegedy et al. (2014) perturbs an image to increase the loss of classifier on the resulting image. The Basic Iterative Method (BIM) Kurakin et al. (2016) extends FGSM by taking multiple small steps instead of one big step. Another superior attack method is the Carlini and Wagner attack (C&W) Carlini and Wagner (2017). In the wake of defensive distillation, they create the quasi-imperceptible perturbations by restricting their and -norms. The -norm is used across this paper.

For ImageNet validation images, we create adversary images using the three described attack methods on pre-trained VGG16. The SMs of clean images and adversary images are shown in Figure 4. For all the saliency methods except for our enhanced GuidedBP, the SMs created for predicted classes of the clean image and its adversary versions are visually the same. One might argue that it is an advantage of the saliency methods: they can still identify the object in the image even when attacked. However, we argue that saliency methods should reflect the strong reaction of deep neural networks. In other words, they should produce different SMs for clean images and adversary ones.

Since the existing saliency methods always create similar SMs for a clean image and its adversary versions, they cannot be applied to explain classifications misled by adversary perturbations. Our enhanced GuidedBP can identify the relevant evidence of the decisions. For the classification of the original input (e.g., sheepland dog), the created SM shows the VGG16 focus on the important visual feature of the target object (the head), while it focuses on class-irrelevant features (background and body parts) when explaining the classifications of adversary inputs.

The saliency methods can identify the input features that contribute to the classification decision. We can apply saliency methods on misled classifications of adversary samples. If we perturb the pixels relevant to the misclassification according to the created SMs, the attack effectiveness will be decreased. The performance of the model on the perturbated samples can be recovered to some extent. Figure 4 shows the performance of the model on the adversary samples (C&W attack) when they are perturbated according to the SMs. We can observe that the perturbation with SMs of our enhanced GuidedBP can recovery the most score. Instead of claiming the SM-based perturbation is an effective defense method, we aim to show that SMs created by enhanced GuidedBP can better identify the pixels relevant to classifications. When too many images pixels are perturbated, the visual features of true target objects is lost, which can also lead to low performance of the model.

Figure 5: The figure shows SMs created by GuidedBP and Enhanced GuidedBP for clean images and adversary ones. The predictions under the map indicate the success or failure of adversary attacks.

To further analyze the adversary-discrinativity of SMs created by enhanced GuidedBP. We categorize created adversary images into two categories: the ones that mislead the classification decisions successfully and the ones that fail to attack the neural network. For the clean images and the perturbed images in , the created SMs should identify the class-discriminative parts. Contrarily, for the adversary images , the parts identified in the SMs are irrelevant to the ground-truth label, which means the network focuses on the wrong parts of the adversary images when making decisions.

In Figure 5, the image in the first row contains a vulture. If the created adversary image fails to fool the neural network, the corresponding SM focuses on the head of the vulture (see 1st-3rd columns right of the image). If the attack is successful, the create SM for the misclassified class (i.e., kite) focuses on wings of the vulture. As a comparison, the GuidedBP always visualizes all the salient low-level features of all the images (e.g., the ski, the persons, and the alp in the image of the second row).

4 Discussion and Conclusion

Why enhanced GuidedBP is better? The pre-softmax scores (logits) are often taken as output scores to create SMs. The previous attribution methods show that the scores of different classes can be attributed to the same pixels. They explain where the scores themselves come from. Our approach explains where the difference between logits comes from, which is the exact reason why the network predicts a higher probability for a class A than a class B. In the optimization of creating adversary images, the loss of the neural network is increased, which results in the change of the rank of logits. Our approach can find the evidence for the difference between the scores, i.e., the rank of logits. The change of the rank is the reason for misclassifications. That is why the enhanced GuidedBP can explain the classification decisions of adversary images better.

The generalization of the enhancing method As analyzed in Sec. 2.1, the important factors to support the success of enhanced GuidedBP is that and have similar Filtering effect. When generalizing the enhancing method to other methods, the effectiveness depends on whether their created SMs have similar filtering effects.

In this work, we identify the class-discriminative information in SMs created by GuidedBP and propose a simple way to enhance it. The proposed enhanced GuidedBP can explain classification decisions of adversary images better. In future work, we will investigate how to regularize the deep neural networks using the captured discriminative information so that the rank of logits is not easily changed by adversary perturbations.


  • J. Adebayo, J. Gilmer, M. Muelly, I. Goodfellow, M. Hardt, and B. Kim (2018) Sanity checks for saliency maps. In NeurIPS, pp. 9525–9536. Cited by: §1.
  • D. Alvarez-Melis and T. S. Jaakkola (2018) On the robustness of interpretability methods. In Workshop on Human Interpretability in Machine Learning (WHI), Cited by: §1, §2.1, §2.
  • M. Ancona, E. Ceolini, C. Öztireli, and M. Gross (2017) A unified view of gradient-based attribution methods for deep neural networks. In NIPS 2017-Workshop on Interpreting, Explaining and Visualizing Deep Learning, Cited by: §1.
  • S. Bach, A. Binder, G. Montavon, F. Klauschen, K. Müller, and W. Samek (2015) On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one 10 (7), pp. e0130140. Cited by: §1.
  • N. Carlini and D. Wagner (2017) Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 39–57. Cited by: §3.
  • P. Dabkowski and Y. Gal (2017) Real time image saliency for black box classifiers. In NeuIPS, pp. 6967–6976. Cited by: §1.
  • M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. IJCV 88 (2), pp. 303–338. Cited by: §2.1.
  • R. C. Fong and A. Vedaldi (2017) Interpretable explanations of black boxes by meaningful perturbation. ICCV, pp. 3449–3457. Cited by: §1.
  • I. J. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. In ICLR, Cited by: §1.
  • J. Gu, Y. Yang, and V. Tresp (2018) Understanding individual decisions of cnns via contrastive backpropagation. In ACCV, Cited by: §1.
  • A. Kurakin, I. Goodfellow, and S. Bengio (2016) Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533. Cited by: §3.
  • A. Mahendran and A. Vedaldi (2016) Salient deconvolutional networks. In ECCV, Cited by: §1.
  • W. Nie, Y. Zhang, and A. Patel (2018) A theoretical explanation for perplexing behaviors of backpropagation-based visualizations. In 2018 Workshop on Human Interpretability in Machine Learning (WHI), Cited by: §1.
  • M. T. Ribeiro, S. Singh, and C. Guestrin (2016) Why should i trust you?: explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD, pp. 1135–1144. Cited by: §1.
  • R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, et al. (2017) Grad-cam: visual explanations from deep networks via gradient-based localization.. In ICCV, pp. 618–626. Cited by: §1.
  • A. Shrikumar, P. Greenside, and A. Kundaje (2017) Learning important features through propagating activation differences. In ICML, Cited by: §1.
  • K. Simonyan, A. Vedaldi, and A. Zisserman (2013) Deep inside convolutional networks: visualising image classification models and saliency maps. In ICLR, Cited by: §1.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §2.1.
  • D. Smilkov, N. Thorat, B. Kim, F. Viégas, and M. Wattenberg (2017) Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825. Cited by: §1, §2.2.
  • J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. A. Riedmiller (2014) Striving for simplicity: the all convolutional net. In ICLR, Cited by: §1.
  • M. Sundararajan, A. Taly, and Q. Yan (2017) Axiomatic attribution for deep networks. In ICML, Cited by: §1.
  • C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fergus (2014) Intriguing properties of neural networks. In ICLR, Cited by: §1, §2.2, §3.
  • M. D. Zeiler and R. Fergus (2014) Visualizing and understanding convolutional networks. In ECCV, pp. 818–833. Cited by: §1.
  • L. M. Zintgraf, T. Cohen, T. Adel, and M. Welling (2017) Visualizing deep neural network decisions: prediction difference analysis. In ICLR, Cited by: §1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description