# The (Un)reliability of saliency methods

## Abstract

Saliency methods aim to explain the predictions of deep neural networks. These methods lack reliability when the explanation is sensitive to factors that do not contribute to the model prediction. We use a simple and common pre-processing step —adding a constant shift to the input data— to show that a transformation with no effect on the model can cause numerous methods to incorrectly attribute. In order to guarantee reliability, we posit that methods should fulfill input invariance, the requirement that a saliency method mirror the sensitivity of the model with respect to transformations of the input. We show, through several examples, that saliency methods that do not satisfy input invariance result in misleading attribution.

## 1Introduction

While considerable research has focused on discerning the decision process of neural networks [2], there remains a trade-off between model complexity and interpretability. Research to address this tension is urgently needed; reliable explanations build trust with users, help identify points of model failure and remove barriers to entry for the deployment of deep neural networks in domains like health care, security and transportation.

In deep neural networks, data representation is delegated to the model and subsequently we cannot generally say in an informative way what led to a model prediction. Instead, saliency methods aim to infer insights about the learnt by the model by ranking the explanatory power of constituent inputs. While unified in purpose, these methods are surprisingly divergent and non-overlapping in outcome. Evaluating the reliability of these methods is complicated by a lack of ground truth, as ground truth would depend upon full transparency into how a model arrives at a decision — the very problem we are trying to solve for in the first place.

Given the need for a quantitative method of comparison, several properties such as completeness, implementation invariance and sensitivity have been articulated as desirable to ensure that saliency methods are reliable [1]. Implementation invariance, proposed as an axiom for attribution methods by [14], is the requirement that functionally equivalent networks (models with different architectures but equal outputs for all inputs), always attribute in an identical way.

This work posits that a second invariance axiom, which we term *input invariance*, needs to be satisfied to ensure reliable interpretation of the input’s contribution to the model prediction. *Input invariance* requires that the saliency method mirror the sensitivity of the model with respect to transformations of the input. We demonstrate that numerous methods do not satisfy input invariance using a simple transformation – a constant shift of the input – that changes the attribution of these methods but does not affect the model prediction or weights. Our results demonstrate that explanations of a networks predictions can be purposefully manipulated using surprisingly simple transformations to be misleading. This work is motivated by an understanding that saliency methods are highly valued tools for gaining intuition about a network. Determining points of failure is a necessary step for knowledgeable use of these tools as well as a pre-requisite for domains like medicine where the incorrect classification of an input as salient carries a high cost.

In this work we:

introduce the axiom

*input invariance*and show, using a simple constant shift in the input, that certain saliency methods do not satisfy this property (See Figure 2).demonstrate using MNIST that we can purposefully force misleading attribution (See Figure 3 and Figure 5).

show that “reference point” methods – Integrated gradients and the Deep Taylor Decomposition– have diverging attribution satisfy input invariance contingent on the choice of reference and the type of transformation considered (See Fig. ?).

propose data normalization as a way to ensure that some methods satisfy input invariance for the type of transformation considered. Discuss the need for wider research as normalization does not systematically guarantee reliable attribution for all possible transformations.

In **Section 2**, we detail our experiment framework. In **Section 3**, we determine that while the model is invariant to the input transformation considered, several saliency methods attribute to the mean shift. In **Section 4** we discuss “reference point” methods and illustrate the importance of choosing an appropriate reference before discussing some directions for future research in **Section 5**.

## 2The model is invariant to a constant shift in input

We show that, by construction, the bias of a neural network compensates for the constant shift resulting in two networks with identical weights and predictions.

We compare the attribution across two networks, and . is a network trained on input that denotes sample from training set . The classification task of network 1 is:

is a network that predicts the classification of a transformed input . The relationship between and is the addition of constant vector :

Network 1 and 2 differ only by construction. Consider the first layer neuron before non-linearity in :

We alter the biases in the first layer neuron by adding the mean shift . This now becomes Network 2:

As a result the first layer activations are the same for and :

Note that the gradient with respect to the input remains unchanged as well:

We have shown that Network 2 cancels out the mean shift transformation. This means that and have identical weights and produce the same output for all corresponding samples, , :

### 2.1Experimental Setup

Now, we describe our experiment setup to evaluate the input invariance of a set of saliency methods. Most saliency research to date has centered on convolutional neural networks (CNN). In this work, we also evaluate input invariance using a CNN. Network 1 is a 3 layer multi-layer perceptron with 1024 ReLu-activated neurons each. Network 1 classifies MNIST image inputs in a [0,1] encoding. We consider a negative constant shift of ; Network 2 classifies MNIST image inputs in a [-1,0] MNIST encoding. The first network is trained for 10 epochs using mini-batch stochastic gradient descent (SGD). The final accuracy is 98.3% for both^{1}

## 3The (In)sensitivity of Saliency Methods to Mean Shifts

In Section 3.1 we introduce key approaches to the classification of inputs as salient and the saliency methods we evaluate. In Section 3.2 we find that gradient and signal methods satisfy input invariance. In Section 3.3 we find that all attribution methods considered have points of failure.

### 3.1Saliency methods considered

Saliency methods broadly fall into three different categories:

Gradients (Sensitivity)

[2] shows how a small change to the input affects the classification score for the output of interest.

Signal methods

such as DeConvNet [16], Guided BackProp [13] and PatternNet [5] aim to isolate input patterns that stimulate neuron activation in higher layers.

Attribution methods

such as Deep-Taylor Decomposition [6] and Integrated Gradients [14] assign importance to input dimensions by decomposing the value at an output neuron into contributions from the individual input dimensions:

is the decomposition into input contributions and has the same number of dimensions as , signifies the attribution method applied to output for sample . Attribution methods are distinct from gradients because of the insistence on

*completeness*; the sum of all attributions should be approximately equal to the original output .

We consider the input invariance of each category separately (by evaluating raw gradients, GuidedBackprop, PatternNet, Integrated Gradients and Deep Taylor Decomposition) and also benchmark the input invariance of SmoothGrad [12], a method that wraps around an underlying saliency approach and uses the addition of noise to produce a sharper visualization of the saliency heatmap.

The experiment setup and methodology is as described in **Section 2**. Each method is evaluated by comparing the saliency heatmaps for the predictions of network 1 and 2, where is simply the mean shifted input (). A saliency method that satisfies input invariance will produce identical saliency heatmaps for Network 1 and 2 despite the constant shift in input.

### 3.2Gradient and Signal methods Satisfy Input Invariance

Gradient and signal methods are not sensitive to a constant shift in inputs. In Figure 1 raw gradients, PatternNet (PN), [5] and GuidedBackprop (GB) [13] produce identical saliency heatmaps for both networks. Intuitively, gradient, PN and GB satisfy input invariance given that we are comparing two networks with an identical . All three methods determine attribution entirely as a function of the network/pattern weights and thus will be input invariant as long as we are comparing networks with identical weights.

In the same manner, we can say that these methods will not be input invariant when comparing networks with different weights (even if we consider models with different architectures but identical predictions for every input).

### 3.3The Sensitivity of Attribution Methods

We evaluate the following attribution methods: gradient times input (GI), integrated gradients (IG, [14]) and the deep-taylor decomposition (DTD, [6]).

In ? we find GI to be sensitive to constant shifts in the input. In ? we group discussion of IG and DTD under “reference point” methods because both require that attribution is done in relation to a chosen reference. We find that satisfying input invariance depends upon the choice of reference point and the type of constant shift to the input.

#### Gradient times input is sensitive to mean shift of inputs

We find that the multiplication of raw gradients by the image fails to satisfy input invariance. In Figure 2 GI produces different saliency heatmaps for both networks.

In Section 3.2 we determined that a saliency heatmap of gradients gradient does satisy input invariance. This breaks when the gradients are multiplied with the input image.

Multiplying by the input fails to satisfy input invariance because the input shift is carried through to final attribution. Naive multiplication by the input, as noted by [12], also constrains attribution without justification to inputs that are not 0.

#### Reliability of Reference Point Methods Depends on the Choice of Reference

Both Integrated Gradients IG, [14] and Deep Taylor Decomposition DTD, [6] determine the importance of inputs relative to a reference point. DTD refers to this as the root point and IG terms the reference point a baseline. The choice of reference point is not determined *a priori* by the method and is instead a hyperparameter of the attribution task.

The choice of reference point determines all subsequent attribution. In Fig. ? IG and DTD show different attribution depending on the choice of reference point. We show that IG and DTD only satisfy input invariance contingent on the choice of reference point and the type of transformation considered.

**Integrated gradients** (IG) [14] attributes the predicted score to each input with respect to a baseline . This is achieved by constructing a set of inputs interpolating between the baseline and the input.

Since this integral cannot be computed analytically, it is approximated by a finite sum ranging over .

We evaluate whether two possible IG reference points satisfy input invariance. Firstly, we consider an image populated uniformly with the minimum pixel from the dataset () (black image) and a zero vector image. In Figure 2, a black image reference point produces identical attribution heatmaps whereas a zero vector reference point is not input invariant.

IG using a black image reference point is not sensitive to the constant shift in input because is determined after the mean shift of the input so the difference between and remains the same for both networks. In network 1 this is and in network 2 this is .

IG with a zero vector reference point fails to satisfy input invariance because while the difference in network 1 is , the difference in network 2 becomes .

It is possible to construct a constant vector that will break the reliability of using a black image as a baseline. We consider a transformation of the input where the constant vector ( ) added to is an image of a checkered box. Consistent with **Section 2** the relationship between and the transformed input is the addition of the checkered box image vector .

In Figure 3 shows that we are able to manipulate the attribution heatmap of an MNIST prediction so that , an image of checkered boxes, appears for all reference points except for PA. This constant vector transformation causes all IG reference points to fail to satisfy input invariance.

**Deep Taylor Decomposition (DTD)** determines attribution relative to a reference point neuron. DTD can satisfy input invariance if the right reference point is chosen. In the general formulation, the attribution of an input neuron is initialized to be equal to the output of that neuron. The attribution of other output neurons is set to zero. This attribution is backpropagated to input neurons using the following distribution rule where is the attribution assigned to neuron in layer :

We evaluate the input invariance of DTD using a reference point determined by Layer-wise Relevance Propagation (LRP,[1]) and PatternAttribution (PA). In Figure 2, DTD satisfies input invariance when using a reference point defined by PA however fails to satisfy input invariance when using a reference point defined by LRP.

LRP is sensitive to the input shift because it is a case of DTD where a zero vector is chosen as the root point.^{2}

depends only upon the input and so attribution will change between network 1 and 2 because and differ by a constant vector.

PatternAttribution (PA) satisfies input invariance because the reference point is defined as the natural direction of variation in the data [5]. This natural direction is determined by the covariance of the data and thus compensates explicitly for the constant vector shift of the input. Therefore it is by construction input invariant.

The PA root point is:

where .

In a linear model:

For neurons followed by a ReLu non-linearity the vector accounts for the non-linearity and is computed as:

Here denotes the expectation taken over values where is positive.

PA reduces to the following step:

The vector depends upon covariance and thus compensates the mean shift of the input. The attribution for both networks is thus identical.

### 3.4SmoothGrad Inherits the Sensitivity Properties of Underlying Methods

SmoothGrad (SG, [12]) replaces the input with identical versions of the input with added random noise. These noisy inputs are injected into the underlying attribution method and final attribution is the average attribution across . For example, if the underlying methods are gradients w.r.t. the input. SG becomes:

SG often results in aesthetically sharper visualizations when applied to multi-layer neural networks with non-linearities. SG does not alter the attribution method itself so will always inherit the sensitivity of the underlying method to an input transformation. In Figure 4 applying SG on top of gradients and signal methods produces identical saliency maps. SG does not satisfy input invariant when applied to gradient x input, LRP and zero vector reference points which compares SG heatmaps generated for all methods discussed so far. SG is insensitive to the input transformation when applied to PA and a black image.

## 4The Importance of Choosing an Appropriate Reference Point

IG and DTD satisfy input invariance when certain reference points or/and input transformations are considered. The choice of reference point is also important because it determines all subsequent attribution. In fig. ? attribution visually diverges for the same method if multiple reference points are considered.

A reasonable reference point choice will naturally depend upon domain and task. For example, [14] suggests that a black image is a natural reference point for image recognition tasks whereas a zero vector is a reasonable choice for text based networks. However, we have shown that the choice of reference point can lead to very different results. Unintentional misrepresentation of the model is very possible when the implications of attribution using a given reference point are unclear. Thus far, we have discussed attribution for image recognition tasks with the assumption that pre-processing steps are known and visual inspection of the points determined to be salient is possible. For audio and language based models where visual inspection is difficult or inappropriate, identifying failure points or how attribution varies under different baselines poses a challenge.

If we cannot determine the implications of reference point choice, we are limited in our ability to say anything about the reliability of the method. To demonstrate this point, we construct a constant shift of the input that takes advantage of the input invariance points of failure we have already identified.

In the following experiment, we construct a constant vector shift using a hand drawn image of cat. Network 1 is the same as introduced in **Section 2**. The raw image can be seen in Figure 5. Consistent with **Section 2** the relationship between and the transformed input is the addition of a constant vectors .

We construct by choosing a desired attribution that should be assigned to a specific sample when the gradient is multiplied with the input.

is constructed to ensure that the specific receives the desired attribution as follows:

We clip the shift to be within [-.3,.3] so that the MNIST digit is still visible, if we do not clip the end attribution would only show the cat.

In Figure 5 transforming the input in this manner allows purposeful misrepresentation of the attribution. All methods, except for PA, fail to satisfy input invariance and visibly show a cat as the explanation for an MNIST prediction.

How can we avoid breaks in input invariance? PA is invariant to the input transformations considered because it relies on the covariance of the data which compensates for the shift. If the data had been normalized prior to attribution, in a manner that counters this exact transformation, many of the methods considered would still satisfy input invariance. However, this is far from a systematic treatment of the reference point selection as there are input transformations outside of our experiment scope where this would not be sufficient. We believe an important research agenda is furthering the understanding of reference point choice that guarantee reliability without relying on case-by-case solutions.

## 5Conclusion

Saliency methods are powerful tools to gain intuition about our model. We show that numerous methods fail to attribute correctly when a constant vector shift is applied to the input. More worryingly, we show that we are able to purposefully create a deceptive explanation of the network using a hand drawn cat image.

We introduce *input invariance* as a prerequisite for reliable attribution. Our treatment of input invariance is restricted to demonstrating that there is at least one input transformation (a constant vector shift to the input) that causes numerous saliency methods to attribute incorrectly. This work is motivated by our belief that saliency methods remain valuable tools to gain intuition about the network. Understanding where they fail equips researchers with the tools to appropriately weigh the explanations these models provide.

Guaranteeing the reliability of saliency methods is crucial in tasks where visual inspection of results is not easy or the costs of incorrect attribution is high. For example, human inspection of the attribution for an image recognition task would catch the cat attack experiment (described in Section 4). However, it is unclear how we would catch the same purposeful manipulation or an unintentional misrepresentation in a language or audio model where inspection is not possible or opaque. Paradoxically, these are also the cases where attribution is most needed in order to understand the data.

Determining how saliency methods fail is an important stepping stone to understanding where and how we should use these methods. An urgent research agenda, and a requirement for the use of deep neural networks in domains like medicine, is evaluating which methods and/or reference points consistently guarantee reliability for all possible transformations.

## Acknowledgements

We would like to acknowledge the thoughtful feedback and guidance of Gregoir Montavon, Mukund Sundararajan, Ankur Taly, Doug Eck and Jonas Kemp.

### Footnotes

- Although there is a gap between this and the state of art, the gap does not significantly influence our findings.
- This case of DTD is called the and can be shown to be equivalent to Layer-wise Relevance Propagation [1]. Under specific circumstances, LRP is also equivalent to the gradient times input [4].

### References

**On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation.**

Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and Wojciech Samek. PloS one**How to explain individual classification decisions.**

David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, and Klaus-Robert Müller. Journal of Machine Learning Research**On the interpretation of weight vectors of linear models in multivariate neuroimaging.**

Stefan Haufe, Frank Meinecke, Kai Görgen, Sven Dähne, John-Dylan Haynes, Benjamin Blankertz, and Felix Bießmann. Neuroimage**Investigating the influence of noise and distractors on the interpretation of neural networks.**

Pieter-Jan Kindermans, Kristof Schütt, Klaus-Robert Müller, and Sven Dähne. arXiv preprint arXiv:1611.07270**Learning how to explain neural networks: Patternnet and patternattribution.**

Pieter-Jan Kindermans, Kristof T Schütt, Maximilian Alber, Klaus-Robert Müller, Dumitru Erhan, Been Kim, and Sven Dähne. arXiv preprint arXiv:1705.05598v2**Explaining nonlinear classification decisions with deep taylor decomposition.**

Grégoire Montavon, Sebastian Lapuschkin, Alexander Binder, Wojciech Samek, and Klaus-Robert Müller. Pattern Recognition**Synthesizing the preferred inputs for neurons in neural networks via deep generator networks.**

Anh Nguyen, Alexey Dosovitskiy, Jason Yosinski, Thomas Brox, and Jeff Clune. In*Advances in Neural Information Processing Systems*, pp. 3387–3395, 2016.**Imagenet large scale visual recognition challenge.**

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. International Journal of Computer Vision**Not just a black box: Learning important features through propagating activation differences.**

Avanti Shrikumar, Peyton Greenside, Anna Shcherbina, and Anshul Kundaje. CoRR**Very deep convolutional networks for large-scale image recognition.**

Karen Simonyan and Andrew Zisserman. In*ICLR*, 2015.**Deep inside convolutional networks: Visualising image classification models and saliency maps.**

Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. In*ICLR*, 2014.**Smoothgrad: removing noise by adding noise.**

Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda Viégas, and Martin Wattenberg. arXiv preprint arXiv:1706.03825**Striving for simplicity: The all convolutional net.**

Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. In*ICLR*, 2015.**Axiomatic attribution for deep networks.**

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. arXiv preprint arXiv:1703.01365**Understanding neural networks through deep visualization.**

Jason Yosinski, Jeff Clune, Thomas Fuchs, and Hod Lipson. In*ICML Workshop on Deep Learning*, 2015.**Visualizing and understanding convolutional networks.**

Matthew D Zeiler and Rob Fergus. In*European Conference on Computer Vision*, pp. 818–833. Springer, 2014.**Visualizing deep neural network decisions: Prediction difference analysis.**

Luisa M Zintgraf, Taco S Cohen, Tameem Adel, and Max Welling. In*ICLR*, 2017.