Understanding High-Order Loss Approximations and Features in Deep Learning Interpretation

Understanding High-Order Loss Approximations and Features in Deep Learning Interpretation

Understanding Impacts of High-Order Loss Approximations and Features in Deep Learning Interpretation

Abstract

Current saliency map interpretations for neural networks generally rely on two key assumptions. First, they use first-order approximations of the loss function, neglecting higher-order terms such as the loss curvature. Second, they evaluate each feature’s importance in isolation, ignoring feature interdependencies. This work studies the effect of relaxing these two assumptions. First, we characterize a closed-form formula for the input Hessian matrix of a deep ReLU network. Using this, we show that, for classification problems with many classes, if a prediction has high probability then including the Hessian term has a small impact on the interpretation. We prove this result by demonstrating that these conditions cause the Hessian matrix to be approximately rank one and its leading eigenvector to be almost parallel to the gradient of the loss. We empirically validate this theory by interpreting ImageNet classifiers. Second, we incorporate feature interdependencies by calculating the importance of group-features using a sparsity regularization term. We use an relaxation technique along with proximal gradient descent to efficiently compute group-feature importance values. Our empirical results show that our method significantly improves deep learning interpretations.

\icmlsetsymbol

equal*

{icmlauthorlist}\icmlauthor

Sahil Singlato \icmlauthorEric Wallaceto \icmlauthorShi Fengto \icmlauthorSoheil Feizito

\icmlaffiliation

toComputer Science Department, University of Maryland

\icmlcorrespondingauthor

Sahil Singlassingla@cs.umd.edu \icmlcorrespondingauthorSoheil Feizisfeizi@cs.umd.edu

\icmlkeywords

Deep Learning Interpretation, Hessian, regularization


\printAffiliationsAndNotice

1 Introduction

The growing use of deep learning in sensitive applications such as medicine, autonomous driving, and finance raises concerns about human trust in machine learning systems. For trained models, a central question is test-time interpretability: how can humans understand the reasoning behind model predictions? A common interpretation approach is to identify the importance of each input feature for a model’s prediction. A saliency map can then visualize the important features, e.g., the pixels of an image Simonyan et al. (2014); Sundararajan et al. (2017) or words in a sentence Li et al. (2016).

Several approaches exist to create saliency maps, largely based on model gradients. For example, Simonyan et al. (2014) compute the gradient of the class score with respect to the input, while Smilkov et al. (2017) average the gradient from several noisy versions of the input. Although these gradient-based methods can produce visually pleasing results, they often weakly approximate the underlying model Feng et al. (2018); Nie et al. (2018). Existing saliency interpretations mainly rely on two key assumptions:

  • Gradient-based loss surrogate: For computational efficiency, several existing methods, e.g., Simonyan et al. (2014); Smilkov et al. (2017); Sundararajan et al. (2017), assume that the loss function is almost linear at the test sample. Thus, they use variations of the input gradient to compute feature importance.

  • Isolated feature importance: Current methods evaluate the importance of each feature in isolation, assuming all other features are fixed. Features, however, may have complex interdependencies that can be learned by the model.

This work studies the impact of relaxing these two assumptions in deep learning interpretation. To relax the first assumption, we use the second-order approximation of the loss function by keeping the Hessian term in the Taylor expansion of the loss. For a deep ReLU network and the cross-entropy loss function, we compute this Hessian term in closed-form. Using this closed-form formula for the Hessian, we prove the following for ReLU networks:

Theorem 1 (informal version)

If the probability of the predicted class is close to one and the number of classes is large, first-order and second-order interpretations are sufficiently close to each other.

We present a formal version of this result in Theorem 5 and also validate it empirically. For instance, in ImageNet 2012 Russakovsky et al. (2015), a dataset of 1,000 classes, we show that incorporating the Hessian term in deep learning interpretation has a small impact for most images.

The key idea of the proof follows from the fact that when the number of classes is large and the confidence in the predicted class is high, the Hessian of the loss function is approximately of rank one. In essence, the largest eigenvalue squared is significantly larger than the sum of squared remaining eigenvalues. Moreover, the corresponding eigenvector is approximately parallel to the gradient vector (Theorem 4). This causes first-order and second-order interpretations to perform similarly. We also show in Appendix F.3 that this result holds empirically for a neural network model that is not piecewise linear. Our theoretical results can also be extended to related problems such as adversarial examples, where most methods are based on the first-order loss approximations Goodfellow et al. (2014); Moosavi-Dezfooli et al. (2015); Madry (Aleksander and Makelov).

Next, we relax the isolated feature importance assumption. To incorporate feature interdependencies in the interpretation, we define the importance function over subsets of features, referred to as group-features. We adjust the subset size on a per-example basis using an unsupervised approach, making the interpretation context-aware. Including group-features in the interpretation makes the optimization combinatorial. To circumvent the associated computational issues, we use an relaxation as is common in compressive sensing Candes & Tao (2005); Donoho (2006), LASSO regression Tibshirani (1996), and other related problems. To solve the relaxed optimization, we employ proximal gradient descent Parikh & Boyd (2014). Our empirical results on ImageNet indicate that incorporating group-features removes noise and makes the interpretation more visually coherent with the object of interest. We refer to our interpretation method based on first-order (gradient) information as the CAFO (Context-Aware First Order) interpretation. Similarly, the method based on second-order information is called the CASO (Context-Aware Second Order) interpretation. We provide open-source code.111https://github.com/singlasahil14/CASO

2 Problem Setup and Notation

Consider a prediction problem from input variables (features) to an output variable . For example, in the image classification problem, is the space of images and is the set of labels . We observe samples from these variables, namely . Let be the observed empirical distribution.222Note that for simplicity, we hide the dependency of on . The empirical risk minimization (ERM) approach computes the optimal predictor for a loss function using the following optimization:

(1)

Let be a subset of with cardinality . For a given sample , let indicate the features of in positions . We refer to as a group-feature of . The importance of a group-feature is proportional to the change in the loss function when is perturbed. We select the group-feature with maximum importance and visualize that subset in a saliency map.

Definition 1 (Group-Feature Importance Function)

Let be the optimizer of the ERM problem (1). For a given sample , we define the group-feature importance function as follows:

(2)

where counts the number of non-zero elements of its argument (known as the norm). The parameter characterizes an upper bound on the cardinality of the group-features. The parameter characterizes an upper bound on the norm of feature perturbations.

If is the solution of optimization (2), then the vector contains the feature importance values that are visualized in the saliency map. Note, when this definition simplifies to current feature importance formulations which consider features in isolation. When , our formulation can capture feature interdependencies. Parameters and in general depend on the test sample (i.e., the size of the group-features are different for each image and model). We introduce an unsupervised metric to determine these parameters in Section 4.1, but assume these parameters are given for the time being.

The cardinality constraint (i.e. the constraint on the group-feature size) leads to a combinatorial optimization problem in general. Such a sparsity constraint has appeared in different problems such as compressive sensing Candes & Tao (2005); Donoho (2006) and LASSO regression Tibshirani (1996). Under certain conditions, we show that without loss of generality the norm can be relaxed with the (convex) norm (Appendix E).

Our goal is to solve optimization (2) which is non-linear and non-concave in . Current approaches do not consider the cardinality constraint and optimize by linearizing the objective function (i.e., using the gradient). To incorporate group-features into current methods, we can add the constraints of optimization (2) to the objective function using Lagrange multipliers. This yields the following Context-Aware First-Order (CAFO) interpretation function.

Definition 2 (The CAFO Interpretation)

For a given sample , we define the Context-Aware First-Order (CAFO) importance function as follows:

(3)

where and are non-negative regularization parameters. We refer to the objective of this optimization as , hiding its dependency on and to simplify notation.

Large values of regularization parameters and in optimization (3) correspond to small values of parameters and in optimization (2). Incorporating group-features naturally leads to a sparsity regularizer through the penalty. Note, this is not a hard constraint which forces a sparse interpretation. Instead, given proper choice of the regularization coefficients, the interpretation will reflect the sparsity used by the underlying model. In Section 4.1, we detail our method for setting on an example-specific basis (i.e., context-aware) based on the sparsity ratio of CAFO’s optimal solution. Moreover, in Appendix E, we show that under some general conditions, optimization (3) can be solved efficiently and its solution matches that of the original optimization (2).

To better approximate the loss function, we use its second-order Taylor expansion around point :

(4)

where and is the Hessian of the loss function on the input features (note is fixed). This second-order expansion of the loss function decreases the interpretation’s model approximation error.

By choosing proper values for regularization parameters, the resulting optimization using the second-order surrogate loss is strictly a convex minimization (or equivalently concave maximization) problem, allowing for efficient optimization using gradient descent (Theorem 3). Moreover, even though the Hessian matrix can be expensive to compute for large neural networks, gradient updates of our method only require the Hessian-vector product (i.e., ) which can be computed efficiently Pearlmutter (1994). This yields the following Context-Aware Second-Order (CASO) interpretation function.

Definition 3 (The CASO Interpretation)

For a given sample , we define the Context-Aware Second-Order (CASO) importance function as follows:

(5)

We refer to the objective of this optimization as . and are defined as in (3).

3 The Impact of the Hessian

The Hessian is by definition useful when the loss function at the test sample has high curvature. However, given the linear nature of popular network architectures with piecewise linear activations, e.g., ReLU Glorot et al. (2011) or Maxout Goodfellow et al. (2013), do these regions of high curvature even exist? We answer this question for neural networks with piecewise linear activations by first providing an exact calculation of the input Hessian. Then, we use this derivation to understand the impact of including the Hessian term in interpretation. More specifically, we prove that when the probability of the predicted class is 1 and the number of classes is large, the second-order interpretation is similar to the first-order one. We verify this theoretical result experimentally over images in the ImageNet 2012 dataset Russakovsky et al. (2015). We also observe that when the confidence in the predicted class is low, the second-order interpretation can be significantly different from the first-order interpretation. Since second-order interpretations take into account the curvature of the model, we conjecture that they are more faithful to the underlying model in these cases.

3.1 Closed-form Hessian Formula for ReLU Networks

We present an abridged version of the exact Hessian calculation here, the details are provided in Appendix A.1. Neural network models which use piecewise linear activation functions have class scores (logits) which are linear functions of the input. That is, since they are piecewise linear over the entire domain, they are linear at a particular input.333Note that we ignore points where the function is non-differentiable as they form a measure zero set. Thus, we can write:

where is the input of dimension , are the logits, are the weights, and are the biases of the linear function. Note that combines weights of different layers from the input to the output of the network. Each row of is the gradient of logit with respect to the flattened input and can be handled in auto-grad software such as PyTorch Paszke et al. (2017). We define:

where denotes the number of classes, denotes the class probabilities, and is the cross-entropy loss function.

In this case, we have the following result:

Proposition 1

is given by:

(6)

where is a diagonal matrix whose diagonal elements are equal to .

The first observation from Proposition 1 is as follows:

Theorem 2

is a positive semidefinite matrix.

These two results allow an extremely efficient computation of the Hessian’s eigenvectors and eigenvalues using the Cholesky decomposition of (Appendix C). Note the use of decomposition is critical as storing the Hessian requires intractable amounts of memory for high dimensional inputs. The entire calculation of the Hessian’s decomposition for ImageNet using a ResNet-50 He et al. (2016) runs in approximately 4.2 seconds on an NVIDIA GTX 1080 Ti.

To the best of our knowledge, this is the first work which derives the exact Hessian decomposition for piecewise linear networks. Yao et al. (2018) also proved the Hessian for piecewise linear networks is at most rank but did not derive the exact input Hessian.

One advantage of having a closed-form formula for the Hessian matrix (6) is that we can use it to properly set the regularization parameter in CASO’s formulation. To do this, we rely on the following result:

Theorem 3

If is the largest eigenvalue of , for any value of , the second-order interpretation objective function (5) is strongly concave.

We use Theorem 3 to set the regularization parameter for CASO. We need to set to make the optimization convex, but not set so large that it overpowers . In particular, we set , where we choose for CASO and CAFO. We observe that if is small, the optimization can become non-convex due to numerical error in the calculation of L. However above a threshold, the value of does not have a significant impact on the saliency map.

3.2 Theoretical Results on the Hessian Impact

We now leverage the exact Hessian calculation to prove that when the probability of predicted class is 1 and the number of classes is large, the Hessian of a piece-wise linear neural network is approximately of rank one and its eigenvector is approximately parallel to the gradient. Since a constant scaling does not affect the visualization, this causes the two interpretations to be similar to one another.

Theorem 4

If the probability of the predicted class=1-(c-1) , where , then as c such that , Hessian is of rank one and its eigenvector is parallel to the gradient.

Let be the optimal solution to the CASO objective 5 and be the optimal solution for the CAFO objective 3. We assume =0 for both the objectives.

Theorem 5

If the probability of the predicted class=1-(c-1) , where , then as c such that , the CASO solution (5) with is almost parallel to the CAFO solution (3) with .

We emphasize that our theoretical results are valid in the “asymptotic regime”. To analyze the approximation in the finite length regime, we simulate the relative error between the true Hessian and the rank-one approximation of the Hessian as the number of classes increases and probability of predicted class tends to 1. We find the Hessian quickly converges to rank-one empirically (Appendix F.1).

3.3 Empirical Results on the Hessian Impact

We now present empirical results on the impact of the Hessian in interpreting deep learning models. In our experiments here, we isolate the impact of the Hessian term by setting in both CASO and CAFO.

A consequence of Theorem 3 is that the gradient descent method with Nesterov momentum converges to the global optimizer of the second-order interpretation objective with a convergence rate of , see Appendix B for details.

To optimize , the gradient is given by:

(7)

Figure 1: The Frobenius norm difference between CASO and CAFO after normalizing both vectors to have the same norm. Consistent with the result of Theorem 4, when the classification confidence is low, the CASO result differs significantly from CAFO. When the confidence is high, CASO and CAFO are approximately the same. To isolate the impact of the Hessian term, we assume in both CASO and CAFO.

Figure 2: Panel (a) shows an example where the classification confidence is low. In this case, the CASO and CAFO interpretations differ significantly. Panel (b) demonstrates an example where the classification confidence is high. In this case, CASO and CAFO lead to similar interpretations as suggested by our theory.

The gradient term and the regularization term are straightforward to implement using standard backpropagation.

To compute the Hessian-vector product term , we rely on the result of Pearlmutter 1994 Pearlmutter (1994): a Hessian-vector product can be computed in the same time as the gradient . This is handled easily in modern auto-grad software. Moreover, for ReLU networks, our closed-form formula for the Hessian term (Theorem 1) can be used in the computation of the Hessian-vector product as well. In our experiments here we use the closed-form formula for . When , we use proximal gradient descent (Section 4).

We compare second-order (CASO with ) and the first-order interpretations (CAFO with ) empirically. Note that when , where is the gradient and is the interpretation obtained using the CAFO objective.

We compute second-order and first-order interpretations for 1000 random samples on the ImageNet ILSVRC-2012 Russakovsky et al. (2015) validation set using a Resnet-50 He et al. (2016) model. Our loss function is the cross-entropy loss. After calculating for all methods, the values must be normalized for visualization in a saliency map. We apply a normalization technique from existing work which we describe in Appendix D.

We plot the Frobenius norm of the difference between CASO and CAFO in Figure 1. Before taking the difference, we normalize the solutions produced by CASO and CAFO to have the same norm because a constant scaling of elements of does not change the visualization.

The empirical results are consistent with our theoretical results: second-order and first-order interpretations are similar when the classification confidence is high. However, when the confidence is small, including the Hessian term can be useful in deep learning interpretation.

To observe the difference between CAFO and CASO interpretations qualitatively, we compare them for an image when the confidence is high and for one where it is low in Figure 2. When the classification confidence is high, CAFO CASO and when this is low, CASO CAFO. Additional examples have been given in Appendix F.

We do additional experiments to evaluate the impact of the Hessian on a neural network that is not piecewise linear. We interpret a SE-Resnet-50 Hu et al. (2018) neural network (which uses sigmoid non-linearities) on the same 1000 images. We observe a similar trend as in the case of ReLU networks (Appendix F.3).

4 The Impact of Group-features

This section studies the impact of the group-features in deep learning interpretation. The group-feature has been included as the sparsity constraint in optimization (2).

To obtain an unconstrained concave optimization for the CASO interpretation, we relaxed the sparsity (cardinality) constraint (often called an norm constraint) to a convex norm constraint. Such a relaxation is a core component for popular learning methods such as compressive sensing Candes & Tao (2005); Donoho (2006) or LASSO regression Tibshirani (1996). Using results from this literature, we show this relaxation is tight under certain conditions on the Hessian matrix (see Appendix E). In other words, the optimal of optimization (5) is sparse with the proper choice of regularization parameters.

Note that the regularization term is a concave function for . Similarly due to Theorem 3, the CASO interpretation objective (5) is strongly concave.

One method for optimizing this objective is using gradient descent as done in the second-order interpretation but using an regularization penalty. However, we found that this procedure leads to poor convergence properties in practice, partially due to the non-smoothness of the term.

To resolve this issue, we instead use proximal gradient descent to compute a solution for CAFO and CASO when . Using the Nesterov momentum method and backtracking with proximal gradient descent gives a convergence rate of where is the number of gradient updates (Appendix B).

Below we explain how we use proximal gradient descent to optimize our objective. First, we write the objective function as the sum of a smooth and non-smooth function:

Let be the smooth, be the non-smooth part:

The gradient of the smooth objective is given by:

The proximal operator is given by:

(a) Interpretation solutions for CAFO with different values of the regularization parameter .
(b) Interpretation solutions for CASO with different values of the regularization parameter .
(c) Interpretation solutions using the gradient with different clipping thresholds to induce the given sparsity.
Figure 3: Larger values lead to higher sparsity ratios (). Our unsupervised method selects the interpretations marked with a green box. Interpretations selected in panel (a) and (b) are less noisy compared to (c).

This formula can be understood intuitively as follows. If the magnitude of some elements of is below a certain threshold (), proximal mapping sets those values to zero. This leads to values that are exactly zero in the saliency map.

To optimize , we use FISTA Beck & Teboulle (2009) with backtracking and the Nesterov momentum optimizer with a learning rate of for 10 iterations and decay factor of . is initialized to zero. FISTA takes a step with learning rate to reduce the smooth objective loss and then applies a proximal mapping to the resulting . Backtracking reduces the learning rate when the update results in a higher loss.

4.1 Empirical Impact of Group-Features

We now investigate the empirical impact of group-features. In our experiments, we focus on image classification because visual interpretations are intuitive and allow for comparison with prior work. We use a Resnet-50 He et al. (2016) model on the ImageNet ILSVRC-2012 dataset.

To gain an intuition for the effect of , we show a sweep over values in Figure 3. When is too high, the saliency map becomes all zero. Different approaches to set the regularization parameter have been explored in different problems. For example, in LASSO, one common approach is to use Least Angle Regression Efron et al. (2004).

We propose an unsupervised method based on the sparsity ratio of the interpretation solution to set . We define , the sparsity ratio, as the number of zero pixels divided by the total number of pixels. We start with and increase by a factor of 10 until reaches all zeros. For interpretations with sparsity in a certain range (e.g. in our examples), we choose the interpretation with the highest loss. If we do not find any interpretation that satisfies the sparsity condition, we reduce the first that resulted in becoming zero by a factor of 2 and repeat further iterations. In practice, we batch different values of to find a reasonable parameter setting efficiently.

This method selects the interpretation marked with a green box in Figures (a)a and (b)b. In Figure (c)c, we show the gradient interpretation with different values of clipping thresholds to induce the specified sparsity value. We observe that the interpretations obtained using group-features (Figures (a)a and (b)b) are less noisy compared to Figure (c)c.

Figure 4: A qualitative comparison of existing interpretation methods. More examples are shown in Appendix H. Grad stands for Vanilla Gradient and IntegratedGrad stands for Integrated Gradient. For our methods (CAFO, CASO, SmoothCAFO, SmoothCASO) the saliency map is more visually coherent with the object of interest compared to existing methods.

5 Qualitative Comparision of Deep Learning Interpretation Methods

This section briefly reviews prior saliency map approaches and compares their performance to CAFO and CASO qualitatively. The proposed Hessian and group-feature terms can be included in existing approaches as well.

Vanilla Gradient: Simonyan et al. (2014) propose to compute the gradient of the class score with respect to the input.

SmoothGrad: Smilkov et al. (2017) argue that the input gradient may fluctuate sharply in the region local to the test sample. To address this, they average the gradient-based importance values generated from many noisy inputs.

Integrated Gradients: Sundararajan et al. (2017) define a baseline, which represents an input absent of information (e.g., a completely zero image). Feature importance is determined by accumulating gradient information along the path from the baseline to the original input: . The integral is approximated by a finite sum.

We use the normalization method from SmoothGrad Smilkov et al. (2017) for visualizing the saliency map. Details of this method are given in Appendix D.

We can also extend the idea of SmoothGrad to define smooth versions of CASO and CAFO. This yields the following interpretation objective.

Definition 4 (The Smooth CASO Interpretation)

For a given sample , we define the smooth context-aware second-order (the Smooth CASO) importance function as follows:

(8)

where and and are defined similarly as before.

In the smoothed versions, we average over samples with . Smooth CAFO is defined similarly without the Hessian term.

Since quantitatively evaluating a saliency map is an open problem, we focus on two qualitative aspects. First, we inspect visual coherence, i.e., only the object of interest should be highlighted and not the background. Second, we test for discriminativity, i.e., in an image with two objects the predicted object should be highlighted.

Figure 4 shows comparisons between CAFO, CASO, and other existing interpretation methods. Including group-features in the interpretation leads to a sparse saliency map, eliminating the spurious noise and creating a visually coherent saliency map. More examples have been presented in Appendix H.

6 Conclusion and Future Work

We have studied two aspects of the deep learning interpretation problem. First, we characterized a closed-form formula for the input Hessian matrix of a deep ReLU network. Using this, we showed that, if the confidence in the predicted class is high and the number of classes is large, first-order and second-order methods produce similar results. In the process, we also proved that the Hessian matrix is of rank one and its eigenvector is parallel to the gradient. These results can be insightful in other related problems such as adversarial examples. Second, we incorporated feature interdependencies in the interpretation using a sparsity regularization term. Adding this term significantly improves qualitative interpretation results.

There remain many open problems in interpreting deep learning models. For instance, since saliency maps are high-dimensional, they can be sensitive to noise and adversarial perturbations Ghorbani et al. (2017). Moreover, without proper quantitative evaluation metrics for model interpretations, the evaluation of interpretations is often qualitative and can be subjective. Finally, the theoretical impact of the Hessian term for low confidence predictions and the case when the number of classes is small remains unknown. Resolving these issues are among interesting directions for future work.

Acknowledgments

Shi Feng and Eric Wallace were supported by NSF Grant IIS-1822494. Any opinions, findings, conclusions, or recommendations expressed here are those of the authors and do not necessarily reflect the view of the sponsor.

References

  • Adebayo et al. (2018) Adebayo, J., Gilmer, J., Goodfellow, I., and Kim, B. Local explanation methods for deep neural networks lack sensitivity to parameter values. In ICLR Workshop, 2018.
  • Feng et al. (2018) Shi Feng, Eric Wallace, Alvin Grissom II, Mohit Iyyer, Pedro Rodriguez and Jordan Boyd-Graber. Pathologies of Neural Models Make Interpretations Difficult In EMNLP, 2018.
  • Adebayo et al. (2018) Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I., Hardt, M., and Kim, B. Sanity Checks for Saliency Maps. Proceedings of the International Conference on Learning Representations, 2019.
  • Adebayo et al. (2018) Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I., Hardt, M., and Kim, B. Sanity checks for saliency maps. In Proceedings of Advances in Neural Information Processing Systems, 2018.
  • Alvarez-Melis & Jaakkola (2018) Alvarez-Melis, D. and Jaakkola, T. S. Towards Robust Interpretability with Self-Explaining Neural Networks. Proceedings of the Neural Information Processing Systems, 2018.
  • Bach et al. (2015) Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.-R., Samek, W., and Suárez, Ó. D. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. In PloS one, 2015.
  • Beck & Teboulle (2009) Beck, A. and Teboulle, M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2009.
  • Madry (Aleksander and Makelov) Madry, Aleksander and Makelov, Aleksandar and Schmidt, Ludwig and Tsipras, Dimitris and Vladu, Adrian. Towards deep learning models resistant to adversarial attacks. Proceedings of the International Conference on Learning Representations, 2018.
  • Bickel et al. (2009) Bickel, P. J., Ritov, Y., Tsybakov, A. B., et al. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, 2009.
  • Candes et al. (2007) Candes, E., Tao, T., et al. The dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics, 2007.
  • Candes & Tao (2005) Candes, E. J. and Tao, T. Decoding by linear programming. IEEE transactions on information theory, 2005.
  • Carlini & Wagner (2016) Carlini, N. and Wagner, D. Towards Evaluating the Robustness of Neural Networks. IEEE Symposium on Security and Privacy (SP), 2017.
  • Chen et al. (2017) Chen, P.-Y., Sharma, Y., Zhang, H., Yi, J., and Hsieh, C.-J. Ead: elastic-net attacks to deep neural networks via adversarial examples. AAAI, 2018.
  • Donoho (2006) Donoho, D. L. Compressed sensing. In IEEE Transactions on Information Theory, 2006.
  • Efron et al. (2004) Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. Least Angle Regression. 2004.
  • Ghorbani et al. (2017) Ghorbani, A., Abid, A., and Zou, J. Y. Interpretation of neural networks is fragile. AAAI, 2019.
  • Glorot et al. (2011) Glorot, X., Bordes, A., and Bengio, Y. Deep sparse rectifier neural networks. In Proceedings of Artificial Intelligence and Statistics, 2011.
  • Goodfellow et al. (2013) Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. Maxout networks. In Proceedings of the International Conference of Machine Learning, 2013.
  • Goodfellow et al. (2014) Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and Harnessing Adversarial Examples. Proceedings of the International Conference on Learning Representations, 2015.
  • He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition, 2016.
  • Kindermans et al. (2016) Kindermans, P.-J., Schütt, K., Müller, K.-R., and Dähne, S. Investigating the influence of noise and distractors on the interpretation of neural networks. In NIPS Workshop on Interpretable Machine Learning in Complex Systems, 2016.
  • Li et al. (2016) Li, J., Monroe, W., and Jurafsky, D. Understanding neural networks through representation erasure. arXiv preprint arXiv: 1612.08220, 2016.
  • Moosavi-Dezfooli et al. (2015) Moosavi-Dezfooli, S.-M., Fawzi, A., and Frossard, P. DeepFool: a simple and accurate method to fool deep neural networks. CVPR, 2016.
  • Hu et al. (2018) Hu, Jie and Shen, Li and Sun, Gang Squeeze-and-Excitation Networks. CVPR, 2018.
  • Nie et al. (2018) Nie, W., Zhang, Y., and Patel, A. A theoretical explanation for perplexing behaviors of backpropagation-based visualizations. Proceedings of the International Conference of Machine Learning, 2018.
  • Parikh & Boyd (2014) Parikh, N. and Boyd, S. Proximal algorithms. Found. Trends Optim., 1(3):127–239, January 2014. ISSN 2167-3888. doi: 10.1561/2400000003. URL http://dx.doi.org/10.1561/2400000003.
  • Paszke et al. (2017) Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in pytorch. In NIPS Autodiff Workshop: The Future of Gradient-based Machine Learning Software and Techniques, 2017.
  • Pearlmutter (1994) Pearlmutter, B. A. Fast exact multiplication by the hessian. In Neural Computation, 1994.
  • Kindermans et al. (2018) Pieter-Jan Kindermans, Sara Hooker, Julius Adebayo, Maximilian Alber and Kristof T. Schütt and Sven Dähne and Dumitru Erhan and Been Kim The (Un)reliability of saliency methods. Proceedings of the Neural Information Processing Systems, 2018.
  • Raskutti et al. (2010) Raskutti, G., Wainwright, M. J., and Yu, B. Restricted eigenvalue properties for correlated gaussian designs. Journal of Machine Learning Research, 11(Aug):2241–2259, 2010.
  • Russakovsky et al. (2015) Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 2015.
  • Shrikumar et al. (2017) Shrikumar, A., Greenside, P., and Kundaje, A. Learning important features through propagating activation differences. In Proceedings of the International Conference of Machine Learning, 2017.
  • Simonyan et al. (2014) Simonyan, K., Vedaldi, A., and Zisserman, A. Deep inside convolutional networks: Visualising image classification models and saliency maps. In Proceedings of the International Conference on Learning Representations, 2014.
  • Smilkov et al. (2017) Smilkov, D., Thorat, N., Kim, B., Viégas, F. B., and Wattenberg, M. SmoothGrad: removing noise by adding noise. CoRR, 2017.
  • Sundararajan et al. (2017) Sundararajan, M., Taly, A., and Yan, Q. Axiomatic attribution for deep networks. In Proceedings of the International Conference of Machine Learning, 2017.
  • Tibshirani (1996) Tibshirani, R. Regression shrinkage and selection via the lasso. In Journal of the Royal Statistical Society, 1996.
  • Yao et al. (2018) Yao, Z., Gholami, A., Lei, Q., Keutzer, K., and Mahoney, M. W. Hessian-based analysis of large batch training and robustness to adversaries. Proceedings of the Neural Information Processing Systems, 2018.

Appendix

Appendix A Proofs

a.1 Proof of Proposition 1

This section derives the closed-form formula for the Hessian of the loss function for a deep ReLU network. Since a ReLU network is piecewise linear, it is locally linear around an input . Thus the logits can be represented as:

where is the input of dimension , are the logits, are the weights, and are the biases of the linear function. In this proof, we use to denote the logits, to denote the class probabilities, to denote the label vector and c to denote the number of classes. Each column of is the gradient of logit with respect to flattened input and can be easily handled in auto-grad software such as PyTorch Paszke et al. (2017).

Thus

(9)
(10)

Using (9) and (10)

Therefore, we have:

(11)

Deriving :

(12)
(13)

Thus we have,

(14)
(15)

where

(16)

This completes the proof.

a.2 Proof of Theorem 2

To simplify notation, define as in (16). For any arbitrary row of the matrix , we have

Because , by the Gershgorin Circle theorem, we have that all eigenvalues of are positive and is a positive semidefinite matrix. Since is positive semidefinite, we can write . Using (15):

Hence is a positive semidefinite matrix as well.

a.3 Proof of Theorem 3

The second-order interpretation objective function is:

where ( is fixed). Therefore if , is negative definite and is strongly concave.

a.4 Proof of Theorem 4

Let the class probabilities be denoted by , the number of classes by c and the label vector by . We again use and as defined in (14) and (15) respectively. Without loss of generality, assume that the first class is the class with maximum probability. Hence,

(17)

We assume all other classes have small probability (i.e., the confidence is high),

Since ,

(18)

We define:

Ignoring terms:

Let be an eigenvalue of and be an eigenvector of , then .
Let be the individual components of the eigenvector. The equation can be rewritten in terms of its individual components as follows:

(19)
(20)
(21)

We first consider the case . Substituting in :

Since is an eigenvector, it cannot be zero,

Let be the corresponding eigenvector for .
By substituting in (20):

Dividing by the normalization constant,

(22)

Now we consider the case . Substituting in  :
The space of eigenvectors for is a dimensional subspace with
Let be the eigenvectors with
Let be the eigenvector with
Writing in terms of its eigenvalues and eigenvectors,

Let

Hence as ,

Using ,

Substituting ,

(23)

Using (14),

Let denote the row of ,
Using and ,


Using (22),

(24)

Using (23),

Using (24),

(25)

Thus, the Hessian is approximately rank one and the gradient is parallel to the Hessian’s only eigenvector.

a.5 Proof of Theorem 5

We use  (14).
Let = 0 in the CASO and CAFO objectives. The CASO objective then becomes:

Taking the derivative with respect to and solving:


Similarly, for the CAFO objective we get:

Using (25),

Define:

Thus is the eigenvalue of for the eigenvector:

Consider the matrix :
Let be the eigenvectors of where:

Eigenvalue for
Eigenvalue for

Since each is orthogonal to

Hence and since scaling does not affect the visualization, the two interpretations are equivalent.

Appendix B Convergence of Gradient Descent to Solve CASO

A consequence of Theorem 3 is that gradient descent converges to the global optimizer of the second-order interpretation objective objective with a convergence rate of . More precisely, we have:

Corollary 1

Let be the objective function of the second-order interpretation objective (Definition 3). Let be the value of in the step with a learning rate . We have

Appendix C Efficient Computation of the Hessian Matrix Using the Cholesky Decomposition

By Theorem 2, the Cholesky decomposition of (defined in (16)) exists. Let be the Cholesky decomposition of . Thus, we have

Let . Thus, can be re-written as .

Let the SVD of be as the following:

Thus, we can write:

Define . Note that , the eigenvalues of and are the same. For a dataset such as ImageNet, the input has dimension d = 2242243 and c = 1000. Decomposing C (size 10001000) into its eigenvalues and eigenvectors is computationally efficient. Thus, from , we can compute the eigenvectors of .

Appendix D Saliency Visualization Methods

Normalizing Feature Importance Values: After assigning importance values to each input feature, the values must be normalized for visualization in a saliency map. For fair comparison across all methods, we use the non-diverging normalization method from SmoothGrad Smilkov et al. (2017). This normalization method first takes the absolute value of the importance scores and then sums across the three color channels of the image. Next, the largest importance values are capped to the value of percentile. Finally, the importance values are divided and clipped to enforce the range . Code for the method is available.444https://github.com/PAIR-code/saliency/blob/master/saliency/visualization.py

Domain-Specific Post-Processing: Gradient Input Shrikumar et al. (2017) multiplies the importance values by the raw feature values. In image tasks where the baseline is zero, Integrated Gradients Sundararajan et al. (2017) does the same. This heuristic can visually sharpen the saliency map and has some theoretical justification: it is equivalent to the original Layerwise Relevance Propagation Technique Bach et al. (2015) modulo a scaling factor Shrikumar et al. (2017); Kindermans et al. (2016). Additionally, if the model is linear, , multiplying the gradient by the input is equivalent to a feature’s true contribution to the final class score.

However, multiplying by the input can introduce visual artifacts not present in the importance values Smilkov et al. (2017). We argue against multiplying by the input: it artificially enhances the visualization and only yields benefits in the image domain. Adebayo et al. (2018) argue similarly and show cases when the input term can dominate the interpretation. Moreover, multiplication by the input removes the input invariance of the interpretation regardless of the invariances of the underlying model Kindermans et al. (2018). We observed numerous failures in existing interpretation methods when input multiplication is removed.

Appendix E Tightness of the Relaxation

We assume the condition of Theorem 3 holds, thus, the CASO optimization is a concave maximization (equivalently a convex minimization) problem.

Note the CASO optimization with the cardinality constraint can be re-written as follows:

(26)

where

(27)
(28)

Where indicates the square root of a positive definite matrix. Equation (27) highlights the condition for tuning the parameter : it needs to be sufficiently large to allow inversion of but sufficiently small to not “overpower” the Hessian term. Note, we are now minimizing for consistency with the compressive sensing literature. To explain the conditions under which the relaxation is tight, we define the following notation. For a given subset and constant , we define the following cone:

(29)

where is the complement of . We say that the matrix satisfies the restricted eigenvalue (RE) Raskutti et al. (2010); Bickel et al. (2009) condition over with parameters if