Understanding HighOrder Loss Approximations and Features in Deep Learning Interpretation
Understanding Impacts of HighOrder Loss Approximations and Features in Deep Learning Interpretation
Abstract
Current saliency map interpretations for neural networks generally rely on two key assumptions. First, they use firstorder approximations of the loss function, neglecting higherorder terms such as the loss curvature. Second, they evaluate each feature’s importance in isolation, ignoring feature interdependencies. This work studies the effect of relaxing these two assumptions. First, we characterize a closedform formula for the input Hessian matrix of a deep ReLU network. Using this, we show that, for classification problems with many classes, if a prediction has high probability then including the Hessian term has a small impact on the interpretation. We prove this result by demonstrating that these conditions cause the Hessian matrix to be approximately rank one and its leading eigenvector to be almost parallel to the gradient of the loss. We empirically validate this theory by interpreting ImageNet classifiers. Second, we incorporate feature interdependencies by calculating the importance of groupfeatures using a sparsity regularization term. We use an relaxation technique along with proximal gradient descent to efficiently compute groupfeature importance values. Our empirical results show that our method significantly improves deep learning interpretations.
equal*
Sahil Singlato \icmlauthorEric Wallaceto \icmlauthorShi Fengto \icmlauthorSoheil Feizito
toComputer Science Department, University of Maryland
Sahil Singlassingla@cs.umd.edu \icmlcorrespondingauthorSoheil Feizisfeizi@cs.umd.edu
Deep Learning Interpretation, Hessian, regularization
1 Introduction
The growing use of deep learning in sensitive applications such as medicine, autonomous driving, and finance raises concerns about human trust in machine learning systems. For trained models, a central question is testtime interpretability: how can humans understand the reasoning behind model predictions? A common interpretation approach is to identify the importance of each input feature for a model’s prediction. A saliency map can then visualize the important features, e.g., the pixels of an image Simonyan et al. (2014); Sundararajan et al. (2017) or words in a sentence Li et al. (2016).
Several approaches exist to create saliency maps, largely based on model gradients. For example, Simonyan et al. (2014) compute the gradient of the class score with respect to the input, while Smilkov et al. (2017) average the gradient from several noisy versions of the input. Although these gradientbased methods can produce visually pleasing results, they often weakly approximate the underlying model Feng et al. (2018); Nie et al. (2018). Existing saliency interpretations mainly rely on two key assumptions:

Gradientbased loss surrogate: For computational efficiency, several existing methods, e.g., Simonyan et al. (2014); Smilkov et al. (2017); Sundararajan et al. (2017), assume that the loss function is almost linear at the test sample. Thus, they use variations of the input gradient to compute feature importance.

Isolated feature importance: Current methods evaluate the importance of each feature in isolation, assuming all other features are fixed. Features, however, may have complex interdependencies that can be learned by the model.
This work studies the impact of relaxing these two assumptions in deep learning interpretation. To relax the first assumption, we use the secondorder approximation of the loss function by keeping the Hessian term in the Taylor expansion of the loss. For a deep ReLU network and the crossentropy loss function, we compute this Hessian term in closedform. Using this closedform formula for the Hessian, we prove the following for ReLU networks:
Theorem 1 (informal version)
If the probability of the predicted class is close to one and the number of classes is large, firstorder and secondorder interpretations are sufficiently close to each other.
We present a formal version of this result in Theorem 5 and also validate it empirically. For instance, in ImageNet 2012 Russakovsky et al. (2015), a dataset of 1,000 classes, we show that incorporating the Hessian term in deep learning interpretation has a small impact for most images.
The key idea of the proof follows from the fact that when the number of classes is large and the confidence in the predicted class is high, the Hessian of the loss function is approximately of rank one. In essence, the largest eigenvalue squared is significantly larger than the sum of squared remaining eigenvalues. Moreover, the corresponding eigenvector is approximately parallel to the gradient vector (Theorem 4). This causes firstorder and secondorder interpretations to perform similarly. We also show in Appendix F.3 that this result holds empirically for a neural network model that is not piecewise linear. Our theoretical results can also be extended to related problems such as adversarial examples, where most methods are based on the firstorder loss approximations Goodfellow et al. (2014); MoosaviDezfooli et al. (2015); Madry (Aleksander and Makelov).
Next, we relax the isolated feature importance assumption. To incorporate feature interdependencies in the interpretation, we define the importance function over subsets of features, referred to as groupfeatures. We adjust the subset size on a perexample basis using an unsupervised approach, making the interpretation contextaware. Including groupfeatures in the interpretation makes the optimization combinatorial. To circumvent the associated computational issues, we use an relaxation as is common in compressive sensing Candes & Tao (2005); Donoho (2006), LASSO regression Tibshirani (1996), and other related problems. To solve the relaxed optimization, we employ proximal gradient descent Parikh & Boyd (2014). Our empirical results on ImageNet indicate that incorporating groupfeatures removes noise and makes the interpretation more visually coherent with the object of interest. We refer to our interpretation method based on firstorder (gradient) information as the CAFO (ContextAware First Order) interpretation. Similarly, the method based on secondorder information is called the CASO (ContextAware Second Order) interpretation. We provide opensource code.^{1}^{1}1https://github.com/singlasahil14/CASO
2 Problem Setup and Notation
Consider a prediction problem from input variables (features) to an output variable . For example, in the image classification problem, is the space of images and is the set of labels . We observe samples from these variables, namely . Let be the observed empirical distribution.^{2}^{2}2Note that for simplicity, we hide the dependency of on . The empirical risk minimization (ERM) approach computes the optimal predictor for a loss function using the following optimization:
(1) 
Let be a subset of with cardinality . For a given sample , let indicate the features of in positions . We refer to as a groupfeature of . The importance of a groupfeature is proportional to the change in the loss function when is perturbed. We select the groupfeature with maximum importance and visualize that subset in a saliency map.
Definition 1 (GroupFeature Importance Function)
Let be the optimizer of the ERM problem (1). For a given sample , we define the groupfeature importance function as follows:
(2)  
where counts the number of nonzero elements of its argument (known as the norm). The parameter characterizes an upper bound on the cardinality of the groupfeatures. The parameter characterizes an upper bound on the norm of feature perturbations.
If is the solution of optimization (2), then the vector contains the feature importance values that are visualized in the saliency map. Note, when this definition simplifies to current feature importance formulations which consider features in isolation. When , our formulation can capture feature interdependencies. Parameters and in general depend on the test sample (i.e., the size of the groupfeatures are different for each image and model). We introduce an unsupervised metric to determine these parameters in Section 4.1, but assume these parameters are given for the time being.
The cardinality constraint (i.e. the constraint on the groupfeature size) leads to a combinatorial optimization problem in general. Such a sparsity constraint has appeared in different problems such as compressive sensing Candes & Tao (2005); Donoho (2006) and LASSO regression Tibshirani (1996). Under certain conditions, we show that without loss of generality the norm can be relaxed with the (convex) norm (Appendix E).
Our goal is to solve optimization (2) which is nonlinear and nonconcave in . Current approaches do not consider the cardinality constraint and optimize by linearizing the objective function (i.e., using the gradient). To incorporate groupfeatures into current methods, we can add the constraints of optimization (2) to the objective function using Lagrange multipliers. This yields the following ContextAware FirstOrder (CAFO) interpretation function.
Definition 2 (The CAFO Interpretation)
For a given sample , we define the ContextAware FirstOrder (CAFO) importance function as follows:
(3) 
where and are nonnegative regularization parameters. We refer to the objective of this optimization as , hiding its dependency on and to simplify notation.
Large values of regularization parameters and in optimization (3) correspond to small values of parameters and in optimization (2). Incorporating groupfeatures naturally leads to a sparsity regularizer through the penalty. Note, this is not a hard constraint which forces a sparse interpretation. Instead, given proper choice of the regularization coefficients, the interpretation will reflect the sparsity used by the underlying model. In Section 4.1, we detail our method for setting on an examplespecific basis (i.e., contextaware) based on the sparsity ratio of CAFO’s optimal solution. Moreover, in Appendix E, we show that under some general conditions, optimization (3) can be solved efficiently and its solution matches that of the original optimization (2).
To better approximate the loss function, we use its secondorder Taylor expansion around point :
(4) 
where and is the Hessian of the loss function on the input features (note is fixed). This secondorder expansion of the loss function decreases the interpretation’s model approximation error.
By choosing proper values for regularization parameters, the resulting optimization using the secondorder surrogate loss is strictly a convex minimization (or equivalently concave maximization) problem, allowing for efficient optimization using gradient descent (Theorem 3). Moreover, even though the Hessian matrix can be expensive to compute for large neural networks, gradient updates of our method only require the Hessianvector product (i.e., ) which can be computed efficiently Pearlmutter (1994). This yields the following ContextAware SecondOrder (CASO) interpretation function.
Definition 3 (The CASO Interpretation)
For a given sample , we define the ContextAware SecondOrder (CASO) importance function as follows:
(5) 
We refer to the objective of this optimization as . and are defined as in (3).
3 The Impact of the Hessian
The Hessian is by definition useful when the loss function at the test sample has high curvature. However, given the linear nature of popular network architectures with piecewise linear activations, e.g., ReLU Glorot et al. (2011) or Maxout Goodfellow et al. (2013), do these regions of high curvature even exist? We answer this question for neural networks with piecewise linear activations by first providing an exact calculation of the input Hessian. Then, we use this derivation to understand the impact of including the Hessian term in interpretation. More specifically, we prove that when the probability of the predicted class is 1 and the number of classes is large, the secondorder interpretation is similar to the firstorder one. We verify this theoretical result experimentally over images in the ImageNet 2012 dataset Russakovsky et al. (2015). We also observe that when the confidence in the predicted class is low, the secondorder interpretation can be significantly different from the firstorder interpretation. Since secondorder interpretations take into account the curvature of the model, we conjecture that they are more faithful to the underlying model in these cases.
3.1 Closedform Hessian Formula for ReLU Networks
We present an abridged version of the exact Hessian calculation here, the details are provided in Appendix A.1. Neural network models which use piecewise linear activation functions have class scores (logits) which are linear functions of the input. That is, since they are piecewise linear over the entire domain, they are linear at a particular input.^{3}^{3}3Note that we ignore points where the function is nondifferentiable as they form a measure zero set. Thus, we can write:
where is the input of dimension , are the logits, are the weights, and are the biases of the linear function. Note that combines weights of different layers from the input to the output of the network. Each row of is the gradient of logit with respect to the flattened input and can be handled in autograd software such as PyTorch Paszke et al. (2017). We define:
where denotes the number of classes, denotes the class probabilities, and is the crossentropy loss function.
In this case, we have the following result:
Proposition 1
is given by:
(6) 
where is a diagonal matrix whose diagonal elements are equal to .
The first observation from Proposition 1 is as follows:
Theorem 2
is a positive semidefinite matrix.
These two results allow an extremely efficient computation of the Hessian’s eigenvectors and eigenvalues using the Cholesky decomposition of (Appendix C). Note the use of decomposition is critical as storing the Hessian requires intractable amounts of memory for high dimensional inputs. The entire calculation of the Hessian’s decomposition for ImageNet using a ResNet50 He et al. (2016) runs in approximately 4.2 seconds on an NVIDIA GTX 1080 Ti.
To the best of our knowledge, this is the first work which derives the exact Hessian decomposition for piecewise linear networks. Yao et al. (2018) also proved the Hessian for piecewise linear networks is at most rank but did not derive the exact input Hessian.
One advantage of having a closedform formula for the Hessian matrix (6) is that we can use it to properly set the regularization parameter in CASO’s formulation. To do this, we rely on the following result:
Theorem 3
If is the largest eigenvalue of , for any value of , the secondorder interpretation objective function (5) is strongly concave.
We use Theorem 3 to set the regularization parameter for CASO. We need to set to make the optimization convex, but not set so large that it overpowers . In particular, we set , where we choose for CASO and CAFO. We observe that if is small, the optimization can become nonconvex due to numerical error in the calculation of L. However above a threshold, the value of does not have a significant impact on the saliency map.
3.2 Theoretical Results on the Hessian Impact
We now leverage the exact Hessian calculation to prove that when the probability of predicted class is 1 and the number of classes is large, the Hessian of a piecewise linear neural network is approximately of rank one and its eigenvector is approximately parallel to the gradient. Since a constant scaling does not affect the visualization, this causes the two interpretations to be similar to one another.
Theorem 4
If the probability of the predicted class=1(c1) , where , then as c such that , Hessian is of rank one and its eigenvector is parallel to the gradient.
Let be the optimal solution to the CASO objective 5 and be the optimal solution for the CAFO objective 3. We assume =0 for both the objectives.
Theorem 5
We emphasize that our theoretical results are valid in the “asymptotic regime”. To analyze the approximation in the finite length regime, we simulate the relative error between the true Hessian and the rankone approximation of the Hessian as the number of classes increases and probability of predicted class tends to 1. We find the Hessian quickly converges to rankone empirically (Appendix F.1).
3.3 Empirical Results on the Hessian Impact
We now present empirical results on the impact of the Hessian in interpreting deep learning models. In our experiments here, we isolate the impact of the Hessian term by setting in both CASO and CAFO.
A consequence of Theorem 3 is that the gradient descent method with Nesterov momentum converges to the global optimizer of the secondorder interpretation objective with a convergence rate of , see Appendix B for details.
To optimize , the gradient is given by:
(7) 
The gradient term and the regularization term are straightforward to implement using standard backpropagation.
To compute the Hessianvector product term , we rely on the result of Pearlmutter 1994 Pearlmutter (1994): a Hessianvector product can be computed in the same time as the gradient . This is handled easily in modern autograd software. Moreover, for ReLU networks, our closedform formula for the Hessian term (Theorem 1) can be used in the computation of the Hessianvector product as well. In our experiments here we use the closedform formula for . When , we use proximal gradient descent (Section 4).
We compare secondorder (CASO with ) and the firstorder interpretations (CAFO with ) empirically. Note that when , where is the gradient and is the interpretation obtained using the CAFO objective.
We compute secondorder and firstorder interpretations for 1000 random samples on the ImageNet ILSVRC2012 Russakovsky et al. (2015) validation set using a Resnet50 He et al. (2016) model. Our loss function is the crossentropy loss. After calculating for all methods, the values must be normalized for visualization in a saliency map. We apply a normalization technique from existing work which we describe in Appendix D.
We plot the Frobenius norm of the difference between CASO and CAFO in Figure 1. Before taking the difference, we normalize the solutions produced by CASO and CAFO to have the same norm because a constant scaling of elements of does not change the visualization.
The empirical results are consistent with our theoretical results: secondorder and firstorder interpretations are similar when the classification confidence is high. However, when the confidence is small, including the Hessian term can be useful in deep learning interpretation.
To observe the difference between CAFO and CASO interpretations qualitatively, we compare them for an image when the confidence is high and for one where it is low in Figure 2. When the classification confidence is high, CAFO CASO and when this is low, CASO CAFO. Additional examples have been given in Appendix F.
We do additional experiments to evaluate the impact of the Hessian on a neural network that is not piecewise linear. We interpret a SEResnet50 Hu et al. (2018) neural network (which uses sigmoid nonlinearities) on the same 1000 images. We observe a similar trend as in the case of ReLU networks (Appendix F.3).
4 The Impact of Groupfeatures
This section studies the impact of the groupfeatures in deep learning interpretation. The groupfeature has been included as the sparsity constraint in optimization (2).
To obtain an unconstrained concave optimization for the CASO interpretation, we relaxed the sparsity (cardinality) constraint (often called an norm constraint) to a convex norm constraint. Such a relaxation is a core component for popular learning methods such as compressive sensing Candes & Tao (2005); Donoho (2006) or LASSO regression Tibshirani (1996). Using results from this literature, we show this relaxation is tight under certain conditions on the Hessian matrix (see Appendix E). In other words, the optimal of optimization (5) is sparse with the proper choice of regularization parameters.
Note that the regularization term is a concave function for . Similarly due to Theorem 3, the CASO interpretation objective (5) is strongly concave.
One method for optimizing this objective is using gradient descent as done in the secondorder interpretation but using an regularization penalty. However, we found that this procedure leads to poor convergence properties in practice, partially due to the nonsmoothness of the term.
To resolve this issue, we instead use proximal gradient descent to compute a solution for CAFO and CASO when . Using the Nesterov momentum method and backtracking with proximal gradient descent gives a convergence rate of where is the number of gradient updates (Appendix B).
Below we explain how we use proximal gradient descent to optimize our objective. First, we write the objective function as the sum of a smooth and nonsmooth function:
Let be the smooth, be the nonsmooth part:
The gradient of the smooth objective is given by:
The proximal operator is given by:
This formula can be understood intuitively as follows. If the magnitude of some elements of is below a certain threshold (), proximal mapping sets those values to zero. This leads to values that are exactly zero in the saliency map.
To optimize , we use FISTA Beck & Teboulle (2009) with backtracking and the Nesterov momentum optimizer with a learning rate of for 10 iterations and decay factor of . is initialized to zero. FISTA takes a step with learning rate to reduce the smooth objective loss and then applies a proximal mapping to the resulting . Backtracking reduces the learning rate when the update results in a higher loss.
4.1 Empirical Impact of GroupFeatures
We now investigate the empirical impact of groupfeatures. In our experiments, we focus on image classification because visual interpretations are intuitive and allow for comparison with prior work. We use a Resnet50 He et al. (2016) model on the ImageNet ILSVRC2012 dataset.
To gain an intuition for the effect of , we show a sweep over values in Figure 3. When is too high, the saliency map becomes all zero. Different approaches to set the regularization parameter have been explored in different problems. For example, in LASSO, one common approach is to use Least Angle Regression Efron et al. (2004).
We propose an unsupervised method based on the sparsity ratio of the interpretation solution to set . We define , the sparsity ratio, as the number of zero pixels divided by the total number of pixels. We start with and increase by a factor of 10 until reaches all zeros. For interpretations with sparsity in a certain range (e.g. in our examples), we choose the interpretation with the highest loss. If we do not find any interpretation that satisfies the sparsity condition, we reduce the first that resulted in becoming zero by a factor of 2 and repeat further iterations. In practice, we batch different values of to find a reasonable parameter setting efficiently.
This method selects the interpretation marked with a green box in Figures (a)a and (b)b. In Figure (c)c, we show the gradient interpretation with different values of clipping thresholds to induce the specified sparsity value. We observe that the interpretations obtained using groupfeatures (Figures (a)a and (b)b) are less noisy compared to Figure (c)c.
5 Qualitative Comparision of Deep Learning Interpretation Methods
This section briefly reviews prior saliency map approaches and compares their performance to CAFO and CASO qualitatively. The proposed Hessian and groupfeature terms can be included in existing approaches as well.
Vanilla Gradient: Simonyan et al. (2014) propose to compute the gradient of the class score with respect to the input.
SmoothGrad: Smilkov et al. (2017) argue that the input gradient may fluctuate sharply in the region local to the test sample. To address this, they average the gradientbased importance values generated from many noisy inputs.
Integrated Gradients: Sundararajan et al. (2017) define a baseline, which represents an input absent of information (e.g., a completely zero image). Feature importance is determined by accumulating gradient information along the path from the baseline to the original input: . The integral is approximated by a finite sum.
We use the normalization method from SmoothGrad Smilkov et al. (2017) for visualizing the saliency map. Details of this method are given in Appendix D.
We can also extend the idea of SmoothGrad to define smooth versions of CASO and CAFO. This yields the following interpretation objective.
Definition 4 (The Smooth CASO Interpretation)
For a given sample , we define the smooth contextaware secondorder (the Smooth CASO) importance function as follows:
(8) 
where and and are defined similarly as before.
In the smoothed versions, we average over samples with . Smooth CAFO is defined similarly without the Hessian term.
Since quantitatively evaluating a saliency map is an open problem, we focus on two qualitative aspects. First, we inspect visual coherence, i.e., only the object of interest should be highlighted and not the background. Second, we test for discriminativity, i.e., in an image with two objects the predicted object should be highlighted.
6 Conclusion and Future Work
We have studied two aspects of the deep learning interpretation problem. First, we characterized a closedform formula for the input Hessian matrix of a deep ReLU network. Using this, we showed that, if the confidence in the predicted class is high and the number of classes is large, firstorder and secondorder methods produce similar results. In the process, we also proved that the Hessian matrix is of rank one and its eigenvector is parallel to the gradient. These results can be insightful in other related problems such as adversarial examples. Second, we incorporated feature interdependencies in the interpretation using a sparsity regularization term. Adding this term significantly improves qualitative interpretation results.
There remain many open problems in interpreting deep learning models. For instance, since saliency maps are highdimensional, they can be sensitive to noise and adversarial perturbations Ghorbani et al. (2017). Moreover, without proper quantitative evaluation metrics for model interpretations, the evaluation of interpretations is often qualitative and can be subjective. Finally, the theoretical impact of the Hessian term for low confidence predictions and the case when the number of classes is small remains unknown. Resolving these issues are among interesting directions for future work.
Acknowledgments
Shi Feng and Eric Wallace were supported by NSF Grant IIS1822494. Any opinions, findings, conclusions, or recommendations expressed here are those of the authors and do not necessarily reflect the view of the sponsor.
References
 Adebayo et al. (2018) Adebayo, J., Gilmer, J., Goodfellow, I., and Kim, B. Local explanation methods for deep neural networks lack sensitivity to parameter values. In ICLR Workshop, 2018.
 Feng et al. (2018) Shi Feng, Eric Wallace, Alvin Grissom II, Mohit Iyyer, Pedro Rodriguez and Jordan BoydGraber. Pathologies of Neural Models Make Interpretations Difficult In EMNLP, 2018.
 Adebayo et al. (2018) Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I., Hardt, M., and Kim, B. Sanity Checks for Saliency Maps. Proceedings of the International Conference on Learning Representations, 2019.
 Adebayo et al. (2018) Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I., Hardt, M., and Kim, B. Sanity checks for saliency maps. In Proceedings of Advances in Neural Information Processing Systems, 2018.
 AlvarezMelis & Jaakkola (2018) AlvarezMelis, D. and Jaakkola, T. S. Towards Robust Interpretability with SelfExplaining Neural Networks. Proceedings of the Neural Information Processing Systems, 2018.
 Bach et al. (2015) Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.R., Samek, W., and Suárez, Ó. D. On pixelwise explanations for nonlinear classifier decisions by layerwise relevance propagation. In PloS one, 2015.
 Beck & Teboulle (2009) Beck, A. and Teboulle, M. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2009.
 Madry (Aleksander and Makelov) Madry, Aleksander and Makelov, Aleksandar and Schmidt, Ludwig and Tsipras, Dimitris and Vladu, Adrian. Towards deep learning models resistant to adversarial attacks. Proceedings of the International Conference on Learning Representations, 2018.
 Bickel et al. (2009) Bickel, P. J., Ritov, Y., Tsybakov, A. B., et al. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, 2009.
 Candes et al. (2007) Candes, E., Tao, T., et al. The dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics, 2007.
 Candes & Tao (2005) Candes, E. J. and Tao, T. Decoding by linear programming. IEEE transactions on information theory, 2005.
 Carlini & Wagner (2016) Carlini, N. and Wagner, D. Towards Evaluating the Robustness of Neural Networks. IEEE Symposium on Security and Privacy (SP), 2017.
 Chen et al. (2017) Chen, P.Y., Sharma, Y., Zhang, H., Yi, J., and Hsieh, C.J. Ead: elasticnet attacks to deep neural networks via adversarial examples. AAAI, 2018.
 Donoho (2006) Donoho, D. L. Compressed sensing. In IEEE Transactions on Information Theory, 2006.
 Efron et al. (2004) Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. Least Angle Regression. 2004.
 Ghorbani et al. (2017) Ghorbani, A., Abid, A., and Zou, J. Y. Interpretation of neural networks is fragile. AAAI, 2019.
 Glorot et al. (2011) Glorot, X., Bordes, A., and Bengio, Y. Deep sparse rectifier neural networks. In Proceedings of Artificial Intelligence and Statistics, 2011.
 Goodfellow et al. (2013) Goodfellow, I. J., WardeFarley, D., Mirza, M., Courville, A., and Bengio, Y. Maxout networks. In Proceedings of the International Conference of Machine Learning, 2013.
 Goodfellow et al. (2014) Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and Harnessing Adversarial Examples. Proceedings of the International Conference on Learning Representations, 2015.
 He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition, 2016.
 Kindermans et al. (2016) Kindermans, P.J., SchÃ¼tt, K., MÃ¼ller, K.R., and DÃ¤hne, S. Investigating the influence of noise and distractors on the interpretation of neural networks. In NIPS Workshop on Interpretable Machine Learning in Complex Systems, 2016.
 Li et al. (2016) Li, J., Monroe, W., and Jurafsky, D. Understanding neural networks through representation erasure. arXiv preprint arXiv: 1612.08220, 2016.
 MoosaviDezfooli et al. (2015) MoosaviDezfooli, S.M., Fawzi, A., and Frossard, P. DeepFool: a simple and accurate method to fool deep neural networks. CVPR, 2016.
 Hu et al. (2018) Hu, Jie and Shen, Li and Sun, Gang SqueezeandExcitation Networks. CVPR, 2018.
 Nie et al. (2018) Nie, W., Zhang, Y., and Patel, A. A theoretical explanation for perplexing behaviors of backpropagationbased visualizations. Proceedings of the International Conference of Machine Learning, 2018.
 Parikh & Boyd (2014) Parikh, N. and Boyd, S. Proximal algorithms. Found. Trends Optim., 1(3):127–239, January 2014. ISSN 21673888. doi: 10.1561/2400000003. URL http://dx.doi.org/10.1561/2400000003.
 Paszke et al. (2017) Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in pytorch. In NIPS Autodiff Workshop: The Future of Gradientbased Machine Learning Software and Techniques, 2017.
 Pearlmutter (1994) Pearlmutter, B. A. Fast exact multiplication by the hessian. In Neural Computation, 1994.
 Kindermans et al. (2018) PieterJan Kindermans, Sara Hooker, Julius Adebayo, Maximilian Alber and Kristof T. Schütt and Sven Dähne and Dumitru Erhan and Been Kim The (Un)reliability of saliency methods. Proceedings of the Neural Information Processing Systems, 2018.
 Raskutti et al. (2010) Raskutti, G., Wainwright, M. J., and Yu, B. Restricted eigenvalue properties for correlated gaussian designs. Journal of Machine Learning Research, 11(Aug):2241–2259, 2010.
 Russakovsky et al. (2015) Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 2015.
 Shrikumar et al. (2017) Shrikumar, A., Greenside, P., and Kundaje, A. Learning important features through propagating activation differences. In Proceedings of the International Conference of Machine Learning, 2017.
 Simonyan et al. (2014) Simonyan, K., Vedaldi, A., and Zisserman, A. Deep inside convolutional networks: Visualising image classification models and saliency maps. In Proceedings of the International Conference on Learning Representations, 2014.
 Smilkov et al. (2017) Smilkov, D., Thorat, N., Kim, B., Viégas, F. B., and Wattenberg, M. SmoothGrad: removing noise by adding noise. CoRR, 2017.
 Sundararajan et al. (2017) Sundararajan, M., Taly, A., and Yan, Q. Axiomatic attribution for deep networks. In Proceedings of the International Conference of Machine Learning, 2017.
 Tibshirani (1996) Tibshirani, R. Regression shrinkage and selection via the lasso. In Journal of the Royal Statistical Society, 1996.
 Yao et al. (2018) Yao, Z., Gholami, A., Lei, Q., Keutzer, K., and Mahoney, M. W. Hessianbased analysis of large batch training and robustness to adversaries. Proceedings of the Neural Information Processing Systems, 2018.
Appendix
Appendix A Proofs
a.1 Proof of Proposition 1
This section derives the closedform formula for the Hessian of the loss function for a deep ReLU network. Since a ReLU network is piecewise linear, it is locally linear around an input . Thus the logits can be represented as:
where is the input of dimension , are the logits, are the weights, and are the biases of the linear function. In this proof, we use to denote the logits, to denote the class probabilities, to denote the label vector and c to denote the number of classes. Each column of is the gradient of logit with respect to flattened input and can be easily handled in autograd software such as PyTorch Paszke et al. (2017).
Therefore, we have:
(11) 
Deriving :
(12)  
(13) 
Thus we have,
(14)  
(15) 
where
(16) 
This completes the proof.
a.2 Proof of Theorem 2
To simplify notation, define as in (16). For any arbitrary row of the matrix , we have
Because , by the Gershgorin Circle theorem, we have that all eigenvalues of are positive and is a positive semidefinite matrix. Since is positive semidefinite, we can write . Using (15):
Hence is a positive semidefinite matrix as well.
a.3 Proof of Theorem 3
The secondorder interpretation objective function is:
where ( is fixed). Therefore if , is negative definite and is strongly concave.
a.4 Proof of Theorem 4
Let the class probabilities be denoted by , the number of classes by c and the label vector by . We again use and as defined in (14) and (15) respectively. Without loss of generality, assume that the first class is the class with maximum probability. Hence,
(17) 
We assume all other classes have small probability (i.e., the confidence is high),
Since ,
(18) 
We define:
Ignoring terms:
Let be an eigenvalue of and be an eigenvector of , then .
Let be the individual components of the eigenvector. The equation can be rewritten in terms of its individual components as follows:
(19)  
(20)  
(21) 
We first consider the case . Substituting in :
Since is an eigenvector, it cannot be zero,
Let be the corresponding eigenvector for .
By substituting in (20):
Dividing by the normalization constant,
(22) 
Now we consider the case . Substituting in :
The space of eigenvectors for is a dimensional
subspace with
Let be the eigenvectors with
Let be the eigenvector with
Writing in terms of its eigenvalues and eigenvectors,
Let
Hence as ,
Using ,
Substituting ,
(23) 
Using (14),
Let denote the row of ,
Using and ,
Using (22),
(24) 
Using (23),
Using (24),
(25) 
Thus, the Hessian is approximately rank one and the gradient is parallel to the Hessian’s only eigenvector.
a.5 Proof of Theorem 5
We use (14).
Let = 0 in the CASO and CAFO objectives. The CASO objective then becomes:
Taking the derivative with respect to and solving:
Similarly, for the CAFO objective we get:
Using (25),
Define:
Thus is the eigenvalue of for the eigenvector:
Consider the matrix :
Let be the eigenvectors of where:
Eigenvalue for
Eigenvalue for
Since each is orthogonal to
Hence and since scaling does not affect the visualization, the two interpretations are equivalent.
Appendix B Convergence of Gradient Descent to Solve CASO
A consequence of Theorem 3 is that gradient descent converges to the global optimizer of the secondorder interpretation objective objective with a convergence rate of . More precisely, we have:
Corollary 1
Let be the objective function of the secondorder interpretation objective (Definition 3). Let be the value of in the step with a learning rate . We have
Appendix C Efficient Computation of the Hessian Matrix Using the Cholesky Decomposition
By Theorem 2, the Cholesky decomposition of (defined in (16)) exists. Let be the Cholesky decomposition of . Thus, we have
Let . Thus, can be rewritten as .
Let the SVD of be as the following:
Thus, we can write:
Define . Note that , the eigenvalues of and are the same. For a dataset such as ImageNet, the input has dimension d = 2242243 and c = 1000. Decomposing C (size 10001000) into its eigenvalues and eigenvectors is computationally efficient. Thus, from , we can compute the eigenvectors of .
Appendix D Saliency Visualization Methods
Normalizing Feature Importance Values: After assigning importance values to each input feature, the values must be normalized for visualization in a saliency map. For fair comparison across all methods, we use the nondiverging normalization method from SmoothGrad Smilkov et al. (2017). This normalization method first takes the absolute value of the importance scores and then sums across the three color channels of the image. Next, the largest importance values are capped to the value of percentile. Finally, the importance values are divided and clipped to enforce the range . Code for the method is available.^{4}^{4}4https://github.com/PAIRcode/saliency/blob/master/saliency/visualization.py
DomainSpecific PostProcessing: Gradient Input Shrikumar et al. (2017) multiplies the importance values by the raw feature values. In image tasks where the baseline is zero, Integrated Gradients Sundararajan et al. (2017) does the same. This heuristic can visually sharpen the saliency map and has some theoretical justification: it is equivalent to the original Layerwise Relevance Propagation Technique Bach et al. (2015) modulo a scaling factor Shrikumar et al. (2017); Kindermans et al. (2016). Additionally, if the model is linear, , multiplying the gradient by the input is equivalent to a feature’s true contribution to the final class score.
However, multiplying by the input can introduce visual artifacts not present in the importance values Smilkov et al. (2017). We argue against multiplying by the input: it artificially enhances the visualization and only yields benefits in the image domain. Adebayo et al. (2018) argue similarly and show cases when the input term can dominate the interpretation. Moreover, multiplication by the input removes the input invariance of the interpretation regardless of the invariances of the underlying model Kindermans et al. (2018). We observed numerous failures in existing interpretation methods when input multiplication is removed.
Appendix E Tightness of the Relaxation
We assume the condition of Theorem 3 holds, thus, the CASO optimization is a concave maximization (equivalently a convex minimization) problem.
Note the CASO optimization with the cardinality constraint can be rewritten as follows:
(26)  
where
(27)  
(28) 
Where indicates the square root of a positive definite matrix. Equation (27) highlights the condition for tuning the parameter : it needs to be sufficiently large to allow inversion of but sufficiently small to not “overpower” the Hessian term. Note, we are now minimizing for consistency with the compressive sensing literature. To explain the conditions under which the relaxation is tight, we define the following notation. For a given subset and constant , we define the following cone:
(29) 
where is the complement of . We say that the matrix satisfies the restricted eigenvalue (RE) Raskutti et al. (2010); Bickel et al. (2009) condition over with parameters if