Interpretation of Prediction Models Using the Input Gradient
State of the art machine learning algorithms are highly optimized to provide the optimal prediction possible, naturally resulting in complex models. While these models often outperform simpler more interpretable models by order of magnitudes, in terms of understanding the way the model functions, we are often facing a “black box”.
In this paper we suggest a simple method to interpret the behavior of any predictive model, both for regression and classification. Given a particular model, the information required to interpret it can be obtained by studying the partial derivatives of the model with respect to the input. We exemplify this insight by interpreting convolutional and multi-layer neural networks in the field of natural language processing.
The property of interpretability of a model can be defined in several ways. In this paper, we are interested in the following property: given a prediction model , where is the input space and is the output space, we would like interpret which variables in affect the prediction of and in which way. When this property is achieved we say the model is interpretable.
In linear regression interpreting a given model relies on the model parameters. Since the model is of the form of , there is a – correspondence between the parameters and the coordinate of the variables in the input space. Therefore, the effect of a given variable on the model , can be evaluated through the value of . When we may say has a significant positive effect on . When , we may say that is irrelevant to the prediction of , and it may get selected out of the model. Intuitively it is often also explained in terms of a slope: One unit increase in would increase the prediction by the amount of .
This approach of inferring variables importance through the parameters, has been thoroughly studied by statisticians, since it is fundamental to the field of statistical inference. It has been expanded from linear regression to generalized linear models, and can also be applied to other algorithms which rely on the inner product of the data and the variables, such as SVM. As models become more complex there is no longer a – relation between the parameters and the variables. For example, in neural networks, the number of parameters will be magnitude larger than the number of variables, and the relation between a variable and specific set of parameters is often intractable.
The key observation and main point of this paper is that it is possible to interpret the model without using the parameters. Essentially it is possible to interpret any by studying the partial derivatives of with respect to the input variables . The intuition is straightforward. Suppose that , it happens that for the variable , then effectively doesn’t influence at all. This is also true the other way around. If , , we (intuitively) learn that is important to the model prediction.
For the particular case of linear regression, when , the two approaches coincide since , and the parameters themselves are the partial derivatives. The major advantage of the suggested perspective is that it can be generalized to any given prediction model , both for regression and classification problems.
In the paper we show how it is possible to interpret complex models, specifically, convolutional and multi-layer neural networks using data from the field of natural language processing.
The idea of evaluating the performance of the model through the partial derivatives of the input variables is seemingly trivial, but we could not locate examples in the literature in which it has been done for interpretation purposes. The most relevant reference we could find is Google DeepDream , which uses GoogLeNet convolutional neural net . They calculate the gradients of a given input image and perform a gradient ascent on the input space to maximize particular neurons or layers activations. This provide a visualization tool to enable the interpretation of the convolutional neural network.
A recent paper by Riberio et al  approach the problem of interpreting a given model from a different direction, with a new technique named LIME aimed to interpret any prediction model. In the paper the authors suggest to interpret the performance of at a particular point , by approximating locally around with a model which is simple and interpretable, such as sparse linear regression (“Lasso”) or a decision tree.
4Method and Notation
This paper consider a prediction model , learned from observations of the input space and the output space . This is the general framework used in supervised learning problems, including both regression, when is continuous and classification, when is categorical. is a high dimensional feature space, with different features, , where can also be discrete or categorical. We denote the training set as and testing as .
We propose to study the gradient values of with respect to the features. Formally, an observation , is in the form of , where is the value of the -th variable (or feature). We will study the gradient . The -th partial derivative will be denoted as . We will also use the average of the partial derivatives in the test set, , denoted in vector form as .
The computation of the gradients in neural networks, and often in other models, can be derived explicitly using the backpropagation and the chain rule. This is what we have done in this paper. In other cases it may be possible to estimate it numerically. For a small , . When is continuous, this is quickly computable, and only requires two forward passes per coordinate, since is given. In Section 5.1 and Section 5.2 we will discuss cases when is categorical and not differentiable.
In addition, the gradient, , is a function defined globally on the entire space. When it can not be analytically derived, as in most cases, it has to be evaluated at multiple points locally. In all of the examples we trained a model on the training set, and evaluated on the test set. This is reasonable due to the standard assumption that the training and test sets are an sample from the population, but this course of action should be evaluated on a case-by-case basis.
In order to demonstrate the effectiveness of the gradients for interpretation purposes, we present examples from the field of natural language processing in which complex models are being interpreted using the gradients. The examples are using the IMDB Large Movie Review Dataset . The dataset contains different movie reviews, equally partitioned to train and test sets, with being the labeled sentiment of the review - either positive and negative.
Implementation has been done using Keras  and Theano . Since Theano is doing symbolic differentiation, calculating the gradients efficiently was a simple short function call. Source code enabling replication of the results is available on the author website.
5.1Example : Convolutional Neural Networks following Word Embedding
There are two major challenges with textual data. First the words are categorical and part of a dictionary. Second, significant information can be learned through the sequential order of the sentence. In recent years many state-of-the-art benchmarks in textual problems are being achieved using convolutional neural networks (CNN) models. CNN model address the first challenge by embedding the dictionary in a high dimensional metric space, and the second by applying convolutions to capture the sequential component of the data. See  for model details and  for performance review.
Interpreting a CNN model with textual data is not a trivial task because the original sentence is hardly intractable after the embedding and the convolutions. Nevertheless, in this example we demonstrate how it is achievable by using the gradients to interpret the model.
To train a CNN on the IMDB dataset, we thresholded the review length to , and the dictionary size to the most popular words. For a sentence , where is the -th word, the embedding maps each word and transfer . After the embedding, we apply convolutions on with length (words), depth (embedding dimension), followed by max pooling and a linear classifier. This model yields a accuracy, not far from the state-of-the-art (topping has been done a handful of times and is considered a significant improvement ).
To interpret this complex model, we turn to gradients. Since the words themselves are categorical, the prediction function is not differentiable with respect to the words. We therefore use the chain rule to calculate for every word the partial derivative of the model with respect to the embedding vector. Thus, for the -th word the gradient is . This provides for each sentence words gradients, which we rank according to their norm in order to assess which ones are the most influential. The intuition toward taking the highest norm is due to the fact that the norm of the gradient is a proxy for the “magnitude” of the slope of the graph in the direction of the gradient vector, and the steepest slope will identify the word effecting the prediction the most.
Table ? shows the expressions for which the gradients norm are the largest for several reviews in the test set. Since the convolution is being applied on length of words we present the expressions activating the convolution, where the first word is the one with the highest gradient norm. The highest ranked expressions are highly interpretable, and most examples are self explanatory, such as “fantastic film total” or “lack of credibility”. Some expressions were activated by the words following the selected one, such as “ape was outstanding”, in which the word “ape” had the largest gradient in the sentence, but probably explained as influential by the “was outstanding” expression.
5.2Example : Bag of Words Model
In the previous example we have used the gradients to interpret the local information about a particular sentence. In this example we will use the gradients globally in order to assess which words affect the classifier the most, and draw insight on the classifier functionality. To do so we represent the sentences with a simplified version of the Bag of Words (BoW) model.
A sentence is represented as a binary vector in the length of the dictionary , in which if the word occur in the sentence, and otherwise. In this model the data representation is simpler than the word embedding used in example , since it discard the grammar and sequential information from the original text. In general it tends to under-perform the word embedding representations, but it is easier to interpret, since now there is – correspondence between the features and the words. Following the BoW data representation we fit a fully connected neural network with hidden layers to perform the classification, resulting with a model accuracy.
Next, for all the partial derivatives are evaluated, and , the vector of the averages of the partial derivatives, is calculated. provides a single global estimator of the gradient function across the input space, which can be used to rank the words according to their influence on the model. As can be seen in Table ?, the top positive and negative words according to the values of are highly interpretable, and include words such as “excellent” and “great” for positive sentiment and “worst” and “waste” for negative sentiment.
Furthermore, can also be used for approximating the prediction function , since the features are binary. By classifying as if and otherwise, we get a classifier that agrees with the prediction function on of the observations in the test set. This surprising result provide further insight on , specifically that (1) the decision boundary of is closely defined by the hyper-planes created by , therefore the learned is linear although the model is much more complex; (2) can be used as a trustworthy estimation of influence of a given word.
There is a last nuance about this example that should be discussed. Although the function is not differentiable because , the derivatives were obtained as if it were. Effectively it means that the derivative is being evaluated as if . This approach can be extended to general categorical data with the use of indicators as features.
5.3Example : Fully Connected Neural Networks on MNIST
The MNIST data contains labeled handwritten digits and often functions as a simple data set to test new machine learning methods. To validate the feasibility of the technique on the dataset, we have fitted a fully connected neural network with hidden layers on the dataset. This results with accuracy, inferior to the state-of-the-art achieved by convolutional neural networks (), but significantly better then logistic regression, which can directly be interpret through the models weights. In this time the output is a vector size , is a vector function. We therefore calculate the Jacobian values of the fully connected network over the test set and take the average in a similar way to Example . Figure [[ADD LINK]] visualize the result and compare them to the logistic regression weights.
In this paper we have demonstrated the usefulness of the model gradient vector with respect to the input, as an important quantity to use when studying the behavior of a given model. The gradient vector indicates which features affect the prediction the most, thus it can be used to draw insights and to interpret the input features in complex models quickly and efficiently. We have also showed that the gradients contains enough information to draw global conclusions on the behavior of the classifier, including the linear structure of the model, and create a linear approximation of the function.
We would like to thank Ruslan Salakhutdinov and Alessandro Rinaldo for suggestions, insights and remarks that has greatly improved the quality of the paper.
- Keras, https://github.com/fchollet/keras, 2015.
- Natural language processing (almost) from scratch.
Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Journal of Machine Learning Research, 12(Aug):2493–2537, 2011.
- Supervised and semi-supervised text categorization using one-hot lstm for region embeddings.
Rie Johnson and Tong Zhang. arXiv preprint arXiv:1602.02373, 2016.
- Convolutional neural networks for sentence classification.
Yoon Kim. arXiv preprint arXiv:1408.5882, 2014.
- Learning word vectors for sentiment analysis.
Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pages 142–150. Association for Computational Linguistics, 2011.
- Deepdream, https://github.com/google/deepdream, 2015.
Alexander Mordvintsev, Christopher Olah, and Mike Tyka.
- “ why should i trust you?”: Explaining the predictions of any classifier.
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. arXiv preprint arXiv:1602.04938, 2016.
- Going deeper with convolutions.
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015.
- Theano: A Python framework for fast computation of mathematical expressions.
Theano Development Team. arXiv e-prints, abs/1605.02688, May 2016.