Georgia Institute of Technology
Ant Financial Voleon Group
Georgia Institute of Technology
Ant Financial Voleon Group
We study instancewise feature importance scoring as a method for model interpretation. Any such method yields, for each predicted instance, a vector of importance scores associated with the feature vector. Methods based on the Shapley score have been proposed as a fair way of computing feature attributions of this kind, but incur an exponential complexity in the number of features. This combinatorial explosion arises from the definition of the Shapley value and prevents these methods from being scalable to large data sets and complex models. We focus on settings in which the data have a graph structure, and the contribution of features to the target variable is well-approximated by a graph-structured factorization. In such settings, we develop two algorithms with linear complexity for instancewise feature importance scoring. We establish the relationship of our methods to the Shapley value and another closely related concept known as the Myerson value from cooperative game theory. We demonstrate on both language and image data that our algorithms compare favorably with other methods for model interpretation.
Modern machine learning models, including random forests, deep neural networks, and kernel methods, can produce high-accuracy prediction in many applications. Often however, the accuracy in prediction from such black box models, comes at the cost of interpretability. Ease of interpretation is a crucial criterion when these tools are applied in areas such as medicine, financial markets, and criminal justice; for more background, see the discussion paper by Lipton  as well as references therein.
In this paper, we study instancewise feature importance scoring as a specific approach to the problem of interpreting the predictions of black-box models. Given a predictive model, such a method yields, for each instance to which the model is applied, a vector of importance scores associated with the underlying features. The instancewise property means that this vector, and hence the relative importance of each feature, is allowed to vary across instances. Thus, the importance scores can act as an explanation for the specific instance, indicating which features are the key for the model to make its prediction on that instance.
There is now a large body of research focused on the problem of scoring input features based on the prediction of a given instance (for instance, see the papers [19, 1, 17, 14, 22, 2, 5, 23] as well as references therein). Of most relevance to this paper is a line of recent work [22, 14, 5] that has developed methods for model interpretation based on Shapley value  from cooperative game theory. The Shapley value was originally proposed as an axiomatic characterization of a fair distribution of a total surplus from all the players, and can be applied in to predictive models, in which case each feature is modeled as a player in the underlying game. While the Shapley value approach is conceptually appealing, it is also computationally challenging: in general, each evaluation of a Shapley value requires an exponential number of model evaluations. Different approaches to circumventing this complexity barrier have been proposed, including those based on Monte Carlo approximation [22, 5] and methods based on sampled least-squares with weights .
In this paper, we take a complementary point of view, arguing that the problem of explanation is best approached within a model-based paradigm. In this view, explanations are cast in terms of a model, which may or may not be the same model as used to fit the data. Criteria such as Shapley value, which are intractable to compute when no assumptions are made, can be more effectively computed or approximated within the framework of a model. We focus specifically on settings in which a graph structure is appropriate for the data; specifically, we consider simple chains and grids, appropriate for time series and images, respectively. We propose two measures for instancewise feature importance scoring in this framework, which we term L-Shapley and C-Shapley; here the abbreviations “L" and “C" refer to “local” and “connected,” respectively. By exploiting the underlying graph structure, the number of model evaluations is reduced to linear—as opposed to exponential—in the number of features. We demonstrate the relationship of these measures with a constrained form of Shapley value, and we additionally relate C-Shapley with another solution concept from cooperative game theory, known as the Myerson value . The Myerson value is commonly used in graph-restricted games, under a local additivity assumption of the model on disconnected subsets of features. Finally, we apply our feature scoring methods to several state-of-the-art models for both language and image data, and find that our scoring algorithms compare favorably to several existing sampling-based algorithms for instancewise feature importance scoring.
The remainder of this paper is organized as follows. We begin in Section 2 with background and set-up for the problem to be studied. In Section 3, we describe the two methods proposed and analyzed in this paper, based on the L-Shapley and C-Shapley scores. Section 4 is devoted to a study of the relationship between these scores and the Myerson value. In Section 5, we evaluate the performance of L-Shapley and C-Shapley on various real-world data sets, and we conclude with a discussion in Section 6.
2 Background and preliminaries
We begin by introducing some background and notation.
2.1 Importance of a feature subset
We are interested in studying models that are trained to perform prediction, taking as input a feature vector and predicting a response or output variable . We assume access to the output of a model via a conditional distribution, denoted by , that provides the distribution of the response conditioned on a given vector of inputs. For any given subset , we use to denote the associated sub-vector of features, and we let denote the induced conditional distribution when is restricted to using only the sub-vector . In the cornercase in which , we define . In terms of this notation, for a given feature vector , subset and fitted model distribution , we introduce the importance score
where denotes the expectation over . The importance score has a coding-theoretic interpretation: it corresponds to the negative of the expected number of bits required to encode the output of the model based on the sub-vector . It will be zero when the model makes a deterministic prediction based on , and larger when the model returns a distribution closer to uniform over the output space.
There is also an information-theoretic interpretation to this definition of importance scores, as discussed in our previous work . In particular, suppose that for a given integer , there is a function such that, for all almost all , the -sized subset maximizes over all subsets of size . In this case, we are guaranteed that the mutual information between and is maximized, over any conditional distribution that generates a subset of size given . The converse is also true.
In many cases, class-specific importance is favored, where one is interested in seeing how important a feature subset is to the predicted class, instead of the prediction as a conditional distribution. In order to handle such cases, it is convenient to introduce the degenerate conditional distribution
We can then define the importance of a subset with respect to using the modified score
which is the expected log probability of the predicted class given the features in .
Estimating the conditional distribution:
In practice, we need to estimate—for any given feature vector —the conditional probability functions based on observed data. Past work has used one of two
approaches: either estimation based on empirical
averages , or plug-in estimation using a
reference point [5, 14].
Empirical average estimation: In this approach, we first draw a set of feature vector by sampling with replacement from the full data set. For each sample , we define a new vector with components
Taking the empirical mean of over
is then used as an estimate of .
Plug-in estimation: In this approach, the first step is to specify a reference vector is specified. We then define the vector with components
Finally, we use the conditional probability as an approximation to . The plug-in estimate is more computationally efficient than the empirical average estimator, and works well when there exist appropriate choices of reference points. We use this method for our experiments, where we use the index of padding for language data, and the average pixel strength of an image for vision data.
2.2 Shapley value for measuring interaction between features
Consider the problem of quantifying the importance of a given feature index for feature vector . A naive way of doing so would be by computing the importance score of feature on its own. However, doing so ignores interactions between features, which are likely to be very important in applications. As a simple example, suppose that we were interested in performing sentiment analysis on the following sentence:
|It is not heartwarming or entertaining. It just sucks.||()|
This sentence is contained in a movie review from the IMDB movie data set , and it is classified as negative sentiment by a machine learning model to be discussed in the sequel. Now suppose we wish to quantify the importance of feature “not” in prediction. The word “not” plays an important role in the overall sentence as being classified as negative, and thus should be attributed a significant weight. However, viewed in isolation, the word “not” has neither negative nor positive sentiment, so that one would expect that .
Thus, it is essential to consider the interaction of a given feature with other features. For a given subset containing , a natural way in which to assess how interacts with the other features in is by computing the difference between the importance of all features in , with and without . This difference is called the marginal contribution of to , and given by
In order to obtain a simple scalar measure for feature , we need to aggregate these marginal contributions over all subsets that contain . The Shapley value  is one principled way of doing so. For each integer , we let denote the set of -sized subsets that contain . The Shapley value is obtained by averaging the marginal contributions, first over the set for a fixed , and then over all possible choices of set size :
Since the model remains fixed throughout our analysis, we frequently omit the dependence of on , instead adopting the more compact notation .
The concept of Shapley value was first introduced in cooperative game theory , and it has been used in a line of recent work on instancewise feature importance ranking [22, 5, 14]. It can be justified on an axiomatic basis [18, 24] as being the unique function from a collection of numbers (one for each subset ) to a collection of numbers (one for each feature ) with the following properties:
The sum of the Shapley values is equal to the difference .
- Equal contributions:
If for all subsets , then .
Given two models and , let and denote the associated marginal contribution functions, and let and denote the associated Shapley values. If for all subsets , then we are guaranteed that .
Note that all three of these axioms are reasonable in our feature selection context.
2.3 The challenge with computing Shapley values
The exact computation of the Shapley value takes into account the interaction of feature with all subsets that contain , thereby leading to computational difficulties. Various approximation methods have been developed with the goal of reducing complexity. For example, Štrumbelj and Kononenko  proposed to estimate the Shapley values via a Monte Carlo approximation built on an alternative permutation-based definition of the Shapley value. Lundberg and Lee  proposed to evaluate the model over randomly sampled subsets and use a weighted linear regression to approximate the Shapley values based on the collected model evaluations.
In practice, such sampling-based approximations may suffer from high variance when the number of samples to be collected per instance is limited. For large-scale predictive models, the number of features is often relatively large, meaning that the number of samples required to obtain stable estimates can be prohibitively large. The main contribution of this paper is to address this challenge in a model-based paradigm, where the contribution of features to the response variable respects the structure of an underlying graph. In this setting, we propose efficient algorithms and provide bounds on the quality of the resulting approximation. As we discuss in more detail later, our approach should be viewed as complementary to sampling-based or regresssion-based approximations of the Shapley value. In particular, these methods can be combined with the approach of this paper so as to speed up the computation of the L-Shapley and C-Shapley values that we propose.
In many applications, the features can be associated with the nodes of a graph, and we can define distances between pairs of features based on the graph structure. More concretely, for sequence data (such as language, music etc.), each feature vector can be associated with a line graph, whereas for image data, each is naturally associated with a grid graph. In this section, we propose modified forms of the Shapley values, referred to as L-Shapley and C-Shapley values, that can be computed more efficiently than the Shapley value. We also show that under certain probabilistic assumptions on the marginal distribution over the features, these quantities yield good approximations to the original Shapley values.
More precisely, given feature vectors , we let denote a connected graph with nodes and edges , where each feature is associated with a a node , and edges represent interactions between features. The graph induces a distance function on , given by
In the line graph, this graph distance corresponds to the number of edges in the unique path joining them, whereas it corresponds to the Manhattan distance in the grid graph. For a given node , its -neighborhood is the set
of all nodes at graph distance at most . See Figure 1 for an illustration for the two-dimensional grid graph.
We propose two algorithms for the setting in which features that are either far apart on the graph or features that are not directly connected, have an accordingly weaker interaction.
3.1 Local Shapley
In order to motivate our first graph-structured Shapley score, let us take a deeper look at Example ( ‣ 2.2). In order to compute the importance score of “not,” the most important words to be included are “heartwarming” and “entertaining.” Intuitively, the words distant from them have a weaker influence on the importance of a given word in a document, and therefore have relatively less effect on the Shapley score. Accordingly, as one approximation, we propose the L-Shapley score, which only perturbs the neighboring features of a given feature when evaluating its importance:
Given a model , a sample and a feature , the L-Shapley estimate of order on a graph is given by
The coefficients in front of the marginal contributions of feature are chosen to match the coefficients in the definition of the Shapley value restricted to the neighborhood . We show in Section 4 that this choice controls the error under certain probabilistic assumptions. In practice, the choice of the integer is dictated by computational considerations. By the definition of -neighborhoods, evaluating all L-Shapley scores on a line graph requires model evaluations. (In particular, computing each feature takes model evaluations, half of which overlap with those of its preceding feature.) A similar calculation shows that computing all L-Shapley scores on a grid graph requires function evaluations.
3.2 Connected Shapley
We also propose a second algorithm, C-Shapley, that further reduces the complexity of approximating the Shapley value. Coming back to Example ( ‣ 2.2) where we evaluate the importance of “not,” both the L-Shapley estimate of order larger than two and the exact Shapley value estimate would evaluate the model on the word subset “It not heartwarming,” which rarely appears in real data and may not make sense to a human or a model trained on real-world data. The marginal contribution of “not” relative to “It not heartwarming” may be well approximated by the marginal contribution of “not” to “not heartwarming.” This motivates us to proprose C-Shapley:
Given a model , a sample and a feature , the C-Shapley estimate of order on a graph is given by
where denotes the set of all subsets of that contain node , and are connected in the graph .
The coefficients in front of the marginal contributions are a result of using Myerson value to characterize a new coalitional game over the graph , in which the influence of disconnected subsets of features are additive. The error between C-Shapley and the Shapley value can also be controlled under certain statistical assumptions. See Section 4 for details.
For text data, C-Shapley is equivalent to only evaluating n-grams in a neighborhood of the word to be explained. By the definition of -neighborhoods, evaluating the C-Shapley scores for all features takes model evaluations on a line graph, as each feature takes model evaluations.
In this section, we study some basic properties of the L-Shapley and C-Shapley values. In particular, under certain probabilistic assumptions on the features, we show that they provide good approximations to the original Shapley values. We also show their relationship to another concept from cooperative game theory, namely that of Myerson values, when the model satisfies certain local additivity assumptions.
4.1 Approximation of Shapley value
In order to characterize the relationship between L-Shapley and the Shapley value, we introduce absolute mutual information as a measure of dependence. Given two random variables and , the absolute mutual information between and is defined as
where the expectation is taken jointly over . Based on the definition of independence, we have if and only if . Recall the mutual information  is defined as . The new measure is more stringent than the mutual information in the sense that . The absolute conditional mutual information can be defined in an analogous way. Given three random variables and , we define the absolute conditional mutual information to be , where the expectation is taken jointly over . Recall that is zero if and only if .
Theorem 1 and Theorem 2 show that L-Shapley and C-Shapley values, respectively, are related to the Shapley value whenever the model obeys a Markovian structure that is encoded by the graph. We leave their proofs to Appendix B.
Suppose there exists a feature subset with , such that
where we identify with for notational convenience. Then the expected error between the L-Shapley estimate and the true Shapley-value-based importance score is bounded by :
In particular, we have almost surely if we have and for any .
Suppose there exists a neighborhood of , with , such that Condition 8 is satisfied. Moreover, for any connected subset with , we have
where . Then the expected error between the C-Shapley estimate and the true Shapley-value-based importance score is bounded by :
In particular, we have almost surely if we have and for any .
4.2 Relating the C-Shapley value to the Myerson value
Let us now discuss how the C-Shapley value can be related to the Myerson value, which was introduced by Myerson  as an approach for characterizing a coalitional game over a graph . Given a subset of nodes in the graph , let denote the set of connected components of —i.e., subsets of that are connected via edges of the graph. Thus, if is a connected subset of , then consists only of ; otherwise, it contains a collection of subsets whose disjoint union is equal to .
Consider a score function that satisfies the following decomposability condition: for any subset of nodes , the score is equal to the sum of the scores over all the connected components of —viz.
For any such score function, we can define the associated Shapley value, and it is known as the Myerson value on with respect to . Myerson  showed that the Myerson value is the unique quantity that satisfies both the decomposability property, as well as the properties additivity, equal contributions and monotonicity given in Section 2.2.
In our setting, if we use a plug-in estimate for conditional probability, the decomposability condition (12) is equivalent to assuming that the influence of disconnected subsets of features are additive at sample , and C-Shapley of order is exactly the Myerson value over . In fact, if we partition each subset into connected components, as in the definition of Myerson value, and sum up the coefficients (using Lemma 1 in Appendix B), then the Myerson value is equivalent to equation (6).
4.3 Connections with related work
Let us now discuss connections with related work in more depth, and in particular how methods useful for approximating the Shapley value can be used to speed up the evaluation of approximate L-Shapley and C-Shapley values.
4.3.1 Sampling-based methods
There is an alternative definition of the Shapley value based on taking averages over permutations of the features. In particular, the contribution of a feature corresponds to the average of the marginal contribution of to its preceding features over the set of all permutations of features. Based on this definition, Štrumbelj and Kononenko  propose a Monte Carlo approximation, based on randomly sampling permutations.
While L-Shapley is deterministic in nature, it is possible to combine it with this and other sampling-based methods. For example, if one hopes to consider the interaction of features in a large neighborhood with a feature , where exponential complexity in becomes a barrier, sampling based on random permutation of local features may be used to alleviate the computational burden.
4.3.2 Regression-based methods
Lundberg and Lee  proposed to sample feature subsets based on a weighted kernel, and carry out a weighted linear regression to estimate the Shapley value. Suppose the model is evaluated on feature subsets at . In weighted least squares, each row of the data matrix is a -dimensional vector, with the entry being one if the feature is selected, and zero otherwise. The response is the evaluation of the model over feature subsets. The weight matrix is diagonal with with .
Lundberg and Lee  provide strong empirical results using this regression-based approximation, referred to as KernelSHAP; in particular, see Section 5.1 and Figure 3 of their paper. We can combine such a regression-based approximation with our modified Shapley values to further reduce the evaluation complexity of the C-Shapley values. In particular, for a chain graph, we evaluate the score function over all connected subsequences of length ; similarly, on a grid graph, we evaluate it over all connected squares of size . Doing so yields a data matrix and a response vector , where if the th feature is included in the th sample, and , the score function evaluated on the corresponding feature subset. We use the solution to this weighted least-squares problem as a regression-based estimate of C-Shapley—that is, .
We evaluate the performance of L-Shapley and C-Shapley on real-world data sets involving text and image classification. Codes for reproducing the key results are available online.111https://github.com/Jianbo-Lab/LCShapley We compare L-Shapley and C-Shapley with several competitive algorithms for instancewise feature importance scoring on black-box models, including the regression-based approximation known as KernelSHAP , SampleShapley  , and the LIME method . As discussed previously, KernelSHAP forms a weighted regression-approximation of the Shapley values, whereas SampleShapley estimates Shapley value by random permutation of features. The LIME method uses a linear model to locally approximate the original model through weighted least squares. For all methods, the number of model evaluations is the same, and linear in the number of features. We also choose the objective to be the log probability of the predicted class, and use the plug-in estimate of conditional probability across all methods (see Section 2.1).
For image data, we also compare with Saliency map  as another baseline. The Saliency method is used for interpreting neural networks in computer vision, by assuming knowledge of the gradient of a model with respect to the input, and using the gradient magnitude as the importance score for each pixel.
5.1 Text Classification
Text classification is a classical problem in natural language processing, in which text documents are assigned to predefined categories. We study the performance of L-Shapley and C-Shapley on three popular neural models for text classification: word-based CNNs , character-based CNNs , and long-short term memory (LSTM) recurrent neural networks , with the following three data sets on different scales. See Table 1 for a summary, and Appendix A for all of the details.
IMDB Review with Word-CNN: The Internet Movie Review Dataset (IMDB) is a dataset of movie reviews for sentiment classification , which contains binary labeled movie reviews, with a split of for training and for testing. A simple word-based CNN model composed of an embedding layer, a convolutional layer, a max-pooling layer, and a dense layer is used, achieving an accuracy of on the test data set.
AG news with Char-CNN: The AG news corpus is composed of titles and descriptions of news articles from news sources . It is segmented into four classes, each containing training samples and testing samples. Our character-based CNN has the same structure as that proposed in Zhang et al. . The model achieves an accuracy of on the test data set.
Yahoo! Answers with LSTM: The corpus of Yahoo! Answers Topic Classification Dataset is divided into ten categories, each class containing training samples and testing samples. Each input text includes the question title, content and best answer. We train a bidirectional LSTM which achieves an accuracy of on the test data set, close to the state-of-the-art accuracy of obtained by character-based CNNs .
We choose zero paddings as the reference point for all methods, and make model evaluations, where is the number of words for each input. Given the average length of each input (see Table 1), this choice controls the number of model evaluations under , taking less than one second in TensorFlow on a Tesla K80 GPU for all the three models. For L-Shapley, we are able to consider the interaction of each word with the two neighboring words in given the budget. For C-Shapley, the budget allows the regression-based version to evaluate all -grams with .
The change in log-odds scores before and after masking the top features ranked by importance scores is used as a metric for evaluating performance, where masked words are replaced by zero paddings. This metric has been used in previous literature in model interpretation [19, 14]. We study how the average log-odds score of the predicted class decreases as the percentage of masked features over the total number of features increases on samples from the test set. Results are plotted in Figure 2.
|Data Set||Classes||Train Samples||Test Samples||Average #w||Model||Parameters||Accuracy|
|IMDB Review ||2||25,000||25,000||325.6||WordCNN||351,002||90.1%|
|AG’s News ||4||120,000||7,600||43.3||CharCNN||11,337,988||90.09%|
|Yahoo! Answers ||10||1,400,000||60,000||108.4||LSTM||7,146,166||70.84%|
|Shapley||It is not heartwarming or entertaining. It just sucks.|
|C-Shapley||It is not heartwarming or entertaining. It just sucks.|
|L-Shapley||It is not heartwarming or entertaining. It just sucks.|
|KernelSHAP||It is not heartwarming or entertaining. It just sucks.|
|SampleShapley||It is not heartwarming or entertaining. It just sucks.|
On IMDB with Word-CNN, the simplest model among the three, L-Shapley, achieves the best performance while LIME, KernelSHAP and C-Shapley achieve slightly worse performance. On AG’s news with Char-CNN, L-Shapley and C-Shapley both outperform other algorithms. On Yahoo! Answers with LSTM, C-Shapley outperforms the rest of the algorithms by a large margin, followed by LIME. L-Shapley with order , SampleShapley, and KernelSHAP do not perform well for LSTM model, probably because some of the signals captured by LSTM are relatively long -grams.
5.2 Image Classification
We carry out experiments in image classification on the MNIST and CIFAR10 data sets:
MNIST: The MNIST data set contains images of handwritten digits with ten categories . A subset of MNIST data set composed of digits and is used for better visualization, with images for training and images for testing. A simple CNN model achieves accuracy on the test data set.
CIFAR10: The CIFAR10 data set  contains images in ten classes. A subset of CIFAR10 data set composed of deers and horses is used for better visualization, with images for training and images for testing. A convolutional neural network modified from AlexNet  achieves accuracy on the test data set.
We take each pixel as a single feature for both MNIST and CIFAR10. We choose the average pixel strength as the reference point for all methods, and make model evaluations, where is the number of pixels for each input image, which keeps the number of model evaluations under .
LIME and L-Shapley are not used for comparison because LIME takes “superpixels” instead of raw pixels segmented by segmentation algorithms as single features, and L-Shapley requires nearly sixteen thousand model evaluations when applied to raw pixels.222L-Shapley becomes practical if we take small patches of images instead of pixels as single features. For C-Shapley, the budget allows the regression-based version to evaluate all image patches with .
Figure 3 shows the decrease in log-odds scores before and after masking the top pixels ranked by importance scores as the percentage of masked pixels over the total number of pixels increases on test samples on MNIST and CIFAR10 data sets. C-Shapley consistently outperforms other methods on both data sets.
Figure 4 and Figure 5 provide additional visualization of the results. By masking the top pixels ranked by various methods, we find that the pixels picked by C-Shapley concentrate around and inside the digits in MNIST. The C-Shapley and Saliency methods yield the most interpretable results in CIFAR10. In particular, C-Shapley tends to mask the parts of head and body that distinguish deers and horses, and the human riding the horse. Figure 3 shows two misclassified digits by the CNN model. Interestingly, the top pixels chosen by C-Shapley visualize the “reasoning” of the model: more specifically, the important pixels to the model are exactly those which could form a digit from the opposite class.
We have proposed L-Shapley and C-Shapley for instancewise feature importance scoring, making use of a graphical representation of the data. We have shown the superior performance of the proposed algorithms compared to other methods for instancewise feature importance scoring in text and image classification.
We would like to acknowledge support from the DARPA Program on Lifelong Learning Machines from the Army Research Office under grant number W911NF-17-1-0304, and from National Science Foundation grant NSF-DMS-1612948.
- Bach et al.  Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS One, 10(7):e0130140, 2015.
- Baehrens et al.  David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, and Klaus-Robert Müller. How to explain individual classification decisions. Journal of Machine Learning Research, 11:1803–1831, 2010.
- Chen et al.  Jianbo Chen, Le Song, Martin J Wainwright, and Michael I Jordan. Learning to explain: An information-theoretic perspective on model interpretation. arXiv preprint arXiv:1802.07814, 2018.
- Cover and Thomas  Thomas M Cover and Joy A Thomas. Elements of Information Theory. John Wiley & Sons, 2012.
- Datta et al.  Anupam Datta, Shayak Sen, and Yair Zick. Algorithmic transparency via quantitative input influence: Theory and experiments with learning systems. In Security and Privacy (SP), 2016 IEEE Symposium on, pages 598–617. IEEE, 2016.
-  Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Neural networks for machine learning-lecture 6a-overview of mini-batch gradient descent.
- Hochreiter and Schmidhuber  Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
- Kim  Yoon Kim. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751, 2014.
- Kingma and Ba  Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR), 2015.
- Krizhevsky  Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
- Krizhevsky et al.  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012.
- LeCun et al.  Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Lipton  Zachary C Lipton. The mythos of model interpretability. arXiv preprint arXiv:1606.03490, 2016.
- Lundberg and Lee  Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 4765–4774. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf.
- Maas et al.  Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pages 142–150. Association for Computational Linguistics, 2011.
- Myerson  Roger B Myerson. Graphs and cooperation in games. Mathematics of Operations Research, 2(3):225–229, 1977.
- Ribeiro et al.  Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should I trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135–1144. ACM, 2016.
- Shapley  Lloyd S Shapley. A value for n-person games. Contributions to the Theory of Games, 2(28):307–317, 1953.
- Shrikumar et al.  Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. In ICML, volume 70 of Proceedings of Machine Learning Research, pages 3145–3153. PMLR, 06–11 Aug 2017.
- Simonyan et al.  K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. In Proceedings of the International Conference on Learning Representations (ICLR), 2014.
- Srivastava et al.  Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
- Štrumbelj and Kononenko  Erik Štrumbelj and Igor Kononenko. An efficient explanation of individual classifications using game theory. Journal of Machine Learning Research, 11:1–18, 2010.
- Sundararajan et al.  Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In International Conference on Machine Learning, pages 3319–3328, 2017.
- Young  H Peyton Young. Monotonic solutions of cooperative games. International Journal of Game Theory, 14(2):65–72, 1985.
- Zhang et al.  Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems, pages 649–657, 2015.
Appendix A Model structure
IMDB Review with Word-CNN
The word-based CNN model is composed of a -dimensional word embedding, a -D convolutional layer of 250 filters and kernel size three, a max-pooling and a -dimensional dense layer as hidden layers. Both the convolutional and the dense layers are followed by ReLU as nonlinearity, and Dropout  as regularization. The model is trained with rmsprop . The model achieves an accuracy of on the test data set.
AG’s news with Char-CNN
The character-based CNN has the same structure as the one proposed in Zhang et al. , composed of six convolutional layers, three max-pooling layers, and two dense layers. The model is trained with SGD with momentum 0.9 and decreasing step size initialized at . (Details can be found in Zhang et al. .) The model reaches accuracy of on the test data set.
Yahoo! Answers with LSTM
The network consists of a -dimensional randomly-initialized word embedding, a bidirectional LSTM, each LSTM unit of dimension , and a dropout layer as hidden layers. The model is trained with rmsprop . The model reaches accuracy of on the test data set, close to the state-of-the-art accuracy of obtained by character-based CNN .
A simple CNN model is trained on the data set, which achieves accuracy on the test data set. It is composed of two convolutional layers of kernel size and a dense linear layer at last. The two convolutional layers contain 8 and 16 filters respectively, and both are followed by a max-pooling layer of pool size two.
A convolutional neural network modified from AlexNet  is trained on the subset. It is composed of six convolutional layers of kernel size and two dense linear layers of dimension 512 and 256 at last. The six convolutional layers contain 48,48,96,96,192,192 filters respectively, and every two convolutional layers are followed by a max-pooling layer of pool size two and a dropout layer. The CNN model is trained with the Adam optimizer  and achieves accuracy on the test data set.
Appendix B Proof of Theorems
In this appendix, we collect the proofs of Theorems 1 and 2.
b.1 Proof of Theorem 1
We state an elementary combinatorial equality required for the proof of the main theorem:
Lemma 1 (A combinatorial equality).
For any positive integer , and any pair of non-negative integers with , we have
By the binomial theorem for negative integer exponents, we have
The identity can be found by examination of the coefficient of in the expansion of
In fact, equating the coefficients of in the left and the right hand sides, we get
Moving to the right hand side and expanding the binomial coefficients, we have
Taking this lemma, we now prove the theorem. We split our analysis into two cases, namely versus . For notational convenience, we extend the definition of L-Shapley estimate for feature to an arbitrary feature subset containing . In particular, we define
First, suppose that . For any subset , we introduce the shorthand notation and , and note that . Recalling the definition of the Shapley value, let us partition all the subsets based on , in particular writing
Based on this partitioning, the expected error between and can be written as
Partitioning the set by the size of , we observe that
where we recall that .
Now omitting the dependence of on for notational simplicity, we now write the difference as
Substituting this equivalence into our earlier bound (19) and taking an expectation over on both sides, we find that the expected error is upper bounded as
Recalling the definition of the absolute mutual information, we see that
which completes the proof of the claimed bound.
Finally, in the special case that and for any , then this inequality holds with , which implies . Therefore, we have almost surely, as claimed.
We now consider the general case in which . Using the previous arguments, we can show
Appylying the triangle inequality yields , which establishes the claim.
b.2 Proof of Theorem 2
As in the previous proof, we divide our analysis into two cases.
First, suppose that . For any subset with , we can partition into two components and , such that and is a connected subsequence. is disconnected from . We also define
We partition all the subsets based on in the definition of the Shapley value:
The expected error between and is
Partitioning by the size of , we observe that
where . We omit the dependence of and on the pair for notational simplicity, and observe that the difference between and is
Taking an expectation over at both sides, we can upper bound the expected error by