\thetitle
Georgia Institute of Technology
Ant Financial Voleon Group
Georgia Institute of Technology
Ant Financial Voleon Group
Abstract
We study instancewise feature importance scoring as a method for model interpretation. Any such method yields, for each predicted instance, a vector of importance scores associated with the feature vector. Methods based on the Shapley score have been proposed as a fair way of computing feature attributions of this kind, but incur an exponential complexity in the number of features. This combinatorial explosion arises from the definition of the Shapley value and prevents these methods from being scalable to large data sets and complex models. We focus on settings in which the data have a graph structure, and the contribution of features to the target variable is wellapproximated by a graphstructured factorization. In such settings, we develop two algorithms with linear complexity for instancewise feature importance scoring. We establish the relationship of our methods to the Shapley value and another closely related concept known as the Myerson value from cooperative game theory. We demonstrate on both language and image data that our algorithms compare favorably with other methods for model interpretation.
1 Introduction
Modern machine learning models, including random forests, deep neural networks, and kernel methods, can produce highaccuracy prediction in many applications. Often however, the accuracy in prediction from such black box models, comes at the cost of interpretability. Ease of interpretation is a crucial criterion when these tools are applied in areas such as medicine, financial markets, and criminal justice; for more background, see the discussion paper by Lipton [13] as well as references therein.
In this paper, we study instancewise feature importance scoring as a specific approach to the problem of interpreting the predictions of blackbox models. Given a predictive model, such a method yields, for each instance to which the model is applied, a vector of importance scores associated with the underlying features. The instancewise property means that this vector, and hence the relative importance of each feature, is allowed to vary across instances. Thus, the importance scores can act as an explanation for the specific instance, indicating which features are the key for the model to make its prediction on that instance.
There is now a large body of research focused on the problem of scoring input features based on the prediction of a given instance (for instance, see the papers [19, 1, 17, 14, 22, 2, 5, 23] as well as references therein). Of most relevance to this paper is a line of recent work [22, 14, 5] that has developed methods for model interpretation based on Shapley value [18] from cooperative game theory. The Shapley value was originally proposed as an axiomatic characterization of a fair distribution of a total surplus from all the players, and can be applied in to predictive models, in which case each feature is modeled as a player in the underlying game. While the Shapley value approach is conceptually appealing, it is also computationally challenging: in general, each evaluation of a Shapley value requires an exponential number of model evaluations. Different approaches to circumventing this complexity barrier have been proposed, including those based on Monte Carlo approximation [22, 5] and methods based on sampled leastsquares with weights [14].
In this paper, we take a complementary point of view, arguing that the problem of explanation is best approached within a modelbased paradigm. In this view, explanations are cast in terms of a model, which may or may not be the same model as used to fit the data. Criteria such as Shapley value, which are intractable to compute when no assumptions are made, can be more effectively computed or approximated within the framework of a model. We focus specifically on settings in which a graph structure is appropriate for the data; specifically, we consider simple chains and grids, appropriate for time series and images, respectively. We propose two measures for instancewise feature importance scoring in this framework, which we term LShapley and CShapley; here the abbreviations “L" and “C" refer to “local” and “connected,” respectively. By exploiting the underlying graph structure, the number of model evaluations is reduced to linear—as opposed to exponential—in the number of features. We demonstrate the relationship of these measures with a constrained form of Shapley value, and we additionally relate CShapley with another solution concept from cooperative game theory, known as the Myerson value [16]. The Myerson value is commonly used in graphrestricted games, under a local additivity assumption of the model on disconnected subsets of features. Finally, we apply our feature scoring methods to several stateoftheart models for both language and image data, and find that our scoring algorithms compare favorably to several existing samplingbased algorithms for instancewise feature importance scoring.
The remainder of this paper is organized as follows. We begin in Section 2 with background and setup for the problem to be studied. In Section 3, we describe the two methods proposed and analyzed in this paper, based on the LShapley and CShapley scores. Section 4 is devoted to a study of the relationship between these scores and the Myerson value. In Section 5, we evaluate the performance of LShapley and CShapley on various realworld data sets, and we conclude with a discussion in Section 6.
2 Background and preliminaries
We begin by introducing some background and notation.
2.1 Importance of a feature subset
We are interested in studying models that are trained to perform prediction, taking as input a feature vector and predicting a response or output variable . We assume access to the output of a model via a conditional distribution, denoted by , that provides the distribution of the response conditioned on a given vector of inputs. For any given subset , we use to denote the associated subvector of features, and we let denote the induced conditional distribution when is restricted to using only the subvector . In the cornercase in which , we define . In terms of this notation, for a given feature vector , subset and fitted model distribution , we introduce the importance score
where denotes the expectation over . The importance score has a codingtheoretic interpretation: it corresponds to the negative of the expected number of bits required to encode the output of the model based on the subvector . It will be zero when the model makes a deterministic prediction based on , and larger when the model returns a distribution closer to uniform over the output space.
There is also an informationtheoretic interpretation to this definition of importance scores, as discussed in our previous work [3]. In particular, suppose that for a given integer , there is a function such that, for all almost all , the sized subset maximizes over all subsets of size . In this case, we are guaranteed that the mutual information between and is maximized, over any conditional distribution that generates a subset of size given . The converse is also true.
In many cases, classspecific importance is favored, where one is interested in seeing how important a feature subset is to the predicted class, instead of the prediction as a conditional distribution. In order to handle such cases, it is convenient to introduce the degenerate conditional distribution
We can then define the importance of a subset with respect to using the modified score
which is the expected log probability of the predicted class given the features in .
Estimating the conditional distribution:
In practice, we need to estimate—for any given feature vector —the conditional probability functions based on observed data. Past work has used one of two
approaches: either estimation based on empirical
averages [22], or plugin estimation using a
reference point [5, 14].
Empirical average estimation: In this approach, we first draw a set of feature vector by sampling with replacement from the full data set. For each sample , we define a new vector with components
Taking the empirical mean of over
is then used as an estimate of .
Plugin estimation: In this approach, the first step is to specify a reference vector is specified. We then define the vector with components
Finally, we use the conditional probability as an approximation to . The plugin estimate is more computationally efficient than the empirical average estimator, and works well when there exist appropriate choices of reference points. We use this method for our experiments, where we use the index of padding for language data, and the average pixel strength of an image for vision data.
2.2 Shapley value for measuring interaction between features
Consider the problem of quantifying the importance of a given feature index for feature vector . A naive way of doing so would be by computing the importance score of feature on its own. However, doing so ignores interactions between features, which are likely to be very important in applications. As a simple example, suppose that we were interested in performing sentiment analysis on the following sentence:
It is not heartwarming or entertaining. It just sucks.  () 
This sentence is contained in a movie review from the IMDB movie data set [15], and it is classified as negative sentiment by a machine learning model to be discussed in the sequel. Now suppose we wish to quantify the importance of feature “not” in prediction. The word “not” plays an important role in the overall sentence as being classified as negative, and thus should be attributed a significant weight. However, viewed in isolation, the word “not” has neither negative nor positive sentiment, so that one would expect that .
Thus, it is essential to consider the interaction of a given feature with other features. For a given subset containing , a natural way in which to assess how interacts with the other features in is by computing the difference between the importance of all features in , with and without . This difference is called the marginal contribution of to , and given by
(1) 
In order to obtain a simple scalar measure for feature , we need to aggregate these marginal contributions over all subsets that contain . The Shapley value [18] is one principled way of doing so. For each integer , we let denote the set of sized subsets that contain . The Shapley value is obtained by averaging the marginal contributions, first over the set for a fixed , and then over all possible choices of set size :
(2) 
Since the model remains fixed throughout our analysis, we frequently omit the dependence of on , instead adopting the more compact notation .
The concept of Shapley value was first introduced in cooperative game theory [18], and it has been used in a line of recent work on instancewise feature importance ranking [22, 5, 14]. It can be justified on an axiomatic basis [18, 24] as being the unique function from a collection of numbers (one for each subset ) to a collection of numbers (one for each feature ) with the following properties:
 Additivity:

The sum of the Shapley values is equal to the difference .
 Equal contributions:

If for all subsets , then .
 Monotonicity:

Given two models and , let and denote the associated marginal contribution functions, and let and denote the associated Shapley values. If for all subsets , then we are guaranteed that .
Note that all three of these axioms are reasonable in our feature selection context.
2.3 The challenge with computing Shapley values
The exact computation of the Shapley value takes into account the interaction of feature with all subsets that contain , thereby leading to computational difficulties. Various approximation methods have been developed with the goal of reducing complexity. For example, Štrumbelj and Kononenko [22] proposed to estimate the Shapley values via a Monte Carlo approximation built on an alternative permutationbased definition of the Shapley value. Lundberg and Lee [14] proposed to evaluate the model over randomly sampled subsets and use a weighted linear regression to approximate the Shapley values based on the collected model evaluations.
In practice, such samplingbased approximations may suffer from high variance when the number of samples to be collected per instance is limited. For largescale predictive models, the number of features is often relatively large, meaning that the number of samples required to obtain stable estimates can be prohibitively large. The main contribution of this paper is to address this challenge in a modelbased paradigm, where the contribution of features to the response variable respects the structure of an underlying graph. In this setting, we propose efficient algorithms and provide bounds on the quality of the resulting approximation. As we discuss in more detail later, our approach should be viewed as complementary to samplingbased or regresssionbased approximations of the Shapley value. In particular, these methods can be combined with the approach of this paper so as to speed up the computation of the LShapley and CShapley values that we propose.
3 Methods
In many applications, the features can be associated with the nodes of a graph, and we can define distances between pairs of features based on the graph structure. More concretely, for sequence data (such as language, music etc.), each feature vector can be associated with a line graph, whereas for image data, each is naturally associated with a grid graph. In this section, we propose modified forms of the Shapley values, referred to as LShapley and CShapley values, that can be computed more efficiently than the Shapley value. We also show that under certain probabilistic assumptions on the marginal distribution over the features, these quantities yield good approximations to the original Shapley values.
More precisely, given feature vectors , we let denote a connected graph with nodes and edges , where each feature is associated with a a node , and edges represent interactions between features. The graph induces a distance function on , given by
(3) 
In the line graph, this graph distance corresponds to the number of edges in the unique path joining them, whereas it corresponds to the Manhattan distance in the grid graph. For a given node , its neighborhood is the set
(4) 
of all nodes at graph distance at most . See Figure 1 for an illustration for the twodimensional grid graph.
(a)  (b)  (c) 
We propose two algorithms for the setting in which features that are either far apart on the graph or features that are not directly connected, have an accordingly weaker interaction.
3.1 Local Shapley
In order to motivate our first graphstructured Shapley score, let us take a deeper look at Example ( ‣ 2.2). In order to compute the importance score of “not,” the most important words to be included are “heartwarming” and “entertaining.” Intuitively, the words distant from them have a weaker influence on the importance of a given word in a document, and therefore have relatively less effect on the Shapley score. Accordingly, as one approximation, we propose the LShapley score, which only perturbs the neighboring features of a given feature when evaluating its importance:
Definition 1.
Given a model , a sample and a feature , the LShapley estimate of order on a graph is given by
(5) 
The coefficients in front of the marginal contributions of feature are chosen to match the coefficients in the definition of the Shapley value restricted to the neighborhood . We show in Section 4 that this choice controls the error under certain probabilistic assumptions. In practice, the choice of the integer is dictated by computational considerations. By the definition of neighborhoods, evaluating all LShapley scores on a line graph requires model evaluations. (In particular, computing each feature takes model evaluations, half of which overlap with those of its preceding feature.) A similar calculation shows that computing all LShapley scores on a grid graph requires function evaluations.
3.2 Connected Shapley
We also propose a second algorithm, CShapley, that further reduces the complexity of approximating the Shapley value. Coming back to Example ( ‣ 2.2) where we evaluate the importance of “not,” both the LShapley estimate of order larger than two and the exact Shapley value estimate would evaluate the model on the word subset “It not heartwarming,” which rarely appears in real data and may not make sense to a human or a model trained on realworld data. The marginal contribution of “not” relative to “It not heartwarming” may be well approximated by the marginal contribution of “not” to “not heartwarming.” This motivates us to proprose CShapley:
Definition 2.
Given a model , a sample and a feature , the CShapley estimate of order on a graph is given by
(6) 
where denotes the set of all subsets of that contain node , and are connected in the graph .
The coefficients in front of the marginal contributions are a result of using Myerson value to characterize a new coalitional game over the graph , in which the influence of disconnected subsets of features are additive. The error between CShapley and the Shapley value can also be controlled under certain statistical assumptions. See Section 4 for details.
For text data, CShapley is equivalent to only evaluating ngrams in a neighborhood of the word to be explained. By the definition of neighborhoods, evaluating the CShapley scores for all features takes model evaluations on a line graph, as each feature takes model evaluations.
4 Properties
In this section, we study some basic properties of the LShapley and CShapley values. In particular, under certain probabilistic assumptions on the features, we show that they provide good approximations to the original Shapley values. We also show their relationship to another concept from cooperative game theory, namely that of Myerson values, when the model satisfies certain local additivity assumptions.
4.1 Approximation of Shapley value
In order to characterize the relationship between LShapley and the Shapley value, we introduce absolute mutual information as a measure of dependence. Given two random variables and , the absolute mutual information between and is defined as
(7) 
where the expectation is taken jointly over . Based on the definition of independence, we have if and only if . Recall the mutual information [4] is defined as . The new measure is more stringent than the mutual information in the sense that . The absolute conditional mutual information can be defined in an analogous way. Given three random variables and , we define the absolute conditional mutual information to be , where the expectation is taken jointly over . Recall that is zero if and only if .
Theorem 1 and Theorem 2 show that LShapley and CShapley values, respectively, are related to the Shapley value whenever the model obeys a Markovian structure that is encoded by the graph. We leave their proofs to Appendix B.
Theorem 1.
Suppose there exists a feature subset with , such that
(8) 
where we identify with for notational convenience. Then the expected error between the LShapley estimate and the true Shapleyvaluebased importance score is bounded by :
(9) 
In particular, we have almost surely if we have and for any .
Theorem 2.
Suppose there exists a neighborhood of , with , such that Condition 8 is satisfied. Moreover, for any connected subset with , we have
(10) 
where . Then the expected error between the CShapley estimate and the true Shapleyvaluebased importance score is bounded by :
(11) 
In particular, we have almost surely if we have and for any .
4.2 Relating the CShapley value to the Myerson value
Let us now discuss how the CShapley value can be related to the Myerson value, which was introduced by Myerson [16] as an approach for characterizing a coalitional game over a graph . Given a subset of nodes in the graph , let denote the set of connected components of —i.e., subsets of that are connected via edges of the graph. Thus, if is a connected subset of , then consists only of ; otherwise, it contains a collection of subsets whose disjoint union is equal to .
Consider a score function that satisfies the following decomposability condition: for any subset of nodes , the score is equal to the sum of the scores over all the connected components of —viz.
(12) 
For any such score function, we can define the associated Shapley value, and it is known as the Myerson value on with respect to . Myerson [16] showed that the Myerson value is the unique quantity that satisfies both the decomposability property, as well as the properties additivity, equal contributions and monotonicity given in Section 2.2.
In our setting, if we use a plugin estimate for conditional probability, the decomposability condition (12) is equivalent to assuming that the influence of disconnected subsets of features are additive at sample , and CShapley of order is exactly the Myerson value over . In fact, if we partition each subset into connected components, as in the definition of Myerson value, and sum up the coefficients (using Lemma 1 in Appendix B), then the Myerson value is equivalent to equation (6).
4.3 Connections with related work
Let us now discuss connections with related work in more depth, and in particular how methods useful for approximating the Shapley value can be used to speed up the evaluation of approximate LShapley and CShapley values.
4.3.1 Samplingbased methods
There is an alternative definition of the Shapley value based on taking averages over permutations of the features. In particular, the contribution of a feature corresponds to the average of the marginal contribution of to its preceding features over the set of all permutations of features. Based on this definition, Štrumbelj and Kononenko [22] propose a Monte Carlo approximation, based on randomly sampling permutations.
While LShapley is deterministic in nature, it is possible to combine it with this and other samplingbased methods. For example, if one hopes to consider the interaction of features in a large neighborhood with a feature , where exponential complexity in becomes a barrier, sampling based on random permutation of local features may be used to alleviate the computational burden.
4.3.2 Regressionbased methods
Lundberg and Lee [14] proposed to sample feature subsets based on a weighted kernel, and carry out a weighted linear regression to estimate the Shapley value. Suppose the model is evaluated on feature subsets at . In weighted least squares, each row of the data matrix is a dimensional vector, with the entry being one if the feature is selected, and zero otherwise. The response is the evaluation of the model over feature subsets. The weight matrix is diagonal with with .
Lundberg and Lee [14] provide strong empirical results using this regressionbased approximation, referred to as KernelSHAP; in particular, see Section 5.1 and Figure 3 of their paper. We can combine such a regressionbased approximation with our modified Shapley values to further reduce the evaluation complexity of the CShapley values. In particular, for a chain graph, we evaluate the score function over all connected subsequences of length ; similarly, on a grid graph, we evaluate it over all connected squares of size . Doing so yields a data matrix and a response vector , where if the th feature is included in the th sample, and , the score function evaluated on the corresponding feature subset. We use the solution to this weighted leastsquares problem as a regressionbased estimate of CShapley—that is, .
5 Experiments
We evaluate the performance of LShapley and CShapley on realworld data sets involving text and image classification. Codes for reproducing the key results are available online.^{1}^{1}1https://github.com/JianboLab/LCShapley We compare LShapley and CShapley with several competitive algorithms for instancewise feature importance scoring on blackbox models, including the regressionbased approximation known as KernelSHAP [14], SampleShapley [22] , and the LIME method [17]. As discussed previously, KernelSHAP forms a weighted regressionapproximation of the Shapley values, whereas SampleShapley estimates Shapley value by random permutation of features. The LIME method uses a linear model to locally approximate the original model through weighted least squares. For all methods, the number of model evaluations is the same, and linear in the number of features. We also choose the objective to be the log probability of the predicted class, and use the plugin estimate of conditional probability across all methods (see Section 2.1).
For image data, we also compare with Saliency map [20] as another baseline. The Saliency method is used for interpreting neural networks in computer vision, by assuming knowledge of the gradient of a model with respect to the input, and using the gradient magnitude as the importance score for each pixel.
5.1 Text Classification
Text classification is a classical problem in natural language processing, in which text documents are assigned to predefined categories. We study the performance of LShapley and CShapley on three popular neural models for text classification: wordbased CNNs [8], characterbased CNNs [25], and longshort term memory (LSTM) recurrent neural networks [7], with the following three data sets on different scales. See Table 1 for a summary, and Appendix A for all of the details.

IMDB Review with WordCNN: The Internet Movie Review Dataset (IMDB) is a dataset of movie reviews for sentiment classification [15], which contains binary labeled movie reviews, with a split of for training and for testing. A simple wordbased CNN model composed of an embedding layer, a convolutional layer, a maxpooling layer, and a dense layer is used, achieving an accuracy of on the test data set.

AG news with CharCNN: The AG news corpus is composed of titles and descriptions of news articles from news sources [25]. It is segmented into four classes, each containing training samples and testing samples. Our characterbased CNN has the same structure as that proposed in Zhang et al. [25]. The model achieves an accuracy of on the test data set.

Yahoo! Answers with LSTM: The corpus of Yahoo! Answers Topic Classification Dataset is divided into ten categories, each class containing training samples and testing samples. Each input text includes the question title, content and best answer. We train a bidirectional LSTM which achieves an accuracy of on the test data set, close to the stateoftheart accuracy of obtained by characterbased CNNs [25].
We choose zero paddings as the reference point for all methods, and make model evaluations, where is the number of words for each input. Given the average length of each input (see Table 1), this choice controls the number of model evaluations under , taking less than one second in TensorFlow on a Tesla K80 GPU for all the three models. For LShapley, we are able to consider the interaction of each word with the two neighboring words in given the budget. For CShapley, the budget allows the regressionbased version to evaluate all grams with .
The change in logodds scores before and after masking the top features ranked by importance scores is used as a metric for evaluating performance, where masked words are replaced by zero paddings. This metric has been used in previous literature in model interpretation [19, 14]. We study how the average logodds score of the predicted class decreases as the percentage of masked features over the total number of features increases on samples from the test set. Results are plotted in Figure 2.
Data Set  Classes  Train Samples  Test Samples  Average #w  Model  Parameters  Accuracy 

IMDB Review [15]  2  25,000  25,000  325.6  WordCNN  351,002  90.1% 
AG’s News [25]  4  120,000  7,600  43.3  CharCNN  11,337,988  90.09% 
Yahoo! Answers [25]  10  1,400,000  60,000  108.4  LSTM  7,146,166  70.84% 
Method  Explanation 

Shapley  It is not heartwarming or entertaining. It just sucks. 
CShapley  It is not heartwarming or entertaining. It just sucks. 
LShapley  It is not heartwarming or entertaining. It just sucks. 
KernelSHAP  It is not heartwarming or entertaining. It just sucks. 
SampleShapley  It is not heartwarming or entertaining. It just sucks. 

On IMDB with WordCNN, the simplest model among the three, LShapley, achieves the best performance while LIME, KernelSHAP and CShapley achieve slightly worse performance. On AG’s news with CharCNN, LShapley and CShapley both outperform other algorithms. On Yahoo! Answers with LSTM, CShapley outperforms the rest of the algorithms by a large margin, followed by LIME. LShapley with order , SampleShapley, and KernelSHAP do not perform well for LSTM model, probably because some of the signals captured by LSTM are relatively long grams.
5.2 Image Classification
We carry out experiments in image classification on the MNIST and CIFAR10 data sets:

MNIST: The MNIST data set contains images of handwritten digits with ten categories [12]. A subset of MNIST data set composed of digits and is used for better visualization, with images for training and images for testing. A simple CNN model achieves accuracy on the test data set.

CIFAR10: The CIFAR10 data set [10] contains images in ten classes. A subset of CIFAR10 data set composed of deers and horses is used for better visualization, with images for training and images for testing. A convolutional neural network modified from AlexNet [11] achieves accuracy on the test data set.
We take each pixel as a single feature for both MNIST and CIFAR10. We choose the average pixel strength as the reference point for all methods, and make model evaluations, where is the number of pixels for each input image, which keeps the number of model evaluations under .
LIME and LShapley are not used for comparison because LIME takes “superpixels” instead of raw pixels segmented by segmentation algorithms as single features, and LShapley requires nearly sixteen thousand model evaluations when applied to raw pixels.^{2}^{2}2LShapley becomes practical if we take small patches of images instead of pixels as single features. For CShapley, the budget allows the regressionbased version to evaluate all image patches with .
Figure 3 shows the decrease in logodds scores before and after masking the top pixels ranked by importance scores as the percentage of masked pixels over the total number of pixels increases on test samples on MNIST and CIFAR10 data sets. CShapley consistently outperforms other methods on both data sets.
Figure 4 and Figure 5 provide additional visualization of the results. By masking the top pixels ranked by various methods, we find that the pixels picked by CShapley concentrate around and inside the digits in MNIST. The CShapley and Saliency methods yield the most interpretable results in CIFAR10. In particular, CShapley tends to mask the parts of head and body that distinguish deers and horses, and the human riding the horse. Figure 3 shows two misclassified digits by the CNN model. Interestingly, the top pixels chosen by CShapley visualize the “reasoning” of the model: more specifically, the important pixels to the model are exactly those which could form a digit from the opposite class.
6 Discussion
We have proposed LShapley and CShapley for instancewise feature importance scoring, making use of a graphical representation of the data. We have shown the superior performance of the proposed algorithms compared to other methods for instancewise feature importance scoring in text and image classification.
Acknowledgments
We would like to acknowledge support from the DARPA Program on Lifelong Learning Machines from the Army Research Office under grant number W911NF1710304, and from National Science Foundation grant NSFDMS1612948.
References
 Bach et al. [2015] Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, KlausRobert Müller, and Wojciech Samek. On pixelwise explanations for nonlinear classifier decisions by layerwise relevance propagation. PloS One, 10(7):e0130140, 2015.
 Baehrens et al. [2010] David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, and KlausRobert Müller. How to explain individual classification decisions. Journal of Machine Learning Research, 11:1803–1831, 2010.
 Chen et al. [2018] Jianbo Chen, Le Song, Martin J Wainwright, and Michael I Jordan. Learning to explain: An informationtheoretic perspective on model interpretation. arXiv preprint arXiv:1802.07814, 2018.
 Cover and Thomas [2012] Thomas M Cover and Joy A Thomas. Elements of Information Theory. John Wiley & Sons, 2012.
 Datta et al. [2016] Anupam Datta, Shayak Sen, and Yair Zick. Algorithmic transparency via quantitative input influence: Theory and experiments with learning systems. In Security and Privacy (SP), 2016 IEEE Symposium on, pages 598–617. IEEE, 2016.
 [6] Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Neural networks for machine learninglecture 6aoverview of minibatch gradient descent.
 Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural Computation, 9(8):1735–1780, 1997.
 Kim [2014] Yoon Kim. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751, 2014.
 Kingma and Ba [2015] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR), 2015.
 Krizhevsky [2009] Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
 Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012.
 LeCun et al. [1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Lipton [2016] Zachary C Lipton. The mythos of model interpretability. arXiv preprint arXiv:1606.03490, 2016.
 Lundberg and Lee [2017] Scott M Lundberg and SuIn Lee. A unified approach to interpreting model predictions. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 4765–4774. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7062aunifiedapproachtointerpretingmodelpredictions.pdf.
 Maas et al. [2011] Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pages 142–150. Association for Computational Linguistics, 2011.
 Myerson [1977] Roger B Myerson. Graphs and cooperation in games. Mathematics of Operations Research, 2(3):225–229, 1977.
 Ribeiro et al. [2016] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should I trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135–1144. ACM, 2016.
 Shapley [1953] Lloyd S Shapley. A value for nperson games. Contributions to the Theory of Games, 2(28):307–317, 1953.
 Shrikumar et al. [2017] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. In ICML, volume 70 of Proceedings of Machine Learning Research, pages 3145–3153. PMLR, 06–11 Aug 2017.
 Simonyan et al. [2014] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. In Proceedings of the International Conference on Learning Representations (ICLR), 2014.
 Srivastava et al. [2014] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
 Štrumbelj and Kononenko [2010] Erik Štrumbelj and Igor Kononenko. An efficient explanation of individual classifications using game theory. Journal of Machine Learning Research, 11:1–18, 2010.
 Sundararajan et al. [2017] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In International Conference on Machine Learning, pages 3319–3328, 2017.
 Young [1985] H Peyton Young. Monotonic solutions of cooperative games. International Journal of Game Theory, 14(2):65–72, 1985.
 Zhang et al. [2015] Xiang Zhang, Junbo Zhao, and Yann LeCun. Characterlevel convolutional networks for text classification. In Advances in Neural Information Processing Systems, pages 649–657, 2015.
Appendix A Model structure
IMDB Review with WordCNN
The wordbased CNN model is composed of a dimensional word embedding, a D convolutional layer of 250 filters and kernel size three, a maxpooling and a dimensional dense layer as hidden layers. Both the convolutional and the dense layers are followed by ReLU as nonlinearity, and Dropout [21] as regularization. The model is trained with rmsprop [6]. The model achieves an accuracy of on the test data set.
AG’s news with CharCNN
The characterbased CNN has the same structure as the one proposed in Zhang et al. [25], composed of six convolutional layers, three maxpooling layers, and two dense layers. The model is trained with SGD with momentum 0.9 and decreasing step size initialized at . (Details can be found in Zhang et al. [25].) The model reaches accuracy of on the test data set.
Yahoo! Answers with LSTM
The network consists of a dimensional randomlyinitialized word embedding, a bidirectional LSTM, each LSTM unit of dimension , and a dropout layer as hidden layers. The model is trained with rmsprop [6]. The model reaches accuracy of on the test data set, close to the stateoftheart accuracy of obtained by characterbased CNN [25].
Mnist
A simple CNN model is trained on the data set, which achieves accuracy on the test data set. It is composed of two convolutional layers of kernel size and a dense linear layer at last. The two convolutional layers contain 8 and 16 filters respectively, and both are followed by a maxpooling layer of pool size two.
Cifar10
A convolutional neural network modified from AlexNet [11] is trained on the subset. It is composed of six convolutional layers of kernel size and two dense linear layers of dimension 512 and 256 at last. The six convolutional layers contain 48,48,96,96,192,192 filters respectively, and every two convolutional layers are followed by a maxpooling layer of pool size two and a dropout layer. The CNN model is trained with the Adam optimizer [9] and achieves accuracy on the test data set.
Appendix B Proof of Theorems
In this appendix, we collect the proofs of Theorems 1 and 2.
b.1 Proof of Theorem 1
We state an elementary combinatorial equality required for the proof of the main theorem:
Lemma 1 (A combinatorial equality).
For any positive integer , and any pair of nonnegative integers with , we have
(13) 
Proof.
By the binomial theorem for negative integer exponents, we have
The identity can be found by examination of the coefficient of in the expansion of
(14) 
In fact, equating the coefficients of in the left and the right hand sides, we get
(15) 
Moving to the right hand side and expanding the binomial coefficients, we have
(16) 
which implies
∎
Taking this lemma, we now prove the theorem. We split our analysis into two cases, namely versus . For notational convenience, we extend the definition of LShapley estimate for feature to an arbitrary feature subset containing . In particular, we define
(17) 
Case 1:
First, suppose that . For any subset , we introduce the shorthand notation and , and note that . Recalling the definition of the Shapley value, let us partition all the subsets based on , in particular writing
Based on this partitioning, the expected error between and can be written as
(18) 
Partitioning the set by the size of , we observe that
where we have applied Lemma 1 with , , and . Substituting this equivalence into equation (18), we find that the expected error can be upper bounded by
(19) 
where we recall that .
Now omitting the dependence of on for notational simplicity, we now write the difference as
Substituting this equivalence into our earlier bound (19) and taking an expectation over on both sides, we find that the expected error is upper bounded as
Recalling the definition of the absolute mutual information, we see that
which completes the proof of the claimed bound.
Finally, in the special case that and for any , then this inequality holds with , which implies . Therefore, we have almost surely, as claimed.
Case 2:
We now consider the general case in which . Using the previous arguments, we can show
Appylying the triangle inequality yields , which establishes the claim.
b.2 Proof of Theorem 2
As in the previous proof, we divide our analysis into two cases.
Case 1:
First, suppose that . For any subset with , we can partition into two components and , such that and is a connected subsequence. is disconnected from . We also define
(20) 
We partition all the subsets based on in the definition of the Shapley value:
The expected error between and is
(21) 
Partitioning by the size of , we observe that
where we apply Lemma 1 with , and . From equation (21), the expected error can be upper bounded by
where . We omit the dependence of and on the pair for notational simplicity, and observe that the difference between and is
Taking an expectation over at both sides, we can upper bound the expected error by