Marginal Contribution Feature Importance - an Axiomatic Approach for The Natural Case

Marginal Contribution Feature Importance - an Axiomatic Approach for The Natural Case

Abstract

When training a predictive model over medical data, the goal is sometimes to gain insights about a certain disease. In such cases, it is common to use feature importance as a tool to highlight significant factors contributing to that disease. As there are many existing methods for computing feature importance scores, understanding their relative merits is not trivial. Further, the diversity of scenarios in which they are used lead to different expectations from the feature importance scores. While it is common to make the distinction between local scores that focus on individual predictions and global scores that look at the contribution of a feature to the model, another important division distinguishes model scenarios, in which the goal is to understand predictions of a given model from natural scenarios, in which the goal is to understand a phenomenon such as a disease. We develop a set of axioms that represent the properties expected from a feature importance function in the natural scenario and prove that there exists only one function that satisfies all of them, the Marginal Contribution Feature Importance (MCI). We analyze this function for its theoretical and empirical properties and compare it to other feature importance scores. While our focus is the natural scenario, we suggest that our axiomatic approach could be carried out in other scenarios too.

1 Introduction

In recent years, increasing amounts of data coupled with the rise of highly complex models such as artificial neural networks have led to major advances in our ability to model highly complex relations krizhevsky2012imagenet (). These increasingly complex models tend to be harder to interpret lipton2018mythos (). This has led to extensive work on interpretability molnar2020interpretable (), explainability holzinger2019causability (), and more specifically regarding algorithms that assign feature importance lundberg2017unified (); plumb2018model (); ribeiro2016should (); shrikumar2017learning (); sundararajan2017axiomatic (). Most of the previous works on feature importance have focused on assigning importance scores for predictions of a specific trained model. Methods for assigning feature importance scores are commonly divided into local and global, where the goal of local scores is to explain how much each feature effects a specific prediction while the goal of the global scores is to explain how much each feature is effecting the model predictions across the entire data distribution. However, there are scenarios where assigning feature importance to explain a phenomena in the real world is also very useful. For example, consider the case of modeling the relations between gene expression to a certain disease. A scientist may be interested in gene importance as a tool for highlighting the genes most related to that disease to prioritize experiments to be done in the lab.

We identify another important dimension that defines different goals for feature importance scores. This dimension differentiates between model feature importance scores, which aim to explain a given model, from the natural feature importance scores, which aim to explain a phenomenon in the real world such as factors that contribute to heightened risk of a disease. This raises the question whether feature importance scores that were designed to explain a model would work well in natural scenarios. In this work we argue that the scenarios are different and we show examples where methods that were designed to explain a model would fail to explain a natural phenomenon. Chen and his co-authors chen2020true () recently demonstrated the difference between explaining a model and explaining the data for a specific use case in the context of Shapley value. In this work we address this difference more broadly.

Most previous work on feature importance focused on explaining models baehrens2009explain (); breiman1984classification (); covert2020understanding (); lundberg2017unified (); plumb2018model (); ribeiro2016should (); shrikumar2017learning (); sundararajan2017axiomatic (). Therefore, in this work we focus on the natural setting and more specifically on the global-natural scenario. To come up with a good feature importance function for this setting we identify few axioms that any feature importance score in the global-natural case should satisfy to have the properties expected of it. We prove that there is only one function that satisfies these axioms, the Marginal Contribution Feature Importance (MCI). We compare this score to other feature importance scores, both from a theoretical standpoint and from an empirical one.

The contributions of this paper are the following:

  1. We define a new important dimension for feature importance scores that differentiate the model scenario in which the goal is to explain model predictions from the natural scenario in which the goal is to explain a phenomenon in the real world.

  2. We focus on the global-natural scenario and present three simple properties (axioms) that every feature importance score in this scenario should have.

  3. We prove that these axioms uniquely identify the MCI score.

  4. We analyse the theoretical properties of MCI and show empirically that it is more accurate and robust than other available solutions.

2 Feature Importance Scenarios

It has been noted that the term “feature importance” may have different meanings in different scenarios ribeiro2016should (). In the introduction we already discussed the common distinction between local and global scenarios in which feature importance is used, and suggested another distinction between model scenarios and natural scenarios. To clarify this two dimensional distinction, in the following we list potential use cases for each scenario.

Global-natural (the scientist scenario):

When studying the association between gnomic loci or gene expression and a disease it is common for a scientist to train models that predict the disease as a tool to study the disease danaee2017deep (); mckinney2006machine (). Note that in this case models are acting as a surrogate for the natural phenomenon being studied. Here, feature importance is used to rank loci or genes that may have significant influence on the disease. Note that this use case is not limited to genomics. Similar scenarios have been described in medical studies fontana2013estimation (), in chemical sciences mccloskey2019using (), and other domains cui2010music ().

Global-model (the engineer scenario):

Another common use case for feature importance is in understanding a specific model. For example, for security purposes it is important to understand the sensitivity of a model to changes in specific features guo2018lemna (). In other cases, feature importance is used for explaining a model adadi2018peeking () or identifying bugs tan2018learning (). Note that main users of the importance scores are the engineers who develop the model and would like to verify its correctness.

Local-model (the user scenario):

In this scenario, a user of a predictive model is interested in explanation of a specific prediction kulesza2015principles (). For example, it may be a customer trying to understand why an application for credit was denied by a credit-provider and what can the customer do to change this decision. Here feature importance serves in understanding how the model used by the credit provider sees the customer.

Local-natural (the patient scenario):

Consider analyzing the CT scan of a patient for identifying cancer tumors. Important features here are places in the image that are cancerous. In this case, the goal in highlighting important features is to identify the location of the tumor in the body, and the model is only a tool for doing that.

To summarize, in model scenarios the goal is to understand a specific model whereas in the natural scenarios the model is just a tool for learning a natural structure. Further, in the local scenarios the goal is to understand the effect of each feature on a specific prediction or case, whereas in the global scenario the goal is to understand the global effect of each feature.

This differences also reflect in the expectations of feature importance scores. While the difference between the local and global settings has been discussed before ribeiro2016should (); tan2018learning (); tonekaboni2020went (), we would like to explain here some differences between the model scenarios and the natural scenarios scores. Imagine a scientist investigating the relation between gene expression to a disease, while some of those genes are highly correlated in their expression. From a scientist point of view, correlated genes encapsulate the same information about the disease, and therefore they are equally important. On the other hand, from an engineer’s point of view, a model built for making predictions using the same data could use just a portion of those genes for the sake of making accurate predictions. In this case, an engineer trying to understand the model would like to consider only the specific genes the model uses.

From the previous discussion, it follows that feature importance scores that may work well in the model setting will not necessarily be adequate for the natural setting. In the rest of the work, we focus our attention on the global-natural scenario that has received limited attention thus far.

3 Definitions and Notations

In this section, we introduce important concepts that were defined by previous works covert2020understanding () and are relevant to the global-natural scenario. These concepts will serve us in defining our method. We start our discussion with Shapley value, a fundamental concept in game theory that was recently adopted to the realm of feature importance lundberg2017unified (); covert2020understanding (). Shapley value was originally designed for problems of cost allocation where participants cooperate to achieve a certain good shapley1953value (). Shapley value uses an evaluation function , that computes for each set of players the value generated if these participants were to cooperate, where for the empty set the evaluation is assumed to be zero.

Shapley presented four axioms that a fair allocation of cost should have and showed that there is only one fair cost allocation function shapley1953value ():

(1)

where is the set of all features, is the set of all permutations of , is the set of all features preceding in permutation , and .

By treating features as players cooperating to make accurate predictions, this idea was adopted for feature selection cohen2007feature () and then was extended to local-model feature importance by the SHAP method lundberg2017unified (), and recently also extended to the scenario of global-model feature importance by the SAGE method covert2020understanding ().

In the context of global feature importance, the evaluation function is a measure of the predictive power of a given set of features covert2020understanding (). For example, let be the set of predictors restricted to use only the set of features , then given a loss function , the predictive power can be defined as:

where the expectation is with respect to the distribution of the feature vector and the label . Other examples of measures of predictive power include the AUC of the best classifier in or the mutual information between the set of features and the label.1

In this work, we refer to a function as an evaluation function if and if is monotone increasing in the sense that if then . This reflects the intuition that giving more features to the model can only increase the amount of information on the label and thus allows more accurate models.

While the Shapley values axioms make sense in the realm of allocating costs to beneficiaries, we question their adequacy to the natural scenario of feature importance. To demonstrate the problem, consider a system with the binary features which are Rademacher random variables such that the target variable and the mutual information is the evaluation function. Shapley value assigns the feature importance and to and respectively. However, if is duplicated 3 times then the feature importance becomes and for and . Note that the importance of drops when it is duplicated while the importance of and increases to the point that they become the most important features. This means that if these features were indicators of the presence of a certain protein in a blood sample then their importance scores may change if measurements are taken of other proteins that are triggered by the same mechanism and therefore are highly correlated. As a consequence, if the scientist suspected that a certain mechanism is responsible for a disease and therefore sampled many proteins that are related to this mechanism then Shapley-based feature importance scores will suggest that these proteins are of lesser importance.

Ablation studies are another common approach to assign importance scores to features bengtson-roth-2008-understanding (); casagrande1974ablation (); hessel2018rainbow (). In this method, the importance of a feature is the reduction in performance due to removing this feature. Using the notation above, in ablation studies the importance of feature is .

Bivariate association is the complement to ablation studies. In this method feature importance is its contribution in isolation, that is, . These methods are commonly used in Genome-Wide Association Studies (GWAS) haljas2018bivariate (); liu2009powerful (), in feature ranking methods Zien_2009 (), feature selection methods guyon2003introduction (), or in feature screening methods fan2008sure ().

As mentioned in covert2020understanding (), both the ablation and the bivariate association methods deal imperfectly with specific types of feature interactions. Ablation and Shapley-based methods fail when features are highly correlated, while the bivariate method fails when there are synergies between the features. As an example consider an XOR where the target variable is the exclusive or of two features - bivariate methods would fail to find the association between these features and the target variable.

So far we have discussed importance scores that are model agnostic. However, it is important to mention that there are also importance scores that are specific to a certain type of models. For example, in linear models, it is common practice to look at weights assigned to each feature as a measure for its importance while in tree-based models, a common practice is to look at the sum of the gains from decision nodes in which a feature was used breiman1984classification (). MAPLE is a method to explain ensemble of trees plumb2018model () while for deep learning there are methods such as Integrated Gradients sundararajan2017axiomatic () and DeepLift shrikumar2017learning ().

We now move forward to introducing our method, which aims to overcome the difficulties in existing methods.

4 Marginal Contribution Feature Importance

In previous sections, we discussed the different scenarios in which feature importance can be used and presented the limitations of current methods in the global-natural scenario. To find a proper score for this scenario we begin by introducing a small set of properties expected of a feature importance scoring function in this setting which we refer to as axioms. We show that Marginal Contribution Feature Importance (MCI) is the only function that satisfies these axioms and we study its properties.

4.1 The Axioms

The axioms use the Elimination operation, which is defined as follows:

Definition 1.

Let be a set of features and be an evaluation function. Eliminating the set creates a new set of features and a new valuation function such that .

Using this definition we introduce the set of required properties (axioms):

Definition 2.

A valid feature importance function in the global-natural scenario is a function that has the following properties:

  1. Marginal contribution: The importance of a feature is equal or higher than the increase in the evaluation function when adding it to all the other features, i.e .

  2. Elimination: Eliminating features from can only decrease the importance of each feature. i.e., if and is the evaluation function which is obtained by eliminating from then ,    .

  3. Minimalism: If is the feature importance function, then for every function for which axioms 1 and 2 hold, and for every : .

To understand the rationale behind these axioms, the Marginal contribution states that if a feature generates an increase of performance even when all other features are present, then its importance is at least as large as this additional gain it creates. This is to say that if the ablation study (see section 3) shows a certain gain, then the feature importance is at least this gain.

The rationale for the Elimination axiom is that the importance of a feature may be apparent only when some context is present. For example, if the target variable is the XOR of two features, their significance is apparent only when both of them are observed. Therefore, eliminating features can cause the estimated importance of the remaining features to drop. On the other hand, if a feature shown to be important, that is, it provides high predictive power given the current set of features, then its predictive power does not decrease when additional information is provided by adding features. Note that it still may be the case that the relative importance of features changes when adding or eliminating features. In other words, we may find, by adding features, that a feature that was considered less important is very significant in combination with these new features. However, if a feature was considered having high predictive power then adding feature can only demonstrate even higher predictive power using the additional information provided by the new featues.

Finally, note that if satisfies the marginal contribution and the elimination properties, then for every the function also satisfies these properties. The Minimalism axiom provide disambiguation by requiring the selection of the smallest function.

These axioms allow us to present the main theorem which shows the existence and uniqueness of the feature importance function.

Theorem 3.

The function

is the only function that satisfies the marginal contribution, the elimination and the minimalism axioms.

Theorem 3 shows that there is only one way to define a feature importance function that satisfies the axioms presented above. We call this function, the Marginal Contribution Feature Importance (MCI) function. The proof of this theorem is presented in the supplementary material (Section A).

4.2 Properties of the Marginal Contribution Feature Importance Function

MCI has many advantageous properties as shown in the following theorem.

Theorem 4.

Let be a set of features, let be an evaluation function and let be the feature importance function . The following holds:

  • Dummy: if is a dummy variable, that is , then .

  • Symmetry: if and are symmetric, that is if for every we have that , then .

  • Super-efficiency: .

  • Sub-additivity: if and are evaluation functions defined on then .

  • Upper bound the self contribution: for every feature , .

  • Duplication invariance: let be a set of features and be an evaluation function. Assume that is a duplication of in the sense that for every we have that . If and are the results of eliminating then .

Recall that the Shapley value is defined by four axioms: efficiency, symmetry, dummy and additivity shapley1953value (). Theorem 4 shows that MCI has the symmetry and dummy properties, but the efficiency property is replaced by a super-efficiency property while the additivity property is replaced by a sub-additivity property. The upper bound on self contribution shows that MCI always dominate the bivariate association scores. It is also easy to verify that it upper bounds Shapley value and the ablation score. Finally, duplication invariance shows that when features are duplicated the feature importance scores do not change, this is unlike Shapley value for which we showed earlier its sensitivity to duplication. The proof of Theorem 4 is presented in the supplementary material (Section A).

Another interesting property of MCI is the context it can provide for the importance of a feature. From the definition of MCI it follows that for every there is such that . This is a context with which provides a big gain. In some cases this context can give additional insight to the scientist.

4.3 Computation and Approximation

The complexity of computing the MCI function in a straight forward way is exponential in the number of features. Since computing the Shapley value is NP-complete deng1994complexity (), there is no reason to believe that MCI is easier to compute. In the supplementary material (Section C) we provide examples for cases were MCI can be computed in polynomial time, for example, when is sub-modular. Moreover, much like Shapley value, MCI can be approximated by sampling techniques castro2009polynomial (). One interesting property of MCI is that any sampling based technique provides a lower-bound on the score. We also present some upper-bounds that allow for a branch and bound type technique. In addition, we show that when there are strong correlations between features an estimation for MCI scores can be derived from computing the scores on a much smaller problem.

Another challenge in computing MCI is obtaining the values of for various sets. Using the theory of uniform convergence of empirical means (the PAC model) vapnik2015uniform (), we show that with high probability can be estimated to within an additive factor using a finite sample and this estimate can be used to estimate MCI to within a similar additive factor. Details about computation and approximation techniques are provided in the supporting material (Section C).

5 Experiments

In this section we analyze the performance of MCI empirically and compare it to other methods. For the experiments described here we used the breast cancer sub-type classification task from a gene microarray dataset (BRCA) tomczak2015cancer (). We present two experiments, in the first one we compare the quality of the ranking of the feature importance provided by different methods. In the second experiment we test the robustness of the methods to correlations between the features. In the supplementary material (Section E) we also provide results for robustness experiments on six different datasets from the UCI repository asuncion2007uci ().

5.1 Data

We used a BRCA tomczak2015cancer () dataset that consists of 17,814 genes from 571 patients that have been diagnosed for one of 4 breast cancer sub-types. In our experiments we used the same subset of 50 features used in covert2020understanding (). The main advantage of this dataset is the existing scientific knowledge about genes that are related to breast cancer. We used this knowledge to evaluate the quality of different feature importance scores.

5.2 Implementation Details

The evaluation of any features subset was conducted by defining to be the average negative log-loss over 3-fold cross validations of a logistic regression model trained using only the features in .

Since computing Shapley values and MCI scores exactly has exponential complexity, we used the sampling technique proposed by covert2020understanding () for the SAGE algorithm. According to this method, a random set of permutations of the features is sampled. For each we denote by and estimate the feature importance for Shapley and MCI to be:

Recall that using this method provides an unbiased estimator for the Shapley value and a lower bound for MCI. In our experiments with BRCA was of the size as we observed that the relative order of features stabilizes in both methods at this point.

Experiment I: Quality Experiment II: Robustness
NDCG MKD
Method @3 @5 @10 @20 @50 @3 @5 @10 @20 @50
MCI 1.00 0.85 0.77 0.88 0.92 0.00 0.00 0.04 0.02 0.03
Shapley 0.77 0.70 0.73 0.88 0.88 1.00 0.28 0.13 0.08 0.04
Bivariate 1.00 0.85 0.77 0.82 0.92 0.00 0.00 0.00 0.00 0.00
Ablation 0.30 0.21 0.28 0.44 0.61 0.12 0.14 0.17 0.09 0.04
Table 1: Results of experiments with the BRCA dataset: On the left hand side the results for Experiment I, measuring the quality of the scores, are presented using the NDCG scores of prefixes of varying sizes of the rankings generated by the different methods (higher is better, the perfect score is ). On the right side the results for Experiment II, measuring the robustness of the scores to feature correlations. The results are the Minimal Kendall Distance (MKD) between the rankings of the top- ranked elements before and after duplicating the top feature three times. In these results lower is better, the perfect score is 0.00.

5.3 Experiment I: Quality

The goal of the first experiment is to evaluate the quality of the importance scores provided by the different methods. Each feature importance score generates a ranking of the features. Since there is existing scientific knowledge about the association between genes and breast cancer sub-types, we consider a better ranking to be one that gives a higher ranking to genes that are known to be related to breast cancer covert2020understanding (). We evaluated the ranking using the well-known Normalized Discounted Cumulative Gain (NDCG) metric jarvelin2002cumulated (). The results of the experiment are presented in Table 1. The results show that MCI and Bivariate outperform Shapley while Ablation performs poorly. The success of the Bivariate method in this experiment suggests that there are no significant synergies between the features in this dataset. MCI handled this situation and even outperforms Bivariate slightly in the top 20 list. However, due to the strong correlations between features, Ablation failed to generate a meaningful ranking and this is a probable explanation also to the low performance of Shapley.

SAGE covert2020understanding () is a global feature importance score that was designed for the global-model scenario. It is possible to use it in the global-natural scenario by training a model on a portion of the data and using the rest for evaluating feature importance. Unfortunately, we found that when used in this manner, SAGE’s results are sensitive to the choice of the random seed (other methods were found to be stable) and in most cases it is substantially outperformed by MCI. Details are provided in the supplementary material (Section F).

(a) Shapley values
(b) MCI values
(c) Shapley values with duplication
(d) MCI values with duplication
Figure 1: Feature duplication experiment on the BRCA dataset, with the top 10 ranking features displayed. On the top row we show the feature importance according to Shapley (a) and MCI (b). The bottom row shows the estimations of both methods, when the top ranked feature (BCL11A) is duplicated three time. Both methods produce similar feature importance estimation when the top ranked feature is not duplicated. However, when it is, Shapley’ importance assignment (c) is affected drastically, while MCI (d) succeeds to remain stable.

5.4 Experiment II: Robustness

The goal of the second experiment, is to measure the robustness of the different methods and more specifically the implications of correlated features on the ranking they produce. For each method we duplicated three times the feature that was ranked first and re-evaluated the scores and the associated ranking with the additional features. We computed the distance between the top- list before duplicating the feature to the top- list after the duplication.

To measure the distances between the rankings of the top- features we used the Minimizing Kendall Distance (MKD) fagin2003comparing () metric. In this method, for the ranking of the top- features we define to be the set of completions of to rankings of the entire feature set. i.e., if is a ranking of all the features and . Using this notation the MKD distance between and is where is the Kendall-tau distance. Furthermore, since we introduced duplicated features which are not ranked in the original ranking, we included only the first occurrence of this feature in the list of ranked features.

The results of this experiment are presented in Table 1. As expected, the Bivariate method is not effected by the duplicate features since it ignores feature correlations. MCI was resilient to the introduction of these features in the top of the list while some small changes are seen at the bottom of the list. We suspect that this is noise due to the fact that only a small sample was used to evaluate . Shapley suffered the most from the introduction of new feature and most of the differences are at the top of the list. i.e., features that were considered the most important features before the duplication were pushed down the list. Finally, Ablation suffered from the introduction of duplicated features too.

Additional demonstration of the difference between MCI and Shapley is presented in Figure 1. In this figure the top 10 features computed by Shapley and MCI before and after introducing the duplicated features are presented. While the rankings before the duplication are similar in both methods, this is no longer the case after the duplication. While MCI finds all the duplicates to be as important as the original feature, Shapley penalizes for the duplication and demotes this feature while promoting the feature that was on the place to the place.

6 Conclusion

In this work we introduced an important distinction between the model scenario of feature importance scores, and the natural scenario. We showed in what sense these scenarios are different, and demonstrated cases where previous methods, such as Shapley values based methods, behave imperfectly when extended to the natural scenario. Therefore, we focused on the natural scenario and used an axiomatic approach in which we identified three properties any feature importance function for this scenario should satisfy. We showed that there exists a function that satisfy all of these properties, and that it is unique. We called it the MCI function and addressed its computational and approximation challenges in the supplementary material. Finally, we ran several experiments showing that the MCI method performed better in identifying important features, and was robust to addition of correlated features, whereas Shapley values based methods are highly affected from this kind of change.

Appendix A Proofs

In this section we group together all the proofs for the theorems we presented in the main paper.

a.i Existence and Uniqueness of the Feature Importance Score

Here we prove Theorem 3 states that there is a function for which all the axioms defined in Definition 2 hold and that this function is unique. This proof is a constructive proof in the sense that we are able to show that the only feature importance function for which the axioms hold is the function that assigns to the feature the score

Meaning, the importance of a feature is the maximum contribution to the valuation function over any subset of features.

Lemma 1.

Let be a feature importance function for which axioms (1), (2) hold (marginal contribution, and elimination axiom). Then:

Proof.

We prove the statement using induction on the size of the features set . Let . If (i.e ) then from the marginal contribution axiom we have that

Assume that the statement holds for any set of features of size for . Let and let . Let . If there exists then from the elimination axiom, if is eliminated we will obtain and such that and . However, since we have that from the assumption of the induction

Otherwise, assume that is such that . Therefore, . From the marginal contribution axiom we have that .

Lemma 1 shows that any importance function that has the marginal contribution property and the eliminate property must assign an importance score of at least to every feature. Therefore, by adding the minimalism axiom we obtain the uniqueness and existence of the feature importance function as shown in Theorem 3.

Here we prove Theorem 3

Proof.

Adding the minimalism axiom to Lemma 1 shows that if the marginal contribution and the elimination axioms hold for then it is the unique feature importance function. Proving that the marginal contribution axiom hold is straight-forward: for a feature

To see that the elimination axiom holds too, let and let . If is eliminated from to create and then

a.ii Properties of the MCI Function

Here, we prove the MCI function properties presented in Theorem 4.

Proof.

Dummy: Let be a dummy variable such that then .

Symmetry: Let and be such that for every we have that . Consider any set . We consider three cases, (1) , (2) , and (3) exactly one of is in . In case (1) we have that . In case (2) we have

In case (3) assume, w.l.o.g. that and . Let denote the set where is replaced by and therefore, due to the symmetry between and it holds that . Note also that and therefore . From analyzing these 3 cases it follows that for every there exists such that . Therefore, . However, by replacing the roles of and it also holds that and therefore .

Super-efficiency: Let . w.l.o.g. let . Define . Therefore, and . Since ,

Sub-additivity: if and are valuation functions defined on then for all

Upper bound the self contribution: For We have that .

Duplication invariant: Assume that is a duplication of in the sense that for every we have that . Let and be the results of eliminating and let . From the elimination axiom we have that . Assume that is such that . If then and . Otherwise and it holds for that . Therefore, in all possible cases we have that . When combined with the elimination axiom we conclude that . ∎

Appendix B Additional Properties of the MCI Function

In the following theorem we present and prove additional relevant properties of the MCI function, that did not mentioned in the main paper.

Theorem 5.

Let be a set of features, let be a valuation function and let be the MCI function. The following holds:

  • Scaling: .

  • Monotonicity: If then .

In the following we prove theorem 5

Proof.

Scaling: This property follows since for every and every it holds that .

Monotonicity: Let for which . Let be such that . Then, if it holds that

Otherwise, if then

Appendix C Computation Optimizations

We now turn our attention to the computational challenge of computing or approximating the MCI function. Straight-forward computation is exponential in the size of the feature set. Therefore we study cases in which the computation can be made efficient and approximation techniques.

c.i Submodularity

In the following we show that if the valuation function is submodular lovasz1983submodular () then the MCI feature importance score is equal to the self contribution of each feature.

Lemma 2.

If is sub-modular then .

Proof.

Recall that in that case there is a diminishing return and therefore for every :. Therefore,

The submodularity assumption might be too stringent in some cases. For example, if the target variable is an XOR of some features then the submodularity assumption does not hold. However, if we assume that submodularity holds for large sets then we obtain a polynomial algorithm for computing the feature importance. This may make sense in the genomics setting where genes may have synergies but we may assume that only small interactions of 2,3, or 4 genes are significant. We begin by defining -size submodularity:

Definition 6.

A function is -size submodular if for every such that

A function is soft -size submodular if it holds that for every , , there exists , for which:

Lemma 3.

If is -size-submodular or soft -size-submodular then

Proof.

First, we show that if is -size-submodular it is also soft -size-submodular. Let be a -size-submodular valuation function. Let , and let , , . From the -size-submodular property we get that:

And therefore is also soft -size-submodular. Hence, it is enough to prove the theorem for soft -size-submodular functions.
Let be a soft -size-submodular evaluation function. Let be such that . Assume, in contradiction, that . Note that since if then , and in this case we have that in contradiction. Due to the soft -size submodular and monotonous properties of it follows that exists , such that:

Therefore, . This is a contradiction since . ∎

Lemma 3 shows that if is soft -size-submodular then the entire function can be computed in time .

c.ii Branch and Bound Optimization

Here we show how we can discard computation for some of the subsets using a branch and bound like technique.

Lemma 4.

For every and :

Proof.

This lemma follows from the monotonous property of .

The ability to upper bound provided by this Lemma allows cutting back the computation significantly. For example, if we computed for every set of size and we have that then . The following lemma proves this property in a more general setting.

Lemma 5.

Let and for . Let and . Let and and then

Proof.

The lower bound on follows from the simple fact that for every

Let be such that . If then . Otherwise, there exists and such that and from Lemma C.II it holds that

which completes the proof. ∎

c.iii Heuristics

Recall that for any it holds that . Therefore, any method can be used to select and obtain a lower bound on the feature importance. In the experiments in this paper we used random permutations to generate the set following the proposal of covert2020understanding (). This method is described in Section 5. Our experiments show that this method is effective. however, in some cases it may be too demanding since for every subset of features a model has to be trained. The computational cost can be further reduced by using a method such as SAGE covert2020understanding () to estimate from a model that was trained on the entire dataset and therefore trained only once. Only for sets such that SAGE estimates that is large, the real value can be computed via training models. Therefore, the estimator SAGE (or in any other proposed method) is used to eliminate testing sets for which the marginal contribution of is predicted to be small.

Appendix D Approximations of the MCI Funciton

Another challenge in computing is computing the evaluation function . Recall that we assumed, for example, that is monotone increasing. This is motivated by thinking about as some measure of the information that provides on the target variable. However, if only a finite sample exists for evaluating then forcing this property is no longer trivial. Therefore, in this section we show that using uniform convergence of empirical means vapnik2015uniform (), or in other words, the PAC theory, it is possible to show that can be approximated well enough from a finite sample such that the estimate of will be good too.

First, we need to define a way for estimating the valuation function using a finite sample. Let be a random variable consists of features and let be a target random variable. Let be a finite i.i.d sample from and let be a hypothesis class for any subset of features . For any hypothesis denote by the expected error of over , and denote by the expected error of over the true distribution . The estimated valuation function defined as follows:

and the true valuation function defined as:

For convenience we assume that is finite (this restriction can be exchanged for finite VC dimension). denote .

Theorem 7.

Let be a random variable consists of features , let be a target random variable and let be a hypothesis class for for any subset .
For any and sample of size it holds that:

Proof.

Let be a random variable consists of features , let be a target random variable and let be a hypothesis class for any subset of features . Let and let be an i.i.d sample of size .

First, we show that for all :

Let . Denote and . We have that:

and also:

Next we would like to show that for any it holds that .

Let and let , .

Notice that

and also

and therefore we get that

Hence, using union bound and Hoeffding inequality, for any it holds that:

Where the last inequality follows by using the bound on in the statement of this theorem. ∎

Appendix E Robustness Experiments on UCI Datasets

In this section we present additional experiments comparing the robustness of different feature importance methods. We follow similar procedure to the one described in Section 5 for additional six datasets from the UCI repository asuncion2007uci () (see description of the datasets in Table 2). We do not hold quality experiments for these datasets since unlike the BRCA dataset, for these datasets there is no definitive knowledge about the importance of features to which we can compare.

Following the robustness experiment for the BRCA dataset described in Section 5, we first computed feature importance using different methods. Then, for each method we duplicated three times the feature that was ranked first and re-compute the feature importance. We measure stability using the MKD distance described in Section 5. We computed the MKD distances between the top- list before the introduction of the duplicates, and the top- list after the duplication.

Dataset Type # Features # Examples
Heart Disease Classification 13 303
Wine Quality Regression 11 1599
German Credit Default Classification 20 1000
Bike Rental Regression 12 303
Online Shopping Classification 17 12330