# Approximation Trees: Statistical Stability in Model Distillation

###### Abstract

This paper examines the stability of learned explanations for black-box predictions via model distillation with decision trees. One approach to intelligibility in machine learning is to use an understandable “student” model to mimic the output of an accurate “teacher”. Here, we consider the use of regression trees as a student model, in which nodes of the tree can be used as “explanations” for particular predictions, and the whole structure of the tree can be used as a global representation of the resulting function.

However, individual trees are sensitive to the particular data sets used to train them, and an interpretation of a student model may be suspect if small changes in the training data have a large effect on it. In this context, access to outcomes from a teacher helps to stabilize the greedy splitting strategy by generating a much larger corpus of training examples than was originally available. We develop tests to ensure that enough examples are generated at each split so that the same splitting rule would be chosen with high probability were the tree to be re-trained. Further, we develop a stopping rule to indicate how deep the tree should be built based on recent results on the variability of Random Forests when these are used as the teacher. We provide concrete examples of these procedures on the CAD-MDD and COMPAS data sets.

## 1 Introduction

This paper examines the use of regression trees for model distillation. While Machine Learning has traditionally focused on predictive performance, there has been considerable recent interest in “X-raying the black box”: finding methods to make the ways in which neural networks, Random Forests and other predictive models arrive at their predictions understandable to humans. This problem can be approached by creating summaries of these models such as variable importance scores (Breiman, 2001), partial dependence or ICE plots (Friedman, 2001; Goldstein et al., 2013), saliency maps (Simonyan et al., 2013) and other local explanations (Ribeiro et al., 2016). It can also be approached by developing intelligible “student” models which mimic the predictions of the original ”teacher” black box: a strategy encompassed by the term model distillation. Within model distillation, common student models are generalized additive models (GAMS: see Lou et al. (2012); Tan et al. (2017), Hooker (2007) provides a link between these and PDPs) and decision trees Breiman et al. (1984); Quinlan (1987), which are our focus. Decision trees have an intelligible graphical representation and can automatically fit complex high-dimensional functions, both of which make them appealing as student models. They have been used to reduce the computational cost of predictions (Craven and Shavlik, 1995), reduce effort in data collection (see Gibbons et al. (2013) especially for shortening medical questionnaires) and the final few nodes on a tree can provide an explanation for its predictions (Ribeiro et al., 2016; Hu et al., 2018). They have been used to represent complex models in Johansson et al. (2010, 2011); He et al. (2012); Augasta and Kathirvalavakumar (2012) and as a means of examining disparate impact in machine learning models in Chouldechova (2017).

The features listed above make decision trees attractive as a machine learning technique. However, the greedy algorithm used to build trees results in high variability and poor performance when used directly on training data. This is because small perturbations of the data used to build the tree can result in dramatically different models as when, for example, a different covariate is chosen in a high-level split with consequences that cascade through the rest of the tree structure. In the context of model distillation, this instability is an important concern: an explanation or interpretation of a learning outcome that is sensitive to small changes in the data may be viewed as unreliable. A related question is to determine the depth of the tree: while explanations of the global behavior of a predictive model focus on the leading nodes, those for individual predictions typically discuss the bottom of the tree. This is also where the tree structure is most variable and determining at what level of detail a teacher model needs to be explained is also relevant.

We address both these questions here. In order to obtain a stabilized structure for an approximation tree we take advantage of our ability to generate an arbitrarily large data set from which to build it. Specifically, we follow Gibbons et al. (2013) in generating pseudo data from a kernel density estimate based on the observed feature variables and using the value of the teacher model at these points as a response. In this paper we additionally ensure that, were this pseudo data to be re-generated, the same tree structure would be chosen with high probability. To carry this out, at each node we assess the stability of the selected split via a hypothesis testing framework; when splitting, we generate a large enough corpus of pseudo data to ensure that separation between the Gini index split criterion at the chosen split and that of other candidates is large enough to be consistently selected. This framework is repeated at each split to obtain a stabilized tree, generating new pseudo data as needed.

As our experiments show, this can result in the need to generate very large sets of pseudo-data when competing splits produce very similar improvement and achieving a stabilized tree can be computationally demanding. We think that this is an important observation: that many existing uses of decision trees in model distillation may produce unstable model interpretations or explanations and our understanding of these models may rest more on the particular data used to generate the approximation tree than on the underlying structure of the teacher. There are some subtle distinctions to be made here: if a distillation tree replaces the learned model when making predictions, we might reasonably choose to present it as an explanation for how a prediction is made, even if the structure of the tree was originally determined partially by chance. However, if we also hope to interpret reasoning behind the prediction, or expect the tree to explain something about the teacher, we would require explanations to be reproducible.

Our second question concerns a stopping rule for the depth of an approximation tree: we would like to capture the structure of the teacher in so far as it relates to the underlying signal in the data, but we also want to avoid mimicking artifacts of the particular data set used to obtain it. That is, we would like to stop building a tree once nodes start capturing the variance of the teacher. Few machine learning methods come with statistical estimates of sample variability; however for the specific case of Random Forests (Breiman, 2001), Mentch and Hooker (2016) and Wager and Athey (2017) were able to demonstrate that the predictions of Random Forests are asymptotically normally distributed with a variance that can be estimated at no additional overhead. Here we use this to develop a test of whether the values of a Random Forests teacher model are statistically different from a constant within the current leaf of the student. This in turn allows us to annotate the student model with an assessment of whether each of the leaf nodes could yield further insight, or predictive accuracy, if split again. Here our experiments suggest that the signal from Random Forests is generally more complex than is captured by an individual approximation tree grown to common depths, suggesting that explanations based on our approximation tree procedure are at least conservative.

In the remainder of this paper, Section 2 develops a means of ensuring the stability of a single split based on ideas from hypothesis testing. This is then employed to conduct a stabilized decision tree in Section 3 where we also present the results of some simulation experiments. Section 4 develops our stopping criterion. We then employ the complete procedure on two data sets: the CAD-MDD data set studied in Gibbons et al. (2013) in which the goal is to shorten a medical questionnaire screening for severe depression, and the COMPAS data set Larson et al. (2016) used to assess recidivism risk, also studied in Kleinberg et al. (2016); Chouldechova (2017). Studies on further data sets are presented in the Supplemental Material.

## 2 Stabilizing Splitting Rules

We begin by examining the statistical properties of deciding on a single split within a tree. Decision trees proceed by recursively splitting the covariate space in two based on the values of one covariate. That is, we define a split function

which divides the covariate space in two with the division indicated by the label . Decision tree algorithms select the split function by examining all possible splitting rules among the possible choices of covariates, , and split points and choosing the values which minimize a data-fitting criterion. In this paper we focus on multi-class classification with a tree built using the change in Gini index measured over classes:

(1) |

between the root and the two leaf nodes. In practice, this criterion is evaluated on data by replacing the probabilities in (1) with their empirical counterparts on data.

In the context of model distillation, we consider a black box classifier along with a pseudo-sample of arbitrary size . Here , and are the -predicted class probabilities over responses and we choose the split that minimizes a Gini index of the . In order to ensure that our choice of split is stable, we need to choose large enough to control the probability that two different pseudo samples, and would result in different splits. Here, we make pairwise comparisons between the current best split, and the list of candidate alternatives. For each alternative, the p-value for a test that the difference in Gini gains is greater than zero gives us an estimate of the probability that a different data set would choose the alternative over the current best split. Summing these probabilities gives a bound on the likelihood of splitting the current node a different way and we then select to control this probability.

The remainder of this section provides the technical details required to carry this calculation out. We first derive a central limit theorem for the difference in Gini indices between splits; we then show how this can be used to assess the probability that a different pseudo-sample would result in a different choice of split and increment until we control this probability at a desired level. Finally, we control for multiple comparisons through a Bonferroni procedure.

### 2.1 Asymptotic Distribution of Gini Indices

A theoretical discussion of the evaluation of splits can be found in Banerjee et al. (2007). In our specific case, we compare the Gini indices of candidate splits: To do so, we examine their asymptotic behavior and obtain a central limit theorem (CLT) so tests can be developed based on a normal distribution; (1) implies an averaging over all samples when calculating the Gini index, suggesting the existence of the CLT.

To examine two prospective splits and with the same samples, recall their Gini gains

where represents the covariate distribution of and the conditional probability of given . Subscripts are arranged in the order of the split, the left (denoted as ) or right (denoted as ) child, and the class label. For instance,

and respectively for . The empirical versions, Gini indices, are

Moving to the left and right children of both splits, we denote the numbers of samples and the probabilities over class labels in each child by, for

For simplicity we write, for

Employing a multivariate CLT we obtain

To relate this limiting distribution to the difference of Gini indices we shall employ the -method. Consider the analytic function s.t.

The -method imples that

(2) |

Here we write

We should point out that while (2) provides us with the CLT we need to assess the difference between two Gini indices. After expanding (2),

or asymptotically,

Hence, by replacing by the empirical versions from the pseudo samples, we write

(3) |

Denote the Gini difference we can rewrite (3) as

(4) |

### 2.2 The Probability of Choosing the Same Split Again

The above formula (3) gives rise to the following test when comparing two splits with different batches of pseudo samples. Suppose we have two prospective splits and . After drawing pseudo samples and observing, without loss of generality, that . We intend to claim that is better than . In order to ensure this split is chosen reliably, we can run a single-sided test to check whether we would obtain the same decision when accessing with another independently-generated set of pseudo samples . Assuming that and are independent samples, (4) implies

which gives,

This distribution leads to a prediction interval based on which we would get the prediction of the Gini difference using a different pseudo sample. In order to control at a confidence level , we need

(5) |

where is the -quantile of a standard normal. With a sufficiently large it is always possible to determine the better split between and should they have any difference. In addition, by combining this test with a pairwise comparisons procedure, we are capable of finding the best split among multiple prospective splits.

### 2.3 Sequential Testing

The power of this split test increases with . Since we need to choose to reveal any detectable difference between two splits, when no prior knowledge is given regarding the magnitude of the difference, we need an adaptive approach to increasing accordingly.

For a fixed confidence level , suppose we have tested at sample size and get p-value . Referring to (5), we have

Notice that is the estimator of which is an intrinsic constant with respect to the pairwise comparison. Hence in order to reach a p-value less than we may increase sample size to such that

which yields that

(6) |

Due to pseudo sample randomness, a few successive increments are required before we land in the confidence level.

### 2.4 Multiple Testing

So far we have obtained a method to compare a pair of splits. But when splitting a certain node we usually need to choose the best split among multiple . In order to adapt our pairwise split test to this situation, we consider modifying the problem slightly into deciding whether the split with the minimal Gini index is intrinsically superior than any other splits. This problem can be resolved by conducting pairwise comparisons of the split with minimal Gini index against the rest.

If we still want to test at a certain significance whether the split with the lowest estimated Gini index, i.e, , is the optimal, we can still work within the scheme of the pairwise comparison with an additional procedure controlling the familywise error rate (FWER). Here we make an analogue of the Bonferroni correction (Dunnett, 1955).

Test the hypotheses . Get the -values .

Analogous to a Bonferroni correction, use , the upper bound for making at most one Type I error, as the -value of the multiple comparison.

This test aggregates all significance levels into one. The Bonferroni correction will result in a conservative estimate as we ignore much of the correlation structure of the splits. In this scenario, the updates of sample size made in sequential testing should also be adapted as we are now taking the aggregated significance level. A quick and feasible fix is to replace the in (6) by the aggregated significance level. Alternatively, we may just test between the best two splits.

Because of the computational cost, when we have two splits that cannot be distinguished, the sequential and multiple testing procedure may end up demanding an extremely large number of points to make the test significant. In practice, we halt the testing early at a cutoff of certain amount of points, and choose the current best split. This compensation for computation time might lower the real power of the test, leading to a less stable result.

## 3 Stably Approximating a Black Box

To build an approximation tree, we replace the greedy splitting criterion by our stabilized version within the CART construction algorithm. At each node, we first generate an initial number of pseudo samples belonging to this node from the black box. Then we compare prospective splits simultaneously based on this set and decide whether we choose the one with the smallest Gini index with certain confidence or request more pseudo sample points. In the latter case, we keep generating until the pseudo sample size reaches what is required by the sequential testing procedure. This is repeated until we distinguish the best split. We perform this procedure on any node that needs to split during construction to get the final approximation tree.

In Section 4 we explore a potential stopping rule to decide the maximal depth of the tree, although we will often want to impose a maximal depth. We also need to specify : the maximal number of pseudo-samples to use and : our risk tolerance for choosing an alternative split. Additionally we specify which candidate splits to consider, and how to generate the pseudo-sample covariates below.

### 3.1 Choice of Prospective Splits

Most methods of finding prospective splits for a decision tree are compatible with our method once they target optimizing some information gain (Quinlan, 2014, 1987). In building an approximation tree, we only consider making splits at those points which would have been employed in a tree generated from the original training data. We look at the original samples that have been carried along the path and take the possible combinations of the covariates and their middle points of adjacent values that have appeared in those samples. This significantly narrows the number of splits to consider, improving the Bonferroni bounds above, and ensures that the set of candidate splits is independent of the pseudo-sample data.

Although this method will initially generate a large number of prospective splits, because of the sequential testing scheme, most of those splits will be identified as far worse than the best after a few tests and can be discarded, leaving a negligible effect on the overall performance. In practice, we implement a scheme (Benjamini and Hochberg, 1995) to adaptively discard splits that perform far worse than the current best. All splits are ordered by their p-values against the current best split, and the splits fall below the threshold are discarded and the Bonferroni procedure applied to the remaining candidates.

### 3.2 Generating Points

We first generate pseudo covariates and then obtain predictions from the black box to get the responses. It is worth noticing that the first step here may encounter the obstacle that, in practice, we do not have the true generative distribution of covariates.

The methods we develop here do not rely on a particular method of generating pseudo-samples. However, in this paper, we employ a kernel smoother of the empirical covariate distribution in order to maintain a close approximation to the original data (this assumes that these data are available). This translates to generating pseudo covariates from observed covariates plus random Gaussian noise. In the case of discrete covariates, we choose a neighboring category with a small probability. The width of the kernels to be employed (i.e. the size of perturbation of the training data) may be chosen based on prior knowledge of where new predictions are to be made, or can be chosen by cross-validation.

When we go further down the approximation tree, the covariate space may be narrowed down by the splits along the path. A feasible covariate generator can thus be produced by only smoothing the empirical distribution of those original samples that have been carried on by this path. We further check the boundary condition to ensure that the covariates we generated agree within the region divided by the splits along the path.

### 3.3 Performance on Simulated Data

We use Random Forests (RF) as the teacher model, which will allow us to develop stopping rules below. However, our calculations for developing a tree do not rely on using RF as a teacher and any other machine learning algorithm could be readily employed.

We experiment our method on a simple simulated dataset to check its behavior. Assume and , and let the covariate distribution Write and let

The generative distribution is intentionally set to be almost tree-structured so the result should reflect our method working under ideal conditions. We do so to avoid extreme cases during our check, while general distributions will be tested on using real datasets.

We compare across three methods: classification trees (CART), Random Forests (RF) and our proposed approximation tree (AppTree). During each replication, we generate 1,000 sample points from above distribution and obtain a standard RF consisting of 100 trees and a 5-layer CART tree. Then we build a 5-layer approximation tree via the algorithm above. The significant level for the split test is set to be 0.1, and the maximal number of pseudo samples at each node is set to be and respectively. For each we have 100 replications. To assess stability, we use the same setting above but fix one RF as an oracle and learn it by an approximation tree 100 times with and respectively.

In order to evaluate predictive accuracy and consistency, we generate new covariates and measure how much the predictions of approximation tree agree with those of the RF. To measure stability, defined in our case as structural uniqueness, we construct multiple approximation trees out of a single RF and examine the variation in their structures. The split test does not always guarantee a consistent pick through multiple trials due to the pseudo-sample randomness, hence we hope to see small variation among all the trees built. We also examine the trees at different depths to capture the variation along the tree growth. In this paper, our interest lies in the consistency of our approximation trees with RF and the stability of their structures. However, we will still compare the predictive accuracy of approximation trees with other models.

Figure 1 shows the predictive accuracy of the three methods on new test points. On average they share similar predictive accuracy; RF has the smallest variance, followed by AppTree. This meets our expectation that AppTree is capable of inheriting stability from the RF after learning how the RF extrapolates. Since the relation between the covariates and the responses is relatively simple, increasing does not significantly improve the performance.

Figure 2 shows the comparison between RF and AppTree in terms of the difference of their predicted class probability, and the disagreement of their class labels. Again the increase of does not bring significant improvement to performance. AppTree has achieved 95% agreement on average with the RF. By expanding the trees to larger sizes the approximation accuracy can still be marginally increased by “overfitting” the RF.

Figure 3 shows the stability of AppTree viewing from its top 4 layers and top 5 layers. It can be seen that by increasing the cap on the maximal number of pseudo samples AppTree can generate, its stability is significantly improved. One unique structure is obtained when , which means that some nodes require points to detect the best split. Two key observations can be made here. Our control of is relatively conservative due to our sequential testing and multiple testing steps. The maximal number of pseudo sample needed may be quite large to detect the best split. Overall, this initial check shows results as we expected.

In the Supplemental Material we report additional empirical studies on both simulated and real data to show how the performance of approximation tree compares with both decision trees and RFs based on prediction accuracy, consistency with the RF (mimicking accuracy), and stability.

## 4 Stopping Rules with Random Forests

The methods above ensure that we can construct a tree that will reliably exhibit the same structure across repetitions of the pseudo-data generating mechanism. Here we present a means of deciding how deep the resulting tree can be grown. Generically, this depth may be chosen by the user, or via test-set performance (the approach taken in Gibbons et al., 2013). However, when Random Forests are used as a teacher, recent results on their statistical properties (Mentch and Hooker, 2016; Wager and Athey, 2017) allow us to determine a stopping rule by asking whether the Random Forests predictions within the current node are statistically different from constant. That is, we ask whether splitting the node further is simply modeling noise. This allows us to annotate each node with a p-value that tests whether there is further signal within that node. We view this as an additional criterion for stopping; intelligibility or performance criteria may suggest stopping even with a small p-value – for example if a class prediction won’t change across the node even if the class probability varies – but that large p-values do indicate that further splits are potentially misleading. To carry this out, we first review the properties of Random Forests which we will use, and then develop our specific testing criteria.

### 4.1 Random Forest Variance

The tests we propose are based on recent results on the asymptotic normality of random forests. These results consider a model in which the forest is a sum of trees, each built using a subsample of size providing the representation

in which is a row, including response, of the original data, is a subset of indices selected at random and is a function that builds a tree using the ’s as data with randomization parameter (used to choose candidate covariates to split on) and makes a prediction for the covariate values . Below we will use the shorthand for the collection . Mentch and Hooker (2016) observed that this can be seen as an infinite-order incomplete U-statistic with a random kernel whose values are asymptotically normal with variance given by

(7) |

with

the covariance of two trees who share the first data points. Similar results are found in Wager and Athey (2017) with differing details on rates and assumptions about . Below we make use of a multivariate extension of these results to predictions at multiple values of in which the are matrices with dimension corresponding to the number of query points.

Estimates for have been proposed in Sexton and Laake (2009); Wager et al. (2014) as well as in the papers mentioned above; see Athey et al. (2016) for demonstrations of consistency of some of these estimates. Here we use query points and employ the infinitesimal jackknife (Wager et al., 2014) to estimate :

(8) |

where

is the empirical covariance over the . We also use the natural estimate

(9) | |||

See Ghosal and Hooker (2018) for the consistency of the multivariate infinitesimal jackknife.

Note that while these particular estimates are specific to bagging-type teachers, the procedures below can be readily modified for any other teacher for which a CLT can be proven and where a variance estimate is available, e.g. Zhou and Hooker (2018).

### 4.2 Testing Procedures

We apply these results to develop a stopping rule via a test that the variability of within the current node is different from chance. To do so, we employ a variant of the procedures in Mentch and Hooker (2017) used to test for structure in Random Forests. Formally, for a node (which designates a region of feature space), we generate data points from the kernel density defined above and for which we have the asymptotic result that

and test the hypothesis that is constant over predictions. Formally, has a Gaussian process limit, so that this test can be viewed as an approximation to the hypothesis that is constant over .

To carry this out, let be the identity matrix and be the matrix whose entries are all 1 and we let be the result of plugging (8) and (9) into (7). Then by standard arguments under the null hypothesis

from which we can obtain a p-value. This is formalized in Algorithm 2.

Our choice of represents a further tuning parameter. Theoretically, larger is always an improvement, but this must be balanced both with computational cost and with instability in the estimate and inversion of which can require building very large numbers of trees. Mentch and Hooker (2017) employed random projections to improve the stability of these tests for large . Here we take a simpler, though more conservative, approach of generating several sets of points and averaging the -values; in our studies below we average the -values over 10 sets of query points.

### 4.3 Simulation Example

We demonstrate the behavior of this stopping rule with a simulation example. Here we suppose our features are two dimensional , and are each generated independently by a uniform distribution over and we simulate a response from

to which we add Gaussian noise with standard deviation 1. Thus, the underlying signal can be represented with a two-level tree.

We generated 500 data sets of size 1,000 from this process and trained a Random Forests of 10,000 trees on each using 200 observations for each tree. The large number of trees was chosen to stabilize our variance estimate. We then used 10,000 test points to generate an approximation tree, and example of which is given in Figure 4.

For these data, the approximation trees all split on at the root note. Figure 5 records the -values we calculate at the root node over all 500 data sets, and the combined p-values of the left and right children (since symmetry arguments give these the same distribution). Using a 0.05 threshold, all but 3 of the approximation trees would be chosen to be split at the root node and no trees exhibit further splits.

This gives us confidence that we can distinguish appropriate tree structure when it exists. However, as we observe in our real-world examples below, the underlying structure may not be well approximated by a shallow tree and we may wish to truncate an approximation tree, even if there is remaining signal that can be distinguished from noise.

## 5 Real-world Experiments

Here we present two case-studies in using our approximation tree algorithm. In each of these we first obtain a subsampled Random Forests to predict the response using the randomForest package in R. We then develop a stabilized approximation tree for that forest using up to 500,000 pseudo-samples per split. We also annotate each node with a p-value indicating whether the values of the underlying Random Forests are statistically distinguishable from being constant. For our case studies, we find that we stop due to the maximum tree depth being reached before the observed p-values become large.

### 5.1 CAD-MDD Data

The CAD-MDD data were employed in Gibbons et al. (2013) to shorten diagnostic screening questionnaires for severe depression. The data set contains 836 responses to an 88-question tool along with a clinical diagnosis. The length of this tool represents a significant response burden and a goal is to shorten the tool while retaining diagnostic accuracy by adaptively asking questions as the subject answers them.

Figure 6 presents the dominant tree structure, occurring in 86 out of 100 replications that we obtain out of the approximation tree algorithm, annotated with significance levels for each split. The pseudo-code-formatted decision rule can be followed by adaptively asking patients one question at a time until arriving at a leaf node. Following Gibbons et al. (2013) we improve the depression diagnosis from an 88-question survey plus a RF predictor, to a unique adaptive screening tool with at most 5 questions, while retaining 90% consistency. It is now possible for psychiatrists to manually follow the screening tool asking a few questions to perform a quick initial diagnosis for patients without directly going through the 88-question survey, or adaptive data collection can also be automated.

In order to demonstrate that each split we made is indeed capturing true signal instead of random variability, each intermediate tree node is annotated with a p-value returned by Algorithm 2. Full details of the learned approximation tree can be found in the Supplemental Material.

We also note that although we only have 836 original patients, we may still need pseudo points to stabilize a split. One possible reason is that we have many variables and values to choose as a splitting rule, making the best splits indistinguishable and always reaching the pseudo sample size cap . An alternative remedy is to obtain a prior set of fewer splits of interests and split by this set. However, in general we still require a large number of points.

### 5.2 COMPAS Data

COMPAS, abbreviated for Correctional Offender Management Profiling for Alternative Sanctions, is a proprietary score developed to predict recidivism risk. It has been the subject of considerable controversy, stimulating a growing literature on the subject of bias in algorithmic decision making; see Angwin
et al. (2016), Kleinberg et al. (2016) or Tan
et al. (2017) for example. ProPublica analyzed the COMPAS Recidivism Algorithm (Larson
et al., 2016) and released the dataset containing criminal history, jail and prison time, demographics and COMPAS scores for defendants from Broward County, Florida. ^{1}^{1}1https://github.com/propublica/compas-analysis

Our focus here is to build a stable and reliable approximation tree, which may then be investigated for explicit racial bias. We select nine features and reformulate the task as a binary classification problem: recidivism or not. Due to space constraints, Figure 7 shows part of our approximation tree with p-values annotated; see Supplemental Material for details.

## 6 Conclusion

While much attention has been given to ways to understand the result of machine learning algorithms or to provide explanations for particular predictions, there has been relatively little attention given to the statistical stability of those explanations. An explanation for a prediction that might differ due to random chance even when the underlying prediction doesn’t change may be reasonably regarded as spurious. In this paper, we propose a method to ensure the stability of explanations that are given by decision trees that result from model distillation. We have shown that a hypothesis testing framework can be employed to ensure that sufficient pseudo-data is generated to consistently choose the same tree structure. Moreover, when distilling a Random Forests we have used recent results on their statistical properties to design stopping rules to ensure that explanations derived from approximating decision trees reflect consistent signal rather than sample noise.

Our empirical results suggest that while trees can be used as students to recover predictions with high accuracy, stabilizing their structure, and hence providing stable explanations for predictions, can require the generation of much larger pseudo-data sets than is currently common. We have produced evidence that our stopping rules will indeed choose the “right” tree structure, but we observe that in at least two cases trees of more than five levels continue to uncover real signal.

The framework we have employed here has been based on the use of a Gini split criterion and Random Forests. The first of these can readily be replaced with any criterion that can be represented as an average (subject to mild regularity conditions). Our use of Random Forests is motivated by a central limit theorem for the values of Random Forests predictions; the use of these stopping rules with other machine learning teachers requires similar results for them, of which we know few.

It is important to note that split-stabilization and stopping rules are based on two different sources of variability. We control pseudo-data variability in the choice of splits, while variability from the original sample is calculated for our stopping rule. These choices have been made in the interests of presenting a single tree that retains reasonable fidelity to the teacher. We have not assessed whether sample variability might produce a teacher which induces a different approximation tree; this would require accounting for the covariance of predictions within our central limit theorem for the split criterion as well as intelligible means of representing variability in tree structures and we leave this for future work.

## Acknowledgements

This work was partially supported by grants NIH R03DA036683, NSF DMS-1053252 and NSF DEB-1353039. The authors would like to thank Robert Gibbons for providing the CAD-MDD data.

## References

- Angwin et al. (2016) Angwin, J., J. Larson, S. Mattu, and L. Kirchner (2016). Machine bias: Thereâs software used across the country to predict future criminals. and itâs biased against blacks. ProPublica, May 23.
- Athey et al. (2016) Athey, S., J. Tibshirani, and S. Wager (2016). Generalized random forests. arXiv preprint arXiv:1610.01271.
- Augasta and Kathirvalavakumar (2012) Augasta, M. G. and T. Kathirvalavakumar (2012). Reverse engineering the neural networks for rule extraction in classification problems. Neural processing letters 35(2), 131–150.
- Banerjee et al. (2007) Banerjee, M., I. W. McKeague, et al. (2007). Confidence sets for split points in decision trees. The Annals of Statistics 35(2), 543–574.
- Benjamini and Hochberg (1995) Benjamini, Y. and Y. Hochberg (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), 289–300.
- Breiman (2001) Breiman, L. (2001). Random forests. Machine learning 45(1), 5–32.
- Breiman et al. (1984) Breiman, L., J. Friedman, C. J. Stone, and R. A. Olshen (1984). Classification and regression trees. CRC press.
- Chouldechova (2017) Chouldechova, A. (2017). Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data 5(2), 153–163.
- Craven and Shavlik (1995) Craven, M. W. and J. W. Shavlik (1995). Extracting tree-structured representations of trained networks. In NIPS.
- Dunnett (1955) Dunnett, C. W. (1955). A multiple comparison procedure for comparing several treatments with a control. Journal of the American Statistical Association 50(272), 1096–1121.
- Friedman (2001) Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of statistics, 1189–1232.
- Ghosal and Hooker (2018) Ghosal, I. and G. Hooker (2018). Boosting random forests to reduce bias; one-step boosted forest and its variance estimate. arXiv preprint arXiv:1803.08000.
- Gibbons et al. (2013) Gibbons, R. D., G. Hooker, M. D. Finkelman, D. J. Weiss, P. A. Pilkonis, E. Frank, T. Moore, and D. J. Kupfer (2013). The computerized adaptive diagnostic test for major depressive disorder (cad-mdd): a screening tool for depression. The Journal of clinical psychiatry 74(7), 1–478.
- Goldstein et al. (2013) Goldstein, A., A. Kapelner, J. Bleich, and E. Pitkin (2013, 09). Peeking inside the black box: Visualizing statistical learning with plots of individual conditional expectation. 24.
- He et al. (2012) He, H., J. Eisner, and H. Daume (2012). Imitation learning by coaching. In Advances in Neural Information Processing Systems, pp. 3149–3157.
- Hooker (2007) Hooker, G. (2007). Generalized functional anova diagnostics for high-dimensional functions of dependent variables. Journal of Computational and Graphical Statistics 16(3), 709–732.
- Hu et al. (2018) Hu, L., J. Chen, V. N. Nair, and A. Sudjianto (2018, June). Locally Interpretable Models and Effects based on Supervised Partitioning (LIME-SUP). ArXiv e-prints.
- Johansson and Niklasson (2009) Johansson, U. and L. Niklasson (2009). Evolving decision trees using oracle guides. In Computational Intelligence and Data Mining, 2009. CIDM’09. IEEE Symposium on, pp. 238–244. IEEE.
- Johansson et al. (2010) Johansson, U., C. Sönströd, and T. Löfström (2010). Oracle coached decision trees and lists. In International Symposium on Intelligent Data Analysis, pp. 67–78. Springer.
- Johansson et al. (2011) Johansson, U., C. Sönströd, and T. Löfström (2011). One tree to explain them all. In Evolutionary Computation (CEC), 2011 IEEE Congress on, pp. 1444–1451. IEEE.
- Kleinberg et al. (2016) Kleinberg, J., S. Mullainathan, and M. Raghavan (2016). Inherent trade-offs in the fair determination of risk scores. arXiv preprint arXiv:1609.05807.
- Larson et al. (2016) Larson, J., S. Mattu, L. Kirchner, and J. Angwin (2016). How we analyzed the compas recidivism algorithm. ProPublica (5 2016) 9.
- Lichman (2013) Lichman, M. (2013). UCI machine learning repository.
- Lou et al. (2012) Lou, Y., R. Caruana, and J. Gehrke (2012). Intelligible models for classification and regression. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’12, New York, NY, USA. ACM.
- Lucas et al. (2013) Lucas, D., R. Klein, J. Tannahill, D. Ivanova, S. Brandon, D. Domyancic, and Y. Zhang (2013). Failure analysis of parameter-induced simulation crashes in climate models. Geoscientific Model Development 6(4), 1157–1171.
- Mangasarian et al. (1995) Mangasarian, O. L., W. N. Street, and W. H. Wolberg (1995). Breast cancer diagnosis and prognosis via linear programming. Operations Research 43(4), 570–577.
- Mentch and Hooker (2016) Mentch, L. and G. Hooker (2016). Quantifying uncertainty in random forests via confidence intervals and hypothesis tests. The Journal of Machine Learning Research 17(1), 841–881.
- Mentch and Hooker (2017) Mentch, L. and G. Hooker (2017). Formal hypothesis tests for additive structure in random forests. Journal of Computational and Graphical Statistics 26(3), 589–597.
- Quinlan (1987) Quinlan, J. R. (1987). Generating production rules from decision trees. In IJCAI, Volume 87, pp. 304–307. Citeseer.
- Quinlan (2014) Quinlan, J. R. (2014). C4. 5: programs for machine learning. Elsevier.
- Ribeiro et al. (2016) Ribeiro, M. T., S. Singh, and C. Guestrin (2016). Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144. ACM.
- Sexton and Laake (2009) Sexton, J. and P. Laake (2009). Standard errors for bagged and random forest estimators. Computational Statistics & Data Analysis 53(3), 801–811.
- Simonyan et al. (2013) Simonyan, K., A. Vedaldi, and A. Zisserman (2013). Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034.
- Tan et al. (2017) Tan, S., R. Caruana, G. Hooker, and Y. Lou (2017). Detecting bias in black-box models using transparent model distillation. arXiv preprint arXiv:1710.06169.
- Wager and Athey (2017) Wager, S. and S. Athey (2017). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association (just-accepted).
- Wager et al. (2014) Wager, S., T. Hastie, and B. Efron (2014). Confidence intervals for random forests: The jackknife and the infinitesimal jackknife. The Journal of Machine Learning Research 15(1), 1625–1651.
- Zhou and Hooker (2018) Zhou, Y. and G. Hooker (2018). Boulevard: Regularized Stochastic Gradient Boosted Trees and Their Limiting Distribution. ArXiv e-prints.

## Appendix A Additional results on AppTree

### a.1 Real Datasets

In this section we will show the results of our method on eight datasets. Seven of them are available on the UCI repository (Lichman, 2013) and one is the CAD-MDD data used by Gibbons et al. (2013). We manually split each dataset into train and test for cross validation. Table 1 shows the number of covariates, training and testing sample size and the levels of responses for each dataset.

Name | #Cov | #Train | #Test | Response Levels |
---|---|---|---|---|

CAD-MDD | 88 | 500 | 336 | 0,1 |

BreastCancer(Mangasarian et al., 1995) | 30 | 350 | 218 | 0,1 |

Car | 6 | 1000 | 727 | 0,1 |

ClimateModel(Lucas et al., 2013) | 18 | 400 | 140 | 0,1 |

Abalone | 10 | 3133 | 1044 | 0,1,2,3 |

Cardiotocography | 30 | 1126 | 1000 | 0,1,2 |

WineRed | 11 | 1100 | 499 | 0,1,2 |

WineWhite | 11 | 3000 | 1898 | 0,1,2 |

To decide the generative distribution of covariates before running our algorithm, we perturb the empirical distribution by Gaussian noise whose variances are approximately 1/50 of the ranges of corresponding covariates. The probability of jumping to a neighboring category for discrete covariates is set to be 1/7.

We compare across four methods here: classification trees (CART), random forests (RF), our proposed approximation tree (AppTree), and a baseline method (BASE). Previous work (Johansson and Niklasson, 2009; Johansson et al., 2010) fixes the number of pseudo samples from the oracle during the coaching procedure. Analogously, we set BASE to be a non-adaptive version of our AppTree which requests a pseudo sample set from the RF only once at the root node and uses it all the way down. The pseudo sample size is set to be 9 times the size of the training data; larger than the samples used in (Johansson and Niklasson, 2009; Johansson et al., 2010) and is designed as a reasonable blind decision without any prior information.

We use the same setting for all datasets for consistency. For each dataset, we train a RF containing 200 trees, a CART tree, then 100 AppTrees and 100 BASE trees approximating the RF. , which means each node of AppTree can generate at most pseudo samples to decide its split. CART, BASE and AppTree all grow to the 6th layer including the root. Confidence level is set to be 0.1.

### a.2 Binary Classification

Figure 8 shows the evaluation of methods on binary classification datasets. Johansson et al. (2011) has pointed out that a single decision tree is already capable of mimicking an oracle predictor to make highly accurate predictions given the oracle is not overly complicated. Our simulation shows similar results, as all three methods CART, BASE and AppTree tightly follow the ROC curves of the RF and there is no significance difference among them. Consistency is measured as the frequency of a model agreeing with the RF when predicting on same input covariates. We use a moving threshold as the classification boundary, and evaluate the consistency on both the testing data and the new data (as marked “test” and “new” in the plot). First, all three 6-layer trees can approximate the RF with over 80% probability for almost any given classification threshold, and there is no significant difference among them. Further, the behaviors of both “test” and “new” plot seem similar, which lends support to our generative covariate distribution. While the overall 80% consistency may seem not powerful enough to make those trees aligned with the oracle RF, we can build the trees deeper to better “overfit” the RF.

### a.3 Multiclass Classification

Figures 9 and 10 demonstrate evaluate performance on the three multiclass classification datasets. We observe similar patterns as we did in binary cases that all three tree methods work similarly. ROC curves of RF show poorer performance than in binary classification, and ROC curves of the three tree methods are further from random forests, suggesting that deeper trees might be necessary. Nonetheless, all tree methods again agree with the RF on about 80% of the predictions made by RF on the test data. We have therefore shown that our stability request of AppTree does not undermine its predictive power and consistency with the black box when compared with other tree methods.

### a.4 Stability

Stability is the major concern in our simulation study. We measure how many different tree structures BASE and AppTree report out of their 100 replications of approximating the same RF, and count how many times each individual tree structure (both splitting covariate and splitting value) occurs. Table 2 shows the number of different tree structures and number of occurrences of the top 3 frequent structures for both BASE and AppTree on each dataset. Figure 11 and 12 visualize these results.

BASE | AppTree | |||
---|---|---|---|---|

Name | #Struct | Top 3 Cnt | #Struct | Top 3 Cnt |

CAD-MDD | 100 | 1, 1, 1 | 6 | 86, 5, 4 |

BreastCancer | 99 | 2, 1, 1 | 6 | 69, 15, 9 |

Car | 70 | 7, 4, 3 | 41 | 13, 10, 9 |

ClimateModel | 100 | 1, 1, 1 | 4 | 86, 12, 1 |

Abalone | 93 | 4, 2, 2 | 14 | 26, 22, 15 |

Cardiotocography | 100 | 1, 1, 1 | 9 | 68, 12, 5 |

WineRed | 100 | 1, 1, 1 | 26 | 18, 15, 11 |

WineWhite | 100 | 1, 1, 1 | 25 | 30, 28, 5 |

BASE is a non-adaptive version of AppTree that only requests pseudo samples once at the root node. Our simulation setting guarantees that BASE and AppTree have access to the same set of all possible splitting covariates and values. If we compared AppTree with BASE equipped with an enormous amount of pseudo samples at the beginning such that at each node BASE had no fewer sample points than AppTree, we should expect similar behavior between those two methods. However, BASE fails to stabilize the tree structure in our experiment as almost every 6-layer tree it produces has unique structure, whereas AppTree manages to generate a small number of dominant tree structures with a confidence control of . Thus, our adaptive increment of the pseudo sample size significantly contributes to the stability of the decision tree we obtain from the coaching procedure as the approximation tree.

Notice that , which means if we choose and train with with infinitely many pseudo samples, we should have the most dominant 6-layer tree structure occurring about 20 out of 100 replications. Our results on most of the datasets have already attained such stability with and , therefore the control of is relative conservative while the choice of the pseudo sample cap is sufficient. The significance level controls the stability at a split-wise level. It is possible to extend this to further stabilize the tree by again introducing the FWER at the tree level. Notice this procedure may also increase the number of pseudo samples we need at each split.

## Appendix B Approximation Tree for CAD-MDD Data

See table 3 for full details of the approximation tree for CAD-MDD data.

Covariate | Value | Sample_Number | p |
---|---|---|---|

46 | 1.5 | 836 | 4.3e-158 |

53 | 1.5 | 508 | 1.8e-22 |

39 | 1.5 | 474 | 3.3e-08 |

54 | 0.5 | 427 | 9.9e-06 |

40 | 2.5 | 47 | 2.3e-18 |

48 | 2.5 | 34 | 3.0e-13 |

17 | 1.5 | 28 | 4.8e-14 |

14 | 1.5 | 318 | 3.8e-34 |

50 | 2.5 | 101 | 2.7e-13 |

58 | 1.5 | 70 | 4.7e-10 |

27 | 1.5 | 31 | 8.7e-10 |

48 | 2.5 | 227 | 1.7e-18 |

53 | 1.5 | 75 | 2.0e-14 |

35 | 2.5 | 152 | 2.3e-06 |

## Appendix C Approximation Tree for COMPAS Data

The original dataset we downloaded was https://github.com/propublica/compas-analysis/blob/master/compas-scores-two-years.csv. We only keep eight features (sex, age, race, juv_fel_count, juv_misd_count, _other_count, priors_count, c_charge_degree) and one target value (is_recid). There are 7214 training data in total.

See table 4 for full details of the approximation tree for COMPAS data.

Covariate | Value | Sample_Number | p |
---|---|---|---|

priors_count | 2.5 | 7214 | 3.8e-22 |

age | 23.5 | 4387 | 2.1e-46 |

priors_count | 0.5 | 981 | 7.0e-07 |

age | 21.5 | 512 | 3.2e-02 |

is_male | 0.5 | 469 | 4.2e-16 |

age | 31.5 | 3406 | 1.3e-08 |

priors_count | 1.5 | 1400 | 1.4e-04 |

priors_count | 1.5 | 2006 | 6.0e-02 |

age | 33.5 | 2827 | 6.2e-08 |

priors_count | 4.5 | 1460 | 4.9e-05 |

age | 27.5 | 542 | 7.4e-04 |

is_African_American | 0.5 | 918 | 1.5e-02 |

priors_count | 5.5 | 1267 | 3.2e-02 |

age | 36.5 | 571 | 2.1e-02 |

priors_count | 8.5 | 796 | 3.2e-03 |