Quantifying Uncertainty in Random Forests via Confidence Intervals and Hypothesis Tests

# Quantifying Uncertainty in Random Forests via Confidence Intervals and Hypothesis Tests

Lucas Mentch                                         LKM54@CORNELL.EDU
Giles Hooker                                                  GJH27@CORNELL.EDU

Department of Statistical Science
Cornell University
Ithaca, NY 14850, USA
July 3, 2019
###### Abstract

This work develops formal statistical inference procedures for predictions generated by supervised learning ensembles. Ensemble methods based on bootstrapping, such as bagging and random forests, have improved the predictive accuracy of individual trees, but fail to provide a framework in which distributional results can be easily determined. Instead of aggregating full bootstrap samples, we consider predicting by averaging over trees built on subsamples of the training set and demonstrate that the resulting estimator takes the form of a U-statistic. As such, predictions for individual feature vectors are asymptotically normal, allowing for confidence intervals to accompany predictions. In practice, a subset of subsamples is used for computational speed; here our estimators take the form of incomplete U-statistics and equivalent results are derived. We further demonstrate that this setup provides a framework for testing the significance of features. Moreover, the internal estimation method we develop allows us to estimate the variance parameters and perform these inference procedures at no additional computational cost. Simulations and illustrations on a real dataset are provided.

## 1 Introduction

This paper develops tools for performing formal statistical inference for predictions generated by a broad class of methods developed under the algorithmic framework of data analysis. In particular, we focus on ensemble methods – combinations of many individual, frequently tree-based, prediction functions – which have played an important role. We present a variant of bagging and random forests, both initially introduced by Breiman [1996, 2001b], in which base learners are built on randomly chosen subsamples of the training data and the final prediction is taken as the average over the individual outputs. We demonstrate that this fits into the statistical framework of U-statistics, which were shown to have minimum variance by Halmos [1946] and later demonstrated to be asymptotically normal by Hoeffding [1948]. This allows us to demonstrate that under weak regularity conditions, predictions generated by these subsample ensemble methods are asymptotically normal. We also provide a method to consistently estimate the variance in the limiting distribution without increasing the computational cost so that we may produce confidence intervals and formally test feature significance in practice. Though not the focus of this paper, it is worth noting that this subbagging procedure – suggested by Andonova et al. [2002] for use in model selection – was shown by Zaman and Hirose [2009] to outperform traditional bagging in many situations.

We consider a general supervised learning framework in which an outcome is predicted as a function of features by the function . We also allow binary classification so long as the model predicts the probability of success, as opposed to a majority vote, so that the prediction remains real valued. Additionally, we assume a training set consisting of independent examples from the process that is used to produce the prediction function . Throughout the remainder of this paper, we implicitly assume that the dimension of the feature space remains fixed, though nothing in the theory provided prohibits a growing number of features so long as our other explicit conditions on the statistical behavior of trees are met.

Statistical inference proceeds by asking the counterfactual question, “What would our results look like if we regenerated these data.” That is, if a new training set was generated and we reproduced , how different might we expect the predictions to be? To illustrate, consider the hypothesis that the feature does not contribute to the outcome at any point in the feature space:

 H0:∃F1s.t.F(x1,...,xd)=F1(x2,...,xd)∀(x1,...,xd)∈X

A formal statistical test begins by calculating a test statistic and asks, “If our data was generated according to and we generated a new training set and recalculated , what is the probability that this new statistic would be larger than ?” That is, we are interested in estimating . In most fields, a probability of less than 0.05 is considered sufficient evidence to reject the assumption that the data was generated according to . Of course, a 0.05 chance can be obtained by many methods (tossing a biassed coin, for example) so we also seek a statistic such that when is false, we are likely to reject. This probability of correctly rejecting is known as the power of the test, with more powerful tests clearly being more useful.

Here we propose to conduct the above test by comparing predictions generated by and . Before doing so however, we consider the simpler hypothesis involving the value of a prediction:

 H′0:F(x1,…,xp)=f0.

Though often of less scientific importance, hypotheses of this form allow us to generate confidence intervals for predictions. These intervals are defined to be those values of for which we do not have enough evidence to reject . In practice, we choose to be the prediction generated by the ensemble method in order to provide a formalized notion of plausible values of the prediction, which is, of course, of genuine interest. Our results begin here because the statistical machinery we develop will provide a distribution for the values of the prediction. This allows us to address , after which we can combine these tests to address hypotheses like .

Although this form of statistical analysis is ubiquitous in scientific literature, it is worthwhile contrasting this form of analysis with an alternative based on probably approximately correct (PAC) theory, as developed by Vapnik and Chervonenkis [1971] and Valiant [1984]. PAC theory provides a uniform bound on the difference between the true error and observed training error of a particular estimator, also referred to as a hypothesis. In this framework, is some error of the function which is estimated by based on the data. A bound is then found for where is some class of functions that includes . Since this bound is uniform over , it applies to and we might think of comparing with using such bounds. While appealing, these bounds provide the accuracy of our estimate of but do not account for how the true might change when is reproduced with new training data. The uniformity of these bounds could be used to account for the uncertainty in if it is chosen to minimize over , but this is not always the case, for example, when using tree-based methods. We also expect the same uniformity to make PAC bounds conservative, thereby resulting in tests with lower power than those we develop.

Our analysis relies on the structure of subsample-based ensemble methods, specifically making use of classic -statistic theory. These estimators have a long history (see, for example, original work by Kendall [1938] or Wilcoxon [1945], or Lee [1990] which has a modern overview), frequently focussed on rank-based non-parametric tests, and have been shown to have an asymptotically normal sampling distribution by Hoeffding [1948]. Our application to subsample ensembles requires the extension of these results to some new cases as well as methods to estimate the asymptotic variance, both of which we provide.

U-statistics have traditionally been employed in the context of statistical parameter estimation. From this classical statistical perspective, we treat ensemble-tree methods like bagging and random forests as estimators and thus the limiting distributions and inference procedures we develop are with respect to the expected prediction generated by the ensemble. That is, given a particular prediction point , our limiting normal distributions are centered at the expected ensemble-based prediction at and not necessarily . Such forms of distributional analysis are common in other nonparametric regression settings – see Eubank [1999] Section 4.8, for example. More details on appropriate interpretations of the results are provided throughout the paper, in particular in Section 4.1.

In order to claim that the inference procedures proposed here are asymptotically valid for , the ensemble must consistently predict at a rate of or faster. Though this is the case for many classical estimators, establishing fast uniform rates of convergence for tree-based ensembles has proven extremely difficult. Breiman et al. [1984] discuss consistency of general partition-type models in the final chapter of their seminal book in the context of both classification and regression. Biau et al. [2008] restrict their attention to classification, but prove consistency of certain idealized bagging and random forest estimators, provided the individual trees are consistent. This paper also discusses a more general version of bagging, where the samples used to construct individual base learners may be proper subsamples of the training set taken with replacement as opposed to full bootstrap samples, so as to include the subbagging approach. Biau [2012] further examines the consistency of random forests and investigates their behavior in the presence of a sparse feature space. Recently, Denil et al. [2013] proved consistency for a mathematically tractable variant of random forests and in some cases, achieved empirical performance on par with the original random forest procedure suggested by Breiman. Zhu et al. [2015] prove consistency for their Reinforcement Learning Trees, where embedded random forests are used to decide splitting variables, and achieve significant improvements in empirical MSE for some datasets. However, no rates of convergence have been developed that could be applied to analyze the ensemble methods we consider here.

Beyond these consistency efforts, mathematical analyses of ensemble learners has been somewhat limited. Sexton and Laake [2009] propose estimating the standard error of bagged trees and random forests using jackknife and bootstrap estimators. Recently, Wager et al. [2014] proposed applying the jackknife and infinitesimal jackknife procedures introduced by Efron [2014] for estimating standard errors in random forest predictions. Chipman et al. [2010] have received significant attention for developing BART, a Bayesian “sum-of-trees” statistical model for the underlying regression function that allows for pointwise posterior inference throughout the feature space as well as estimates for individual feature effects. Recently, Bleich et al. [2014] extended the BART approach by suggesting a permutation-based approach for determining feature relevance and by introducing a procedure to allow variable importance information to be reflected in the prior.

The layout of this paper is as follows: we demonstrate in Section 2 that ensemble methods based on subsampling can be viewed as U-statistics. In Section 3 we provide consistent estimators of the limiting variance parameters so that inference may be carried out in practice. Inference procedures, including a test of significance for features, are discussed in Section 4. Simulations illustrating the limiting distributions and inference procedures are provided in Section 5 and the inference procedures are applied to a real dataset provided by Cornell University’s Lab of Ornithology in Section 6.

## 2 Ensemble Methods as U-statistics

We begin by introducing the subbagging and subsampled random forest procedures that result in estimators in the form of U-statistics. In both cases, we provide an algorithm to make the procedure explicit.

### 2.1 Subbagging

We begin with a brief introduction to U-statistics; see Lee [1990] for a more thorough treatment. Let where is the parameter of interest and suppose that there exists an unbiased estimator of that is a function of arguments. Then we can write

 θ=Eh(Z1,...,Zk)

and without loss of generality, we may further assume that is permutation symmetric in its arguments since any given may be replaced by an equivalent permutation symmetric version. The minimum variance unbiased estimator for is given by

 Un=1(nk)∑(i)h(Zi1,...,Zik) (1)

where the sum is taken over all subsamples of size and is referred to as a U-statistic with kernel of rank . When both the kernel and rank remain fixed, Hoeffding [1948] showed that these statistics are asymptotically normal with limiting variance where

 ζ1,k=cov(h(Z1,...,Zk),h(Z1,Z′2,...,Z′k)) (2)

and . The 1 in the subscript comes from the fact that there is 1 example in common between the two subsamples. In general, denotes a covariance in the form of (2) with examples in common.

Given infinite computing power and a consistent estimate of , Hoeffding’s original result is enough to produce a subbagging procedure with asymptotically normal predictions. Suppose that as our training set, we observe where is a vector of features and is the response. Fix and let be a subsample of the training set. Given a feature vector where we are interested in making a prediction, we can write the prediction at generated by a tree that was built using the subsample as a function from to . Taking all subsamples, building a tree and predicting at with each, we can write our final subbagged prediction at as

 bn(x∗)=1(nk)∑(i)Tx∗((Xi1,Yi1),...,(Xik,Yik)). (3)

by averaging the tree-based predictions. Treating each ordered pair as one of inputs into the function , the estimator in (3) is in the form of a U-statistic since tree-based estimators produce the same predictions independent of the order of the training data. Thus, provided the distribution of predictions at has a finite second moment and , the distribution of subbagged predictions at is asymptotically normal. Note that in this context, is the covariance between predictions at generated by trees trained on datasets with 1 sample in common.

Of course, building trees is compuationally infeasible for even moderately sized training sets and an obvious substantial improvement in computationally efficiency can be achieved by building and averaging over only trees. In this case, the estimator in (3), appropriately scaled, is called an incomplete U-statistic. When the subsamples are selected uniformly at random with replacement from the possibilities, the resulting incomplete U-statistic remains asymptotically normal; see Janson [1984] or Lee [1990] page 200 for details.

Though more computationally efficient, there remains a major shortcomming with this approach: the number of samples used to build each tree, , remains fixed as . We would instead like to grow with so that trees can be grown to a greater depth, thereby presumably producing more accurate predictions. Incorporating this, our estimator becomes

 bn,kn,mn(x∗)=1mn∑(i)Tx∗,kn((Xi1,Yi1),...,(Xikn,Yikn)). (4)

Statistics of this form were discussed by Frees [1989] and called Infinite Order U-statistics (IOUS) in the complete case, when , and resampled statistics in the incomplete case. Specifically, Frees considers the situation where, given an sample and kernel , and and goes on to develop sufficient conditions for consistency and asymptotic normality whenever grows faster than . In contrast, the theorem below introduces a central limit theorem for estimators of the same form as in (4) but with respect to their individual means and covers all possible growth rates of with respect to . In this context, only minimal regularity conditions are required for asymptotic normality. We begin with an assumption on the distribution of estimates for the general U-statistic case.

Condition 1: Let with and define . Then for all ,

 limn→∞1ζ1,kn∫|h1,kn(Z1)|≥δ√nζ1,knh21,kn(Z1)dP=0.

This condition serves to control the tail behavior of the predictions and allows us to satisfy the Lindeberg condition needed to obtain part (i) of Theorem 1 below.

###### Theorem 1

Let and let be an incomplete, infinite order U-statistic with kernel that satisfies Condition 1. Let such that for all and some constant , and let . Then as long as and ,

1. if , then .

2. if , then .

3. if , then .

Condition 1, though necessary for the general U-statistic setting, is a bit obscure. However, in our regression context, when the regression function is bounded and the errors have exponential tails, a more intuitive Lipschitz-type condition given in Proposition 1 is sufficient. Though stronger than necessary, this alternative condition allows us to satisfy the Lindeberg condition and is reasonable to expect of any supervised learning method.

Proposition 1: For a bounded regression function , if there exists a constant such that for all ,

 ∣∣h((X1,Y1),...,(Xkn,Ykn),(Xkn+1,Ykn+1))−h((X1,Y1),...,(Xkn, Ykn),(Xkn+1,Y∗kn+1))∣∣ ≤c∣∣Ykn+1−Y∗kn+1∣∣

where , , and where and are i.i.d. with exponential tails, then Condition 1 is satisfied.

A number of important aspects of these results are worth pointing out. First, note from Theorem 1 that the trees are built with subsamples that are approximately square root of the size of the full training set. This condition is not necessary for the proof, but ensures that the variance of the U-statistic in part converges to 0 as is typically the case in central limit theorems. By maintaining this relatively small subsample size, we can build many more trees and maintain a procedure that is computationally equivalent to traditional bagging based on full bootstrap samples. Also note that no particular assumptions are placed on the dimension of the feature space; the number of features may grow with so long as the stated conditions remain satisfied.

The final condition of Theorem 1, that , though not explicitly controllable, should be easily satisfied in many cases. As an example, suppose that the terminal node size is bounded by so that trees built with larger training sets are grown to greater depths. Then if the form of the response is where has variance , will be bounded below by . Finally, note that the assumption of exponential tails on the distribution of regression errors in Proposition 1 is stronger than necessary. Indeed, so long as , we need only insist that .

The proofs of Theorem and Proposition 1 are provided in Appendix A. The subbagging algorithm that produces asymptotically normal predictions at each point in the feature space is provided in Algorithm 1.

Note that this procedure is precisely the original bagging algorithm suggested by Breiman, but with proper subsamples used to build trees instead of full bootstrap samples. In Section 3, we provide consistent estimators for the limiting variance parameters in Theorem 1 so that we may carry out inference in practice.

We would also like to acknowledge similar work currently in progress by Wager [2014]. Wager builds upon the potential nearest neighbor framework introduced by Lin and Jeon [2006] and seeks to provide a limiting distribution for the case where many trees are used in the ensemble, roughly corresponding to our result in Theorems 1 and 2. The author considers only an idealized class of trees based on the assumptions in Meinshausen [2006] as well as additional honesty and regularity conditions that allow to grow at a faster rate, and demonstrates that when many Monte Carlo samples are employed, the infinitesimal jackknife estimator of variance is consistent and predictions are asymptotically normal. This estimator has roughly the same computational complexity as those we propose in Section 3 and should scale well subject to some additional bookkeeping. In contrast, the theory we provide here takes into account all possible rates of Monte Carlo sampling via the three cases discussed in Theorems 1 and 2 and we provide a consistent means for estimating each corresponding variance.

### 2.2 Random Forests

The distributional results described above for subbagging do not insist on a particular tree building method. So long as the trees generate predictions that satisfy minimal regularity conditions, the experimenter is free to use whichever building method is preferred. The subbagging procedure does, however, require that each tree in the ensemble is built according to the same method.

This insistence on a uniform, non-randomized building method is in contrast with random forests. The original random forests procedure suggested by Breiman [2001b] dictates that at each node in each tree, the split may occur on only a randomly selected subset of features. Thus, we may think of each tree in a random forest as having an additional randomization parameter that determines the eligible features that may potentially be split at each node. In a general U-statistic context, we can write this random kernel U-statistic as

 Uω;n,kn,mn=1mn∑(i)h(ωi)kn(Zi1,...,Zikn) (5)

so that we can write a random forest estimator as

 rn,kn,mn(x∗)=1mn∑(i)T(ωi)x∗,kn((Xi1,Yi1),...,(Xikn,Yikn)).

Due to this additional randomness, random forests and random kernel U-statistics in general do not fit within the framework developed in the previous section so we develop new theory for this expanded class. Suppose and that these randomization parameters are selected independently of the original sample . Consider the statistic

 U∗ω;n,kn,mn=Eω⎛⎝1mn∑(i)h(ωi)kn(Zi1,...,Zikn)⎞⎠

so that . Taking the expectation with respect to , the kernel becomes fixed and hence conforms to the non-random kernel U-statistic theory. Thus, is asymptotically normal in both the complete and incomplete cases, as well as in the complete and incomplete infinite order cases, by Theorem 1. Given this asymptotic normality of , in order to retain asymptotic normality of the corresponding random kernel version, we need only show that

 √n(U∗ω;n,kn,mn−Uω;n,kn,mn)P→0.

We make use of this idea in the proof of the following theorem, which is provided in Appendix A.

###### Theorem 2

Let be a random kernel U-statistic of the form defined in equation (5) such that satisfies Condition 1 and suppose that for all , , and . Then, letting index the subsamples, so long as and

 limn→∞E(h(ω)kn(Zβ1,...,Zβkn)−Eωh(ω)kn(Zβ1,...,Zβkn))2≠∞,

is asymptotically normal and the limiting distributions are the same as those provided in Theorem 1.

Note that the variance parameters and in the context of random kernel U-statistics are still defined as the covariance between estimates generated by the (now random) kernels. Thus, in the specific context of random forests, these variance parameters correspond to the covariance between predictions generated by trees, but each tree is built according to its own randomization parameter and this covariance is taken over as well. The final condition of Theorem 2 that

 limn→∞E(h(ω)kn(Zβ1,...,Zβkn)−Eωh(ω)kn(Zβ1,...,Zβkn))2≠∞,

simply ensures that the randomization parameter does not continually pull predictions from the same subsample further apart as . This condition is satisfied, for example, if the response is bounded and should also be easily satisfied for any reasonable implementation of random forests.

The subsampled random forest algorithm that produces asymptotically normal predictions is provided in Algorithm 2. As with subbagging, this subsampled random forest algorithm is exactly a random forest with subsamples used to build trees instead of full bootstrap samples.

Another random-forest-type estimator based on a crossed design that results in an infinite order generalized U-statistic is provided in Appendix B. However, the above formulation most resembles Breiman’s original procedure and is more computationally feasible than the method mentioned in Appendix B, so we consider only this random kernel version of random forests in the simulations and other work that follows.

## 3 Estimating the Variance

The limiting distributions provided in Theorem 1 depend on the unknown mean parameter as well as the unknown variance parameters and . In order for us to be able to use these distributions for statistical inference in practice, we must establish consistent estimators of these parameters. It is obvious that we can use the sample mean – i.e. the prediction from our ensemble – as a consistent estimate of , but determining an appropriate variance estimate is less straightforward.

In equation (2) of the previous section, we defined as the covariance between two instances of the kernel with shared arguments, so the sample covariance between predictions may serve as a consistent estimator for both and . However, in practice we find that this often results in estimates close to 0, which may then lead to an overall negative variance estimate.

It is not difficult to show - see Lee [1990] page 11 for details - that an equivalent expression for is given by

 ζc,kn=var(E(hkn(Z1,...,Zkn)|Z1=z1,...,Zc=zc)).

To estimate for our tree-based ensembles, we begin by selecting observations , which we refer to as initial fixed points, from the training set. We then select several subsamples of size from the training set, each of which must include , build a tree with each subsample, and record the mean of the predictions at . Let (MC for “Monte Carlo”) denote the number of subsamples drawn so that this average is taken over predictions. We then repeat the process for initial sets of fixed points and take our final estimate of as the variance over the final averages, yielding the estimator

 ^ζc,kn=var(1nMCnMC∑i=1Tx∗,kn(S~z(1),i),...,1nMCnMC∑i=1Tx∗,kn(S~z(n~z),i))

where denotes the set of initial fixed points and denotes the subsample that includes (which is used here as shorthand for the argument to the tree function ). Now, since we assume that the orginal data in the training set is i.i.d., the random variables are also i.i.d. and since the sample variance is a U-statistic, is a consistent estimator. The algorithm for calculating is provided in Algorithm 3. Note that when , each of the subsamples is identical so we need only use which simplifies the estimation procedure for . The procedure for calculating is provided in Algorithm 4.

Choosing the values of and will depend on the situation. The number of iterations required to accurately estimate the variance depends on a number of factors, including the tree building method and true underlying regression function. Of course, ideally these estimation parameters should be chosen as large as is computationally feasible. In our simulations, we find that in most cases, only a relatively small number of initial fixed point sets are needed, but many more Monte Carlo samples are often necessary for accurate estimation. In most cases, we used an of at least 500. Recall that because our trees are built with small subsamples, we can build correspondingly more trees at the same computational cost.

### Internal vs External Estimation

The algorithms for producing the subbagged or subsampled random forest predictions as well as the above algorithms for estimating the variance parameters are all that is needed to perform statistical inference. We can begin with Algorithm 1 or 2 to generate the predictions, followed by Algorithms 3 and 4 to estimate the variance parameters and . This procedure of running these 3 algorithms seperately is what we will refer to as the external variance estimation method, since the the variance parameters are estimated outside of the orginal ensemble. By contrast, we could instead generate the predictions and estimate the variance parameters in one procedure by taking the mean and variance of the predictions generated by the trees used to estimate . Algorithm 5 outlines the steps in this internal variance estimation method.

This internal variance estimation method is more computationally efficient and has the added benefit of producing variance estimates by simply changing the way in which the subsamples are selected. This means that we are able to obtain all parameter estimates we need to conduct inference at no greater computational cost than building the original ensemble. Although Theorems 1 and 2 dictate that the subsamples used in the ensemble be selected uniformly at random, we find that the additional correlation introduced by selecting the subsamples in this way and using the same subsamples to estimate all parameters does not affect the limiting distribution.

## 4 Inference Procedures

In this section, we describe the inference procedures that may be carried out after performing the estimation procedures.

### 4.1 Confidence Intervals

In Section 2, we showed that predictions from subbagging and subsampled random forests are asymptotically normal and in Section 3 we provided consistent estimators for the parameters in the limiting normal distributions. Thus, given a training set, we can estimate the approximate distribution of predictions at any given feature vector of interest . To produce a confidence interval for predictions at , we need only estimate the variance parameters and take quantiles from the appropriate limiting distribution. Formally, our confidence interval is where the lower and upper bounds, and , are the and quantiles respectively of the normal distribution with mean and variance where and are the variance estimates and . This limiting distribution is that given in result of Theorem 1 which is the distribution we recommend using in practice.

As mentioned in the introduction, these confidence intervals can also be used to address hypotheses of the form

 H0:θkn=c H1:θkn≠c.

Formally, we can define the test statistic

 t=^θkn−csd(^θkn)

and reject if is greater than the the quantile of the standard normal. This corresponds to a test with type 1 error rate so that . However, this testing procedure is equivalent to simply checking whether is within the calculated confidence interval: if is in the confidence interval, then we fail to reject this hypothesis that the true mean prediction is equal to , otherwise we reject.

Finally, recall that these confidence intervals are for the expected prediction and not necessarily for the true value of the underlying regression function . If the tree building method employed is consistent so that , then as the sample size increases, the tree should be (on average) producing more accurate predictions, but in order to claim that our confidence intervals are asymptotically valid for , we need for this convergence to occur at rate of or faster. However, in general, the rate of convergence will depend on not only the tree-building method and true underlying regression function, but also on the location of the prediction point within the feature space.

Note that Theorems 1 and 2 apply not only to tree-based ensembles, but to any estimator that can be written in the form of an infinite order U-statistic, as in equation (4). Some of these ensembles may be straightforward to analyze, but for others, such as random forest predictions near the edge of the feature space, it may be difficult to establish a universal rate of consistency. However, even when -consistency cannot be guaranteed, these intervals still provide valuable information not currently available with existing tools. In these cases, the confidence interval provides a reasonable range of values for where the prediction might fall if the ensemble was recomputed using a new training set; areas of the feature space where confidence intervals are relatively large indicate regions where the ensemble is particularly unstable.

Compare this, for example, to the standard approach of withholding some (usually small) portion of the training set and comparing predictions made at these hold-out points to the corresponding known responses. Such an approach provides some information as to the accuracy of the learner at specific locations throughout the feature space, but says nothing about the stability of these predictions. Thus, instead of relying only on measures of overall goodness-of-fit such as MSE or SSE, these intervals allow users to investigate prediction variability at particular points or regions and in this sense, can be seen as a measure of how much the accuracy of predictions at that point is due to chance.

### 4.2 Tests Of Significance

The limiting distributions developed in Theorems 1 and 2 also allow us a way to test the significance of features. In many situations, data are recorded for a large number of features but a sparse true regression structure is suspected. Suppose that the training set consists of features, and consider a reduced set . Let be a set of feature vectors where we are interested in making predictions. Also, let denote the function that maps feature vectors to their corresponding true mean prediction and let denote the same type of function that maps from the reduced feature space. That is, for a particular prediction point of interest , is the true mean prediction generated by trees built using the full feature space, and is the true mean prediction generated by trees that are only permitted to utilize features in the reduced set. Then, for each test point , we would like to know whether so that we can determine the predictive influence of features not in . More formally, we would like to test the hypothesis

 H0:g(xi)=g(R)(xi)∀xi∈x\tiny TEST (6) H1:g(xi)≠g(R)(xi)for somexi∈x\tiny TEST.

Rejecting this null hypothesis means that a feature not in the reduced feature space makes a significant contribution to the prediction at at least one of the test points.

To perform this test with a training set of size , we take subsamples, each of size , and build a tree with each subsample. Denote these subsamples and for a given feature vector , let denote the average over the predictions at generated from the trees. Then, using the same subsamples , again build a tree with each, but using only those features in , and let be the average prediction at generated by these trees. Finally, define the difference function

 ^D(xi)=^g(xi)−^g(R)(xi)

as the difference between the two ensemble predictions. Note that we can write

 ^D(xi) =^g(xi)−^g(R)(xi) =1mn∑(j)Txi,kn(Sj)−1mn∑(j)T(R)xi,kn(Sj) =1mn∑(j)(Txi,kn(Sj)−T(R)xi,kn(Sj))

so that this difference function is a U-statistic. Thus, if we have only a single test point of interest, is asymptotically normal, so is asymptotically and we can use as a test statistic.

However, it is more often the case that we have several test points of interest. In this case, define to be the vector of observed differences in the predictions

 ^D=(^D(x1),...,^D(xN))

so that, provided a joint distribution exists with respect to Lebesgue measure, has a multivariate normal distribution with mean vector

 μ=(g(x1)−g(R)(x1),...,g(xN)−g(R)(xN))T

which we estimate with

 ^μ=^DT

as well as a covariance matrix . This covariance matrix has parameters and , the multivariate analogues of and . Consistent estimators for these multivariate parameters can be obtained by simply replacing the variance calculation in Algorithms 3 and 4 with a covariance. For clarity, the procedure for obtaining is provided in Algorithm 6 in Appendix C.

Finally, combining these predictions to form a consistent estimator we have that

 ^μT^Σ−1^μ∼χ2N

under . Thus, in order to test the hypothesis in (6), we compare the test statistic to the quantile of the distribution to produce a test with type 1 error rate . If our test statistic is larger than this critical value, we reject the null hypothesis.

### Further Testing Procedures

This setup, though straightforward, may not always definitively decide the significance of features. In some cases, even randomly generated features that are unrelated to the response can be reported significant. Depending on the building method, tree-based algorithms may take advantage of additional randomness in features even when the particular values of those features do not directly contribute to the response. For this reason, we also recommend repeating the testing procedure by comparing predictions generated using the full dataset to predictions generated by a dataset with randomly generated values – commonly obtained by permuting the values in the training set – for the features not in the reduced feature set to test hypostheses of the form

 H0:g(xi)=g(RAND)(xi)∀xi∈x\tiny TEST H1:g(xi)≠g(RAND)(xi)for some% xi∈x\tiny TEST.

The testing procedure remains exactly the same except that to calculate the second set of trees, we simply substitute the reduced training set for a training set with the same number of features, but with randomized values taking the place of the original values for the additional features. Rejecting this null hypothesis allows us to conclude that not only do the additional features not in the reduced training set make a significant contribution to the predictions, but that the contribution is significantly more than could be obtained simply by adding additional randomness.

There are also two additional tests that may be performed. First, we can test whether predictions generated by a training set with randomized values for the additional features are significantly different from predictions generated by the reduced feature set. If a significant difference is found, then the trees in the ensemble are making use of the additional randomness or possibly an accidental structure in the randomized features. As a final check, we can compare predictions generated by two training sets, each with randomized values for the features not in the reduced set. In the unlikely event that a significant difference is found between these predictions, it is again likely due to an accidental structure in the randomized values. Both of these tests can be performed in exactly the same fashion by substituting the appropriate training sets.

## 5 Simulations

We present here a small simulation study in order to illustrate the limiting distributions derived in Section 2 and also to demonstrate the inference procedures proposed in the previous section. We consider two different underlying regression functions:

1. ;

2. ;

The first function corresponds to simple linear regression (SLR) and was chosen for simplicity and ease of visualization. The second was initially considered by Friedman [1991] in development of the Multivariate Adaptive Regression Spline (MARS) procedure and was recently investigated by Biau [2012]. In each case, features were selected uniformly at random from the feature spaces and responses were sampled from , where , to form the training sets.

### Limiting Distributions

We begin by illustrating the distributions of subbagged predictions. In the SLR case, predictions were made at and in the MARS case, predictions were made at . The histograms of subbagged predictions are shown in Figure 1. Each histogram is comprised of 250 simulations.

For each histogram, the size of the training set , number of subsamples , and size of each subsample , is provided in the title. Each tree in the ensembles was built using the rpart function in R, with the additional restriction that at least 3 observations per node were needed in order for the algorithm to consider splitting on that node. Overlaying each histogram is the density obtained by estimating the parameters in the limiting distribution. In each case, we take the limiting distribution to be that given in result (ii) of Theorem 1; namely that the predictions are normally distributed with mean and variance .

The mean was estimated as the empirical mean across the 250 subbagged predictions. To estimate , 5000 new subsamples of size were selected and with each subsample, a tree was built and used to predict at and was taken as the empirical variance between these predictions. To estimate , we follow the procedure in Algorithm 3 with and in the SLR cases and with and in the MARS cases. Note that since we are only interested in verifying the distributions of predictions, the variance parameters are estimated only once for each case and not for each ensemble.

It is worth noting that the same variance estimation procedure with and lead to an overestimate of the variance, so we reiterate that using a large seems to provide better results, even when is relatively small. In each case, we use as a plug-in estimate for . We also repeated this procedure and generated the distribution of predictions according to the internal variance estimation method described in Algorithm 5. Details and histograms are provided in Appendix D. These distributions appear to be the same as when the subsamples are selected uniformly at random, as in the external variance estimation method.

Note that the distributional results in Theorem 1 require , so in practice, the subsample size should be small relative to . In the above simulations, we choose slightly larger than and the distributions appear normally distributed with the correct limiting distribution. However, though this restriction on the growth rate of the subsample size is sufficient for asymptotic normality, it is perhaps not necessary. In our simulations, we found that ensembles built with larger are still approximately normal, but begin to look increasingly further from normal as increases. The histograms in Figure 2 show the distribution of subbagged predictions in the MARS case with and and also with so that we are using full bootstrap samples to build the ensembles in the latter case. The parameters in the limiting distribution are estimated in exactly the same manner as with the smaller for the case where . In the bootstrap case, we cannot follow our subbagging procedure exactly since the bootstrap samples used to build each tree in the ensemble must be taken with replacement, so we do not attempt to estimate the variance. These distributions look less normal and we begin to overestimate the variance in the case where .

### Confidence Intervals

We move now to building confidence intervals for predictions and examine their coverage probabilities. We begin with the SLR case, with , , and and as above, predict at . To build the confidence intervals, we generate 250 datasets and with each dataset, we produce a subbagged ensemble, estimate the parameters in the limiting distribution, and take the 0.025 and 0.975 quantiles from the estimated limiting normal distribution to form an approximate 95% confidence interval. The mean of this limiting normal was estimated as the mean of the predictions generated by the ensemble. The variance parameter was estimated by drawing 500 new subsamples, not necessarily used in the ensemble, and calculating the variance between predictions generated by the resulting 500 trees and was estimated externally using and .

In order to assess the coverage probability of our confidence intervals, we first need to estimate the true mean prediction at that would be generated by this subbagging ensemble. To estimate this true mean, we built 1000 subbagged ensembles and took the mean prediction generated by these ensembles, which we found to be 20.02 - very close to the true underlying value of 20. In this case, we found a coverage probability of 0.912, which means that 228 of our 250 confidence intervals contained our estimate of the true mean prediction. These confidence intervals are shown in Figure 3. The horizontal line in the plot is at 20.02 and represents our estimate of the true expected prediction.

This same procedure was repeated for the SLR case with and , the MARS case with and , and the MARS case with and . In each case, we produced 250 confidence intervals and the coverage probabilities are shown in Table 1. The parameters in the limiting distributions were estimated externally in exactly the same fashion, using and to estimate . These slightly higher coverage probabilities mean that we are overestimating the variance, which is likely due to smaller values of the estimation parameters and being used to estimate .

We also repeated this procedure for generating confidence intervals using the internal variance estimation method. The resulting coverage probabilities are remarkably similar to the external variance estimation method and are shown in Table 1. These ensembles were built using and .

### Hypothesis Testing

We now explore the hypothesis testing procedure for assessing feature significance. We focus on the MARS case, where our training set now consists of 6 features , but the response depends only on the first 5. The values of the additional feature are sampled uniformly at random from the interval and independently of the first 5 features.

We begin by looking at the distribution of test statistics when the test set consists 41 equally spaced points between 0 and 1. That is, the first test point is , the second is , and so on so that the last test point is . For this test, we are interested in looking at the difference between trees built using all features and those built using only the first 5 so that in the notation in Section 4.2, . We ran 250 simulations with , , and using a test set consisting of all 41 test points, the 20 central-most points, and the 5 central-most points. The parameter was estimated externally using and and was estimated by taking the covariance of the difference in predictions generated by 5000 trees. These covariance parameters are estimated only once instead of within each ensemble since we are only interested in the distribution of test statistics. Histograms of the resulting test statistics along with an overlay of the estimated densities are shown in the top row of Figure 4.

We repeated this procedure, this time with a test set consisting of points in the center of the hypercube so we are not predicting close to any edge of the feature space. The value of each feature in the test set was selected uniformly at random from . A total of 41 such test points were generated, and histograms of the 250 resulting test statistics were produced in the cases where we use 5 of these test points, 20 test points, and all 41 test points. These histograms and estimated densities are shown in the bottom row of Figure 4. Note that the bottom row appears to be a better fit and thus there appears to be some bias occuring when test points are selected near the edges of the feature space.

To check the alpha level of the test – the probability of incorrectly rejecting the null hypothesis – we simulated 250 new training sets and used the test set consisting of 41 randomly selected central points. For each training set, we built full and reduced subbagged estimates, allowed and not allowed to utilize respectively, estimated the parameters, and performed the hypothesis test. For each ensemble, the variance parameter was estimated externally using and and was estimated externally on an independently generated set of trees.

In this setup, none of the 250 simulations resulted in rejecting the null hypothesis, so our empirical alpha level was 0. A histogram of the resulting test statistics is shown on the left of Figure 5; the critical value, the 0.95 quantile of the , is 56.942. Thus, as with the confidence intervals, we are being conservative. Recall that our confidence interval simulations with , , and predicting at captured our true value 99.6% of the time, so this estimate of 0, though conservative, is not necessarily unexpected. We also repeated this procedure using an internal variance estimate with and and found an alpha level of 0.14. The histogram of test statistics resulting from the internal variance estimation method is shown on the right in Figure 5. Here we see that the correlation introduced by not taking an i.i.d. selection of subsamples to build the ensemble may be slightly inflating the test statistics.

### Random Forests

Thus far, our simulations have dealt only with subbagged ensembles, but recall that Theorem 2 established the asymptotic normality for predictions generated by random forests as well. The histograms in Figure 6 show the distribution of predictions generated by subsampled random forests at when the true underlying function is the MARS function. These trees were grown using the randomForest function in R with the ntree argument set to 1. At each node in each tree, 3 of the 5 features were selected at random as candidates for splits and we insisted on at least 2 observations in each terminal node. The histograms show the empirical distribution of 250 subsampled random forest predictions and the overlaid density is the limiting normal with the variance parameters estimated externally. Our estimate of the mean of this distribution was taken as the empirical mean of the 250 predictions. Our estimate of was taken as the empirical variance of predictions across 5000 new trees and the estimate for was calculated with