Generalized Random Forests

Generalized Random Forests

Susan Athey
athey@stanford.edu
   Julie Tibshirani
jtibs@cs.stanford.edu
   Stefan Wager
swager@stanford.edu
Current version July 2019  
Abstract

We propose generalized random forests, a method for non-parametric statistical estimation based on random forests (Breiman, 2001) that can be used to fit any quantity of interest identified as the solution to a set of local moment equations. Following the literature on local maximum likelihood estimation, our method operates at a particular point in covariate space by considering a weighted set of nearby training examples; however, instead of using classical kernel weighting functions that are prone to a strong curse of dimensionality, we use an adaptive weighting function derived from a forest designed to express heterogeneity in the specified quantity of interest. We propose a flexible, computationally efficient algorithm for growing generalized random forests, develop a large sample theory for our method showing that our estimates are consistent and asymptotically Gaussian, and provide an estimator for their asymptotic variance that enables valid confidence intervals. We use our approach to develop new methods for three statistical tasks: non-parametric quantile regression, conditional average partial effect estimation, and heterogeneous treatment effect estimation via instrumental variables. A software implementation, grf for R and C++, is available from CRAN.

1 Introduction

Random forests, introduced by Breiman (2001), have become one of the most popular methods in applied statistical learning.We are grateful for helpful comments from several colleagues; in particular, we are indebted to Jerry Friedman for first recommending we take a closer look at splitting rules for quantile regression forests, to Will Fithian for drawing our attention to connections between our early ideas and gradient boosting, to Guido Imbens for suggesting the local centering scheme discussed in Section 6.1.1, to two anonymous referees, and to seminar participants at the Atlantic Causal Inference Conference, the California Econometrics Conference, Ca’Voscari University of Venice, Columbia, Cornell, the Econometric Society Winter Meetings, EPFL, the European University Institute, INFORMS, Kellogg, the Microsoft Conference on Digital Economics, the MIT Conference on Digital Experimentation, Northwestern, Toulouse, Triangle Computer Science Distinguished Lecture Series, University of Chicago, University of Illinois Urbana–Champaign, University of Lausanne, and the USC Dornsife Conference on Big Data in Economics. From a conceptual perspective, random forests are usually understood as a practical method for non-parametric conditional mean estimation: Given a data-generating distribution for , forests are used to estimate

(1)

Several theoretical results are available on the asymptotic behavior of variants of such forest-based estimates , including consistency (Arlot and Genuer, 2014; Biau et al., 2008; Biau, 2012; Denil et al., 2014; Lin and Jeon, 2006; Scornet et al., 2015; Wager and Walther, 2015) and asymptotic normality (Mentch and Hooker, 2016; Wager and Athey, 2017).

The goal of our paper is to generalize Breiman’s random forests, and to develop a forest-based method for estimating any quantity identified via local moment conditions. Specifically, given data , we take to be defined via a local estimating equation of the form

(2)

where is some scoring function and is an optional nuisance parameter. This setup encompasses several key statistical problems. For example, if we model the distribution of conditionally on as having a density then, under standard regularity conditions, the moment condition (2) with identifies the local maximum likelihood parameters . More generally, we can use moment conditions of the form (2) to identify conditional means, conditional quantiles, conditional average partial effects, etc. Our main substantive application of generalized random forests involves heterogeneous treatment effect estimation via instrumental variables.

In developing generalized random forests, our aim is to build a family of non-parametric estimators that inherit the desirable empirical properties of regression forests—such as stability, ease of use, and flexible adaptation to different functional forms as in, e.g., Biau and Scornet (2016) or Varian (2014)—but can be used in the wide range of statistical settings characterized by (2) rather than just in regression problems of the type (1). Our main focus is on addressing the resulting conceptual and methodological challenges, and in establishing formal asymptotic inference results for such forests.

In order to support generic estimating equations as in (2), our method requires several important extensions to the standard regression forests of Breiman (2001). First, while regression forests are typically understood as ensemble methods, i.e., forest predictions are written as the average of noisy tree-based predictors ,

(3)

this perspective is not appropriate more generally. Noisy solutions to moment equations as in (2) are generally biased, and averaging as in (3) would do nothing to alleviate the bias.111In the special case of regression forests, individual trees have low bias but high variance, and so (3) does meaningfully stabilize predictions; see Scornet et al. (2015) or Wager and Athey (2017) for formal statements.

To avoid this issue, we cast forests as a type of adaptive locally weighted estimators that first use a forest to calculate a weighted set of neighbors for each test point , and then solve a plug-in version of the estimating equation (2) using these neighbors. Section 2.2 gives a detailed treatment of this perspective. This locally weighting view of random forests was previously advocated by Meinshausen (2006) in the context of quantile regression, and also underlies theoretical analyses of regression forests (e.g., Lin and Jeon, 2006). In the context of regression, the averaging and weighting views of forests are equivalent; however, once we move to more general settings, the weighting-based perspective will prove to be substantially more effective, and also brings forests closer to the classical literature on local maximum likelihood estimation (Fan and Gijbels, 1996; Loader, 1999; Newey, 1994b; Stone, 1977; Tibshirani and Hastie, 1987).

The second challenge in generalizing forest-based methods is that their success hinges on whether the adaptive neighborhood function obtained via partitioning adequately captures the heterogeneity in the underlying function we want to estimate. Even within the same class of statistical tasks, different types of questions can require different neighborhood functions. For example, suppose that two scientists are studying the effects of a new medical treatment. One wants to know how the treatment affects long-term survival, whereas the other is examining its effect on the length of hospital stays. It is entirely plausible that the neighborhood functions that are helpful in capturing the treatment heterogeneity in each setting would be based on completely different covariates, e.g., a patient’s smoking habits for long-term survival, and the location and size of the hospital for the length of stay.

Thus, each time we apply random forests to a new scientific task, it is important to use rules for recursive partitioning that are able to detect and highlight heterogeneity in the signal the researcher is interested in. In prior work, such problem-specific rules have largely been designed by hand, a labor-intensive task. Although the CART rules of Breiman et al. (1984) have long been popular for classification and regression tasks, there has been a steady stream of papers proposing new splitting rules for other problems, including Athey and Imbens (2016) and Su et al. (2009) for treatment effect estimation, Beygelzimer and Langford (2009) and Kallus (2016) for personalized policy allocation, and Ciampi et al. (1986), Gordon and Olshen (1985), LeBlanc and Crowley (1992), Molinaro et al. (2004) as well as several others for survival analysis (see Bou-Hamad et al. (2011) for a review). Zeileis et al. (2008) propose a method for constructing a single tree for general maximum likelihood problems, where the splitting rule is based on hypothesis tests for improvements in model goodness of fit.

In contrast, we seek a unified, general framework for computationally efficient problem-specific splitting rules, optimized for the primary objective of capturing heterogeneity in a key parameter of interest. In the spirit of gradient boosting (Friedman, 2001), our recursive partitioning method begins by computing a linear, gradient-based approximation to the non-linear estimating equation we are trying to solve, and then uses this approximation to specify the tree-split point. Algorithmically, our procedure reduces to iteratively applying a labeling step where we generate pseudo-outcomes by computing gradients using parameters estimated in the parent node, and a regression step where we pass this labeled data to a standard CART regression routine. Thus, we can make use of pre-existing, optimized tree software to execute the regression step, and obtain high quality neighborhood functions while only using computational resources comparable to those required by standard CART algorithms. In line with this approach, our generalized random forest software package builds on the carefully optimized ranger implementation of regression forest splitting rules (Wright and Ziegler, 2017).

Finally, moment conditions of the form (2) typically arise in scientific applications where rigorous statistical inference is required, and so a generalization of random forests to such problems would not be of much use without accompanying theoretical guarantees. The bulk of this paper is devoted to a theoretical analysis of generalized random forests, and to establishing asymptotic consistency and Gaussianity of the resulting estimates . We also develop methodology for asymptotic confidence intervals. Our technical analysis is motivated by classical theory for local estimating equations, in particular the results of Newey (1994b), paired with machinery from Wager and Athey (2017) to address the adaptivity of the random forest weighting function.

The resulting generalized random forests present a flexible framework for non-parametric statistical estimation and inference with formal asymptotic guarantees. In this paper, we consider applications to quantile regression, conditional average partial effect estimation and heterogeneous treatment effect estimation with instruments; however, there are many other popular statistical models that fit directly into our framework, including panel regression, Huberized robust regression, models of consumer choice, etc. In order to fit any of these models with generalized random forests, the analyst simply needs to provide the problem-specific routines to calculate gradients of the moment conditions evaluated at different observations in the dataset for the “label” step of our algorithm. Moreover, despite all the required formalism, we emphasize that our method is in fact a proper generalization of regression forests: If we apply our framework to build a forest-based method for local least-squares regression, i.e., we use in (2), we exactly recover a regression forest.

A high-performance software implementation of generalized random forests, grf for R and C++, is available from CRAN and at https://www.github.com/swager/grf.

1.1 Related Work

The idea of local maximum likelihood (and closely related, local generalized method of moments) estimation has a long history in statistics, with notable contributions from Fan et al. (1998), Newey (1994b), Staniswalis (1989), Stone (1977), Tibshirani and Hastie (1987), Lewbel (2007) and others. In the economics literature, popular applications of these techniques include local linear regression in regression discontinuity frameworks (see Imbens and Lemieux, 2008, for a review), multinomial choice modeling in a panel data setting (e.g., Honoré and Kyriazidou, 2000), and instrumental variables regression (Su et al., 2013). The basic idea is that when estimating parameters at a particular value of covariates, a kernel weighting function is used to place more weight on nearby observations in the covariate space. A challenge facing this approach is that if the covariate space has more than two or three dimensions, the “curse of dimensionality” implies that plain kernel-based methods may not perform well (e.g., Robins and Ritov, 1997).

Our paper takes the approach of replacing the kernel weighting with forest-based weights, that is, weights derived from the fraction of trees in which an observation appears in the same leaf as the target value of the covariate vector. The original random forest algorithm for non-parametric classification and regression was proposed by Breiman (2001), building on insights from the ensemble learning literature (Amit and Geman, 1997; Breiman, 1996; Dietterich, 2000; Ho, 1998). The perspective we take on random forests as a form of adaptive nearest neighbor estimation, however, most closely builds on the proposal of Meinshausen (2006) for forest-based quantile regression. This adaptive nearest neighbors perspective also underlies several statistical analyses of random forests, including those of Arlot and Genuer (2014), Biau and Devroye (2010), and Lin and Jeon (2006).

Meanwhile, our gradient-based splitting scheme draws heavily from a long tradition in the statistics and econometrics literatures of using gradient-based test statistics to detect change points in likelihood models (Andrews, 1993; Hansen, 1992; Hjort and Koning, 2002; Nyblom, 1989; Ploberger and Krämer, 1992; Zeileis, 2005; Zeileis and Hornik, 2007). In particular, Zeileis et al. (2008) consider the use of such methods for model-based recursive partitioning. Our problem setting differs from the above in that we are not focused on running a hypothesis test, but rather seek an adaptive nearest neighbor weighting that is as sensitive as possible to heterogeneity in our parameter of interest; we then rely on the random forest resampling mechanism to achieve statistical stability (Mentch and Hooker, 2016; Scornet et al., 2015; Wager and Athey, 2017). In this sense, our approach is closely related to the gradient boosting algorithm of Friedman (2001), who uses similar gradient-based approximations to guide a greedy, heuristic, non-parametric regression procedure.

Our asymptotic theory relates to an extensive recent literature on the statistics of random forests, most of which focuses on the regression case (Arlot and Genuer, 2014; Biau, 2012; Biau et al., 2008; Biau and Scornet, 2016; Breiman, 2004; Bühlmann and Yu, 2002; Chipman et al., 2010; Denil et al., 2014; Efron, 2014; Geurts et al., 2006; Ishwaran and Kogalur, 2010; Lin and Jeon, 2006; Meinshausen, 2006; Mentch and Hooker, 2016; Samworth, 2012; Scornet et al., 2015; Sexton and Laake, 2009; Wager and Athey, 2017; Wager and Walther, 2015; Zhu et al., 2015). Our present paper complements this body of work, by showing how methods developed to study regression forests can also be used understand estimated solutions to local moment equations obtained via generalized random forests.

Finally we note that the problem we study, namely estimating how a function varies with covariates, is distinct from the problem of estimating a single, low-dimensional parameter—such as an average treatment effect—while controlling for a non-parametric or high-dimensional set of covariates. Recent contributions to the latter include Athey et al. (2016), Belloni et al. (2013), Chernozhukov et al. (2016), Robins et al. (1995), Robins et al. (2008), and van der Laan and Rubin (2006). In particular, Chernozhukov et al. (2016) and van der Laan and Rubin (2006) discuss how traditional machine learning methods like regression forests can be used as sub-components in efficient inference about such low-dimensional parameters.

2 Generalized Random Forests

2.1 Review of Breiman’s Forests

\minibox ,  
data:

\minibox ,  
data:

\minibox ,  
data:

\minibox ,  
data:

Figure 1: Example of a small regression tree on a sample of size . The examples used to build this tree are of the form , and axis-aligned splits based on the determine the leaf membership of each training example. In “standard” regression trees as discussed in, e.g., Breiman et al. (1984) or Hastie et al. (2009), the tree predicts by averaging the outcomes within the relevant leaf; thus, in the example of Figure 1, any test point with would be assigned a prediction .

In standard classification or regression forests as proposed by Breiman (2001), the prediction for a particular test point is determined by averaging predictions across an ensemble of different trees (Amit and Geman, 1997; Breiman, 1996; Dietterich, 2000; Ho, 1998). Individual trees are grown by greedy recursive partitioning, i.e., we recursively add axis-aligned splits to the tree, where each split it chosen to maximize the improvement to model fit (Breiman et al., 1984); see Figure 1 for an example of a tree. The trees are randomized using bootstrap (or subsample) aggregation, whereby each tree is grown on a different random subset of the training data, and random split selection that restricts the variables available at each step of the algorithm.222For an introductory overview of random forests, we recommend the chapter of Hastie et al. (2009) dedicated to the method.

In generalizing random forests beyond the regression and classification contexts, we preserve several core elements of Breiman’s forests—including recursive partitioning, subsampling, and random split selection—but must abandon the idea that our final estimate is obtained by averaging estimates from each member of an ensemble. As discussed below, standard regression forests can equivalently be described as a type of adaptive nearest neighbor estimator, and this alternative perspective is much more amenable to statistical extensions.

2.2 Forest-Based Local Estimation

Suppose now that we have independent and identically data for samples, indexed . For each sample, we have access to an observable quantity that encodes information relevant to estimating , along with a set of auxiliary covariates . In the case of non-parametric regression, this observable just consists of an outcome with ; in general, however, it will contain richer information. For example, in the case of treatment effect estimation with exogenous treatment assignment, also includes the treatment assignment . Given this type of data, our goal is to estimate solutions to local estimation equations of the form

(4)

where is the parameter we care about and is an optional nuisance parameter. This setting is general, and encompasses many important problems in statistics and econometrics.

A popular approach to estimating such functions is to first define some kind of similarity weights that measure the relevance of the -th training example to fitting at , and then fit the target of interest via an empirical version of the estimating equation (Fan et al., 1998; Newey, 1994b; Staniswalis, 1989; Stone, 1977; Tibshirani and Hastie, 1987):

(5)

When the above expression has a unique root, we can simply say that solves . The weights used to specify the above solution to the heterogeneous estimating equation are traditionally obtained via a deterministic kernel function, perhaps with an adaptively chosen bandwidth parameter (Hastie et al., 2009). Although methods of the above kind often work well in low dimensions, they can be very sensitive to the curse of dimensionality.

Figure 2: Illustration of the random forest weighting function. The rectangles depticted above correspond to terminal nodes in the dendogram representation of Figure 1. Each tree starts by giving equal (positive) weight to the training examples in the same leaf as our test point of interest, and zero weight to all the other training examples. Then, the forest averages all these tree-based weightings, and effectively measures how often each training example falls into the same leaf as .

In this paper, we seek to use forest-based algorithms to adaptively learn better, problem-specific, weights that can be used in conjunction with (5). As in Meinshausen (2006), our generalized random forests obtain such weights by averaging neighborhoods implicitly produced by different trees. First, we grow a set of trees indexed by and, for each such tree, define as the set of training examples falling in the same “leaf” as . The weights then capture the frequency with which the -th training example falls into the same leaf as :

(6)

These weights sum to 1, and define the forest-based adaptive neighborhood of ; see Figure 2 for an illustration of this weighting function. There are of course some subtleties in how the sets are defined—in particular, as discussed in Section 2.5 below, our construction will rely on both subsampling and a specific form of sample-splitting to achieve consistency—but at a high level the estimates produced by a generalized random forests are simply obtained by solving (5) with weights (6).

Finally, as claimed above, our present weighting-based definition of a random forest is equivalent to the standard “average of trees” perspective taken in Breiman (2001) for the special case of regression trees. Specifically, suppose we want to estimate the conditional mean function , which is identified in (4) using the moment function . Then, we can use simple algebra to verify that

(7)

where is the prediction made by a single CART regression tree, and so we in fact recover the simple averaging definition (3) of a regression forest.

2.3 Splitting to Maximize Heterogeneity

Given this setup, we need to build trees that, when combined into a forest, induce weights that lead to good estimates of . The main substantive difference between the random forests of Breiman (2001) relative to other non-parametric regression techniques is their use of recursive partitioning on subsamples to generate these weights . Motivated by the empirical success of regression forests across several application areas, our approach here is to mimic the algorithm of Breiman (2001) as closely as possible, while tailoring our splitting scheme to focus on heterogeneity in the target functional .

Just like in Breiman’s forests, our search for good splits proceeds greedily, i.e., we seek splits that immediately improve the quality of the tree fit as much as possible. Every split starts with a parent node ; given a sample of data , we define as the solution to the estimating equation, as follows (where we suppress the dependence on the sample in cases where it is unambiguous):

(8)

We would like to divide into two children using an axis-aligned cut such as to improve the accuracy of our -estimates as much as possible. That is, we wish to evaluate a given split using the expectation of the squared error, where the expectation is taken with respect to the uncertainty that arises when we estimate the parameters using a random sample and evaluate the squared error at a random evaluation point :

(9)

where the are fit over the child nodes in analogy to (8).

Many standard regression tree implementations, such as CART (Breiman et al., 1984), choose their splits by simply minimizing the in-sample prediction error of the node, which corresponds to (9) with plug-in estimators from the training sample. Athey and Imbens (2016) study “honest” trees, where one sample is used to select the splits but a distinct, independent sample is used to estimate ; this motivates an explicit treatment of as a random variable. They show that for the case of “causal trees” (estimating the effect of a binary treatment), an unbiased, model-free estimate of (9) is available, one which reduces to adjusting the in-sample squared error in treatment effects by a correction that accounts for the way in which the splits affect the variance of the parameter estimates when re-estimated in new samples , in the spirit of Mallows (1973).

In our setting, however, this kind of direct loss minimization is not an option: If is only identified through a moment condition, then we do not in general have access to unbiased, model-free estimates of the criterion (9). To address this issue, we rely on the following more abstract characterization of our target criterion.

Proposition 1.

Suppose that basic assumptions detailed in Section 4 hold, and that the parent node has a radius smaller than for some value . Fix a training sample . We write for the number of observations in the parent and for the number of observations in each child, and define

(10)

where and are solutions to the estimating equation computed in the children, following (8). Then, treating the children and as well as the counts and as constant, and assuming that , we have

(11)

where is a deterministic term that measures the purity of the parent node that does not depend on how the parent is split, and the -term incorporates terms that depend on sampling variance.

Motivated by this observation, we consider splits that make the above -criterion (10) large. A special case of the above idea also underlies the splitting rule for treatment effect estimation proposed by Athey and Imbens (2016). At a high level, we can think of this -criterion as favoring splits that increase the heterogeneity of the in-sample -estimates as fast as possible.

Finally, we note that the dominant bias term in (11) is due to the sampling variance of regression trees, and is the same term that appears in the analysis of Athey and Imbens (2016). Including this error term in the splitting criterion may stabilize the construction of the tree, and further it can prevent the splitting criterion from favoring splits that make the model difficult to estimate, for example, splits where there is not sufficient variation in the data to estimate the model parameters within the resulting child leaves.

2.4 The Gradient Tree Algorithm

The above discussion provides some helpful conceptual guidance on how to pick good splits. However, from a computational perspective, actually optimizing the criterion over all possible axis-aligned splits while explicitly solving for and in each candidate child using an analogue to (8) may be quite expensive.

To avoid this issue, we instead optimize an approximate criterion built using gradient-based approximations for and . For each child , we use as follows: We first compute as any consistent estimate for the gradient of the expectation of the -function, i.e., , and then set

(12)

where and are obtained by solving (8) once in the parent node, and is a vector that picks out the -coordinate from the vector. When the -function itself is continuously differentiable, we use

(13)

and, in this case, the quantity corresponds to the influence function of the -th observation for computing in the parent. Cases where is non-differentiable, e.g., with quantile regression, require more care.

Algorithmically, our recursive partitioning scheme now reduces to alternatively applying the following two steps. First, in a labeling step, we compute , , and the derivative matrix on the parent data as in (8), and use them to get pseudo-outcomes

(14)

Next, in a regression step, we run a standard CART regression split on the pseudo-outcomes . Specifically, we split into two axis-aligned children and such as to maximize the criterion

(15)

Finally, once we have executed the regression step, we relabel the observations in each child by solving the estimating equation, and continue on recursively.

To gain more intuition about this method, it is helpful to examine what it does in the simplest case of least-squares regression, i.e., with . Here, the labeling step (14) doesn’t change anything—we get , where is the mean outcome in the parent—while the second step maximizing (15) corresponds to the usual way of making splits as in Breiman (2001). In other words, the special structure of the type of problem we are trying to solve is encoded in (14), while the second scanning step is a universal step shared across all different types of forests.

From a computational perspective, we expect this approach to provide more consistent performance than optimizing (10) at each split directly. When growing a tree, the computation is typically dominated by the split-selection step, and so it is critical for this step to be implemented as efficiently as possible (conversely, the labeling step (14) is only solved once per node, and so is less performance sensitive). From this perspective, using a regression splitting criterion as in (15) is very desirable, as it is possible to evaluate all possible split points along a given feature with only a single pass over the data in the parent node (by representing the criterion in terms of cumulative sums). In contrast, directly optimizing the original criterion (10) may require solving more intricate optimization problems for each possible candidate split.

This type of gradient-based approximation also underlies other popular statistical algorithms, including gradient boosting (Friedman, 2001) and the model-based recursive partitioning algorithm of Zeileis et al. (2008); conceptually, it is closely related to standard techniques for moment-based change-point detection (Andrews, 1993; Hansen, 1992; Hjort and Koning, 2002; Nyblom, 1989; Ploberger and Krämer, 1992; Zeileis, 2005; Zeileis and Hornik, 2007). Finally, in our context, we can verify that the error from using the approximate criterion (15) instead of the exact -criterion (10) is within the tolerance used to motivate the -criterion in Proposition 1, thus suggesting that our use of (12) to guide splitting may not result in too much inefficiency.

Proposition 2.

Under the conditions of Proposition 1, suppose that is consistent, i.e., . Then,

(16)

Note: All tuning parameters, such as the total number of trees and the sub-sampling rate used in Subsample, are taken as pre-specified. This function is implemented in the package grf for R and C++.

1:procedure GeneralizedRandomForest(set of examples , test point )
2:     weight vector Zeros()
3:     for  to total number of trees  do
4:         set of examples Subsample(, )
5:         sets of examples SplitSample()
6:         tree GradientTree() Grows a tree by recursive partitioning, alternating the steps (14) and (15).
7:         Neighbors() Returns those elements of that fall into the same leaf as in the tree .
8:         for all example  do
9:                             
10:     output , the solution to (5) with weights

The function call Zeros creates a vector of zeros of length ; Subsample draws a subsample of size from without replacement; and SplitSample randomly divides a set into two evenly-sized, non-overlapping halves.

The final step (5) can be solved using any numerical estimator. Our implementation grf provides an explicit plug-in point where a user can write a solver for (5) that is appropriate for their -function of interest.

Algorithm 1 Generalized random forest with honesty and subsampling

2.5 Building a Forest with Theoretical Guarantees

Now, given a practical splitting scheme for growing individual trees, we want to grow a forest that allows for consistent estimation of using (5) paired with the forest weights (6). At a high level, we expect each tree to provide small, relevant neighborhoods for that give us noisy estimates of . Then, if every tree has different small, relevant neighborhoods for , we may hope that forest-based aggregation will provide a single larger but still relevant neighborhood for that yields stable estimates .

To ensure that forest-based aggregation succeeds in providing such stable, consistent parameter estimates, we rely on two conceptual ideas that have proven to be successful in the literature on forest-based least-squares regression: Training trees on subsamples of the training data (Mentch and Hooker, 2016; Scornet et al., 2015; Wager and Athey, 2017), and a sub-sample splitting technique that we call honesty (Biau, 2012; Denil et al., 2014; Wager and Athey, 2017). Our final algorithm for forest-based solutions to heterogeneous estimating equations is given as Algorithm 1; we refer to Section 2.4 of Wager and Athey (2017) for a more in-depth discussion of honesty in the context of forests.

As we will show in the theoretical analysis in Section 4, assuming regularity conditions, the estimates obtained using a generalized random forest as described in Algorithm 1 are consistent for . Moreover, given appropriate subsampling rates, we establish asymptotic normality of the resulting forest estimates .

3 Interlude: Quantile Regression Forests

Before developing the asymptotics of generalized random forests below, we take a brief pause to illustrate how the formalisms described above play out in the case of quantile regression. This problem has also been considered in detail by Meinshausen (2006), who proposed a consistent forest-based quantile regression algorithm; his method also fits into the paradigm of solving estimating equations (5) using random forest weights (6). However, unlike us, Meinshausen (2006) does not propose a splitting rule that is tailored to the quantile regression context, and instead builds his forests using plain CART regression splits. Thus, a comparison of our method with that of Meinshausen (2006) provides a perfect opportunity for evaluating the value of our proposed method for constructing forest-based weights that are specifically designed to express heterogeneity in conditional quantiles.

Recall that, in the language of estimating equations, the -th quantile of the distribution of conditionally on is identified via (4), using the moment function

(17)

Plugging this moment function into our splitting scheme, (14) gives us pseudo-outcomes

(18)

up to a scaling and re-centering that do not affect the subsequent regression split on these pseudo-outcomes. In other words, gradient-based quantile regression trees simply try to separate observations that fall above the -th quantile of the parent from those below it.

mean shift scale shift
Figure 3: Comparison of quantile regression using generalized random forests and the quantregForest package of Meinshausen (2006). In both cases, we have independent and identically distributed examples where is uniformly distributed over with , and is Gaussian conditionally on : in the left panel, , while in the right panel . The other 39 covariates are noise. We estimate the quantiles at .

We compare our method to that of Meinshausen (2006) in Figure 3. In the left panel, we have a mean shift in the distribution of conditional on at , and both methods are able to pick it up as expected. However, in the right panel, the mean of given is constant, but there is a scale shift at . Here, our method still performs well, as our splitting rule targets changes in the quantiles of the -distribution. However, the method of Meinshausen (2006) breaks down completely, as it relies on CART regression splits that are only sensitive to changes in the conditional mean of given .

We also note that generalized random forests produce somewhat smoother sample paths than the method of Meinshausen (2006); this is due to our use of honesty as described in Section 2.5. If we run generalized random forests without honesty, then our method still correctly identifies the jumps at , but has sample paths that oscillate locally just as much as the baseline method.

Finally, the purpose of this example is not to claim that our variant of quantile regression forests built using gradient trees is always superior to the method of Meinshausen (2006) that uses regression-based splitting to obtain the weights . Rather, we have shown that, as claimed, our splitting rule is specifically sensitive to quantile shifts in a way that regression splits are not—and, moreover, deriving the splitting rule with (18) was fully automatic given our generalized random forest formalism. Of course, there are presumably applications where changes in the conditional mean are good proxies for changes in the conditional quantiles; and, in these cases, we might expect the method of Meinshausen (2006) to work very well. However, if we want to make sure our splitting rule is always sensitive to changes at the quantile of interest, our gradient-based approach may be more robust.

Remark (Estimating many quantiles).

In many cases, we want to estimate multiple quantiles at the same time; for example, in Figure 3, we sought to get at the same time. Estimating different forests for each quantile separately would be undesirable for many reasons: It would be computationally expensive and, moreover, there is a risk that quantile estimates might cross in finite samples due to statistical noise. Thus, we need to build a forest using a splitting scheme that is sensitive to changes at any of our quantiles of interests. Here, we use a simple heuristic inspired by the relabeling transformation (18). Given a set of quantiles of interest , we first evaluate all these quantiles in the parent node, and label -th point by the interval it falls into. Then, we choose the split point using a multiclass classification rule that classifies each observation into one of the intervals.

4 Asymptotic Analysis

4.1 Theoretical Setup

We now turn to a formal characterization of generalized random forests, with the aim of establishing asymptotic Gaussianity of the estimates , and of providing tools for statistical inference about . We begin by listing the basic assumptions underlying all of our theoretical results. First, we assume that the covariate space and the parameter space are both subsets of Euclidean space; specifically, we assume that and for some , where is a compact subset of . Moreover, we assume that the features have a density that is bounded away from 0 and ; as argued in, e.g., Wager and Walther (2015), this is equivalent to imposing a weak dependence condition on the individual features because trees and forests are invariant to monotone rescaling of the features.

Some practically interesting cases, such as quantile regression, involve discontinuous score functions , which makes the analysis considerably more intricate. Here, we follow standard practice, and assume that the expected score function

(19)

vary smoothly in the parameters, even though itself may be discontinuous. For example, in the case of quantile regression is discontinuous in , but will be smooth whenever has a smooth density.

Assumption 1 (Lipschitz -signal).

For fixed values of , we assume that as defined in (19) is Lipschitz continuous in .

Assumption 2 (Smooth identification).

When is fixed, we assume that this -function is twice continuously differentiable in with a uniformly bounded second derivative. Moreover, writing the derivative of as

(20)

we assume that is invertible for all .

Our next two assumptions control regularity properties of the -function itself. Assumption 3 holds trivially when itself is Lipschitz in (in fact, having be 0.5-Hölder would be enough); and also holds for quantile regression with the constant referred to in the results below, where is the density of . Assumption 4 is used to show that a certain empirical process is Donsker and clearly holds for all our cases of interest; see the Section 4.3 for details. Finally, Assumption 5 can be taken to hold with when is continuous and non-degenerate, and with for quantile regression.

Assumption 3 (Lipschitz -variogram).

We assume that the the score functions have a continuous covariance structure. Writing for the worst-case variogram

(21)

we assume that this variogram is Lipschitz, i.e. for all

(22)

for some constant .

Assumption 4 (Regularity of ).

The -functions can be written as

such that is Lipschitz-continuous in , is a univariate summary of , is any family of monotone and bounded functions.

Assumption 5 (Existence of solutions).

We assume that, for any weights with , the estimating equation (5) returns a minimizer that at least approximately solves the estimating equation:

(23)

All the previous assumptions only deal with local properties of the estimating equation, and can be used to control the behavior of in a small neighborhood of the population parameter value . Now, to make any use of these assumptions, we first need to verify that be consistent. Here, we use the following assumption to guarantee consistency; this setup is general enough to cover both instrumental variables regression and quantile regression.

Assumption 6 (Convexity).

The score function is a sub-gradient of a convex function, and the expected score is the gradient of a strongly convex function.

Finally, our consistency and Gaussianty results require some control on the behavior of the trees comprising the forest. To do so, we follow Wager and Athey (2017), as follows.

Assumption 7 (Regular and honest forest).

We assume that our trees are symmetric, in that their output is invariant to permuting the indices of the training examples. We also assume that the tree makes balanced splits, in the sense that every split puts at least a fraction of the observations in the parent node into each child, for some , take the tree to be randomized in such a way that, at every split, the probability that the tree splits on the -th feature is bounded from below by some .333In our implementation, we satisfy this condition by using the device of Denil et al. (2014), whereby the number splitting variables considered at each step of the algorithm is random; specifically, we try variables at each step, where is a tuning parameter. Finally, we assume that our forest is honest and built via subsampling with subsample size satisfying and , as described in Section 2.5.

In the interest of generality, we set up Assumptions 16 in a rather abstract way effort. We end this section by showing that, in the context of our main problems of interest requiring Assumptions 16 is not particularly stringent. Further examples that satisfy the above assumptions will be discussed in Sections 6 and 7.

Example 1 (Least squares regression).

In the case of least-squares regression, i.e., , Assumptions 26 hold immediately from the definition of . Meanwhile, Assumption 1 simply means that the conditional mean function must be Lipschitz in ; this is a standard assumption in the literature on regression forests.

Example 2 (Quantile regression).

For quantile regression, Assumption 1 is equivalent to assuming that the conditional exceedance probabilities be Lipschitz-continuous in for all . Further, Assumption 2 holds if the conditional density has a continuous uniformly bounded first derivative, and is bounded away from 0 at the quantile of interest , where denotes the cumulative distribution function of conditionally on ; and Assumption 3 holds if is uniformly bounded from above. Assumptions 46 hold from the definition of .

4.2 A Central Limit Theorem for Generalized Random Forests

Given these assumptions, we are now ready to provide an asymptotic characterization of generalzed random forests. In doing so, we note that existing asymptotic analyses of regression forests, including Mentch and Hooker (2016), Scornet et al. (2015) and Wager and Athey (2017), were built around the fact that regression forests are averages of regression trees grown over sub-samples, and can thus be analyzed as -statistics (Hoeffding, 1948). Unlike regression forest predictions, however, the parameter estimates provided by generalized random forests are not averages of estimates made by different trees; instead, we obtain by solving a single weighted moment equation as in (5). Thus, existing proof strategies do not apply in our setting.

We tackle this problem using the method of influence functions as described by Hampel (1974); in particular, we are motivated by the analysis of Newey (1994b). The core idea of these methods is to first derive a sharp, linearized approximation to the local estimator , and then to analyze the linear approximation instead.

In our setup, the influence function heuristic motivates a natural approximation to as follows. Let denote the influence function of the -th observation with respect to the true parameter value ,

(24)

These quantities are closely related to the pseudo-outcomes (14) used in our gradient tree splitting rule; the main difference is that, here, the quantities depend on the unknown true parameter values at and are thus inaccessible in practice. We use the -superscript to remind ourselves of this fact.

Then, given any set of forest weights used to define the generalized random forest estimate by solving (5), we can also define a pseudo-forest

(25)

which we will use as an approximation for . We note that, formally, this pseudo-forest estimate is equivalent to the output of an (infeasible) regression forest with weights and outcomes .

The upshot of this approximation is that, unlike , the pseudo-forest is a -statistic. More specifically, because is a linear function of the pseudo-outcomes , we can write it as an average of pseudo-tree predictions

(26)

Then, because each individual pseudo-tree prediction is trained on a subsample of the training data drawn without replacement (see Section 2.5), the arguments of Mentch and Hooker (2016) or Wager and Athey (2017) can be used to study the averaged estimator using classical results about -statistics (Hoeffding, 1948; Efron and Stein, 1981).

Following this proof strategy, the key difficulty is in showing that our influence-based statistic is in fact a good approximation for . The following result, which is the main technical contribution of this paper, establishes such a coupling provided the estimator itself is consistent for . Then, in Lemma 4 we guarantee consistency of assuming convexity of as in Assumption 6. We note that separating the analysis of moment estimators into a local approximation argument that hinges on consistency and a separate result that establishes consistency is standard; see, e.g., Chapter 5.3 of Van der Vaart (2000).

Lemma 3.

Given Assumptions 15, and a forest trained according to 7, suppose that the generalized random forest estimator is consistent for . Then and are coupled at the following rate:

(27)
Lemma 4.

Given Assumptions 16, the estimates from a generalized random forest trained according to Assumption 7 converge in probability to as .

Given this coupling result, it now remains to study the asymptotics of . In doing so, we re-iterate that is exactly the output of an infeasible regression forest trained on outcomes . Thus, the results of Wager and Athey (2017) apply directly to this object, and can be used to establish its Gaussianity. The fact that we cannot actually compute in practice does not hinder an application of their results.

Pursuing this approach, we find that whenever trees are grown on subsamples of size scaling as for some , —and thus also —is asymptotically normal.

Theorem 5.

Suppose that Assumptions 16 hold, that our forest is trained according to Assumption 7, and moreover that trees are grown on subsamples of size with

(28)

where and are constants defined when stating basic assumptions about the forest. Finally, suppose that . Then, there is a sequence for which

(29)

where is a function that is bounded away from 0 and increases at most polynomially with the log-inverse sampling ratio .

4.3 Proof of Main Results

Here, we present arguments leading up to our main result, namely the central limit theorem presented in Theorem 5, starting with some technical lemmas. Due to space considerations, the proofs of Propositions 1 and 2, Lemma 4, and the technical results stated below are given in the appendix. Throughout our theoretical analysis, we use the following notation: Given our forest weights (6), let

(30)

We will frequently use the following bounds on the moments of at the true parameter value .

Lemma 6.

Let be weights from a forest obtained as in Assumption 7, and suppose that the -function is Lipschitz in as in Assumption 1. Then, satisfies the following moment bounds:

(31)
(32)

where is the subsampling rate used when building our forest.

Proof.

To establish these bounds, we start by expanding as

(33)

where the are the individual tree weights used to build the forest weights in (6). Now, is nothing but the output of a regression forest with response . Thus, given our assumptions about the moments of and the fact that our trees are built via honest subsampling, these bounds follow directly from arguments made in Wager and Athey (2017). First, the proof of Theorem 3 of Wager and Athey (2017) shows that the weights are localized:

(34)

thus directly implying (31) thanks to Assumption 1. Meanwhile, because individual trees are grown on subsamples, we can verify that

(35)

where the first inequality results from classical results about -statistics going back to Hoeffding (1948), while the second inequality follows from second-moment bounds on along with the fact that our trees are grown on honest subsamples. ∎

4.3.1 Local Regularity of Forests

Before proving any of our main results, we need establish a result that gives us some control over the “sample paths” of . To do so, define the local discrepancy measure

(36)

which describes how tightly the stochastic fluctuations of are coupled for nearby parameter values and . The following lemmas establish uniform local concentration of : First, in Lemma 7, we control the variogram of the forest, and then Lemma 8 establishes concentration of over small balls. Both proofs are given the appendix.

Lemma 7.

Let and be fixed pairs of parameters, and let be forest weights generated according to Assumption 7. Then, provided that Assumptions 13 hold,

(37)

where is the Lipschitz parameter from (22).

Next, to generalize this concentration bound from a single point into a uniform bound, we will need some standard formalism from empirical process theory as presented in, e.g., van der Vaart and Wellner (1996). To do so, we start by defining a bracketing, as follows. For any pair of parameters , , define the bracket

for all realizations of , where the inequality is understood coordinate-wise; and define the radius of the bracket in terms of the -distance of the individual “-trees” that comprise :

(38)

where and are two disjoint half-subsamples as in Algorithm 1. For any , the -bracketing number is the minimum number of brackets of radius at most required to cover .

Given this notation, our concentration bound for will depend on controlling this covering number. Specifically, we assume that there is a constant for which the bracketing entropy is bounded by

(39)

We use Assumption 4 to give us bounds of this type; and, in fact, this is the only place we use Assumption 4. Replacing Assumption 4 with (39) would be enough to prove our results, which will only depend on this assumption through Lemma 8 below.

To see how Assumption 4 leads to (39), we first write

where is Lipschitz and is a monotone function of a univariate representation of . Writing analogously and for the bracketing numbers of these two additive components on their own, we can verify that

Because is a bounded, monotone, univariate family, Theorem 2.7.5 of van der Vaart and Wellner (1996) implies that . Meanwhile, because is Lipschitz and our parameter space is compact, Lemma 2.7.11 of van der Vaart and Wellner (1996) implies that . Thus, both terms are controlled at the desired order, and so (39) holds.

Lemma 8.

Under the conditions of Lemma 7, suppose moreover that (39) holds. Then,

(40)

for any and , where is an upper bound for ; note that Assumption 4 guarantees that a finite bound exists.

4.3.2 Proof of Lemma 3

We now turn to proving one of our key results. We first note that, if were twice differentiable in , then we could verify (27) fairly directly via Taylor expansion of . Now, of course, is not twice differentiable, and so we cannot apply this argument directly. Rather, we need to first apply a Taylor expansion on the expected function, , which is twice differentiable; we then use the regularity properties established in Section 4.3.1 to extend this result to .

Now, recall that we have assumed to be consistent; thus, there is a sequence such that

(41)

Using notation established in (30) and (36), we then write