Boulevard: Regularized Stochastic Gradient Boosted Trees and Their Limiting Distribution
This paper examines a novel gradient boosting framework for regression. We regularize gradient boosted trees by introducing subsampling and employ a modified shrinkage algorithm so that at every boosting stage the estimate is given by an average of trees. The resulting algorithm, titled “Boulevard”, is shown to converge as the number of trees grows. We also demonstrate a central limit theorem for this limit, allowing a characterization of uncertainty for predictions. A simulation study and real world examples provide support for both the predictive accuracy of the model and its limiting behavior.
This paper presents a theoretical study of gradient boosted trees (GBM or GBT: Friedman, 2001). Machine learning methods for prediction have generally been thought of as trading off both intelligibility and statistical uncertainty quantification in favor of accuracy. Recent results have started to provide a statistical understanding of methods based on ensembles of decision trees (Breiman et al., 1984). In particular, the consistency of methods related to random forests (RF: Breiman, 2001) has been demonstrated in Biau (2012); Scornet et al. (2015) while Wager et al. (2014); Mentch and Hooker (2016); Wager and Athey (2017) and Athey et al. (2016) prove central limit theorems for RF predictions. These have then been used for tests of variable importance and nonparametric interactions in Mentch and Hooker (2017).
In this paper, we extend this analysis to GBT. Analyses of RF have relied on a subsampling structure to express the estimator in the form of a U-statistic from which central limit theorems can be derived. By contrast, GBT produces trees sequentially with the current tree depending on the values in those built previously, requiring a different analytical approach. While the algorithm proposed in Friedman (2001) is intended to be generally applicable to any loss function, in this paper we focus specifically on nonparametric regression (Stone, 1977, 1982). Given a sample of observations , assume they follow the relation
which satisfies the following:
The density, , is bounded from above and below, i.e. s.t. .
is bounded Lipschitsz, i.e. , and s.t. .
is sub-Gaussian error with , ,
To fit this sample, GBT builds correlated trees sequentially to perform gradient descent in functional space (Friedman et al., 2000). With loss, the process resembles the Robbins-Monro algorithm (Robbins and Monro, 1951), applying trees to iteratively fit residuals. The procedure is given as follows.
Start with the initial estimate .
For given the current functional estimate , calculate the negative gradient
Construct a tree regressor on .
Update by a small learning rate ,
Gradient boosting is initially developed from attempts to understand Adaboost (Freund et al., 1999) in Friedman et al. (2000). Mallat and Zhang (1993) studied the Robbins-Monro algorithm and demonstrated convergence when the regressors are taken from a Hilbert space. Their result was extended by Bühlmann (2002) to show the consistency of decision tree boosting under the settings of norm and early stopping. From a broader point of view, discussions of consistency and convergence of general boosting framework can be found in Bühlmann and Yu (2003), Zhang et al. (2005) and Bühlmann and Hothorn (2007).
Besides the original implementation, there are a number of variations on the algorithm presented above. Friedman (2002) incorporated subsampling in each iteration and empirically showed significant improvement in predictive accuracy. Rashmi and Gilad-Bachrach (2015) argued that GBT is sensitive towards the initial trees, requiring lots of subsequent trees to make an impact. Their solution incorporates the idea of dropout (Wager et al., 2013; Srivastava et al., 2014) to train each new iteration with a subset of the existing ensemble to alleviate the imbalance. Further, Rogozhnikov and Likhomanenko (2017) suggested to downscale the learning rate and studied the convergence of the boosting path when the learning rate is small enough to guarantee contraction.
In this paper we intend to provide a universal framework combining the factors mentioned above into GBT to study its asymptotic behavior. Our method is particularly inspired by the recent development of the RF inferential framework (Mentch and Hooker, 2016; Wager and Athey, 2017; Mentch and Hooker, 2017), in which the averaging structure of random forests enables analyses based on U-statistics and projections leading to asymptotic normality. Similarly, among classic stochastic gradient methods, Ruppert-Polyak (Polyak and Juditsky, 1992; Ruppert, 1988) averaging implies asymptotic normality for model parameter estimators by averaging the gradient descent history. The boosting framework we present also produces models that exhibit this averaging structure which we can therefore leverage. We name this algorithm Boulevard boosting and summarize it in the following Algorithm 1.
Algorithm 1 (Boulevard).
Start with the initial estimate .
Given the current functional functional estimate , calculate the negative gradient
If needed, generate a subsample (otherwise let ).
Construct a tree regressor on .
Update by learning rate , presumably closer to , that
When boosting ends after the -th iteration with , rescale it by to obtain the final prediction .
As indicated by the last two steps, Boulevard and GBT differ in how they update the ensemble. The name Boulevard re-imagines the random forest stretched out along a row, with the averaging structure diminishing the influence of early trees as the algorithm proceeds. The practical benefit of Boulevard update is threefold. First, shrinkage makes the ensemble less sensitive to any particular tree, particularly those at the start of the process. Second, subsampling reduces overfitting. Third, it is free of early stopping rules because there is always signal in the gradients. As a result, the final form of the predictor sits between an ordinary GBT and a RF. Moreover, Boulevard provides us with theoretical guarantees regarding its limiting behavior. If we write to be the functional estimate after iterations for a sample of size , we can obtain the following results in their simplified forms.
Finite sample convergence. Conditioned on any finite sample of size , should we keep boosting using Boulevard, the boosting path converges almost surely. In other words, relying on the sample such that for ,
Consistency. Under several conditions regarding the construction of regression trees, this limit aligns with the truth after rescaling by . For ,
where the convergence in probability is with respect to sample variability.
Asymptotical normality. Under the same conditions for consistency, we can prove that the prediction is asymptotically normal. For ,
We will demonstrate in detail these results along with their required conditions in later sections. So far as we are aware, these represent the first results on a distributional limit for GBT and hence the potential for inference using this framework; we hope that they inspire further refinements. It is worth noticing that Bayesian Additive Regression Trees (BART) Chipman et al. (2010) were also motivated by GBT and allow the development of Bayesian credible intervals. However, the training procedure for BART resembles backfitting a finite number of trees, resulting in a somewhat different model class.
The remainder of the paper is organized as follows: In Section 2, we introduce structure-value isolation, a counterpart of honesty for boosted trees and non-adaptivity to ensure that the distribution of tree structures stabilizes. In Section 3, we show finite sample Boulevard convergence to a fixed point. In Section 4, we further prove the limiting distribution. In Section 5, we focus on non-adaptivity and discuss how to achieve it either manually or spontaneously. In Section 6, we present our empirical study.
The sequential process employed in constructing decision trees renders them challenging to describe mathematically and more tractable variants of the original implementation in Breiman et al. (1984) are often employed. In particular, separating the structure of the tree from the observed responses – which may still used to determine leaf values – has been particularly useful. Early analyses employed completely randomized trees in which the structure was determined independently of observed responses: see Bühlmann et al. (2002); Breiman (2004); Biau et al. (2008); Biau (2012); Arlot and Genuer (2014) that, in particular, generate a connection to kernel methods observed in Davies and Ghahramani (2014) and Scornet (2016). See Section 3.1 of Biau and Scornet (2016) for an overview of these developments.
More recent results in Wager and Athey (2017) employ sample splitting so that the structure of a tree is determined by part of the data while the values in the leaves are decided by an independent set to create a property that is termed honesty – independence between tree structure and leaf values. Wager and Athey (2017) provide some evidence for the practical, as well as theoretical, utility of this construction. However, they also rely on a condition of regularity: requiring the sample probability of selecting each covariate to be bounded away from zero. This is important in controlling the size of leaves and therefore the bias of the resulting estimate. This condition removes the variable selection properties of trees (see Biau, 2012) and we are unaware of practical implementations that guarantee it in practice.
In this paper, we rely on two restrictions on the tree building process: an extension of honesty that we call structure-value isolation to require the structure of trees independent of leaf values for the whole ensemble rather than individual trees, and non-adaptivity that requires the distribution of tree structures (conditional on the sample) to be constant over boosting iterations. Non-adaptivity allows us to demonstrate the convergence of the Boulevard algorithm for a fixed sample and to show that its limit has the form of kernel ridge regression. Structure-value isolation then allows us to provide distributional results by separating the form of the kernel from the response.
We recognize that these place restrictive conditions on the tree building process and note parallels between these and the early developments in the theory of random forests referenced above. Completely randomized trees satisfy both conditions but are not strictly necessary. Non-adaptivity can be replaced by an asymptotic version which we argue may arise spontaneously or can be enforced in various ways. Structure-value isolation can be achieved, for example, by global subsample splitting – replacing the leaf values of trees by a held-out subset as Boulevard progresses, although note that our results do not cover uncertainty in the resulting kernel structure. Further details of these conditions are pursued below; we speculate in Section 4.6 that our distributional results may also be obtained via alternative representations, but developing these is beyond the scope of this paper.
2.1 Structure-Value Isolation
A decision tree (Breiman et al., 1984) predicts by iteratively segmenting the covariate space into disjoint subsets (i.e. leaves) within each of which a terminal leaf value is chosen for making predictions. Therefore we can separate the construction of a decision tree into two stages: deciding the tree structure, and deciding the values in the terminal nodes. Traditional greedy tree building algorithms use the same sample points for both of these two steps. One drawback of these algorithms is the difficulty of providing mathematical guarantees about isolating sample points with large observation errors, i.e. outliers, thereby de-stabilizing the resulting predicted values. We think of this behavior as “chasing order statistics”. As a result, a plethora of conclusions on trees and tree ensembles rely on randomization, for example, using completely randomized trees or applying honesty as noted above.
In the context of Boulevard, the sequential dependence of trees requires us to employ a stronger condition. We define structure-value isolation as the requirement that leaf values are independent of tree structures across the entire ensemble, rather than on a tree-by-tree basis. To make this precise, here we introduce our decision tree notations. A regression tree segments the covariate space into disjoint leaves representing the tree structure. and are hyper-rectangles. A traditional regression tree prediction can be explicitly expressed as
where, given for some ,
describing the influence of on predicting the value at . Slight changes are required when a subsample is used instead of the full sample to calculate the leaf values. For given subsample , we write
In this case, for any ,
We define the (column) structure vector of x, and
the structure matrix as the stacked structure vectors of the full sample. With this notation, structure-value isolation can also be viewed as the separation between the tree structure matrix and the response vector. Formally, we have
An ensemble of trees
exhibits structure-value isolation if is independent of the vector for all and , where we use the superscript to represent the -th tree.
For example, random forests utilizing completely randomized trees achieve structure-value isolation, while standard boosting and standard random forests do not.
In addition to structure-value isolation, in order to demonstrate that Boulevard converges for a fixed sample, we also impose a condition which ensures that the distribution of tree structures stabilizes as Boulevard progresses. We describe this in two senses: a strong sense in which the distribution, conditional on the sample, is fixed for the whole sample, and a weak sense in which this occurs eventually.
Denote the probability space of tree structures given sample of size after trees have been built, where consists of all possible tree structures and the probability measure on . A tree ensemble is non-adaptive if identical for all . A tree ensemble is eventually non-adaptive if identical for sufficiently large .
Following this definition, both standard random forests and random forests utilizing completely randomized trees are non-adaptive. In terms of boosting, since the target responses change each iteration, a shortcut for achieving non-adaptivity is the aforementioned structure-value isolation. On the other hand, eventual non-adaptivity is a desirable condition should a GBT ensemble become stationary after enough iterations. The following algorithm provides a straightforward example to guarantee non-adaptivity.
Algorithm 2 (Trees for Non-adaptive Boosting).
Start with , where are current residuals.
Obtain the tree structure independently of .
If needed, uniformly subsample an index set of size (otherwise let ).
Decide the leaf values, hence , merely with respect to as for ,
with defined to be .
Algorithm 2 is not specific about how to decide the tree structures. For non-adaptivity it can be realized using completely randomized trees in which the gradients only influence the leaf values. Another strategy is to acquire another independent sample and implement a random forest on this sample solely for extracting tree structures. We may also consider another alternative which uses a parallel adaptive boosting procedure on another independent sample to produce tree structures. While this option does not guarantee non-adaptivity, it results in trees with structure-value isolation and can be tailored to fulfill eventual non-adaptivity should we manage to reach an asymptotically stationary distribution of trees at the end of the parallel procedure.
In fact, eventual non-adaptivity allows the use of any tree building strategy for a finite number of trees. This assumption turns out to be sufficient for our theoretical development as long as we focus on the limit as the number of trees tends to infinite, rather than using a finite ensemble. Meanwhile it is compatible with any of the tree building algorithms used in practice for the beginning part where most signals get explored, which potentially accelerates the learning process. We will delay the discussion of fulfilling eventual non-adaptivity either spontaneously or manually to Section 5, and will focus on the Boulevard algorithm equipped with non-adaptivity mechanism as non-adaptive Boulevard unless otherwise stated.
2.3 Trees as Kernel Method
Another motivation for isolating tree structures and leaf values comes from the fact that it relates decision trees to a kernel method. To clarify, recall the definition of the tree structure matrix and denote by the probability space of all possible tree structures given sample of size and, when necessary, a subsampling framework , where is the structure of a single possible tree. When we have structure-value isolation, we can write the expected decision tree prediction on this sample as
where , the expectation with respect to subsample indices and the expectation with respect to the probability measure .
Denote by the expectation over all possible tree structures and subsample index sets, then
is symmetric, element-wise nonnegative.
is positive semidefinite.
, , . Here the last stands for the spectrum norm.
This in (3) is similar to the random forest kernel (Scornet et al., 2015) defined by the corresponding tree structure space, subsampling strategy and tree structure randomization approach. In terms of a single tree, honesty contributes to its symmetry and positive semidefiniteness, while subsampling decides the concentration level of the kernel. While for an ensemble, non-adaptivity guarantees that applies to all its tree components, making it possible to use a uniform mathematical expression to describe predictions from different trees.
for all , where is the target negative gradients and the fitted values after -th iteration.
It is also worth noticing that in part (iii) of Theorem 2.1 we only guarantee inequality rather than setting the norm exactly to 1. This is caused by structure-value isolation: for instance, when applying subsample splitting, it is possible that certain leaf of the tree structure produced by the first subsample contains no points of the second subsample. For such cases we can predict 0 for expediency: we demonstrate in Section 4.2 that this choice has an asymptotically negligible effect below.
As stated in Zhang et al. (2005), a first “theoretical issue” of analyzing boosting method is the difficulty of attaining convergence. As a starting point we will show that Boulevard guarantees point-wise convergence under finite sample settings.
3.1 Stochastic Contraction and Boulevard Convergence
To prove convergence of the Boulevard algorithm, we introduce the following definition, lemmas and theorem inspired by the unpublished manuscript by Almudevar (2002) regarding a special class of stochastic processes. We refer the readers to the original manuscript, but key points of the proof are briefly reproduced in Appendix A.2 and extended to cover a Kolmogorov inequality.
Theorem 3.1 (Stochastic Contraction).
Given -valued stochastic process , a sequence of , define
We call a stochastic contraction if the following properties hold
In particular, a multidimensional stochastic contraction exhibits the following behavior
holds for all s.t. .
To study Boulevard convergence, we start by informally postulating a convergent point for any boosting algorithm. Ideally the boosting path should remain stationary at and after .
For standard boosting with loss, supposing we obtain at the -th iteration, the next update is with the tree structure matrix decided by the gradient vector. being the convergent point implies that =0. Either we can implement a diminishing sequence of learning rates , or we must guarantee that we stop boosting with . The latter condition depends on both the properties of the gradient vector and the tree building algorithm deciding , making it challenging to study either the existence of such point or the convergence of the boosting path to .
For Boulevard, the update will be , implying that when is random. For trees with non-adaptivity and structure-value isolation, we can solve this relation for to prove its existence. Furthermore, Theorem 3.1 can be used to study the convergence of the boosting path to such . We formalize our discussion as the following theorem.
Given sample . If we construct gradient boosted trees with structure-value isolation and non-adaptivity using tree structure space , by choosing and defining as a truncation function, let Boulevard iteration take form of
where the observed response vector, the predicted response vector by the first trees, the random tree structure vector. We have
where , the random tree structure matrix defined above.
This theorem guarantees the convergence of Boulevard paths under finite sample setting once we threshold by a large . As a corollary we can express the prediction at any point of interest , which coincides with a kernel ridge regression using the random forest kernel.
By defining ,
Ridge regression tends to shrink the predictions towards 0 and so does (5). Therefore Corollary 3.2.1 carries the message that Boulevard may not converge to the whole signal. We will see in later proofs that Boulevard converges to , in contrast to standard boosting where we expect the convergence to the full signal. In fact, Boulevard down-weighs the boosting history, thereby ensuring that each tree in the finite ensemble must predict some partial signal during training. It thus avoids being dominated by the first few trees then repeatedly fitting on noise. In practice, since we showed that the prediction from Boulevard is consistent with respect to , we simply rescale it by to retrieve the whole signal.
3.2 Beyond Loss
Besides regression, other tasks may require alternative loss functions for boosting. For instance, the exponential loss in adaboost (Freund and Schapire, 1995). Analogous to the proof for loss, we can write the counterparts for any general loss whose non-adaptive Boulevard iteration takes the form of
Suppose the existence of the fix point , then
If the gradient term is bounded and Lipschitz (which could be enforced by truncation), i.e.
we can similarly show such Boulevard iteration converges by choosing . However, the closed form of can be intractable to obtain and potentially non-unique. For example for AdaBoost, is the solution to .
4 Limiting Distribution
Inspired by recent results demonstrating the asymptotic normality of random forest predictions, in this section we prove the asymptotic normality of predictions from Boulevard. Before detailing these results, we need some prerequisite discussion on the rates used for decision tree construction in order to ensure asymptotic local behavior. In general, the variability of model predictions comes from three sources: the sample variability, response errors, and the systematic viability of boosting caused by its subsampling and stopping rule. As shown in the last section, Boulevard convergence relieves the need for considering the stopping rule and the systematic viability. Therefore the strategy for our proof is as follows: we first consider the boosting process conditioned on the sample triangular array so only the response errors contribute to the variability. We then establish the uniformity over almost all random samples to extend the limiting distribution to the unconditional cases, showing that it is still the response errors that dominate the prediction variability.
4.1 Building Deeper Trees
Decision trees can be thought as k-nearest-neighbor (k-NN: Altman, 1992) models where is the leaf size and the distance metric is given by whether two points are in the same leaf. This adapts the metric to the local geometry of the response function. As the conclusions on k-NN predictions require growing-in-size and shrinking-in-radius neighborhoods (Gordon and Olshen, 1984), so are the counterparts of building deeper trees. Assuming non-adaptivity, the following assumptions are sufficient for our analysis.
Asymptotic locality. We introduce a sequence by which the diameter of any leaf is bounded from above. Writing , we require,
Minimal leaf size. We introduce another sequence by which the volume of any leaf is bounded from below. Writing as the volume function by Lebesgue measure, we require that
Recall the notation that . These place requirements on all the leaves of a tree; we specify the rates we require for and below.
These assumptions together bound the space occupied by any leaf of any possible tree from being either too extensive or too small. Along with the Lipschitz condition, this geometrically shrinking neighborhood not only indicates that each tree is asymptotically guaranteed to gather points with close response values together, but also helps to control the impact of one point on another across multiple consecutive iterations. To be specific, because boosting fits gradients instead of responses, a pair of distant points with divergent values might still contaminate each other through a bridge created by other points linked by being in the same leaf nodes across multiple iterations. Here (L1) provides a theoretical upper bound for such influence.
Notice that randomized honest tree building strategies can be compatible with (L1) if we force it, whereas the standard CART algorithm does not imply (L1). However, since the intuition of (L1) is to put points with similar responses together, we anticipate that CART, tuned to a proper depth, will satisfy (L1), at least in practice. If so, we can still exercise Boulevard boosting using trees built with greedy algorithms without compromising its theoretical guarantees in practice.
4.2 Conditioned on the Sample
We first consider a sample triangular array, i.e. for each , the sample is given. The first subscript will be dropped when there is no ambiguity. We specify the rates for the size of leaf nodes as:
For some ,
For some ,
One compatible realization is
For simplicity all our proofs are under this setting. However, any other rates satisfying these conditions are also sufficient.
Starting here we use the abbreviations
to cover the components of (5). The rates of and determine Boulevard asymptotics.
Recall that the structure-value isolation paired with subsampling may lead to the consequence of predicting 0 in some leaves, so that we only guarantee . Working with the tree construction rates as above, the subsample rate determines how far away is from 1. For Algorithm 2, without loss of generality, suppose we obtain a tree structure where each leaf contains no fewer than sample points before applying subsampling to decide the fitted values. If the subsample size is , i.e. , the chance of missing all sample points in one leaf is
Therefore, for any , if we use subsample size at least of . This requires the subsample to be relatively large, which is compatible with both being constant, or .
4.3 Exponential Decay of Influence and Asymptotic Normality
The prediction that Boulevard makes at a point is a linear combination of responses whose coefficients are given by . Therefore the asymptotical normality of Boulevard predictions will rely on whether can comply with the Lindeberg-Feller condition. Since only differs from by a ridge regression style matrix multiplication, we can expect hold similar properties as does. Further, by examining we can show that distant points are less influential on the prediction, and this decay of influence is exponential in our case.
Given any -vector and an index set , denote
This notation implies the decomposition that .
Given sample , a point of interest , set and define index set , then
If we only look nearby,
Lemma 4.1 indicates that Boulevard trees will asymptotically rely on a shrinking neighborhood around the point of interest. Given sample size and a point of interest , we can therefore define two neighborhoods of different radii and . contains all points that have direct influence on in a single tree, and contains the points that dominate the prediction at . Their cardinalities and follow Binomial distributions with parameters depending on the local covariate density.
To show the limiting distribution of for the sequence of predictions , the key point is to verify that, by writing the prediction at any point as a linear combination of sample responses, all these coefficients are diminishing at a rate allowed by the Lindeberg-Feller condition. Therefore we take a look at the coefficient vectors and .
Given the triangular array with sample at size , assume
The smaller neighborhood is growing fast enough: .
We have enough points in each leaf:
Combining all our discussions above, the following theorem stating the asymptotic normality conditioned on a given sample sequence follows. Lemma 4.2 is used here to estimate the order of the variance term.
Theorem 4.1 (Conditional Asymptotic Normality for Boulevard Predictions).
For any , suppose the conditions in Lemma 4.2 hold. Write , then under non-adaptivity and structure-value isolation,
4.4 Random Design
To extend the scope of the conditional limiting distribution to the random design case where the covariates are randomly drawn from their distribution, we intend to claim all the assumptions we made leading to Theorem 4.1 also hold true for almost surely all random sample sequence. In order to do so, we first incorporate the sample randomness in a well-defined probability space regarding the triangular array constructed from a random sample sequence. This can be done by the Kolmogorov’s extension theorem; see details in Appendix A.9.
We also have to revise (R2) to account for sample randomness. We increase the minimal leaf geometric volume by any small s.t. follows
With this assumption, the following lemma demonstrates the asymptotic normality where the mean depends on the random sample.
For given , suppose we have random sample for each . If we restrict the cardinality of tree space using any small s.t.
The proof of Lemma 4.3 allows us to substitute all by in the analyses of random design. Further, we take advantage of the following theorem to replace the empirical mean by its population version . Combining all above we obtain the main theorem of this paper that the limiting distribution of the random design non-adaptive Boulevard predictions is normal.
Theorem 4.2 (Asymptotic Normality for Boulevard Predictions).
Theorem 4.2 shows a deterministic mean term but a stochastic variance term. From results on kernel ridge regression, we would expect that this stochastic variance converges in probability if the random forest kernel behaves as a generic kernel with a shrinking bandwidth. From a theoretical perspective, the optimal rate of is bounded from below by , which corresponds to the optimal nonparametric regression rate using -Hölder continuous functions as base learners (Stone, 1982). In practice, relies on the specific method of growing the boosted trees, and therefore may vary from case to case.
Furthermore, this demonstrates that with carefully structured trees the prediction is consistent while the variance involves no signal but the error. It acts as an undersmoothed local smoother whose bias term shrinks faster than the variance term.
4.5 Subsampling and Tree Space Capacity
We have a strict requirement that the tree terminal node size grows at a rate between and to guarantee undersmoothing. Any term is allowed to be added to the existing polynomial result without changing the behavior. We notice that different subsample rates (i.e. in Wager et al. (2014), in Mentch and Hooker (2016), although see relaxations in Peng et al. (2019)) have been applied for measuring uncertainty. In comparison, the Boulevard algorithm requires a relatively restricted rate between these. In addition, though Boulevard training implements subsampling at each iteration, this does not influence the asymptotic distribution. The impact of subsampling is on the possible variance from the mean process therefore the convergence speed especially if we assume non-adaptivity.
We have required the size of tree space to scale at a rate close to . In comparison, Wager and Walther (2015) have shown that, in fixed dimension, any tree can be well approximated by a collection of hyper rectangles. One feasible way to compare these two rates is to consider a tree space satisfying Boulevard rates in which each tree has several leaves that constitute one hyper rectangle in the collection for approximation. Therefore a tree space of cardinality is enough for approximating any tree ensemble. In this sense, the capacity of our designated tree space is large enough from a practical perspective.
4.6 Alternative Analyses
The distributional results above are based on an analysis of a kernel representation of the Boulevard process in which structure-value isolation plays a key role in both consistency and asymptotic normality results. By contrast, Mentch and Hooker (2016) and Wager and Athey (2017) employed results derived from U-statistics to demonstrate asymptotic normality when random forest trees are obtained using subsamples. This analysis uses the subsample structure to bypass the kernel representation and allows the trees to be built adaptively. In a similar manner, when Boulevard employs subsampling we may represent its fixed point as a statistic. However, as in our discussion in Section 3.2, -statistic kernel is then defined implicitly with respect to a non-smooth tree-building process and further analysis is beyond the scope of this paper.
Wager and Athey (2017) also uses regularity to constrain leaf sizes in order to demonstrate consistency outside of the -statistic framework. We expect that a similar weakening of these conditions is possible in Boulevard.
5 Eventual Non-adaptivity
All the results discussed above have assumed the non-adaptivity of the boosting procedure of Boulevard. In standard boosting however, it is conventional and important to decide tree structures on the current gradients in order to better exploit the gap between the prediction and the signal. Such procedures are known for their tendency to overfit, behavior which can be relieved by subsampling. However, when seeking to extend our results to this case we lose the easy identifiability of a Boulevard convergence point as the tree structure distribution changes at each iteration. We therefore need more assumptions and further theoretical development to extend our convergence results to a more practical Boulevard algorithm that allows the current gradient to determine tree structure.
A first approach to this is to relax non-adaptivity to eventual non-adaptivity. We postulate a convergent sequence of predictions, indicating that the underlying tree spaces will be stabilized after boosting for a sufficiently long time. If this is the case, eventual non-adaptivity is spontaneous. Here we introduce the notation where and indicating the expected tree structure given the gradient of the loss between observed responses and current predictions. In regression this is , and we will take this form into the following discussion instead of a generic gradient expression.
5.1 Local Homogeneity and Contraction Regions
We start with trees whose splits are based on the optimal L2 gain (Breiman et al., 1984). For , the chosen split minimizes the impurity in the form of
where . Once the optimal split is unique, i.e. the optimum has a positive margin over the rest, we could allow a small change of all ’s values without changing the split decision. This also holds true if the split is decided by a subsample. In terms of adaptive boosting, this observation demonstrates local homogeneity that, except on a set with Lebesgue measure 0 where has multiple optima for (6), we can segment , the space of possible vectors , into subsets s.t. for the same subset.
Notice that Gini gain is insensitive to multiplying by a nonzero factor. Therefore all ’s are open double cones in .
Definition 5.1 (Contraction Region).
Given the sample . Write and current prediction . Following the above segmentation . We call a contraction region if for the following
for any , where is the unique structural matrix in this region.
The intuition behind this definition is that, as long as a Boulevard process stays inside a contraction region, the subsequent tree structures will be conditionally independent of the predicted values. Therefore the process becomes non-adaptive, collapsing to . To achieve this eventual non-adaptivity, we would like to know when a Boulevard path is permanently contained in a contraction region.
We should point out here that we have not shown the existence and the uniqueness of such contraction regions. Such an analysis would rely on the method of choosing splits, the training sample and the choice of .
5.2 Escaping the Contraction Region
In this section we explore possible approaches to restrict a Boulevard process inside a contraction region. Assuming the existence of contraction regions, we recall Theorem 3.1 which indicates that the Boulevard process has positive probability of not moving far from the fixed point.
Denote the open ball of radius centered at in . Suppose a contraction region, the contraction point and for some . Write the Boulevard process. For sufficiently large ,
Theorem 5.1 states that the longer we boost, the more likely the boosting path is trapped in a contraction region forever, resulting in spontaneous eventual non-adaptivity and justifying all our theory assuming non-adaptivity. However, it guarantees neither the existence or the uniqueness of the contraction region.
5.3 Forcing Eventual Non-Adaptivity
If we alternatively consider manually forcing eventual non-adaptivity, a possible ad hoc solution to the existence is to apply a tail snapshot which uses the tree space of certain iteration for the rest of the boosting run. This can be either pre-specified or decided on the fly when the Boulevard path becomes stationary. Ideally, as long as is in a contraction region, its forced non-adaptivity is indistinguishable from spontaneous eventual non-adaptivity. An example of Boulevard regression implementing the tail snapshot strategy is detailed in Algorithm 3, in which we decide by examining the training loss reduction.
Algorithm 3 (Tail Snapshot Boulevard).
Start with and to be specified.
For , given , calculate the gradient
If is not specified, update by and the tree structure space decided by current gradients,
where denotes the random tree structure vector based on tree space . If is specified, update by instead, i.e.
When is not specified, check the empirical training loss as a measure of the distance to the fixed point.
If a preset threshold, we claim Boulevard is close enough to a fixed point and specify to be current . This step is then skipped for the rest of the boosting run.
Another plausible solution is to collectively build the tree space up based on the following intuition. For the first trees we prefer to construct them using the gradients to capture the signal. However when the boosting process keeps going on, we may consider reusing the tree structures or the tree spaces of previous iterations as to avoid overfitting the data or fitting the errors. Therefore the tree space for a large should be the tree structures created by current gradients, plus all tree spaces for . This accumulation scheme changes as we boost, however asymptotically it leads to a common tree space and thus eventual non-adaptivity.
6 Empirical Study
We have conducted an empirical study to demonstrate the performance of Boulevard. Despite the fact that our purpose in developing Boulevard lies in statistical inference, we require its accuracy to be on par with other predominant tree ensembles, which is assessed on both simulated and real world data. In addition, we inspect the empirical limiting behavior of non-adaptive Boulevard to show its agreement with our theory. We summarize the result of the empirical study in this section, while additional details can be found in Appendix B.
6.1 Predictive Accuracy
We first compare Boulevard predictive accuracy with the following tree ensembles: Random Forest (RF), gradient boosted trees without subsampling (GBT), stochastic gradient boosted trees (SGBT), non-adaptive Boulevard achieved by completely randomized trees (rBLV), adaptive Boulevard whose tree structures are influenced by gradient values (BLV). All the tree ensembles build same depth of trees (see Appendix B.1 for details) throughout the experiment.
Results on simulated data are shown in Figure 1. We choose sample size of 5000 and use the following two settings as underlying response functions: (1) (top), and (2) (bottom). Error terms are Unif[-1,1] (left) and equal point mass on (right). Training errors are evaluated on the training set with noisy responses, while test errors are evaluated by error from the underlying signal on a separate test set. For each setup we have two plots covering the behavior of the first 50 trees and the following 250 trees respectively. BLV and rBLV are comparable with RF, while all the three equal-weight tree ensembles are slightly inferior to GBM and SGBM.
In addition, we have used the same settings and applied different values 0.2, 0.5 and 0.8 on two simulated data sets to investigate the impact of on Boulevard accuracy for finite sample cases. Results are summarized in Figure 2. BLV outperforms rBLV while appears to be the best choice. Though this observation aligns with our intuition to choose a large , we consider the discrepancy between different values to be small.
Results on four real world data sets selected from UCI Machine Learning Repository (Dheeru and Karra Taniskidou, 2017; Tüfekci, 2014; Kaya et al., 2012) are shown in Figure 3. All curves are averages after 5-fold cross validation. Different parameters are used for different data sets, see Appendix B.1. Rankings of the five methods in comparison do not show consistency, nevertheless rBLV and BLV manage to achieve similar performance to the other methods on test sets.
6.2 Convergence and Kernel Ridge Regression
Theorem 3.2 and Corollary 3.2.1 claim that any Boulevard prediction should converge to a kernel ridge regression (KRR) form. Although the matrix inverse involved in the expression is too time consuming to compute for large , it is possible to estimate it using Monte Carlo when and are small. We take advantage of this setup to compare Boulevard and KRR in order to empirically support the theorem.
We choose and . We use the following model