On Semi-parametric Bernstein-von Mises Theorems for BART

# On Semi-parametric Bernstein-von Mises Theorems for BART

Veronika Ročková111Assistant Professor in Econometrics and Statistics and James S. Kemper Faculty Scholar at the University of Chicago Booth School of Business. This research was supported by the James S. Kemper Faculty Research Fund.
###### Abstract

Few methods in Bayesian non-parametric statistics/ machine learning have received as much attention as Bayesian Additive Regression Trees (BART). While BART is now routinely performed for prediction tasks, its theoretical properties began to be understood only very recently. In this work, we continue the theoretical investigation of BART initiated by Ročková and van der Pas (2017). In particular, we study the Bernstein-von Mises (BvM) phenomenon (i.e. asymptotic normality) for smooth linear functionals of the regression surface within the framework of non-parametric regression with fixed covariates. As with other adaptive priors, the BvM phenomenon may fail when the regularities of the functional and the truth are not compatible. To overcome the curse of adaptivity under hierarchical priors, we induce a self-similarity assumption to ensure convergence towards a single Gaussian distribution as opposed to a Gaussian mixture. Similar qualitative restrictions on the functional parameter are known to be necessary for adaptive inference. Many machine learning methods lack coherent probabilistic mechanisms for gauging uncertainty. BART readily provides such quantification via posterior credible sets. The BvM theorem implies that the credible sets are also confidence regions with the same asymptotic coverage. This paper presents the first asymptotic normality result for BART priors, providing another piece of evidence that BART is a valid tool from a frequentist point of view.

## 1 Introduction

This note studies the Gaussian approximability of certain aspects of posterior distributions in non-parametric regression with trees/forest priors. Results of this type, initially due to Laplace (1810) but most commonly known as Bernstein-von Mises (BvM) theorems, imply that posterior-based inference asymptotically coincides with the one based on traditional frequentist -consistent estimators. In this vein, BvM theorems provide a rigorous frequentist justification of Bayesian inference. The main thrust of this work is to understand the extent to which this phenomenon holds for various incarnations of Bayesian Additive Regression Trees (BART) of Chipman et al. (2010), one of the workhorses of Bayesian machine learning.

In simple words, the BvM phenomenon occurs when, as the number of observations increases, the posterior distribution has approximately the shape of a Gaussian distribution centered at an efficient estimator of the parameter of interest. Moreover, the posterior credible sets, i.e. regions with prescribed posterior probability, are then also confidence regions with the same asymptotic coverage. This property has nontrivial practical implications in that constructing confidence regions from MCMC simulation can be relatively straightforward whereas computing confidence regions directly is not. Under fairly mild assumptions, BvM statements can be expected to hold in regular finite-dimensional models (van der Vaart, 2000).

Unfortunately, the frequentist theory on asymptotic normality does not generalize fully to semi- and non-parametric estimation problems (Bickel and Kleijn, 2012). Freedman initiated the discussion on the consequences of unwisely chosen priors in the 1960’s, providing a negative BvM result in a basic -sequence Gaussian conjugate setting. The warnings against seemingly innocuous priors that may lead to posterior inconsistency were then reiterated many times in the literature, including Cox (1993) and (Diaconis and Freedman, 1998, 1986). Other counterexamples and anomalies of the BvM behavior in infinite-dimensional problems can be found in Johnstone (2010) and Leahu (2011). While, as pointed out by Bickel and Kleijn (1999), analogues of the BvM property for infinite-dimensional parameters are not immediately obvious, rigorous notions of non-parametric BvM’s have been introduced in several pioneering works (Leahu, 2011; Ghosal, 2000; Castillo and Nickl, 2014, 2013).

Unwisely chosen priors leave room for unintended consequences also in semi-parametric contexts (Rivoirard and Rousseau, 2012). Castillo (2012a) provided an interesting counterexample where the posterior does not display the BvM behavior due to a non-vanishing bias in the centering of the posterior distribution. Various researchers have nevertheless documented instances of the BvM limit in semi-parametric models (a) when the parameter can be separated into a finite-dimensional parameter of interest and an infinite-dimensional nuisance parameter (Castillo, 2012b; Shen, 2002; Bickel and Kleijn, 2012; Cheng and Kosorok, 2008; Johnstone, 2010; Jonge and van Zanten, 2013), and (b) when the parameter of interest is a functional of the infinite-dimensional parameter (Rivoirard and Rousseau, 2012; Castillo and Rousseau, 2015). In this work, we focus on the latter class of semi-parametric BvM’s and study the asymptotic behavior of smooth linear functionals of the regression function.

We consider the standard non-parametric regression setup, where a vector of responses is linked to fixed (rescaled) predictors for each through

 Yi=f0(xi)+εi,εi\lx@stackreliid∼N(0,1), (1.1)

where is an unknown -Hölder continuous function on a unit cube . The true generative model giving rise to (1.1) will be denoted with .

Each model is parametrized by , where is an infinite-dimensional set of possibilities of . Let be a measurable functional of interest and let be a probability distribution on . Given observations from (1.1), we study the asymptotic behavior of the posterior distribution of , denoted with . Let denote the centered normal law with a covariance matrix . In simple words, we want to show that under the Bayesian CART or BART priors on , the posterior distribution satisfies the BvM-type property in the sense that

 Π[√n(Ψ(f)−ˆΨ)|Y(n)]⇝N(0,V) (1.2)

as in -probability, where is a random centering point estimator and where stands for weak convergence. We make this statement more precise in Section 2

Castillo and Rousseau (2015) provide general conditions on the model and on the functional to guarantee that (1.2) holds. These conditions describe how the choice of the prior influences the bias and variance . Building on this contribution, we provide (a) tailored statements for various incarnations of tree/forest priors that have occurred in the literature, (b) extend the considerations to adaptive priors under self-similarity.

### 1.1 Notation

The notation will be used to denote inequality up to a constant, denotes and and denotes . The -covering number of a set for a semimetric , denoted by is the minimal number of -balls of radius needed to cover set . We denote by the normal density with zero mean and variance With we denote the standard Euclidean norm and with and the extremal eigenvalues of a matrix . The set of -Hölder continuous functions (i.e. Hölder smooth with ) on will be denoted with .

## 2 Rudiments

Before delving into our development, we first we make precise the definition of asymptotic normality.

###### Definition 2.1.

We say that the posterior distribution for the functional is asymptotically normal with centering and variance if, for the bounded Lipschitz metric for weak convergence, and the real-valued mapping , it holds that

 β(Π[⋅|Y(n)]∘τ−1n,N(0,V))→0 (2.1)

in probability as , which we denote as .

When efficiency theory at the rate is available, we say that the posterior distribution satisfies the BvM theorem if (2.1) holds with for a linear efficient estimator of The proof of such a statement typically requires of a few steps (a) a semi-local step where one proves that the posterior distribution concentrates on certain sets and (b) a strictly local study on these sets, which can be carried out under the LAN (local asymptotic normality) conditions. Denoting the log-likelihood with

 ℓn(f)=n2log2π−n∑i=1[Yi−f(xi)]22,

we define the log-likelihood ratio and express it as a sum of a quadratic term and a stochastic term via the LAN expansion as follows

 Δn(f) =−n2∥f−f0∥2L+√nWn(f−f0), (2.2)

where

 ∥f−f0∥2L =1nn∑i=1[f0(xi)−f(xi)]2, Wn(f−f0) =⟨f−f0,√nε⟩L=1nn∑i=1√nεi[f(xi)−f0(xi)].

Recall that the phrase “semi-parametric” here refers to the problem of estimating functionals in an infinite-dimensional model rather than Euclidean parameters in the presence of infinite-dimensional nuisance parameters. In this note, we consider the smooth linear functional

 Ψ(f)=1nn∑i=1a(xi)f(xi) (2.3)

for some smooth uniformly bounded -Hölder continuous function , i.e. and for some . Were , Hölder continuity would imply that is a constant function and (2.3) boils down to a constant multiple of the average regression surface evaluated at fixed design points. This quantity is actually of independent interest and has been studied in a different setup by Ray and van der Vaart (2018) who focus on the posterior distribution of the “half average treatment effect estimator” (the mean regression surface) in the presence of missing data and random covariates. Besides its standalone purpose, the semi-parametric BvM result for functionals (2.3) has a useful alternative purpose in the multiscale approach to posteriors towards obtaining non-parametric BvM statements for BART priors (Castillo and Ročková, 2019). One of the ingredients is proving convergence of finite-dimensional distributions, which can be shown by the Cramer-Wald theorem applying results presented here.

## 3 Tree and Forest Priors

Regression trees provide a piecewise constant reconstruction of the regression surface , where the pieces correspond to terminal nodes of recursive partitioning (Donoho, 1997). Before introducing the tree function classes, we need to define the notion of tree-shaped partitions.

Starting from a parent node , a binary tree partition is obtained by successively applying a splitting rule on a chosen internal node. Each such internal node is divided into two daughter cells with a split along one of the coordinates at a chosen observed value. These daughter cells define two new rectangular subregions of , which can be split further (to become internal nodes) or end up being terminal nodes. The terminal cells after splits then yield a tree-shaped partition consisting of boxes . We denote with the set of all tree-shaped partitions that can be obtained by recursively splitting times at observed values .

The tree functions can be then written as

 fT,β(x)=K∑k=1I(x∈Ωk)βk, (3.1)

where and where is a vector of jump sizes. Solitary trees are not as good performers as ensembles of trees/forests (Breiman, 2001; Chipman et al., 2010). The forest mapping underpinning the BART method of Chipman et al. (2010) is the following sum-of-trees model indexed by a collection of tree-shaped partitions and step sizes :

 fE,B(x)=T∑t=1fTt,βt(x)forTt∈VKtandβt∈RKt. (3.2)

The prior distribution is assigned over the class of forests

 F={fE,B(x)of the % form (???) for some E and B and T∈N}

in a hierarchical manner. One first assigns a prior distribution over (or sets it equal to some value) and then a prior over the tree-shaped partitions as well as step sizes for .

### 3.1 Tree Partition Priors π(T)

In 1998, there were two Bayesian CART developments that surfaced independently of each other: Chipman et al. (1997) and Denison et al. (1998). Albeit related, the two methods differ in terms of the proposed tree partition prior .

The prior of Denison et al. (1998) is equalitarian in the sense that trees with the same number of leaves are a-priori equally likely, regardless of their shape. To prioritize smaller trees (that do not overfit), one assigns a complexity prior on the tree size (i.e. the number of bottom nodes) . They considered the Poisson distribution

 π(K)=λK(eλ−1)K!,K=1,2,…,for someλ>0. (3.3)

Given the tree size , one assigns a uniform prior over tree topologies

 π(T|K)=I(T∈VK)|VK|, (3.4)

where is the cardinality of . This prior can be straightforwardly implemented using Metropolis-Hastings strategies (Denison et al., 1998) and was studied theoretically by Ročková and van der Pas (2017).

The Bayesian CART prior of Chipman et al. (1997), on the other hand, specifies the prior implicitly as a tree-generating Galton-Watson (GW) process. This process provides a mathematical representation of an evolving population of individuals who reproduce and die subject to laws of chance. The tree growing process is characterized as follows. Starting with a root node , one decides to split each node into two children by flipping a coin. We are tacitly using the two-index numbering of nodes , where corresponds to the tree layer and denotes the left-most node at the layer. In order to prevent the trees from growing indefinitely, the success probability decays with , making bushy trees far less likely. Let us denote with the binary splitting indicator for whether or not a node was divided. Chipman et al. (1997) assume

 P(γlk=1)=plk=α(1+l)δ (3.5)

for some and . Ročková and Saha (2019) propose a modification for some , which yields optimal posterior concentration in the sense. If the node splits (i.e. ), one chooses a splitting rule by first picking a variable uniformly from available directions and then picking a split point uniformly from available data values .

Unlike in the homogeneous case (where all ’s are iid), (3.5) defines a heterogeneous GW process where the offspring distribution is allowed to vary from generation to generation, i.e. the variables are independent but non-identical. Figure 1 depicts the split probabilities.

### 3.2 Priors on Jump Sizes π(β|K)

After partitioning the predictor space into nested rectangular cells, one needs to assign a prior on the presumed level of the outcome. Throughout this work, we denote with the vector of jump sizes associated with partitioning cells. Both Chipman et al. (1997) and Denison et al. (1998) assumed an independent product of Gaussians for some . Chipman et al. (2000) argue, however, that the simple independence prior on the bottom leaf coefficients may not provide enough structure. They claim that values that correspond to nearby cells in the predictor space should be more similar so that the prior incorporates local smoothness. They suggest a prior on bottom leaves that aggregates priors on the ancestral internal nodes and, in this way, induces correlation among neighboring cells. Motivated by these considerations, here we allow for general correlation structures by assuming a multivariate Gaussian prior

 π(β|K)∼NK(0,Σ),withλmin(Σ)>c>0andλmax(Σ)≲n (3.6)

where is some positive definite matrix. We also consider an independent product of Laplace priors (which was not yet studied in this context)

 π(β|K)=K∏k=1ψ(βk;λ), (3.7)

where for some .

## 4 Simple One-dimensional Scenario

To set the stage for our development, we start with a simple toy scenario where (a) , (b) is regarded as fixed, and (c) when there is only one partition consisting of equivalent blocks. The equivalent blocks partition (Anderson, 1966) comprises intervals with roughly equal number of observations in them, i.e. . For the sake of simplicity, we will also assume that the data points lie on a regular grid, where for in which case the intervals have also roughly equal lengths. This setup was studied previously by van der Pas and Ročková (2017) in the study of regression histograms. We relax this assumption in the next section. We denote the class of approximating functions as

 F[K]={fKβ:[0,1]p→R;fKβ(x)=K∑k=1I(x∈Ωk)βk}, (4.1)

where we have omitted the subscript and highlighted the dependence on . We denote with the prior distribution over , obtained by assigning the prior (3.6) or (3.7). To further simplify the notation, we will drop the subscript and write when referring to the elements of .

The aim is to study the posterior distribution of and to derive conditions under which

 Π[√n(Ψ(fK)−Ψn)|Y(n)] (4.2)

converges to a normal distribution (in probability) with mean zero and variance , where is a random centering, distributed according to a Gaussian distribution with mean and variance .

Using the fact that convergence of Laplace transforms for all in probability implies convergence in distribution in probability (Section 1 of Castillo and Rousseau (2015)) this BvM statement holds when

 EΠ[et√n(Ψ(fK)−ˆΨK)|Y(n)]→et22∥a∥2Lin Pn0 probability as n→∞, (4.3)

where is a linear efficient estimator of such that

 √n(ˆΨK−Ψn)=oP(1).

In order to show (4.3), we first need to introduce some notation. Let be the projection of onto , i.e.

 aK(x)=K∑k=1I(x∈Ωk)aKkwithaKk=n∑i=1I(xi∈Ωk)na(xi)μ(Ωk),

where was defined above as . Similarly, we denote with the projection of onto with jump sizes . Next, we define

 ˆΨK =Ψ(fK0)+Wn(aK)√nandΨn=Ψ(f0)+Wn(a)√n.

The following Theorem characterizes the BvM property when is sufficiently large and when is known.

###### Theorem 4.1.

Assume the model (1.1) with , where is endowed with a prior on in (4.1) induced by (3.6). Assume and for and . With the choice , we have

 Π(√n(Ψ(fK)−ˆΨK)|Y(n))⇝N(0,∥a∥2L)

in -probability as .

Section A.1

###### Remark 4.1.

Theorem 4.1 applies to the so-called symmetric dyadic trees. In particular, when for some , the equivalent blocks partition with cells can be represented with a symmetric full binary tree which splits all the nodes at dyadic rationals up to a resolution .

Theorem 4.1 is related to Theorem 4.2 of Castillo and Rousseau (2015) for density estimation with non-adaptive histogram priors. The proof here also requires two key ingredients. The first one is the construction of a prior which does not change too much under the change of measures from to , where is a step function with shifted step sizes . This property holds for (correlated) Gaussian step heights and is safely satisfied by other prior distributions with heavier tails.

###### Remark 4.2.

Theorem 4.1 holds for Laplace product prior (as shown in Section A.1.1). It is interesting to note that under the Laplace prior, one can obviate the need for showing posterior concentration around a projection of onto , which is needed for the Gaussian case.

The second crucial ingredient (as in the Proposition 1 of Castillo and Rousseau (2015)) is the so-called no bias condition:

 √n⟨a−aK,f0−fK0⟩L=o(1). (4.4)

This condition vaguely reads as follows: one should be able to approximate simultaneously and well enough using functions in the inferential class . Using the Cauchy-Schwartz inequality and Lemma 3.2 of Ročková and van der Pas (2017), this condition will be satisfied when . Choosing , (4.4) holds as long as . Different choices of , however, would imply different restrictions on and . The no-bias condition thus enforces certain limitations on the regularities and . This poses challenges for adaptive priors that only adapt to the regularity of , which may not be necessarily similar to the regularity of . This phenomenon has been coined as the curse of adaptivity (Castillo and Rousseau, 2015).

## 5 Overcoming the Curse of Adaptivity

The dependence of on makes the result in Theorem 4.1 somewhat theoretical. In practice, it is common to estimate the regularity from the data using, e.g., a hierarchical Bayes method, which treats both and the partition as unknown with a prior. This fully Bayes model brings us a step closer to the actual Bayesian CART and BART priors. Treating both and as random and assuming , the class of approximating functions now constitutes a union of shells

 F=∞⋃K=1F[K],

where each shell

 F[K]=⋃T∈VKF[T],

itself is a union of sets . The sets collect all step functions that grow on the same tree partition .

As mentioned in Castillo and Rousseau (2015), obtaining BvM in the case of random is case dependent. As the prior typically adapts to the regularity of , the no-bias condition (4.4) may not be satisfied if the regularities of and are too different. The adaptive prior can be detrimental in such scenarios, inducing a non-vanishing bias in the centering of the posterior distribution (see Castillo (2012a) or Rivoirard and Rousseau (2012)). Roughly speaking, one needs to make sure that the prior supports large enough values and sufficiently regular partitions so that and can be both safely approximated. To ensure this behavior, we enforce a signal strength assumption through self-similarity requiring that the function “does not appear smoother than it actually is” (Gine and Nickl, 2015). Such qualitative assumptions are natural and necessary for obtaining adaptation properties of confidence bands (Picard and Tribouley, 2000; Bull, 2012; Gine and Nickl, 2010; Nickl and Szabo, 2016; Ray, 2017).

### 5.1 Self-similarity

Various self-similarity conditions have been considered in various estimation settings, including the supremum-norm loss (Bull, 2012; Gine and Nickl, 2010, 2011) as well as the loss (Nickl and Szabo, 2016; Szabo et al., 2015). In the multi-scale analysis, the term self-similar coins “desirable truths” that are not so difficult to estimate since their regularity looks similar at small and large scales. Gine and Nickl (2010) argue that such self-similarity is a reasonable modeling assumption and Bull (2012) shows the set of self-dissimilar Hölder functions (in the sense) is negligible. Szabo et al. (2015) provided an -style self-similarity restriction on a Sobolev parameter space. Nickl and Szabo (2016) weakened this condition and showed that it “cannot be improved upon” and that the statistical complexity of the estimation problem does not decrease quantitatively under self-similarity in Sobolev spaces.

We consider a related notion of self-similar classes within the context of fixed-design regression as opposed to the asymptotically equivalent white noise model. To this end, let us first formalize the notion of the cell size in terms of the local spread of the data and introduce the partition diameter (Verma et al., 2009; Ročková and van der Pas, 2017).

###### Definition 5.1.

(Diameter) Denote by a partition of and by a collection of data points in . We define a diameter of as

 diam(Ωk)=maxx,y∈Ωk∩X∥x−y∥2.

and with we define a diameter of the entire partition where .

Here, we do not require that the design points are strictly on a grid as long as they are regular according to Definition 3.3 of Ročková and van der Pas (2017). We define regular datasets below. First we introduce the notion of the - tree (Bentley, 1975). Such a partition is constructed by cycling over coordinate directions in , where all nodes at the same level are split along the same axis. For a given direction , each internal node, say , will be split at a median of the point set (along the axis). This split will pass and observations onto its two children, thereby roughly halving the number of points. After rounds of splits on each variable, all terminal nodes have at least observations, where .

We now define regular datasets in terms of the - tree partition.

###### Definition 5.2.

Denote by the - tree where . We say that a dataset is regular if

 max1≤k≤Kdiam(ˆΩk)

for some large enough constant and all .

The definition states that in a regular dataset, the maximal diameter in the - tree partition should not be much larger than a “typical” diameter.

###### Definition 5.3.

We say that the function is self-similar, if there exists constant and such that

 ∥f0T−f0∥2L≥Mdiam(T)2αfor allT∈VKsuch % thatdiam(T)≤D, (5.2)

where is the the projection of onto

We can relate the assumption (5.2) to the notion of self-similarity in the Remark 3.4 of Szabo et al. (2015). To see this connection, assume for now the equivalent block partition from Section 4, whose diameter is roughly when the design points lie on a regular grid. The study of regression histograms with equivalent blocks under a fixed regular design is statistically equivalent to the multi-scale analysis of Haar wavelet coefficients up to the resolution in the white noise model. The projected model onto the Haar basis can be written as

 Ylk=f0lk+1√nεlkfor0≤l

where and where are the wavelet coefficients indexed by the resolution level and the shift index . The speed of decay of determines the statistical properties of , where -Hölder continuous functions satisfy . Assuming the equivalent blocks partition with blocks, the condition in the Remark 3.4 of Szabo et al. (2015) writes as follows: there exists such that we have

 ∫10|f0T(x)−f0(x)|2dx=∑l≥L∑0≤k<2l(f0lk)2≥MK−2α≍diam(T)2α.

The first equality above stems from the orthogonality of the Haar bases. In this vein, (5.2) can be regarded a generalization of this condition to imbalanced partitions and fixed design under the norm .

To get even more insight into (5.2) in fixed design regression, we take a closer look at the approximation gap. We have

 ∥f0−f0T∥2L=K∑k=1μ(Ωk)Var[f0|Ωk],

where

 Var[f0|Ωk]=1n(Ωk)∑xi∈Ωk⎛⎝f0(xi)−1n(Ωk)∑xi∈Ωkf0(xi)⎞⎠2

is the local variability of the function inside . The function will be self-similar when the variability inside each cell ’s is large enough to be detectable, i.e.

 inf1≤k≤KVar[f0|Ωk]>Mdiam2α(T)for allT={Ωk}Kk=1∈VK

such that for some . From the definition of the partition diameter, it turns out that this will be satisfied when for all and Functions that are nearly constant in long intervals will not satisfy this requirement. The premise of self-similar functions is that their signal should be detectable with partitions that undersmooth the truth.

In addition, it follows from the proof of Lemma 3.2 of Ročková and van der Pas (2017) that . The lower bound in (5.2) thus matches the upper bound making the approximation error behave similarly across partitions with similar diameters, essentially identifying the smoothness.

Based on these considerations, we introduce the notion of regular partitions that are not too rough in the sense that their diameters shrink sufficiently fast.

###### Definition 5.4.

For some and for some arbitrarily slowly increasing sequence , we denote

 dn(α)≡(Mn/M)1/αn−1/(2α+p)log1/2αn. (5.3)

A tree partition is said to be -regular for a given if

 diam(T)≤dn(α).

We denote the subset of all -regular partitions with .

The following Lemma states that, when is self-similar, the posterior concentrates on partitions that are not too complex or irregular.

###### Lemma 5.1.

Assume that is self-similar, and that the design is regular. Under the Bayesian CART prior ((3.3) and (3.4) or (3.5)) with Gaussian or Laplace step heights ((3.6) or (3.7)) we have

 Π({T∉Rn}∪{T∈VK:K>Kn}|Y(n))→0

in -probability as , where are all regular partitions and for some .

###### Proof.

With , the assumption (5.2) implies when , where is the projection of onto . The posterior distribution under the Bayesian CART prior concentrates at the rate in the sense, i.e. in -probability for any arbitrarily slowly increasing sequence . This follows from Ročková and van der Pas (2017) for the conditionally uniform tree partition prior and from Ročková and Saha (2019) for the tree-branching process prior. Both of these papers study Gaussian step heights. In the Appendix (Section A.2), we extend these results to Laplace step heights. From the near-minimaxity of the posterior, it then follows that partitions that are not regular are not supported by the posterior. The dimensionality part regarding follows from Ročková and van der Pas (2017) and Ročková and Saha (2019). ∎

### 5.2 Adaptive BvM for Smooth Functionals when p=1

It is known that signal-strength conditions enforced through self-similarity allow for the construction of honest adaptive credible balls (Gine and Nickl, 2010). Our notion of self-similarity is sufficient for obtaining the adaptive semi-parametric BvM phenomenon for smooth linear functionals. Denote with

 R(Kn)={T∈Rn∩VKforK≤Kn}.

Lemma 5.1 shows that the posterior concentrates on this set so that we can perform the analysis locally. For any , we write

 ˆΨT=Ψ(fT0)+Wn(aT)√n,

where and are the projections onto . Under the adaptive prior (treating the partitions as random with a prior) the posterior is asymptotically close to a mixture of normals indexed by with weights . When

 maxT∈R(Kn)∣∣∥aT∥L−∥a∥L∣∣=o(1)andmaxT∈R(Kn)√n(Ψn−ˆΨT)=oP(1), (5.4)

this mixture boils down to the target law . The first condition in (5.4) holds owing to the fact that (Lemma 3.2 of Rockova and van der Pas (2017)). The second condition will be satisfied as long as

 √ndn(α)α+γ→0. (5.5)

For and assuming , we have for

 √ndn(α)α+γ≲n1/2−α+γ2α+1(logn)α+γα→0

for . We can now state an adaptive variant of Theorem 4.1 for random partitions.

###### Theorem 5.1.

Assume model (1.1) with , where and for and . Assume that is self-similar. Under the Bayesian CART prior ((3.3) and (3.4) or (3.5)) with Gaussian or Laplace step heights ((3.6) or (3.7)), we have

 Π(√n(Ψ(fT,β)−ˆΨT)|Y(n))⇝N(0,∥a∥2L)

in -probability as .

###### Proof.

Section A.3

Theorem 5.1 can be regarded as an adaptive extension of the regular density histogram result of Castillo and Rousseau (2015). Here, we instead focus on irregular and adaptive regression histograms underpinned by tree priors and treat both and as random under self-similarity. The change of measure argument is performed locally for each regular partition.

Theorem 5.1 can be extended to tree ensembles. The self-similarity assumption would be instead formulated in terms of a global partition, which is obtained by super-imposing all tree partitions inside and by merging empty cells with one of their neighbors. Since tree ensembles also concentrate at near-minimax speed (Ročková and van der Pas, 2017; Ročková and Saha, 2019), one obtains that the posterior concentrates on regular ensembles (where the diameter is small). The analysis is then performed locally on regular ensembles in the same spirit as for single trees.

### 5.3 BvM for Average Regression Surface when p>1

One of the main limitations of tree/forest methods is that they cannot estimate optimally functions that are smoother than Lipschitz (Scricciolo, 2007). The reason for this limitation is that step functions are relatively rough; e.g. the approximation error of histograms for functions that are smoother than Lipschitz is at least of the order , where is the number of bins (Ročková and van der Pas, 2017). The number of steps required to approximate a smooth function well is thus too large, creating a costly bias-variance tradeoff. When , the no-bias condition (4.4) would be satisfied if which, from the Hölder continuity, holds when is a constant function.

Focusing on the actual BART method when , we now rephrase Theorem 5.1 for the average regression functional (2.3) obtained with . When is a constant function, the no-bias condition (4.4) is automatically satisfied. Recall that the second requirement for BvM pertains to the shift of measures. It turns out that the Gaussian prior  (3.6) may induce too much bias when the variance is too small (fixed as ). We thereby deploy an additional assumption in the prior to make sure that the variance increases suitably with the number of steps . For the BART prior on step heights of each tree , we assume either a Gaussian prior or the Laplace prior with .

Having the variance scale with the number of steps is generic in the multi-scale analysis of Haar wavelets. For instance, assuming a full dyadic tree with layers, where , the iid standard Gaussian product prior on the wavelet coefficients implies a Gaussian prior with variance on the bottom nodes (Castillo and Ročková, 2019). The bottom nodes correspond to histogram bins in our setup, where more refined partitions will ultimately have more variable step sizes.

The following theorem is formulated for a few variants of the BART prior. This prior consists of (a) either fixed number of trees (as recommended by (Chipman et al., 2010)), (b) the Galton-Watson prior (3.5) or the conditionally uniform tree prior (3.3) and (3.4), independently for each tree, and (c) the Gaussian prior or the Laplace prior with where is the number of bottom nodes of a tree . Below, we denote with , where is a projection of onto a set of all forest mappings (3.2) supported on the tree ensemble .

###### Theorem 5.2.

Assume model (1.1) with , where is endowed with the BART prior (as stated above) and where and . Assume that in (2.3). When the design is regular, we have in -probability as .

Section A.4

## 6 Discussion

This note reveals some aspects of Bernstein-von Mises limits under adaptive BART priors. We focus on semi-parametric BvM’s for linear functionals of the infinite-dimensional regression function parameters. This semi-parametric setup already poses nontrivial challenges on hierarchical Bayes. We have reiterated and highlighted some of the challenges here and addressed them with self-similarity identification. Our results serve as a step towards obtaining fully non-parametric BvM for the actual function , as opposed to just its low-dimensional summaries. These results will be reported in follow-up work.

## References

• Anderson (1966) Anderson, T. W. (1966). Some nonparametric multivariate procedures based on statistically equivalent blocks. Multivariate Analysis 1, 5–27.
• Bentley (1975) Bentley, J. L. (1975). Multidimensional binary search trees used for associative searching. Communications of the ACM 18(9), 509–517.
• Bickel and Kleijn (1999) Bickel, P. and B. Kleijn (1999). On the Bernstein–von Mises theorem with infinite-dimensional parameters. The Annals of Statistics 27, 1119–1140.
• Bickel and Kleijn (2012) Bickel, P. and B. Kleijn (2012). The semiparametric Bernstein–von Mises theorem. The Annals of Statistics 40, 206–237.
• Breiman (2001) Breiman, L. (2001). Random forests. Machine Learning 45(1), 5–32.
• Bull (2012) Bull, A. (2012). Honest adaptive confidence bands and self-similar functions. Electronic Journal of Statistics 6, 1490–1516.
• Castillo (2012a) Castillo, I. (2012a). Semiparametric Bernstein–von Mises theorem and bias, illustrated with Gaussian process priors. Sankhya 74, 194–221.
• Castillo (2012b) Castillo, I. (2012b). A semiparametric Bernstein–von Mises theorem for Gaussian process priors. Probab. Theory Related Fields 152, 53–99.
• Castillo and Nickl (2013) Castillo, I. and R. Nickl (2013). Nonparametric Bernstein–von Mises theorems in Gaussian white noise. The Annals of Statistics 41, 1999–2028.
• Castillo and Nickl (2014) Castillo, I. and R. Nickl (2014). On the Bernstein–von Mises theorem for nonparametric Bayes proceduress. The Annals of Statistics 27, 1941–1969.
• Castillo and Rousseau (2015) Castillo, I. and J. Rousseau (2015). A Bernstein–von Mises theorem for smooth functionals in semiparametric models. The Annals of Statistics 43, 2353–2383.
• Castillo and Ročková (2019) Castillo, I. and V. Ročková (2019). Multiscale analysis of BART. In Preparation.
• Cheng and Kosorok (2008) Cheng, G. and M. R. Kosorok (2008). General frequentist properties of the posterior profile distribution. The Annals of Statistics 36, 1819–1853.
• Chipman et al. (2010) Chipman, H., E. I. George, and R. McCulloch (2010). BART: Bayesian additive regression trees. Annals of Applied Statistics 4, 266–298.
• Chipman et al. (1997) Chipman, H., E. I. George, and R. E. McCulloch (1997). Bayesian CART model search. Journal of the American Statistical Association 93, 935–960.
• Chipman et al. (2000) Chipman, H., E. I. George, and R. E. McCulloch (2000). Hierarchical priors for Bayesian CART shrinkage. Statistics and Computing 10, 17–24.
• Cox (1993) Cox, D. (1993). An analysis of Bayesian inference for nonparametric regression. The Annals of Statistics 21, 903–923.
• Denison et al. (1998) Denison, D., B. Mallick, and A. Smith (1998). A Bayesian CART algorithm. Biometrika 85, 363–377.
• Diaconis and Freedman (1986) Diaconis, P. and D. Freedman (1986). On consistency of Bayes estimates. The Annals of Statistics 14, 1–26.
• Diaconis and Freedman (1998) Diaconis, P. and D. Freedman (1998). Consistency of Bayes estimates for nonparametric regression: normal theory. Bernoulli 4, 411–444.
• Donoho (1997) Donoho, D. (1997). CART and best-ortho-basis: a connection. Annals of Statistics 25, 1870–1911.
• Ghosal (2000) Ghosal, S. (2000). Asymptotic normality of posterior distributions for exponential families when the number of parameters tends to infinity. Journal of Multivariate Analysis 74, 49–68.
• Gine and Nickl (2010) Gine, E. and R. Nickl (2010). Confidence bands in density estimation. The Annals of Statistics 38, 1122–1170.
• Gine and Nickl (2011) Gine, E. and R. Nickl (2011). On adaptive inference and confidence bands. The Annals of Statistics 39, 2383–2409.
• Gine and Nickl (2015) Gine, E. and R. Nickl (2015). Mathematical Foundations of Infinite-Dimensional Statistical Models. Cambridge University Press.
• Johnstone (2010) Johnstone, I. (2010). High dimensional Bernstein–von Mises: Simple examples. In Borrowing Strength: Theory Powering Applications?A Festschrift for Lawrence D. Brown.
• Jonge and van Zanten (2013) Jonge, R. and H. van Zanten (2013). Semiparametric Bernstein?von Mises for the error standard deviation. Electronic Journal of Statistics 7, 217–243.
• Kleijn and van der Vaart (2006) Kleijn, B. and A. van der Vaart (2006). Misspecification in infinite-dimensional Bayesian statistics. Annals of Statistics 34, 837–877.
• Laplace (1810) Laplace, P. S. (1810). Memoire sur les formules qui sont fonctions de tres grands nombres et sur leurs applications aux probabilites. Oeuvres de Laplace 12, 301–349.
• Leahu (2011) Leahu, H. (2011). On the Bernstein–von Mises phenomenon in the Gaussian white noise model. Electron. J. Stat 4, 373–404.
• Nickl and Szabo (2016) Nickl, R. and B. Szabo (2016). A sharp adaptive confidence ball for self-similar functions. Stochastic Processes and their Applications 126, 3913–3934.
• Picard and Tribouley (2000) Picard, D. and K. Tribouley (2000). Adaptive confidence interval for pointwise curve estimation. The Annals of Statistics 28, 298–335.
• Ray (2017) Ray, K. (2017). Adaptive Bernstein–von Mises theorems in Gaussian white noise. The Annals of Statistics 45, 2511–2536.
• Ray and van der Vaart (2018) Ray, K. and A. van der Vaart (2018). Semiparametric Bayesian causal inference using Gaussian process priors. ArXiv.
• Rivoirard and Rousseau (2012) Rivoirard, V. and J. Rousseau (2012). Bernstein–von Mises theorem for linear functionals of the density. The Annals of Statistics 40, 1489–1523.
• Ročková and Saha (2019) Ročková, V. and E. Saha (2019). On theory for BART. International Conference on Artificial Intelligence and Statistics.
• Ročková and van der Pas (2017) Ročková, V. and S. van der Pas (2017+). Posterior concentration for Bayesian regression trees and forests. Annals of Statistics (In Revision), 1–40.
• Scricciolo (2007) Scricciolo, C. (2007). On rates of convergence for Bayesian density estimation. Scandinavian Journal of Statistics 34, 626–642.
• Shen (2002) Shen, X. (2002). Asymptotic normality of semiparametric and nonparametric posterior distributions. Journal of the American Statistical Association 97, 222–235.
• Szabo et al. (2015) Szabo, B., A. van der Vaart, and J. van Zanten (2015). Frequentist coverage of adaptive nonparametric Bayesian credible sets. The Annals of Statistics 43, 1391–1428.
• van der Pas and Ročková (2017) van der Pas, S. and V. Ročková (2017). Bayesian dyadic trees and histograms for regression. Advances in Neural Information Processing Systems, 1–12.
• van der Vaart (2000) van der Vaart, A. (2000). Asymptotic Statistics. Cambridge University Press.
• Verma et al. (2009) Verma, N., S. Kpotufe, and S. Dasgupta (2009). Which spatial partition trees are adaptive to intrinsic dimension? In Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence, pp. 565–574. AUAI Press.

## Appendix A Appendix

### a.1 Proof of Theorem 4.1

First, we show that with large enough will satisfy

 √n|Ψn−ˆΨK|=oP(1). (A.1)

We can write

 √n|Ψn−ˆΨK|=√n|Ψ(f0)−Ψ(fK0)+Wn(a−aK)| =√n∣∣ ∣∣1nn∑i=1[f0(xi)−fK0(xi)+εi][a(xi)−aK(xi)]∣∣ ∣∣,

where we used the fact that

 n∑i=1[f0(xi)−fK0(xi)]aK(xi)=K∑k=1aKkn∑xi∈Ωk⎡⎣f0(xi)−nμ(Ωk)∑xi∈Ωkf0(xi)⎤⎦=0.

Next, we can write

 √n|Ψn−ˆΨK|<[√n∥a−aK∥L×∥f0−fK0∥L+ZKn],

where

 ZKn=1√nn∑i=1εi[a(xi)−aK(xi)].

We assume that the design is regular (according to Definition 3.3 of Rockova and van der Pas (2017) (further referred to as RP17) with ). Assuming that , it follows from the proof of Lemma 3.2 of Rockova and van der Pas (2017) that

 ∥f0−fK0∥L≤∥f0∥HαC1/Kαand∣∣∥a∥L−∥aK∥L∣∣≤∥a−aK∥L≤∥a∥HγC2/Kγ.

We assume that for some and for some . Because

 VarZKn=∥a−aK∥2L=1nn∑i=1[a(xi)−aK(xi)]2≲1/K2γ,

we conclude that when as and thereby

 √n|Ψn−ˆΨK|≲√nK−(α+γ)+oP(1). (A.2)

With the choice for and , (A.2) will be satisfied.

To continue, we introduce the following notation

 fKt=fK−taK√n

and write

 ℓn(fKt)−ℓn(fK0)−[ℓn(fK)−ℓn(fK0)]