High-dimensional nonparametric monotone function estimation using BART

# High-dimensional nonparametric monotone function estimation using BART

Hugh A. Chipman, Edward I. George, Robert E. McCulloch, Thomas S. Shively 111Hugh Chipman is Professor, Department of Mathematics and Statistics, Acadia University, Wolfville, Nova Scotia, B4P 2R6, hugh.chipman@acadiau.ca. Edward I. George is Professor of Statistics, Department of Statistics, The Wharton School, University of Pennsylvania, 3730 Walnut St, 400 JMHH, Philadelphia, PA 19104-6304, edgeorge@wharton.upenn.edu. Robert E. McCulloch is Professor of Statistics, The School of Mathematical and Statistical Sciences, Arizona State University, Tempe, AZ, 85281, robert.mcculloch@asu.edu. Thomas S. Shively is Professor of Statistics, Department of Information, Risk, and Operations Management, University of Texas at Austin, 1 University Station, B6500, Austin, TX 78712-1175, Tom.Shively@mccombs.utexas.edu. This work was supported by the grant DMS-1406563 from the National Science Foundation.
July 11, 2019
###### Abstract

For the estimation of a regression relationship between and a large set of potential predictors , the flexible nature of a nonparametric approach such as BART (Bayesian Additive Regression Trees) allows for a much richer set of possibilities than a more restrictive parametric approach. However, it may often occur that subject matter considerations suggest the relationship will be monotone in one or more of the predictors. For such situations, we propose monotone BART, a constrained version of BART that uses the monotonicity information to improve function estimation without the need of using a parametric form. Imposing monotonicity, when appropriate, results in (i) function estimates that are smoother and more interpretable, (ii) better out-of-sample predictive performance, (iii) less uncertainty, and (iv) less sensitivity to prior choice. While some of the key aspects of the unconstrained BART model carry over directly to monotone BART, the imposition of the monotonicity constraints necessitates a fundamental rethinking of how the model is implemented. In particular, in the original BART algorithm, the Markov Chain Monte Carlo algorithm relied on a conditional conjugacy that is no longer available in a high-dimensional, constrained space.

KEY WORDS: Bayesian nonparametrics, high-dimensional regression, ensemble model, MCMC algorithm, monotonicity constraints

## 1 Introduction

Suppose one would like to learn how depends on a vector of potential predictors when very little prior information is available about the form of the relationship. With only very weak assumptions, the Bayesian nonparametric approach BART (Bayesian Additive Regression Trees) can quickly discover the nature of this relationship; see Chipman, George, and McCulloch (2010), hereafter CGM10. More precisely, based only on the assumption that

 Y=f(x)+ϵ,ϵ∼N(0,σ2), (1.1)

BART can quickly obtain full posterior inference for the unknown regression function,

 f(x)=E(Y|x) (1.2)

and the unknown variance . BART also provides predictive inference as well as model-free variable selection and interaction detection, see Chipman, George, and McCulloch (2013), Bleich et al. (2014), and Kapelner and Bleich (2016).

The main goal of this paper is the development of monotone BART (hereafter mBART), a constrained version of BART that restricts attention to regression functions that are monotone in any predesignated subset of the components of . Such monotonicity constraints often arise naturally from subject matter considerations. For example, in one of our illustrative data sets, the variable of interest is the price of a house. One of the components of the explanatory variable is the size of the house. It seems reasonable to restrict our search for the function to functions such that bigger houses sell for more, all other things equal.

There is a rich literature on monotone function estimation in both the frequentist and Bayesian paradigms. Frequentist methods to estimate univariate monotone functions include Barlow et al. (1972), Mammen (1991), Ramsay (1998) and Kong and Eubank (2006) while Bayesian methods include Lavine and Mockus (1995), Holmes and Heard (2003), Neelon and Dunson (2004), Shively, Sage, Walker (2009), and Shively, Walker, Damien (2011). However, these methods are difficult to implement for high-dimensional multivariate function estimation because they are built on basis elements that are fundamentally low-dimensional objects. Saarela and Arjas (2011) and Lin and Dunson (2014) develop methods to estimate monotone multivariate functions but they become computationally intensive as the number of predictor variables increase. As we will show, mBART imposes no restrictions on beyond the monotonicity constraints while at the same time, it can easily handle high-dimensional data. The reason is that mBART is built on a sum-of-trees approximation to and is therefore composed of multivariate basis elements.

The extension of BART to our monotonically constrained setting requires two basic innovations. First, it is necessary to develop general constraints for regression trees to be monotone in any predesignated set of coordinates. Under these constraints, the monotonicity of the complete sum-of-trees approximation follows directly. The second innovation requires a new approach for MCMC posterior computation. Whereas the original BART formulation allowed straightforward marginalization over regression tree parameters using a conditional conjugacy argument, the constrained trees formulation requires a more nuanced approach because the conjugacy argument no longer applies.

The outline of the paper is as follows. In Section 2, we describe in detail the constrained sum-of-trees model used for monotone function estimation. Section 3 discusses the regularization prior for the trees while section 4 describes the new MCMC algorithm required to implement mBART. Section 5 provides simulation results for one-dimensional and five-dimensional function estimation as well as three examples using house price, car price and stock returns data. Section 6 contains some concluding remarks.

## 2 A Monotone Sum-of-Trees Model

The essence of BART is a sum-of-trees model approximation of the relationship between and in (1.1);

 Y=m∑j=1g(x;Tj,Mj)+ϵ,ϵ∼N(0,σ2), (2.1)

where each is a binary regression tree with a set of associated terminal node constants , and is the function which assigns to according to the sequence of decision rules in . These decision rules are binary partitions of the predictor space of the form vs where the splitting value is in the range of . (A clarifying example of how works appears in Figure 1 below and is described later in this section). When , (2.1) reduces to the single tree model used by Chipman et al. (1998) for Bayesian CART.

Under (2.1), is the sum, over trees , of all the terminal node ’s assigned to by the ’s. As the can take any values it is easy to see that the sum-of-trees model (2.1) is a very flexible representation capable of representing a wide class of functions from to , especially when the number of trees is large. Composed of many simple functions from to , namely the , the sum-of-trees representation is much more manageable than a representation with more complicated basis elements such as multidimensional wavelets or multidimensional splines. And because each tree function is invariant to monotone transformations of (aside from the splitting value), standardization choices are not needed for the predictors.

Key to the construction of monotone BART are the conditions under which the underlying sum-of-trees function will satisfy the following precise definition of a multivariate monotone function.

Definition: For a subset of the coordinates of , a function is said to be monotone in if for each and all values of , satisfies

 f(x1,…,xi+δ,…,xp)≥f(x1,…,xi,…,xp), (2.2)

for all ( is nondecreasing), or for all ( is nonincreasing).

Clearly, a sum-of-trees function will be monotone in whenever each of the component trees is monotone in . Thus it suffices to focus on the conditions for a single tree function to be monotone in . As we’ll see, this will only entail providing constraints on the set of terminal node constants ; constraints determined by the tree .

We illustrate these concepts with the bivariate monotone tree function in Figure 1. This tree has six terminal nodes, labeled 4,10,11,12,13, and 7. The labels follow the standard tree node labeling scheme where the top node is labeled 1 and any non-terminal node with label has a left child with label and a right child with label . Beginning at the top node, each is assigned to subsequent nodes according to the sequence of splitting rules it meets. This continues until reaches a terminal node where assigns the designated value of from the set . For example, with this choice of , = 3 when .

Alternative views of the function in Figure 1 are depicted in Figure 2. Figure 2a shows the partitions of the space induced by . The terminal node regions, ,,,,,, correspond to the six similarly labeled terminal nodes of . Figure 2b shows as a simple step function which assigns a level to each terminal node region. From Figure 2b, it is clear that for any , moving to or to can only increase for . Thus, in the sense of our definition, this is monotone in both and .

To see the essence of what is needed to guarantee the monotonicity of a tree function, consider the very simple case of a monotone when is a function of only, as depicted in Figure 3. Each level region of corresponds to a terminal node region in space, which is simply an interval whenever is a univariate function. For each such region, consider the adjoining region with larger values of , which we refer to as an above-neighbor region, and the adjoining region with smaller values of , which we refer to as a below-neighbor region. End regions will only have single neighboring regions. To guarantee (nondecreasing) monotonicity, it suffices to constrain the level assigned to each terminal node region to be less than or equal to the level of its above-neighbor region, and to be greater than or equal to the level of its below-neighbor region.

To apply these notions to a bivariate tree function as depicted in Figures 1 and 2, we will simply say that rectangular regions are neighboring if they have boundaries which are adjoining in any of the coordinates. Furthermore, a region will be called an above-neighbor of a region if the lower adjoining boundary of is the upper adjoining boundary of . A below-neighbor is defined similarly. For example, in Figure 2a, is an above-neighbor of and ; and and are below-neighbors of .

Note that and are not neighbors. We will say the and are separated because the upper boundary of is less than the lower boundary of . For a small enough step size , it is impossible to get from to by changing any by so that the mean level of one does not constrain the mean level of the other.

To make these definitions precise for a -dimensional tree (a function of ), we note that each terminal node region of will be a rectangular region of the form

 Rk={x:xi∈[Lik,Uik),i=1,…,d}, (2.3)

where the interval for each is determined by the sequence of splitting rules leading to .

We say that is separated from if or for some . In Figure 2a, is separated from and .

If and are not separated, will be said to be an above-neighbor of if for some , and it will be said to be a below-neighbor of if for some . Note that any terminal node region may have several above-neighbor and below-neighbor regions. has below neighbors and and above neighbor .

The constraints on the levels under which will be monotone are now straightforward to state.

Constraint Conditions for Tree Monotonicity: A tree function will be monotone if the level of each of its terminal node regions is
(a) less than or equal to the minimum level of all of its above-neighbor regions, and
(b) greater than or equal to the maximum level of all of its below-neighbor regions.

The function will be monotone in if the neighboring regions satisfy (a) and (b) for all the coordinates in (rather than all coordinates).

As we’ll see in subsequent sections, an attractive feature of these conditions is that they dovetail perfectly with the nature of our iterative MCMC simulation calculations. At each step there, we simulate one terminal node level at time conditionally on all the other node levels, so imposing the constraints is straightforward. This avoids the need to simultaneously constrain all the levels at once.

## 3 A Constrained Regularization Prior

The mBART model specification is completed by putting a constrained regularization prior on the parameters, and , of the sum-of-trees model (2.1). Essentially a modification of the original BART prior formulation to accommodate designated monotone constraints, we follow CGM10 and proceed by restricting attention to priors of the form

 p((T1,M1),…,(Tm,Mm),σ) = [∏jp(Tj,Mj)]p(σ) (3.1) = [∏jp(Mj|Tj)p(Tj)]p(σ).

Under such priors, the tree components are independent of each other and of .

As discussed in the previous section, a sum-of-trees function is guaranteed to be monotone whenever each of the trees is monotone in the sense of (2.2). Thus, it suffices to restrict the support of to values which satisfy the Monotonicity Constraints (a) and (b) from Section 2. For this purpose, let be the set of all which satisfy these monotonicity constraints, namely

 C={(T,M):g(x;T,M) is monotone in xi∈S}. (3.2)

These constraints are then incorporated into the prior by constraining the CGM10 BART independence form to have support only over ,

 p(Mj|Tj)∝⎡⎣bj∏i=1p(μij|Tj)⎤⎦χC(Tj,Mj). (3.3)

Here is the number of bottom (terminal) nodes of , and on and otherwise.

In the next three sections we discuss the choice of priors , , and . These priors will have the same form as in CGM10, but in some cases the monotonicity constraint will motivate modifications to our choices for the hyper parameters.

### 3.1 The Tj Prior

The tree prior is specified by three aspects: (i) the probability of a node having children at depth () is

 α(1+d)−β,α∈(0,1),β∈[0,∞), (3.4)

(ii) the uniform distribution over available predictors for splitting rule assignment at each interior node, and (iii) the uniform distribution on the discrete set of available splitting values for the assigned predictor at each interior node. This last choice has the appeal of invariance under monotone transformations of the predictors.

Because we want the regularization prior to keep the individual tree components small, especially when is set to be large, we typically recommend the defaults and in (3.4) in the unconstrained case. With this choice, simulation of tree skeletons directly from (i) shows us that trees with 1, 2, 3, 4, and terminal nodes will receive prior probabilities of about 0.05, 0.55, 0.28, 0.09, and 0.03, respectively.

Discussion of the choice of and in the constrained case is deferred to the end of Section 4.3 since our choices are motivated by details of the Markov Chain Monte Carlo algorithm for posterior computation.

### 3.2 The σ Prior

For , we use the (conditionally) conjugate inverse chi-square distribution . To guide the specification of the hyperparameters and , we recommend a data-informed approach in order to assign substantial probability to the entire region of plausible values while avoiding overconcentration and overdispersion. This entails calibrating the prior degrees of freedom and scale using a “rough data-based overestimate” of .

The two natural choices for are (1) the “naive” specification, in which we take to be the sample standard deviation of (or some fraction of it), or (2) the “linear model” specification, in which we take as the residual standard deviation from a least squares linear regression of on the original ’s. We then pick a value of between 3 and 10 to get an appropriate shape, and a value of so that the th quantile of the prior on is located at , that is We consider values of such as 0.75, 0.90 or 0.99 to center the distribution below . For automatic use, we recommend the default setting which tends to avoid extremes. Alternatively, the values of may be chosen by cross-validation from a range of reasonable choices. This choice is exactly as in CGM10.

### 3.3 The Mj|Tj Prior

For the choice of in (3.3), we adopt a normal form as in BART. However, here we use different choices of the prior variance depending on whether in (3.2) imposes constraints on . For unconstrained by , we set

 p(μij|Tj)=ϕμμ,σμ, (3.5)

the same density used by CGM10 for BART, whereas for constrained by , we set

 p(μij|Tj)=ϕμμ,cσμ, (3.6)

the density with .

To motivate our prior choice in (3.6), consider a simple tree with just two terminal node means and constrained to satisfy . Under (3.6), the joint distribution of and is

 p(μ1,μ2)∝ϕμμ,cσμ(μ1)ϕμμ,cσμ(μ2)χ{μ1≤μ2}(μ1,μ2). (3.7)

In this case, the marginal distributions of and are skew normal distributions (Azzalini 1985) with variances equal to , and means and , respectively. Thus, the marginal prior variances of and are identical to the marginal variances of the unconstrained means. This equality helps to balance the prior effects across predictors and facilitates the calibrated specification of described below. Of course, it will be the case that some means may be further constrained when they occur deeper down the tree, thereby further reducing their prior variance. Although additional small prior adjustments can be considered for such cases, we view them as relatively unimportant because the vast majority of BART trees will be small with at most one or two constraints. Thus, we recommend the prior (3.6) for all which are constrained.

To guide the specification of the hyperparameters and , we use the same informal empirical Bayes strategy in CGM10. Based on the idea that that is very likely between and , the observed minimum and maximum of , we want to choose and so that the induced prior on assigns substantial probability to the interval . By using the observed and , we aim to ensure that the implicit prior for is in the right “ballpark”.

In the unconstrained case where each value of is the sum of iid ’s under the sum-of-trees model, the induced prior on under (3.5) is exactly . Let us argue now that when monotone constraints are introduced, still holds up as a useful approximation to the induced prior on . To begin with, for each value of , let , the mean assigned to by the th tree . Then, under the sum-of-trees model, is the sum of independent means since the ’s are independent across trees. Using central limit theorem considerations, this sum of small random effects will be approximately normal, at least for the central part of the distribution. The means of all the random effects will be centered around , (the constrained ’s will have pairwise offsetting biases), and so the mean of will be approximately . Finally, since the marginal variance for all ’s  is at least approximately , the variance of will be approximately .

Proceeding as in CGM10, we thus choose and so that and for some preselected value of . This is conveniently implemented by first shifting and rescaling so that the observed transformed values range from to , and then setting and . Using , for example, would yield a 95% prior probability that is in the interval , thereby assigning substantial probability to the entire region of plausible values of while avoiding overconcentration and overdispersion. As and/or the number of trees is increased, this prior will become tighter, thus limiting the effect of the individual tree components of (2.1) by keeping the values small. We have found that values of between 1 and 3 yield good results, and we recommend as an automatic default choice, the same default recommendation for BART. Alternatively, the value of may be chosen by cross-validation from a range of reasonable choices.

#### 3.3.1 The Choice of m

Again as in BART, we treat as a fixed tuning constant to be chosen by the user. For prediction, we have found that mBART performs well with values of at least . For variable selection, values as small as are often effective.

## 4 Simulation from the Constrained Posterior

### 4.1 Backfitting MCMC for Constrained Regression Trees

Let be the vector of independent observations of from (2.1). All post-data information for Bayesian inference about any aspects of the unknowns, , and future values of , is captured by the full posterior distribution

 p((T1,M1),…,(Tm,Mm),σ|y). (4.1)

Since all inference is conditional on the given predictor values, we suppress them in the notation. This posterior is proportional to the product of the likelihood , which is the product of normal likelihoods based on (2.1), and the constrained regularization prior described in Section 3.

To extract information from (4.1), which is generally intractable, we propose an MCMC backfitting algorithm that simulates a sequence of draws, ,

 (T1,M1)(k),…,(Tm,Mm)(k),σ(k) (4.2)

that is converging in distribution to (4.1) as .

Beginning with a set of initial values of , this algorithm proceeds by simulating a sequence of transitions , for , . The transition is obtained by using a Metropolis-Hastings (MH) algorithm to simulate a single transition of a Markov chain with stable distribution

 p((Tj,Mj)|r(k)j,σ(k)), (4.3)

for , where

 r(k)j≡y−∑j′jg(x;Tj′,Mj′)(k) (4.4)

is the vector of partial residuals based on a fit that excludes the most current simulated values of for . A full iteration of the algorithm is then completed by simulating the draw of from the full conditional

 σ|(T1,M1)(k+1),…,(Tm,Mm)(k+1),y. (4.5)

Because conditioning the distribution of on and in (4.3) is equivalent to conditioning on the excluded values of , and , this algorithm is an instance of MH within a Gibbs sampler.

### 4.2 A New Localized MH Algorithm

We now describe a new localized MH algorithm for the simulation of as single transitions of a Markov chain converging to the (possibly constrained) posterior (4.3). For simplicity of notation, let us denote a generic instance of these moves by . Dropping from (4.3) since it is fixed throughout this move, and dropping all the remaining subscripts and superscripts, the target posterior distribution can be expressed as

 p(T,M|r)=p(r|T,M)p(M|T)p(T)/p(r), (4.6)

where its components are as follows.

First, is the normal likelihood which would correspond to an observation of , where . Assuming , and letting be the vector of components of assigned to by , this likelihood is of the form

 p(r|T,M)=b∏i=1p(ri|μi) (4.7)

where

 p(ri|μi)∝∏jexp(−(rij−μi)2/2σ2). (4.8)

The prior of given by (3.3) is of the form

 p(M|T)∝[b∏i=1p(μi|T)]χC(T,M), (4.9)

where from (3.5) if is unconstrained by , and from (3.6) if is constrained by . The tree prior described in Section 3.1 is the same form used for unconstrained BART. Finally, the intractable marginal , which would in principle be obtained by summing and integrating over and , will fortunately play no role in our algorithm.

In unconstrained CART and BART, CGM98 and CGM10 used the following two step Metropolis-Hastings (MH) procedure for the simulation of . First, a proposal was generated with probability . Letting be the probability of the reversed step, the move was then accepted with probability

 α = min{q(T∗→T0)q(T0→T∗)p(T∗|r)p(T0|r),1} (4.10) = min{q(T∗→T0)q(T0→T∗)p(r|T∗)p(r|T0)p(T∗)p(T0),1}.

If accepted, any part of with a new ancestry under is simulated from independent normals since just consists of independent normals given the independence and conditional conjugacy of our prior (which is (4.9) without the monotonicity constraint ) and the conditional data independence (4.7). Otherwise is set equal to .

In the contrained case, the basic algorithm is the same except that with the monotonicity constraint in (4.9), the in are dependent. Hence, when we make local moves involving a few of the we must be careful to condition on the remaining elements. In addition, computations must be done numerically since we lose the conditional conjugacy. The moves in mBART only operate on one or two of the values at a time so that the appropriate conditional integrals can easily be done numerically.

We consider localized proposals under which and differ only by those ’s which have different ancestries under and . Letting be the part of with the same ancestry under and , we restrict attention to proposals for which and , where is the part of that will be replaced by in . It will also be convenient in what follows to let be the components of the data assigned to by , to be the components assigned to by , and to be the components assigned to the identical components of by both and .

For example, suppose we begin with a proposal that randomly chooses between a birth step and death step, and that was obtained by a birth step, which entails adding two child nodes at a randomly chosen terminal node of . This move is illustrated in Figure 4 where and , so that to which is assigned, to which is assigned, and to which is assigned. Note that the set of observations in is just the division of the set of observations in defined by the decision rule associated with node 7 in the tree .

The key is to then proceed conditional on and the tree ancestry associated with it. In Figure 4, we condition on and the ancestral tree structure given by nodes including the decision rules associated with the interior nodes 1 and 3. To keep the notation clean, we will use as a conditioning variable in our expressions below and the reader must make a mental note to include the associated tree ancestry as conditioning information.

Conditional on , our Metropolis procedure is as follows. First, a proposal is generated with probability , using the same CGM98 proposal used in unconstrained CART and BART. Letting be the probability of the reversed step, the move is then accepted with probability

 α = min{q(T∗→T0)q(T0→T∗)p(T∗|μsame,r)p(T0|μsame,r),1} (4.11) = min{q(T∗→T0)q(T0→T∗)p(T∗|μsame,rnew)p(T0|μsame,rold),1} = min{q(T∗→T0)q(T0→T∗)p(rnew|T∗,μsame)p(rold|T0,μsame)p(T∗)p(T0),1}.

The difference between (4.10) and (4.11) is that we condition on throughout and explicitly note that the part of does not matter. In going from the first line above to the second we have used the fact that, conditional on , gives the same multiplicative contribution to the top and bottom of the acceptance ratio so that it cancels out leaving only terms depending on and . To go from the second line above to the third we will compute the required and marginals numerically as detailed in Section 4.3 below. Note also that in the BART prior, and are dependent only through the dimension of so is the same as in the unconstrained case.

If is accepted, is then simulated from and is set equal to . Otherwise is set equal to .

### 4.3 Implementation of the Localized MH Algorithm

The implementation of our localized MH algorithm requires the evaluation of and for the calculation in (4.11), and the simulation from . Although these can all be done quickly and easily in the unconstrained cases, a different approach is needed for constrained cases. This approach, which we now describe, relies crucially on the reduced computational requirements for the localized MH algorithm when is restricted to local moves at a single node.

For the moment, consider the birth move described in Section 4.2 and illustrated in Figure 4. In this case, with corresponding and with corresponding . Thus, to perform this move, it is necessary to compute and for the computation of in (4.11), and to simulate from
when is selected. For the corresponding death step, we would need to simulate from . When these means are unconstrained, these calculations can be done quickly with closed form expressions and the simulations by routine methods so we focus here on the constrained case.

Let us begin with the calculation of

 p(rL,rR|T∗,μsame)=∫p(rL|μL)p(rR|μR)p(μL,μR|T∗,μsame)dμLdμR (4.12)

where

 p(μL,μR|T∗,μsame)=ϕμμ,cσμ(μL)ϕμμ,cσμ(μR)χC(μL,μR)/d∗ (4.13)

and is the normalizing constant. The determination of is discussed in Section 2; it is the set which results in a monotonic function. Note that is of the form with (possibly and/or ) determined by the conditioning on and . In particular, note that depends on but we have suppressed this in the notation for the sake of simplicity.

Closed forms for (4.12) and the norming constant are unavailable. However, since the integrals are only two-dimensional, it is straighforward to compute them numerically. To use a very simple approach, we approximate them by summing over a grid of values. We choose a grid of equally spaced values and then let be the set of where both and belong to the grid. Then, our approximate integrals are

 ~p(rL,rR|T∗,μsame)=∑(μL,μR)∈G∩Cp(rL|μL)p(rR|μR)~p(μL,μR|T∗,μsame), (4.14)

where

 ~p(μL,μR|T∗,μsame)=ϕμμ,cσμ(μL)ϕμμ,cσμ(μR)/~d∗ (4.15)

with

 ~d∗=∑(μL,μR)∈G∩Cϕμμ,cσμ(μL)ϕμμ,cσμ(μR). (4.16)

Note that we do not include “” terms (the difference between adjacent grid values) in our integral approximations since they cancel out.

If is accepted, the simulation of proceeds by sampling from the probability distribution over given by

 ~p(μL,μR|rL,rR,T∗,μsame)=p(rL|μL)p(rR|μR)~p(μL,μR|T∗,μsame)~p(rL,rR|T∗,μsame). (4.17)

Note that cancels in (4.17) so that we are just renormalizing

 p(rL|μL)p(rR|μR)ϕμμ,cσμ(μL)ϕμμ,cσμ(μR)

to sum to one on .

For the calculation of

 p(r0|T0,μsame)=∫p(r0|μ0)p(μ0|T0,μsame)dμ0 (4.18)

where

 p(μ0|T0,μsame)=ϕμμ,cσμ(μ0)χC(μ0)/d0 (4.19)

and is the normalizing constant with the constraint set of the form , similar griding can be done to obtain a discrete approximation of and a constrained posterior sample of . Again, implicitly depends on and . The grid here would be just one-dimensional.

Computations for the reverse death move would proceed similarly. Local moves for beyond birth and death moves may also be similarly applied, as long as and are each at most two dimensional since beyond two dimensions, grids become computationally demanding. For example, obtained by changing a splitting rule whose children are terminal nodes would fall into this category. In all our examples, we use birth/death moves and draws of a single component given and all the remaining elements of .

The approach outlined above for birth/death moves involves two bivariate integrals and two univariate integrals which we approximate with two sums over a bivariate grid and two sums over a univariate grid. In practice, we reduce the computational burden by letting and equal one and then compensating for this omission with an adjustment of our prior. For example, in a birth move, setting the ’s to one ignores a factor in our ratio. Note that from (4.13) is just the constrained integral of the product of two univariate normal densities. Without the constraint, the integral would be one. The more our monotonicity constraint limits the integral (through ), the smaller is. Similary, is a constrained univariate integral. However, in a birth step, is typically more constrained than . Hence, is a ratio depending on and which we expect to be great than one. Note that only depends on and only depends on (that is, not on ).

We compensate for the omission of and by letting and rather than using standard BART default values of and . With and , is larger mimicking the effect of the omitted ratio. We have found that with these choices we get tree sizes comparable to those obtained in unconstrained BART. The values and are used in all our examples.

## 5 Examples

In this section we present several examples to illustrate the performance of mBART. We present results for two simulated scenarios and three real data sets. In the simulations, where we know the true function is monotonic and non-linear, we compare mBART to BART. For the real data, we compare mBART to BART and the standard linear model. In all cases (except Figure 6 in which we study prior sensitivity) we use default priors for mBART and BART but remind the reader that for best out-of-sample results, it may be wise to consider the use of cross-validation to tune the prior choice as illustrated in CGM10.

Note that below when we refer to the “fit” of BART or mBART at a given , we mean the posterior mean of estimated by averaging the draws evaluated at .

Our first simulated example has a one-dimensional so that we can easily see the fits graphically. We see three differences between mBART and BART which are intuitive consequences of the the injection of the strong prior information that the function is monotonic: (i) the fitted function is smoother, (ii) the uncertainty is lower, and (iii) for some values of , the influence of the prior is reduced.

Our second example simulates data sets where is a five-dimensional nonlinear monotonic function. We explore the relationship between the out-of-sample predictive performance of mBART and the signal to noise ratio by using four different values for the error standard deviation. In the high signal case, BART is able to estimate without the aid of the additional information about monotonicity so that BART and mBART have the same performance out-of-sample. In the low signal cases, the additional information is important and mBART beats BART out-of-sample.

In our first real example is the price of a house and represents attributes of the house. In this example, mBART, BART, and the linear model all give similar fits. However, the non-monotonic BART fits are very counter intuitive while the mBART fits are quite reasonable. This example also illustrates a very important feature of mBART: it handles ordered categorical explanatory variables beautifully.

In our second real example is the price of a used car and represents attributes of a car. In this example, the linear model fails while mBART and BART perform comparably with mBART giving more interpretable results.

In our third real example, is the excess return for a cross section of firms in a given month and measures attributes of each firm in the previous month. This example is extreme in that the signal to noise ratio is very low. The unconstrained BART fit is very noisy while the mBART fit is nicely smooth and suggestive of non-linearity.

The examples illustrate two key attributes of mBART in cases where the monotonicity is appropriate. First, the smoother mBART fits are more interpretable than the more “jumbly” BART fits. Second, in terms of the bias-variance trade off, the monotone constraint may serve to restrain the fit giving improved out-of-sample performance. This is clearly illustrated in our second simulated example. In all three of our real examples, the posterior distribution of from BART covers the posterior mean from mBART, informally suggesting that the monotonicity constraint is consistent with the data. In all three of our real examples, the mBART fits are more appealing. In the car price example, BART and mBART beat the linear model out-of-sample, and mBART is (perhaps) a bit better than BART. In the returns example, linear beats BART out-of-sample, while mBART does as well as linear. In general, we can think of mBART as a convenient “half way house” in between the very flexible ensemble method BART and the inflexible linear method.

### 5.1 One-Dimensional Simulated Example

In this section we present a very simple simulated example with just one variable so that we can visually see some of the basic properties of the monotone BART inference. We simulate observations from the model

 Yi=X3i+ϵi,ϵi∼N(0,.12),Xi∼U[−1,1].

Figure 5 shows the results for a single simulated data set. The BART inference is displayed in the left panel and the mBART inference is in the right panel. The mBART fit is much better. The fit is smoother and the uncertainty is much smaller. Clearly, injecting the correct prior information that the function is monotone dramatically tightens up the inference.

Given the additional information in the monotonicity constraint, mBART should also be less sensitive to the choice of prior than BART. To explore this, we fit BART and mBART using 36 different prior specifications and compared the variation in fits. We used three different values for , the number of trees in the sum: 50, 200, and 500. We used four different values for where a large value of implies a tighter prior on the end node parameters : 1, 2, 3, or 5 (see CGM10 and Section (3.3)). We used three different settings for the pair of values where is the degrees of freedom for the inverted chi-squared prior and the scale is chosen so the least squares estimate of the error standard deviation is at the quantile of the prior: (3,.9), (3,.99), and (10,.75). These settings follow the prior settings explored in CGM10. All possible combinations gives us possible settings.

The results for one simulated data set are shown in Figure 6. The horizontal axis i