A Practically Competitive and Provably Consistent Algorithm for Uplift Modeling

# A Practically Competitive and Provably Consistent Algorithm for Uplift Modeling

## Abstract

Randomized experiments have been critical tools of decision making for decades. However, subjects can show significant heterogeneity in response to treatments in many important applications. Therefore it is not enough to simply know which treatment is optimal for the entire population. What we need is a model that correctly customize treatment assignment base on subject characteristics. The problem of constructing such models from randomized experiments data is known as Uplift Modeling in the literature. Many algorithms have been proposed for uplift modeling and some have generated promising results on various data sets. Yet little is known about the theoretical properties of these algorithms. In this paper, we propose a new tree-based ensemble algorithm for uplift modeling. Experiments show that our algorithm can achieve competitive results on both synthetic and industry-provided data. In addition, by properly tuning the ”node size” parameter, our algorithm is proved to be consistent under mild regularity conditions. This is the first consistent algorithm for uplift modeling that we are aware of.

Copyright Notice: This paper has been accepted to the 2017 IEEE International Conference on Data Mining. Authors have assigned to The Institute of Electrical and Electronic Engineers (the “IEEE”) all rights under the IEEE copyright. The article published in the proceedings of ICDM 2017 under the same title is a shorten version of this paper.

\IEEEpeerreviewmaketitle

## 1 Introduction

Decision makers often face the situation where they need to identify from a set of alternatives the candidate that leads to the most desirable outcome. For example, an airline company that sells priority boarding as an ancillary product needs to select a good price (usually among a few predetermined numbers) that maximizes the revenue. Oftentimes passengers show significant heterogeneity in their response to prices and the answer as to which price is optimal depends on the circumstance. For example, the revenue maximizing prices are likely to be different for a route between major cities and a route between vacation destinations. Luckily, for some application, we can conduct randomized experiments to learn more about subject responses under different scenarios. In such an experiment, subjects are randomly assigned to treatments following a given probability distribution. Then the characteristics of the subject, the assigned treatment, and the response are recorded. Given the randomized experiment data, we want to construct models that can correctly predict the optimal treatment based on subject characteristics. This problem is known as Uplift Modeling in the literature.

While both generating a mapping from the feature space to a finite set of labels, uplift modeling should not be confused with classification problems. The fundamental difference comes from the fact that the data for uplift modeling is unlabeled. For any individual subject, it is impossible to know which treatment is optimal because we can only observe its response under the (randomly) assigned treatment and none of the alternatives. This poses unique challenges in the construction and evaluation of uplift models.

One research area that is related to but different from uplift modeling is the study on heterogeneous treatment effect [1][2]. While uplift modeling aims to identify the optimal treatment among possibly many alternatives, analysis of heterogeneous treatment effect focus on estimating the difference in expected response caused by a single treatment. The distinction between the two areas is more apparent when we look at their formulation. Let be the feature vector and the treatment. Denote as the response which distribution depends on and . For uplift modeling the treatment can take a finite number of values denoted as . The objective is to obtain an accurate estimator of

 h(x)≡argmaxt=1,...,KE[Y|X=x,T=t],

i.e., the conditional response-maximizing treatment. The focus of heterogeneous treatment effect is, on the other hand, accurate estimates of and inference for

 τ(x)≡E[Y|X=x,T=1]−E[Y|X=x,T=0]

where indicates the treatment is applied and otherwise. It is clear that heterogeneous treatment effect is applicable only when there is a single treatment because the definition of subtraction is ambiguous between more than two terms. Similar arguments can be made about the difference between uplift modeling and subgroup analysis [3].

A generic way to solve uplift problems is the Separate Model Approach (SMA). The randomized experiment data is split by treatment, and for each treatment one prediction model is built. Given a new test example, we can obtain its predicted response under each treatment and select the correspondingly best treatment. The main advantage of this approach is that it does not require specialized algorithms. Any existing classification/regression model can be incorporated into this scheme. The disadvantage is that SMA does not always perform well in practice [6][9]. To correctly identify the optimal treatment, a learning algorithm need to know how well each and every treatment is doing. However, information about other treatments is never provided to the learning algorithm under the SMA scheme. For more discussion on the failure of SMA please see Section 5 of [9].

Disappointed by the performance of the Separate Model Approach, researchers have proposed a number of specialized algorithms for uplift modeling. Most of them are designed for the special case of a single treatment [4] [5] [6] [7] [8] [9] [10] [11] [12]. Methods for multiple treatments are introduced in [13] [14] and [15]. In [13], the tree-based algorithm described in [9] is extended to multiple treatment cases by using a weighted sum of pairwise distributional divergence as the splitting criterion. In [14], a multinomial logit formulation is proposed in which treatments are incorporated as binary features. They also explicitly include the interaction terms between treatments and features. What is most relevant to our work is the Contextual Treatment Selection (CTS) algorithm presented in [15]. CTS is a tree-based ensemble method. It grows a group of trees, each with a random subsample of the original training data. At each step of the tree growing process, a random subset of all features is drawn as candidates for which an exhaustive search is conducted to find the best splitting point. A split is evaluated by the increase in expected response it can bring as measured on the training data. As far as we are aware of, CTS is the first uplift algorithm that can handle multiple treatments and continuous response. It can lead to significant performance improvement over other applicable methods.

One drawback with exhaustive search is its susceptibility to outliers. Splits are likely to be placed adjacent to extreme values. This is especially problematic for uplift trees because the score of a split is affected by estimations for all treatments. Outliers of any treatment can influence the choice of a split point. Furthermore, successive splits tend to group together similar extreme values, introducing more bias into the estimation of expected responses.

To solve the problem above, we introduce a modified version of CTS algorithm named Unbiased Contextual Treatment Selection (UCTS). The key difference is the separation between the partition of feature space and the estimation of leaf responses. Before growing a tree, UCTS first randomly splits the training data into two subsets, one for selecting tree splits and the other for estimating treatment-wise expected response in the leaf nodes. In Section 3 we demonstrate experimentally that UCTS is competitive with CTS using both synthetic and industry provided data. Another advantage of this two-sample approach is that it makes the consistency analysis more tractable. In Section 4, we prove that UCTS can achieve mean-square consistency under mild regularity conditions by properly tuning the ”node size” parameter. This is the first consistency result for uplift modeling that we are aware of.

In the reminder of this section we define the notations used throughout this paper. The UCTS algorithm is described in detail in Section 2. In Section 3 we explain the setup and the results of the numerical experiments. The consistency analysis of UCTS is presented in Section 4. Section 5 ends the paper with a brief summary.

### 1.1 Notations

We use upper case letters to denote random variables and lower case letters their realizations. We use boldface for vectors and normal typeface for scalers.

• represents the feature vector and its realization. Subscripts are used to indicate specific features. For example, is the th feature in the vector and its realization. Let denote the -dimensional feature space.

• represents the treatment. We assume there are different treatments encoded as .

• Let be the response and its realization. Throughout this paper we assume the larger the value of , the more desirable the outcome. Denote the expectation of conditional on features and the treatment as .

For the priority boarding example mentioned earlier where the airline wants to customize the price of priority boarding to maximize its revenue, would be the charactering information of flights such as the origin-destination pair, the date and time of the flights, etc.. would be a discrete set of candidate prices such as $5,$10, \$15. And the response would be the revenue for passenger-segments.

Suppose we have a data set of size containing the joint realization of collected from a randomized experiment. We use superscript to index the samples as below,

 Sn={(x(i),t(i),y(i)),i=1,…,n}.

A treatment selection rule is a mapping from the feature space to the space of treatments, or . The goal of Uplift Modeling is to, based on training data , find a treatment selection rule such that the expectation is as high as possible. It is obvious that the maximum expected response is achieved by the point-wise optimal treatment rule .

## 2 Algorithm

Classification or regression trees, when combined into ensembles, prove to be among the most powerful Machine Learning methods [16]. Almost predictably, the Contextual Treatment Selection (CTS) algorithm, which generates tree-based ensembles, also leads to significant performance improvement for uplift modeling problems [15]. In this section we describe a modified version of CTS called the Unbiased Contextual Treatment Selection (UCTS) which eliminates the estimation bias of leaf responses by using separate data sets for partition generation and leaf estimation.

### 2.1 Splitting Criteria

Here we only consider the binary partition approach where each split creates two branches further down the tree. Let be the subset of the feature space associated with the current node. Suppose is a candidate split that divides into the left child-node and the right child-node . Having allows us to select different treatments for the child nodes. The added flexibility brings about an increase in expected response which is,

 Δμ(s)= P{X∈ϕl|X∈ϕ}maxtl=1,...,KE[Y|X∈ϕl,T=tl] + P{X∈ϕr|X∈ϕ}maxtr=1,...,KE[Y|X∈ϕr,T=tr] − maxt=1,...,KE[Y|X∈ϕ,T=t]. (1)

At each step of the tree-growing process, we want to select the split that leads to the largest . The conditional probability of falling into a child node is estimated using the sample fraction, i.e.,

 P{X∈ϕ′|X∈ϕ}≈^p(ϕ′|ϕ)≡∑ni=1I{x(i)∈ϕ′}∑ni=1I{x(i)∈ϕ} (2)

for and is the indicator function.

Estimating the conditional expectation requires more care. We need to consider the fact that the estimation is done by treatment. Therefore fewer samples are available. In addition, treatments may not have equal probabilities in the randomized experiment that generates the training set. Let be the number of samples in with treatment . Given two user-defined parameters and , , the estimator of , is defined as follows.

If ,

 ^yt(ϕ′)=∑ni=1y(i)I{x(i)∈ϕ′}I{t(i)=t}+^yt(ϕ)⋅n_reg∑ni=1I{x(i)∈ϕ′}I{t(i)=t}+n_reg, (3)

otherwise

 ^yt(ϕ′)=^yt(ϕ), (4)

where is the parent node of . To initialize this recursive definition, estimation of the root node is set to the sample average. Letting inherit its parent node estimation when there are not enough samples allows the tree grow to full extend while ensuring reliable estimation for minority treatments. To summarize, the score of a split is computed as,

 ^Δμ(s)= ^p(ϕl|ϕ)×maxt=1,...,K^yt(ϕl) + ^p(ϕr|ϕ)×maxt=1,...,K^yt(ϕr) − maxt=1,...,K^yt(ϕ). (5)

-Regularity

To avoid having severely unbalanced trees, UCTS requires that selected splits must leave at least a fraction of available training examples on each side of the split for some user-defined .

### 2.2 Termination Rules

UCTS considers a node as a terminal node if the number of samples in the node is less than for all treatments

### 2.3 Leaf Response Estimation

In order to eliminate the bias, UCTS uses a separate set of data to estimate the leaf response from the set by which the partition is generated. This is achieved by randomly splitting the training set into the approximation set and the estimation set . For a user-defined parameter , contains a fraction of the examples in sampled by treatment. contains the rest of the data.

Of the two sets, is used to generate the tree structure using the splitting criteria and terminations conditions described above. Let be the set of nodes of a tree grown with . For any , denote as the examples in that fall into with treatment . If is not empty, then the conditional expected response in under treatment is estimated as the sample average of . If otherwise, then inherits the estimation from its parent node. We assume that contains samples of all treatments at least for the root node. By this definition, we can get estimations for the root node first and then traverse down level by level until all nodes are estimated.

### 2.4 Algorithm

To reduce the high variance associated with a single tree, UCTS generates a forest of trees in a way similar to Random Forest [17]. The algorithm is outlined below.

## 3 Experiments

One of the challenges for testing uplift algorithms is the lack of publicly available randomized experiments data. In this section, we first use a simple two-dimensional data model to illustrate the behavioral difference between UCTS and CTS. Then, the performance of UCTS is tested on two larger data sets. The first one is a 50-dimensional synthetic data set. The second is industry provided data on the pricing of priority boarding of flights. These two data sets are the same ones tested in [15] which allows us to directly compare with their results.

### 3.1 Simple 2D Example

Consider a two-dimensional feature space. The first feature is continuous and uniformly distributed between and , i.e., . The second feature takes discrete values each with probability . There are two treatments and the response under each treatment is defined as below.

 If T=1,Y∼U[0,X1].If T=2,Y∼{0.8∗U[0,X1]+5if X2=B,1.2∗U[0,X1]−5if X2=A or C.

The optimal treatment rule for this data model is illustrated in Fig. 1. The vertical boundary in the middle is located at . Feature is plotted like a continuous variable so that we could have a 2D image. Note that, although the optimal treatment assignment exhibits a sharp change at , the actual difference between treatments changes smoothly with and is zero at the middle. Therefore the algorithms are likely have some difficulty identifying the correct treatment around . Another characteristic of this example is that the variance in response grows quadratically with . Because CTS is more susceptible to extremes values than UCTS, we should expect their behaviors to be more different when is large.

To have a fair comparison of the behaviors of UCTS and CTS, we must first find their optimal parameters, specifically, and for UCTS and for CTS. The parameters are selected based on the performance of models trained with different training sets as measured by the true data model. As a result, for the training size of 1000 samples per treatment, we have and for UCTS and for CTS. Then, 5 more training sets are sampled and the decision boundary reconstructed by the two algorithms with chosen parameters are plotted in Fig. 2. We can see that the decision boundary generated by UCTS is much smoother than that by CTS for all training sets. This is especially the case on the right side of each plot when the variance in response is high and extreme values are more common.

To verify that UCTS is not sacrificing performance for smoothness, we compare the results of UCTS models and CTS models generated from training sets. The expected response under each model is estimated using the true data model. The average performance and the 95% confidence interval are plotted in Fig. 3. We can see that UCTS is fully competitive with CTS.

### 3.2 High-Dimensional Synthetic Data

While the 2D example is helpful for us to understand the behavioral difference between UCTS and CTS, it might not be complex enough to represent real world scenarios. In this subsection we consider a 50-dimensional data model with a much more complex response distribution. This is also the data model used in Section 4.1 of [15] which allows us to compare our results with theirs.

The feature space is the fifty-dimensional hyper-cube of length 10. Features are uniformly distributed in the feature space, i.e., , for . There are four different treatments, , and the response under each treatment is defined as below.

 Y=⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩f(X)+U[0,αX1]+ϵif T=1,f(X)+U[0,αX2]+ϵif T=2,f(X)+U[0,αX3]+ϵif T=3,f(X)+U[0,αX4]+ϵif T=4. (6)

The first term is a mixture of exponential functions defined on . This term is the same for all treatments and represents the systematic dependence of the response on the features. The second term is the treatment effect and determines the magnitude of the effect. The third term is the zero-mean Gaussian noise which standard deviation is set to twice the magnitude of the treatment effect1. By the symmetry of the model we can see that the expected response is the same for all treatments which is estimated to be 5.18 using Monte Carlo simulation on 10,000,000 samples. Similarly, the expected response under the optimal treatment rule is estimated to be 5.79.

The performance of UCTS is tested under different training data sizes, specifically, 500, 2000, 4000, 8000, 16000, and 32000 samples per treatment. For each size, 10 training sets and test sets are provided in [15]. We use the results from these data to generate the 95% margin of error. When training each model, we have , , , , and . The most important parameter is selected by the validation set (30% of total training data). The results are plotted in Fig. 4.

In Fig. 4 the results of UCTS (yellow line with crosses) are plotted together with those of 5 different algorithms, including CTS (green line with horizontal bars). The other 4 methods are Separate Model Approach with Random Forest (SMA-RF), K-Nearest Neighbor (SMA-KNN), Support Vector Regressor with Radial Basis Kernel (SMA-SVR), and AdaBoost (SMA-Ada). From the figure we can see that UCTS and CTS outperform Separate Model Approaches when the training size is greater than 4,000. By training size 32,000, they have almost achieved the optimal performance. Meanwhile, the 95% margins of error of UCTS and CTS overlap for every training size. It is not unreasonable to say that they have comparable performance for this particular data model.

### 3.3 Priority Boarding Data

As we have mentioned in the introduction, one of the applications of Uplift Modeling is customized pricing. In this example we apply uplift algorithms to select the price of priority boarding of airlines based on flight information. The data is provided by one of the major airlines in Europe. In the data set, half of the passengers receive the default price of € 5 and half receives the treatment price of € 7. Interestingly, the two prices lead to the same € 0.42 average revenue per passenger overall. A total of 9 features are derived based on the information of the flight and of the reservation. These are the origin station, the origin-destination pair, the departure weekday, the arrival weekday, the number of days between flight booking and departure, flight fare, flight fare per passenger, flight fare per passenger per mile, and the group size.

The performance of UCTS is compared with those of 6 other methods which are the separate model approach with Random Forest (SMA-RF), Support Vector Machine (SMA-SVM), Adaboost (SMA-Ada), K-Nearest Neighbors (SMA-KNN), as well as the uplift Random Forest method implemented in [11], and CTS. The data is randomly split into the training set (225,000 samples per treatment) and the test set (75,000 samples per treatment). For UCTS, we have , , , and . According to the results on the validation set (30% of training data), we set and . Details on parameter tuning of the 6 other methods can be found in the Appendix of [15].

The expected revenue from each algorithm is plotted in Fig. 5. The benefit of applying specialized uplift algorithms is apparent. The best result from Separate Model Approach is € 0.45 which is 7% increase relative to fixed pricing. However, with UCTS, we can achieve an astonishing 29% increase.

We also plot the Modified Uplift Curves (MUC) of the 7 methods in Fig. 6. The horizontal axis in a MUC indicates the percentage of population subject to treatments (while others receiving the control). The vertical axis is the expected response at a given percentage. The MUC is a useful tool for balancing the gain from customizing treatment assignment and the risk of exposing subjects to treatments. In Fig. 6 we can see that UCTS achieves a higher expected response than other methods for any given percentage.

Knowing that there not exist a learning algorithm which always performs better than others regardless of the underlying data model [19], we hope we have demonstrated with the experiments in this section that UCTS can be competitive with CTS for some data sets. In the next section we present a distinct advantage of UCTS which is its provable consistency.

## 4 Consistency Analysis

Tree-based ensemble methods have eluded theoretical analysis for many years. Since its publication in 2001 [17], Random Forest has become a major analytical tool in many areas of application with its stable and excellent performance. Yet it is still an open question whether the algorithm is consistent or not. The difficulties in analysis come partly from the fact that the algorithm is highly data-dependent and partly from the randomization procedure. In recent years, there have been several critical attempts in making the gap between theory and practice narrower. For a more detailed summary of these results please refer to the Introduction of [18].

Uplift modeling is in a similar situation. Many algorithms have been proposed in the past two decades and some have achieved promising results on various data sets. However, to the best of our knowledge, there has been very few publication about the theoretical properties of these algorithms. In order to fill the vacancy in literature and to understand the behavior of the algorithm, in this Section we provide a proof of consistency for the proposed UCTS algorithm. Unlike the theoretical studies on Random Forest which often concentrate on simplified versions of the procedure, our proof is for UCTS exactly as described in Algorithm 1.

### 4.1 Consistency of Uplift Algorithms

The general framework of uplift modeling is that, after observing the feature vector of a subject, the decision maker applies a treatment to the subject and observes its response . Assume where is a zero-mean random noise that may depend on and . Then the conditional expectation is simply

 μ(x,t)≡E[Y|X=x,T=t].

A treatment selection rule is a mapping from the feature space to treatments, i.e., . Denote the expected response under a treatment rule as

 v(h)≡E[Y|X,T=h(X)]=E{μ[X,h(X)]}

where the expectation is taken over . It is obvious that the maximum expected response is achieved by the point-wise optimal treatment rule .

Given a set of samples from a randomized experiment, the goal of an uplift algorithm is to construct a treatment selection rule such that is as close to as possible. In this sense, we can define the consistency of uplift algorithms as the following.

###### Definition 1.

An uplift algorithm is Consistent if

 limn→+∞E{μ[X,h∗(X)]−μ[X,hn(X)]}2=0,

where the expectation is taken over both the test example and the training data .

A UCTS model consists of a collection of randomized uplift trees each of which is an estimator of . For the th tree in the forest, the predicted value at query point is denoted as , where are independent random variables, distributed as a generic random variable and independent of . This auxiliary random variable is used to subsample training data for each tree and to select splitting variables. Averaging tree predictions gives us the predicted value of the forest at ,

 μB,n(x,t;Θ1,...,ΘB,Sn)=1BB∑b=1μn(x,t;Θb,Sn).

From now on we abbreviate as to lighten the notation while it should have been made clear the dependence of on the training data, the auxiliary randomness and the number of trees . Given the estimator , the treatment rule is simply defined as

 hn(x)=argmaxt=1,..,K[μn(x,t]

with ties breaking randomly.

###### Lemma 1.

If for each we have where the expectation is taken over , and , then

 limn→∞E{μ[X,h∗(X)]−μ[X,hn(X)]}2=0.
###### Proof.

See Appendix 6.1. ∎

Lemma 1 establishes a connection between the consistency of uplift problems to that of regression problems. The key here is that we need to ensure the consistency of simultaneously for all treatments for which we need more detail about recursive partitioning algorithms.

### 4.2 Recursive Partitioning

Let be a partition of the feature space generated by a recursive partitioning algorithm as represented by the leaf nodes. Given a point in the feature space, denote as the element of that contains . Suppose features are distributed according to a density function . Then let be the expected fraction of samples in leaf node . Given a set of training examples, let be the number of examples in . In the paper we only consider the case where splits are orthogonal to the splitting variables. Therefore all leaves are rectangles and let be the length of along the th coordinate. To rigorously describe the theoretical results, we introduce the following definitions.

###### Definition 2.

A tree is a random-split tree if at every step of the tree-growing procedure, marginalizing over , the probability that the next split occurs along the -th feature is bounded below by for some , for all .

###### Definition 3.
2

An uplift tree is -regular for some if each split leaves at least a fraction of the available training examples on each side of the split and each leaf node contains at least training examples for some . In each leaf node, there are at most training examples for each treatment, with and .

It is not difficult to see that a tree generated by UCTS is both a random-split tree with and -regular with , and . Lemma 2 states that the leaf node of a -regular tree can not be too small in its probability measure.

###### Lemma 2.

A leaf node of an -regular tree grown with training examples satisfies the following inequality,

 P{f(L)≥kn−δ}≥1−e−2δ2n (7)

for some .

###### Proof.

See Appendix 6.2. ∎

Lemma 3 further proves that the diameter of the leaf nodes of a random-split and -regular tree shrinks in all dimensions as the number of training examples grows.

###### Lemma 3.

If , a leaf node of a random-split and -regular tree grown with training examples satisfies the following inequality,

 P{diamj(L)≤(1−α+δ)[ln(n/k)ln(α−1)−1](πd−η)} ≥ 1−e2η2(kn)2η2ln(α−1)−e−2δ2k⋅πln(n/k)dln(α−1)

for some and .

###### Proof.

See Appendix 6.3. ∎

### 4.3 Consistency of UCTS Trees

With the help of Lemma 2 and Lemma 3 we can proceed to prove the consistency of UCTS trees. The intuition is quite straightforward. By properly tuning parameter such that , the dimension of a leaf node vanishes as well as the within-node variance in response. In addition, if as then we can estimate leaf node response to an arbitrary accuracy.

The main consistency result is derived based on the following assumptions.

• Features are uniformly distributed in the -dimensional unit hypercube, i.e., . This assumption is not as restrictive as it might seem. Because trees are invariant to monotone transformations on , any distribution that has bounded support and a bounded non-zero density function can be rescaled, without loss of generality, to the uniform distribution.

• The response is bounded, i.e., .

• The conditional expectation function is Lipschitz continuous for each , i.e., there exists a constant such that ,

 |μ(x1,t)−μ(x2,t)|≤CL|x1−x2|. (8)
• Because a UCTS tree is a -regular tree with , and . We assume the parameters and are chosen properly with such that and .

###### Theorem 1.

If above assumptions are satisfied then a treatment selection rule constructed by the UCTS algorithm is consistent.

###### Proof.

See Appendix 6.4. ∎

## 5 Conclusion

With the increasing ease of accessing and analyzing large amount of data comes the possibility and necessity of personalization. Uplift Modeling have proved to be an important tool in this movement. The algorithm presented in this paper, in addition to being competitive performance-wise, fills a vacancy in the literature with its provable consistency.

## 6 Appendices

Proofs are organized in this section. There is a simple inequality that are used repeatedly. Given a random variable bounded above by , we have,

 E[Z] = Extra open brace or missing close brace ≤ CZP{Z>z}+z. (9)

We also need Hoeffding’s inequality for binomial distribution. Let be the number of success in independently and identically distributed Bernoulli random variables with success probability . For some , we have,

 P{H(n)n≤p−δ}≤e−2δ2n (10)

and

 P{H(n)n≥p+δ}≤e−2δ2n (11)

### 6.1 Proof of Lemma 1

,

 E{μ(X,h∗(X))−μ(X,hn(X))}2 = ∑t≠t′E{[μ(X,t)−μ(X,t′)]2∣∣h∗(X)=t,hn(X)=t′, μ(X,t)−μ(X,t′)≥√ϵ/2}⋅P{h∗(X)=t, hn(X)=t′,μ(X,t)−μ(X,t′)≥√ϵ/2} +∑t≠t′E{[μ(X,t)−μ(X,t′)]2∣∣h∗(X)=t,hn(X)=t′, μ(X,t)−μ(X,t′)<√ϵ/2}⋅P{h∗(X)=t, hn(X)=t′,μ(X,t)−μ(X,t′)<√ϵ/2} (12) ≤ 4C2Y∑t≠t′P{hn(X)=t′,h∗(X)=t, μ(X,t)−μ(X,t′)≥√ϵ/2}+ϵ2 (13) ≤ 4C2Y∑t≠t′P{μn(X,t′)≥μn(X,t),h∗(X)=t, μ(X,t)−μ(X,t′)≥√ϵ/2}+ϵ2 (14) ≤ 4C2Y∑t≠t′P{μn(X,t′)≥μ(X,t′)+12√ϵ/2 orμn(X,t)≤μ(X,t)−12√ϵ/2, h∗(X)=t,μ(X,t)−μ(X,t′)≥√ϵ/2} +ϵ2 (15) ≤ 4C2Y∑t≠t′P{μn(X,t′)≥μ(X,t′)+12√ϵ/2 orμn(X,t)≤μ(X,t)−12√ϵ/2}+ϵ2 (16) ≤ 4C2Y∑t≠t′P{μn(X,t′)≥μ(X,t′)+12√ϵ2} +P{μn(X,t)≤μ(X,t)−12√ϵ2}+ϵ2 (17) = Missing or unrecognized delimiter for \Big +ϵ2. (18)

On one hand, we know that, for , there exists some such that when ,

 E{μn(X,t)−μ(X,t)}2<ϵ264K(K−1)C2Y. (19)

On the other hand we have,

 Missing or unrecognized delimiter for \right = E{[μn(X,t)−μ(X,t)]2∣∣|μn(X,t)−μ(X,t)|≥12√ϵ2} ⋅P{|μn(X,t)−μ(X,t)|≥12√ϵ2} +E{[μn(X,t)−μ(X,t)]2∣∣|μn(X,t)−μ(X,t)|<12√ϵ2} ⋅P{|μn(X,t)−μ(X,t)|<12√ϵ2} ≥ ϵ8P{|μn(X,t)−μ(X,t)|≥12√ϵ2}. (20)

Therefore when we have

 P{|μn(X,t)−μ(X,t)|≥12√ϵ2}≤ϵ8K(K−1)C2Y. (21)

When , combining Eq. (18) with Eq. (21) gives us

 E{μ(X,h∗(X))−μ(X,hn(X))}2≤ϵ.

### 6.2 Proof of Lemma 2

Let be a leaf node of a -regular tree. Given the fact that the tree is grown with training examples, the number of examples in follows the binomial distribution . By Hoeffding’s inequality, for some ,

 P{#Ln≤f(L)+δ}≥1−e−2δ2n. (22)

Since ,

 P{f(L)≥kn−δ} (23) ≥ P{f(L)≥#Ln−δ}≥1−e−2δ2n

### 6.3 Proof of Lemma 3

Let be an internal node of an -regular tree and its child node. Given the number of examples in node , the number of examples in the child node follows the binomial distribution . Then by Hoeffding’s inequality for some ,

 P{#ϕ′#ϕ−f(ϕ′)f(ϕ)≥−δ}≥1−e−2δ2#ϕ. (24)

Combining the above with gives us

 P{f(ϕ′)≤(1−α+δ)f(ϕ)}≥1−e−2δ2#ϕ. (25)

Suppose is created by a split of on the th coordinate, then

 P{diamj(ϕ′)≤(1−α+δ)diamj(ϕ)}≥1−e−2δ2#ϕ. (26)

This means each split decreases the diameter of the splitting coordinate by at least .

Let be a leaf node of a -regular tree. By regularity, we know the shallowest possible path from the root to a leaf is created by repeatedly splitting a fraction of the training example until the termination conditions are met. Therefore the number of splits from the root to any leaf is greater than . Because the marginal probability that a split is made on the th coordinate is bounded below by , the number of splits on the th coordinate has a stochastic lower bound . Again, by Hoeffding’s inequality, for some ,

 P{qj≥[ln(n/k)ln(α−1)−1](πd−η)} ≥ 1−exp{−2η2[ln(n/k)ln(α−1)−1]} = 1−e2η2(kn)2η2ln(α−1). (27)

Therefore intuitively the diameter of any leaf on the th coordinate is, with high probability, bounded above by . To be more precise, we have

 P{diamj(L)≤(1−α+δ)[ln(n/k)ln(α−1)−1](πd−η)} ≥ ⎡⎢⎣1−e2η2(kn)2η2ln(α−1)⎤⎥⎦⋅[1−e−2δ2k][ln(n/k)ln(α−1)−1](πd−η) (28) ≥ ⎡⎢⎣1−e2η2(kn)2η2ln(α−1)⎤⎥⎦ ⋅{1−e−2δ2k[ln(n/k)ln(α−1)−1](πd−η)} (29) ≥ 1−e2η2(kn)2η2ln(α−1)−e−2δ2k⋅πln(n/k)dln(α−1). (30)

### 6.4 Proof of Theorem 1

Given a set of training examples and the auxiliary randomness , let denote the partition of the feature space generated by the approximation set . For a random test data , define , i.e., contains the data in that fall into the same leaf as and are also assigned treatment . For ,

 E[μn(X,t;Θ,Sn)−μ(X,t)]2 = E{1#SE(X,t)∑SE(X,t)Yi−μ[L(X),t] +μ[L(X),t]−μ(X,t)}2 (31) ≤ Missing or unrecognized delimiter for \Big +2E{μ[L(X),t]−μ(X,t)}2 (32) ≜ 2I+2J (33)

We can bound the estimation error by appropriately increasing the minimum number of samples in the leaf nodes. Define , , and .

 I= E{∑SE(X,t)Yi#SE(X,t)−μ[L(X),t]∣∣f(L(X))≥kρn−δ1}2 ⋅P{f(L(X))≥kρn−δ1} +E{∑SE(X,t)Yi#SE(X,t)−μ[L(X),t]∣∣f(L(X))