Orthogonal Random Forest for Causal Inference
Abstract
We propose the orthogonal random forest, an algorithm that combines Neymanorthogonality to reduce sensitivity with respect to estimation error of nuisance parameters with generalized random forests (Athey et al., 2017)—a flexible nonparametric method for statistical estimation of conditional moment models using random forests. We provide a consistency rate and establish asymptotic normality for our estimator. We show that under mild assumptions on the consistency rate of the nuisance estimator, we can achieve the same error rate as an oracle with a priori knowledge of these nuisance parameters. We show that when the nuisance functions have a locally sparse parametrization, then a local penalized regression achieves the required rate. We apply our method to estimate heterogeneous treatment effects from observational data with discrete treatments or continuous treatments, and we show that, unlike prior work, our method provably allows to control for a highdimensional set of variables under standard sparsity conditions. We also provide a comprehensive empirical evaluation of our algorithm on both synthetic and real data.
vsred \addauthorswblue \addauthormopurple
1 Introduction
Many problems that arise in causal inference can be formulated in the language of conditional moment models: given a target feature find a solution to a system of conditional moment equations
(1) 
given access to i.i.d. samples from the data generating distribution, where is a known score function and is an unknown nuisance function that also needs to be estimated from data. Examples include nonparametric regression, heterogeneous treatment effect estimation, instrumental variable regression, local maximum likelihood estimation and estimation of structural econometric models.^{1}^{1}1See e.g. Reiss & Wolak (2007) and examples in Chernozhukov et al. (2016, 2018) The study of such conditional moment restriction problems has a long history in econometrics (see e.g. Newey (1993); Ai & Chen (2003); Chen & Pouzo (2009); Chernozhukov et al. (2015)).
In this general estimation problem, the main goal is to estimate the target parameter at a rate that is robust to the estimation error of the nuisance component. This allows \sweditthe use of flexible models to fit the nuisance functions and enables asymptotically valid inference. Almost all prior work on the topic has focused on two settings: i) they either assume the target function takes a parametric form and allow for a potentially highdimensional parametric nuisance function, e.g. (Chernozhukov et al., 2016, 2017, 2018), ii) \sweditor take a nonparametric stance at estimating but do not allow for highdimensional \sweditnuisance functions (Wager & Athey, 2015; Athey et al., 2017).
We propose Orthogonal Random Forest (ORF), a random forestbased estimation algorithm, which performs nonparametric estimation of the target parameter while permitting more complex nuisance functions with highdimensional parameterizations. Our estimator is also asymptotically normal and hence allows for the construction of asymptotically valid confidence intervals via plugin or bootstrap approaches. Our approach combines the notion of Neyman orthogonality of the moment equations with a twostage random forest based algorithm, which generalizes prior work on Generalized Random Forests (Athey et al., 2017) and the double machine learning (double ML) approach proposed in (Chernozhukov et al., 2017). \sweditTo support our general algorithm, we also provide a novel nuisance estimation algorithm—Forest Lasso—that effectively recovers highdimensional nuisance parameters provided they have locally sparse structure. This result combines techniques from Lasso theory (Hastie et al., 2015) with concentration inequalities for statistics (Hoeffding, 1963).
As a concrete example and as a main application of our approach, we consider the problem of heterogeneous treatment effect estimation. This problem is at the heart of many decisionmaking processes, including clinical trial assignment to patients, price adjustments of products, and ad placement by a search engine. In many situations, we would like to take the heterogeneity of the population into account and estimate the heterogeneous treatment effect (HTE)—the effect of a treatment (e.g. drug treatment, price discount, and ad position), on the outcome of interest (e.g. clinical response, demand, and clickthroughrate), as a function of observable characteristics of the treated subject (e.g. individual patient, product, and ad). HTE estimation is a fundamental problem in causal inference from observational data (Imbens & Rubin, 2015; Wager & Athey, 2015; Athey et al., 2017), and is intimately related to many areas of machine learning, including contextual bandits, offpolicy evaluation and optimization (Swaminathan et al., 2016; Wang et al., 2017; Nie & Wager, 2017), and counterfactual prediction (Swaminathan & Joachims, 2015; Hartford et al., 2016).
The key challenge in HTE estimation is that the observations are typically collected by a policy that depends on confounders or control variables , which also directly influence the outcome. Performing a direct regression of the outcome on the treatment and features , without controlling for a multitude of other potential confounders, will produce biased estimation. This leads to a regression problem that in the language of conditional moments takes the form:
(2) 
where is the heterogeneous effect of the treatment (discrete or continuous) on the outcome as a function of the features and is an unknown nuisance function that captures the direct effect of the control variables on the outcome. Moreover, unlike active experimentation settings such as contextual bandits, when dealing with observational data, the actual treatment or logging policy that could potentially be used to debias the estimation of is also unknown.
There is a surge of recent work at the interplay of machine learning and causal inference that studies efficient estimation and inference of treatment effects. Chernozhukov et al. (2017) propose a twostage estimation method called double machine learning that first orthogonalizes out the effect of highdimensional confounding factors using sophisticated machine learning algorithms, including Lasso, deep neural nets and random forests, and then estimates the effect of the lower dimensional treatment variables, by running a lowdimensional linear regression between the residualized treatments and residualized outcomes. They show that \swediteven if the estimation error of the first stage is \sweditnot particularly accurate, the secondstage estimate can still be asymptotically normal. However, their approach requires a parametric specification of . In contrast, another line of work that brings machine learning to causal inference provides fully flexible nonparametric HTE estimation based on random forest techniques (Wager & Athey, 2015; Athey et al., 2017; Powers et al., 2017). However, these methods heavily rely on lowdimensional assumptions.
Our algorithm ORF, when applied to the HTE problem (see Section 6) allows for the nonparametric estimation of via forest based approaches while simultaneously allowing for a highdimensional set of control variables . This estimation problem is of practical importance when a decision maker (DM) wants to optimize a policy that depends only on a small set of variables, e.g. due to data collection or regulatory constraints or due to interpretability of the resulting policy, while at the same time controlling for many potential confounders in the existing data that could lead to biased estimates. Such settings naturally arise in contextual pricing or personalized medicine. In such settings the DM is faced with the problem of estimating a conditional average treatment effect conditional on a small set of variables while controlling for a much larger set. Our estimator provably offers a significant statistical advantage for this task over prior approaches.
In the HTE setting, the ORF algorithm follows the residualonresidual regression approach analyzed by (Chernozhukov et al., 2016) to formulate a locally Neyman orthogonal moment and then applies our orthogonal forest algorithm to this orthogonal moment. Notably, (Athey et al., 2017) also recommend such a residual on residual regression approach in their empirical evaluation, which they refer to as “local centering”, albeit with no theoretical analysis. Our results provide a theoretical foundation of the local centering approach through the lens of Neyman orthogonality. Moreover, our theoretical results give rise to a slightly different overall estimation approach than the one in (Athey et al., 2017): namely we residualize locally around the target estimation point , as opposed to performing an overall residualization step and then calling the Generalized Random Forest algorithm on the residuals. The latter stems from the fact that our results require that the nuisance estimator achieve a good estimation rate only around the target point . Hence, residualizing locally seems more appropriate than running a global nuisance estimation, which would typically minimize a nonlocal mean squared error. Our experimental findings reinforce this intuition (see e.g. comparison between ORF and the GRFRes benchmark). Another notable work that combines the residualization idea with flexible heterogeneous effect estimation is that of (Nie & Wager, 2017), who formulate the problem as an appropriate residualbased square loss minimization over an arbitrary hypothesis space for the heterogeneous effect function . Formally, they show robustness, with respect to nuisance estimation errors, of the mean squared error (MSE) of the resulting estimate in expectation over the distribution and for the case where the hypothesis space is a reproducing kernel Hilbert space (RKHS). Our work differs primarily by: i) focusing on supnorm estimation error at any target point as opposed to MSE, ii) using forest based estimation as opposed to finding a function in an RKHS, iii) working with the general orthogonal conditional moment problems, and iv) providing asymptotic normality results and hence valid inference.
We provide a comprehensive empirical comparison of ORF with several benchmarks, including three variants of GRF. We show that by setting the parameters according to what our theory suggests, ORF consistently outperforms all of the benchmarks. Moreover, we show that bootstrap based confidence intervals provide good finite sample coverage.
Finally, to motivate the usage of the ORF, we applied our technique to Dominick’s dataset, a popular historical dataset of storelevel orange juice prices and sales provided by University of Chicago Booth School of Business. The dataset is comprised of a large number of covariates , but economics researchers might only be interested in learning the elasticity of demand as a function of a few variables such as income or education. We applied our method (see Appendix G for details) to estimate orange juice price elasticity as a function of income, and our results, depicted in Figure 1, unveil the natural phenomenon that lower income consumers are more pricesensitive.
1.1 Related Work
we need to add an extensive comparison to many relevant literatures.
2 Estimation via Local Orthogonal Moments
\vseditWe study nonparametric estimation of models defined via conditional moment restrictions, in the presence of nuisance functions. Suppose we have a set of observations drawn independently from some underlying distribution over the observation domain . Each observation contains a feature vector .
Given a target feature , our goal is to estimate a parameter vector that is defined via a local moment condition, i.e. for all , is the unique solution with respect to of:
(3) 
where is a score function that maps an observation , parameter vector , and nuisance vector to a vectorvalued score and is an unknown nuisance function that takes as input and a subvector of , and outputs a nuisance vector in . For any feature , parameter , and nuisance function , we define the moment function as:
(4) 
We assume that the dimensions are constants, while the dimension of can be growing with .
We will analyze the following twostage estimation process.

[topsep=0pt,itemsep=1ex,partopsep=1ex,parsep=1ex]

First stage. Compute a nuisance estimate for using data with some guarantee on the conditional root mean squared error:^{2}^{2}2Throughout the paper we denote with the euclidean norm and with the norm.

Second stage. Compute a set of similarity weights over the data that measure the similarity between their feature vectors and the target . \vsdelete^{3}^{3}3We obtain the similarity weights using random forests, but they can also be computed through other methods such as NN. Compute the estimate using the nuisance estimate via the plugin weighted moment condition:
(5)
In practice, our framework permits the use of any method to estimate the nuisance function in the first stage. However, since our description is a bit too abstract let us give a special case, which we will also need to assume for our normality result. Consider the case when the nuisance function takes the form , for some known function but unknown function (with potentially growing with ), i.e. locally around each the function is a parametric function of . Moreover, the parameter of the true nuisance function is identified as the minimizer of a local loss:
(6) 
Then we can estimate via a locally weighted and penalized empirical loss minimization algorithm. In particular in Section 5 we will consider the case of local penalized estimation that we will refer to as forest lasso and which provides formal guarantees in the case where is sparse.
The key technical condition that allows us to reliably perform the twostage estimation is the following local orthogonality condition, which can be viewed as a localized version of the Neyman orthogonality condtion (Neyman, 1979; Chernozhukov et al., 2017) around the neighborhood of the target feature . Intuitively, the condition says that the score function is insensitive to local perturbations in the nuisance parameters around their true values.
Definition 2.1 (Local Orthogonality).
Fix any estimator for the nuisance function. Then the Gateaux derivative with respect to , denoted , is defined as:
where denotes the gradient of with respect to the final arguments. We say that the moment conditions are locally orthogonal if for all : .
3 Orthogonal Random Forest
\vseditWe describe our main algorithm orthogonal random forest (ORF) for calculating the similarity weights in the second stage of the two stage estimation. In the next section we will see that we will be using this algorithm for the estimation of the nuisance functions, so as to perform a local nuisance estimation. At a high level, ORF can be viewed as an orthogonalized version of GRF that is more robust to the nuisance estimation error. Similar to GRF, the algorithm runs a tree learner over random subsamples (without replacement) of size , to build trees such that each tree indexed by provides a treebased weight for each observation in the input sample. \vseditThen the ORF weight for each sample is the average over the treeweights .
The tree learner starts with a root node that contains the entire and recursively grows the tree to split into a set of leaves until the number of observations in each leaf is not too small. The set of neighboods defined by the leaves naturally gives a simlarity measure between each observation and the target . Following the same approach of (Tibshirani et al., 2018; Wager & Athey, 2015), we maintain the following tree properties in the process of building a tree.
Specification 1 (Forest Regularity).
The tree satisfies

[topsep=0pt,itemsep=1ex,partopsep=1ex,parsep=1ex]

Honesty: we randomly partition the input sample into two subsets , , then uses to place splits in the tree, and uses for estimation.

balanced: each split leaves at least a fraction of the observations in on each side of the split for some parameter of .

Minimum leaf size : there are between and observations from in each leaf of the tree.

randomsplit: at every step, marginalizing over the internal randomness of the learner, the probability that the next split occurs along the th feature is at least for some , for all ^{4}^{4}4For example, this can be achieved by uniformly randomizing the splitting variable with probability or via a Poisson sampling scheme where a random subset of the variables of size is chosen to consider for candidate splits, with .
The key modification to GRF’s tree learner is our incorporation of orthogonal nuisance estimation in the splitting criterion. \vsdeleteIn order to accurately capture the heterogeneity in the target parameter in the presence of highdimensional nuisance, we perform a twostage estimation locally at each internal node to decide the next split. While the splitting criterion does not factor into our theoretical analysis (similar to (Tibshirani et al., 2018)), we find it to be an effective practical heuristic.
Splitting criterion with orthogonalization.
\vseditAt each internal node we perform a twostage estimation over , i.e. the set of examples in that reach node : 1) compute a nuisance estimate using only data (e.g. by estimating a parameter that minimizes and setting ), and then 2) form estimate using :^{5}^{5}5In our implementation we actually use a crossfitting approach, where we use half of to compute a nuisance function to apply to the other half and vice versa.
We now generate a large random set of candidate axisaligned splits (satisfying Specification 1 and we want to find the split into two children and such that if we perform the same twostage estimation separately at each child, the new estimates and take on very different values, so that the heterogeneity of the two children nodes is maximized. Performing the twostage estimation of and for all candidate splits is too computationally expensive. Instead, we will approximate these estimates by taking a Newton step from the parent node estimate : for any child node given by a candidate split, our proxy estimate is:
where . \vseditWe select the candidate split that maximizes the following proxy heterogeneity score: for each coordinate let
(7) 
where . We then create a single heterogeneity score per split as a convex combination that puts weight on the mean and on the maximum score across coordinates. is chosen uniformly at random in at each iteration of splitting. Hence, some splits focus on heterogeneity on average, while others focus on creating heterogeneity on individual coordinates.
ORF weights and estimator.
\vseditFor each tree indexed based on subsample , let be the leaf that contains the target feature . We assign tree weight and ORF weight to each observation :
Wager & Athey (2015) show that under the structural specification of the trees, the tree weights are nonzero only around a small neighborhood of ; a property that we will leverage in our analysis.
Theorem 3.1 (Kernel shrinkage (Wager & Athey, 2015)).
Suppose the minimum leaf size parameter , the tree is balanced and randomsplit and the distribution of admits a density in that is bounded away from zero and infinity. Then the tree weights satisfy
where
and is size of the subsamples.
4 Convergence and Asymptotic Analysis
\vseditThe ORF estimate is computed by solving the weighted moment condition in Equation 5, using the ORF weights as described in the previous section. We now provide theoretical guarantees for under the following assumption on the moment, score fuction and the data generating process.
Assumption 4.1.
The moment condition and the score function satisfy the following:

[topsep=0pt,itemsep=1ex,partopsep=1ex,parsep=1ex]

Local Orthogonality. The moment condition satisfies local orthogonality.

Identifiability. The moments has a unique solution .

Smooth Signal. The moments are Lipschitz in for any .

Curvature. The Jacobian has minimum eigenvalue bounded away from zero.

Smoothness of scores. For every and for all and , the eigenvalues of the expected Hessian are bounded above by a constant . For any , the score is Lipschitz in for any and Lipschitz in for any . The gradient of the score with respect to is Lipschitz in .

Boundedness. The parameter set has constant diameter. There exists a bound such that for any observation , the firststage nuisance estimate satisfies for any .

Full Support . The distribution of admits a density that is bounded away from zero and infinity.
All the results presented in the remainder of the paper will assume these conditions and we omit stating so in each of the theorems. Any extra conditions required for each theorem will be explicitly provided. Note that except for the local orthogonality condition, all of the assumptions are imposing standard boundedness and regularity conditions of the moments.
Theorem 4.2 (Error Bound).
Suppose that: . Then:
Theorem 4.3 (High Probability Error Bound).
Suppose that the score is the gradient of a convex loss and let denote the minimum eigenvalue of the jacobian . Moreover, suppose that the nuisance estimate satisfies that w.p. : . Then w.p. :
(8) 
For asymptotic normality we will restrict our framework to the case of parametric nuisance functions, i.e. for some known function and to a particular type of nuisance estimators that recover the true parameter . Albeit we note that the parameter can be an arbitrary nonparametric function of and can also be highdimensional. We will further assume that the moments also have a smooth covariance structure in , i.e. if we let
then is Lipschitz in for any .
Theorem 4.4 (Asymptotic Normality).
Suppose that takes a locally parametric form , for some known function that is Lipschitz in w.r.t. the norm for some and the nuisance estimate is of the form and satisfies:
Suppose that is chosen such that:
for any , and . Moreover, is Lipschitz in for any . Then for any coefficient , with , assuming for any , there exists a sequence , such that:
(9) 
Given the result in Theorem 4.4, we can follow the same approach of Bootstrap of Little Bags by (Athey et al., 2017; Sexton & Laake, 2009) to build valid confidence intervals.
5 Nuisance Estimation: Forest Lasso
Next, we study the nuisance estimation problem in the first stage and provide a general nuisance estimation method that leverages locally sparse parameterization of the nuisance function, permitting low error rates even for highdimensional problems. \vseditConsider the case when the nuisance function takes the form for some known functional form , for some known function but unknown function , with potentially growing with . Moreover, the parameter of the true nuisance function is identified as the minimizer of some local loss, as defined in Equation (6).
We consider the following estimation process: given a set of observations , we run the same tree learner in Section 3 over random subsamples (without replacement) to compute ORF weights for each observation over . Then we apply a local penalized estimation:
(10) 
To provide formal guarantees for this method we will need to make the following assumptions.
Assumption 5.1 (Assumptions for nuisance estimation).
The target parameter and data distribution satisfy:

[topsep=0pt,itemsep=1ex,partopsep=1ex,parsep=1ex]

For any , is sparse with support .

is a Lipschitz in and the function is Lipschitz in for any , with respect to the norm.

The data distribution satisfies the conditional restricted eigenvalue condition: for all and for all , for some matrix that depends only on the data: , and for all and for all :
(11)
Under Assumption 5.1 we show that the local penalized estimator achieves the following parameter recovery guarantee.
Theorem 5.2.
With probability :
as long as .
Example 5.3 (Forest Lasso).
For locally sparse linear regression, and . This means, . Hence, the conditional restricted eigenvalue condition is simply a conditional covariance condition: .
Example 5.4 (Forest Logistic Lasso).
For locally sparse logistic regression, , and
where is the logistic function. In this case,
assuming the index is bounded in some finite range. Hence, our conditional restricted eigenvalue condition is the same conditional covariance condition:
6 Heterogeneous Treatment Effects
Now we apply ORF to the problem of estimating heterogeneous treatment effects. We will consider the following extension of the partially linear regression (PLR) model due to Robinson (1988). ^{6}^{6}6The standard PLR model (Robinson, 1988) considers solely the case of constant treatment effects, , and the goal is the estimation of the parameter . We have i.i.d. observations such that for each , represents the treatment applied that can be either realvalued (in ) or discrete (taking values in , where each denotes the standard basis in ), represents the outcome, represents potential confounding variables (controls), and is the feature vector that captures the heterogeneity. The set of parameters are related via the following equations:
(12)  
(13) 
where are bounded unobserved noises such that and .\swcommentadded the last In the main equation (12), represents the treatment effect function. Our goal is to estimate conditional average treatment effect (CATE) conditioned on target feature :
(14) 
The confounding equation (13) determines the relationship between treatments variable and the feature and confounder . To create an orthogonal moment for identifying , we follow the classical residualization approach similar to (Chernozhukov et al., 2017). First, observe that
Let us define the function , and consider the residuals
Then we can simplify the equation as . \sweditAs long as is independent of conditioned on (e.g. is independent of or does not depend on ), we also have
Since , then
This relationship suggests that we can obtain an estimate of by regressing on locally around . We can thus define the orthogonalized score function: for any observation , any parameter , any estimates and for functions and , the score is:
where . In the appendix, we show that this moment condition satisfies local orthogonality, and it identifies as long as as the noise is independent of conditioned on and the expected matrix is invertible. Even though the approach applies generically, to obtain formal guarantees on the nuisance estimates via our Forest Lasso, we will restrict their functional form.
Realvalued treatments.
Suppose and each coordinate of and are given by highdimensional linear functions: , , where are sparse vectors in . Consequently, can be written as a sparse linear function over degree2 polynomial features of . Then as long as and are Lipschitz in and the confounders satisfy
then we can use Forest Lasso to estimate both and . Hence, we can apply the ORF algorithm to get estimation error rates and asymptotic normality results for . (see Appendix B for formal statement).
Discrete treatments.
We now describe how our theory can be applied to discrete treatments. Suppose and each coordinate of are of the form: and , where is the logistic function. Note in this case is not independent of since
To maintain the conditional independence between and conditioned on , we focus on the setting where is only a function of , i.e. for all . In this setting we can estimate by running a forest logistic lasso for each treatment . Then we can estimate as follows: For each estimate the expected counterfactual outcome function: , by running a forest lasso between and only among the subset of samples that received treatment . Similarly, estimate by running a forest lasso between and only among the subset of samples that received treatment . Then observe that can be written as a function of , and . Thus we can combine these estimates to get an estimate of . Hence, we can obtain a guarantee similar to that of Corollary B.1 (see appendix).
Doubly robust moment for discrete treatments.
In the setting where also depends on and treatments are discrete, we can formulate an alternative orthogonal moment that identifies the CATE even when is correlated with . This moment is based on first constructing unbiased estimates of the counterfactual outcome for every observation and for any potential treatment , i.e. even for . The latter is done by invoking the doubly robust formula (Robins & Rotnitzky, 1995; Cassel et al., 1976; Kang et al., 2007):
with the convention that
Then we can identify the parameter using the moment: One can easily show that this moment satisfies the Neyman orthogonality condition with respect to the nuisance functions and (see appendix). In fact this property is essentially implied by the fact that the estimates satisfy the double robustness property, since double robustness is a stronger condition than orthogonality. We will again consider . Then using similar reasoning as in the previous paragraph, we see that with a combination of forest logistic lasso for and forest lasso for , we can estimate these nuisance functions at a sufficiently fast rate for our ORF estimator (based on this doubly robust moment) to be asymptotically normal, assuming they have locally sparse linear or logistic parameterizations.
7 Monte Carlo Experiments
We compare the empirical performance of ORF with other methods in the literature (and their variants).^{7}^{7}7The source code for running these experiments is available in the git repo Microsoft/EconML. The data generating process we consider is described by the following equations:
Moreover, is drawn from the uniform distribution , is drawn from , and the noise terms . The sparse vectors have coefficients drawn independently from . The dimension and we vary the support size over the range of . We examine a treatment function that is continuous and piecewise linear (detailed in Figure 3). In Appendix H we analyze other forms for . \vsdeletewe show that ORF also outperforms other methods with discontinuous that is piecewise constant or piecewise polynomial.
For each fixed treatment function, we repeat experiments, each of which consists of generating observations from the DGP, drawing the vectors and , and estimating at 100 test points over a grid in . We then calculate the bias, variance and root mean squared error (RMSE) of each estimate . Here we report summary statistics of the median and percentiles of these three quantities across test points, so as to evaluate the average performance of each method. We compare two variants of ORF with two variants of GRF (Athey et al., 2017) (see Appendix H for a third variant) and two extensions of double ML methods for heterogeneous treatment effect estimation (Chernozhukov et al., 2017).
ORF variants. (1) ORF: We implement ORF as described in Section 3, setting parameters under the guidance of our theoretical result: subsample size , Lasso regularization (for both tree learner and kernel estimation), number of trees , a max tree depth of , and a minimum leaf size of . (2) ORF with LassoCV (ORFCV): we replaced the Lasso algorithm in ORF’s kernel estimation, with a crossvalidated Lasso for the selection of the regularization parameter and . ORFCV provides a more systematic optimization over the parameters.
GRF variants. (1) GRFResLasso: We perform a naive combination of double ML and GRF by first residualizing the treatments and outcomes on both the features and controls , then running GRF R package by (Tibshirani et al., 2018) on the residualized treatments , residualized outcomes , and features . A crossvalidated Lasso is used for residualization. (2) GRFResRF: We combine DoubleML and GRF as above, but we use crossvalidated Random Forests for calculating residuals and .
Double ML with Polynomial Heterogeneity (DMLPoly). An extension of the classic Double ML procedure for heterogeneous treatment effects introduced in (Chernozhukov et al., 2017). This method accounts for heterogeneity by creating an expanded linear base of composite treatments (cross products between treatments and features). (1) Heterogeneous Double ML using LassoCV for firststage estimation (HeteroDMLLasso): In this version, we use Lasso with crossvalidation for calculating residuals on in the first stage. (2) Heterogeneous Double ML using random forest for firststage estimation (HeteroDMLRF): A more flexible version that uses random forests to perform residualization on treatments and outcomes. The latter performs better when treatments and outcomes have a nonlinear relationship with the joint features of .
We generated data according to the Monte Carlo process above and set the parameters to samples, controls, features and support size and three types of treatment effect functions. In this section, we present the results for a piecewise linear treatment effect function.
In Figure 3, we inspect the goodness of fit for the chosen estimation methods across 100 Monte Carlo experiments. We note the limitations of two versions of the GRFRes estimators, GRFResLasso and GRFResRF, in capturing the treatment effect function well. The GRFResRF estimations have a consistent bias as the Random Forest residualization cannot capture the dependency on the controls given their highdimensionality. The HeteroDML methods are not flexible enough to capture the complexity of the treatment effect function. The best performers are the ORFCV, ORF, and GRFResLasso, with the latter estimator having a larger bias and variance.
We analyze these estimators as we increase the support size of . Figures 2 illustrate the evaluation metrics across different support sizes. The ORFCV performs very well, with consistent bias and RMSE across support sizes and treatment functions. The bias, variance and RMSE of the ORF grow with support size, but this growth is at a lower rate compared to the alternative estimators. The ORFCV and ORF algorithms perform better than the GRFRes methods on all metrics for this example. We observe this pattern for the other choices of support size, sample size and treatment effect function (see Appendix H). In Figure 4, we provide a snapshot of the bootstrap confidence interval coverage for this example.
Acknowledgements
ZSW is supported in part by a Google Faculty Research Award, a J.P. Morgan Faculty Award, a Mozilla research grant, and a Facebook Research Award. Part of this work was completed while ZSW was at Microsoft ResearchNew York City.
References
 Ai & Chen (2003) Ai, C. and Chen, X. Efficient estimation of models with conditional moment restrictions containing unknown functions. Econometrica, 71(6):1795–1843, 2003. doi: 10.1111/14680262.00470. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/14680262.00470.
 Athey et al. (2017) Athey, S., Tibshirani, J., and Wager, S. Generalized Random Forests. ArXiv eprints, October 2017.
 Cassel et al. (1976) Cassel, C. M., Särndal, C. E., and Wretman, J. H. Some results on generalized difference estimation and generalized regression estimation for finite populations. Biometrika, 63(3):615–620, 1976.
 Chen & Pouzo (2009) Chen, X. and Pouzo, D. Efficient estimation of semiparametric conditional moment models with possibly nonsmooth residuals. Journal of Econometrics, 152(1):46 – 60, 2009. ISSN 03044076. doi: https://doi.org/10.1016/j.jeconom.2009.02.002. URL http://www.sciencedirect.com/science/article/pii/S0304407609000529. Recent Adavances in Nonparametric and Semiparametric Econometrics: A Volume Honouring Peter M. Robinson.
 Chernozhukov et al. (2015) Chernozhukov, V., Newey, W. K., and Santos, A. Constrained Conditional Moment Restriction Models. arXiv eprints, art. arXiv:1509.06311, September 2015.
 Chernozhukov et al. (2016) Chernozhukov, V., Escanciano, J. C., Ichimura, H., Newey, W. K., and Robins, J. M. Locally Robust Semiparametric Estimation. arXiv eprints, art. arXiv:1608.00033, July 2016.
 Chernozhukov et al. (2017) Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., and Newey, W. Double/debiased/neyman machine learning of treatment effects. American Economic Review, 107(5):261–65, May 2017. doi: 10.1257/aer.p20171038. URL http://www.aeaweb.org/articles?id=10.1257/aer.p20171038.
 Chernozhukov et al. (2017) Chernozhukov, V., Goldman, M., Semenova, V., and Taddy, M. Orthogonal Machine Learning for Demand Estimation: High Dimensional Causal Inference in Dynamic Panels. ArXiv eprints, December 2017.
 Chernozhukov et al. (2018) Chernozhukov, V., Nekipelov, D., Semenova, V., and Syrgkanis, V. Plugin Regularized Estimation of HighDimensional Parameters in Nonlinear Semiparametric Models. arXiv eprints, art. arXiv:1806.04823, June 2018.
 Hartford et al. (2016) Hartford, J., Lewis, G., LeytonBrown, K., and Taddy, M. Counterfactual Prediction with Deep Instrumental Variables Networks. ArXiv eprints, December 2016.
 Hastie et al. (2015) Hastie, T., Tibshirani, R., and Wainwright, M. Statistical Learning with Sparsity: The Lasso and Generalizations. Chapman & Hall/CRC, 2015. ISBN 1498712169, 9781498712163.
 Hoeffding (1963) Hoeffding, W. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963. ISSN 01621459. URL http://www.jstor.org/stable/2282952.
 Imbens & Rubin (2015) Imbens, G. W. and Rubin, D. B. Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press, 2015. doi: 10.1017/CBO9781139025751.
 Kang et al. (2007) Kang, J. D., Schafer, J. L., et al. Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical science, 22(4):523–539, 2007.
 Mentch & Hooker (2016) Mentch, L. and Hooker, G. Quantifying uncertainty in random forests via confidence intervals and hypothesis tests. The Journal of Machine Learning Research, 17(1):841–881, 2016.
 Newey (1993) Newey, W. K. 16 efficient estimation of models with conditional moment restrictions. In Econometrics, volume 11 of Handbook of Statistics, pp. 419 – 454. Elsevier, 1993. doi: https://doi.org/10.1016/S01697161(05)800513. URL http://www.sciencedirect.com/science/article/pii/S0169716105800513.
 Neyman (1979) Neyman, J. C() tests and their use. Sankhyā: The Indian Journal of Statistics, Series A (19612002), 41(1/2):1–21, 1979. ISSN 0581572X. URL http://www.jstor.org/stable/25050174.
 Nie & Wager (2017) Nie, X. and Wager, S. Learning Objectives for Treatment Effect Estimation. ArXiv eprints, December 2017.
 Peel et al. (2010) Peel, T., Anthoine, S., and Ralaivola, L. Empirical bernstein inequalities for ustatistics. In Lafferty, J. D., Williams, C. K. I., ShaweTaylor, J., Zemel, R. S., and Culotta, A. (eds.), Advances in Neural Information Processing Systems 23, pp. 1903–1911. Curran Associates, Inc., 2010.
 Powers et al. (2017) Powers, S., Qian, J., Jung, K., Schuler, A., Shah, N. H., Hastie, T., and Tibshirani, R. Some methods for heterogeneous treatment effect estimation in highdimensions. ArXiv eprints, July 2017.
 Reiss & Wolak (2007) Reiss, P. C. and Wolak, F. A. Chapter 64 structural econometric modeling: Rationales and examples from industrial organization. volume 6 of Handbook of Econometrics, pp. 4277 – 4415. Elsevier, 2007. doi: https://doi.org/10.1016/S15734412(07)060643. URL http://www.sciencedirect.com/science/article/pii/S1573441207060643.
 Robins & Rotnitzky (1995) Robins, J. M. and Rotnitzky, A. Semiparametric efficiency in multivariate regression models with missing data. Journal of the American Statistical Association, 90(429):122–129, 1995.
 Robinson (1988) Robinson, P. M. Rootnconsistent semiparametric regression. Econometrica, 56(4):931–954, 1988. ISSN 00129682, 14680262. URL http://www.jstor.org/stable/1912705.
 Sexton & Laake (2009) Sexton, J. and Laake, P. Standard errors for bagged and random forest estimators. Computational Statistics and Data Analysis, 53(3):801 – 811, 2009. ISSN 01679473. doi: https://doi.org/10.1016/j.csda.2008.08.007. URL http://www.sciencedirect.com/science/article/pii/S0167947308003988. Computational Statistics within Clinical Research.
 Swaminathan & Joachims (2015) Swaminathan, A. and Joachims, T. Counterfactual risk minimization: Learning from logged bandit feedback. CoRR, abs/1502.02362, 2015. URL http://arxiv.org/abs/1502.02362.
 Swaminathan et al. (2016) Swaminathan, A., Krishnamurthy, A., Agarwal, A., Dudík, M., Langford, J., Jose, D., and Zitouni, I. Offpolicy evaluation for slate recommendation. CoRR, abs/1605.04812, 2016. URL http://arxiv.org/abs/1605.04812.
 Tibshirani et al. (2018) Tibshirani, J., Athey, S., Wager, S., Friedberg, R., Miner, L., and Wright, M. grf: Generalized Random Forests (Beta), 2018. URL https://CRAN.Rproject.org/package=grf. R package version 0.10.2.
 Wager & Athey (2015) Wager, S. and Athey, S. Estimation and Inference of Heterogeneous Treatment Effects using Random Forests. ArXiv eprints, October 2015.
 Wang et al. (2017) Wang, Y.X., Agarwal, A., and Dudík, M. Optimal and adaptive offpolicy evaluation in contextual bandits. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 3589–3597, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL http://proceedings.mlr.press/v70/wang17a.html.
Appendix A Twoforest ORF v.s. Oneforest ORF
A natural question about ORF is whether it is necessary to have two separate forests for the twostage estimation. We investigate this question by implementing a variant of ORF without sample splitting (ORFNS)—it builds only one random forest over the entire dataset, and perform the twostage estimation using the same set of importance weights. We empirically compare ORFCV with ORFNS. In Figure 5, we note that the bias, variance and RMSE of the ORFNS increase drastically with the subsample ratio , whereas the same metrics are almost constant for the ORFCV. This phenomenon is consistent with the theory, since larger subsamples induce a higher probability of collision between independently drawn samples, and the “spillover” can incur large bias and error.
Appendix B Formal Guarantee for RealValued Treatments
Corollary B.1 (Accuracy for realvalued treatments).
Suppose that and each coorindate are Lipschitz in and have norms bounded by for any . Assume that distribution of admits a density that is bounded away from zero and infinity. For any feature , the conditional covariance matrices satisfy , and , where denotes the degree2 polynomial feature vector of . Then with probability , ORF returns an estimator such that
as long as the sparsity and the subsampling rate of ORF . Moreover, for any with , there exists a sequence such that
as long as the sparsity and the subsampling rate of ORF for any .
Appendix C Uniform Convergence of Lipschitz Processes
Lemma C.1 (Stochastic Equicontinuity for statistics via Bracketing).
Consider a parameter space that is a bounded subset of , with . Consider the statistic over samples of order :
(15) 
where is a known symmetric function in its first sarguments and Lipschitz in . Suppose that and . Then w.p. :
(16) 
Proof of Lemma c.1.
Note that for any fixed , is a Ustatistic of order . Therefore by the Bernestein inequality for statistics (see e.g. Theorem 2 of (Peel et al., 2010)), for any fixed , w.p.
Since , we can find a finite space of size , such that for any , there exists with . Moreover, since is Lipschitz with respect to :
Thus we have that:
Taking a union bound over