Orthogonal Random Forest for Causal Inference

Orthogonal Random Forest for Causal Inference

Miruna Oprescu    Vasilis Syrgkanis    Zhiwei Steven Wu
Abstract

We propose the orthogonal random forest, an algorithm that combines Neyman-orthogonality to reduce sensitivity with respect to estimation error of nuisance parameters with generalized random forests (Athey et al., 2017)—a flexible non-parametric method for statistical estimation of conditional moment models using random forests. We provide a consistency rate and establish asymptotic normality for our estimator. We show that under mild assumptions on the consistency rate of the nuisance estimator, we can achieve the same error rate as an oracle with a priori knowledge of these nuisance parameters. We show that when the nuisance functions have a locally sparse parametrization, then a local -penalized regression achieves the required rate. We apply our method to estimate heterogeneous treatment effects from observational data with discrete treatments or continuous treatments, and we show that, unlike prior work, our method provably allows to control for a high-dimensional set of variables under standard sparsity conditions. We also provide a comprehensive empirical evaluation of our algorithm on both synthetic and real data.

1 Introduction

Many problems that arise in causal inference can be formulated in the language of conditional moment models: given a target feature find a solution to a system of conditional moment equations

 E[ψ(Z;θ,h0(x,W))∣X=x]=0, (1)

given access to i.i.d. samples from the data generating distribution, where is a known score function and is an unknown nuisance function that also needs to be estimated from data. Examples include non-parametric regression, heterogeneous treatment effect estimation, instrumental variable regression, local maximum likelihood estimation and estimation of structural econometric models.111See e.g. Reiss & Wolak (2007) and examples in Chernozhukov et al. (2016, 2018) The study of such conditional moment restriction problems has a long history in econometrics (see e.g. Newey (1993); Ai & Chen (2003); Chen & Pouzo (2009); Chernozhukov et al. (2015)).

\swedit

In this general estimation problem, the main goal is to estimate the target parameter at a rate that is robust to the estimation error of the nuisance component. This allows \sweditthe use of flexible models to fit the nuisance functions and enables asymptotically valid inference. Almost all prior work on the topic has focused on two settings: i) they either assume the target function takes a parametric form and allow for a potentially high-dimensional parametric nuisance function, e.g. (Chernozhukov et al., 2016, 2017, 2018), ii) \sweditor take a non-parametric stance at estimating but do not allow for high-dimensional \sweditnuisance functions (Wager & Athey, 2015; Athey et al., 2017).

We propose Orthogonal Random Forest (ORF), a random forest-based estimation algorithm, which performs non-parametric estimation of the target parameter while permitting more complex nuisance functions with high-dimensional parameterizations. Our estimator is also asymptotically normal and hence allows for the construction of asymptotically valid confidence intervals via plug-in or bootstrap approaches. Our approach combines the notion of Neyman orthogonality of the moment equations with a two-stage random forest based algorithm, which generalizes prior work on Generalized Random Forests (Athey et al., 2017) and the double machine learning (double ML) approach proposed in (Chernozhukov et al., 2017). \sweditTo support our general algorithm, we also provide a novel nuisance estimation algorithm—Forest Lasso—that effectively recovers high-dimensional nuisance parameters provided they have locally sparse structure. This result combines techniques from Lasso theory (Hastie et al., 2015) with concentration inequalities for -statistics (Hoeffding, 1963).

As a concrete example and as a main application of our approach, we consider the problem of heterogeneous treatment effect estimation. This problem is at the heart of many decision-making processes, including clinical trial assignment to patients, price adjustments of products, and ad placement by a search engine. In many situations, we would like to take the heterogeneity of the population into account and estimate the heterogeneous treatment effect (HTE)—the effect of a treatment (e.g. drug treatment, price discount, and ad position), on the outcome of interest (e.g. clinical response, demand, and click-through-rate), as a function of observable characteristics of the treated subject (e.g. individual patient, product, and ad). HTE estimation is a fundamental problem in causal inference from observational data (Imbens & Rubin, 2015; Wager & Athey, 2015; Athey et al., 2017), and is intimately related to many areas of machine learning, including contextual bandits, off-policy evaluation and optimization (Swaminathan et al., 2016; Wang et al., 2017; Nie & Wager, 2017), and counterfactual prediction (Swaminathan & Joachims, 2015; Hartford et al., 2016).

The key challenge in HTE estimation is that the observations are typically collected by a policy that depends on confounders or control variables , which also directly influence the outcome. Performing a direct regression of the outcome on the treatment and features , without controlling for a multitude of other potential confounders, will produce biased estimation. This leads to a regression problem that in the language of conditional moments takes the form:

 E[Y−θ0(x)T−f0(x,W)∣X=x]=0 (2)

where is the heterogeneous effect of the treatment (discrete or continuous) on the outcome as a function of the features and is an unknown nuisance function that captures the direct effect of the control variables on the outcome. Moreover, unlike active experimentation settings such as contextual bandits, when dealing with observational data, the actual treatment or logging policy that could potentially be used to de-bias the estimation of is also unknown.

There is a surge of recent work at the interplay of machine learning and causal inference that studies efficient estimation and inference of treatment effects. Chernozhukov et al. (2017) propose a two-stage estimation method called double machine learning that first orthogonalizes out the effect of high-dimensional confounding factors using sophisticated machine learning algorithms, including Lasso, deep neural nets and random forests, and then estimates the effect of the lower dimensional treatment variables, by running a low-dimensional linear regression between the residualized treatments and residualized outcomes. They show that \swediteven if the estimation error of the first stage is \sweditnot particularly accurate, the second-stage estimate can still be -asymptotically normal. However, their approach requires a parametric specification of . In contrast, another line of work that brings machine learning to causal inference provides fully flexible non-parametric HTE estimation based on random forest techniques (Wager & Athey, 2015; Athey et al., 2017; Powers et al., 2017). However, these methods heavily rely on low-dimensional assumptions.

Our algorithm ORF, when applied to the HTE problem (see Section 6) allows for the non-parametric estimation of via forest based approaches while simultaneously allowing for a high-dimensional set of control variables . This estimation problem is of practical importance when a decision maker (DM) wants to optimize a policy that depends only on a small set of variables, e.g. due to data collection or regulatory constraints or due to interpretability of the resulting policy, while at the same time controlling for many potential confounders in the existing data that could lead to biased estimates. Such settings naturally arise in contextual pricing or personalized medicine. In such settings the DM is faced with the problem of estimating a conditional average treatment effect conditional on a small set of variables while controlling for a much larger set. Our estimator provably offers a significant statistical advantage for this task over prior approaches.

In the HTE setting, the ORF algorithm follows the residual-on-residual regression approach analyzed by (Chernozhukov et al., 2016) to formulate a locally Neyman orthogonal moment and then applies our orthogonal forest algorithm to this orthogonal moment. Notably, (Athey et al., 2017) also recommend such a residual on residual regression approach in their empirical evaluation, which they refer to as “local centering”, albeit with no theoretical analysis. Our results provide a theoretical foundation of the local centering approach through the lens of Neyman orthogonality. Moreover, our theoretical results give rise to a slightly different overall estimation approach than the one in (Athey et al., 2017): namely we residualize locally around the target estimation point , as opposed to performing an overall residualization step and then calling the Generalized Random Forest algorithm on the residuals. The latter stems from the fact that our results require that the nuisance estimator achieve a good estimation rate only around the target point . Hence, residualizing locally seems more appropriate than running a global nuisance estimation, which would typically minimize a non-local mean squared error. Our experimental findings reinforce this intuition (see e.g. comparison between ORF and the GRF-Res benchmark). Another notable work that combines the residualization idea with flexible heterogeneous effect estimation is that of (Nie & Wager, 2017), who formulate the problem as an appropriate residual-based square loss minimization over an arbitrary hypothesis space for the heterogeneous effect function . Formally, they show robustness, with respect to nuisance estimation errors, of the mean squared error (MSE) of the resulting estimate in expectation over the distribution and for the case where the hypothesis space is a reproducing kernel Hilbert space (RKHS). Our work differs primarily by: i) focusing on sup-norm estimation error at any target point as opposed to MSE, ii) using forest based estimation as opposed to finding a function in an RKHS, iii) working with the general orthogonal conditional moment problems, and iv) providing asymptotic normality results and hence valid inference.

We provide a comprehensive empirical comparison of ORF with several benchmarks, including three variants of GRF. We show that by setting the parameters according to what our theory suggests, ORF consistently outperforms all of the benchmarks. Moreover, we show that bootstrap based confidence intervals provide good finite sample coverage.

Finally, to motivate the usage of the ORF, we applied our technique to Dominick’s dataset, a popular historical dataset of store-level orange juice prices and sales provided by University of Chicago Booth School of Business. The dataset is comprised of a large number of covariates , but economics researchers might only be interested in learning the elasticity of demand as a function of a few variables such as income or education. We applied our method (see Appendix G for details) to estimate orange juice price elasticity as a function of income, and our results, depicted in Figure 1, unveil the natural phenomenon that lower income consumers are more price-sensitive.

\vscomment

1.1 Related Work

we need to add an extensive comparison to many relevant literatures.

2 Estimation via Local Orthogonal Moments

\vsedit

We study non-parametric estimation of models defined via conditional moment restrictions, in the presence of nuisance functions. Suppose we have a set of observations drawn independently from some underlying distribution over the observation domain . Each observation contains a feature vector .

\vsedit

Given a target feature , our goal is to estimate a parameter vector that is defined via a local moment condition, i.e. for all , is the unique solution with respect to of:

 E[ψ(Z;θ,h0(x,W))∣X=x]=0, (3)

where is a score function that maps an observation , parameter vector , and nuisance vector to a vector-valued score and is an unknown nuisance function that takes as input and a subvector of , and outputs a nuisance vector in . For any feature , parameter , and nuisance function , we define the moment function as:

 m(x;θ,h)=E[ψ(Z;θ,h(X,W))∣X=x] (4)

We assume that the dimensions are constants, while the dimension of can be growing with .

\vsedit

We will analyze the following two-stage estimation process.

1. [topsep=0pt,itemsep=-1ex,partopsep=1ex,parsep=1ex]

2. First stage. Compute a nuisance estimate for using data with some guarantee on the conditional root mean squared error:222Throughout the paper we denote with the euclidean norm and with the -norm.

 E(^h)=√E[∥^h(x,W)−h0(x,W)∥2∣X=x]
3. Second stage. Compute a set of similarity weights over the data that measure the similarity between their feature vectors and the target . \vsdelete333We obtain the similarity weights using random forests, but they can also be computed through other methods such as -NN. Compute the estimate using the nuisance estimate via the plug-in weighted moment condition:

 ^θ(x) solves: ∑ni=1aiψ(Zi;θ,^h(Xi,Wi))=0 (5)
\vsedit

In practice, our framework permits the use of any method to estimate the nuisance function in the first stage. However, since our description is a bit too abstract let us give a special case, which we will also need to assume for our normality result. Consider the case when the nuisance function takes the form , for some known function but unknown function (with potentially growing with ), i.e. locally around each the function is a parametric function of . Moreover, the parameter of the true nuisance function is identified as the minimizer of a local loss:

 ν0(x)=argminν∈VE[ℓ(Z;ν)∣X=x] (6)

Then we can estimate via a locally weighted and penalized empirical loss minimization algorithm. In particular in Section 5 we will consider the case of local -penalized estimation that we will refer to as forest lasso and which provides formal guarantees in the case where is sparse.

The key technical condition that allows us to reliably perform the two-stage estimation is the following local orthogonality condition, which can be viewed as a localized version of the Neyman orthogonality condtion (Neyman, 1979; Chernozhukov et al., 2017) around the neighborhood of the target feature . Intuitively, the condition says that the score function is insensitive to local perturbations in the nuisance parameters around their true values.

Definition 2.1 (Local Orthogonality).

Fix any estimator for the nuisance function. Then the Gateaux derivative with respect to , denoted , is defined as:

 E[∇hψ(Z,θ0(x),h0(x,W))(^h(x,W)−h0(x,W))∣x]

where denotes the gradient of with respect to the final arguments. We say that the moment conditions are locally orthogonal if for all : .

3 Orthogonal Random Forest

\vsedit

We describe our main algorithm orthogonal random forest (ORF) for calculating the similarity weights in the second stage of the two stage estimation. In the next section we will see that we will be using this algorithm for the estimation of the nuisance functions, so as to perform a local nuisance estimation. At a high level, ORF can be viewed as an orthogonalized version of GRF that is more robust to the nuisance estimation error. Similar to GRF, the algorithm runs a tree learner over random subsamples (without replacement) of size , to build trees such that each tree indexed by provides a tree-based weight for each observation in the input sample. \vseditThen the ORF weight for each sample is the average over the tree-weights .

The tree learner starts with a root node that contains the entire and recursively grows the tree to split into a set of leaves until the number of observations in each leaf is not too small. The set of neighboods defined by the leaves naturally gives a simlarity measure between each observation and the target . Following the same approach of (Tibshirani et al., 2018; Wager & Athey, 2015), we maintain the following tree properties in the process of building a tree.

Specification 1 (Forest Regularity).

The tree satisfies

• [topsep=0pt,itemsep=-1ex,partopsep=1ex,parsep=1ex]

• Honesty: we randomly partition the input sample into two subsets , , then uses to place splits in the tree, and uses for estimation.

• -balanced: each split leaves at least a fraction of the observations in on each side of the split for some parameter of .

• Minimum leaf size : there are between and observations from in each leaf of the tree.

• -random-split: at every step, marginalizing over the internal randomness of the learner, the probability that the next split occurs along the -th feature is at least for some , for all 444For example, this can be achieved by uniformly randomizing the splitting variable with probability or via a Poisson sampling scheme where a random subset of the variables of size is chosen to consider for candidate splits, with .

The key modification to GRF’s tree learner is our incorporation of orthogonal nuisance estimation in the splitting criterion. \vsdeleteIn order to accurately capture the heterogeneity in the target parameter in the presence of high-dimensional nuisance, we perform a two-stage estimation locally at each internal node to decide the next split. While the splitting criterion does not factor into our theoretical analysis (similar to (Tibshirani et al., 2018)), we find it to be an effective practical heuristic.

Splitting criterion with orthogonalization.

\vsedit

At each internal node we perform a two-stage estimation over , i.e. the set of examples in that reach node : 1) compute a nuisance estimate using only data (e.g. by estimating a parameter that minimizes and setting ), and then 2) form estimate using :555In our implementation we actually use a cross-fitting approach, where we use half of to compute a nuisance function to apply to the other half and vice versa.

 ^θP∈argminθ∈Θ∥∥∑i∈(P∩S1)ψ(Zi;θ,^hP(Wi))∥∥

We now generate a large random set of candidate axis-aligned splits (satisfying Specification 1 and we want to find the split into two children and such that if we perform the same two-stage estimation separately at each child, the new estimates and take on very different values, so that the heterogeneity of the two children nodes is maximized. Performing the two-stage estimation of and for all candidate splits is too computationally expensive. Instead, we will approximate these estimates by taking a Newton step from the parent node estimate : for any child node given by a candidate split, our proxy estimate is:

 ~θC=^θP−1|C∩S1|∑i∈Cj∩S1A−1Pψ(Zi;^θP,^hP(Xi,Wi))

where . \vseditWe select the candidate split that maximizes the following proxy heterogeneity score: for each coordinate let

 ~Δt(C1,C2)=∑2j=11|Cj∩S1|(∑i∈Cj∩S1ρt,i)2 (7)

where . We then create a single heterogeneity score per split as a convex combination that puts weight on the mean and on the maximum score across coordinates. is chosen uniformly at random in at each iteration of splitting. Hence, some splits focus on heterogeneity on average, while others focus on creating heterogeneity on individual coordinates.

ORF weights and estimator.

\vsedit

For each tree indexed based on subsample , let be the leaf that contains the target feature . We assign tree weight and ORF weight to each observation :

 aib=1[(Xi∈Lb(x))∧(Zi∈S2b)]|Lb(x)∩S2b|,ai=1B∑Bb=1aib

Wager & Athey (2015) show that under the structural specification of the trees, the tree weights are non-zero only around a small neighborhood of ; a property that we will leverage in our analysis.

Theorem 3.1 (Kernel shrinkage (Wager & Athey, 2015)).

Suppose the minimum leaf size parameter , the tree is -balanced and -random-split and the distribution of admits a density in that is bounded away from zero and infinity. Then the tree weights satisfy

 E[sup{∥x−xi∥:aib>0}]=O(s−12αd),

where

 α=log(ρ−1)πlog((1−ρ)−1)

and is size of the subsamples.

4 Convergence and Asymptotic Analysis

\vsedit

The ORF estimate is computed by solving the weighted moment condition in Equation 5, using the ORF weights as described in the previous section. We now provide theoretical guarantees for under the following assumption on the moment, score fuction and the data generating process.

Assumption 4.1.

The moment condition and the score function satisfy the following:

1. [topsep=0pt,itemsep=-1ex,partopsep=1ex,parsep=1ex]

2. Local Orthogonality. The moment condition satisfies local orthogonality.

3. Identifiability. The moments has a unique solution .

4. Smooth Signal. The moments are -Lipschitz in for any .

5. Curvature. The Jacobian has minimum eigenvalue bounded away from zero.

6. Smoothness of scores. For every and for all and , the eigenvalues of the expected Hessian are bounded above by a constant . For any , the score is -Lipschitz in for any and -Lipschitz in for any . The gradient of the score with respect to is -Lipschitz in .

7. Boundedness. The parameter set has constant diameter. There exists a bound such that for any observation , the first-stage nuisance estimate satisfies for any .

\vsedit
8. Full Support . The distribution of admits a density that is bounded away from zero and infinity.

All the results presented in the remainder of the paper will assume these conditions and we omit stating so in each of the theorems. Any extra conditions required for each theorem will be explicitly provided. Note that except for the local orthogonality condition, all of the assumptions are imposing standard boundedness and regularity conditions of the moments.

Theorem 4.2 (Lq-Error Bound).

Suppose that: . Then:

Theorem 4.3 (High Probability Error Bound).

Suppose that the score is the gradient of a convex loss and let denote the minimum eigenvalue of the jacobian . Moreover, suppose that the nuisance estimate satisfies that w.p. : . Then w.p. :

 ∥^θ−θ0∥=O⎛⎜ ⎜⎝s−12αd+√slog(nsδ)n+χ2n,δ⎞⎟ ⎟⎠σ−O(χn,δ) (8)

For asymptotic normality we will restrict our framework to the case of parametric nuisance functions, i.e. for some known function and to a particular type of nuisance estimators that recover the true parameter . Albeit we note that the parameter can be an arbitrary non-parametric function of and can also be high-dimensional. We will further assume that the moments also have a smooth co-variance structure in , i.e. if we let

 V=ψ(Z;θ0(x),g(W;ν0(x)))

then is Lipschitz in for any .

Theorem 4.4 (Asymptotic Normality).

Suppose that takes a locally parametric form , for some known function that is -Lipschitz in w.r.t. the norm for some and the nuisance estimate is of the form and satisfies:

 E[∥^ν(x)−ν0(x)∥4r]1/4≤χn,4=o((s/n)1/4)

Suppose that is chosen such that:

 s−1/(2αd)=o((s/n)1/2−ε),

for any , and . Moreover, is Lipschitz in for any . Then for any coefficient , with , assuming for any , there exists a sequence , such that:

 (9)

Given the result in Theorem 4.4, we can follow the same approach of Bootstrap of Little Bags by (Athey et al., 2017; Sexton & Laake, 2009) to build valid confidence intervals.

5 Nuisance Estimation: Forest Lasso

Next, we study the nuisance estimation problem in the first stage and provide a general nuisance estimation method that leverages locally sparse parameterization of the nuisance function, permitting low error rates even for high-dimensional problems. \vseditConsider the case when the nuisance function takes the form for some known functional form , for some known function but unknown function , with potentially growing with . Moreover, the parameter of the true nuisance function is identified as the minimizer of some local loss, as defined in Equation (6).

\vsedit

We consider the following estimation process: given a set of observations , we run the same tree learner in Section 3 over random subsamples (without replacement) to compute ORF weights for each observation over . Then we apply a local penalized -estimation:

 ^ν(x)=argminν∈V∑ni=1aiℓ(Zi;ν)+λ∥ν∥1 (10)
\vsedit

To provide formal guarantees for this method we will need to make the following assumptions.

Assumption 5.1 (Assumptions for nuisance estimation).

The target parameter and data distribution satisfy:

• [topsep=0pt,itemsep=-1ex,partopsep=1ex,parsep=1ex]

• For any , is -sparse with support .

• is a -Lipschitz in and the function is -Lipschitz in for any , with respect to the norm.

• The data distribution satisfies the conditional restricted eigenvalue condition: for all and for all , for some matrix that depends only on the data: , and for all and for all :

 (11)
\vsedit

Under Assumption 5.1 we show that the local penalized estimator achieves the following parameter recovery guarantee.

Theorem 5.2.

With probability :

 ∥^ν(x)−ν0(x)∥1≤2λkγ−32k√sln(dν/δ)/n

as long as .

Example 5.3 (Forest Lasso).

For locally sparse linear regression, and . This means, . Hence, the conditional restricted eigenvalue condition is simply a conditional covariance condition: .

Example 5.4 (Forest Logistic Lasso).

For locally sparse logistic regression, , and

 ℓ(Zi;ν)=yiln(L(⟨ν,Wi⟩))+(1−yi)ln(1−L(⟨ν,Wi⟩)),

where is the logistic function. In this case,

 ∇ννℓ(Zi;ν) =L(⟨ν,Wi⟩)(1−L(⟨ν,Wi⟩))WiW⊺i ⪰ρWiW⊺i=H(Zi)

assuming the index is bounded in some finite range. Hence, our conditional restricted eigenvalue condition is the same conditional covariance condition:

 ρE[WWT∣x]⪰ργI

6 Heterogeneous Treatment Effects

Now we apply ORF to the problem of estimating heterogeneous treatment effects. We will consider the following extension of the partially linear regression (PLR) model due to Robinson (1988). 666The standard PLR model (Robinson, 1988) considers solely the case of constant treatment effects, , and the goal is the estimation of the parameter . We have i.i.d. observations such that for each , represents the treatment applied that can be either real-valued (in ) or discrete (taking values in , where each denotes the standard basis in ), represents the outcome, represents potential confounding variables (controls), and is the feature vector that captures the heterogeneity. The set of parameters are related via the following equations:

 Y =⟨μ0(X,W),T⟩+f0(X,W)+ε, (12) T =g0(X,W)+η, (13)

where are bounded unobserved noises such that and .\swcommentadded the last In the main equation (12), represents the treatment effect function. Our goal is to estimate conditional average treatment effect (CATE) conditioned on target feature :

 θ0(x)=E[μ0(X,W)∣X=x]. (14)

The confounding equation (13) determines the relationship between treatments variable and the feature and confounder . To create an orthogonal moment for identifying , we follow the classical residualization approach similar to (Chernozhukov et al., 2017). First, observe that

 Y−E[Y∣X,W]=⟨μ0(X,W),T−E[T∣X,W]⟩+ε

Let us define the function , and consider the residuals

 ~Y =Y−q0(X,W) ~T =T−g0(X,W)=η

Then we can simplify the equation as . \sweditAs long as is independent of conditioned on (e.g. is independent of or does not depend on ), we also have

 E[μ0(X,W)∣X,η]=E[μ0(X,W)∣X]=θ(X).

Since , then

 E[~Y∣X,~T]=∗E[μ0(X,W)∣X]⋅~T=θ(X)⋅~T.

This relationship suggests that we can obtain an estimate of by regressing on locally around . We can thus define the orthogonalized score function: for any observation , any parameter , any estimates and for functions and , the score is:

 {Y−q(X,W)−θ(T−g(X,W)⟩)}(T−g(X,W)),

where . In the appendix, we show that this moment condition satisfies local orthogonality, and it identifies as long as as the noise is independent of conditioned on and the expected matrix is invertible. Even though the approach applies generically, to obtain formal guarantees on the nuisance estimates via our Forest Lasso, we will restrict their functional form.

Real-valued treatments.

Suppose and each coordinate of and are given by high-dimensional linear functions: , , where are -sparse vectors in . Consequently, can be written as a -sparse linear function over degree-2 polynomial features of . Then as long as and are Lipschitz in and the confounders satisfy

then we can use Forest Lasso to estimate both and . Hence, we can apply the ORF algorithm to get estimation error rates and asymptotic normality results for . (see Appendix B for formal statement).

Discrete treatments.

We now describe how our theory can be applied to discrete treatments. Suppose and each coordinate of are of the form: and , where is the logistic function. Note in this case is not independent of since

 Var(ηj)=gj0(X,W)(1−gj0(X,W)).

To maintain the conditional independence between and conditioned on , we focus on the setting where is only a function of , i.e. for all . In this setting we can estimate by running a forest logistic lasso for each treatment . Then we can estimate as follows: For each estimate the expected counter-factual outcome function: , by running a forest lasso between and only among the subset of samples that received treatment . Similarly, estimate by running a forest lasso between and only among the subset of samples that received treatment . Then observe that can be written as a function of , and . Thus we can combine these estimates to get an estimate of . Hence, we can obtain a guarantee similar to that of Corollary B.1 (see appendix).

Doubly robust moment for discrete treatments.

In the setting where also depends on and treatments are discrete, we can formulate an alternative orthogonal moment that identifies the CATE even when is correlated with . This moment is based on first constructing unbiased estimates of the counterfactual outcome for every observation and for any potential treatment , i.e. even for . The latter is done by invoking the doubly robust formula (Robins & Rotnitzky, 1995; Cassel et al., 1976; Kang et al., 2007):

 Y(t)=mt0(X,W)+(Y−mt0(X,W))1{T=t}gt0(X,W)

with the convention that

 g00(X,W) =1−∑t≠0gt0(X,W) m00(X,W) =f0(X,W)

Then we can identify the parameter using the moment: One can easily show that this moment satisfies the Neyman orthogonality condition with respect to the nuisance functions and (see appendix). In fact this property is essentially implied by the fact that the estimates satisfy the double robustness property, since double robustness is a stronger condition than orthogonality. We will again consider . Then using similar reasoning as in the previous paragraph, we see that with a combination of forest logistic lasso for and forest lasso for , we can estimate these nuisance functions at a sufficiently fast rate for our ORF estimator (based on this doubly robust moment) to be asymptotically normal, assuming they have locally sparse linear or logistic parameterizations.

7 Monte Carlo Experiments

We compare the empirical performance of ORF with other methods in the literature (and their variants).777The source code for running these experiments is available in the git repo Microsoft/EconML. The data generating process we consider is described by the following equations:

 Yi =θ0(xi)Ti+⟨Wi,γ0⟩+εi Ti =⟨Wi,β0⟩+ηi

Moreover, is drawn from the uniform distribution , is drawn from , and the noise terms . The -sparse vectors have coefficients drawn independently from . The dimension and we vary the support size over the range of . We examine a treatment function that is continuous and piecewise linear (detailed in Figure 3). In Appendix H we analyze other forms for . \vsdeletewe show that ORF also outperforms other methods with discontinuous that is piecewise constant or piecewise polynomial.

For each fixed treatment function, we repeat experiments, each of which consists of generating observations from the DGP, drawing the vectors and , and estimating at 100 test points over a grid in . We then calculate the bias, variance and root mean squared error (RMSE) of each estimate . Here we report summary statistics of the median and percentiles of these three quantities across test points, so as to evaluate the average performance of each method. We compare two variants of ORF with two variants of GRF (Athey et al., 2017) (see Appendix H for a third variant) and two extensions of double ML methods for heterogeneous treatment effect estimation (Chernozhukov et al., 2017).

ORF variants. (1) ORF: We implement ORF as described in Section 3, setting parameters under the guidance of our theoretical result: subsample size , Lasso regularization (for both tree learner and kernel estimation), number of trees , a max tree depth of , and a minimum leaf size of . (2) ORF with LassoCV (ORF-CV): we replaced the Lasso algorithm in ORF’s kernel estimation, with a cross-validated Lasso for the selection of the regularization parameter and . ORF-CV provides a more systematic optimization over the parameters.

GRF variants. (1) GRF-Res-Lasso: We perform a naive combination of double ML and GRF by first residualizing the treatments and outcomes on both the features and controls , then running GRF R package by (Tibshirani et al., 2018) on the residualized treatments , residualized outcomes , and features . A cross-validated Lasso is used for residualization. (2) GRF-Res-RF: We combine DoubleML and GRF as above, but we use cross-validated Random Forests for calculating residuals and .

Double ML with Polynomial Heterogeneity (DML-Poly). An extension of the classic Double ML procedure for heterogeneous treatment effects introduced in (Chernozhukov et al., 2017). This method accounts for heterogeneity by creating an expanded linear base of composite treatments (cross products between treatments and features). (1) Heterogeneous Double ML using LassoCV for first-stage estimation (HeteroDML-Lasso): In this version, we use Lasso with cross-validation for calculating residuals on in the first stage. (2) Heterogeneous Double ML using random forest for first-stage estimation (HeteroDML-RF): A more flexible version that uses random forests to perform residualization on treatments and outcomes. The latter performs better when treatments and outcomes have a non-linear relationship with the joint features of .

We generated data according to the Monte Carlo process above and set the parameters to samples, controls, features and support size and three types of treatment effect functions. In this section, we present the results for a piecewise linear treatment effect function.

In Figure 3, we inspect the goodness of fit for the chosen estimation methods across 100 Monte Carlo experiments. We note the limitations of two versions of the GRF-Res estimators, GRF-Res-Lasso and GRF-Res-RF, in capturing the treatment effect function well. The GRF-Res-RF estimations have a consistent bias as the Random Forest residualization cannot capture the dependency on the controls given their high-dimensionality. The HeteroDML methods are not flexible enough to capture the complexity of the treatment effect function. The best performers are the ORF-CV, ORF, and GRF-Res-Lasso, with the latter estimator having a larger bias and variance.

We analyze these estimators as we increase the support size of . Figures 2 illustrate the evaluation metrics across different support sizes. The ORF-CV performs very well, with consistent bias and RMSE across support sizes and treatment functions. The bias, variance and RMSE of the ORF grow with support size, but this growth is at a lower rate compared to the alternative estimators. The ORF-CV and ORF algorithms perform better than the GRF-Res methods on all metrics for this example. We observe this pattern for the other choices of support size, sample size and treatment effect function (see Appendix H). In Figure 4, we provide a snapshot of the bootstrap confidence interval coverage for this example.

Acknowledgements

ZSW is supported in part by a Google Faculty Research Award, a J.P. Morgan Faculty Award, a Mozilla research grant, and a Facebook Research Award. Part of this work was completed while ZSW was at Microsoft Research-New York City.

References

• Ai & Chen (2003) Ai, C. and Chen, X. Efficient estimation of models with conditional moment restrictions containing unknown functions. Econometrica, 71(6):1795–1843, 2003.
• Athey et al. (2017) Athey, S., Tibshirani, J., and Wager, S. Generalized Random Forests. ArXiv e-prints, October 2017.
• Cassel et al. (1976) Cassel, C. M., Särndal, C. E., and Wretman, J. H. Some results on generalized difference estimation and generalized regression estimation for finite populations. Biometrika, 63(3):615–620, 1976.
• Chen & Pouzo (2009) Chen, X. and Pouzo, D. Efficient estimation of semiparametric conditional moment models with possibly nonsmooth residuals. Journal of Econometrics, 152(1):46 – 60, 2009. ISSN 0304-4076. Recent Adavances in Nonparametric and Semiparametric Econometrics: A Volume Honouring Peter M. Robinson.
• Chernozhukov et al. (2015) Chernozhukov, V., Newey, W. K., and Santos, A. Constrained Conditional Moment Restriction Models. arXiv e-prints, art. arXiv:1509.06311, September 2015.
• Chernozhukov et al. (2016) Chernozhukov, V., Escanciano, J. C., Ichimura, H., Newey, W. K., and Robins, J. M. Locally Robust Semiparametric Estimation. arXiv e-prints, art. arXiv:1608.00033, July 2016.
• Chernozhukov et al. (2017) Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., and Newey, W. Double/debiased/neyman machine learning of treatment effects. American Economic Review, 107(5):261–65, May 2017.
• Chernozhukov et al. (2017) Chernozhukov, V., Goldman, M., Semenova, V., and Taddy, M. Orthogonal Machine Learning for Demand Estimation: High Dimensional Causal Inference in Dynamic Panels. ArXiv e-prints, December 2017.
• Chernozhukov et al. (2018) Chernozhukov, V., Nekipelov, D., Semenova, V., and Syrgkanis, V. Plug-in Regularized Estimation of High-Dimensional Parameters in Nonlinear Semiparametric Models. arXiv e-prints, art. arXiv:1806.04823, June 2018.
• Hartford et al. (2016) Hartford, J., Lewis, G., Leyton-Brown, K., and Taddy, M. Counterfactual Prediction with Deep Instrumental Variables Networks. ArXiv e-prints, December 2016.
• Hastie et al. (2015) Hastie, T., Tibshirani, R., and Wainwright, M. Statistical Learning with Sparsity: The Lasso and Generalizations. Chapman & Hall/CRC, 2015. ISBN 1498712169, 9781498712163.
• Hoeffding (1963) Hoeffding, W. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963. ISSN 01621459.
• Imbens & Rubin (2015) Imbens, G. W. and Rubin, D. B. Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press, 2015.
• Kang et al. (2007) Kang, J. D., Schafer, J. L., et al. Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical science, 22(4):523–539, 2007.
• Mentch & Hooker (2016) Mentch, L. and Hooker, G. Quantifying uncertainty in random forests via confidence intervals and hypothesis tests. The Journal of Machine Learning Research, 17(1):841–881, 2016.
• Newey (1993) Newey, W. K. 16 efficient estimation of models with conditional moment restrictions. In Econometrics, volume 11 of Handbook of Statistics, pp. 419 – 454. Elsevier, 1993.
• Neyman (1979) Neyman, J. C() tests and their use. Sankhyā: The Indian Journal of Statistics, Series A (1961-2002), 41(1/2):1–21, 1979. ISSN 0581572X.
• Nie & Wager (2017) Nie, X. and Wager, S. Learning Objectives for Treatment Effect Estimation. ArXiv e-prints, December 2017.
• Peel et al. (2010) Peel, T., Anthoine, S., and Ralaivola, L. Empirical bernstein inequalities for u-statistics. In Lafferty, J. D., Williams, C. K. I., Shawe-Taylor, J., Zemel, R. S., and Culotta, A. (eds.), Advances in Neural Information Processing Systems 23, pp. 1903–1911. Curran Associates, Inc., 2010.
• Powers et al. (2017) Powers, S., Qian, J., Jung, K., Schuler, A., Shah, N. H., Hastie, T., and Tibshirani, R. Some methods for heterogeneous treatment effect estimation in high-dimensions. ArXiv e-prints, July 2017.
• Reiss & Wolak (2007) Reiss, P. C. and Wolak, F. A. Chapter 64 structural econometric modeling: Rationales and examples from industrial organization. volume 6 of Handbook of Econometrics, pp. 4277 – 4415. Elsevier, 2007.
• Robins & Rotnitzky (1995) Robins, J. M. and Rotnitzky, A. Semiparametric efficiency in multivariate regression models with missing data. Journal of the American Statistical Association, 90(429):122–129, 1995.
• Robinson (1988) Robinson, P. M. Root-n-consistent semiparametric regression. Econometrica, 56(4):931–954, 1988. ISSN 00129682, 14680262.
• Sexton & Laake (2009) Sexton, J. and Laake, P. Standard errors for bagged and random forest estimators. Computational Statistics and Data Analysis, 53(3):801 – 811, 2009. ISSN 0167-9473. Computational Statistics within Clinical Research.
• Swaminathan & Joachims (2015) Swaminathan, A. and Joachims, T. Counterfactual risk minimization: Learning from logged bandit feedback. CoRR, abs/1502.02362, 2015. URL http://arxiv.org/abs/1502.02362.
• Swaminathan et al. (2016) Swaminathan, A., Krishnamurthy, A., Agarwal, A., Dudík, M., Langford, J., Jose, D., and Zitouni, I. Off-policy evaluation for slate recommendation. CoRR, abs/1605.04812, 2016. URL http://arxiv.org/abs/1605.04812.
• Tibshirani et al. (2018) Tibshirani, J., Athey, S., Wager, S., Friedberg, R., Miner, L., and Wright, M. grf: Generalized Random Forests (Beta), 2018. R package version 0.10.2.
• Wager & Athey (2015) Wager, S. and Athey, S. Estimation and Inference of Heterogeneous Treatment Effects using Random Forests. ArXiv e-prints, October 2015.
• Wang et al. (2017) Wang, Y.-X., Agarwal, A., and Dudík, M. Optimal and adaptive off-policy evaluation in contextual bandits. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 3589–3597, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.

Appendix A Two-forest ORF v.s. One-forest ORF

A natural question about ORF is whether it is necessary to have two separate forests for the two-stage estimation. We investigate this question by implementing a variant of ORF without sample splitting (ORF-NS)—it builds only one random forest over the entire dataset, and perform the two-stage estimation using the same set of importance weights. We empirically compare ORF-CV with ORF-NS. In Figure 5, we note that the bias, variance and RMSE of the ORF-NS increase drastically with the subsample ratio , whereas the same metrics are almost constant for the ORF-CV. This phenomenon is consistent with the theory, since larger subsamples induce a higher probability of collision between independently drawn samples, and the “spill-over” can incur large bias and error.

Appendix B Formal Guarantee for Real-Valued Treatments

Corollary B.1 (Accuracy for real-valued treatments).

Suppose that and each coorindate are Lipschitz in and have norms bounded by for any . Assume that distribution of admits a density that is bounded away from zero and infinity. For any feature , the conditional covariance matrices satisfy , and , where denotes the degree-2 polynomial feature vector of . Then with probability , ORF returns an estimator such that

 ∥^θ−θ0∥≤O(n−12+2αd√log(ndν/δ))

as long as the sparsity and the subsampling rate of ORF . Moreover, for any with , there exists a sequence such that

 σ−1n⟨b,^θ−θ⟩→dN(0,1),

as long as the sparsity and the subsampling rate of ORF for any .

Appendix C Uniform Convergence of Lipschitz U-Processes

Lemma C.1 (Stochastic Equicontinuity for U-statistics via Bracketing).

Consider a parameter space that is a bounded subset of , with . Consider the -statistic over samples of order :

 Gs,nf(⋅;θ)=(ns)−1∑1≤i1≤…≤is≤nf(zi1,…,zis;θ) (15)

where is a known symmetric function in its first sarguments and -Lipschitz in . Suppose that and . Then w.p. :

 supθ∈Θ|Gs,nf(⋅;θ)−E[f(Z1:s;θ)]|=O⎛⎝η√s(log(n/s)+log(1/δ))n+(G+LR)s(log(n/s)+log(1/δ))n⎞⎠ (16)
Proof of Lemma c.1.

Note that for any fixed , is a U-statistic of order . Therefore by the Bernestein inequality for -statistics (see e.g. Theorem 2 of (Peel et al., 2010)), for any fixed , w.p.

 |Gs,nf(⋅;θ)−E[f(Z1:s;θ)]|≤η√2log(1/δ)n/s+G2log(1/δ)3(n/s)

Since , we can find a finite space of size , such that for any , there exists with . Moreover, since is -Lipschitz with respect to :

 |Gs,nf(⋅;θ)−E[f(Z1:s;θ)]|≤|Gs,nf(⋅;θε)−E[f(Z1:sθε)]|+2L∥θ−θε∥

Thus we have that:

 supθ∈Θ|Gs,nf(⋅;θ)−E[f(Z1:s;θ)]|≤supθ∈Θε|Gs,nf(⋅;θε)−E[f(Z1:sθε)]|+2Lε

Taking a union bound over