Estimation and Inference about Conditional Average Treatment Effect and Other Causal Functions
Abstract
Our framework can be viewed as inference on lowdimensional nonparametric functions in the presence of highdimensional nuisance function (where dimensionality refers to the number of covariates). Specifically, we consider the setting where we have a signal that is an unbiased predictor of causal/structural objects like treatment effect, structural derivative, outcome given a treatment, and others, conditional on a set of very high dimensional controls/ confounders . We are interested in simpler lowerdimensional nonparametric summaries of , namely conditional on a lowdimensional subset of covariates . The random variable , which we refer to as a signal, depends on a nuisance function of the highdimensional controls, which is unknown.
In the first stage, we need to learn the function using any machine learning method that is able to approximate accurately under very high dimensionality of . For example, under approximate sparsity with respect to a dictionary, penalized methods can be used; in others, tools such as deep neural networks can be used. To estimate we would use , but to make the subsequent inference valid, we need to carefully construct such that the signal is orthogonal to perturbations of , namely . This property allows us to use arbitrary, highquality machine learning tools to learn , because it eliminates the first order biases arising from biased/regularized estimation of necessarily occurring in such cases. As a result, the secondstage lowdimensional nonparametric inference enjoys the quasioracle properties, as if we knew .
In the second stage, we approximate the target function by a linear form , where is a vector of the approximating functions and is the Best Linear Predictor parameter, where the dimension of is increasing slower than the sample size. We develop a complete set of results about estimation and approximately Gaussian inference on and . If is sufficiently rich and admits a good approximation, then gets automatically targeted by the inference; otherwise, the best linear approximation to gets targeted. In some applications, can be simply specified as a collection of group indicators, in which case describes groupaverage treatment/structural effects (GATEs), providing a very practical summary of the heterogeneous effects.
,
class=MSC] \kwd[Primary ]62G08 \kwd[; secondary ]62G05 62G20 62G35
Highdimensional statistics, heterogeneous treatment effect, conditional average treatment effect, orthogonal estimation, machine learning, double robustness
1 Introduction and Motivation.
This paper gives a method to estimate and conduct inference about a nonparametric function that summarizes heterogeneous treatment/causal/structural effects conditionally on a small subset of (potentially very) highdimensional controls . We assume that this function can be represented as a conditional expectation function
(1.1) 
where the random variable , which we refer to as a signal, depends on a nuisance function of the controls. Examples of the nonparametric target function include the Group Average Treatment Effect, Continuous Treatment Effects, the Conditional Average Treatment Effect (CATE), regression function with Partially Missing Outcome, and the Conditional Average Partial Derivative (CAPD). Examples of the nuisance functions
include the propensity score (probability of treatment assignment), conditional density, and the regression function, among others. In summary,
Although there are multiple possible choices of signals for the target function , we focus on signals that possess the Neymantype orthogonality property (Neyman (1959)). Formally, we require the pathwise derivative of the conditional expectation to be zero conditionally on :
(1.2) 
If the signal is orthogonal, its plugin estimate is insensitive to the biased estimation of , resulting from application of modern adaptive learning methods in high dimensions, and delivers a highquality estimator of the target function under mild conditions.
We demonstrate the importance of the orthogonality property using an example of CATE. We define the CATE function as:
where are the potential outcomes corresponding to receipt(nonreceipt) of a binary treatment and is a covariate vector of interest. Consider an Inverse Probability Weighting (IPW, Horwitz and Thompson (1952)) type signal
where is the observed outcome, and the true value of the firststage nuisance parameter is the propensity score . The function is estimated by modern regularized methods, and the bias of its estimation error converges slowly. The standard choice of signal is not orthogonal to the firststage bias:
Consequently, the bias of the estimation error, , in the propensity score translates into the bias of the estimated signal . As a result, the estimate of the target function based on is lowquality.
To overcome the transmission of the firststage bias, we consider the signal of Robins and Rotnitzky (1995) type:
where is the regression function, and the true value of the firststage nuisance parameter . Our proposed choice of signal is orthogonal with respect to the firststage bias
Consequently, the bias of the estimation error, , in the propensity score does not translate into the bias of the estimated signal . As a result, the estimate of the target function based on is highquality.
In the second stage we consider a linear projection of an orthogonal signal on a vector of basis functions
The choice of basis functions depends on the problem and the desired interpretation of CATE. For one example, perhaps the simplest choice is to take be a partition of the support of into mutually exclusive groups. Setting
corresponds to that can be interpreted as Group Average Treatment Effect. For another example, let be a dimensional dictionary of series basis functions (e.g., polynomials or splines). Then, corresponds to the best linear approximation to the target function in the given dictionary. Under some smoothness conditions, as the dimension of the dictionary becomes large, will approximate at the given point .
1.1 Overview of Main Results.
The first main contribution of this paper is to provide sufficient conditions for pointwise and uniform asymptotic Gaussian approximation of the target function. We approximate the target function at a point by a linear form :
where is a vector of technical transformations of the covariates , is the misspecification error due to the linear approximation
(1.3) 
The twostage estimator of , which we refer to as Orthogonal Estimator, is constructed as follows. In the first stage we construct an estimate of the nuisance parameter , using any highquality machine learning estimator. In the second stage we construct an estimate of the signal as and run ordinary least squares of on the technical regressors . We use different samples for the estimation of in the first stage and the estimation of in the second stage in a form of crossfitting, described in the following definition.
Definition 1.1 (Crossfitting).

For a random sample of size , denote a fold random partition of the sample indices by , where is the number of partitions and sample size of each fold is . Also for each define .

For each , construct an estimator
^{2} of the nuisance parameter value using only the data from ; and for any observation , define the estimate .
Definition 1.2 (Orthogonal Estimator).
Given , define Orthogonal Estimator as:
(1.4) 
Under mild conditions on , Orthogonal Estimator delivers a highquality estimate of the pseudotarget function with the following properties:

W.p. , the mean squared error of is bounded by
where is the effect of the misspecification error .

The estimator of the pseudotarget function is asymptotically linear:
where the empirical process is approximated by a Gaussian process with coordinates, uniformly over , and the covariance matrix can be consistently estimated by a sample analog .

If the misspecification error is small, the pseudotarget function can be replaced by the target function :

The quantiles of the suprema of the approximating Gaussian process can be consistently approximated by bootstrap, using which valid uniform confidence bands for and can be constructed.
The results of this paper accommodate the estimation of by highdimensional/highly complex modern machine learning (ML) methods, such as random forests, neural networks, and shrinkage estimators, as well as estimation of by previously developed methods. The only requirement we impose on the estimation of is its mean square convergence to the true nuisance parameter at a highquality rate for some . This requirement is satisfied under structured assumptions on , such as approximate sparsity of with respect to some dictionary, wellapproximability of by trees or by sparse neural and deep neural nets.
Another important example, worth highlighting here, is that of Continuous Treatment Effects, studied in Kennedy et al. (2017), where
where is a continuous treatment and are the potential outcomes. We prove that the doubly robust signal of Kennedy et al. (2017)
is conditionally orthogonal with respect to the nuisance parameter consisting of the conditional treatment density , regression function , and marginal density :
The proposed estimator of the signal takes the form
where is estimated on and the sample average in the second summand is taken over data excluding observation . Having replaced by in Definition 1.2, we obtain an asymptotically linear estimator of the pseudotarget function:
(1.5) 
where empirical process is approximated by a Gaussian process with the same covariance function uniformly over .
To the best of our knowledge, the inferential approach via series approximation that we propose for this example and all other is novel and so are the inferential results.
1.2 Literature Review.
This paper builds on the three bodies of research within the semiparametric literature: orthogonal(debiased) machine learning, least squares series estimation, and treatment effects/missing data problems. Orthogonal machine learning (Chernozhukov et al. (2016a), Chernozhukov et al. (2016b)) concerns with the inference on a fixeddimensional target parameter in presence of a highdimensional nuisance function in a semiparametric moment problem. In case the moment condition is orthogonal to the perturbations of , the estimation of by ML methods has no firstorder effect on the asymptotic distribution of the target parameter . In particular, plugging in an estimate of obtained on a separate sample, results in a consistent asymptotically normal estimate whose asymptotic variance is the same as if was known. This result allows one to use highly complex machine learning methods to estimate the nuisance function , such as penalized methods in sparse models (Bühlmann,Peter and van der Geer,Sara (2011), Belloni et al. (2016)), boosting in sparse linear models (Luo,Ye and Spindler,Martin (2016)), and other methods for classes of neural nets, regression trees, and random forests. In many cases, the orthogonal moment conditions are also doubly robust (Robins and Rotnitzky (1995),Kennedy et al. (2017)) with respect to misspecification of one of the nuisance components.
The second building block of our paper is the literature on least squares series estimation (Newey (2007), Newey (2009), Belloni et al. (2015), Chen and Christensen (2015)), which establishes the pointwise and the uniform limit theory for least squares series estimation. We extend this theory by allowing the dependent variable of the series projection to depend upon an unknown nuisance parameter . We show that series properties continue to hold without any additional strong assumptions on the problem design.
Finally, we also contribute to the literature on missing data, estimation of the conditional average treatment effect and group average treatment effect (Robins and Rotnitzky (1995), Hahn (1998), Graham (2011), Graham et al. (2012), Hirano et al. (2003), Abrevaya and Lieli (2015), Athey and Imbens (2015), Kennedy et al. (2017), Grimmer et al. (2017)). It is also worth mentioning several references that appeared well after our paper was posted to ArXiv (1702.06240). They include Fan et al. (2019), Colangelo and Lee (2019), Jacob et al. (2019), Oprescu et al. (2019), Zimmert and Lechner (2019), and mainly analyze inference on (average, group, or local average) features of CATE(Fan et al. (2019)) and CTE (Colangelo and Lee (2019)). Our framework covers many more examples than features CATE (which was our original motivation), and uses series estimators instead of kernels for localization (e.g., in contrast to Fan et al. (2019), Colangelo and Lee (2019)).
2 Examples.
Examples below apply the proposed framework to study Continuous Treatment Effects, Conditional Average Treatment Effect, regression function with Partially Missing Outcome and Conditional Average Partial Derivative.
Here we introduce a new target function  average potential outcome for the case of continuous treatment  and derive an orthogonal signal for it.
Example 1 (Continuous Treatment Effects).
Let be a continuous treatment variable, be a vector of the controls, and stand for the potential outcomes corresponding to the subject’s response after receiving units of treatment. The observed data consist of the treatment , controls , and the treatment response . For a given value , the average potential outcome as defined in Gill and Robins (2001) and Kennedy et al. (2017) is
(2.1) 
Let us define the following nuisance functions. Define the conditional density of given as
and the conditional expectation of the outcome given as . Finally, let
be the marginal density of the vector .The orthogonal signal conditionally on is given by
(2.2) 
where the nuisance parameter is a vectorvalued function:
In particular, (2.2) coincides with the doubly robust signal of Kennedy et al. (2017). Corollary 5.1 shows that Equation (2.2) is orthogonal with respect to the nuisance parameter . Theorem 5.1 gives pointwise and uniform asymptotic theory for . This theorem presents a novel asymptotic linearity representation of that accounts for the estimation error of the integral .
Example 2 (Conditional Average Treatment Effect).
Let and be the potential outcomes, corresponding to the response of a subject with and without receiving a treatment, respectively. Let indicate the subject’s presence in the treatment group. The object of interest is the Conditional Average Treatment Effect
Since an individual cannot be treated and nontreated at the same time, only the actual outcome , but not the treatment effect , is observed.
A standard way to make progress in this problem is to assume unconfoundedness (Rosenbaum and Rubin (1983)). Suppose there exists an observable control vector such that the treatment status is independent of the potential outcomes conditionally on , and .
Assumption 2.1 (Unconfoundedness).
The treatment status is independent of the potential outcomes conditionally on : .
Define the conditional probability of treatment receipt as Consider an orthogonal signal of Robins and Rotnitzky (1995) type:
Example 3 (Regression Function with Partially Missing Outcome).
Suppose a researcher is interested in the conditional expectation given a covariate vector of a variable
that is partially missing. Let indicate whether the outcome is observed, and be the observed outcome. Since the researcher does not control the presence status , a standard way to make progress is to assume existence of an observable control vector such that .
Assumption 2.2 (Missingnesss at Random).
The presence indicator is independent from the outcome conditionally on : .
Corollary 5.3 shows that the signal , defined as
(2.4) 
is orthogonal to the estimation error of the nuisance parameter . Here, the function is the conditional expectation function of the observed outcome given and .
Example 4 (Experiment with Partially Missing Outcome).
Let and be the potential outcomes, corresponding to the response of a subject with and without receiving a treatment, respectively. Suppose a researcher is interested in the Conditional Average Treatment Effect
Conditionally on a vector of stratifying variables , he randomly assigns the treatment status to measure the outcome . In presence of the partially missing outcome, let indicate the presence of the outcome record in the data. Since the presence indicator may be codetermined with the covariates , estimating the treatment effect function on the observed outcomes only without accounting for the missingnesss may lead to an inconsistent estimate of .
Since the researcher does not control presence status , a standard way to make progress is to assume Missingnesss at Random, namely existence of an observable control vector such that . Setting to be a full vector of observables (in particular, ) makes Missingness at Random (Assumption 2.2) the least restrictive. Define the conditional probability of presence
and the treatment propensity score
An orthogonal signal for the CATE can be obtained as follows:
(2.5) 
where is the conditional expectation function of given .
Example 5 (Conditional Average Partial Derivative).
Let
be a conditional expectation function of an outcome given a set of variables . Suppose a researcher is interested in the conditional average derivative of with respect to given , denoted by
An orthogonal signal of Newey and Stoker (1993) type takes the form
(2.6) 
where is the conditional density of conditionally on . The nuisance parameter consists of the conditional expectation function and the conditional density . Corollary 5.4 shows that the is orthogonal to the estimation error of the nuisance parameter .
3 Empirical Application
To show the immediate usefulness of the method, we consider an important problem of inference about structural derivatives. We apply our methods to study the household demand for gasoline, a question studied in Hausman and Newey (1995), Schmalensee and Stoker (1999), Yatchew and No (2001) and Blundell et al. (2012). These papers estimated the demand function and the average price elasticity for various demographic groups. The dependence of the price elasticity on the household income was highlighted in Blundell et al. (2012), who have estimated the elasticity by low, middle, and highincome groups and found its relationship with income to be nonmonotonic. To gain more insight into this question, we estimate the average price elasticity as a function of income and provide simultaneous confidence bands for it.
The data for our analysis are the same as in Yatchew and No (2001), coming from National Private Vehicle Use Survey, conducted by Statistics Canada between October 1994 and September 1996. The data set is based on fuel purchase diaries and contains detailed information about fuel prices, fuel consumption patterns, vehicles and demographic characteristics. We employ the same selection procedure as in Yatchew and No (2001) and Belloni et al. (2011), focusing on a sample of the households with nonzero licensed drivers, vehicles, and distance driven which leaves us with 5001 observations.
The object of interest is the average predicted percentage change in the demand due to a unit percentage change in the price, holding the observed demographic characteristics fixed, conditional on income. In context of Example 5, this corresponds to the conditional average derivative
where is the logarithm of gas consumption, is the logarithm of price per liter, is log income, and are the observed subject characteristics such as household size and composition, distance driven, and the type of fuel usage. The orthogonal signal for the target function is given by
(3.1) 
where is the conditional density of the price variable given income and subject characteristics . The conditional density and the conditional expectation functions comprise the set of the nuisance parameters to be estimated in the first stage.
The choice of the estimators in the first and the second stages is as follows. To estimate the conditional expectation function and its partial derivative , we consider a linear model that includes price, price squared, income, income squared, their interactions with 28 time, geographical, and household composition dummies. All in all, we have 91 explanatory variables. We estimate using Lasso with the penalty level chosen as in Belloni et al. (2014), and estimate the derivative using the estimated coefficients of . To estimate the conditional density , we consider a model:
where is the conditional expectation of price variable given income variable and covariates , and is an independent continuously distributed shock with univariate density . Under this assumption, the log density equals to
We estimate by an adaptive kernel density estimator of Portnoy and Koenker (1989) with Silverman choice of bandwidth. Finally, we plug in the estimates of , , into the Equation 3.1 to get an estimate of and estimate by least squares series regression of on . We try both polynomial basis function and Bsplines to construct technical regressors.
Figures 1 and 3 report the estimate of the target function (the black line), the pointwise (the dashed blue lines) and the uniform confidence (the solid blue lines) bands for the average price elasticity conditional on income, where the significance level . The panels of Figure 1 correspond to different choices of the firststage estimates of the nuisance functions and and dictionaries of technical regressors. The panels of Figure 3 correspond to the subsamples of large and small households and to different choices of the dictionaries.
The summary of our empirical findings based on Figure 1 and 3 is as follows. We find the elasticity to be in the range and significant for majority of income levels. The estimates based on splines (Figures 0(c), 0(d)) are monotonically increasing in income, which is intuitive. The estimates based on polynomial functions are nonmonotonic in income. For every algorithm on Figure 1 we cannot reject the null hypothesis of constant price elasticity for all income levels: for each estimation procedure, the uniform confidence bands contain the constant function. Figure 3 shows the average price elasticity conditional on income for small and large households.
To demonstrate the relevance of demographic data in the first stage estimation, we have shown the average predicted effect of the price change on the gasoline consumption (in logs), without accounting for the covariates in the first stage. In particular, this effect equals to , where is the conditional expectation of gas consumption given income and price. This predictive effect consists two effects: the effect of price change on the consumption holding the demographic covariates fixed, which we refer to as average price elasticity, and the association of the price change with the change in the household characteristics that also affect the consumption themselves. Figure 2 shows this predictive effect, approximated by the polynomials of degree , conditional on income. By contrast to the results in Figure 1, the slope of the polynomial of degree has a negative relationship between income and price elasticity, which present evidence that the demographics confound the relationship between income and price elasticity.
4 Main Theoretical Results.
Notation. For two sequences of random variables means . For two sequences of numbers , means . Let . The norm of a vector is denoted by , the norm is denoted by , the is denoted by , and the  norm denotes the number of nonzero components of a vector. Given a vector and a set of indices , we denote by the vector in for which and . Let . For a matrix , let be the maximal eigenvalue of . For a random variable , let . Let .
The random sample is a sequence of independent copies of a random element taking values in a measurable space according to a probability law . We shall use empirical process notation. For a generic function and a generic sample , denote a sample average by
and a scaled, demeaned sample average by
All asymptotic statements below are with respect to . Let
For an observation index that belongs to a fold , define , where is estimated on as in Definition 1.1. Define .
4.1 Asymptotic Theory for Least Squares Series
Assumption 4.1 (Identification).
Let denote population covariance matrix of technical regressors. Assume that that do not depend on s.t. .
Assumption 4.2 (Growth Condition).
We assume that the norm of the technical regressors grows sufficiently slow:
Assumption 4.3 (Misspecification Error).
There exists a sequence of finite constants such that the norms of the misspecification error are controlled as follows:
Assumption 4.3 introduces the rate of decay of the misspecification error. Specifically, the sequence of constants bounds the mean squared misspecification error. In addition, the sequence bounds the worstcase misspecification error uniformly over the domain of , where is the modulus of continuity of the worstcase error with respect to mean squared error.
Define the sampling error as follows:
Assumption 4.4 (Sampling Error).
The second moment of the sampling error conditionally on is bounded from above by :
To describe the firststage rate requirement, Assumption 4.5 introduces a sequence of nuisance realization sets for the nuisance parameter . As sample size increases, the sets shrink around the true value . The shrinkage speed is described in terms of the statistical rates and .
Assumption 4.5 (Small Bias Condition).
There exists a sequence , such that with probability at least , the first stage estimate , obtained by crossfitting (Definition 1.1), belongs to a shrinking neighborhood of , denoted by . Uniformly over , the following mean square convergence holds:
(4.1)  
(4.2) 
In particular, can be bounded as
Assumption 4.5 is stated in a highlevel form in order to accommodate various machine learning estimators. We demonstrate the plausibility of Assumption 4.5 for a highdimensional sparse model for Example 3 (see Example 6).