Estimation and Inference about Conditional Average Treatment Effect and Other Causal Functions

Estimation and Inference about Conditional Average Treatment Effect and Other Causal Functions

Abstract

Our framework can be viewed as inference on low-dimensional nonparametric functions in the presence of high-dimensional nuisance function (where dimensionality refers to the number of covariates). Specifically, we consider the setting where we have a signal that is an unbiased predictor of causal/structural objects like treatment effect, structural derivative, outcome given a treatment, and others, conditional on a set of very high dimensional controls/ confounders . We are interested in simpler lower-dimensional nonparametric summaries of , namely conditional on a low-dimensional subset of covariates . The random variable , which we refer to as a signal, depends on a nuisance function of the high-dimensional controls, which is unknown.

In the first stage, we need to learn the function using any machine learning method that is able to approximate accurately under very high dimensionality of . For example, under approximate sparsity with respect to a dictionary, -penalized methods can be used; in others, tools such as deep neural networks can be used. To estimate we would use , but to make the subsequent inference valid, we need to carefully construct such that the signal is orthogonal to perturbations of , namely . This property allows us to use arbitrary, high-quality machine learning tools to learn , because it eliminates the first order biases arising from biased/regularized estimation of necessarily occurring in such cases. As a result, the second-stage low-dimensional nonparametric inference enjoys the quasi-oracle properties, as if we knew .

In the second stage, we approximate the target function by a linear form , where is a vector of the approximating functions and is the Best Linear Predictor parameter, where the dimension of is increasing slower than the sample size. We develop a complete set of results about estimation and approximately Gaussian inference on and . If is sufficiently rich and admits a good approximation, then gets automatically targeted by the inference; otherwise, the best linear approximation to gets targeted. In some applications, can be simply specified as a collection of group indicators, in which case describes group-average treatment/structural effects (GATEs), providing a very practical summary of the heterogeneous effects.

[
\kwd
{aug}

,

class=MSC] \kwd[Primary ]62G08 \kwd[; secondary ]62G05 62G20 62G35

High-dimensional statistics, heterogeneous treatment effect, conditional average treatment effect, orthogonal estimation, machine learning, double robustness

1 Introduction and Motivation.

This paper gives a method to estimate and conduct inference about a nonparametric function that summarizes heterogeneous treatment/causal/structural effects conditionally on a small subset of (potentially very) high-dimensional controls . We assume that this function can be represented as a conditional expectation function

(1.1)

where the random variable , which we refer to as a signal, depends on a nuisance function of the controls. Examples of the nonparametric target function include the Group Average Treatment Effect, Continuous Treatment Effects, the Conditional Average Treatment Effect (CATE), regression function with Partially Missing Outcome, and the Conditional Average Partial Derivative (CAPD). Examples of the nuisance functions

include the propensity score (probability of treatment assignment), conditional density, and the regression function, among others. In summary,

Although there are multiple possible choices of signals for the target function , we focus on signals that possess the Neyman-type orthogonality property (Neyman (1959)). Formally, we require the pathwise derivative of the conditional expectation to be zero conditionally on :

(1.2)

If the signal is orthogonal, its plug-in estimate is insensitive to the biased estimation of , resulting from application of modern adaptive learning methods in high dimensions, and delivers a high-quality estimator of the target function under mild conditions.

We demonstrate the importance of the orthogonality property using an example of CATE. We define the CATE function as:

where are the potential outcomes corresponding to receipt(non-receipt) of a binary treatment and is a covariate vector of interest. Consider an Inverse Probability Weighting (IPW, Horwitz and Thompson (1952)) type signal

where is the observed outcome, and the true value of the first-stage nuisance parameter is the propensity score . The function is estimated by modern regularized methods, and the bias of its estimation error converges slowly. The standard choice of signal is not orthogonal to the first-stage bias:

Consequently, the bias of the estimation error, , in the propensity score translates into the bias of the estimated signal . As a result, the estimate of the target function based on is low-quality.

To overcome the transmission of the first-stage bias, we consider the signal of Robins and Rotnitzky (1995) type:

where is the regression function, and the true value of the first-stage nuisance parameter . Our proposed choice of signal is orthogonal with respect to the first-stage bias

Consequently, the bias of the estimation error, , in the propensity score does not translate into the bias of the estimated signal . As a result, the estimate of the target function based on is high-quality.

In the second stage we consider a linear projection of an orthogonal signal on a vector of basis functions

The choice of basis functions depends on the problem and the desired interpretation of CATE. For one example, perhaps the simplest choice is to take be a partition of the support of into mutually exclusive groups. Setting

corresponds to that can be interpreted as Group Average Treatment Effect. For another example, let be a -dimensional dictionary of series basis functions (e.g., polynomials or splines). Then, corresponds to the best linear approximation to the target function in the given dictionary. Under some smoothness conditions, as the dimension of the dictionary becomes large, will approximate at the given point .

1.1 Overview of Main Results.

The first main contribution of this paper is to provide sufficient conditions for pointwise and uniform asymptotic Gaussian approximation of the target function. We approximate the target function at a point by a linear form :

where is a -vector of technical transformations of the covariates , is the misspecification error due to the linear approximation1, and is the Best Linear Predictor, defined by the normal equation:

(1.3)

The two-stage estimator of , which we refer to as Orthogonal Estimator, is constructed as follows. In the first stage we construct an estimate of the nuisance parameter , using any high-quality machine learning estimator. In the second stage we construct an estimate of the signal as and run ordinary least squares of on the technical regressors . We use different samples for the estimation of in the first stage and the estimation of in the second stage in a form of cross-fitting, described in the following definition.

Definition 1.1 (Cross-fitting).
  1. For a random sample of size , denote a -fold random partition of the sample indices by , where is the number of partitions and sample size of each fold is . Also for each define .

  2. For each , construct an estimator 2 of the nuisance parameter value using only the data from ; and for any observation , define the estimate .

Definition 1.2 (Orthogonal Estimator).

Given , define Orthogonal Estimator as:

(1.4)

Under mild conditions on , Orthogonal Estimator delivers a high-quality estimate of the pseudo-target function with the following properties:

  • W.p. , the mean squared error of is bounded by

    where is the effect of the misspecification error .

  • The estimator of the pseudo-target function is asymptotically linear:

    where the empirical process is approximated by a Gaussian process with coordinates, uniformly over , and the covariance matrix can be consistently estimated by a sample analog .

  • If the misspecification error is small, the pseudo-target function can be replaced by the target function :

  • The quantiles of the suprema of the approximating Gaussian process can be consistently approximated by bootstrap, using which valid uniform confidence bands for and can be constructed.

The results of this paper accommodate the estimation of by high-dimensional/highly complex modern machine learning (ML) methods, such as random forests, neural networks, and -shrinkage estimators, as well as estimation of by previously developed methods. The only requirement we impose on the estimation of is its mean square convergence to the true nuisance parameter at a high-quality rate for some . This requirement is satisfied under structured assumptions on , such as approximate sparsity of with respect to some dictionary, well-approximability of by trees or by sparse neural and deep neural nets.

Another important example, worth highlighting here, is that of Continuous Treatment Effects, studied in Kennedy et al. (2017), where

where is a continuous treatment and are the potential outcomes. We prove that the doubly robust signal of Kennedy et al. (2017)

is conditionally orthogonal with respect to the nuisance parameter consisting of the conditional treatment density , regression function , and marginal density :

The proposed estimator of the signal takes the form

where is estimated on and the sample average in the second summand is taken over data excluding observation . Having replaced by in Definition 1.2, we obtain an asymptotically linear estimator of the pseudo-target function:

(1.5)

where empirical process is approximated by a Gaussian process with the same covariance function uniformly over .

To the best of our knowledge, the inferential approach via series approximation that we propose for this example and all other is novel and so are the inferential results.

1.2 Literature Review.

This paper builds on the three bodies of research within the semiparametric literature: orthogonal(debiased) machine learning, least squares series estimation, and treatment effects/missing data problems. Orthogonal machine learning (Chernozhukov et al. (2016a), Chernozhukov et al. (2016b)) concerns with the inference on a fixed-dimensional target parameter in presence of a high-dimensional nuisance function in a semiparametric moment problem. In case the moment condition is orthogonal to the perturbations of , the estimation of by ML methods has no first-order effect on the asymptotic distribution of the target parameter . In particular, plugging in an estimate of obtained on a separate sample, results in a -consistent asymptotically normal estimate whose asymptotic variance is the same as if was known. This result allows one to use highly complex machine learning methods to estimate the nuisance function , such as penalized methods in sparse models (Bühlmann,Peter and van der Geer,Sara (2011), Belloni et al. (2016)), boosting in sparse linear models (Luo,Ye and Spindler,Martin (2016)), and other methods for classes of neural nets, regression trees, and random forests. In many cases, the orthogonal moment conditions are also doubly robust (Robins and Rotnitzky (1995),Kennedy et al. (2017)) with respect to misspecification of one of the nuisance components.

The second building block of our paper is the literature on least squares series estimation (Newey (2007), Newey (2009), Belloni et al. (2015), Chen and Christensen (2015)), which establishes the pointwise and the uniform limit theory for least squares series estimation. We extend this theory by allowing the dependent variable of the series projection to depend upon an unknown nuisance parameter . We show that series properties continue to hold without any additional strong assumptions on the problem design.

Finally, we also contribute to the literature on missing data, estimation of the conditional average treatment effect and group average treatment effect (Robins and Rotnitzky (1995), Hahn (1998), Graham (2011), Graham et al. (2012), Hirano et al. (2003), Abrevaya and Lieli (2015), Athey and Imbens (2015), Kennedy et al. (2017), Grimmer et al. (2017)). It is also worth mentioning several references that appeared well after our paper was posted to ArXiv (1702.06240). They include Fan et al. (2019), Colangelo and Lee (2019), Jacob et al. (2019), Oprescu et al. (2019), Zimmert and Lechner (2019), and mainly analyze inference on (average, group, or local average) features of CATE(Fan et al. (2019)) and CTE (Colangelo and Lee (2019)). Our framework covers many more examples than features CATE (which was our original motivation), and uses series estimators instead of kernels for localization (e.g., in contrast to Fan et al. (2019), Colangelo and Lee (2019)).

2 Examples.

Examples below apply the proposed framework to study Continuous Treatment Effects, Conditional Average Treatment Effect, regression function with Partially Missing Outcome and Conditional Average Partial Derivative.

Here we introduce a new target function - average potential outcome for the case of continuous treatment - and derive an orthogonal signal for it.

Example 1 (Continuous Treatment Effects).

Let be a continuous treatment variable, be a vector of the controls, and stand for the potential outcomes corresponding to the subject’s response after receiving units of treatment. The observed data consist of the treatment , controls , and the treatment response . For a given value , the average potential outcome as defined in Gill and Robins (2001) and Kennedy et al. (2017) is

(2.1)

Let us define the following nuisance functions. Define the conditional density of given as

and the conditional expectation of the outcome given as . Finally, let

be the marginal density of the vector .The orthogonal signal conditionally on is given by

(2.2)

where the nuisance parameter is a vector-valued function:

In particular, (2.2) coincides with the doubly robust signal of Kennedy et al. (2017). Corollary 5.1 shows that Equation (2.2) is orthogonal with respect to the nuisance parameter . Theorem 5.1 gives pointwise and uniform asymptotic theory for . This theorem presents a novel asymptotic linearity representation of that accounts for the estimation error of the integral .

Example 2 (Conditional Average Treatment Effect).

Let and be the potential outcomes, corresponding to the response of a subject with and without receiving a treatment, respectively. Let indicate the subject’s presence in the treatment group. The object of interest is the Conditional Average Treatment Effect

Since an individual cannot be treated and non-treated at the same time, only the actual outcome , but not the treatment effect , is observed.

A standard way to make progress in this problem is to assume unconfoundedness (Rosenbaum and Rubin (1983)). Suppose there exists an observable control vector such that the treatment status is independent of the potential outcomes conditionally on , and .

Assumption 2.1 (Unconfoundedness).

The treatment status is independent of the potential outcomes conditionally on : .

Define the conditional probability of treatment receipt as Consider an orthogonal signal of Robins and Rotnitzky (1995) type:

(2.3)

where is the conditional expectation function of given . Corollary 5.2 shows that (2.3) is orthogonal to the estimation error of the nuisance parameter .

Example 3 (Regression Function with Partially Missing Outcome).

Suppose a researcher is interested in the conditional expectation given a covariate vector of a variable

that is partially missing. Let indicate whether the outcome is observed, and be the observed outcome. Since the researcher does not control the presence status , a standard way to make progress is to assume existence of an observable control vector such that .

Assumption 2.2 (Missingnesss at Random).

The presence indicator is independent from the outcome conditionally on : .

Corollary 5.3 shows that the signal , defined as

(2.4)

is orthogonal to the estimation error of the nuisance parameter . Here, the function is the conditional expectation function of the observed outcome given and .

Example 4 (Experiment with Partially Missing Outcome).

Let and be the potential outcomes, corresponding to the response of a subject with and without receiving a treatment, respectively. Suppose a researcher is interested in the Conditional Average Treatment Effect

Conditionally on a vector of stratifying variables , he randomly assigns the treatment status to measure the outcome . In presence of the partially missing outcome, let indicate the presence of the outcome record in the data. Since the presence indicator may be co-determined with the covariates , estimating the treatment effect function on the observed outcomes only without accounting for the missingnesss may lead to an inconsistent estimate of .

Since the researcher does not control presence status , a standard way to make progress is to assume Missingnesss at Random, namely existence of an observable control vector such that . Setting to be a full vector of observables (in particular, ) makes Missingness at Random (Assumption 2.2) the least restrictive. Define the conditional probability of presence

and the treatment propensity score

An orthogonal signal for the CATE can be obtained as follows:

(2.5)

where is the conditional expectation function of given .

Example 5 (Conditional Average Partial Derivative).

Let

be a conditional expectation function of an outcome given a set of variables . Suppose a researcher is interested in the conditional average derivative of with respect to given , denoted by

An orthogonal signal of Newey and Stoker (1993) type takes the form

(2.6)

where is the conditional density of conditionally on . The nuisance parameter consists of the conditional expectation function and the conditional density . Corollary 5.4 shows that the is orthogonal to the estimation error of the nuisance parameter .

3 Empirical Application

To show the immediate usefulness of the method, we consider an important problem of inference about structural derivatives. We apply our methods to study the household demand for gasoline, a question studied in Hausman and Newey (1995), Schmalensee and Stoker (1999), Yatchew and No (2001) and Blundell et al. (2012). These papers estimated the demand function and the average price elasticity for various demographic groups. The dependence of the price elasticity on the household income was highlighted in Blundell et al. (2012), who have estimated the elasticity by low, middle, and high-income groups and found its relationship with income to be non-monotonic. To gain more insight into this question, we estimate the average price elasticity as a function of income and provide simultaneous confidence bands for it.

The data for our analysis are the same as in Yatchew and No (2001), coming from National Private Vehicle Use Survey, conducted by Statistics Canada between October 1994 and September 1996. The data set is based on fuel purchase diaries and contains detailed information about fuel prices, fuel consumption patterns, vehicles and demographic characteristics. We employ the same selection procedure as in Yatchew and No (2001) and Belloni et al. (2011), focusing on a sample of the households with non-zero licensed drivers, vehicles, and distance driven which leaves us with 5001 observations.

The object of interest is the average predicted percentage change in the demand due to a unit percentage change in the price, holding the observed demographic characteristics fixed, conditional on income. In context of Example 5, this corresponds to the conditional average derivative

where is the logarithm of gas consumption, is the logarithm of price per liter, is log income, and are the observed subject characteristics such as household size and composition, distance driven, and the type of fuel usage. The orthogonal signal for the target function is given by

(3.1)

where is the conditional density of the price variable given income and subject characteristics . The conditional density and the conditional expectation functions comprise the set of the nuisance parameters to be estimated in the first stage.

The choice of the estimators in the first and the second stages is as follows. To estimate the conditional expectation function and its partial derivative , we consider a linear model that includes price, price squared, income, income squared, their interactions with 28 time, geographical, and household composition dummies. All in all, we have 91 explanatory variables. We estimate using Lasso with the penalty level chosen as in Belloni et al. (2014), and estimate the derivative using the estimated coefficients of . To estimate the conditional density , we consider a model:

where is the conditional expectation of price variable given income variable and covariates , and is an independent continuously distributed shock with univariate density . Under this assumption, the log density equals to

We estimate by an adaptive kernel density estimator of Portnoy and Koenker (1989) with Silverman choice of bandwidth. Finally, we plug in the estimates of , , into the Equation 3.1 to get an estimate of and estimate by least squares series regression of on . We try both polynomial basis function and B-splines to construct technical regressors.

Figures 1 and 3 report the estimate of the target function (the black line), the pointwise (the dashed blue lines) and the uniform confidence (the solid blue lines) bands for the average price elasticity conditional on income, where the significance level . The panels of Figure 1 correspond to different choices of the first-stage estimates of the nuisance functions and and dictionaries of technical regressors. The panels of Figure 3 correspond to the subsamples of large and small households and to different choices of the dictionaries.

The summary of our empirical findings based on Figure 1 and 3 is as follows. We find the elasticity to be in the range and significant for majority of income levels. The estimates based on -splines (Figures 0(c), 0(d)) are monotonically increasing in income, which is intuitive. The estimates based on polynomial functions are non-monotonic in income. For every algorithm on Figure 1 we cannot reject the null hypothesis of constant price elasticity for all income levels: for each estimation procedure, the uniform confidence bands contain the constant function. Figure 3 shows the average price elasticity conditional on income for small and large households.3. For majority of income levels, we find large households to be more price elastic than the small ones, but the difference is not significant at any income level.

To demonstrate the relevance of demographic data in the first stage estimation, we have shown the average predicted effect of the price change on the gasoline consumption (in logs), without accounting for the covariates in the first stage. In particular, this effect equals to , where is the conditional expectation of gas consumption given income and price. This predictive effect consists two effects: the effect of price change on the consumption holding the demographic covariates fixed, which we refer to as average price elasticity, and the association of the price change with the change in the household characteristics that also affect the consumption themselves. Figure 2 shows this predictive effect, approximated by the polynomials of degree , conditional on income. By contrast to the results in Figure 1, the slope of the polynomial of degree has a negative relationship between income and price elasticity, which present evidence that the demographics confound the relationship between income and price elasticity.

(a) Step 2: is estimated by Lasso. Step 3: polynomials of degree .
(b) Step 2: is estimated by random forest. Step 3: polynomials of degree .
(c) Step 2: is estimated by Lasso. Step 3: -splines of order with knot.
(d) Step 2: is estimated by random forest. -splines of order with knot.
Figure 1: confidence bands for the best linear approximation of the average price elasticity conditional on income with accounting for the demographic controls in the first stage. The black line is the estimated function, the dashed(solid) blue lines are the pointwise (uniform) confidence bands. The estimation algorithm has three steps: (1) first-stage estimation of the conditional expectation function , (2) second-stage estimation of the conditional density , and (3) third-stage estimation of the target function by least squares series. Step 1 is performed using Lasso with standardized covariates and the penalty choice , where and is the estimate of the residual variance. Step 2 is performed by estimating the regression function of and estimating the density of the residual by adaptive kernel density estimator of Portnoy and Koenker (1989) with the Silverman choice of bandwidth. The regression function is estimated lasso (0(a), 0(c)) and random forest (0(b), 0(d)). Step 3 is performed using B-splines of order with the number of knots equal to one (0(c), 0(d)) and polynomial functions of order . (0(a), 0(b)). weighted bootstrap repetitions.
(a) Polynomial degree
(b) Polynomial degree
Figure 2: confidence bands for the best linear approximation of the average price elasticity conditional on income without accounting for the demographic controls in the first stage. The black line is the estimated function, the dashed blue lines and the solid blue lines are the pointwise and the uniform confidence bands. The estimation algorithm has three steps: (1) first-stage estimation of the conditional expectation function , (2) second-stage estimation of the conditional density , and (3) third-stage estimation of the target function by least squares series. Step 1 is performed using least squares series regression using polynomial functions whose power is chosen by cross-validation out of . Step 2 is performed by kernel density estimator with the Silverman choice of bandwidth. Step 3 is performed using polynomial functions and is shown for and . weighted bootstrap repetitions.
(a) Large Households, Polynomials of degree .
(b) Small Households, Polynomials of degree .
(c) Large Households, B-splines of degree with knot.
(d) Small Households, B-splines of degree with knot.
Figure 3: confidence bands for the best linear approximation of the average price elasticity conditional on income with accounting for the demographic controls in the first stage by household size. The black line is the estimated function, the dashed(solid) blue lines are the pointwise (uniform) confidence bands. The estimation algorithm has three steps: (1) first-stage estimation of the conditional expectation function , (2) second-stage estimation of the conditional density , and (3) third-stage estimation of the target function by least squares series. Step 1 is performed using Lasso with standardized covariates and the penalty choice , where and is the estimate of the residual variance. Step 2 is performed by estimating the regression function of and estimating the density of the residual by adaptive kernel density estimator of Portnoy and Koenker (1989) with the Silverman choice of bandwidth. The regression function is estimated lasso. Step 3 is performed using B-splines of order with the number of knots equal to one (2(c), 2(d)) and using non-orthogonal polynomial functions of degree (2(a), 2(b)). weighted bootstrap repetitions.

4 Main Theoretical Results.

Notation. For two sequences of random variables means . For two sequences of numbers , means . Let . The norm of a vector is denoted by , the norm is denoted by , the is denoted by , and the - norm denotes the number of non-zero components of a vector. Given a vector and a set of indices , we denote by the vector in for which and . Let . For a matrix , let be the maximal eigenvalue of . For a random variable , let . Let .

The random sample is a sequence of independent copies of a random element taking values in a measurable space according to a probability law . We shall use empirical process notation. For a generic function and a generic sample , denote a sample average by

and a -scaled, demeaned sample average by

All asymptotic statements below are with respect to . Let

For an observation index that belongs to a fold , define , where is estimated on as in Definition 1.1. Define .

4.1 Asymptotic Theory for Least Squares Series

Assumption 4.1 (Identification).

Let denote population covariance matrix of technical regressors. Assume that that do not depend on s.t. .

Assumption 4.2 (Growth Condition).

We assume that the -norm of the technical regressors grows sufficiently slow:

Assumption 4.3 (Misspecification Error).

There exists a sequence of finite constants such that the norms of the misspecification error are controlled as follows:

Assumption 4.3 introduces the rate of decay of the misspecification error. Specifically, the sequence of constants bounds the mean squared misspecification error. In addition, the sequence bounds the worst-case misspecification error uniformly over the domain of , where is the modulus of continuity of the worst-case error with respect to mean squared error.

Define the sampling error as follows:

Assumption 4.4 (Sampling Error).

The second moment of the sampling error conditionally on is bounded from above by :

To describe the first-stage rate requirement, Assumption 4.5 introduces a sequence of nuisance realization sets for the nuisance parameter . As sample size increases, the sets shrink around the true value . The shrinkage speed is described in terms of the statistical rates and .

Assumption 4.5 (Small Bias Condition).

There exists a sequence , such that with probability at least , the first stage estimate , obtained by cross-fitting (Definition 1.1), belongs to a shrinking neighborhood of , denoted by . Uniformly over , the following mean square convergence holds:

(4.1)
(4.2)

In particular, can be bounded as

Assumption 4.5 is stated in a high-level form in order to accommodate various machine learning estimators. We demonstrate the plausibility of Assumption 4.5 for a high-dimensional sparse model for Example 3 (see Example 6).

Pointwise Limit Theory

Theorem 4.1 (Pointwise Limit Theory of Orthogonal Estimator).

Let Assumptions 4.1, 4.2, 4.3, 4.4, 4.5 hold. Then, the following statements hold:

  1. The second norm of the estimation error is bounded as:

    which implies a bound on the mean squared error of the estimate of the pseudo-target function :

  2. For any the estimator is approximately linear: