De-biased Machine Learning for Compliers

De-biased Machine Learning for Compliers

Rahul Singh and Sophie Sun    Rahul  Singh
MIT Economics
rahul.singh@mit.edu
&Liyang  Sun
MIT Economics
lsun20@mit.edu
Equal contribution
September 9, 2019
Abstract

Instrumental variable identification is a concept in causal statistics for estimating the counterfactual effect of treatment on output controlling for covariates using observational data. Even when measurements of are confounded, the treatment effect on the subpopulation of compliers can nonetheless be identified if an instrumental variable is available, which is independent of conditional on and the unmeasured confounder. We introduce a de-biased machine learning (DML) approach to estimating complier parameters with high-dimensional data. Complier parameters include local average treatment effect, average complier characteristics, and complier counterfactual outcome distributions. In our approach, the de-biasing is itself performed by machine learning, a variant called de-biased machine learning via regularized Riesz representers (DML-RRR). We prove our estimator is consistent, asymptotically normal, and semi-parametrically efficient. In experiments, our estimator outperforms state of the art alternatives. We use it to estimate the effect of 401(k) participation on the distribution of net financial assets.

1 Introduction

Instrumental variable (IV) identification is a concept in causal statistics for estimating the counterfactual effect of treatment on output controlling for covariates using observational data stock2003retrospectives . Even when measurements of are confounded, the treatment effect can nonetheless be identified if an instrumental variable is available, which is independent of conditional on and the unmeasured confounder. Intuitively, only influences via , identifying the counterfactual relationship of interest.

This solution comes at a price; the analyst can no longer measure parameters of the entire population such as average treatment effect (ATE). Measuring population parameters requires a stronger assumption called selection on observables: conditional on covariates , the relationship between treatment and outcome is as good as random. Instead, the analyst can only measure parameters defined for the subpopulation of compliers such as local average treatment effect (LATE) . A complier is an individual whose treatment status is affected by variation in the instrument . In public policy research, the instruments take the form of changes in eligibility criteria for social programs. Compliers are thus of policy interest as they are exactly the subpopulation to be affected by eligibility changes.

To fix ideas, we provide examples with continuous outcome , binary treatment , and binary instrument . Randomized assignment of a drug () only influences patient health () via actual consumption of the drug (), identifying the counterfactual effect of the drug on health even in the scenario of imperfect compliance angrist1996identification . However, the analyst can only learn the treatment effect on the subpopulation of complier patients: those who would consume the drug if assigned and who would not consume the drug if not assigned. Charter school admission by lottery () only influences student test scores () via actually attending the charter school (), identifying the counterfactual effect of the charter school on test scores even if there is selection bias in which students choose to accept an offer of admission angrist2010inputs ; angrist2013explaining . However, the analyst can only learn the treatment effect on the subpopulation of complier students: those who would attend the charter school if they won the lottery and who would not attend the charter school if they lost the lottery.

In the present work, we introduce a de-biased machine learning (DML) approach to estimating complier parameters with high-dimensional data chernozhukov2018double ; foster2019orthogonal . In our approach, the de-biasing is itself performed by machine learning, a variant called de-biased machine learning via regularized Riesz representers (DML-RRR) chernozhukov2018dantzig ; chernozhukov2018learning . We present a general estimator, then specialize it to the tasks of learning LATE, average complier characteristics, and complier counterfactual outcome distributions. Counterfactual outcome distributions are particularly important in welfare analysis of schooling, subsidized training, union status, minimum wages, and transfer programs abadie_bootstrap_2002 ; abadie_instrumental_2002 .

We make three contributions. First, we extend the theory of DML-RRR pioneered by chernozhukov2018dantzig ; chernozhukov2018learning . Whereas chernozhukov2018dantzig ; chernozhukov2018learning consider parameters of the full population identified by selection on observables, we consider parameters of the complier subpopulation identified by instrumental variables. We prove our estimator is consistent, asymptotically normal, and semi-parametrically efficient, and we provide simultaneous confidence bands. Second, we re-interpret a widely-used algorithm for estimating complier parameters called -weighting as the Riesz representer in DML-RRR; it is in fact a component of the de-biasing term. Third, we show our approach outperforms alternative approaches to estimating complier parameters, suggesting DML-RRR may be an effective paradigm in high-dimensional causal inference.

2 Related Work

Several approaches have been proposed to estimate complier parameters by DML. Both ogburn_doubly_2015 and chernozhukov2018double present a DML estimator for LATE. The justification in ogburn_doubly_2015 is via inverse propensity weighting, while the justification in chernozhukov2018double is by interpreting LATE as a ratio of ATEs. In belloni2017program , the authors present a DML estimator for counterfactual outcome distributions with simultaneous confidence bands. All of these estimators involve plugging in an estimated propensity score in the denominator, which is numerically unstable. Unlike previous work, we present a general justification that covers a broad class of estimators, and we present a DML-RRR variant that eliminates the numerically unstable step of plugging in an estimated propensity score. As far as we know, ours is the first DML and DML-RRR estimator of complier characteristics. For a comparison between DML-RRR and other approaches to semi-parametric estimation that use ML–namely targeted maximum likelihood van2011targeted , efficient score ning2017general , and approximate residual balancing athey2018approximate –we recommend chernozhukov2018dantzig ; chernozhukov2018learning .

Our work also relates to the literature on -weighting, an algorithm introduced by abadie_semiparametric_2003 . In -weighting, any complier parameter can be expressed as a weighted average of the corresponding population parameter. For example, LATE can be expressed as a weighted average of ATE across covariate values. Likewise, abadie_instrumental_2002 ; angrist_stand_2016 propose -weighting estimators of counterfactual outcome distributions. The weight involves an estimated propensity score in the denominator, which is numerically unstable. Theoretically, the literature has not yet justified the use of a black-box regularized ML algorithm to learn the propensity score in high dimensional settings. By elucidating the relationship between -weighting and DML, we provide this justification. Moreover, by introducing the DML-RRR variant, we are able to learn the -weight directly without estimating its components or even knowing its functional form.

Finally, our paper contributes to the growing literature on instrumental variables in machine learning. Both hartford2017deep and singh2019kernel consider the problem of nonparametric instrumental variable regression, where the target parameter is the structural function that summarizes the counterfactual relationship: where is confounding noise. In athey2019generalized , the authors further assume the function can be decomposed as . Importantly, hartford2017deep ; singh2019kernel ; athey2019generalized assume that the noise term is additively separable–a model proposed by newey2003instrumental . In this setting, hartford2017deep introduce nonlinearity with neural networks, singh2019kernel do so with RKHS methods, and athey2019generalized with random forests. In our setting, are binary and we do not assume additive separability of confounding noise–a model considered by angrist1996identification . Our target parameters are functionals of the underlying regression where is a vector of relevant random variables. Such parameters are called semi-parametric. We allow black-box ML for nonlinear estimation of .

3 Problem setting and definitions

Let concatenate the random variables. is the continuous outcome, is the binary treatment, is the binary instrumental variable, and is the covariate. We observe i.i.d. observations . Where possible, we suppress index to lighten notation.

Instrumental variable identification requires an assumption expressed in terms of potential outcomes. A potential outcome is a latent random variable expressing a counterfactual outcome given a hypothetical intervention. We recommend imbens2015causal ; peters2017elements ; hernan2019causal for a clear introduction to this framework for causal inference. Following the notation of angrist1996identification , we denote by the potential outcome under the intervention and . We denote by the potential treatment under the intervention . Compliers are the subpopulation for whom .

We now formalize our causal assumption about the instrument , quoting angrist1996identification . This prior knowledge, described informally in the introduction, allows us to define and recover the counterfactual effect of treatment on outcome for compliers.

Assumption 1 (Identification).

Assume

  1. independence:

  2. exclusion: for

  3. overlap:

  4. monotonicity: and

The independence condition states that the instrument is as good as randomly assigned conditional on covariates . The exclusion condition imposes that the instrument only affects the outcome via the treatment . We can therefore simplify notation: . The overlap condition ensures that there are no covariate values for which the instrument is deterministic. The monotonicity condition rules out the possibility of defiers: individuals who will always pursue an opposite treatment status from their assignment.

Definition 1 (Complier parameters).

We define the following complier parameters

  1. LATE is .

  2. Average complier characteristics are for characteristics .

  3. Complier counterfactual outcome distributions are where

Using the notation of chernozhukov2018learning , we denote the conditional expectation function (CEF) of a random vector conditional on as

where . The random vector is observable and depends on the complier parameter of interest; we specify its components for LATE, complier characteristics, and counterfactual outcome distributions in Theorem 1. We denote the classic Horvitz-Thompson weight with . Lastly, we denote by the norm of a vector, and we denote by the norm of a random variable, i.e. .

4 Learning problem and algorithm

DML is a method of moments approach to estimation with de-biasing and strong statistical guarantees. We review the DML algorithm: in stage 1, estimate the CEF and an additional nuisance parameter called the Riesz representer (RR); in stage 2, estimate the parameter of interest using method of moments with a de-biased moment function and the stage 1 estimates. We extend DML to estimate complier parameters. Specifically, we demonstrate how the identification assumption, expressed in terms of potential outcomes, implies a moment function and a corresponding de-biased moment function. Its de-biasing term is precisely the normalized -weight.

4.1 Dml

Consider a causal parameter implicitly defined by

Here is called the moment function, and it defines the causal parameter . is the CEF, a nuisance parameter that must be estimated in order to estimate the parameter of interest .

The plug-in approach involves estimating in stage 1 by some black-box ML algorithm, and estimating in stage 2 by method of moments with moment function . The plug-in approach is badly biased chernozhukov2018double .

The DML approach uses a more sophisticated moment function newey1994asymptotic .

is called the de-biasing term. We derive such that is doubly robust (DR). In particular, we derive such that

so stage 2 estimation of by method of moments with moment function is asymptotically invariant to estimation error of either or . In this sense, introducing the additional term serves to de-bias the original moment function .

Importantly, the DR moment function introduces an additional nuisance parameter , a component of the RR, which must be estimated in stage 1. Whereas DML involves estimating by estimating its components and knowing its functional form, we estimate directly by DML-RRR.

4.2 DML for complier parameters

We derive the DR moment functions for complier parameters. We show that these moment functions share a common structure.

Theorem 1 (DR moment functions).

Under Assumption 1, the DR moment functions for LATE, average complier characteristics, and complier counterfactual outcome distributions are of the form

where

  1. for LATE, and

  2. for complier characteristics, and

  3. for complier counterfactual distributions, and

Formally, is the RR to the continuous linear functional , i.e.

Indeed, this is obvious since we know from the classic Horvitz-Thompson derivation that is the RR to the continuous linear functional , i.e.

In Appendix 8.2, we review the classic Horvitz-Thompson derivation. We also prove a more general version of Theorem 1 for the entire class of complier parameters, and we demonstrate that the -weight is a reparametrization of the RR .

4.3 Algorithm

In chernozhukov2016locally , the authors show it is data-efficient and theoretically elegant to use sample splitting in DML bickel1982adaptive ; schick1986asymptotically . The DML-RRR algorithm of chernozhukov2018dantzig ; chernozhukov2018learning is as follows.

Algorithm 1 (Dml).

Partition the sample into subsets .

  1. For each , estimate and from observations not in

  2. Estimate as the solution to

Our theoretical guarantees apply to Lasso or Dantzig selector estimators of , originally presented in chernozhukov2018dantzig and chernozhukov2018learning , respectively. In what follows, we restrict attention to Lasso.

Consider the projection of onto -dimensional dictionary . Extending the RR result component-wise,

With -regularization, the objective becomes

Expanding the square, ignoring terms without , and using the RR result,

The empirical analogue to the above expression yields an estimator of . In this paper, we consider .

Algorithm 2 (Rrr).

For observations in

  1. Calculate matrix

  2. Calculate vector

  3. Set where

Likewise, we can project onto -dimensional dictionary using the functional . Our theoretical results are agnostic about the choice of estimator ; it may be this estimator or any other black-box ML algorithm satisfying the rate condition specified in Assumption 4.

Suppose we wish to form a simultaneous confidence band for the components of , particularly relevant for the estimation of counterfactual outcome distributions based on a grid . The following algorithm allows us to do so from some estimator for the asymptotic variance of .

Algorithm 3 (Simultaneous confidence band).

Given ,

  1. Calculate where .

  2. Sample and compute the value as the -quantile of sampled .

  3. Form the confidence band where is the diagonal entry of corresponding to value .

5 Consistency and asymptotic normality

We adapt the assumptions of chernozhukov2018learning to our setting. First, we place weak assumptions on the dictionary , propensity score , conditional variance , and Jacobian . We allow the bound on the dictionary to be a sequence that increases in .

Assumption 2 (Bounded dictionary).

s.t. a.s.

Assumption 3 (Regularity).

Assume

  1. for some

  2. is bounded

  3. Jacobian is nonsingular

Next we state our rate assumption on black-box estimator . We allow to converge at a rate slower than .

Assumption 4 (CEF rate).

s.t.

Let . We articulate assumptions required for convergence of under two regimes: the regime in which is dense and the regime in which is sparse.

Assumption 5 (Dense RR).

Assume

  1. s.t. and

Assumption 5 is a statement about the quality of approximation of by dictionary . It is satisfied if, for example, is a linear combination of .

Assumption 6 (Sparse RR).

Assume

  1. with nonzero elements s.t. and

  2. is nonsingular with largest eigenvalue uniformly bounded in

  3. s.t. where .

Assumption 6 is a statement about the quality of approximation of by a subset of dictionary . It is satisfied if, for example, is sparse or approximately sparse chernozhukov2018learning . is the population version of the restricted eigenvalue condition of bickel2009simultaneous . Finally we state some sufficient conditions on the sequences of constants as they relate to regularization sequence and CEF convergence rate , as dictionary dimension and sample size increase.

Assumption 7 (Sufficient conditions).

Assume

We quote stage 1 convergence guarantees for the estimator in Algorithm 2 from chernozhukov2018learning . We obtain a slow rate for dense and a fast rate for sparse

Theorem 2 (Dense RR rate).

Under Assumptions 1235, and 7, and

Theorem 3 (Sparse RR rate).

Under Assumptions 1236, and 7, and

We now present the main theorem of this paper. We prove our DML-RRR estimator for complier parameters is consistent and asymptotically normal, appealing to the theory in chernozhukov2016locally to generalize the main result in chernozhukov2018learning .

Assumption 8.

, a compact parameter space

Theorem 4 (DML-RRR asymptotic normality).

Supppose Assumptions 1234, either 5 or 67, and 8 hold. Then , , and where

It follows that is semiparametrically efficient chernozhukov2016locally . Finally, we prove the validity of simultaneous confidence bands for counterfactual distribution estimators.

Theorem 5 (Simultaneous confidence band).

Under the Assumptions of Theorem 4, the confidence band in Algorithm 3 jointly covers the true counterfactual distribution at all grid points with probability approaching the nominal level, i.e.

6 Experiments

We compare the performance of DML-RRR with original DML chernozhukov2018double and -weighting abadie_semiparametric_2003 in simulations. We focus on counterfactual distributions as our choice of complier parameter. We then apply DML-RRR to real-world data to estimate the counterfactual distribution of employee net financial assets with and without 401(k) participation.

6.1 Simulation

We apply DML-RRR, DML, and -weighting to a counterfactual distribution design detailed in Appendix 8.6. Each simulation consists of observations, and we use a dictionary with dimension . Both DML and -weighting involve inverting . To improve numerical stability for the DML estimator, we impose trimming according to belloni2017program , dropping observations with .

For each algorithm, we implement 500 simulations and visualize the mean as well as the 2.5% and 97.5% quantiles for each value in the grid . Figure 1 summarizes results: DML-RRR performs best, though its improvement over DML is modest. However, one advantage of DML-RRR is that it does not require ad-hoc trimming. -weighting performs worst, perhaps because it does not use a regularized ML estimator of .

(a)
(b)
Figure 1: Counterfactual distribution simulation

6.2 Effect of 401(k) on assets

Next, we use DML-RRR to investigate the effect of 401(k) participation on the distribution of net financial assets. We follow the identification strategy of poterba1994 ; poterba1995 . The authors assume that when 401(k) was introduced, workers ignored whether a given job offered 401(k) and instead made employment decisions based on income and other observable job characteristics; after conditioning on income and job characteristics, 401(k) eligibility was exogenous at the time.

We use data from the 1991 US Survey of Income and Program Participation, studied in abadie_semiparametric_2003 ; chernozhukov2004effects ; chernozhukov_iv_2005 ; ogburn_doubly_2015 ; belloni2017program . We use sample selection and variable construction as in chernozhukov2004effects . The outcome is net financial assets defined as the sum of IRA balances, 401(k) balances, checking accounts, US saving bonds, other interest-earning accounts, stocks, mutual funds, and other interest-earning assets minus non-mortgage debt. The treatment is participation in the 401(k) plan. The instrument is eligibility to enroll in a 401(k) plan. The covariates are age, income, years of education, family size, marital status, two-earner status, benefit pension status, IRA participation, and home-ownership.

The data include observations. We follow belloni2017program in the choice of grid points and the dictionary . We take as the through percentiles of , a total of 91 different values of . We consider two dictionaries: low-p with and very-high-p with . See Appendix 8.7 for further details on the dictionaries and DML-RRR implementation.

Figure 2 visualizes point estimates and simultaneous 95% confidence bands. We find that 401(k) participation significantly shifts out the distribution of net financial assets, consistent with results reported in belloni2017program . Moreover, the DML-RRR algorithm is robust in the high dimensional setting, yielding similar results in the low-p and very-high-p specifications.

(a) low-p
(b) very-high-p
Figure 2: Effect of 401(k) on net financial assets for compliers

7 Conclusion

We extend DML-RRR to the task of learning causal parameters from confounded, high-dimensional data. DML-RRR is easily implemented and semiparametrically efficient. As a contribution to the IV literature, we reinterpret the -weight as the Riesz representer in the problem of learning complier parameters. As a contribution to the DML literature, we generalize the theory of DML-RRR and provide simultaneous confidence bands. In simulations, DML-RRR modestly outperforms DML and -weighting and eliminates the ad-hoc step of trimming, suggesting DML-RRR may be an effective paradigm in high-dimensional causal inference.

Acknowledgments

We are grateful to Alberto Abadie, Anish Agarwal, Isaiah Andrews, Victor Chernozhukov, Anna Mikusheva, Whitney Newey, and Suhas Vijaykumar.

References

  • (1) Alberto Abadie. Bootstrap tests for distributional treatment effects in instrumental variable models. Journal of the American Statistical Association, 97(457):284–292, March 2002.
  • (2) Alberto Abadie. Semiparametric instrumental variable estimation of treatment response models. Journal of Econometrics, 113(2):231–263, April 2003.
  • (3) Alberto Abadie, Joshua Angrist, and Guido Imbens. Instrumental variables estimates of the effect of subsidized training on the quantiles of trainee earnings. Econometrica, 70(1):91–117, 2002.
  • (4) Joshua D. Angrist, Sarah R. Cohodes, Susan M. Dynarski, Parag A. Pathak, and Christopher R. Walters. Stand and deliver: Effects of Boston’s charter high schools on college preparation, entry, and choice. Journal of Labor Economics, 34(2):275–318, January 2016.
  • (5) Joshua D Angrist, Susan M Dynarski, Thomas J Kane, Parag A Pathak, and Christopher R Walters. Inputs and impacts in charter schools: KIPP Lynn. American Economic Review, 100(2):239–43, 2010.
  • (6) Joshua D Angrist, Guido W Imbens, and Donald B Rubin. Identification of causal effects using instrumental variables. Journal of the American Statistical Association, 91(434):444–455, 1996.
  • (7) Joshua D Angrist, Parag A Pathak, and Christopher R Walters. Explaining charter school effectiveness. American Economic Journal: Applied Economics, 5(4):1–27, 2013.
  • (8) Susan Athey, Guido W Imbens, and Stefan Wager. Approximate residual balancing: Debiased inference of average treatment effects in high dimensions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(4):597–623, 2018.
  • (9) Susan Athey, Julie Tibshirani, and Stefan Wager. Generalized random forests. The Annals of Statistics, 47(2):1148–1178, 2019.
  • (10) Alexandre Belloni, Victor Chernozhukov, Ivan Fernández-Val, and Christian Hansen. Program evaluation and causal inference with high-dimensional data. Econometrica, 85(1):233–298, 2017.
  • (11) Peter J Bickel. On adaptive estimation. The Annals of Statistics, pages 647–671, 1982.
  • (12) Peter J Bickel, Ya’acov Ritov, and Alexandre B Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 37(4):1705–1732, 2009.
  • (13) Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1–C68, 2018.
  • (14) Victor Chernozhukov, Denis Chetverikov, and Kengo Kato. Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. The Annals of Statistics, 41(6):2786–2819, December 2013.
  • (15) Victor Chernozhukov, Juan Carlos Escanciano, Hidehiko Ichimura, Whitney K Newey, and James M Robins. Locally robust semiparametric estimation. arXiv:1608.00033, 2016.
  • (16) Victor Chernozhukov and Christian Hansen. The effects of 401(k) participation on the wealth distribution: An instrumental quantile regression analysis. Review of Economics and Statistics, 86(3):735–751, 2004.
  • (17) Victor Chernozhukov and Christian Hansen. An IV model of quantile treatment effects. Econometrica, 73(1):245–261, 2005.
  • (18) Victor Chernozhukov, Whitney Newey, James Robins, and Rahul Singh. Double/de-biased machine learning of global and local parameters using regularized Riesz representers. arXiv:1802.08667, 2018.
  • (19) Victor Chernozhukov, Whitney K Newey, and Rahul Singh. Learning L2 continuous regression functionals via regularized Riesz representers. arXiv:1809.05224, 2018.
  • (20) Dylan J Foster and Vasilis Syrgkanis. Orthogonal statistical learning. arXiv:1901.09036, 2019.
  • (21) Jason Hartford, Greg Lewis, Kevin Leyton-Brown, and Matt Taddy. Deep IV: A flexible approach for counterfactual prediction. In International Conference on Machine Learning, pages 1414–1423, 2017.
  • (22) Miguel A Hernan and James M Robins. Causal Inference. CRC, 2019.
  • (23) Guido W Imbens and Donald B Rubin. Causal Inference in Statistics, Social, and Biomedical Sciences. Cambridge University Press, 2015.
  • (24) Whitney K Newey. The asymptotic variance of semiparametric estimators. Econometrica, pages 1349–1382, 1994.
  • (25) Whitney K Newey and Daniel McFadden. Large sample estimation and hypothesis testing. Handbook of Econometrics, 4:2111–2245, 1994.
  • (26) Whitney K Newey and James L Powell. Instrumental variable estimation of nonparametric models. Econometrica, 71(5):1565–1578, 2003.
  • (27) Yang Ning and Han Liu. A general theory of hypothesis tests and confidence regions for sparse high dimensional models. The Annals of Statistics, 45(1):158–195, 2017.
  • (28) Elizabeth L. Ogburn, Andrea Rotnitzky, and James M. Robins. Doubly robust estimation of the local average treatment effect curve. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 77(2):373–396, 2015.
  • (29) Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Elements of Causal Inference: Foundations and Learning Algorithms. MIT press, 2017.
  • (30) James M Poterba and Steven F Venti. 401(k) plans and tax-deferred saving. In Studies in the Economics of Aging, pages 105–142. University of Chicago Press, 1994.
  • (31) James M Poterba, Steven F Venti, and David A Wise. Do 401(k) contributions crowd out other personal saving? Journal of Public Economics, 58(1):1–32, 1995.
  • (32) Anton Schick. On asymptotically efficient estimation in semiparametric models. The Annals of Statistics, 14(3):1139–1151, 1986.
  • (33) Rahul Singh, Maneesh Sahani, and Arthur Gretton. Kernel instrumental variable regression. arXiv:1906.00232, 2019.
  • (34) James H Stock and Francesco Trebbi. Retrospectives: Who invented instrumental variable regression? Journal of Economic Perspectives, 17(3):177–194, 2003.
  • (35) Mark J Van der Laan and Sherri Rose. Targeted Learning: Causal Inference for Observational and Experimental Data. Springer Science & Business Media, 2011.

8 Appendix

\localtableofcontents

8.1 Notation glossary

Let concatenate the random variables. is the continuous outcome, is the binary treatment, is the binary instrumental variable, and is the covariate. We observe i.i.d. observations . Where possible, we suppress index to lighten notation.

Following the notation of angrist1996identification , we denote by the potential outcome under the intervention and . Due to Assumption 1, we can simplify notation: . We denote by the potential treatment under the intervention . Compliers are the subpopulation for whom .

Using the notation of chernozhukov2018learning , we denote the conditional expectation function (CEF) of random vector conditional on as

where . The random vector is observable and depends on the complier parameter of interest; we specify its components for LATE, complier characteristics, and counterfactual outcome distributions in Theorem 1.

We denote the propensity score . We denote the classic Horvitz-Thompson weight with .

We denote by the norm of a vector. We denote by the norm of a random variable , i.e. . For random vector , we slightly abuse notation by writing

Likewise, we write the element-wise absolute value as

Finally, we denote the true parameter value , where is some compact parameter space.

8.2 Identification

We review the derivation of the classic Horvitz-Thompson weight, relate DML-RRR to -reweighting, prove a general identification result, and specialize this result to LATE, complier characteristics, and counterfactual outcome distributions.

Proposition 1.

is the RR to the continuous linear functional , i.e.

Proof.

Observe that

and likewise

In summary, we can write

Definition 2.

Define

These are the -weights introduced in abadie_semiparametric_2003 .

Proposition 2.

The -weights can be rewritten as

Proof.

Theorem 6.

Suppose Assumption 1 holds. Let be a measurable, real-valued function s.t. for all .

  1. If is defined by the moment condition , let

  2. If is defined by the moment condition , let

  3. If is defined by the moment condition , let

Then the DR moment function for is of the form

where

Proof.

Consider the first case. Under Assumption 1, we can appeal to (abadie_semiparametric_2003, , Theorem 3.1).

Hence

appealing to Assumption 1, Proposition 2, and the fact that is the RR for . Likewise for the second and third cases. ∎

Proof of Theorem 1.

Suppose we can decompose for some function that does not depend on data. Then we can replace with without changing and . This is because and hence . Whenever we use this reasoning, we write .

  1. For LATE we can write , where is defined by the moment condition and is defined by the moment condition . Applying case 2 of Theorem 6 to , we have . Applying case 1 of Theorem 6 to , we have . Writing , the moment function for can thus be derived with . Note that this expression decomposes into and in Theorem 1.

  2. For average complier characteristics, is defined by the moment condition . Applying case 2 of Theorem 6, we have . This expression decomposes into and in Theorem 1.

  3. For complier distribution of , is defined by the moment condition . Applying case 1 of Theorem 6 to , we have . For complier distribution of , is defined by the moment condition . Applying case 2 of Theorem 6 to , we have . Concatenating and , we arrive at the decomposition in Theorem 1.

8.3 Lemmas

Definition 3.
Proposition 3.

Under Assumption 2,

Proof.

(chernozhukov2018learning, , Lemma A1)

Denote .

Proposition 4.

Under Assumptions 1 and 3

  1. is continuous at w.r.t.

Proof.

(chernozhukov2018learning, , Theorem 6)

Proposition 5.

Under Assumptions 1 and 3,

Proof.

Proposition 4 and (chernozhukov2018learning, , Lemma 4)