Debiased Machine Learning for Compliers
Abstract
Instrumental variable identification is a concept in causal statistics for estimating the counterfactual effect of treatment on output controlling for covariates using observational data. Even when measurements of are confounded, the treatment effect on the subpopulation of compliers can nonetheless be identified if an instrumental variable is available, which is independent of conditional on and the unmeasured confounder. We introduce a debiased machine learning (DML) approach to estimating complier parameters with highdimensional data. Complier parameters include local average treatment effect, average complier characteristics, and complier counterfactual outcome distributions. In our approach, the debiasing is itself performed by machine learning, a variant called debiased machine learning via regularized Riesz representers (DMLRRR). We prove our estimator is consistent, asymptotically normal, and semiparametrically efficient. In experiments, our estimator outperforms state of the art alternatives. We use it to estimate the effect of 401(k) participation on the distribution of net financial assets.
1 Introduction
Instrumental variable (IV) identification is a concept in causal statistics for estimating the counterfactual effect of treatment on output controlling for covariates using observational data stock2003retrospectives . Even when measurements of are confounded, the treatment effect can nonetheless be identified if an instrumental variable is available, which is independent of conditional on and the unmeasured confounder. Intuitively, only influences via , identifying the counterfactual relationship of interest.
This solution comes at a price; the analyst can no longer measure parameters of the entire population such as average treatment effect (ATE). Measuring population parameters requires a stronger assumption called selection on observables: conditional on covariates , the relationship between treatment and outcome is as good as random. Instead, the analyst can only measure parameters defined for the subpopulation of compliers such as local average treatment effect (LATE) . A complier is an individual whose treatment status is affected by variation in the instrument . In public policy research, the instruments take the form of changes in eligibility criteria for social programs. Compliers are thus of policy interest as they are exactly the subpopulation to be affected by eligibility changes.
To fix ideas, we provide examples with continuous outcome , binary treatment , and binary instrument . Randomized assignment of a drug () only influences patient health () via actual consumption of the drug (), identifying the counterfactual effect of the drug on health even in the scenario of imperfect compliance angrist1996identification . However, the analyst can only learn the treatment effect on the subpopulation of complier patients: those who would consume the drug if assigned and who would not consume the drug if not assigned. Charter school admission by lottery () only influences student test scores () via actually attending the charter school (), identifying the counterfactual effect of the charter school on test scores even if there is selection bias in which students choose to accept an offer of admission angrist2010inputs ; angrist2013explaining . However, the analyst can only learn the treatment effect on the subpopulation of complier students: those who would attend the charter school if they won the lottery and who would not attend the charter school if they lost the lottery.
In the present work, we introduce a debiased machine learning (DML) approach to estimating complier parameters with highdimensional data chernozhukov2018double ; foster2019orthogonal . In our approach, the debiasing is itself performed by machine learning, a variant called debiased machine learning via regularized Riesz representers (DMLRRR) chernozhukov2018dantzig ; chernozhukov2018learning . We present a general estimator, then specialize it to the tasks of learning LATE, average complier characteristics, and complier counterfactual outcome distributions. Counterfactual outcome distributions are particularly important in welfare analysis of schooling, subsidized training, union status, minimum wages, and transfer programs abadie_bootstrap_2002 ; abadie_instrumental_2002 .
We make three contributions. First, we extend the theory of DMLRRR pioneered by chernozhukov2018dantzig ; chernozhukov2018learning . Whereas chernozhukov2018dantzig ; chernozhukov2018learning consider parameters of the full population identified by selection on observables, we consider parameters of the complier subpopulation identified by instrumental variables. We prove our estimator is consistent, asymptotically normal, and semiparametrically efficient, and we provide simultaneous confidence bands. Second, we reinterpret a widelyused algorithm for estimating complier parameters called weighting as the Riesz representer in DMLRRR; it is in fact a component of the debiasing term. Third, we show our approach outperforms alternative approaches to estimating complier parameters, suggesting DMLRRR may be an effective paradigm in highdimensional causal inference.
2 Related Work
Several approaches have been proposed to estimate complier parameters by DML. Both ogburn_doubly_2015 and chernozhukov2018double present a DML estimator for LATE. The justification in ogburn_doubly_2015 is via inverse propensity weighting, while the justification in chernozhukov2018double is by interpreting LATE as a ratio of ATEs. In belloni2017program , the authors present a DML estimator for counterfactual outcome distributions with simultaneous confidence bands. All of these estimators involve plugging in an estimated propensity score in the denominator, which is numerically unstable. Unlike previous work, we present a general justification that covers a broad class of estimators, and we present a DMLRRR variant that eliminates the numerically unstable step of plugging in an estimated propensity score. As far as we know, ours is the first DML and DMLRRR estimator of complier characteristics. For a comparison between DMLRRR and other approaches to semiparametric estimation that use ML–namely targeted maximum likelihood van2011targeted , efficient score ning2017general , and approximate residual balancing athey2018approximate –we recommend chernozhukov2018dantzig ; chernozhukov2018learning .
Our work also relates to the literature on weighting, an algorithm introduced by abadie_semiparametric_2003 . In weighting, any complier parameter can be expressed as a weighted average of the corresponding population parameter. For example, LATE can be expressed as a weighted average of ATE across covariate values. Likewise, abadie_instrumental_2002 ; angrist_stand_2016 propose weighting estimators of counterfactual outcome distributions. The weight involves an estimated propensity score in the denominator, which is numerically unstable. Theoretically, the literature has not yet justified the use of a blackbox regularized ML algorithm to learn the propensity score in high dimensional settings. By elucidating the relationship between weighting and DML, we provide this justification. Moreover, by introducing the DMLRRR variant, we are able to learn the weight directly without estimating its components or even knowing its functional form.
Finally, our paper contributes to the growing literature on instrumental variables in machine learning. Both hartford2017deep and singh2019kernel consider the problem of nonparametric instrumental variable regression, where the target parameter is the structural function that summarizes the counterfactual relationship: where is confounding noise. In athey2019generalized , the authors further assume the function can be decomposed as . Importantly, hartford2017deep ; singh2019kernel ; athey2019generalized assume that the noise term is additively separable–a model proposed by newey2003instrumental . In this setting, hartford2017deep introduce nonlinearity with neural networks, singh2019kernel do so with RKHS methods, and athey2019generalized with random forests. In our setting, are binary and we do not assume additive separability of confounding noise–a model considered by angrist1996identification . Our target parameters are functionals of the underlying regression where is a vector of relevant random variables. Such parameters are called semiparametric. We allow blackbox ML for nonlinear estimation of .
3 Problem setting and definitions
Let concatenate the random variables. is the continuous outcome, is the binary treatment, is the binary instrumental variable, and is the covariate. We observe i.i.d. observations . Where possible, we suppress index to lighten notation.
Instrumental variable identification requires an assumption expressed in terms of potential outcomes. A potential outcome is a latent random variable expressing a counterfactual outcome given a hypothetical intervention. We recommend imbens2015causal ; peters2017elements ; hernan2019causal for a clear introduction to this framework for causal inference. Following the notation of angrist1996identification , we denote by the potential outcome under the intervention and . We denote by the potential treatment under the intervention . Compliers are the subpopulation for whom .
We now formalize our causal assumption about the instrument , quoting angrist1996identification . This prior knowledge, described informally in the introduction, allows us to define and recover the counterfactual effect of treatment on outcome for compliers.
Assumption 1 (Identification).
Assume

independence:

exclusion: for

overlap:

monotonicity: and
The independence condition states that the instrument is as good as randomly assigned conditional on covariates . The exclusion condition imposes that the instrument only affects the outcome via the treatment . We can therefore simplify notation: . The overlap condition ensures that there are no covariate values for which the instrument is deterministic. The monotonicity condition rules out the possibility of defiers: individuals who will always pursue an opposite treatment status from their assignment.
Definition 1 (Complier parameters).
We define the following complier parameters

LATE is .

Average complier characteristics are for characteristics .

Complier counterfactual outcome distributions are where
Using the notation of chernozhukov2018learning , we denote the conditional expectation function (CEF) of a random vector conditional on as
where . The random vector is observable and depends on the complier parameter of interest; we specify its components for LATE, complier characteristics, and counterfactual outcome distributions in Theorem 1. We denote the classic HorvitzThompson weight with . Lastly, we denote by the norm of a vector, and we denote by the norm of a random variable, i.e. .
4 Learning problem and algorithm
DML is a method of moments approach to estimation with debiasing and strong statistical guarantees. We review the DML algorithm: in stage 1, estimate the CEF and an additional nuisance parameter called the Riesz representer (RR); in stage 2, estimate the parameter of interest using method of moments with a debiased moment function and the stage 1 estimates. We extend DML to estimate complier parameters. Specifically, we demonstrate how the identification assumption, expressed in terms of potential outcomes, implies a moment function and a corresponding debiased moment function. Its debiasing term is precisely the normalized weight.
4.1 Dml
Consider a causal parameter implicitly defined by
Here is called the moment function, and it defines the causal parameter . is the CEF, a nuisance parameter that must be estimated in order to estimate the parameter of interest .
The plugin approach involves estimating in stage 1 by some blackbox ML algorithm, and estimating in stage 2 by method of moments with moment function . The plugin approach is badly biased chernozhukov2018double .
The DML approach uses a more sophisticated moment function newey1994asymptotic .
is called the debiasing term. We derive such that is doubly robust (DR). In particular, we derive such that
so stage 2 estimation of by method of moments with moment function is asymptotically invariant to estimation error of either or . In this sense, introducing the additional term serves to debias the original moment function .
Importantly, the DR moment function introduces an additional nuisance parameter , a component of the RR, which must be estimated in stage 1. Whereas DML involves estimating by estimating its components and knowing its functional form, we estimate directly by DMLRRR.
4.2 DML for complier parameters
We derive the DR moment functions for complier parameters. We show that these moment functions share a common structure.
Theorem 1 (DR moment functions).
Under Assumption 1, the DR moment functions for LATE, average complier characteristics, and complier counterfactual outcome distributions are of the form
where

for LATE, and

for complier characteristics, and

for complier counterfactual distributions, and
Formally, is the RR to the continuous linear functional , i.e.
Indeed, this is obvious since we know from the classic HorvitzThompson derivation that is the RR to the continuous linear functional , i.e.
4.3 Algorithm
In chernozhukov2016locally , the authors show it is dataefficient and theoretically elegant to use sample splitting in DML bickel1982adaptive ; schick1986asymptotically . The DMLRRR algorithm of chernozhukov2018dantzig ; chernozhukov2018learning is as follows.
Algorithm 1 (Dml).
Partition the sample into subsets .

For each , estimate and from observations not in

Estimate as the solution to
Our theoretical guarantees apply to Lasso or Dantzig selector estimators of , originally presented in chernozhukov2018dantzig and chernozhukov2018learning , respectively. In what follows, we restrict attention to Lasso.
Consider the projection of onto dimensional dictionary . Extending the RR result componentwise,
With regularization, the objective becomes
Expanding the square, ignoring terms without , and using the RR result,
The empirical analogue to the above expression yields an estimator of . In this paper, we consider .
Algorithm 2 (Rrr).
For observations in

Calculate matrix

Calculate vector

Set where
Likewise, we can project onto dimensional dictionary using the functional . Our theoretical results are agnostic about the choice of estimator ; it may be this estimator or any other blackbox ML algorithm satisfying the rate condition specified in Assumption 4.
Suppose we wish to form a simultaneous confidence band for the components of , particularly relevant for the estimation of counterfactual outcome distributions based on a grid . The following algorithm allows us to do so from some estimator for the asymptotic variance of .
Algorithm 3 (Simultaneous confidence band).
Given ,

Calculate where .

Sample and compute the value as the quantile of sampled .

Form the confidence band where is the diagonal entry of corresponding to value .
5 Consistency and asymptotic normality
We adapt the assumptions of chernozhukov2018learning to our setting. First, we place weak assumptions on the dictionary , propensity score , conditional variance , and Jacobian . We allow the bound on the dictionary to be a sequence that increases in .
Assumption 2 (Bounded dictionary).
s.t. a.s.
Assumption 3 (Regularity).
Assume

for some

is bounded

Jacobian is nonsingular
Next we state our rate assumption on blackbox estimator . We allow to converge at a rate slower than .
Assumption 4 (CEF rate).
s.t.
Let . We articulate assumptions required for convergence of under two regimes: the regime in which is dense and the regime in which is sparse.
Assumption 5 (Dense RR).
Assume

s.t. and

Assumption 5 is a statement about the quality of approximation of by dictionary . It is satisfied if, for example, is a linear combination of .
Assumption 6 (Sparse RR).
Assume

with nonzero elements s.t. and

is nonsingular with largest eigenvalue uniformly bounded in

s.t. where .

Assumption 6 is a statement about the quality of approximation of by a subset of dictionary . It is satisfied if, for example, is sparse or approximately sparse chernozhukov2018learning . is the population version of the restricted eigenvalue condition of bickel2009simultaneous . Finally we state some sufficient conditions on the sequences of constants as they relate to regularization sequence and CEF convergence rate , as dictionary dimension and sample size increase.
Assumption 7 (Sufficient conditions).
Assume
We quote stage 1 convergence guarantees for the estimator in Algorithm 2 from chernozhukov2018learning . We obtain a slow rate for dense and a fast rate for sparse
We now present the main theorem of this paper. We prove our DMLRRR estimator for complier parameters is consistent and asymptotically normal, appealing to the theory in chernozhukov2016locally to generalize the main result in chernozhukov2018learning .
Assumption 8.
, a compact parameter space
Theorem 4 (DMLRRR asymptotic normality).
It follows that is semiparametrically efficient chernozhukov2016locally . Finally, we prove the validity of simultaneous confidence bands for counterfactual distribution estimators.
6 Experiments
We compare the performance of DMLRRR with original DML chernozhukov2018double and weighting abadie_semiparametric_2003 in simulations. We focus on counterfactual distributions as our choice of complier parameter. We then apply DMLRRR to realworld data to estimate the counterfactual distribution of employee net financial assets with and without 401(k) participation.
6.1 Simulation
We apply DMLRRR, DML, and weighting to a counterfactual distribution design detailed in Appendix 8.6. Each simulation consists of observations, and we use a dictionary with dimension . Both DML and weighting involve inverting . To improve numerical stability for the DML estimator, we impose trimming according to belloni2017program , dropping observations with .
For each algorithm, we implement 500 simulations and visualize the mean as well as the 2.5% and 97.5% quantiles for each value in the grid . Figure 1 summarizes results: DMLRRR performs best, though its improvement over DML is modest. However, one advantage of DMLRRR is that it does not require adhoc trimming. weighting performs worst, perhaps because it does not use a regularized ML estimator of .
6.2 Effect of 401(k) on assets
Next, we use DMLRRR to investigate the effect of 401(k) participation on the distribution of net financial assets. We follow the identification strategy of poterba1994 ; poterba1995 . The authors assume that when 401(k) was introduced, workers ignored whether a given job offered 401(k) and instead made employment decisions based on income and other observable job characteristics; after conditioning on income and job characteristics, 401(k) eligibility was exogenous at the time.
We use data from the 1991 US Survey of Income and Program Participation, studied in abadie_semiparametric_2003 ; chernozhukov2004effects ; chernozhukov_iv_2005 ; ogburn_doubly_2015 ; belloni2017program . We use sample selection and variable construction as in chernozhukov2004effects . The outcome is net financial assets defined as the sum of IRA balances, 401(k) balances, checking accounts, US saving bonds, other interestearning accounts, stocks, mutual funds, and other interestearning assets minus nonmortgage debt. The treatment is participation in the 401(k) plan. The instrument is eligibility to enroll in a 401(k) plan. The covariates are age, income, years of education, family size, marital status, twoearner status, benefit pension status, IRA participation, and homeownership.
The data include observations. We follow belloni2017program in the choice of grid points and the dictionary . We take as the through percentiles of , a total of 91 different values of . We consider two dictionaries: lowp with and veryhighp with . See Appendix 8.7 for further details on the dictionaries and DMLRRR implementation.
Figure 2 visualizes point estimates and simultaneous 95% confidence bands. We find that 401(k) participation significantly shifts out the distribution of net financial assets, consistent with results reported in belloni2017program . Moreover, the DMLRRR algorithm is robust in the high dimensional setting, yielding similar results in the lowp and veryhighp specifications.
7 Conclusion
We extend DMLRRR to the task of learning causal parameters from confounded, highdimensional data. DMLRRR is easily implemented and semiparametrically efficient. As a contribution to the IV literature, we reinterpret the weight as the Riesz representer in the problem of learning complier parameters. As a contribution to the DML literature, we generalize the theory of DMLRRR and provide simultaneous confidence bands. In simulations, DMLRRR modestly outperforms DML and weighting and eliminates the adhoc step of trimming, suggesting DMLRRR may be an effective paradigm in highdimensional causal inference.
Acknowledgments
We are grateful to Alberto Abadie, Anish Agarwal, Isaiah Andrews, Victor Chernozhukov, Anna Mikusheva, Whitney Newey, and Suhas Vijaykumar.
References
 (1) Alberto Abadie. Bootstrap tests for distributional treatment effects in instrumental variable models. Journal of the American Statistical Association, 97(457):284–292, March 2002.
 (2) Alberto Abadie. Semiparametric instrumental variable estimation of treatment response models. Journal of Econometrics, 113(2):231–263, April 2003.
 (3) Alberto Abadie, Joshua Angrist, and Guido Imbens. Instrumental variables estimates of the effect of subsidized training on the quantiles of trainee earnings. Econometrica, 70(1):91–117, 2002.
 (4) Joshua D. Angrist, Sarah R. Cohodes, Susan M. Dynarski, Parag A. Pathak, and Christopher R. Walters. Stand and deliver: Effects of Boston’s charter high schools on college preparation, entry, and choice. Journal of Labor Economics, 34(2):275–318, January 2016.
 (5) Joshua D Angrist, Susan M Dynarski, Thomas J Kane, Parag A Pathak, and Christopher R Walters. Inputs and impacts in charter schools: KIPP Lynn. American Economic Review, 100(2):239–43, 2010.
 (6) Joshua D Angrist, Guido W Imbens, and Donald B Rubin. Identification of causal effects using instrumental variables. Journal of the American Statistical Association, 91(434):444–455, 1996.
 (7) Joshua D Angrist, Parag A Pathak, and Christopher R Walters. Explaining charter school effectiveness. American Economic Journal: Applied Economics, 5(4):1–27, 2013.
 (8) Susan Athey, Guido W Imbens, and Stefan Wager. Approximate residual balancing: Debiased inference of average treatment effects in high dimensions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(4):597–623, 2018.
 (9) Susan Athey, Julie Tibshirani, and Stefan Wager. Generalized random forests. The Annals of Statistics, 47(2):1148–1178, 2019.
 (10) Alexandre Belloni, Victor Chernozhukov, Ivan FernándezVal, and Christian Hansen. Program evaluation and causal inference with highdimensional data. Econometrica, 85(1):233–298, 2017.
 (11) Peter J Bickel. On adaptive estimation. The Annals of Statistics, pages 647–671, 1982.
 (12) Peter J Bickel, Ya’acov Ritov, and Alexandre B Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 37(4):1705–1732, 2009.
 (13) Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1–C68, 2018.
 (14) Victor Chernozhukov, Denis Chetverikov, and Kengo Kato. Gaussian approximations and multiplier bootstrap for maxima of sums of highdimensional random vectors. The Annals of Statistics, 41(6):2786–2819, December 2013.
 (15) Victor Chernozhukov, Juan Carlos Escanciano, Hidehiko Ichimura, Whitney K Newey, and James M Robins. Locally robust semiparametric estimation. arXiv:1608.00033, 2016.
 (16) Victor Chernozhukov and Christian Hansen. The effects of 401(k) participation on the wealth distribution: An instrumental quantile regression analysis. Review of Economics and Statistics, 86(3):735–751, 2004.
 (17) Victor Chernozhukov and Christian Hansen. An IV model of quantile treatment effects. Econometrica, 73(1):245–261, 2005.
 (18) Victor Chernozhukov, Whitney Newey, James Robins, and Rahul Singh. Double/debiased machine learning of global and local parameters using regularized Riesz representers. arXiv:1802.08667, 2018.
 (19) Victor Chernozhukov, Whitney K Newey, and Rahul Singh. Learning L2 continuous regression functionals via regularized Riesz representers. arXiv:1809.05224, 2018.
 (20) Dylan J Foster and Vasilis Syrgkanis. Orthogonal statistical learning. arXiv:1901.09036, 2019.
 (21) Jason Hartford, Greg Lewis, Kevin LeytonBrown, and Matt Taddy. Deep IV: A flexible approach for counterfactual prediction. In International Conference on Machine Learning, pages 1414–1423, 2017.
 (22) Miguel A Hernan and James M Robins. Causal Inference. CRC, 2019.
 (23) Guido W Imbens and Donald B Rubin. Causal Inference in Statistics, Social, and Biomedical Sciences. Cambridge University Press, 2015.
 (24) Whitney K Newey. The asymptotic variance of semiparametric estimators. Econometrica, pages 1349–1382, 1994.
 (25) Whitney K Newey and Daniel McFadden. Large sample estimation and hypothesis testing. Handbook of Econometrics, 4:2111–2245, 1994.
 (26) Whitney K Newey and James L Powell. Instrumental variable estimation of nonparametric models. Econometrica, 71(5):1565–1578, 2003.
 (27) Yang Ning and Han Liu. A general theory of hypothesis tests and confidence regions for sparse high dimensional models. The Annals of Statistics, 45(1):158–195, 2017.
 (28) Elizabeth L. Ogburn, Andrea Rotnitzky, and James M. Robins. Doubly robust estimation of the local average treatment effect curve. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 77(2):373–396, 2015.
 (29) Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Elements of Causal Inference: Foundations and Learning Algorithms. MIT press, 2017.
 (30) James M Poterba and Steven F Venti. 401(k) plans and taxdeferred saving. In Studies in the Economics of Aging, pages 105–142. University of Chicago Press, 1994.
 (31) James M Poterba, Steven F Venti, and David A Wise. Do 401(k) contributions crowd out other personal saving? Journal of Public Economics, 58(1):1–32, 1995.
 (32) Anton Schick. On asymptotically efficient estimation in semiparametric models. The Annals of Statistics, 14(3):1139–1151, 1986.
 (33) Rahul Singh, Maneesh Sahani, and Arthur Gretton. Kernel instrumental variable regression. arXiv:1906.00232, 2019.
 (34) James H Stock and Francesco Trebbi. Retrospectives: Who invented instrumental variable regression? Journal of Economic Perspectives, 17(3):177–194, 2003.
 (35) Mark J Van der Laan and Sherri Rose. Targeted Learning: Causal Inference for Observational and Experimental Data. Springer Science & Business Media, 2011.
8 Appendix
\localtableofcontents8.1 Notation glossary
Let concatenate the random variables. is the continuous outcome, is the binary treatment, is the binary instrumental variable, and is the covariate. We observe i.i.d. observations . Where possible, we suppress index to lighten notation.
Following the notation of angrist1996identification , we denote by the potential outcome under the intervention and . Due to Assumption 1, we can simplify notation: . We denote by the potential treatment under the intervention . Compliers are the subpopulation for whom .
Using the notation of chernozhukov2018learning , we denote the conditional expectation function (CEF) of random vector conditional on as
where . The random vector is observable and depends on the complier parameter of interest; we specify its components for LATE, complier characteristics, and counterfactual outcome distributions in Theorem 1.
We denote the propensity score . We denote the classic HorvitzThompson weight with .
We denote by the norm of a vector. We denote by the norm of a random variable , i.e. . For random vector , we slightly abuse notation by writing
Likewise, we write the elementwise absolute value as
Finally, we denote the true parameter value , where is some compact parameter space.
8.2 Identification
We review the derivation of the classic HorvitzThompson weight, relate DMLRRR to reweighting, prove a general identification result, and specialize this result to LATE, complier characteristics, and counterfactual outcome distributions.
Proposition 1.
is the RR to the continuous linear functional , i.e.
Proof.
Observe that
and likewise
In summary, we can write
∎
Definition 2.
Define
These are the weights introduced in abadie_semiparametric_2003 .
Proposition 2.
The weights can be rewritten as
Proof.
∎
Theorem 6.
Suppose Assumption 1 holds. Let be a measurable, realvalued function s.t. for all .

If is defined by the moment condition , let

If is defined by the moment condition , let

If is defined by the moment condition , let
Then the DR moment function for is of the form
where
Proof.
Consider the first case. Under Assumption 1, we can appeal to (abadie_semiparametric_2003, , Theorem 3.1).
Hence
appealing to Assumption 1, Proposition 2, and the fact that is the RR for . Likewise for the second and third cases. ∎
Proof of Theorem 1.
Suppose we can decompose for some function that does not depend on data. Then we can replace with without changing and . This is because and hence . Whenever we use this reasoning, we write .

For LATE we can write , where is defined by the moment condition and is defined by the moment condition . Applying case 2 of Theorem 6 to , we have . Applying case 1 of Theorem 6 to , we have . Writing , the moment function for can thus be derived with . Note that this expression decomposes into and in Theorem 1.
∎
8.3 Lemmas
Definition 3.
Proposition 3.
Under Assumption 2,
Proof.
(chernozhukov2018learning, , Lemma A1) ∎
Denote .
Proof.
(chernozhukov2018learning, , Theorem 6) ∎
Proof.
Proposition 4 and (chernozhukov2018learning, , Lemma 4) ∎