Machine Learning Estimation of Heterogeneous Treatment Effects with Instruments
Abstract
We consider the estimation of heterogeneous treatment effects with arbitrary machine learning methods in the presence of unobserved confounders with the aid of a valid instrument. Such settings arise in A/B tests with an intenttotreat structure, where the experimenter randomizes over which user will receive a recommendation to take an action, and we are interested in the effect of the downstream action. We develop a statistical learning approach to the estimation of heterogeneous effects, reducing the problem to the minimization of an appropriate loss function that depends on a set of auxiliary models (each corresponding to a separate prediction task). The reduction enables the use of all recent algorithmic advances (e.g. neural nets, forests). We show that the estimated effect model is robust to estimation errors in the auxiliary models, by showing that the loss satisfies a Neyman orthogonality criterion. Our approach can be used to estimate projections of the true effect model on simpler hypothesis spaces. When these spaces are parametric, then the parameter estimates are asymptotically normal, which enables construction of confidence sets. We applied our method to estimate the effect of membership on downstream webpage engagement on TripAdvisor, using as an instrument an intenttotreat A/B test among 4 million TripAdvisor users, where some users received an easier membership signup process. We also validate our method on synthetic data and on public datasets for the effects of schooling on income.
Machine Learning Estimation of Heterogeneous Treatment Effects with Instruments
Vasilis Syrgkanis Microsoft Research vasy@microsoft.com Victor Lei TripAdvisor vlei@tripadvisor.com Miruna Oprescu Microsoft Research moprescu@microsoft.com Maggie Hei Microsoft Research Maggie.Hei@microsoft.com Keith Battocchi Microsoft Research kebatt@microsoft.com Greg Lewis Microsoft Research glewis@microsoft.com
noticebox[b]\end@float
1 Introduction
A/B testing is the gold standard of causal inference. But even when A/B testing is feasible, estimating the effect of a treatment on an outcome might not be a straightforward task. One major difficulty is noncompliance: even if we randomize what treatment to recommend to a subject, the subject might not comply with the recommendation due to unobserved factors and follow the alternate action. The impact that unobserved factors might have on the measured outcome is a source of endogeneity and can lead to biased estimates of the effect. This problem arises in large scale data problems in the digital economy; when optimizing a digital service, we might often want to estimate the effect of some action taken by our users on downstream metrics. However, the service cannot force users to comply, but can only find means of incentivizing or recommending the action. The unobserved factors of compliance can lead to biased estimates if we consider the takers and not takers as exogenously assigned and employ machine learning approaches to estimate the potentially heterogeneous effect of the action on the downstream metric.
The problem can be solved by using the technique of instrumental variable (IV) regression: as long as the recommendation increases the probability of taking the treatment, then we know that there is at least some fraction of users that were assigned the treatment “exogeneously”. IV regression parses out this population of “exogenously treated” users and estimates an effect based solely on them.
Most classical IV approaches estimate a constant average treatment effect. However, to make personalized policy decisions (an emerging trend in most digital services) one might want to estimate a heterogeneous effect based on observable characteristics of the user. The latter is a daunting task, as we seek to estimate a function of observable characteristics as opposed to a single number. Hence, statistical power is at stake. Even estimating an ATE is nontrivial when effect and compliance are correlated through observables. The emergence of large datasets in the digital economy alleviates this concern; with A/B tests running on millions of users it is possible to estimate complex heterogeneous effect models, even if compliance levels are relatively weak. Moreover, as we control for more and more observable features of the user, we also reduce the risk that correlation between effect and compliance is stemming from unobserved factors.
This leads to the question this work seeks to answer: how can we blend the power of modern machine learning approaches (e.g. random forests, gradient boosting, penalized regressions, neural networks) with instrumental variable methods, so as to estimate complex heterogeneous effect rules. Recent work at the intersection of machine learning and econometrics has proposed powerful methods for estimating the effect of a treatment on an outcome, while using machine learning methods for learning nuisance models that help debias the final effect rule. However, the majority of the work has either focused on 1) estimating average treatment effects or low dimensional parametric effect models (e.g. the double machine learning approach of [11]), 2) developing new algorithms for estimating fully nonparametric models of the effect (e.g. the IV forest method of [4], the DeepIV method of [15]), 3) assuming that the treatment is exogenous once we condition on the observable features and reducing the problem to an appropriate square loss minimization framework (see e.g. [24, 19]).
Nevertheless, a general reduction of IV based machine learning estimation of heterogeneous effects to a more standard statistical learning problem that can incorporate existing algorithms in a blackbox manner has not been formulated in prior work. In fact, the work of [24] leaves this as an open question. Such a reduction can help us leverage the recent algorithmic advances in statistical learning theory so as to work with large datasets. Our work proposes the reduction of heterogeneous effects estimation via instruments to a square loss minimization problem over a hypothesis space. This enables us to learn not only the true heterogeneous effect model, but also the projections of the true model in simpler hypothesis spaces for interpretability. Moreover, our work leverages recent advances in statistical learning with nuisance functions [12, 13], to show that the mean squared error (MSE) of the learned model is robust to the estimation error of auxiliary models that need to be estimated (as is standard in IV regression). Thus we achieve MSE rates where the leading term depends only on the sample complexity of the hypothesis space of the heterogeneous effect model.
Some advantages of reducing our problem to a set of standard regression problems include being able to use existing algorithms and implementations, as well as recent advances of interpretability in machine learning. For instance, in our application we deploy the SHAP framework [21, 20] to interpret random forest based models of the heterogeneous effect. Furthermore, when the hypothesis space is low dimensional and parametric then our approach falls in the setting studied by prior work of [11] and, hence, not only MSE rates but also confidence interval construction is relatively straightforward. This enables hypothesis testing on parametric projections of the true effect model.
We apply our approach to an intenttotreat A/B test among 4 million users on a major travel webpage so as to estimate the effect of membership on downstream engagement. We identify sources of heterogeneity that have policy implications on which users the platform should engage more and potentially how to redesign the recommendation to target users with large effects. We validate the findings on a different cohort in a separate experiment among 10 million users on the same platform. Even though the new experiment was deployed on a much broader and different cohort, we identify common leading factors of effect heterogeneity, hence confirming our findings. As a robustness check we create semisynthetic data with similar features and marginal distributions of variables as the real data, but where we know the ground truth. We find that our method performs well both in terms of MSE, identifying the relevant factors and coverage of the confidence intervals.
Finally, we apply our method to a more traditional IV application: estimating the effect of schooling on wages. We use a well studied public data set and observe that our approach automatically identifies sources of heterogeneity that were previously uncovered using more structural approaches. We also validate our method in this application on semisynthetic data that emulate the true data.
2 Estimation of Heterogeneous Treatment Effects with Instruments
We consider estimation of heterogeneous treatment effects with respect to a set of features , of an endogenous treatment on an outcome with an instrument . For simplicity of exposition, we will restrict attention to the case where and are scalar variables, but several of our results extend to the case of multidimensional treatments and instruments. is an instrumental variable if it has an effect on the treatment but does not have a direct effect on the outcome other than through the treatment. More formally, we assume the following moment condition:
(1) 
Equivalently we assume that: , with . We allow for the presence of confounders, i.e. could be correlated with via some unobserved common factor . However, our exclusion restriction on the instrument implies that the residual is mean zero conditional on the instrument. Together with the fact that the instrument also has an effect on the treatment at any value of the feature , i.e.: , allows us to identify the heterogeneous effect function . We focus on the case where the effect is linear in the treatment , which is wlog in the binary treatment setting, which is our main application, and since our goal is to focus on the nonlinearity wrt (this greatly simplifies our problem, see [9, 10, 23, 16]).
Given i.i.d. samples from the data generating process, our goal is to estimate a model that achieves small expected meansquarederror, i.e.: . Since the true function can be very complex and difficult to estimate in finite samples, we are also interested in estimating projections of the true on simpler hypothesis spaces . Projections are also useful for interpretability: one might want to understand what is the best linear projection of on , i.e. . In this case we will denote with the projection of on , i.e. and our goal would be to achieve small mean squared error with respect to . When is a low dimensional parametric class (e.g. a linear function on a lowdimensional feature space or a constant function), we are also interested in performing inference; i.e. constructing confidence intervals that asymptotically contain the correct parameter with probability equal to some target confidence level.
2.1 WarmUp: Estimating the Average Treatment Effect (ATE)
For estimation of the average treatment effect (ATE), assuming that either there is no effect heterogeneity with respect to or there is no heterogeneous compliance with respect to , [11] propose a method for estimating the ATE that solves the empirical analogue of the following moment equation:
(2) 
This moment function is orthogonal to all the functions , and that also need to be estimated from data. This moment avoids the estimation of the expected conditional on and satisfies an orthogonality condition that enables robustness of the estimate , to errors in the nuisance estimates and . The estimate is asymptotically normal with variance equal to the variance of the method if the estimates were the correct ones, assuming that the mean squared error of these estimates decays at least at a rate of (see [11] for more details). This result requires that the nuisance estimates are fitted in a crossfitting manner, i.e. we use half of the data to fit a model for each of these functions and then predict the values of the model on the other half of the samples. We refer to this algorithm as DMLATEIV.^{1}^{1}1For Double Machine Learning ATE estimation with Instrumental Variables.
Inconsistency under Effect and Compliance Heterogeneity The above estimate is a consistent estimate of the average treatment effect as long as there is either no effect heterogeneity with respect to or there is no heterogeneous compliance (i.e. the effect of the instrument on the treatment) with respect to . Otherwise it is inconsistent. The reason is that, if we let and , then the population quantity: is a function of . If we also have effect heterogeneity, then we are solving for a constant that in the limit satisfies: , where . On the other hand the true heterogeneous model satisfies the equation: . In the limit, the two quantities are related via the equation: . Then the constant effect that we estimate converges to the quantity: . If is not independent with , then is a reweighted version of the true average treatment effect , reweighted by the heterogeneous compliance. To account for this heterogeneous compliance we need to change our moment equation so as to reweight based on , which is unknown and also needs to be estimated from data. Given that this function could be arbitrarily complex, we want our final estimate to be robust to estimation errors of . We can achieve this by considering a doubly robust approach to estimating . Suppose that we had some other method of computing an estimate of the heterogeneous treatment effect , then we can combine both estimates to get a more robust method for the ATE, e.g.:
(3) 
This approach has been analyzed in [25] in the case of constant treatment effects and an analogue of this average effect was also used by [5] in a policy learning problem as opposed to an estimation problem. In particular, the quantity is known as the compliance score [1, 3]. Our methodological contribution in the next two sections is twofold: i) first we propose a modelbased stable approach for estimating a preliminary estimate , which does not necessarily require that everywhere (an assumption that is implicit in the latter method), ii) second we show that this doubly robust quantity can be used as a regression target and minimizing the square loss with respect to this target, corresponds to an orthogonal loss, as defined in [12, 13].
3 Preliminary Estimate of Conditional Average Treatment Effect (CATE)
Let and , as in the previous section. Then observe that we can rewrite the moment condition as:
(4) 
Moreover, observe that the functions and are related via: . Thus we can further rewrite the moment condition in terms of instead of :
(5) 
Moreover, we can identify with the following subset of conditional moments, where the conditioning of is removed:
(6) 
Equivalently, is a minimizer of the square loss:
(7) 
since the derivative of this loss with respect to is equal to the moment equation and, thus, the first order condition for the loss minimization problem is satisfied by the true model . Moreover, if the loss function satisfies a functional analogue of strong convexity, then any minimizer of the loss achieves small mean squared error with respect to . This leads to the following approach:
(8) 
This method is an extension of the classical twostageleastsquares (2SLS) approach [2] to allow for arbitrary machine learning models; ignoring the residualization part (i.e. if for instance ), then it boils down to: 1) predict the mean treatment from the instrument and with an arbritrary regression/classification method, 2) predict the outcome from the predicted treatment multiplied by the heterogeneous effect model . Residualization helps us remove the dependence of the mean squared error on the complexity of the baseline function . We achieve this by showing that this loss is orthogonal with respect to (see [13] for the definition of an orthogonal loss). However, orthogonality does not hold with respect to . This finding is reasonable since we are using as our regressor. Hence, any error in the measurement of the regressor can directly propagate to an error in . This is the same reason why in classical IV regression one cannot ignore the variance from the first stage of 2SLS when calculating confidence intervals.
Lemma 1.
The loss function is orthogonal to the nuisance functions , but not .
Strong convexity and overlap. Note that both the empirical loss and the population loss are convex in the prediction, which typically implies computational tractability. Moreover, the second order directional derivative of the population loss in any functional direction is:
(9) 
and let:
(10) 
To be able to achieve meansquarederror rates based on our loss minimization, we need the population version of the loss function to satisfy a functional analogue of strong convexity:
(11) 
This setting falls under the “singleindex” setup of [13]. Using arguments from Lemma 1 of [13], if:
(12) 
where
(13) 
then . A sufficient condition is that for all . This is a standard "overlap" condition that the instrument is exogenously varying at any and has a direct effect on the treatment at any . DMLIV only requires an "average" overlap condition, tailored particularly to the hypothesis space , hence it could handle settings where the instrument is weak for some subset of the population. For instance, if is a linear function class: , then for the oracle strong convexity to hold it suffices that: . Lemma 1, combined with the above discussion and the results of [13] yields:^{2}^{2}2This corollary follows by small modifications of the proofs of Theorem 1 and Theorem 3 of [13] that accounts for the nonorthogonality w.r.t. , so we omit its proof.
Corollary 2.
Assume all random variables are bounded and consider any algorithm that achieves expected generalization error with respect to loss . Moreover, suppose that the nuisance estimates satisfy and . Then returned by DMLIV satisfies: . If empirical risk minimization is used in the final stage, then , where is the critical radius of the hypothesis space as defined via the localized Rademacher complexity [18].
Computational considerations. The empirical loss is not a standard square loss. However, we can rewrite it as . Thus the problem is equivalent to a standard square loss minimization with label and sample weights . Thus we can use any outofthebox machine learning method that accepts sample weights, such as stochastic gradient based regression methods and gradient boosted or random forests. Alternatively, if we assume a linear representation of the effect function , then the problem is equivalent to regressing on the scaled features , and again any method for fitting linear models can be invoked.
4 DRIV: Orthogonal Loss for IV Estimation of CATE and Projections
We now present the main estimation algorithm that combines the doubly robust approach presented for ATE estimation with the preliminary estimator of the CATE to obtain a fully orthogonal and strongly convex loss. This method achieves a second order effect from all nuisance estimation errors and enables oracle rates for the target effect class and asymptotically valid inference for low dimensional target effect classes. In particular, given access to a first stage model of heterogeneous effects (such as the one produced by DMLIV), we can estimate a more robust model of heterogeneous effects via minimizing a square loss that treats the doubly robust quantity used in Equation (3) as the label:
(14) 
We allow for a model space that is not necessarily equal to . The solution in Equation (3) is a special case of this minimization problem where the space contains only constant functions. Our main result shows that this loss is orthogonal to all nuisance functions , , , , . Moreover, it is strongly convex in the prediction , since conditional on all the nuisance estimates it is a standard square loss. Moreover, we show that the loss is orthogonal irrespective of what the model space , even if , as long as the preliminary estimate is consistent with respect to the true CATE (i.e. fit a flexible preliminary CATE and use it to project to a simpler hypothesis space).
Lemma 3.
The loss is orthogonal with respect to the nuisance functions , , , and .
If we use DMLIV for , even though DMLIV has a first order impact from the error of , the second stage estimate has a second order impact, since it has a second order impact from the first stage CATE error. Lemma 3 together with the results of [13] and [12] implies the following corollary:
Corollary 4.
Assume all random variables are bounded and consider any algorithm that achieves expected generalization error with respect to loss . Moreover, suppose that each nuisance estimate , . Then returned by DRIV satisfies: , where . If empirical risk minimization is used in the final stage, then , where is the critical radius of the hypothesis space as defined via the localized Rademacher complexity [18]. If is highdimensional sparse linear, i.e. with , and , then if an penalized square loss minimization is used in the final step of DRIV, it suffices that to get: .
Interpretability through projections. The fact that our loss function can be used with any target allows us to perform inference on the projection of on a simple space (e.g. decision trees, linear functions) for interpretability purposes. If we let the label in the final regression of DRIV, then observe that when the nuisance estimates take their true values then , since the second part of has mean zero. Hence:
(15) 
The first part is independent of and hence minimizing the oracle is equivalent to minimizing over , which is exactly the projection of on . One version of an interpretable model is estimating the CATE with respect to a subset of the variables, i.e.: (e.g. how treatment effect varies with a single feature). This boils down to setting some space of functions of .
If is a low dimensional set of features and is a the space of linear functions of , i.e. , then the first order condition of our loss is equal to the moment condition . Then orthogonality of our loss implies that DRIV is equivalent to an orthogonal moment estimation method [11]. Thus using the results of [11] we get the corollary:
Corollary 5 (Confidence Intervals).
The estimate returned by DRIV with for a constant independent of , is asymptotically normal with asymptotic variance equal to the hypothetical variance of as if the nuisance estimates had their true values.
Hence, we can use outofthebox packages for calculating CIs of an OLS regression to get values on the coefficients.
5 Estimating Effects of Membership on TripAdvisor
We apply our methods to estimate the treatment effect of membership on the number of days a user visits a leading travel assistance website, W. The instrument used was a 14day intenttotreat A/B test run during 2018, where users in group A received a new, easier membership signup process, while the users in group B did not. The treatment is whether a user became a member or not. Becoming a member and logging into TripAdvisor gives users exclusive access to trip planning tools, special deals and price alerts, and personalized ideas and travel advice. Our data consists of 4,606,041 total users in a 50:50 A/B test. For each user, we have a 28day preexperiment summary about their browsing and purchasing activity on TripAdvisor (see Sec. B.2). The instrument significantly increased the rate of treatment, and is assumed to satisfy the exclusion restriction.
We applied two sets of nuisance estimation models with different complexity characteristics: LASSO regression and logistic regression with an L2 penalty (LM); and gradient boosting regression and classification (GB). The only exception was , where we used a fixed estimate of since the instrument was a large randomized experiment. See Sec. B.1 for details.^{3}^{3}3We attempted to use the R implementation of Generalized Random Forests (GRF)[4] to compare with our results. However, we could not fit due to the size of the data and insufficient memory errors (with 64GB RAM).
We estimate the ATE using DRIV projected onto a constant (Table 1). Using linear nuisance models results in very similar ATE estimates between DMLATEIV and DRIV. We compare the covariate associations for both heterogeneity and compliance under DRIV to understand why. If there are covariates with significant nonzero associations in both heterogeneity and compliance, this could lead to different estimates between DRIV and DMLATEIV (and vice versa). Replacing the CATE projection model with a linear regression, we obtain valid inferences for the covariates associated with treatment effect heterogeneity (Figure 1). For compliance, we run a linear regression of the estimated quantity on , to assess its association with each of the features (see Sec. B.1 for details). Comparing treatment and compliance coefficients, os_type_linux and revenue_pre are the only coefficients substantially different from 0 in both. However, only a very small proportion of users in the experiment are Linux users, and the distribution of revenue is very positively skewed. This justifies the minor difference between the DMLATEIV and DRIV estimates. Moreover, we fit a shallow, heavily regularized random forest and interpret it using Shapley Additive Explanations (SHAP) [22]. SHAP gave directionally similar impact of each feature on the effect (Figure 1). However, since we constrained the model to have depth at most one, it essentially gives the features in order of importance if we were to split the population based on a single feature. This justifies why the order of importance of features in the forest is not in the same order as the magnitude of the rank in the linear model, since they have different interpretations. The features picked up by the forest intuitively make sense since an already highly engaged member of W, or a user who has recently made a booking, is less likely to further increase their visits to W.
Using gradient boosting nuisance models, we show that many inferences remain similar (Figure 2 in Appendix). The most notable changes in heterogeneity were for features which have a highly skewed distribution (e.g. visits to specific pages on W), or which appear rarely in the data (e.g. Linux users). The linear CATE projection model coefficients are largely similar for both residualization models (except the Linux operating system feature, which appears rarely in the data). Moving to a random forest for the CATE projection model with SHAP presents greater differences, especially for the highly skewed features.
Similar instrument from a recent experiment A recent 2019 A/B test of the same membership signup process provided another viable instrument. This 21day A/B test included a much larger, more diverse population of users than in 2018 due to fewer restrictions for eligibility (see Sec. B.2 for details). We apply DRIV with gradient boosting residualization models and a linear projection of the CATE. The CATE distribution has generally higher values compared to the 2018 experiment which reflects the different experimental population. In particular, users in the 2018 experiment had much higher engagement and significantly higher revenue in the preexperiment period. This was largely because users were only included in the 2018 experiment on their second visit. The higher baseline naturally makes it more difficult to achieve high treatment effects, explaining the generally lower CATE distribution in the 2018 experiment. We note that, unlike in 2018, the revenue coefficient is no longer significant. We again attribute this to the much higher revenue baseline in 2018. Despite the population differences, however, we observe "days_visited_vrs_pre" continues to have a very significant positive association. "days_visited_exp_pre" now also appears to have a significantly positive association, as does the iPhone device (which was not a feature in the 2018 experiment). The inclusion of iPhone users is another big domain shift in the two experiments.
Policy recommendations for W Our results offer several policy implications for W. Firstly, encourage iPhone users, and users who frequent vacation rentals pages to signup for membership. These users exhibited high treatment effects from membership. For frequent visitors to vacation rentals pages, this effect was robust across residualization models, CATE projections, and even different instruments (e.g. by providing stronger encouragements for signup on particular subpages). Second, find ways to improve the membership offering for users who are already engaged: e.g. recently made a booking (high revenue_pre), were already frequent visitors (high days_visited_free_pre).
Validation on SemiSynthetic Data In Appendix C, we validate the correctness of ATE and CATE from DRIV, by creating a semisynthetic dataset with the same variables and such that the marginal distribution of each variable looks similar to the TripAdvisor data, but where we know the true effect model. We find that DRIV recovers a good estimate of the ATE. The CATE of DRIV with linear regression as final stage also recovers the true coefficients, and a random forest final stage picks the correct factors of heterogeneity as most important features. Moreover, coverage of DRIV ATE confidence intervals is almost nominal at 94%, while DMLATEIV can be very biased and have 0 coverage.
6 Estimating the Effect of Schooling on Wages
The causal impact of schooling on wages has been studied at length in Economics (see [14], [6], [7], [17]), and although it is generally agreed that there is a positive impact, it is difficult to obtain a consistent estimate of the effect due to selfselection into education levels. To account for this endogeneity, Card ([6]) proposes using proximity to a 4year college as an IV for schooling. We analyze Card’s data from the Nat. Long. Survey of Young Men (NLSYM, 1966) to estimate the ATE of education on wages and find sources of heterogeneity. We describe the NLSYM data in depth in Appendix D. At high level, the data contains 3,010 rows with 22 mostly binary covariates , log wages (), years of schooling (), and 4year college proximity indicator ().
We apply DMLATEIV and DRIV with linear (LM) or gradient boosted (GBM) nuisance models to estimate the ATE (Table 2 and Table 8 in Appx. D). While the DMLATEIV results are consistent with Card’s ( CI), this estimate is likely biased in the presence of compliance and effect heterogeneity (see Sec. 2.1). The DRIV ATE estimates, albeit lower, still lie within the 95% CI of the DML ATE.
We study effect heterogeneity with a shallow random forest an the last stage of DRIV. Fig. 3 depicts the spread of treatment effects, and the important features selected. Most effects (89%) are positive, with very few very negative outliers. The heterogeneity is driven mainly by parental education variables. We project the DRIV treatment effect on the mother’s education variable to study this effect. In fig. 3, we note that treatment effects are highest among children of less educated mothers. This pattern has also been observed in [6] and [17].
Observational Data  SemiSynthetic Data  
Nuisance  Method  ATE Est  95% CI  ATE Est  95% CI  Cover 
LM  DMLATEIV  0.137  [0.027, 0.248]  0.654  [0.621, 0.687]  10% 
LM  DRIV  0.065  [0.02, 0.151]  0.587  [0.521, 0.652]  92% 
Contains the true ATE (0.609)  Coverage for 95% CI over 100 Monte Carlo simulations 
Semisynthetic Data Results. We created semisynthetic data from the NLSYM covariates and instrument , with generated treatments and outcomes based on known compliance and treatment functions (see Appx. D for details). In Table 2, we see that DMLATEIV ATE (true ATE=0.609) is upwards biased and has poor coverage over 100 runs, whereas the DRIV ATE is less biased and has overall good coverage. With DRIV, we also recover the correct coefficients: vs , vs , and vs .
References
 [1] Alberto Abadie. Semiparametric instrumental variable estimation of treatment response models. Journal of Econometrics, 113(2):231 – 263, 2003.
 [2] Joshua D Angrist and JörnSteffen Pischke. Mostly harmless econometrics: An empiricist’s companion. Princeton university press, 2008.
 [3] Peter M Aronow and Allison Carnegie. Beyond late: Estimation of the average treatment effect with an instrumental variable. Political Analysis, 21(4):492–506, 2013.
 [4] Susan Athey, Julie Tibshirani, Stefan Wager, et al. Generalized random forests. The Annals of Statistics, 47(2):1148–1178, 2019.
 [5] Susan Athey and Stefan Wager. Efficient policy learning. arXiv preprint arXiv:1702.02896, 2017.
 [6] David Card. Using geographic variation in college proximity to estimate the return to schooling. Technical report, National Bureau of Economic Research, 1993.
 [7] David Card. Estimating the return to schooling: Progress on some persistent econometric problems. Econometrica, 69(5):1127–1160, 2001.
 [8] Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 785–794, New York, NY, USA, 2016. ACM.
 [9] Xiaohong Chen and Zhipeng Liao. Sieve semiparametric twostep gmm under weak dependence. Cowles Foundation Discussion Papers 2012, Cowles Foundation for Research in Economics, Yale University, 2015.
 [10] Xiaohong Chen and Demian Pouzo. Estimation of nonparametric conditional moment models with possibly nonsmooth generalized residuals. Econometrica, 80(1):277–321, 2012.
 [11] Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1–C68, 2018.
 [12] Victor Chernozhukov, Denis Nekipelov, Vira Semenova, and Vasilis Syrgkanis. Plugin regularized estimation of highdimensional parameters in nonlinear semiparametric models. arXiv preprint arXiv:1806.04823, 2018.
 [13] Dylan J Foster and Vasilis Syrgkanis. Orthogonal statistical learning. arXiv preprint arXiv:1901.09036, 2019.
 [14] Zvi Griliches. Estimating the returns to schooling: Some econometric problems. Econometrica: Journal of the Econometric Society, pages 1–22, 1977.
 [15] Jason Hartford, Greg Lewis, Kevin LeytonBrown, and Matt Taddy. Deep IV: A flexible approach for counterfactual prediction. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1414–1423, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
 [16] Jason Hartford, Greg Lewis, Kevin LeytonBrown, and Matt Taddy. Deep iv: A flexible approach for counterfactual prediction. In International Conference on Machine Learning, pages 1414–1423, 2017.
 [17] John Hudson and John G Sessions. Parental education, labor market experience and earnings: new wine in an old bottle? Economics Letters, 113(2):112–115, 2011.
 [18] Vladimir Koltchinskii. Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems. Springer, 2011.
 [19] Sören R Künzel, Jasjeet S Sekhon, Peter J Bickel, and Bin Yu. Metalearners for estimating heterogeneous treatment effects using machine learning. arXiv preprint arXiv:1706.03461, 2017.
 [20] Scott M Lundberg, Gabriel G Erion, and SuIn Lee. Consistent individualized feature attribution for tree ensembles. arXiv preprint arXiv:1802.03888, 2018.
 [21] Scott M Lundberg and SuIn Lee. A unified approach to interpreting model predictions. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 4765–4774. Curran Associates, Inc., 2017.
 [22] Scott M Lundberg and SuIn Lee. A unified approach to interpreting model predictions. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 4765–4774. Curran Associates, Inc., 2017.
 [23] Whitney K. Newey and James L. Powell. Instrumental variable estimation of nonparametric models. Econometrica, 71(5):1565–1578, 2003.
 [24] Xinkun Nie and Stefan Wager. Quasioracle estimation of heterogeneous treatment effects. arXiv preprint arXiv:1712.04912, 2017.
 [25] Ryo Okui, Dylan S Small, Zhiqiang Tan, and James M Robins. Doubly robust instrumental variable regression. Statistica Sinica, pages 173–205, 2012.
Appendix A Proof of Main Lemmas
Before we prove our two main lemmas we define the concept of an orthogonal loss. Consider a loss function that depends on a target model and nuisance model .
Definition 1 (Directional Derivative).
Let be a vector space of functions. For a functional , we define the derivative operator
for a pair of functions . Likewise, we define
When considering a functional in two arguments, e.g. , we will write and to make the argument with respect to which the derivative is taken explicit.
Definition 2 (Orthogonal Loss).
The population risk is orthogonal, if:
(16) 
Definition 3 (Strong Convexity in Prediction).
The population risk is strongly convex with respect to the prediction, if:
where:
(17) 
a.1 Proof of Lemma 1
Proof.
We show that the expected directional derivative of the moment (directional derivative of the loss with respect to ) conditional on , with respect to each of the nuisance functions is equal to zero, when evaluated at the true nuisance and target functions. The directional derivative of the loss with respect to direction and evaluated at parameter is:
where:
To show orthogonality with respect to , it suffices to show that the classical derivative of with respect to the inputs and is zero, when evaluated at the true nuisance and target parameters:
Where in both equations we invoked the conditional moment restrictions to claim that they are equal to zero.
To prove orthogonality with respect to we need to show that the directional derivative of with respect to is zero. We cannot reduce it to a classical derivative condition, since takes as input the variable which is not part of the conditioning set of the moment . However, we see that this directional derivative evaluated at and at a direction , is not zero:
The last quantity is not necessarily zero, since . This finding is reasonable since we are using as our regressor. Hence, any error in the measurement of the regressor should directly propagate to an error in . The quantity would have been zero if the residual error from the first stage function was independent of the residual randomness , conditional on . However, the two in general can be correlated: the second quantity measures how far is from each mean , while the first quantity measures how far is the estimate from . It is highly probable that when takes values that lead to a large deviation from the mean treatment, then these are also the values of for which the first stage model makes more mistakes. ∎
a.2 Proof of Lemma 3
Proof.
We show that the expected derivative of the moment (derivative of the loss with respect to ) conditional on , with respect to each of the nuisance functions is equal to zero, when evaluated at the true nuisance and target functions. The directional derivative of the loss with respect to direction and evaluated at parameter is:
where and :
To show orthogonality with respect to the nuisance functions , it suffices to show that the classical derivative of with respect to each component of is zero, when evaluated at the true nuisance and target parameters:
Where in all equations we invoked the conditional moment restrictions and the definitions of the true nuisance functions to claim that they are equal to zero.
Moreover, observe that the second directional derivative of the loss with respect to and for a direction is equal to:
(18) 
Thus the loss is strongly convex. ∎
Appendix B W Data and Analysis
b.1 Model Details and Parameters
Residualization Models

LASSO regression and logistic regression with an L2 penalty using the Python sklearn library. For each crossfitted fold, 3Fold crossvalidation was used to select the regularization parameter based on minimizing RMSE and logloss.

Gradient boosting (GB) regression and classification using the XGBoost library.[8] 100 estimators were used, with a minimum child weight of 20, and gamma set to 0.1. A 10% validation set was used for early stopping based on RMSE and logloss.
The gradient boosting models from sklearn also yielded substantially similar results to XGBoost.
Random Forest We use a shallow, heavily regularized random forest for projection of the CATE. Parameters used: 1,000 trees, a minimum leaf size of 20,000, and a maximum depth size of 1. The heavy regularization is required in order to ensure stability of the CATE estimates.
Linear Compliance Model Using the 2018 experiment data and linear residualization models, the compliance quantity (despite the logistic function) is wellapproximated by a linear regression. We use this approximation for interpreting the coefficents of the fitted model (Figure 4).
b.2 Additional Data Description and Preparation
Full description of the data in Table 3. The criteria for eligibility required that users were not existing members of TripAdvisor before the experimental period; visited TripAdvisor through a desktop browser during the experimental period; and visited TripAdvisor at least twice during the experimental period. The first visit did not activate the test functionality. Group assignment was determined randomly with equal probability, resulting in in group A, and in group B.
We transform the operating system categorical variable using onehot encoding and drop the "Windows" level to use as the baseline. In addition, the covariates are normalized uniformly over 1,000 quantiles, resulting in a covariate vector for each user to be used for both residualization and effect heterogeneity.
For confidentiality reasons, we report the ATE and CATE results normalized by , the mean number of days visited by users in group B of the A/B experiment. A treatment effect of 1 unit is therefore equal to additional days visited.
revenue_pre  Total revenue in dollars generated by the user in the preexperimental period 
days_visited_free_pre  Count of the days the user visited the TripAdvisor through free channels (e.g. email) in the preexperimental period (028) 
days_visited_hs_pre  Count of the days the user visited the hotels pages of TripAdvisor in the preexperimental period (028) 
days_visited_exp_pre  Count of the days the user visited the experiences pages of TripAdvisor in the preexperimental period (028) 
days_visited_rs_pre  Count of the days the user visited the restaurants pages of TripAdvisor in the preexperimental period (028) 
days_visited_vrs_pre  Count of the days the user visited the vacation rentals pages of TripAdvisor in the preexperimental period (028) 
days_visited_fs_pre  Count of the days the user visited the flights pages of TripAdvisor in the preexperimental period (028) 
os_type  Categorical variable for the user’s operating system (3 levels) 
locale_en_US  Binary variable indicating whether the user was from the en_US locale 
Y  Outcome measurement, count of the number of total days the user visited TripAdvisor 
T  Treatment, binary variable of whether the user became a member during the experimental period 
Z  Instrument, binary variable of the user’s group assignment in the A/B test 
Additional details about the 2019 experiment There were some key differences compared to the 2018 A/B test:

the test was run for 3 weeks instead of 2;

the test functionality was displayed on both desktop and mobile platforms across nearly all pages of TripAdvisor (i.e. not just the homepage);

firsttime visitors were eligible for the test; and

the sample size was much larger at users. We use a sample of users stratified by A/B test group allocation for computational reasons.
b.3 Additional Figures of Experimental Results
Appendix C SemiSynthetic Data Analysis for TripAdvisor Data
W Semisynthetic Data Results. In order to validate the correctness of ATE and CATE from DRIV model, we consider a semisynthetic data generating process that looks similar in structure to TripAdvisor data. The covariates have the same schema but are generated from fixed marginal distributions. The instrument corresponds to a fully randomized recommendation of treatment. And the compliance rates are generated to be similar with the experiment. This probability depends both on the observed feature and an unobserved confounder that has a direct effect on the outcome. The X covariates and DGP are given by:
Covariate  Distribution 

days_visited_free_pre, days_visited_hs_pre, days_visited_rs_pre, days_visited_exp_pre, days_visited_vrs_pre, days_visited_fs_pre  
locale_US  
os_type  
revenue_pre 
(Instrument)  
(Unobserved confounder)  
(Compliers when recommended)  
(NonCompliers when not recommended)  
(Treatment)  
(Outcome) 
Moreover, the treatment effect function is predefined here, which depends on the feature "days_visited_free_pre"(X[0]) and "locale_US"(X[6])
(CATE) 
We rerun the same experiments with 4 million samples. In table 5, it shows that both DMLATEIV and DRIV with either linear or GBM nuisance estimators, their ATE CI can recover the true estimate of ATE. Moreover, we validate the CATE via DRIV. In figure 10 and 11, we can see that DRIV with linear regression as final stage recovers the true coefficient from CATE, and the last stage model using random forest also picks the correct factor of heterogeneity as the most important features.
Nuisance Models  Method  True ATE  ATE Estimate  95% CI 

Linear Models  DMLATEIV  0.249  0.336  [0.186, 0.487] 
Linear Models  DRIV with constant  0.249  0.166  [0.025, 0.358] 
Gradient Boosting Models  DMLATEIV  0.249  0.342  [0.191, 0.492 ] 
Gradient Boosting Models  DRIV with constant  0.249  0.136  [0.060, 0.332 ] 
We also run some other experiments with different sample size and different level of endogeneity (the coefficient of variable ) to learn the consistency of ATE for these two models. we can see from the table and figures below that all of their CI covers the true estimate of ATE, but with the increase of and the decrease of the endogeneity coefficient, the ATE of DMLATEIV is more biased.
Nuisance Models  Method  True ATE  ATE Estimate  95% CI 

Linear Models  DMLATEIV  0.249  0.349  [0.230, 0.468] 
Linear Models  DRIV with constant  0.249  0.197  [0.044, 0.350] 
Gradient Boosting Models  DMLATEIV  0.249  0.354  [0.235, 0.473] 
Gradient Boosting Models  DRIV with constant  0.249  0.179  [0.023, 0.335] 
Nuisance Models  Method  True ATE  ATE Estimate  95% CI 

Linear Models  DMLATEIV  0.250  0.350  [0.045, 0.655] 
Linear Models  DRIV with constant  0.250  0.167  [0.222, 0.556] 
Gradient Boosting Models  DMLATEIV  0.250  0.344  [0.040, 0.648] 
Gradient Boosting Models  DRIV with constant  0.250  0.253  [0.212, 0.718] 
Coverage Experiment. To further validate the consistency of DMLATEIV and DRIV under effect and compliance heterogeneity, we create a slightly different semisynthetic dataset with stronger instrument and less samples () to run 100 times Monte Carlo Simulations. The DGP is given by:
(Instrument)  
(Unobserved confounder)  
(Compliers when recommended)  
(NonCompliers when not recommended)  
(Treatment)  
(Outcome) 
Moreover,
(CATE) 
It turns out that distribution of DMLATEIV ATE has smaller variance but larger bias with 0 coverage to the true ATE, while DRIV ATE are more converged to the true ATE with 94% coverage.
Appendix D NLSYM Data Analysis
NLSYM Data Results. The NLSYM data is comprised of 3,010 entries from men ages 1424 that were interviewed in 1966 and again in 1976. We use the covariates selected by Card: mother and father education, family composition at 14, workforce experience, indicators for black, region, southern residence and residence in an SMSA in 1966 and 1976. The outcome of interest is log wages, the treatment is the years of schooling, and the instrument is an indicator of whether the participant grew up near a 4year college.
Observational Data  SemiSynthetic Data  
Nuisance  Method  ATE Est  95% CI  ATE Est  95% CI  Cover 
LM  DMLATEIV  0.137  [0.027, 0.248]  0.654  [0.621, 0.687]  10% 
LM  DRIV  0.065  [0.02, 0.151]  0.587  [0.521, 0.652]  92% 
GBM  DMLATEIV  0.138  [0.025, 0.251]  0.645  [0.613, 0.695]  30% 
GBM  DRIV  0.052  [0.032, 0.136]  0.612  [0.548, 0.677]  86% 
Contains the true ATE (0.609)  Coverage for 95% CI over 100 Monte Carlo simulations 
Semisynthetic Data Results. The NLSYM data is a relatively small dataset and could potentially be a weak instrument, which could explain the large confidence intervals in the prior analysis. To disentangle these effects, we create semisynthetic data from the NLSYM covariates and instrument , with generated treatments and outcomes based on known compliance and treatment functions. The data generating process for the semisynthetic data is given by:
(Unobserved Confounder)  
(Compliance Level)  
(Treatment)  
(Outcome) 
We create a realistic heterogeneous treatment effect that depends on the mother’s education (X[4]) and whether the child was in the care of a single mother at age 14 (X[7], 10% of subjects):
(CATE)  
(Nuissance Functions) 