# Machine Learning Estimation of Heterogeneous Treatment Effects with Instruments

###### Abstract

We consider the estimation of heterogeneous treatment effects with arbitrary machine learning methods in the presence of unobserved confounders with the aid of a valid instrument. Such settings arise in A/B tests with an intent-to-treat structure, where the experimenter randomizes over which user will receive a recommendation to take an action, and we are interested in the effect of the downstream action. We develop a statistical learning approach to the estimation of heterogeneous effects, reducing the problem to the minimization of an appropriate loss function that depends on a set of auxiliary models (each corresponding to a separate prediction task). The reduction enables the use of all recent algorithmic advances (e.g. neural nets, forests). We show that the estimated effect model is robust to estimation errors in the auxiliary models, by showing that the loss satisfies a Neyman orthogonality criterion. Our approach can be used to estimate projections of the true effect model on simpler hypothesis spaces. When these spaces are parametric, then the parameter estimates are asymptotically normal, which enables construction of confidence sets. We applied our method to estimate the effect of membership on downstream webpage engagement on TripAdvisor, using as an instrument an intent-to-treat A/B test among 4 million TripAdvisor users, where some users received an easier membership sign-up process. We also validate our method on synthetic data and on public datasets for the effects of schooling on income.^{†}^{†}Prototype code for all the algorithms presented and the synthetic data experimental study can be found at https://github.com/Microsoft/EconML/tree/master/prototypes/dml_iv

Machine Learning Estimation of Heterogeneous Treatment Effects with Instruments

Vasilis Syrgkanis Microsoft Research vasy@microsoft.com Victor Lei TripAdvisor vlei@tripadvisor.com Miruna Oprescu Microsoft Research moprescu@microsoft.com Maggie Hei Microsoft Research Maggie.Hei@microsoft.com Keith Battocchi Microsoft Research kebatt@microsoft.com Greg Lewis Microsoft Research glewis@microsoft.com

noticebox[b]\end@float

## 1 Introduction

A/B testing is the gold standard of causal inference. But even when A/B testing is feasible, estimating the effect of a treatment on an outcome might not be a straightforward task. One major difficulty is non-compliance: even if we randomize what treatment to recommend to a subject, the subject might not comply with the recommendation due to unobserved factors and follow the alternate action. The impact that unobserved factors might have on the measured outcome is a source of endogeneity and can lead to biased estimates of the effect. This problem arises in large scale data problems in the digital economy; when optimizing a digital service, we might often want to estimate the effect of some action taken by our users on downstream metrics. However, the service cannot force users to comply, but can only find means of incentivizing or recommending the action. The unobserved factors of compliance can lead to biased estimates if we consider the takers and not takers as exogenously assigned and employ machine learning approaches to estimate the potentially heterogeneous effect of the action on the downstream metric.

The problem can be solved by using the technique of instrumental variable (IV) regression: as long as the recommendation increases the probability of taking the treatment, then we know that there is at least some fraction of users that were assigned the treatment “exogeneously”. IV regression parses out this population of “exogenously treated” users and estimates an effect based solely on them.

Most classical IV approaches estimate a constant average treatment effect. However, to make personalized policy decisions (an emerging trend in most digital services) one might want to estimate a heterogeneous effect based on observable characteristics of the user. The latter is a daunting task, as we seek to estimate a function of observable characteristics as opposed to a single number. Hence, statistical power is at stake. Even estimating an ATE is non-trivial when effect and compliance are correlated through observables. The emergence of large data-sets in the digital economy alleviates this concern; with A/B tests running on millions of users it is possible to estimate complex heterogeneous effect models, even if compliance levels are relatively weak. Moreover, as we control for more and more observable features of the user, we also reduce the risk that correlation between effect and compliance is stemming from unobserved factors.

This leads to the question this work seeks to answer: how can we blend the power of modern machine learning approaches (e.g. random forests, gradient boosting, penalized regressions, neural networks) with instrumental variable methods, so as to estimate complex heterogeneous effect rules. Recent work at the intersection of machine learning and econometrics has proposed powerful methods for estimating the effect of a treatment on an outcome, while using machine learning methods for learning nuisance models that help de-bias the final effect rule. However, the majority of the work has either focused on 1) estimating average treatment effects or low dimensional parametric effect models (e.g. the double machine learning approach of [11]), 2) developing new algorithms for estimating fully non-parametric models of the effect (e.g. the IV forest method of [4], the DeepIV method of [15]), 3) assuming that the treatment is exogenous once we condition on the observable features and reducing the problem to an appropriate square loss minimization framework (see e.g. [24, 19]).

Nevertheless, a general reduction of IV based machine learning estimation of heterogeneous effects to a more standard statistical learning problem that can incorporate existing algorithms in a black-box manner has not been formulated in prior work. In fact, the work of [24] leaves this as an open question. Such a reduction can help us leverage the recent algorithmic advances in statistical learning theory so as to work with large data-sets. Our work proposes the reduction of heterogeneous effects estimation via instruments to a square loss minimization problem over a hypothesis space. This enables us to learn not only the true heterogeneous effect model, but also the projections of the true model in simpler hypothesis spaces for interpretability. Moreover, our work leverages recent advances in statistical learning with nuisance functions [12, 13], to show that the mean squared error (MSE) of the learned model is robust to the estimation error of auxiliary models that need to be estimated (as is standard in IV regression). Thus we achieve MSE rates where the leading term depends only on the sample complexity of the hypothesis space of the heterogeneous effect model.

Some advantages of reducing our problem to a set of standard regression problems include being able to use existing algorithms and implementations, as well as recent advances of interpretability in machine learning. For instance, in our application we deploy the SHAP framework [21, 20] to interpret random forest based models of the heterogeneous effect. Furthermore, when the hypothesis space is low dimensional and parametric then our approach falls in the setting studied by prior work of [11] and, hence, not only MSE rates but also confidence interval construction is relatively straightforward. This enables hypothesis testing on parametric projections of the true effect model.

We apply our approach to an intent-to-treat A/B test among 4 million users on a major travel webpage so as to estimate the effect of membership on downstream engagement. We identify sources of heterogeneity that have policy implications on which users the platform should engage more and potentially how to re-design the recommendation to target users with large effects. We validate the findings on a different cohort in a separate experiment among 10 million users on the same platform. Even though the new experiment was deployed on a much broader and different cohort, we identify common leading factors of effect heterogeneity, hence confirming our findings. As a robustness check we create semi-synthetic data with similar features and marginal distributions of variables as the real data, but where we know the ground truth. We find that our method performs well both in terms of MSE, identifying the relevant factors and coverage of the confidence intervals.

Finally, we apply our method to a more traditional IV application: estimating the effect of schooling on wages. We use a well studied public data set and observe that our approach automatically identifies sources of heterogeneity that were previously uncovered using more structural approaches. We also validate our method in this application on semi-synthetic data that emulate the true data.

## 2 Estimation of Heterogeneous Treatment Effects with Instruments

We consider estimation of heterogeneous treatment effects with respect to a set of features , of an endogenous treatment on an outcome with an instrument . For simplicity of exposition, we will restrict attention to the case where and are scalar variables, but several of our results extend to the case of multi-dimensional treatments and instruments. is an instrumental variable if it has an effect on the treatment but does not have a direct effect on the outcome other than through the treatment. More formally, we assume the following moment condition:

(1) |

Equivalently we assume that: , with . We allow for the presence of confounders, i.e. could be correlated with via some unobserved common factor . However, our exclusion restriction on the instrument implies that the residual is mean zero conditional on the instrument. Together with the fact that the instrument also has an effect on the treatment at any value of the feature , i.e.: , allows us to identify the heterogeneous effect function . We focus on the case where the effect is linear in the treatment , which is wlog in the binary treatment setting, which is our main application, and since our goal is to focus on the non-linearity wrt (this greatly simplifies our problem, see [9, 10, 23, 16]).

Given i.i.d. samples from the data generating process, our goal is to estimate a model that achieves small expected mean-squared-error, i.e.: . Since the true function can be very complex and difficult to estimate in finite samples, we are also interested in estimating projections of the true on simpler hypothesis spaces . Projections are also useful for interpretability: one might want to understand what is the best linear projection of on , i.e. . In this case we will denote with the projection of on , i.e. and our goal would be to achieve small mean squared error with respect to . When is a low dimensional parametric class (e.g. a linear function on a low-dimensional feature space or a constant function), we are also interested in performing inference; i.e. constructing confidence intervals that asymptotically contain the correct parameter with probability equal to some target confidence level.

### 2.1 Warm-Up: Estimating the Average Treatment Effect (ATE)

For estimation of the average treatment effect (ATE), assuming that either there is no effect heterogeneity with respect to or there is no heterogeneous compliance with respect to , [11] propose a method for estimating the ATE that solves the empirical analogue of the following moment equation:

(2) |

This moment function is orthogonal to all the functions , and that also need to be estimated from data. This moment avoids the estimation of the expected conditional on and satisfies an orthogonality condition that enables robustness of the estimate ,
to errors in the nuisance estimates and . The estimate is asymptotically normal with variance equal to the variance of the method if the estimates were the correct ones, assuming that the mean squared error of these estimates decays at least at a rate of (see [11] for more details). This result requires that the nuisance estimates are fitted in a cross-fitting manner, i.e. we use half of the data to fit a model for each of these functions and then predict the values of the model on the other half of the samples. We refer to this algorithm as DMLATEIV.^{1}^{1}1For Double Machine Learning ATE estimation with Instrumental Variables.

Inconsistency under Effect and Compliance Heterogeneity The above estimate is a consistent estimate of the average treatment effect as long as there is either no effect heterogeneity with respect to or there is no heterogeneous compliance (i.e. the effect of the instrument on the treatment) with respect to . Otherwise it is inconsistent. The reason is that, if we let and , then the population quantity: is a function of . If we also have effect heterogeneity, then we are solving for a constant that in the limit satisfies: , where . On the other hand the true heterogeneous model satisfies the equation: . In the limit, the two quantities are related via the equation: . Then the constant effect that we estimate converges to the quantity: . If is not independent with , then is a re-weighted version of the true average treatment effect , re-weighted by the heterogeneous compliance. To account for this heterogeneous compliance we need to change our moment equation so as to re-weight based on , which is unknown and also needs to be estimated from data. Given that this function could be arbitrarily complex, we want our final estimate to be robust to estimation errors of . We can achieve this by considering a doubly robust approach to estimating . Suppose that we had some other method of computing an estimate of the heterogeneous treatment effect , then we can combine both estimates to get a more robust method for the ATE, e.g.:

(3) |

This approach has been analyzed in [25] in the case of constant treatment effects and an analogue of this average effect was also used by [5] in a policy learning problem as opposed to an estimation problem. In particular, the quantity is known as the compliance score [1, 3]. Our methodological contribution in the next two sections is two-fold: i) first we propose a model-based stable approach for estimating a preliminary estimate , which does not necessarily require that everywhere (an assumption that is implicit in the latter method), ii) second we show that this doubly robust quantity can be used as a regression target and minimizing the square loss with respect to this target, corresponds to an orthogonal loss, as defined in [12, 13].

## 3 Preliminary Estimate of Conditional Average Treatment Effect (CATE)

Let and , as in the previous section. Then observe that we can re-write the moment condition as:

(4) |

Moreover, observe that the functions and are related via: . Thus we can further re-write the moment condition in terms of instead of :

(5) |

Moreover, we can identify with the following subset of conditional moments, where the conditioning of is removed:

(6) |

Equivalently, is a minimizer of the square loss:

(7) |

since the derivative of this loss with respect to is equal to the moment equation and, thus, the first order condition for the loss minimization problem is satisfied by the true model . Moreover, if the loss function satisfies a functional analogue of strong convexity, then any minimizer of the loss achieves small mean squared error with respect to . This leads to the following approach:

(8) |

This method is an extension of the classical two-stage-least-squares (2SLS) approach [2] to allow for arbitrary machine learning models; ignoring the residualization part (i.e. if for instance ), then it boils down to: 1) predict the mean treatment from the instrument and with an arbritrary regression/classification method, 2) predict the outcome from the predicted treatment multiplied by the heterogeneous effect model . Residualization helps us remove the dependence of the mean squared error on the complexity of the baseline function . We achieve this by showing that this loss is orthogonal with respect to (see [13] for the definition of an orthogonal loss). However, orthogonality does not hold with respect to . This finding is reasonable since we are using as our regressor. Hence, any error in the measurement of the regressor can directly propagate to an error in . This is the same reason why in classical IV regression one cannot ignore the variance from the first stage of 2SLS when calculating confidence intervals.

###### Lemma 1.

The loss function is orthogonal to the nuisance functions , but not .

Strong convexity and overlap. Note that both the empirical loss and the population loss are convex in the prediction, which typically implies computational stability. Moreover, the second order directional derivative of the population loss in any functional direction is:

(9) |

and let:

(10) |

To be able to achieve mean-squared-error rates based on our loss minimization, we need the population version of the loss function to satisfy a functional analogue of -strong convexity:

(11) |

This setting falls under the “single-index” setup of [13]. Using arguments from Lemma 1 of [13], if:

(12) |

where

(13) |

then .
A sufficient condition is that for all . This is a standard "overlap" condition that the instrument is exogenously varying at any and has a direct effect on the treatment at any . DMLIV only requires an "average" overlap condition, tailored particularly to the hypothesis space , hence it could handle settings where the instrument is weak for some subset of the population.
For instance, if is a linear function class: , then for the oracle strong convexity to hold it suffices that:
.
Lemma 1, combined with the above discussion and the results of [13] yields:^{2}^{2}2This corollary follows by small modifications of the proofs of Theorem 1 and Theorem 3 of [13] that accounts for the non-orthogonality w.r.t. , so we omit its proof.

###### Corollary 2.

Assume all random variables are bounded. Suppose that in the final stage of DMLIV we use any algorithm that achieves expected generalization error w.r.t. loss , i.e.:

(14) |

Moreover, suppose that the nuisance estimates satisfy and and for all : . Then returned by DMLIV satisfies:

(15) |

If empirical risk minimization is used in the final stage, i.e.:

(16) |

then , where is the critical radius of the hypothesis space as defined via the localized Rademacher complexity [18].

Computational considerations. The empirical loss is not a standard square loss. However, we can re-write it as . Thus the problem is equivalent to a standard square loss minimization with label and sample weights . Thus we can use any out-of-the-box machine learning method that accepts sample weights, such as stochastic gradient based regression methods and gradient boosted or random forests. Alternatively, if we assume a linear representation of the effect function , then the problem is equivalent to regressing on the scaled features , and again any method for fitting linear models can be invoked.

## 4 DRIV: Orthogonal Loss for IV Estimation of CATE and Projections

We now present the main estimation algorithm that combines the doubly robust approach presented for ATE estimation with the preliminary estimator of the CATE to obtain a fully orthogonal and strongly convex loss. This method achieves a second order effect from all nuisance estimation errors and enables oracle rates for the target effect class and asymptotically valid inference for low dimensional target effect classes. In particular, given access to a first stage model of heterogeneous effects (such as the one produced by DMLIV), we can estimate a more robust model of heterogeneous effects via minimizing a square loss that treats the doubly robust quantity used in Equation (3) as the label:

(17) |

We allow for a model space that is not necessarily equal to . The solution in Equation (3) is a special case of this minimization problem where the space contains only constant functions. Our main result shows that this loss is orthogonal to all nuisance functions , , , , . Moreover, it is strongly convex in the prediction , since conditional on all the nuisance estimates it is a standard square loss. Moreover, we show that the loss is orthogonal irrespective of what the model space , even if , as long as the preliminary estimate is consistent with respect to the true CATE (i.e. fit a flexible preliminary CATE and use it to project to a simpler hypothesis space).

###### Lemma 3.

The loss is orthogonal with respect to the nuisance functions , , , and .

If we use DMLIV for , even though DMLIV has a first order impact from the error of , the second stage estimate has a second order impact, since it has a second order impact from the first stage CATE error. Lemma 3 together with the results of [13] implies the following corollary:

###### Corollary 4.

Assume all random variables are bounded and for all . Suppose that in the final stage of DRIV we use any algorithm that achieves expected generalization error with respect to loss over hypothesis space , i.e.:

(18) |

Moreover, suppose that each nuisance estimate , and the hypothesis space is convex. Then returned by DRIV satisfies:

(19) |

where

(20) |

If empirical risk minimization is used in the final stage, i.e.:

(21) |

then , where is the critical radius of the hypothesis space as defined via the localized Rademacher complexity [18].

For the special case where is high-dimensional sparse linear and contains the true CATE, then it suffices to require only an rate for the nuisance functions, as opposed to an . This corollary stems from verifying that all the assumptions of [12] hold for our loss function:

###### Corollary 5.

If is high-dimensional sparse linear, i.e. with , and , and the model is well-specified, i.e. , with , then if -penalized square loss minimization is used in the final step of DRIV, i.e.:

(22) |

it suffices that to get: for .

Interpretability through projections. The fact that our loss function can be used with any target allows us to perform inference on the projection of on a simple space (e.g. decision trees, linear functions) for interpretability purposes. If we let the label in the final regression of DRIV, then observe that when the nuisance estimates take their true values then , since the second part of has mean zero. Hence:

(23) |

The first part is independent of and hence minimizing the oracle is equivalent to minimizing over , which is exactly the projection of on . One version of an interpretable model is estimating the CATE with respect to a subset of the variables, i.e.: (e.g. how treatment effect varies with a single feature). This boils down to setting some space of functions of .

If is a low dimensional set of features and is a the space of linear functions of , i.e. , then the first order condition of our loss is equal to the moment condition . Then orthogonality of our loss implies that DRIV is equivalent to an orthogonal moment estimation method [11]. Thus using the results of [11] we get the corollary:

###### Corollary 6 (Confidence Intervals).

The estimate returned by DRIV with for a constant independent of , is asymptotically normal with asymptotic variance equal to the hypothetical variance of as if the nuisance estimates had their true values.

Hence, we can use out-of-the-box packages for calculating CIs of an OLS regression to get -values on the coefficients.

### 4.1 Variance Reduction with Re-Weighting under Well-Specification

DRIV is orthogonal irrespective of whether contains the true CATE model and provides estimation rates for the projection of CATE on the space . However, it does suffer from a high variance, since we are dividing by the conditional co-variance . Hence, if the instrument is weak in some region of ’s then the method can suffer from instability. Moreover, the variance of the approach does not adapt to the model space , i.e. it could be that some ’s have a zero co-variance but the model is identified by the remainder of the ’s.

Unlike DRIV, DMLIV does not suffer from these drawbacks. However, as pointed out earlier, DMLIV is not orthogonal with respect to the nuisance function . We now show how we can combine the best of both worlds by simply re-weighting the DRIV loss, so as to put less weight on high variance regions. However, unlike DRIV, the loss that we construct is simply orthogonal and not universally orthogonal. In particular, since is a reasonable proxy on the magnitude of the variance of the regression target , we will re-weight the loss by :

(24) |

In other words we are re-weighting the DRIV loss to give less weight on samples that have high variance and solely using the samples with low variance to identify our true model. We can then show the following theorem:

###### Lemma 7.

Assuming , then the loss is orthogonal with respect to the nuisance functions , , , and .

We will refer to latter algorithm as DRIV-RW (for re-weighted DRIV). Similar to DRIV we can then get the following corollary:

###### Corollary 8.

Assume all random variables are bounded, and suppose that in the final stage of DRIV-RW we use any algorithm that achieves expected generalization error with respect to loss over hypothesis space , i.e.:

(25) |

Moreover, suppose that each nuisance estimate , and for all : . Then returned by DRIV-RW satisfies:

(26) |

We can also make statements analogous to Corollary 4 for the case where empirical risk minimization is used in the final stage or -penalized empirical risk minimization over linear functions. Moreover, observe that even if the on-average overlap condition does not hold, we are still recovering a model that converges to the true one in terms of the re-weighted MSE, where we re-weight based on a measure of strength of the instrument (i.e. we will be predicting better in regions where the instrument is strong). This is a good fall-back guarantee to have in cases where the instrument happens to be weak.

Finally, in the case where the instrument is multi-dimensional, then we can follow an approach similar to DMLIV, where we construct an “optimal” instrument of the form . Then we can consider the analogue of DRIV-RW, but where takes the place of , i.e. let:

(27) |

Then observe that the analogue of for the instrument is equal to as defined in the DMLIV section, i.e.

then the loss function takes the form:

(28) |

###### Lemma 9.

Assuming that there exists , such that:

(29) |

and that the preliminary estimate converges in mean-squared-error to then the loss is orthogonal with respect to the nuisance functions , , , , .

Observe that if we use DMLIV as the preliminary estimator, then because DMLIV solely uses the latter set of moment restrictions, it will converge in MSE to a function that satisfies the moment condition. Thus it will satisfy the requirement. We will refer to this algorithm as Projected-DRIV-RW.

###### Corollary 10.

Assume all random variables are bounded, and suppose that in the final stage of Projected-DRIV-RW we use any algorithm that achieves expected generalization error with respect to loss over hypothesis space , i.e.:

(30) |

Moreover, suppose that each nuisance estimate , and for all : . Then returned by Projected-DRIV-RW satisfies:

(31) |

## 5 Example: Randomized Trials with Non-Compliance

Given that our main application is a special case of our framework where the instrument is the assignment in a randomized control trial with non-compliance, in this section we show how our algorithms simplify for this setting.

Randomized control trials with non-compliance, or equivalently intent-to-treat A/B tests, are the special case where the instrument and the treatment are binary, i.e. , and the instrument is fully exogenous. For simplicity, we assume that for all (i.e. a balanced test). In this case, the nuisance components can all be expressed as a function of the following quantity (which is typically referred to as the compliance score):

(32) |

Then, after straightforward algebraic manipulations, we have:

and the loss functions simplify to:

Moreover, observe that the loss is equivalent to the loss with , i.e. a zero preliminary estimator of the treatment effect. Thus in this case, we essentially need to estimate three nuisance components, i.e. . We can estimate by simply estimating and setting , which boils down to a classification problem among the treated and control populations for each value of . In fact, once we have estimated these two nuisance functions, we also know that: . Thus it all boils down to estimating and .

Finally, the re-weighted loss, can be further simplified to:

(33) |

This functional form is reminiscent of the R-learner loss for the case when the treatment is exogenous, which corresponds to: . Here, we see that to handle the endogeneity of the treatment, we need to fit a preliminary estimator.

## 6 Estimating Effects of Membership on TripAdvisor

We apply our methods to estimate the treatment effect of membership on the number of days a user visits a leading travel assistance website, W. The instrument used was a 14-day intent-to-treat A/B test run during 2018, where users in group A received a new, easier membership sign-up process, while the users in group B did not. The treatment is whether a user became a member or not. Becoming a member and logging into TripAdvisor gives users exclusive access to trip planning tools, special deals and price alerts, and personalized ideas and travel advice. Our data consists of 4,606,041 total users in a 50:50 A/B test. For each user, we have a 28-day pre-experiment summary about their browsing and purchasing activity on TripAdvisor (see Sec. B.2). The instrument significantly increased the rate of treatment, and is assumed to satisfy the exclusion restriction.

We applied two sets of nuisance estimation models with different complexity characteristics: LASSO regression and logistic regression with an L2 penalty (LM); and gradient boosting regression and classification (GB). The only exception was , where we used a fixed estimate of since the instrument was a large randomized experiment. See Sec. B.1 for details.^{3}^{3}3We attempted to use the R implementation of Generalized Random Forests (GRF)[4] to compare with our results. However, we could not fit due to the size of the data and insufficient memory errors (with 64GB RAM).

We estimate the ATE using DRIV projected onto a constant (Table 1). Using linear nuisance models results in very similar ATE estimates between DMLATEIV and DRIV. We compare the co-variate associations for both heterogeneity and compliance under DRIV to understand why. If there are co-variates with significant non-zero associations in both heterogeneity and compliance, this could lead to different estimates between DRIV and DMLATEIV (and vice versa). Replacing the CATE projection model with a linear regression, we obtain valid inferences for the co-variates associated with treatment effect heterogeneity (Figure 1). For compliance, we run a linear regression of the estimated quantity on , to assess its association with each of the features (see Sec. B.1 for details). Comparing treatment and compliance coefficients, os_type_linux and revenue_pre are the only coefficients substantially different from 0 in both. However, only a very small proportion of users in the experiment are Linux users, and the distribution of revenue is very positively skewed. This justifies the minor difference between the DMLATEIV and DRIV estimates. Moreover, we fit a shallow, heavily regularized random forest and interpret it using Shapley Additive Explanations (SHAP) [22]. SHAP gave directionally similar impact of each feature on the effect (Figure 1). However, since we constrained the model to have depth at most one, it essentially gives the features in order of importance if we were to split the population based on a single feature. This justifies why the order of importance of features in the forest is not in the same order as the magnitude of the rank in the linear model, since they have different interpretations. The features picked up by the forest intuitively make sense since an already highly engaged member of W, or a user who has recently made a booking, is less likely to further increase their visits to W.

Using gradient boosting nuisance models, we show that many inferences remain similar (Figure 2 in Appendix). The most notable changes in heterogeneity were for features which have a highly skewed distribution (e.g. visits to specific pages on W), or which appear rarely in the data (e.g. Linux users). The linear CATE projection model coefficients are largely similar for both residualization models (except the Linux operating system feature, which appears rarely in the data). Moving to a random forest for the CATE projection model with SHAP presents greater differences, especially for the highly skewed features.

Similar instrument from a recent experiment A recent 2019 A/B test of the same membership sign-up process provided another viable instrument. This 21-day A/B test included a much larger, more diverse population of users than in 2018 due to fewer restrictions for eligibility (see Sec. B.2 for details). We apply DRIV with gradient boosting residualization models and a linear projection of the CATE. The CATE distribution has generally higher values compared to the 2018 experiment which reflects the different experimental population. In particular, users in the 2018 experiment had much higher engagement and significantly higher revenue in the pre-experiment period. This was largely because users were only included in the 2018 experiment on their second visit. The higher baseline naturally makes it more difficult to achieve high treatment effects, explaining the generally lower CATE distribution in the 2018 experiment. We note that, unlike in 2018, the revenue coefficient is no longer significant. We again attribute this to the much higher revenue baseline in 2018. Despite the population differences, however, we observe "days_visited_vrs_pre" continues to have a very significant positive association. "days_visited_exp_pre" now also appears to have a significantly positive association, as does the iPhone device (which was not a feature in the 2018 experiment). The inclusion of iPhone users is another big domain shift in the two experiments.

Policy recommendations for W Our results offer several policy implications for W. Firstly, encourage iPhone users, and users who frequent vacation rentals pages to sign-up for membership. These users exhibited high treatment effects from membership. For frequent visitors to vacation rentals pages, this effect was robust across residualization models, CATE projections, and even different instruments (e.g. by providing stronger encouragements for sign-up on particular sub-pages). Second, find ways to improve the membership offering for users who are already engaged: e.g. recently made a booking (high revenue_pre), were already frequent visitors (high days_visited_free_pre).

Validation on Semi-Synthetic Data In Appendix C, we validate the correctness of ATE and CATE from DRIV, by creating a semi-synthetic dataset with the same variables and such that the marginal distribution of each variable looks similar to the TripAdvisor data, but where we know the true effect model.
We find that DRIV recovers a good estimate of the ATE.
The CATE of DRIV with linear regression as final stage also recovers the true coefficients, and a random forest final stage picks the correct factors of heterogeneity as most important features.
Moreover, coverage of DRIV ATE confidence intervals is almost nominal at 94%, while DMLATEIV can be very biased and has 26% coverage.^{4}^{4}4Results on the coverage experiment can be recovered by running: https://github.com/microsoft/EconML/tree/master/prototypes/dml_iv/coverage.py followed by post-processing by https://github.com/microsoft/EconML/tree/master/prototypes/dml_iv/post_processing.ipynb. Single synthetic instance results on the quality of the recovered estimates and how they compare with benchmark approaches can be found in https://github.com/microsoft/EconML/tree/master/prototypes/dml_iv/TA_DGP_Analysis.ipynb.

## 7 Estimating the Effect of Schooling on Wages

The causal impact of schooling on wages has been studied at length in Economics (see [14], [6], [7], [17]), and although it is generally agreed that there is a positive impact, it is difficult to obtain a consistent estimate of the effect due to self-selection into education levels. To account for this endogeneity, Card ([6]) proposes using proximity to a 4-year college as an IV for schooling. We analyze Card’s data from the Nat. Long. Survey of Young Men (NLSYM, 1966) to estimate the ATE of education on wages and find sources of heterogeneity. We describe the NLSYM data in depth in Appendix D. At high level, the data contains 3,010 rows with 22 mostly binary covariates , log wages (), years of schooling (), and 4-year college proximity indicator ().

We apply DMLATEIV and DRIV with linear (LM) or gradient boosted (GBM) nuisance models to estimate the ATE (Table 2). While the DMLATEIV results are consistent with Card’s ( CI), this estimate is likely biased in the presence of compliance and effect heterogeneity (see Sec. 2.1). The DRIV ATE estimates, albeit lower, still lie within the 95% CI of the DML ATE.

We study effect heterogeneity with a shallow random forest an the last stage of DRIV. Fig. 3 depicts the spread of treatment effects, and the important features selected. Most effects (89%) are positive, with very few very negative outliers. The heterogeneity is driven mainly by parental education variables. We project the DRIV treatment effect on the mother’s education variable to study this effect. In fig. 3, we note that treatment effects are highest among children of less educated mothers. This pattern has also been observed in [6] and [17].^{5}^{5}5See https://github.com/microsoft/EconML/tree/master/prototypes/dml_iv/NLSYM_Linear.ipynb for LM nuisance models and https://github.com/microsoft/EconML/tree/master/prototypes/dml_iv/NLSYM_GBM.ipynb for GBM nuisance models for these results.

Observational Data | Semi-Synthetic Data | |||||

Nuisance | Method | ATE Est | 95% CI | ATE Est | 95% CI | Cover |

LM | DMLATEIV | 0.137 | [0.027, 0.248] | 0.651 | [0.607, 0.696] | 52% |

LM | DRIV | 0.072 | [0.009, 0.135] | 0.546 | [0.427, 0.665] | 98% |

GBM | DMLATEIV | 0.157 | [0.041, 0.274] | 0.653 | [0.612, 0.694] | 72% |

GBM | DRIV | 0.041 | [-0.037, 0.120] | 0.650 | [0.574, 0.727] | 93% |

Contains the true ATE (0.609) | Coverage for 95% CI over 100 Monte Carlo simulations |

Semi-synthetic Data Results. We created semi-synthetic data from the NLSYM covariates and instrument , with generated treatments and outcomes based on known compliance and treatment functions (see Appx. D for details).
In Table 2, we see that DMLATEIV ATE (true ATE=0.609) is upwards biased and has poor coverage over 100 runs, whereas the DRIV ATE is less biased and has overall good coverage. With DRIV, we also cover the correct with the coefficient CIs when the final stage is the space of linear models on the relevant variables; with linear nuisance models (including products of features): vs , vs , and vs ; and with gradient boosted forest nuisance models: vs , vs , and vs . (these are the coefficients after variables where pre-processed and normalized to mean zero and variance ) ^{6}^{6}6See Notebook for Semi-Synthetic Analysis with LM nuisance models and Notebook for Semi-Synthetic Analysis with GBM nuisance models for these results.

#### Acknowledgements

We thank Jeff Palmucci, Brett Malone, Baskar Mohan, Molly Steinkrauss, Gwyn Fisher and Matthew Dacey from TripAdvisor for their support and assistance in making this collaboration possible.

## References

- [1] Alberto Abadie. Semiparametric instrumental variable estimation of treatment response models. Journal of Econometrics, 113(2):231 – 263, 2003.
- [2] Joshua D Angrist and Jörn-Steffen Pischke. Mostly harmless econometrics: An empiricist’s companion. Princeton university press, 2008.
- [3] Peter M Aronow and Allison Carnegie. Beyond late: Estimation of the average treatment effect with an instrumental variable. Political Analysis, 21(4):492–506, 2013.
- [4] Susan Athey, Julie Tibshirani, Stefan Wager, et al. Generalized random forests. The Annals of Statistics, 47(2):1148–1178, 2019.
- [5] Susan Athey and Stefan Wager. Efficient policy learning. arXiv preprint arXiv:1702.02896, 2017.
- [6] David Card. Using geographic variation in college proximity to estimate the return to schooling. Technical report, National Bureau of Economic Research, 1993.
- [7] David Card. Estimating the return to schooling: Progress on some persistent econometric problems. Econometrica, 69(5):1127–1160, 2001.
- [8] Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 785–794, New York, NY, USA, 2016. ACM.
- [9] Xiaohong Chen and Zhipeng Liao. Sieve semiparametric two-step gmm under weak dependence. Cowles Foundation Discussion Papers 2012, Cowles Foundation for Research in Economics, Yale University, 2015.
- [10] Xiaohong Chen and Demian Pouzo. Estimation of nonparametric conditional moment models with possibly nonsmooth generalized residuals. Econometrica, 80(1):277–321, 2012.
- [11] Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1–C68, 2018.
- [12] Victor Chernozhukov, Denis Nekipelov, Vira Semenova, and Vasilis Syrgkanis. Plug-in regularized estimation of high-dimensional parameters in nonlinear semiparametric models. arXiv preprint arXiv:1806.04823, 2018.
- [13] Dylan J Foster and Vasilis Syrgkanis. Orthogonal statistical learning. arXiv preprint arXiv:1901.09036, 2019.
- [14] Zvi Griliches. Estimating the returns to schooling: Some econometric problems. Econometrica: Journal of the Econometric Society, pages 1–22, 1977.
- [15] Jason Hartford, Greg Lewis, Kevin Leyton-Brown, and Matt Taddy. Deep IV: A flexible approach for counterfactual prediction. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1414–1423, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
- [16] Jason Hartford, Greg Lewis, Kevin Leyton-Brown, and Matt Taddy. Deep iv: A flexible approach for counterfactual prediction. In International Conference on Machine Learning, pages 1414–1423, 2017.
- [17] John Hudson and John G Sessions. Parental education, labor market experience and earnings: new wine in an old bottle? Economics Letters, 113(2):112–115, 2011.
- [18] Vladimir Koltchinskii. Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems. Springer, 2011.
- [19] Sören R Künzel, Jasjeet S Sekhon, Peter J Bickel, and Bin Yu. Meta-learners for estimating heterogeneous treatment effects using machine learning. arXiv preprint arXiv:1706.03461, 2017.
- [20] Scott M Lundberg, Gabriel G Erion, and Su-In Lee. Consistent individualized feature attribution for tree ensembles. arXiv preprint arXiv:1802.03888, 2018.
- [21] Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 4765–4774. Curran Associates, Inc., 2017.
- [22] Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 4765–4774. Curran Associates, Inc., 2017.
- [23] Whitney K. Newey and James L. Powell. Instrumental variable estimation of nonparametric models. Econometrica, 71(5):1565–1578, 2003.
- [24] Xinkun Nie and Stefan Wager. Quasi-oracle estimation of heterogeneous treatment effects. arXiv preprint arXiv:1712.04912, 2017.
- [25] Ryo Okui, Dylan S Small, Zhiqiang Tan, and James M Robins. Doubly robust instrumental variable regression. Statistica Sinica, pages 173–205, 2012.

## Appendix A Proof of Main Lemmas and Corollaries

Before we prove our two main lemmas we define the concept of an orthogonal loss. Consider a loss function that depends on a target model and nuisance model .

###### Definition 1 (Directional Derivative).

Let be a vector space of functions. For a functional , we define the derivative operator

for a pair of functions . Likewise, we define

When considering a functional in two arguments, e.g. , we will write and to make the argument with respect to which the derivative is taken explicit.

###### Definition 2 (Orthogonal Loss).

The population risk is orthogonal, if:

(34) |

Suppose that the loss function is the expectation of a point-wise loss:

(35) |

where represents all random variables of the data generating process and are subsets of these variables.

Then orthogonality is implied by the condition, , :

(36) |

where