Generic Machine Learning Inference on Heterogenous Treatment Effects in Randomized Experiments
Abstract. We propose strategies to estimate and make inference on key features of heterogeneous effects in randomized experiments. These key features include best linear predictors of the effects using machine learning proxies, average effects sorted by impact groups, and average characteristics of most and least impacted units. The approach is valid in high dimensional settings, where the effects are proxied by machine learning methods. We postprocess these proxies into the estimates of the key features. Our approach is generic, it can be used in conjunction with penalized methods, deep and shallow neural networks, canonical and new random forests, boosted trees, and ensemble methods. It does not rely on strong assumptions. In particular, we don’t require conditions for consistency of the machine learning methods. Estimation and inference relies on repeated data splitting to avoid overfitting and achieve validity. For inference, we take medians of pvalues and medians of confidence intervals, resulting from many different data splits, and then adjust their nominal level to guarantee uniform validity. This variational inference method is shown to be uniformly valid and quantifies the uncertainty coming from both parameter estimation and data splitting. We illustrate the use of the approach with two randomized experiments in development on the effects of microcredit and nudges to stimulate immunization demand.
Key words: Agnostic Inference, Machine Learning, Confidence Intervals, Causal Effects, Variational Pvalues and Confidence Intervals, Uniformly Valid Inference, Quantification of Uncertainty, Sample Splitting, Multiple Splitting, AssumptionFreeness, Microcredit, Immunization Incentives
JEL: C18, C21, D14, G21, O16
1. Introduction
Randomized experiments play an important role in the evaluation of social and economic programs and medical treatments (e.g., Imbens and Rubin (2015); Duflo et al. (2007)). Researchers and policy makers are often interested in features of the impact of the treatment that go beyond the simple average treatment effects. In particular, very often, they want to know whether treatment effect depends on covariates, such as gender, age, etc. It is essential to assess if the impact of the program would generalize to a different population with different characteristics, and for economists, to better understand the driving mechanism behind the effects of a particular program. In a review of 189 RCT published in top economic journals since 2006, we found that 76 (40%) report at least one subgroup analysis, wherein they report treatment effects in subgroups formed by baseline covariates.^{1}^{1}1The papers were published in Quarterly Journal of of Economics, American Economic Review, Review of Economics Studies, Econometrica and Journal of Political Economy. We than Karthik Mularidharan, Mauricio Romero and Kaspar Wüthrich for sharing the list of papers they computed for another project.
One issue with reporting treatment effects split by subgroups, however, is that there are often a large number of potential sample splits: choosing subgroups expost opens the possibility of overfitting. To solve this problem, medical journals and the FDA require preregistering the subsample of interest in medical trials in advance. In economics, this approach has gained some traction, with the adoption of preanalysis plans (which can be filed in the AEA registry for randomized experiments). Restricting heterogeneity analysis to preregistered subgroups, however, amounts to throwing away a large amount of potentially valuable information, especially now that many researchers collect large baseline data sets. It should be possible to use the data to discover ex post whether there is any relevant heterogeneity in treatment effect by covariates.
To do this in a disciplined fashion and avoid the risk of overfitting, scholars have recently proposed using machine learning (ML) tools (see e.g. Athey and Imbens (2017) and below for a review). Indeed, ML tools seem to be ideal to explore heterogeneity of treatment effects, when researchers have access to a potentially large array of baseline variables to form subgroups, and little guiding principles on which of those are likely to be relevant. Several recent papers, which we review below, develop methods for detecting heterogeneity in treatment effects. Empirical researchers have taken notice.^{2}^{2}2In the last few months alone, several new empirical papers in economics used ML methods to estimate heterogenous effects. E.g. Rigol et al. (2016) shows that villagers outperform the machine learning tools when they predict heterogeneity in returns to capital. Davis and Heller (2017) predicts who benefits the most from a summer internship projects. Deryugina et al. (Forthcoming) uses the methods developed in the present paper to evaluate the heterogeneity in the effect of air pollution on mortality. Crepon et al. (2019) also builds on the present paper to develop a methodology to determine if the impact of two different programs can be accounted for by different selection. The methodological papers reviewed later also contain a number of empirical applications.
This paper develops a generic approach to use any of the ML tools to predict and make inference on heterogeneous treatment or policy effects. A core difficulty of applying ML tools to the estimation of heterogenous causal effects is that, while they are successful in prediction empirically, it is much more difficult to obtain uniformly valid inference. In fact, in high dimensional settings, absent strong assumptions, generic ML tools may not even produce consistent estimates of the conditional average treatment effect (CATE), the difference in the expected potential outcomes between treated and control groups conditional on covariates.
Previous attempts to solve this problem focus either on specific tools (for example the method proposed by Athey and Imbens (2016), which has become popular with applied researchers, and uses trees), or on situations where those assumptions might be satisfied. Our approach to resolve the fundamental impossibilities in nonparametric inference is different. Motivated by Genovese and Wasserman (2008), instead of attempting to get consistent estimation and uniformly valid inference on the CATE itself, we focus on providing valid estimation and inference on features of CATE. We start by building a ML proxy predictor of CATE, and then develop valid inference on features of the CATE based on this proxy predictor. In particular, we develop valid inference on three objects, which are likely to be of interest to applied researchers and policy makers: First, the Best Linear Predictor (BLP) of the CATE based on the ML proxy predictor; second, the Sorted Group Average Treatment Effects (GATES) or average treatment effect by heterogeneity groups induced by the ML proxy predictor; and third, the Classification Analysis (CLAN) or the average characteristics of the most and least affected units defined in terms of the ML proxy predictor. Thus, we can find out if there is detectable heterogeneity in the treatment effect based on observables, and if there is any, what is the treatment effect for different bins. And finally we can describe which of the covariates is correlated with this heterogeneity.
There is a tradeoff between more restrictive assumptions or tools and a more ambitious estimation. We chose a different approach to address this tradeoff than previous papers: focus on coarser objects of the function rather than the function itself, but make as little assumptions as possible. This seems to be a worthwhile sacrifice: the objects for which we have developed inference appear to us at this point to be the most relevant, but in the future, one could easily use the same approach to develop methods to estimate other objects of interest.
The Model and Key Causal Functions.
Let and be the potential outcomes in the treatment state 1 and the nontreatment state 0; see Neyman (1923) and Rubin (1974). Let be a vector of covariates that characterize the observational units. The main causal functions are the baseline conditional average (BCA):
(1.1) 
and the conditional average treatment effect (CATE):
(1.2) 
Suppose the binary treatment variable is randomly assigned conditional on , with probability of assignment depending only on a subvector of stratifying variables , namely
(1.3) 
and the propensity score is known and is given by
(1.4) 
which we assume is bounded away from zero or one:
(1.5) 
The observed outcome is . Under the stated assumption, the causal functions are identified by the components of the regression function of given :
that is,
(1.6) 
and
(1.7) 
We observe , consisting of i.i.d. copies of the random vector having probability law . The expectation with respect to is denoted by . The probability law of the entire data is denoted by and the corresponding expectation is denoted by .
Properties of Machine Learning Estimators of Motivating the Agnostic Approach
Machine learning (ML) is a name attached to a variety of new, constantly evolving statistical learning methods: Random Forest, Boosted Trees, Neural Networks, Penalized Regression, Ensembles, and Hybrids (see, e.g., Wasserman (2016) for a recent review, and Friedman et al. (2001) for a prominent textbook treatment). In modern highdimensional settings, ML methods effectively explore the various forms of nonlinear structured sparsity to yield “good” approximations to whenever such assumptions are valid, based on equations (1.6) and (1.7). As a result these methods often work much better than classical methods in highdimensional settings, and have found widespread uses in industrial and academic applications.
Motivated by their practical predictive success, it is really tempting to apply ML methods directly to try to learn the CATE function (by learning the two regression functions for treated and untreated and taking the difference). However, it is hard, if not impossible, to obtain uniformly valid inference on using generic ML methods, under credible assumptions and practical tuning parameter choices. There are several fundamental reasons as well as huge gaps between theory and practice that are responsible for this.
One fundamental reason is that the ML methods might not even produce consistent estimators of in high dimensional settings. For example, if has dimension and the target function is assumed to have continuous and bounded derivatives, then the worst case (minimax) lower bound on the rate of learning this function from a random sample of size cannot be better than as , as shown by Stone Stone (1982). Hence if is fixed and is also small, but slowly increasing with , such as , then there exists no consistent estimator of generally.
Hence, generic ML estimators cannot be regarded as consistent, unless further very strong assumptions are made. Examples of such assumptions include structured forms of linear and nonlinear sparsity and supersmoothness. While these (sometime believable and yet untestable) assumptions make consistent adaptive estimation possible (e.g.,Bickel et al. (2009)), inference remains a more difficult problem, as adaptive confidence sets do not exist even for lowdimensional nonparametric problems (Low et al. (1997); Genovese and Wasserman (2008)). Indeed, adaptive estimators (including modern ML methods) have biases of comparable or dominating order as compared to sampling error. Further assumptions such as ”selfsimilarity” are needed to bound the biases and expand the confidence bands by the size of bias (see Giné and Nickl (2010); Chernozhukov et al. (2014)) to produce partly adaptive confidence bands. For more traditional statistical methods there are constructions in this vein that make use of either undersmoothing or biasbounding arguments (Giné and Nickl (2010); Chernozhukov et al. (2014)). These methods, however, are not yet available for ML methods in high dimensions (see, however, Hansen et al. (2017) for a promising approach called ”targeted undersmoothing” in sparse linear models).
Suppose we did decide to be optimistic (or panglossian) and imposed the strong assumptions, that made the theoretical versions of the ML methods provide us with highquality consistent estimators of and valid confidence bands based on them. This would still not give us a practical construction we would want for our applications. The reason is that there is often a gap between theoretical versions of the ML methods appearing in various theoretical papers and the practical versions (with the actual, datadriven tuning parameters) coded up in statistical computing packages used by practitioners.^{3}^{3}3There are cases where such gap does not exist, e.g., see Belloni et al. (2014, 2011) for the lasso. On the other hand, for example, even the wide use of Kfold crossvalidation in highdimensional settings for machine learning remains theoretically unjustified. There do exist, however, related subsamplebased methods that achieve excellent performance for tuning selection (Wegkamp et al., 2003; Lecué and Mitchell, 2012). The use of ML, for example, involves many tuning parameters with practical rules for choosing them, while theoretical work provides little guidance or backing for such practical rules; see e.g., the influential book Friedman et al. (2001) for many examples of such rules. Unfortunately, theoretical work often only provides existence results: there exist theoretical ranges of the tuning parameters that make the simple versions of the methods work for predictive purposes (under very strong assumptions), leaving no satisfactory guide to practice.
In this paper we take an agnostic view. We neither rely on any structured assumptions, which might be difficult to verify or believe in practice, nor impose conditions that make the ML estimators consistent. We simply treat ML as providing proxy predictors for the objects of interest.
Our Agnostic Approach
Here, we propose strategies for estimation and inference on
key features of rather than itself. 
Because of this difference in focus we can avoid making strong assumptions about the properties of the ML estimators.
Let denote a random partition of the set of indices . The strategies that we consider rely on random splitting of
into a main sample, denoted by , and an auxiliary sample, denoted by . We will sometimes refer to these samples as and . We assume that the main and auxiliary samples are approximately equal in size, though this is not required theoretically.
From the auxiliary sample , we obtain ML estimators of the baseline and treatment effects, which we call the proxy predictors,
These are possibly biased and noisy predictors of and , and in principle, we do not even require that they are consistent for and . We simply treat these estimates as proxies, which we postprocess to estimate and make inference on the features of the CATE . We condition on the auxiliary sample , so we consider these maps as frozen, when working with the main sample.
Using the main sample and the proxies, we shall target and develop valid inference about key features of rather than , which include

Best Linear Predictor (BLP) of the CATE based on the ML proxy predictor ;

Sorted Group Average Treatment Effects (GATES): average of (ATE) by heterogeneity groups induced by the ML proxy predictor ;

Classification Analysis (CLAN): average characteristics of the most and least affected units defined in terms of the ML proxy predictor .
Our approach is generic with respect to the ML method being used, and is agnostic about its formal properties.
We will make use of many splits of the data into main and auxiliary samples to produce robust estimates. Our estimation and inference will systematically account for two sources of uncertainty:

Estimation uncertainty conditional on the auxiliary sample.

Splitting uncertainty induced by random partitioning of the data into the main and auxiliary samples.
Because we account for the second source, we call the resulting collection of methods as variational estimation and inference methods (VEINs). For point estimates we report the median of the estimated key features over different random splits of the data. For the confidence intervals we take the medians of many random conditional confidence sets and we adjust their nominal confidence level to reflect the splitting uncertainty. We construct pvalues by taking medians of many random conditional pvalues and adjust the nominal levels to reflect the splitting uncertainty. Note that considering many different splits and accounting for variability caused by splitting is very important. Indeed, with a single splitting practice, empiricists may unintentionally look for a ”good” data split, which supports their prior beliefs about the likely results, thereby invalidating inference.^{4}^{4}4This problem is “solved” by fixing the MonteCarlo seed and the entire data analysis algorithm before the empirical study. Even if such a huge commitment is really made and followed, there is a considerable risk that the resulting datasplit may be nontypical. Our approach allows one to avoid taking this risk.
Relationship to the Literature.
We focus the review strictly on the literatures about estimation and inference on heterogeneous effects and inference using sample splitting.
We first mention work that uses linear and semiparametric regression methods. A semiparametric inference method for characterizing heterogeneity, called the sorted effects method, was given in Chernozhukov et al. (2015). This approach does provide a full set of inference tools, including simultaneous bands for percentiles of the CATE, but is strictly limited to the traditional semiparametric estimators for the regression and causal functions. Hansen et al. (2017) proposed a sparsity based method called ”targeted undersmoothing” to perform inference on heterogeneous effects. This approach does allow for highdimensional settings, but makes strong assumptions on sparsity as well as additional assumptions that enable the targeted undersmoothing. A related approach, which allows for simultaneous inference on many coefficients (for example, inference on the coefficients corresponding to the interaction of the treatment with other variables) was first given in Belloni et al. (2013) using a Zestimation framework, where the number of interactions can be very large; see also Dezeure et al. (2016) for a more recent effort in this direction, focusing on debiased lasso in mean regression problems. This approach, however, still relies on a strong form of sparsity assumptions. Zhao et al. (2017) proposed a postselection inference framework within the highdimensional linear sparse models for the heterogeneous effects. The approach is attractive because it allows for some misspecification of the model.
Next we discuss the use of treebased and other methods. Imai and Ratkovic (2013) discussed the use of a heuristic supportvectormachine method with lasso penalization for classification of heterogeneous treatments into positive and negative ones. They used the HorvitzThompson transformation of the outcome (e.g., as in Hirano et al. (2003); Abadie (2005)) such that the new outcome becomes an unbiased, noisy version of CATE. Athey and Imbens (2016) made use of the HorvitzThompson transformation of the outcome variable to inform the process of building causal trees, with the main goal of predicting CATE. They also provide a valid inference result on average treatment effects for groups defined by the tree leaves, conditional on the data split in two subsamples: one used to build the tree leaves and the one to estimate the predicted values given the leaves. Like our methods, this approach is essentially assumptionfree. The difference with our generic approach is that it is limited to trees and does not account for splitting uncertainty, which is important in practical settings. Wager and Athey (2017) provided a subsamplingbased construction of a causal random forest, providing valid pointwise inference for CATE (see also the review in Wager and Athey (2017) on prior uses of random forests in causal settings) for the case when covariates are very lowdimensional (and essentially uniformly distributed).^{5}^{5}5The dimension is fixed in Wager and Athey (2017); the analysis relies on the Stone’s model with smoothness index , in which no consistent estimator exists once . It’d be interesting to establish consistency properties and find valid inferential procedures for the random forest in highdimensional ( or ) approximately sparse cases, with continuous and categorical covariates, but we are not aware of any studies that cover such settings, which are of central importance to us. Unfortunately, this condition rules out the typical highdimensional settings that arise in many empirical problems, including the ones considered in this paper.
Our approach is different from these existing approaches, in that we are changing the target, and instead of hunting for CATE , we focus on key features of . We simply treat the ML methods as providing a proxy predictor , which we postprocess to estimate and make inference on the key features of the CATE . Some of our strategies rely on HorvitzThompson transformations of outcome and some do not. The inspiration for our approach draws upon an observation in Genovese and Wasserman (2008), namely that some fundamental impossibilities in nonparametric inference could be avoided if we focus inference on coarser features of the nonparametric functions rather than the functions themselves.
Our inference approach is also of independent interest, and could be applied to many problems, where sample splitting is used to produce ML predictions, e.g. Abadie et al. (2017). Related references include Wasserman and Roeder (2009); Meinshausen et al. (2009), where the ideas are related but quite different in details, which we shall explain below. The premise is the same, however, as in Meinshausen et al. (2009); Rinaldo et al. (2016) – we should not rely on a single random split of the data and should adjust inference in some way. Our approach takes the medians of many conditional confidence intervals as the confidence interval and the median of many conditional pvalues as the pvalue, and adjusts their nominal levels to account for the splitting uncertainty. Our construction of pvalues builds upon ideas in Benjamini and Hochberg (1995); Meinshausen et al. (2009), though what we propose is radically simpler, and our confidence intervals appear to be brand new. Of course sample splitting ideas are classical, going back to Hartigan (1969); Kish and Frankel (1974); Barnard (1974); Cox (1975); Mosteller and Tukey (1977), though having been mostly underdeveloped and overlooked for inference, as characterized by Rinaldo et al. (2016).
2. Main Identification Results and Estimation Strategies
2.1. BLP of CATE
We consider two strategies for identifying and estimating the best linear predictor of using :
which, if exists, is defined by projecting on the linear span of and in the space .
BLP of CATE: The First Strategy
Here we shall identify the coefficients of the BLP from the weighted linear projection:
(2.1) 
where ,
Note that the above equation uniquely pins down and under weak assumptions.
The interaction is orthogonal to under the weight and to all other regressors that are functions of under any dependent weight.^{6}^{6}6The orthogonalization ideas embedded in this strategy do have classical roots in econometrics (going back to at least Frisch and Waugh in the 30s), and similar strategies underlie the orthogonal or double machine learning approach (DML) in Chernozhukov et al. (2017). Our paper has different goals than DML, attacking the problem of inference on heterogeneous effects without rate and even consistency assumptions. The strategy here is more nuanced in that we are making it work under misspecification or inconsistent learning, which is likely to be true in very highdimensional problems.
Theorem 2.1 (Blp 1).
Consider and as fixed maps. Assume that and have finite second moments, is full rank, and . Then, defined in (2.1) also solves the best linear predictor/approximation problem for the target :
in particular and .
The identification result is constructive. We can base the corresponding estimation strategy on the empirical analog:
where denotes the empirical expectation with respect to the main sample, i.e.
The properties of this estimator, conditional on the auxilliary data, are well known and follow as a special case of Lemma B.1 in the Appendix.
Comment 2.1 (Main Implications of the result).
If is a perfect proxy for , then In general, , correcting for noise in . If is complete noise, uncorrelated to , then . Furthermore, if there is no heterogeneity, that is , then Rejecting the hypothesis therefore means that there is both heterogeneity and is its relevant predictor.
Figure 1 provides two examples. The left panel shows a case without heterogeneity in the CATE where , whereas there right panel shows a case with strong heterogeneity in the CATE where . In both cases we evenly split 1000 observations between the auxiliary and main samples, is uniformly distributed in , and the proxy predictor is estimated by random forest in the auxiliary sample following the standard implementation, see e.g. Friedman et al. (2001). When there is no heterogeneity, postprocessing the ML estimates helps reducing sampling noise bringing the estimated BLP close to the true BLP; whereas under strong heterogeneity the signal in the ML estimates dominates the sampling noise and the postprocessing has little effect.
Comment 2.2 (Digression: Naive Strategy that is not Quite Right).
It is tempting and “more natural” to estimate
where . This is a good strategy for predicting the conditional expectation of given and . But, , and is not the best linear predictor of .
BLP of CATE: The Second Strategy
The second strategy makes use of the HorvitzThompson transformation:
It is well known that the transformed response provides an unbiased signal about CATE:
and it follows that
This simple strategy is completely fine for identification purposes, but can severely underperform in estimation and inference due to lack of precision. We can repair the deficiencies by considering, instead, the linear projection:
(2.2) 
where , , , , and , e.g. or . The terms are present in order to reduce noise.
Theorem 2.2 (Blp 2).
Consider and as fixed maps. Assume that has finite second moments, is such that is finite and full rank, and . Then, defined in (2.2) solves the best linear predictor/approximation problem for the target :
in particular and .
The corresponding estimator is defined through the empirical analog:
and the properties of this estimator, conditional on the auxiliary data, are well known and given in Lemma B.1.
Comment 2.3 (Comparison of Estimation Strategies).
A natural question that may arise is whether the two estimation strategies proposed can be ranked in terms of asymptotic efficiency. The answer is negative. We show in Appendix C that they produce estimators that have the same distribution in large samples.
2.2. The Sorted Group ATE
The target parameters are
where is an indicator of group membership.
Comment 2.4.
There are many possibilities for creating groups based upon ML tools applied to the auxiliary data. For example, one can group or cluster based upon predicted baseline response as in the “endogenous stratification” analysis (Abadie et al., 2017), or based upon actual predicted treatment effect . We focus on the latter approach for defining groups, although our identification and inference ideas immediately apply to other ways of defining groups, and could be helpful in these contexts.
We build the groups to explain as much variation in as possible
where are nonoverlaping intervals that divide the support of into regions with equal or unequal masses:
The parameters of interest are the Sorted Group Average Treatment Effects (GATES):
Given the definition of groups, it is natural for us to impose the monotonicity restriction
which holds asymptotically if is consistent for and the latter has an absolutely continuous distribution. Under the monotonicity condition, the estimates could be rearranged to obey the weak monotonicity condition, improving the precision of the estimator. The joint confidence intervals could also be improved by intersecting them with the set of monotone functions. Furthermore, as before, we can test for homogeneous effects, , by testing whether,
GATES: The First Strategy
Here we shall recover the GATES parameters from the weighted linear projection equation:
(2.3) 
for , , ,
The presence of in the interaction orthogonalizes this regressor relative to all other regressors that are functions of . The controls , e.g. , can be included to improve precision.
Given the identification strategy, we can base the corresponding estimation strategy on the following empirical analog:
(2.4) 
The properties of this estimator, conditional on the auxilliary data, are well known and stated as a special case of Lemma B.1.
A formal statement appears below, together with a complementary result.
Figure 2 provides two examples using the same designs as in fig. 1. Postprocessing the ML estimates again has stronger effect when there is no heterogeneity, but in both cases help bring the estimated GATES close to the true GATES.
GATES: The Second Strategy
Here we employ linear projections on HorvitzThompson transformed variables:
(2.5) 
for , , ,
Given the identification strategy, we can base the corresponding estimation strategy on the following empirical analog:
(2.6) 
The properties of this estimator, conditional on the auxiliary data, are well known and given in Lemma B.1. The resulting estimator has similar performance to the previous estimator, and under some conditions their firstorder properties coincide.
The following is the formal statement of the identification result.
Theorem 2.3 (Gates).
Consider and as fixed maps. Assume that has finite second moments and the ’s and defined above are such that and are finite and have full rank. Consider defined by the weighted regression equation (2.3) or by the regression equation (2.5). These parameters defined in two different ways are equivalent and are equal to the expectation of conditional on the proxy group :
2.3. Classification Analysis (CLAN)
When the BLP and GATES analyses reveal substantial heterogeneity, it is interesting to know the properties of the subpopulations that are most and least affected. Here we focus on the “least affected group” and “most affect group” . Under the monotonicity assumption, it is reasonable that the first and the last groups are the most and least affected, where the labels “most” and “least” can be swapped depending on the context.
Let be a vector of characteristics of an observational unit. The parameters of interest are the average characteristics of the most and least affected groups:
The parameters and are identified because they are averages of variables that are directly observed. We can compare and to quantify differences between the most and least affected groups. We call this type of comparisons as classification analysis or CLAN.
3. ”Variational” Estimation and Inference Methods
3.1. Estimation and Inference: The Generic Targets
Let denote a generic target parameter or functional, for example,

is the heterogeneity predictor loading parameter;

is the “personalized” prediction of ;

is the expectation of for the group ;

is the difference in the expectation of between the most and least affected groups;

is the difference in the expectation of the characteristics of the most and least impacted groups.
3.2. Quantification of Uncertainty: Two Sources
There are two principal sources of sampling uncertainty:

Estimation uncertainty regarding the parameter , conditional on the data split;

Uncertainty or ”variation” induced by the data splitting.
Conditional on the data split, quantification of estimation uncertainty is standard. To account for uncertainty with respect to the data splitting, it makes sense to examine the robustness and variability of the estimates/confidence intervals with respect to different random splits. One of our goals is to develop methods, which we call ”variational estimation and inference” (VEIN) methods, for quantifying this uncertainty. These methods can be of independent interest in many settings where the sample splitting is used.
Quantifying Source (I): Conditional Inference
We first recognize that the parameters implicitly depend on
the auxiliary sample, used to create the ML proxies and . Here we make the dependence explicit: .
All of the examples admit an estimator such that under mild assumptions,
in the sense that, as ,
Implicitly this requires the auxiliary data to be ”sufficiently regular”, and this should happen with high probability.
As a consequence, the confidence interval (CI)
covers with approximate probability :
This leads to straighforward conditional inference, which does not account for the sample splitting uncertainty.
Quantifying Source (II): “Variational” Inference
Different partitions of yield different targets . Conditional on the data, we treat as a random variable, since are random sets that form random partitions of into samples of size and . Different partitions also yield different estimators and approximate distributions for these estimators. Hence we need a systematic way of treating the randomness in these estimators and their distributions.
Comment 3.1.
In cases where the data sets are not large, it may be desirable to restrict attention to balanced partitions , where the proportion of treated units is equal to the designed propensity score.
We want to quantify the uncertainty induced by the random partitioning. Conditional on Data, the estimated is still a random variable, and the confidence band is a random set. For reporting purposes, we instead would like to report an estimator and confidence set, which are nonrandom conditional on the data.
The above estimator and confidence set are nonrandom conditional on the data. The confidence set reflects the uncertainty created by the random partitioning of the data into the main and auxilliary data.
Comment 3.2.
For a random variable with law we define
Note that the lower median is the usual definition of the median. The upper median is the next distinct quantile of the random variable (or it is the usual median after reversing the order on ). For example, when is uniform on , then and ; and if is uniform on , then . For continuous random variables the upper and lower medians coincide. For discrete random variables they can differ, but the differences will be small for variables that are close to being continuous.
Suppose we are testing against , conditional on the auxiliary data, then the pvalue is given by
The pvalue for testing against , is given by .
Under the null hypothesis is approximately distributed as the uniform variable, , conditional on . Note that, conditional on , still has randomness induced by random partitioning of the data, which we need to address.
Comment 3.3.
The main idea behind this construction is simple: the pvalues are distributed as marginal uniform variables , and hence obey the following property.
Lemma 3.1 (A Property of Uniform Variables).
Consider , the (usual, lower) median of a sequence of uniformly distributed variables, for each , where variables are not necessarily independent. Then,
Proof. Let denote the median of . Then is equivalent to So
By Markov inequality this is bounded by
where the last inequality holds by the marginal uniformity.
Main Inference Result: Variational Pvalues and Confidence Intervals
We present a formal result on adjusted pvalues using this condition:

Suppose that is a set of regular auxiliary data configurations such that for all , under the null hypothesis:
and . In particular, suppose that this holds for the pvalues
Lemma B.1 shows that this condition is plausible for the least squares estimators defined in the previous section under mild conditions.
Theorem 3.1 (Uniform Validity of Variational PValue).
Under condition PV and the null hypothesis holding,
uniformly in .
In order to establish the properties of the confidence interval , we first consider the properties of the related confidence interval, which is based on the inversion of the pvalue based tests:
(3.1) 
for , where, for ,
(3.2)  
(3.3) 
The confidence interval has the following representation in terms of the medians of tstatistics implied by the proof Theorem 3.2 stated below:
(3.4) 
This can be (slightly) tighter than , while the latter is much simpler to construct.
The following theorem establishes that both confidence sets maintain the approximate confidence level .
Theorem 3.2 (Uniform Validity of Variational Confidence Intervals).
4. Other Considerations and Extensions
1. Choosing the Best ML Method Targeting CATE in Stage 1. There are several options. The best ML method can be chosen using the auxiliary sample, based on either (a) the ability to predict using and or (b) the ability to predict using and under the weight (as in the first type of strategies we developed earlier). To be specific, we can solve either of the following problems:

minimize the errors in the prediction of on and :
where and are parameter spaces for and ; or

minimize the errors in the weighted prediction of on and :
where and and are parameter spaces for and .
This idea improves over simple but inefficient strategy of predicting just using , which have been suggested before for causal inference. It also improves over the simple strategy that predicts using and (which chooses the best predictor for in a given class but not necessarily the best predictor for CATE ). Note that this idea is new and is of major independent interest.
2. Choosing the Best ML Method BLP Targeting CATE in Stage 2. The best ML method can also be chosen in the main sample by maximizing
(4.1) 
Maximizing is equivalent to maximizing the correlation
between the ML proxy predictor and the true score ,
or equivalent to maximizing the in the regression of on .
3. Choosing the Best ML Method GATES Targeting CATE in Stage 2. Analogously, for GATES the best ML method can also be chosen in the main sample by maximizing
(4.2) 
This is the part of variation of explained by . Hence choosing the ML proxy to maximize is equivalent to maximizing the in the regression of on (without a constant). If the groups have equal size, namely for each , then
4. Stratified Splitting. The idea is to balance the proportions of treated and untreated in both and samples, so that the proportion of treated is equal to the experiment’s propensity scores across strata. This formally requires us to replace the i.i.d. assumption by the i.n.i.d. assumption (independent but not identically distributed observations) when accounting for estimation uncertainty, conditional on the auxiliary sample. This makes the notation more complicated, but the results in Lemma B.1 still go through with notational modifications.
5. When Proxies have Little Variation. The analysis may generate proxy predictors that have little variation, so we can think of them as “weak”, which makes the parameter weakly identified. We can either add small noise to the proxies (jittering), so that inference results go through, or we may switch to testing rather than estimation. For practical reasons, we prefer the jittering approach.
5. Further Potential Applications to Prediction and Causal Inference Problems
Our inference approach generalizes to any problem of the following sort.
The noise reduction strategies, like the ones we used in the context of Htransformed outcomes, can be useful in these cases, but their construction could depend on the context.
Example 1. Forecasting or Predicting Regression Functions using ML proxies. This is the most common type of the problem arising in forecasting. Here the target is the best predictor of using , namely , and trivially serves as the unbiased signal. The interesting part here is the use of variational inference tools developed in this paper for constructing confidence intervals for the predicted values produced by the estimated BLP of using .
Example 2. Predicting Structural Derivatives using ML proxies. Suppose we are interested in best predicting the conditional average partial derivative , where and . In the context of demand analysis, is the log of individual demand, is the logprice of a product, and includes prices of other products and characteristics of individuals. Then, the unbiased signal is given by where is the conditional density function of given . That is, under mild conditions on the density using the integration by parts formula.
6. Empirical Applications and Implementation Algorithms
To illustrate the methods developed in this paper, we consider two empirical examples. The first example is an RCT conducted in Morocco, which investigates the effect of microfinance access on several outcomes. The second example analyzes a randomized intervention program in India to improve immunization. We conclude this section by providing the implementation algorithm.
6.1. Heterogeneity in the Effect of Microcredit Availability
We analyze a randomized experiment designed to evaluate the impact of microcredit availability on borrowing and selfemployment activities, which was previously studied in Crépon et al. (2015). The experiment was conducted in 162 villages in Morocco, divided into 81 pairs of villages with similar observable characteristics (number of households, accessibility to the center of the community, existing infrastructure, type of activities carried out by the households, and type of agriculture activities). One of the villages in each pair was randomly assigned to treatment and the other to control. Between 2006 and 2007 a microfinance institution started operating in the treated villages. Two years after the intervention an endline household survey was conducted with 5,551 households, which constitute our sample. There was no other microcredit penetration in these villages, before and for the duration of the study. Therefore, we interpret the treatment as the availability of microcredit.
Recent randomized evaluations of access to microcredit at the community level have found limited impacts of microcredit.^{7}^{7}7See Banerjee (2013) for a summary of the recent literature Despite evidence that access to microfinance leads to an increase in borrowing (Angelucci et al. (2015), Banerjee et al. (2015b), Tarozzi et al. (2015)) and business creation or expansion (Angelucci et al. (2015), Attanasio et al. (2015), Banerjee et al. (2015b), Tarozzi et al. (2015)), most studies have found that this does not translate into an increase in economic outcomes such as profit, income, labor supply and consumption (Angelucci et al. (2015), Banerjee et al. (2015b), Crépon et al. (2015)). Moreover, there is also no evidence of substantial gains in human development outcomes, such as education and health (Banerjee et al. (2015b),Tarozzi et al. (2015)). Studies which estimate the impact of microfinance by randomizing microcredit at the individual level confirm these findings (Augsburg et al. (2012), Karlan and Zinman (2009), Karlan and Zinman (2011)).
One question that remains elusive is whether the lack of evidence on the average effects masks heterogeneity, in which there are potential winners and losers of the microcredit expansion. Understanding this heterogeneity can have important implications for evaluating the welfare effects of microcredit, designing policies and targeting the groups that would benefit from microfinance. Indeed, the idea that there might be heterogeneity in the impact of microcredit has been a common theme among RCTs evaluating microfinance programs. Having found mostly positive but insignificant coefficients, the papers cited above attempt to explore heterogeneous treatment effects, mostly using quantile treatment effects. For profits, most studies seem to find positive impact at the higher quantiles (and in the data set we study here, Crépon et al. (2015) actually find negative impacts at lower level). Using Bayesian hierarchical methods to aggregate the evidence across studies, Meager (2017) cautions that these results on quantiles may not be generalizable: the profit variables seems to have too much noise to lend itself to quantile estimation.
A number of recent papers also consider heterogeneous treatment effects by studying the effect of microfinance on subpopulations. In a followup study of Banerjee et al. (2015b), Banerjee et al. (2015a) investigates whether the heterogeneity is persistent six years after the microfinance was introduced. They find that credit has a much bigger impact on the business outcomes of those who started a business before microfinance entered than of those without prior businesses. Using the same dataset as in this application, Crépon et al. (2015) classifies households into three categories in terms of their probability to borrow before the intervention and finds that microcredit access has a significant impact on investment and profit, but still no impact on income and consumption among those who are most likely to borrow. It is worth noting that the original strategy for this study was to construct groups which, ex ante had different probability to borrow, in order to separately estimate the direct effect of microcredit on those most likely to borrow, and the indirect effect on those very unlikely to borrow. The researchers initially tried to predict the probability to borrow fitting a model to a first group of villages for which they had collected a short survey. However they ended up predicting the probability to borrow expost because the model proved to have low predictive power. This expost classification may lead to overfitting. One cause for concern in this case is that different variables predict the probability to borrow in different waves, which makes it less likely that those variables reflect true structural relationships.
The strategy developed in this paper provides several advantages in studying heterogeneity in the treatment effects of microfinance. First, contrary to the literature, which relies on ad hoc subgroup analysis across a few baseline characteristics, we are agnostic about the source of heterogeneity. While the variable “had a prior business” has proven to be a robust and generalizable predictor of differences in treatment effect ( Meager (2017)) and could therefore be prespecified in future preanalysis plans, we have little idea about what else predicts heterogeneity. Second, our approach is valid in high dimensional settings, allowing us to include a rich set of characteristics in an unspecified functional form. Finally, using the CLAN estimation we are able to identify the characteristics of the most and least affected subpopulations, which could be an important input for a welfare analysis or targeting households who are likely to benefit from access to microfinance.
We focus on heterogeneity in treatment effects on four household outcome variables, : the amount of money borrowed, the output from selfemployment activities, profit from selfemployment activities, and monthly consumption. The treatment variable, , is an indicator for the household residing in a treated village. The covariates, , include some baseline household characteristics such as number of members, number of adults, head age, indicators for households doing animal husbandry, doing other nonagricultural activity, having an outstanding loan over the past 12 months, household spouse responded to the survey, another household member (excluding the household head) responded to the survey, and 81 village pair fixed effects (these are the variables that are available for all households). We also include indicators for missing observation at baseline as controls. Table 1 shows some descriptive statistics for the variables used in the analysis (all monetary variables are expressed in Moroccan Dirams, or MAD). Treated and control households have similar characteristics and the unconditional average treatment effect on loans, output, profit and consumption are respectively 1,128, 5,237, 1,844 and 31.
All  Treated  Control  

Outcome Variables  
Total Amount of Loans  2,359  2,930  1,802 
Total output from selfemployment activities (past 12 months)  32,499  35,148  29,911 
Total profit from selfemployment activities (past 12 months)  10,102  11,035  9,191 
Total monthly consumption  3,012  2,996  3,027 
Baseline Covariates  
Number of Household Members  3.879  3.872  3.886 
Number of Members 16 Years Old or Older  2.604  2.601  2.607 
Head Age  35.976  35.937  36.014 
Declared Animal Husbandry Selfemployment Activity  0.415  0.426  0.404 
Declared Nonagricultural Selfemployment Activity  0.146  0.129  0.164 
Borrowed from Any Source  0.210  0.224  0.196 
Spouse of Head Responded to Selfemployment Section  0.067  0.074  0.061 
Member Responded to Selfemployment Section  0.044  0.048  0.041 
Elastic Net  Boosting  Neural Network  Random Forest  
Amount of Loans  
Best BLP ()  2,808,960  1,919,609  2,175,872  2,753,511 
Best GATES ()  875  283  568  1290 
Output  
Best BLP ()  142,021,759  81,927,950  72,908,917  123,485,223 
Best GATES ()  8,677  3,625  4,986  5,123 
Profit  
Best BLP ()  32,462,874  16,674,642  13,411,383  43,184,732 
Best GATES ()  4,595  2,167  1,447  4,344 
Consumption  
Best BLP ()  45,084  26,158  38,578  37,507 
Best GATES ()  101  69  85  109 
Notes: Medians over 100 splits in half. 