Group Average Treatment Effects for Observational StudiesFinancial support from the Deutsche Forschungsgemeinschaft via the IRTG 1792 “High Dimensional Non Stationary Time Series”, Humboldt-Universität zu Berlin, is gratefully acknowledged. All correspondence may be addressed to the author by e-mail at

Group Average Treatment Effects for Observational Studiesthanks: Financial support from the Deutsche Forschungsgemeinschaft via the IRTG 1792 “High Dimensional Non Stationary Time Series”, Humboldt-Universität zu Berlin, is gratefully acknowledged. All correspondence may be addressed to the author by e-mail at

Daniel Jacob School of Business and Economics, Humboldt-Universität zu Berlin, Spandauer Straße 1, 10178 Berlin, Germany.    Wolfgang Karl Härdle1 Sim Kee Boon Institute for Financial Economics, Singapore Management University, 50 Stamford Road, 178899 Singapore, SingaporeW.I.S.E. - Wang Yanan Institute for Studies in Economics, Xiamen University, 422 Siming S Rd, 361005 Fujian, China    Stefan Lessmann1
2footnotemark: 2
This Draft: November 22, 2019

The paper proposes an estimator to make inference on key features of heterogeneous treatment effects sorted by impact groups (GATES) for non-randomised experiments. Observational studies are standard in policy evaluation from labour markets, educational surveys and other empirical studies. To control for a potential selection-bias we implement a doubly-robust estimator in the first stage. Keeping the flexibility to use any machine learning method to learn the conditional mean functions as well as the propensity score we also use machine learning methods to learn a function for the conditional average treatment effect. The group average treatment effect is then estimated via a parametric linear model to provide p-values and confidence intervals. The result is a best linear predictor for effect heterogeneity based on impact groups. Cross-splitting and averaging for each observation is a further extension to avoid biases introduced through sample splitting. The advantage of the proposed method is a robust estimation of heterogeneous group treatment effects under mild assumptions, which is comparable with other models and thus keeps its flexibility in the choice of machine learning methods. At the same time, its ability to deliver interpretable results is ensured.

JEL classification: C01, C14, C31, C63

Keywords: causal inference, machine learning, simulation study, confidence intervals, multiple splitting, sorted group ATE (GATES), doubly-robust estimator

figuresection tablesection \tcbsetcolframe=black, colback=white,size=fbox

1 Introduction

When evaluating a causal effect of some policy, marketing action or another treatment indicator, it might not be sufficient to only report the average treatment effect (ATE). The estimation of heterogeneous effects, e.g. the conditional (on covariates) average treatment effect (CATE), provides further insight into causal mechanisms and helps researchers and practitioners to actively adjust the treatment assignment towards an efficient allocation. The more information in terms of characteristics i.e. covariates we are provided with, the better can heterogeneity be observed. If we have little deterministic information it might be that heterogeneity effects are overlooked. The trade-off here is that the more covariates datasets have, the more complex they get. This is why parametric models are often insufficient when applied on high-dimensional, non-linear datasets (chernozhukov2018double). Therefore, recent methods for treatment effect estimation use machine learning models that have shown to be superior in high-dimensional prediction problems (hastie2009elements). The idea is to learn nuisance functions and regularize the parameter space while making as little assumptions as possible. This is especially helpful when the data does not come from randomised experiments where treatment is randomly assigned to the individuals. In observational studies, self-selection into treatment can arise which introduces a bias that has to be corrected for (i.e. self-selection bias) (heckman1998characterizing). For the ATE one would use the nuisance parameter to orthogonalize the effect that covariates have on both, the treatment assignment and the outcome variable. See chernozhukov2018double for a recent approach which they call double machine learning.

The two most prominent methods used to estimate the CATE may be the general random forest, which builds on the idea of controlling for observed confounders through a tree structure and then estimates the CATE within each final leaf (athey2019generalized). The results from each tree are then weighted over the trees within the forest to get a final estimate. The second one is causal boosting, which uses boosted trees to increase performance (powers2018methods). What the aforementioned methods lack, however, is that they are built on tree algorithms and therefore do not allow a flexible estimation of heterogeneous treatment effects in terms of the model choice. A recent method called R-learner does provide such flexibility and shows competitive performance in the estimation of the CATE to other existing proposals (nie2017quasi). Other models, known as meta-learners, decompose the modelling procedure into sub-regression functions, which can be solved using any supervised learning method. This can e.g. be done by a two-model approach (TMA) where a response function (conditional mean) on the treated and another one on the non-treated observations is trained. In randomised experiments, the difference between the two functions can thus be interpreted as the CATE. (kunzel2019metalearners). Applying the two-model approach on data from non-randomized experiments would incorporate a potential bias. One way to address the problem is to use a doubly-robust estimator as proposed by (robins1995semiparametric). Using the estimates from the two-model approach in combination with inverse probability weighting (IPW) decreases the variance of the estimator and controls for observed confounding (see e.g. lunceford2004stratification). Additional orthogonalization using the two conditional mean functions produced by the TMA further decreases the bias of the parameter of interest (lee2017doubly). The doubly-robust estimator can be used in high-dimensional settings to estimate a reduced dimensional conditional average treatment effect function. Functional limit theory can be derived for the case where the nuisance functions are trained via machine learning methods which are then applied on the doubly-robust estimator. The reduced functional form can then be found using a traditional kernel regression (fan2019estimation). Recent papers study and evaluate different models that are designed for the estimation of heterogeneous treatment effects (see e.g. knaus2018machine; kunzel2019metalearners; powers2018methods.

The difficulty, however, is that machine learning methods are often a black box that is not easy to interpret. This fact hinders the information on drivers for effect heterogeneity. In this paper, we, therefore, build on the ideas of chernozhukov2018generic who concentrate to estimate group average treatment effects (GATE) in randomised experiments. The groups are built on the distribution from the CATE (e.g. quantiles to get five groups). A parametric model is then used to identify the best linear predictor for the group treatment effect, providing standard errors and confidence intervals. The heterogeneity between the groups can further be interpreted through covariates which shed some light on the question of what characteristics determine the differences between groups. In this paper, we extend the approach to estimating the GATE parameter towards the use in observational studies and also towards the possibility to estimate a best linear CATE based on the group heterogeneity. The advantage of the proposed method is a robust estimation of heterogeneous treatment effect that is comparable with other models thus keeping its flexibility in the choice of machine learning methods and at the same time its ability to interpret the results. The latter is especially useful in all areas of empirical economics like policy or labour markets. It also has the advantage to control for potential self-selection bias. The idea of going beyond the average, but not as deep as to estimate conditional average treatment effects for many covariates, is first considered in chernozhukov2018sorted. They provide standard errors and confidence bands for the estimated sorted group effects and related classification analysis and provide confidence sets for the most and least affected groups. While they only use parametric estimators, a nonparametric attempt to estimate group average treatment effects and also provide insights from the heterogeneity in terms of observed covariates is proposed by zimmert2019group. They use a two-step estimator of which the second step consists of a kernel estimator. Our contribution is to keep machine learning methods to learn the nuisance parameter in the first step but use a parametric model in the last step. This allows us to make inference and limit the degree of uncertainty in observational studies. This paper consists of three parts. First, we state the methodology for randomized experiments and second, the extensions to deliver robust results in observational studies. Third, we simulate data that include selection bias and are high-dimensional and non-linear. We compare the results for the GATE obtained with the two-model approach and the extended doubly-robust method. Through averaging of the results for each observation we report the mean absolute error from the true heterogeneous treatment effects for both methods.

2 Generic Machine Learning for Group ATE

2.1 Potential Outcome Assumptions

Throughout this paper, we make use of the potential outcome theorem (rosenbaum1983central) and state four necessary assumptions. The first assumption is the ignorability of treatment, conditional on observed covariates (), from the two potential outcomes. It is also known as unconfoundedness or simply conditional independence:


With denoting the potential outcome under treatment and if not being treated. is the treatment assignment variable.

The second assumption, the Stable Unit Treatment Value Assumption (SUTVA), guarantees that the potential outcome of an individual is unaffected by changes in the treatment assignment of others. This assumption might be violated if individuals can interact with each other (peer and social effects). In randomised controlled experiments, the first two assumptions are fulfilled by design or, at least, cancel out.

The third assumption, called overlap, guarantees that for all , the probability of being in the treatment group (i.e. the propensity score, ), is bounded away from 0 and 1:


We control for the common support by estimating the propensity score and balance the treatment and control group based on the distribution. We hence exclude all observations that have a propensity score lower or higher than . The fundamental problem of causal inference is that we only observe one of the two potential outcomes at the same time. The counterfactual for a nontreated (treated) person, namely, what would have happened if this person were (not) treated, is always missing. We can represent this statement through a switching regression where the observed outcome () depends on the two potential outcomes and the treatment assignment:


We further assume that, for the estimation of standard errors, the following moments exist: for and .

2.2 Randomized Control Trial

To provide valid estimation and inference for a causal interpretation of parameters, chernozhukov2018generic focus on features of the CATE. One of the main features is the Sorted Group Average Treatment Effect. The idea is to find groups of observations depending on the estimated treatment effect heterogeneity. Their proposed method relies on a two-model approach in the first step. Here, two response functions are trained separately for the treated and non-treated observations. This approach can be biased if the data sample is from an observational study. In randomized control trials, the difference between the two functions provides an estimate of the treatment effect for every observation. To denote that this function might not be consistent or unbiased it is further called score-function ():


Here is the regression model of the outcome variable on separately for and represents the parameters for treatment and control group. The two functions can be estimated with a broad range of supervised machine learning methods. The target parameters are


where is an indicator of a group membership. The groups are ex-post defined by the predicted score function in the first stage. If the treatment effect for the groups are consistent, it asymptotically holds that


which is the monotonicity restriction. Furthermore, it can be tested whether there is a homogeneous effect if would be equal for all groups. The weighted linear projection equation to recover the GATES parameter is:


with and being the baseline function without treatment. is the treatment effect projection. and is the -quantile of . Subscript denotes that these are all out-of-sample predictions. This becomes clearer in the pseudo-code of Algorithm 1, which describes the implementation of this method. The weights represent the Horvitz-Thompson transformation (horvitz1952generalization):


This estimator, which is applied to account for different proportions of observations within strata in a target population, is equivalent to the simple inverse probability weighting estimator. These estimators, however, might exhibit a high variance if the identification (the precision) of the propensity scores is lacking (lunceford2004stratification).

The main identification result is that the projection coefficients can be represented in the following way:

1 for b=1 to B do
2       Split Data in samples: and with
3       Train , with
4       Train , with
5                Predict , with
6                Predict , with
7                        Calculate =
8       Train , with
9                Predict , with
10                        Calculate , with
11       Estimate GATES parameters () with weighted OLS using (see equation 7)
13 end for
14Average over iterations:
Algorithm 1 GATES

2.3 Observational Studies

To use the best linear predictor for group heterogeneity in observational studies, we need to change and extend the first and second stage. First, we replace the two-model approach by a doubly-robust estimator. This means we not only weight by the inverse of the propensity score but also orthogonalize the outcome variable by subtracting the conditional mean. We also use the sample splitting as a form of cross-fitting by using the auxiliary sample to estimate the score function via the doubly- robust estimator and then use the main sample to predict the final score function, which is used in the parametric step. In this way, we limit the danger of overfitting. The parametric second stage simplifies by plugging in the robust score function without the use of inverse probability weighting. The resulting function is a more robust version of the CATE for each individual as well as for the GATE function. The two steps are described in more detail in the following.

The separate estimation of the outcome conditioning on the treatment assignment only works for randomised experiments. Assume that in observational studies individual’s self-select themselves into the treatment. If this is the case, then the distribution of the covariates is different given treatment status. As a consequence, the estimated score-function might not reflect the treatment effect rather than observed differences based on the covariates. We replace the simple two-model approach by a doubly-robust estimator, which accounts for this potential bias via an extension of inverse probability weighting and by using the residuals between the outcome variable and the conditional expectation functions for (see equation 10.

The function is calculated using the training data (the sample). In a second step, a new supervised model is trained on the transformed outcome using while predictions are made on the test set in order to get an unbiased estimate (see equation 11. Algorithm 2 describes this process.


In equation 10, is equivalent to the score-function from the two-model approach. Simulation evidence from knaus2018machine suggests that estimators based on might be more stable because of the doubly-robust property and that the performance is competitive for the estimation of heterogeneous treatment effects in observational studies. The doubly-robust property states that, at least for the ATE, the estimator is consistent and unbiased if only one of the models, the regression or the propensity score, is correctly specified (robins1994estimation; robins1995semiparametric). lunceford2004stratification; williamson2014variance; belloni2014inference study the theoretical properties and highlight implications for practice. One of the findings is that the variance can be decreased when using the doubly-robust estimator instead of a simple inverse probability estimator (lunceford2004stratification). vira2017 show that equation 10 is conditionally locally robust to the estimation error of the nuisance parameter.

Next we state some asymptotic results to recover the CATE. From equation 4 it follows that


Let be the true high dimensional nuisance parameters. Following fan2019estimation we can define



Theorem 2.1
(i) under Assumptions 1,2,3,4

(ii) given (i). This moment condition satisfies the Neyman-orthogonality condition. Neyman-orthogonality is a key component in ensuring that the CATE estimators are robust to the regularization bias inherent for the nuisance functions which are learned via machine learning models.

Through the doubly-robust estimator, , the weighted linear projection equation changes to


with . The interaction is an orthogonalization of the treatment variable to all other covariates and used to increase precision. The Horvitz-Thompson transformation is excluded since the inverse probability weighting is already included in the doubly-robust estimator.

The second extension is to weight each individual based on the group inclusion probability. Instead of taking the median over repetitions for each of the groups we store the information about the group estimate for each individual over the repetitions. The median is then taken over repetitions for each individual rather than the groups. This allows us to not only have a more robust GATE for each group but also to get an estimate for each individual which can be used for comparison with other methods and to make predictions. Naturally, we can do the same in the first step and apply this weighting procedure on the score-function. Algorithm 2 shows the steps to identify the group treatment effect for each individual.

1 for b=1 to B do
2       Split Data in samples: and with
3       Train , with
4       Train , with
5       Train , with
6                Predict , with
7                Predict , with
8                Predict , with
9                    Train on = with
10                    Predict with
11                        Calculate , with
12                        Calculate =
13       Estimate GATES parameters () with OLS using (see equation 14)
15 end for
16Average over iterations for each :
17 Calculate Density for every : given over
18 Calculate Final score-function () given density of medians for i = 1 to N
Algorithm 2 Extended GATES

3 Simulation Study

3.1 Data Generating Process

To evaluate the advantage of the proposed extensions i) doubly-robust first stage and simplified parametric second stage and ii) inclusion probability weighting, we use simulated data where the true treatment effects are known. In the following we describe the data generating process (DGP) in detail and show the variations that we consider. We generate the covariates in a way that they are partially correlated among each other. The process is described in Algorithm 3.

1 Generate random positive definite covariance matrix based on a uniform distribution over the space of the correlation matrix
2 Scale covariance matrix. This equals the correlation matrix and can be seen as the covariance matrix of the standardised random variables .
Generate random normal distributed variables with mean = 0 and variance =
Algorithm 3 Correlation Matrix

An illustration of the distribution for and observations is given in Figure 1.

Figure 1: Correlation Matrix of Covariates. Correlation metric is bravais-pearson.

It shows that the covariates are correlated among each other. This is guaranteed through the uniform distribution of the covariance matrix which is then transformed to a correlation matrix. This assumption is more common in real datasets and helps to investigate the performance of machine learning algorithms, especially the regularization bias, in a more realistic manner.

The basic model used in this simulation study is a partially linear regression model based on robinson1988root with extensions:


with being a continuous outcome variable. is the true treatment effect or population uplift, while is the treatment status. The vector consists of different features, covariates or confounders, while the vector Z is a subspace of X and represents the variables on which the treatment effect is dependent. , and are unobserved covariates which follow a random normal distribution = .

Equation 16 is the propensity score. In the case of completely random treatment assignment the propensity score for all units (). The scalar can take any value between the interval (0,1). Here we use (balanced assignment).

The function is calculated via a trigonometric function to make the covariates non-linear and potentially complicated for estimation.


The vector with represents weights for every covariate. Next, a description of how to build the function as well as how to create a heterogeneous treatment effect is given. A varying treatment effect implies that its strength differs among the observations and is therefore conditioned on some covariates . Regarding the treatment assignment, two options are considered. Option 1 assumes to be completely random assigned among the observations. In this case, is just a vector of random numbers with values or . In the second option, the treatment assignment is dependent on the covariates. The functions are generated as follows:

1 if  random assignment then
2       Generate ;
4 else
5       Create Vector Multiply the matrix by vector with to get vector .
6       Add covariates
7       Calculate probability distribution for the vector from the normal distribution function:
8       Apply random number generator from a Binomial function with probability () for success equals . This creates a vector such that .
9 end if
Algorithm 4 Treatment Assignment

Regarding the treatment effect,we consider different options. First, is constant for every unit. Second, depends linear on a subset of the covariates and is continuous. The third option is a non-linear dependence of all covariates and continuous. Fourth, again depends on some space of the covariates and further takes only two different values.

1 if constant effect then
2       with ;
4 else if simple heterogeneous effect then
5       Generate
6       ;
8 else if non-linear heterogeneous effect then
9       Apply trigonometric function:
10 else
11       Define Z as some feature space of X and apply CDF as in 19 and run Bernoulli trials:
Standardise the treatment effect within the set {-0.1,+0.3}.
12 end if
Algorithm 5 Treatment Effect

3.2 Simulation Results

Figure 2 shows the densities for 49 randomly selected observations. The simulated data in this case has the following properties. , , and . We show that even in randomised experiments, the point estimates differ due to the sample-splitting in the first step. Averaging them by taking the median leads to a more stable conditional treatment effect function. The same is done with the group average treatment effects.

Figure 2: Distribution of scores () for 49 randomly selected individuals.

Figure 3 shows the results from the simulation given the DGP in Table 1. We use inclusion probability weighting to assign a group average treatment effect to every observation. We use these estimates to report the mean absolute error from the true treatment effects. This is done for both methods, the two-model and the doubly-robust. For each data generating process, we use Monte Carlo resampling 10 times and show the single results in Figure 3. Points below the 45-degree line are in favour for the doubly-robust method since it show a smaller MAE for the doubly-robust compared to the two-model approach. We also state the average result (error) for each setting in Table 1. A two-sample Welch t-test confirms that the hypothesis of equal means can be rejected based on a 1% significance level for each setting. Algorithm 6 describes the assignment of the treatment effects based on the groups from the GATES as well as the MAE estimation over datasets.

1 for s=1 to S do
2       for b=1 to B do
3             Assign group average treatment effect from group to observations in group
4             Store results in some matrix
5       end for
6               Average Take median for each observation over bootstraps
7                Estimate final GATE based on relative group membership
8                Estimate mean absolute error (MAE) from true treatment effect
9                                 for both estimators
10                Store results in some matrix
11       Resample keeping specifications constant (Monte Carlo study)
12 end for
Average errors over iterations
Algorithm 6 Inclusion Probability Weighting and MAE estimation
Scenarios A B C D E F
N 1000 1000 5000 5000 5000 5000
100 200 100 200 200 500
constant linear non-linear binary non-linear binary
Average error Two-Model 0.19 0.20 0.16 0.16 0.18 0.20
Average error Doubly-Robust 0.15 0.16 0.13 0.12 0.13 0.15
Table 1: Settings and Monte Carlo averages
Figure 3: Comparison of two-model approach and doubly-robust. Axes show mean absolute error between estimates and true individual treatment effects. 45-degree line indicates the equality of both methods.

The simulation study shows that, for all the considered data settings, our method decreases the error to the true individual treatment effect. Setting A:D show results for non-randomized settings with different parameters. We even find that the proposed extensions produce a smaller MAE in randomized control trials ( see Figure 3: E, F). This is true for every resampling of the DGP and each setting. Looking at the MAE we find the highest difference between the two methods for random assignment ().

4 Conclusion

In this paper, we propose a method to estimate group average treatment effects in the combination of machine learning methods and parametric estimation for non-randomized control trials. Since flexibility in terms of the model choice, as well as interpretability of the results, is of main interest, we extend the idea of the GATES approach towards the use of a doubly-robust estimator in the non-parametric step. This ensures to control for self-selection into treatment which is a realistic challenge in observational studies.

We find that using a doubly-robust estimator with cross-fitting, in combination with a simplified parametric model, decreases the mean absolute error compared to the original two-model approach significantly. We further propose inclusion probability weighting to identify the GATE value for each individual in a robust way. This allows making predictions on new observations. In our setting, we considered only five different groups. This amount could be increased to e.g. 10 or even more groups. In empirical settings, it would depend on the sample size. If we want to have at least 30 observations within a group we could have groups, with -splits or folds of the dataset in the first stage. Here we considered only two-folds. However, there is no general relationship between the number of folds in cross-fitting and the precision of the estimator (see chernozhukov2018double for an example with different folds). Due to computational reasons we only use = 10 bootstrap iterations within the same sample and = 10 Monte Carlo re-samplings of the same data generating process. This amount needs to be increased to e.g. 50 and 100, respectively. At this stage, we only consider a boosting-trees algorithm (with parameter tuning via 10 fold cross-validation) as a machine learning method. In a further draft, we will extend this to the use of boosted gradient descent (XGBoost), random forest algorithm, neural networks and some linear methods like variants of the Elastic Net. We can even consider different methods for each nuisance function.

In a further draft, we would also compare the ATE as well as the CATE, all resulting from the group average treatment effect, with recent methods that estimate the former parameter or function. This could especially be useful to test for some linear dependency in the underlying data generating process.


Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description