Tree-Based Bayesian Treatment Effect Analysis
The inclusion of the propensity score as a covariate in Bayesian regression trees for causal inference can reduce the bias in treatment effect estimations, which occurs due to the regularization-induced confounding phenomenon. This study advocates for the use of the propensity score by evaluating it under a full-Bayesian variable selection setting, and the use of Individual Conditional Expectation Plots, which is a graphical tool that can improve treatment effect analysis on tree-based Bayesian models and others “black box” models. The first one, even if poorly estimated, can lead to bias reduction on the estimated treatment effects, while the latter can be used to found groups of individuals which have different responses to the applied treatment, and analyze the impact of each variable in the estimated treatment effect.
Keywords: Causal inference, Propensity score, Bayesian Regression Trees
The advent of Bayesian computation on the previous decades allowed the creation of models with a high degree of complexity, while Monte Carlo Markov Chains (MCMC) algorithms allow the estimation of models that were previously considered infeasible. One of these models is the Bayesian Additive Regression Trees (BART) (CGM10), an extremely flexible nonparametric model. Hill11 applied the BART models to the causal inference setting, more specifically on the estimations of binary treatment effects for observational studies, and the results were promising.
But these models can be affected by the regularization-induced confounding (Hahn18), which states that, in the presence of confounding, the regularization of these models may cause biased estimations of the treatment effects. Hahn17 argue that, under strong ignorability, this problem can be eased through the use of the propensity score (Rubin83) among the covariates of the model.
This study has two main contributions. The first is the application of a sensitivity analysis in the causal inference setting through the use of Individual Conditional Expectation Plots (Goldstein15). The second is to corroborate the inclusion of the propensity score on Bayesian regression tree models (Hahn17) by using simulations and the full-Bayesian variable selection proposed by Linero18.
Section 2 introduces notation and a revision on tree-based Bayesian regression trees for causal inference. Section 3 specify how the propensity score and the Individual Conditional Expectation (ICE) Plots can be used to properly perform treatment effect analysis on Bayesian regression trees. Section 4 have simulations on which the techniques introduced on the previous section are used. Finally, in Section 5 the results of the study are discussed, along with proposed extensions.
2 Treatment Effect Estimation with Bayesian Regression Trees Models
Capital roman letters are used to denote random variables, while realizations are denoted in lower case. Vectors are denoted by a tilde on the top of the variable, and matrices are denoted by bold variables. Let denote a scalar response, denote a binary treatment effect and denote a vector of covariates . In such way that the triplet denotes the observation of the th individual of a sample size .
Following the notation of Imbens04, the that has a realization can be denoted by:
It is important to notice that only or can be observed, while the unobserved is called counterfactual.
In the paper strong ignorability (Rubin83) is assumed to hold. This condition require two assumptions:
The first assumption guarantees that there are no unmeasured confounders in the analysis, while the second assumption assures that there is a positive probability of assigning each treatment to every individual in the population, always enabling the existence of the counterfactual, thus making it possible to estimate the treatment effect.
Using the framework adopted by Hahn17 for expressing treatment effects, it follows that the Individual Treatment Effect (ITE) can be represented as
where is the ITE for the th individual of the sample.
As in Hill11, the Conditional Average Treatment Effect (CATE) is estimated by
which is equivalent to the average of the individual treatment effects of the sample.
2.1 BART Model
CGM10 introduced the BART model, which is a Bayesian nonparametric “sum-of-trees” model, where each tree is a weak learner constrained by a regularization prior. The individual tree is formed by a set of binary splits from the set of covariates , and a set terminal nodes . Each split is of the form vs for continuous variables.
The model can be expressed as
where is the number of trees on the sum, and is the tree function from CGM98, which assigns a value from the th tree to as seen in the example from Figure 1.
The priors of the model can be specified as
where is the cardinality of the set .
The prior works as a regularization prior which constrains each tree to be a “weak learner”, which is a model that performs better than chance. For convenience, is scaled between and , so that the prior given by
helds, for the default setting of , probability that the expected value of the response lies within the interval . The hyperparameter can, also, be chosen by cross-validation.
The prior is given by
which is an inverse chi-square distribution. The hyperparameter is given in a way such that , where is an initial guess based on the data. CGM10 recommends the default setting as .
The prior has three parts. The first one is the probability of a split in a determined node, which is given by
where and are hyperparameters with default values and , and is the depth of tree . The second part, which is the probability of selecting a determined variable to perform a split, is given by a discrete Uniform hyperprior. Finally, the third part is given by a discrete Uniform distribution over the possible splits of the variable chosen for the split.
The posterior of the model can be expressed as
One way to sample from this posterior distribution is through the Bayesian backfitting algorithm introduced by Hastie00, which, basically, is a Gibbs sampler drawing with , conditionally on , where is the set of trees with its associated terminal node parameters except .
To perform a draw from it is important to notice that have impact on only through
in such a way that can be sampled using the framework of CGM98 for drawing samples of a single tree.
The authors introduced four proposals from which the trees can mix by using the Metropolis-Hastings algorithm: GROW (grow a split from a terminal node), PRUNE (collapse a split above two terminal nodes), CHANGE (change a rule from a nonterminal node) and SWAP (swap the rules of a parent and child nonterminal nodes). Pratola16 also presents the ROTATION and the PERTURB proposals as an alternative to improve the tree mixing. For further details of proposals GROW, PRUNE and CHANGE, see Bleich16.
Hill11 noted that as the BART captures interactions and nonlinearities with ease, handle a large number of covariates that are potential confounders, and have a stable default setup for its priors. So the model is a tool that can be applied to the causal inference setting, specially at the estimation of CATE, since the model estimates of ITE have shown great uncertainty.
2.2 BCF Model
One possible extension of the BART model can be achieved by adding linear components to the model (CGM10, p. 295). The Bayesian Causal Forests (BCF) model introduced by Hahn17 follows this idea by using a linear combination of two BART models to estimate the value of the response .
The BCF model is given by
where and are independent BART priors, is vector of covariates for the th individual of the sample, and is the estimate of the propensity score for the th individual. The propensity score estimate for each individual is introduced in the model as a covariate with the main objective of reducing bias into the estimate of treatment effects. Its role is further explored at Section 3.1.
The function estimate the prognostic effect of each individual, while the function is used to capture its treatment effect. The BCF has been designed this way because in the original BART setting there was no control on how the model varies in , and in this new reparametrization the works like an indicator function for the , enabling the model to aggregate all the covariates interactions regarding treatment effects in the same function.
Since the functions and are independent, each can have different priors that are adaptable according to its characteristics. Both functions held reasonable results with the default prior settings, but Hahn17 have chosen to do two main modifications regarding . The first modification has been made to support homogeneous treatment effects in the model by two amendments in the prior: Considering that the homogeneous treatment effects are represented by trees that are root nodes, the parameter is adapted to the probability of homogeneous effects by solving , where is the number of trees in the function ; Setting (instead of ) to lower the split probability, since the works as an indicator function which can be compared to a first tree split in the variable , meaning that all trees in actually start with depth of . The second modification has been made through the implementation of a half-Cauchy hyperprior to the scale parameter of , where . The default BART uses a constant to its scale parameter, but the change grants a way of avoiding spurious inferences. For further details on the half-Cauchy hyperprior, see Gelman06.
Furthermore, the posterior estimates can be used to analyze the treatment effects at individual level in a way that is possible to identify groups which have positive (or negative) treatment effects within a certain credible interval, allowing a kind of study that was not recommended in Hill11 framework due to the lack of estimates robustness.
3 Treatment Effect Analysis
The BART model presented in Section 2.1 can be used to estimate the ITE by
where consist of the posterior draws from the prediction of the estimated model.
In the case of the BCF model from Section 2.2, as the function is estimated separately from the prognostic effect, the BART prior of already gives the posterior draws from the estimated model as an output. It should be noted that since both models outputs are given by posterior draws, it is straightforward to construct credible intervals for the estimated treatment effects with the use of quantiles from these draws.
The following subsections introduce some methods and tools to assist the analysis of treatment effects in these models.
3.1 The RIC and the Role of the Propensity Score
The term “regularization-induced confounding” (RIC) was introduced by Hahn18 into the setting of linear models with homogeneous treatment effects and further expanded by Hahn17 to the BART models, which despite having a good predictive performance, have shown biased treatment effect estimation and lack of robustness at the individual level estimates when applied to the causal settings. Hahn17 tries to avoid the RIC phenomenon by including the estimate of the propensity score as a covariate in the BART and the BCF models.
The propensity score was first introduced by Rubin83 as an assisting tool to reduce bias in observational studies by balancing the data according to the probability of assigning a treatment to an individual given its vector of covariates. The authors also introduced matched sampling, subclassification, and covariance adjustment techniques by using the propensity score. In other words, the propensity score was created to deal with the problem of treatment effects bias in observational studies.
As shown by Hahn17, the inclusion of the propensity score allows the tree-based models to adapt more easily in cases where the data exhibits complex confounding. Since the propensity score naturally simplifies the number of required splits in a context of parsimonious trees, it allows the model to focus on other interactions between the variables, reducing bias and improving the predictions of the model.
Under the sparsity setting it is possible to assessthe use of the propensity score in the model by the variable selection framework introduced by Linero18 in its Dirichlet Additive Regression Trees (DART) models, which assigns a sparsity-inducing Dirichlet hyperprior (instead of an Uniform hyperprior) on the probability of choosing a variable to split on, which is given by
where is the number of covariates in the model and is given by the prior
with , and in this paper.
The author suggests two approaches to perform the variable selection: The first one is to look into the posterior draws from the Dirichlet hyperprior and the second one is to use the method of variable selection proposed by Berger04 and calculate the posterior inclusion probability () for each variable.
The former approach is straightforward, since the posterior draws allow the construction of credible intervals for the probability of choosing any variable. The latter approach can be performed by simple verifying, for each iteration of the MCMC, if the variable is used in at least one splitting rule of the tree ensemble. If it is used, the value is assigned, and, if not, is assigned. The is the mean of these indicator functions for variable over all iterations of the MCMC, and the variable will be selected if .
3.2 A Visualization Tool: ICE Plots
Friedman01 introduced the Partial Dependence Plot (PDP) in order to allow the visualization of the impact that a variable have in the response, performing an sensitivity analysis. The main reason for the development of such a tool is due to the difficulty to interpret parameters in machine learning methods, more specifically, the models known as “black box” models, such as Support Vector Machines, Random Forests, Neural Networks, etc.
The partial dependence function for the th observation is defined as
where is the sample size, is the prediction function from the “black box” model, is the index of one of variables of interest, , is a scalar from the variable of interest from the th observation of the training data, and is the vector of covariates from the subset of the observation from the training data. The curve is made from the partial dependence functions of all observations from . In general, with , but there are no means of plotting the PDP for .
CGM10 suggest the use of PDP to analyze the marginal effect of the variables in relation to the response. It must be noted that the implementation of this technique in BART models actually give draws from the posterior distribution, thus, it is easy to acquire the credible intervals for the PDP curve.
This tools have already been applied to BART models in the treatment effect setting in Green12, where the authors create curves regarding each variable in relation to the CATE, which takes the place of the response variable. This is possible due to a slight modification in the algorithm by using
But since this tool only evaluates the mean of the response, it cannot be used into the analysis of individual treatment effects.
Goldstein15 introduced the ICE Plots by noting that since the PDP curve could conceal heterogeneous effects in the response variable, it was necessary a tool for analyzing the marginal effect at each individual. By using the same notation of equation (13) the ICE function for the th individual is defined by
where the ICE curve for the th individual is formed by calculating the ICE function for every of the sample. This way, there will be curves in the plot, and the PDP curve can be obtained by averaging the ICE curves.
This analysis can be very useful in the context of treatment effects, since in the setting of heterogeneous treatment effects it is possible to look for indications of groups that have different reactions to the applied treatment. Like in equation (14), the ICE curves can be adapted to the ITE setting by estimating
instead of using the formula described in equation (15).
Examples from the ICE Plots and the PDPs can be found on Section 4.1.
In order to assess the tools that have been proposed in Section 3, some simulations were performed. In Section 4.1 the simulations were made in order to corroborate the use of the propensity score as a covariate in BART models by using the ICE Plots to evaluate the ITE of the sample. In Section 4.2 the simulations were performed in the sparsity setting, to evaluate how often the propensity score is used by the model in relation to the other variables. Only the BART models were studied at Sections 4.1 and 4.2, but these methods can be extended to the BCF models as well. Furthermore, in Section 4.3 the CATE and ITE estimates of BART and BCF models were evaluated.
Zigler14 pointed that since the propensity score carries uncertainty about its estimations, so it is natural to advocate the use of a Bayesian framework in it. Following Hahn17, the probit-BART posterior mean is used as the propensity score estimate for each individual in the simulations. As a comparative, the frequentist approach of estimating the propensity score by Generalized Linear Models with logit link was used.
All calculations were performed in R version 3.4.3 (R) by using the packages BART (BART), bcf (bcf) and ICEbox (Goldstein15). As in Hill11, the priors, hyperparameters, and hyperpriors were held under default setting in this paper, but cross-validation can be performed in order to improve results.
4.1 Simulation Based on Real Data
In this scenario, the following models were analyzed:
Vanilla: estimated by a BART model using as covariates;
Oracle (Oracle-BCF): estimated by a BART (BCF) model using and , the true value of the propensity score, as covariates;
PS-BART (PS-BCF): estimated by a BART (BCF) model using and , estimated by the posterior mean of the probit-BART, as covariates;
GLM-BART (GLM-BCF): estimated by a BART (BCF) model using and , estimated by GLM (which is considered as a naive approach to estimate the propensity score), as covariates;
Rand-BART (Rand-BCF): estimated by a BART (BCF) model using and , given by a random Uniform distribution, as covariates.
This simulation is based on Hill11, and uses the Infant Health and Development Program (IHDP) dataset (). In order to simplify the simulation, only the continuous covariates are used: birth weight (); head circunference (); weeks born preterm (); birth order (); neonatal health index () and age of the mother (). The response surface was generated by
The predictors were standardized for data generation, and the ’s were sampled from (0, 1, 2, 3, 4) with probabilities (0.05, 0.1, 0.15, 0.2, 0.5).
The true propensity score and the real treatment effects were generated based on Hahn17 example as it follows,
This simulation was replicated times in order to assessthe results. In each model the first draws from the MCMC were treated as burn in, while the posterior draws had size .
The Figure 2 is composed by the boxplots of posterior CATE for each model for the one replication of the simulation, but it is important to point out that similar results were found for the other replications. The red line is the real CATE for this specific iteration.
In general, the BART and the BCF models held similar results. The Vanilla model posterior CATE estimates apparently are impacted by the RIC phenomenon, so the model performed poorly. The Oracle models, as expected, had a good performance due to the inclusion of the propensity score as a covariate. The PS and the GLM models performed slightly better than the Vanilla model, indicating that the inclusion of the propensity score had a positive impact on the model, but the uncertainty associated with the estimation of propensity score contributed negatively on the CATE estimates. The Rand models held the worst results in the simulation due to the inclusion of an irrelevant variable and the lack of a propensity score estimate among its covariates.
As seen in the ICE Plots of the Vanilla and Oracle models from Figure 3, the inclusion of the true propensity score as a covariate in greatly reduces the uncertainty over the individual treatment effects, allowing the visualization of different groups of individuals and eliminating most of spurious effects from the other covariates.
Furthermore, across replications, the use of the propensity score as a covariate is corroborated, since the inclusion of true (estimated) propensity score greatly (slightly) improved the performance of the model, as seen in the CATE and ITE RMSE boxplots in Figures 4 and 5. The means and the standard deviations for average RMSE of these models over the replications can be found on Table 1.
|CATE RMSE||ITE RMSE||CATE RMSE||ITE RMSE|
Also, the BART and the BCF models held similar results across simulations, but for the ITE RMSE, the BCF model seems to have a better performance under the uncertainty of the estimation of the propensity score, despite having a higher variance across the RMSE calculated over the replications. Nevertheless, the BART model was superior in the Oracle scenario.
4.2 Sparse Data Examples
Under the sparse setting, the following models were analyzed:
Vanilla (Vanilla-DART): estimated by a BART (DART) model using as covariates;
Oracle (Oracle-DART): estimated by a BART (DART) model using and , the true value of the propensity score, as covariates;
PS-BART (PS-DART): estimated by a BART (DART) model using and , estimated by the posterior mean of the probit-BART (probit-DART), as covariates;
GLM-BART (GLM-DART): estimated by a BART (DART) model using and , estimated by GLM, as covariates;
Rand-BART (Rand-DART): estimated by a BART (DART) model using and , given by a random Uniform distribution, as covariates;
In order to acknowledge the propensity score role in the model, a method of variable selection was performed. Selected variables were those whose presented . To assessthe performance of the variable selection, following Linero18 and Bleich14, Precision, Recall, and were used. These measures are defined by,
where is True Positive, is False Positive, and is False Negative.
These simulations were replicated times in order to assessthe results. In each model the first draws from the MCMC were treated as burn in, while the posterior draws had size , with thinning being set to . For all simulations, .
4.2.1 Hahn Simulation under Sparsity
The example based on Hahn17 simulation was generated as it follows,
where and .
Like in the previous example, the Figure 6 is composed by the boxplots of posterior CATE for each model for the one replication of the simulation. The red line is the real CATE for this specific iteration.
Under the sparse setting, the DART models, in general, had a better performance at estimating the CATE. The Vanilla model estimates remained close to the Rand models estimates. The GLM models performed slightly better than the Vanilla model, but the inclusion of many irrelevant variables on the model have shown a negative impact on the estimates. The PS models have performed significantly better than the GLM models, while, as expected, the Oracle models have shown the best results.
Figures 7 and 8 represents the Dirichlet hyperprior draws for the Vanilla-DART and Oracle-DART models. The inclusion of the propensity score in the Oracle model allows the tree ensemble to focus on the variable , which determines the treatment effect of each individual, instead of trying to figure out how the variables and are related. Figures 9 and 10 represents BART and DART estimation from the Oracle models. While BART tends to use all the variables in the tree ensemble, the Dirichlet hyperprior on DART select only a few among the possible covariates.
|Model||CATE RMSE||ITE RMSE||Precision||Recall||PS-Usage|
The performance of the models over the replications is evaluated at Table 2, along with the Probit model used to estimate the propensity score. Variables were selected via . For Precision, Recall and , the value indicates a perfect adjustment. PS-Usage indicates the mean of the proportion of times that the propensity score estimation was used in the model, and if the model had an estimation of the propensity score, regardless of misspecification, the estimation was considered a relevant variable in this analysis. All measurements are given by the mean over the replications, with standard deviation in parentheses.
4.2.2 Friedman Function under Sparsity
The simulation adapted from Friedman91 example was generated as it follows,
where and .
The boxplots from Figure 11 are composed by posterior CATE for each model for the one replication of the simulation. The red line is the real CATE for this specific iteration.
Like in Section 4.2.1, the DART models were superior. Again Vanilla and Rand models estimates are, apparently biased due to the RIC phenomenon, while the PS models have shown more consistent results in relation to the GLM models. The most accurate results were held by the Oracle models.
Figures 12 and 13 represents the Dirichlet hyperprior draws for the Vanilla-DART and Oracle-DART models, while the Figures 14 and 15 represents BART and DART estimation from the Oracle models. It is possible to see that the DART models tends to select only the relevant variables in the model, including the propensity score.
|Model||CATE RMSE||ITE RMSE||Precision||Recall||PS-Usage|
Following the Section 4.2.1 methodology, the performance of the models over the replications is evaluated at Table 3, along with Figures 16 and 17. Variables were selected via . For Precision, Recall and , the value indicates a perfect adjustment. PS-Usage indicates the mean of the proportion of times that the propensity score estimation was used in the model, and if the model had an estimation of the propensity score, regardless of misspecification, the estimation was considered a relevant variable in this analysis. All measurements are given by the mean over the replications, with standard deviation in parentheses.
4.3 Simulations Assessment
In all examples the inclusion of the true or the estimated propensity score resulted in a decrease of the impact that the RIC phenomenon had over the model. As expected, the models which had the true propensity score have shown better results. The two models used to estimated the propensity score have shown similar performance in the simulation based on real data, but on the sparsity examples the BART based models had superior results. That may be due to the fact that the GLM was including all the variables in the model, while the BART and the DART models can naturally incorporate interactions between covariates, and even perform accurate variable selection in the case of DART. The simulations were generated in a simple setting, allowing both models to adjust relatively well, but in real datasets there might be unusual interactions between the covariates, as well as irrelevant variables, which is a scenario that models derived from BART, such as DART and BCF, can adapt with ease.
The flexibility of the ICE Plot allow it to be used under many different scenarios, but in the treatment effect setting it brings up three main advantages: allows the visualization of variables that do not impact in the treatment effect; show possible candidates of relevant variables for different individuals; and grants a way that may help in the identification of groups whose individuals may be affected in different ways by the chosen treatment.
Under the sparsity setting, it can be seen that the DART models variable selection performed well, even in both examples. Moreover, based on the results, the from the DART model can be considered an important tool upon the definition of which variables to include in the propensity score estimation.
In general, Hill11 and Hahn17 work were extended by including previously existing tools that could, and should, be applied to the causal inference setting.
We have corroborated Hahn17 study, which argues that inclusion of the propensity score can suppress at least part of the bias that the RIC phenomenon adds to the data. This idea was enforced by analyzing the effects of propensity score through a sensitivity analysis, and by the use of a full-Bayesian variable selection method. We have also found that in binary treatment effect observational studies even a naively estimated propensity score (which was played by the GLM in the simulations) may have a positive impact on the model, and even if the estimates are completely random (like in the Rand models in the simulations), there will be no additional bias in the treatment effect estimation in relation to the Vanilla model. Alternative Bayesian or machine learning estimations of the propensity score can also be further explored.
In regard of model effectiveness, the BCF allows its priors to be allocated freely in the functions related to the prognostic effect and the treatment effect, so the model may held better results than BART if cross-validation is applied, but we have not found a clear superior model under default priors, hyperparameters and hyperpriors.
A possible extension of this study can be done by applying the DART Dirichlet hyperprior to the BCF model and verifying the model effectiveness in high dimensional data examples with . Also, sensitivity analysis can be done under the sparse setting.
Another possible approach could be done by inserting heteroscedastic error terms and applying Pratola17 approach.
Let denote the th observation in the terminal node in a tree with structure . It will be assumed that
such that , and .
So, the likelihood of the data following CGM98 framework is given by
In order to avoid reversible jumps, the MCMC algorithm uses
Thus, Linero17 shows that this integral can be simplified to