Estimating Causal Effects With Partial Covariates For Clinical Interpretability

Estimating Causal Effects With Partial Covariates For Clinical Interpretability

Sonali Parbhoo,  Mario Wieser11footnotemark: 1  and Volker Roth
University of Basel
These authors contributed equally.


Estimating Causal Effects With Partial Covariates For Clinical Interpretability

  Sonali Parbhoothanks: These authors contributed equally.,  Mario Wieser11footnotemark: 1  and Volker Roth University of Basel


noticebox[b]Machine Learning for Health (ML4H) Workshop at NeurIPS 2018.\end@float

1 Introduction

Understanding the causal effects of an intervention is a key question in many applications, from personalised medicine to marketing (e.g. Sun et al. (2015); Wager and Athey (2017); Alaa and van der Schaar (2017)). Predicting the causal outcome typically involves dealing with high-dimensional observational data that is frequently subject to the effects of confounding.

In general, we distinguish between measured and hidden confounding: When confounders are directly measured, they may be accounted for using techniques that correct for their effects, such as propensity reweighting (IPS) or covariate shift (Hernán and Robins, 2006; Rosenbaum and Rubin, 1984). In contrast, to account for hidden confounding, proxy variables may be used as noisy representatives of latent confounders (Greenland and Lash, 2008; Pearl, 2012; Kuroki and Pearl, 2014; Louizos et al., 2017). Both approaches can however only be applied when covariate data is completely measured. This assumption is not feasible in a large number of settings such as medicine. For example, doctors are interested in identifying treatments that improve patient outcomes, and have to base decisions on hundreds of potentially confounding variables such as age and genetic factors. Here, a doctor may readily have access to many routine measurements such as blood count data for all patients, but may only have genetic information for some patients. Inferring the causal effects of a treatment requires learning a joint distribution over covariates and confounders of patients whose data is completely observable, while simultaneously transferring this knowledge to patients whose data is missing. This is not achievable in practice since we have to integrate over all missing covariates.

We propose addressing the problem of performing causal inference with partial covariate information from an decision-theoretic point of view. Specifically, we assume that a fixed set of measurements is unavailable for a subset of the data (or patients) at test time. The key idea is to use the Information Bottleneck (IB) criterion (Tishby et al., 2000) to perform a sufficient reduction of the covariate and recover a distribution of the confounding information. The IB enables us to build a discrete reference class over patients whose covariate data is complete, to which we can map patients with incomplete data and estimate treatment effects on the basis of such a mapping. Finally, we demonstrate that our method outperforms existing approaches across established causal inference benchmarks and a real world application for treating sepsis.

2 Method

We refer to our model as cause-effect IB (CEIB). In Figure 1, we illustrate an overview of the possible configurations for performing causal inference and present our model in the context of existing work. The corresponding causal graphs for Cases I and II are shown in Figure 2. The major difference between I and II is the reversal of the arrow between and , and the fact that in Case II confounders are not measured, but indirectly observed via noisy proxies.

Figure 1: Overview of causal inference with confounding effects and missing covariates. In this paper, we address Cases I and II, thus accounting for incomplete covariate information when confounding is measured and hidden respectively.
(a) Case I: Measured confounding
(b) Case II: Hidden confounding
Figure 2: Influence diagrams of the two cases considered in this paper. Red and green circles correspond to observed and latent random variables respectively, while blue rectangles represent interventions. In Case I, we identify a low-dimensional representation of measured covariates to estimate the effects of an intervention on outcome . In Case II, the arrow between and is reversed and confounders are indirectly measured via proxy variables, indicated by an orange circle here. We identify a low-dimensional representation and use this to explicitly estimate as well as . In both cases, representation is used to make inferences for a subset of patients where only partial covariate information is available.

In our paper, we consider the decision-theoretic approach of Dawid (2007) to estimate the causal effect where we have both hidden and measured confounding with incomplete covariates. This involves computing the ACE of on . Dawid (2007) show that the ACE and observational ACE are equivalent under the conditional independence assumption . This assumption expresses that the distribution of is the same in the interventional and observational regimes. It can also be extended to account for the notion of confounding. Here, the treatment assignment may be ignored when estimating , provided a sufficient covariate and . Formally, is a sufficient covariate for the effect of on outcome if and . It can also be shown via Pearl’s backdoor criterion (Pearl, 2009) that the ACE may be defined in terms of the Specific Causal Effect (SCE),




Importantly, estimating the ACE only requires computing a distribution in Figure 2. In what follows, we use the IB to learn a sufficient covariate that allows us to approximate this distribution.

Case I: Measured Confounding

This case occurs when we have observational data where all the relevant confounding variables are measured, but where a fixed set of covariates is only available for some subset of the data at test time. Let and be our covariate sets (both available at training). We adapt the IB for learning the outcome of a therapy when partial covariate information is available for at test time. To do so, we consider the following parametric form,


where and are low-dimensional discrete representations of the covariate data, is a concatenation of and and represents the mutual information parameterised by networks , , and respectively. We assume a parametric form of the conditionals , , , , as well as Markov chain . The three terms in Equation 3 have the following forms:

as a result of the Markov assumption in the IB model. Here is the entropy of . For the decoder model, we use an architecture similar to the TARnet (Johansson et al., 2016), where we replace conditioning on high-dimensional covariates with conditioning on latent . We can thus express the conditionals as,


with logistic function , and outcome given by a Gaussian distribution parameterised with a TARnet with . Note that the terms correspond to neural networks.

Case II: Hidden Confounding

This case is analogous to the work of Louizos et al. (2017). We however, treat proxies as measured confounders and propose using Case I to estimate the causal effect here. Using Case I is permissible since both DAGs in Figure 2 are Markov equivalent, and the causal direction between and can only be determined by additional assumptions on the causal graph. However, assuming the causal structure in Figure 1(b) as in Louizos et al. (2017) requires the definition of a complex prior over . Hence, it may be more natural to treat all covariates including proxies as measured confounders like we propose in this paper. In doing so, we compress the relevant information to a sufficient covariate as described in Case I.

Once we can estimate in both cases using the proposed model, we can compute the ACE. When given a test patient with partial covariates, we can assign them to the closest equivalence class of patients with similar characteristics, and approximate the effect of treatments this basis.

3 Experiments

We demonstrate the performance of our approach on a high-dimensional real world task for managing and treating sepsis. Additional experiments are in the supplement. For this experiment, we make use of data from the Multiparameter Intelligent Monitoring in Intensive Care (MIMIC-III) database (Johnson et al., 2016). We focus on patients satisfying Sepsis-3 criteria (16 804 patients in total). For each patient, we have a 48-dimensional set of physiological parameters including demographics, lab values, vital signs and input/output events, where covariates are partially incomplete. Our outcomes correspond to the odds of mortality, while we binarise medical interventions according to whether or not a vasopressor is administered. The data set is divided into 60/20/20% into training/validation/testing sets. We train our model with 6, 4-dimensional Gaussian mixture components and analysed the information curves and cluster compositions respectively.

Figure 3: Subfigures (a) and (b) illustrate the information curve and for the task of managing sepsis. We perform a sufficient reduction of the covariates to 6-dimensions and are able to approximate the ACE on the basis of this.

The information curves for and are shown in Figures 2(a) and 2(b) respectively. We observe that we can perform a sufficient reduction of the high-dimensional covariate information to between 4 and 6 dimensions while achieving high predictive accuracy of outcomes . Since there is no ground truth available for the sepsis task, we do not have access to the true confounding variables. However, we can perform an analysis on the basis of the clusters obtained over the latent space. Here, we see that we can characterise the patients in each cluster according to their initial SOFA (Sequential Organ Failure Assessment) scores. SOFA scores range between 1-4 and are used to track a patient’s stay in hospital. In Figure 4, we observe clear differences in cluster composition relative to the SOFA scores. Clusters 2, 5 and 6 tend to have higher proportions of patients with lower SOFA scores, while Clusters 3 and 4 have larger proportions of patients with higher SOFA scores. This result suggests that a patient’s initial SOFA score is potentially a confounder when determining how to administer subsequent treatments and predicting their odds of in-hospital mortality. This is consistent with medical studies such as Medam et al. (2017); Studnek et al. (2012) where authors indicate that high initial SOFA scores were likely to impact on their overall chances of survival and treatments administered in hospital. Overall, performing such analyses for tasks like Sepsis may help correct for confounding and assist in establishing potential guidelines.

(a) Cluster 1
(b) Cluster 2
(c) Cluster 3
(d) Cluster 4
(e) Cluster 5
(f) Cluster 6
Figure 4: Proportion of initial SOFA scores in each cluster. The variation in initial SOFA scores across clusters suggests that it is a potential confounder of odds of mortality when managing and treating sepsis.

4 Discussion

CEIB makes state-of-the-art predictions of the ACE that are robust against confounding.

CEIB learns a low-dimensional, interpretable representation of latent confounding.

Since CEIB extracts only the information that is relevant for making predictions, it is able to learn a low-dimensional representation of the confounding effect and uses this to make predictions. In particular, the introduction of a discrete cluster structure in the latent space allows an easier interpretation of the confounding effect. Similar methods such as Louizos et al. (2017) typically use a higher dimensional representation to account for these effects without gains in performance. This is likely a result of misrepresenting the true confounding effect. Modelling the task as an IB alleviates this problem. For sepsis, we identify a latent space of 6 dimensions when predicting odds of mortality, where clusters exhibit a distinct structure with respect to a patient’s initial SOFA score.

CEIB enables estimating the causal effect with incomplete covariates.

Unlike previous approaches, CEIB can deal with incomplete covariate data during test time by introducing a discrete latent space. Specifically, we learn equivalence classes among patients such that the approximate the effects of treatments can be computed where data is incomplete.


  • Alaa and van der Schaar (2017) Ahmed M. Alaa and Mihaela van der Schaar. Bayesian inference of individualized treatment effects using multi-task gaussian processes. CoRR, abs/1704.02801, 2017.
  • Almond et al. (2005) Douglas Almond, Kenneth Y Chay, and David S Lee. The costs of low birth weight. The Quarterly Journal of Economics, 120(3):1031–1083, 2005.
  • Dawid (2007) Philip Dawid. Fundamentals of statistical causality. Technical report, Department of Statistical Science, University College London, 2007.
  • Greenland and Lash (2008) Sander Greenland and Timothy Lash. Bias analysis. Modern Epidemiology, pages 345 – 380, 2008.
  • Hernán and Robins (2006) Miguel A Hernán and James M Robins. Estimating causal effects from epidemiological data. Journal of Epidemiology & Community Health, 60(7):578–586, 2006.
  • Hill (2011) Jennifer L. Hill. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1):217–240, 2011.
  • Johansson et al. (2016) Fredrik D. Johansson, Uri Shalit, and David Sontag. Learning representations for counterfactual inference. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pages 3020–3029., 2016.
  • Johnson et al. (2016) Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Li-wei, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii, a freely accessible critical care database. Scientific data, 3:160035, 2016.
  • Kuroki and Pearl (2014) Manabu Kuroki and Judea Pearl. Measurement bias and effect restoration in causal inference. Biometrika, 101(2):423–437, 2014.
  • Louizos et al. (2017) Christos Louizos, Uri Shalit, Joris M Mooij, David Sontag, Richard Zemel, and Max Welling. Causal effect inference with deep latent-variable models. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6446–6456. Curran Associates, Inc., 2017.
  • McCormick et al. (2013) Marie C. McCormick, Jeanne Brooks-Gunn, and Stephen L. Buka. Infant health and development program, phase iv, 2001-2004 united states. 2013. doi: 10.3886/ICPSR23580.v2.
  • Medam et al. (2017) Sophie Medam, Laurent Zieleskiewicz, Gary Duclos, Karine Baumstarck, Anderson Loundou, Julie Alingrin, Emmanuelle Hammad, Coralie Vigne, François Antonini, and Marc Leone. Medicine, 96(50), 12 2017. doi: 10.1097/MD.0000000000009241.
  • Pearl (2009) Judea Pearl. Causality. Cambridge university press, 2009.
  • Pearl (2012) Judea Pearl. On measurement bias in causal inference. arXiv preprint arXiv:1203.3504, 2012.
  • Rosenbaum and Rubin (1984) Paul R Rosenbaum and Donald B Rubin. Reducing bias in observational studies using subclassification on the propensity score. Journal of the American statistical Association, 79(387):516–524, 1984.
  • Shalit et al. (2017) Uri Shalit, Fredrik D. Johansson, and David Sontag. Estimating individual treatment effect: generalization bounds and algorithms. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3076–3085, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
  • Studnek et al. (2012) Jonathan R Studnek, Melanie R Artho, Craymon L Garner Jr, and Alan E Jones. The impact of emergency medical services on the ed care of severe sepsis. The American journal of emergency medicine, 30(1):51–56, 2012.
  • Sun et al. (2015) Wei Sun, Pengyuan Wang, Dawei Yin, Jian Yang, and Yi Chang. Causal inference via sparse additive models with application to online advertising. In AAAI, pages 297–303, 2015.
  • Tishby et al. (2000) Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000.
  • Wager and Athey (2017) Stefan Wager and Susan Athey. Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 2017.

Appendix A Additional Experiments

a.1 Infant Health and Development Program

The Infant Health and Development Program (IHDP) [McCormick et al., 2013, Hill, 2011] is a randomised control experiment assessing the impact of educational intervention on outcomes of pre-mature, low birth weight infants born in 1984-1985. Measurements from children and their mother were collected for studying the effects of childcare and home visits from a trained specialist on test scores. Briefly, the study contains information about the children and their mothers/caregivers. Data on the children include treatment group, sex, birth weight, health indices. Information about the mothers includes maternal age, mother’s race as well as educational achievement. Hill [2011] extract features and treatment assignments from the real-world clinical trial, and introduce selection bias to the data artificially by removing a non-random portion of the treatment group, in particular children with non-white mothers. In total, the data set consists of 747 subjects (139 treated, 608 control), each represented by 25 covariates measuring properties of the child and their mother. The data set is divided into 60/20/20% into training/validation/testing sets.

For our experiments, we compare the performance of CEIB for predicting the ACE against several existing baselines as in Louizos et al. [2017]: OLS-1 is a least squares regression; OLS-2 uses two separate least squares regressions to fit the treatment and control groups respectively; TARnet is a feedforward neural network from Shalit et al. [2017]; KNN is a -nearest neighbours regression; RF is a random forest; BNN is a balancing neural network [Johansson et al., 2016]; BLR is a balancing linear regression [Johansson et al., 2016], and CFRW is a counterfactual regression that using the Wasserstein distance [Shalit et al., 2017].

Table 1: Within-sample and out-of-sample mean and standard errors for the metrics across models on the IHDP data set. A smaller value indicates better performance. Bold values indicate the method with the best performance.

We train our model with , -dimensional Gaussian mixture components, although our method can be applied without loss of generality to any number of dimensions. To assess the ability to estimate treatment effects on the basis of partial information, we artificially exclude three covariates at test time. These are covariates that are exhibit a moderate correlation to the hidden confounder ethnicity. The results are shown in Table 1. Overall, our approach exhibits good performance for both in-sample and out-of-sample predictions, while simultaneously accounting for partial covariate information.

To assess the interpretability of the proposed approach and the ability to account for hidden confounding, we perform an analysis on the latent space of our model. First, we plot two information curves illustrating the number of latent dimensions required to reconstruct the output for the terms and respectively. These results are shown in Figure 4(a) and Figure 4(b). In particular, we perform this analysis when the data set of subjects is both de-randomised and randomised (i.e. when we do not introduce selection bias into the data set). Comparing the information curves in Figure 4(a) confirms that when we do not de-randomise the data, the information content in the treatment tends to be closer to 0, whereas the opposite is true when the data is de-randomised. The information curves in Figure 4(b) additionally demonstrate our model’s ability to account for indirect effects of confounding when predicting the overall outcomes: when data is de-randomised, we are able to reconstruct treatment outcomes more accurately. Overall, the results from Figures 4(a) and 4(b) highlight that there is indeed a hidden confounding effect that we can account for using the proposed approach.

Figure 5: (a) Information curves for and (b) with de-randomised and randomised data respectively. When the data is randomised, the value of is close to zero. The differences between the curves illustrates confounding. When data is de-randomised, we are able to estimate treatment effects more accurately by accounting for this confounding.

Next, we perform an analysis of the discretised latent space by comparing the proportions of ethnic groups of test subjects in each cluster from the Gaussian mixture to see if we can recover the hidden confounding effect. These results are shown in Figure 6 where we plot a hard assignment of test subjects to clusters on the basis of their ethnicity. Evidently, the clusters exhibit a clear structure with respect to the ethnic groups. In particular, Cluster 2 in Figure 5(b) has a significantly higher proportion of non-white members in the de-randomised setting, confirming that we are able to correctly identify the true confounding effect and account for this when making predictions. Finally, we perform similar analyses and assess the error in estimating the ACE when varying the number of mixture components in Figure 7. When the number of clusters is larger, the clusters get smaller and it becomes more difficult to reliably estimate the ACE since we average over the cluster members to account for partial covariate information at test time. Here, model selection is made by observing where the error in estimating the ACE stabilises (anywhere between 4-7 mixture components).

(a) Cluster 1
(b) Cluster 2
(c) Cluster 3
(d) Cluster 4
Figure 6: Illustration of the proportion of major ethnic groups within the four clusters. Grey and orange indicate de-randomised and randomised data respectively. The first cluster in (a) is a neutral cluster. The second cluster in (b) shows an enrichment of information in the African-American group. Clusters 3 and 4 in (c) and (d) respectively, show an enrichment of information in the White group. Overall, we are able to identify the hidden confounder correctly and account for this when predicting outcomes. For better visualisation, we only report the two main clusters which include the majority of all patients.
Figure 7: Out-of-sample error in ACE with a varying number of clusters.

a.2 Binary Treatment Outcome on Twins

Like Louizos et al. [2017], we apply CEIB to a benchmark task using the birth data of twins in the USA between 1989 and 1991 [Almond et al., 2005]. Here, treatment is a binary indicator of being the heavier twin at birth, while outcome corresponds to the mortality within a year after birth. Since mortality is rare, we consider only same sex twins with weights less than 2 kg which results in 11 984 pairs of twins. Each twin has a set of 46 covariates including information about their parents such as their level of education, race, incidence of renal disease, diabetes, smoking etc. as well as whether the birth took place in hospital or at home and the number of gestation weeks prior to birth.

Figure 8: Absolute error in ACE estimation for Twins task. CEIB outperforms baselines over varying levels of proxy noise.

To simulate an observational study, we selectively hide one of the twins. To illustrate the ability of CEIB to be applied to Case II where we treat proxy variables as measured confounders, we base the treatment assignment on a single variable which is highly correlated with the outcome: GESTAT10, the number of gestation weeks prior to birth. This has values from 0-9 that correspond to the weeks of gestation before birth i.e. birth before 20 weeks gestation, 20-27 weeks of gestation, etc. Analogous to Louizos et al. [2017] we set treatment to for , where is GESTAT10 and are the 45 remaining covariates. Since CEIB can account for incomplete covariates, we artificially exclude 3 covariates from at test time.

Like Louizos et al. [2017], proxies are created with a one-hot encoding of , replicated 3 times and randomly flipping the 30 bits, where the flipping probability varies from 0.05 to 0.15. There may also be additional proxy variables for in the data from the set of variables. Our task is to predict the ACE. Specifically, we compare the performance of CEIB to CEVAE (with a varying number of hidden layers), TARnet (with varying numbers of hidden layers) and logistic regression (LR). These results are shown in Figure 8. Here too, CEIB achieves close to state-of-the-art performance on the Twins task.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description