CauseEffect Deep Information Bottleneck For Systematically Missing Covariates
Abstract
Estimating the causal effects of an intervention from highdimensional observational data is difficult due to the presence of confounding. The task is often complicated by the fact that we may have a systematic missingness in our data at test time. Our approach uses the information bottleneck to perform a lowdimensional compression of covariates by explicitly considering the relevance of information. Based on the sufficiently reduced covariate, we transfer the relevant information to cases where data is missing at test time, allowing us to reliably and accurately estimate the effects of an intervention, even where data is incomplete. Our results on causal inference benchmarks and a real application for treating sepsis show that our method achieves stateofthe art performance, without sacrificing interpretability.
1 Introduction
Understanding the causal effects of an intervention is a key question in many applications, such as healthcare (e.g. [34, 1]). The problem is especially complicated when we have a complete, highdimensional set of observational measurements for a group of patients, but an incomplete set of measurements for a potentially larger group of patients for whom we would like to infer treatment effects at test time. For instance, a doctor treating patients with HIV may readily have access to routine measurements such as blood count data for all their patients, but only have the genotype information for some patients at a future test time, as a result of the medical costs associated with genotyping, or resource limitations.
A naive strategy to address this problem would be to remove all those features that are missing at test time and infer treatment effects on the basis of the reduced space of features. Alternatively, one may attempt imputing the incomplete dimensions for the same purpose. Both of these solutions however, fail in highdimensional settings, particularly if the missingness is systematic as in this case, or if many dimensions are missing. Other approaches account for incomplete data during training, for instance by assuming hidden confounding. These methods typically try to build a joint model on the basis of noisy representatives of confounders (see for instance [7, 22, 16, 17]). However, in highdimensionalsettings, it is unclear what these representatives might be, and whether our data meets such assumptions. Regardless of these assumptions, none of these approaches addresses systematic missingness at test time.
A more natural approach would be to assume one could measure everything that is relevant for estimating treatment effects for a subset of the patients, and attempt to transfer this distribution of information to a potentially larger set of test patients. However, this is a challenging task given the high dimensionality of the data that we must condition on. Here, we propose tackling this question from the decisiontheoretic perspective of causal inference. The overall idea is to use the Information Bottleneck (IB) criterion [32, 2] to perform a sufficient reduction of the covariates or learn a minimal sufficient statistic for inferring treatment outcomes. Unlike traditional dimensionality reduction techniques, the IB is expressed entirely in terms of informationtheoretic quantities and is thus particularly appealing in this context, since it allows us to retain only the information that is relevant for inferring treatment outcomes. Specifically, by conditioning on this reduced covariate, the IB enables us to build a discrete reference class over patients with complete data, to which we can map patients with incomplete covariates at test time, and subsequently estimate treatment effects on the basis of these groups.
Our contributions may thus be summarised as follows: We learn a discrete, lowdimensional, interpretable latent space representation of confounding. This representation allows us to learn equivalence classes among patients such that the specific causal effect of a patient can be approximated by the specific causal effect of the subgroups. We subsequently transfer this information to a set of test cases with incomplete measurements at test time such that we can estimate the causal effect. Finally, we demonstrate that our method outperforms existing approaches on established causal inference benchmarks and the real world applications for treating sepsis.
2 Preliminaries and Related Work
Potential Outcomes and Counterfactual Reasoning
Counterfactual reasoning (CR) has drawn large attention, particularly in the medical community. Such models are formalised in terms of potential outcomes [29, 30, 25]. Assume we have two choices of taking a treatment , and not taking a treatment (control) . Let denote the outcomes under and denote outcomes under the control . The counterfactual approach assumes that there is a preexisting joint distribution over outcomes. This joint distribution is hidden since and cannot be applied simultaneously. Applying an action thus only reveals , but not . In this setting, computing the effect of an intervention involves computing the difference between when an intervention is made and when no treatment is applied [21, 20]. We would subsequently choose to treat with if,
(1) 
for loss over and respectively. Potential outcomes are typically applied to crosssectional data [26, 27] and sequential time settings. Notable examples of models for counterfactual reasoning include [10] and [3]. Specifically, [10] propose a neural network architecture called TARnet to estimate the effects of interventions. Similarly, Gaussian Process CR (GPCR) models are proposed in [26, 27] and further extended to the multitask setting in [1]. Approaches that address missing data within the potential outcomes framework include [4] and [12]. The former adapts propensity score estimation for this purpose; the latter use lowrank matrix factorisation to deduce a set of confounders and compute treatment effects. Unlike all of these methods, we use of the IB criterion to learn treatment effects and adjust for confounding.
DecisionTheoretic View of Causal Inference
The decision theoretic approach to causal inference focuses on studying the effects of causes rather than the causes of effects [6]. Here, the key question is what is the effect of the causal action on the outcome? The outcome may be modelled as a random variable for which we can set up a decision problem. That is, at each point, the value of is dependent on whether or is selected. The decisiontheoretic view of causal inference considers the distributions of outcomes given the treatment or control, and and explicitly computes an expected loss of with respect to each action choice. Finally, the choice to treat with is made using Bayesian decision theory if,
(2) 
Thus in this setting, causal inference involves comparing the expected losses over the hypothetical distributions and for outcome .
Information Bottleneck
The classical IB method [32] describes an information theoretic approach to compressing a random variable with respect to a second random variable . The compression of may be described by another random variable . Achieving an optimal compression requires solving the problem,
(3) 
under the assumption that and are conditionally independent given . That is, the classical IB method assumes that variables satisfy the Markov relation . in Eqn. 3 represents the mutual information between two random variables and controls the degree of compression. In its classical form, the IB principle is defined only for discrete random variables. However in recent years multiple IB relaxations and extensions, such as for Gaussian [5] and metaGaussian variables [23], have been proposed. Among these extensions, is the latent variable formulation of the IB method (e.g. [2, 35]). Here, one assumes structural equations of the form,
(4)  
(5) 
These equations give rise to a different conditional independence relation, . While both independences cannot hold in the same graph, in the limiting case where the noise term , . In what follows, we assume the latent variable formulation of the IB.
Deep Latent Variable Models
Deep latent variable models have recently received remarkable attention and been applied to a variety of problems. Among these, variational autoencoders (VAEs) employ the reparameterisation trick introduced in [13, 24] to infer a variational approximation over the posterior distribution of the latent space . Important work in this direction include [15] and [9]. Most closely related to the work we present here, is the application of VAEs in a healthcare setting by [17]. Here, the authors introduce a CauseEffect VAE (CEVAE) to estimate the causal effect of an intervention in the presence of noisy proxies. Despite their differences, it has been shown that there are several close connections between the VAE framework and the previously described latent variable formulation of the IB principle. This is essentially a VAE where is replaced by in the decoder with . In contrast, the approach in this paper considers the IB principle to perform causal inference in scenarios where our data has a systematic missingness at test time.
3 Method
In this section, we present an approach based on the IB principle for estimating the causal effects of an intervention with incomplete covariate information. We refer to this model as a CauseEffect Information Bottleneck (CEIB). In recent years, there has been a growing interest in the connections between the IB principle and deep neural networks [33, 2, 35]. Here, we use the nonlinear expressiveness of neural networks with the IB criterion to learn a sufficiently reduced representation of confounding, based on which we can approximate the effects of an intervention more effectively. Specifically, we interpret our model from the decisiontheoretic view [6] of causal inference.
Problem Formulation
Like other approaches in the decisiontheoretic setting, our goal is to estimate the Average Causal Effect (ACE) of on . If we assume that or define the interventional regimes, and the observational regime, the ACE is given by,
(6) 
The ACE in Equation 6 is defined in terms of the interventional regime however, in practice we can only collect data on . The observational counterpart of the ACE formally be defined as:
(7) 
In general, the ACE in Equations 6 and 7 are not equal unless we assume ignorable treatment assignments or under the conditional independence assumption . This assumption expresses that the distribution of is the same in the interventional and observational regimes. In the presence of confounding, the treatment assignment may only be ignored when estimating , provided a sufficient covariate and [6]. That is, is a sufficient covariate for the effect of on outcome if and . In this case, it can be shown by Pearl’s backdoor criterion [21] that the ACE may be defined in terms of the Specific Causal Effect (SCE),
(8) 
where
(9) 
We employ the following assumptions and notation. Let denote a set of patient covariates based on which we would like to estimate treatment effects. During training, we assume that all covariates can be observed as in a medical study, where dimension is large. Outside the study at test time however, we assume covariates are not usually observed, e.g. due to the expensive data acquisition process. That is, we assume the same feature dimensions are missing for all patients at testing. Let denote the outcomes following treatments . For simplicity and ease of comparison with prior methods on existing benchmarks, we consider treatments that are binary, but our method is applicable for any general . We assume strong ignorability or that all confounders are measured for a set of patients. The causal graph of our model is shown in Figure 0(a). Importantly, estimating the ACE in this case only requires computing a distribution , provided is a sufficient covariate. In what follows, we use the IB to learn a sufficient covariate that allows us to approximate the distribution in Figure 0(a).
Performing a Sufficient Reduction of the Covariate
We propose modelling this task with an extended formulation of the information bottleneck using the architecture proposed in Figure 0(b). The IB approach allows us to learn a lowdimensional interpretable compression of the relevant information during training, which we can use to infer treatment effects where covariate information is incomplete at test time.
We adapt the IB for learning the outcome of a therapy when incomplete covariate information is available for at test time. To do so, we consider the following extended parametric form of the IB,
(10) 
where and are lowdimensional discrete representations of the covariate data, is a concatenation of and and represents the mutual information parameterised by networks , , , and respectively. We assume a parametric form of the conditionals , , , . The first two terms in Equation 10 for our encoder model have the following forms:
(11)  
(12) 
The decoder model in Equation 10 can analogously be expressed as:
(13) 
where is the entropy of . For the decoder model, we use an architecture similar to the TARnet [10], where we replace conditioning on highdimensional covariates with conditioning on reduced covariate . We can thus formulate the conditionals as,
(14) 
with logistic function , and outcome given by a Gaussian distribution parameterised with a TARnet with . Note that the terms correspond to neural networks. While distribution is included to ensure the joint distribution over treatments, outcomes and covariates is identifiable, in practice, our goal is to approximate the effects of a given on . Hence, we train our model in a teacher forcing fashion by using the true treatment assignments from the data and fixing the s at test time. Unlike other approaches for inferring treatment effects, the Lagrange parameter in the IB formulation in Equation 10 allows us to adjust the degree of compression, which, in this context, enables us to learn a sufficient statistic . In particular, adjusting enables us to explore a range of such representations from having a completely insufficient covariate to a completely sufficient compression of confounding.
Learning Equivalence Classes and Distribution Transfer
Since and are discrete latent representations of the covariate information, we make use of the Gumbel softmax reparameterisation trick [9] to draw samples from a categorical distribution with probabilities . Here,
(15) 
where are samples drawn from Gumbel(0,1). The softmax function is used to approximate the in Equation 15, and generate dimensional sample vectors , where
(16) 
and is the softmax temperature parameter. By using the Gumbel softmax reparameterisation trick to obtain a discrete representation of relevant information, we can learn equivalence classes among patients based on which we can compute the SCE for each group using sufficient covariate via Equation 9. Specifically, during training, and are used to learn cluster assignment probabilities for each data point. At test time, we subsequently assign an example with missing covariates to the relevant equivalence class. Computing the SCE allows us potentially to tailor treatments to specific groups based on rather than an entire population — an important aspect in healthcare where patients are typically heterogeneous. Based on the SCE, we can also compute the populationlevel effects of an intervention via the ACE from Equation 8. In the absence of the latent compression via CEIB and the discrete representation of relevant information, it would not be possible to transfer knowledge from examples with complete information to cases with incomplete information, since estimating treatment effects would require integrating over all covariates — an infeasible task in high dimensions.
4 Experiments
The goal of the experiments is to demonstrate the ability of CEIB to accurately infer treatment effects while learning a lowdimensional interpretable representation of confounding in cases where partial covariate information is available at test time. We report the ACE and SCE values in our experiments for this purpose. In general, the lack of ground truth in real world data makes evaluating causal inference algorithms a difficult problem. To overcome this, in our artificial experiments we consider a semisynthetic data set where the true outcomes and treatment assignments known
The Infant Health and Development Program:
The Infant Health and Development Program (IHDP) [18, 8] is a randomised control experiment assessing the impact of educational intervention on outcomes of premature, low birth weight infants born in 19841985. Measurements from children and their mothers were collected to study the effects of childcare and home visits from a trained specialist on test scores. The data includes treatment groups, health indices, mothers’ ethnicity and educational achievement. [8] extract features and treatment assignments from the realworld clinical trial, and introduce selection bias to the data artificially by removing a nonrandom portion of the treatment group, in particular children with nonwhite mothers. In total, the data set consists of 747 subjects (139 treated, 608 control), each represented by 25 covariates measuring characteristics of the child and their mother. The data set is divided into 60/10/30% into training/validation/testing sets. For our setup, we use encoder and decoder architectures with 3 hidden layers. Our model is trained with Adam optimiser [14] with a learning rate of 0.001. We compare the performance of CEIB for predicting the ACE against several existing baselines as in [17]


Experiment 1:
In the first experiment, we compared the performance of CEIB for estimating the ACE against the baselines when using the complete set of measurements at test time. These results are shown in Table 1a. Evidently, CEIB outperforms existing approaches. To demonstrate that we can transfer the relevant information to cases where covariates are incomplete at test time, we artificially excluded covariates that have a moderate correlation with ethnicity at test time. We compute the ACE and compare this to the performance of TARnet and CFRW also on the reduced set of covariates (Table1b). If we extend this to the extreme case of removing 8 covariates at test time, the outofsample error in predicting the ACE increases to 0.29 +/ 0.02. Thus CEIB achieves stateoftheart predictive performance for both insample and outofsample predictions, even with incomplete covariate information.
Experiment 2:
Building on Experiment 1, we perform an analysis of the latent space of our model to assess whether we learn a sufficiently reduced covariate. We use the IHDP data set as before, but this time consider both the data before introducing selection bias (analogous to a randomised study), as well as after introducing selection bias by removing a nonrandom proportion of the treatment group as before (akin to a derandomised study). We plot the information curves illustrating the number of latent dimensions required to reconstruct the output for the terms and respectively for varying values of . These results are shown in Figure 1(a) and 1(b). Theoretically, we should be able to examine the shape of the curves to identify whether a sufficiently reduced covariate has been obtained. In particular, where a study is randomised, the sufficient covariate should have no impact on the treatment . In this case, the mutual information should be approximately zero and the curve should remain flat for varying values of . This result is confirmed in Figure 1(a). The information curves in Figure 1(b) additionally demonstrate our model’s ability to account for confounding when predicting the overall outcomes. Specifically, the point at which each of the information curves saturates is the point at which we have learnt a sufficiently reduced covariate based on which we can infer treatment effects. The curves also highlight benefit of adjusting , since we obtain a taskdependent adjustment of the latent space which allows us to explore a range of solutions, from having a completely insufficient covariate to a completely sufficient compression of the covariates where the information curve saturates. Overall, we are able to learn a lowdimensional representation that is consistent with the ethnicity confounder and account for its effects when predicting treatment outcomes.
We also analysed the discretised latent space by comparing the proportions of ethnic groups of test subjects in each cluster in the derandomised setting. These results are shown in Figure 3 where we plot a hard assignment of test subjects to clusters on the basis of their ethnicity. Evidently, the clusters exhibit a clear structure with respect to ethnicity. In particular, Cluster 2 in Figure 2(b) has a significantly higher proportion of nonwhite members in the derandomised setting. The discretisation also allows us to calculate the SCE for each cluster. In general, Cluster 2 tends to have a lower SCE than the other groups. This is consistent with how the data was derandomised, since we removed a proportion of the treated instances with nonwhite mothers. Conditioning on this kind of information is thus crucial to be able to accurately assess the impact of educational intervention on test scores. Finally, we assess the error in estimating the ACE when varying the number of mixture components in Figure 1(c). When the number of clusters is larger, the clusters get smaller and it becomes more difficult to reliably estimate the ACE since we average over the cluster members to account for partial covariate information at test time. Here, model selection is made by observing where the error in estimating the ACE stabilises (anywhere between 47 mixture components).
Sepsis Management:
We illustrate the performance CEIB on the realworld task of managing and treating sepsis. Sepsis is one of the leading causes of mortality within hospitals and treating septic patients is highly challenging, since outcomes vary with interventions and there are no universal treatment guidelines. We use data from the Multiparameter Intelligent Monitoring in Intensive Care (MIMICIII) database [11]. We focus on patients satisfying Sepsis3 criteria (16 804 patients in total). For each patient, we have a 48dimensional set of physiological parameters including demographics, lab values, vital signs and input/output events. Our outcomes correspond to the odds of mortality, while we binarise medical interventions according to whether or not a vasopressor is administered. The data set is divided into 60/20/20% into training/validation/testing sets. We train our model with six 4dimensional Gaussian mixture components and analyse the information curves and cluster compositions respectively.
The information curves for and can be found in Figures 4(a) and 4(b) respectively. Overall, we we can perform a sufficient reduction of the highdimensional covariate information to between 4 and 6 dimensions, while accurately estimating . Since there is no ground truth available for the sepsis task, we do not have access to the true confounding variables. However, we analyse the clusters with respect to a patient’s initial Sequential Organ Failure Assessment (SOFA) scores, used to track a patient’s stay in hospital (Figure 4). Clusters 2, 5 and 6 tend to have higher proportions of patients with lower SOFA scores compared to Clusters 3 and 4. The SCE values over these clusters also vary considerably, suggesting that initial SOFA score is potentially a confounder of treatments and odds of inhospital mortality. This is consistent with medical studies such as [19, 31] in which high initial SOFA scores tend to impact treatments and overall chances of survival. Finally, while we cannot quantify an error in estimating the ACE since we do not have access to the counterfactual outcomes, we can still compute the ACE for the sepsis management task. Here, we specifically observe a negative ACE, suggesting that treating patients with vasopressors generally reduces the chances of mortality. Performing such analyses for tasks like sepsis may help adjust for confounding and assist in establishing potential guidelines.
5 Conclusion
We have presented a novel approach to estimate causal effects with incomplete covariates at test time. This is a crucial problem in several domains such as healthcare, where doctors frequently have access to routine measurements, but may have difficulty acquiring for instance, genotyping data for patients at test time as a result of the costs. We used the IB framework to learn a sufficient statistic of information relevant for predicting outcomes. By further introducing a discrete latent space, we could learn equivalence classes that, in turn, allowed us to transfer knowledge from instances with complete covariates to instances with incomplete covariates, such that treatment effects could be accurately estimated. Our extensive experiments show that our method outperforms stateoftheart approaches on synthetic and real world datasets. Since handling systematic missingness is a highly relevant problem in healthcare, we view this as step towards improving these systems on a larger scale.
Footnotes
 Additional experiments can be found in the supplement
 OLS1 is a least squares regression; OLS2 uses two separate least squares regressions to fit the treatment and control groups respectively; TARnet is a feedforward neural network from [28]; KNN is a nearest neighbours regression; RF is a random forest; BNN is a balancing neural network [10]; BLR is a balancing linear regression [10], and CFRW is a counterfactual regression that using the Wasserstein distance [28].
References
 (2017) Bayesian inference of individualized treatment effects using multitask gaussian processes. CoRR abs/1704.02801. Cited by: §1, §2.
 (201612) Deep Variational Information Bottleneck. ArXiv eprints. External Links: 1612.00410 Cited by: §1, §2, §3.
 (2013) Counterfactual reasoning and learning systems: the example of computational advertising. The Journal of Machine Learning Research 14 (1), pp. 3207–3260. Cited by: §2.
 (2016) Propensity score analysis with missing data.. Psychological methods 21 (3), pp. 427. Cited by: §2.
 (2005) Information bottleneck for gaussian variables. In Journal of Machine Learning Research, pp. 165–188. Cited by: §2.
 (2007) Fundamentals of statistical causality. Technical report Department of Statistical Science, University College London. Cited by: §2, §3, §3.
 (2008) Bias analysis. Modern Epidemiology, pp. 345 – 380. Cited by: §1.
 (2011) Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics 20 (1), pp. 217–240. Cited by: §4.
 (2017) Categorical Reparameterization with GumbelSoftmax. International Conference on Learning Representations (ICLR). Cited by: §2, §3.
 (2016) Learning representations for counterfactual inference. In Proceedings of the 33rd International Conference on International Conference on Machine Learning  Volume 48, ICML’16, pp. 3020–3029. Cited by: §2, §3, footnote 3.
 (2016) MIMICiii, a freely accessible critical care database. Scientific data 3, pp. 160035. Cited by: §4.
 (2018) Causal inference with noisy and missing covariates via matrix factorization. In Advances in Neural Information Processing Systems, pp. 6921–6932. Cited by: §2.
 (2013) Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §2.
 (2014) Adam: a method for stochastic optimization.. abs/1412.6980. Cited by: §4.
 (2014) Semisupervised learning with deep generative models. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 813 2014, Montreal, Quebec, Canada, pp. 3581–3589. Cited by: §2.
 (2014) Measurement bias and effect restoration in causal inference. Biometrika 101 (2), pp. 423–437. Cited by: §1.
 (2017) Causal effect inference with deep latentvariable models. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan and R. Garnett (Eds.), pp. 6446–6456. Cited by: §1, §2, §4.
 (2013) Infant health and development program, phase iv, 20012004 united states. External Links: Document Cited by: §4.
 (201712) Medicine 96 (50). External Links: Document, ISBN 00257974; 15365964 Cited by: §4.
 (2015) Counterfactuals and causal inference. Cambridge University Press. Cited by: §2.
 (2009) Causality. Cambridge university press. Cited by: §2, §3.
 (2012) On measurement bias in causal inference. arXiv preprint arXiv:1203.3504. Cited by: §1.
 (2012) Metagaussian information bottleneck.. In NIPS, P. L. Bartlett, F. C. N. Pereira, C. J. C. Burges, L. Bottou and K. Q. Weinberger (Eds.), pp. 1925–1933. Cited by: §2.
 (201422–24 Jun) Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the 31st International Conference on Machine Learning, E. P. Xing and T. Jebara (Eds.), Proceedings of Machine Learning Research, Vol. 32, Bejing, China, pp. 1278–1286. Cited by: §2.
 (1978) Bayesian inference for causal effects: the role of randomization. The Annals of statistics, pp. 34–58. Cited by: §2.
 (2017) Reliable decision support using counterfactual models. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan and R. Garnett (Eds.), pp. 1697–1708. Cited by: §2.
 (2017) Whatif reasoning using counterfactual gaussian processes. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 49 December 2017, Long Beach, CA, USA, pp. 1696–1706. Cited by: §2.
 (201706–11 Aug) Estimating individual treatment effect: generalization bounds and algorithms. In Proceedings of the 34th International Conference on Machine Learning, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 3076–3085. Cited by: footnote 3.
 (1923) Sur les applications de la théorie des probabilités aux experiences agricoles: essai des principes. Roczniki Nauk Rolniczych 10, pp. 1–51. Cited by: §2.
 (1990) On the application of probability theory to agricultural experiments. essay on principles. section 9.. Statistical Science 5 (4), pp. 465–472. Cited by: §2.
 (2012) The impact of emergency medical services on the ed care of severe sepsis. The American journal of emergency medicine 30 (1), pp. 51–56. Cited by: §4.
 (2000) The information bottleneck method. arXiv preprint physics/0004057. Cited by: §1, §2.
 (2015) Deep learning and the information bottleneck principle. CoRR abs/1503.02406. Cited by: §3.
 (2017) Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association. Cited by: §1.
 (2018) Learning Sparse Latent Representations with the Deep Copula Information Bottleneck. International Conference on Learning Representations (ICLR). External Links: 1804.06216 Cited by: §2, §3.