# Multiple Causal Inference with Latent Confounding

###### Abstract

Causal inference from observational data requires assumptions. These assumptions range from measuring confounders to identifying instruments. Traditionally, these assumptions have focused on estimation in a single causal problem. In this work, we develop techniques for causal estimation in causal problems with multiple treatments. We develop two assumptions based on shared confounding between treatments and independence of treatments given the confounder. Together these assumptions lead to a confounder estimator regularized by mutual information. For this estimator, we develop a tractable lower bound. To fit the outcome model, we use the residual information in the treatments given the confounder. We validate on simulations and an example from clinical medicine.

Multiple Causal Inference with Latent Confounding

Rajesh Ranganath Courant Institute of Mathematical Sciences New York University rajeshr@cims.nyu.edu Adler Perotte Department of Biomedical Informatics Columbia University adler.perotte@columbia.edu

noticebox[b]\end@float

## 1 Introduction

Causal inference aims to estimate the effect one variable has on another. Such causal inferences form the heart of inquiry in many domains, including estimating the value of giving a medication to a patient, understanding the influence of genetic variations on phenotypes, and measuring the impact of job training programs on income.

Assumption-free causal inferences rely on randomized experimentation [7, 26]. Randomized experiments break the relationship between the intervention variable (the treatment) and variables that could alter both the treatment and the outcome—confounders. Though powerful, randomized experimentation fails to make use of large collections of non-randomized observational data (like electronic health records in medicine) and is inapplicable where broad experimentation is infeasible (like in human genetics). The counterpart to experimentation is causal inference from observational data. Causal inference from observational data requires assumptions. These assumptions include measurement of all confounders [31], the presence of external randomness that partially controls treatment [3], and structural assumptions on the randomness [13].

Though causal inference from observational data has been used in many domains, the assumptions the underlie these inferences focus on estimation in a single causal problem. But many real-world applications consist of multiple causal problems. For example, the effects of genetic variation on various phenotypes [6] or the effects of medications from order sets in clinical medicine [24] consist of multiple causal problems rather than a single causal problem. Considering multiple causal problems make new kinds of assumptions possible.

We formalize multiple causal inference, a collection of causal inference problems with multiple treatments and a single outcome. We develop a set of assumptions under which confounders can be estimated when unmeasured. Two assumptions form the starting point: that the treatments share confounders, and that given the shared confounder, all of the treatments are independent. This kind of shared confounding structure can be found in many domains such as genetics. In genetics, the multiple causal problems are the effects of different genetic variations on different phenotypes. The shared unobserved confounding structure across problems comes from population structure or local correlation structure (such as linkage disequilibrium).

These two assumptions of shared confounding and independent treatments given the confounder imply a probabilistic model for the treatments, confounder, and outcome. This model is akin to generalized factor analysis. Yet shared confounding and independent treatments still leave ambiguity in how to estimate the confounder. This ambiguity lies in how much information the confounder contains about a single treatment given the rest of the treatments.

To respect shared confounding, the information between the confounder and a treatment given the rest of the treatments should be minimal. However, forcing this information to zero makes the confounder independent of the treatments. This can violate the assumption of independence given the shared confounder. To resolve the tension between our two assumptions, we develop the principle of minimal information. This minimal information principle says that the true confounder has minimal information with each treatment, given the rest of the treatments—subject to the constraint that the confounder should predict each treatment best, given the rest of the treatments. In other words, we try find the confounder that has lowest information with the treatments, while still rendering them independent.

We develop an algorithm for estimating the confounder based on the minimal information principle. The algorithm relies on independently reconstructing each treatment given a stochastic confounder estimated from all of the treatments, while regularizing the mutual information each treatment has with the confounder, given the rest of the treatments. The mutual information is intractable, so we build a lower bound called the multiple causal lower bound (mclbo).

The last step in building a causal estimator is to build the outcome model. Traditional outcome models regress the confounders and treatments to the outcome [23]. However, since the confounder estimate is a stochastic function of the treatments, it contains no new information about the response over the treatments—a regression on both the estimated confounder and treatments can ignore the estimated confounder. Instead, we build regression models using the residual information in the treatments and develop an estimator to compute these residuals. The assumptions and algorithms we develop strengthen the theoretical basis for existing causal estimation with unobserved confounders such as causal estimation with linear mixed models (pdfs) [16, 20].

We demonstrate our approach on a large simulation study. Though traditional methods like principal component analysis (pca)adjustment [40] closely approximate the family of techniques we describe, we find that our approach more accurately estimates the causal effects, especially when the confounder dimensionality is misspecified. Finally, we apply the mclboto control for confounders in a medical prediction problem on health records from the Multiparameter Intelligent Monitoring in Intensive Care (mimic iii)clinical database [15]. We show the effects we recover match those in the literature.

#### Related Work.

Causal inference has a long history in many disciplines including statistics, computer science, and econometrics. A full review is outside of the scope of this article, however, we highlight some of recent advances in building flexible causal models. Wager and Athey [37] develop random forests to capture variability in treatment effects [37]. Hill [12] uses Bayesian nonparametric methods to model the outcome response. Louizos et al. [21] build flexible latent variables to correct for confounding when proxies of confounders are measured, rather than the confounders themselves. Johansson et al. [14], Shalit et al. [33] develop estimators with theoretical guarantees by building representations that penalize differences in confounder distributions between the treated and untreated.

The above approaches focus on building estimators for single treatments, where either the confounder or a non-treatment proxy is measured. In contrast, computational genetics has developed a variety of methods to control for unmeasured confounding in genome-wide association studies (gwas). GENOME-WIDE ASSOCIATION STUDIEShave multiple treatments in the form of genetic variations across multiple sites. Yu et al. [40], Kang et al. [16], Lippert et al. [20] estimate kinship matrices between individuals using a subset of the genetic variations, then fit a lmmwhere the kinship provides the covariance for random effects. Song et al. [34] adjust for confounding via factor analysis on discrete variables and use inverse regression to estimate individual treatment effects. Tran and Blei [36] build implicit models for genome-wide association studies and describe general implicit causal models in the same vein as Kocaoglu et al. [18].

Our theoretical grounding for multiple causal inference complements this earlier work. We develop the two assumptions of shared confounding and of independence given shared confounders. We describe the information ambiguity inherent in estimating confounders in multiple causal inference and show that a probabilistic model can be insufficient to resolve the ambiguity. We develop a confounder estimator to resolve this by regularization, while minimizing the leave-one-treatment-out prediction error. Earlier work estimates confounders by choosing their dimensionality to not overfit. This matches the flavor of the estimator we develop.

## 2 Multiple Causal Inference

The trouble with causal inference from observational data lies in confounders, variables that affect both treatments and outcome. The problem is that the observed statistical relationship between the treatment and outcome may be partially or completely due to the confounder. Randomizing the treatment breaks the relationship between the treatment and confounder, rendering the observed statistical relationship causal. But the lack of randomized data necessitates assumptions to control for potential confounders. These assumptions have focused on causal estimation with a single treatment and a single outcome. In real-world settings such as in genetics and medicine, there are multiple treatments. We now define the multiple causal inference problem, detail assumptions for multiple causal inference, and develop new estimators for the causal effects given these assumptions.

Multiple causal inference consists of a collection of causal inference problems. Consider a set of treatments indexed by and single outcome . For example, could be the th medication for a disease given to a patient and could be the severity of that disease. The patient’s unmeasured traits induce a relationship between the treatments and the outcome. The goal of multiple causal inference is to simultaneously estimate the causal effects for all treatments. We develop two assumptions under which these causal effects can be estimated in the presence of unobserved confounders.

#### Shared Confounding.

The first assumption we make to identify multiple causal effects is that of shared confounders. The shared confounder assumption posits that confounders are shared across all of the treatments. Under this assumption, each treatment provides a view on the shared confounder. This provides the means to control for confounding on other treatments. This assumption is natural in many problems. For example, in gwas, treatments are genetic variations and the outcome is a phenotype. Due to correlations in genetic variations caused by ancestry, the treatments share confounding structure.

#### Independence Given Unobserved Confounders.

The shared confounding assumption does not identify the causal effects, since there can be links from treatment to treatment . In the presence of these links, we cannot get a clear view of the shared confounder. The is because the dependence between and may be due either to confounding or to the direct link between the pair of treatments. To address this issue, we assume that treatments are independent given unobserved confounders. In the discussion, we explore strategies to loosen this assumption.

#### Implied Model.

We developed two assumptions: shared confounding and independence given the confounder. Together, these assumptions imply the existence of an unmeasured variable with some unknown distribution such that when conditioned on, the treatments become independent:

###### Proposition 1

Let be independent noise terms, and be functions. Then, shared confounding and independence given unobserved confounding imply a generative process for the data that is

(1) |

If this model were explicitly given, posterior inference would reveal the causal effects. However, an issue remains. Since the information to infer comes from the treatments and the outcome, flexible models lead to a problematic ambiguity. Specifically, it is unclear how much information the outcome and each treatment reveal about the confounder. These assumptions lead to the principle of minimal information. This principle says that the confounder retains the minimum amount of information about each treatment needed to reconstruct the other treatments. We formalize the notion of information and develop an estimator that controls the information in the next section.

## 3 Confounder Estimation in Multiple Causal Inference

We develop an estimator for the confounder in multiple causal inference without directly specifying a generative model. This estimator finds the confounder with minimum information that reconstructs each treatment given the other treatments. The estimator works via information-based regularization and cross-validation in a way that is agnostic to the particular functional form of the estimator. We present a tractable lower bound to estimate the confounder.

#### Information Ambiguity.

We formalize the notion of information using mutual information [8].
Let denote the mutual information.
The mutual information is nonnegative and is zero when and are independent.
To understand the ambiguity in building stochastic confounder estimators,
consider the information between the estimated confounder and treatment given the remaining
treatments : . We call this the
additional mutual information (ami). It is the additional information a treatment can provide
to the confounder, over what the rest of the treatments provide.
The additional mutual informationtakes values
between zero and some nonnegative number. The maximum indicates
that and perfectly
predict .^{1}^{1}1When all variables are discrete, the upper bound
is the entropy . This range parameterizes
the ambiguity of how much information the confounder encodes
about treatment , over the information present in the remaining treatments.^{2}^{2}2In the appendix, we provide an explicit construction of equivalent models that have different information.

At first glance, letting seems to violate shared confounding because the confounder has information about a treatment that is not in the other treatments. But setting forces the confounder to be independent of all of the treatments. This is in tension with the assumption of the independence of treatments given the confounder. Since if the confounder-estimated entropy is bigger than the true entropy under the population sampling distribution , , the treatments cannot be independent given the confounder.

The minimal information principle can be seen as a balancing of independence given a shared confounder and having a purely shared confounder. This principle can formally be stated as trying to minimize the ami subject to good predictions via the confounder by minimizing predictive entropy .

#### Confounder Estimation.

The most general form of a confounder estimator is a function that takes the following as input: noise , parameters , treatments , and outcome . Using the outcome without extra assumptions is inherently ambiguous. The ambiguity lies in how much of is retained in . The only statistics we observe about come from or its cross statistics with . From eq. 1, we know that the cross statistics provide a way to estimate the confounder. However, since the outcome depends on the treatments, cross statistics between the treatment and outcome could either be from the confounder or from the direct relationship between the treatments and outcome. This ambiguity cannot be resolved without further assumptions like assuming a model. Therefore we focus on estimating the confounder without using the outcome, where the outcome has been marginalized out.

A generic stochastic confounder estimator with marginalized outcome is a stochastic function of the treatments and noise with parameters . The posterior of the latent confounder in a model is an example of such a stochastic estimator. To respect the assumptions, we want to find a such that conditional on the confounder, the treatments are independent. With the estimator we can directly control the ami via regularization. This contrasts classical latent variable models, where parameters like the dimensionality, flexibility of the likelihood, and number of layers in a neural network control the additional mutual information.

For compactness, we drop the confounder’s functional dependence on and write . Let be the empirical distribution over the observed treatments, let be a parameter, and let be a regularization parameter. Then we can define an objective that tries to reconstruct each treatment independently given , while controlling the additional mutual information:

(2) |

We will suppress the index in when clear. This objective finds the that can reconstruct most accurately, assuming the treatments are conditionally independent given . Equation 2 can be viewed as an autoencoder where the code is regularized to limit the additional mutual information, thereby preferring to keep information shared between treatments.

The information regularizer is similar to regularizers in supervised learning. Consider how well the confounder predicts a treatment when estimated conditional on the rest of the treatments. When is too small for a flexible model, the confounder memorizes the treatment so the prediction error is big. When is too large, is independent of the treatments so again the prediction error is big. This mirrors choosing the regularization coefficient in linear regression. When the regularization is too large, the regression coefficients ignore the data, and when it is too small, the regression coefficients memorize the data. As in regression, can be found by cross-validation. Minimizing the conditional entropy directly rather than by cross-validation leads to the degenerate solution of having all the information in .

Since we do not have access to , the objective contains an intractable mutual information term. We develop a tractable objective based on the conditional entropy decomposition of conditional mutual information.

#### Multiple Causal Lower Bound.

The conditional mutual information can be written in terms of conditional entropies as

The second term comes from the entropy of and is tractable. But the first term requires marginalizing out the treatment . This conditional entropy with marginalized treatment is not tractable, so we develop a lower bound. Expanding the integral gives

The lower bound follows from Jensen’s inequality. Unbiased estimates of the lower bound can be computed via Monte Carlo. The bound becomes tight as goes to the number of observations. Substituting this back, the information-regularized confounder estimator objective gives

(3) |

This gives a tractable lower bound to the information-regularized objective called the multiple causal lower bound (mclbo).

#### Algorithm.

To optimize the mclbo, we use stochastic gradients by passing the derivative inside expectations [39]. These techniques underly black box variational inference algorithms [28, 17, 30]. We derive the full gradients for and in the appendix. With these gradients, the algorithm can be summarized as follows. We choose a range of values and fit the confounder estimator using the mclbo. We then select the that minimizes the entropy on held-out treatments. In practice, we allow a small relative tolerance for larger ’s over the best held-out prediction to account for finite sample estimation error. The algorithm can be seen as learning an autoencoder. The code of the this autoencoder minimizes the information retained about each treatment subject to the code predicting each best when the code is computed only from , the remaining treatments.

#### Necessity of Minimal Information.

Even with an explicit probabilistic model, the minimal information principle is necessary. We provide an example model and population treatment distribution that demonstrates this. Suppose the true treatments come from an unconfounded model, where all of the are independent. Consider a latent variable model where each observation has a latent variable and treatment vector . Let be a matrix of parameters, let and be hyperparameters, and let index observations. Then the model is

(4) |

The maximum likelihood estimate for the model in eq. 4 with latent size equal to data size given this true model is , up to rotations. The posterior distribution is

From the posterior, we see that small leads to the posterior memorizing the treatments —the model learns that all of is confounded by , while the true treatments are unconfounded. This problem occurs because there are a class of models indexed by that model the observed and have the same conditional entropy and predictive likelihoods. All additional mutual informationregularization values lead to the same prediction. Satisfying the minimal information principle forces identification, in this case preserving the true unconfounded treatments.

## 4 Estimating the Outcome Model

In traditional observational causal inference, the possible outcomes are independent of the treatments given the confounders, so predictions given confounders are causal estimates. With the notation that removes any influence from confounding variables, we have

So to estimate the causal effect, it suffices to build a consistent regression model. However with estimated confounders that are stochastic functions of the treatment, this relationship breaks: and . The confounder has less information about the outcomes than the treatments themselves. Given the treatments, the confounders provide no information about the outcome. The lack of added information means that if we were to simply regress and to , the regression could completely ignore the confounder. Building outcome models conditional on addresses this issue, but this requires doing regression for every value of .

A regression conditional on the confounder can only use the part of the treatment that is independent of the confounder. Recovering these independent components allows for the use of regression. Formally, let be the independent component of the th treatment, then we would like to find a distribution that maximizes

(5) |

Optimizing this objective over and provides stochastic estimates of the part of independent of . We call this leftover part the residuals. These residuals are independent of the confounders and can be used in conjunction with them to estimate the outcome. Optimizing eq. 5 can be a challenge both due to the intractable mutual information constraint and that may have degenerate density.

In the appendix, we provide a general estimation technique for the residuals. Here we focus on the degenerate case. Suppose in eq. 2 can be for, some function , written as for drawn independently. Then if is invertible for every fixed value of , the residuals that satisfy eq. 5 can be found via inversion. That is, eq. 5 is optimal if , since is independent of by construction and in conjunction with perfectly reconstructs . This means when the reconstruction in eq. 2 is invertible, independent residuals are easy to compute.

To learn the outcome model, we regress with the residuals and confounder by maximizing

(6) |

Since and are independent, they provide different information to . To compute the desired causal estimate, given the learned , we can substitute:

(7) |

The right hand side is in terms of known quantities: the outcome model from eq. 6 and the from the confounder estimation in eq. 3. The causal estimate of can be computed by averaging over . We call the process of confounder estimation with the mclbofollowed by outcome estimation with residuals the multiple causal estimation via information (mcei). We formalize causal recovery with mceiin a simple setting.

###### Proposition 2

The intuition is that as , we get perfect estimates of . Moreover, the amount of information about each treatment in the confounder goes to zero, so the information ambiguity gets resolved and we return to the classical causal inference setup. This argument can be generalized to non-identically distributed treatments where posterior concentration occurs.

## 5 Experiments

We demonstrate our approach on a challenging simulation where the noise also grows with the amount of confounding. We study multiple variants of this simulation across many replications. In total, we estimate over a hundred different models. We also study a real-world example from medicine. Here, we look at the effects of various lab values prior to entering the intensive care unit on the length of stay in the hospital.

#### Simulation.

Consider a model with real-valued treatments. Let index the observations and the treatments. Let be a parameter matrix, be the simulation standard deviation, and be the dimensionality of . The treatments are drawn conditional on an unobserved as

(8) |

where scales the influence of on each of the treatments. The outcomes are drawn in a way so that the effects of and cancel each other out. Let be a vector of weights and be the outcome standard deviation. Then the outcomes are

The amount of confounding grows with . Moreover since the weights are shared and opposite in sign between the treatment part , and confounding part , the effects when simply considering the observed confounded treatment can cancel each other out. This means to have any hope of recovering the true requires be split apart into the confounded and unconfounded parts. This, combined with the fact the noise grows relative to the unconfounded portion as increases, makes this simulation a challenge.

We compare our approach to the pcacorrection [40] and the lmm[16, 20]. These approaches should perform well since the process in eq. 8 matches the assumptions behind probabilistic pca [5]. Both these methods fall into the theoretical framework we developed. For confounder estimation by the mclbo, we limit our approach to have a similar number of parameters. Details are in the appendix.

We study two cases. First, we correctly give all methods the right dimensionality . Second, we misspecify: all methods use a smaller dimension , while the true , and the reverse setting where the dimensionality (40) is much bigger than the true . We measure MSE to the true parameters scaled by the true norm. We simulate observations with treatments over redraws. We describe the remaining simulation parameters in the appendix.

Figure 1 shows the results. The left panel plots the error for varying levels of confounding when the confounder dimension is correctly specified. We find that confounder estimation with mceiperforms similar to or better than pcaand the lmm. This difference is larger when the confounding grows. This is mostly due to the better variance control from fitting all treatments at once (pcawith all treatments has degenerate design). The middle and right panels show mceitolerates misspecification better than pcaand the linear mixed model.

#### Clinical Experiment.

LENGTH OF STAY (los)is defined as the duration of a hospital visit. This measure is often used as an intermediate outcome in studies due to the associated adverse primary outcomes. Patient flow is important medically because unnecessarily prolonged hospitalization places patients at risk for hospital acquired infections (among other adverse outcomes). These can be difficult to treat and are associated with significant morbidity, mortality, and cost. Studies have found a 1.37% increase in infection risk and 6% increase in any adverse event risk for each excess losday [11, 2]. Also, it is of operational concern for hospitals because reimbursement for medical care is increasingly tied to visit episodes rather than to discrete products or services provided [27].

The dataset studied in this experiment is comprised of 25753 ICU visits and 37 laboratory tests from the mimic iiiclinical database [15]. We applied our mceiapproach to laboratory tests measured in the emergency department prior to admission as treatments, and a binarized losbased on the average value as outcome. Laboratory test values were shifted, log-transformed, and then standardized with missing values imputed by randomly sampling the empirical distribution of the laboratory test type.

The results are shown in fig. 2 and correlate well with findings in the literature regarding factors influencing los. For example, elevated blood urea nitrogen is associated with states of hypovolemia and hypercatabolism and has been linked to increased losin pancreatitis and stroke patients [10, 19]. Elevated white blood cells, or leukocytosis, is one of the main markers for infection and, as expected, infection has been associated with increased los, particularly when systemic [4, 35]. Other findings, such as an inverse relationship to potassium (hypokalemia) and platelets (thrombocytopenia) are also supported by the literature [25, 9].

## 6 Discussion

We formalized two assumptions needed for multiple causal inference, namely shared confounding and independence between treatments given the shared confounders. Together, these assumptions imply a minimal information principle that tries to find the confounder that has minimal information with the treatments while rendering the treatments conditionally independent given the confounder. We developed a tractable algorithm to estimate the confounder based on the minimal information principle. We showed how stochastic residuals can be used to estimate the outcome model, and we demonstrated our approach in simulations and on ICU data.

Many future directions remain. First, the assumptions we made are likely not tight. For example, the independence between treatments given the shared confounder could be relaxed to allow a finite number of dependencies between observations. The intuition is that if there is a limited amount of dependence between treatments, the confounder can be estimated from the other treatments. Next, in the algorithm to estimate the information, the lower bound can be replaced by likelihood ratio estimation. This has the added benefit of removing slack from the bound, while also improving numerical stability by avoiding differences of numbers of arbitrary magnitude. In this work, we focused on estimation with a single outcome. With multiple outcomes, new kinds of estimators that are simpler can be developed.

### Acknowledgments

We would like to acknowledge Jaan Altosaar, Fredrik Johansson, Rahul Krishnan, and Bharat Srikishan for helpful discussion and comments.

## References

- Agakov and Barber [2004] Agakov, F. V. and Barber, D. (2004). An auxiliary variational method. In Neural Information Processing, pages 561–566.
- Andrews et al. [1997] Andrews, L. B., Stocking, C., Krizek, T., Gottlieb, L., Krizek, C., Vargish, T., and Siegler, M. (1997). An alternative strategy for studying adverse events in medical care. The Lancet, 349(9048):309–313.
- Angrist et al. [1996] Angrist, J. D., Imbens, G. W., and Rubin, D. B. (1996). Identification of causal effects using instrumental variables. Journal of the American statistical Association, 91(434):444–455.
- Beyersmann et al. [2009] Beyersmann, J., Kneib, T., Schumacher, M., and Gastmeier, P. (2009). Nosocomial infection, length of stay, and time-dependent bias. Infection Control & Hospital Epidemiology, 30(3):273–276.
- Bishop [2016] Bishop, C. (2016). Pattern Recognition and Machine Learning. Springer-Verlag New York.
- Consortium et al. [2007] Consortium, W. T. C. C. et al. (2007). Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature, 447(7145):661.
- Cook et al. [2002] Cook, T. D., Campbell, D. T., and Shadish, W. (2002). Experimental and quasi-experimental designs for generalized causal inference. Houghton Mifflin Boston.
- Cover and Thomas [2012] Cover, T. M. and Thomas, J. A. (2012). Elements of information theory. John Wiley & Sons.
- Crowther et al. [2005] Crowther, M. A., Cook, D. J., Meade, M. O., Griffith, L. E., Guyatt, G. H., Arnold, D. M., Rabbat, C. G., Geerts, W. H., and Warkentin, T. E. (2005). Thrombocytopenia in medical-surgical critically ill patients: prevalence, incidence, and risk factors. Journal of critical care, 20(4):348–353.
- Faisst et al. [2010] Faisst, M., Wellner, U. F., Utzolino, S., Hopt, U. T., and Keck, T. (2010). Elevated blood urea nitrogen is an independent risk factor of prolonged intensive care unit stay due to acute necrotizing pancreatitis. Journal of critical care, 25(1):105–111.
- Hassan et al. [2010] Hassan, M., Tuckman, H. P., Patrick, R. H., Kountz, D. S., and Kohn, J. L. (2010). Hospital length of stay and probability of acquiring infection. International Journal of pharmaceutical and healthcare marketing, 4(4):324–338.
- Hill [2011] Hill, J. L. (2011). Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1):217–240.
- Hoyer et al. [2009] Hoyer, P. O., Janzing, D., Mooij, J. M., Peters, J., and Schölkopf, B. (2009). Nonlinear causal discovery with additive noise models. In Advances in neural information processing systems, pages 689–696.
- Johansson et al. [2016] Johansson, F., Shalit, U., and Sontag, D. (2016). Learning representations for counterfactual inference. In International Conference on Machine Learning, pages 3020–3029.
- Johnson et al. [2016] Johnson, A. E., Pollard, T. J., Shen, L., Li-wei, H. L., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L. A., and Mark, R. G. (2016). Mimic-iii, a freely accessible critical care database. Scientific data, 3:160035.
- Kang et al. [2010] Kang, H. M., Sul, J. H., Zaitlen, N. A., Kong, S.-y., Freimer, N. B., Sabatti, C., Eskin, E., et al. (2010). Variance component model to account for sample structure in genome-wide association studies. Nature genetics, 42(4):348.
- Kingma and Welling [2014] Kingma, D. and Welling, M. (2014). Auto-encoding variational bayes. In International Conference on Learning Representations (ICLR-14).
- Kocaoglu et al. [2017] Kocaoglu, M., Snyder, C., Dimakis, A. G., and Vishwanath, S. (2017). Causalgan: Learning causal implicit generative models with adversarial training. arXiv preprint arXiv:1709.02023.
- Lin et al. [2015] Lin, W.-C., Shih, H.-M., and Lin, L.-C. (2015). Preliminary prospective study to assess the effect of early blood urea nitrogen/creatinine ratio-based hydration therapy on poststroke infection rate and length of stay in acute ischemic stroke. Journal of Stroke and Cerebrovascular Diseases, 24(12):2720–2727.
- Lippert et al. [2011] Lippert, C., Listgarten, J., Liu, Y., Kadie, C. M., Davidson, R. I., and Heckerman, D. (2011). Fast linear mixed models for genome-wide association studies. Nature methods, 8(10):833.
- Louizos et al. [2017] Louizos, C., Shalit, U., Mooij, J. M., Sontag, D., Zemel, R., and Welling, M. (2017). Causal effect inference with deep latent-variable models. In Advances in Neural Information Processing Systems, pages 6449–6459.
- Maaløe et al. [2016] Maaløe, L., Sønderby, C. K., Sønderby, S. K., and Winther, O. (2016). Auxiliary deep generative models. arXiv preprint arXiv:1602.05473.
- Morgan and Winship [2014] Morgan, S. L. and Winship, C. (2014). Counterfactuals and causal inference. Cambridge University Press.
- O’connor et al. [2009] O’connor, C., Adhikari, N. K., DeCaire, K., and Friedrich, J. O. (2009). Medical admission order sets to improve deep vein thrombosis prophylaxis rates and other outcomes. Journal of hospital medicine, 4(2):81–89.
- Paltiel et al. [2001] Paltiel, O., Salakhov, E., Ronen, I., Berg, D., and Israeli, A. (2001). Management of severe hypokalemia in hospitalized patients: a study of quality of care based on computerized databases. Archives of internal medicine, 161(8):1089–1095.
- Pearl et al. [2009] Pearl, J. et al. (2009). Causal inference in statistics: An overview. Statistics surveys, 3:96–146.
- Press et al. [2016] Press, M. J., Rajkumar, R., and Conway, P. H. (2016). Medicare’s new bundled payments: design, strategy, and evolution. Jama, 315(2):131–132.
- Ranganath et al. [2014] Ranganath, R., Gerrish, S., and Blei, D. (2014). Black box variational inference. In Artificial Intelligence and Statistics, pages 814–822.
- Ranganath et al. [2016] Ranganath, R., Tran, D., and Blei, D. (2016). Hierarchical variational models. In International Conference on Machine Learning, pages 324–333.
- Rezende et al. [2014] Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, pages 1278–1286.
- Rosenbaum and Rubin [1983] Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55.
- Salimans et al. [2015] Salimans, T., Kingma, D., and Welling, M. (2015). Markov chain monte carlo and variational inference: Bridging the gap. In International Conference on Machine Learning, pages 1218–1226.
- Shalit et al. [2017] Shalit, U., Johansson, F. D., and Sontag, D. (2017). Estimating individual treatment effect: generalization bounds and algorithms. In International Conference on Machine Learning, pages 3076–3085.
- Song et al. [2015] Song, M., Hao, W., and Storey, J. D. (2015). Testing for genetic associations in arbitrarily structured populations. Nature genetics, 47(5):550.
- Talmor et al. [1999] Talmor, M., Hydo, L., and Barie, P. S. (1999). Relationship of systemic inflammatory response syndrome to organ dysfunction, length of stay, and mortality in critical surgical illness: effect of intensive care unit resuscitation. Archives of surgery, 134(1):81–87.
- Tran and Blei [2017] Tran, D. and Blei, D. M. (2017). Implicit causal models for genome-wide association studies. arXiv preprint arXiv:1710.10742.
- Wager and Athey [2017] Wager, S. and Athey, S. (2017). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, (just-accepted).
- Wallach et al. [2009] Wallach, H. M., Murray, I., Salakhutdinov, R., and Mimno, D. (2009). Evaluation methods for topic models. In Proceedings of the 26th annual international conference on machine learning, pages 1105–1112. ACM.
- Williams [1992] Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. In Reinforcement Learning, pages 5–32. Springer.
- Yu et al. [2006] Yu, J., Pressoir, G., Briggs, W. H., Bi, I. V., Yamasaki, M., Doebley, J. F., McMullen, M. D., Gaut, B. S., Nielsen, D. M., Holland, J. B., et al. (2006). A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nature genetics, 38(2):203.

## Appendix A Appendix

#### Explicit Ambiguity.

We make the idea of explicit ambiguity precise by constructing two models. Both models have the same distribution of treatments and outcomes and have treatments that are independent of the outcome. Take the model

and the model

Both of these models satisfy the independence of treatments given the shared confounder and have the same joint distribution on . But the second model differs in key way. It assumes all of the treatments are due to confounding. Since the two models have the same observed distribution, we need assumptions to choose between them.

We could start by arguing that this model does not satisfy shared confounding because parts of only relate to a single treatment. Howeverm limiting by requiring it to contain only shared information requires the confounder be independent of the treatments. In practice it can be hard to prevent a confounder from memorizing the , as even a single dimensional with flexible can memorize the treatments. This is why we need the minimal information principle along with the regularizer it induces.

#### Negative Entropy Lower Bound

The lower bound follows via the relationship from Jensen’s inequality,

(9) |

#### Proposition 1.

Independence given the confounder means that is independent of given the unobserved confounder. Shared confounding means there is only a single confounder . Since the form of is arbitrary, the distribution on is arbitrary. Also, since is arbitrary the distribution of given is arbitrary. Thus the generative process in Equation 1 constructs treatments that are conditionally independent given the confounder. It can represent any distribution for each treatment given the confounder. The confounder can also can take any distribution. This means that Equation 1 can represent any distribution of treatments that satisfy both assumptions, of shared confounding and of independence given confounding. The outcome function is arbitrary and so can be chosen to match any true outcome model.

#### Gradients of the mclbo.

The first term in the mclbodenoted and the conditional entropy are all integrals with respect to the distribution . To compute stochastic gradients, we differentiate under the integral sign as in variational inference. For simplicity, we assume that a sample from can be generated by transforming parameter-free noise through a function . This assumption leads to simpler gradient computation [17, 30].

Define

(10) |

Then the gradient with respect to can be written as

(11) |

Sampling from the various expectations gives a noisy unbiased estimate of the gradient. The gradient for is much simpler, as the sampled distributions do not depend on :

(12) |

Sampling from the observed data then sampling the confounder estimate gives an unbiased estimate of this gradient. The confounder estimation for a fixed value of is summarized in Algorithm 1.

#### Equivalent Confounders.

Invertible transformations of a random variable preserve the information in that random variable. Take two distributions for computing the stochastic confounder and where can be written as an invertible function of . These two distributions have equivalent information for downstream tasks, such as building the outcome model or conditioning on the confounder. This equivalence means we have choice on which member in the equivalence class we choose. One way to narrow the choice is to enforce that the dimensions of are independent by minimizing total correlation.

#### Connection to Factor Analysis.

Factor analysis methods work by specifying a generative model for observations that independently generate each dimension of each observation. In its most general form this model is

Inference in this model matches the reconstruction term inside our confounder estimator with a -divergence regularizer. If we allow for the parameters of the prior on to be learned to maximize the overall likelihood, and if ’s dimensions are independent, then inference corresponds to minimizing the reconstruction eq. 2 with a total correlation style penalty.

There are many ways to choose the complexity of the factor model. One choice is to find the smallest complexity model that still gives good predictions of given (like document completion evaluation in topic models [38]). Here complexity is measure in terms of the dimensionality of and the complexity of and . This choice tries to minimize the amount of information retained in , while still reconstructing the treatments well. This way to select the factor analysis model’s complexity meets the condition of the minimum information principle. However, selecting discrete parameters like dimensionality give less fine-grained control over the information rates.

#### Proposition 2.

If the data are conditionally i.i.d., then in the true model concentrates as the number of treatments goes to infinity. In this setting, we can learn the model from Proposition 1 using the mclbo. This follows because the information each treatment provides goes to zero as since they are conditionally i.i.d., thus the true confounder (and posterior), up to information equivalences, is simply a point that maximizes the reconstruction term in the mclbosubject to asymptotically zero ami. This shows outcome estimation corresponds to simple regression with treatments and confounder (up to an information equivalence), which correctly estimates the causal effects as .

#### Estimating .

The estimation requires finding parameters and that maximize

The constraint can be baked into a Lagrangian with parameter ,

The mutual information can be split into entropy terms:

The first term can be bounded as before using multiple samples. The second term requires an alternative approach. We can use the entropy bounds with auxiliary distributions on the conditioning set, as used in variational inference [1, 32, 29, 22]. These bounds work with a distribution over the reverse conditioning set in this case . For this, we can use the reconstruction distribution and the fact that and do not depend on the parameters and .

#### Confounder Parameterization and Simulation Hyperparameters.

We limit the confounder to have similar complexity as pca. We do this by using a confounder distribution with normal noise, where we restrict the mean of the confounder estimate to be a linear function of the treatments . The variance is independent and controlled by a two-layer (for second moments) neural network. We similarly limit the likelihoods and outcome model to have linear means and fixed variance.

For the remaining simulation hyperparameters, we set and to be the absolute value of draws from the standard normal. We fix the simulation standard deviation to and fix outcome standard deviation to .