Causal Inference with Selectively-Deconfounded Data

# Causal Inference with Selectively-Deconfounded Data

## Abstract

Given only data generated by a standard confounding graph with unobserved confounder, the Average Treatment Effect (ATE) is not identifiable. To estimate the ATE, a practitioner must then either (a) collect deconfounded data; (b) run a clinical trial; or (c) elucidate further properties of the causal graph that might render the ATE identifiable. In this paper, we consider the benefit of incorporating a (large) confounded observational dataset alongside a (small) deconfounded observational dataset when estimating the ATE. Our theoretical results show that the inclusion of confounded data can significantly reduce the quantity of deconfounded data required to estimate the ATE to within a desired accuracy level. Moreover, in some cases—say, genetics—we could imagine retrospectively selecting samples to deconfound. We demonstrate that by strategically selecting these examples based upon the (already observed) treatment and outcome, we can reduce our data dependence further. Our theoretical and empirical results establish that the worst-case relative performance of our approach (vs. a natural benchmark) is bounded while our best-case gains are unbounded. Next, we demonstrate the benefits of selective deconfounding using a large real-world dataset related to genetic mutation in cancer. Finally, we introduce an online version of the problem, proposing two adaptive heuristics.

## 1 Introduction

The academic literature on causal inference typically addresses the rigid setting in which confounders are either always or never observed. However, in many applications, suspected confounders might be infrequently observed. For example, in healthcare, a particular gene might be suspected to influence both a behavior and a health outcome of interest. However, while behaviors (e.g. smoking) might easily be tracked via questionnaires and outcomes (e.g. cancer status) tracked with similar ubiquity, genetic tests may be expensive and thus only available for a subset of patients.

In this paper, we address the middle ground along the confounded-deconfounded spectrum, focusing on the case where ample (cheaply-acquired) confounded data and few (expensive) deconfounded data are available. Naively, one could estimate the average treatment effect with standard methods using only the deconfounded data. First, we ask: how much can we improve our ATE estimates by incorporating confounded data over approaches that rely on deconfounded data alone? Second, motivated by genetic confounders, which might be retrospectively observed for cases with known treatments and outcomes, we introduce the problem of selective deconfounding—allocating a fixed budget for revealing the confounder based upon observed treatments and outcomes. This prompts our second question: what is the optimal policy for selecting data to deconfound?

We address these questions for a standard confounding graph where the treatment and outcome are binary, and the confounder is categorical. First, we propose a simple method for incorporating confounded data that achieves a constant-factor improvement in ATE estimation error. In short, the inclusion of (infinite) confounded data reduces the number of free parameters to be estimated, improving our estimates of the remaining parameters. Moreover, due to the multiplicative factors in the causal functional, errors in parameter estimates can compound. Thus, our improvements in parameter estimates yield greater benefits in estimating treatment effects. For binary confounders, our numerical results show that on average, over problem instances selected uniformly on the parameter simplex, our method achieves roughly improvements in ATE estimation error.

Next, we show that we can reduce the estimation error further by strategically choosing which samples to deconfound in an offline setting. Our proposed policy for selecting samples dominates reasonable benchmarks. In the worst case, our method requires no more than as many samples as a natural sampling policy and our best-case gains are unbounded. Moreover, our qualitative analysis characterizes those situations most favorable/unfavorable for our proposed method. We extend our work to the scenario where only a finite amount of confounded is present, demonstrating our qualitative insights continue to apply. Additionally, we validate our methods using COSMIC (Tate et al., 2019; Cosmic, 2019), a real-world dataset containing cancer types, genetic mutations, and other patient features, showing that the practical benefits of our proposed sampling policy. Finally, we introduce an online variant where our selection strategy can adapt dynamically as new information is revealed after each step of selective deconfounding. Throughout the paper, we implicitly assume that the confounded data was sampled i.i.d. from the target population of interest.

## 2 Related Work

Causal inference has been studied thoroughly under the ignorability assumption, i.e., no unobserved confounding (Neyman, 1923; Rubin, 1974; Holland, 1986). Some approaches for estimating the ATE under ignorability include inverse propensity score weighting (Rosenbaum and Rubin, 1983; Hirano et al., 2003; McCaffrey et al., 2004), matching Dehejia and Wahba (2002), and the backdoor adjustment Pearl (1995). Some related papers look to combine various sources of information, for instance from RCTs and observational data to estimate the ATE (Stuart et al., 2011; Hartman et al., 2015). Other papers leverage machine learning techniques, such as random forests, for estimating causal effects (Alaa and van der Schaar, 2017; Wager and Athey, 2018). Other techniques include using time-series data to estimate the ATE  (Athey et al., 2016), and targeted learning (Van der Laan and Rose, 2011).

Since the presence of an unobserved confounder can invalidate the estimated ATE, two lines of work attempt to address/remove the ignorability assumption: one using observational data alone, and the other by combining confounded observational data with experimental (and thus unconfounded) data. The first line of work includes papers using proxies (Miao et al., 2018) and mediators (Pearl, 1995). Kuroki and Pearl (2014) identify graphical structures under which causal effect can be identified. Miao et al. (2018) propose to use two different types of proxies to recover causal effects with one unobserved confounder. Shi et al. (2018) extend the work by Miao et al. (2018) to multiple confounders. However, both methods require knowledge of proxy categories a priori and are not robust under misspecification of proxy categories. Louizos et al. (2017) use variational autoencoders to recover the causal effect under the model where when conditioned on the unobserved confounders, the proxies are independent of treatment and outcome. Pearl (1995) introduces the front-door adjustment, expressing the causal effect as a functional that concerns only the (possibly confounded) treatment and outcome, and an (unconfounded) mediator that transmits the entire effect.

In other work, Bareinboim and Pearl (2013) propose to combine observational and experimental data under distribution shift, learning the treatment effect from the experimental data and transporting it to the confounded observational data to obtain a bias-free estimator for the causal effect. Recently, Kallus et al. (2018) propose a two-step process to remove hidden confounding by incorporating experimental data. Lastly, few papers provide finite sample guarantees for causal inference. Shalit et al. (2017) upper bound the estimation error for a family of algorithms that estimate causal effects under the ignorability assumption.

Unlike most prior work, we (i) address confounded and deconfounded (but not experimental) data, and (ii) perform finite sample analysis to quantify the relative benefit of additional confounded and deconfounded data towards improving our estimate of the average treatment effect.

## 3 Methods and Theory

Let and be random variables denoting the treatment and treatment outcome, respectively. We restrict these to be binary, viewing as an indicator of whether a particular treatment has occurred and as an indicator of whether the outcome was successful. In this work, we assume the existence of a single (possible) confounder, denoted , which can take up to categorical values (Figure 1). Following Pearl’s nomenclature (Pearl, 2000), let

 P(Y=y|do(T=t)):=∑z∈[k]PY|T,Z(y|t,z)PZ(z).

Our goal is to estimate the ATE, which can be expressed, via the back-door adjustment, in terms of the joint distribution on , as:

 ATE:=P(Y=1|do(T=1))−P(Y=1|do(T=0)), =∑z∈[k](PY|T,Z(1|1,z)−PY|T,Z(1|0,z))PZ(z) (1)

Our key contribution is to provide, analyze, and empirically validate methods for estimating the ATE from data consisting of both confounded and deconfounded observations. In our setup, the confounded data contains independent draws from the joint distribution (marginalized over the hidden confounder ), and the deconfounded data contains independent draws from the full joint distribution . Thus, the confounded and deconfounded data take the form of and tuples, respectively.

Note that given this graph, we cannot exactly calculate the ATE without knowing the entire joint distribution , and thus we cannot hope to estimate the ATE from confounded data alone. On the other hand, when deconfounded data is scarce and confounded data comparatively plentiful, we hope to improve our ATE estimates.

### 3.1 Infinite Confounded Data

Throughout this subsection, we address the setting where we have an infinite amount of confounded data (), i.e., the marginal distribution is known exactly.

Deconfounded Data Alone We begin with the baseline approach of using only the deconfounded data. Let , and let be empirical estimates of from the deconfounded data using the Maximum Likelihood Estimator (MLE). Let be the estimated average treatment effect calculated by plugging into Equation (1). In the following theorem, we show a quantity of samples which is sufficient to estimate the ATE to within a desired level of accuracy under the estimation process described above. Let throughout.

###### Theorem 1.

Using deconfounded data alone, is satisfied if the sample size is at least

 mbase :=maxt,zC(∑ypzyt)2=maxt,z1PT,Z(t,z)2C.

The proof of Theorem 1 (Appendix B.1) relies on an additive decomposition of the estimation error on ATE in terms of the estimation error on the ’s, along with concentration via Hoeffding’s inequality. Theorem 1 analyzes the worst case (ignoring all confounded data). We will contrast this bound with counterpart methods that use confounded data.

Incorporating Confounded Data Estimating the ATE requires estimating the entire distribution . To assess the utility of confounded data, we decompose into two components: (i) the confounded distribution ; and (ii) the conditional distributions . Given infinite confounded data, the confounded distribution is known exactly, reducing the number of free parameters in by three. The deconfounded data can then be used exclusively to estimate the conditional distributions . To ease notation, let , let , and let be the empirical estimate of from the confounded data using the MLE. Then, we will always calculate our estimate by plugging the ’s and ’s into Equation (1). The following theorem bounds the sample complexity for this estimator (proof in Appendix B.2):

###### Theorem 2.

When incorporating (infinite) confounded data, is satisfied if the number of samples is at least

 mnsp :=maxt,zC∑yayt(∑yaytqzyt)2=maxt,zPT(t)PT,Z(t,z)2C.

Note that is smaller than for any problem instance, highlighting the value of incorporating confounded data into the ATE estimation.

Offline Sample Selection Policies One important consequence of our procedure for estimating the ATE is that the four conditional distributions are estimated separately: the deconfounded data is partitioned into four groups, one for each , and the empirical measures are then calculated separately. This means that the procedure does not rely on the fact that the deconfounded data is drawn from the exact distribution , and in particular, the draws might as well have been made directly from the conditional distributions .

Suppose now that we can draw directly from these conditional distributions. This situation may arise when the confounder is fixed (like a genetic trait) and can be observed retrospectively. We now ask, given a budget for selectively deconfounding samples, how should we allocate our samples among the four groups ()?

Let denote a selection policy with each indicating the proportion of samples allocated to each group, and . We consider the following three selection policies:

1. Natural (NSP): —this is similar to drawing from .

2. Uniform (USP): . This splits the samples evenly across all four conditional distributions.

3. Outcome-weighted (OWSP): , i.e. splitting samples evenly across treatment groups ( vs. ), and within each treatment group, choosing the number of samples to be proportional to the outcome ( vs. ).

While OWSP may seem the least intuitive, we demonstrate benefits over the other policies via analogous theorems (to Thm. 12) for the uniform and outcome-weighted policies:

###### Theorem 3.

Under the uniform selection policy, with (infinite) confounded data incorporated, is satisfied if the number of samples is at least

 musp:=maxt,zC∑y4a2yt(∑yaytqzyt)2=maxt,z4∑yPY,T(y,t)2PT,Z(t,z)2C.

Similarly, for the outcome-weighted selection policy:

 mowsp :=maxt,z2C(∑yayt)2(∑yaytqzyt)2=maxt,z2PZ|T(z|t)2C.

The proofs of Theorems 2-3 (Appendix B.2), which differ from the proof of Theorem 1, require a modification to Hoeffding’s inequality (Appendix, Lemma 4), which we derive to bound the sample complexity of the weighted sum of two independent random variables. Theorem 3 points to some advantages of OWSP. First, OWSP has the nice property that the sufficient number of samples, , does not depend on .

A comparison of the quantities and suggests that USP is strictly dominated by OWSP, since . We might hope for a similar result by comparing with from Theorem 2, but neither strictly dominates the other. Instead, our final result shows that NSP may be significantly worse than OWSP, but OWSP is never much worse (proof in Appendix B.3):

###### Corollary 1.

Let and be defined as in Theorems 23. Then

 mowspmnsp≤2,

and there exist distributions where is arbitrarily close to zero.

### 3.2 Finite Confounded Data

We have now shown that given an infinite amount of confounded data, OWSP outperforms the NSP in the worst case (Section 3.1). However, in practice, the confounded data will be finite. In this case, these confounded data provide us with an estimate of the confounded distribution, , which we denote , and thus provide us an estimated OWSP. Similarly, we estimate using the MLE from the confounded data. To check the robustness of OWSP, we extend our analysis to handle finite confounded data. With defined as in Section 3.1, we can derive a theorem analogous to Theorems 1-3:

###### Theorem 4.

Given confounded and deconfounded samples, with , is satisfied when

 miny,t,z(∑yaytqzyt)21xytm+(qzyt)2n =miny,t,z⎛⎜ ⎜⎝PT,Z(t,z)21xytm+(qzyt)2n⎞⎟ ⎟⎠≥4C

The proof of Theorem 4 (Appendix B.4) requires a bound we derive (Appendix, Lemma 5) for the product of two independent random variables. A few results follow from Theorem 4. First, a quick calculation shows that when is held constant, remains positive as . This means that for a certain combinations of , there does not necessarily exist a sufficiently large s.t. can be satisfied.

However, when there exists such an , then we can write

 m≥maxy,t,z1xyt(PT,Z(t,z)24C−(qzyt)2n).

In this case, the conclusions of Corollary 1 still hold: , and there exist distributions such that is arbitrarily small. Theorem 4 also implies that when

 n≫(qzyt)2xytm∀y,t,

the majority of the estimation error comes from not deconfounding enough data. To put it another way, for a given , having confounded samples is sufficient.

One new issue that arises with finite confounded data is that a sampling policy may not be feasible because there are not enough confounded samples to deconfound. This does not happen for NSP (assuming ), but can occur for USP and OWSP. When this happens, e.g. in our experiments, we approximate the target sampling policies as closely as is feasible (see Appendix D).

## 4 Experiments

Since the upper bounds that we derived in Section 3 are not necessarily tight, we first perform synthetic experiments to assess the tightness of our bounds. For the purpose of illustration, we focus on binary confounders throughout this section, and denote . We first compare the sampling policies in synthetic experiments on randomly chosen distributions , measuring both the average and worst-case performance of each sampling policy. We then measure the effect of having finite (vs. infinite) confounded data. Finally, we test the performance of OWSP on real-world data taken from a genetic database, COSMIC, that includes genetic mutations of cancer patients (Tate et al., 2019; Cosmic, 2019).

### 4.1 Infinite Confounded Data: Synthetic Experiments

Assuming access to infinite confounded data, we experimentally evaluate all four sampling methods for estimating the ATE: using deconfounded data alone, and using confounded data that has been selected according to NSP, USP, and OWSP. Let

 a:=(a00,a01,a10,a11), and q:=(q00,q01,q10,q11),

encoding the confounded and conditional distributions, respectively. We evaluate the performance of four methods in terms of the absolute error, .

Average Performance We first evaluate the four methods over a randomly-selected set of distributions. Figure 2 was generated by averaging over 13,000 instances, each with the distribution drawn uniformly from the unit -Simplex. Every instance consists of replications, each with a random draw of 1,200 deconfounded samples. The absolute error is measured as a function of the number of deconfounded samples in steps of samples. Figure 2 (left) compares the use of deconfounded data alone with the incorporation of confounded data selected naturally (as in the comparison of Theorems 1 and 2). It shows that incorporating confounded data yields a significant improvement in estimation error. For example, achieving an absolute error of using deconfounded data alone requires more than 1,200 samples on average, while by incorporating confounded data, only samples are required. Having established the value of confounded data, Figure 2 (middle) compares the three selection policies. We find that, when averaged over joint distributions, OWSP outperforms both NSP and USP. The two scatter plots in Figure 2 (right) contain the 13,000 instances in the left figures, each averaged over replications. The number of deconfounded samples is fixed at 1,200. We observe that OWSP outperforms NSP and USP in the majority of instances.

Worst-Case Instances In Figure 3, we evaluate the performance of the three selection policies on joint distributions chosen adversarially against each. The three sub-figures (the columns) correspond to instances where NSP, USP, and OWSP perform the worst, respectively, from the left to the right. Each sub-figure is further subdivided: the top contains results for the single adversarial example while the bottom is averaged over ’s sampled uniformly from . The absolute error is averaged over 10,000 replications in the left figures and over in the right. In all cases, we draw deconfounded samples and measure the absolute error in steps of samples.

Figure 3 (left) validates Corollary 1. We observe that when the distribution of is heavily skewed towards , OWSP and USP significantly outperform NSP. Figure 3 (middle) shows that USP can underperform NSP, but when averaged over all possible values of , USP performs better than NSP. Figure 3 (right), we observe that OWSP can underperform NSP and USP, but, when compared with the left and middle column, the performance of OWSP is close to that of NSP and USP. Furthermore, when averaged over all possible values of , OWSP outperforms the other two policies. Appendix C provides representative examples in which each of these joint distributions could appear.

### 4.2 Finite Confounded Data

Given only confounded data, we test the performance of the OWSP against NSP and USP. In Figure 4, the absolute error is measured as a function of the number of confounded samples in step sizes that increment in the log scale from to 10,000 while fixing the number of deconfounded samples to . Figure 4 (left) is generated by averaging over 13,000 instances, and each consisted of replications, and it compares three offline sampling selection policies. Since when we only have confounded samples, the three sampling policies are identical, the error curves corresponding to NSP, USP and OWSP start at the same point on the top left corner. We observe that as the number of confounded samples increases, OWSP quickly outperforms NSP and USP on average, and the gaps between OWSP and the other two selection policies widen. Notice that the average absolute errors of the three selection policies do not converge to in this setting because we fix the amount of deconfounded samples to be .

Figure 4 (middle) contains the 13,000 instances described above averaged over replications. It compares the performance of OWSP with that of the NSP on an instance level. Similarly, Figure 4 (right) compares the performance of OWSP with that of the USP. In both figures, We fix the number of confounded samples to be . We observe that OWSP dominates NSP and USP in the majority of instances.

### 4.3 Real-World Experiments: Cancer Mutations

In the previous experiments, we chose the underlying distribution uniformly from the unit -Simplex. However, real-world problems of interest do not follow this distribution. Thus, we illustrate the usefulness of our methods using a real-world dataset, where we take the underlying distribution to be the empirical distribution over the data. In particular, we first pick three variables to be the outcome, treatment, and confounder, and then artificially hide the values of the confounder. Finally, we evaluate our proposed sampling methods under the assumption that we have access to infinitely many confounded samples.

Data The Catalogue Of Somatic Mutations In Cancer (COSMIC) is a public database of DNA sequences of tumor samples. It consists of targeted gene-screening panels aggregated and manually curated over 25,000 peer reviewed papers. We focus on the variables: primary cancer site, gene, and the age of the patient at the time of the genetic test. Specifically, for a total of 1,350,015 cancer patients, we collected their age, type of cancer, and for a subset of genes, whether or not a mutation was observed in each gene. Ages were converted to binary values by setting a threshold at 45 years old.

Causal Models In our experiments, we designate cancer type as the outcome, mutation as the treatment, and age as the confounder—this might be plausible because we know that when people are older, their accumulated exposure to radiation is larger and thus have a higher probability of having somatic mutations. On the other hand, when people age, their immune systems become weaker (Montecino-Rodriguez et al., 2013), and thus are more susceptible to having a particular type of cancer (outcome).

The top most commonly mutated genes were selected as treatment candidates. For each combination of a cancer type and one of these genes, we removed patients for whom this gene was not sequenced, and kept all pairs that had at least patients in each of the four treatment-outcome groups (to ensure our deconfounding policies would have enough samples to deconfound). This procedure gave us unique combinations of a cancer (outcome), gene (treatment), and age (confounder).

Since on average, each {cancer, mutation, age} tuple contains around 94,619 patients, we took the estimated empirical distribution as the data-generating distribution and applied the ATE formula described in Section 3 to obtain the “true” ATE. To model the unobserved confounder, we hid the age parameter, only revealing it to a sampling policy when it requested a deconfounded sample. We compared the use of deconfounded data along with the incorporation of confounded data under the three sampling selection polices: NSP, USP, and OWSP.

Results Figure 5 (left) was generated with the instances described above, each repeated for 10,000 replications. The absolute error is measured as a function of the number of deconfounded samples in step sizes of . First, similar to Figure 2, we observe that incorporating confounded data reduces the absolute estimation error by a large margin. Note the improvement of OWSP over NSP is larger in this case as compared to that seen in Figure 2. Furthermore, when the number of deconfounded samples is small, OWSP outperforms USP.

In Figure 5 (middle, right), we fix the number of deconfounded samples to be , and compare the performance of OWSP against that of NSP and USP, respectively. Both figures contain the instances in the left figure, averaged over 10,000 replications. We observe that under this setup, OWSP dominates NSP in all instances, and outperforms USP in the majority of instances.

## 5 Online Sample Selection Policies

So far, we have established the benefits of using confounded data (alongside deconfounded data) in the offline setting, i.e. where the fraction of our deconfounded data allocated to each group (as defined by the values of and ) must be specified upfront. Note that offline policies cannot utilize any information about . We conclude by introducing an online setting where we can choose which group to sample from sequentially (over periods) and adaptively, incorporating information from all previous steps. For notational convenience, we will focus on the case where the confounder is binary, i.e., , and denote .

We will again assume that we have infinitely many confounded samples. Recall that this implies is known for all , and thus, the only parameters we need to estimate are , for all . At each period, we can represent the state as a tuple

 (n00,n01,n10,n11,^q00,^q01,^q10,^q11),

where is the number of previous periods where group was chosen, and is the proportion of these periods for which was observed (initialized to be if ). The ’s coincide with our running estimates of the ’s, and we will continue to estimate the by plugging these estimates into (1). An online policy is then a mapping from the state to the four groups. Here, we propose two online policies for selective deconfounding.

Greedy: For each group , we approximate the expected distance from the currently estimated ATE to the new estimated ATE if we were to observe one additional sample from the group, and we do so by substituting the estimated ’s for the true data generating parameters (which we do not know). We then sample from the group for which the following quantity is the largest, i.e., the group that induces the largest expected change in the estimated ATE:

 E[∣∣ ∣∣ˆATE(qyt=nyt^qyt+B(^qyt)nyt+1)−ˆATE(qyt=^qyt)∣∣ ∣∣],

where denotes a Bernoulli variable with mean (over which the expectation is taken), and is the estimated when we update the estimate for to , and keep all other estimates unchanged. Intuitively, maximizing this quantity is meant as an approximate proxy for minimizing the expected distance between the estimated and true ATEs.

-Step Lookahead: As a generalization of Greedy, which by definition looks one period ahead, we can consider policies that look samples ahead. More specifically, we consider all online policies over the next periods, and calculate the expected distance from the currently estimated ATE to the estimated ATE periods from now. Just as in Greedy, we choose the policy that maximizes this quantity, and implement the policy’s first action. That is, we repeat the entire process at every period rather than every periods. The -Step Lookahead policies should improve with increased , but this comes at a computational cost exponential in .

Synthetic Experiments As in the experiments in Section 4, Figure 6 was generated using 1,300 instances sampled uniformly from the unit -Simplex, and each instance contains replications. The absolute error is measured as a function of the number of deconfounded samples in step sizes of . We observe that the Greedy and 2-Step Lookahead selection policies outperform OWSP (Figure 6). Figure 6 (middle, right) contain points, each corresponding to one instance averaged over replications. The right scatter plot suggests that there are benefits of selecting sample online. However, the middle plot suggests the improvement might be limited.

## 6 Conclusion

Our theoretical results upper bound the amount of deconfounded data required under each sample selection policy, and provide insights for why the outcome-weighted selection policy works better on average than natural selection policy. We point to several promising directions for potential future research. First, we plan to extend our results to more general causal problems, including linear and semi-parametric causal models. Second, we plan to investigate scenarios with multiple confounders that may not always be observed simultaneously. Third, we could extend the idea of selective revelation of information beyond confounders to incorporate other variables, such as mediators or proxies.

## Appendix A Review of Classical Results in Concentration Inequalities

Before embarking on our proofs, we state some classic results that we will use frequently. The following concentration inequalities are part of a family of results collectively referred to as Hoeffding’s inequality (e.g., see Vershynin [2018]).

###### Lemma 1 (Hoeffding’s Lemma).

Let X be any real-valued random variable with expected value , such that almost surely. Then, for all ,

###### Theorem 5 (Hoeffding’s inequality for general bounded r.v.s).

Let be independent random variables such that . Then, for , we have .

## Appendix B Proofs

To begin, recall the notation introduced in Section 3: we model the binary-valued treatment, the binary-valued outcome, and the categorical confounder as the random variables , , and , respectively. The underlying joint distribution of these three random variables is represented as . To save on space for terms that are used frequently, we define the following shorthand notation:

 pzyt =PY,T,Z(y,t,z), ayt =PY,T(y,t), qzyt =PZ|Y,T(z|y,t).

These terms appear frequently because, to estimate the entire joint distribution on (the ’s), it suffices to estimate the joint distribution on (the ’s), along with the conditional distribution of on (the ’s):

 pzyt=aytqzyt.

Finally, let , and be the empirical estimates of and , respectively, using the MLE.

### b.1 Proof of Theorem 1

###### Theorem 1.

Using deconfounded data alone, is satisfied if the sample size is at least

 mbase :=maxt,zC(∑ypzyt)2=maxt,z1PT,Z(t,z)2C.
###### Proof of Theorem 1.

This proof proceeds as follows: first, we prove a sufficient (deterministic) condition, on the errors of our estimates of ’s, under which is small. Second, we show that the errors of our estimates of ’s are indeed small with high probability.

Step 1: First, we can write the ATE in terms of the ’s as follows:

 ATE=∑z(PY|T,Z(1|1,z)−PY|T,Z(1|0,z))PZ(z)=∑z⎛⎜ ⎜⎝⎛⎜ ⎜⎝pz11∑ypzy1−pz10∑ypzy0⎞⎟ ⎟⎠(∑y,tpzyt)⎞⎟ ⎟⎠.

In order for the ATE to be well-defined, we assume for all throughout. We can then decompose :

 |ˆATE−ATE| =∣∣ ∣ ∣∣∑z⎛⎜ ⎜⎝⎛⎜ ⎜⎝^pz11∑y^pzy1−^pz10∑y^pzy0⎞⎟ ⎟⎠(∑y,t^pzyt)−⎛⎜ ⎜⎝pz11∑ypzy1−pz10∑ypzy0⎞⎟ ⎟⎠(∑y,tpzyt)⎞⎟ ⎟⎠∣∣ ∣ ∣∣ ≤∑z∣∣ ∣ ∣∣⎛⎜ ⎜⎝^pz11∑y^pzy1−^pz10∑y^pzy0⎞⎟ ⎟⎠(∑y,t^pzyt)−⎛⎜ ⎜⎝pz11∑ypzy1−pz10∑ypzy0⎞⎟ ⎟⎠(∑y,tpzyt)∣∣ ∣ ∣∣.

Thus, in order to upper bound by some , it suffices to show that

 ∣∣ ∣ ∣∣⎛⎜ ⎜⎝^pz11∑y^pzy1−^pz10∑y^pzy0⎞⎟ ⎟⎠(∑y,t^pzyt)−⎛⎜ ⎜⎝pz11∑ypzy1−pz10∑ypzy0⎞⎟ ⎟⎠(∑y,tpzyt)∣∣ ∣ ∣∣≤ϵk,∀z. (2)

Step 2: To bound the above terms, we first derive Lemma 2 for bounding the error of the product of two estimates in terms of their two individual errors:

###### Lemma 2.

For any , and , suppose there exists such that all of the following conditions hold:

Then, .

###### Proof of Lemma 2.

Since , we have , and similarly, from , we have . Thus,

 |uv−^u^v| ≤max(|uv−(u+(1−θ)ϵ)(v+θϵ)|,|uv−(u−(1−θ)ϵ)(v−θϵ)|) (because v,^v≥0) =max(∣∣θuϵ+(1−θ)vϵ+(1−θ)θϵ2∣∣,∣∣θuϵ+(1−θ)vϵ−(1−θ)θϵ2∣∣) =∣∣θuϵ+(1−θ)vϵ+(1−θ)θϵ2∣∣ (because(1−θ)θϵ2>0) ≤|θ(u+ϵ)ϵ+(1−θ)vϵ| (becauseθϵ2>(1−θ)θϵ2) ≤ϵ (becauseu+ϵ∈[−1,1],andv≤1)

We can apply Lemma 2 directly to the terms in (2) by setting

 uz =pz11∑ypzy1−pz10∑ypzy0, ^uz =^pz11∑y^pzy1−^pz10∑y^pzy0, vz =∑y,tpzyt, ^vz =∑y,t^pzyt,

and noting that , and . Lemma 2 implies that the upper bound in (2) holds if, for some , we have

 |vz−^vz|<θkϵ and |uz−^uz|<1−θkϵ.

While we can apply standard concentration results to the terms, the terms will need to be further decomposed:

 |uz−^uz| =∣∣ ∣ ∣∣pz11∑ypzy1−pz10∑ypzy0−^pz11∑y^pzy1+^pz10∑y^pzy0∣∣ ∣ ∣∣ ≤∣∣ ∣ ∣∣pz11∑ypzy1−^pz11∑y^pzy1∣∣ ∣ ∣∣+∣∣ ∣ ∣∣pz10∑ypzy0−^pz10∑y^pzy0∣∣ ∣ ∣∣.

It will suffice to show that for each and ,

 ∣∣ ∣ ∣∣pz1t∑ypzyt−^pz1t∑y^pzyt∣∣ ∣ ∣∣<1−θ2kϵ. (3)

Step 3: To bound these terms, we derive Lemma 3. Recall that .

###### Lemma 3.

For any , if and , then

 ∣∣∣ww+s−^w^w+^s∣∣∣≤2ϵ.
###### Proof of Lemma 3.

First, since , we have that

 ∣∣∣w+s^w+^s−1∣∣∣≤w+s^w+^sϵ,

or equivalently,

 1−w+s^w+^sϵ≤w+s^w+^s≤1+w+s^w+^sϵ.

We can apply this inequality and rearrange terms as follows to conclude the proof:

 ∣∣∣ww+s−^w^w+^s∣∣∣ =∣∣∣1w+s∣∣∣∣∣∣w−^ww+s^w+^s∣∣∣ ≤∣∣∣1w+s∣∣∣max(∣∣∣w−^w(1−w+s^w+^sϵ)∣∣∣,∣∣∣w−^w(1+w+s^w+^sϵ)∣∣∣) =∣∣∣1w+s∣∣∣max(∣∣∣w−^w+w+s^w+^s^wϵ∣∣∣,∣∣∣w−^w−w+s^w+^s^wϵ∣∣∣) =max(∣∣∣w−^ww+s+^w^w+^sϵ∣∣∣,∣∣∣w−^ww+s−^w^w+^sϵ∣∣∣) ≤∣∣∣w−^ww+s∣∣∣+∣∣∣^w^w+^s∣∣∣ϵ ≤∣∣∣w+sw+s∣∣∣ϵ+∣∣∣^w^w+^s∣∣∣ϵ ≤2ϵ.

The second to last inequality follows from the assumption that . ∎

Lemma 3 implies that (3) is satisfied if

 ∣∣pz1t−^pz1t∣∣<(∑ypzyt)(1−θ)4kϵ and ∣∣pz1t+pz0t−^pz1t−^pz0t∣∣<(∑ypzyt)(1−θ)4kϵ.

Step 3: We’ve shown above that is satisfied when

 |vz−^vz|<θkϵ,∣∣pz1t−^pz1t∣∣<(∑ypzyt)(1−θ)4kϵ, and ∣∣pz1t+pz0t−^pz1t−^pz0t∣∣<(∑ypzyt)(1−θ)4kϵ,∀t,z.

Note that if then

 |vz−^vz|=∣∣ ∣∣∑y,tpzyt−∑y,t^pzyt∣∣ ∣∣≤∑t∣∣ ∣∣∑ypzyt−∑y^pzyt∣∣ ∣∣<(∑y,tpzyt)(1−θ)4kϵ≤(1−θ)4kϵ.

Thus, to remove the first constraint , we set

 θkϵ=(1−θ)4kϵ,

and obtain .

Step 4: To summarize so far, Lemmas 2 and 3 allow us to upper bound the error of our estimated in terms of upper bounds on the error of our estimates of its constituent terms:

 P(|ˆATE−ATE|<ϵ)≥P(⋂t,z{∣∣pz1t−^pz1t∣∣<∑ypzyt5kϵ}⋂t,z{∣∣pz1t+pz0t−^pz1t−^pz0t∣∣<∑ypzyt5kϵ}),

or equivalently,

 P(|ˆATE−ATE|≥ϵ)≤P(⋃t,z{∣∣pz