Efficiently Learning and Sampling Interventional Distributions from Observations1footnote 1FootnoteFootnoteFootnotesFootnotes1footnote 1Author names are in the alphabetical order.

Efficiently Learning and Sampling Interventional Distributions from Observations1

Abstract

We study the problem of efficiently estimating the effect of an intervention on a single variable (atomic interventions) using observational samples in a causal Bayesian network. Our goal is to give algorithms that are efficient in both time and sample complexity in a non-parametric setting.

Tian and Pearl (AAAI ’02) have exactly characterized the class of causal graphs for which causal effects of atomic interventions can be identified from observational data. We make their result quantitative. Suppose is a causal model on a set of observable variables with respect to a given causal graph with observable distribution . Let denote the interventional distribution over the observables with respect to an intervention of a designated variable with .2 We show that assuming that has bounded in-degree, bounded c-components, and that the observational distribution is identifiable and satisfies certain strong positivity condition:

1. [Evaluation] There is an algorithm that outputs with probability an evaluator for a distribution that satisfies using samples from and time. The evaluator can return in time the probability for any assignment to .

2. [Generation] There is an algorithm that outputs with probability a sampler for a distribution that satisfies using samples from and time. The sampler returns an iid sample from with probability in time.

We extend our techniques to estimate marginals over a given subset of variables of interest. We also show lower bounds for the sample complexity showing that our sample complexity has optimal dependence on the parameters and as well as the strong positivity parameter.

1 Introduction

A causal model for a system of variables describes not only how the variables are associated with each other but also how they would change if they were to be acted on by an external force. For example, in order to have a proper discussion about global warming, we need more than just an associational model which would give the correlation between human CO emissions and Arctic temperature levels. We instead need a causal model which would predict the climatological effects of humans reducing CO emissions by (say) 20 over the next five years. Notice how the two can give starkly different pictures: if global warming is being propelled by natural weather cycles, then changing human emissions won’t make any difference to temperature levels, even though human emissions and temperature may be correlated in our dataset (just because both are increasing over the timespan of our data).

Causality has been a topic of inquiry since ancient times, but a modern, rigorous formulation of causality came about in the twentieth century through the works of Pearl, Robins, Rubin, and others [IR15, PEA09, RSR11, HR20]. In particular, Pearl [PEA09] recasted causality in the language of causal Bayesian networks (or causal Bayes nets for short). A causal Bayes net is a standard Bayes net that is reinterpreted causally. Specifically, it makes the assumption of modularity: for any variable , the dependence of on its parents is an autonomous mechanism that does not change even if other parts of the network are changed. This allows assessment of external interventions, such as those encountered in policy analysis, treatment management, and planning. The idea is that by virtue of the modularity assumption, an intervention simply amounts to a modified Bayes net where some of the parent-child mechanisms are altered while the rest are kept the same.

The underlying structure of causal Bayes net is a directed acyclic graph . The graph consists of nodes where nodes correspond to the observable variables while the additional nodes correspond to a set of hidden variables . We assume that the observable variables take values over a finite alphabet . By interpreting as a standard Bayes net over and then marginalizing to , we get the observational distribution on . The modularity assumption allows us to define the result of an intervention on . An intervention is specified by a subset of variables and an assignment3 . In the interventional distribution, the variables are fixed to , while each variable is sampled as it would have been in the original Bayes net, according to the conditional distribution , where (parents of ) consist of either variables previously sampled in the topological order of or variables in set by the intervention. The marginal of the resulting distribution to is the interventional distribution denoted by . We sometimes also use to denote the intervention process and to denote the resulting interventional distribution.

In this work, we focus our attention on the case that is a single observable variable, so that interventions on are atomic. We study the following estimation problems:

1. (Evaluation) Given an , construct an evaluator for which estimates the value of the probability mass function

 Px(v)\lx@stackreldef=PrV∼Px[V=v]

for any . The goal is to construct the evaluator using only a bounded number of samples from the observational distribution , and moreover, the evaluator should run efficiently.

2. (Generation) Given an , construct a generator for which generates i.i.d. samples from a distribution that approximates . The goal is to construct the generator using only a bounded number of samples from the observational distribution , and moreover, the generator should be able to output each sample efficiently.

We study these problems in the non-parametric setting, where we assume that all the observable variables under consideration are over a finite alphabet .

Evaluation and generation are two very natural inference problems4. Indeed, the influential work of Kearns et al. [KMR+94] introduced the computational framework of distribution learning in terms of these two problems. Over the last 25 years, work on distribution learning has clarified how classical techniques in statistics can be married to new algorithmic ideas in order to yield sample- and time-efficient algorithms for learning very general classes of distributions; see [DIA16] for a recent survey of the area. The goal of our work is to initiate a similar computational study of the fundamental problems in causal inference.

The crucial distinction of our setting from the distribution learning setting is that the algorithm does not get samples from the distribution of interest. In our setting, the algorithm receives as input samples from while its goal is to estimate the distribution . This is motivated by the fact that typically randomized experiments are hard (or unethical) to conduct while observational samples are easy to collect. Even if we disregard computational considerations, it may be impossible to determine the interventional distribution from the observational distribution and knowledge of the causal graph . The simplest example is the so-called “bow-tie graph” on two observable variables and (with being a parent of ) and a hidden variable that is a parent of both and . Here, it’s easy to see that does not uniquely determine . Tian and Pearl [TP02a] studied the general question of when the interventional distribution is identifiable from the observational distribution . They characterized the class of directed acyclic graphs with hidden variables such that for any , for any causal Bayes net on , and for any intervention to , is identifiable from . Thus, for all our work we assume that , because otherwise, is not identifiable, even with an infinite number of observations.

We design sample and time efficient algorithms for the above-mentioned estimation problems. Our starting point is the work of Tian and Pearl [TP02a]. Tian and Pearl [TP02a] (as well as other related work on identifiability) assumes, in addition to the underlying graph being in , that the distribution is positive, meaning that for all assignments to . We show that under reasonable assumptions about the structure of , we only need to assume strong positivity for the marginal of over a bounded number of variables to design our algorithms. We extend our techniques to the problem of efficiently estimating the marginal interventional distributions over a subset of observable variables. Finally we establish a lower bound for the sample complexity showing that our sample complexity has near optimal dependence on the parameters of interest. We discuss our results in detail next.

2 Our Contributions

Let be a causal Bayes net5 over a graph , in which the set of observable variables is denoted by and the set of hidden variables is denoted by . Let . There is a standard procedure in the causality literature (see [TP02b]) to convert into a graph on nodes. Namely, under the semi-Markovian assumption that each hidden variable does not have any parents and affects exactly two observable variables and , we remove from and put a bidirected edge between and . We end up with an Acyclic Directed Mixed Graph (ADMG) , having nodes corresponding to the variables and having edge set where are the directed edges and are the bidirected edges. Figure 1 shows an example. The in-degree of is the maximum number of directed edges coming into any node. A c-component refers to any maximal subset of nodes/variables which is connected using only bidirected edges. Then gets partitioned into c-components: .

Let be a designated variable in . Without loss of generality, suppose . Tian and Pearl [TP02a] showed that (the class of ADMGs for which is identifiable from for any causal Bayes net on and any intervention ) consists of exactly those graphs that satisfy Assumption 2.1 below (See Theorem 3 of [TP02a]).

Assumption 2.1 (Identifiability with respect to X).

There does not exist a path of bidirected edges between and any child of . Equivalently, no child of belongs to .

The second assumption we make is about the observational distribution . For a subset of variables , let where are the observable parents of in the graph .

Assumption 2.2 (α-strong positivity with respect to X).

Suppose lies in the c-component , and let . For every assignment to , .

So, if is small, then Assumption 2.2 only requires that a small set of variables take on each possible configuration with non-negligible probability. When Assumption 2.2 holds, we say that the causal Bayes net is -strongly positive with respect to .

2.1 Algorithms

Suppose is an unknown causal Bayes net over a known ADMG on observable variables that satisfies identifiablity (Assumption 2.1) and -strong positivity (Assumption 2.2) with respect to a variable . Let denote the maximum in-degree of the graph and denote the size of its largest c-component.

We first present an efficient algorithm for the evaluation problem.

Theorem 2.3 (Evaluation).

For any intervention to and parameter , there is an algorithm that takes samples from , and in time, returns a circuit . With probability at least , this circuit on any input runs in time and outputs , where is a distribution satisfying .

We then extend the techniques used for Theorem 2.3 to design an efficient generator for .

Theorem 2.4 (Generation).

For any intervention to and parameter , there is an algorithm that takes samples from , and in time, returns a probabilistic circuit that generates samples of a distribution satisfying . On each call, the circuit takes time and outputs a sample of with probability at least .

We now discuss the problem of estimating , i.e., the marginal interventional distribution upon intervention to over a subset of the observables . We show finite sample bounds for estimating when the causal Bayes net satisfies Assumption 2.1 and Assumption 2.2, thus obtaining quantitative counterparts to the results shown in [TP02a] (See Theorem 4 of [TP02a]). We use to denote the cardinality of .

A generator for obviously also gives a generator for the marginal of on any subset . We observe that given a generator, we can also learn an approximate evaluator for the marginal of on sample-efficiently. This is because using samples of , we can learn an explicit description of upto total variation distance with probability at least , by simply using the empirical estimator. Since is itself -close to in total variation distance, we get an algorithm that with constant probability, returns an evaluator for a distribution that is -close to . Summarizing:

Corollary 2.5.

For any subset with , intervention to and parameter , there is an algorithm that takes samples from and returns an evaluator for a distribution on such that .

Note that the time complexity of the above algorithm is exponential in as we need to take exponential in many samples from the generator. To handle problems that arise in practice for ’s of small cardinality, it is of interest to develop sample and time efficient algorithms for estimating . In such cases the approach discussed above is superfluous, as the sample complexity depends linearly on , the total number of variables in the model, which could be potentially large. We show that in such cases where is extremely small we can perform efficient estimation with small sample size. A more detailed discussion of our analysis on evaluating marginals which includes the algorithms and proofs can be found in Section 6. Precisely, we show the following theorem: {restatable}theoremlearnmarg For any subset with , intervention to and parameter , there is an algorithm that takes samples from and runs in time and returns an evaluator for a distribution on such that .

2.2 Lower Bounds

We next address the question of whether the sample complexity of our algorithms has the right dependence on the parameters of the causal Bayes net as well as on . We also explore whether Assumption 2.2 can be weakened. Since in this section, our focus is on the sample complexity instead of time complexity, we do not distinguish between evaluation and generation.

To get some intuition, consider the simple causal Bayes net depicted in Figure 1(a). Here, does not have any parents and is not confounded with any variable. is a child of , and suppose and are boolean variables, where for some small . Now, to estimate the interventional probability to within , it is well-known that samples with are needed. Since occurs with probability , an lower bound on the sample complexity follows.

However, from this example, it’s not clear that we need to enforce strong positivity on the parents of or the c-component containing , since both are trivial. Also, the sample complexity has no dependence on and . The following theorem addresses these issues. {restatable}theoremlbmain Fix integers and a set of size . For all sufficiently large , there exists an ADMG with nodes and in-degree so that the following hold. contains a node such that and (where is the c-component containing ). For any , there exists a causal Bayes net on over -valued variables such that: {enumerate*}

For the observational distribution , the marginal is uniform but the marginal has mass at most at some assignment.

There exists an intervention on such that learning the distribution upto -distance with probability requires samples from . So, must have a guarantee that its marginal on has mass at all points in order for an algorithm to learn using samples. For comparison, our algorithms in Theorem 2.3 and Theorem 2.4 assume strong positivity for and achieve sample complexity . For small values of and , the upper and lower bounds are qualitatively similar. It remains an open question to fully close the gap.

To hint towards the proof of Section 2.2, we sketch the argument when is a parent of and . Figure 1(b) shows a graph where has one parent and no hidden variables. Both and are parents of , and all three are binary variables. Consider two causal models and . For both and , is uniform over and with probability . Now, suppose and , while and . Note that while , so that the interventional distributions are -far from each other. On the other hand, it can be shown using Fano’s inequality that any algorithm needs to observe samples to distinguish and .

2.3 Previous Work

Identification of causal effects from the observational distribution has been studied extensively in the literature. Here we discuss some of the relevant literature in the non-parametric setting. When there are no unobservable variables (and hence the associated ADMG is a DAG), it is always possible to identify any given intervention from the observational distribution [PEA09, ROB86, SGS00]. However, when there are unobservable variables causal effect identifiability in ADMGs is not always possible. A series of important works focused on establishing graphical criterions for identifiability of interventional distributions from the observational distribution [TP02a, SGS00, GP95, HAL00, KM99, PR95]. This led to a complete algorithm6, first by Tian and Pearl for the identifiability of atomic interventions [TP02a] (this work is the most relevant for the present work), and then by Shpitser and Pearl (algorithm ID) for the identifiability of any given intervention from the observational distribution [SP06] (see also [HV08]). Researchers have also investigated implementation aspects of the identification algorithms. In particular, an implementation of the algorithm ID has been carried out in the R package causaleffect in [TK17]. This work was followed by a sequence of works [TK17, TK18] where the authors simplify ID and obtain a succinct representation of the target causal effect by removing unnecessary variables from the expression. Other software packages related to causal identifiability are also publicly available [42, 19, 33].

Researchers have also investigated non-parametric causal effect identification from observations on structures other than ADMGs. Some recent results in this direction include work reported in [JZB19] (and [JZB19]) where complete algorithms have been established for causal effect identifiability (and conditional causal effect identifiability) with respect to Markov equivalent class diagrams, a more general class of causal graphs. Maximally oriented partially directed acyclic graphs (MPDAGs) is yet another generalization of DAGs with no hidden variables. Very recently complete algorithms for causal identification with respect to MPDAGs have been established [PER19]. Complete algorithms are also known for dynamic causal networks, a causal analogue for dynamic Bayesian networks that evolve over time [BAG16]. Causal chain graphs (CEGs, which are similar to ADMGs) are yet another class of graphs for which identifiability of interventions has been investigated and conditions (similar to Pearl’s back-door criterion) have been established [TSR10, THW13].

In a different line of work reported in [SS16], the authors introduce the notion of stability of causal identification: a notion capturing the sensitivity of causal effects to small perturbations in the input. They show that the causal identification function is numerically unstable for the ID algorithm [SP06]. They also show that, in contrast for atomic interventions (i.e., when is singleton) the identification algorithm of Tian and Pearl [TP02a] is not too sensitive to changes in the input whenever Assumption 2.1 of [TP02a] is true.

Although most of the work on non-parametric causal identification mentioned above assume the causal graph is known, the problem of inferring the underlying causal graph has also been studied in various contexts. Some papers reporting the work along this line include  [HEJ15, HB13, ASY+19, YKU18, KJS+19]. Causal effect identification is a fundamental topic with a wide range of practical applications. In particular it has found applications in a range of applied areas including recommendation systems [SHW15], computational sciences [SPI10], social and behavioral sciences [SOB00], econometrics [HV07, MAT93, LEW19], and epidemiology [HR20].

An important observation we note is that all existing works on non-parametric causal identifiability research assume infinite sample access to the observational distribution. To the best of our knowledge, the present work is the first that establishes sample and time complexity bounds on non-parametric causal effect identifiability. In this respect, the closest related work is [ABD+18] which looked at the problem of goodness-of-fit testing of causal models in a non-parametric setting; however, they assumed access to experimental data, not just observational data.

3 Preliminaries

Notation. We use capital (bold capital) letters to denote variables (sets of variables), e.g., is a variable and is a set of variables. We use small (bold small) letters to denote values taken by the corresponding variables (sets of variables), e.g., is the value of and is the value of the set of variables . For a vector and a subset of coordinates , we use the notation to denote the restriction of to the coordinates in and to denote the -th coordinate of . For two sets of variables and and assignments of values to and to , denotes the assignment to in the natural way.

The variables in this paper take values in a finite set . We use the total variation distance to measure the distances between distributions. For two distributions and over the same finite sample space , their total variation distance is denoted by and is given by

Bayesian Networks. Bayesian networks are popular probabilistic graphical models for describing high-dimensional distributions.

Definition 3.1.

A Bayesian Network is a distribution that can be specified by a tuple where: (i) is a set of variables over alphabet , (ii) is a directed acyclic graph with nodes corresponding to the elements of , and (iii) is the conditional distribution of variable given that its parents in take the values .

The Bayesian Network defines a probability distribution over , as follows. For all ,

In this distribution, each variable is independent of its non-descendants given its parents in .

Causality. We describe Pearl’s notion of causality from [PEA95]. Central to his formalism is the notion of an intervention. Given an observable variable set and a subset , an intervention is the process of fixing the set of variables to the values . The interventional distribution is the distribution on after setting to . Formally:

Definition 3.2 (Causal Bayes Net).

A causal Bayes net is a collection of interventional distributions that can be defined in terms of a tuple , where (i) and are the tuples of observable and hidden variables respectively, (ii) is a directed acyclic graph on , (iii) is the conditional probability distributions of given that its parents take the values , and (iv) is the distribution of the hidden variables . is said to be the causal graph corresponding to .

Such a causal Bayes net defines a unique interventional distribution for every subset (including ) and assignment , as follows. For all :

 Px(v)={∑u∏Vi∈V∖XPr[vi∣Π(Vi)=vΠ(Vi)]⋅Pr[u]if % v is consistent with x0 otherwise.

We use to denote the observational distribution (). For a subset , denotes the marginal of on . For an assignment to , we also use the notation as shorthand for the probability mass of at .

As mentioned in the introduction, we often consider a causal graph as an ADMG by implicitly representing hidden variables using bidirected edges. In an ADMG, we imagine that there is a hidden variable subdividing each such bidirected edge that is a parent of the two endpoints of the edge. Thus, the edge set of an ADMG is the union of the directed edges and the bidirected edges . Given such an ADMG , for any , denotes the complement set , denotes the parents of according to the directed edges of , i.e., . We also define: and . The bidirected edges are used to define c-components:

Definition 3.3 (c-component).

For a given ADMG , is a c-component of , if is a maximal set such that between any two vertices of , there exists a path that uses only the bidirected edges .

Since a c-component forms an equivalence relation, the set of all c-components forms a partition of , the observable vertices of . Let denote the partition of into the c-components of .

Definition 3.4.

For a subset , the Q-factor for is defined as the following function over :

 QS(v)=Pv¯¯¯S(vS).

Clearly, for every , is a distribution over .

For , the induced subgraph is the subgraph obtained by removing the vertices and their corresponding edges from .

The following lemma is used heavily in this work.

Lemma 3.5 (Corollary 1 of [Tia02]).

Let be a causal Bayes net on . Let be the c-components of . Then for any we have:

• .

• Let be a topological order over with respect to the directed edges. Then, for any , is computable from and is given by:

 QSj(v)=∏i:Vi∈SjP(vi∣v1,…,vi−1).
• Furthermore, each factor can be expressed as:

 P(vi∣v1,…,vi−1)=P(vi∣vPa+(Ti)∩[i−1])

where is the c-component of that contains .

Note that Lemma 3.5 implies that each is a function of the coordinates of corresponding to . The next result, due to Tian and Pearl, uses the identifiability criterion encoded in Assumption 2.1.

Theorem 3.6 (Theorem 3 of [TP02a]).

Let be a causal Bayes net over and be a variable. Let be the c-components of and assume without loss of generality. Suppose satisfies Assumption 2.1 (identifiability with respect to ). Then for any setting to and any assignment to , the interventional distribution is given by:

 Px(w) =PwV∖S1(wS1∖{X})⋅ℓ∏j=2PwV∖(Sj∪{X})∘x(wSj) =∑x′∈ΣQS1(w∘x′)⋅ℓ∏j=2QSj(w∘x)

4 Efficient Estimation

Let be a causal Bayes net over a causal graph . is an ADMG with observable variables . Without loss of generality, let be a topological order according to the directed edges of . As a first step to our algorithms for interventional distributions, we are interested in learning the observational distribution . Our approach is to view the causal Bayes net as a regular Bayes net over observable variables and use an existing learning algorithm for Bayes nets. From Lemma 3.5, we can write the observational distribution as:

 P(V)=n∏i=1P(Vi∣Zi)

where is of size at most . Here is the maximum c-component size and is the maximum in-degree. Therefore can also be viewed as the distribution of a (regular) Bayes net with no hidden variables but with in-degree at most . The problem of properly learning a Bayes net is well-studied [DAS97, CDK+17], starting from Dasgupta’s early work [DAS97]. We use the following learning result described in [BGM+20].

Theorem 4.1 ([Bgm+20]).

There is an algorithm that on input parameters and samples from an unknown Bayes net over on a known DAG on vertex set and maximum in-degree , takes samples, runs in time , and produces a Bayes net on such that with probability .

From the above discussion we get the following corollary.

Corollary 4.2.

There is an algorithm that on input parameters , and samples from the observed distribution of an unknown causal Bayes net over on a known ADMG on vertex set with maximum in-degree and maximum c-component size , takes samples, runs in time and outputs a Bayes net on a DAG such that with probability .

Lemma 3.5 together with the Bayes net learning algorithm also gives us a way to learn the joint distribution on a subset of c-components when the remaining observable variables are intervened (assuming certain strong positivity condition), which we will use.

Lemma 4.3.

Let be any fixed set of c-components of an ADMG on vertex set with maximum in-degree and maximum c-component size . Let be the set of vertices in and . Let be any assignment of and be the restriction . Suppose . Then, (1). (2) Moreover, there is an algorithm that on input parameters , and samples from the observed distribution on takes samples, runs in time and outputs a Bayes net on a DAG such that with probability .

Proof.

follows from the fact that the intervention blocks any other interventions since these are the immediate parents of in . An informal argument to show is identifiable implicitly follows from Lemma 3.5. We can put dummy bidrected edges to make a single c-component and apply part (ii) of Lemma 3.5. In the resulting factorization the terms are of the form . But then we can only keep the subset of variables of which are from the c-component of and the parents of this c-component, and remove the rest using conditional independence (blocking), since the dummy bidirected edges are not affecting.

Hence for a fixed assignment , the factorization of the joint distribution consists of the product of only the terms for every , where the set has at most variables among which the variables coming from are fixed to . More formally this fact can be proven by closely following the proof of Lemma 3.5.

It follows by comparing the above factorization of and the factorization of from Lemma 3.5 that the conditional samples from with is according to the distribution . Moreover is a Bayes net with degree at most as we discussed in the previous paragraph. From Theorem 4.1 conditional samples and processing time is enough to learn in distance . The result follows by observing samples from are enough to give conditional samples except for probability . ∎

In the next two subsections we design our evaluation and generation algorithms. We use the following partitioning of the variables in these subsections. Let the c-components of be given by with . Let and . Note that and . See Figure 3.

4.1 Evaluation

Proof of Theorem 2.3.

In view of Theorem 3.6, for an assignment (, , ) to let us define and so that :

 M(w)=PwB(wA)=∑x′PwB(wA∘x′)=∑x′∈ΣQS1(w∘x′)

 R(w)=PwA∘x(wB∪C)=ℓ∏j=2QSj(w∘x).

Our strategy will be to approximate and separately as and , and then output their product. See Algorithm 1.

For every assignment to , let denote the partial function of for those such that . Note that we can interpret as a probability distribution on , thus . Invoking Lemma 4.3 and using Assumption 2.2, for every , we can learn an such that , using samples.

For a particular assignment to , let denote the partial function of for those such that . We can interpret as a probability distribution on , thus . We learn the distribution on all of and then take its marginal on . Again, we can invoke Lemma 4.3 and Assumption 2.2, so that for every , we learn an that is within TV distance of .

Finally, we use Lemma 4.4 below to combine the two estimates to obtain an estimate .

Lemma 4.4.

Suppose for all assignments to , and for all assignments to , . Then .

Proof.
 ∑w|Px(w)−P′x(w)| =∑w|M(w)R(w)−M′(w)R′(w)| =∑w|M(w)R(w)−M(w)R′(w)+M(w)R′(w)−M′(w)R′(w)| ⩽∑w[M(w)|R(w)−R′(w)|+|M(w)−M′(w)|R′(w)]

We upper bound each of the above two terms separately. We break corresponding to and respectively.

 ∑a,b,c|Mb(a)−M′b(a)|Ra(b,c) =∑b,c(∑a|Mb(a)−M′b(a)|Ra(b,c)) ⩽∑b,c(∑a|Mb(a)−M′b(a)|)(∑aRa(b,c)) ⩽∑b,cε1∑aRa(b,c) ⩽ε1|Σ|k ∑wM(w)|R(w)−R′(w)| ⩽∑w|R(w)−R′(w)| ⩽∑aε2 ⩽ε2|Σ|k

Finally we claim is a valid distribution satisfying . In the bayes net factorization of for every variable we have exactly one term in either the Bayes nets as