Estimating the effect of joint interventions from observational data in sparse high-dimensional settings
We consider the estimation of joint causal effects from observational data. In particular, we propose new methods to estimate the effect of multiple simultaneous interventions (e.g., multiple gene knockouts), under the assumption that the observational data come from an unknown linear structural equation model with independent errors. We derive asymptotic variances of our estimators when the underlying causal structure is partly known, as well as high-dimensional consistency when the causal structure is fully unknown and the joint distribution is multivariate Gaussian. We also propose a generalization of our methodology to the class of nonparanormal distributions. We evaluate the estimators in simulation studies and also illustrate them on data from the DREAM4 challenge.
The effect of joint interventions from observational data
T1Supported in part by Swiss NSF Grant 200021_149760.
Causal inference \kwddirected acyclic graph (DAG) \kwdlinear structural equation model (linear SEM) \kwdmultiple simultaneous interventions \kwdjoint causal effects \kwdnonparanormal distribution \kwdhigh-dimensional data
Estimation of causal effects from observational data is impossible in general. It is, however, possible to estimate sets of possible causal effects of single interventions from observational data, under the assumption that the data were generated from an unknown linear structural equation model (SEM) with independent errors, or equivalently, from an unknown directed acyclic graph (DAG) model without hidden variables. The IDA method  was developed for this purpose, and can for example be used to predict the effect of single gene knockouts on other genes or some phenotype of interest, based on observational gene expression profiles. IDA has been applied to high-dimensional gene expression data sets and is a useful tool for the design of experiments, in the sense that it can indicate which genes are likely to have a large effect on the variable of interest [19, 34].
In this paper, we generalize IDA by relaxing some of its assumptions and extending it to multiple simultaneous interventions. For example, we may want to predict the effect of a double or triple gene knockout on some phenotype of interest. Since the space of possible intervention experiments grows exponentially in the number of simultaneous interventions, having an IDA-like tool to predict the effect of multiple simultaneous interventions is highly desirable in order to plan and prioritize such experiments. Moreover, the strength of the epistatic interaction [13, 38] between a pair of genes can be estimated by computing the difference between the predicted effect of a double gene knockout and the combined predicted effects of the two corresponding single gene knockouts.
The idea behind IDA is as follows. Since the underlying causal DAG is unknown, it seems natural to try to learn it. In general, however, the underlying causal DAG is not identifiable. We can learn its Markov equivalence class, which can be represented as a graph by a so-called completed partially directed acyclic graph (CPDAG) (see Section LABEL:A2-section:_Markov_equivalence_class_of_DAGs of ). Conceptually, we can then list all DAGs in the Markov equivalence class. One of these DAGs is the true causal DAG, but we do not know which one. For each DAG, we can then estimate the total causal effect of say on , under the assumption that the given DAG is the true causal DAG. In a linear SEM, this means that we can simply take the coefficient of in the regression of on and the parents of in the given DAG (see Section 2.3). Doing this for all DAGs in the Markov equivalence class yields a multiset of possible causal effects that is guaranteed to contain the true causal effect. We can then summarize the information on the effect of on by taking summary measures of this multiset.
For large graphs, listing all DAGs in the Markov equivalence class is computationally intensive. The above reasoning shows, however, that it suffices to know the parents of in the different DAGs. These possible parent sets can be extracted directly from the CPDAG, using a simple local criterion . This approach has two important advantages: it is a computational shortcut and it is less sensitive to estimation errors in the estimated CPDAG. The three steps of IDA can then be summarized as (1) estimating the CPDAG; (2) extracting possible valid parent sets of the intervention node from the CPDAG; (3) regressing on while adjusting for the possible parent sets. A schematic representation of IDA is given in Section LABEL:A2-subsection:_IDA_for_single_intervention of .
In order to generalize IDA to estimate the effect of joint interventions, we need non-trivial modifications of steps (2) and (3). In step (2), we must make sure that the possible parent sets of the various intervention nodes are jointly valid, in the sense that there is a DAG in the Markov equivalence class with this specific combination of parent sets. This decision can no longer be made fully locally, as was possible for the single intervention case. In step (3), we can no longer use regression with covariate adjustment, as illustrated in Example 2 in Section 2.3 (cf. ). We therefore develop new methods to estimate the effect of joint interventions under the assumption that only the parent sets of the intervention nodes are given. We refer to this assumption as the OPIN assumption (only parents of intervention nodes).
In the literature on time-dependent treatments (which can be viewed as joint interventions), it has been proposed to use inverse probability weighting (IPW) . IPW fits our framework in the sense that it works under the OPIN assumption. The method is widely used when the underlying causal DAG is given, but combining it with a causal structure learning method seems new. Unfortunately, however, such a combination does not provide a satisfactory solution to our problem, since we found that the statistical behavior was disappointing in our setting with continuous treatments. We therefore propose two new methods for estimating the effect of joint interventions under the OPIN assumption: one is based on recursive regressions for causal effects (RRC) and the other on modifying Cholesky decompositions (MCD). Combining our new steps (2) and (3), we obtain methods, called jointIDA, for estimating the effect of joint interventions from observational data, under the assumption that the data were generated from an unknown linear SEM with continuous independent errors.
We establish asymptotic normality of our estimators when the underlying SEM is linear and the parent sets are known (Section 4). Moreover, we provide high-dimensional consistency results when the causal structure is fully unknown and the distribution is multivariate Gaussian (Section 6). We also provide a generalization of our methodology to the family of nonparanormal distributions (Section 7).
Compared to the original IDA method , we have considerably weaker assumptions. IDA required linearity, Gaussianity and no hidden confounders. We dropped the Gaussianity assumption to a large extent, and only use it now in the high-dimensional consistency proof of (joint-)IDA, where it is needed since (so far) no causal structure learning method has been shown to be consistent in high-dimensional settings for linear SEMs with non-Gaussian noise. Additionally, we give an extension of our methods to nonparanormal distributions, hence treating an interesting subclass of non-linear and non-Gaussian distributions.
2.1 Graph terminology
We consider graphs with vertex (or node) set and edge set . There is at most one edge between any pair of vertices and edges may be either directed () or undirected (). If contains only (un)directed edges, it is called (un)directed. If contains directed and/or undirected edges, it is called partially directed. The skeleton of a partially directed graph is the undirected graph that results from replacing all directed edges by undirected edges.
If there is an edge between and in , we say that and are adjacent. The adjacency set of in is denoted by . If in , then is a parent of , and the edge between and is into . The set of all parents of in is denoted by .
A path between and is a sequence of distinct vertices such that all pairs of successive vertices are adjacent. A backdoor path from to is a path between and that has an edge into , i.e., . A path is called a v-structure if and and are not adjacent. A directed path from to is a path between and where all edges are directed towards . A directed path from to together with the edge forms a directed cycle. If there is a directed path from to , then is an ancestor of and is a descendant of . We also say that each node is an ancestor and a descendant of itself. The set of all non-descendants of in is denoted by .
A graph that does not contain directed cycles is called acyclic. Important classes of graphs in this paper are directed acyclic graphs (DAGs) and partially directed acyclic graphs (PDAGs).
2.2 Linear structural equation models and causal effects
Throughout this paper, we use the same notation to refer to sets or vectors. For example, , and can refer to sets or vectors, depending on the context.
Let be a DAG with vertices. Each vertex , , represents a random variable . An edge means that is a direct cause of in the sense of Definition 2.1 below. Throughout this paper, we use the same notation to denote a set of vertices and the corresponding set of random variables. For example, can refer to a set of indices or a set of random variables. Let be a weight matrix, where is the weight of the edge if , and otherwise. Then we say that is a weighted DAG.
(Linear structural equation model) Let be a weighted DAG, and a continuous random vector of jointly independent error variables with mean zero. Then is said to be generated from a linear structural equation model (linear SEM) characterized by the pair if
If is generated from a linear SEM characterized by the pair with , then we call the causal weighted DAG and the causal DAG. The symbol “” in (1) emphasizes that the expression should be understood as a generating mechanism rather than as a mere equation.
We emphasize that we assume that there are no hidden variables; hence the joint independence of the error terms. In the rest of the paper, we refer to linear SEMs without explicitly mentioning the independent error assumption. We also consider each of the structural equations in (1) as “autonomous”, meaning that changing the generating mechanism of one of the variables does not affect the generating mechanisms of the other variables.
An example of a weighted DAG is given in Figure 1, where , , , , , and . Note that directly causes , in the sense that plays a role in the generating process of . The set of all direct causes of is .
Suppose that is generated from a linear SEM characterized by . Since is acyclic, the vertices can always be rearranged to obtain an upper triangular weight matrix B. Such an ordering of the nodes is called a causal ordering. In Figure 1, (5, 1, 3, 4, 2, 6) is a causal ordering. (In this example there is a unique causal ordering, but that is not true in general.)
For any generated by a linear SEM characterized by , the joint density of satisfies the following factorization :
where the parent sets are determined from .
We now consider a (hypothetical) outside intervention to the system, where we set a variable to some value within the support of , uniformly over the entire population. This can be denoted by Pearl’s do-operator: or , which corresponds to removing the edges into in (or equivalently, setting the th column of equal to zero) and replacing by the constant . Since we assume that the other generating mechanisms are not affected by this intervention, the post-intervention joint density is given by the so-called truncated factorization formula or g-formula, see [25, 29, 33]:
The post-intervention distribution after a joint intervention on several nodes can be handled similarly:
Unless stated otherwise, we assume that are the intervention variables () and is the variable of interest. One can always label the variables to achieve this, since the nodes are not assumed to be in a causal ordering. The number of intervention variables is called the cardinality of the joint intervention.
Definition 2.2 defines the total joint effect of on in terms of partial derivatives of the expected value of the post-intervention distribution of .
(Total joint effect) Let be generated from a linear SEM characterized by . Then the total joint effect of on is given by
is the total effect of on in a joint intervention on . For notational convenience, we write instead of to denote the total effect of on in a single intervention on . Finally, we write and when it is helpful to indicate the dependence on the weighted DAG .
In general, is a vector of functions of , but under the assumption that is generated from a linear SEM, it reduces to a vector of numbers. In this case the partial derivatives can be interpreted as follows:
Thus, the total effect of on in a joint intervention on represents the increase in expected value of due to one unit increase in the intervention value of , while keeping the intervention values of constant. (In certain cases, can be viewed as a direct effect; see for example in Figure LABEL:A2-subfig:_RRCvsCochran1 in Section LABEL:A2-subsection:_Cochran of .)
The meaning of in a linear SEM can also be understood by looking at the causal weighted DAG : the causal effect of on along a directed path from to in can be calculated by multiplying all edge weights along the path, see . Then each can be calculated by summing up the causal effects along all directed paths from to which do not pass through (since those variables are held fixed by the intervention). We refer to this interpretation as the “path method” and illustrate it in the following example.
Let be generated from a linear SEM characterized by , where is depicted in Figure 1 and are jointly independent errors with arbitrary mean zero distributions.
We first consider the total effect of on in a single intervention on . There are four directed paths from to , namely , , and . Hence, the total causal effect of on is . Similarly, the total causal effect of on is .
Next, we consider the total joint effect of on . Since the only directed path from to does not pass through , . On the other hand, three of the four directed paths from 1 to 6 pass through 2, and the only remaining directed path is . Hence, , which is different from the single intervention effect .
If , then there is no directed path from to in . Thus, and the total effect of on is also zero in any joint intervention that involves .
2.3 Causal effects via covariate adjustment
It is straightforward to determine the total effect of on in a single intervention on a linear SEM (see Proposition LABEL:A2-proposition:_causal_effect_in_linear_SEM of ), since
where, for any and any set of variables such that , we define to be the coefficient of in the linear regression of on (without intercept term), denoted by . Equation (3) immediately follows from Pearl’s backdoor criterion  if all error variables are Gaussian . In Section LABEL:A2-section:_single_intervention_effects_as_regression_coefficients of , we show that (3) in fact holds under a linear SEM with arbitrary error distributions, since both the left hand side and the right hand side only depend on the covariance matrix of . Comparing equation (3) to the path method, we see that (3) does not require any knowledge about the underlying causal DAG beyond the parents of the intervention node .
For the total joint effect of on (), straightforward covariate adjustment cannot be used to calculate from one multiple linear regression. One might perhaps hope that each () can be computed separately as a coefficient of in a multiple linear regression, but the following example shows that this strategy fails as well.
3 Joint interventions when we only know the parents of the intervention nodes
Let be generated from a linear SEM characterized by , and suppose that we are interested in the total joint effect of on . If were known, then these effects could be computed with the path method. In this section, we consider the following weaker assumption.
(OPIN: only parents of intervention nodes) We only have partial knowledge of the underlying DAG : we know the direct causes (parent sets) of the intervention variables , but have no other information about the underlying causal structure. In particular, we do not know, in general, whether comes before or after in a causal ordering of the nodes for any in .
We consider this set-up for two main reasons. First, we think it is an interesting and novel assumption in itself, as there may be scenarios where one does not know the entire causal DAG, but one does know the direct causes of the intervention nodes. Second, it is a stepping stone for determining possible total joint effects in settings where the underlying causal DAG is fully unknown. In particular, we can mimic the IDA approach and use the CPDAG to determine possible jointly valid parent sets, i.e., parent sets of the intervention nodes that correspond to a DAG in the Markov equivalence class (see Section 5). For each of these possible jointly valid parent sets, we can compute the total joint effect under the OPIN assumption, and then collect all of these in a multiset. For very large graphs, one could even go a step further and only learn the Markov blankets of the intervention nodes [1, 3, 28] and extract possible parent sets from there.
We say that a procedure is an OPIN method if it does not require any knowledge of the underlying causal DAG beyond the parent sets of the intervention nodes. As mentioned in Section 1, an existing OPIN method for joint interventions is given by IPW . We introduce two new OPIN methods, called RRC and MCD. Sections 3.1 and 3.2 discuss the “oracle versions” of the methods, where we assume that the true distribution of is fully known. The corresponding sample versions are given in Section 3.3.
3.1 Recursive regressions for causal effects (RRC)
Our first method is based on recursive regressions for causal effects (RRC). We start with the special case of double interventions, i.e., .
(Oracle version of RRC for ) Let be generated from a linear SEM. Then the total joint effect of on is given by
where is defined in (3).
This result may seem rather straightforward, but we were unable to find it in the literature. There is a somewhat similar recursive formula for regression coefficients , namely , which is considered in the causality context (e.g., [9, 8]). However, the expression for in (4) contains causal effects which are generally different from the corresponding regression coefficients in (see equation (3) above and Example LABEL:A2-ex:_RRC_vs_Cochran's_formula in Section LABEL:A2-subsection:_Cochran of ).
The formula for is clear from the path method if is an ancestor of and is an ancestor of in : is the effect along all directed paths from 1 to , and we then subtract the effect along the subset of paths that pass through node 2. It is important to note, however, that equation (4) holds regardless of the causal ordering of , and .
If , then (see Remark 2.1) and hence . Similarly, if , then . At least one of these two scenarios must hold due to acyclicity.
We now generalize Theorem 3.1 to , yielding a recursive tool to compute total joint effects of any cardinality from lower order effects, and in particular from single intervention effects.
(Oracle version of RRC for ) Let be generated from a linear SEM and let . Then the total effect of on in a joint intervention on satisfies:
where we use the notation and to denote and , respectively.
3.2 Modified Cholesky decompositions (MCD)
Our second OPIN method is based on modifying Cholesky decompositions (MCD) of covariance matrices. The pseudocode is given in Algorithms 3.1 and 3.2, and the intuition is as follows. The covariance matrix of is given by
Let , where is obtained from by deleting all edges into nodes and is obtained from by setting the columns corresponding to equal to zero. is related to the joint intervention on as follows. Let be generated from the linear SEM . Then the post intervention joint density of given the intervention values is identical to the conditional distribution of given . Let be the covariance matrix of , i.e.,
Then for each , equals , the total effect of on in a joint intervention on . Hence, we focus on obtaining from .
If we knew the causal ordering of the variables, could be obtained by regressing each variable on its predecessors in the causal ordering, or equivalently, by the generalized Cholesky decomposition. Since is a positive definite matrix, there exists a unique generalized Cholesky decomposition , where , is a lower triangular matrix with ones on the diagonal, and is a diagonal matrix. The first entries of the th row of correspond to the negative of the regression coefficients in the regression of on . Hence, if the variables in are arranged in a causal ordering, the weight matrix can be obtained from the Cholesky decomposition. Setting the columns of corresponding to equal to zero is therefore equivalent to setting the off-diagonal elements of the rows of corresponding to equal to zero (cf. [2, 11]). Denoting the resulting matrix by , we then obtain .
In our set-up, however, we do not know the causal ordering. Instead, we work under the OPIN assumption, knowing only the parent sets of the intervention nodes . But we can still obtain by an iterative procedure. That is, we first consider a single intervention on to obtain . Next, we add the intervention on to obtain . After steps, this yields .
The key idea is the following. Suppose we wish to construct from , and let denote the number of parents of . Now order the variables in such that the first variables correspond to (in an arbitrary order), the th variable corresponds to , and the remaining variables follow in an arbitrary order. Let be the generalized Cholesky decomposition of with this ordering. Then the first entries of the th row of contain the negative weights of all edges that are into node (i.e., these weights are equal to the ones in the true causal weighted DAG). We can then obtain from by setting the first elements in the th row equal to zero, and use the reverse Cholesky decomposition to construct .
Repeating this procedure for the other intervention nodes leads to the iterative procedure given in Algorithm 3.1, where the matrices in the th step of this algorithm are denoted by , and . Theorem 3.3 shows that the output of this algorithm indeed equals .
Since our main goal is to obtain the total joint effect of on , we do not need to obtain the full covariance matrix . Let
Then it suffices to obtain , i.e., the sub-matrix of that corresponds to . The proof of Theorem 3.4 shows that can be obtained by simply running Algorithm 3.1 with input matrix . This simplification is important in sparse high-dimensional settings, where the full covariance matrix can be very large and difficult to estimate, while the sub-matrix is small. We can then compute the total joint effect of on as indicated in Algorithm 3.2.
(Soundness of MCD oracle) Let be generated from a linear SEM and let . Then the output of Algorithm 3.2 equals , the total joint effect of on .
The term in line 2 of Algorithm 3.2 is not necessary for the oracle version, since if . However, it does make a difference in the sample version, when we use the sample covariance matrix instead of the true covariance matrix. Section LABEL:A2-section:_an_illustration_of_the_MCD_algorithm of  contains a detailed example where MCD is applied to the weighted DAG in Figure 1.
Soundness of RRC and MCD may still hold even when the parent sets are not correctly specified. We show this in Section LABEL:A2-section:_hidden_variable_examples of  with two examples with incorrectly specified parent sets. In Example LABEL:A2-example:_hidden_variable_example_1 of , both RRC and MCD produce the correct output, while only MCD produces the correct output in Example LABEL:A2-example:_hidden_variable_example_2 of . The latter shows that RRC and MCD are indeed two different approaches, although their outputs are identical in the oracle versions when the parent sets are correctly specified.
3.3 Sample versions
Suppose that we have i.i.d. observations of , where is generated from a linear SEM characterized by .
where is the sample regression coefficient of in the linear regression .
Next, we define sample versions of RRC and MCD.
Thus, the effects of multiple interventions can be estimated from single intervention effects, where the latter can be estimated from single intervention experiments or from observational data and an IDA-like method.
The MCD estimator for simply equals adjusted regression:
Finally, we note that the RRC estimator for and the MCD estimator for generally depend on the ordering of . However, using different orderings in simulations showed very little difference for or , especially when the underlying causal DAG was sparse.
4 Asymptotic distributions of RRC and MCD
We now derive the asymptotic distributions of the RRC and MCD estimators under the OPIN assumption. For simplicity, we limit ourselves to the case .
Assume that we have i.i.d. observations of , where is generated from a linear SEM characterized by . Moreover, in this section we assume that for all . Let and . Let and let denote the corresponding sample covariance matrix. The half-vectorization, , of a symmetric matrix is the column vector in obtained by vectorizing the lower triangular part of . The derivative of a vector-valued differentiable function with respect to is denoted by the matrix whose -th entry is equal to .
Since all fourth moments of the variables in are finite, the multivariate central limit theorem implies
where . We will use (8) and the multivariate delta-method to derive the asymptotic distributions of RRC and MCD.
(Asymptotic distribution of RRC) Assume that is generated from a linear SEM characterized by , where for all . Moreover, assume that . Then
where , with , and . An explicit expression of the matrix is given in Proposition LABEL:A2-proposition:_asymptotic_distribution_of_RRC of .
By definition, if for any , and . Hence, the cases excluded from Theorem 4.1 can be handled trivially.
To obtain the asymptotic distribution of MCD, we first derive the asymptotic distribution of , where is the output of Algorithm 3.1 applied to . Without loss of generality, we assume that the variables in are ordered as , where the variable in and are ordered arbitrarily. Moreover, for simplicity of notation we omit line 9 of Algorithm 3.1, so that the variables in are ordered as .
Let be the permutation matrix such that variables in are ordered as in . Let be the matrix such that for any symmetric matrix , . We define and .
(Asymptotic distribution of ) Assume that is generated from a linear SEM characterized by , where for all . Then
where with . An explicit expression for is given in Section LABEL:A2-section:_Supplement_to_Section_4 of , and an expression for can be obtained analogously.
The following corollary follows directly from Theorem 4.2 and the multivariate delta-method.
(Asymptotic distribution of MCD) Assume that is generated from a linear SEM characterized by , where for all . Moreover, assume that . Then
where is defined in Theorem 4.2, and .
Since the formulas in Theorem 4.1 and Corollary 4.1 are not easily comparable, we computed them numerically for various settings in Section LABEL:A2-section:_asymptotic_comparison of . We found that no estimator dominates the other in terms of asymptotic variance, but that RRC seems to have a smaller asymptotic variance in most cases.
5 Extracting possible parent sets from a CPDAG
Recall that we introduced the OPIN assumption in Section 3 as a stepping stone for the scenario where we have no information on the underlying causal DAG. We will now consider this more general scenario: is generated from a linear SEM, and we only know the observational distribution of .
As in IDA we can first estimate the CPDAG of the unknown underlying causal DAG. Conceptually, we could then list all DAGs in the Markov equivalence class described by (see, e.g., ). Suppose that the Markov equivalence class consists of DAGs . For each , , we could determine the parent sets of the intervention nodes , denoted by the ordered set . All possible jointly valid parent sets of are then
We could then apply the RRC and MCD algorithms, using each of the possible jointly valid parent sets , , to obtain the multiset of possible total joint effects
However, listing all DAGs is computationally expensive and does not scale well to large graphs. In this section, our aim is to develop efficient ways to find the jointly valid parent sets of , where parent sets are called jointly valid if there exists a DAG in the Markov equivalence class of with this particular combination of parent sets.
In , the authors defined a so-called “locally valid” parent set of a node and showed that any locally valid parent set of a single intervention node is also a valid parent set. All locally valid parent sets of node can be obtained efficiently by orienting only those undirected edges in which contain node as an endpoint, without creating new v-structures with as a collider, and then taking all resulting parent sets of . As an easy extension of the method of  to multiple interventions, one could try to obtain jointly valid parent sets by taking all combinations of the locally valid parent sets of all intervention nodes. However, in Example 3 we show that this approach may generate parent sets that are not jointly valid. In other words, local validity of each parent set is necessary, but not sufficient for joint validity.
Consider the CPDAG in Figure 1(a). There are three DAGs that belong to the Markov equivalence class represented by (Figure 1(b)). Thus, all jointly valid sets of parent sets of are , and . However, both and are locally valid parent sets of vertices and in . Hence, all combinations of locally valid parent sets of nodes and include the additional ordered set . The latter is not jointly valid, since it corresponds to the DAG , which is not in the Markov equivalence class represented by due to the additional v-structure.
We now propose a semi-local algorithm for extracting all jointly valid parent sets, using the following graph-theoretic property of a CPDAG: no orientation of edges not oriented in a CPDAG can create a directed cycle or a new v-structure which includes at least one edge that was oriented in (see the proof of Theorem 4 in ).
Let and be the subgraphs on all vertices of that consist of all undirected and all directed edges of , respectively. Then is a jointly valid parent set of the intervention nodes with respect to if and only if is a jointly valid parent set of the intervention nodes with respect to . Typically, consists of several connected components, and these can be considered independently of each other. Since we only have to consider components that contain an intervention node, we have to work with at most components (which are typically much smaller than ), and this gives a large computational advantage. The algorithm is given in pseudocode in Algorithm 5.1 and illustrated Example 4.
Consider the CPDAG in Figure 3, together with its corresponding subgraphs and . We assume that the intervention nodes are . Note that contains four connected components and two of them contain at least one intervention node, namely and . We first consider . The multiset of all possible jointly valid parent sets of with respect to can be obtained by creating all possible DAGs in the Markov equivalence class described by