Ancestral Causal Inference

Ancestral Causal Inference

Sara Magliacane
VU Amsterdam & University of Amsterdam
sara.magliacane@gmail.com &Tom Claassen
Radboud University Nijmegen
tomc@cs.ru.nl &Joris M. Mooij
University of Amsterdam
j.m.mooij@uva.nl
Abstract

Constraint-based causal discovery from limited data is a notoriously difficult challenge due to the many borderline independence test decisions. Several approaches to improve the reliability of the predictions by exploiting redundancy in the independence information have been proposed recently. Though promising, existing approaches can still be greatly improved in terms of accuracy and scalability. We present a novel method that reduces the combinatorial explosion of the search space by using a more coarse-grained representation of causal information, drastically reducing computation time. Additionally, we propose a method to score causal predictions based on their confidence. Crucially, our implementation also allows one to easily combine observational and interventional data and to incorporate various types of available background knowledge. We prove soundness and asymptotic consistency of our method and demonstrate that it can outperform the state-of-the-art on synthetic data, achieving a speedup of several orders of magnitude. We illustrate its practical feasibility by applying it to a challenging protein data set.

 

Ancestral Causal Inference


  Sara Magliacane VU Amsterdam & University of Amsterdam sara.magliacane@gmail.com Tom Claassen Radboud University Nijmegen tomc@cs.ru.nl Joris M. Mooij University of Amsterdam j.m.mooij@uva.nl

\@float

noticebox[b]29th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\end@float

1 Introduction

Discovering causal relations from data is at the foundation of the scientific method. Traditionally, cause-effect relations have been recovered from experimental data in which the variable of interest is perturbed, but seminal work like the do-calculus Pearl2009 and the PC/FCI algorithms Spirtes2000; Zhang:2008:COR:1414091.1414237 demonstrate that, under certain assumptions (e.g., the well-known Causal Markov and Faithfulness assumptions Spirtes2000), it is already possible to obtain substantial causal information by using only observational data.

Recently, there have been several proposals for combining observational and experimental data to discover causal relations. These causal discovery methods are usually divided into two categories: constraint-based and score-based methods. Score-based methods typically evaluate models using a penalized likelihood score, while constraint-based methods use statistical independences to express constraints over possible causal models. The advantages of constraint-based over score-based methods are the ability to handle latent confounders and selection bias naturally, and that there is no need for parametric modeling assumptions. Additionally, constraint-based methods expressed in logic loci; BCCD; triantafillou2015constraint; antti allow for an easy integration of background knowledge, which is not trivial even for simple cases in approaches that are not based on logic borboudakis2012.

Two major disadvantages of traditional constraint-based methods are: (i) vulnerability to errors in statistical independence test results, which are quite common in real-world applications, (ii) no ranking or estimation of the confidence in the causal predictions. Several approaches address the first issue and improve the reliability of constraint-based methods by exploiting redundancy in the independence information BCCD; antti; triantafillou2015constraint. The idea is to assign weights to the input statements that reflect their reliability, and then use a reasoning scheme that takes these weights into account. Several weighting schemes can be defined, from simple ways to attach weights to single independence statements antti, to more complicated schemes to obtain weights for combinations of independence statements triantafillou2015constraint; BCCD. Unfortunately, these approaches have to sacrifice either accuracy by using a greedy method BCCD; triantafillou2015constraint, or scalability by formulating a discrete optimization problem on a super-exponentially large search space antti. Additionally, the confidence estimation issue is addressed only in limited cases ICP.

We propose Ancestral Causal Inference (ACI), a logic-based method that provides comparable accuracy to the best state-of-the-art constraint-based methods (e.g., antti) for causal systems with latent variables without feedback, but improves on their scalability by using a more coarse-grained representation of causal information. Instead of representing all possible direct causal relations, in ACI we represent and reason only with ancestral relations (“indirect” causal relations), developing specialised ancestral reasoning rules. This representation, though still super-exponentially large, drastically reduces computation time. Moreover, it turns out to be very convenient, because in real-world applications the distinction between direct causal relations and ancestral relations is not always clear or necessary. Given the estimated ancestral relations, the estimation can be refined to direct causal relations by constraining standard methods to a smaller search space, if necessary.

Furthermore, we propose a method to score predictions according to their confidence. The confidence score can be thought of as an approximation to the marginal probability of an ancestral relation. Scoring predictions enables one to rank them according to their reliability, allowing for higher accuracy. This is very important for practical applications, as the low reliability of the predictions of constraint-based methods has been a major impediment to their wide-spread use.

We prove soundness and asymptotic consistency under mild conditions on the statistical tests for ACI and our scoring method. We show that ACI outperforms standard methods, like bootstrapped FCI and CFCI, in terms of accuracy, and achieves a speedup of several orders of magnitude over antti on a synthetic dataset. We illustrate its practical feasibility by applying it to a challenging protein data set SPP05 that so far had only been addressed with score-based methods and observe that it successfully recovers from faithfulness violations. In this context, we showcase the flexibility of logic-based approaches by introducing weighted ancestral relation constraints that we obtain from a combination of observational and interventional data, and show that they substantially increase the reliability of the predictions. Finally, we provide an open-source version of our algorithms and the evaluation framework, which can be easily extended, at http://github.com/caus-am/aci.

2 Preliminaries and related work

Preliminaries

We assume that the data generating process can be modeled by a causal Directed Acyclic Graph (DAG) that may contain latent variables. For simplicity we also assume that there is no selection bias. Finally, we assume that the Causal Markov Assumption and the Causal Faithfulness Assumption Spirtes2000 both hold. In other words, the conditional independences in the observational distribution correspond one-to-one with the d-separations in the causal DAG. Throughout the paper we represent variables with uppercase letters, while sets of variables are denoted by boldface. All proofs are provided in the Supplementary Material.

A directed edge in the causal DAG represents a direct causal relation between cause on effect . Intuitively, in this framework this indicates that manipulating will produce a change in , while manipulating will have no effect on . A more detailed discussion can be found in Spirtes2000. A sequence of directed edges is a directed path. If there exists a directed path from to (or ), then is an ancestor of (denoted as ). Otherwise, is not an ancestor of (denoted as ). For a set of variables , we write:

(1)

We define an ancestral structure as any non-strict partial order on the observed variables of the DAG, i.e., any relation that satisfies the following axioms:

(2)
(3)
(4)

The underlying causal DAG induces a unique “true” ancestral structure, which represents the transitive closure of the direct causal relations projected on the observed variables.

For disjoint sets we denote conditional independence of and given as , and conditional dependence as . We call the cardinality the order of the conditional (in)dependence relation. Following loci we define a minimal conditional independence by:

and similarly, a minimal conditional dependence by:

The square brackets indicate that is needed for the (in)dependence to hold in the context of . Note that the negation of a minimal conditional independence is not a minimal conditional dependence. Minimal conditional (in)dependences are closely related to ancestral relations, as pointed out in loci:

Lemma 1.

For disjoint (sets of) variables :

(5)
(6)

Exploiting these rules (as well as others that will be introduced in Section 3) to deduce ancestral relations directly from (in)dependences is key to the greatly improved scalability of our method.

Related work on conflict resolution

One of the earliest algorithms to deal with conflicting inputs in constraint-based causal discovery is Conservative PC conf/uai/RamseyZS06, which adds “redundant” checks to the PC algorithm that allow it to detect inconsistencies in the inputs, and then makes only predictions that do not rely on the ambiguous inputs. The same idea can be applied to FCI, yielding Conservative FCI (CFCI) Colombo++2012; pcalg. BCCD (Bayesian Constraint-based Causal Discovery) BCCD uses Bayesian confidence estimates to process information in decreasing order of reliability, discarding contradictory inputs as they arise. COmbINE (Causal discovery from Overlapping INtErventions) triantafillou2015constraint is an algorithm that combines the output of FCI on several overlapping observational and experimental datasets into a single causal model by first pooling and recalibrating the independence test -values, and then adding each constraint incrementally in order of reliability to a SAT instance. Any constraint that makes the problem unsatisfiable is discarded.

Our approach is inspired by a method presented by Hyttinen, Eberhardt and Järvisalo antti (that we will refer to as HEJ in this paper), in which causal discovery is formulated as a constrained discrete minimization problem. Given a list of weighted independence statements, HEJ searches for the optimal causal graph (an acyclic directed mixed graph, or ADMG) that minimizes the sum of the weights of the independence statements that are violated according to . In order to test whether a causal graph induces a certain independence, the method creates an encoding DAG of d-connection graphs. D-connection graphs are graphs that can be obtained from a causal graph through a series of operations (conditioning, marginalization and interventions). An encoding DAG of d-connection graphs is a complex structure encoding all possible d-connection graphs and the sequence of operations that generated them from a given causal graph. This approach has been shown to correct errors in the inputs, but is computationally demanding because of the huge search space.

3 ACI: Ancestral Causal Inference

We propose Ancestral Causal Inference (ACI), a causal discovery method that accurately reconstructs ancestral structures, also in the presence of latent variables and statistical errors. ACI builds on HEJ antti, but rather than optimizing over encoding DAGs, ACI optimizes over the much simpler (but still very expressive) ancestral structures.

For variables, the number of possible ancestral structures is the number of partial orders (http://oeis.org/A001035), which grows as asymptoticEnumerationPosets, while the number of DAGs can be computed with a well-known super-exponential recurrence formula (http://oeis.org/A003024). The number of ADMGs is . Although still super-exponential, the number of ancestral structures grows asymptotically much slower than the number of DAGs and even more so, ADMGs. For example, for 7 variables, there are ancestral structures but already ADMGs, which lower bound the number of encoding DAGs of d-connection graphs used by HEJ.

New rules

The rules in HEJ explicitly encode marginalization and conditioning operations on d-connection graphs, so they cannot be easily adapted to work directly with ancestral relations. Instead, ACI encodes the ancestral reasoning rules (2)–(6) and five novel causal reasoning rules:

Lemma 2.

For disjoint (sets) of variables :

(7)
(8)
(9)
(10)
(11)

We prove the soundness of the rules in the Supplementary Material. We elaborate some conjectures about their completeness in the discussion after Theorem 1 in the next Section.

Optimization of loss function

We formulate causal discovery as an optimization problem where a loss function is optimized over possible causal structures. Intuitively, the loss function sums the weights of all the inputs that are violated in a candidate causal structure.

Given a list of weighted input statements , where is the input statement and is the associated weight, we define the loss function as the sum of the weights of the input statements that are not satisfied in a given possible structure , where denotes the set of all possible causal structures. Causal discovery is formulated as a discrete optimization problem:

(12)
(13)

where means that input is not satisfied in structure according to the rules .

This general formulation includes both HEJ and ACI, which differ in the types of possible structures and the rules . In HEJ represents all possible causal graphs (specifically, acyclic directed mixed graphs, or ADMGs, in the acyclic case) and are operations on d-connection graphs. In ACI  represent ancestral structures (defined with the rules(2)-(4)) and the rules are rules (5)–(11).

Constrained optimization in ASP

The constrained optimization problem in (12) can be implemented using a variety of methods. Given the complexity of the rules, a formulation in an expressive logical language that supports optimization, e.g., Answer Set Programming (ASP), is very convenient. ASP is a widely used declarative programming language based on the stable model semantics DBLP:conf/aaai/Lifschitz08; DBLP:reference/fai/Gelfond08 that has successfully been applied to several NP-hard problems. For ACI we use the state-of-the-art ASP solver clingo 4 clingo. We provide the encoding in the Supplementary Material.

Weighting schemes

ACI supports two types of input statements: conditional independences and ancestral relations. These statements can each be assigned a weight that reflects their confidence. We propose two simple approaches with the desirable properties of making ACI asymptotically consistent under mild assumptions (as described in the end of this Section), and assigning a much smaller weight to independences than to dependences (which agrees with the intuition that one is confident about a measured strong dependence, but not about independence vs. weak dependence). The approaches are:

  • a frequentist approach, in which for any appropriate frequentist statistical test with independence as null hypothesis (resp. a non-ancestral relation), we define the weight:

    (14)
  • a Bayesian approach, in which the weight of each input statement using data set is:

    (15)

    where the prior probability can be used as a tuning parameter.

Given observational and interventional data, in which each intervention has a single known target (in particular, it is not a fat-hand intervention EatonMurphy07), a simple way to obtain a weighted ancestral statement is with a two-sample test that tests whether the distribution of changes with respect to its observational distribution when intervening on . This approach conveniently applies to various types of interventions: perfect interventions Pearl2009, soft interventions Markowetz++2005, mechanism changes TianPearl2001, and activity interventions MooijHeskes_UAI_13. The two-sample test can also be implemented as an independence test that tests for the independence of and , the indicator variable that has value for observational samples and for samples from the interventional distribution in which has been intervened upon.

4 Scoring causal predictions

The constrained minimization in (12) may produce several optimal solutions, because the underlying structure may not be identifiable from the inputs. To address this issue, we propose to use the loss function (13) and score the confidence of a feature (e.g., an ancestral relation ) as:

(16)

Without going into details here, we note that the confidence (16) can be interpreted as a MAP approximation of the log-odds ratio of the probability that feature is true in a Markov Logic model:

In this paper, we usually consider the features to be ancestral relations, but the idea is more generally applicable. For example, combined with HEJ it can be used to score direct causal relations.

Soundness and completeness

Our scoring method is sound for oracle inputs:

Theorem 1.

Let be sound (not necessarily complete) causal reasoning rules. For any feature , the confidence score of (16) is sound for oracle inputs with infinite weights.

Here, soundness means that if is identifiable from the inputs, if is identifiable from the inputs, and otherwise (neither are identifiable). As features, we can consider for example ancestral relations for variables . We conjecture that the rules (2)–(11) are “order-1-complete”, i.e., they allow one to deduce all (non)ancestral relations that are identifiable from oracle conditional independences of order in observational data. For higher-order inputs additional rules can be derived. However, our primary interest in this work is improving computation time and accuracy, and we are willing to sacrifice completeness. A more detailed study of the completeness properties is left as future work.

Asymptotic consistency

Denote the number of samples by . For the frequentist weights in (14), we assume that the statistical tests are consistent in the following sense:

(17)

as , where the null hypothesis is independence/nonancestral relation and the alternative hypothesis is dependence/ancestral relation. Note that we need to choose a sample-size dependent threshold such that at a suitable rate. Kalisch and Bühlmann KalischBuehlmann2007 show how this can be done for partial correlation tests under the assumption that the distribution is multivariate Gaussian.

For the Bayesian weighting scheme in (15), we assume that for ,

(18)

This will hold (as long as there is no model misspecification) under mild technical conditions for finite-dimensional exponential family models. In both cases, the probability of a type I or type II error will converge to 0, and in addition, the corresponding weight will converge to .

Theorem 2.

Let be sound (not necessarily complete) causal reasoning rules. For any feature , the confidence score of (16) is asymptotically consistent under assumption (17) or (18).

Here, “asymptotically consistent” means that the confidence score in probability if is identifiably true, in probability if is identifiably false, and in probability otherwise.

5 Evaluation

In this section we report evaluations on synthetically generated data and an application on a real dataset. Crucially, in causal discovery precision is often more important than recall. In many real-world applications, discovering a few high-confidence causal relations is more useful than finding every possible causal relation, as reflected in recently proposed algorithms, e.g., ICP.

Average execution time (s) ACI HEJ BAFCI BACFCI 6 1 0.21 12.09 8.39 12.51 6 4 1.66 432.67 11.10 16.36 7 1 1.03 715.74 9.37 15.12 8 1 9.74 13.71 21.71 9 1 146.66 18.28 28.51
(a)
(b)
Figure 1: Execution time comparison on synthetic data for the frequentist test on 2000 synthetic models: (a) average execution time for different combinations of number of variables and max. order ; (b) detailed plot of execution times for (logarithmic scale).
Compared methods

We compare the predictions of ACI and of the acyclic causally insufficient version of HEJ antti, when used in combination with our scoring method (16). We also evaluate two standard methods: Anytime FCI Spirtes01ananytime; Zhang:2008:COR:1414091.1414237 and Anytime CFCI Colombo++2012, as implemented in the pcalg R package pcalg. We use the anytime versions of (C)FCI because they allow for independence test results up to a certain order. We obtain the ancestral relations from the output PAG using Theorem 3.1 from ancestralFCI. (Anytime) FCI and CFCI do not rank their predictions, but only predict the type of relation: ancestral (which we convert to +1), non-ancestral (-1) and unknown (0). To get a scoring of the predictions, we also compare with bootstrapped versions of Anytime FCI and Anytime CFCI. We perform the bootstrap by repeating the following procedure 100 times: sample randomly half of the data, perform the independence tests, run Anytime (C)FCI. From the 100 output PAGs we extract the ancestral predictions and average them. We refer to these methods as BA(C)FCI. For a fair comparison, we use the same independence tests and thresholds for all methods.

Synthetic data

We simulate the data using the simulator from HEJ antti: for each experimental condition (e.g., a given number of variables and order ), we generate randomly linear acyclic models with latent variables and Gaussian noise and sample data points. We then perform independence tests up to order and weight the (in)dependence statements using the weighting schemes described in Section 3. For the frequentist weights we use tests based on partial correlations and Fisher’s -transform to obtain approximate -values (see, e.g., KalischBuehlmann2007) with significance level . For the Bayesian weights, we use the Bayesian test for conditional independence presented in DBLP:journals/ci/MargaritisB09 as implemented by HEJ with a prior probability of 0.1 for independence.

(a) PR ancestral: n=6
(b) PR ancestral: n=6 (zoom)
(c) PR nonancestral: n=6
(d) PR ancestral: n=8
(e) PR ancestral: n=8 (zoom)
(f) PR nonancestral: n=8
Figure 2: Accuracy on synthetic data for the two prediction tasks (ancestral and nonancestral relations) using the frequentist test with . The left column shows the precision-recall curve for ancestral predictions, the middle column shows a zoomed-in version in the interval (0,0.02), while the right column shows the nonancestral predictions.

In Figure 1(a) we show the average execution times on a single core of a 2.80GHz CPU for different combinations of and , while in Figure 1(b) we show the execution times for , sorting the execution times in ascending order. For 7 variables ACI is almost 3 orders of magnitude faster than HEJ, and the difference grows exponentially as increases. For 8 variables HEJ can complete only four of the first 40 simulated models before the timeout of s. For reference we add the execution time for bootstrapped anytime FCI and CFCI.

In Figure 2 we show the accuracy of the predictions with precision-recall (PR) curves for both ancestral () and nonancestral () relations, in different settings. In this Figure, for ACI and HEJ all of the results are computed using frequentist weights and, as in all evaluations, our scoring method (16). While for these two methods we use , for (bootstrapped) (C)FCI we use all possible independence test results (). In this case, the anytime versions of FCI and CFCI are equivalent to the standard versions of FCI and CFCI. Since the overall results are similar, we report the results with the Bayesian weights in the Supplementary Material.

In the first row of Figure 2, we show the setting with variables. The performances of HEJ and ACI coincide, performing significantly better for nonancestral predictions and the top ancestral predictions (see zoomed-in version in Figure 2(b)). This is remarkable, as HEJ and ACI use only independence test results up to order , in contrast with (C)FCI which uses independence test results of all orders. Interestingly, the two discrete optimization algorithms do not seem to benefit much from higher order independence tests, thus we omit them from the plots (although we add the graphs in the Supplementary Material). Instead, bootstrapping traditional methods, oblivious to the (in)dependence weights, seems to produce surprisingly good results. Nevertheless, both ACI and HEJ outperform bootstrapped FCI and CFCI, suggesting these methods achieve nontrivial error-correction.

In the second row of Figure 2, we show the setting with 8 variables. In this setting HEJ is too slow. In addition to the previous plot, we plot the accuracy of ACI when there is oracle background knowledge on the descendants of one variable (). This setting simulates the effect of using interventional data, and we can see that the performance of ACI improves significantly, especially in the ancestral preditions. The performance of (bootstrapped) FCI and CFCI is limited by the fact that they cannot take advantage of this background knowledge, except with complicated postprocessing borboudakis2012.

Application on real data

We consider the challenging task of reconstructing a signalling network from flow cytometry data SPP05 under different experimental conditions. Here we consider one experimental condition as the observational setting and seven others as interventional settings. More details and more evaluations are reported in the Supplementary Material. In contrast to likelihood-based approaches like SPP05; EatonMurphy07; MooijHeskes_UAI_13; Rothenhausler++2015, in our approach we do not need to model the interventions quantitatively. We only need to know the intervention targets, while the intervention types do not matter. Another advantage of our approach is that it takes into account possible latent variables.

(a) Bootstrapped (100) anytime CFCI (input: independences of order )
(b) ACI (input: weighted ancestral relations)
(c) ACI (input: independences of order , weighted ancestral relations)
Figure 3: Results for flow cytometry dataset. Each matrix represents the ancestral relations, where each row represents a cause and each column an effect. The colors encode the confidence levels: green is positive, black is unknown, while red is negative. The intensity of the color represents the degree of confidence. For example, ACI identifies MEK to be a cause of RAF with high confidence.

We use a -test to test for each intervention and for each variable whether its distribution changes with respect to the observational condition. We use the -values of these tests as in (14) in order to obtain weighted ancestral relations that are used as input (with threshold ). For example, if adding U0126 (a MEK inhibitor) changes the distribution of RAF significantly with respect to the observational baseline, we get a weighted ancestral relation MEKRAF. In addition, we use partial correlations up to order 1 (tested in the observational data only) to obtain weighted independences used as input. We use ACI with (16) to score the ancestral relations for each ordered pair of variables. The main results are illustrated in Figure 3, where we compare ACI with bootstrapped anytime CFCI under different inputs. The output for boostrapped anytime FCI is similar, so we report it only in the Supplementary Material. Algorithms like (anytime) (C)FCI can only use the independences in the observational data as input and therefore miss the strongest signal, weighted ancestral relations, which are obtained by comparing interventional with observational data. In the Supplementary Material, we compare also with other methods (ICP, MooijHeskes_UAI_13). Interestingly, as we show there, our results are similar to the best acyclic model reconstructed by the score-based method from MooijHeskes_UAI_13. As for other constraint-based methods, HEJ is computationally unfeasible in this setting, while COMBINE assumes perfect interventions (while this dataset contains mostly activity interventions).

Notably, our algorithms can correctly recover from faithfulness violations (e.g., the independence between MEK and ERK), because they take into account the weight of the input statements (the weight of the independence is considerably smaller than that of the ancestral relation, which corresponds with a quite significant change in distribution). In contrast, methods that start by reconstructing the skeleton, like (anytime) (C)FCI, would decide that MEK and ERK are nonadjacent, and are unable to recover from that erroneous decision. This illustrates another advantage of our approach.

6 Discussion and conclusions

As we have shown, ancestral structures are very well-suited for causal discovery. They offer a natural way to incorporate background causal knowledge, e.g., from experimental data, and allow a huge computational advantage over existing representations for error-correcting algorithms, such as antti. When needed, ancestral structures can be mapped to a finer-grained representation with direct causal relations, as we sketch in the Supplementary Material. Furthermore, confidence estimates on causal predictions are extremely helpful in practice, and can significantly boost the reliability of the output. Although standard methods, like bootstrapping (C)FCI, already provide reasonable estimates, methods that take into account the confidence in the inputs, as the one presented here, can lead to further improvements of the reliability of causal relations inferred from data.

Strangely (or fortunately) enough, neither of the optimization methods seems to improve much with higher order independence test results. We conjecture that this may happen because our loss function essentially assumes that the test results are independent from another (which is not true). Finding a way to take this into account in the loss function may further improve the achievable accuracy, but such an extension may not be straightforward.

Acknowledgments

SM and JMM were supported by NWO, the Netherlands Organization for Scientific Research (VIDI grant 639.072.410). SM was also supported by the Dutch programme COMMIT/ under the Data2Semantics project. TC was supported by NWO grant 612.001.202 (MoCoCaDi), and EU-FP7 grant agreement n.603016 (MATRICS). We also thank Sofia Triantafillou for her feedback, especially for pointing out the correct way to read ancestral relations from a PAG.

7 Proofs

7.1 ACI causal reasoning rules

We give a combined proof of all the ACI reasoning rules. Note that the numbering of the rules here is different from the numbering used in the main paper.

Lemma 3.

For , , , , disjoint (sets of) variables:

Proof.

We assume a causal DAG with possible latent variables, the causal Markov assumption, and the causal faithfulness assumption.

  1. This is a strengthened version of rule (i) in conf/aistats/EntnerHS13: note that the additional assumptions made there (, ) are redundant and not actually used in their proof. For completeness, we give the proof here. If , then there is a directed path from to . As all paths between and are blocked by , the directed path from to must contain a node . Hence , a contradiction with .

  2. If then there exists a path between and such that each noncollider on is not in , every collider on is ancestor of , and there exists a collider on that is ancestor of but not of . Let be the collider on closest to that is ancestor of but not of . Note that

    1. The path is d-connected given .

    2. (because otherwise , a contradiction).

    3. (because otherwise the path would be d-connected given , a contradiction).

    Hence we conclude that , , , and by symmetry also .

  3. Suppose . Then there exists a path between and , such that each noncollider on is not in , each collider on is an ancestor of , and is a noncollider on . Note that

    1. The subpath must be d-connected given .

    2. has at least one outgoing edge on . Follow this edge further along until reaching either , , or the first collider. When a collider is reached, follow the directed path to . Hence there is a directed path from to or or to , i.e., .

  4. If in addition, , then must be a noncollider on the subpath . Therefore, .

  5. Assume that and . Then there must be paths between and and between and such that each noncollider is not in and each collider is ancestor of . Let be the node on closest to that is also on (this could be ). Then we have a path such that each collider (except ) is ancestor of and each noncollider (except ) is not in . This path must be blocked given as . If would be a noncollider on this path, it would need to be in in order to block it; however, it must then also be a noncollider on or and hence cannot be in . Therefore, must be a collider on this path and cannot be ancestor of . We have to show that is ancestor of . If were a collider on or , it would be ancestor of , a contradiction. Hence must have an outgoing arrow pointing towards on and . If we encounter a collider following the directed edges, we get a contradiction, as that collider, and hence , would be ancestor of . Hence is ancestor of , and therefore, .

7.2 Soundness

Theorem 3.

Let be sound (not necessarily complete) causal reasoning rules. For any feature , the confidence score of (16) is sound for oracle inputs with infinite weights, i.e., if is identifiable from the inputs, if is identifiable from the inputs, and otherwise (neither are identifiable).

Proof.

We assume that the data generating process is described by a causal DAG which may contain additional latent variables, and that the distributions are faithful to the DAG. The theorem then follows directly from the soundness of the rules and the soundness of logical reasoning. ∎

7.3 Asymptotic consistency of scoring method

Theorem 4.

Let be sound (not necessarily complete) causal reasoning rules. For any feature , the confidence score of (16) is asymptotically consistent under assumption (14) or (15) in the main paper, i.e.,

  • in probability if is identifiably true,

  • in probability if is identifiably false,

  • in probability otherwise (neither are identifiable).

Proof.

As the number of statistical tests is fixed (or at least bounded from above), the probability of any error in the test results converges to 0 asymptotically. The loss function of all structures that do not correspond with the properties of the true causal DAG converges to in probability, whereas the loss function of all structures that are compatible with properties of the true causal DAG converges to 0 in probability. ∎

(a) PR ancestral
(b) PR ancestral (zoom)
(c) PR nonancestral
Figure 4: Synthetic data: accuracy for the two prediction tasks (ancestral and nonancestral relations) for variables using the frequentist test with , also for higher order .
(a) PR ancestral
(b) PR ancestral (zoom)
(c) PR nonancestral
Figure 5: Synthetic data: accuracy for the two prediction tasks (ancestral and nonancestral relations) for variables using the Bayesian test with prior probability of independence .

8 Additional results on synthetic data

In Figures 4 and 5 we show the performance of ACI and HEJ antti for higher order independence test results (). As in the main paper, for (bootstrapped) FCI and CFCI we use , because it gives the best predictions for these methods. In Figure 4 we report more accuracy results on the frequentist test with , the same setting as Figure 2 (a-c) in the main paper. As we see, the performances of ACI and HEJ do not really improve with higher order but actually seem to deteriorate.

In Figure 5 we report accuracy results on synthetic data also for the Bayesian test described in the main paper, with prior probability of independence . Using the Bayesian test does not change the overall conclusions: ACI and HEJ overlap for order and they perform better than bootstrapped (C)FCI.

9 Application on real data

We provide more details and more results on the real-world dataset that was briefly described in the main paper, the flow cytometry data SPP05. The data consists of simultaneous measurements of expression levels of 11 biochemical agents in individual cells of the human immune system under 14 different experimental conditions.

9.1 Experimental conditions

The experimental conditions can be grouped into two batches of 8 conditions each that have very similar interventions:

  • “no-ICAM”, used in the main paper and commonly used in the literature;

  • “ICAM”, where Intercellular Adhesion Protein-2 (ICAM-2) was added (except when PMA or 2CAMP was added).

For each batch of 8 conditions, the experimenters added -CD3 and -CD28 to activate the signaling network in 6 out of 8 conditions. For the remaining two conditions (PMA and 2CAMP), -CD3 and -CD28 were not added (and neither was ICAM-2). We can consider the absence of these stimuli as a global intervention relative to the observational baseline (where -CD3 and -CD28 are present, and in addition ICAM-2 is present in the ICAM batch). For each batch (ICAM and no-ICAM), we can consider an observational dataset and 7 interventional datasets with different activators and inhibitors added to the cells, as described in Table 1. Note that the datasets from the last two conditions are the same in both settings. For more information about intervention types, see MooijHeskes_UAI_13.

no-ICAM:   Reagents Intervention -CD3, -CD28 ICAM-2 Additional Target Type + - - - (observational) + - AKT inhibitor AKT activity + - G0076 PKC activity + - Psitectorigenin PIP2 abundance + - U0126 MEK activity + - LY294002 PIP2/PIP3 mechanism change - - PMA PKC activity + fat-hand - - 2CAMP PKA activity + fat-hand


no-ICAM:   Reagents Intervention -CD3, -CD28 ICAM-2 Additional Target Type + + - - (observational) + + AKT inhibitor AKT activity + + G0076 PKC activity + + Psitectorigenin PIP2 abundance + + U0126 MEK activity + + LY294002 PIP2/PIP3 mechanism change - - PMA PKC activity + fat-hand - - 2CAMP PKA activity + fat-hand

Table 1: Reagents used in the various experimental conditions in SPP05 and corresponding intervention types and targets. The intervention types and targets are based on (our interpretation of) biological background knowledge. The upper table describes the “no-ICAM” batch of conditions that is most commonly used in the literature. The lower table describes the additional “ICAM” batch of conditions that we also use here.

In this paper, we ignore the fact that in the last two interventional datasets in each batch (PMA and 2CAMP) there is also a global intervention. Ignoring the global intervention allows us to compute the weighted ancestral relations, since we consider any variable that changes its distribution with respect to the observational condition to be an effect of the main target of the intervention (PKC for PMA and PKA for 2CAMP). This is in line with previous work SPP05; MooijHeskes_UAI_13. Also, we consider only PIP3 as the main target of the LY294002 intervention, based on the consensus network SPP05, even though in MooijHeskes_UAI_13 both PIP2 and PIP3 are considered to be targets of this intervention. In future work, we plan to extend ACI in order to address the task of learning the intervention targets from data, as done by EatonMurphy07 for a score-based approach.

In the main paper we provide some results for the most commonly used no-ICAM batch of experimental conditions. Below we report additional results on the same batch. Moreover, we provide results for causal discovery on the ICAM batch, which are quite consistent with the no-ICAM batch. Finally, we compare with other methods that were applied to this dataset, especially with a score-based approach (MooijHeskes_UAI_13) that shows surprisingly similar results to ACI, although it uses a very different method.

(a) Independences of order 0
(b) Weighted ancestral relations
(c) ACI (input: independences order )
(d) ACI (input: weighted ancestral relations)
(e) ACI (input: independences order , weighted ancestral relations)
(f) Bootstrapped (100) anytime FCI (input: independences order )
(g) Bootstrapped (100) anytime CFCI (input: independences order )
Figure 6: Results on flow cytometry dataset, no-ICAM batch. The top row represents some of the possible inputs: weighted independences of order 0 from the observational dataset (the inputs include also order 1 test results, but these are not visualized here) and weighted ancestral relations recovered from comparing the interventional datasets with the observational data. In the bottom two rows each matrix represents the ancestral relations that are estimated using different inputs and different methods (ACI, bootstrapped anytime FCI or CFCI). Each row represents a cause, while the columns are the effects. The colors encodes the confidence levels, green is positive, black is unknown, while red is negative. The intensity of the color represents the degree of confidence.

9.2 Results on no-ICAM batch

In Figure 6 we provide additional results for the no-ICAM batch. In the first row we show some of the possible inputs: weighted independences (in this case partial correlations) from observational data and weighted ancestral relations from comparing the interventional datasets with the observational data. Specifically, we consider as inputs only independences up to order (but only independences of order 0 are visualized in the figure). The color encodes the weight of the independence. As an example, the heatmap shows that Raf and Mek are strongly dependent.

For the weighted ancestral relations, in Figure 6 we plot a matrix in which each row represents a cause, while the columns are the effects. As described in the main paper we use a -test to test for each intervention and for each variable whether its distribution changes with respect to the observational condition. We use the biological knowledge summarised in Table 1 to define the intervention target, which is then considered the putative “cause”. Then we use the -values of these tests and a threshold to obtain the weights of the ancestral relations, similarly to what is proposed in the main paper for the frequentist weights for the independence tests:

For example, if adding U0126 (which is known to be a MEK inhibitor) changes the distribution of RAF with with respect to the observational baseline, we get a weighted ancestral relation (MEKRAF, 1.609).

(a) Independences of order 0
(b) Weighted ancestral relations
(c) ACI (input: independences order )
(d) ACI (input: weighted ancestral relations)
(e) ACI (input: independences order , weighted ancestral relations)
(f) Bootstrapped (100) anytime FCI(input: independences order )
(g) Bootstrapped (100) anytime CFCI (input: independences order )
Figure 7: Results on flow cytometry dataset, ICAM batch. Same comparison as in Figure 6, but for the ICAM batch.

9.3 ICAM batch

In Figure 7 we show the results for the ICAM setting. These results are very similar to the results for the no-ICAM batch (see also Figure 8), showing that the predicted ancestral relations are robust. In particular it is clear that also for the ICAM batch, weighted ancestral relations are a very strong signal, and that methods that can exploit them (e.g., ACI) have a distinct advantage over methods that cannot (e.g., FCI and CFCI).

In general, in both settings there appear to be various faithfulness violations. For example, it is well-known that MEK causes ERK, yet in the observational data these two variables are independent. Nevertheless, we can see in the data that an intervention on MEK leads to a change of ERK, as expected. It is interesting to note that our approach can correctly recover from this faithfulness violation because it takes into account the weight of the input statements (note that the weight of the independence is smaller than that of the ancestral relation, which corresponds with a quite significant change in distribution). In contrast, methods that start by reconstructing the skeleton (like (C)FCI or LoCI loci) would decide that MEK and ERK are nonadjacent, unable to recover from that erroneous decision. This illustrates one of the advantages of our approach.

Figure 8: ACI results (input: independences of order and weighted ancestral relations) on no-ICAM (left) and ICAM (right) batches. These heatmaps are identical to the ones in Figures 6 and 7, but are reproduced here next to each other for easy comparison.

9.4 Comparison with other approaches

We also compare our results with other, mostly score-based approaches. Amongst other results, MooijHeskes_UAI_13 report the top 17 direct causal relations on the no-ICAM batch that were inferred by their score-based method when assuming acyclicity. In order to compare fairly with the ancestral relations found by ACI, we first perform a transitive closure of these direct causal relations, which results in 21 ancestral relations. We then take the top 21 predicted ancestral relations from ACI (for the same no-ICAM batch), and compare the two in Figure 9. The black edges, the majority, represent the ancestral relations found by both methods. The blue edges are found only by ACI, while the grey edges are found only by MooijHeskes_UAI_13. Interestingly, the results are quite similar, despite the very different approaches. In particular, ACI allows for confounders and is constraint-based, while the method in MooijHeskes_UAI_13 assumes causal sufficiency (i.e., no confounders) and is score-based.

Figure 9: Comparison of ancestral relations predicted by ACI and the score-based method from MooijHeskes_UAI_13, both using the no-ICAM batch. Depicted are the top 21 ancestral relations obtained by ACI and the transitive closure of the top 17 direct causal relations reported in MooijHeskes_UAI_13, which results in 21 ancestral relations. Black edges are ancestral relations found by both methods, blue edges were identified only by ACI, while grey edges are present only in the transitive closure of the result from MooijHeskes_UAI_13.

Table 2 summarizes most of the existing work on this flow cytometry dataset. It was originally part of the S1 material of ICP_PNAS. We have updated it here by adding also the results for ACI and the transitive closure of MooijHeskes_UAI_13.

Direct causal predictions Ancestral predictions
Edge SPP05a SPP05b MooijHeskes_UAI_13a EatonMurphy07 ICP ICP hiddenICP ICP MooijHeskes_UAI_13b ACI (top 21)
RAFMEK
MEKRAF
MEKERK
MEKAKT
MEKJNK
PLCgPIP2
PLCgPIP3
PLCgPKC
PIP2PLCg
PIP2PIP3
PIP2PKC
PIP3PLCg
PIP3PIP2
PIP3AKT
AKTERK
AKTJNK
ERKAKT
ERKPKA
PKARAF
PKAMEK
PKAERK
PKAAKT
PKAPKC
PKAP38
PKAJNK
PKCRAF
PKCMEK
PKCPLCg
PKCPIP2
PKCPIP3
PKCERK
PKCAKT
PKCPKA
PKCP38
PKCJNK
P38JNK
P38PKC
JNKPKC
JNKP38
Table 2: Updated Table S1 from ICP_PNAS: causal relationships between the biochemical agents in the flow cytometry data of SPP05, according to different causal discovery methods. The consensus network according to SPP05 is denoted here by “SPP05a” and their reconstructed network by “SPP05b”. For MooijHeskes_UAI_13 we provide two versions: “MooijHeskes_UAI_13a” for the top 17 edges in the acyclic case, as reported in the original paper, and “MooijHeskes_UAI_13b” for its transitive closure, which consists of 21 edges. To provide a fair comparison, we also pick the top 21 ancestral predictions from ACI.

10 Mapping ancestral structures to direct causal relations

An ancestral structure can be seen as the transitive closure of the directed edges of an acyclic directed mixed graph (ADMG). There are several strategies to reconstruct “direct” causal relations from an ancestral structure, in particular in combination with our scoring method. Here we sketch a possible strategy, but we leave a more in-depth investigation to future work.

(a) PR direct
(b) PR direct (zoom)
(c) PR direct acausal
Figure 10: Synthetic data: accuracy for the two prediction tasks (direct causal and noncausal relations) for variables using the frequentist test with for 2000 simulated models.
Average execution time (s)
Setting Direct causal relations Only second step Ancestral relations
ACI with restricted HEJ direct HEJ restricted HEJ ancestral HEJ
6 1 9.77 15.03 7.62 12.09
6 4 16.96 314.29 14.43 432.67
7 1 36.13 356.49 30.68 715.74
8 1 98.92 81.73
9 1 361.91 240.47
Table 3: Average execution times for recovering causal relations with different strategies for 2000 models for variables using the frequentist test with .

A possible strategy is to first recover the ancestral structure from ACI with our scoring method and then use it as “oracle” input constraints for the HEJ antti algorithm. Specifically, for each weighted output obtained by ACI, we add to the input list , and similarly for each . Then we can use our scoring algorithm with HEJ to score direct causal relations (e.g., ) and direct acausal relations (e.g., ):

(19)

In the standard HEJ algorithm, are all possible ADMGs, but with our additional constraints we can reduce the search space to only the ones that fit the specific ancestral structure, which is on average and asymptotically a reduction of for variables. We will refer to this two-step approach as ACI with restricted HEJ (ACI + HEJ). A side effect of assigning infinite scores to the original ancestral predictions instead of the originally estimated scores is that some of the estimated direct causal predictions scores will also be infinite, flattening their ranking. For this preliminary evaluation, we fix this issue by reusing the original ancestral scores also for the infinite direct predictions scores. Another option may be to use the ACI scores for (a)causal relations as soft constraints for HEJ, although at the time of writing it is still unclear whether this would lead to the same speedup as the previously mentioned version.

We compared accuracy and execution times of standard HEJ (without the additional constraints derived from ACI) with ACI with restricted HEJ on simulated data. Figure 10 shows PR curves for predicting the presence and absence of direct causal relations for both methods. In Table 3 we list the execution times for recovering direct causal relations. Additionally, we list the execution times of only the second step of our approach, the restricted HEJ, to highlight the improvement in execution time resulting from the restrictions. In this preliminary investigation with simulated data, ACI with restricted HEJ is much faster than standard HEJ (without the additional constraints derived from ACI) for predicting direct causal relations, but only sacrifices a little accuracy (as can be seen in Figure 10). In the last column of Table 3, we show the execution times of standard HEJ when used to score ancestral relations. Interestingly, predicting direct causal relations is faster than predicting ancestral relations with HEJ. Still, for 8 variables the algorithm takes more than 2,500 seconds for all but 6 models of the first 40 simulated models.

Another possible strategy first reconstructs the (possibly incomplete) PAG Spirtes2000 from ancestral relations and conditional (in)dependences using a procedure similar to LoCI loci, and then recovering direct causal relations. There are some subtleties in the conversion from (possibly incomplete) PAGs to direct causal relations, so we leave this and other PAG based strategies, as well as a better analysis of conversion of ancestral relations to direct causal relations as future work.

11 Complete ACI encoding in ASP

Answer Set Programming (ASP) is a widely used declarative programming language based on the stable model semantics of logical programming. A thorough introduction to ASP can be found in DBLP:conf/aaai/Lifschitz08; DBLP:reference/fai/Gelfond08. The ASP syntax resembles Prolog, but the computational model is based on the principles that have led to faster solvers for propositional logic DBLP:conf/aaai/Lifschitz08. ASP has been applied to several NP-hard problems, including learning Bayesian networks and ADMGs antti. Search problems are reduced to computing the stable models (also called answer sets), which can be optionally scored.

For ACI we use the state-of-the-art ASP solver clingo 4 clingo. We provide the complete ACI encoding in ASP using the clingo syntax in Table 4. We encode sets via their natural correspondence with binary numbers and use boolean formulas in ASP to encode set-theoretic operations. Since ASP does not support real numbers, we scale all weights by a factor of 1000 and round to the nearest integer.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%  Ancestral Causal Inference (ACI)    %%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%% Preliminaries:
%%% Define ancestral structures:
{ causes(X,Y) } :- node(X), node(Y), X!=Y.
:- causes(X,Y), causes(Y,X), node(X), node(Y), X < Y.
:- not causes(X,Z), causes(X,Y), causes(Y,Z), node(X), node(Y), node(Z).
%%% Define the extension of causes to sets.
% existsCauses(Z,W) means there exists I \in W that is caused by Z.
1{causes(Z, I): ismember(W,I)} :- existsCauses(Z,W), node(Z), set(W), not ismember(W,Z).
existsCauses(Z,W) :- causes(Z, I), ismember(W,I), node(I), node(Z), set(W), not ismember(W,Z), Z!=I.
%%% Generate in/dependences in each model based on the input in/dependences.
1{ dep(X,Y,Z);indep(X,Y,Z) }1 :- input_indep(X,Y,Z,_).
1{ dep(X,Y,Z);indep(X,Y,Z) }1 :- input_dep(X,Y,Z,_).
%%% To simplify the rules, add symmetry of in/dependences.
dep(X,Y,Z) :- dep(Y,X,Z), node(X), node(Y), set(Z), X!=Y, not ismember(Z,X), not ismember(Z,Y).
indep(X,Y,Z) :- indep(Y,X,Z), node(X), node(Y), set(Z), X!=Y, not ismember(Z,X), not ismember(Z,Y).
%%%%% Rules from LoCI:
%%% Minimal independence rule (4) : X || Y | W u [Z] => Z -/-> X, Z -/-> Y, Z -/-> W
:- not causes(Z,X), not causes(Z,Y), not existsCauses(Z,W), dep(X,Y,W), indep(X,Y,U),
U==W+2**(Z-1), set(W), node(Z), not ismember(W, Z), Y != Z, X != Z.
%%% Minimal dependence rule (5): X |/| Y | W u [Z] => Z --> X or Z-->Y or Z-->W
:- causes(Z,X), indep(X,Y,W), dep(X,Y,U), U==W+2**(Z-1), set(W), set(U), node(X),
 node(Y), node(Z), not ismember(W, Z), not ismember(W, X), not ismember(W,Y),
 X != Y, Y != Z, X != Z.
% Note: the version with causes(Z,Y) is implied by the symmetry of in/dependences.
:- existsCauses(Z,W), indep(X,Y,W), dep(X,Y,U), U==W+2**(Z-1), set(W), set(U), node(X),
 node(Y), node(Z), not ismember(W, Z), not ismember(W, X), not ismember(W,Y),
 X != Y, Y != Z, X != Z.
%%%%% ACI rules:
%%% Rule 1: X || Y | U and X -/-> U => X -/->Y
:- causes(X,Y), indep(X,Y,U), not existsCauses(X,U), node(X), node(Y), set(U), X != Y,
not ismember(U,X), not ismember(U,Y).
%%% Rule 2: X || Y | W u [Z] => X |/| Z | W
dep(X,Z,W) :- indep(X,Y,W), dep(X,Y,U), U==W+2**(Z-1), set(W), set(U),
node(X), node(Y), node(Z), X != Y, Y != Z, X != Z, not ismember(W,X), not ismember(W,Y).
%%% Rule 3: X |/| Y | W u [Z] => X |/| Z | W
dep(X,Z,W) :- dep(X,Y,W), indep(X,Y,U), U==W+2**(Z-1), set(W), set(U),
node(X), node(Y), node(Z), X != Y, Y != Z, X != Z, not ismember(W,X), not ismember(W,Y).
%%% Rule 4: X || Y | W u [Z] and X || Z | W u U => X || Y | W u U
indep(X,Y,A) :- dep(X,Y,W), indep(X,Y,U), U==W+2**(Z-1), indep(X,Z,A), A==W+2**(B-1),
 set(W), set(U), not ismember(W,X), not ismember(W,Y), node(X), node(Y), node(Z),
 set(A), node(B), X!=B, Y!=B, Z!=B, X != Y, Y != Z, X != Z.
%%% Rule 5: Z |/| X | W and Z |/| Y | W and X || Y | W => X |/| Z | W u Z
dep(X,Y,U) :- dep(Z,X,W), dep(Z,Y,W), indep(X,Y,W), node(X), node(Y), U==W+2**(Z-1),
 set(W), set(U), X != Y, Y != Z, X != Z, not ismember(W,X), not ismember(W,Y).
%%%%% Loss function and optimization.
%%% Define the loss function as the incongruence between the input in/dependences
%%% and the in/dependences of the model.
fail(X,Y,Z,W) :- dep(X,Y,Z), input_indep(X,Y,Z,W).
fail(X,Y,Z,W) :- indep(X,Y,Z), input_dep(X,Y,Z,W).
%%% Include the weighted ancestral relations in the loss function.
fail(X,Y,-1,W) :- causes(X,Y), wnotcauses(X,Y,W), node(X), node(Y), X != Y.
fail(X,Y,-1,W) :- not causes(X,Y), wcauses(X,Y,W), node(X), node(Y), X != Y.
%%% Optimization part: minimize the sum of W of all fail predicates that are true.
#minimize{W,X,Y,C:fail(X,Y,C,W) }.
Table 4: Complete ACI encoding in Answer Set Programming, written in the syntax for clingo 4.

12 Open source code repository

We provide an open-source version of our algorithms and the evaluation framework, which can be easily extended, at http://github.com/caus-am/aci.

References

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
47636
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description