Ancestral Causal Inference
Abstract
Constraintbased causal discovery from limited data is a notoriously difficult challenge due to the many borderline independence test decisions. Several approaches to improve the reliability of the predictions by exploiting redundancy in the independence information have been proposed recently. Though promising, existing approaches can still be greatly improved in terms of accuracy and scalability. We present a novel method that reduces the combinatorial explosion of the search space by using a more coarsegrained representation of causal information, drastically reducing computation time. Additionally, we propose a method to score causal predictions based on their confidence. Crucially, our implementation also allows one to easily combine observational and interventional data and to incorporate various types of available background knowledge. We prove soundness and asymptotic consistency of our method and demonstrate that it can outperform the stateoftheart on synthetic data, achieving a speedup of several orders of magnitude. We illustrate its practical feasibility by applying it to a challenging protein data set.
Ancestral Causal Inference
Sara Magliacane VU Amsterdam & University of Amsterdam sara.magliacane@gmail.com Tom Claassen Radboud University Nijmegen tomc@cs.ru.nl Joris M. Mooij University of Amsterdam j.m.mooij@uva.nl
noticebox[b]29th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\end@float
1 Introduction
Discovering causal relations from data is at the foundation of the scientific method. Traditionally, causeeffect relations have been recovered from experimental data in which the variable of interest is perturbed, but seminal work like the docalculus Pearl2009 and the PC/FCI algorithms Spirtes2000; Zhang:2008:COR:1414091.1414237 demonstrate that, under certain assumptions (e.g., the wellknown Causal Markov and Faithfulness assumptions Spirtes2000), it is already possible to obtain substantial causal information by using only observational data.
Recently, there have been several proposals for combining observational and experimental data to discover causal relations. These causal discovery methods are usually divided into two categories: constraintbased and scorebased methods. Scorebased methods typically evaluate models using a penalized likelihood score, while constraintbased methods use statistical independences to express constraints over possible causal models. The advantages of constraintbased over scorebased methods are the ability to handle latent confounders and selection bias naturally, and that there is no need for parametric modeling assumptions. Additionally, constraintbased methods expressed in logic loci; BCCD; triantafillou2015constraint; antti allow for an easy integration of background knowledge, which is not trivial even for simple cases in approaches that are not based on logic borboudakis2012.
Two major disadvantages of traditional constraintbased methods are: (i) vulnerability to errors in statistical independence test results, which are quite common in realworld applications, (ii) no ranking or estimation of the confidence in the causal predictions. Several approaches address the first issue and improve the reliability of constraintbased methods by exploiting redundancy in the independence information BCCD; antti; triantafillou2015constraint. The idea is to assign weights to the input statements that reflect their reliability, and then use a reasoning scheme that takes these weights into account. Several weighting schemes can be defined, from simple ways to attach weights to single independence statements antti, to more complicated schemes to obtain weights for combinations of independence statements triantafillou2015constraint; BCCD. Unfortunately, these approaches have to sacrifice either accuracy by using a greedy method BCCD; triantafillou2015constraint, or scalability by formulating a discrete optimization problem on a superexponentially large search space antti. Additionally, the confidence estimation issue is addressed only in limited cases ICP.
We propose Ancestral Causal Inference (ACI), a logicbased method that provides comparable accuracy to the best stateoftheart constraintbased methods (e.g., antti) for causal systems with latent variables without feedback, but improves on their scalability by using a more coarsegrained representation of causal information. Instead of representing all possible direct causal relations, in ACI we represent and reason only with ancestral relations (“indirect” causal relations), developing specialised ancestral reasoning rules. This representation, though still superexponentially large, drastically reduces computation time. Moreover, it turns out to be very convenient, because in realworld applications the distinction between direct causal relations and ancestral relations is not always clear or necessary. Given the estimated ancestral relations, the estimation can be refined to direct causal relations by constraining standard methods to a smaller search space, if necessary.
Furthermore, we propose a method to score predictions according to their confidence. The confidence score can be thought of as an approximation to the marginal probability of an ancestral relation. Scoring predictions enables one to rank them according to their reliability, allowing for higher accuracy. This is very important for practical applications, as the low reliability of the predictions of constraintbased methods has been a major impediment to their widespread use.
We prove soundness and asymptotic consistency under mild conditions on the statistical tests for ACI and our scoring method. We show that ACI outperforms standard methods, like bootstrapped FCI and CFCI, in terms of accuracy, and achieves a speedup of several orders of magnitude over antti on a synthetic dataset. We illustrate its practical feasibility by applying it to a challenging protein data set SPP05 that so far had only been addressed with scorebased methods and observe that it successfully recovers from faithfulness violations. In this context, we showcase the flexibility of logicbased approaches by introducing weighted ancestral relation constraints that we obtain from a combination of observational and interventional data, and show that they substantially increase the reliability of the predictions. Finally, we provide an opensource version of our algorithms and the evaluation framework, which can be easily extended, at http://github.com/causam/aci.
2 Preliminaries and related work
Preliminaries
We assume that the data generating process can be modeled by a causal Directed Acyclic Graph (DAG) that may contain latent variables. For simplicity we also assume that there is no selection bias. Finally, we assume that the Causal Markov Assumption and the Causal Faithfulness Assumption Spirtes2000 both hold. In other words, the conditional independences in the observational distribution correspond onetoone with the dseparations in the causal DAG. Throughout the paper we represent variables with uppercase letters, while sets of variables are denoted by boldface. All proofs are provided in the Supplementary Material.
A directed edge in the causal DAG represents a direct causal relation between cause on effect . Intuitively, in this framework this indicates that manipulating will produce a change in , while manipulating will have no effect on . A more detailed discussion can be found in Spirtes2000. A sequence of directed edges is a directed path. If there exists a directed path from to (or ), then is an ancestor of (denoted as ). Otherwise, is not an ancestor of (denoted as ). For a set of variables , we write:
(1) 
We define an ancestral structure as any nonstrict partial order on the observed variables of the DAG, i.e., any relation that satisfies the following axioms:
(2)  
(3)  
(4) 
The underlying causal DAG induces a unique “true” ancestral structure, which represents the transitive closure of the direct causal relations projected on the observed variables.
For disjoint sets we denote conditional independence of and given as , and conditional dependence as . We call the cardinality the order of the conditional (in)dependence relation. Following loci we define a minimal conditional independence by:
and similarly, a minimal conditional dependence by:
The square brackets indicate that is needed for the (in)dependence to hold in the context of . Note that the negation of a minimal conditional independence is not a minimal conditional dependence. Minimal conditional (in)dependences are closely related to ancestral relations, as pointed out in loci:
Lemma 1.
For disjoint (sets of) variables :
(5)  
(6) 
Exploiting these rules (as well as others that will be introduced in Section 3) to deduce ancestral relations directly from (in)dependences is key to the greatly improved scalability of our method.
Related work on conflict resolution
One of the earliest algorithms to deal with conflicting inputs in constraintbased causal discovery is Conservative PC conf/uai/RamseyZS06, which adds “redundant” checks to the PC algorithm that allow it to detect inconsistencies in the inputs, and then makes only predictions that do not rely on the ambiguous inputs. The same idea can be applied to FCI, yielding Conservative FCI (CFCI) Colombo++2012; pcalg. BCCD (Bayesian Constraintbased Causal Discovery) BCCD uses Bayesian confidence estimates to process information in decreasing order of reliability, discarding contradictory inputs as they arise. COmbINE (Causal discovery from Overlapping INtErventions) triantafillou2015constraint is an algorithm that combines the output of FCI on several overlapping observational and experimental datasets into a single causal model by first pooling and recalibrating the independence test values, and then adding each constraint incrementally in order of reliability to a SAT instance. Any constraint that makes the problem unsatisfiable is discarded.
Our approach is inspired by a method presented by Hyttinen, Eberhardt and Järvisalo antti (that we will refer to as HEJ in this paper), in which causal discovery is formulated as a constrained discrete minimization problem. Given a list of weighted independence statements, HEJ searches for the optimal causal graph (an acyclic directed mixed graph, or ADMG) that minimizes the sum of the weights of the independence statements that are violated according to . In order to test whether a causal graph induces a certain independence, the method creates an encoding DAG of dconnection graphs. Dconnection graphs are graphs that can be obtained from a causal graph through a series of operations (conditioning, marginalization and interventions). An encoding DAG of dconnection graphs is a complex structure encoding all possible dconnection graphs and the sequence of operations that generated them from a given causal graph. This approach has been shown to correct errors in the inputs, but is computationally demanding because of the huge search space.
3 ACI: Ancestral Causal Inference
We propose Ancestral Causal Inference (ACI), a causal discovery method that accurately reconstructs ancestral structures, also in the presence of latent variables and statistical errors. ACI builds on HEJ antti, but rather than optimizing over encoding DAGs, ACI optimizes over the much simpler (but still very expressive) ancestral structures.
For variables, the number of possible ancestral structures is the number of partial orders (http://oeis.org/A001035), which grows as asymptoticEnumerationPosets, while the number of DAGs can be computed with a wellknown superexponential recurrence formula (http://oeis.org/A003024). The number of ADMGs is . Although still superexponential, the number of ancestral structures grows asymptotically much slower than the number of DAGs and even more so, ADMGs. For example, for 7 variables, there are ancestral structures but already ADMGs, which lower bound the number of encoding DAGs of dconnection graphs used by HEJ.
New rules
The rules in HEJ explicitly encode marginalization and conditioning operations on dconnection graphs, so they cannot be easily adapted to work directly with ancestral relations. Instead, ACI encodes the ancestral reasoning rules (2)–(6) and five novel causal reasoning rules:
Lemma 2.
For disjoint (sets) of variables :
(7)  
(8)  
(9)  
(10)  
(11) 
We prove the soundness of the rules in the Supplementary Material. We elaborate some conjectures about their completeness in the discussion after Theorem 1 in the next Section.
Optimization of loss function
We formulate causal discovery as an optimization problem where a loss function is optimized over possible causal structures. Intuitively, the loss function sums the weights of all the inputs that are violated in a candidate causal structure.
Given a list of weighted input statements , where is the input statement and is the associated weight, we define the loss function as the sum of the weights of the input statements that are not satisfied in a given possible structure , where denotes the set of all possible causal structures. Causal discovery is formulated as a discrete optimization problem:
(12) 
(13) 
where means that input is not satisfied in structure according to the rules .
This general formulation includes both HEJ and ACI, which differ in the types of possible structures and the rules . In HEJ represents all possible causal graphs (specifically, acyclic directed mixed graphs, or ADMGs, in the acyclic case) and are operations on dconnection graphs. In ACI represent ancestral structures (defined with the rules(2)(4)) and the rules are rules (5)–(11).
Constrained optimization in ASP
The constrained optimization problem in (12) can be implemented using a variety of methods. Given the complexity of the rules, a formulation in an expressive logical language that supports optimization, e.g., Answer Set Programming (ASP), is very convenient. ASP is a widely used declarative programming language based on the stable model semantics DBLP:conf/aaai/Lifschitz08; DBLP:reference/fai/Gelfond08 that has successfully been applied to several NPhard problems. For ACI we use the stateoftheart ASP solver clingo 4 clingo. We provide the encoding in the Supplementary Material.
Weighting schemes
ACI supports two types of input statements: conditional independences and ancestral relations. These statements can each be assigned a weight that reflects their confidence. We propose two simple approaches with the desirable properties of making ACI asymptotically consistent under mild assumptions (as described in the end of this Section), and assigning a much smaller weight to independences than to dependences (which agrees with the intuition that one is confident about a measured strong dependence, but not about independence vs. weak dependence). The approaches are:

a frequentist approach, in which for any appropriate frequentist statistical test with independence as null hypothesis (resp. a nonancestral relation), we define the weight:
(14) 
a Bayesian approach, in which the weight of each input statement using data set is:
(15) where the prior probability can be used as a tuning parameter.
Given observational and interventional data, in which each intervention has a single known target (in particular, it is not a fathand intervention EatonMurphy07), a simple way to obtain a weighted ancestral statement is with a twosample test that tests whether the distribution of changes with respect to its observational distribution when intervening on . This approach conveniently applies to various types of interventions: perfect interventions Pearl2009, soft interventions Markowetz++2005, mechanism changes TianPearl2001, and activity interventions MooijHeskes_UAI_13. The twosample test can also be implemented as an independence test that tests for the independence of and , the indicator variable that has value for observational samples and for samples from the interventional distribution in which has been intervened upon.
4 Scoring causal predictions
The constrained minimization in (12) may produce several optimal solutions, because the underlying structure may not be identifiable from the inputs. To address this issue, we propose to use the loss function (13) and score the confidence of a feature (e.g., an ancestral relation ) as:
(16) 
Without going into details here, we note that the confidence (16) can be interpreted as a MAP approximation of the logodds ratio of the probability that feature is true in a Markov Logic model:
In this paper, we usually consider the features to be ancestral relations, but the idea is more generally applicable. For example, combined with HEJ it can be used to score direct causal relations.
Soundness and completeness
Our scoring method is sound for oracle inputs:
Theorem 1.
Let be sound (not necessarily complete) causal reasoning rules. For any feature , the confidence score of (16) is sound for oracle inputs with infinite weights.
Here, soundness means that if is identifiable from the inputs, if is identifiable from the inputs, and otherwise (neither are identifiable). As features, we can consider for example ancestral relations for variables . We conjecture that the rules (2)–(11) are “order1complete”, i.e., they allow one to deduce all (non)ancestral relations that are identifiable from oracle conditional independences of order in observational data. For higherorder inputs additional rules can be derived. However, our primary interest in this work is improving computation time and accuracy, and we are willing to sacrifice completeness. A more detailed study of the completeness properties is left as future work.
Asymptotic consistency
Denote the number of samples by . For the frequentist weights in (14), we assume that the statistical tests are consistent in the following sense:
(17) 
as , where the null hypothesis is independence/nonancestral relation and the alternative hypothesis is dependence/ancestral relation. Note that we need to choose a samplesize dependent threshold such that at a suitable rate. Kalisch and Bühlmann KalischBuehlmann2007 show how this can be done for partial correlation tests under the assumption that the distribution is multivariate Gaussian.
For the Bayesian weighting scheme in (15), we assume that for ,
(18) 
This will hold (as long as there is no model misspecification) under mild technical conditions for finitedimensional exponential family models. In both cases, the probability of a type I or type II error will converge to 0, and in addition, the corresponding weight will converge to .
Theorem 2.
Here, “asymptotically consistent” means that the confidence score in probability if is identifiably true, in probability if is identifiably false, and in probability otherwise.
5 Evaluation
In this section we report evaluations on synthetically generated data and an application on a real dataset. Crucially, in causal discovery precision is often more important than recall. In many realworld applications, discovering a few highconfidence causal relations is more useful than finding every possible causal relation, as reflected in recently proposed algorithms, e.g., ICP.
Compared methods
We compare the predictions of ACI and of the acyclic causally insufficient version of HEJ antti, when used in combination with our scoring method (16). We also evaluate two standard methods: Anytime FCI Spirtes01ananytime; Zhang:2008:COR:1414091.1414237 and Anytime CFCI Colombo++2012, as implemented in the pcalg R package pcalg. We use the anytime versions of (C)FCI because they allow for independence test results up to a certain order. We obtain the ancestral relations from the output PAG using Theorem 3.1 from ancestralFCI. (Anytime) FCI and CFCI do not rank their predictions, but only predict the type of relation: ancestral (which we convert to +1), nonancestral (1) and unknown (0). To get a scoring of the predictions, we also compare with bootstrapped versions of Anytime FCI and Anytime CFCI. We perform the bootstrap by repeating the following procedure 100 times: sample randomly half of the data, perform the independence tests, run Anytime (C)FCI. From the 100 output PAGs we extract the ancestral predictions and average them. We refer to these methods as BA(C)FCI. For a fair comparison, we use the same independence tests and thresholds for all methods.
Synthetic data
We simulate the data using the simulator from HEJ antti: for each experimental condition (e.g., a given number of variables and order ), we generate randomly linear acyclic models with latent variables and Gaussian noise and sample data points. We then perform independence tests up to order and weight the (in)dependence statements using the weighting schemes described in Section 3. For the frequentist weights we use tests based on partial correlations and Fisher’s transform to obtain approximate values (see, e.g., KalischBuehlmann2007) with significance level . For the Bayesian weights, we use the Bayesian test for conditional independence presented in DBLP:journals/ci/MargaritisB09 as implemented by HEJ with a prior probability of 0.1 for independence.
In Figure 1(a) we show the average execution times on a single core of a 2.80GHz CPU for different combinations of and , while in Figure 1(b) we show the execution times for , sorting the execution times in ascending order. For 7 variables ACI is almost 3 orders of magnitude faster than HEJ, and the difference grows exponentially as increases. For 8 variables HEJ can complete only four of the first 40 simulated models before the timeout of s. For reference we add the execution time for bootstrapped anytime FCI and CFCI.
In Figure 2 we show the accuracy of the predictions with precisionrecall (PR) curves for both ancestral () and nonancestral () relations, in different settings. In this Figure, for ACI and HEJ all of the results are computed using frequentist weights and, as in all evaluations, our scoring method (16). While for these two methods we use , for (bootstrapped) (C)FCI we use all possible independence test results (). In this case, the anytime versions of FCI and CFCI are equivalent to the standard versions of FCI and CFCI. Since the overall results are similar, we report the results with the Bayesian weights in the Supplementary Material.
In the first row of Figure 2, we show the setting with variables. The performances of HEJ and ACI coincide, performing significantly better for nonancestral predictions and the top ancestral predictions (see zoomedin version in Figure 2(b)). This is remarkable, as HEJ and ACI use only independence test results up to order , in contrast with (C)FCI which uses independence test results of all orders. Interestingly, the two discrete optimization algorithms do not seem to benefit much from higher order independence tests, thus we omit them from the plots (although we add the graphs in the Supplementary Material). Instead, bootstrapping traditional methods, oblivious to the (in)dependence weights, seems to produce surprisingly good results. Nevertheless, both ACI and HEJ outperform bootstrapped FCI and CFCI, suggesting these methods achieve nontrivial errorcorrection.
In the second row of Figure 2, we show the setting with 8 variables. In this setting HEJ is too slow. In addition to the previous plot, we plot the accuracy of ACI when there is oracle background knowledge on the descendants of one variable (). This setting simulates the effect of using interventional data, and we can see that the performance of ACI improves significantly, especially in the ancestral preditions. The performance of (bootstrapped) FCI and CFCI is limited by the fact that they cannot take advantage of this background knowledge, except with complicated postprocessing borboudakis2012.
Application on real data
We consider the challenging task of reconstructing a signalling network from flow cytometry data SPP05 under different experimental conditions. Here we consider one experimental condition as the observational setting and seven others as interventional settings. More details and more evaluations are reported in the Supplementary Material. In contrast to likelihoodbased approaches like SPP05; EatonMurphy07; MooijHeskes_UAI_13; Rothenhausler++2015, in our approach we do not need to model the interventions quantitatively. We only need to know the intervention targets, while the intervention types do not matter. Another advantage of our approach is that it takes into account possible latent variables.
We use a test to test for each intervention and for each variable whether its distribution changes with respect to the observational condition. We use the values of these tests as in (14) in order to obtain weighted ancestral relations that are used as input (with threshold ). For example, if adding U0126 (a MEK inhibitor) changes the distribution of RAF significantly with respect to the observational baseline, we get a weighted ancestral relation MEKRAF. In addition, we use partial correlations up to order 1 (tested in the observational data only) to obtain weighted independences used as input. We use ACI with (16) to score the ancestral relations for each ordered pair of variables. The main results are illustrated in Figure 3, where we compare ACI with bootstrapped anytime CFCI under different inputs. The output for boostrapped anytime FCI is similar, so we report it only in the Supplementary Material. Algorithms like (anytime) (C)FCI can only use the independences in the observational data as input and therefore miss the strongest signal, weighted ancestral relations, which are obtained by comparing interventional with observational data. In the Supplementary Material, we compare also with other methods (ICP, MooijHeskes_UAI_13). Interestingly, as we show there, our results are similar to the best acyclic model reconstructed by the scorebased method from MooijHeskes_UAI_13. As for other constraintbased methods, HEJ is computationally unfeasible in this setting, while COMBINE assumes perfect interventions (while this dataset contains mostly activity interventions).
Notably, our algorithms can correctly recover from faithfulness violations (e.g., the independence between MEK and ERK), because they take into account the weight of the input statements (the weight of the independence is considerably smaller than that of the ancestral relation, which corresponds with a quite significant change in distribution). In contrast, methods that start by reconstructing the skeleton, like (anytime) (C)FCI, would decide that MEK and ERK are nonadjacent, and are unable to recover from that erroneous decision. This illustrates another advantage of our approach.
6 Discussion and conclusions
As we have shown, ancestral structures are very wellsuited for causal discovery. They offer a natural way to incorporate background causal knowledge, e.g., from experimental data, and allow a huge computational advantage over existing representations for errorcorrecting algorithms, such as antti. When needed, ancestral structures can be mapped to a finergrained representation with direct causal relations, as we sketch in the Supplementary Material. Furthermore, confidence estimates on causal predictions are extremely helpful in practice, and can significantly boost the reliability of the output. Although standard methods, like bootstrapping (C)FCI, already provide reasonable estimates, methods that take into account the confidence in the inputs, as the one presented here, can lead to further improvements of the reliability of causal relations inferred from data.
Strangely (or fortunately) enough, neither of the optimization methods seems to improve much with higher order independence test results. We conjecture that this may happen because our loss function essentially assumes that the test results are independent from another (which is not true). Finding a way to take this into account in the loss function may further improve the achievable accuracy, but such an extension may not be straightforward.
Acknowledgments
SM and JMM were supported by NWO, the Netherlands Organization for Scientific Research (VIDI grant 639.072.410). SM was also supported by the Dutch programme COMMIT/ under the Data2Semantics project. TC was supported by NWO grant 612.001.202 (MoCoCaDi), and EUFP7 grant agreement n.603016 (MATRICS). We also thank Sofia Triantafillou for her feedback, especially for pointing out the correct way to read ancestral relations from a PAG.
7 Proofs
7.1 ACI causal reasoning rules
We give a combined proof of all the ACI reasoning rules. Note that the numbering of the rules here is different from the numbering used in the main paper.
Lemma 3.
For , , , , disjoint (sets of) variables:
Proof.
We assume a causal DAG with possible latent variables, the causal Markov assumption, and the causal faithfulness assumption.

This is a strengthened version of rule (i) in conf/aistats/EntnerHS13: note that the additional assumptions made there (, ) are redundant and not actually used in their proof. For completeness, we give the proof here. If , then there is a directed path from to . As all paths between and are blocked by , the directed path from to must contain a node . Hence , a contradiction with .

If then there exists a path between and such that each noncollider on is not in , every collider on is ancestor of , and there exists a collider on that is ancestor of but not of . Let be the collider on closest to that is ancestor of but not of . Note that

The path is dconnected given .

(because otherwise , a contradiction).

(because otherwise the path would be dconnected given , a contradiction).
Hence we conclude that , , , and by symmetry also .


Suppose . Then there exists a path between and , such that each noncollider on is not in , each collider on is an ancestor of , and is a noncollider on . Note that

The subpath must be dconnected given .

has at least one outgoing edge on . Follow this edge further along until reaching either , , or the first collider. When a collider is reached, follow the directed path to . Hence there is a directed path from to or or to , i.e., .


If in addition, , then must be a noncollider on the subpath . Therefore, .

Assume that and . Then there must be paths between and and between and such that each noncollider is not in and each collider is ancestor of . Let be the node on closest to that is also on (this could be ). Then we have a path such that each collider (except ) is ancestor of and each noncollider (except ) is not in . This path must be blocked given as . If would be a noncollider on this path, it would need to be in in order to block it; however, it must then also be a noncollider on or and hence cannot be in . Therefore, must be a collider on this path and cannot be ancestor of . We have to show that is ancestor of . If were a collider on or , it would be ancestor of , a contradiction. Hence must have an outgoing arrow pointing towards on and . If we encounter a collider following the directed edges, we get a contradiction, as that collider, and hence , would be ancestor of . Hence is ancestor of , and therefore, .
∎
7.2 Soundness
Theorem 3.
Let be sound (not necessarily complete) causal reasoning rules. For any feature , the confidence score of (16) is sound for oracle inputs with infinite weights, i.e., if is identifiable from the inputs, if is identifiable from the inputs, and otherwise (neither are identifiable).
Proof.
We assume that the data generating process is described by a causal DAG which may contain additional latent variables, and that the distributions are faithful to the DAG. The theorem then follows directly from the soundness of the rules and the soundness of logical reasoning. ∎
7.3 Asymptotic consistency of scoring method
Theorem 4.
Let be sound (not necessarily complete) causal reasoning rules. For any feature , the confidence score of (16) is asymptotically consistent under assumption (14) or (15) in the main paper, i.e.,

in probability if is identifiably true,

in probability if is identifiably false,

in probability otherwise (neither are identifiable).
Proof.
As the number of statistical tests is fixed (or at least bounded from above), the probability of any error in the test results converges to 0 asymptotically. The loss function of all structures that do not correspond with the properties of the true causal DAG converges to in probability, whereas the loss function of all structures that are compatible with properties of the true causal DAG converges to 0 in probability. ∎
8 Additional results on synthetic data
In Figures 4 and 5 we show the performance of ACI and HEJ antti for higher order independence test results (). As in the main paper, for (bootstrapped) FCI and CFCI we use , because it gives the best predictions for these methods. In Figure 4 we report more accuracy results on the frequentist test with , the same setting as Figure 2 (ac) in the main paper. As we see, the performances of ACI and HEJ do not really improve with higher order but actually seem to deteriorate.
In Figure 5 we report accuracy results on synthetic data also for the Bayesian test described in the main paper, with prior probability of independence . Using the Bayesian test does not change the overall conclusions: ACI and HEJ overlap for order and they perform better than bootstrapped (C)FCI.
9 Application on real data
We provide more details and more results on the realworld dataset that was briefly described in the main paper, the flow cytometry data SPP05. The data consists of simultaneous measurements of expression levels of 11 biochemical agents in individual cells of the human immune system under 14 different experimental conditions.
9.1 Experimental conditions
The experimental conditions can be grouped into two batches of 8 conditions each that have very similar interventions:

“noICAM”, used in the main paper and commonly used in the literature;

“ICAM”, where Intercellular Adhesion Protein2 (ICAM2) was added (except when PMA or 2CAMP was added).
For each batch of 8 conditions, the experimenters added CD3 and CD28 to activate the signaling network in 6 out of 8 conditions. For the remaining two conditions (PMA and 2CAMP), CD3 and CD28 were not added (and neither was ICAM2). We can consider the absence of these stimuli as a global intervention relative to the observational baseline (where CD3 and CD28 are present, and in addition ICAM2 is present in the ICAM batch). For each batch (ICAM and noICAM), we can consider an observational dataset and 7 interventional datasets with different activators and inhibitors added to the cells, as described in Table 1. Note that the datasets from the last two conditions are the same in both settings. For more information about intervention types, see MooijHeskes_UAI_13.
In this paper, we ignore the fact that in the last two interventional datasets in each batch (PMA and 2CAMP) there is also a global intervention. Ignoring the global intervention allows us to compute the weighted ancestral relations, since we consider any variable that changes its distribution with respect to the observational condition to be an effect of the main target of the intervention (PKC for PMA and PKA for 2CAMP). This is in line with previous work SPP05; MooijHeskes_UAI_13. Also, we consider only PIP3 as the main target of the LY294002 intervention, based on the consensus network SPP05, even though in MooijHeskes_UAI_13 both PIP2 and PIP3 are considered to be targets of this intervention. In future work, we plan to extend ACI in order to address the task of learning the intervention targets from data, as done by EatonMurphy07 for a scorebased approach.
In the main paper we provide some results for the most commonly used noICAM batch of experimental conditions. Below we report additional results on the same batch. Moreover, we provide results for causal discovery on the ICAM batch, which are quite consistent with the noICAM batch. Finally, we compare with other methods that were applied to this dataset, especially with a scorebased approach (MooijHeskes_UAI_13) that shows surprisingly similar results to ACI, although it uses a very different method.
9.2 Results on noICAM batch
In Figure 6 we provide additional results for the noICAM batch. In the first row we show some of the possible inputs: weighted independences (in this case partial correlations) from observational data and weighted ancestral relations from comparing the interventional datasets with the observational data. Specifically, we consider as inputs only independences up to order (but only independences of order 0 are visualized in the figure). The color encodes the weight of the independence. As an example, the heatmap shows that Raf and Mek are strongly dependent.
For the weighted ancestral relations, in Figure 6 we plot a matrix in which each row represents a cause, while the columns are the effects. As described in the main paper we use a test to test for each intervention and for each variable whether its distribution changes with respect to the observational condition. We use the biological knowledge summarised in Table 1 to define the intervention target, which is then considered the putative “cause”. Then we use the values of these tests and a threshold to obtain the weights of the ancestral relations, similarly to what is proposed in the main paper for the frequentist weights for the independence tests:
For example, if adding U0126 (which is known to be a MEK inhibitor) changes the distribution of RAF with with respect to the observational baseline, we get a weighted ancestral relation (MEKRAF, 1.609).
9.3 ICAM batch
In Figure 7 we show the results for the ICAM setting. These results are very similar to the results for the noICAM batch (see also Figure 8), showing that the predicted ancestral relations are robust. In particular it is clear that also for the ICAM batch, weighted ancestral relations are a very strong signal, and that methods that can exploit them (e.g., ACI) have a distinct advantage over methods that cannot (e.g., FCI and CFCI).
In general, in both settings there appear to be various faithfulness violations. For example, it is wellknown that MEK causes ERK, yet in the observational data these two variables are independent. Nevertheless, we can see in the data that an intervention on MEK leads to a change of ERK, as expected. It is interesting to note that our approach can correctly recover from this faithfulness violation because it takes into account the weight of the input statements (note that the weight of the independence is smaller than that of the ancestral relation, which corresponds with a quite significant change in distribution). In contrast, methods that start by reconstructing the skeleton (like (C)FCI or LoCI loci) would decide that MEK and ERK are nonadjacent, unable to recover from that erroneous decision. This illustrates one of the advantages of our approach.
9.4 Comparison with other approaches
We also compare our results with other, mostly scorebased approaches. Amongst other results, MooijHeskes_UAI_13 report the top 17 direct causal relations on the noICAM batch that were inferred by their scorebased method when assuming acyclicity. In order to compare fairly with the ancestral relations found by ACI, we first perform a transitive closure of these direct causal relations, which results in 21 ancestral relations. We then take the top 21 predicted ancestral relations from ACI (for the same noICAM batch), and compare the two in Figure 9. The black edges, the majority, represent the ancestral relations found by both methods. The blue edges are found only by ACI, while the grey edges are found only by MooijHeskes_UAI_13. Interestingly, the results are quite similar, despite the very different approaches. In particular, ACI allows for confounders and is constraintbased, while the method in MooijHeskes_UAI_13 assumes causal sufficiency (i.e., no confounders) and is scorebased.
Table 2 summarizes most of the existing work on this flow cytometry dataset. It was originally part of the S1 material of ICP_PNAS. We have updated it here by adding also the results for ACI and the transitive closure of MooijHeskes_UAI_13.
Direct causal predictions  Ancestral predictions  
Edge  SPP05a  SPP05b  MooijHeskes_UAI_13a  EatonMurphy07  ICP ICP  hiddenICP ICP  MooijHeskes_UAI_13b  ACI (top 21) 
RAFMEK  
MEKRAF  
MEKERK  
MEKAKT  
MEKJNK  
PLCgPIP2  
PLCgPIP3  
PLCgPKC  
PIP2PLCg  
PIP2PIP3  
PIP2PKC  
PIP3PLCg  
PIP3PIP2  
PIP3AKT  
AKTERK  
AKTJNK  
ERKAKT  
ERKPKA  
PKARAF  
PKAMEK  
PKAERK  
PKAAKT  
PKAPKC  
PKAP38  
PKAJNK  
PKCRAF  
PKCMEK  
PKCPLCg  
PKCPIP2  
PKCPIP3  
PKCERK  
PKCAKT  
PKCPKA  
PKCP38  
PKCJNK  
P38JNK  
P38PKC  
JNKPKC  
JNKP38 
10 Mapping ancestral structures to direct causal relations
An ancestral structure can be seen as the transitive closure of the directed edges of an acyclic directed mixed graph (ADMG). There are several strategies to reconstruct “direct” causal relations from an ancestral structure, in particular in combination with our scoring method. Here we sketch a possible strategy, but we leave a more indepth investigation to future work.
Average execution time (s)  

Setting  Direct causal relations  Only second step  Ancestral relations  
ACI with restricted HEJ  direct HEJ  restricted HEJ  ancestral HEJ  
6  1  9.77  15.03  7.62  12.09 
6  4  16.96  314.29  14.43  432.67 
7  1  36.13  356.49  30.68  715.74 
8  1  98.92  81.73  
9  1  361.91  240.47 
A possible strategy is to first recover the ancestral structure from ACI with our scoring method and then use it as “oracle” input constraints for the HEJ antti algorithm. Specifically, for each weighted output obtained by ACI, we add to the input list , and similarly for each . Then we can use our scoring algorithm with HEJ to score direct causal relations (e.g., ) and direct acausal relations (e.g., ):
(19) 
In the standard HEJ algorithm, are all possible ADMGs, but with our additional constraints we can reduce the search space to only the ones that fit the specific ancestral structure, which is on average and asymptotically a reduction of for variables. We will refer to this twostep approach as ACI with restricted HEJ (ACI + HEJ). A side effect of assigning infinite scores to the original ancestral predictions instead of the originally estimated scores is that some of the estimated direct causal predictions scores will also be infinite, flattening their ranking. For this preliminary evaluation, we fix this issue by reusing the original ancestral scores also for the infinite direct predictions scores. Another option may be to use the ACI scores for (a)causal relations as soft constraints for HEJ, although at the time of writing it is still unclear whether this would lead to the same speedup as the previously mentioned version.
We compared accuracy and execution times of standard HEJ (without the additional constraints derived from ACI) with ACI with restricted HEJ on simulated data. Figure 10 shows PR curves for predicting the presence and absence of direct causal relations for both methods. In Table 3 we list the execution times for recovering direct causal relations. Additionally, we list the execution times of only the second step of our approach, the restricted HEJ, to highlight the improvement in execution time resulting from the restrictions. In this preliminary investigation with simulated data, ACI with restricted HEJ is much faster than standard HEJ (without the additional constraints derived from ACI) for predicting direct causal relations, but only sacrifices a little accuracy (as can be seen in Figure 10). In the last column of Table 3, we show the execution times of standard HEJ when used to score ancestral relations. Interestingly, predicting direct causal relations is faster than predicting ancestral relations with HEJ. Still, for 8 variables the algorithm takes more than 2,500 seconds for all but 6 models of the first 40 simulated models.
Another possible strategy first reconstructs the (possibly incomplete) PAG Spirtes2000 from ancestral relations and conditional (in)dependences using a procedure similar to LoCI loci, and then recovering direct causal relations. There are some subtleties in the conversion from (possibly incomplete) PAGs to direct causal relations, so we leave this and other PAG based strategies, as well as a better analysis of conversion of ancestral relations to direct causal relations as future work.
11 Complete ACI encoding in ASP
Answer Set Programming (ASP) is a widely used declarative programming language based on the stable model semantics of logical programming. A thorough introduction to ASP can be found in DBLP:conf/aaai/Lifschitz08; DBLP:reference/fai/Gelfond08. The ASP syntax resembles Prolog, but the computational model is based on the principles that have led to faster solvers for propositional logic DBLP:conf/aaai/Lifschitz08. ASP has been applied to several NPhard problems, including learning Bayesian networks and ADMGs antti. Search problems are reduced to computing the stable models (also called answer sets), which can be optionally scored.
For ACI we use the stateoftheart ASP solver clingo 4 clingo. We provide the complete ACI encoding in ASP using the clingo syntax in Table 4. We encode sets via their natural correspondence with binary numbers and use boolean formulas in ASP to encode settheoretic operations. Since ASP does not support real numbers, we scale all weights by a factor of 1000 and round to the nearest integer.
12 Open source code repository
We provide an opensource version of our algorithms and the evaluation framework, which can be easily extended, at http://github.com/causam/aci.