Anchored Causal Inference
in the Presence of Measurement Error
Abstract
We consider the problem of learning a causal graph in the presence of measurement error. This setting is for example common in genomics, where gene expression is corrupted through the measurement process. We develop a provably consistent procedure for estimating the causal structure in a linear Gaussian structural equation model from corrupted observations on its nodes, under a variety of measurement error models. We provide an estimator based on the methodofmoments, which can be used in conjunction with constraintbased causal structure discovery algorithms. We prove asymptotic consistency of the procedure and also discuss finitesample considerations. We demonstrate our method’s performance through simulations and on real data, where we recover the underlying gene regulatory network from zeroinflated singlecell RNAseq data.
Anchored Causal Inference
in the Presence of Measurement Error
Basil Saeed Lab for Information & Decision Systems Massachusetts Institute of Technology Cambridge, MA 02139 bsaeed@mit.edu Anastasiya Belyaeva Lab for Information & Decision Systems Massachusetts Institute of Technology Cambridge, MA 02139 belyaeva@mit.edu Yuhao Wang Lab for Information & Decision Systems Massachusetts Institute of Technology Cambridge, MA 02139 yuhaow@mit.edu Caroline Uhler Lab for Information & Decision Systems Massachusetts Institute of Technology Cambridge, MA 02139 cuhler@mit.edu
noticebox[b]Preprint. Under review.\end@float
1 Introduction
Determining causal relationships between a set of variables is a central task in causal inference with applications in many scientific fields including economics, biology and social sciences Friedman et al. (2000); Pearl (2003); Robins et al. (2000). Directed acyclic graph (DAG) models, also known as Bayesian networks, are commonly used to represent the causal structure among variables. Learning a DAG from observations on the nodes is intrinsically hard Chickering et al. (2004), and in general a DAG is only identifiable up to its Markov equivalence class Verma and Pearl (1990). In addition, in many applications there may be latent variables. While various algorithms have been developed to learn DAGs with latent variables Colombo et al. (2012); Spirtes (2001); Spirtes et al. (2000); Zhang (2008), without restrictions on the latent variables there may be infinitely many DAGs that can explain the data Richardson et al. (2002), in which case the model is of little use for further interpretation and analysis.
Restrictions on the latent variables can improve model identifiability. In this paper, we consider the problem of causal discovery with measurement error, where each latent variable has exactly one corresponding observed variable (corrupted observation of the latent variable), which serves as its anchor, and the goal is to infer the causal relationships among the latent variables; see Figure 1a. For instance in social sciences, the beliefs of people cannot be directly measured, but surveys can provide a noisy version of the latent variables, and we may be interested in inferring the causal structure among the latent beliefs. Similarly, in biological applications measurment error needs to be taken into account, e.g. when measuring brain signals using functional magnetic resonance (fMRI) or gene expression using RNA sequencing.
While the method developed in this paper can be applied generally to causal inference in the presence of measurement noise, we will showcase its use on learning the underlying gene regulatory network from singlecell RNAseq data Klein et al. (2015). Such data is known to suffer from dropout Ziegenhain et al. (2017), which manifests itself as false zeros due to too little starting RNA or technical noise. In singlecell RNAseq experiments, it is estimated that such false zeros occur with a probability of 2476% across current stateoftheart technologies Ziegenhain et al. (2017). Applying causal inference methods directly to such data may lead to biased estimates, where negatively correlated variables may appear positively correlated due to dropout; see Figure 1b. Currently, the typical approach for dealing with dropout is to first impute gene expression data Van Dijk et al. (2018). However, this may introduce artificial dependencies (Figure 1c) that may be detrimental for learning the correct causal model. It is therefore of great interest to develop algorithms that directly learn the causal structure among the latent variables from the corrupted data.
Zhang et al. Zhang et al. (2017) considered the problem of learning a causal DAG model under measurement error as in Figure 1a, but restricted the measurement error to be independent from the latent variables. For many applications, including modeling dropout in singlecell RNAseq data, this assumption is too restrictive. Halpern et al. Halpern et al. (2015) considered more general anchored causal models than in Figure 1a, but only in the binary setting. Silva et al. Silva et al. (2006) considered a similar model for continuous distributions, but under the assumption that the dependence between latent and observed variables is linear, an assumption that is too restrictive for many applications. Inspired by topic modeling, the authors in Anandkumar et al. (2013) proposed a causal discovery method for DAGs with various levels of latent variables, but under the assumption that the latent variables are nonGaussian and have sufficient outgoing edges for identifiability of the model.
The main contributions of this paper are as follows:

We introduce anchored causal inference to model causal relationships among latent variables from observations with measurement error in the Gaussian setting.

We develop a provably consistent algorithm based on the methodofmoments to identify the causal structure (up to Markov equivalence) among the latent variables for a flexible class of measurement error models, including nonlinear dependence on the latent variables.

In particular, we derive a consistent estimator for partial correlations and a consistent conditional independence test when the measurement noise is dropout.

We present experimental results on both simulated and singlecell RNAseq data, showing that our estimator, which takes into account measurement error, outperforms standard causal inference algorithms applied directly to the corrupted data.
2 Preliminaries and Related Work
Let be a directed acyclic graph (DAG) with nodes and directed edges . We associate a random variable to each node . We denote the joint distribution of by and assume that is generated by a linear Gaussian structural equation model:
(1) 
where is the weighted adjacency matrix of and with . We consider the problem where only a noisecorrupted version of is observed. We define to be the observed variables (anchors) generated from the latent variables by a noise process , where has nonzero variance; see Figure 1a. We aim to learn the DAG associated with the latent variables .
The majority of literature in causal inference assumes causal sufficiency, i.e. no latent variables. A standard approach for causal structure discovery in this setting is to first infer the conditional independence (CI) relations among the observed variables and then use the CI relations to learn the DAG structure Spirtes et al. (2000). However, since multiple DAGs can encode the same CI relations, can only be identified up to its Markov equivalence class (MEC). An MEC can be represented by a CPDAG, a partially directed graph whose skeleton (underlying undirected graph) is the skeleton of and an edge is directed if it has the same direction for all DAGs in the MEC Andersson et al. (1997); Verma and Pearl (1990). Various algorithms have been developed for learning a CPDAG under causal sufficiency Chickering (2002); Solus et al. (2017); Spirtes et al. (2000), most prominently the PC algorithm Spirtes et al. (2000), which treats causal inference as a constraint satisfaction problem with the constraints being CI relations. The PC algorithm is provably consistent, meaning that it outputs the correct MEC when the sample size , under the socalled faithfulness assumption, which asserts that the CI relations entailed by are exactly the relations implied by separation in the underlying DAG Spirtes et al. (2000).
In the presence of latent variables, identifiability is further weakened (only the socalled PAG is identifiable) and various algorithms have been developed for learning a PAG Colombo et al. (2012); Spirtes (2001); Spirtes et al. (2000); Zhang (2008). However, these algorithms cannot estimate causal relations among the latent variables, which is our problem of interest. Leung et al. Leung et al. (2016) study identifiability of directed Gaussian graphical models in the presence of a single latent variable. Zhang et al. Zhang et al. (2017), Silva et al. Silva et al. (2006), Halpern et al. Halpern et al. (2015) and Anandkumar et al. Anandkumar et al. (2013) all consider the problem of learning causal edges among latent variables from the observed variables, i.e. models as in Figure 1a or generalizations thereof, but under assumptions that may not hold for our applications of interest, namely that the measurement error is independent of the latent variables Zhang et al. (2017), that the observed variables are a linear function of the latent variables Silva et al. (2006), that the observed variables are binary Halpern et al. (2015), or that each latent variable is nonGaussian with sufficient outgoing edges to guarantee identifiability Anandkumar et al. (2013).
3 Anchored Causal Inference
In the following, we first describe the assumptions of our Anchored Causal Model, then motivate the model by the application to learning the underlying gene regulatory network from zeroinflated singlecell RNAseq data, and finally provide an algorithm for anchored causal inference and prove its consistency under the model assumptions.
Model AssumptionsAnchored Causal Model.
(A1).
Given a DAG , the latent variables are generated by a linear Gaussian structural equation model (see (1)) that is faithful to .
(A2).
The observed random vector satisfies the CI relations
Furthermore, for all there exists a finitedimensional vector of monomials in and a finitedimensional vector of monomials in and such that their means can be mapped to the moments of the latent variables by known continuously differentiable functions and , i.e. and , and their covariance satisfies .
While Assumption (A1) fixes the structural and functional relationship between the latent variables by a linear Gaussian structural equation model, Assumption (A2) fixes the structural relationship between latent and observed variables by each having exactly one parent for all , and ensures that the first and secondorder moments of can be obtained from moments of without the restriction to a specific measurement error model. This allows for more general noise models than in Silva et al. (2006); Zhang et al. (2017).
Example 3.1 (Modeling singlecell RNAseq data).
Let represent the true latent RNA values (logtransformed) and the observed RNA values after dropout. The authors in Pierson and Yau (2015) considered a simple model of gene regulation represented by a linear Gaussian structural equation model among the latent variables and modeled dropout for singlecell RNAseq data by
Assuming the dropout probabilities are known, this model satisfies Assumptions (A1) and (A2) with
but does not satisfy the assumptions in Silva et al. (2006); Zhang et al. (2017). ∎
Algorithm 1 describes our Anchored Causal Inference procedure. The procedure works as follows: Given i.i.d. samples of denoted by , compute the required empirical moments and . Given a particular measurement error model defined by and , compute the first and secondorder moments of to obtain its covariance matrix . If we can obtain the set of CI relations involving , we can use causal structure discovery algorithms to learn (up to its Markov equivalence class). Since follows a Gaussian distribution, conditional independence corresponds to zero partial correlation. Let and , then the sample partial correlations can be computed recursively for increasing conditioning set sizes by
(2) 
where in the base case are the correlations obtained from . The main difficulty lies in developing test statistics based on the estimated partial correlations such that the inferred CI relations correspond as to the set of CI relations implied by the underlying causal DAG . Such test statistics are developed in Corollaries 1 and 2. The inferred CI relations can then be fed into a constraintbased causal discovery algorithm such as the PC algorithm Spirtes et al. (2000) to obtain the CPDAG of .
The first step in asserting consistency of Algorithm 1 is the following lemma, which follows from the law of large numbers; the proof is provided in appendix A.
Next, we design a consistent hypothesis test for obtaining CI relations based on the estimated partial correlations of in (2), similar in principle to Gaussian CI tests based on Fisher’s ztransform used by many causal inference algorithms Chickering (2002); Spirtes et al. (2000); Wang et al. (2017). Under causal sufficiency, i.e., when is the identity function for all , it can be shown Lehmann (1998) that the estimated partial correlations in (2) satisfy
(3) 
Hence, applying the delta method to Fisher’s ztransform of the estimated partial correlations yields
(4) 
Hence Fisher’s ztransform can be used in the test statistic to test conditional independence by declaring at significance level if and only if
(5) 
where denotes the inverse CDF of . The following theorem generalizes (3) to our anchored causal model class, where the partial correlations of are estimated from the observed moments of .
Theorem 1.
The proof of Theorem 1 can be found in appendix B, where we provide a procedure for computing the function for any and . The idea of the proof is as follows: First apply the Central Limit Theorem to the vector of sample moments . Under assumption (A2), the correlations based on are continuously differentiable functions of . Furthermore, for any and , the partial correlation is defined recursively for increasing conditioning set sizes as a continuously differentiable function of . Hence, one can iteratively apply the delta method starting from the statement of the Central Limit Theorem applied to to obtain the asymptotic distribution of .
In the following two corollaries to Theorem 1, we provide different test statistics for CI testing based on the estimated partial correlations of the latent vector . We start by generalizing Fisher’s transform and its asymptotic distribution given in (4).
Corollary 1.
The proof of Corollary 1 follows by applying the delta method to Theorem 1 (appendix C). Whether the conditions of Corollary 1 are satisfied, depends on the measurement error model . We show in appendix F.1 that the conditions of Corollary 1 hold for the dropout model for and , and derive the corresponding variance stabilizing transformation. Note that it is sufficient if we can compute the integral in (6) numerically; a closedform solution is not required. Corollary 1 implies that the test statistic in (5) can be used to consistently estimate the CI relations among the latent variables . When the assumptions of Corollary 1 are not met, then we can obtain a different test statistic that is asymptotically normal as shown in the following result.
Corollary 2 follows from Theorem 1: since is continuous in , converges to by the law of large numbers and implies that as (appendix D). Hence the test statistic can be used in (5) to obtain the CI relations among the latent variables .
With respect to finitesample considerations, note that in Corollary 2 is a function of , and thus its convergence to its asymptotic distribution requires the convergence of . Hence, we expect the convergence in distribution of Corollary 1 to be faster than that of Corollary 2 and as a result the test statistic in Corollary 1 to perform better in the finitesample regime.
We end this section with the main result of this paper, namely the consistency of Algorithm 1.
Theorem 2.
The proof can be found in appendix E. In particular, we show that under the faithfulness assumption, the set of CI relations inferred from in Algorithm 1 converges to the set of CI statements implied by the underlying DAG . Hence using any consistent causal structure discovery algorithm on these CI relations results in the correct Markov equivalence class.
4 Implementation
Next, we discuss an important aspect of implementation and show how the results in Section 3 can be applied to the dropout model in Example 3.1.
In the finitesample setting, the estimated covariance matrix of the latent variables is not guaranteed to be positive semidefinite. In this case, shrinkage towards a positive definite matrix can be used as a form of regularization. When , a standard approach is to use LedoitWolf shrinkage towards the identity matrix Ledoit and W. (2004). When , the sample covariance matrix based on the samples is positive definite with probability 1 and hence can be shrunk towards by
(7) 
The form of shrinkage that provides better results depends on whether or the identity matrix are better approximations of the true underlying covariance matrix . In our experiments in Section 5, we applied shrinkage towards as in (7). Both types of shrinkage result in consistent estimates: consistency of LedoitWolf shrinkage is proven in Ledoit and W. (2004) and the consistency of shrinkage towards the sample covariance matrix in (7) follows from Theorem 1, since as implies that becomes positive semidefinite with large enough sample size, and therefore, as , which shows that shrinkage reduces to the consistent case without shrinkage.
4.1 Application: The Dropout Model
Under the dropout model in Example 3.1, the assumptions of Corollary 1 are not satisfied, since in general cannot be expressed as a function of only. This is shown in appendix F.2 by plotting as a function of for fixed . In the special case when , the conditions are satisfied for all when , i.e., a variance stabilizing transformation can be found for the correlations . This dropout stabilizing transform is provided in appendix F.1. This transform and the resulting CI test can be used as a heuristic also when or and we analyze its performance in Section 5.
In appendix F.3, we provide a recursive formula for computing from Corollary 2 for the dropout model, which we refer to as the dropout normalizing transform. Computing it requires determining the asymptotic variance of . Also note that applying shrinkage will in general change the asymptotic variance of the partial correlations. We show how to correct as a function of the shrinkage coefficient in appendix F.4. This adjustment is applied in all of our experiments in Section 5.
In Section 5.2, we apply our estimation procedure based on the dropout model to singlecell RNAseq data in order to infer the structure of the underlying gene regulatory network. For the theoretical analysis in Section 3 we assumed that the dropout probabilities are known. However, in general these parameters need to be estimated and it was proposed in Pierson (2015) to model by for , where depends on the singlecell RNAseq assay. Using this model for the dropout probabilities we can jointly estimate the parameters and as follows. This model implies
Since corresponds to the gene expression count averages, we can assume that . Under this assumption, the equation has a unique solution for . With respect to the parameter , for some singlecell RNAseq assays it is possible to obtain an estimate for by including molecules with known expression as controls. However, since this estimate is often unreliable Grün and van Oudenaarden (2015) and not always available, we selected so as to minimize the amount of shrinkage required to obtain a positive semidefinite matrix.
5 Experiments
In this section, we analyze the performance of Algorithm 1 based on the dropout model both on simulated data and on singlecell RNAseq data.
5.1 Simulations
The data was generated from the dropout model described in Example 3.1. The structure of the matrix in the linear Gaussian structural equation model (1) was generated by an ErdösRenyi model with expected degree for and number of nodes . The weights of the matrix were uniformly drawn from to be bounded away from . The mean parameters were uniformly drawn from and the probabilities from . These ranges were chosen to match the expected ranges in the gene expression data analyzed in Section 5.2. We generated observations of from this generating model, for .
Figure 2 shows the ROC curves and the Structural Hamming Distance (SHD) plots evaluating the skeleton of the CPDAG output by Algorithm 1 for , and . Each point in the plots is an average over simulations (divisible by the number of cores). The CI relations were obtained using the dropout stabilizing transform (green curve), the dropout normalizing transform (red curve), and the Gaussian CI test applied directly to the perturbed observations (blue curve) and the PC algorithm was used to estimate the CPDAG. In appendix G, we provide additional figures for the setting where , and , as well as the ROC curves and SHD plots evaluating the inferred CPDAG.
The simulation results show that the dropout stabilizing transform outperforms, or performs at least as well as the Gaussian CI test in all settings we tested even though it was derived for and . The performance of both dropout transforms improves over the Gaussian CI test with increasing sample size. A large sample size is especially important for the dropout normalizing transform because it relies on the estimation of more parameters. Since the dropout stabilizing transform is preferable to the dropout normalizing transform from a computational point of view while performing similarly well, we concentrate on this transform in Figure 2(eh).
The simulation results show that our estimators significantly outperform the naive Gaussian CI test applied directly to the corrupted data. This has important implications for the development of new singlecell RNAseq technologies indicating that an increased sample size is preferrable to minimizing dropout. Current singlecell technologies have been heading exactly in this direction, trading off increased sample sizes (with studies containing up to a million samples) for an increased dropout rate sin (). The simulation results suggest that our estimators are wellsuited for such data.
5.2 Singlecell RNAseq Data
Perturbseq. We tested our method on gene expression data collected via singlecell Perturbseq by Dixit et al. Dixit et al. (2016) from bone marrowderived dendritic cells (BMDCs). As in most singlecell studies, the gene expression observations are affected by dropout. The data consists of 933 observational samples (after standard preprocessing), which we used for learning the gene regulatory network. The Perturbseq data set also contains interventional samples, which we used to evaluate the estimated CPDAG and construct an ROC curve. As in Dixit et al. (2016), we focussed our analysis on 24 genes, which are important transcription factors known to regulate each other as well as a variety of other genes Garber et al. (2012). We used the dropout stabilizing transform to obtain the CI relations among the latent variables and compared the resulting CPDAG to the graph obtained using the standard Gaussian CI test applied directly to the observed corrupted data. In both settings we used the PC algorithm to infer the CPDAG from the CI relations. Figure 3a shows the resulting ROC curve, which quantifies for varying tuning parameters the accuracy of each of the learned CPDAGs in predicting the effect of each of the eight interventions as also described in Wang et al. (2017). Our algorithm with the dropout stabilizing transform outperforms the Gaussian CI test. The inferred gene regulatory network is shown in Figure 3b.
Pancreas  Type II Diabetes. We also tested our method on two gene expression data sets collected from human pancreatic cells Baron et al. (2016); Segerstolpe et al. (2016) via different singlecell assays, one with low dropout rate (Smartseq2) and the other with high dropout rate (inDrop). The Smartseq2 data set consists of 3514 cells and the inDrop data set of 8569 cells. We focused our analysis on a gene regulatory network of 20 genes, which is known to be involved in Type II Diabetes Sharma et al. (2018). Since no interventional data is available for this application, we evaluated our estimator based on how consistent the estimated CPDAG is across the two data sets. Figure 3c shows the SHD between the CPDAGs estimated from the data set with low versus high dropout using the dropout stabilizing transform as compared to the Gaussian CI test applied directly on the observed data. The inferred gene regulatory network is provided in appendix G. Since the SHD is lower for the dropout stabilizing transform than the Gaussian CI test, the CPDAG estimates produced by our method are more consistent across different dropout levels, thereby suggesting that our method is more robust to dropout.
6 Discussion
In this paper, we proposed a procedure for learning causal relationships in the presence of measurement error. For this purpose, we considered the anchored causal model, where each corrupted observed variable is generated from a latent uncorrupted variable and the aim is to learn the causal relationships among the latent variables using the observed variables as anchors. We introduced an algorithm that learns the Markov equivalence class of the causal DAG among the latent variables based on the empirical moments of the observed variables and proved its consistency. One of the main motivations for developing this algorithm was to address the problem of dropout in singlecell RNAseq experiments. We showed how to apply our algorithm for learning the underlying gene regulatory network under a standard dropout model and analyzed its performance on synthetic data and on singlecell RNAseq data, thereby showing that taking into account dropout allows identifying the causal relationships between genes in a more accurate and robust manner.
References
 [1] 10x genomics: Transcriptional profiling of 1.3 million brain cells with the chromium single cell 3’ solution. Application note. http://support.10xgenomics.com/singlecellgeneexpression/datasets/1.3.0/1M_neurons. Accessed May, 2019.
 Anandkumar et al. [2013] A. Anandkumar, D. Hsu, A. Javanmard, and S. Kakade. Learning linear bayesian networks with latent variables. In International Conference on Machine Learning, pages 249–257, 2013.
 Andersson et al. [1997] S. A. Andersson, D. Madigan, and M. D. Perlman. A characterization of Markov equivalence classes for acyclic digraphs. The Annals of Statistics, 25(2):505–541, 1997.
 Baron et al. [2016] M. Baron, A. Veres, S. L. Wolock, A. L. Faust, R. Gaujoux, A. Vetere, J. H. Ryu, B. K. Wagner, S. S. ShenOrr, A. M. Klein, D.A. Melton, and Yanai I. A singlecell transcriptomic map of the human and mouse pancreas reveals interand intracell population structure. Cell Systems, 3(4):346–360, 2016.
 Chickering [2002] D. M. Chickering. Optimal structure identification with greedy search. Journal of Machine Learning Research, 3(Nov):507–554, 2002.
 Chickering et al. [2004] D. M. Chickering, D. Heckerman, and C. Meek. Largesample learning of Bayesian networks is NPhard. Journal of Machine Learning Research, 5(Oct):1287–1330, 2004.
 Colombo et al. [2012] D. Colombo, M. H. Maathuis, M. Kalisch, and T. S. Richardson. Learning highdimensional directed acyclic graphs with latent and selection variables. The Annals of Statistics, pages 294–321, 2012.
 Dixit et al. [2016] A. Dixit, O. Parnas, B. Li, J. Chen, C. P. Fulco, L. JerbyArnon, N. D. Marjanovic, D. Dionne, T. Burks, R. Raychowdhury, B. Adamson, T. M. Norman, E. S. Lander, J. S. Weissman, N. Friedman, and A. Regev. Perturbseq: dissecting molecular circuits with scalable singlecell RNA profiling of pooled genetic screens. Cell, 167(7):1853–1866, 2016.
 Friedman et al. [2000] N. Friedman, M. Linial, I. Nachman, and D. Pe’er. Using Bayesian networks to analyze expression data. Journal of Computational Biology, 7(34):601–620, 2000.
 Garber et al. [2012] M. Garber, N. Yosef, A. Goren, R. Raychowdhury, A. Thielke, M. Guttman, J. Robinson, B. Minie, N. Chevrier, Z. Itzhaki, R. BlecherGonen, C. Bornstein, D. AmannZalcenstein, A. Weiner, D. Friedrich, J. Meldrim, O. Ram, C. Cheng, A. Gnirke, S. Fisher, N. Friedman, B. Wong, B. E. Bernstein, C. Nusbaum, N. Hacohen, A. Regev, and I. Amit. A highthroughput chromatin immunoprecipitation approach reveals principles of dynamic gene regulation in mammals. Molecular Cell, 47(5):810–822, 2012.
 Grün and van Oudenaarden [2015] D. Grün and A. van Oudenaarden. Design and analysis of singlecell sequencing experiments. Cell, 163(4):799–810, 2015.
 Halpern et al. [2015] Y. Halpern, S. Horng, and D. Sontag. Anchored discrete factor analysis. arXiv preprint arXiv:1511.03299, 2015.
 Klein et al. [2015] A. M. Klein, L. Mazutis, I. Akartuna, N. Tallapragada, A. Veres, V. Li, L. Peshkin, and M. W. Kirscner. Droplet barcoding for singlecell transcriptomics applied to embryonic stem cells. Cell, 161(5):1187–1201, 2015. ISSN 00928674. doi: 10.1016/j.cell.2015.04.044. URL http://dx.doi.org/10.1016/j.cell.2015.04.044.
 Ledoit and W. [2004] O. Ledoit and Michael W. A wellconditioned estimator for largedimensional covariance matrices. Journal of Multivariate Analysis, 88(2):365–411, 2004.
 Lehmann [1998] E. L. Lehmann. Elements of LargeSample Theory. Spring Science and Business Media, 1998. ISBN 0387985956.
 Leung et al. [2016] D. Leung, M. Drton, and H. Hara. Identifiability of directed Gaussian graphical models with one latent source. Electronic Journal of Statistics, 10(1):394–422, 2016.
 Pearl [2003] J. Pearl. Causality: Models, reasoning, and inference. Econometric Theory, 19(4):675–685, 2003.
 Pierson [2015] E. Pierson. Statistical models for singlecell data. Master’s thesis, Oxford University, UK, 2015.
 Pierson and Yau [2015] E. Pierson and C. Yau. ZIFA: Dimensionality reduction for zeroinflated singlecell gene expression analysis. Genome Biology, 16(1):241, 2015.
 Richardson et al. [2002] Thomas Richardson, Peter Spirtes, et al. Ancestral graph Markov models. The Annals of Statistics, 30(4):962–1030, 2002.
 Robins et al. [2000] J. M. Robins, M. A. Hernan, and B. Brumback. Marginal structural models and causal inference in epidemiology, 2000.
 Segerstolpe et al. [2016] Å. Segerstolpe, A. Palasantza, P. Eliasson, E. Andersson, A. Andréasson, X. Sun, S. Picelli, A. Sabirsh, M. Clausen, M. K. Bjursell, D. M. Smith, M. Kasper, C. Ammala, and R. Sandberg. Singlecell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metabolism, 24(4):593–607, 2016.
 Sharma et al. [2018] A. Sharma, A. Halu, J. L. Decano, M. Padi, Y. Liu, R. B. Prasad, J. Fadista, M. Santolini, J. Menche, S. T. Weiss, M. Vidal, E.K. Silverman, M. Aikawa, A. L. Barabási, L. Groop, and J. Loscalzo. Controllability in an islet specific regulatory network identifies the transcriptional factor NFATC4, which regulates type 2 diabetes associated genes. NPJ Systems Biology and Applications, 4(1):25, 2018.
 Silva et al. [2006] R. Silva, R. Scheine, C. Glymour, and P. Spirtes. Learning the structure of linear latent variable models. Journal of Machine Learning Research, 7(Feb):191–246, 2006.
 Solus et al. [2017] L. Solus, Y. Wang, L. Matejovicova, and C. Uhler. Consistency guarantees for permutationbased causal inference algorithms. arXiv preprint arXiv:1702.03530, 2017.
 Spirtes [2001] P. Spirtes. An anytime algorithm for causal inference. In Proceedings of the Eighth International Workshop on Artificial Intelligence and Statistics, pages 213–221, 2001.
 Spirtes et al. [2000] P. Spirtes, C. N. Glymour, R. Scheines, D. Heckerman, C. Meek, G. Cooper, and T. Richardson. Causation, Prediction, and Search. MIT press, 2000.
 Van Dijk et al. [2018] D. Van Dijk, R. Sharma, J. Nainys, K. Yim, P. Kathail, A. J. Carr, C. Burdziak, K. R. Moon, C. L. Chaffer, D. Pattabiraman, B. Bierie, L. Mazutis, G. Wolf, S. Krishnaswamy, and D. Pe’er. Recovering gene interactions from singlecell data using data diffusion. Cell, 174(3):716–729, 2018.
 Verma and Pearl [1990] T. Verma and J. Pearl. Equivalence and synthesis of causal models. In Proceedings of the Sixth Annual Conference on Uncertainty in Artificial Intelligence, pages 255–270, 1990.
 Wang et al. [2017] Y. Wang, L. Solus, K. Yang, and C. Uhler. Permutationbased causal inference algorithms with interventions. In Advances in Neural Information Processing Systems, pages 5822–5831, 2017.
 Zhang [2008] J. Zhang. On the completeness of orientation rules for causal discovery in the presence of latent confounders and selection bias. Artificial Intelligence, 172(1617):1873–1896, 2008.
 Zhang et al. [2017] K. Zhang, M. Gong, J. Ramsey, K. Batmanghelich, P. Spirtes, and C. Glymour. Causal discovery in the presence of measurement error: Identifiability conditions. arXiv preprint arXiv:1706.03768, 2017.
 Ziegenhain et al. [2017] C. Ziegenhain, B. Vieth, S. Parekh, B. Reinius, A. GuillaumetAdkins, M. Smets, H. Leonhardt, H. Heyn, I. Hellmann, and W. Enard. Comparative analysis of singlecell RNA sequencing methods. Molecular Cell, 65(4):631–643.e4, 2017. ISSN 10974164. doi: 10.1016/j.molcel.2017.01.023.
Appendix A Proof of Lemma 1
Appendix B Proof of Theorem 1
We start by defining the vectors of all correlations estimated from Algorithm 1 and all true correlations of as
(8) 
respectively. We use to denote the vector obtained from concatenating all monomials in and that appear in and for in Assumption (A2). That is,
We let
be the analogous concatenated vector of sample monomials in and calculated from the data .
The following lemma is concerned with the asymptotic distribution of the correlation vector .
Lemma 2.
Proof.
Assumption (A2) asserts that the covariance of is finite. Hence, we can apply the Central Limit Theorem to obtain
(9) 
where is the covariance matrix of . The elements of the covariance matrix can be written as a continuous function of the firstand secondorder moments of , i.e., they can be written as a continuous function of .
Assumptions (A1) and (A2) imply that we can write for all ,
(10) 
We compute the sample correlation in our algorithm as
Based on equation (10) we can define a function such that and . Applying the delta method to equation (9) with the function , we get
where , since the elements of the mean vector are elements of . Notice that under the assumption that the variance of is nonzero from Section 2, the denominator in (10) is nonzero, and therefore is continuously differentiable in , which is continuously differentiable in by Assumption (A2). Hence, is continuous in , and therefore continuous in . Since is a matrix product of functions continuous in , it is also continuous in . ∎
Lemma 3.
Proof.
We take any arbitrary but fixed and subset , and we prove the lemma for Let , where is the size of the conditioning set . We begin by relabeling the variables of interest for clarity. We relabel to , to and the elements of to . Furthermore, we define the sets
Note that , and thereby, the partial correlation of interest is . Now we define for , the vectors
It follows from the definition of that and . In order to prove the lemma, we proceed by induction on starting with the base case of and show that for all such that ,
(11) 
for some that is continuous in . Note that the base case is given by the hypothesis in the lemma. Moreover, the statement of the lemma is that the above holds for , and therefore, completing the inductive step proves the lemma.
To complete the inductive step, assume that for such that , we have
(12) 
Note that for any , the recursive formula for the partial correlations
(13) 
implies that the vector can be written as a function of . Let be this function, then we have
Note that this implies since our procedure uses this recursive formula to estimate the partial correlations. Applying the delta method to with the function gives
where . The matrix can be computed to be the following matrix:
where
and
To simplify indexing, we define the index function
Then, the element will be on the row and column of the Jacobian . We can now compute the elements of the matrix in terms of the elements of . Namely, defining and using the notation to denote the entry in the row and column of , we can compute the element in the row and column of to be