Learning Directed Acyclic Graphs with Hidden Variables via Latent Gaussian Graphical Model Selection
Abstract.
We introduce a new method to estimate the Markov equivalence class of a directed acyclic graph (DAG) in the presence of hidden variables, in settings where the underlying DAG among the observed variables is sparse, and there are a few hidden variables that have a direct effect on many of the observed ones. Building on the socalled low rank plus sparse framework, we suggest a twostage approach which first removes unwanted variation using latent Gaussian graphical model selection, and then estimates the Markov equivalence class of the underlying DAG by applying GES. This approach is consistent in certain highdimensional regimes and performs favourably when compared to the state of the art, both in terms of graphical structure recovery and total causal effect estimation.
Key words and phrases:
directed acyclic graph (DAG), highdimensional data, causal inference, hidden variables, lowrank plus sparse1. Introduction
The task of learning directed acyclic graphs (DAGs) arises in many areas of science and engineering. In such graphs, nodes represent random variables and edges encode conditional dependencies. The problem of recovering their structure from observational data is challenging and cannot be tackled without making untestable assumptions [33]. Among other assumptions, causal sufficiency is particularly constraining. Briefly, causal sufficiency requires that there be no hidden (or latent) variables that are common causes of two or more observed variables (such hidden variables are often called confounders). Although causal sufficiency is unrealistic in most applications, many structure learning algorithms operate under this assumption [40, 7, 43, 17, 31]. On the other hand, methods allowing for arbitrary hidden structures tend to be overly conservative, recovering only a small subset of the causal effects [41, 10, 8, 29]. In the present work, we suggest taking a middleground stance on causal sufficiency by allowing hidden variables while imposing some restrictions on their number and behaviour. More precisely, we consider settings where the underlying DAG among the observed variables is sparse, and there are a few hidden variables that have a direct effect on many of the observed ones [6]. This is an interesting problem for at least two reasons.
First, these assumptions cover important realworld applications. In the context of gene expression data, for example, such confounding occurs due to technical factors or unobserved environmental variables [25, 42, 16]. For another example, consider the task of modelling the inverse covariance structure of stock returns [21, §9.5]. [6] showed that a large fraction of the conditional dependencies among stock returns can be explained by a few hidden variables, e.g. energy prices.
Second, this setting is complementary to the realm of application of popular algorithms that do not assume causal sufficiency, such as versions of the Fast Causal Inference algorithm [41, 10, 8]. Under our assumptions, most observed variables do not appear conditionally independent of each other and the underlying socalled maximal ancestral graph is expected to be dense which, in turn, implies that very few causal effects are identifiable and bounded away from zero (see Figure 1 for an example). Moreover, learning such dense graphs is computationally demanding.
In this paper, we suggest a twostage procedure. First, the socalled “lowrank plus sparse” approach of [6] is used to obtain a pair of inverse covariance matrices, say, describing the estimated Gaussian graphical model over the observed variables and the estimated effect of the hidden variables, respectively. In the second stage, a causal structure learning algorithm which assumes causal sufficiency is applied to . In addition, (joint) total causal effects can be straightforwardly estimated using the (joint)IDA algorithm [28, 27, 32].
The suggested approach is conceptually simple and enjoys many desirable theoretical and computational properties.
In particular, we derive conditions under which this estimator is consistent in a highdimensional regime.
Through extensive simulations, we show that our approach outperforms other relevant methods, both in terms of graph structure recovery and total causal effect estimation.
Our main focus being on applicability, we also suggest strategies to select the tuning parameters in various settings and illustrate their performances in simulations.
Finally, we analyse two datasets.
In our first application, we model the expression levels of the genes responsible for isoprenoid synthesis in Arabidopsis thaliana and show
that some of the hidden variables we estimate have a clear biological interpretation.
We also model the expression levels of hundreds of genes expressed in ovarian cancer and assess our results using two external sources of validation.
Compared to stateoftheart algorithms, we find our approach to be better at recovering known causal relationships
2. Problem Statement
2.1. Graphical Models Terminology
We consider graphs , where the vertices (or nodes) represent random variables and the edges represent relationships between pairs of variables. The edges can be either directed () or undirected (). An (un)directed graph can only contain (un)directed edges, whereas a partially directed graph may contain both directed and undirected edges. The skeleton of a partially directed graph , denoted as , is the undirected graph that results from replacing all directed edges of by undirected edges. Two nodes and are adjacent if there is an edge between them. If , then is a parent of . A path between and in a graph is a sequence of distinct nodes such that all pairs of successive nodes in the sequence are adjacent in . A directed path from to is a path between and , where all edges are directed towards . A directed path from to together with forms a directed cycle. A graph without directed cycles is acyclic. A graph that is both (partially) directed and acyclic, is a (partially) directed acyclic graph or (P)DAG.
A DAG encodes conditional independence relationships via the notion of dseparation [see 34, Def. 1.2.3]. Several DAGs can encode the same dseparations and such DAGs form a Markov equivalence class. A Markov equivalence class of DAGs can be uniquely represented by a completed partially directed acyclic graph (CPDAG), which is a PDAG that satisfies the following: in the CPDAG if in every DAG in the Markov equivalence class, and in the CPDAG if the Markov equivalence class contains a DAG in which as well as a DAG in which [44, 3].
For , we write to denote conditional independence in the joint distribution of , while means that and are dseparated by in . A DAG is a perfect map of the joint distribution of if for all such that and for all , we have .
When some variables are unobserved, as is assumed in this paper, complications arise because the class of DAGs is not closed under marginalisation. Among other factors, this limitation prompted the development of another class of graphical independence models called maximal ancestral graphs (MAGs) [37]. A criterion akin to dseparation makes it possible to readoff independencies of such graphs and, since multiple MAGs can encode the same set of conditional independence statements, one usually attempts to recover a partial ancestral graph (PAG) which describes a Markov equivalence class of MAGs [2]. Like in CPDAGs, circle marks represent uncertainty about edge marks. In particular, a circle mark occurs in the PAG if the Markov equivalence class contains a MAG in which the edge mark is a tail, and a MAG in which the edge mark is an arrowhead [46].
2.2. Setup and Notations
Throughout, we assume that we are given independent, identically distributed (i.i.d.) realisations of a partially observed zeromean random vector . The subvector corresponds to those variables that remain hidden. Our main assumptions are a) is normally distributed and b) there exists a DAG, say, which is a perfect map of the distribution of conditional on . While our initial motivation came from causal problems, as discussed in the introduction, our assumptions do not require to be the causal DAG. Moreover, we do not assume that is generated by a linear Gaussian DAG, a scenario which arises as a special case.
More formally, our first assumption amounts to saying that the distribution of is of the form with inverse covariance (or precision) matrix given by
The superscript is used to indicate that these matrices are parameters of the model, as opposed to estimates. Moreover, we make use of partitioned matrices, so that and .
Since is unobserved, we only see observations from the marginal distribution of . A simple calculation yields
(1) 
Setting , we have that summarises the effect of the hidden variables on the observed ones. In practice, only samples drawn from are available and the data at our disposal is summarised by their sample covariance matrix . By assumption b), there exists a DAG, say, that is a perfect map of the distribution . The CPDAG of is denoted by . Our goal is to recover . In a causal framework, this also enables the use of the (joint)IDA algorithm [28, 32]. Beyond its capacity to recover , an ideal procedure should also be wellbehaved in highdimensional settings and tractable at scale.
In Figure 1 a) we give an example of a DAG with two influential hidden variables. In such a scenario, we see that the MAG and PAG (Fig. 1 b), c)) are dense and the PAG contains many uninformative circle edgemarks. For comparison, Figure 1 d) depicts our target object which is sparse and contains more informative edge marks. Figure 1 e) shows the corresponding matrix for some choice of coefficients in . is a dense, rank2 matrix.
2.3. Suggested Estimator
The presence of hidden variables represents a substantial complication since follows a normal distribution with an inverse covariance matrix () which is the sum of two matrices. The contribution of the observed and the hidden variables is split: is the precision matrix induced by , while summarises the effect of the hidden variables on the observed ones. Since our goal is to infer , our strategy is to first tease apart the summands in order to learn , and then to rely on this estimate to infer under the assumption that there are no confounders.
Inferring the components of from samples drawn according to (1) is a challenging problem because it is fundamentally misspecified: an infinity of pairs can achieve the maximum value of the likelihood. Remarkably, [6] showed that it is sometimes possible to uniquely decompose into its summands. Loosely speaking, this is the case when is sparse and there are relatively few hidden variables with an effect spread over most of the observed variables (this is detailed further in Section 3). Since , constraining the number of hidden variables amounts to requiring that be lowrank. The and nuclear norms being convex relaxations for the norm and the rank respectively, [6] suggest regularising the maximumlikelihood estimator with these norms so as to encourage the learning of a sparse and a lowrank . An estimate of is obtained as the minimiser of
(2) 
where and . Here, the notations and indicate that and are restricted to the space of positive definite and positive semidefinite matrices, respectively. Moreover is the nuclear norm of (i.e. the sum of its singular values); is the norm of (i.e. the sum of its entries’ magnitudes). Under suitable regularity conditions, this estimator is consistent. Moreover, (2) is jointly convex in its parameters and can be efficiently minimised even when is in the thousands [26]. We call this estimator the “lowrank plus sparse” estimator (LRpS) and we write for the program which applies (2) to a covariance matrix , with tuning parameters and outputs a pair of matrices .
Provided the conditions for consistency of LRpS are met, an algorithm which assumes causal sufficiency can be readily applied to the estimated covariance matrix . For this second stage, we suggest using the GES algorithm with an regularised loglikelihood score [7]. Let us write for the program which applies GES to a covariance matrix with tuning parameter and outputs a CPDAG . The suggested estimator, called LRpS+GES henceforth, can be summarised as in Algorithm 2.1.
At a practical level, the fact there are three tuning parameters might be a legitimate concern. We suggest first selecting the tuning parameters of – – using crossvalidation or the (extended) BIC [14] and then choosing , so that there is no need to scout a 3dimensional grid. Moreover, we will see that both theoretical and empirical results support the idea that LRpS is not very sensitive to the value of : trying only a few values (i.e. five or so) of this tuning parameter is enough for most applications. More practical details are given later.
2.4. Previous Work
Over the past two decades, significant advances have been made on the problem of estimating DAGs from observational data. This is a task which is known to be very challenging, especially in the highdimensional setting. For example, the space of DAGs is nonconvex and its size increases superexponentially with the dimension of the problem [38]. Structure learning algorithms fall into three categories that we review here. Since there are many approaches in each of these categories we refer the reader to [19], [12], [22] for a more detailed overview and simulation studies.
Score based approaches assign a score to each structure and aim to identify the one (or ones) that maximises a scoring function. Usually, the scoring criterion measures the quality of a candidate structure based on the data. Due to its theoretical properties and its performance on real and simulated datasets, we give special attention to the Greedy Equivalence Search (GES) algorithm of [7]. GES is a greedy algorithm which searches for the CPDAG that maximises the penalised loglikelihood score over the space of CPDAGs. It proceeds with a forward phase in which single edge additions are carried out sequentially so as to yield the largest possible increase of the score criterion, until no addition can improve the score further. The algorithm then starts with the output of the forward phase and uses best single edge deletions until the score can no longer be improved. In spite of being a greedy algorithm, GES is consistent not only in the classical sense (“fixed , increasing ”) [7] but also in certain sparse highdimensional regimes [39, 31].
Constraint based algorithms learn graphical models by performing conditional independence tests. The PCalgorithm is a popular approach that falls in this category [40]. Under suitable conditions, it is consistent for CPDAG recovery, even in the highdimensional regime [23, 20, 9]. When there are hidden variables and/or selection bias, the counterpart of the PCalgorithm is the Fast Causal Inference (FCI) algorithm [41] whose output is a partial ancestral graph [40, 37]. While consistent in sparse highdimensional settings [10], FCI is not fast enough to be applied to large graphs. This limitation prompted the development of methods such as the Really Fast Causal Inference (RFCI) algorithm and FCI+ [10, 8]. A strength of FCItype algorithms is that the hidden structure can be arbitrarily complicated, since no assumptions are made about selection bias and hidden variables.
Hybrid algorithms combine constraintbased and scorebased methods. For example, the (MaxMin HillClimbing) MMHC algorithm first learns the skeleton using a local discovery algorithm and then orients the edges via a greedy hillclimbing procedure [43]. The NSDIST approach suggested in [19] also outputs a DAG in twostages. In the first stage, the adaptivelasso [47] is used to perform neighbourhood selection. For the second stage, [19] suggest a novel greedy algorithm which searches the space of DAGs whose neighbourhoods agree with the output of the first stage. Finally, the adaptively restricted GES of [31] is a hybrid approach which modifies the forward phase of GES by adaptively restricting the search space. They show that this approach is consistent in some sparse highdimensional regimes, and much faster than GES.
The type of confounding we consider in this paper is ubiquitous is genomics applications, which is why the problem of estimating and removing this kind of unwanted variation has been wellstudied [25, 16, 30]. A simple approach consists in estimating the first few principal components of the data and to regress them out before conducting any analysis. More sophisticated, general purpose algorithms have also been developed. PEER, for example, is a Bayesian approach which aims at inferring “hidden determinants and their effects from gene expression profiles by using factor analysis methods” [42]. It was recently used by the GTEX consortium in order to remove confounding from their datasets [1].
3. Theoretical Results
The highdimensional behaviour of the LowRank plus Sparse decomposition (LRpS henceforth) and the GES algorithm has been studied in the literature [6, 31]. We rely on this body of work to derive the highdimensional consistency of LRpS+GES.
We consider an asymptotic scenario where both the dimension of the problem and the sample size are allowed to grow simultaneously, meaning that and are now functions of . We write and to make this dependence explicit. Likewise, we write for the random vectors being modelled, where . We also write and to make it clear that the nominal parameters and the estimates obtained from Algorithm 2.1 are indexed by . We let be the true partial correlation between and given , for and . The sample partial correlation is defined similarly. These quantities can be computed from and . For any matrix , we denote by the maximum number of nonzero entries in any row or column of . If is a partially directed graph with adjacency matrix , we define its degree . Finally, for any matrix , let be the spectral norm of , i.e. its largest singular value, and let be its largest entry in magnitude, i.e. .
We prove the following result in Appendix A.
Theorem 3.1.
Assume (A1)  (A6) given below. Then, there exists a sequence such that
.
Assumptions (A1)  (A6) are as follows:
 (A1):

(Consistency of LRpS) The conditions for the algebraic consistency of LRpS are satisfied (see Theorem 4.1 of [6]). In particular, it is required that both observed and hidden variables be jointly normal.
 (A2):

(Highdimensional setting) , for some .
 (A3):

(Sparsity condition) Let and , so that . We assume that , for some .
 (A4):

(Bounds on the growth of the oracle versions) The maximum degree in the output of the forward phase of every optimal oracle version of GES is bounded by , for some sequence such that and , and where is given by (A3).
 (A5):

(Bounds on partial correlations) The partial correlations satisfy the following upper and lower bound for all and such that :
with for some where is as in (A4).
 (A6):

and , for some .
In the previous section, it was mentioned that the LRpS estimator is consistent when is sparse and is lowrank. Assumption (A1) contains more precise requirements for the problem to be identified. One of the conditions for identifiability is expressed as , for some constant which depends on the Fisher information matrix. Here, is a property of such that a small value of guarantees that no single hidden variable will have an effect on only a small number of the observed variables. It is related to the concept of incoherence, which is easily calculated and satisfies , for any matrix [5, 6]. On the other hand, quantifies the diffusivity of ’s spectrum. Matrices that have a small have few nonzero entries per row/column. Thus, (A1) entails that there must be few hidden variables acting on many observed ones and that must have sparse rows/columns. Assumption (A1) also requires that the tuning parameter be chosen within the range , thus supporting our observation that LRpS is not very sensitive to the value of this tuning parameter in practice.
An important feature of Theorem 3.1 is that the degrees of the true CPDAG and are allowed to grow logarithmically with the sample size . When coupled with (A1), this assumption on the growth rate of imposes restrictions on the number of hidden variables , albeit not explicitly. Indeed, it can be shown that for the condition to hold with high probability, has to be of the form (under some assumptions about the distribution from which is sampled) [6]. Thus, the degree of and the number of hidden variables are allowed to grow simultaneously with the sample size.
We note that [31] proved highdimensional consistency of GES under weaker versions of (A2) and (A3). Namely, GES allows for some and for some . Second, assumption (A1) includes stringent conditions on the value of the smallest eigenvalue of as well as a form of “generalised irrepresentability condition” similar to the irrepresentability condition introduced in [35] in the context of regularised maximum likelihood estimation. In the next section, we illustrate the performances of LRpS+GES on simulated data both when our assumptions hold and when they are grossly violated. Just like its regularised counterpart (the graphical lasso), we find that LRpS performs well in many more settings than suggested by theory. In the case of the graphical lasso, this is explained by the socalled “screening property”. To the best of our knowledge, no such theoretical results exist for LRpS.
4. Performances on Simulated Data
4.1. CPDAG Structure Recovery
Throughout, we generate DAGs with nodes, and set – a value which does not depend on the sample size
where and are matrices encoding the structure and effect sizes of the DAGs and [4]. Furthermore, and are strictly uppertriangular matrices and is a diagonal matrix. The DAGs over the observed variables () are random DAGs with an expected sparsity of , which corresponds to an average degree of about 2.5 and an average maximum degree of about 6.3. The hidden variables remain independent, but each of them has directed edges towards a random of the observed variables. All edge weights – i.e the nonzero entries of the matrices – are drawn uniformly at random from . Residual variances – i.e. the diagonal entries of – follow a uniform distribution over .
In this section, we compare methods based on the precisionrecall curves obtained by varying the tuning parameter for the last stage of the structure learning methods. The tuning parameters of the first stages (when applicable) are selected as described below. The following methods are applied to the data:
 GES [7]:

implemented in the pcalg package [24].
 NSDIST [19]:
 PCA*+GES:

the top principal components are first estimated from the data matrix and regressed out. GES is then applied to the residuals. The number of principal components is chosen with perfect knowledge (hence the * in the name) so as to maximise the average precision.
 PEER*+GES [42]:

similar to PCA*+GES, the first stage is replaced with PEER. Here again, the number of latent factors is selected so as to maximise average precision, hence the * in the name.
 LRpS+GES:

the suggested approach described in Algorithm 2.1. The tuning parameters for LRpS are chosen by crossvalidation with .
In our first set of simulations, we investigate the effect of the sample size and the number of hidden variables on CPDAG recovery. We set and , but fix to 70. For each of the nine possible pairs, we generate 50 distinct DAGs and draw samples from each of them, for a total of 450 datasets. This is a setting which is favourable to our approach since the hidden variables impact a large fraction (70%) of the observed ones.
In Figure 4 a) we assess the performances of the methods in terms of skeleton recovery by plotting average precision/recall curves. Precision is calculated as the fraction of correct edges among the retrieved edges; recall is computed as the number of correctly retrieved edges divided by the total number of edges in the true CPDAG. Since each of the 9 designs is repeated 50 times, we report average precisions at fixed recalls of . In the appendix, similar curves are plotted for directed edges. When there is no confounding (leftmost column), all methods are known to be consistent for skeleton recovery and offer comparable performances. As soon as , GES is outperformed. Unsurprisingly, when and are of the same order of magnitude, PCA does not perform as well as a Bayesian approach like PEER. When there are only a few confounders (, middle column) they both perform significantly better than GES. When it comes to skeleton recovery, LRpS+GES is always at least as good as the other methods. When there is confounding, it is significantly better because it is the only method which explicitly models hidden variables. This is true even though the tuning parameters and were chosen with crossvalidation. We also see that when , LRpS+GES is the only method whose performance improves with increasing sample size.
In our second set of simulations, we draw from a more diverse set of hidden structures. We set but draw and uniformly at random from and respectively. We generate 500 datasets according to this scheme. In order to quantify the departure of from our assumptions we compute , the incoherence of , for each of the 500 datasets (the distribution of along with figures showing the effect of are shown in the appendix). In this second scenario, many datasets explicitly violate our assumptions since there are many hidden variables acting in a sparse fashion.
In Figure 4 b), we plot average precision/recall curves for this second simulation design. The datasets are divided into four bins based on the quartiles of ’s distribution (noted ), so that the leftmost panel corresponds to the 125 datasets for which it is easiest to estimate . Doing so indicates how our approach is expected to behave in the most adverse scenarios. As can be seen from this figure, LRpS+GES performs at least as well as other approaches in terms of skeleton recovery. In most cases, it performs better.
4.2. Total Causal Effect Estimation
Under the additional assumption that – which is a perfect map of the distribution – is the causal DAG, the Markov equivalence class encoded by contains the causal DAG. For any given pair of distinct nodes , we can therefore estimate the total causal effect of on for all DAGs in the Markov equivalence class. Since one of these DAGs is the true causal DAG, this yields a list of possible total causal effects which includes the true total causal effect. The IDA approach makes it possible to generate such lists efficiently without enumerating all DAGs in the Markov equivalence class [28, 27]. The original IDA method described in [28] uses the PC algorithm in order to first estimate a CPDAG, and then computes sets of possible total causal effects using the sample covariance matrix and the output of the first stage. However, it is possible to replace this first step by any other algorithm which estimates a CPDAG. Likewise, any estimator of can replace .
Since LRPS+GES outputs a CPDAG, we can assess its ability to estimate total causal effects by using it in the first stage of IDA. Thus, lists of possible causal effects are generated using the estimated CPDAG and the covariance matrix . We denote this method by (LRPS+GES),IDA. For all pairs , , we compute the set of possible total causal effects of on . Pairs of variables are then ranked according to . This ranking is compared to the true total causal effects using the precision and recall metrics, e.g. “precision at rank ” would be the number of pairs that are in the top pairs and have a nonzero total causal effect in the true DAG, divided by .
In this section, we select a single DAG, PAG or CPDAG along the regularisation paths in order to apply IDA or LVIDA. Thus, we pick a value of the tuning parameters for both the first and second stages. This is in contrast with the previous section where only the tuning parameters of the first stages (when applicable) were selected, while we reported precisionrecall curves for the whole regularisation paths of the second stages. We consider the following methods, where the first stage tuning parameters are selected as before (when applicable):
 GES,IDA:

the CPDAG is estimated using GES. The tuning parameter is chosen with the BIC. IDA is applied with the resulting CPDAG and the sample covariance matrix .
 NSDIST,IDA:

the DAG is estimated using NSDIST and converted to a CPDAG. The tuning parameter for the second stage (, with the notations of [19]) is chosen using the BIC. IDA is applied with the resulting CPDAG and the sample covariance matrix .
 (PCA*+GES*),IDA:

: the top principal components are first estimated from the data and regressed out. The CPDAG is estimated using GES on the residuals. The tuning parameter is chosen with perfect knowledge so as to maximise the average precision (in terms of causal effect recovery). IDA is applied with the resulting CPDAG and the covariance matrix of the residuals.
 (LRpS+GES),IDA:

the CPDAG is estimated using LRpS+GES. The tuning parameter of the second stage is chosen with the BIC. IDA is applied with the resulting CPDAG and the covariance matrix .
 RFCI,LVIDA [10, 29]:

the PAG is estimated with RFCI. The significance level for RFCI is given by
^{3} . LVIDA is applied to the resulting PAG and the sample covariance matrix . Whenever LVIDA outputs an NA, the corresponding pair is not counted, i.e. it is neither a true positive nor false positive.  RANDOM,IDA:

one hundred random DAGs are generated from the same model as was used in the simulation. Total causal effects are then estimated based on the resulting CPDAG and the sample covariance matrix . We report the interval spanned by the 2.597.5 percentiles of the distribution of precisions at fixed recalls.
 EMPTY, IDA:

Causal effects are computed without adjustment, which is equivalent to applying the ida function of the pcalg package to an empty graph and the sample covariance matrix.
With respect to total causal effect estimation, we found (PEER*+GES*),IDA and (PCA* + GES*),IDA to be nearly undistinguishable, which is why PEER is not reported here. Moreover, note that since we are reporting results for the GES,IDA approach, we are not considering the “PC,IDA” method. Indeed, GES has been shown to have good finite sample performance, and recent highdimensional consistency guarantees have been given in [31].
We consider the same simulation designs as in the previous section. Figure 7 a) displays the results obtained in the first setting, the one where hidden variables are influential. Unlike in Figure 4, the orientation of the directed edges matters. When the sample size is relatively small , there appear to be little to gain from using (CP)DAG estimation methods – the EMPTY approach is competitive. As soon as the sample size increases and , there is a clear benefit in using LRpS+GES. When , LRpS+GES is outperformed but its performances remain comparable to those of GES. Since it is designed to handle hidden variables, the behaviour of RFCI,LVIDA might come as a surprise. First, we see that when there is no confounding, RFCI,LVIDA is capable of achieving a higher precision than any other method. This is consistent with previous findings indicating that LVIDA is conservative but capable of recovering a small but highquality set of total causal effects [29]. When , the set of models we simulate from is particularly challenging for methods relying on MAGs since nearly all pairs of observed variables are confounded. It is therefore not surprising to see RFCI,LVIDA being outperformed.
As can be seen from Figure 7 b), LRpS+GES performs at least as well other approaches and, in most cases, it performs better. As pointed out above, it is in the most challenging scenarios, when confounders act in a sparse fashion (rightmost panel), that RFCI,LVIDA is the most useful. It is very conservative but is capable of achieving the highest precision.
5. Applications
5.1. Application 1: Isoprenoid Synthesis in Arabidopsis thaliana
We illustrate a few properties of our approach on a dataset containing gene expression measurements taken in Arabidopsis thaliana under experimental conditions [45]. [45] gave particular attention to the genes involved in isoprenoid synthesis. In Arabidopsis thaliana, two pathways, located in distinct organs, are responsible for isoprenoid synthesis: the mevalonate pathway (MVA) and the nonmevalonate pathway (MEP). We downloaded the data made available in the supplementary materials of [45] and modelled the genes represented in Figure 3 of [45]. They fall into three categories: genes that are part of the MVA pathway, genes that are in the MEP pathway and mitochondrial genes. For illustration, Figure 10 a) shows the adjacency matrix of the metabolic pathways. This is a graph in which nodes are genes and edges are chemical reactions between gene products. This graph, while related to the regulatory network we aim to estimate, is very different from it: in general, both the structure and the direction of the edges differ.
We fitted LRpS to our data and selected the tuning parameters and with fivefold crossvalidation. The estimated lowrank matrix had two nonzero eigenvalues, with ratio . Hence, only the first eigenvector was retained. In order to see whether we could interpret the hidden variables estimated by LRpS, we looked at the loadings of the genes in the first eigenvector of . Figure 10 b) shows the distribution of the loadings per pathway and suggests that the main source of variation in the data is given by these pathways, which are sometimes unknown in less studied organisms. This shows that in some cases, we can account for hidden variables, but also interpret them. By applying GES to – the sparse output of LRpS – we are therefore modelling a regulatory network conditionally on those pathways, without having to provide further information.
In Figures 17 a) and b), we show the adjacency matrices of the CPDAGs obtained by running GES and LRpS+GES using the BIC score for GES. Figure 17 c) shows the adjacency matrix of the PAG obtained by RFCI, with as before. In Figures 17 d), e), the matrices of total causal effects computed from GES,IDA and (LRPS+GES),IDA are plotted, where IDA is used as in our simulations. In Figure 17 f), the output of LVIDA is plotted, with NAs marked in red. The graphs and total causal effects obtained from other methods (NSDIST,IDA, etc…) are plotted in the appendix. Figure 17 illustrates the tendency of LVIDA to produce very conservative estimates of the causal effects, with many pairs being either zero or NA. On the other extreme, the causal effects of GES are stronger than those of LRpS+GES. In particular, LRpS+GES does not find any strong causal relationship between mitochondrial genes and any other genes, as indicated by the “white cross” in the middle of the matrix plotted in Figure 17 e). Both GES and LRpS+GES support the hypothesis of crosstalk from the MEP to the MVA pathway.
The metabolic pathways of Arabidopsis thaliana have been studied in detail but, to the best of our knowledge, no reliable groundtruth is available for its directed regulatory network. For that reason, it is difficult to assess the quality of the estimated CPDAGs or matrices of total causal effects. Nonetheless, by applying LRpS+GES to this dataset, we were able to show that it is sometimes possible to interpret the latent factors inferred by our approach. Moreover, this type of confounding is not specific to this particular dataset, but is expected to occur in most gene expression datasets, possibly in organisms for which no pathway information is available. This application also gave us the opportunity to compare LVIDA to other IDAbased methods on a real dataset.
5.2. Application 2: Regulatory Network in Ovarian Cancer
We now consider the problem of identifying the targets regulated by a given set of transcription factors in a human gene expression dataset. This problem is often considered in the literature because it constitutes an example of a reallife dataset for which the existence and direction of some edges is known, thus making it possible to compare estimated graphs to a “partial groundtruth” [43, 19]. Briefly, a transcription factor is a protein which regulates the mRNA expression of a gene by binding to a specific DNA sequence near its promoting region. Some families of transcription factors have been studied in detail, and publicly available databases such as TRRUST provide lists of transcription factors along with the genes – called targets – they regulate [18]. transcription factors play a crucial in role in cancer development, which is why it is believed that intervening on the expression of such genes could alter the course of some cancers [11].
In this application, we follow closely the steps described in Section 5.1 of [19] where
ovarian adenocarcinomas are studied.
We used the RNASeq data available from the National Cancer Institute (portal.gdc.cancer.gov/)
and logtransformed the gene expression levels.
There is a consensus about how important some transcription factor families are for cancer development [11, 36].
We therefore selected the transcription factors belonging to those families
To construct a reference network to which we can compare our estimates, we used the output of NetBox. NetBox outputs a list of known (unoriented) interactions between some of the 501 selected genes. Unfortunately, nothing indicates whether those interactions are causal; in general it is not because two genes interact in NetBox that intervening on the expression levels of one of the genes will induce a change in the expression level of the other. However, thanks to our knowledge of transcription factors, we do know that whenever there is an interaction between a transcription factor and a nontranscription factor, then it is likely to be causal and directed from the transcription factor to its target. Moreover, transcription factors are tissue specific, meaning that we can only expect a subset of the interactions to be active in any given celltype [13]. These observations allow us to build three reference networks: a) an undirected graph in which there is an edge between A and B whenever they are said to interact according to NetBox (this is Network A); b) a “causal” undirected graph in which only edges between transcription factors and their targets have been retained (Network B); c) a causal directed graph in which the edges of Network B have been ordered from transcription factors to their targets (Network C).
In this application, the number of variables () is rather large compared to the sample size (). We therefore selected the tuning parameters of LRpS () using the Extended BIC instead of crossvalidation [14].
In Figure 22 we compare the output of various methods (GES, LRpS+GES, NSDIST, PCA*+GES, PEER*+GES) to reference networks A, B and C in terms of True and False Positive Rates (TPR, FPR). For Network C, we follow again [19]: an undirected edge in a CPDAG is counted as half a true positive and half a false negative. In grey, we plot the range spanned by the 2.5 and 97.5 percentiles of our null distribution. It was computed by first picking 100 random ordering of the variables and, starting from a complete DAG, removing random edges one after the other until there are no edges left. After each removal, we computed the performance metrics of the DAG with respect to all reference networks, thus generating 100 random regularisation paths for each of the plots.
Figure 22 a) plots the Receiving Operator Curve (ROC) for Network A. All methods display comparable performances, although NSDIST and LRpS+GES appear slightly above GES. In Figure 22 b), we restrict ourselves to Network B, so that only transcription factortarget edges are counted. LRpS+GES is clearly above NSDIST, PCA*+GES and PEER*+GES which are themselves outperforming GES. When the direction of the edges is also taken into account (Figure 22 c)), LRpS+GES remains ahead of the other methods, and NSDIST beats PCA and PEER. The difference in performance between NSDIST and GES does not come as a surprise since Figure 22 a) and b) reproduce the findings of Figures 4 a) and b) of [19].
Since NetBox is not restricted to transcription factortarget interactions, we sought to confirm the results of Figure 22 c) by using an independent source of validation specialised in transcriptional regulatory relationships. We used TRRUST [18] and constructed Network D by adding directed edges between transcription factors and targets according to TRRUST. In Figure 22 d) we plot the resulting ROC cuve, which reproduces the results obtained using NetBox transcription factortarget interactions as ground truth.
6. Discussion
We discussed the problem of estimating the Markov equivalence class of a DAG in the presence of hidden variables.
Building on previous work by [6] and [7], we suggested a twostage approach – termed LRpS+GES – which first
removes unwanted variation using latent Gaussian graphical model selection, and then estimates a CPDAG by applying GES.
We chose GES for its good empirical performance and theoretical guarantees [31], but we note that the second step can be replaced by any structure learning algorithm
for DAGs that assumes causal sufficiency.
Our main theoretical result states that LRpS+GES is consistent for CPDAG recovery in some sparse highdimensional regimes.
Through simulations and two applications to gene expression datasets, we showed that our approach often outperforms
the state of the art, both in terms of graphical structure recovery and total causal effect estimation.
Moreover, the results reported in our simulations can be achieved in practice since tuning parameter selection
was performed using insample information only
When it comes to removing unwanted variation from biological datasets, stateoftheart approaches usually incorporate external information into the analysis by including additional covariates (e.g. gender, genetic relatedness), thus also accounting for known confounders [42, 30]. Since these additional covariates are often discrete, modelling them with LRpS+GES would be a violation of our assumptions. In such a setting, it is straightforward to replace LRpS by the LSCGGM estimator suggested in [15]. LSCGGM makes it possible to perform a low rank plus sparse decomposition, conditionally on a number of arbitrarily distributed random variables. The LRpS+GES approach could therefore be replaced by the “LSCGGM+GES” estimator, which would come with similar theoretical guarantees.
The computational cost of LRpS+GES might also be a concern to the practitioner. In Algorithm 2.1 we first estimate an inverse covariance matrix . To the best of our knowledge, the fastest algorithm for this LRpS step uses the socalled alternative direction method of multipliers, with a cost of per iteration [26]. Next, must be inverted, at a cost of , and then GES is run on . For large problems, this last step can be replaced by the ARGES algorithm suggested in [31].
As detailed earlier, there exist other approaches which are capable of estimating DAG models and total causal effects in the presence of hidden variables, i.e. FCItype algorithms [41, 10, 8] and LVIDA [29]. In both our simulations and our first application, we found that such approaches are very conservative under our assumptions. However, they do outperform LRpS+GES when hidden variables act on the observed ones in a sparse fashion. As such, LRpS+GES is complementary to existing methods.
Finally, we note that the LRpS+GES estimator can be modified to tackle another widespread problem: selection bias. Reusing the notations introduced in Section 2.2, selection bias can be handled as follows. Let be a zeromean random vector which follows a multivariate normal distribution with covariance matrix . Let us further assume that there exists a DAG, say, which is a perfect map of the distribution . Then, assuming that the variables in are selection variables, we only see observations from
By the Woodbury identity, this can be rewritten in terms of the precision matrix as
where is a negative semidefinite matrix defined as
Since Theorem 4.1 of [6] does not make any assumptions about the positivedefinitiveness of , the following estimator could replace LRpS in the first stage of LRpS+GES:
(3) 
where and . This modified approach is consistent in the presence of selection variables under the similar conditions as Theorem 3.1. The only difference is in the interpretation of the condition , which would require that be sparse and that there be few selection variables that are directly regulated by many of the observed variables.
Footnotes
 The code for our simulations and applications is made available with this paper.
 The code for our simulations and applications is made available with this paper.
 In a number of cases, the LVIDA algorithm, when applied to a single dataset, was still running after a few days of computation. Given that we simulated data from hundreds of datasets, we could not experiment with many values of .
 Namely: FOS, FOSB, JUN, JUNB, JUND, ESR1, ESR2, AR, NFKB1, NFKB2, RELA, RELB, REL, STAT1, STAT2, STAT3, STAT4, STAT5, STAT6
 http://sanderlab.org/tools/netbox.html
 The code for our simulations and applications is made available with this paper.
References
 Aguet, F. et al. (2016) Local genetic effects on gene expression across 44 human tissues. Tech. rep. Available at http://dx.doi.org/10.1101/074450.
 Ali, R. A., Richardson, T. S. and Spirtes, P. (2009) Markov equivalence for ancestral graphs. Ann. Statist., 37, 2808–2837.
 Andersson, S. A., Madigan, D. and Perlman, M. D. (1997) A characterization of Markov equivalence classes for acyclic digraphs. Ann. Statist., 25, 505–541.
 Bollen, K. (1989) Structural equations with latent variables.
 Candès, E. J., Li, X., Ma, Y. and Wright, J. (2011) Robust principal component analysis? J. ACM, 58, Art. 11.
 Chandrasekaran, V., Parrilo, P. A. and Willsky, A. S. (2012) Latent variable graphical model selection via convex optimization. Ann. Statist., 40, 1935–1967.
 Chickering, D. M. (2002) Learning equivalence classes of Bayesiannetwork structures. J. Mach. Learn. Res., 2, 445–498.
 Claassen, T., Mooij, J. M. and Heskes, T. (2013) Learning sparse causal models is not NPhard. In Proceedings of the TwentyNinth Conference on Uncertainty in Artificial Intelligence, UAI’13, 172–181.
 Colombo, D. and Maathuis, M. (2014) Orderindependent constraintbased causal structure learning. J. Mach. Learn. Res., 15, 3741–3782.
 Colombo, D., Maathuis, M. H., Kalisch, M. and Richardson, T. S. (2012) Learning highdimensional directed acyclic graphs with latent and selection variables. Ann. Statist., 40, 294–321.
 Darnell, J. E. (2002) Transcription factors as targets for cancer therapy. Nature Reviews Cancer, 2, 740–749.
 Drton, M. and Maathuis, M. H. (2017) Structure learning in graphical modeling. Annual Review of Statistics and Its Application, 4, 365–393.
 Eeckhoute, J., Métivier, R. and Salbert, G. (2009) Defining specificity of transcription factor regulatory activities. Journal of Cell Science, 122, 4027–4034.
 Foygel, R. and Drton, M. (2010) Extended bayesian information criteria for gaussian graphical models. In Advances in Neural Information Processing Systems 23, 604–612.
 Frot, B., Jostins, L. and McVean, G. (2017) Latent variable model selection for Gaussian conditional random fields. Available at: arXiv:1512.06412.
 GagnonBartsch, J. A., Jacob, L. and Speed, T. P. (2013) Removing unwanted variation from high dimensional data with negative controls. Tech. Rep. 820, Department of Statistics, University of California at Berkeley.
 Van de Geer, S. and Bühlmann, P. (2013) penalized maximum likelihood for sparse directed acyclic graphs. Ann. Statist., 41, 536–567.
 Han, H., Shim, H., Shin, D., Shim, J. E., Ko, Y., Shin, J., Kim, H., Cho, A., Kim, E., Lee, T., Kim, H., Kim, K., Yang, S., Bae, D., Yun, A., Kim, S., Kim, C. Y., Cho, H. J., Kang, B., Shin, S. and Lee, I. (2015) TRRUST: a reference database of human transcriptional regulatory interactions. Scientific Reports, 5.
 Han, S. W., Chen, G., Cheon, M.S. and Zhong, H. (2016) Estimation of directed acyclic graphs through twostage adaptive lasso for gene network inference. J. Am. Statist. Ass., 111, 1004–1019.
 Harris, N. and Drton, M. (2013) PC algorithm for nonparanormal graphical models. J. Mach. Learn. Res., 14, 3365–3383.
 Hastie, T., Tibshirani, R. and Wainwright, M. (2015) Statistical Learning with Sparsity: The Lasso and Generalizations. Boca Raton, USA: Chapman & Hall/CRC.
 HeinzeDeml, C., Maathuis, M. H. and Meinshausen, N. (2018) Causal structure learning. To appear. Available at: arXiv:1706.09141.
 Kalisch, M. and Bühlmann, P. (2007) Estimating highdimensional directed acyclic graphs with the PCalgorithm. J. Mach. Learn. Res., 8, 613–636.
 Kalisch, M., Mächler, M., Colombo, D., Maathuis, M. and Bühlmann, P. (2012) Causal inference using graphical models with the R package pcalg. J. Statist. Software, 47, 1–26.
 Leek, J. T. and Storey, J. D. (2007) Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genetics, 3, e161.
 Ma, S., Xue, L. and Zou, H. (2013) Alternating direction methods for latent variable gaussian graphical model selection. Neural Comput., 25, 2172–2198.
 Maathuis, M. H., Colombo, D., Kalisch, M. and Bühlmann, P. (2010) Predicting causal effects in largescale systems from observational data. Nature methods, 7, 247–8.
 Maathuis, M. H., Kalisch, M. and Bühlmann, P. (2009) Estimating highdimensional intervention effects from observational data. Ann. Statist., 37, 3133–3164.
 Malinsky, D. and Spirtes, P. (2016) Estimating causal effects with ancestral graph Markov models. In Proceedings of the Eighth International Conference on Probabilistic Graphical Models, vol. 52 of Proceedings of Machine Learning Research, 299–309. Lugano, Switzerland: PMLR.
 Mostafavi, S., Battle, A., Zhu, X., Urban, A. E., Levinson, D., Montgomery, S. B. and Koller, D. (2013) Normalizing RNAsequencing data by modeling hidden covariates with prior knowledge. PLoS ONE, 8, e68141.
 Nandy, P., Hauser, A. and Maathuis, M. H. (2017a) Highdimensional consistency in scorebased and hybrid structure learning. Available at: arXiv:1507.02608.
 Nandy, P., Maathuis, M. H. and Richardson, T. S. (2017b) Estimating the effect of joint interventions from observational data in sparse highdimensional settings. Ann. Statist., 45, 647–674.
 Pearl, J. (2009a) Causal inference in statstics: an overview. Statistics Surveys, 3, 96–146.
 — (2009b) Causality: Models, Reasoning and Inference. Cambridge: Cambridge University Press, 2nd edn.
 Ravikumar, P. K., Raskutti, G., Wainwright, M. J. and Yu, B. (2009) Model selection in Gaussian graphical models: Highdimensional consistency of l1regularized mle. In Advances in Neural Information Processing Systems 21, 1329–1336. Curran Associates, Inc.
 Redell, M. and Tweardy, D. (2005) Targeting transcription factors for cancer therapy. Current Pharmaceutical Design, 11, 2873–2887.
 Richardson, T. S. and Spirtes, P. (2002) Ancestral graph Markov models. Ann. Statist., 30, 962–1030.
 Robinson, R. W. (1977) Counting unlabeled acyclic digraphs, 28–43. Combinatorial Mathematics V: Proceedings of the Fifth Australian Conference. Springer Berlin Heidelberg.
 Shpitser, I., Evans, R. J., Richardson, T. S. and Robins, J. M. (2013) Sparse nested markov models with loglinear parameters. In Proceedings of the TwentyNinth Conference on Uncertainty in Artificial Intelligence, UAI’13, 576–585.
 Spirtes, P., Glymour, C. and Scheines, R. (2000) Causation, Prediction, and Search. Adaptive Computation and Machine Learning. Cambridge: MIT Press, second edn.
 Spirtes, P., Meek, C. and Richardson, T. (1995) Causal inference in the presence of latent variables and selection bias. In In Proceedings of Eleventh Conference on Uncertainty in Artificial Intelligence, 499–506. San Francisco, CA: Morgan Kaufmann.
 Stegle, O., Parts, L., Piipari, M., Winn, J. and Durbin, R. (2012) Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses. Nature Protocols, 7, 500–507.
 Tsamardinos, I., Brown, L. E. and Aliferis, C. F. (2006) The maxmin hillclimbing Bayesian network structure learning algorithm. Mach. Learn., 65, 31–78.
 Verma, T. and Pearl, J. (1991) Equivalence and synthesis of causal models. In Proceedings of the Sixth Annual Conference on Uncertainty in Artificial Intelligence, UAI ’90, 255–270. New York, NY, USA: Elsevier Science Inc.
 Wille, A., Zimmermann, P., Vranová, E., Fürholz, A., Laule, O., Bleuler, S., Hennig, L., Prelić, A., von Rohr, P., Thiele, L., Zitzler, E., Gruissem, W. and Bühlmann, P. (2004) Sparse graphical Gaussian modeling of the isoprenoid gene network in arabidopsis thaliana. Genome Biology, 5, R92.
 Zhang, J. (2008) On the completeness of orientation rules for causal discovery in the presence of latent confounders and selection bias. Artificial Intelligence, 172, 1873–1896.
 Zou, H. (2006) The adaptive lasso and its oracle properties. J. Am. Statist. Ass., 101, 1418–1429.