Efficient Structure Learning and Sampling
of Bayesian Networks^{†}^{†}thanks: R package BiDAG is available at https://CRAN.Rproject.org/package=BiDAG
Abstract
Bayesian networks are probabilistic graphical models widely employed to understand dependencies in high dimensional data, and even to facilitate causal discovery. Learning the underlying network structure, which is encoded as a directed acyclic graph (DAG) is highly challenging mainly due to the vast number of possible networks. Efforts have focussed on two fronts: constraint based methods that perform conditional independence tests to exclude edges and score and search approaches which explore the DAG space with greedy or MCMC schemes. Here we synthesise these two fields in a novel hybrid method which reduces the complexity of MCMC approaches to that of a constraint based method. Individual steps in the MCMC scheme only require simple table lookups so that very long chains can be efficiently obtained. Furthermore, the scheme includes an iterative procedure to correct for errors from the conditional independence tests. The algorithm not only offers markedly superior performance to alternatives, but DAGs can also be sampled from the posterior distribution enabling full Bayesian modelling averaging for much larger Bayesian networks.
Polina Suter
DBSSE, ETH Zurich, Mattenstrasse 26, 4058 Basel, Switzerland Giusi Moffa
Division of Psychiatry, University College London, London, UK
Institute for Clinical Epidemiology and Biostatistics,
University Hospital Basel and University of Basel, Basel, Switzerland
Editor:
Keywords: Bayesian Networks, Structure Learning, MCMC.
1 Introduction
Bayesian networks are statistical models to describe and visualise in a compact graphical form the probabilistic relationships between variables of interest. The nodes of a graphical structure correspond to the variables, while directed edges between the nodes encode conditional dependency relationships between them. The most important property of the digraphs underlying a Bayesian network is that they are acyclic, i.e. there are no directed paths which revisit any node. Such graphical objects are commonly denoted DAGs (directed acyclic graphs).
Alongside their more canonical use for representing multivariate probability distributions, DAGs also play a prominent role in describing causal models (Pearl, 2000; Hernán and Robins, 2006; VanderWeele and Robins, 2007) and facilitating causal discovery from observational data (Maathuis et al., 2009; Moffa et al., 2017), though caution is warranted in their use and interpretation (Dawid, 2010). However, the potential for causal discovery and uncovering the mechanisms underlying scientific phenomena in disciplines ranging from the social sciences (Elwert, 2013) to biology (Friedman et al., 2000; Friedman, 2004; Kuipers et al., 2017) has driven interest in DAG inference.
To fully characterise a Bayesian network we need the DAG structure and the parameters which define an associated statistical model to explicitly describe the probabilistic relationships between the variables. To make any inference about the variables in the network, we need both components, the structure and the parameters, which we may need to estimate from data. Given sample data from the joint probability distribution on the node variables, learning the graphical structure is in general the more challenging task. The difficulty mostly rests with the mere size of the search space, which grows superexponentially with the number of nodes , since the logarithm grows quadratically (Robinson, 1970, 1973). A curious illustration of this growth is that the number of DAGs with 21 nodes is approximately the estimated number of atoms in the observable universe ().
1.1 Bayesian network notation
Bayesian networks represent a factorisation of multivariate probability distributions of random variables by encoding conditional dependencies in a graphical structure. A Bayesian network consists of a DAG and an associated set of parameters which define the conditional distribution of each variable on its parents . The distribution represented by the Bayesian networks is then assumed to satisfy the Markov property (Koller and Friedman, 2009) that each variable is independent of its nondescendants given its parents , allowing the joint probability distribution to be factorised as
(1) 
Learning the parameters which best describe a set of data for a given a graph is generally straightforward, with the main difficulty in learning the structural dependence in and the DAG itself.
Due to the symmetry of conditional independence relationships, the same distribution might factorise according to different DAGs. Those DAGs encoding the same probability distribution constitute an equivalence class: they share the vstructures (two unconnected parents of any node, Verma and Pearl, 1990) and the skeleton (the edges if directions were removed). The equivalence class of DAGs can be represented as an essential graph (Andersson et al., 1997) or a completed partially directed acyclic graph (CPDAG) (Chickering, 2002b). Based purely on probabilistic properties, Bayesian networks can therefore only be learned from data up to an equivalence class.
1.2 DAG posteriors
In inferring Bayesian networks, the dual challenge consists of learning the graph structure (or its equivalence class) which best fits and explains the data , and accounting for the uncertainty in the structure and parameters given the data. A natural strategy in Bayesian inference consists of sampling and averaging over a set of similarly well fitting networks. Each DAG is assigned a score equal to its posterior probability given the data
where the likelihood has been marginalised over the parameter space. When the graph and parameter priors satisfy certain conditions of structure modularity, parameter independence and parameter modularity (Heckerman and Geiger, 1995; Friedman et al., 2000; Friedman and Koller, 2003) then the score decomposes
(2) 
involving a function which depends only on a node and its parents. For discrete categorical data, a Dirichlet prior is the only choice satisfying the required conditions for decomposition, leading to the BDe score of Heckerman and Geiger (1995). For continuous multivariate Gaussian data, the inverse Wishart prior leads to the BGe score (Geiger and Heckerman, 2002; corrected in Consonni and Rocca, 2012; Kuipers et al., 2014). In this manuscript we will focus on the continuous case with the BGe score when evaluating the complexity of our approach, and discuss the discrete binary case in Appendix B.
1.3 State of the art structure learning
Traditional approaches to structure learning fall into two categories (and their combination):

constraint based methods, relying on conditional independence tests

score and search algorithms, relying on a scoring function and a search procedure
Below we provide a brief review of these concepts and algorithms of each class pertinent to this work.
1.3.1 Constraint based methods
Constraint based methods exploit the property of Bayesian networks that edges encode conditional dependencies. If a pair of variables can be shown to be independent of each other when conditioning on at least one set (including the empty set) of the remaining variables, then a direct edge between the corresponding nodes in the graph can be excluded. The most prominent example of constraint based methods is the well known PC algorithm of Spirtes et al. (2000), more recently popularised by Kalisch and Bühlmann (2007), who provided consistency results for the case of sparse DAGs and an R (R Core Team, 2017) implementation within the pcalg package (Kalisch et al., 2012).
Rather than exhaustively search the possible conditioning sets for each pair of nodes, the crucial insight of the PC algorithm is to perform the tests in order of increasing complexity. Namely, starting from a fully connected (undirected) graph, the procedure tests marginal independence (conditioning on the empty set) for all pairs of nodes. Then it performs pairwise conditional independence tests between pairs of node which are still directly connected, conditioning on each adjacent node of either node in the pair, and so on conditioning on larger sets. Edges are always deleted when a conditional independence test cannot be rejected. This strategy differs from the typical use of hypothesis testing since edges are assumed to be present by default, but the null hypothesis is taken as conditional independence.
Edges which are never deleted through the process form the final skeleton of the PC algorithm. The conditional independencies which are not rejected identify all vstructures of the graph, fixing the direction of the corresponding edges. At this point it may still be possible to orient some edges, to ensure that no cycles are introduced and no additional vstructures are created. The algorithm finally returns the CPDAG of the Markov equivalence class consistent with the conditional dependencies compatible with the data.
In the final skeleton, for each node the remainder of its adjacent neighbourhood will have been conditioned upon. For the node with largest degree , at least tests will have been performed and the algorithm is of exponential complexity in the largest degree. In the best case the algorithm may still run with around 25–30, but in the worst case the base can increase giving a complexity bound of making the algorithm infeasible even for low for large DAGs (Kalisch and Bühlmann, 2007).
For sparse large graphs, the PC algorithm can be very efficient. Despite the efficiency, since edges can only be deleted and many (correlated) independence tests are performed, the PC algorithm tends to have a high rate of false negatives and hence lose edges, so that it finds only a fraction of those in the true network (Uhler et al., 2013). Increasing the threshold for the conditional independency tests does little to alleviate the problem of false negatives while increasing runtime substantially. Another aspect of the problem of repeated tests is that the output of the PC algorithm can depend on the order of the tests (or the ordering of the input data), leading to unstable estimates in high dimensions, although modifications have been proposed to mollify this effect (Colombo and Maathuis, 2014).
1.3.2 Score and search methods
On the other side of the coin are score and search methods and MCMC samplers. Each DAG gets a score, typically a penalised likelihood or a posterior probability (Section 1.2). An algorithm then searches through the DAG space for the structure (or ensemble thereof) which optimise the score.
The most basic sampler is structure MCMC where each step involves adding, deleting or reversing an edge (Madigan and York, 1995; Giudici and Castelo, 2003) and accepting the move according to a MetropolisHastings probability. The scheme is highly flexible – amplifying the score leads to simulated annealing while modulating the amplification through the acceptance rate leads to adaptive tempering, to speed up the traversal and exploration of the DAG space. Sampling from the neighbourhood of DAGs with one edge added, removed or reversed, proportional to their scores results in faster MCMC convergence (as for example in Jennings and Corcoran, 2016). Greedy hill climbing instead proceeds by choosing the highest scoring DAG in the neighbourhood as the new starting point. Greedy equivalence search (GES) (Chickering, 2002a) is a popular algorithm for a greedy search approach on the space of Markov equivalent DAGs.
When the score is decomposable (as in Eq. (2)), only nodes whose parents change need to be rescored providing an speedup for structure based methods. Since the decomposability in Eq. (2) mimics the factorisation in Eq. (1) it is a property possessed by commonly used scores like the BIC penalised likelihood or the BDe or BGe, and an essential property for more advanced algorithms.
A direction spearheaded by order MCMC (Friedman and Koller, 2003) reduces the search space by combining large collections of DAGs together, namely all DAGs sharing the same topological ordering of the nodes. The score of the order consists of the sum of the scores of the DAGs consistent with it, and an MCMC scheme runs on the space of orders which are simply permutations of the nodes. Although the number of DAGs compatible with each order also grows superexponentially, the sum of all their scores involves different contributions and can be evaluated in exponential time. For larger computations become quickly prohibitive and the complexity is artificially reduced to polynomial by imposing a hard limit on the number of parents allowed for each node. Nevertheless the strategy of combining large sets of DAGs and working on the much smaller (though still factorial) space of orders, enormously improves convergence with respect to structure MCMC, allowing the search and sampling of much larger graphs (for moderate or low ).
Score decomposability is also necessary for orderbased greedy search (Teyssier and Koller, 2005) as well as for dynamic or integer linear programming methods (Koivisto and Sood, 2004; Eaton and Murphy, 2007; He et al., 2016; Cussens, 2011; Cussens et al., 2017) for structure learning. For Bayesian model averaging, one limitation with order MCMC derives from the fact that DAGs may belong to many different orders, introducing a bias in the sampling. The bias can be avoided by working on the space of ordered partitions (Kuipers and Moffa, 2017) which provide a unique representation of each DAG. Other alternatives include large scale edge reversal (Grzegorczyk and Husmeier, 2008) and Gibbs sampling (Goudie and Mukherjee, 2016) moves. These unbiased MCMC schemes are currently the only viable approaches to sampling and accounting for structure uncertainty, though still limited to smaller or less dense graphs.
As the size and connectivity of the target DAGs increase, the wide spectrum of constraint based and score and search algorithms, cannot but fail to converge or discover optimally scoring graphs. To limit the extent to which the search space grows with the number of nodes, Tsamardinos et al. (2006) brought the two sides of the coin together to benefit from their individual advantages, the ease of conditional independence testing and the performance of DAG searching. First a constraint based method, akin to the PC algorithm, identifies a (liberal) undirected skeleton. A greedy search then acts on the restricted search space defined by excluding edges which are not included in the reference skeleton. Since score and search, when feasible, tends to perform better (Heckerman et al., 2006) than constraint based methods, the hybrid approach of Tsamardinos et al. (2006) outperformed previous methods. Nandy et al. (2015) recently investigated the consistency properties of hybrid approaches using GES in high dimensional settings.
1.4 Original contribution
In this work, we bring the power and sophistication of order and partition MCMC to the hybrid framework for structure learning. The result is a highly efficient algorithm for search and sampling with marked improvements on current state of the art methods. The key is to observe that the exponential complexity in of for order or partition MCMC (Friedman and Koller, 2003; Kuipers and Moffa, 2017) derives from allowing among the potentially parents of each node any of the other . If the set of potential parents is preselected, for example through a constraint based method relying on conditional independence tests, the complexity reduces to for searching of an optimal structure, and for unbiased structure sampling. The complexity of the search then matches that of the testing component of the PC algorithm. Moreover we can precompute and tabulate the parents set scores, which are exponentially costly. In particular we introduce a method to also precompute the tables of partial sums needed for the MCMC scheme with no complexity overhead for the search. During each MCMC step we then simply need to look up the relevant scores providing a very efficient sampler.
A distinctive feature of our method is not to fully trust the search space provided by the initial constraint based method. Each node is entitled to have as parent any of the permissible nodes in the search space, and an additional arbitrary one from outside that set. Accounting for the expansion to the potential parent set, each MCMC step takes a time of up to order , despite scoring vast sets of DAGs at a time, and which is even faster than structure MCMC moves. The expansion beyond the skeleton provides a mechanism to iteratively improve the search space until it includes the maximally scoring DAG encountered or the bulk of the posterior weight. Based on our simulations the results strongly outperform currently available alternatives, enabling efficient inference and sampling of much larger DAGs.
In Section 2 we develop efficient algorithms for orderbased inference for a known search space. Then in Section 3 we demonstrate how to expand and improve the search space iteratively, and show how this offers notably improved performance. Finally in Section 4 we extend the scheme to the space of partitions to provide an unbiased sampler. The algorithms introduced here are all implemented in the R (R Core Team, 2017) package BiDAG.
2 Orderbased inference on a given search space
In the order MCMC algorithm of Friedman and Koller (2003), the nodes of a DAG are arranged in topological order . We associate a permutation with each order. For a DAG to be compatible with an order, the parents of each node must have a higher index in the permutation
(3) 
Visually, when we place the nodes in a linear chain from left to right according to , edges may only come from nodes further to the right (Figure 1). With the rows and columns labelled following , the adjacency matrix of a compatible DAG is lower triangular so that a total of DAGs are compatible with each order.
2.1 Order MCMC
The idea of order MCMC is to combine all DAGs consistent with an order so to reduce the problem to the much smaller space of permutations instead of working directly on the space of DAGs. Each order receives a score equal to the sum of the scores of all DAGs in the order
(4) 
Instead of naively scoring all DAGs in an order, Friedman and Koller (2003) used the factorisation in Eq. (2)
(5) 
to exchange the sum and product. The sum is restricted to parent subsets compatible with the node ordering
(6) 
The score of the order therefore reduces to sums over all compatible parent subsets, eliminating the need of summing over DAGs. Evaluating the score therefore requires evaluations of the score function . This provides a massive reduction in complexity compared to scoring all DAGs in the order individually. The exponential complexity in is still too high for larger DAGs so a hard limit on the size of the parent sets is typically introduced to obtain polynomial complexity of evaluations of . For larger DAGs however, must be rather small in practice, so that the truncation runs the risk of artificially excluding highly scoring DAGs. As a remedy, we start by defining order MCMC on a given search space, which may be selected for example based on prior subject knowledge or from a skeleton derived through constraint based methods.
2.2 Restricting the search space
The search space can be defined by a directed graph , which is not necessarily acyclic, or its adjacency matrix, :
(7) 
One advantage with respect to simply using an undirected skeleton is that the directed graph naturally allows for prior information to be included. In the search space, each node has the following set of permissible parents
(8) 
For a set of size we evaluate the score of each possible combination and store the scores in a table (as in the example of Table 1). Since there are possible combinations, and for the BGe score each involves taking the determinant of a matrix, the complexity of building this table is .
Parent nodes  Node score  Banned parents  Summed node score 

For convenience later, we label the different scores with a binary mapping from a parent subset to the integers:
(9) 
using the indicator function .
2.3 Efficient order scoring
To score all the DAGs compatible with a particular order we still need to select and sum the rows in the score tables where the parent subset respects the order
(10) 
with the additional constraint that all elements in the parent sets considered must belong to the search space defined by . From the precomputed score table of all permissible parent subsets in the search space, we select those compatible with the order constraint. Simply running through the rows takes exponential time (in ) for each node. We can avoid this by building a second table: the summed score table (see also the example in Table 1). The first column indicates which nodes are banned as parents and the second column reports the sum of scores over all parent subsets excluding those nodes. For the indexing of the sums we negate the indicator function:
(11) 
A Hasse diagram (Figure 2) visualises the power set of the permissible parents with layers ranked by the size of the parent subsets, and helps develop a strategy to efficiently evaluate the partial sums over parent subsets. Directed edges indicate the addition of another parent to each subset, while the corresponding scores of each parent subset are attached to the nodes in Figure 2. The advantage of the Hasse representation is that each element in the summed score table (right of Table 1) is the sum of the scores of a node and all its ancestors in the network.
2.3.1 Computing the summed score table
To actually perform the sums we utilise the separation of the power set into layers of differently sized parent subsets and implement Algorithm 1. The partial sums at each layer are propagated to their children in the network. To avoid overcounting contributions from ancestors which are propagated along all paths connecting nodes layers apart, we divide by the corresponding factorials to obtain the required sums. For each end layer there are a different number of ancestral paths to the nodes in previous layers leading to different correction factors, so we need to repeat the propagation times. Building the summed score table for each variable has a complexity of . This is lower than creating the original score table and so adds no complexity overhead to the method.
The summed score table provides the ability to efficiently score each order. For each node we look up the relevant row in the summed score table and by moving linearly along the order we can compute Eq. (10) in as
(12) 
where
(13) 
2.3.2 Order MCMC scoring revisited
The same scheme can be applied to the standard order MCMC restriction of a hard limit of on the size of parent sets, but where the parents may be selected from among all nodes. In Algorithm 1 the number of permissible parents would be rather than and we simply need to truncate the main for loop over the level to take a maximum value of . The complexity of building the summed score table is per variable and hence in total, but once the tables have been built the complexity of scoring any order is again just .
2.4 Order MCMC moves
The strategy of order MCMC is to build a Markov chain on the space of orders. From the order at the current iteration , a new order is proposed and accepted with probability
(14) 
to provide a chain with a stationary distribution proportional to the score for symmetric proposals. The standard proposal (Friedman and Koller, 2003) is to swap two nodes in the order while leaving the others fixed. We will denote this move as a global swap. The banned parent subset of each node between two swapped ones may change. Therefore they need to be rescored and computing the score of the proposed order has a complexity . It is possible to move from one order to any other in steps, making the chain irreducible. A more local move of only transposing two adjacent nodes allows the proposed order to be scored in . We will denote this move as a local transposition which takes steps to become irreducible.
2.4.1 A Gibbs move in order space
On the space of orders we define a new move with the same complexity of the standard global swap move. We denote this move as a node relocation. From the current state in the chain, , first sample a node uniformly from the available, say node . Define the neighbourhood of the move, , to be all orders with node placed elsewhere in the order or at its current position, as in the example in Figure 3 with node 4 chosen. To move through the full neighbourhood of size , we can sequentially transpose node with adjacent nodes. Since each local transposition takes a time to compute the score of the next order, scoring the whole neighbourhood takes . Finally sample a proposed order proportionally to the scores of all the orders in the neighbourhood. As a consequence the move is always accepted and the next step of the chain set to the proposed order . We summarise the move in Algorithm 2.
The newly defined single node relocation move satisfies detailed balance
(15) 
where is the transition probability from to . The transition involves first sampling a node and then the order proportionally to its score so that
(16) 
The reverse move needs the same node to be selected, and as the denominators cancel when substituting into Eq. (15). For orders connected by a local transposition, say node swapped with the adjacent node , there are two possible paths connecting the orders and a transition probability of
(17) 
Since the reverse move involves the same pair of nodes, we can again directly verify detailed balance by substituting into Eq. (15).
The move is aperiodic since the original order is included in the neighbourhood. It is possible to move from any order to any other by performing steps making the chain also irreducible. Therefore the newly defined move satisfies the requirements for the chain to converge and provide order samples from a probability distribution proportional to the score .
We also note that the node relocation move naturally provides a fixed scan Gibbs sampler by cycling through the nodes sequentially, rather than sampling the node to move at each step.
2.4.2 Chain complexity
We mix the three moves into a single scheme. Since the global swap involves rescoring nodes on average at each step, while the local transposition involves and the node relocation we can keep the average complexity at if we sample the more expensive moves with a probability . To balance their computational costs, we select them with a probability of respectively. The irreducible length of the mixture is so following the heuristic reasoning of Kuipers and Moffa (2017) we would expect convergence of the chain to take steps. This complexity is consistent with our simulation results in Appendix A.
Once the score tables have been computed, the complexity of running the whole chain is also . Utilising the lookup table of summed scores therefore offers a substantial reduction in complexity of a factor of compared to standard order MCMC with only the size of parent sets restricted to and only utilising the basic score tables.
2.4.3 DAG sampling
From a sampled order, we can draw a DAG consistent with the order by sampling the parents of each node proportionally to the entries in the score table which respect the order. Complexity remains exponential, of , so the scheme should be thinned such that DAG sampling only happens periodically in the order chain. The frequency should be set so that sampling takes at most comparable computational time to running the order chain inbetween.
2.5 Maximal DAG discovery
In addition to sampling DAGs, we can also employ the MCMC scheme to search for the maximally scoring or maximum a posteriori (MAP) DAG. To this end we replace the score of each order by the score of the highest scoring DAG in that order
(18) 
To compute the terms on the right we again use the Hasse power set representation of the permissible parent set of each node and propagate the maximum of scores down the power set following Algorithm 3:
(19) 
2.5.1 Stochastic search
Finding the order with the highest directly provides a MAP DAG. A stochastic search based on the order MCMC scheme with score can tackle the problem. Running the scheme, we keep track of the highest scoring order, and hence the highest scoring DAG, encountered. The convergence time to sample orders from a distribution proportional to is again expected to be .
To perform adaptive tempering and speed up discovery of the MAP DAG, we can transform the score by raising it to the power of and employ . This transformation smooths () or amplifies () the score landscape and the value of can be tuned adaptively depending on the acceptance probability of the MCMC moves while running the algorithm. To effectively explore local neighbourhoods, the target acceptance of swaps may scale . Alternatively, simulated annealing can be performed by sending .
2.5.2 Greedy search
The orderbased scheme can also be adapted to perform a greedy search (Teyssier and Koller, 2005). For example we score all possible local transpositions of adjacent nodes in and select the highest scoring at each step. With an irreducible length of for this move, we would expect complexity to find each local maximum. For the standard global swap of two random nodes, scoring the neighbourhood itself is so that the irreducibility length of makes this move more expensive than local transpositions. Local transpositions would therefore be generally preferable for greedy search, although global swaps may be useful to escape from local maxima.
The new node relocation move of moving a single node at a time (Figure 3) requires only to score all the possible new placements of all nodes. With the irreducibility length of , we are again looking at complexity for each search. The new move also contains all local transpositions in its neighbourhood and so provides a complementary alternative to a greedy search scheme purely based on local transpositions.
3 Extending the search space
The restricted search space, derived for example through constraint based methods, may exclude relevant edges. To address this problem, we extend our approach to soften the restrictions. In particular we allow each node to have one additional parent from among the remaining nodes outside its permissible parent set. The score of each order becomes
(20) 
For the efficient computation of the sum, we build a score table for each node and each additional parent. For a node with parents this leads to tables in total and we perform Algorithm 1 on each of them. The time and space complexity of building these tables is therefore an order more expensive than using the restricted search space. Given the tables, however, an order can again be scored with a simple lookup
(21) 
where we also index the summed scores with the additional parent. The complexity of scoring the order is .
3.1 Move complexity
For the local transposition where we swap two adjacent nodes in the order, if neither is in the permissible parent set of the other we simply update one element of the sum in Eq. (21) in . If either is in the permissible parent set the index changes and all terms need to be replaced in . However since the node have up to parents, the probability of a permissible parent being affected is giving an average complexity of . For the global swap the maximum complexity is when many of the intermediate nodes have among their permissible parents either of the swapped nodes, but on average the complexity is following the same argument as above. The node relocation move has the same complexity, so that the weighted mixture of moves typically takes just .
3.2 Iteratively improving the search space
Extending the search space allows us to discover edges which improve the score of DAGs or with a high posterior weight which were previously excluded from the core search space defined by . These edges are then incorporated into the core search space and the search space improved iteratively.
Starting with the original core search space we expand to allow one additional parent and search for the maximally scoring DAG in that space and convert it to the CPDAG . For the next core search space we take the union of the edges in and , expand and search for the highest scoring CPDAG in the new space. We iteratively repeat this procedure
(22) 
until no higher scoring CPDAG is uncovered and it is entirely included in the core search space ().
By improving the search space in this way, we obtain a stark improvement in performance, as illustrated in Figure 4 and Appendix A. When simply utilising the initial core search space from the PC algorithm, we do observe a modest improvement of our MCMC sampler over both the constraint based PC algorithm (Spirtes et al., 2000) and the score and search based GES (Chickering, 2002a). However once we allow the score to guide the selection of the search space, we quickly move away from the alternatives (Figure 4). The improvements with each iteration in updating the search space are illustrated in Figure 9 in Appendix A.
4 DAG sampling on a fixed search space
For partition MCMC (Kuipers and Moffa, 2017), the nodes of a DAG are assigned to a labelled ordered partition consisting of a partition of the nodes and a permutation where the nodes within each partition element take ascending order. This provides a unique representation of each DAG unlike the simpler order representation which has a bias towards DAGs which belong to more orders. The assignment can be performed by recursively tracking nodes with no incoming edges, called outpoints (Robinson, 1970, 1973). In Figure 5 the outpoints are nodes 1 and 5 which are placed in the rightmost partition element. With these nodes and their outgoing edges removed, nodes 3 and 4 become outpoints placed into the second partition element and finally node 2 fills the remaining partition element. The partition is with permutation .
When reversing the process and building a DAG recursively, the outpoints at each stage must be connected to from outpoints at the next stage. Each node in any partition element must have at least one incoming edge from nodes in the adjacent partition element to the right. For example, node 2 in any DAG compatible with the partition in Figure 5 must have an edge from either node 3 or node 4 (or both). Additional edges may only come from nodes further right.
There are then 12 possible incoming edge combinations for node 2, three for each of node 3 and 4, giving a total of 108 DAGs compatible with the labelled ordered partition of the DAG in Figure 5. In partition MCMC the sum of the scores of all these DAGs is assigned to the partition. We now describe an efficient implementation when the search space is restricted.
4.1 Scoring partitions on a restricted search space
The posterior probability of a labelled partition is the sum of posterior probabilities of DAGs within the search space compatible with the partition
(23) 
where the restriction on parents sets induced by the partition is that they must contain at least one node from the adjacent partition element to the right. To evaluate the sums in Eq. (23) for each subset of banned nodes (belonging to the same partition element or further left) we need to keep track of the subset of needed nodes belonging to the partition element immediately to the right to ensure at least one is in the parent set. With permissible parents for node we have possible subset pairs for which we use the ternary mapping:
(24) 
We also define the mapping back to the banned parent set
(25) 
For the example with 3 permissible parents, there are the 8 values calculated in Table 1 where there was no restriction on enforcing the presence of a member of the needed parent subset (which we regard as the empty set). Additionally there are the 19 combinations in Table 2 where we index the sums with the ternary mapping of Eq. (24).
Banned parents  Needed parents  Summed node score 

Again we build a network representation of the possibilities by replicating each node in the power set representation according to the number of choices of possible needed parent subsets in the complement of the banned parent subsets. We rank the nodes by the size of the banned node subsets as in Figure 6, and assign nodes the score corresponding to the complement of the banned node subset. The connection in the network represent either removing an element from the banned parent subset, or moving it to the needed parent subset. For any such that is in the banned parent subset there is then an edge to the node indexed by more and the node indexed by less using the ternary mapping of Eq. (24).
The sums in Table 2 are the sums of the scores of each node in the network in Figure 6 and its ancestors. To compute these sums efficiently we again propagate through the network using Algorithm 4 whose complexity is . The restriction encoded by partitions to remove the bias of order MCMC therefore increases the computational cost of building the lookup tables for the partition MCMC sampler. However, once the score table is built, computing the score of any partition from Eq. (23) reduces to
(26) 
which can be evaluated in .
4.2 Partition MCMC moves
The simplest move in the partition space consists of splitting partition elements, or joining adjacent ones. Proposing such a move from to and accepting with probability
(27) 
while accounting for the neighbourhood sizes (following Kuipers and Moffa, 2017) is sufficient to sample partitions proportionally to their posterior in the search space. Nodes in two partition elements need to be rescored by looking up new values in their restricted sum tables. Although partition elements can get to a size , on average they contain around nodes (Kuipers and Moffa, 2015) so we would expect for this move. Once a partition has been sampled, a compatible DAG can be sampled conditionally.
To speed up convergence, additional permutations moves were included in Kuipers and Moffa (2017), either between two nodes in adjacent partition elements, requiring again the rescoring of nodes in two partition elements, or between any two nodes in different partition elements requiring the rescoring of all nodes inbetween as well. We would typically expect for the local swapping of nodes and for the global swapping. The global swap is picked with a probability to contain the average complexity.
A final move is to select a single node and to move it elsewhere in the partition, or as a new partition element. Analogously to the single node move introduced in Section 2.4 we can score the entire neighbourhood for any node selected at random by sequentially moving it through the partition, as in Figure 7. Since each other node has its needed or banned parent sets essentially affected twice, then scoring the neighbourhood takes . We always accept this move as it is sampled directly from the neighbourhood (which includes the starting partition), further aiding convergence. This move is also selected with a probability .
4.3 Expanding the search space
When expanding the search space, for each node we simply create further summed score tables where each time we include one other node from outside the search space as an additional parent. The space and time complexity increases by a factor of when building these tables. Since only one element is required from the needed node subsets defined by the adjacent partition element to the right, we sum over the relevant entries for all nodes outside the search space but in the adjacent partition element. For the MCMC moves, the time complexity can increase by a factor but the typical increase is again on average.
5 Discussion
In this work we presented a novel and computationally efficient approach for learning and sampling the DAGs underlying high dimensional Bayesian network models. Two main original features are worth highlighting:
First, the computational efficiency we gain by observing that the key quantities needed for the MCMC scheme can be effectively precomputed and stored in lookup tables. This constitute a valuable enrichment of methods which group DAGs into collections which can be scored together to reduce the size of the search space, as in order based approaches. As such, order and partition MCMC constitute the building blocks of our procedure with the advantage now that each step in the chain take minimal computational time.
Second, the expansion of the search space beyond the skeleton preliminary obtained through constraint based methods, which ensures a better inference of the network structure. In fact the predefined search space may not include DAGs corresponding to the mode of the posterior distribution, and therefore it is important to add some flexibility to hybrid methods in this respect. The advantages with respect to current mainstream methods have been illustrated in our simulation studies in Figure 4 and Appendix A.
When iteratively updating the search space, our focus was to include edges ensuring that the highest scoring (or MAP) DAG found at each stage belongs to the core search space for the next iteration. Alternatively we could sample from the posterior distribution and update the search space by adding edges with a certain posterior probability. The orderbased sample is of course biased, but the additional restriction in the combinatorics of partition MCMC, which ensures a unique representation of each DAG, increases the complexity of building the necessary lookup tables. For denser networks it may be preferable to remove the bias in later iterations only, once the search space has already converged under order. The complexity of finding the MAP or sampling with order MCMC is actually the same. We chose to update the search space based on the MAP since using order MCMC to find a maximal score can be faster than sampling due to the possibility of tempering.
The possibility to add edges beyond a predefined skeleton, allows for the correction of errors where edges have been missed. The iterative approach is, aside from stochastic fluctuations in the search or sampling, mainly deterministic. However, since only the addition of a single parent at a time is considered for each node, the algorithm would not pick up situations of missing correlated edges, which would need to be be added at the same time to improve the score. Allowing pairs of edges to be added together increases the overall space complexity by a factor which can be computationally prohibitive. On the other hand we could view the search space itself, or the lists of permitted parents, as a random variable and implement a stochastic updating scheme. This may be especially feasible for sparser graphs where such a scheme could effectively extend the posterior sample outside of a fixed search space.
As the initial core search space we adopted the undirected skeleton obtained from the PC algorithm, without accounting for any orientations. The iterative steps of building the score tables have exponential complexity in the number of parents. Therefore ignoring the direction in the case of nodes with many children, which will be included as potential parents, will lead to increased costs in building the lookup tables. In certain cases it may then be convenient to exploit the direction of edges from the PC algorithm and limit particular nodes’ parents to those nodes compatible with directed or undirected edges in the CPDAG.
Although we focussed on the use of the PC algorithm to obtain an initial core search space, our approach is agnostic to the method used to define starting point, although obviously performance will improve the closer the initial search space is to the target space containing the bulk of the posterior distribution. If relevant edges are missing in the initial search space, our algorithm can add them though it may take a few iterations. False positive edges in the search space do not affect the MCMC search, but do increase the time and space needed for computing the lookup tables. In our simulations, the PC algorithm was quite conservative, missing many edges but introducing few false positives. Due to the many missing edges improving the search space tended to require quite some iterations, which were however reasonably fast.
If we defined the initial core search space by GES we would start with more of the important edges, but also many false positives. As a consequence the algorithm would require fewer but more costly steps to improve the search space. The number of false positives when using GES is sensitive to the penalisation parameter in the score, which would ideally be optimally tuned if using GES to obtain the initial search space for our method. Orderbased conditional independence tests also offer another option (Raskutti and Uhler, 2013). For Gaussian models, the Markov random field or conditional independence graph defined by the precision matrix (as used for example in Nandy et al., 2015) is also a possibility. Theoretically the conditional independence graph should contain all edges present in the PC algorithm skeleton, potentially including more true positive edges, while most likely also introducing additional false positives. In principle one may even combine search spaces from different approaches.
A possibly interesting direction would be to start from the maximally scoring DAG given by the ILP method of Cussens (2011) and Cussens et al. (2017), if the solver manages to complete and the number of parents in the DAG is less than the low limit needed for the input score tables. The maximal DAG, appropriately expanded, may then be a good starting point for the full sampling. Alternatively the final search space obtained by our search could be an interesting input for the ILP, or may be determined by combining elements of both approaches. Similarly one may consider whether dynamic programming approaches for exhaustively searching orders (Koivisto and Sood, 2004; Silander and Myllymäki, 2006; Eaton and Murphy, 2007; He et al., 2016) can be modified to work on restricted search spaces and be efficient enough to replace the MCMC search.
Regardless of how the initial search space is defined, or how the maximal DAG is discovered, and our approach markedly outperforms others for this, our hybrid scheme is the only one capable of efficiently sampling larger and denser graphs. Sampling from the posterior distribution is vital for understanding the uncertainty in the graph structure itself. For robust inference, the structure uncertainty should be accounted for in further downstream analyses (Kuipers et al., 2017), for example for causal interpretations and in the estimation of intervention effects (Moffa et al., 2017).
Acknowledgements
The authors would like to thank Markus Kalisch and Niko Beerenwinkel for useful comments and discussions.
A Performance and convergence
To examine the performance and convergence of our method, we performed a simulation study for 4 network sizes and 2 sample sizes . For each combinations of network and sample sizes 100 random graphs and corresponding data were generated using the functions randomDAG and rmvDAG from the pcalg package. The edge probability parameter was set to , so that average number of edges in the DAG equals and the average parent set size per node equals . The strengths of the edges was set to the range .
a.1 Skeleton inference
To assess the performance we first considered the number of true positives (TP) and false positives (FP) in the undirected skeletons of the networks inferred and scaled them by the number of edges (P) in the true DAG
(28) 
We computed the median TPR along with the first and third quartiles and plotted (Figures 4 and 8) against the median FPRn over the 100 realisations for our MCMC scheme for the final and initial search spaces, along with the results from GES (Chickering, 2002a) and the PC algorithm (Spirtes et al., 2000; Kalisch et al., 2012). The range of discrimination thresholds were:

penalisation parameter for GES

significance level for the PC algorithm

posterior probability for MCMC
From the initial search space defined by the PC algorithm skeleton, and expanded to include an additional parent, we see a distinct improvement with our MCMC scheme (pink circles in Figures 4 and 8) while when we improve the search space iteratively until it contains the MAP DAG discovered we observe a strong advantage over the alternative methods (purple squares in Figures 4 and 8) and approach perfect recovery of the skeleton for the larger sample size (Figure 4).
To explore how the iterative search leads to such an improvement in performance we keep track of the highest scoring DAG uncovered at each iteration, and used to update the core search space for the next iteration. In Figure 9, we overlay these intermediate results on the MCMC lines of Figure 4. Each iteration, and especially the earlier ones leads to an improvement in the search space allowing the MCMC search to find better DAGs which were previously excluded. Finally, utilising the posterior probability of edges in the sample from the final search space, we can remove some of the false positive edges in the point estimate of the MAP DAG.
a.2 Direction inference
Along with inferring the undirected skeleton, we also assess the performance in recovering the correct directions and compute the structural Hamming distance (SHD) between the true generating DAG and those inferred by the different methods. In all cases we convert to CPDAGs before computing the distances. For GES we used the penalisation while for the PC algorithm we used a significance level of . To condense the sample of DAGs from our MCMC schemes to a single graph, we converted the sample to CPDAGs and retained edges occurring with a posterior probability greater than . The result for the MAP correspond to the highest scoring DAG discovered in the final search space, again transformed into a CPDAG.
a.3 Convergence
To examine the convergence of our MCMC scheme we compared two independent runs with different initial points in the final search space for each dataset. We computed the squared correlation between the two runs of the posterior probabilities of each edge, after excluding a burnin period of 20%. Since most edges are absent, only edges with a posterior probability greater than 5% in at least one run were included to avoid artificially inflating the correlation coefficient. The median along with the first and third quartiles are displayed in Figure 12. By scaling the number of MCMC steps with we observe the squared correlation approaching 1 with the difference decaying linearly with the scaled number of steps. We see, especially after this transformation that there is little dependence of the correlation on the size of the network, apart from a slower convergence for the smallest network size. With reasonable consistency, and importantly no obvious decrease in scaled performance as the number of variables increases, the simulation results of Figure 12 are in line with estimate of requiring steps for the MCMC convergence.
B Binary data
For binary data, we employ the BDe score (Heckerman and Geiger, 1995). For any node , its contribution to the score involves computing the number of times takes the value 1, or the value 0, for each of the possible configurations of its parents . All the parents for each of the observations must be run through in a complexity of . As there is a parameter associated with each of the parent configurations, we assume . Building the score table of node by naively running through all parent configurations would take .
However the parent configurations are connected via the poset structure of Figure 2, so we can build the score table more efficiently by only looking at the raw data once. For the BDe score for the case when all parents are present we build the vectors and whose elements count the number of times takes the value 1 and 0 for each parent state, in time . We employ a binary mapping of the parent states to elements of the vectors using
(29) 
where for the full set of parents, . When we remove one of the parents to compute the score table entry at layer in the poset of Figure 2 we simply combine elements of the vector where the removed parent takes the value 0 with the corresponding elements where it takes the value 1. In general we can create the vectors at each level from any connected at a higher level with
(30) 
where the square brackets indicate the elements of the vectors and we employ the mapping
(31) 
From the pair of vectors for any set of permissible parent nodes we can compute the entry in the score table according to the BDe score (Heckerman and Geiger, 1995)
(32) 
with and the hyperparameters of the beta distributions which correspond to pseudocounts in the score.
Repeating the creation of the count vectors and computation of the score by moving up the layers in the poset of Figure 2, as summarised in Algorithm 5, we efficiently build the score table for each node in the data. For each term at layer we look at vectors from the layer above of size so that filling out the score tables takes . Combining with the initial step leads to an overall complexity of which is a significant improvement on the naive implementation of .
References
 Andersson et al. (1997) S. A. Andersson, D. Madigan, and M. D. Perlman. A characterization of Markov equivalence classes for acyclic digraphs. Annals of Statistics, 25:505–541, 1997.
 Chickering (2002a) D. M. Chickering. Optimal structure identification with greedy search. Journal of Machine Learning Research, 3:507–554, 2002a.
 Chickering (2002b) D. M. Chickering. Learning equivalence classes of Bayesiannetwork structures. Journal of Machine Learning Research, 2:445–498, 2002b.
 Colombo and Maathuis (2014) D. Colombo and M. H. Maathuis. Orderindependent constraintbased causal structure learning. Journal of Machine Learning Research, 15:3741–3782, 2014.
 Consonni and Rocca (2012) G. Consonni and L. L. Rocca. Objective Bayes factors for Gaussian directed acyclic graphical models. Scandinavian Journal of Statistics, 39:743–756, 2012.
 Cussens (2011) J. Cussens. Bayesian network learning with cutting planes. In Twentyseventh Conference on Uncertainty in Artificial Intelligence, pages 153–160, 2011.
 Cussens et al. (2017) J. Cussens, M. Järvisalo, J. H. Korhonen, and M. Bartlett. Bayesian network structure learning with integer programming: Polytopes, facets, and complexity. Journal of Artificial Intelligence Research, 58:185–229, 2017.
 Dawid (2010) A. P. Dawid. Beware of the DAG! Journal of Machine Learning Research Workshop and Conference Proceedings, 6:59–86, 2010.
 Eaton and Murphy (2007) D. Eaton and K. Murphy. Bayesian structure learning using dynamic programming and MCMC. In Twentythird Conference on Uncertainty in Artificial Intelligence, pages 101–108, 2007.
 Elwert (2013) F. Elwert. Graphical causal models. In Handbook of causal analysis for social research, pages 245–273. Springer, 2013.
 Friedman (2004) N. Friedman. Inferring cellular networks using probabilistic graphical models. Science, 303:799–805, 2004.
 Friedman and Koller (2003) N. Friedman and D. Koller. Being Bayesian about network structure. A Bayesian approach to structure discovery in Bayesian networks. Machine Learning, 50:95–125, 2003.
 Friedman et al. (2000) N. Friedman, M. Linial, I. Nachman, and D. Pe’er. Using Bayesian networks to analyze expression data. Journal of Computational Biology, 7:601–620, 2000.
 Geiger and Heckerman (2002) D. Geiger and D. Heckerman. Parameter priors for directed acyclic graphical models and the characterization of several probability distributions. Annals of Statistics, 30:1412–1440, 2002.
 Giudici and Castelo (2003) P. Giudici and R. Castelo. Improving Markov chain Monte Carlo model search for data mining. Machine Learning, 50:127–158, 2003.
 Goudie and Mukherjee (2016) R. J. B. Goudie and S. Mukherjee. A Gibbs sampler for learning DAGs. Journal of Machine Learning Research, 17:1032–1070, 2016.
 Grzegorczyk and Husmeier (2008) M. Grzegorczyk and D. Husmeier. Improving the structure MCMC sampler for Bayesian networks by introducing a new edge reversal move. Machine Learning, 71:265–305, 2008.
 He et al. (2016) R. He, J. Tian, and H. Wu. Structure learning in Bayesian networks of moderate size by efficient sampling. Journal of Machine Learning Research, 17:3483–3536, 2016.
 Heckerman and Geiger (1995) D. Heckerman and D. Geiger. Learning Bayesian networks: A unification for discrete and Gaussian domains. In Eleventh Conference on Uncertainty in Artificial Intelligence, pages 274–284, 1995.
 Heckerman et al. (2006) D. Heckerman, C. Meek, and G. Cooper. A Bayesian approach to causal discovery. Innovations in Machine Learning, pages 1–28, 2006.
 Hernán and Robins (2006) M. A. Hernán and J. M. Robins. Instruments for causal inference: an epidemiologist’s dream? Epidemiology, 17:360–372, 2006.
 Jennings and Corcoran (2016) D. Jennings and J. N. Corcoran. A birth and death process for Bayesian network structure inference. arXiv:1610.00189, 2016.
 Kalisch and Bühlmann (2007) M. Kalisch and P. Bühlmann. Estimating highdimensional directed acyclic graphs with the PCalgorithm. Journal of Machine Learning Research, 8:613–636, 2007.
 Kalisch et al. (2012) M. Kalisch, M. Mächler, D. Colombo, M. H. Maathuis, and P. Bühlmann. Causal inference using graphical models with the R package pcalg. Journal of Statistical Software, 47:1–26, 2012.
 Koivisto and Sood (2004) M. Koivisto and K. Sood. Exact Bayesian structure discovery in Bayesian networks. Journal of Machine Learning Research, 5:549–573, 2004.
 Koller and Friedman (2009) D. Koller and N. Friedman. Probabilistic graphical models. MIT press, 2009.
 Kuipers and Moffa (2015) J. Kuipers and G. Moffa. Uniform random generation of large acyclic digraphs. Statistics and Computing, 25:227–242, 2015.
 Kuipers and Moffa (2017) J. Kuipers and G. Moffa. Partition MCMC for inference on acyclic digraphs. Journal of the American Statistical Association, 112:282–299, 2017.
 Kuipers et al. (2014) J. Kuipers, G. Moffa, and D. Heckerman. Addendum on the scoring of Gaussian directed acyclic graphical models. Annals of Statistics, 42:1689–1691, 2014.
 Kuipers et al. (2017) J. Kuipers, T. Thurnherr, G. Moffa, P. Suter, J. Behr, R. Goosen, G. Christofori, and N. Beerenwinkel. Mutational interactions define novel cancer subgroups. bioRxiv:187260, 2017.
 Maathuis et al. (2009) M. H. Maathuis, M. Kalisch, and P. Bühlmann. Estimating highdimensional intervention effects from observational data. Annals of Statistics, 37:3133–3164, 2009.
 Madigan and York (1995) D. Madigan and J. York. Bayesian graphical models for discrete data. International Statistical Review, 63:215–232, 1995.
 Moffa et al. (2017) G. Moffa, G. Catone, J. Kuipers, E. Kuipers, D. Freeman, S. Marwaha, B. Lennox, M. Broome, and P. Bebbington. Using directed acyclic graphs in epidemiological research in psychosis: An analysis of the role of bullying in psychosis. Schizophrenia Bulletin, 43:1273–1279, 2017.
 Nandy et al. (2015) P. Nandy, A. Hauser, and M. H. Maathuis. Highdimensional consistency in scorebased and hybrid structure learning. arXiv:1507.02608, 2015.
 Pearl (2000) J. Pearl. Causality: models, reasoning and inference. MIT press, 2000.
 R Core Team (2017) R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, 2017. URL https://www.Rproject.org/.
 Raskutti and Uhler (2013) G. Raskutti and C. Uhler. Learning directed acyclic graphs based on sparsest permutations. arXiv:1307.0366, 2013.
 Robinson (1970) R. W. Robinson. Enumeration of acyclic digraphs. In Second Chapel Hill Conference on Combinatorial Mathematics and its Applications, pages 391–399, 1970.
 Robinson (1973) R. W. Robinson. Counting labeled acyclic digraphs. In New directions in the theory of graphs, pages 239–273. Academic Press, New York, 1973.
 Silander and Myllymäki (2006) T. Silander and P. Myllymäki. A simple approach for finding the globally optimal Bayesian network structure. In Twentysecond Conference on Uncertainty in Artificial Intelligence, pages 445–452, 2006.
 Spirtes et al. (2000) P. Spirtes, C. N. Glymour, and R. Scheines. Causation, prediction, and search. MIT Press, 2000.
 Teyssier and Koller (2005) M. Teyssier and D. Koller. Orderingbased search: A simple and effective algorithm for learning Bayesian networks. In Twentyfirst Conference on Uncertainty in Artificial Intelligence, pages 584–590, 2005.
 Tsamardinos et al. (2006) I. Tsamardinos, L. E. Brown, and C. F. Aliferis. The maxmin hillclimbing Bayesian network structure learning algorithm. Machine Learning, 65:31–78, 2006.
 Uhler et al. (2013) C. Uhler, G. Raskutti, P. Bühlmann, and B. Yu. Geometry of faithfulness assumption in causal inference. Annals of Statistics, 41:436–463, 2013.
 VanderWeele and Robins (2007) T. J. VanderWeele and J. M. Robins. Four types of effect modification: A classification based on directed acyclic graphs. Epidemiology, 18:561–568, 2007.
 Verma and Pearl (1990) T. S. Verma and J. Pearl. Equivalence and synthesis of causal models. In Sixth Conference on Uncertainty in Artificial Intelligence, pages 220–227, 1990.