Reversible MCMC on Markov equivalence classes of sparse directed acyclic graphs\thanksrefT11

# Reversible MCMC on Markov equivalence classes of sparse directed acyclic graphs\thanksrefT11

[ [    [ [    [ [ Peking University, Peking University and University of California, Berkeley Y. He
J. Jia
School of Mathematical Sciences
and Center of Statistical Sciences
Peking University
Beijing 100871
China
B. Yu
Department of Statistics
University of California
Berkeley, California 94720
USA
\smonth9 \syear2012\smonth4 \syear2013
\smonth9 \syear2012\smonth4 \syear2013
\smonth9 \syear2012\smonth4 \syear2013
###### Abstract

Graphical models are popular statistical tools which are used to represent dependent or causal complex systems. Statistically equivalent causal or directed graphical models are said to belong to a Markov equivalent class. It is of great interest to describe and understand the space of such classes. However, with currently known algorithms, sampling over such classes is only feasible for graphs with fewer than approximately 20 vertices. In this paper, we design reversible irreducible Markov chains on the space of Markov equivalent classes by proposing a perfect set of operators that determine the transitions of the Markov chain. The stationary distribution of a proposed Markov chain has a closed form and can be computed easily. Specifically, we construct a concrete perfect set of operators on sparse Markov equivalence classes by introducing appropriate conditions on each possible operator. Algorithms and their accelerated versions are provided to efficiently generate Markov chains and to explore properties of Markov equivalence classes of sparse directed acyclic graphs (DAGs) with thousands of vertices. We find experimentally that in most Markov equivalence classes of sparse DAGs, (1) most edges are directed, (2) most undirected subgraphs are small and (3) the number of these undirected subgraphs grows approximately linearly with the number of vertices.

[
\kwd
\doi

10.1214/13-AOS1125 \volume41 \issue1 2013 \firstpage1742 \lastpage1779 \newproclaimdefinitionDefinition

\runtitle

Reversible MCMC on MECs of sparse DAGs

\thankstext

T11Supported in part by NSFC (11101008, 11101005, 71271211), 973 Program-2007CB814905, DPHEC-20110001120113, US NSF Grants DMS-11-07000, DMS-09-07632, DMS-06-05165, DMS-12-28246 3424, SES-0835531 (CDI), US ARO grant W911NF-11-1-0114 and the Center for Science of Information (CSoI), a US NSF Science and Technology Center, under Grant agreement CCF-0939370. This research was also supported by School of Mathematical Science, the Center of Statistical Sciences, the Key Lab of Mathematical Economics and Quantitative Finance (Ministry of Education) , the Key lab of Mathematics and Applied Mathematics (Ministry od Education), and the Microsoft Joint Lab on Statistics and information technology at Peking University.

{aug}

A]\fnmsYangbo \snmHe\correflabel=e1]heyb@math.pku.edu.cn, A]\fnmsJinzhu \snmJialabel=e2]jzjia@math.pku.edu.cn and B]\fnmsBin \snmYulabel=e3]binyu@stat.berkeley.edu

class=AMS] \kwd62H05 \kwd60J10 \kwd05C81 Sparse graphical model \kwdreversible Markov chain \kwdMarkov equivalence class \kwdCausal inference

## 1 Introduction

Graphical models based on directed acyclic graphs(DAGs, denoted as ) are widely used to represent causal or dependent relationships in various scientific investigations, such as bioinformatics, epidemiology, sociology and business finegold2011robust (), Friedman (), heckerman1999bayesian (), jansen2003bayesian (), maathuis2009estimating (), pearl2000causality (), spirtes2001causation (). A DAG encodes the independence and conditional independence restrictions of variables. However, because different DAGs can encode the same set of independencies or conditional independencies, most of the time we cannot distinguish DAGs via observational data pearl1988probabilistic (). A Markov equivalence class is used to represent all DAGs that encode the same dependencies and independencies andersson1997characterization (), chickering2002learning (), pearl1991theory (). A Markov equivalence class can be visualized (or modeled) and uniquely represented by a completed partial directed acyclic graph (completed PDAG for short) chickering2002learning () which possibly contains both directed edges and undirected edges lauritzen2002chain (). There exists a one-to-one correspondence between completed PDAGs and Markov equivalence classes andersson1997characterization (). The completed PDAGs are also called essential graphs by Andersson et al. andersson1997characterization () and maximally oriented graphs by Meek meek1995causal ().

A set of completed PDAGs can be used as a model space. The modeling task is to discover a proper Markov equivalence class in the model space castelo2004learning (), chickering1995learning (), CY1999 (), dash1999hybrid (), heckerman1995learning (), madigan1996bayesian (). Understanding the set of Markov equivalence classes is important and useful for statistical causal modeling gillispie2006formulas (), gillispie2002size (), Kalisch (). For example, if the number of DAGs is large for Markov equivalence classes in the model space, searching based on unique completed PDAGs could be substantially more efficient than searching based on DAGs chickering2002learning (), munteanu2001eq (), madigan1996bayesian (). Moreover, if most completed PDAGs in the model space have many undirected edges (with nonidentifiable directions), many interventions might be needed to identify the causal directions eberhardt2007interventions (), he2008active ().

Because the number of Markov equivalence classes increases superexponentially with the number of vertices (e.g., more than classes with 10 vertices) gillispie2002size (), it is hard to study sets of Markov equivalence classes. To our knowledge, only completed PDAGs with a small given number of vertices (10) have been studied thoroughly in the literature gillispie2006formulas (), gillispie2002size (), pena2007approximate (). Moreover, these studies focus on the size of Markov equivalence classes, which is defined as the number of DAGs in a Markov equivalence class. Gillispie and Perlman gillispie2002size () obtain the true size distribution of all Markov equivalence classes with a given number (10 or fewer) of vertices by listing all classes. Peña pena2007approximate () designs a Markov chain to estimate the proportion of the equivalence classes containing only one DAG for graphs with 20 or fewer vertices.

In recent years, sparse graphical models have become popular tools for fitting high-dimensional multivariate data. The sparsity assumption introduces restrictions on the model space; a standard restriction is that the number of edges in the graph be less than a small multiple of the number of vertices. It is thus both interesting and important to be able to explore the properties of subsets of graphical models, especially with sparsity constraints on the edges.

In this paper, we propose a reversible irreducible Markov chain on Markov equivalence classes. We first introduce a perfect set of operators that determine the transitions of the chain. Then we obtain the stationary distribution of the chain by counting (or estimating) all possible transitions for each state of the chain. Finally, based on the stationary distribution of the chain (or estimated stationary distribution), we re-weigh the samples from the chain. Hence these reweighed samples can be seen as uniformly (or approximately uniformly) generated from the Markov equivalence classes of interest. Our proposal allows the study of properties of the sets that contain sparse Markov equivalence classes in a computationally efficient manner for sparse graphs with thousands of vertices.

### 1.1 A Markov equivalence class and its representation

In this section, we give a short overview for the representations of a Markov equivalence class.

A graph is defined as a pair , where denotes the vertex set with variables, and denotes the edge set. Let be the number of edges in . A directed (undirected) edge is denoted as or (). A graph is directed (undirected) if all of its edges are directed (undirected). A sequence of distinct vertices is called a path from to if either or is in for all . A path is partially directed if at least one edge in it is directed. A path is directed (undirected) if all edges are directed (undirected). A cycle is a path from a vertex to itself.

A directed acyclic graph (DAG), denoted by , is a directed graph which does not contain any directed cycle. Let be a subset of . The subgraph induced by the subset has vertex set and edge set , the subset of which contains the edges with both vertices in . A subgraph is called a -structure if there is no edge between and . A partially directed acyclic graph (PDAG), denoted by , is a graph with no directed cycle.

A graphical model consists of a DAG and a joint probability distribution. With the graphical model, in general, the conditional independencies implied by the joint probability distribution can be read from the DAG. A Markov equivalence class (MEC) is a set of DAGs that encode the same set of independencies or conditional independencies. Let the skeleton of an arbitrary graph be the undirected graph with the same vertices and edges as , regardless of their directions. Verma and Pearl verma1990equivalence () proved the following characterization of Markov equivalence classes:

###### Lemma 1 ((Verma and Pearl verma1990equivalence ()))

Two DAGs are Markov equivalent if and only if they have the same skeleton and the same -structures.

This lemma implies that, among DAGs in an equivalence class, some edge orientations may vary, while others will be preserved (e.g., those involved in a -structure). Consequently, a Markov equivalence class can be represented uniquely by a completed PDAG, defined as follows:

{definition}

[(Completed PDAG chickering2002learning ())] The completed PDAG of aDAG , denoted as , is a PDAG that has the same skeleton as , and an edge is directed in if and only if it has the same orientation in every equivalent DAG of . According to Definition 1.1 and Lemma 1, a completed PDAG of a DAG has the same skeleton as , and it keeps at least the directed edges that occur in the -structures of . Another popular name of a completed PDAG is “essential graph” introduced by Andersson et al. andersson1997characterization (), who introduce four necessary and sufficient conditions for a graph to be an essential graph; see them in Lemma 2, Appendix .1. One of the conditions shows that all directed edges in a completed PDAG must be “strongly protected,” defined as follows: {definition} Let be a graph. A directed edge is strongly protected in if occurs in at least one of the four induced subgraphs of in Figure 1.

If we delete all directed edges from a completed PDAG, we are left with several isolated undirected subgraphs. Each isolated undirected subgraph is a chain component of the completed PDAG. Observational data is not sufficient to learn the directions of undirected edges of a completed PDAG; one must perform additional intervention experiments. In general, the size of a chain component is a measure of “complexity” of causal learning; the larger the chain components are, the more interventions will be necessary to learn the underlying causal graph he2008active ().

In learning graphical models chickering2002learning () or studying Markov equivalence classes pena2007approximate (), Markov chains on completed PDAGs play an important role. We briefly introduce the existing methods to construct Markov chains on completed PDAGs in the next subsection.

### 1.2 Markov chains on completed PDAGs

To construct a Markov chain on completed PDAGs, we need to generate the transitions among them. In general, an operator that can modify the initial completed PDAG locally can be used to carry out a transition perlman2001graphical (), chickering2002learning (), munteanu2001eq (), pena2007approximate (). Let be a completed PDAG. We consider six types of operators on : inserting an undirected edge (denoted by InsertU ), deleting an undirected edge (DeleteU), inserting a directed edge (InsertD), deleting a directed edge (DeleteD), making a -structure (MakeV) and removing a -structure (RemoveV). We call InsertU, DeleteU, InsertD, DeleteD, MakeV and RemoveV the types of operators. An operator on a given completed PDAG is determined by two parts: its type and the modified edges. For example, the operator “InsertU ” on represents inserting an undirected edge to , and is the modified edge of the operator. A modified graph of an operator is the same as the initial completed PDAG, except for the modified edges of the operator. A modified graph might (not) be a completed PDAG; see Example 1 in Section 2.1, of the Supplementary Material hesupp ().

Madigan et al. madigan1996bayesian (), Perlman perlman2001graphical () and Peña pena2007approximate () introduce several Markov chains based on the modified graphs of operators. At each state of these Markov chains, say , they move to the modified graph of an operator on only when the modified graph happens to be a completed PDAG, otherwise, stay at . In order to move to new completed PDAGs, Madigan et al. madigan1996bayesian () search the operators whose modified graphs are completed PDAG by checking Andersson’s conditions andersson1997characterization () one by one. Perlman perlman2001graphical () introduces an alternative search approach that is more efficient by “exploiting further” Andersson’s conditions.

When the modified graph of an operator on is not a completed PDAG, the operator might result in a transition from one completed PDAG to another. This operator also results in a “valid” transition. To obtain valid transitions, Chickering chickering2002learning (), chickering2003optimal () introduces the concept of validity for an operator on . Before defining “valid operator,” we need a concept consistent extension. A consistent extension of a PDAG is a directed acyclic graph (DAG) on the same underlying set of edges, with the same orientations on the directed edges of and the same set of -structures dor1992simple (), verma1992algorithm (). According to Lemma 1, all consistent extensions of a PDAG , if they exist, belong to a unique Markov equivalence class. Hence if the modified graph of an operator is a PDAG and has a consistent extension, it can result in a completed PDAG that corresponds to a unique Markov equivalence class. We call it the resulting completed PDAG of the operator. Now a valid operator is defined as below.

{definition}

[(Valid operator)] An operator on is valid if (1) the modified graph of the operator is a PDAG and has a consistent extension, and (2) all modified edges in the modified graph occur in the resulting completed PDAG of the operator.

The first condition in Definition 1.2 guarantees that a valid operator results in a completed PDAG. The second condition guarantees that the valid operator is “effective;” that is, the change brought about by the operator occurs in the resulting completed PDAG. Here we notice that the second condition is implied by the context in Chickering chickering2002learning (). Below we briefly introduce how to obtain the resulting completed PDAG of a valid operator from the modified graph.

Verma and Pearl verma1992algorithm () and Meek meek1995causal () introduce an algorithm for finding the completed PDAG from a “pattern” (given skeleton and -structures). This method can be used to create the completed PDAG from a DAG or a PDAG. They first undirect every edge, except for those edges that participate in a -structure. Then they choose one of the undirected edges and direct it if the corresponding directed edge is strongly protected, as shown in Figure 1(a), (c) or (d). The algorithm terminates when there is no undirected edge that can be directed.

Chickering chickering2002learning () proposes an alternative approach to obtain the completed PDAG of a valid operator from its modified graph; see Example 2, Section 2.1 of the Supplementary Material hesupp (). The method includes two steps. The first step generates a consistent extension (a DAG) of the modified graph (a PDAG) using the algorithm described in Dor and Tarsi dor1992simple (). The second step creates a completed PDAG corresponding to the consistent extension chickering1995transformational (), chickering2002learning (). We describe Dor and Tarsi’s algorithm and Chickering’s algorithms in Section 1 of the Supplementary Material hesupp ().

The approach proposed by Chickering chickering1995transformational (), chickering2002learning () is “more complicated but more efficient” meek1995causal () than Meek’s method described above. Hence when constructing a Markov chain, we use Chickering’s approach to obtain the resulting completed PDAG of a given valid operator from its modified graph.

With a set of valid operators, a Markov chain on completed PDAGs can be constructed. Let be the set of all completed PDAGs with vertices, be a given subset of . For any completed PDAG , let be a set of valid operators of interest to be defined later on in equation (3.1). A set of valid operators on is defined as

 O=⋃C∈SOC. (1)

Here we notice that each operator in is specific to the completed PDAG that the operator applies to. A Markov chain on based on the set can be defined as follows.

{definition}

[(A Markov chain on )] The Markov chain determined by a set of valid operators is generated as follows: start at an arbitrary completed PDAG, denoted as , and repeat the following steps for : {longlist}[(2)]

At the th step we are at a completed PDAG .

We choose an operator uniformly from ; if the resulting completed PDAG of is in , move to and set ; otherwise we stay at and set .

Given the same operator set, the Markov chain in Definition 1.2 has more new transition states for any completed PDAG than those based on the modified graphs of operators madigan1996bayesian (), perlman2001graphical (), pena2007approximate (). This is because some valid operators will result in new completed PDAGs even if their modified graphs are not completed PDAGs. Consequently, the transitions, which are generated by these operators, are not contained in Markov chains based on the modified graphs.

The set is the finite state space of chain . Clearly, the sequence of completed PDAGs in Definition 1.2 is a discrete-time Markov chain lovasz1993random (), norris1997markov (). Let be the one-step transition probability of from to for any two completed PDAGs and in . A Markov chain is irreducible if it can reach any completed PDAG starting at any state in . If is irreducible, there exists a unique distribution satisfying balance equations (see Theorems 1.7.7 and 1.5.6 in norris1997markov ())

 πC=∑C′∈SπC′pC′Cfor all C∈S. (2)

An irreducible chain is reversible if there exists a probability distribution such that

 πCpCC′=πC′pC′Cfor all C,C′∈S. (3)

It is well known that is the unique stationary distribution of the discrete-time Markov chain if it is finite, reversible, and irreducible; see Lemma 1.9.2 in norris1997markov (). Moreover, the stationary probabilities can be calculated efficiently if the Markov chain satisfies equation (3).

The properties of the Markov chain given in Definition 1.2 depend on the operator set . To implement score-based searching in the whole set of Markov equivalence classes, Chickering chickering2002learning () introduces a set of operators with types of InsertU, DeleteU, InsertD, DeleteD, MakeV or ReverseD (reversing the direction of a directed edge), subject to some validity conditions. Unfortunately, the Markov chain in Definition 1.2 is not reversible if the set of Chickering’s operators is used. Our goal is to design a reversible Markov chain, as it makes it easier to compute the stationary distribution, and thereby to study the properties of a subset of Markov equivalence classes.

In Section 2, we first discuss the properties of an operator set needed to guarantee that the Markov chain is reversible. Section 2 also explains how to use the samples from the Markov chain to study properties of any given subset of Markov equivalence classes. In Section 3 we focus on studying sets of sparse Markov equivalence classes. Finally, in Section 4, we report the properties of directed edges and chain components in sparse Markov equivalence classes with up to one thousand of vertices.

## 2 Reversible Markov chains on Markov equivalence classes

Let be any subset of the set that contains all completed PDAGs with vertices, and be a set of operators on defined in equation (1). As in Definition 1.2, we can obtain a Markov chain denoted by . We first discuss four properties of that guarantee that is reversible and irreducible. They are validity, distinguishability, irreducibility and reversibility. We call a set of operators perfect if it satisfies these four properties. Then we give the stationary distribution of when is perfect and show how to use to study properties of .

### 2.1 A reversible Markov chain based on a perfect set of operators

Let be a one-step transition probability of from to for any two completed PDAGs and in . In order to formulate clearly, we introduce two properties of : Validity and Distinguishability.

{definition}

[(Validity)] Given and any completed PDAG in , a set of operators on is valid if for any operator ( without confusion below) in , is valid according to Definition 1.2 and the resulting completed PDAG obtained by applying to , which is different from , is also in .

According to Definition 2.1, if a set of operators on is valid, we can move to a new completed PDAG in each step of and the one-step transition probability of any completed PDAG to itself is zero:

 pCC=0for any completed PDAG C∈S. (4)

For a set of valid operators and any completed PDAG in , we define the resulting completed PDAGs of the operators in as the direct successors of . For any direct successor of , denoted by , we obtain clearly as in equation (5) if has the following property.

{definition}

[(Distinguishability)] A set of valid operators on is distinguishable if for any completed PDAG in , different operators in will result in different completed PDAGs.

If is distinguishable, for any direct successor of , denoted by , there is a unique operator in that can transform to . Thus, the number of operators in is the same as the number of direct successors of . Sampling operators from uniformly generates a uniformly random transition from to its direct successors. By denoting as the number of operators in , we have

We introduce this property because it makes computation of efficient: if is distinguishable, we know right away from .

In order to make sure the Markov chain is irreducible and reversible, we introduce two more properties of : irreducibility and reversibility.

{definition}

[(Irreducibility)] A set of operators on is irreducible if for any two completed PDAGs , there exists a sequence of operators in such that we can obtain from by applying these operators sequentially. If is irreducible, starting at any completed PDAG in , we have positive probability to reach any other completed PDAG via a sequence of operators in . Thus, the Markov chain is irreducible.

{definition}

[(Reversibility)] A set of operators on is reversible if for any completed PDAG and any operator with being the resulting completed PDAG of , there is an operator such that is the resulting completed PDAG of .

If the set of operators on is valid, distinguishable and reversible, for any pair of completed PDAGs , is also a direct successor of if is a direct successor of . For any and any of its direct successors , we have

 pCC′=1/M(OC)andpC′C=1/M(OC′). (6)

Let , and define a probability distribution as

 πC=M(OC)/T. (7)

Clearly, equation (3) holds for in equation (7) if is valid, distinguishable and reversible. is the unique stationary distribution of if it is also irreducible AldoFill99 (), lovasz1993random (), norris1997markov ().

In the following proposition, we summarize our results about the Markov chain on , and give its stationary distribution.

###### Proposition 1 ((Stationary distribution of {et}))

Let be any given set of completed PDAGs. The set of operators is defined as where is a set of operators on for any in . Let be the number of operators in . For the Markov chain on generated according to Definition 1.2, if is perfect, that is, the properties—validity, distinguishability, reversibility and irreducibility—hold for , then: {longlist}[(2)]

the Markov chain is irreducible and reversible;

the distribution in equation (7) is the unique stationary distribution of and .

The challenge is to construct a concrete perfect set of operators. In Section 3, we carry out such a construction for a set of Markov equivalence classes with sparsity constraints and provide algorithms to obtain a reversible Markov chain. We now show that a reversible Markov chain can be used to compute interesting properties of a completed PDAG set .

### 2.2 Estimating the properties of S by a perfect Markov chain

For any , let be a real function describing any property of interest of , and the random variable be uniformly distributed on . In order to understand the property of interest, we compute the distribution of .

Let’s consider one example in the literature. The proportion of Markov equivalence classes of size one (equivalently, completed PDAGs that are directed) in is studied in the literature gillispie2006formulas (), gillispie2002size (), pena2007approximate (). For this purpose, we can define as the size of Markov equivalence classes represented by and obtain the proportion by computing the probability of .

Let be any subset of , the probability of is

 P(f(u)∈A)=|{C\dvtxf(C)∈A,C∈S}||S|=∑C∈SI{f(C)∈A}|S|, (8)

where is the number of elements in the set and is an indicator function.

Let be a realization of Markov chain on based on a perfect operator set according to Definition 1.2 and . Let be the stationary probability of Markov chain . From Proposition 1, we have for . We can use to estimate the probability of by

 ^PN(f(u)∈A)=∑Nt=1I{f(et)∈A}M−1t∑Nt=1M−1t. (9)

From the ergodic theory of Markov chains (see Theorem 1.10.2 in norris1997markov ()), we can get Proposition 2 directly.

###### Proposition 2

Let be a given set of completed PDAGs, and assume the set of operators on is perfect. The Markov chain is obtained according to Definition 1.2. Then the estimator in equation (9) converges to in equation (8) with probability one, that is,

 P(^PN(f(u)∈A)→P(f(u)∈A) as N→∞)=1. (10)

Proposition 2 shows that the estimator defined in equation (9) is a consistent estimator of . We can study any given subset of Markov equivalence classes via equation (9) if we can obtain and . We now turn to construct a concrete perfect set of operators for a set of completed PDAGs with sparsity constraints and then introduce algorithms to run a reversible Markov chain.

## 3 A Reversible Markov chain on completed PDAGs with sparsity constraints

We define a set of Markov equivalence classes with vertices and at most edges as follows:

 Snp={C\dvtxC is a completed PDAG with p vertices and nC≤n}, (11)

where is the number of edges in . Recall that denotes the set of all completed PDAGs with vertices. Clearly, when .

We now construct a perfect set of operators on . Notice that our constructions can be extended to adapt to some other sets of completed PDAGs, say, a set of completed PDAGS with a given maximum degree. In Section 3.1, we construct the perfect set of operators for any completed PDAG in . In Section 3.2, we propose algorithms and their accelerated version for efficiently obtaining a Markov chain based on the perfect set of operators.

### 3.1 Construction of a perfect set of operators on Snp

In order to construct a perfect set of operators, we need to define the set of operators on each completed PDAG in . Let be a completed PDAG in . We consider six types of operators on that were introduced in Section 1.2: InsertU, DeleteU, InsertD, DeleteD, MakeV and RemoveV. The operators on with the same type but different modified edges constitute a set of operators. We introduce six sets of operators on denoted by , , , , and in Definition 3.1. In addition to the conditions that guarantee validity, for each type of operators, we also introduce other constraints to make sure that all operators are reversible.

First we explain some notation used in Definition 3.1. Let and be any two distinct vertices in . The neighbor set of denoted by consists of every vertex with in . The common neighbor set of and is defined as . is a parent of and is a child of if occurs in . A vertex is a common child of and if is a child of both and . represents the set of all parents of .

{definition}

[(Six sets of operators on )] Let be a completed PDAG in and be the number of edges in . We introduce six sets of operators on : , , , and RemoveV as follows.

{longlist}

[(d)]

For any two vertices that are not adjacent in , the operator “InsertU ” on is in if and only if ; “InsertU ” is valid; for any that is a common child of in , both and occur in the resulting completed PDAG of “InsertU .”

For any undirected edge in , the operator “DeleteU ” on is in if and only if “DeleteU ” is valid.

For any two vertices that are not adjacent in , the operator “InsertD ” on is in if and only if ; “InsertD ” is valid; for any that is a common child of in , occurs in the resulting completed PDAG of “InsertD .”

For any directed edge in , operator “DeleteD ” on is in if and only if “DeleteD ” is valid; for any that is a parent of but not a parent of , directed edge in occurs in the resulting completed PDAG of “DeleteD .”

For any subgraph in , the operator “MakeV ” on is in if and only if “MakeV ” is valid.

For any -structure of , the operator “RemoveV ” on is in RemoveV if and only if ; ; every undirected path between and contains a vertex in .

Munteanu and Bendou munteanu2001eq () discuss the constraints for the first five types of operators such that each one can transform one completed PDAG to another. Chickering chickering2002learning () introduces the necessary and sufficient conditions such that these five types of operators are valid. We list the conditions introduced by Chickering chickering2002learning () in Lemma 3, Appendix .1, and employ them to guarantee that the conditions iu, du, id, dd and mv in Definition 3.1 hold.

The set of operators on denoted by is defined as follows:

 OC = ∪DeleteDC∪MakeVC∪RemoveVC.

Taking the union over all completed PDAGs in , we define the set of operators on as

 O=⋃C∈SnpOC, (13)

where is the set of operators in equation (3.1). In the main result of this paper, we show that in equation (13) is a perfect set of operators on .

###### Theorem 1 ((A perfect set of operators on Snp))

defined in equation (13) is a perfect set of operators on .

Here we notice that iu, id and dd are key conditions in Definition 3.1 to guarantee that is reversible. Without these three conditions, there are operators that are not reversible; see Example 3, Section 2.1 in the Supplementary Material hesupp (). We provide a proof of Theorem 1 in Appendix .2.

The preceding section showed how to construct a perfect set of operators. A toy example is provided as Example 4 in Section 2.1 of the Supplementary Material hesupp (). Based on the perfect set of operators we can obtain a finite irreducible reversible discrete-time chain. In the next subsection, we provide detailed algorithms for obtaining a Markov chain on and their accelerated version.

### 3.2 Algorithms

In this subsection, we provide the algorithms in detail to generate a Markov chain on , defined in Definition 1.2 based on the perfect set of operators defined in (13). A sketch of Algorithm 1 is shown below; some steps of this algorithm are further explained in the subsequent algorithms.

Step A of Algorithm 1 constructs the sets of operators on completed PDAGs in the chain . It is the most difficult step and dominates the time complexity of Algorithm 1. Step B and Step C can be implemented easily after is obtained. Step D can be implemented via Chickering’s method chickering2002learning () that was mentioned in Section 1.2. We will show that the time complexity of obtaining a Markov chain on with length () is approximate if is the same order of . For large , we also provide an accelerated version that, in some cases, can run hundreds of times faster.

The rest of this section is arranged as follows. In Section 3.2.1, we first introduce the algorithms to implement Step A. In Section 3.2.2 we discuss the time complexity of our algorithm, and provide an acceleration method to speed up Algorithm 1.

#### 3.2.1 Implementation of Step A in Algorithm 1

A detailed implementation of Step A (to construct ) is described in Algorithm 2. To construct in Algorithm 2, we go through all possible operators on and choose those satisfying the corresponding conditions in Definition 3.1.

The conditions in Algorithm 2 include: iu, iu, iu, du, id, id, id, dd, dd, rm, rv, rv and mv. For each possible operator, we check the corresponding conditions shown in Algorithm 2 one-by-one until one of them fails. Below, we introduce how to check these conditions.

1

The conditions iu, id and dd in Algorithm 2 depend on both and the resulting completed PDAGs of the operators. Intuitively, checking iu, id or dd requires that we obtain the corresponding resulting completed PDAGs. We know that the time complexity of getting a resulting completed PDAG of is  dor1992simple (), chickering2002learning (), where is the number of edges in . To avoid generating resulting completed PDAG, in the Supplementary Material hesupp (), we provide three algorithms to check iu, id and dd only based on and in an efficient manner.

The other conditions can be tested via classical graph algorithms. These tests include: (1) whether two vertex sets are equal or not, (2) whether a subgraph is a clique or not and (3) whether all partially directed paths or all undirected paths between two vertices contain at least one vertex in a given set. Checking the first two types of conditions is trivial and very efficient because the sets involved are small for most completed PDAGs in when is of the same order of . To check the conditions with the third type, we just need to check whether there is a partially directed path or undirected path between two given vertices not through any vertices in the given set. We check this using a depth-first search from the source vertex. When looking for an undirected path, we can search within the corresponding chain component that includes both the source and the destination vertices.

#### 3.2.2 Time complexity of Algorithm 1 and an accelerated version

We now discuss the time complexity of Algorithm 1. For , let and be the number of vertices and edges in , respectively, be the number of -structures in , and be the number of undirected -structures (subgraphs with and nonadjacent) in . To construct , in Step A of Algorithm 1 (equivalently, Algorithm 2), all possible operators we need to go through: deleting operators (DeleteU and DeleteD), inserting operators (InsertU and InsertD) when the number of edges in is less than , RemoveV operators and MakeV operators. There are at most possible operators for . Among all conditions in Algorithm 2, the most time-consuming one, which takes time  chickering2002learning (), is to look for a path via the depth-first search for an operator with type of InsertD. We have that the time complexity of constructing in Algorithm 2 is in the worst case and the time complexity of Algorithm 1 is in the worst case, where is the length of Markov chain in Algorithm 1. We know that and reach the maxima when is a evenly divided complete bipartite graphs gillispie2002size (). Consequently, the time complexity of Algorithm 1 are in the worst case. Fortunately, when is a few times of , say , all completed PDAGs in are sparse and our experiments show and are much less than for most completed PDAGs in Markov chain . Hence the time complexity of Algorithm 1 is approximate on average when is a few times of .

We can implement Algorithm 1 efficiently when is not large (less or around 100 in our experiments). However, when is larger, we need large to guarantee the estimates reach convergence. Experiments in Section 4 show is suitable. In this case, cubic complexity () of Algorithm 1 is unacceptable. We need to speed up the algorithms for a very large .

Notice that in Algorithm 1, we obtain an irreducible and reversible Markov chain and a sequence of numbers by checking all possible operators on each . The sequence are used to compute the stationary probabilities of according to Proposition 1. We now introduce an accelerated version of Algorithm 1 to generate irreducible and reversible Markov chains on . The basic idea is that we do not check all possible operators but check some random samples. These random samples are then used to estimate .

We first explain some notation used in the accelerated version. For each completed PDAG , if , is the set of all possible operators on with types of InsertU, DeleteU, InsertD, DeleteD, MakeV and RemoveV. If , the number of edges in reaches the upper bound , no more edges can be inserted into . Let be the set of operators obtained by removing operators with types of InsertU and InsertD from . is the set of all possible operators on when . We can obtain and easily via all possible modified edges introduced in Algorithm 2. The accelerated version of Algorithm 1 is shown in Algorithm 3.

4

4

4

4

In Algorithm 3, (either or ) is the set of all possible operators on , is an acceleration parameter that determines how many operators in are checked, is a set of checked operators that are randomly sampled without replacement from and is the set of all perfect operators in . When , and Algorithm 3 becomes back to Algorithm 1.

In Algorithm 3, because the operators in are i.i.d. sampled from in Step A and operator is chosen uniformly from in Step C, clearly, is also chosen uniformly from . We have that the following Corollary 1 holds according to Proposition 1.

###### Corollary 1 ((Stationary distribution of {et} on Snp))

Let , defined in equation (11), be the set of completed PDAGs with vertices and maximum of edges, , defined in equation (3.1), be the set of operators on , and be the number of operators in . For the Markov chain on obtained via Algorithms 1 or 3, then: {longlist}[(2)]

the Markov chain is irreducible and reversible;

the Markov chain has a unique stationary distribution and .

In Algorithm 3, we provide an estimate of instead of calculating it exactly in Algorithm 1. Let , and . Clearly, the ratio is an unbiased estimator of the population proportion via sampling without replacement. We can estimate in Step B as

 ^Mt=mtm(~O)t[αmt]. (14)

We have that when is large, the estimator has an approximate normal distribution with mean equal to .

Let the random variable be uniformly distributed on , be a real function describing a property of interest of and be a subset of . By replacing with in equation (9), we estimate via as follows:

 ^P′N(f(u)∈A)=∑Nt=1I{f(et)∈A}^M−1t∑Nt=1^M−1t, (15)

where is defined in equation (8).

In the accelerated version, only of all possible operators on are checked. In Section 4, our experiments on show that the accelerated version can speed up the approach nearly times, and that equation (15) provides almost the same results as equation (9) in which from Algorithm 1 are used. Roughly speaking, if we set , the time complexity of our accelerated version can reduce to .

## 4 Experiments

In this section, we conduct experiments to illustrate the reversible Markov chains proposed in this paper and their applications for studying Markov equivalence classes. The main points obtained from these experiments are as follows: {longlist}[(2)]

For with small , the estimations of our proposed are very close to true values. For with large (up to 1000), the accelerated version of our proposed approach is also very efficient, and the estimations in equations (9) and (15) converge quickly as the length of Markov chain increases.

For completed PDAGs in with sparsity constraints ( is a small multiple of ), we see that (i) most edges are directed, (ii) the sizes of maximum chain components (measured by the number of vertices) are very small (around ten) even for large (around 1000) and (iii) the number of chain components grows approximately linearly with .

As we know, under the assumption that there are no latent or selection variables present, causal inference based on observational data will give a completed PDAG. Interventions are needed to infer the directions of the undirected edges in the completed PDAG. Our results show that if the underlying completed PDAG is sparse, in the model space of Markov equivalence classes, most graphs have few undirected edges and small chain components. They give hope for learning causal relationships via observational data and for inferring the directions of the undirected edges via interventions.

In Section 4.1, we evaluate our methods by comparing the size distributions of Markov equivalence classes in with small to true distributions () or Gillispie’s results () gillispie2002size (). In Section 4.2, we report the proportion of directed edges and the properties of chain components of Markov equivalence classes under sparsity constraints. In Section 4.3, we show experimentally that Algorithm 3 is much faster than Algorithm 1, and that the difference in the estimates obtained is small. Finally, we study the asymptotic properties of our proposed estimators in Section 4.4.

### 4.1 Size distributions of Markov equivalence classes in Sp for small p

We consider size distributions of completed PDAGs in for and , respectively. There are 11 Markov equivalence classes in , and 185 Markov equivalence classes in . Here we can get the true size distributions for and by listing all the Markov equivalence classes and calculating the size of each explicitly. Gillespie and Perlman calculate the true size probabilities for by listing all classes; these are denoted as GP-values. We estimate the size probabilities via equation (9) with the Markov chains from Algorithm 1. We ran ten independent Markov chains using Algorithm 1 to calculate the mean and standard deviation of each estimate. The results are shown in Table 1, where is the sample size (length of Markov chain). We can see that the means are very close to true values or GP-values, and the standard deviations are also very small.

We implemented our proposed method (Algorithm 1, the version without acceleration) in Python, and ran it on a computer with a 2.6 GHZ processor. In Table 1, is the time used to estimate the size distribution for , or . These results were obtained within at most tens of seconds. In comparison, a MCMC method in pena2013approximate () took more than one hour (in C on a 2.6 GHZ computer) in order to get similar estimates of the proportions of Markov equivalence classes of size one. It is worth noting that our estimates are based on a single Markov chain, while the results in pena2013approximate () are based on independent Markov chains with steps.

### 4.2 Markov equivalence classes with sparsity constraints

We now study the sets of Markov equivalence classes defined in equation (11). The number of vertices is set to or , and the maximum edge constraint is set to where is the ratio of to . For each , we consider three ratios: 1.2, 1.5 and 3. The completed PDAGs in are sparse since . Define the size of a chain component as the number of vertices it contains. In this section, we report four distributions for completed PDAGs in : the distribution of proportions of directed edges, the distribution of the numbers of chain components and the distribution of the maximum size of chain components. The results about the distribution of the numbers of -structures are reported in the Supplementary Material hesupp (). In each simulation, given and , a Markov chain with length of on is generated via Algorithm 3 to estimate the distributions via equation (15). The acceleration parameter is set to and for and , respectively.

In Figure 2, twelve distributions of proportions of directed edges are reported for with different and ratio . We mark the minimums, quartiles (solid circles below boxes), 1st quartiles, medians, 3rd quartiles and maximums of these distributions. We can see that for a fixed , the proportion of directed edges increases with the number of edges in the completed PDAG. For example, when the ratio , the medians (red lines in boxes) of proportions are near ; when the ratio , the medians are near ; when ratio , the medians are near %. Figure 2: Distribution of proportion of directed edges in completed PDAGs in Srpp. The lines in the boxes and the solid circles under the boxes indicate the medians and the 5% quartiles, respectively. Figure 3: Distributions of numbers of chain components of completed PDAGs in Srpp. The lines in the boxes and the solid circles above the boxes indicate the medians and the 95% quartiles, respectively.

The distributions of the numbers of chain components of completed PDAGs in are shown in Figure 3. We plot the distributions for in the main window and the distributions for and in two sub-windows. We can see that the medians of the numbers of chain components are close to 5, 10, 20, and 40 for completed PDAGs in with and , respectively. It seems that there is a linear relationship between the number of chain components and the number of vertices . In the insets, similar results are shown in the distributions for and .

The distributions of the maximum sizes of chain components of completed PDAGs in are shown in Figure 4. For in the main window, the medians of the four distributions are approximately 4, 5, 6 and 7 for and 1000, respectively. This shows that the maximum size of chain components in a competed PDAG increases very slowly with . In particular, from the 95 quartiles (solid circles above boxes), we can see that the maximum chain components of more than completed PDAGs in have at most 8, 9, 10 and 13 vertices for and , respectively. This result implies that sizes of chain components in most sparse completed PDAGs are small. Figure 4: The distributions of the maximum sizes of chain components of completed PDAGs in Srpp. The lines in the boxes and the solid circles above the boxes indicate the medians and the 95% quartiles, respectively.

### 4.3 Comparisons between Algorithm 1 and its accelerated version

In this section, we show experimentally that the accelerated version Algorithm 3 is much faster than Algorithm 1, and the difference of estimates based on two algorithms is small. We have estimated four distributions on in Section 4.2 via Algorithm 3. The four distributions are the distribution of proportions of directed edges, the distribution of the numbers of chain components, the distribution of maximum size of chain components and the distribution of the numbers of -structures. To compare Algorithm 1 with Algorithm 3, we re-estimate these four distributions for completed PDAGs in via Algorithm 1.

For each distribution, in Figure 5, we report the estimates obtained by Algorithm 1 with lines and the estimates obtained by Algorithm 3 with points in the main windows. The differences of two estimates are shown in the sub-windows. The top panel of Figure 5 displays the cumulative distributions of proportions of directed edges. The second panel of this figure displays the distributions of the numbers of chain components. The third panel displays the distributions of maximum size of chain components. The bottom panel displays the distribution of the numbers of -structures. We can see that the differences of three pairs of estimates are small. Figure 5: Distributions for completed PDAGs in S150100 estimated via Algorithm 1 (plotted in lines) and the accelerated version—Algorithm 3 (plotted in points) are shown in the main windows. The differences are shown in sub-windows. Four panels (from top to bottom) display distributions of directed edges, number of chain components, maximum size of chain components and v-structures, respectively.

The average times used to generate a state of the Markov chain of completed PDAGs in are shown in Table 2, in which is the acceleration parameter used in Algorithm 3. If , the Markov chain is generated via Algorithm 1. The results suggest that the accelerated version can speed up the approach nearly times when .

### 4.4 Asymptotic properties of proposed estimators

We further illustrate the asymptotic properties of proposed estimators of sparse completed PDAGs via simulation studies. We consider for , and , respectively. Let be a discrete function of Markov equivalence class , where is a random variable distributed uniformly in . Let be the expectation of , and we have

 E(f)=∑iiP(f=i