Efficient Learning of Optimal Markov Network Topology with k-Tree Modeling

# Efficient Learning of Optimal Markov Network Topology with k-Tree Modeling

Liang Ding111To whom correspondence should be addressed. Co-authors are listed alphabetically. Department of Computer Science, The University of Georgia, Athens, GA; St. Jude Children’s Research Hospital, Memphis, TN; Department of Plant Biology, The University of Georgia, Athens, GA. , Di Chang, Russell Malmberg, Aaron Martinez,
David Robinson, Matthew Wicker, Hongfei Yan, and Liming Cai11footnotemark: 1
July 14, 2019
###### Abstract

The seminal work of Chow and Liu (1968) shows that approximation of a finite probabilistic system by Markov trees can achieve the minimum information loss with the topology of a maximum spanning tree. Our current paper generalizes the result to Markov networks of tree width , for every fixed . In particular, we prove that approximation of a finite probabilistic system with such Markov networks has the minimum information loss when the network topology is achieved with a maximum spanning -tree. While constructing a maximum spanning -tree is intractable for even , we show that polynomial algorithms can be ensured by a sufficient condition accommodated by many meaningful applications. In particular, we prove an efficient algorithm for learning the optimal topology of higher order correlations among random variables that belong to an underlying linear structure.

Keywords: Markov network, joint probability distribution function, -tree, spanning graph, tree width, Kullback-Leibler divergence, mutual information

## 1 Introduction

We are interested in effective modeling of complex systems whose the behavior is determined by relationships among uncertain events. When such uncertain events are quantified as random variables , the system is characterizable with an underlying joint probability distribution function [39, 24]. Nevertheless, estimation of the function from observed event data poses challenges as relationships among events may be intrinsic and the -order function may be very difficult, if not impossible, to compute. A viable solution is to approximate with lower order functions, as a result of constraining the dependency relationships among random variables. Such constraints can be well characterized with the notion of Markov network, where the dependency relation of random variables is defined by a non-directed graph , with the edge set to specify the topology of the dependency relation. In a Markov network, two variables not connected with an edge are independent conditional on the rest of the variables [21, 39, 42].

Model approximation with Markov networks needs to address two essential issues: the quality of the approximation, and the feasibility to compute such approximated models. Chow and Liu [11] were the first to address both issues in investigating networks of tree topology. They measured the information loss with Kullback-Leibler divergence [27] between , the unknown distribution function, and , the distribution function estimated under the dependency graph . They showed that the minimum information loss is guaranteed by the topology corresponding to a maximum spanning tree which can be computed very efficiently (in linear time in the number of random variables).

It has been difficult to extend the seminal work of simultaneous optimal and efficient learning to the realm of arbitrary Markov networks due to the computational intractability nature of the problem [10]. To overcome the barrier, research in Markov network learning has sought heuristics algorithms that are efficient yet without an optimality guarantee in the learned topology [30, 23, 48, 24, 13, 55]. A more viable approach has been to consider the nature of probabilistic networks formulated from real-world applications, which often are constrained and typically tree-like [14, 35, 20, 3, 44, 37, 17, 8, 47, 56]. All such models can be quantitatively characterized with networks that have tree width , for small . Tree width is a metric that measures how much a graph is tree-like [7]. It is closely related to the maximum size of a graph separator, i.e., a characteristic of any set of random variables upon which two (sets) of random variables and are conditionally independent . Such networks have the advantage to support efficient inference, and other queries over constructed models [28]. In addition, many real-world applications actually imply Markov (Bayesian) networks of small tree width. In spite of these advantages, however, optimal learning of such networks is unfortunately difficult, for example, it is intractable even when tree width [20, 44]. Hence, the techniques for learning such Markov networks have heavily relied on heuristics that may not guarantee the quality.

In this paper, we generalize the seminal work of Chow and Liu from Markov trees to Markov networks of tree width . First, we prove that model approximation with Markov networks of tree width has the minimum information loss when the network topology is the one for a maximum spanning -tree. A -tree is a maximal graph of tree width to which no edge can be added without an increase of tree width. Therefore the quality of such Markov networks of tree width is also optimal with a maximum spanning -tree. Second, we reveal sufficient conditions satisfiable by many applications, which guarantee polynomial algorithms to compute the maximum spanning -tree. That is, under such a condition, the optimal topology of Markov networks with tree width can be learned in time for every fixed .

We organize this paper as follows. Section 2 presents preliminaries and introduces Markov -tree model. In section 3, we prove that Kullback-Leibler divergence is minimized when the approximated distribution function is estimated with a maximum spanning -tree . In section 4, we present some conditions, each being sufficient for permitting -time algorithms to compute maximum spanning -tree and thus to learn the optimal topology of Markov networks of tree width . Efficient inference with Markov -tree models is also discussed in section 4. We conclude in section 5 with some further remarks.

## 2 Markov k-Tree Model

The probabilistic modeling of a finite system involves random variables with observable values. Such systems are thus characterized by dependency relations among the random variables, which may be learned by estimating the joint probability distribution function . Inspired by Chow and Liu’s work [11], we are interested in approximating the -order dependency relation among random variables by a -order (i.e., pairwise) relation. Our work first concerns how to minimize the information loss in the approximation.

A pairwise dependency relation between random variables 222Here we slightly abuse notations. We use the same symbol for random variable and its corresponding vertex in the topology graph and rely on context to distinguish them. can be defined with a graph , where binary relation is defined such that if and only if variables and are independent conditional only on the rest of the variables, i.e., let ,

 (Xi,Xj)∉E⟺PG(Xi,Xj|Y)=PG(Xi|Y)PG(Xj|Y) (1)

We call a Markov network with the topology graph over random variable set .

Condition (1) is called the pairwise property. By [24], it is equivalent to the following global property for positive density. Let be two disjoint subsets and . If separates and in the graph (i.e., removing disconnects and in the graph), then

 PG(Y,Z|S)=PG(Y|S)PG(Z|S) (2)

For the convenience of our discussion, we will assume our Markov networks to have the global property.

Equation (2) states that random variables in and in are independent conditional on variables in . The minimum cardinality of characterizes the degree of condition for the independency between and . The conditional independency of is thus characterized by the upper bound of minimum cardinalities of all such separating the graph . This number is closely related to the notion of tree width [41]. In this paper, we are interested in Markov networks whose topology graphs are of tree width , for given integer [20].

### 2.1 k-Tree and Creation Order

Intuitively, tree width [7] is a metric for how much a graph is tree-like. It plays one of the central roles in the development of structural graph theory and, in particular, in the well known graph minor theorems [41] by Robertson and Seymour. There are other alternative definitions for tree width. For the work presented in this paper, we relate its definition to the concept of -tree.

###### Definition 1.

[38] Let be an integer. The class of -trees are graphs that are defined inductively as follows:

1. A -tree of vertices is a clique of vertices;

2. A -tree of vertices, for , is a graph consisting of a -tree of vertices and a new vertex not in , such that forms a -clique with some -clique already in .

###### Definition 2.

[7] For every fixed , a graph is said to be of tree width if and only if it is a subgraph of some -tree.

In particular, all trees are -trees for and they have tree width 1. Graphs of tree width 1 are simply forests. For , a -tree can be succinctly represented with the vertices in the order in which they are introduced to the graph by the inductive process in Definition 1.

###### Definition 3.

Let be an integer. A creation order for a -tree , where , is defined as an ordered sequence, inductively

 ΦX=⎧⎪⎨⎪⎩C|X|=k, where C=XCX|X|=k+1, where C∪{X}=XΦX∖{X}CX|X|>k+1, where X∉C⊂X (3)

where is a -clique and is a -clique.

Note that creation orders may not be unique for the same -tree. The succinct notation facilitates discussion of properties for -trees. In particular, because Markov networks are non-directed graphs, the notion of creation order offers a convenient mechanism to discuss dependency between random variables modeled by the networks.

###### Proposition 1.

Let be a creation order for -tree . Then is a -clique in if and only if there are and such that is a prefix of , where .

###### Definition 4.

Let be an integer and be a creation order of some -tree , where . Then induces an acyclic orientation for such that given any two different variables ,

 ⟨Xi,Xj⟩∈^E if and only if {Xi,Xj∈C,i

In addition, for any variable , its parent set is defined as

Based on Definition 4, it is clear that the orientation induced by any creation order of -tree is acyclic. In particular, there exists exactly one variable with .

### 2.2 Relative Entropy and Mutual Information

According to Shannon’s information theory [43], the uncertainty of a discrete random variable can be measured with entropy , where is the probability distribution for and the sum takes all values in the range of . Entropy for a set of random variables is , where the sum takes all combined values in the ranges of variables in .

###### Definition 5.

[27] Kullback-Leibler divergence between two probability distributions and of the same random variable set is defined as

 DKL(P∥Q)=∑xP(X)logP(X)Q(X)≥0 (5)

where the sum takes combined values in the ranges of all random variables in .

The last equality holds if and only if . can be used to measure the information gain by distribution over distribution , or information loss by approximation of with as in this work.

Let and are two random variables. The mutual information between and , denoted as , is defined as the Kullback-Leibler divergence between their joint probability distribution and the product of marginal distributions, i.e.,

 I(X;Y)=DKL(p(X,Y)∥p(X)p(Y))=∑x,yp(X,Y)logp(X,Y)p(X)p(Y)

Mutual information measures the degree of the correlation between the two random variables. In this work, we slightly extends the mutual information to include more than two variables.

###### Definition 6.

Let be a random variable and be a set of random variables. Then

 I(X;C)=DKL(p(X,C)∥p(X)p(C))=∑x,cp(X,C)logp(X,C)p(X)p(C)

where sum takes combined values and in the ranges of and all random variables in .

Note that is not the same as the multivariate mutual information defined elsewhere [49]. We further point out that mutual information has recently received much attention due to its capability in decoding hidden information from big data [22, 58]. In our derivation of optimal Markov -tree modeling, estimating mutual information stands out as a critical condition that needs to be satisfied (see next section). These phenomena are unlikely coincidences.

### 2.3 Markov k-Tree Model

###### Definition 7.

Let be an integer. A Markov -tree is a Markov network over random variables with a topology graph that is a -tree. We denote with the joint probability distribution function of the Markov -tree.

###### Theorem 1.

Let and be random variables with joint probability distribution function . Let be a Markov -tree and be an acyclic orientation for its edges. Then

 PG(X)=n∏i=1P(Xi|π^E(Xi)) (6)
###### Proof.

Assume that the acyclic orientation is induced by creation order

 ΦX=CkXk+1…Cn−1Xn

where

 PG(X)=P(X|ΦX)=P(X1,…,Xn|ΦX)=P(Xn|X1,…,Xn−1,ΦX)P(X1,…,Xn−1|ΦX)=P(Xn|Cn−1)P(X∖{Xn})|ΦX∖Xn) (7)

The last equality holds for the reason that is independent from variables in conditional on variables in , as shown in the following derivation. Assume . Use the Bayes,

 P(Xn|X1,…,Xn−1,ΦX)=P(Xn|Y,Cn−1,ΦX)=P(Xn,Y|Cn−1,ΦX)P(Y|Cn−1,ΦX)
 =P(Xn|Cn−1,ΦX)P(Y|Cn−1,ΦX)P(Y|Cn−1,ΦX)=P(Xn|Cn−1,ΦX)=P(Xn|Cn−1)

The derivation in (7) also results in recurrence

 P(X|ΦX)={P(Xn|Cn−1)P(X∖{Xn}|ΦX∖{X}) if n>kP(Ck) if n=k (8)

Solving the recurrence yields

 P(X|ΦX)=P(Ck)n∏i=k+1P(Xi−1|Ci) (9)

where . Because for , , it is not hard to prove that

 P(X1,…,Xk)=k∏i=1P(Xi|π^E(X)) (10)

In addition, because for all , by equations (9) and (10), we have

 PG(X)=k∏i=1P(Xi|π^E(Xi))n∏i=k+1P(Xi|π^E(Xi))=n∏i=1P(Xi|π^E(Xi))

Though the above proof is based on an explicit creation order for the -tree . The probability function computed for is actually independent of the choice of a creation order as demonstrated in the following.

###### Theorem 2.

The probability function expressed in (6) for Markov -tree remains the same regardless the choice of creation order for .

A proof is given in Appendix A (Theorem 7).

## 3 Model Optimization

### 3.1 Optimal Markov k-Trees

Let be random variables in any unconstrained probabilistic model . Let and -tree be a topology graph of a Markov network that approximates . The approximation can be measured using the Kullback-Leibler divergence between two models [27, 31]:

 DKL(P∥PG)=∑xP(X)logP(X)PG(X)≥0 (11)

where is any combination of values for all random variables in , and the last equality holds if and only if .

Thus the problem to optimally approximate is to find a Markov -tree with topology that minimizes the divergence (11). We are now ready for our first main theorem.

###### Theorem 3.

is minimized when the topology -tree for random variables is such that maximizes the sum of mutual information.

 n∑i=1I(Xi,π^E(Xi)) (12)

where is any acyclic orientation for the edges in .

###### Proof.

Assume to be an acyclic orientation for the edges in . Apply equation (6)

 PG(X)=n∏i=1P(Xi|π^E(Xi))

to in (11), we have

 DKL(P∥PG)=∑xP(X)logP(X)−∑xP(X)n∑i=ilogP(Xi|π^E(Xi))

The first term on the right hand side (RHS) of above is , where is the entropy of X. And the second term can be further explained with

 −∑xP(X)n∑i=1logP(Xi,π^E(Xi))P(Xi)P(π^E(Xi))−∑xP(X)n∑i=1logP(Xi) (13)

Because and are components of , the second term in (13) can be formulated as

 −∑xP(X)n∑i=1logP(Xi)=n∑i=1∑xP(X)logP(Xi)=n∑i=1∑xiP(Xi)logP(Xi)=n∑i=1H(Xi) (14)

where is any value in the range of random variable .

Also the first term in (13) gives

 −n∑i=1∑xP(X)logP(Xi,π^E(Xi))P(Xi)P(π^E(Xi))=−n∑i=1∑xi,yiP(Xi,π^E(Xi))logP(Xi,π^E(Xi))P(Xi)P(π^E(Xi))=−n∑i=1I(Xi,π^E(Xi)) (15)

where is any combination of values of all random variables in and is the mutual information between variable and its parent set .

Therefore,

 DKL(P∥PG)=−n∑i=1I(Xi,π^E(Xi))+n∑i=1H(Xi)−H(X) (16)

Since and are independent of the choice (and the acyclic orientation for the edges), is minimized when is maximized.

Though Theorems 3 is about optimal approximation of probabilistic systems with Markov -trees, they also characterize the optimal Markov -tree.

###### Theorem 4.

Let and be a set of random variables. The optimal Markov -tree model for is the one with topology that maximize , for some acyclic orientation of edges in .

Now we let

 ΔG,^E(X)=n∑i=1I(Xi,π^E(Xi))

By Theorem 2 and the proof of Theorem 3, we know that is invariant of the choice of an acyclic orientation of edges. So we can simply omit and use for .

###### Definition 8.

Let be random variables. Define

 G∗=argmaxG{ΔG(X)} (17)

to be the topology of optimal Markov -tree over .

###### Corollary 1.

Let and be a set of random variables. Divergence is minimized with the topology -tree satisfying (17).

###### Corollary 2.

Let and be a set of random variables. The optimal Markov -tree model for is the one with topology graph satisfying (17).

### 3.2 Optimal Markov Networks of Tree Width ≤k

Because -trees are the maximum graphs of tree width , Theorems 3 and 4 are not immediately applicable to optimization of Markov networks of tree width . Our derivations so far have not included the situation that, given , an optimal Markov network of tree width is not exactly a -tree, for any . In this section, we resolve this issue with a slight adjustment to the objective function .

###### Definition 9.

Let be a -tree and be a binary relation over . The spanning subgraph is called amended graph of with .

###### Proposition 2.

A graph has tree width if and only if it is an amended graph of some -tree with relation .

###### Definition 10.

Let be an acyclic orientation for edges in -tree and be an amended graph of . Then the orientation of edges in the graph is defined as, for every pair of ,

 ⟨Xi,Xj⟩∈^A⟺⟨Xi,Xj⟩∈^E∧(Xi,Xj)∈A
###### Definition 11.

Let be a -tree, where . Let be an amended graph of . Define

 ΔG,^A(X)=n∑i=1I(Xi,π^A(Xi)) (18)

where is an acyclic orientation on edges .

We can apply the derivation in the previous section by replacing with and obtain

###### Corollary 3.

Let and be a set of random variables. The optimal Markov network of tree width for is the amended graph of some -tree that satisfies

 G∗A=argmaxGA,^A{ΔG,^A(X)}

## 4 Efficient Computation with Markov k-Trees

Optimal topology learning is computationally intractable for Markov networks of arbitrary graph topology [10]. This obstacle is not exception to Markov -trees. In particular, we are able to relate the optimal learning of Markov -trees to the following graph-theoretic problem.

###### Definition 12.

Let . The Maximum Spanning -Tree (MST) problem is, on input graph of vertices , to find a spanning -tree with an acyclic orientation such that

 n∑i=1f(Xi,π^E(Xi))

achieves the maximum.

In the definition, function is pre-defined numerical function for a pair: vertex and its parent set in the output -tree. MST generalizes the traditional “standard” definition of the maximum spanning -tree problem [4, 9], where the objective function is the sum of weights on all edges involved the output -tree; that is

 f(Xi,π^E(Xi))=∑X∈π^E(Xi)w(X,Xi) (19)
###### Proposition 3.

The Markov -tree learning problem is the problem MST defined in Definition 12 in which for every , , the mutual information between and its parent set .

It has been proved that, for any fixed , the problem MST with the objective function defined with given in equation (19) is NP-hard [4]. The intractability appears inherent since a number of variants of the problem remains intractable [9]. It implies that optimal topology learning of Markov -trees (for ) is computationally intractable. Independent works in learning Markov networks of bounded tree width have also confirmed this unfortunate difficulty [44, 47].

### 4.1 Efficient Learning of Optimal Markov Backbone k-Trees

We now consider a class of less general Markov -trees that are of higher order topologies over an underlying linear relation. In particular, such networks carry the signature of relationships among consecutively indexed random variables.

###### Definition 13.

Let be the set of integer-indexed variables.

• The graph , where , is called the backbone of . And each edge , for , is called a backbone edge.

• A graph (resp. -tree) is called a backbone graph (resp. backbone -tree) if it contains as a subgraph.

###### Definition 14.

Let . A Markov network is called Markov backbone -tree if the underlying topology graph of the Markov network is a backbone -tree.

The linearity relation among random variables occurs naturally in many applications with Markov network modeling, for instance, in speech recognition, cognitive linguistics, and higher-order relations among residues on biological sequences. Typically, the linearity also plays an important role in random systems involving variables associated with the time series. For example,

###### Theorem 5.

For each , the topology graph of any finite th-order Markov chain is a backbone -tree.

###### Proof.

This is because a finite th-order Markov chain is defined over labelled variables such that , for every . Clearly, its topology graph contains all edges , for . In addition, the following creation order asserts that the graph is a -tree: , where , for all .

We now show that learning the optimal topology of Markov backbone -trees can be accomplished much more efficiently than with unconstrained Markov -trees. We relate the learning problem to a special version of graph-theoretic problem MST that can be computed efficiently. In the rest of this section, we present some technical details for such connection.

###### Definition 15.

Let be a class of graphs. Then the -retaining MST problem is the MST problem in which the input graph includes a spanning subgraph that is required to be contained by the output spanning -tree.

The unconstrained MST is simply the -retaining MST problem with being the class of independent set graphs. There are other classes of graphs that are of interest to our work. In particular, we define to be the class of all backbones. Then it is not difficult to see:

###### Proposition 4.

The Markov backbone -tree learning problem is the -retaining MST problem.

In the following, we will show that, for various classes that satisfy certain graph-theoretic property, including class , the -retaining MST problems can be solved efficiently.

###### Definition 16.

Let fixed . A -tree has bounded branches if every -clique in separates into a bounded number of connected components.

###### Definition 17.

Let fixed . A graph is bounded branching-friendly if every -tree that contains as a spanning subgraph has bounded branches.

###### Lemma 1.

Let be the class of all backbones. Any graph in is bounded branching friendly.

###### Proof.

Let be a -tree containing a backbone . Let -clique with . Let

 K={κ′:κ′ is a (k+1)-clique and |κ∩κ′|=k}

It is not difficult to see that separates the -tree into at most connected components.

Note that divides set into at most intervals. Let be two -cliques such that vertices and . Then and cannot belong to the same interval in order to guarantee all backbone edges are included in . This implies , a constant when is fixed.

###### Lemma 2.

Let be a class of bounded branching friendly graphs. Then -retaining MST problem can be solved in time for every fixed .

###### Proof.

(We give a sketch for proof.)

Note that the -retaining MST problem is defined as follows:

Input: a set of vertices and a graph ,

Output: -tree with such that

 G∗=argmaxG,^E{n∑i=1f(Xi,π^E(Xi)}

where is a pre-defined function.

First, any -clique in a -tree that contains as a spanning subgraph can only separate into a bounded number of connected components. To see this, assume some -clique does the opposite. From the graph and we can construct a -tree by augmenting with additional edges. We do this without connecting any two of the components in disconnected by . However, the resulted -tree, though containing as a spanning subgraph, would be separated by into an unbounded number of components, contradicting the assumption that is bounded branching friendly.

Second, consider a process to construct a spanning -tree anchored at any fixed -clique . Note that by the aforementioned discussion, in the -tree (to be constructed) the number of components separated by is some constant , possibly a function in which is fixed. For every component, its connection to is by some other -clique such that , where is drawn from the component to “swap” out , a vertex chosen from . It is easy to see that, once is chosen, there are only a bounded number of ways (i.e,, at most possibilities) to choose . On the other hand, there are at most possibilities to choose from one of the connected components. If we use 0 and 1 to indicate the permission mode ‘yes’ or ’no’ that can be drawn from a specified component, an indicator for such permissions for all the components can be represented with binary bits.

Let be a binary string of length . We define to be the maximum value of a sub--tree created around , which is consistent with the permission indicator for the associated components. We then have the following recurrence for function :

 M(κ,α)=maxx,y,β,γ{M(κ′,β)+M(κ,γ)+f(x,κ∖{y})}

where and is the function defined over the newly created vertex and its parent set . The recurrence defines with two recursive terms and , where the permission indicators and are such that the state corresponding to the component from which was drawn should be set ’1’ by and ’0’ by . When a component runs out of vertices, the indicator should be set ’0’. For other components, where indicates ’1’, either or indicates ’1’ but not both. Finally, base case for the recurrence is that when is all 0’s.

The recurrence facilitates a dynamic programming algorithm to compute for all entries of -clique and information . It runs in time and uses memory which can be improved to -time and memory with a more careful implementation of the idea, e.g., by considering a -clique instead of a -clique [15, 16].

Combining Proposition 4 and Lemmas 1 and 2, we have

###### Theorem 6.

For every fixed , optimal topology learning of Markov backbone -trees can be accomplished in time .

### 4.2 Efficient Inference with Markov k-Trees

On learned or constructed Markov -trees, inferences and queries of probability distributions can be conducted, typically through computing the maximum posterior probabilities for most probably explanations [24]. The topology nature of Markov -trees offers the advantage of achieving (high) efficiency for such inference computations, especially when values of are small or moderate.

Traditional methods for inference with constructed Markov networks seek additional independency properties that the models may offer. Typically, the method of clique factorization measures the joint probability of a Markov network in terms of “potential” such that it can be factorized according to the maximal cliques in the graph: , where is the set of maximal cliques and is the subset of variables associated with clique . [5]. For general graphs, by the Hammersley-Clifford theorem [18], the condition to guarantee the equation is non-zero probability , that may not always be satisfied. For Markov -trees, such factorization suits well because -trees are naturally chordal graphs. Nevertheless, inference with a Markov network may involve tasks more sophisticated than computing joint probabilities.

The -tree topology offers a systematic strategy to compute many optimization functions over the constructed network models. This is briefly explained below.

Every -tree of vertices, , consists of exactly number of -cliques. By Proposition 1 in Section 2, if a -tree is defined by creation order , all -cliques uniquely correspond prefixes of . Thus relations among the -cliques can be defined by relations among the prefixes of . For the convenience of discussion, with we denote the -clique corresponding to prefix of .

###### Definition 18.

Let be a -tree and be any creation order for . Two -cliques and in are related, denoted with , if and only if their corresponding prefixes and of satisfy

(1) ;

(2) is a prefix of ;

(3) No other prefix of satisfies (1) and of which is a prefix.

###### Proposition 5.

Let be the set of all -cliques in a -tree and be the relation of cliques given in Definition 18 for the cliques. Then a rooted, directed tree with directed edges from child to its parent .

Actually, tree is a tree decomposition for the -tree graph. Tree decomposition is a technique developed in algorithmic structural graph theory which makes it possible to measure how much a graph is tree-like. In particular, a tree decomposition reorganizes a graph into a tree topology connecting subsets of the graph vertices (called bags, e.g., -cliques in a -tree) as tree nodes [41, 7]. A heavy overlap is required between neighboring bags to ensure that the vertex connectivity information of the graph is not lost in such representation. However, finding a tree decomposition with maximum bag size for a given arbitrary graph (of tree width ) is computationally intractable. Fortunately, for a -tree or a backbone -tree generated with the method presented in section 3, an optimal tree decomposition is already available according to Proposition 5.

Tree decomposition makes it possible to compute efficiently a large class of graph optimization problems on graphs of small tree width , which are otherwise computationally intractable on restricted graphs. In particular, on the tree decompositions of a -tree, many global optimization computation can be systematically solved in linear time [6, 1, 2, 12]. The central idea is to build one dynamic programming table for every -clique, and according to the tree decomposition topology, the dynamic programming table for a parent tree node is built from the tables of its child nodes. For every node (i.e., every -clique), all (partial) solutions associated with the random variables are maintained, and (partial) solutions associated with variables only belonging to its children nodes are optimally selected conditional upon the variables it shares with its children node. The optimal solution associated with the whole network is then present at the root of the tree decomposition. The process takes time, for some possibly exponential function , where is the time to build a table, one for each of the nodes in the tree decomposition. For small or moderately large values of , such algorithms scale linearly with the number of random variables in the network.

## 5 Concluding Remarks

We have generalized Chow and Liu’s seminal work from Markov trees to Markov networks of tree width . In particular, we have proved that model approximation with Markov networks of tree width has the minimum information loss when the network topology is a maximum spanning -tree. We have also shown that learning Markov network topology of backbone -trees can be done in polynomial time for every fixed , in contrast to the intractability in learning Markov -tree without the backbone constraint. This result also holds for a broader range of constraints.

The backbone constraint stipulates a linear structure inherent in many Markov networks. Markov backbone -trees are apparently suitable for modeling random systems involving the time series. In particular, we have shown the order Markov chains are actually Markov backbone -trees. The constrained model is also ideal for modeling systems that possess higher order relations upon a linear structure and has been successful in modeling 3-dimensional structure of bio-molecules and in modeling semantics in computational linguistics, among others.

## Acknowledgement

This research is supported in part by a research grant from National Institute of Health (NIGMS-R01) under the Joint NSF/NIH Initiative to Support Research at the Interface of the Biological and Mathematical Sciences.

## References

• [1] Arnborg S. and Proskurowski A., (1989) Linear time algorithms for NP-hard problems restricted to partial -trees. Discrete Applied Mathematics 23, Pages 11Ð24
• [2] Arnborg, S., Lagergren, J., and Seese, D. (1991) Easy problems for tree-decomposable graphs, Journal of Algorithms, 12, 308-340.
• [3] Bach, F. and Jordan, MI. (2002) Thin Junction trees, in Dietterich, Becker, and Ghahramani, ed., Advances in Neural Information Processing Systems, 14. MIT Press.
• [4] Bern, M.W., (1987) Network design problems: Steiner trees and spanning -trees PhD thesis, University of California, Berkeley, CA.
• [5] Bishop, C.M. (2006) Pattern Recognition and Machine Learning, Springer.
• [6] Bodlaender, H.L. (1988), Dynamic programming on graphs with bounded treewidth, Proc. 15th International Colloquium on Automata, Languages and Programming, Lecture Notes in Computer Science, 317, 105Ð118.
• [7] Bodlaender, H.L. (2006) Treewidth: Characterizations, Applications, and Computations. Proceedings of Workshop in Graph Theory, 1-14.
• [8] Bradley, J. and Guestrin, C. (2010) Learning tree conditional random fields, Proceedings of 21 International Conference on machine Learning, 2010.
• [9] Cai, L. and Maffray, F. (1993) On the spanning k-tree problem, Discrete Applied Mathematics, 44, 139Ð156.
• [10] Chickering, DM., Geiger, D., and Heckerman, D. (1994) Learning bayesian networks is NP-Hard, Technical Rep MSR-TR-94-17, Microsoft Research Advanced Technology Division.
• [11] Chow, CK. and Liu, CN. (1968) Approximating discrete probability distribution with dependence trees, IEEE Transactions on Information Theory, 14: 462-467.
• [12] Courcelle, B. (1990) The monadic second-order logic of graphs. I. recognizable sets of finite graphs, Information and Computation, 85, 12-75.
• [13] Daly, R, Qiang, S. and Aitken, S. (2011), Learning bayesian networks: approaches and issues. Knowledge Engineering Review, vol 26, no. 2, pp. 99-127.
• [14] Dasgupta, S. (1999) Learning polytree. In Laskey and Prade, ed. Proceedings Conference on Uncertainty in AI, 134-141.
• [15] Ding, L. Samad, A., Xue, X., Huang, X., Malmberg, R., and Cai, L. (2014) Stochastic -tree grammar and its application in biomolecular structure modeling. Lecture Notes in Computer Science, 8370, 308-322.
• [16] Ding, L., Xue, X., LaMarca, S., Mohebbi, M., Samad, A., Malmberg, R., and Cai, L. (2014) Accurate prediction of RNA nucleotide interactions with backbone -tree model. Bioinformatics,
• [17] Elidan, G. and Gould, S. (2008) Learning bounded treewidth Bayesian networks, Journal of Machine Learning Research 9, 2699-2731.
• [18] Grimmett, G. R. (1973), A theorem about random fields, Bulletin of the London Mathematical Society, 5 (1): 81Ð84.
• [19] Gupta, A. and Nishimura, N. (1996) The complexity of subgraph isomorphism for classes of partial -trees. Theoretical Computer Science, 164: 287-298.
• [20] Karger, D., and Srebro, N. (2001) Learning Markov networks: maximum bounded tree-width graphs, Proceedings of 12th ACM-SIAM Symposium on Discrete Algorithms.
• [21] Kindermann, R. and Snell, J.L. (1980). Markov Random Fields and Their Applications, American Mathematical Society.
• [22] Kinney, J. B. and Atwal, G. S. (2014) Equitability, mutual information, and the maximal information coefficient. Proceedings of the National Academy of Sciences, 111:9, 3354Ð3359
• [23] Koivisto, M., and Sood, k. (2004) Exact Bayesian structure discovery in Bayesian networks. Journal of Machine Learning Research, 5:549-573.
• [24] Koller, D. and N. Friedman (2010) Probabilistic Graphical Models: Principles and Techniques, MIT Press.
• [25] Koller, D. and Sahami, M. (1996) Toward optimal feature selection. Proceedings of the Thirteenth International Conference on Machine Learning, 284-292.
• [26] Kramer, N., Schafer, J., and Boulesteix, A. (2009) Regularized estimation of large-scale gene association networks using graphical Gaussian models. BMC bioinformatics, 10 (1):1.
• [27] Kullback, S. and Leibler, RA. (1951) On information and sufficiency, Annals of Mathematical Statistics, 22(1):79-86.
• [28] Kwisthout, JH., Bodlaender, HL, van der Gaag, LC. (2010) The necessity of bounded treewidth for efficient inference in Bayesian networks, Proceedings 19th European Conference on Artificial Intelligence, 237-242.
• [29] Lee, J., and Jun, CH. (2015) Classification of high dimensionality data through feature selection using Markov blanket, Industrial Engineering and Management Systems, 14:2, 210-219.
• [30] Lee, S. Ganapathi, V., and Koller, D. (2006) Efficient structure learning of Markov networks using -regularization. In Advances in Neural Information Processing Systems, 817-824.
• [31] Lewis, PM, II, Approximating probabilistic distributions to reduce storage requirements, Information and Control, 2, 214-225.
• [32] Mao, Q., Wang, L., Goodison, S., and Sun, Y. (2015) Dimensionality reduction via graph structure learning, Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 765-774.
• [33] Matousek, J. and Thomas, R. (1992) On the complexity of finding iso- and other morphisms for partial k-trees, Discrete Mathematics, 108, 343-364.
• [34] Meek, C. (2001) Finding a path is harder than finding a tree, Journal of Artificial Intelligence Research, 15: 383-389.
• [35] Meila, M. and Jordan, MI (2000) Learning with mixtures of trees, Journal of Machine Learning Research, 1: 1-48.
• [36] Nagarajan, R., Scutari, M., and Lebre., S. (2013) Bayesian Networks in R with Applications in Systems Biology, Springer.
• [37] Narasimhan, M. and Bilmes, J. (2003) Pac-learning bounded tree-width graphical models, in Chickering and Halpern, ed., Proceedings of Conference on Uncertainty in AI.
• [38] Patil, H. P. (1986) On the structure of -tree. Journal of Combinatorics, Information and System Sciences, 11 (2-4):57-64.
• [39] Pearl, J., (1988) Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann Series in Representation and Reasoning.
• [40] Robertson, N., Seymour, P.D. (1984), Graph minors III: Planar tree-width, Journal of Combinatorial Theory, Series B 36 (1): 49Ð64.
• [41] Robertson, N. and Seymour, P.D. (1986) Graph minors II. Algorithmic aspects of tree-width, Journal of Algorithms, 7, 309-322.
• [42] Rue, H. and Held, L. (2005) Gaussian Markov random felds: theory and applications. CRC Press.
• [43] Shannon, C.E. (1948), A Mathematical Theory of Communication, , Bell System Technical Journal, 27, pp. 379Ð423 & 623Ð656.
• [44] Srebro, N. (2003) Maximum likelihood bounded tree-width Markov networks, Artificial Intelligence, 143 (2003) 123Ð138.
• [45] Sun, Y. and Han, J. (2013) Mining heterogeneous information networks: a structural analysis approach, ACM SIGKDD Explorations Newsletter, 14:2, 20–28.
• [46] Suzuki, J. (1999) Learning Bayesian belief networks based on the MDL principle: An efficient algorithm using the branch and bound technique. IEICE TRANSACTIONS on Information and Systems, 82(2):356-367.
• [47] Szántai, T. and Kovács E (2012) Hypergraphs as a mean of discovering the dependence structure of a discrete multivariate probability distribution. Annals of Operations Research, vol 193, pp 71Ð90.
• [48] Teyssier, M. and Koller, D. (2005) Ordering-based search: a simple and effective algorithm for learning Bayesian networks, Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence, 584-590.
• [49] Timme, N, Alford, W., Flecker, B., and Beggs, J.M. (2014). Multivariate information measures: an experimentalist’s perspective. Journal of Computational Neuroscience, 36 (2), pp 119Ð140.
• [50] Tomczak, K., Czerwinska, P. and Wiznerowicz, M. (2015) The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge. Contemporary Oncology (Pozn), 19(1A): A68-77.
• [51] Torti, SV. and Torti, FM. (2013) Iron and cancer: more ore to be mined. Nature Review Cancer, 13(5): 342-55.
• [52] Valiant, L.G. (1975), General context-free recognition in less than cubic time, Journal of Computer and System Sciences 10 (2): 308Ð314
• [53] Wu, X., Zhu, X., Wu, G., and Ding, W. (2014) Data mining with big data, IEEE Transactions on Knowledge and Data Engineering, 26:1, 97–107.
• [54] Xu, Y., Cui, J., and Puett, D. (2014) Cancer Bioinformatics, Springer.
• [55] Yuan, C., and Malone, B., (2013) Learning optimal Bayesian networks: A shortest path perspective. J. Artificial Intelligence Research 48:23-65.
• [56] Yin, W., Garimalla, S., Moreno, A., Galinski, MR., and Styczynski, MP. (2015) A tree-like Bayesian structure learning algorithm for small-sample datasets from complex biological model systems, BMC Systems Bology 9:49.
• [57] Zhang, Z., Bai, L., Liang, Y., and Hancock, ER. (2015) Adaptive graph learning for unsupervised feature selection, Computer Analysis of Images and Patterns LNCS Vol 9256, 790-800.
• [58] Zhao, I, Zhou, Y., Zhang, X., and Chen, L. (2015) Part mutual information for quantifying direct associations in networks, Proceedings of the National Academy of Sciences, 113: 18, 5130Ð5135.

## Appendix A

###### Theorem 7.

The probability function expressed in (6) for Markov -tree remains the same regardless the choice of creation order for .

###### Proof.

Let and be two creation orders for -tree . For the same -tree , we define and to be respectively the joint probability distribution functions of under the topology graph with creation orders and . Without loss of generality, we further assume that be the creation order for as given in Definition 3 of the following form:

By induction on , we will prove in the following the statement that .

Basis: when ,

 PG(X|ΨX)=P(X1,…,Xk|ΨX)=P(X1,…,Xk)=P(Ck)=PG(X|ΦX)

Assumption: the statement holds for any -tree of random variables.

Induction: Let , where , be a -tree. Then it is not hard to see that subgraph of , after variable is removed, is indeed a -tree for random variables . Furthermore, according to Definition 3, is a creation order for -tree .

Now we assume creation order for graph to be

 DkYk+1Dk+1Yk+2…Dn−1Yn

where , and for all . Then there are two possible cases for to be in the creation order .

Case 1. .

Because shares edges only with variables in , we have and . Therefore,

 ΨX′=Dk+1Yk+2…Dn−1Yn

is a creation order for subgraph . Thus, based on (6), we have