Efficient Learning of Optimal Markov Network Topology
with Tree Modeling
Abstract
The seminal work of Chow and Liu (1968) shows that approximation of a finite probabilistic system by Markov trees can achieve the minimum information loss with the topology of a maximum spanning tree. Our current paper generalizes the result to Markov networks of tree width , for every fixed . In particular, we prove that approximation of a finite probabilistic system with such Markov networks has the minimum information loss when the network topology is achieved with a maximum spanning tree. While constructing a maximum spanning tree is intractable for even , we show that polynomial algorithms can be ensured by a sufficient condition accommodated by many meaningful applications. In particular, we prove an efficient algorithm for learning the optimal topology of higher order correlations among random variables that belong to an underlying linear structure.
Keywords: Markov network, joint probability distribution function, tree, spanning graph, tree width, KullbackLeibler divergence, mutual information
1 Introduction
We are interested in effective modeling of complex systems whose the behavior is determined by relationships among uncertain events. When such uncertain events are quantified as random variables , the system is characterizable with an underlying joint probability distribution function [39, 24]. Nevertheless, estimation of the function from observed event data poses challenges as relationships among events may be intrinsic and the order function may be very difficult, if not impossible, to compute. A viable solution is to approximate with lower order functions, as a result of constraining the dependency relationships among random variables. Such constraints can be well characterized with the notion of Markov network, where the dependency relation of random variables is defined by a nondirected graph , with the edge set to specify the topology of the dependency relation. In a Markov network, two variables not connected with an edge are independent conditional on the rest of the variables [21, 39, 42].
Model approximation with Markov networks needs to address two essential issues: the quality of the approximation, and the feasibility to compute such approximated models. Chow and Liu [11] were the first to address both issues in investigating networks of tree topology. They measured the information loss with KullbackLeibler divergence [27] between , the unknown distribution function, and , the distribution function estimated under the dependency graph . They showed that the minimum information loss is guaranteed by the topology corresponding to a maximum spanning tree which can be computed very efficiently (in linear time in the number of random variables).
It has been difficult to extend the seminal work of simultaneous optimal and efficient learning to the realm of arbitrary Markov networks due to the computational intractability nature of the problem [10]. To overcome the barrier, research in Markov network learning has sought heuristics algorithms that are efficient yet without an optimality guarantee in the learned topology [30, 23, 48, 24, 13, 55]. A more viable approach has been to consider the nature of probabilistic networks formulated from realworld applications, which often are constrained and typically treelike [14, 35, 20, 3, 44, 37, 17, 8, 47, 56]. All such models can be quantitatively characterized with networks that have tree width , for small . Tree width is a metric that measures how much a graph is treelike [7]. It is closely related to the maximum size of a graph separator, i.e., a characteristic of any set of random variables upon which two (sets) of random variables and are conditionally independent . Such networks have the advantage to support efficient inference, and other queries over constructed models [28]. In addition, many realworld applications actually imply Markov (Bayesian) networks of small tree width. In spite of these advantages, however, optimal learning of such networks is unfortunately difficult, for example, it is intractable even when tree width [20, 44]. Hence, the techniques for learning such Markov networks have heavily relied on heuristics that may not guarantee the quality.
In this paper, we generalize the seminal work of Chow and Liu from Markov trees to Markov networks of tree width . First, we prove that model approximation with Markov networks of tree width has the minimum information loss when the network topology is the one for a maximum spanning tree. A tree is a maximal graph of tree width to which no edge can be added without an increase of tree width. Therefore the quality of such Markov networks of tree width is also optimal with a maximum spanning tree. Second, we reveal sufficient conditions satisfiable by many applications, which guarantee polynomial algorithms to compute the maximum spanning tree. That is, under such a condition, the optimal topology of Markov networks with tree width can be learned in time for every fixed .
We organize this paper as follows. Section 2 presents preliminaries and introduces Markov tree model. In section 3, we prove that KullbackLeibler divergence is minimized when the approximated distribution function is estimated with a maximum spanning tree . In section 4, we present some conditions, each being sufficient for permitting time algorithms to compute maximum spanning tree and thus to learn the optimal topology of Markov networks of tree width . Efficient inference with Markov tree models is also discussed in section 4. We conclude in section 5 with some further remarks.
2 Markov Tree Model
The probabilistic modeling of a finite system involves random variables with observable values. Such systems are thus characterized by dependency relations among the random variables, which may be learned by estimating the joint probability distribution function . Inspired by Chow and Liu’s work [11], we are interested in approximating the order dependency relation among random variables by a order (i.e., pairwise) relation. Our work first concerns how to minimize the information loss in the approximation.
A pairwise dependency relation between random variables ^{2}^{2}2Here we slightly abuse notations. We use the same symbol for random variable and its corresponding vertex in the topology graph and rely on context to distinguish them. can be defined with a graph , where binary relation is defined such that if and only if variables and are independent conditional only on the rest of the variables, i.e., let ,
(1) 
We call a Markov network with the topology graph over random variable set .
Condition (1) is called the pairwise property. By [24], it is equivalent to the following global property for positive density. Let be two disjoint subsets and . If separates and in the graph (i.e., removing disconnects and in the graph), then
(2) 
For the convenience of our discussion, we will assume our Markov networks to have the global property.
Equation (2) states that random variables in and in are independent conditional on variables in . The minimum cardinality of characterizes the degree of condition for the independency between and . The conditional independency of is thus characterized by the upper bound of minimum cardinalities of all such separating the graph . This number is closely related to the notion of tree width [41]. In this paper, we are interested in Markov networks whose topology graphs are of tree width , for given integer [20].
2.1 Tree and Creation Order
Intuitively, tree width [7] is a metric for how much a graph is treelike. It plays one of the central roles in the development of structural graph theory and, in particular, in the well known graph minor theorems [41] by Robertson and Seymour. There are other alternative definitions for tree width. For the work presented in this paper, we relate its definition to the concept of tree.
Definition 1.
[38] Let be an integer. The class of trees are graphs that are defined inductively as follows:

A tree of vertices is a clique of vertices;

A tree of vertices, for , is a graph consisting of a tree of vertices and a new vertex not in , such that forms a clique with some clique already in .
Definition 2.
[7] For every fixed , a graph is said to be of tree width if and only if it is a subgraph of some tree.
In particular, all trees are trees for and they have tree width 1. Graphs of tree width 1 are simply forests. For , a tree can be succinctly represented with the vertices in the order in which they are introduced to the graph by the inductive process in Definition 1.
Definition 3.
Let be an integer. A creation order for a tree , where , is defined as an ordered sequence, inductively
(3) 
where is a clique and is a clique.
Note that creation orders may not be unique for the same tree. The succinct notation facilitates discussion of properties for trees. In particular, because Markov networks are nondirected graphs, the notion of creation order offers a convenient mechanism to discuss dependency between random variables modeled by the networks.
Proposition 1.
Let be a creation order for tree . Then is a clique in if and only if there are and such that is a prefix of , where .
Definition 4.
Let be an integer and be a creation order of some tree , where . Then induces an acyclic orientation for such that given any two different variables ,
(4) 
In addition, for any variable , its parent set is defined as
Based on Definition 4, it is clear that the orientation induced by any creation order of tree is acyclic. In particular, there exists exactly one variable with .
2.2 Relative Entropy and Mutual Information
According to Shannon’s information theory [43], the uncertainty of a discrete random variable can be measured with entropy , where is the probability distribution for and the sum takes all values in the range of . Entropy for a set of random variables is , where the sum takes all combined values in the ranges of variables in .
Definition 5.
[27] KullbackLeibler divergence between two probability distributions and of the same random variable set is defined as
(5) 
where the sum takes combined values in the ranges of all random variables in .
The last equality holds if and only if . can be used to measure the information gain by distribution over distribution , or information loss by approximation of with as in this work.
Let and are two random variables. The mutual information between and , denoted as , is defined as the KullbackLeibler divergence between their joint probability distribution and the product of marginal distributions, i.e.,
Mutual information measures the degree of the correlation between the two random variables. In this work, we slightly extends the mutual information to include more than two variables.
Definition 6.
Let be a random variable and be a set of random variables. Then
where sum takes combined values and in the ranges of and all random variables in .
Note that is not the same as the multivariate mutual information defined elsewhere [49]. We further point out that mutual information has recently received much attention due to its capability in decoding hidden information from big data [22, 58]. In our derivation of optimal Markov tree modeling, estimating mutual information stands out as a critical condition that needs to be satisfied (see next section). These phenomena are unlikely coincidences.
2.3 Markov Tree Model
Definition 7.
Let be an integer. A Markov tree is a Markov network over random variables with a topology graph that is a tree. We denote with the joint probability distribution function of the Markov tree.
Theorem 1.
Let and be random variables with joint probability distribution function . Let be a Markov tree and be an acyclic orientation for its edges. Then
(6) 
Proof.
Assume that the acyclic orientation is induced by creation order
where
(7) 
The last equality holds for the reason that is independent from variables in conditional on variables in , as shown in the following derivation. Assume . Use the Bayes,
The derivation in (7) also results in recurrence
(8) 
Solving the recurrence yields
(9) 
where . Because for , , it is not hard to prove that
(10) 
Though the above proof is based on an explicit creation order for the tree . The probability function computed for is actually independent of the choice of a creation order as demonstrated in the following.
Theorem 2.
The probability function expressed in (6) for Markov tree remains the same regardless the choice of creation order for .
A proof is given in Appendix A (Theorem 7).
3 Model Optimization
3.1 Optimal Markov Trees
Let be random variables in any unconstrained probabilistic model . Let and tree be a topology graph of a Markov network that approximates . The approximation can be measured using the KullbackLeibler divergence between two models [27, 31]:
(11) 
where is any combination of values for all random variables in , and the last equality holds if and only if .
Thus the problem to optimally approximate is to find a Markov tree with topology that minimizes the divergence (11). We are now ready for our first main theorem.
Theorem 3.
is minimized when the topology tree for random variables is such that maximizes the sum of mutual information.
(12) 
where is any acyclic orientation for the edges in .
Proof.
Assume to be an acyclic orientation for the edges in . Apply equation (6)
to in (11), we have
The first term on the right hand side (RHS) of above is , where is the entropy of X. And the second term can be further explained with
(13) 
Because and are components of , the second term in (13) can be formulated as
(14) 
where is any value in the range of random variable .
Also the first term in (13) gives
(15) 
where is any combination of values of all random variables in and is the mutual information between variable and its parent set .
Therefore,
(16) 
Since and are independent of the choice (and the acyclic orientation for the edges), is minimized when is maximized.
Though Theorems 3 is about optimal approximation of probabilistic systems with Markov trees, they also characterize the optimal Markov tree.
Theorem 4.
Let and be a set of random variables. The optimal Markov tree model for is the one with topology that maximize , for some acyclic orientation of edges in .
Now we let
By Theorem 2 and the proof of Theorem 3, we know that is invariant of the choice of an acyclic orientation of edges. So we can simply omit and use for .
Definition 8.
Let be random variables. Define
(17) 
to be the topology of optimal Markov tree over .
Corollary 1.
Let and be a set of random variables. Divergence is minimized with the topology tree satisfying (17).
Corollary 2.
Let and be a set of random variables. The optimal Markov tree model for is the one with topology graph satisfying (17).
3.2 Optimal Markov Networks of Tree Width
Because trees are the maximum graphs of tree width , Theorems 3 and 4 are not immediately applicable to optimization of Markov networks of tree width . Our derivations so far have not included the situation that, given , an optimal Markov network of tree width is not exactly a tree, for any . In this section, we resolve this issue with a slight adjustment to the objective function .
Definition 9.
Let be a tree and be a binary relation over . The spanning subgraph is called amended graph of with .
Proposition 2.
A graph has tree width if and only if it is an amended graph of some tree with relation .
Definition 10.
Let be an acyclic orientation for edges in tree and be an amended graph of . Then the orientation of edges in the graph is defined as, for every pair of ,
Definition 11.
Let be a tree, where . Let be an amended graph of . Define
(18) 
where is an acyclic orientation on edges .
We can apply the derivation in the previous section by replacing with and obtain
Corollary 3.
Let and be a set of random variables. The optimal Markov network of tree width for is the amended graph of some tree that satisfies
4 Efficient Computation with Markov Trees
Optimal topology learning is computationally intractable for Markov networks of arbitrary graph topology [10]. This obstacle is not exception to Markov trees. In particular, we are able to relate the optimal learning of Markov trees to the following graphtheoretic problem.
Definition 12.
Let . The Maximum Spanning Tree (MST) problem is, on input graph of vertices , to find a spanning tree with an acyclic orientation such that
achieves the maximum.
In the definition, function is predefined numerical function for a pair: vertex and its parent set in the output tree. MST generalizes the traditional “standard” definition of the maximum spanning tree problem [4, 9], where the objective function is the sum of weights on all edges involved the output tree; that is
(19) 
Proposition 3.
The Markov tree learning problem is the problem MST defined in Definition 12 in which for every , , the mutual information between and its parent set .
It has been proved that, for any fixed , the problem MST with the objective function defined with given in equation (19) is NPhard [4]. The intractability appears inherent since a number of variants of the problem remains intractable [9]. It implies that optimal topology learning of Markov trees (for ) is computationally intractable. Independent works in learning Markov networks of bounded tree width have also confirmed this unfortunate difficulty [44, 47].
4.1 Efficient Learning of Optimal Markov Backbone Trees
We now consider a class of less general Markov trees that are of higher order topologies over an underlying linear relation. In particular, such networks carry the signature of relationships among consecutively indexed random variables.
Definition 13.
Let be the set of integerindexed variables.

The graph , where , is called the backbone of . And each edge , for , is called a backbone edge.

A graph (resp. tree) is called a backbone graph (resp. backbone tree) if it contains as a subgraph.
Definition 14.
Let . A Markov network is called Markov backbone tree if the underlying topology graph of the Markov network is a backbone tree.
The linearity relation among random variables occurs naturally in many applications with Markov network modeling, for instance, in speech recognition, cognitive linguistics, and higherorder relations among residues on biological sequences. Typically, the linearity also plays an important role in random systems involving variables associated with the time series. For example,
Theorem 5.
For each , the topology graph of any finite thorder Markov chain is a backbone tree.
Proof.
This is because a finite thorder Markov chain is defined over labelled variables such that , for every . Clearly, its topology graph contains all edges , for . In addition, the following creation order asserts that the graph is a tree: , where , for all .
We now show that learning the optimal topology of Markov backbone trees can be accomplished much more efficiently than with unconstrained Markov trees. We relate the learning problem to a special version of graphtheoretic problem MST that can be computed efficiently. In the rest of this section, we present some technical details for such connection.
Definition 15.
Let be a class of graphs. Then the retaining MST problem is the MST problem in which the input graph includes a spanning subgraph that is required to be contained by the output spanning tree.
The unconstrained MST is simply the retaining MST problem with being the class of independent set graphs. There are other classes of graphs that are of interest to our work. In particular, we define to be the class of all backbones. Then it is not difficult to see:
Proposition 4.
The Markov backbone tree learning problem is the retaining MST problem.
In the following, we will show that, for various classes that satisfy certain graphtheoretic property, including class , the retaining MST problems can be solved efficiently.
Definition 16.
Let fixed . A tree has bounded branches if every clique in separates into a bounded number of connected components.
Definition 17.
Let fixed . A graph is bounded branchingfriendly if every tree that contains as a spanning subgraph has bounded branches.
Lemma 1.
Let be the class of all backbones. Any graph in is bounded branching friendly.
Proof.
Let be a tree containing a backbone . Let clique with . Let
It is not difficult to see that separates the tree into at most connected components.
Note that divides set into at most intervals. Let be two cliques such that vertices and . Then and cannot belong to the same interval in order to guarantee all backbone edges are included in . This implies , a constant when is fixed.
Lemma 2.
Let be a class of bounded branching friendly graphs. Then retaining MST problem can be solved in time for every fixed .
Proof.
(We give a sketch for proof.)
Note that the retaining MST problem is defined as follows:
Input: a set of vertices and a graph ,
Output: tree with such that
where is a predefined function.
First, any clique in a tree that contains as a spanning subgraph can only separate into a bounded number of connected components. To see this, assume some clique does the opposite. From the graph and we can construct a tree by augmenting with additional edges. We do this without connecting any two of the components in disconnected by . However, the resulted tree, though containing as a spanning subgraph, would be separated by into an unbounded number of components, contradicting the assumption that is bounded branching friendly.
Second, consider a process to construct a spanning tree anchored at any fixed clique . Note that by the aforementioned discussion, in the tree (to be constructed) the number of components separated by is some constant , possibly a function in which is fixed. For every component, its connection to is by some other clique such that , where is drawn from the component to “swap” out , a vertex chosen from . It is easy to see that, once is chosen, there are only a bounded number of ways (i.e,, at most possibilities) to choose . On the other hand, there are at most possibilities to choose from one of the connected components. If we use 0 and 1 to indicate the permission mode ‘yes’ or ’no’ that can be drawn from a specified component, an indicator for such permissions for all the components can be represented with binary bits.
Let be a binary string of length . We define to be the maximum value of a subtree created around , which is consistent with the permission indicator for the associated components. We then have the following recurrence for function :
where and is the function defined over the newly created vertex and its parent set . The recurrence defines with two recursive terms and , where the permission indicators and are such that the state corresponding to the component from which was drawn should be set ’1’ by and ’0’ by . When a component runs out of vertices, the indicator should be set ’0’. For other components, where indicates ’1’, either or indicates ’1’ but not both. Finally, base case for the recurrence is that when is all 0’s.
Theorem 6.
For every fixed , optimal topology learning of Markov backbone trees can be accomplished in time .
4.2 Efficient Inference with Markov Trees
On learned or constructed Markov trees, inferences and queries of probability distributions can be conducted, typically through computing the maximum posterior probabilities for most probably explanations [24]. The topology nature of Markov trees offers the advantage of achieving (high) efficiency for such inference computations, especially when values of are small or moderate.
Traditional methods for inference with constructed Markov networks seek additional independency properties that the models may offer. Typically, the method of clique factorization measures the joint probability of a Markov network in terms of “potential” such that it can be factorized according to the maximal cliques in the graph: , where is the set of maximal cliques and is the subset of variables associated with clique . [5]. For general graphs, by the HammersleyClifford theorem [18], the condition to guarantee the equation is nonzero probability , that may not always be satisfied. For Markov trees, such factorization suits well because trees are naturally chordal graphs. Nevertheless, inference with a Markov network may involve tasks more sophisticated than computing joint probabilities.
The tree topology offers a systematic strategy to compute many optimization functions over the constructed network models. This is briefly explained below.
Every tree of vertices, , consists of exactly number of cliques. By Proposition 1 in Section 2, if a tree is defined by creation order , all cliques uniquely correspond prefixes of . Thus relations among the cliques can be defined by relations among the prefixes of . For the convenience of discussion, with we denote the clique corresponding to prefix of .
Definition 18.
Let be a tree and be any creation order for . Two cliques and in are related, denoted with , if and only if their corresponding prefixes and of satisfy
(1) ;
(2) is a prefix of ;
(3) No other prefix of satisfies (1) and of which is a prefix.
Proposition 5.
Let be the set of all cliques in a tree and be the relation of cliques given in Definition 18 for the cliques. Then a rooted, directed tree with directed edges from child to its parent .
Actually, tree is a tree decomposition for the tree graph. Tree decomposition is a technique developed in algorithmic structural graph theory which makes it possible to measure how much a graph is treelike. In particular, a tree decomposition reorganizes a graph into a tree topology connecting subsets of the graph vertices (called bags, e.g., cliques in a tree) as tree nodes [41, 7]. A heavy overlap is required between neighboring bags to ensure that the vertex connectivity information of the graph is not lost in such representation. However, finding a tree decomposition with maximum bag size for a given arbitrary graph (of tree width ) is computationally intractable. Fortunately, for a tree or a backbone tree generated with the method presented in section 3, an optimal tree decomposition is already available according to Proposition 5.
Tree decomposition makes it possible to compute efficiently a large class of graph optimization problems on graphs of small tree width , which are otherwise computationally intractable on restricted graphs. In particular, on the tree decompositions of a tree, many global optimization computation can be systematically solved in linear time [6, 1, 2, 12]. The central idea is to build one dynamic programming table for every clique, and according to the tree decomposition topology, the dynamic programming table for a parent tree node is built from the tables of its child nodes. For every node (i.e., every clique), all (partial) solutions associated with the random variables are maintained, and (partial) solutions associated with variables only belonging to its children nodes are optimally selected conditional upon the variables it shares with its children node. The optimal solution associated with the whole network is then present at the root of the tree decomposition. The process takes time, for some possibly exponential function , where is the time to build a table, one for each of the nodes in the tree decomposition. For small or moderately large values of , such algorithms scale linearly with the number of random variables in the network.
5 Concluding Remarks
We have generalized Chow and Liu’s seminal work from Markov trees to Markov networks of tree width . In particular, we have proved that model approximation with Markov networks of tree width has the minimum information loss when the network topology is a maximum spanning tree. We have also shown that learning Markov network topology of backbone trees can be done in polynomial time for every fixed , in contrast to the intractability in learning Markov tree without the backbone constraint. This result also holds for a broader range of constraints.
The backbone constraint stipulates a linear structure inherent in many Markov networks. Markov backbone trees are apparently suitable for modeling random systems involving the time series. In particular, we have shown the order Markov chains are actually Markov backbone trees. The constrained model is also ideal for modeling systems that possess higher order relations upon a linear structure and has been successful in modeling 3dimensional structure of biomolecules and in modeling semantics in computational linguistics, among others.
Acknowledgement
This research is supported in part by a research grant from National Institute of Health (NIGMSR01) under the Joint NSF/NIH Initiative to Support Research at the Interface of the Biological and Mathematical Sciences.
References
 [1] Arnborg S. and Proskurowski A., (1989) Linear time algorithms for NPhard problems restricted to partial trees. Discrete Applied Mathematics 23, Pages 11Ð24
 [2] Arnborg, S., Lagergren, J., and Seese, D. (1991) Easy problems for treedecomposable graphs, Journal of Algorithms, 12, 308340.
 [3] Bach, F. and Jordan, MI. (2002) Thin Junction trees, in Dietterich, Becker, and Ghahramani, ed., Advances in Neural Information Processing Systems, 14. MIT Press.
 [4] Bern, M.W., (1987) Network design problems: Steiner trees and spanning trees PhD thesis, University of California, Berkeley, CA.
 [5] Bishop, C.M. (2006) Pattern Recognition and Machine Learning, Springer.
 [6] Bodlaender, H.L. (1988), Dynamic programming on graphs with bounded treewidth, Proc. 15th International Colloquium on Automata, Languages and Programming, Lecture Notes in Computer Science, 317, 105Ð118.
 [7] Bodlaender, H.L. (2006) Treewidth: Characterizations, Applications, and Computations. Proceedings of Workshop in Graph Theory, 114.
 [8] Bradley, J. and Guestrin, C. (2010) Learning tree conditional random fields, Proceedings of 21 International Conference on machine Learning, 2010.
 [9] Cai, L. and Maffray, F. (1993) On the spanning ktree problem, Discrete Applied Mathematics, 44, 139Ð156.
 [10] Chickering, DM., Geiger, D., and Heckerman, D. (1994) Learning bayesian networks is NPHard, Technical Rep MSRTR9417, Microsoft Research Advanced Technology Division.
 [11] Chow, CK. and Liu, CN. (1968) Approximating discrete probability distribution with dependence trees, IEEE Transactions on Information Theory, 14: 462467.
 [12] Courcelle, B. (1990) The monadic secondorder logic of graphs. I. recognizable sets of finite graphs, Information and Computation, 85, 1275.
 [13] Daly, R, Qiang, S. and Aitken, S. (2011), Learning bayesian networks: approaches and issues. Knowledge Engineering Review, vol 26, no. 2, pp. 99127.
 [14] Dasgupta, S. (1999) Learning polytree. In Laskey and Prade, ed. Proceedings Conference on Uncertainty in AI, 134141.
 [15] Ding, L. Samad, A., Xue, X., Huang, X., Malmberg, R., and Cai, L. (2014) Stochastic tree grammar and its application in biomolecular structure modeling. Lecture Notes in Computer Science, 8370, 308322.
 [16] Ding, L., Xue, X., LaMarca, S., Mohebbi, M., Samad, A., Malmberg, R., and Cai, L. (2014) Accurate prediction of RNA nucleotide interactions with backbone tree model. Bioinformatics,
 [17] Elidan, G. and Gould, S. (2008) Learning bounded treewidth Bayesian networks, Journal of Machine Learning Research 9, 26992731.
 [18] Grimmett, G. R. (1973), A theorem about random fields, Bulletin of the London Mathematical Society, 5 (1): 81Ð84.
 [19] Gupta, A. and Nishimura, N. (1996) The complexity of subgraph isomorphism for classes of partial trees. Theoretical Computer Science, 164: 287298.
 [20] Karger, D., and Srebro, N. (2001) Learning Markov networks: maximum bounded treewidth graphs, Proceedings of 12th ACMSIAM Symposium on Discrete Algorithms.
 [21] Kindermann, R. and Snell, J.L. (1980). Markov Random Fields and Their Applications, American Mathematical Society.
 [22] Kinney, J. B. and Atwal, G. S. (2014) Equitability, mutual information, and the maximal information coefficient. Proceedings of the National Academy of Sciences, 111:9, 3354Ð3359
 [23] Koivisto, M., and Sood, k. (2004) Exact Bayesian structure discovery in Bayesian networks. Journal of Machine Learning Research, 5:549573.
 [24] Koller, D. and N. Friedman (2010) Probabilistic Graphical Models: Principles and Techniques, MIT Press.
 [25] Koller, D. and Sahami, M. (1996) Toward optimal feature selection. Proceedings of the Thirteenth International Conference on Machine Learning, 284292.
 [26] Kramer, N., Schafer, J., and Boulesteix, A. (2009) Regularized estimation of largescale gene association networks using graphical Gaussian models. BMC bioinformatics, 10 (1):1.
 [27] Kullback, S. and Leibler, RA. (1951) On information and sufficiency, Annals of Mathematical Statistics, 22(1):7986.
 [28] Kwisthout, JH., Bodlaender, HL, van der Gaag, LC. (2010) The necessity of bounded treewidth for efficient inference in Bayesian networks, Proceedings 19th European Conference on Artificial Intelligence, 237242.
 [29] Lee, J., and Jun, CH. (2015) Classification of high dimensionality data through feature selection using Markov blanket, Industrial Engineering and Management Systems, 14:2, 210219.
 [30] Lee, S. Ganapathi, V., and Koller, D. (2006) Efficient structure learning of Markov networks using regularization. In Advances in Neural Information Processing Systems, 817824.
 [31] Lewis, PM, II, Approximating probabilistic distributions to reduce storage requirements, Information and Control, 2, 214225.
 [32] Mao, Q., Wang, L., Goodison, S., and Sun, Y. (2015) Dimensionality reduction via graph structure learning, Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 765774.
 [33] Matousek, J. and Thomas, R. (1992) On the complexity of finding iso and other morphisms for partial ktrees, Discrete Mathematics, 108, 343364.
 [34] Meek, C. (2001) Finding a path is harder than finding a tree, Journal of Artificial Intelligence Research, 15: 383389.
 [35] Meila, M. and Jordan, MI (2000) Learning with mixtures of trees, Journal of Machine Learning Research, 1: 148.
 [36] Nagarajan, R., Scutari, M., and Lebre., S. (2013) Bayesian Networks in R with Applications in Systems Biology, Springer.
 [37] Narasimhan, M. and Bilmes, J. (2003) Paclearning bounded treewidth graphical models, in Chickering and Halpern, ed., Proceedings of Conference on Uncertainty in AI.
 [38] Patil, H. P. (1986) On the structure of tree. Journal of Combinatorics, Information and System Sciences, 11 (24):5764.
 [39] Pearl, J., (1988) Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann Series in Representation and Reasoning.
 [40] Robertson, N., Seymour, P.D. (1984), Graph minors III: Planar treewidth, Journal of Combinatorial Theory, Series B 36 (1): 49Ð64.
 [41] Robertson, N. and Seymour, P.D. (1986) Graph minors II. Algorithmic aspects of treewidth, Journal of Algorithms, 7, 309322.
 [42] Rue, H. and Held, L. (2005) Gaussian Markov random felds: theory and applications. CRC Press.
 [43] Shannon, C.E. (1948), A Mathematical Theory of Communication, , Bell System Technical Journal, 27, pp. 379Ð423 & 623Ð656.
 [44] Srebro, N. (2003) Maximum likelihood bounded treewidth Markov networks, Artificial Intelligence, 143 (2003) 123Ð138.
 [45] Sun, Y. and Han, J. (2013) Mining heterogeneous information networks: a structural analysis approach, ACM SIGKDD Explorations Newsletter, 14:2, 20–28.
 [46] Suzuki, J. (1999) Learning Bayesian belief networks based on the MDL principle: An efficient algorithm using the branch and bound technique. IEICE TRANSACTIONS on Information and Systems, 82(2):356367.
 [47] Szántai, T. and Kovács E (2012) Hypergraphs as a mean of discovering the dependence structure of a discrete multivariate probability distribution. Annals of Operations Research, vol 193, pp 71Ð90.
 [48] Teyssier, M. and Koller, D. (2005) Orderingbased search: a simple and effective algorithm for learning Bayesian networks, Proceedings of the TwentyFirst Conference on Uncertainty in Artificial Intelligence, 584590.
 [49] Timme, N, Alford, W., Flecker, B., and Beggs, J.M. (2014). Multivariate information measures: an experimentalist’s perspective. Journal of Computational Neuroscience, 36 (2), pp 119Ð140.
 [50] Tomczak, K., Czerwinska, P. and Wiznerowicz, M. (2015) The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge. Contemporary Oncology (Pozn), 19(1A): A6877.
 [51] Torti, SV. and Torti, FM. (2013) Iron and cancer: more ore to be mined. Nature Review Cancer, 13(5): 34255.
 [52] Valiant, L.G. (1975), General contextfree recognition in less than cubic time, Journal of Computer and System Sciences 10 (2): 308Ð314
 [53] Wu, X., Zhu, X., Wu, G., and Ding, W. (2014) Data mining with big data, IEEE Transactions on Knowledge and Data Engineering, 26:1, 97–107.
 [54] Xu, Y., Cui, J., and Puett, D. (2014) Cancer Bioinformatics, Springer.
 [55] Yuan, C., and Malone, B., (2013) Learning optimal Bayesian networks: A shortest path perspective. J. Artificial Intelligence Research 48:2365.
 [56] Yin, W., Garimalla, S., Moreno, A., Galinski, MR., and Styczynski, MP. (2015) A treelike Bayesian structure learning algorithm for smallsample datasets from complex biological model systems, BMC Systems Bology 9:49.
 [57] Zhang, Z., Bai, L., Liang, Y., and Hancock, ER. (2015) Adaptive graph learning for unsupervised feature selection, Computer Analysis of Images and Patterns LNCS Vol 9256, 790800.
 [58] Zhao, I, Zhou, Y., Zhang, X., and Chen, L. (2015) Part mutual information for quantifying direct associations in networks, Proceedings of the National Academy of Sciences, 113: 18, 5130Ð5135.
Appendix A
Theorem 7.
The probability function expressed in (6) for Markov tree remains the same regardless the choice of creation order for .
Proof.
Let and be two creation orders for tree . For the same tree , we define and to be respectively the joint probability distribution functions of under the topology graph with creation orders and . Without loss of generality, we further assume that be the creation order for as given in Definition 3 of the following form:
By induction on , we will prove in the following the statement that .
Basis: when ,
Assumption: the statement holds for any tree of random variables.
Induction: Let , where , be a tree. Then it is not hard to see that subgraph of , after variable is removed, is indeed a tree for random variables . Furthermore, according to Definition 3, is a creation order for tree .
Now we assume creation order for graph to be
where , and for all . Then there are two possible cases for to be in the creation order .
Case 1. .
Because shares edges only with variables in , we have and . Therefore,
is a creation order for subgraph . Thus, based on (6), we have