Learning Nonparametric Forest Graphical Models with Prior Information
Abstract
We present a framework for incorporating prior information into nonparametric estimation of graphical models. To avoid distributional assumptions, we restrict the graph to be a forest and build on the work of forest density estimation (FDE). We reformulate the FDE approach from a Bayesian perspective, and introduce prior distributions on the graphs. As two concrete examples, we apply this framework to estimating scalefree graphs and learning multiple graphs with similar structures. The resulting algorithms are equivalent to finding a maximum spanning tree of a weighted graph with a penalty term on the connectivity pattern of the graph. We solve the optimization problem via a minorizemaximization procedure with Kruskal’s algorithm. Simulations show that the proposed methods outperform competing parametric methods, and are robust to the true data distribution. They also lead to improvement in predictive power and interpretability in two real data sets.
1 Introduction
Graphical models are widely used to encode the conditional independence relationships between random variables. In particular, a random vector is represented by an undirected graph with vertices and missing edges whenever and are conditionally independent given the other variables. One major statistical task is to learn the graph from i.i.d. copies of the random vector.
Existing approaches for estimating graphical models make assumptions on either the underlying distribution or the graphical structure. Currently the most popular method, called graphical lasso [6], assumes that the random vector follows a multivariate Gaussian distribution. In this way, learning the graph is equivalent to estimating the precision matrix , since the conditional independence of a Gaussian random vector is entirely determined by the sparsity pattern of . The graphical lasso finds a sparse estimate of by maximizing the regularized loglikelihood. On the other hand, we can make no distributional assumptions but restrict the graph to be a forest instead. Under this structural constraint, there exists a factorization of the density function involving only the univariate and bivariate marginal densities, which makes nonparametric estimation tractable in high dimensions. In this case, estimating the graph amounts to finding the maximum spanning tree of a weighted graph; see, for example, [2, 10] for details.
Oftentimes, additional information on the structure of a graph is available a priori, which could be utilized to assist the estimation task. For example, a wide variety of the networks in recent literature, such as protein, gene, and social networks, are reported to be scalefree. That is, the degree distribution of the vertices follows a power law: for some . In such scalefree networks, some vertices have many more connections than others, and these highestdegree vertices are usually called hubs and serve significant roles in their networks. As another example of prior information, consider the applications where we believe that several networks share similar but not necessarily identical structures. This phenomenon is not unusual when we have multiple sets of data across distinct classes or units, such as gene expression measurements collected on a set of normal tissue samples and a set of cancer tissue samples. It is thus natural to ask whether such prior information can be integrated to improve the estimation.
Various approaches have been proposed to incorporate the prior belief of the underlying graphs, for example, [4, 11, 13, 14] for learning scalefree graphical models, and [7, 3, 12, 15] for joint estimation of multiple graphical models. Nevertheless, to the best of our knowledge, all the existing methods assume some parametric distribution on the data, mostly multivariate Gaussian. Such distributional assumptions can be quite unrealistic and unnecessary in many applications. Even though the marginal distribution of each variable can be transformed to approximately Gaussian, which allows arbitrary univariate distributions, the joint dependence is still restricted under the Gaussian assumption.
In this paper, we relax such distributional assumptions and estimate graphical models nonparametrically. We build on the forest density estimation (FDE) method introduced in [10]. In particular, we reformulate the FDE approach from a Bayesian perspective, and encode the prior information by putting some prior distribution on the graphs, which favors those that are more consistent with our prior belief. We further show that for the scalefreegraph case and the multiplegraph case, such an approach amounts to finding a maximum spanning tree of a weighted graph with a penalty term on the connection pattern of the nodes. We then devise an algorithm based on a minorizemaximization procedure and Kruskal’s algorithm [9] to find a local optimal solution.
The rest of the paper is organized as follows. In the following section, we give background on forest density estimation. In Section 3, we first give a general framework on how to incorporate prior information to nonparametric forestbased graphical model estimation, and then illustrate how the framework can be specialized to model scalefree graphical models and jointly estimate multiple graphical models with similar structure. In Section 4, we provide a brief review on the related work. Experimental results on synthetic data sets and real applications are presented in Section 5, followed by a conclusion in Section 6.
2 Forest density estimation
We say an undirected graph is a forest if it is acyclic. Let be a forest with vertices and edge set . Let be a dimensional random vector with density . We say that , or equivalently, its density , is Markov to if and are conditionally independent given the other random variables whenever edge is missing in . A density that is Markov to has the following factorization
(1) 
where each is a bivariate density and each is a univariate density. With this factorization, we can write the expected loglikelihood as
(2)  
(3) 
where is the mutual information between and , and is the entropy of . We maximize the right hand side of (3) to find the optimal forest
(4) 
where is the collection of spanning trees on vertices . We let contain only spanning trees because there is always a spanning tree that solves the problem (4). This problem can be recast as the problem of finding a maximum spanning tree for a weighted graph, where the weight of the edge between nodes and is . Kruskal’s algorithm [9] is a greedy algorithm that is guaranteed to find an optimal solution, while Chow and Liu [2] propose the procedure in the setting of discrete random variables. The method is described in Algorithm 1.
However, this procedure is not practical since the true density is unknown. Suppose instead that we have , which are i.i.d. copies of the random vector . We replace the population mutual information by the estimates
(5) 
where and are kernel density estimators of the bivariate and univariate marginal densities
(6) 
with a kernel function and bandwidths and . The resulting estimator of the graph becomes
(7) 
A heldout set is usually used to prune the spanning tree by stopping early in Algorithm 1 when the likelihood on the heldout set is maximized. Thus we obtain a forest estimate of the graph.
3 Learning forest graphical model with prior knowledge
3.1 A Bayesian framework
Sometimes we have some prior information about the structure of the underlying graphical models, and would like to incorporate that to assist the estimation. One way to realize that is to encode the prior knowledge into prior distributions on the spanning trees. Let be a prior distribution on , the set of the spanning trees with nodes. Given the data and assuming the density is known and Markov to the spanning tree , we can write the likelihood as
(8) 
Then the posterior probability of is
(9) 
The maximum a posteriori (MAP) estimate is given by
(10) 
Since we do not know the true density in practice, we can plug in the estimator (5) and obtain
(11) 
as an approximation of . In fact, is obtained by replacing the true marginal densities and the empirical distributions in (10) by their corresponding density estimates. It can also be viewed as a penalized version of the estimator (7).
The penalty term , which is sometimes combinatorial, could make the optimization problem extremely hard to solve. However, when is convex with respect to the entries of the adjacency matrix of , we can adopt a minorizemaximization algorithm [8] to find a local optimal solution. In fact, given the convexity of , the objective function adopts a linear lower bound at any current estimates. This linear lower bound can be then decomposed into a sum of weights over the edges, and we can apply the Kruskal’s algorithm to update our estimate. We shall see in details in the following two concrete examples how this can be carried out.
3.2 Scalefree graphs
Now suppose that we have reasons to believe that the graph is scalefree, or more generally, that the graph consists of several nodes that have dominating degrees compared to the rest. Let be the degree of the node of a spanning tree . Consider a prior distribution on which satisfies
(12) 
for some . This prior distribution favors the spanning trees whose degrees have a power law distribution, and thus reflects our prior beliefs. Plugging this in (11), we obtain
(13) 
where can be now viewed as a tuning parameter. To solve this optimization problem, we first rewrite the objective function as
(14) 
where . Here we also abuse our notation by writing as the adjacency matrix of , that is, if and only if . Note that we have the additional constraint that the graph is a spanning tree. Given a current estimate , we first lower bound by linearizing it at :
(15)  
(16) 
where is a constant which doesn’t depend on . We can maximize this lower bound by applying Kruskal’s algorithm to the graph with edge weights
(17) 
We see that the weights are updated at each iteration based on the current estimate of the graph. Each edge weight is penalized by two quantities that are inversely proportional to the degrees of the two endpoints of the edge. An edge weight is thus penalized less if its endpoints are already highly connected and vice versa. With such a “rich gets richer” procedure, the algorithm encourages some vertices to have high connectivity and hence the overall degree distribution to have a heavy tail. We iterate through such minorization and maximization steps until convergence. Since the objective function is always increasing, the algorithm is guaranteed to converge to a local maximum.
3.3 Multiple graphs with similar structure
In this part, we illustrate how the framework can be modified to facilitate the case where we have multiple graphs that are believed to have similar but not necessarily identical structures. Instead of one single graph, suppose that we now have graphical models with underlying forests , and for the th one, we observe data . Given a joint prior distribution on , we combine the likelihood for the models and update the posterior distribution (9) to be
(18) 
Next, we design a prior distribution on the set of spanning trees which reflects our belief that the structures across the of them share some similarity. Again we use to denote the adjacency matrix of the corresponding graph, that is, if and only if . We consider the following hierarchical model:
(19)  
(20) 
According to this model, the same edge across multiple graphs is governed by the same parameter , and hence encourage similarity across them. This essentially gives a prior distribution on :
(21)  
(22)  
(23)  
(24) 
where is the vector containing the th entries of for , denotes the norm, and denotes the Beta function. Now combining this with (18) and following the reasoning in Subsection 3.1, we obtain our estimator in this case
(25) 
Note that we include an extra tuning parameter in front of the penalty term to give us a bit more flexibility in controlling its magnitude. The function is convex and takes larger values when is close to 0 or compared to those in between. Using it as a penalty thus favors the set of graphs which share common edges.
To solve (25), we again adopt a minorizemaximization procedure. Specifically, write the objective function as
(26) 
where . Given a current solution , we linearize at and get
(27)  
(28) 
where is the digamma function. This gives the following weights updating rule:
(29) 
Note that is an increasing function. Therefore, this updating rule borrows strength across the graphs—it increases an edge’s weight when is large, i.e., when other graphs also have edge present.
3.4 Algorithms
As a short conclusion, we summarize the two procedures, which share a lot of similarity but work for different applications, here in Algorithm 2 and 3. After getting the output of the algorithm, we will prune the resulting spanning tree to obtain a forest estimate (to avoid overfitting in high dimensions). This can be done by going through the last iteration of the algorithm and stop at the step where the likelihood is maximized on a heldout dataset.
4 Related work
Before proceeding to present the performance of the proposed nonparametric methods on both simulated and real datasets, we pause to review some of the existing approaches on estimation of scalefree graphical models and joint estimation of multiple graphical models.
Most existing methods for estimating graphical models with prior information assume that the data follow multivariate Gaussian distributions. To encourage a scalefree graph, Liu and Ihler [11] propose to replace the penalty in the formulation of the graphical lasso by a nonconvex power law regularization term. Along the same line, Defazio and Caetano [4] impose a convex penalty by using submodular functions and their Lovász extension. Essentially, both methods try to penalize the log degree of each node, but end up using a continuous/convex surrogate to avoid the combinatorial problems involving the degrees. Tan et al. [13] propose a general framework to accommodate networks with hub nodes, using a convex formulation that involves a rowcolumn overlap norm penalty.
Methods for inferring Gaussian graphical models on multiple units have also been proposed in recent years. Guo et al. [7] propose a method for joint estimation of Gaussian graphical models by penalizing the graphical lasso objective function by the square root of norms of the edge vector across all graphs, which results in a nonconvex problem. A convex joint graphical lasso approach is developed in [3], which is based on employing generalized fused lasso or group lasso penalties. Peterson et al. [12] and Zhu and Barber [15] propose Bayesian approaches for inference on multiple Gaussian graphical models.
We summarize the aforementioned methods, which are to be implemented and compared in the simulation. Methods proposed in this paper can be viewed as nonparametric counterparts to the parametric methods.
General  With prior information  

Scalefree graph  Multiple graphs  
Parametric  Glasso [6]  SFGlasso [11]  GuoGlasso [7] 
HubGlasso [13]  JointGlasso [3]  
Nonparametric  FDE [10]  SFFDE  JFDE 
: nonconvex method : convex method : this paper
5 Experiments
5.1 Synthetic data
In this subsection, we evaluate the performance of the proposed methods and other existing methods on synthetic data.
Graph structures
We consider the following types of graph structures with vertices.

Scalefree graph: We use a preferential attachment process to generate a scalefree graph [1]. We start with a chain of 4 nodes (i.e., with edges 1–2, 2–3, and 3–4). New nodes are added one at a time, and each new node is connected to one existing node with probability , where is the current degree of the th node, and is a parameter, which we set to be 1.5 in our experiments. A typical realization of such networks is shown in Figure 1 (left).

Stars: The graph has 5 stars of size 20; each star is a tree with one root and 19 leaves. An illustration is shown in Figure 1 (right).

Multiple graphs: We follow the above two mechanisms to generate multiple graphs with similar structures. In particular, we generate a set of scalefree graphs, which share 80 common edges (this is done by applying the above generative model to grow a common tree of size 80 to be shared across the 3 units; each unit then continues this growing process independently until obtaining a tree of 100 vertices), and another set of stars graphs, which have 4 common stars and one individual star with distinct roots.
Scalefree graph  Stars 
Probability distributions
Given a particular graph, we generate samples according to two types of probability distributions that are Markov to the graph: Gaussian copulas and copulas [5]. The Gaussian copula (resp., the copula) can be thought of as representing the dependence structure implicit in a multivariate Gaussian (multivariate ) distribution, while each variable follows a uniform distribution on marginally. Since the graph structures we consider are trees or forests, we generate the data sequentially, first sampling for an arbitrary node in a tree, and then drawing samples for the neighboring nodes according to the conditional distribution given by the copula until going through all nodes in the tree. In our simulations, the degree of freedom of the copula is set to be 1, and the correlation coefficients are chosen to be 0.4 and 0.25 for the Gaussian and the copula.
Methods
We implement methods that are summarized in Table 1. For the forestbased methods, we use a heldout set of size to select tuning parameter and prune the estimated spanning trees. To implement the Gaussianbased methods, we first transform the data marginally to be approximately Gaussian. We choose the tuning parameters by searching through a fine grid, and selecting those that maximize the likelihood on the heldout set. We refer to this as heldout tuning. The results obtained by the heldout tuning reflect the performance of the methods in a fully datadriven way. In addition, we also consider what we call oracle tuning, where the tuning parameters are chosen to maximize the score of the estimated graph. This tuning method requires the knowledge of the true graph, and hence it’s not obvious that there would exist a datadriven way to achieve this. We include the oracle tuning as a way to show the optimal performance possibly achieved by the methods.
Results
For both scalefree graphs and multiple graphs, we carry out four sets of experiments, with data generated from the two types of graphs and the two types of distributions. For each set of experiments, we repeat the simulations 10 times and record the scores of the estimated graphs for each method. An score is the harmonic mean of a method’s precision and recall and hence a measure of its accuracy. It’s a number between 0 and 1; a higher score means better accuracy and 1 means perfect labelling. The average scores are shown in Table 2. From the table, we see that SFFDE and JFDE always outperform FDE on these particular situations. Also, SFFDE and FDE perform better than the other three methods using heldout tuning as the penalized likelihood methods tend to select more edges when tuning parameters are chosen to maximize the heldout likelihood. When the true copula is Gaussian, the graphicallassobased methods all have very high scores if oracle tuning is used; they fail to deliver good performance when the true copula is no longer Gaussian. On the other hand, the FDEbased methods are not affected too much by the true distribution.
Graphs with hubs  

Graph Dist.  FDE  SFFDE  Glasso  SFGlasso  HubGlasso  
heldout  oracle  heldout  oracle  heldout  oracle  
Scalefree  0.79  0.92  0.24  0.91  0.42  0.92  0.16  0.88 
Stars  0.82  0.96  0.25  0.93  0.46  0.98  0.08  0.99 
Scalefree  0.89  0.98  0.30  0.43  0.47  0.53  0.07  0.55 
Stars  0.93  0.98  0.32  0.56  0.50  0.67  0.09  0.79 
Multiple graphs  
Graph Dist.  FDE  JFDE  Glasso  GuoGlasso  JointGlasso  
heldout  oracle  heldout  oracle  heldout  oracle  
Scalefree  0.78  0.90  0.25  0.92  0.84  0.97  0.17  0.97 
Stars  0.80  0.92  0.26  0.92  0.80  0.95  0.17  0.96 
Scalefree  0.91  0.98  0.30  0.44  0.61  0.64  0.23  0.66 
Stars  0.92  0.98  0.33  0.53  0.66  0.70  0.27  0.71 
5.2 Real data
Stock price data
We test our methods on the daily closing prices for stocks that are constantly in the S&P 500 index from Yahoo! Finance. The log returns of each stock are replaced by their respective normal scores, subject to a Winsorized truncation.
In the application of learning scalefree forests, we use the data from the first 9 months of 2014 as the training data and the data from the last 3 months of 2014 as the heldout data. The result turns out that SFFDE yields a larger heldout loglikelihood than FDE (64.5 compared to 62.6), implying that a scalefree approximation is helpful in predicting the relationships. The estimated graphs by FDE and SFFDE are shown in Figure 2. We see that the resulting clusters by SFFDE tend to be more consistent with the Global Industry Classification Standard categories, which are indicated by different colors in the graph.
FDE  SFFDE 

We also consider the application of learning multiple forests by dividing the data into 4 periods from 2009 to 2012, one for a year, and model the 4unit data using our proposed method. The aggregated heldout loglikelihood over the 4 units are 193.4 for JFDE and 185.5 for FDE. The numbers of common edges across the 4 graphs are 111 for JFDE and 24 for FDE, respectively. Figure 3 shows the estimated graphs by JFDE, where common edges across the 4 units are colored in red.
(a) 2009 (b) 2010 (c) 2011 (d) 2012 
University webpage data
As a second example, we apply our methods to the university webpage data from the “World Wide Knowledge Base” project at Carnegie Mellon University, which consists of the occurrences of various terms on student webpages from 4 computer science departments at Texas, Cornell, Washington, and Wisconsin. We choose a subset of 100 terms with the largest entropy. In the analysis, we compute the empirical distributions instead of kernel density estimates since the data is discrete.
To understand the relationships among the terms, we first wish to identify terms that are hubs. Figure 4 shows that SFFDE detects 4 highly connected nodes of degree greater than 10: comput, system, page, and interest. Then we model the 4unit data, one for a university. Figure 5 shows the estimated graphs by JFDE (isolated nodes are not displayed in each graph). These results provides an intuitive explanation of the relationships among the terms across the 4 universities.
FDE  SFFDE 

(a) Texas  (b) Cornel  (c) Washington  (d) Wisconsin 
6 Conclusion
In this paper, we introduce a nonparametric framework for incorporating prior knowledge to assist estimation of graphical models. Instead of Gaussianity assumptions, it assumes the density is Markov to a forest, thus allowing arbitrary distribution. A key ingredient is to design a prior distribution on graphs that favors those consistent with the prior belief. We illustrate the idea by proposing such prior distributions, which lead to two algorithms, for the problems of estimating scalefree networks and multiple graphs with similar structures. An interesting future direction is to apply this idea to more applications and different types of prior information.
References
 Albert and Barabási [2002] Réka Albert and AlbertLászló Barabási. Statistical mechanics of complex networks. Reviews of Modern Physics, 74(1):47, 2002.
 Chow and Liu [1968] C Chow and C Liu. Approximating discrete probability distributions with dependence trees. Information Theory, IEEE Transactions on, 14(3):462–467, 1968.
 Danaher et al. [2014] Patrick Danaher, Pei Wang, and Daniela M Witten. The joint graphical lasso for inverse covariance estimation across multiple classes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(2):373–397, 2014.
 Defazio and Caetano [2012] Aaron Defazio and Tiberio S Caetano. A convex formulation for learning scalefree networks via submodular relaxation. In Advances in Neural Information Processing Systems, pages 1250–1258, 2012.
 Demarta and McNeil [2005] Stefano Demarta and Alexander J McNeil. The t copula and related copulas. International Statistical Review, 73(1):111–129, 2005.
 Friedman et al. [2008] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432–441, 2008.
 Guo et al. [2011] Jian Guo, Elizaveta Levina, George Michailidis, and Ji Zhu. Joint estimation of multiple graphical models. Biometrika, 98(1):1–15, 2011.
 Hunter and Lange [2004] David R Hunter and Kenneth Lange. A tutorial on MM algorithms. The American Statistician, 58(1):30–37, 2004.
 Kruskal [1956] Joseph B Kruskal. On the shortest spanning subtree of a graph and the traveling salesman problem. Proceedings of the American Mathematical society, 7(1):48–50, 1956.
 Liu et al. [2011] Han Liu, Min Xu, Haijie Gu, Anupam Gupta, John Lafferty, and Larry Wasserman. Forest density estimation. The Journal of Machine Learning Research, 12:907–951, 2011.
 Liu and Ihler [2011] Qiang Liu and Alexander T Ihler. Learning scale free networks by reweighted regularization. In International Conference on Artificial Intelligence and Statistics, pages 40–48, 2011.
 Peterson et al. [2015] Christine Peterson, Francesco C. Stingo, and Marina Vannucci. Bayesian inference of multiple Gaussian graphical models. Journal of the American Statistical Association, 110(509):159–174, 2015.
 Tan et al. [2014] Kean Ming Tan, Palma London, Karthik Mohan, SuIn Lee, Maryam Fazel, and Daniela Witten. Learning graphical models with hubs. The Journal of Machine Learning Research, 15(1):3297–3331, 2014.
 Tang et al. [2015] Qingming Tang, Siqi Sun, and Jinbo Xu. Learning scalefree networks by dynamic node specific degree prior. In Proceedings of the 32nd International Conference on Machine Learning, pages 2247–2255, 2015.
 Zhu and Barber [2015] Yuancheng Zhu and Rina Foygel Barber. The logshift penalty for adaptive estimation of multiple Gaussian graphical models. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, pages 1153–1161, 2015.