Learning network structure underlying data is an important problem in machine learning. This paper presents a novel degree prior to study the inference of scale-free networks, which are widely used to model social and biological networks. In particular, this paper formulates scale-free network inference using Gaussian Graphical model (GGM) regularized by a node degree prior. Our degree prior not only promotes a desirable global degree distribution, but also exploits the estimated degree of an individual node and the relative strength of all the edges of a single node. To fulfill this, this paper proposes a ranking-based method to dynamically estimate the degree of a node, which makes the resultant optimization problem challenging to solve. To deal with this, this paper presents a novel ADMM (alternating direction method of multipliers) procedure. Our experimental results on both synthetic and real data show that our prior not only yields a scale-free network, but also produces many more correctly predicted edges than existing scale-free inducing prior, hub-inducing prior and the norm.
Learning Scale-Free Networks by Dynamic Node-Specific Degree Prior
Qingming Tang111Equal Contribution. firstname.lastname@example.org
Toyota Technological Institute at Chicago, 6045 S. Kenwood Ave., Chicago, Illinois 60637, USA
Siqi Sun111Equal Contribution. email@example.com
Toyota Technological Institute at Chicago, 6045 S. Kenwood Ave., Chicago, Illinois 60637, USA
Jinbo Xu firstname.lastname@example.org
Toyota Technological Institute at Chicago, 6045 S. Kenwood Ave., Chicago, Illinois 60637, USA
Graphical models are widely used to describe the relationship between variables (or features). Estimating the structure of an undirected graphical model from a dataset has been extensively studied (Meinshausen & Bühlmann, 2006; Yuan & Lin, 2007; Friedman et al., 2008; Banerjee et al., 2008; Wainwright et al., 2006). Gaussian Graphical Models (GGMs) are widely used to model the data and penalty is used to yield a sparse graph structure. GGMs assume that the observed data follows a multivariate Gaussian distribution with a covariance matrix. When GGMs are used, the network structure is encoded in the zero pattern of the precision matrix. Accordingly, structure learning can be formulated as minimizing the negative log-likelihood (NLL) with an penalty. However, the widely-used penalty assumes that any pair of variables is equally likely to form an edge. This is not suitable for many real-world networks such as gene networks, protein-protein interaction networks and social networks, which are scale-free and contain a small percentage of hub nodes.
A scale-free network (BARABÁSI & Bonabeau, 2003) has node degree following the power-law distribution. In particular, a scale-free network may contain a few hub nodes, whose degrees are much larger than the others. That is, a hub node more likely forms an edge than the others. In real-world applications, a hub node is usually functionally important. For example, in a gene network, a gene playing functions in many biological processes (Zhang & Horvath, 2005; Goh et al., 2007) tends to be a hub. A few methods have been proposed to infer scale-free networks by using a reweighed norm. For example, (Peng et al., 2009) proposed a joint regression sparse method to learn scale-free networks by setting the penalty proportional to the estimated degrees. (Candes et al., 2008) proposed a method that iteratively reweighs the penalty in the norm based on the inverse of the previous estimation of precision matrix. This method can suppress the large bias in the norm when the magnitude of the non-zero entries vary a lot and is also closer to the norm. Some recent papers follow the idea of node-based learning using group lasso (Friedman et al., 2010b; Meier et al., 2008; Tan et al., 2014; Friedman et al., 2010a; Mohan et al., 2014). Since group lasso penalty promotes similar patterns among variables in the same group, it tends to strengthen the signal of hub nodes and suppress that of non-hub nodes. As such, group lasso may only produce edges adjacent to nodes with a large degree (i.e., hubs).
In order to capture the scale-free property, (Liu & Ihler, 2011) minimizes the negative log-likelihood with a scale-free prior, which approximates the degree of each variable by the norm. However, the objective function is non-convex and thus, is hard to optimize. (Defazio & Caetano, 2012) approximates the global node degree distribution by the submodular convex envelope of the node degree function. The convex envelope is the Lovasz extension (Lovász, 1983; Bach, 2010) of the sum of logarithm of node degree. These methods consider only the global node degree distribution, but not the degree of an individual node. However, the approximation function used to model this global degree distribution may not induce a power-law distribution, because there are many possible distributions other than power-law that optimizing the approximation function. See Theorem 1 for details.
To further improve scale-free network inference, this paper introduces a novel node-degree prior, which not only promotes a desirable global node degree distribution, but also exploits the estimated degree of an individual node and the relative strengths of all the edges adjacent to the same node. To fulfill this, we use node and edge ranking to dynamically estimate the degree of an individual node and the relative strength of one potential edge. As dynamic ranking is used in the prior, the objective function is very challenging to optimize. This paper presents a novel ADMM (alternating direction method of multipliers) (Boyd et al., 2011; Fukushima, 1992) method for this.
Let denote a graph, where is the vertex set () and the edge set. We also use to denote the adjacency matrix of , i.e., if and only if . In this paper we assume that the variable set has a Gaussian distribution with a precision matrix . Then, the graph structure is encoded in the zero pattern of the estimated , i.e. the edge formed by is . Let denote the objective function. For GGMs, , where is the empirical covariance matrix. To induce a sparse graph, a penalty is applied to obtain the following objective function.
Here we define two ranking functions. For a given , let denote the permutation of such that where is a shuffle of a row vector excluding the element . Let be a dimension positive vector and a dimension vector. We define . Let be a permutation of such that , where denote the element of . When is a zero vector, we also write as and as , respectively.
To promote a sparse graph, a natural way is to use a penalty to regulate . However, norm cannot induce a scale-free network. This is because that a scale-free network has node degree following a power-law distribution, i.e., where is the degree and is a constant ranging from 2 to 3, rather than a uniform distribution. A simple prior for scale-free networks is the sum of the logarithm of node degree as follows.
where and is the indicator function. The constant 1 is added to handle the situation when . The norm in (2) is non-differentiable and thus, hard to optimize. One way to resolve this problem is to approximate by the norm of as shown in (Liu & Ihler, 2011). Since the logarithm of approximation is non-convex, a convex envelope of (2) based on submodular function and Lovasz extension (Lovász, 1983) is introduced recently in the papers (Bach, 2010; Defazio & Caetano, 2012) as follows.
Let . The convex envelope of Eq. 2 can be written as
Let denote the sum of logarithm of node degree of a graph . For any graph satisfying the scale-free property, there exists another graph with the same number of edges but not satisfying the scale-free property such that .
Since we would like to use (2) as a regularizer, a smaller value of (2), denoted as , is always favorable. Here we briefly prove Theorem 1. Since is scale-free, a large percentage of its nodes have small degrees. It is reasonable to assume that has two nodes and with the same degree. Let denote their degree. Without loss of generality, there exists a node connecting to but not to . We construct a graph from by connecting to and disconnecting from . Since is a concave function, we have
If there are two nodes with the same degree in , we can construct from such that . By this procedure, we may finally obtain a non-scale-free graph with a smaller value. So, Theorem 1 is proved.
Since Theorem 1 shows that a global description of scale-free property is not the ideal choice for inducing a scale-free graph, we propose a new prior to describe the degree of each individual node in the target graph.
Intuitively, given a node , if we rank all the potential edges adjacent to in descending order by where , then with a higher rank is more likely to be a true edge than those with a lower rank and should receive less penalty. To account for this intuition, we introduce a non-decreasing positive sequence and the following prior for node
Here represents the node ranked in the position. When all the elements in are same, this prior simply is norm. In our work, we use a moderately increasing positive sequence (e.g. for ), such that a pair with a larger estimated has a smaller penalty. On the other hand, we do not want to penalize all the edges very differently since the penalty is based upon only rough estimation of the edge strength.
Let denote the degree of node , we propose the following prior based on Eq. 6.
Comparing to the norm, our prior in Eq.7 is non-uniform since each edge has a different penalty, depending on its ranking and the estimated node degrees. This prior has the following two properties.
When , the penalty for is less than or equal to ; otherwise, the penalty is larger than 1.
If node has a higher degree than node , then has a smaller penalty than .
The first property implies that Eq.(7) favors a graph following a given degree distribution. Suppose that in the true graph, node (or variable) () has degree . Let and denote two edge sets of the same size. If has the degree distribution but does not, then we can prove the following result (see Supplementary for proof).
Let denote the expected number of nodes with degree . We can estimate from the prior degree distribution when is given. Supposing that all the nodes are ranked in descending order by their degree, we use to denote the estimated degree of the ranked node based on the power-law distribution, such that . Further, if we know the desired number of predicted edges, say we are told to output edges, then the expected degree of node is assumed to be proportional to . In the following content, we would just use to denote the expected degree of node .
Now the question is how to rank nodes by their degrees? Although the exact degree of node is not available, we can approximate by Lovasz Extenion (Defazio & Caetano, 2012; Bach, 2010; Lovász, 1983), i.e., (see Eq. 2). That is, we can rank nodes by their Lovasz Extension. Note that although we use Lovasz Extension in our implementation, other approximations such as norm can also be used to rank nodes without losing accuracy.
We define our dynamic node-specific prior as follows.
Note that for and defines the ranking based on Lovasz Extension. The -th ranked node is assigned a node penalty , denoted as . Note that the ranking of nodes by their degrees is not static. Instead, it is determined dynamically by our optimization algorithm. Whenever there is a new estimation of , the node and edge ranking may be changed. That is, our node-specific degree prior is dynamic instead of static.
With prior defined in (9), we have the following objective function:
Here is used to control sparsity level. The challenge of minimizing Eq.(10) lies in the fact that we do not know the ranking of both nodes and edges in advance. Instead we have to determine the ranking dynamically. We use an ADMM algorithm to solve it by introducing dual variables and as follows.
We can use a first-order method such as gradient descent to optimize the first subproblem since is convex. In this paper, we assume that is a Gaussian likelihood, in which case (11) can be solved by eigen-decomposition. Here we describe a novel algorithm for (12).
Let and . Minimizing (12) is equivalent to
which can be relaxed as
Here, we simply use to denote a permutation of , and use to denote the element of this permutation. The reason we introduce (15) is, given and without the constraint of (15), the optimal is . Adding the constraint , (15) can be relaxed by solving the following problem until .
Here is the dual vector and can be updated by for , where is the step size. Actually, as only a small percentage of nodes have large degrees, we may speed up by using the condition for where is much smaller than . That is, instead of ranking all the nodes, we just need to select top of them, which can be done much faster. We now propose Algorithm 1 to solve (LABEL:dual_decomp), which in spirit is a dual decomposition (Sontag et al., 2011) algorithm.
The output of algorithm 11 satisfies the following condition.
See supplementary for the proof of Theorem 2.
Solving line 3-4 in Algorithm 11 is not trivial. We can reformulate it as follows.
Here is a dimension vector and for . Since is symmetric, (18) can be reformulated as follows.
Problem (19) can be solved iteratively using dual decomposition by introducing the Lagrangian term , where is a by matrix which would be updated by . Notice that , (19) can be decomposed into independent sub-problems as follows.
Let . Obviously holds, so we do not consider in the remaining section. Let = be a feasible solution of (20). We define the cluster structure of as follows.
Definition 3 Let be a ranked feasible solution. Supposing that , we say form cluster 1 and denote it as . Similarly, we can define , and so on. Assume that is clustered into groups. Let denote the size of and the absolute value of its element.
Assuming that is the optimal solution of (20), by Definition 3, for , has the following property.
See (Defazio & Caetano, 2012) and our supplementary material for detailed proof of (21). Based on (21), we propose a novel dynamic programming algorithm to solve (20), which can be reduced to the problem of finding the constrained optimal partition of .
Let be the optimal partition for . Let and . Then we have the following theorem.
, where is a set with elements which are equal to .
and is the largest value such that .
Theorem 3 clearly shows that this problem satisfies the optimal substructure property and thus, can be solved by dynamic programming. A algorithm (Algorithm 2) to solve (20) is proposed. See supplementary material for the proof of substructure property and the correctness of the algorithm. In Algorithm 2, duplicates by times.
To introduce scale-free prior (), (Liu & Ihler, 2011) proposed to approximate the degree of node by and then use the following objective function.
where is the parameter for each node to smooth out the scale-free distribution. Without prior knowledge, it is not easy to determine the value of . Note that the above objective function is not convex even when is convex because of the log-sum function involved. The objective is optimized by reweighing the penalty for each at each step, and the method (denoted as RW) is guaranteed to converge to a local optimal. The parameter is set as diagonal of estimated in previous iteration, and as , as suggested by the authors. They use to control the sparsity, i.e. the number of predicted edges.
Recently (Defazio & Caetano, 2012) proposed a Lovasz extension approach to approximate node degree by a convex function. The convex function is a reweighed with larger penalty applied to edges with relatively larger strength. It turns out that such kind of convex function prior does not work well when we just need to predict a few edges, as shown in the experiments. Further, both (Liu & Ihler, 2011) and (Defazio & Caetano, 2012) consider only the global degree distribution instead of the degree of each node.
(Tan et al., 2014) proposes a method (denoted as Hub) specifically for a graph with a few hubs and applies a group lasso penalty. In particular, they decompose as , where is a sparse symmetric matrix and is a matrix whose column are almost entirely zero or non-zero. Intuitively, describes the relationship between non-hubs and that between hubs. They formulate the problem as follows.
An ADMM algorithm is used to solve this problem. In our test, we use to yield the best performance. Besides,we set to produce a graph with a desired level of sparsity.
Our method uses hyper-parameters: and . Meanwhile, is the hyper parameter for the power-law distribution and controls sparsity.
We tested our method on two real gene expression datasets and two types of simulated networks: (1) a scale-free network generated by Barabasi-Albert (BA) model (Albert & Barabási, 2002) and (2) a network with a few hub nodes. We generated the data for the simulated scale-free network by its corresponding Multivariate Gaussian distribution. We compared our method (denoted as ”DNS”, short for ”Dynamic Node-Specific Prior”) with graphical lasso (”Glasso”) (Friedman et al., 2008), neighborhood selection (”NS”) (Meinshausen & Bühlmann, 2006), reweighted regularization (”RW”) (Liu & Ihler, 2011), Lovasz extenion (”Lovasz”) (Defazio & Caetano, 2012) and a recent hub detection method (”Hub”) (Tan et al., 2014).
We generated a network of 500 nodes (p = 500) from the BA model. Each entry of the precision matrix is set to 0.3 if forms an edge, and otherwise. To make the precision matrix positive definite, we set the diagonal of to the minimum eigenvalue of plus 0.2. In total we generate 250 data samples from the corresponding multivariate Gaussian distribution . The hyper-parameters of all the methods are set as described in the last section.
Our method is not sensitive to the hyper-parameter . As shown in Figure 2, a few different values (2.1, 2.5, and 2.9) yield almost the same result. Hence we use in the following experiments.
Moreover, as shown in Figure 2, our method DNS outperforms the others in terms of prediction accuracy. It is not surprising that both RW and Hub outperform Glasso and NS since the former two methods are specifically designed for scale-free or hub networks. Lovasz, which is also designed for scale-free networks, would outperform Glasso as the number of predicted edges increase.
Figure 1 displays the log-log degree distribution of the true network and the networks estimated by DNS, RW and Glasso. Both DNS and RW yield networks satisfying the power-law distribution while Glasso does not, which confirms that both DNS and RW indeed favor scale-free networks.
We also tested our method on a hub network, which contains a few nodes with very large degrees but not strictly follows the scale-free property. See Figure 3 for a visualization, where larger dots indicate the hub nodes. Here we use the DREAM5 Network Inference Challenge dataset 1, which is a simulated gene expression data with 806 samples. DREAM5 also provides a ground truth network for this dataset. See (Marbach et al., 2012) for more details.
The result in Figure 4 shows that our method outperforms all the others, although our method is not specifically designed for hub networks. This shows that DNS also performs well in a graph with non-uniform degree distribution but without strict scale-free property.
To further test our method, we used DREAM5 dataset 3 and 4 respectively. Dataset3 contains 805 samples and 4511 genes and its ground truth consists of 3902 edges. Dataset4 contains 536 samples and 5950 genes and its ground truth consists of 3940 edges. The two datasets are very challenging as the data is noisy and without Gaussian and scale-free property. For each dataset, up to predicted edges are allowed for submission for each team competing in the contest. See (Marbach et al., 2012) for a detailed description of the two data sets. We determine the hyper parameters of all the methods such that they output exactly the same number of edges. As shown in Figure 5, our method obtains much higher accuracy than the others on DREAM5 dataset 3. To compare the degree distribution of different methods, we chose the values of the hyper-parameters such that each method yields 4000 edges. The degree distributions of the resultant networks are shown in Figure 6. As shown in this figure, our estimated network is more similar to a scale-free network.
We also tested our result on DREAM5 dataset 4. As most algorithms are time consuming, we just run our method DNS and Glasso. According to our test, both our algorithm and Glasso perform not very good on this dataset (Actually, all DREAM5 competitors do not do well on this dataset (Marbach et al., 2012)). But our algorithm still outperforms Glasso in terms of accuracy. Actually, the accuracy of DNS is about two times of Glassso (e.g. when predicting about 19000 edges, Glasso generate 9 correct edges while DNS find 18 correct edges).
We have presented a novel node-specific degree prior to study the inference of scale-free networks, which are widely used to model social and biological networks. Our prior not only promotes a desirable global node degree distribution, but also takes into consideration the estimated degree of each individual node and the relative strength of all the possible edges adjacent to the same node. To fulfill this, we have developed a ranking-based algorithm to dynamically model the degree distribution of a given network. The optimization problem resulting from our prior is quite challenging. We have developed a novel ADMM algorithm to solve it.
We have demonstrated the superior performance of our prior using simulated data and three DREAM5 datasets. Our prior greatly outperforms the others in terms of the number of correctly predicted edges, especially on the real gene expression data.
The idea presented in this paper can potentially be useful to other degree-constrained network inference problem. In particular, it might be applied to infer protein residue-residue interaction network from a multiple sequence alignment, for which we may predict the degree distribution of each residue using a supervised machine learning method.
- Albert & Barabási (2002) Albert, Réka and Barabási, Albert-László. Statistical mechanics of complex networks. Reviews of modern physics, 74(1):47, 2002.
- Bach (2010) Bach, Francis R. Structured sparsity-inducing norms through submodular functions. In Advances in Neural Information Processing Systems, pp. 118–126, 2010.
- Banerjee et al. (2008) Banerjee, Onureena, El Ghaoui, Laurent, and d’Aspremont, Alexandre. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. The Journal of Machine Learning Research, 9:485–516, 2008.
- BARABÁSI & Bonabeau (2003) BARABÁSI, BY ALBERT-LÁSZLÓ and Bonabeau, Eric. Scale-free. Scientific American, 2003.
- Boyd et al. (2011) Boyd, Stephen, Parikh, Neal, Chu, Eric, Peleato, Borja, and Eckstein, Jonathan. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning, 3(1):1–122, 2011.
- Candes et al. (2008) Candes, Emmanuel J, Wakin, Michael B, and Boyd, Stephen P. Enhancing sparsity by reweighted â 1 minimization. Journal of Fourier analysis and applications, 14(5-6):877–905, 2008.
- Defazio & Caetano (2012) Defazio, Aaron and Caetano, Tiberio S. A convex formulation for learning scale-free networks via submodular relaxation. In Advances in Neural Information Processing Systems, pp. 1250–1258, 2012.
- Friedman et al. (2008) Friedman, Jerome, Hastie, Trevor, and Tibshirani, Robert. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432–441, 2008.
- Friedman et al. (2010a) Friedman, Jerome, Hastie, Trevor, and Tibshirani, Rob. Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33(1):1, 2010a.
- Friedman et al. (2010b) Friedman, Jerome, Hastie, Trevor, and Tibshirani, Robert. A note on the group lasso and a sparse group lasso. arXiv preprint arXiv:1001.0736, 2010b.
- Fukushima (1992) Fukushima, Masao. Application of the alternating direction method of multipliers to separable convex programming problems. Computational Optimization and Applications, 1(1):93–111, 1992.
- Goh et al. (2007) Goh, Kwang-Il, Cusick, Michael E, Valle, David, Childs, Barton, Vidal, Marc, and Barabási, Albert-László. The human disease network. Proceedings of the National Academy of Sciences, 104(21):8685–8690, 2007.
- Liu & Ihler (2011) Liu, Qiang and Ihler, Alexander T. Learning scale free networks by reweighted l1 regularization. In International Conference on Artificial Intelligence and Statistics, pp. 40–48, 2011.
- Lovász (1983) Lovász, László. Submodular functions and convexity. In Mathematical Programming The State of the Art, pp. 235–257. Springer, 1983.
- Marbach et al. (2012) Marbach, Daniel, Costello, James C, Küffner, Robert, Vega, Nicole M, Prill, Robert J, Camacho, Diogo M, Allison, Kyle R, Kellis, Manolis, Collins, James J, Stolovitzky, Gustavo, et al. Wisdom of crowds for robust gene network inference. Nature methods, 9(8):796–804, 2012.
- Meier et al. (2008) Meier, Lukas, Van De Geer, Sara, and Bühlmann, Peter. The group lasso for logistic regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1):53–71, 2008.
- Meinshausen & Bühlmann (2006) Meinshausen, Nicolai and Bühlmann, Peter. High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, pp. 1436–1462, 2006.
- Mohan et al. (2014) Mohan, Karthik, London, Palma, Fazel, Maryam, Witten, Daniela, and Lee, Su-In. Node-based learning of multiple gaussian graphical models. The Journal of Machine Learning Research, 15(1):445–488, 2014.
- Peng et al. (2009) Peng, Jie, Wang, Pei, Zhou, Nengfeng, and Zhu, Ji. Partial correlation estimation by joint sparse regression models. Journal of the American Statistical Association, 104(486), 2009.
- Sontag et al. (2011) Sontag, David, Globerson, Amir, and Jaakkola, Tommi. Introduction to dual decomposition for inference. Optimization for Machine Learning, 1:219–254, 2011.
- Tan et al. (2014) Tan, Kean Ming, London, Palma, Mohan, Karthik, Lee, Su-In, Fazel, Maryam, and Witten, Daniela. Learning graphical models with hubs. The Journal of Machine Learning Research, 15(1):3297–3331, 2014.
- Wainwright et al. (2006) Wainwright, Martin J, Lafferty, John D, and Ravikumar, Pradeep K. High-dimensional graphical model selection using -regularized logistic regression. In Advances in neural information processing systems, pp. 1465–1472, 2006.
- Yuan & Lin (2007) Yuan, Ming and Lin, Yi. Model selection and estimation in the gaussian graphical model. Biometrika, 94(1):19–35, 2007.
- Zhang & Horvath (2005) Zhang, Bin and Horvath, Steve. A general framework for weighted gene co-expression network analysis. Statistical applications in genetics and molecular biology, 4(1), 2005.