Network Inference by Learned NodeSpecific Degree Prior
Abstract
We propose a novel method for network inference from partially observed edges using a nodespecific degree prior. The degree prior is derived from observed edges in the network to be inferred, and its hyperparameters are determined by cross validation. Then we formulate network inference as a matrix completion problem regularized by our degree prior. Our theoretical analysis indicates that this prior favors a network following the learned degree distribution, and may lead to improved network recovery error bound than previous work. Experimental results on both simulated and real biological networks demonstrate the superior performance of our method in various settings.
4
1 Introduction
Network inference or structure learning has been widely studied in machine learning. There are two typical scenarios. In the first scenario, the task is to estimate the structure of an undirected graphical model from a highdimensional dataset, e.g., learning gene coexpression networks from gene expression data Marbach et al. (2012); De Smet and Marchal (2010); Zhang and Horvath (2005). This task has been extensively studied, with a popular method being the graphical Lasso which assumes the underlying graph to be sparse Meinshausen and Bühlmann (2006); Yuan and Lin (2007); Friedman et al. (2008); Banerjee et al. (2008); Wainwright et al. (2006). If prior information regarding clusters or blocks of the network is available, one may apply group penalties to promote desired patterns among the edges in a cluster Friedman et al. (2010); Mohan et al. (2014); Tibshirani et al. (2005). More recently, structure inducing norms/priors Bach (2010); Candes et al. (2008) that promote a hub structure Mohan et al. (2014); Tan et al. (2014); Mohan et al. (2012) or a scalefree network Liu and Ihler (2011); Tang et al. (2015); Defazio and Caetano (2012) are proposed.
In the second scenario of network inference, the task is to reconstruct the whole network or predict missing links based on a subset of observed edges, e.g., link prediction in a social network LibenNowell and Kleinberg (2007) and inferring unknown proteinprotein interactions (PPIs) from experimentallyvalidated PPIs Wang et al. (2013); Dai and Prasad (2010). There are many solutions to this problem given different assumptions on data generation. For example, the problem can be solved by using influence cascade models under the diffusion process assumption Daneshmand et al. (2014); PougetAbadie and Horel (2015). Another popular approach is to formulate the problem as a matrix completion problem Wang et al. (2013); Huang et al. (2013); Hsieh et al. (2014), in which the matrix to be completed is often assumed to have certain structural properties, e.g., being lowrank Candès and Recht (2009); Ding et al. (2006); Mnih and Salakhutdinov (2007); Cai et al. (2010). This is reasonable since many realworld networks have only a small number of degree of freedom.
This paper focuses on the second scenario, i.e., predicting the whole network from a subset of observed edges. We formulate such a network inference problem as a matrix completion problem regularized by our novel nodespecific degree prior. We learn the degree of an individual node from the observed edges by crossvalidation, so that the learned degree is (approximately) consistent with the partial observation. Considering that the observed degree distribution may be different from the true degree, we use a soft rather than hard degree constraint in our prior.
To justify our method, we show theoretically that our nodespecific degree prior indeed can help induce a network that follows a given degree distribution. Under reasonable assumptions on the observation process, we provide upper bound on the expected recovery error of our network inference algorithm, which is superior than the error bound of existing approach due to the additional regularization by our degree prior. Furthermore, we experiment with real biological networks data, and show that by tuning the hyperparameters of our degreebased prior from the observed edges, the proposed method can obtain much better performance than existing methods for network inference.
The rest of the paper is organized as follows. In section 2, we introduce our nodespecific degree prior, and show that it can induce a graph following a given degree distribution. In section 3, we formulate the network inference problem as a regularized matrix completion problem, and present its optimization procedure. In section 4, we discuss the difference between our method and some closely related works. In section 5, we compare our method with related methods in terms of prediction accuracy on both synthetic and real datasets. Finally, we provide concluding remarks in section 6.
Notations.
Let denote the underlying network to be inferred, where is the set of vertices () and the true edge set. We use to denote the adjacency matrix of , i.e., if and only if . In this paper, we assume for all , i.e., there is no edge connecting each node to itself. We denote by the set of observed edges, and the indicator matrix such that if and only if .
To infer the underlying graph from , we first estimate a realvalued symmetric matrix from , each entry of which indicates the strength of the existence of an edge between two vertices. We then predict the largest entries of (the upperdiagonal of) as edges, where is the number of desired edges for the whole network. Alternatively, we can use a thresholding procedure to predict edges from .
2 Network Inference by NodeSpecific Degree Prior
In this section, we propose a new degreebased prior to regularize the network inference problem. We will show that this degree prior helps to induce a network following the desired nodespecific degree distribution. We also estimate the recovery error bound of the constrained network inference problem under reasonable assumptions.
2.1 Inducing the desired degree distribution
There have been several degree priors/norms for inducing a scalefree (or hub) network Liu and Ihler (2011); Defazio and Caetano (2012); Mohan et al. (2012); Tang et al. (2015). But most of these priors work at the global level, e.g., they assume degrees of all nodes approximately follows a powerlaw distribution. In constrast to these existing work, we would like to use the (noisy) degree information at each individual node. As mentioned before, we assume that the target network structure is implied by a realvalued symmetric matrix , and we may use the top entries in as the predicted edges (including observed edges). Let be the desired degree of the nodes. We propose the following prior for :
(1) 
where is a hyperparameter and is a monotonically increasing positive sequence, i.e., . In this paper, we use for , although there are many choices for as long as it increases at a moderate rate. denotes the largest entry in the row of excluding . That is, the likelihood of and forming an edge is ranked in the position among all the possible edges adjacent to .
Theorem 1 below shows that the prior in (1) favors a graph that follows a given degree distribution.
Theorem 1
Let be the degree of the underlying network where . Let be a realvalued symmetric matrix with nonnegative entries and () be the degree distribution of the network resulting from the top upperdiagonal entries of . If , then there exists another symmetric matrix satisfying the following conditions: (1) has the same set of entries as ; (2) the network derived from the top entries of has degree ; and (3) .
Let be the sum of the terms associated with node in (1). Then we have
which is a linear combination of the (sorted) entries of the row of . By the definition of , we have
(2) 
We now construct from as follows. First, we sort all the upperdiagonal elements of descendingly to obtain the below sequence :
Then we employ Algorithm 1 to place each element of in a descending order into two entries and in . appears twice in : one in with coefficient where is the ranking of this entry in row , and the other in with coefficient . Thus the contribution of in is .
We now prove that has the desired properties.
is symmetric and has the same set of entries as .
It is clear from step 5 of the algorithm that the resultant is symmetric. Since in each iteration we assign a different entry of to that of , the two matrices have the same set of entries.
has the desired degree distribution .
The network resulting from is determined from the first iterations of Algorithm 1. (2) indicates that for any , so the following set contains the smallest coefficients in .
Since step 4 of Algorithm 1 selects a pair of indices and with the smallest and , these selected coefficients must be chosen from . As a result, upon the termination of the iteration in Algorithm 1, we have , , and has the desired degree distribution.
, i.e., has a smaller penalty than .
Note that both and have the same set of ranked entries . A larger entry in always has a smaller coefficient in , and all entries and coefficients are nonnegative, so the resultant has the smallest penalty among all the matrices with the same set of entries.
Theorem 1 shows that (1) is a structureinducing prior that favors a network following a given distribution . We can tune the parameter to control the impact of the prior. The induced graph would almost follow the degree distribution when is large (e.g. ), as the penalty coefficient for where . On the other hand, when is small, the prior only weakly encourages the given degree distribution. Furthermore, when , the prior reduces to the norm.
Although (1) induces a graph following a given degree distribution, it is a soft constraint and tolerates some noise in the degree distribution. In fact, the difference in penalty coefficients of two entries and is bounded as
(3) 
which is very small when is large and is small. This bound implies the following properties of our prior in (1).

The larger the degree of a node is, i.e., the more neighbors it has, the smoother its penalty coefficients are.

When two entries in the same row have similar values and thus similar rankings, their corresponding penalty coefficients in the regularizer are also close.
These smoothness properties are further finetuned via the parameter .
2.2 Recovery Error Bound
Mathematically, the network inference problem addressed in this paper is related to PU (PositiveUnlabeled) learning for matrix completion Hsieh et al. (2014).
Here instead of picking top entries, we assume that we derive the network from a realvalued symmetric matrix by thresholding (the two ways are closely related). We assume that the 0/1 adjacency matrix is observed from by a thresholding process: Let be the threshold value, then where is the indicator function. Furthermore, we assume that a subset of the edge set is observed by uniformly sampling elements from with probability . Recall that we use the matrix to denote the observations, where if the edge is observed and otherwise.
We assume the underlying realvalued matrix comes from the set , defined as
Where is elementwise comparison and is the nuclear norm (sum of singular values). We assume the underlying matrix to have small nuclear norm as a proxy to being lowrank, which is a common approach for matrix completion Srebro and Shraibman (2005). We use the following loss function to estimate from :
(4) 
This loss function assigns two different weights to the squared error between and depending on the observed value . When the percentage of unobserved edges is high (i.e., is small), the loss function weighs more on correctly predicting the observed ’s and allows for larger errors for unobserved entries; otherwise it tends to predict smaller values for the unobserved entries.
Let minimize the loss function subject to the constraint that . We may predict an edge set from by thresholding. That is, there is one edge between and if and only if . We can estimate the expected recovery error of as follows.
(5) 
where the expectation is taken over random selection of the observed edge set . We have the following theorem regarding the bound of the expected recovery error.
Theorem 2
Assume the underlying graph has a degree distribution , with and . Let be the threshold value used to obtain edges from the underlying matrix . Let be the minimizer of the loss function defined in (2.2) subject to the constraint that . Assume that the regularizer uses an estimated degree distribution with . Then with probability at least , we have
where , , and is a universal constant.
This theorem shows that on average, the number of mistakes (i.e., ) made in the recovered network is bounded and the bound does not increase with respect to the matrix size , but rather (mildly) depends on the maximum degree and the number of edges, and the complexity of the matrix. In case of a sparse network, the average error has a very small bound. See our supplementary material for a detailed proof of Theorem 2. Part of our proof follows the techniques used by Hsieh et al. (2014). That is, we relate the expected recovery error (5) to the following labeldependent recovery error.
which is further related to the loss function defined in (2.2). Then we can derive generalization guarantee for (bounded) realvalued loss function using the Rademacher complexity of Bartlett and Mendelson (2003) controlled by the nuclear norm and the degree prior. Our error bound is better than that of Hsieh et al. (2014) due to the additional constraint of our degreebased prior.
2.3 Learning Node Degree via Cross Validation
In order to use our nodespecific degree prior, we need to estimate the degree of each individual node of the underlying network. The naive strategy of searching the space of all possible is clearly infeasible. Thus, we will derive from the observed degree where is the degree of node in the observed network . Assuming that the network has edges, under uniform assumption we can estimate the degree of the predicted network by where .
Let be any loss function for where is a hyperparameter. Considering both the loss and the degreebased prior, we solve the following regularized objective function
(6) 
where is the degreebased prior and can be interpreted as an amplification factor. We define the new degree of node after multiplying as . By setting , we amplify the impact of those hub nodes, as the gap of estimated degrees between different hub nodes and nonhub nodes are enlarged, resulting in larger difference in their penalty coefficients in . As in the matrix completion setting, we may determine the hyperparameters , , , and through cross validation. That is, we randomly hold out some observed edges as tuning set, and use the remaining observations to train the model and predict all missing edges. We then select the hyperparamters which gives best prediction accuracy on the heldout tuning set.
2.4 Learning Node Degree via Cross Validation
In order to use our nodespecific degree prior, we need to estimate the degree of each individual node of the underlying network. The naive strategy of searching the space of all possible is clearly infeasible. Thus, we will derive from the observed degree where is the degree of node in the observed network . Assuming that the network has edges, under uniform assumption we can estimate the degree of the predicted network by where .
Let be any loss function for where is a hyperparameter. Considering both the loss and the degreebased prior, we solve the following regularized objective function
(7) 
where is the degreebased prior and can be interpreted as an amplification factor. We define the new degree of node after multiplying as . By setting , we amplify the impact of those hub nodes, as the gap of estimated degrees between different hub nodes and nonhub nodes are enlarged, resulting in larger difference in their penalty coefficients in . As in the matrix completion setting, we may determine the hyperparameters , , , and through cross validation. That is, we randomly hold out some observed edges as tuning set, and use the remaining observations to train the model and predict all missing edges. We then select the hyperparamters which gives best prediction accuracy on the heldout tuning set.
3 Model and Optimization
3.1 Matrix Completion with Learned NodeSpecific Degree Prior
We now consider how to solve the network inference problem (i.e., recover ) given some observed edges (i.e., ) using matrix completion regularized by our degreebased prior. Assume that the observed degree of the variables is and we would like to output edges as the solution. Let be the estimated degree of the nodes in the predicted network where . Using nonnegative matrix trifactorization Ding et al. (2006), the regularized matrix completion problem can be formulated as
(8) 
Where if and otherwise.
In the above objective function, the first term is the loss function and the second term is our degreebased prior. Meanwhile, , , and are the hyper parameters to be tuned through crossvalidation. This trifactorization method has previously been applied to recover proteinprotein interaction (PPI) networks Wang et al. (2013). Note (3.1) is different from (2.2) in that it removes the hard constraints (which is often satisfied by the solution) and it uses factorization instead of nuclear norm constraint to enforce lowrank structure.
3.2 Optimization
By introducing additional variables and the constraint , we rewrite (3.1) as
s.t.  (9) 
This formulation can be solved by alternating direction method of multipliers (ADMM, Boyd et al., 2011), which has been successfully applied to nonnegative matrix factorization problems (Sun and Fevotte, 2014). ADMM solves (9) by iterating the following three steps till convergence, where is an optimization parameter we fix
(10)  
(11)  
3.2.1 Solving (3.2)
3.2.2 Solving (3.2)
We can divide (3.2) into subproblems, one for each node of the graph. The objective for the th node is
(12) 
where , and are the column of and respectively, and for . However, these smaller problems are not independent since needs to be symmetric. To deal with this constraint, we apply the idea of Defazio and Caetano (2012) and rewrite the problem as
(13)  
Then we can apply dual decomposition Sontag et al. (2011) to (13). Specifically, we introduce a Lagrangian term and minimize the following objective function in each iteration of dual decomposition
(14) 
where and its entries are adjusted according to the difference between and in each iteration. By completing the squares, (14) is transformed into the form
(15) 
This objective can now be decomposed into independent subproblems, each of which has the same form of (12). We solve each subproblem using the algorithm of Tang et al. (2015); Bogdan et al. (2013) with time complexity .
4 Related Work
There have been several work on applying matrix completion techniques to the link prediction task, including Hsieh et al. (2014); Wang et al. (2013); Huang et al. (2013). Hsieh et al. (2014) have considered the PU (positiveunlabeled) learning setting for matrix completion, where the observed entries are purely s while all other entries are unlabeled. Wang et al. (2013) have applied the orthogonal matrix trifactorization technique Ding et al. (2006) to the problem of proteinprotein interaction prediction (a real biological problem that can be modeled as link prediction), and yielded significant improvement on prediction accuracy. Huang et al. (2013) have used a trace norm regularized discrete matrix completion to predict new links in social networks and proteinprotein interaction network. However, all these matrix completion based approaches does not carefully utilize the degree information conveyed by the observed samples.
In this work, we have used a degree prior regularized matrix completion framework for predicting missing edges of a network. Although our degree prior may be mathematically similar to some existing scalefree priors in the literature Liu and Ihler (2011); Defazio and Caetano (2012); Tang et al. (2015), there exist clear difference between our work and theirs. Existing priors are all based upon the (global) scalefree assumption, while the prior we use here is directly learned from observed samples. Both theoretical and experimental results show that our learned prior leads to better prediction performance in practice. Furthermore, our work addresses a very different problem setting than the abovementioned previous works do. While those previous works estimate the network from a given covariance matrix calculated from observed attributes of nodes (e.g., gene expression data), our work aims to predict missing links in a partially observed network without any observed attributes of network nodes. For example, those works are suitable for gene coexpression network inference from a set of measured gene expression levels, but not for missing link prediction in a social network. The latter problem can be attacked using the method proposed here.
5 Experimental Results
We have implemented the proposed method described in section 3 (denoted as Tri+Degree) and then compare its performance against the following 4 methods: trifactorization method without any prior (denoted as Tri), trifactorization regularized by penalty (denoted as Tri+L1), trifactorization regularized by scalefree penalty (denoted as Tri+Scale) Liu and Ihler (2011) and PU learning for matrix completion (denoted as PU) Hsieh et al. (2014). Cross validation is performed for all the methods for hyper parameter selection. We report for each method the averaged result over 5 random seeds (used for the initialization of matrix completion).
We conduct three experiments to test the performance of the 5 methods. In the first experiment, we use a simulated gene coexpression network in the wellknown DREAM5 challenge Marbach et al. (2012). The second experiment makes use of one proteinprotein interaction (PPI) network in BioGrid Stark et al. (2006). We randomly sample some of the known edges in this PPI network as observations and use the unsampled edges as the test set. In the last experiment, we test different methods using multiple releases of the proteinprotein interaction network for the species “Rat”. In particular, we predict the network from an old release and then evaluate the consistency between the predicted network and a newer release. Our experiments show that our method obtains the best performance in almost all the experimental settings and in some cases, our method shows significant advantage over the others.
sampling rate=90%  sampling rate=70% 

\psfragsampling rate=0.9 \psfragpercentage[c][c]percentage of all pairs (%) \psfragpredict[c][c]# correctly predicted edges  \psfragsampling rate=0.7[c][c] \psfragpercentage[c][c]percentage of all pairs (%) \psfragpredict[c][c] 
  3.2.104  3.2.114 

3.1.94  0.3269 / 0.3075 / 0.2105 / 0.3324 / 0.39893  0.2944 / 0.3318 / 0.1916 / 0.3154 / 0.4112 
3.2.104    0.2360 / 0.2360 / 0.3483 / 0.3820 / 0.5169 
5.1 Gene Expression Network
The ground truth network has nodes and edges. We consider a node to be a hub if its degree is among top 20% of all nodes. To conduct the experiment, we sample some edges from the ground truth as observations. The following three different sampling conditions are tested.

Uniform sampling: All edges are randomly and uniformly sampled. Two different sampling rates and are used.

Oversampling: Edges adjacent to hubs are oversampled. That is, we sample of edges adjacent to hubs and of edges not adjacent to hubs. In another setting, we sample edges adjacent to hubs and of edges not adjacent to hubs.

Undersampling: Edges adjacent to hubs are undersampled. Two different settings are tested: vs and vs for edges adjacent to hubs vs. edges not adjacent to hubs, respectively.
The second and third strategies are used to test how robust our method is with respect to sampling bias.
We give the results obtained by different methods in Figure 1, where each figure shows how many predictions agree with the ground truth as we are predicting more and more edges (increasing ) from the realvalued matrix. (We exclude the already observed edges for counting the “number of correctly predicted edges”, so the max number of correct prediction for the 90% sampling rate is 1/3 of that for the 70% sampling rate.) Figures in the top row show results for the base sampling rate of 90%, and figures in the bottom row show results for the more challenging base sampling rate of 70%. The left column of Figure 1 shows the experimental results when the observed edges are uniformly sampled from true edges, which is the usual assumption in many real applications. In this case, the performance of our approach is much better than the others. This is because the degree distribution learned from the observation is consistent with the ground truth, and thus our degree prior is most effective. Further, a larger sampling rate results in a better performance than a smaller sampling rate because when more edges are observed, the learned degree distribution is closer to the ground truth. The middle column and the right column of Fig. 1 show results under the oversampling and undersampling conditions, respectively. In these two cases, our approach Tri+Degree still outperforms the others, though not as much as in the uniform sampling condition. This shows the robustness of our method with respect to the mismatch in sampling conditions. Overall, our method achieves the best prediction accuracy in almost all conditions for a range of values.
5.2 ProteinProtein Interaction Network
We further use the proteinprotein interaction (PPI) network for Plasmodium in the BioGrid database (release 3.3.124) to evaluate the performance of the five methods. This network has nodes and experimentallyvalidated true edges. We sample and of the true edges as the input of all the tested methods, respectively. As shown in Figure 2, our method Tri+Degree outperforms the others in both settings. This shows that, in addition to gene coexpression networks, our method also works well on PPI networks.
5.3 Multiple Releases of ProteinProtein Interaction Networks
In the above two experiments, the training and test edges are sampled from the networks to be inferred. Here we test our method with another strategy. In particular, we make use of three releases , and of the proteinprotein interaction (PPI) network for the Rat species, which are taken from the BioGrid database. The three releases have , and nodes, respectively, and , and known edges, respectively. A newer release typically contains all the edges in an older release. In this experiment, we predict PPIs from an older release and then check if the predicted PPIs appear in a newer release or not. A prediction method is better if it yields a larger number of predicted PPIs in a newer release.
Table 1 shows that our method outperforms the other four methods significantly. Each entry in the table contains the test results of methods on two different releases. For example, the entry at the third row and the third column contains the results obtained by using release as input and the difference between releases and as the test set. In each entry, from left to right the numbers are the ratios of the edges only in a newer release being recovered by PU, Tri, Tri+Scale, Tri+L1 and our method, respectively. Each method predicts new edges where is the number of proteins in the older release. In fact, when predicting more edges, our method has a even larger advantage over the others.
6 Conclusion
This paper presents a novel method for network inference (link prediction) using a nodespecific degree prior learned from a subset of observed edges. We show theoretically that the proposed degree prior is effective in inducing a network that approximately follows the observed degree distribution. We propose to use a matrix completion objective regularized by our degree prior for network inference, and provide the recovery guarantee of our structured matrix completion method under the uniform sampling assumption. Our experimental results show that our method achieves better performance than existing approaches on both simulated geneexpression networks and real proteinprotein interaction networks.
Our analysis have mostly assumed that the observed edges are uniformly sampled from the true network, which is typically the case in many real applications. It turns out that our approach also performs better than existing methods when the adjacent edges of hubs are somewhat upsampled/downsampled, so our method is not very sensitive to mismatch in sampling conditions. However, it is interesting and useful to study the performance of different methods under more challenging conditions.
7 Proof of Theorem 2.2
Mathematically, the network inference problem addressed in our paper is related to PU (PositiveUnlabeled) learning for matrix completion Hsieh et al. (2014). Thus, we start from the same problem setting as Hsieh et al. (2014) and borrows some similar techniques in proving our bound. We theoretically show that our degree prior does improve the network inference as it guarantees a better bound of recovery error.
Problem setting
We assume that a 0/1 matrix is observed from a realvalued matrix by a thresholding process: Let be the threshold value, then where is the indicator function. Furthermore, we assume that a subset of the edge set is observed by uniform sampling elements from with probability . We use the matrix to denote the observations, i.e., if the edge is observed and otherwise. This data generation process indicates
Consider the following objective as the proxy for recovering :
(16)  
where indicates elementwise comparison, i.e., , . This objective assigns different weights to the squared error between and depending on the observed value . When the percentage of unobserved edges is high ( is small), the objective emphasizes more on predicting the observed ’s and allows for larger predictions for unobserved entries; otherwise it tends to predict smaller values (close to ) for the unobserved entries.
We assume the underlying realvalued matrix comes from the class . Once we obtain the minimizer of (16), we recover a binary matrix using by thresholding, i.e., .
We define the expected recovery error of an estimator based on observation as
where the expectation is taken over random selection of observed edges (i.e., the randomness in observing from ).
In order to relate our objective to the recovery error, we will make use of the following labeldependent error
The expected labeldependent error of an estimator is defined as
(17) 
The following lemma shows that the expected labeldependent error is proportional to the expected recovery error.
Lemma 1
Let be an estimator based on observations . Then there exists some constant independent of , such that
(18) 
For notational symplicity, we denote by the thresholded binary value from , and by the entire thresholded 0/1 matrix.
Define and . According to the data generation process, we have
(19) 
Consider now the two cases in which makes an error in predicting .
(i) When and , according to (19) we have
(ii) When and , according to (19) we have
Since is also uniquely determined by the estimator for fixed , we can instead compute by taking expectation over the thresholded matrix as follows
We conclude the proof by setting which is independent of due to the expectation.
We now connect the labeldependent error to the weighted quadratic objective.
Lemma 2
For any , we have
(20) 
where .
Furthermore, if , we have
(21) 
where . {proof} First, we define as
Let us first prove (20). Consider the following two cases.
(i) If , then and the two types of losses incurred at entry are