Network Inference by Learned Node-Specific Degree Prior

# Network Inference by Learned Node-Specific Degree Prior

Qingming Tang 111Equal Contribution.
Toyota Technological Institute at Chicago
6045 S. Kenwood Ave.
Chicago, Illinois 60637
Lifu Tu 111Equal Contribution.
Toyota Technological Institute at Chicago
6045 S. Kenwood Ave.
Chicago, IL
Weiran Wang 111Equal Contribution.
Toyota Technological Institute at Chicago
6045 S. Kenwood Ave.
Chicago, IL
Jinbo Xu
Toyota Technological Institute at Chicago
6045 S. Kenwood Ave.
Chicago, Illinois 60637
###### Abstract

We propose a novel method for network inference from partially observed edges using a node-specific degree prior. The degree prior is derived from observed edges in the network to be inferred, and its hyper-parameters are determined by cross validation. Then we formulate network inference as a matrix completion problem regularized by our degree prior. Our theoretical analysis indicates that this prior favors a network following the learned degree distribution, and may lead to improved network recovery error bound than previous work. Experimental results on both simulated and real biological networks demonstrate the superior performance of our method in various settings.

\numberofauthors

4

## 1 Introduction

Network inference or structure learning has been widely studied in machine learning. There are two typical scenarios. In the first scenario, the task is to estimate the structure of an undirected graphical model from a high-dimensional dataset, e.g., learning gene co-expression networks from gene expression data Marbach et al. (2012); De Smet and Marchal (2010); Zhang and Horvath (2005). This task has been extensively studied, with a popular method being the graphical Lasso which assumes the underlying graph to be sparse Meinshausen and Bühlmann (2006); Yuan and Lin (2007); Friedman et al. (2008); Banerjee et al. (2008); Wainwright et al. (2006). If prior information regarding clusters or blocks of the network is available, one may apply group penalties to promote desired patterns among the edges in a cluster Friedman et al. (2010); Mohan et al. (2014); Tibshirani et al. (2005). More recently, structure inducing norms/priors Bach (2010); Candes et al. (2008) that promote a hub structure Mohan et al. (2014); Tan et al. (2014); Mohan et al. (2012) or a scale-free network Liu and Ihler (2011); Tang et al. (2015); Defazio and Caetano (2012) are proposed.

In the second scenario of network inference, the task is to reconstruct the whole network or predict missing links based on a subset of observed edges, e.g., link prediction in a social network Liben-Nowell and Kleinberg (2007) and inferring unknown protein-protein interactions (PPIs) from experimentally-validated PPIs Wang et al. (2013); Dai and Prasad (2010). There are many solutions to this problem given different assumptions on data generation. For example, the problem can be solved by using influence cascade models under the diffusion process assumption Daneshmand et al. (2014); Pouget-Abadie and Horel (2015). Another popular approach is to formulate the problem as a matrix completion problem Wang et al. (2013); Huang et al. (2013); Hsieh et al. (2014), in which the matrix to be completed is often assumed to have certain structural properties, e.g., being low-rank Candès and Recht (2009); Ding et al. (2006); Mnih and Salakhutdinov (2007); Cai et al. (2010). This is reasonable since many real-world networks have only a small number of degree of freedom.

This paper focuses on the second scenario, i.e., predicting the whole network from a subset of observed edges. We formulate such a network inference problem as a matrix completion problem regularized by our novel node-specific degree prior. We learn the degree of an individual node from the observed edges by cross-validation, so that the learned degree is (approximately) consistent with the partial observation. Considering that the observed degree distribution may be different from the true degree, we use a soft rather than hard degree constraint in our prior.

To justify our method, we show theoretically that our node-specific degree prior indeed can help induce a network that follows a given degree distribution. Under reasonable assumptions on the observation process, we provide upper bound on the expected recovery error of our network inference algorithm, which is superior than the error bound of existing approach due to the additional regularization by our degree prior. Furthermore, we experiment with real biological networks data, and show that by tuning the hyper-parameters of our degree-based prior from the observed edges, the proposed method can obtain much better performance than existing methods for network inference.

The rest of the paper is organized as follows. In section 2, we introduce our node-specific degree prior, and show that it can induce a graph following a given degree distribution. In section 3, we formulate the network inference problem as a regularized matrix completion problem, and present its optimization procedure. In section 4, we discuss the difference between our method and some closely related works. In section 5, we compare our method with related methods in terms of prediction accuracy on both synthetic and real datasets. Finally, we provide concluding remarks in section 6.

##### Notations.

Let denote the underlying network to be inferred, where is the set of vertices () and the true edge set. We use to denote the adjacency matrix of , i.e., if and only if . In this paper, we assume for all , i.e., there is no edge connecting each node to itself. We denote by the set of observed edges, and the indicator matrix such that if and only if .

To infer the underlying graph from , we first estimate a real-valued symmetric matrix from , each entry of which indicates the strength of the existence of an edge between two vertices. We then predict the largest entries of (the upper-diagonal of) as edges, where is the number of desired edges for the whole network. Alternatively, we can use a thresholding procedure to predict edges from .

## 2 Network Inference by Node-Specific Degree Prior

In this section, we propose a new degree-based prior to regularize the network inference problem. We will show that this degree prior helps to induce a network following the desired node-specific degree distribution. We also estimate the recovery error bound of the constrained network inference problem under reasonable assumptions.

### 2.1 Inducing the desired degree distribution

There have been several degree priors/norms for inducing a scale-free (or hub) network Liu and Ihler (2011); Defazio and Caetano (2012); Mohan et al. (2012); Tang et al. (2015). But most of these priors work at the global level, e.g., they assume degrees of all nodes approximately follows a power-law distribution. In constrast to these existing work, we would like to use the (noisy) degree information at each individual node. As mentioned before, we assume that the target network structure is implied by a real-valued symmetric matrix , and we may use the top entries in as the predicted edges (including observed edges). Let be the desired degree of the nodes. We propose the following prior for :

 SH(X,d,α) =p∑i=1Hα1Xi,[1]+Hα2Xi,[2]+...+Hαp−1Xi,[p−1]Hαdi (1)

where is a hyper-parameter and is a monotonically increasing positive sequence, i.e., . In this paper, we use for , although there are many choices for as long as it increases at a moderate rate. denotes the largest entry in the row of excluding . That is, the likelihood of and forming an edge is ranked in the position among all the possible edges adjacent to .

Theorem 1 below shows that the prior in (1) favors a graph that follows a given degree distribution.

###### Theorem 1

Let be the degree of the underlying network where . Let be a real-valued symmetric matrix with non-negative entries and () be the degree distribution of the network resulting from the top upper-diagonal entries of . If , then there exists another symmetric matrix satisfying the following conditions: (1) has the same set of entries as ; (2) the network derived from the top entries of has degree ; and (3) .

{proof}

Let be the sum of the terms associated with node in (1). Then we have

 SiH( X,d,α) =Hα1HαdiXi,[1]+Hα2HαdiXi,[2]+...+Hαp−1HαdiXi,[p−1],

which is a linear combination of the (sorted) entries of the row of . By the definition of , we have

 Hα1Hαdi

We now construct from as follows. First, we sort all the upper-diagonal elements of descendingly to obtain the below sequence :

 Y1≥Y2≥⋯≥YK≥⋯≥Yp(p−1)2≥0.

Then we employ Algorithm 1 to place each element of in a descending order into two entries and in . appears twice in : one in with coefficient where is the ranking of this entry in row , and the other in with coefficient . Thus the contribution of in is .

We now prove that has the desired properties.

##### X∗ is symmetric and has the same set of entries as X.

It is clear from step 5 of the algorithm that the resultant is symmetric. Since in each iteration we assign a different entry of to that of , the two matrices have the same set of entries.

##### X∗ has the desired degree distribution d.

The network resulting from is determined from the first iterations of Algorithm 1. (2) indicates that for any , so the following set contains the smallest coefficients in .

 K={Hα1Hαd1,…,Hαd1Hαd1,Hα1Hαd2,…,Hαd2Hαd2,…,Hα1Hαdp,…,HαdpHαdp}

Since step 4 of Algorithm 1 selects a pair of indices and with the smallest and , these selected coefficients must be chosen from . As a result, upon the termination of the iteration in Algorithm 1, we have , , and has the desired degree distribution.

##### SH(X∗,d,α)≤SH(X,d,α), i.e., X∗ has a smaller penalty than X.

Note that both and have the same set of ranked entries . A larger entry in always has a smaller coefficient in , and all entries and coefficients are non-negative, so the resultant has the smallest penalty among all the matrices with the same set of entries.

Theorem 1 shows that (1) is a structure-inducing prior that favors a network following a given distribution . We can tune the parameter to control the impact of the prior. The induced graph would almost follow the degree distribution when is large (e.g. ), as the penalty coefficient for where . On the other hand, when is small, the prior only weakly encourages the given degree distribution. Furthermore, when , the prior reduces to the norm.

Although (1) induces a graph following a given degree distribution, it is a soft constraint and tolerates some noise in the degree distribution. In fact, the difference in penalty coefficients of two entries and is bounded as

 Hαt+δ−HαtHαdi≤logα(δ+1)−logα(1)logα(di+1)=(log(δ+1)log(di+1))α (3)

which is very small when is large and is small. This bound implies the following properties of our prior in (1).

1. The larger the degree of a node is, i.e., the more neighbors it has, the smoother its penalty coefficients are.

2. When two entries in the same row have similar values and thus similar rankings, their corresponding penalty coefficients in the regularizer are also close.

These smoothness properties are further fine-tuned via the parameter .

### 2.2 Recovery Error Bound

Mathematically, the network inference problem addressed in this paper is related to PU (Positive-Unlabeled) learning for matrix completion Hsieh et al. (2014).

Here instead of picking top entries, we assume that we derive the network from a real-valued symmetric matrix by thresholding (the two ways are closely related). We assume that the 0/1 adjacency matrix is observed from by a thresholding process: Let be the threshold value, then where is the indicator function. Furthermore, we assume that a subset of the edge set is observed by uniformly sampling elements from with probability . Recall that we use the matrix to denote the observations, where if the edge is observed and otherwise.

We assume the underlying real-valued matrix comes from the set , defined as

 X:={ X∈Rp×p∣X=X⊤, 0≤X≤1, ∥X∥∗≤t,SH(X,d,α)≤r},

Where is elementwise comparison and is the nuclear norm (sum of singular values). We assume the underlying matrix to have small nuclear norm as a proxy to being low-rank, which is a common approach for matrix completion Srebro and Shraibman (2005). We use the following loss function to estimate from :

 L(X;Ω) :=(1−ρ2)∑(i,j):Ωij=1(Xij−1)2 +ρ2∑(i,j):Ωij=0(Xij−0)2 (4)

This loss function assigns two different weights to the squared error between and depending on the observed value . When the percentage of unobserved edges is high (i.e., is small), the loss function weighs more on correctly predicting the observed ’s and allows for larger errors for unobserved entries; otherwise it tends to predict smaller values for the unobserved entries.

Let minimize the loss function subject to the constraint that . We may predict an edge set from by thresholding. That is, there is one edge between and if and only if . We can estimate the expected recovery error of as follows.

 R(^X)=EΩ[∑i,jI(thr(^Xij)≠Eij)] (5)

where the expectation is taken over random selection of the observed edge set . We have the following theorem regarding the bound of the expected recovery error.

###### Theorem 2

Assume the underlying graph has a degree distribution , with and . Let be the threshold value used to obtain edges from the underlying matrix . Let be the minimizer of the loss function defined in (2.2) subject to the constraint that . Assume that the regularizer uses an estimated degree distribution with . Then with probability at least , we have

 R(^X)≤4γ(2−ρ)ρ⎛⎝2min(A,B)+√slog2/δ2⎞⎠

where , , and is a universal constant.

This theorem shows that on average, the number of mistakes (i.e., ) made in the recovered network is bounded and the bound does not increase with respect to the matrix size , but rather (mildly) depends on the maximum degree and the number of edges, and the complexity of the matrix. In case of a sparse network, the average error has a very small bound. See our supplementary material for a detailed proof of Theorem 2. Part of our proof follows the techniques used by Hsieh et al. (2014). That is, we relate the expected recovery error (5) to the following label-dependent recovery error.

 Rρ(X;Ω) =∑i,j{(1−ρ2)I(thr(Xij)=0)I(Ωij=1)

which is further related to the loss function defined in (2.2). Then we can derive generalization guarantee for (bounded) real-valued loss function using the Rademacher complexity of Bartlett and Mendelson (2003) controlled by the nuclear norm and the degree prior. Our error bound is better than that of Hsieh et al. (2014) due to the additional constraint of our degree-based prior.

### 2.3 Learning Node Degree via Cross Validation

In order to use our node-specific degree prior, we need to estimate the degree of each individual node of the underlying network. The naive strategy of searching the space of all possible is clearly infeasible. Thus, we will derive from the observed degree where is the degree of node in the observed network . Assuming that the network has edges, under uniform assumption we can estimate the degree of the predicted network by where .

Let be any loss function for where is a hyper-parameter. Considering both the loss and the degree-based prior, we solve the following regularized objective function

 minX Fρ(X)+λSH(X,cd,α). (6)

where is the degree-based prior and can be interpreted as an amplification factor. We define the new degree of node after multiplying as . By setting , we amplify the impact of those hub nodes, as the gap of estimated degrees between different hub nodes and non-hub nodes are enlarged, resulting in larger difference in their penalty coefficients in . As in the matrix completion setting, we may determine the hyperparameters , , , and through cross validation. That is, we randomly hold out some observed edges as tuning set, and use the remaining observations to train the model and predict all missing edges. We then select the hyperparamters which gives best prediction accuracy on the held-out tuning set.

### 2.4 Learning Node Degree via Cross Validation

In order to use our node-specific degree prior, we need to estimate the degree of each individual node of the underlying network. The naive strategy of searching the space of all possible is clearly infeasible. Thus, we will derive from the observed degree where is the degree of node in the observed network . Assuming that the network has edges, under uniform assumption we can estimate the degree of the predicted network by where .

Let be any loss function for where is a hyper-parameter. Considering both the loss and the degree-based prior, we solve the following regularized objective function

 minX Fρ(X)+λSH(X,cd,α). (7)

where is the degree-based prior and can be interpreted as an amplification factor. We define the new degree of node after multiplying as . By setting , we amplify the impact of those hub nodes, as the gap of estimated degrees between different hub nodes and non-hub nodes are enlarged, resulting in larger difference in their penalty coefficients in . As in the matrix completion setting, we may determine the hyperparameters , , , and through cross validation. That is, we randomly hold out some observed edges as tuning set, and use the remaining observations to train the model and predict all missing edges. We then select the hyperparamters which gives best prediction accuracy on the held-out tuning set.

## 3 Model and Optimization

### 3.1 Matrix Completion with Learned Node-Specific Degree Prior

We now consider how to solve the network inference problem (i.e., recover ) given some observed edges (i.e., ) using matrix completion regularized by our degree-based prior. Assume that the observed degree of the variables is and we would like to output edges as the solution. Let be the estimated degree of the nodes in the predicted network where . Using non-negative matrix tri-factorization Ding et al. (2006), the regularized matrix completion problem can be formulated as

 minU≥0,S≥0 ∑ijMij((USUT)ij−Ωij)2 (8)

Where if and otherwise.

In the above objective function, the first term is the loss function and the second term is our degree-based prior. Meanwhile, , , and are the hyper parameters to be tuned through cross-validation. This tri-factorization method has previously been applied to recover protein-protein interaction (PPI) networks Wang et al. (2013). Note (3.1) is different from (2.2) in that it removes the hard constraints (which is often satisfied by the solution) and it uses factorization instead of nuclear norm constraint to enforce low-rank structure.

### 3.2 Optimization

By introducing additional variables and the constraint , we rewrite (3.1) as

 minU≥0,S≥0,X ∑ijMij((USUT)ij−Ωij)2+λSH(X,cd,α) s.t. USUT=X. (9)

This formulation can be solved by alternating direction method of multipliers (ADMM, Boyd et al., 2011), which has been successfully applied to non-negative matrix factorization problems (Sun and Fevotte, 2014). ADMM solves (9) by iterating the following three steps till convergence, where is an optimization parameter we fix

 U(t+1),S(t+1)=minU≥0,S≥0 ∑ijMij((USUT)ij−Ωij)2 +η2||USUT−X(t)+Z(t)||2F, (10) +η2||U(t+1)S(t+1)(U(t+1))T−X+Z(t)||2F, (11) Z(t+1)=Z(t)+U(t+1)S(t+1)(U(t+1))T−X(t+1).

#### 3.2.1 Solving (3.2)

Let and combine the similar terms in (3.2). Solving (3.2) is equivalent to minimizing the function

 ∑ijMij((USUT)ij−Ωij)2+η2∑ij((USUT)ij−Y(t)ij)2 =∑ij(Mij+η2)(USUT)2ij −∑ij(2MijΩij+ηY(t)ij)(USUT)ij+const,

which can be further reduced to solving

 minU≥0,S≥0∑ij(Mij+η2)⎛⎜⎝(USUT)ij−MijΩij+η2Y(t)ijMij+η2⎞⎟⎠2.

This is a weighted matrix tri-factorization problem which we solve with the algorithm of Ding et al. (2006).

#### 3.2.2 Solving (3.2)

We can divide (3.2) into subproblems, one for each node of the graph. The objective for the -th node is

 minXi≥0 η2∥Xi−Ai∥2+λp−1∑k=1bi(k)Xi,[k], (12)

where , and are the column of and respectively, and for . However, these smaller problems are not independent since needs to be symmetric. To deal with this constraint, we apply the idea of Defazio and Caetano (2012) and rewrite the problem as

 minX≥0 η2∥X−A∥2F+λSH(X,cd,α) (13) s.t.X=XT.

Then we can apply dual decomposition Sontag et al. (2011) to (13). Specifically, we introduce a Lagrangian term and minimize the following objective function in each iteration of dual decomposition

 minX≥0 η2∥X−A∥2F+λSH(X,cd,α)+tr((B−BT)X), (14)

where and its entries are adjusted according to the difference between and in each iteration. By completing the squares, (14) is transformed into the form

 minX≥0 (15)

This objective can now be decomposed into independent subproblems, each of which has the same form of (12). We solve each subproblem using the algorithm of Tang et al. (2015); Bogdan et al. (2013) with time complexity .

## 4 Related Work

There have been several work on applying matrix completion techniques to the link prediction task, including Hsieh et al. (2014); Wang et al. (2013); Huang et al. (2013). Hsieh et al. (2014) have considered the PU (positive-unlabeled) learning setting for matrix completion, where the observed entries are purely s while all other entries are unlabeled. Wang et al. (2013) have applied the orthogonal matrix tri-factorization technique Ding et al. (2006) to the problem of protein-protein interaction prediction (a real biological problem that can be modeled as link prediction), and yielded significant improvement on prediction accuracy. Huang et al. (2013) have used a trace norm regularized discrete matrix completion to predict new links in social networks and protein-protein interaction network. However, all these matrix completion based approaches does not carefully utilize the degree information conveyed by the observed samples.

In this work, we have used a degree prior regularized matrix completion framework for predicting missing edges of a network. Although our degree prior may be mathematically similar to some existing scale-free priors in the literature Liu and Ihler (2011); Defazio and Caetano (2012); Tang et al. (2015), there exist clear difference between our work and theirs. Existing priors are all based upon the (global) scale-free assumption, while the prior we use here is directly learned from observed samples. Both theoretical and experimental results show that our learned prior leads to better prediction performance in practice. Furthermore, our work addresses a very different problem setting than the abovementioned previous works do. While those previous works estimate the network from a given covariance matrix calculated from observed attributes of nodes (e.g., gene expression data), our work aims to predict missing links in a partially observed network without any observed attributes of network nodes. For example, those works are suitable for gene coexpression network inference from a set of measured gene expression levels, but not for missing link prediction in a social network. The latter problem can be attacked using the method proposed here.

## 5 Experimental Results

We have implemented the proposed method described in section 3 (denoted as Tri+Degree) and then compare its performance against the following 4 methods: tri-factorization method without any prior (denoted as Tri), tri-factorization regularized by penalty (denoted as Tri+L1), tri-factorization regularized by scale-free penalty (denoted as Tri+Scale) Liu and Ihler (2011) and PU learning for matrix completion (denoted as PU) Hsieh et al. (2014). Cross validation is performed for all the methods for hyper parameter selection. We report for each method the averaged result over 5 random seeds (used for the initialization of matrix completion).

We conduct three experiments to test the performance of the 5 methods. In the first experiment, we use a simulated gene co-expression network in the well-known DREAM5 challenge Marbach et al. (2012). The second experiment makes use of one protein-protein interaction (PPI) network in BioGrid Stark et al. (2006). We randomly sample some of the known edges in this PPI network as observations and use the unsampled edges as the test set. In the last experiment, we test different methods using multiple releases of the protein-protein interaction network for the species “Rat”. In particular, we predict the network from an old release and then evaluate the consistency between the predicted network and a newer release. Our experiments show that our method obtains the best performance in almost all the experimental settings and in some cases, our method shows significant advantage over the others.

### 5.1 Gene Expression Network

The ground truth network has nodes and edges. We consider a node to be a hub if its degree is among top 20% of all nodes. To conduct the experiment, we sample some edges from the ground truth as observations. The following three different sampling conditions are tested.

• Uniform sampling: All edges are randomly and uniformly sampled. Two different sampling rates and are used.

• Over-sampling: Edges adjacent to hubs are over-sampled. That is, we sample of edges adjacent to hubs and of edges not adjacent to hubs. In another setting, we sample edges adjacent to hubs and of edges not adjacent to hubs.

• Under-sampling: Edges adjacent to hubs are under-sampled. Two different settings are tested: vs and vs for edges adjacent to hubs vs. edges not adjacent to hubs, respectively.

The second and third strategies are used to test how robust our method is with respect to sampling bias.

We give the results obtained by different methods in Figure 1, where each figure shows how many predictions agree with the ground truth as we are predicting more and more edges (increasing ) from the real-valued matrix. (We exclude the already observed edges for counting the “number of correctly predicted edges”, so the max number of correct prediction for the 90% sampling rate is 1/3 of that for the 70% sampling rate.) Figures in the top row show results for the base sampling rate of 90%, and figures in the bottom row show results for the more challenging base sampling rate of 70%. The left column of Figure 1 shows the experimental results when the observed edges are uniformly sampled from true edges, which is the usual assumption in many real applications. In this case, the performance of our approach is much better than the others. This is because the degree distribution learned from the observation is consistent with the ground truth, and thus our degree prior is most effective. Further, a larger sampling rate results in a better performance than a smaller sampling rate because when more edges are observed, the learned degree distribution is closer to the ground truth. The middle column and the right column of Fig. 1 show results under the over-sampling and under-sampling conditions, respectively. In these two cases, our approach Tri+Degree still outperforms the others, though not as much as in the uniform sampling condition. This shows the robustness of our method with respect to the mismatch in sampling conditions. Overall, our method achieves the best prediction accuracy in almost all conditions for a range of values.

### 5.2 Protein-Protein Interaction Network

We further use the protein-protein interaction (PPI) network for Plasmodium in the BioGrid database (release 3.3.124) to evaluate the performance of the five methods. This network has nodes and experimentally-validated true edges. We sample and of the true edges as the input of all the tested methods, respectively. As shown in Figure 2, our method Tri+Degree outperforms the others in both settings. This shows that, in addition to gene co-expression networks, our method also works well on PPI networks.

### 5.3 Multiple Releases of Protein-Protein Interaction Networks

In the above two experiments, the training and test edges are sampled from the networks to be inferred. Here we test our method with another strategy. In particular, we make use of three releases , and of the protein-protein interaction (PPI) network for the Rat species, which are taken from the BioGrid database. The three releases have , and nodes, respectively, and , and known edges, respectively. A newer release typically contains all the edges in an older release. In this experiment, we predict PPIs from an older release and then check if the predicted PPIs appear in a newer release or not. A prediction method is better if it yields a larger number of predicted PPIs in a newer release.

Table 1 shows that our method outperforms the other four methods significantly. Each entry in the table contains the test results of methods on two different releases. For example, the entry at the third row and the third column contains the results obtained by using release as input and the difference between releases and as the test set. In each entry, from left to right the numbers are the ratios of the edges only in a newer release being recovered by PU, Tri, Tri+Scale, Tri+L1 and our method, respectively. Each method predicts new edges where is the number of proteins in the older release. In fact, when predicting more edges, our method has a even larger advantage over the others.

## 6 Conclusion

This paper presents a novel method for network inference (link prediction) using a node-specific degree prior learned from a subset of observed edges. We show theoretically that the proposed degree prior is effective in inducing a network that approximately follows the observed degree distribution. We propose to use a matrix completion objective regularized by our degree prior for network inference, and provide the recovery guarantee of our structured matrix completion method under the uniform sampling assumption. Our experimental results show that our method achieves better performance than existing approaches on both simulated gene-expression networks and real protein-protein interaction networks.

Our analysis have mostly assumed that the observed edges are uniformly sampled from the true network, which is typically the case in many real applications. It turns out that our approach also performs better than existing methods when the adjacent edges of hubs are somewhat upsampled/downsampled, so our method is not very sensitive to mismatch in sampling conditions. However, it is interesting and useful to study the performance of different methods under more challenging conditions.

## 7 Proof of Theorem 2.2

Mathematically, the network inference problem addressed in our paper is related to PU (Positive-Unlabeled) learning for matrix completion Hsieh et al. (2014). Thus, we start from the same problem setting as Hsieh et al. (2014) and borrows some similar techniques in proving our bound. We theoretically show that our degree prior does improve the network inference as it guarantees a better bound of recovery error.

##### Problem setting

We assume that a 0/1 matrix is observed from a real-valued matrix by a thresholding process: Let be the threshold value, then where is the indicator function. Furthermore, we assume that a subset of the edge set is observed by uniform sampling elements from with probability . We use the matrix to denote the observations, i.e., if the edge is observed and otherwise. This data generation process indicates

 P(Ωij=1∣∣Eij=1)=ρ, P(Ωij=0∣∣Eij=1)=1−ρ, P(Ωij=1∣∣Eij=0)=0, P(Ωij=0∣∣Eij=0)=1.

Consider the following objective as the proxy for recovering :

 minX Rl(X;Ω):=(1−ρ2)∑(i,j):Ωij=1(Xij−1)2+ρ2∑(i,j):Ωij=0(Xij−0)2 (16) s.t.X∈X:={X∈Rp×p∣X=X⊤,0≤X≤1,∥X∥∗≤t,SH(X,d,α)≤r},

where indicates elementwise comparison, i.e., , . This objective assigns different weights to the squared error between and depending on the observed value . When the percentage of unobserved edges is high ( is small), the objective emphasizes more on predicting the observed ’s and allows for larger predictions for unobserved entries; otherwise it tends to predict smaller values (close to ) for the unobserved entries.

We assume the underlying real-valued matrix comes from the class . Once we obtain the minimizer of (16), we recover a binary matrix using by thresholding, i.e., .

We define the expected recovery error of an estimator based on observation as

 R(X)=EΩ[∑i,jI(thr(Xij)≠Eij)],

where the expectation is taken over random selection of observed edges (i.e., the randomness in observing from ).

In order to relate our objective to the recovery error, we will make use of the following label-dependent error

 Rρ(Xij,Ωij) =(1−ρ2)I(thr(Xij)=0)I(Ωij=1) +ρ2I(thr(Xij)=1)I(Ωij=0), Rρ(X;Ω) =∑i,jRρ(Xij,Ωij).

The expected label-dependent error of an estimator is defined as

 Rρ(X):=EΩ[Rρ(X;Ω)]=EΩ[∑i,jRρ(Xij,Ωij)]. (17)

The following lemma shows that the expected label-dependent error is proportional to the expected recovery error.

###### Lemma 1

Let be an estimator based on observations . Then there exists some constant independent of , such that

 Rρ(X)=ρ2R(X)+b. (18)
{proof}

For notational symplicity, we denote by the thresholded binary value from , and by the entire thresholded 0/1 matrix.

Define and . According to the data generation process, we have

 ξ(~Xij) =P(Ωij=1,Eij=1∣∣~Xij)+P(Ωij=1,Eij=0∣∣~Xij) =ρ⋅η(~Xij). (19)

Consider now the two cases in which makes an error in predicting .

(i) When and , according to (19) we have

 P(Ωij=1∣∣~Xij=0)=ξ(0)=ρ⋅η(0).

(ii) When and , according to (19) we have

 P(Ωij=0∣∣~Xij=1)=1−ξ(1)=1−ρ⋅η(1).

Since is also uniquely determined by the estimator for fixed , we can instead compute by taking expectation over the thresholded matrix as follows

 Rρ(X) =E~X[∑ij(1−ρ2)P(Ωij=1∣∣~Xij=0)I(~Xij=0)+ρ2P(Ωij=0∣∣~Xij=1)I(~Xij=1)] =E~X[∑ij(1−ρ2)ξ(0)I(~Xij=0)+ρ2(1−ξ(1))I(~Xij=1)] =E~X[∑ij(ρ−ρ22)η(0)I(~Xij=0)+ρ2(1−ρ⋅η(1))I(~Xij=1)] =E~X[∑ijρ2η(0)I(~Xij=0)+ρ−ρ22η(0)I(~Xij=0) +ρ2(1−η(1))I(~Xij=1)+ρ−ρ22η(1)I(~Xij=1)] =E~X[∑ijρ2η(0)I(~Xij=0)+ρ2(1−η(1))I(~Xij=1)+ρ−ρ22η(~Xij)]

We conclude the proof by setting which is independent of due to the expectation.

We now connect the label-dependent error to the weighted quadratic objective.

###### Lemma 2

For any , we have

 Rρ(X)−minX∈X Rρ(X)≤γ′Rl(X) (20)

where .

Furthermore, if , we have

 Rρ(X)−minX∈X Rρ(X)≤γ(Rl(X)−minX Rl(X)) (21)

where . {proof} First, we define as

 Rl(X):=EΩ[Rl(X;Ω)]=EΩ[∑i,jRl(Xij,Ωij)]

Let us first prove (20). Consider the following two cases.

(i) If , then and the two types of losses incurred at entry are

 Rρ(Xij) =ρ2I(Xij>q),minXijRρ(Xij)=0 Rl(Xij) =ρ2X2ij,minXijRl(X