Predicting Signed Edges with O(n^{1+o(1)}\log{n}) Queries

# Predicting Signed Edges with O(n1+o(1)logn) Queries

Michael Mitzenmacher Harvard University, michaelm@eecs.harvard.edu    Charalampos E. Tsourakakis Harvard University, babis@seas.harvard.edu
###### Abstract

Social networks and interactions in social media involve both positive and negative relationships. Signed graphs capture both types of relationships: positive edges correspond to pairs of “friends”, and negative edges to pairs of “foes”. The edge sign prediction problem, which aims to predict whether an interaction between a pair of nodes will be positive or negative, is an important graph mining task for which many heuristics have recently been proposed [leskovec2010predicting, leskovec2010signed].

Motivated by social balance theory, we model the edge sign prediction problem as a noisy correlation clustering problem with two clusters. We are allowed to query each pair of nodes whether they belong to the same cluster or not, but the answer to the query is corrupted with some probability . Let be the gap. We provide an algorithm that recovers the clustering with high probability in the presence of noise for any constant gap with queries. Our algorithm uses simple breadth first search as its main algorithmic primitive. Finally, we provide a novel generalization to clusters and prove that our techniques can recover the clustering if the gap is constant in this generalized setting.

ref.bib

## 1 Introduction

With the rise of social media, where both positive and negative interactions take place, signed graphs, whose study was initiated by Heider, Cartwright, and Harary [cartwright1956structural, heider1946attitudes, harary1953notion], have become prevalent in graph mining. A key graph mining problem is the edge sign prediction problem, which aims to predict whether an interaction between a pair of nodes will be positive or negative [leskovec2010predicting, leskovec2010signed]. Recent works have developed numerous heuristics for this task that perform relatively well in practice [leskovec2010predicting, leskovec2010signed].

In this work we propose a theoretical model for the edge sign prediction problem that highlights its intimate connections with the famous planted partition problem [abbe2016exact, condon2001algorithms, hajek2016achieving, mcsherry2001spectral]. Specifically, we model the edge sign prediction problem as a noisy correlation clustering problem, where we are able to query a pair of nodes to test whether they belong to the same cluster (edge sign ) or not (edge sign ). The query fails to return the correct answer with some probability . Correlation clustering is a basic data mining primitive with a large number of applications ranging from social network analysis [harary1953notion, leskovec2010predicting] to computational biology [hou2016new]. Our theoretical model is inspired by the famous balance theory: “the friend of my enemy is my friend” [cartwright1956structural, easley2010networks, heider1946attitudes]. The details of our model follow.

Model \@slowromancapi@. Let be the set of items that belong to two clusters, call them red and blue. Set , and , where . The function is unknown and we wish to recover the two clusters by querying pairs of items. (We need not recover the labels, just the clusters.) For each query we receive the correct answer with probability , where is the corruption probability. That is, for a pair of items such that , with probability it is reported that , and similarly if with probability it is reported that . Our goal is to perform as few queries as possible while recovering the underlying cluster structure.

Main result. Our main theoretical result is that we can recover the clusters with high probability111An event holds with high probability (whp) if . in polynomial time. Our algorithm uses breadth first search (BFS) as its main algorithmic primitive. Our result is stated as Theorem 1.

###### Theorem 1.

There exists a polynomial time algorithm that performs edge queries and recovers the clustering whp for any gap .

When is constant, then the number of queries is . A natural follow-up question that we address here is whether our results generalize to the case of more than two clusters. We provide a general model and show that our techniques recover the cluster structure whp as long as the gap is constant.

Model \@slowromancapii@. Our model is now that there are groups, that we number and that we think of as being arranged modulo . Let refer to the group number associated with a vertex . We start by noting that if when querying an edge we returned only whether the the groups of the two edges were equal, it would be difficult to reconstruct the clusters; indeed, even with no errors, a chain of such responses along a path would not generally allow us to determine whether the endpoints of a path were in the same group or not. A model that provides more information and naturally generalizes the two cluster case is the following: when we query an edge , we obtain

 ~f(e)=⎧⎪⎨⎪⎩g(x)−g(y)modk,with % probability 1−q;g(x)−g(y)+1modk,with probability q/2;g(x)−g(y)−1modk,with probability q/2. (1)

That is, we obtain the difference between the groups when no error occurs, and with probability we obtain an error that adds or subtracts one to this gap with equal probability. When , so there are no errors from , the edge queries would allow us to determine the difference between the group numbers of vertices at the start and end of any path, and in particular would allow us to determine if the groups were the same. We also note that we choose this description for ease of exposition. More generally we could handle queries governed by more general error models, of the form:

 ~f(e)=g(x)−g(y)+i~{}~{}~{}with probability~{}qi,0≤i

That is, the error does not depend on the group values and , but is simply independent and identically distributed over the values to .

###### Theorem 2.

There exists a polynomial time algorithm that performs edge queries and recovers the clusters under the model of equation (1) whp for any constant gap .

Our proof techniques extend naturally to this model.

Roadmap. Section 2 presents some theoretical preliminaries. Section 3 presents our algorithmic contributions. Section 4 briefly reviews related work. Finally, Section 5 concludes the paper.

## 2 Theoretical Preliminaries

We use the following powerful probabilistic results for the proofs in Section 3.

###### Theorem 3 (Chernoff bound, Theorem 2.1 [janson2011random]).

Let , , and (for , or otherwise). Then the following inequalities hold:

 Pr[X≤μ−a] ≤ e−μφ(−aμ)≤e−a22μ, (2) Pr[X≥μ+a] ≤ e−μφ(−aμ)≤e−a22(μ+a/3). (3)

We define the notion of read- families, a useful concept when proving concentration results for weakly dependent variables.

Let be independent random variables. For , let and let be a Boolean function of . Assume that for every . Then, the random variables are called a read- family.

###### Theorem 4 (Concentration of Read-k families [gavinsky2014tail]).

Let be a family of read- indicator variables with . Then for any ,

 Pr[r∑i=1Yi≥(q+ϵ)r]≤e−D% KL(q+ϵ||q)⋅r/k (4)

and

 Pr[r∑i=1Yi≤(q−ϵ)r]≤e−D% KL(q−ϵ||q)⋅r/k. (5)

Here, is Kullback-Leibler divergence defined as

 DKL(q||p)=qlog(qp)+(1−q)log(1−q1−p).

We will use the following corollary of Theorem 5, which provides Chernoff-type bounds for read- families. This is derived in a similar way that Chernoff multiplicative bounds are derived from Equations (3) and  (2), see [mcdiarmid1998concentration]. Notice that the main difference compared to the standard Chernoff bounds is the extra factor in denominator of the exponent.

###### Theorem 5 (Concentration of Read-k families [gavinsky2014tail]).

Let be a family of read- indicator variables with . Also, let . Then for any ,

 Pr[Y≥(1+ϵ)E[Y]]≤e−ϵ2E[Y]2k(1+ϵ/3) (6)
 Pr[Y≤(1−ϵ)E[Y]]≤e−ϵ2E[Y]2k. (7)

## 3 Proposed Method

We prove our main result through a sequence of claims and lemmas. For completeness we include all proofs even if some claims are classic, e.g., Claim 1. At a high level, our proof strategy is as follows:

1. We compute the probability that a simple path between and provides us with the correct information on whether or not.

2. Let . We show that there exist almost edge-disjoint paths of length between any pair of vertices with probability at least . The reader can think of the paths as being edge-disjoint, if that is helpful; we shall clarify both what we mean by almost edge-disjoint paths and how it affects the proof later in the paper.

3. For each path from the collection of almost edge-disjoint paths, we compute the product of the sign of the edges along the path. Since the paths are not entirely edge disjoint, the corresponding random variables are weakly dependent. We use concentration of multivariate polynomials [gavinsky2014tail], see also [alonspencer, kim-vu], in combination with Claim 1 to show that using the majority of the resulting signs to decide whether or not for a pair of nodes gives the correct answer with probability lower bounded by . Taking the union bound over pairs concludes the proof.

The pseudo-code is shown as Algorithm 1. The algorithm runs over each pair of nodes, and it invokes Algorithm 2 to construct almost edge-disjoint paths for each pair of nodes using Breadth First Search. Note that since we perform queries uniformly at random, the resulting graph is is asymptotically equivalent to , see [frieze2015introduction, Chapter 1]. Here, is the classic Erdös-Rényi model (a.k.a random binomial graph model) where each possible edge between each pair is included in the graph with probability independent from every other edge.

It turns out that our algorithm needs an average degree only for the first level of the trees that we grow from and when we invoke Algorithm 2. For all other levels of the grown trees, we need the degree to be only . This difference in the branching factors exists in order to ensure that the number of leaves of trees in Algorithm 2 is amplified by a factor of , which then allows us to apply Theorem 5. Using appropriate data structures, Algorithm 1 runs in . One can improve the run time in expectation by sampling neighbors for each node in time instead of time using a standard sublinear sampling technique that generates geometric random variables between successive successes, see [knuth2007seminumerical, tsourakakis2011triangle]. This results in total expected run time . Since we use a branching factor of for all except the first two levels of , we work with the model with to construct the set of almost edge disjoint paths. (Alternatively, one can think that we start with the larger random graph with more edges, and then in the construction of the almost edge disjoint paths we subsample a smaller collection of edges to use in this stage.) The diameter of this graph whp grows asymptotically as [bollobas1998random] for this value of . We use the model only in Lemma 1 to prove that every node has degree at least .

The following result is well known but we present a proof for completeness.

###### Claim 1.

Consider a path between nodes of length . Let . Then,

 Pr[Ruv=1|f(u)=f(v)]=Pr[Ruv=−1|f(u)≠f(v)]=1+(1−2q)L2
###### Proof.

Here iff agrees with the unknown clustering function on . This happens when the number of corrupted edges along that path is even. Let be the number of corrupted edges/sign flips along the path. Clearly, . Also,

 (1−2q)L=L∑k=0(Lk)(−q)k(1−q)L−k =⌊L2⌋∑k=0(L2k)q2k(1−q)L−2k−⌊L2⌋∑k=0(L2k+1)q2k+1(1−q)L−(2k+1)= Pr[Zuv~{}is even]−Pr[Zuv~{}is odd].

Therefore , and the result follows. ∎

The next lemma is a direct corollary of the lower tail multiplicative Chernoff bound.

###### Lemma 1.

Let be a random binomial graph. Then whp all vertices have degree greater than .

###### Proof.

The degree of a node follows the binomial distribution . Set . Then

 Pr[deg(u)<5logn(2c)−L]

Taking a union bound over vertices gives the result. ∎

Now we proceed to our construction of sufficiently enough almost edge-disjoint paths. Our construction is based on standard techniques in random graph theory [broder1998optimal, dudek2015rainbow, frieze2012rainbow, tsourakakis2013mathematical], we include the full proofs for completeness.

###### Lemma 2.

Let where . Fix and . Then, whp there does not exist a subset , such that and .

###### Proof.

Set .Then,

 Pr[∃S:s≤αtL~{}~{}and~{}~{}e[S]≥s+t] ≤∑s≤αtL(ns)((s2)s+t)ps+t≤∑s≤αtL(nes)s(es2p2(s+t))s+t ≤∑s≤αtL(e2+o(1)logn)s(20eslognn)t ≤αtL((e2+o(1)logn)αL(20eαtlog2nnloglogn))t<1n(1−α−o(1))t.

###### Lemma 3.

Let be a rooted tree of depth at most and let be a vertex not in . Then with probability , has at most neighbors in , i.e., .

###### Proof.

Let be a rooted tree of depth at most and let consist of , the neighbors of in plus the ancestors of these neighbors. Set . Then and . It follows from Lemma 2 with and , that we must have with probability . ∎

We show that by growing trees iteratively we can construct sufficiently many edge-disjoint paths for sufficiently large.

###### Lemma 4.

Let , and . For all pairs of vertices there exists a subgraph of as shown in figure 1, whp. The subgraph consists of two isomorphic vertex disjoint trees rooted at each of depth . and both have a branching factor of for the first level, and for the remaining levels. If the leaves of are then where is a natural isomorphism. Between each pair of leaves there is a path of length . The paths are edge disjoint.

###### Proof.

Because we have to do this for all pairs , we note without further comment that likely (resp. unlikely) events will be shown to occur with probability (resp. )).

To find the subgraph shown in Figure 1(b) we grow tree structures as shown in Figure 1(a). Specifically, we first grow a tree from using BFS until it reaches depth . Then, we grow a tree starting from again using BFS until it reaches depth . For the first level of both trees, we choose neighbors of respectively. For all other levels we use a branching factor equal to . Before we show how to continue our construction, we need to prune down the degree of so that the remaining subgraph behave as with . This can be achieved for example either by considering a random subgraph of according to , applying Chernoff bounds as in Lemma 1 to show that each node has degree at least , or by letting each node choose neighbors uniformly at random.

Finally, once trees have been constructed, we grow trees from the leaves of and using BFS for depth . Now we analyze these processes. Since the argument is the same we explain it in detail for and we outline the differences for the other trees. We use the notation for the number of vertices at depth of the BFS tree rooted at .

First we grow . As we grow the tree via BFS from a vertex at depth to vertices at depth certain bad edges from may point to vertices already in . Lemma 3 shows with probability there can be at most 10 bad edges emanating from .

Hence, we obtain the recursion

 D(x)i+1≥(5logn−10)(D(x)i−1)≥4lognD(x)i. (8)

Therefore the number of leaves satisfies

 D(x)k≥1(2c)L(4logn)ϵL≥1(2c)Ln4ϵ/5. (9)

We can make the branching factors exactly for the first level and for all remaining levels by pruning. We do this so that the trees are isomorphic to each other. With a similar argument

 D(y)k≥1(2c)Ln45ϵ. (10)

The only difference is that now we also say an edge is bad if the other endpoint is in . This immediately gives

 D(y)i+1≥(5logn−20)(D(y)i−1)≥4lognD(y)i

and the required conclusion (10).

Similarly, from each leaf and we grow trees of depth using the same procedure and arguments as above. Lemma 3 implies that there are at most 20 edges from the vertex being explored to vertices in any of the trees already constructed (at most 10 to plus any trees rooted at an and another 10 for ). The number of leaves of each now satisfies

 ˆD(xi)γ≥(4logn)γ+1≥n12+45ϵ.

The result is similar for .

Observe next that BFS does not condition on the edges between the leaves of the trees and . That is, we do not need to look at these edges in order to carry out our construction. On the other hand we have conditioned on the occurrence of certain events to imply a certain growth rate. We handle this technicality as follows. We go through the above construction and halt if ever we find that we cannot expand by the required amount. Let be the event that we do not halt the construction i.e. we fail the conditions of Lemmas 2 or 3. We have and so,

 Pr[∃i:e(Xi,Yi)=0∣A] ≤Pr[∃i:e(Xi,Yi)=0]Pr(A)≤2n4ϵ5(1−p)n1+8ϵ5≤n−nϵ.

We conclude that whp there is always an edge between each and thus a path of length at most between each . ∎

The proof of Theorem 1 follows. Set .

###### Proof of Theorem 1.

Fix a pair of nodes . Let be the signs of the edge disjoint paths connecting them, i.e., for all . Also let . Notice that is a read- family where .

By the linearity of expectation

 E[Y]=N(2c)L≥n45ϵ(2c)L.

By applying Theorem 5 we obtain

 Pr[Y<0] =Pr[Y−E[Y]<−E[Y]]≤exp⎛⎜ ⎜⎝−n4/5ϵ(2c)L2n4/5ϵ4(2c)−Llogn⎞⎟ ⎟⎠=o(n−2).

Planted bisection model. Before we prove Theorem 2 we discuss the connection between our formulation and the well studied planted bisection model. Suppose that is even, and the graph has two clusters of equal size. The probabilities of connecting are within each cluster, and across the clusters. Now recall our setting as described in Model \@slowromancapi@, and consider just the edges that correspond to queries that return . These form a graph drawn from the planted bisection model where . Therefore, one can apply existing methods for exact recovery, e.g., [abbe2016exact, mcsherry2001spectral] instead of our method when the sizes of the two clusters are (roughly) equal. It is worth noting that despite the wide variety of techniques that appear in the context of the planted partition model, including the EM algorithm [snijders1997estimation], spectral methods [mcsherry2001spectral, vu2014simple], semidefinite programming [abbe2016exact, hajek2016achieving, montanari2015semidefinite], hill-climbing [carson2001hill], Metropolis algorithm [jerrum1998metropolis], modularity based methods [bickel2009nonparametric], our edge-disjoint path technique is novel in this context.

Hajek, Wu, and Xu proved that when each cluster has nodes, the average degree has to scale as for exact recovery [hajek2016achieving]. Also, they showed that using semidefinite programming (SDP) exact recovery is achievable at this threshold [hajek2016achieving]. Note that as , this lower bound scales as . It is an interesting theoretical problem to explore if the techniques we develop in this work, or similar techniques can get closer to this lower bound.

###### Proof of Theorem 2.

Since the proof of Theorem 2 overlaps with the proof of Theorem 1, we outline the main differences. Let us return to the basic version of Model \@slowromancapii@, and let for be

 ~f(e)−(g(x)−g(y))modk.

Then given a path between two vertices and ,

 g(v)=g(u)+∑e∈Puv~f(e)−∑e∈PuvX(e)modk.

Our question is now what is . We would like that be (even slightly) more highly concentrated on 0 than on other values, so that when , we find that the sum of the return values from our algorithm, , is most likely to be 0. We could then conclude by looking over many almost edge-disjoint paths that if this sum is 0 over a plurality of the paths, then and are in the same group whp.

For our simple error model, the sum behaves like a simple lazy random walk on the cycle of values modulo , where the probability of remaining in the same state at each step is . Let us consider this Markov chain on the values modulo ; we refer to the values as states. Let be the probability of going from state to state after steps in such a walk. It is well known than one can derive explicit formulae for ; see e.g. [feller1968introduction, Chapter XVI.2]. It also follows by simply finding the eigenvalues and eigenvectors of the matrix corresponding to the Markov chain and using that representation. One can check the resulting forms to determine that is maximized when , and to determine the corresponding gap . Based on this gap, we can apply Chernoff-type bounds as in Theorem 5 to show that the plurality of almost edge-disjoint paths will have error 0, allowing us to determine whether the endpoints of the path and are in the same group with high probability.

The simplest example is with groups, where we find

 pt00=13+23(1−3q/2)t,

and

 pt01=pt02=13−13(1−3q/2)t.

In our case , and we see that for any , is large enough that we can detect paths using the same argument as in Model \@slowromancapi@.

For general , we use that the eigenvalues of the matrix

 ⎡⎢ ⎢ ⎢ ⎢ ⎢⎣1−qq0…qq1−qq…0⋮⋮⋮⋮⋱q00…1−q⎤⎥ ⎥ ⎥ ⎥ ⎥⎦

are , with the -th corresponding eigenvector being where is a primitive th root of unity. Here, is not an index but the square root of -1, i.e., . In this case we have

 pt00=1k+1kk−1∑j=1(1−q+qcos(2πj/k))t.

Note that . Some algebra reveals that the next largest value of belongs to , and equals

 pt01=1k+1kk−1∑j=1ω−j(1−q+qcos(2πj/k))t.

We therefore see that the error between ends of a path again have the plurality value 0, with a gap of at least

 pt00−pt01≥2(1−cos(2π/k))(1−q+qcos(2π/k))t.

This gap is constant for any constant and . ∎

As we have already mentioned, the same approach could be used for the more general setting where

 ~f(e)=g(x)−g(y)+j~{}~{}with probability~{}~{}qj,0≤j

but now one works with the Markov chain matrix

 ⎡⎢ ⎢ ⎢ ⎢ ⎢⎣q0q1q2…qk−1qk−1q0q1…qk−2⋮⋮⋮⋱⋮q1q2q3…q0⎤⎥ ⎥ ⎥ ⎥ ⎥⎦.

## 4 Related Work

Fritz Heider introduced the notion of a signed graph, with or labels on the edges, in the context of balance theory [heider1946attitudes]. The key subgraph in balance theory is the triangle: any set of three fully interconnected nodes whose product of edge signs is negative is not balanced. The complete graph is balanced if every one of its triangles is balanced. Early work on signed graphs focused on graph theoretic properties of balanced graphs [cartwright1956structural]. Harary proved the famous balance theorem which characterizes balanced graphs [harary1953notion].

Bansal et al. [bansal2004correlation] studied Correlation Clustering: given an undirected signed graph partition the nodes into clusters so that the total number of disagreements is minimized. This problem is NP-hard [bansal2004correlation, shamir2004cluster]. Here, a disagreement can be either a positive edge between vertices in two clusters or a negative edge between two vertices in the same cluster. Note that in Correlation Clustering the number of clusters is not specified as part of the input. The case when the number of clusters is constrained to be at most two is known as 2-Correlation-Clustering.

Minimizing disagreements is equivalent to maximizing the number of agreements. However, from an approximation perspective these two versions are different: minimizing is harder. For minimizing disagreements, Bansal et al. [bansal2004correlation] provide a 3-approximation algorithm for 2-Correlation-Clustering, and Giotis and Guruswami provide a polynomial time approximation scheme (PTAS) [giotis2006correlation]. Ailon et al. designed a 2.5-approximation algorithm [ailon2008aggregating] that was further improved by Coleman et al. to a 2-approximation [coleman2008local]. We remark that the notion of imbalance studied by Harary is the 2-Correlation-Clustering cost of the signed graph. Mathieu and Schudy initiated the study of noisy correlation clustering [mathieu2010correlation]. They develop various algorithms when the graph is complete, both for the cases of a random and a semi-random model. Later, Makarychev, Makarychev, and Vijayaraghavan proposed an algorithm for graps with edges under a semi-random model [makarychev2015correlation].

When the graph is not complete Correlation Clustering and Minimum Multicut reduce to one another leading to a approximations [charikar2003clustering, demaine2003correlation]. The case of constrained size clusters has recently been studied by Puleo and Mileknovic [puleo2015correlation]. Finally, by using the Goemans-Williamson SDP relaxation for Max Cut [goemans1995improved], one can obtain a 0.878 approximation guarantee for 2-Correlation-Clustering problem as noted by [coleman2008local].

Chen et al. [chen2014clustering, chen2012clustering] consider also model \@slowromancapi@ as described in Section 1 and provide a method that can reconstruct the clustering for random binomial graphs with edges. Their method exploits low rank properties of the cluster matrix, and requires certain conditions, including conditions on the imbalance between clusters, see [chen2012clustering, Theorem 1, Table 1] to be true. Their methods is based on a convex relaxation of a low rank problem. Also, closely related to our work lies the work of Cesa-Bianchi et al. [cesa2012correlation] that take a learning-theoretic perspective on the problem of predicting signs. They consider three types of models: batch, online, and active learning, and provide theoretical bounds for prediction mistakes for each setting. They use the correlation clustering objective as their learning bias, and they show that the risk of the empirical risk minimizer is controlled by the correlation clustering objective. Chian et al. point out that the work of Candès and Tao [candes2006robust] can be used to predict signs of edges, and also provide various other methods, including singular value decomposition based methods, for the sign prediction problem [chiang2014prediction]. The incoherence is the key parameter that determines the number of queries, and is equal to the group imbalance . The number of queries needed for exact recovery under Model \@slowromancapi@ is .

## 5 Conclusion

In this work we have studied the edge sign prediction problem, showing that under our proposed correlation clustering formulation and a fully random noise model querying pairs of nodes uniformly at random suffices to recover the clusters efficiently, whp. We also provided a generalization of our model and proof approach to more than two clusters. While our work here is theoretical, in future work we plan to apply our method and additional heuristics to real data, and compare against related alternatives. From a theoretical perspective, it is an interesting problem to close the gap between our upper bound and the known lower bound for exact recovery [hajek2016achieving] using techniques based on BFS.

## Acknowledgment

We would like to thank Bruce Hajek and Zeyu Zhou for detecting an error in an earlier version of our work. We also want to thank Yury Makarychev, Konstantin Makarychev, and Aravindan Vijayaraghavan for their feedback.

This work was supported in part by NSF grants CNS-1228598, CCF-1320231, CCF-1563710, and CCF-1535795.

\printbibliography
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters