Spectral Clustering of Graphs with the Bethe Hessian

Spectral Clustering of Graphs with the Bethe Hessian

A. Saade, F. Krzakala and L. Zdeborová Laboratoire de Physique Statistique, CNRS UMR 8550, Université P. et M. Curie Paris 6 et École Normale Supérieure, 24, rue Lhomond, 75005 Paris, France.
ESPCI and CNRS UMR 7083 Gulliver, 10 rue Vauquelin,Paris 75005
Institut de Physique Théorique, CEA Saclay and URA 2306, CNRS, 91191 Gif-sur-Yvette, France
July 11, 2019
Abstract

Spectral clustering is a standard approach to label nodes on a graph by studying the (largest or lowest) eigenvalues of a symmetric real matrix such as e.g. the adjacency or the Laplacian. Recently, it has been argued that using instead a more complicated, non-symmetric and higher dimensional operator, related to the non-backtracking walk on the graph, leads to improved performance in detecting clusters, and even to optimal performance for the stochastic block model. Here, we propose to use instead a simpler object, a symmetric real matrix known as the Bethe Hessian operator, or deformed Laplacian. We show that this approach combines the performances of the non-backtracking operator, thus detecting clusters all the way down to the theoretical limit in the stochastic block model, with the computational, theoretical and memory advantages of real symmetric matrices.

Clustering a graph into groups or functional modules (sometimes called communities) is a central task in many fields ranging from machine learning to biology. A common benchmark for this problem is to consider graphs generated by the stochastic block model (SBM) Holland1983109 (); wang1987stochastic (). In this case, one considers vertices and each of them has a group label . A graph is then created as follows: all edges are generated independently according to a matrix of probabilities, with . The group labels are hidden, and the task is to infer them from the knowledge of the graph. The stochastic block model generates graphs that are a generalization of the Erdős-Rényi ensemble where an unknown labeling has been hidden.

We concentrate on the sparse case, where algorithmic challenges appear. In this case is , and we denote . For simplicity we concentrate on the most commonly-studied case where groups are equally sized, if and if . Fixing is referred to as the assortative case, because vertices from the same group connect with higher probability than with vertices from other groups. is called the disassortative case. An important conjecture decelle2011asymptotic () is that any tractable algorithm will only detect communities if

 |cin−cout|>q√c, (1)

where is the average degree. In the case of groups, in particular, this has been rigorously proven mossel2013proof (); massoulie2013community () (in this case, one can also prove that no algorithm could detect communities if this condition is not met). An ideal clustering algorithm should have a low computational complexity while being able to perform optimally for the stochastic block model, detecting clusters down to the transition (1).

So far there are two algorithms in the literature that are able to detect clusters down to the transition (1). One is a message-passing algorithm based on belief-propagation  decelle2011inference (); decelle2011asymptotic (). This algorithm, however, needs to be fed with the correct parameters of the stochastic block model to perform well, and its computational complexity scales quadratically with the number of clusters, which is an important practical limitation. To avoid such problems, the most popular non-parametric approach to clustering are spectral methods, where one classifies vertices according to the eigenvectors of a matrix associated with the network, for instance its adjacency matrix Tutorial (); newman2006finding (). However, while this works remarkably well on regular, or dense enough graphs bickel2009nonparametric (), the standard versions of spectral clustering are suboptimal on graphs generated by the SBM, and in some cases completely fail to detect communities even when other (more complex) algorithms such as belief propagation can do so. Recently, a new class of spectral algorithms based on the use of a non-backtracking walk on the directed edges of the graph has been introduced in krzakala2013spectral () and argued to be better suited for spectral clustering. In particular, it has been shown that this operator is optimal for graphs generated by the stochastic block model, and able to detect communities even in the sparse case all the way down to the theoretical limit (1).

These results are, however, not entirely satisfactory. First, the use a of a high-dimensional matrix (of dimension - where is the number of edges - rather than , the number of nodes) can be expensive, both in terms of computational time and memory. Secondly, linear algebra methods are faster and more efficient for symmetric matrices than non-symmetric ones. The first problem was partially resolved in krzakala2013spectral () where an equivalent operator of dimensions was shown to exist. It was still, however, a non symmetric one and more importantly, the reduction does not extend to weighted graphs, and thus presents a strong limitation.

In this contribution, we provide the best of both worlds: a non-parametric spectral algorithm for clustering with a symmetric, real matrix that performs as well, and in fact slightly better, than the non-backtracking operator of krzakala2013spectral (). This operator is actually not new, and has been known as the Bethe Hessian in the context of statistical physics and machine learning mooij2004validity (); ricci2012bethe () or the deformed Laplacian in various other fields. We show that the spectrum of this operator is directly linked to that of the non-backtracking matrix, that it also performs optimally for the stochastic block model, in the sense that it identifies communities as soon as (1) holds. It also performs well in clustering standard real world benchmark networks.

The paper is organized as follows. In Sec. I we give the expression of the Bethe Hessian operator. We discuss in detail its properties and its connection with both the non-backtracking operator and an Ising spin glass in Sec. II. In Sec. III, we study analytically the spectrum in the case of the stochastic block model. Finally, in Sec. IV we perform numerical tests on both the stochastic block model and on some real networks.

I Clustering based on the Bethe Hessian matrix

Let be a graph with vertices, . Denote by its adjacency matrix, and by the diagonal matrix defined by , where is the degree of vertex . We then define the Bethe Hessian matrix, sometimes called the deformed Laplacian, as

 H(r):=(r2−1)\mathds1−rA+D, (2)

where is a regularizer that we will set to a well-defined value depending on the graph, for instance in the case of the stochastic block model, where is the average degree of the graph (see Sec. II.1).

The spectral algorithm that is the main result of this paper works as follows: we compute the eigenvectors associated with the negative eigenvalues of both and , and cluster them with a standard clustering algorithm such as k-means (or simply by looking at the sign of the components in the case of two communities). The negative eigenvalues of reveal the assortative aspects, while those of reveal the disassortative ones.

Figure 1 illustrates the spectral properties of the Bethe Hessian (2) for networks generated by the stochastic block model. When the informative eigenvalues (i.e. those having eigenvectors correlated to the cluster structure) are the negative ones, while the non-informative bulk remains positive. There are as many negative eigenvalues as there are hidden clusters. It is thus straightforward to select the relevant eigenvectors. This is very unlike the situation for the operators used in standard spectral clustering algorithms (except, again, for the non-backtracking operator) where one must decide in a somehow ambiguous way which eigenvalues are relevant (outside the bulk) or not (inside the bulk). Here, on the contrary, no prior knowledge of the number of communities is needed to select them.

On more general graphs, we argue that the best choice for the regularizer is , where is the spectral radius of the non-backtracking operator. We support this claim both numerically, on real world networks (sec. IV.2), and analytically (sec. III). We also show that can be computed without building the matrix itself, by efficiently solving a quadratic eigenproblem (sec. II.1).

The Bethe Hessian can be generalized straightforwardly to the weighed case: if the edge carries a weight , then we can use the matrix defined by

 ~H(r)ij=δij(1+∑k∈∂iw2ikr2−w2ik)−rwijAijr2−w2ij, (3)

where denotes the set of neighbors of vertex . This is in fact the general expression of the Bethe Hessian of a certain weighted statistical model (see section II.2). If all weights are equal to unity, reduces to (2) up to a trivial factor. Most of the arguments developed in the following generalize immediately to , including the relationship with the weighted non-backtracking operator, introduced in the conclusion of krzakala2013spectral ().

Ii Derivation and relation to previous works

Our approach is connected to both the spectral algorithm using the non-backtracking matrix and to an Ising spin glass model. We now discuss these connections, and the properties of the Bethe Hessian operator along the way.

ii.1 Relation with the non-backtracking matrix

In this section we describe the relationship between the spectrum of the Bethe Hessian and that of the non-backtracking operator of krzakala2013spectral () defined as a non-symmetric matrix indexed by the directed edges of the graph

 Bi→j,k→l=δjk(1−δil). (4)

The remarkable efficiency of the non-backtracking operator is due to the particular structure of its (complex) spectrum. For graphs generated by the SBM the spectrum decomposes into a bulk of uninformative eigenvalues sharply constrained when to the disk of radius , where is the spectral radius of saade2014spectral (), well separated from the real, informative eigenvalues, that lie outside of this circle. It was also remarked that the number of real eigenvalues outside of the circle is the number of communities, when the graph was generated by the stochastic block model. More precisely, the presence of assortative communities yields real positive eigenvalues larger than , while the presence of disassortative communities yields real negative eigenvalues smaller than . The authors of krzakala2013spectral () showed that all eigenvalues of that are different from are roots of the polynomial

 det[(λ2−1)\mathds1−λA+D]=detH(λ). (5)

This is known in graph theory as the Ihara-Bass formula for the graph zeta function. It provides the link between and the (determinant of the) Bethe Hessian (already noticed in watanabe2009graph ()): a real eigenvalue of corresponds to a value of such that the Bethe Hessian has a vanishing eigenvalue.

For any finite , when is large enough, is positive definite. This can be proven by using e.g. the Gershgorin circle theorem. Then as decreases, a new negative eigenvalue of appears when it crosses the zero axis, i.e whenever is equal to a real positive eigenvalue of . Of course, the same phenomenon takes place when increasing from a large negative value. In order to translate all the informative eigenvalues of into negative eigenvalues of we will adopt

 rc=√ρ(B). (6)

since all the relevant values of are outside the circle of radius . Note also that since the eigenvalues of come in pairs having their product close to krzakala2013spectral (), for the negative eigenvalues of move back into the positive part of the spectrum. Hence taking is not desirable.

Let us stress that to compute , we do not need to actually build the non-backtracking matrix. First, when the degree to degree correlations are small krzakala2013spectral (), . In a more general setting, we can efficiently refine this initial guess by solving for the closest root of the quadratic eigenproblem defined by (5), e.g. using a standard SLP algorithm ruhe1973algorithms (). This approach is implemented for the real world networks of sec. IV.2. With the choice (6), the informative eigenvalues of are in one-to-one correspondance with the union of negative eigenvalues of and . Their number will therefore tell us the number of (detectable) communities in the graph, and we will use them to infer the community membership of the nodes, by using a standard clustering algorithm such as k-means. In particular, the Bethe Hessian detects communities whenever does, that is down to the theoretical threshold krzakala2013spectral ().

ii.2 Hessian of the Bethe free energy

Let us define a pairwise Ising model on the graph by the joint probability distribution:

 P({x})=1Zexp⎛⎝∑(i,j)∈Eatanh(1r)xixj⎞⎠, (7)

where is a set of binary random variables sitting on the nodes of the graph . The regularizer is here a parameter that controls the strength of the interaction between the variables: the larger is, the weaker is the interaction (this would be analogous to the temperature in a statistical physics).

In order to study this model, a standard approach in machine learning is the Bethe approximation WJ2008 () in which the means and moments are approximated by the parameters and that minimize the so-called Bethe free energy defined as

 FBethe({mi},{ξij}) =−∑(i,j)∈Eatanh(1r)ξij+∑(i,j)∈E ∑xi,xjη(1+mixi+mjxj+ξijxixj4) (8)

where . Such approach allows for instance to derive the belief propagation () algorithm. Here, however, we wish to restrict to a spectral algorithm.

At very high the minimum of the Bethe free energy is given by the so-called paramagnetic point . It turns out mooij2004validity () that the is a stationarity point of the Bethe free energy for every . Instead of considering the complete Bethe free energy, we will consider only its behavior around the paramagnetic point. This can be expressed via the Hessian (matrix of second derivatives), that has been studied extensively, see e.g. mooij2004validity (), ricci2012bethe (). At the paramagnetic point, the blocks of the Hessian involving one derivative with respect to the are , and the block involving two such derivatives is a positive definite diagonal matrix watanabe2009graph (). We will therefore, somewhat improperly, call Hessian the matrix

 Hij(r)=∂FBethe∂mi∂mj∣∣mi=0,ξij=1r. (9)

In particular, at the paramagnetic point:

 H(r)=\mathds1+Dr2−1−rAr2−1=H(r)r2−1. (10)

A more general expression of the Bethe Hessian in the case of weighted interactions (with weights rescaled to be in ) is given by eq. (3). All eigenvectors of and are the same, as are the eigenvalues up to a multiplicative, positive factor (since we consider only ).

The paramagnetic point is stable if and only if is positive definite. The appearance of each negative eigenvalue of the Hessian corresponds to a phase transition in the Ising model at which a new cluster (or a set of clusters) starts to be identifiable. The corresponding eigenvector will give the direction towards the cluster labeling. This motivates the use of the Bethe Hessian for spectral clustering.

For tree-like graphs such as those generated by the SBM, model (7) can been studied analytically in the asymptotic limit . The location of the possible phase transitions in model (7) are also known from spin glass theory and the theory of phase transitions on random graphs (see e.g. mooij2004validity (); decelle2011inference (); decelle2011asymptotic (); ricci2012bethe ()). For positive the trivial ferromagnetic phase appears at , while the transitions towards the phase corresponding to the hidden community structure arise between . For disassortative communities, the situation is symmetric with . Interestingly, at , the model undergoes a spin glass phase transition. At this point all the relevant eigenvalues have passed in the negative side (all the possible transitions from the paramagnetic states to the hidden structure have taken place) while the bulk on non-informative ones remains positive. This scenario is illustrated in Fig. 1 for the case of two assortative clusters.

Iii The spectrum of the Bethe Hessian

In this section, we show how the spectral density of the Bethe Hessian can be computed analytically on tree-like graphs such as those generated by the stochastic block model. This will serve two goals: i) to justify independently our choice for the value of the regularizer and ii) to show that for all values of , the bulk of uninformative eigenvalues remains in the positive region. The spectral density is defined by:

 ν(λ)=1nn∑i=1 δ(λ−λi), (11)

where the ’s are the eigenvalues of the Bethe Hessian. It can be shown rogers2008cavitySym () that the spectral density (in which potential delta peaks have been removed) is given by

 ν(λ)=1πnn∑i=1 ImΔi(λ), (12)

where the are complex variables living on the vertices of the graph , which are given by:

 Δi=(−λ+r2+di−1−r2∑l∈∂iΔl→i)−1, (13)

where is the degree of node in the graph, and is the set of neighbors of . The are the (linearly stable) solution of the following belief propagation recursion, or cavity method MM2009 (),

 Δi→j=(−λ+r2+di−1−r2∑l∈∂i∖jΔl→i)−1. (14)

The ingredients to derive this formula are to turn the computation of the spectral density into a marginalization problem for a graphical model on the graph , and then write the belief propagation equations to solve it. It can be shown RSA:RSA20313 () that this approach leads to an asymptotically exact description of the spectral density on random graphs such as those generated by the stochastic block model, which are locally tree-like in the limit where . We can solve equation (14) numerically using a population dynamics algorithm MM2009 (): starting from a pool of variables, we iterate by drawing at each step a variable, its excess degree and its neighbors from the pool, and updating its value according to (14). The results are shown on Fig. 1: the bulk of the spectrum is always positive.

We now justify analytically that the bulk of eigenvalues of the Bethe Hessian reaches at . From (12) and (13), it is clear that if the linearly stable solution of (14) is real, then the corresponding spectral density will be equal to . We want to show that there exists an open set around in which there exists a real, stable, solution to the BP recursion. Let us call , where is the number of edges in , the vector which components are the . We introduce the function defined by

 F(λ,Δ––)i→j=(−λ+r2+di−1−r2∑l∈∂i∖jΔl→i)−1Δi→j, (15)

so that equation (14) can be rewritten as

 F(λ,Δ––)=0. (16)

It is straightforward to check that when , the assignment is a real solution of (16). Furthermore, the Jacobian of at this point reads

 JF(0,{1/r2})=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝−10⋮$$\Larger2(r2\mathds1−B)$$0⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠, (17)

where is the non-backtracking operator and is the identity matrix. The square submatrix of the Jacobian containing the derivatives with respect to the messages is therefore invertible whenever . From the continuous differentiability of around and the implicit function theorem, there exists an open set containing such that for all , there exists solution of (16) , and the function is continuous in . To show that the spectral density is indeed 0 in an open set around , we need to show that this solution is linearly stable. Introducing the function defined by

 Gλ(Δ––)i→j=(−λ+r2+di−1−r2∑l∈∂i∖jΔl→i)−1 (18)

it is enough to show that the Jacobian of at the point has all its eigenvalues smaller than 1 in modulus, for close to . But since is continuous in in the neighborhood of , and is continuous in , it is enough to show that the spectral radius of is smaller than . We compute

 JG0({1/r2})=1r2B (19)

so that the spectral radius of is , which is (strictly) smaller than as long as . From the continuity of the eigenvalues of a matrix with respect to its entries, there exists an open set containing such that , the solution of the BP recursion (14) is real, so that the corresponding spectral density in is equal to . This proves that the bulk of the spectrum of reaches at , further justifying our choice for the regularizer.

Iv Numerical results

iv.1 Synthetic networks

We now illustrate the efficiency of the algorithm for graphs generated by the stochastic block model. Fig. 2 shows the performance of standard spectral clustering methods, as well as that of the belief propagation () algorithm of decelle2011asymptotic (), believed to be asymptotically optimal because it approximates the Bayes-optimal estimator in a (conjectured) exact manner in large tree-like graph. The performance is measured in terms of the overlap with the true labeling, defined as

 (1N∑uδgu,~gu−1q)/(1−1q), (20)

where is the true group label of node , and is the label given by the algorithm, and we maximize over all possible permutation of the groups. The Bethe Hessian systematically outperforms and does almost as well as , which is a more complicated non-linear machine learning algorithm, that we have run here assuming the knowledge of "oracle parameters": the number of communities, their sizes, and the matrix decelle2011inference (); decelle2011asymptotic (). The Bethe Hessian, on the other hand can infer the number of communities in the graph by counting the number of negative eigenvalues, and does not need to be fed with the interaction matrix.

iv.2 Real networks

We finally turn towards actual real graphs to illustrate the performances of our approach in practical applications, and to show that even if real networks are not generated by the stochastic block model, the Bethe Hessian operator remains a useful tool. In Table 1 we give the overlap and the number of groups to be identified for several networks commonly used as benchmarks for community detection. For each of these networks we observed a large correlation to the ground truth, and at least equal (and sometimes better) performances with respect to the non backtracking operator. In all cases, the eigenvalues we have considered lay in the negative part of the spectrum and are thus clearly identifiable.

In particular, we find by counting the negative eigenvalues an estimate of the number of clusters (just as krzakala2013spectral () did for the non-backtracking matrix). It is also interesting to note that our approach works not only in the assortative case but also in the disassortative ones, for instance for the word adjacency networks. A Matlab implementation to reproduce the results of the Bethe Hessian for both real and synthetic network is provided on the following webpage: http://mode_net.krzakala.org/.

V Conclusion and perspectives

We have presented here a new approach to spectral clustering using the Bethe Hessian and gave evidence that this approach combines the advantages of standard sparse symmetric real matrices, with the performances of the more involved non-backtracking operator, or the use of the belief propagation algorithm with oracle parameters. This answers the quest for a tractable non-parametric approach that performs optimally in the stochastic bloc model. We invite the reader to the demo file in matlab avaliable at: http://mode_net.krzakala.org/.

Given the large impact and the wide use of spectral clustering methods in many fields of modern science, we thus expect that our method will have a significant impact on data analysis. In particular, our approach can be straightforwardly generalized to other types of spectral clustering problems, as for instance, those with real-valued similarities between vertices and , without any price in scalability (as opposed, for instance, to the non-backtracking operator).

Another promising direction of investigation arises from the observation that the cost function used in the graphical model of section II.2 can be generalized to any objective function to be maximized, e.g. the modularity, or else. The Bethe Hessian’s negative eigenvalues at some carefully chosen regularization parameter could therefore provide an approximate solution to this maximization problem, giving a general spectral relaxation to some known NP-hard problems.

Acknowledgements.
We would like to thank Federico Ricci-Tersenghi for interesting discussions. This work has been supported in part by the ERC under the European Union’s 7th Framework Programme Grant Agreement 307087-SPARCS, by the Grant DySpaN of "Triangle de la Physique".

References

• [1] L. A Adamic and N. Glance. The political blogosphere and the 2004 us election: divided they blog. In Proceedings of the 3rd international workshop on Link discovery, page 36. ACM, 2005.
• [2] P. J Bickel and A. Chen. A nonparametric view of network models and newman–girvan and other modularities. Proceedings of the National Academy of Sciences, 106(50):21068, 2009.
• [3] Charles Bordenave and Marc Lelarge. Resolvent of large random graphs. Random Structures and Algorithms, 37(3):332–352, 2010.
• [4] A. Decelle, F. Krzakala, C. Moore, and L. Zdeborová. Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications. Phys. Rev. E, 84(6):066106, 2011.
• [5] A. Decelle, F. Krzakala, C. Moore, and L. Zdeborová. Inference and phase transitions in the detection of modules in sparse networks. Phys. Rev. Lett., 107(6):065701, 2011.
• [6] Michelle Girvan and Mark EJ Newman. Community structure in social and biological networks. Proceedings of the National Academy of Sciences, 99(12):7821–7826, 2002.
• [7] Paul W. Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. Stochastic blockmodels: First steps. Social Networks, 5(2):109, 1983.
• [8] Valdis Krebs. The network can be found on http://www.orgnet.com/.
• [9] F. Krzakala, C. Moore, E. Mossel, J. Neeman, A. Sly, L. Zdeborová, and P. Zhang. Spectral redemption in clustering sparse networks. Proceedings of the National Academy of Sciences, 110(52):20935–20940, 2013.
• [10] D. Lusseau, K. Schneider, O. J. Boisseau, P. Haase, E. Slooten, and S. M Dawson. The bottlenose dolphin community of doubtful sound features a large proportion of long-lasting associations. Behavioral Ecology and Sociobiology, 54(4):396–405, 2003.
• [11] Ulrike Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395, 2007.
• [12] Laurent Massoulie. Community detection thresholds and the weak ramanujan property. arXiv preprint arXiv:1311.3085, 2013.
• [13] M. Mezard and A. Montanari. Information, Physics, and Computation. Oxford University Press, 2009.
• [14] Joris M Mooij, Hilbert J Kappen, et al. Validity estimates for loopy belief propagation on binary real-world networks. In NIPS, 2004.
• [15] Elchanan Mossel, Joe Neeman, and Allan Sly. A proof of the block model threshold conjecture. arXiv preprint arXiv:1311.4115, 2013.
• [16] Mark EJ Newman. Finding community structure in networks using the eigenvectors of matrices. Phys. Rev. E, 74(3):036104, 2006.
• [17] F. Ricci-Tersenghi. The bethe approximation for solving the inverse ising problem: a comparison with other inference methods. J. Stat. Mech.: Th. and Exp., page P08015, 2012.
• [18] Tim Rogers, Isaac Pérez Castillo, Reimer Kühn, and Koujin Takeda. Cavity approach to the spectral density of sparse symmetric random matrices. Phys. Rev. E, 78(3):031116, 2008.
• [19] Axel Ruhe. Algorithms for the nonlinear eigenvalue problem. SIAM Journal on Numerical Analysis, 10(4):674–689, 1973.
• [20] Alaa Saade, Florent Krzakala, and Lenka Zdeborová. Spectral density of the non-backtracking operator. arXiv preprint arXiv:1404.7787, 2014.
• [21] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1, 2008.
• [22] Yuchung J Wang and George Y Wong. Stochastic blockmodels for directed graphs. Journal of the American Statistical Association, 82(397):8–19, 1987.
• [23] Yusuke Watanabe and Kenji Fukumizu. Graph zeta function in the bethe free energy and loopy belief propagation. In NIPS, pages 2017–2025, 2009.
• [24] W Zachary. An information flow model for conflict and fission in small groups1. Journal of anthropological research, 33(4):452–473, 1977.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters