Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications

Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications

Aurelien Decelle, Florent Krzakala, Cristopher Moore, Lenka Zdeborová Université Paris-Sud & CNRS, LPTMS, UMR8626, Bât. 100, Université Paris-Sud 91405 Orsay, France. CNRS and ESPCI ParisTech, 10 rue Vauquelin, UMR 7083 Gulliver, Paris 75000, France. Santa Fe Institute and University of New Mexico, Albuquerque, New Mexico. Institut de Physique Théorique, IPhT, CEA Saclay, and URA 2306, CNRS, 91191 Gif-sur-Yvette, France.
July 14, 2019
Abstract

In this paper we extend our previous work on the stochastic block model, a commonly used generative model for social and biological networks, and the problem of inferring functional groups or communities from the topology of the network. We use the cavity method of statistical physics to obtain an asymptotically exact analysis of the phase diagram. We describe in detail properties of the detectability/undetectability phase transition and the easy/hard phase transition for the community detection problem. Our analysis translates naturally into a belief propagation algorithm for inferring the group memberships of the nodes in an optimal way, i.e., that maximizes the overlap with the underlying group memberships, and learning the underlying parameters of the block model. Finally, we apply the algorithm to two examples of real-world networks and discuss its performance.

pacs:
64.60.aq,89.75.Hc,75.10.Hk

I Introduction

Many systems of interest consist of a large number of nodes (e.g. atoms, agents, items, Boolean variables) and sometimes the only information we can observe about the system are connections, or edges, between pairs of nodes. The resulting structure is usually called a network. A naturally arising question is whether we are able to understand something about the underlying system based purely on the topology of the network. In many situations different nodes have different functions. For instance each of the nodes may belong to one of groups, often called functional modules or communities, and the structure of the network may depend in an a priori unknown way on the group memberships. In general we are interested in finding how the network’s function and topology affect each other. Hence an interesting and practically important question is whether, based on the known structure of the network, it is possible to learn in what way the group memberships influenced the structure of the network, and which node belongs to which group.

One well-studied example of the above setting is the so-called community detection problem, for a review see e.g. Fortunato (2010). In the study of complex networks, a network is said to have community structure if it divides naturally into groups of nodes with denser connections within groups and sparser connections between groups. This type of structure, where nodes are more likely to connect to others of the same type as in a ferromagnet, is called assortative. The goal is to detect the communities and to estimate, for instance, the expected number of connections within and outside of the community.

In other cases the network can be disassortative, with denser connections between different groups than within groups. For instance, a set of predators might form a functional group in a food web, not because they eat each other, but because they eat similar prey. In networks of word adjacencies in English text, nouns often follow adjectives, but seldom follow other nouns. Even some social networks have disassortative structure: for example, some human societies are divided into moieties, and only allow marriages between different moieties.

In research on networks, generative models of random graphs often provide useful playgrounds for theoretical ideas and testing of algorithms. The simplest such model is the Erdős-Rényi random graph Erdős and Rényi (1959), where every pair of nodes is connected independently with the same probability. In this paper we study in detail the most commonly used generative model for random modular networks, the stochastic block model, which generalizes the Erdős-Rényi random graph by giving each pair of nodes a connection probability depending on what groups they belong to. We extend our previous analysis from Decelle et al. (2011), and provide a rather complete and asymptotically exact analysis of the phase diagram of this model. We focus on the question of whether we can infer the original group assignment based on the topology of the resulting network, and learn the unknown parameters that were used for generating the network. We use the cavity method developed in the statistical physics of spin glasses Mézard and Montanari (2009) to evaluate the phase diagram.

Our results naturally translate into a message-passing algorithm called belief propagation Yedidia et al. (2003), that we suggest as a heuristic tool for learning parameters and inferring modules in real networks. Our results are exact in the thermodynamic limit, but the algorithm works well even on small networks as long as they are well-described by the generative model. Our theory and algorithms are straightforwardly generalizable to other generative models, including hierarchical module structures Clauset et al. (2008), overlapping modules Airoldi et al. (2008), or degree-corrected versions of the stochastic block model Karrer and Newman (2011). In the present work we focus on graphs that do not contain any multiple edges or self-loops, but the generalization would be straightforward.

The paper is organized as follows. In Section II.1 we define the stochastic block model. In II.2 and II.3 we review the Bayesian theory for optimal inference and learning in the present context. In Section II.4 we discuss in more detail the relation between our work and previous work on community detection, and summarize the advantages of our approach. In Section III we present the cavity method for asymptotic analysis of the model, and the associated belief propagation algorithm for parameter learning and inference of the group assignment. In Section IV we analyze the phase diagram and describe in detail the different phase transitions introduced in Decelle et al. (2011). Finally, in Section V we discuss applications of our approach to real-world networks.

Ii The stochastic block model

ii.1 Definition of the model

The stochastic block model is defined as follows. It has parameters (the number of groups), (the expected fraction of nodes in each group , for ), and a affinity matrix (the probability of an edge between group and group ). We generate a random directed graph on nodes, with adjacency matrix if there is an edge from to and otherwise, as follows. Each node has a label , indicating which group it belongs to. These labels are chosen independently, where for each node the probability that is . Between each pair of nodes , we then include an edge from to with probability , setting , and set with probability . We forbid self-loops, so .

We let denote the number of nodes in each group . Since is binomially distributed, in the limit of large we have with high probability. The average number of edges from group to group is then , or if . Since we are interested in sparse graphs where , we will often work with a rescaled affinity matrix . In the limit of large , the average degree of the network is then

 c=∑a,bcabnanb. (1)

We can also consider the undirected case, where , , and are symmetric. The average degree is then

 c=∑a

Several special cases of this stochastic block model are well known and studied in the literature. Without the aim of being exhaustive let us mention three of them. We stress, however, that from a mathematical point of view many of our results are non-rigorous and hence rigorous proofs of our conjectures would be a natural and important extension of our work.

• A common benchmark for community detection is the “four groups” test of Newman and Girvan Newman and Girvan (2004), in which a network is divided into four equally-sized groups: , the average degree is , and

 cab={cina=bcouta≠b, (3)

where so that the community structure is assortative. By varying the difference between and , we create more or less challenging structures for community detection algorithms. It is usually expected in the literature that as it is possible to find a configuration correlated with the original assignment of nodes to the four groups as soon as the average in-degree (number of edges going to the same group) is larger than , or in general, and that a failure to do so is due to an imperfection of the algorithm. In contrast, from our results (see e.g. Fig. 1) it follows that in the limit , unless the average in-degree is larger than no efficient algorithm will be better than a random choice in recovering the original assignment on large networks. At the same time we design an algorithm that finds the most likely assignment for each node if the in-degree is larger than .

• Planted graph partitioning, a generalization of the above example, is well known in the mathematics and computer science literature. We have , and if and if . We again assume an assortative structure, so that , but now we allow the dense case where , might not be . A classical result Dyer and Frieze (1989) shows that for the planted partition is with high probability equal to the best possible partition, in terms of minimizing the number of edges between groups. Another classical result Condon and Karp (2001) shows that the planted partition can be easily found as long as for arbitrarily small . In this work we do not aim at finding the best possible partition nor the planted partition exactly. Rather, we are interested in conditions under which polynomial-time algorithms can find a partition that is correlated with the planted partition. In Section IV.1 we will show that this is possible when . We will also argue that, depending on the values of the parameters, this task may be impossible, or exponentially hard, if this condition is not satisfied.

• In the planted coloring problem we again have and for , but so that there are no edges between nodes in the same group. The average degree of the graph is then . A rigorous analysis of the problem Krivelevich and Vilenchik (2006) shows that for it is possible to find in polynomial time a proper coloring correlated strongly with the planted coloring. The cavity method result derived in Krzakala and Zdeborová (2009) shows that configurations correlated with the planted coloring are possible to find if and only if , where the critical values are given in Krzakala and Zdeborová (2009). For large we have . However, it is known how to find such colorings in polynomial time only for  Krzakala and Zdeborová (2009), and there are physical reasons to believe that provides a threshold for success of a large class of algorithms, as we will explain later.

We stress that in this paper we are mostly interested in the large behavior of the generative model and hence in the text of the paper we will neglect terms that are negligible in this limit. For instance, we will write that the number of edges is even if for finite size one typically has fluctuations around the average of size .

Now assume that a network was generated following the stochastic block model described above. The resulting graph is known, but the parameters , , and the labeling are unknown. In this paper we address the two following questions:

• Given the graph , what are the most likely values of the parameters , , that were used to generate the graph? We will refer to this question as parameter learning.

• Given the graph and the parameters , , , what is the most likely assignment of a label (group) to a given node? In particular, is this most likely assignment better than a random guess? How much better? We will refer to this to as inferring the group assignment.

In order to be able to give a quantitative answer to the second question we define agreement between the original assignment and its estimate as

 A({ti},{qi})=maxπ1N∑iδti,π(qi), (4)

where ranges over the permutations on elements. We also define a normalized agreement that we call the overlap,

 Q({ti},{qi})=maxπ1N∑iδti,π(qi)−maxana1−maxana. (5)

The overlap is defined so that if for all , i.e., if we find the exact labeling, then . If on the other hand the only information we have are the group sizes , and we assign each node to the largest group to maximize the probability of the correct assignment of each node, then . We will say that a labeling is correlated with the original one if in the thermodynamic limit the overlap is strictly positive, with bounded above some constant.

Our main results provide exact answers to the above questions in the thermodynamic limit for sparse networks, i.e. when and are constants independent of size as . Many real-world networks are sparse in that the total number of edges grows only as rather than (although we do not address networks with heavy-tailed degree distributions). In the dense case where the average degree diverges as , learning and inference are algorithmically easier, as was previously realized in Condon and Karp (2001); Bickel and Chen (2009) and as will also become clear from the large-degree limit of our results.

ii.2 Optimal inference of the group assignment

The probability that the stochastic block model generates a graph , with adjacency matrix , along with a given group assignment , conditioned on the parameters is

 P(G,{qi}∣θ)=∏i≠j[pAijqi,qj(1−pqi,qj)1−Aij]∏inqi, (6)

where in the undirected case the product is over pairs . Note that the above probability is normalized, i.e. . Assume now that we know the graph and the parameters , and we are interested in the probability distribution over the group assignments given that knowledge. Using Bayes’ rule we have

 P({qi}∣G,θ)=P(G,{qi}∣θ)∑tiP(G,{ti}∣θ). (7)

In the language of statistical physics, this distribution is the Boltzmann distribution of a generalized Potts model with Hamiltonian

 H({qi}∣G,θ)=−∑ilognqi−∑i≠j[Aijlogcqi,qj+(1−Aij)log(1−cqi,qjN)], (8)

where the sum is over pairs in the undirected case. The labels are Potts spins taking one of the possible values, and the group sizes (or rather their logarithms) become local magnetic fields. In the sparse case , there are strong interactions between connected nodes, , and weak interactions between nodes that are not connected, . Recall that the Boltzmann distribution at unit temperature is

 μ({qi}∣G,θ)=P({qi}∣G,θ)=e−H({qi}∣G,θ)∑{qi}e−H({qi}∣G,θ), (9)

where the denominator is the corresponding partition function

 Z(G,θ)=∑{qi}e−H({qi}∣G,θ). (10)

Note that we define to keep the formally useful property that the energy is extensive, i.e., proportional to , in the thermodynamic limit. In statistical physics it is usual to work with the free energy density

 FN(G,θ)N=−logZ(G,θ)N→N→∞f(G,θ). (11)

Since the energy (8) is extensive, the free energy density has a well defined finite thermodynamic limit .

It is useful to notice that if the parameters are known then the Boltzmann distribution (9) is asymptotically uniform over all configurations with the right group sizes and the right number of edges between each pair of groups, and . The original, correct group assignment is just one of these configurations. In a statistical physics sense, the original group assignment is an equilibrium configuration for the Boltzmann distribution, rather than its ground state. In particular, if we were presented with the original assignment and a typical assignment sampled according to the Boltzmann distribution, we would be unable to tell which one was correct.

The marginals of the Boltzmann distribution, i.e. the probabilities that a node belongs to a group , are

 νi(qi)=∑{qj}j≠iμ({qj}j≠i,qi). (12)

Our estimate of the original group assignment assigns each node to its most-likely group,

 q∗i=argmaxqiνi(qi). (13)

If the maximum of is not unique, we choose at random from all the achieving the maximum. We refer to this method of estimating the groups as marginalization; in Bayesian inference is called the maximum posterior marginal. Standard results in Bayesian inference (e.g. Iba (1999)) show that it is in fact the optimal estimator of the original group assignment if we seek to maximize the number of nodes at which . In particular, the ground state of the Hamiltonian (8), i.e. the configuration that maximizes , has in general a slightly smaller overlap with the original assignment than does.

Note that the Boltzmann distribution is symmetric with respect to permutations of the group labels. Thus the marginals over the entire Boltzmann distribution are uniform. However, if this permutation symmetry is broken in the thermodynamic limit, so that each permutation corresponds to a different Gibbs state, we claim that marginalization within one of these Gibbs states is the optimal estimator for the overlap defined in (5), where we maximize over all permutations.

One of the advantages of knowing the marginals of the Boltzmann distribution (9) is that in the thermodynamic limit we can evaluate the overlap even without the explicit knowledge of the original assignment . It holds that

 Qmargin≡limN→∞1N∑iνi(q∗i)−maxana1−maxana=limN→∞Q({ti},{q∗i}). (14)

The overlap measures the amount of information about the original assignment that can be retrieved from the topology of the network, given the parameters . The marginals can also be used to distinguish nodes that have a very strong group preference from those that are uncertain about their membership.

Another interesting property is that two random configurations taken from the Boltzmann distribution (9) have the same agreement as a random configuration with the original assignment , i.e.

 limN→∞1Nmaxπ∑iνi(π(ti))=limN→∞1N∑i∑aνi(a)2, (15)

where again ranges over the permutations on elements. This identity holds only if the associated Boltzmann distribution was computed using the correct parameters . In the statistical physics of spin glasses, the property (15) is known as the equality between the Edwards-Anderson overlap and the magnetization, and holds on the Nishimori line. Here the knowledge of the actual parameter values is equivalent to the Nishimori condition being satisfied; for more details see Section II.3.

ii.3 Learning the parameters of the model

Now assume that the only knowledge we have about the system is the graph . The general goal in machine learning is to learn the most probable values of the parameters of an underlying model based on the data known to us. In this case, the parameters are and the data is the graph , or rather the adjacency matrix . According to Bayes’ rule, the probability that the parameters take a certain value, conditioned on , is proportional to the probability that the model with parameters would generate . This in turn is the sum of over all group assignments :

 P(θ∣G)=P(θ)P(G)P(G∣θ)=P(θ)P(G)∑{qi}P(G,{qi}∣θ). (16)

In Bayesian inference, is called the posterior distribution. The prior distribution includes any graph-independent information we might have about the values of the parameters. In our setting, we wish to remain perfectly agnostic about these parameters; for instance, we do not want to bias our inference process towards assortative structures. Thus we assume a uniform prior, i.e., up to normalization. Note, however, that since the sum in (16) typically grows exponentially with , we could take any smooth prior as long as it is independent of ; for large , the data would cause the prior to “wash out,” leaving us with the same posterior distribution we would have if the prior were uniform.

Thus maximizing over is equivalent to maximizing the partition function (10) over , or equivalently minimizing the free energy density defined in Eq. (11) of the Potts model (8) as a function of . As in the saddle-point method, if the function has a non-degenerate minimum, then in the thermodynamic limit this minimum is achieved with high probability at precisely the values of the parameters that were used to generate the network. In mathematical terms, if we call the original parameters and the ones minimizing , then for all the probability . It is also important to note that due to the self-averaging nature of the model, as the free energy depends only on and not on the precise realization of the network. We will study the free energy in detail in the next section, but for the moment suppose that it indeed has a non-degenerate minimum as a function of . In that case we can learn the exact parameters: the number of groups , their sizes , and the affinity matrix .

In some cases, rather than minimizing directly, it is useful to write explicit conditions for the stationarity of . Taking the derivative of with respect to for , subject to the condition , and setting these derivatives equal to zero gives

 1N∑i⟨δqi,a⟩=⟨Na⟩N=na∀a=1,…,q, (17)

where by we denote the thermodynamic average. Thus for each group , the most likely value of is the average group size; an intuitive result, but one that deserves to be stated. Analogously, taking the derivative of by the affinities gives

 1Nnanb∑(i,j)∈E⟨δqi,aδqj,b⟩=⟨Mab⟩Nnanb=cab∀a,b. (18)

Meaning that the most likely value of is proportional to the average number of edges from group to group . More to the point, the most likely value of is the average fraction of the potential edges from group to group that in fact exist. In the undirected case, for we have

 1Nn2a/2∑(i,j)∈E⟨δqi,aδqj,a⟩=⟨Maa⟩Nn2a/2=caa∀a. (19)

The stationarity conditions (1719) naturally suggest an iterative way to search for the parameters that minimize the free energy. We start with arbitrary estimates of (actually not completely arbitrary, for a more precise statement see subsequent sections), measure the mean values and in the Boltzmann distribution with parameters , and update according to (1719) . We then use the resulting to define a new Boltzmann distribution, again measure and , and so on until a fixed point is reached.

In statistical physics, the stationarity conditions (1719) can be interpreted as the equality of the quenched and annealed magnetization and correlations. In models of spin glasses (e.g. Iba (1999)) they are sometimes referred to as the Nishimori conditions. This iterative way of looking for a maximum of the free energy is equivalent to the well-known expectation-maximization (EM) method in statistics Dempster et al. (1977).

Note that if the conditions (1719) are satisfied, the average of the Hamiltonian in the Boltzmann distribution (9) can be easily computed as

 1N⟨H⟩≡e=−∑analogna−∑a,bnanbcablogcab+c, (20)

where the term comes from the weak interactions between disconnected pairs of nodes. In the undirected case,

 1N⟨H⟩≡e=−∑analogna−∑a

Note that as usual in statistical physics the free energy is the average of the energy minus the entropy, (where here the “temperature” is unity). The entropy is the logarithm of the number of assignments that have the corresponding energy. In the Bayesian inference interpretation the entropy counts assignments that are as good as the original one (i.e. they have the same group sizes and the same number of edges between each pair of groups). This entropy provides another useful measure of significance of communities in the network, and was studied numerically for instance in Good et al. (2010).

ii.4 Our contribution and relation to previous work

The Bayesian approach presented in the previous two sections is well known in machine learning and statistics. However, it is also well known that computing the partition function and the marginal probabilities of a model defined by the Hamiltonian (8) or computing the averages on the left-hand sides of (1719) is a computationally hard task. In general, and our model is no exception, there is no exact polynomial-time algorithm known for these problems.

A standard tool of statistical physics and statistical inference is to approximate thermodynamic averages using Monte Carlo Markov chains (MCMC), also known as Gibbs sampling. Any Markov chain respecting detailed balance will, after a sufficient number of steps, produce sample configurations according to the Boltzmann distribution (9). This can then be used to compute averages like and , or even the free energy if we integrate over a temperature-like parameter. A central question, however, is the equilibration time of these Markov chains. For very large networks, a running time that grows more than linearly in is impractical.

The running time of Gibbs sampling is one reason why the Bayesian approach is not that widely used for community detection. Exceptions include specific generative models for which the partition function computation is tractable Newman and Leicht (2007); Ball et al. (2011), the work of Hofman and Wiggins (2008) where a variational approximation to bound the partition function (or related expectations) was used, the work of Clauset et al. (2008) where a more elaborate generative model is used to infer hierarchies of communities and subcommunities. Another exception is Moore et al. (2011), where Gibbs sampling is used to estimate the mutual information between each node and the rest of the graph in order to perform “active learning.”

The main contribution of this work is a detailed, and in the thermodynamic limit exact, analysis of the stochastic block model and its phase diagram with the use of the cavity method Mézard and Parisi (2001); Mézard and Montanari (2009) developed in the theory of spin glasses. We show that there is a region in the phase diagram, i.e., a range of parameters , where inference is impossible even in principle, since the marginals of the Boltzmann distribution yield no information about the original group assignment. There is another region where inference is possible but, we argue, exponentially hard. Finally, there is a region where a belief propagation algorithm Yedidia et al. (2003), that emerges naturally from the cavity method, computes the free energy density and corresponding expectations exactly in the thermodynamic limit, letting us infer the parameters optimally and in linear time. As we show later, this algorithm also performs very well for networks of moderate size, and it is useful for real-world networks as well.

The boundaries between these phases correspond to well-known phase transitions in the statistical physics of spin glasses: namely, the dynamical transition or the reconstruction threshold, see e.g. Mézard and Montanari (2006); Zdeborová and Krzakala (2007); the condensation transition or the Kauzmann temperature Krzakala et al. (2007); Kauzmann (1948); and the easy/hard transition in planted models introduced in Krzakala and Zdeborová (2009). There is also a close relation between our approach and the optimal finite-temperature decoding Nishimori (1993); Sourlas (1994); Iba (1999); Nishimori (2001), and the statistical mechanics approach to image processing Tanaka (2002).

In fact, the theory of spin glasses also leads to the conclusion that Gibbs sampling works in linear time, i.e., its equilibration time is linear in , in the same region where belief propagation works. However, belief propagation is considerably faster than Gibbs sampling at finding the marginals, since belief propagation produces marginals directly while Gibbs sampling requires us to measure them by taking many independent samples.

Belief propagation was previously suggested as an algorithm for detecting communities in Hastings (2006). However, in that work its performance was not studied systematically as a function of the parameters of the block model. Moreover, there the parameters were fixed to an assortative community structure similar to the planted partition model discussed above, rather than being learned.

We argue that our Bayesian approach, coupled with belief propagation to compute marginals and estimate the free energy, is optimal for graphs generated from the stochastic block model. For such random networks it possesses several crucial advantages over the methods that are widely used for community detection in the current literature, without being computationally more involved. For a review of methods and results known about community detection see for instance Fortunato (2010) and references therein. Let us list some of the advantages:

• General modules, not just assortative ones: It is fair to say that when there are well-separated assortative communities in the network, the community detection problem is relatively easy, and many efficient algorithms are available in the literature; see for instance Lancichinetti and Fortunato (2009) for a comparative study. However, for block models where the matrix of affinities is not diagonally dominant, i.e., where functional groups may be disassortative or where edges between them are directed, the spectrum of available algorithms is, so far, rather limited.

• No prior knowledge of parameters needed: Our method does not require any prior knowledge about the functional groups or how they tend to be connected. It is able to learn the number of groups , their sizes , and the affinity matrix .

• Asymptotically exact for the stochastic block model: Unlike any other known method, for networks that are created by the stochastic block model, in the limit our algorithm either outputs the exact parameters , or halts and reports that learning is impossible or algorithmically hard. In the second case we are very confident that no other method will be able to do better in terms of the overlap parameter (5), and we will explain the reasons for this conjecture later in the text.

• Detects when a network has no communities: While people may differ on the precise definition of community structure and how to find it, they presumably agree that a purely random graph, such as an Erdős-Rényi graph where all pairs of nodes are connected with the same probability, has no community structure to discover. However, the vast majority of community detection algorithms do in fact find illusory communities in such graphs, due to random fluctuations. For instance, sparse random graphs possess bisections, i.e., divisions of the nodes into two equal groups, where the number of edges between groups is much smaller than the number of edges within groups. To give a specific example, nodes in a large 3-regular random graph can be bisected in such a way that only about of all edges are between the groups Šulc and Zdeborová (2010). Popular measures of community significance such as modularity Newman and Girvan (2004) do not take this fact into account. In contrast, our method naturally recognizes when a network does not in fact contain any modular structure.

• Better measures of significance: More generally, most known methods for community detection aim at providing one assignment of nodes to groups; physically speaking, they look for a single ground state. However, there are usually a large number of group assignments that are comparable to each other according to various quality measure, see e.g. Good et al. (2010). Our method provides all the thermodynamically significant group assignments “at once,” by providing the marginal probabilities with which each node belongs to a given community. It also provides the entropy of the good assignments, giving us a measure of how non-unique they are. This kind of information is much more informative than a single group assignment, even if we can find the “best” one. Physically speaking, the Boltzmann distribution tells us more about a network than the ground state does.

We stress, however, that all of the above is true only for networks generated from the stochastic block model, or for real world networks that are well described by this generative model. Indeed, many of the other methods suggested in the literature for community detection also implicitly assume that the network under study is well described by a similar model. For many real networks, this is not true; for instance, as pointed out in Karrer and Newman (2011), the block model performs poorly on networks where communities include nodes with a very broad range of degrees. On the other hand, the Bayesian approach and belief propagation algorithm presented here can be generalized straightforwardly to any generative model where the likelihood (6) can be written as a product of local terms, such as the degree-corrected block model suggested in Karrer and Newman (2011). Thus our methods can apply to a wide variety of generative models, which take various kinds of information about the nodes and edges into account. We leave these generalizations for future work.

Iii Cavity method and the stochastic block model

In this section we derive the cavity equations and the associated belief propagation (BP) algorithm for computing the marginal probabilities (12) and the average values (1719) we need to learn the parameters . When applied in the thermodynamic limit , our analysis lets us describe the phase diagram of the learning and inference problems, showing for which parameters these problems are easy, hard, or impossible.

iii.1 Cavity equations: marginals and free energy

In the literature the so-called replica symmetric cavity method is often understood in terms of the BP algorithm and its behavior in the thermodynamic limit with properly chosen initial conditions. From a physics point of view, the BP algorithm is an iterative way to compute the partition function by neglecting correlations between the neighbors of node while conditioning on the “spin” or label of node . Such correlations are non-existent if the network of interactions is a tree. On networks that are locally treelike, if correlations decay rapidly with distance these correlations are negligible, making BP asymptotically exact.

Note that in our case the “network of interactions” is fully connected, since in the Hamiltonian (8) there are weak interactions even along the non-edges, i.e., between pairs of nodes that are not connected. However, as we will see these weak interactions can be replaced with a “mean field,” limiting the interactions to the sparse network.

The belief propagation equations are derived from a recursive computation of the partition function with the assumption that the network of interactions is a tree. The asymptotic exactness of the replica symmetric cavity method (BP equations) is then validated by showing that the correlations that have been neglected are indeed negligible in the thermodynamic limit.

To write the belief propagation equations for the Hamiltonian (8) we define conditional marginals, or messages, denoted . This is the marginal probability that the node belongs to group in the absence of node . The cavity method assumes that the only correlations between ’s neighbors are mediated through , so that if were missing—or if its label were fixed—the distribution of its neighbors’ states would be a product distribution. In that case, we can compute the message that sends recursively in terms of the messages that receives from its other neighbors :

 ψi→jti=1Zi→jnti∏k∈∂i∖j[∑tkcAiktitk(1−ctitkN)1−Aikψk→itk], (22)

where denotes ’s neighborhood, and is a normalization constant ensuring . We apply (22) iteratively until we reach a fixed point . Then the marginal probability is estimated to be , where

 ψiti=1Zinti∏k∈∂i[∑tkcAiktitk(1−ctitkN)1−Aikψk→itk]. (23)

These equations are for the undirected case. In a directed network, would send and receive messages from both its incoming and outgoing neighbors, and we would use or for incoming and outgoing edges respectively.

Since we have nonzero interactions between every pair of nodes, we have potentially messages, and indeed (22) tells us how to update all of these for finite . However, this gives an algorithm where even a single update takes time, making it suitable only for networks of up to a few thousand nodes. Happily, for large sparse networks, i.e., when is large and , we can neglect terms of sub-leading order in . In that case we can assume that sends the same message to all its non-neighbors , and treat these messages as an external field, so that we only need to keep track of messages where is the number of edges. In that case, each update step takes just time.

To see this, suppose that . We have

 ψi→jti=1Zi→jnti∏k∉∂i∖j[1−1N∑tkctktiψk→itk]∏k∈∂i[∑tkctktiψk→itk]=ψiti+O(1N). (24)

Hence the messages on non-edges do not depend to leading order on the target node . On the other hand, if we have

 ψi→jti=1Zi→jnti∏k∉∂i[1−1N∑tkctktiψk→itk]∏k∈∂i∖j[∑tkctktiψk→itk]. (25)

The belief propagation equations can hence be rewritten as

 ψi→jti=1Zi→jntie−hti∏k∈∂i∖j[∑tkctktiψk→itk], (26)

where we neglected terms that contribute to , and defined an auxiliary external field

 hti=1N∑k∑tkctktiψktk. (27)

In order to find a fixed point of Eq. (26) in linear time we update the messages , recompute , update the field by adding the new contribution and subtracting the old one, and repeat. The estimate of the marginal probability is then

 ψiti=1Zintie−hti∏j∈∂i⎡⎣∑tjctjtiψj→itj⎤⎦. (28)

When the cavity approach is asymptotically exact then the true marginal probabilities obey . The overlap with the original group assignment is then computed from (14). Introducing

 Zij = ∑a

we can write the BP estimate for the free energy, also called the Bethe free energy, in the thermodynamic limit as

 fBP(q,{na},{cab})=−1N∑ilogZi+1N∑(i,j)∈ElogZij−c2, (32)

where is the average degree given by (1) and the third term comes from the edge-contribution of non-edges (i.e. ).

When deriving the BP equations (26), the BP estimates of marginals (28), and the Bethe free energy (32), we saw that even though the original graph of interactions is fully connected we end up with BP equations on the edges of the original network. This network is locally treelike, so standard assumptions about correlation decay suggest that the BP equations are then asymptotically exact.

In the statistical physics of spin glasses the equations presented in this section are called the replica symmetric cavity equations. A large amount of work has been devoted to understanding under what circumstances these equations give asymptotically exact results for the marginals and the free energy, and when they do not Mézard et al. (1987); Mézard and Montanari (2009). All known cases when the replica symmetric solution of a model like (8) is not correct on a randomly generated network can be divided into two classes: a static spin glass phase with spin glass susceptibility

 χSG=1N∑i,j∑ti,tj[νij(ti,tj)−νi(ti)νj(tj)]2 (33)

diverging with the system size, or a first order phase transition into a dynamically non-attractive “ferromagnetic” phase, see e.g. Franz et al. (2001).

It is a general property of inference problems that at the correct value of the parameters the static spin glass phase never exists. Indeed, when the parameters are the actual ones from which the graph was generated, the system satisfies the so-called Nishimori condition, and a very general result is that there is no static spin glass phase on the Nishimori line Nishimori (2001); Montanari (2008).

On the other hand, the first-order phase transition to a dynamically non-attractive ferromagnetic phase cannot be avoided. This phase is easy to detect, and to describe asymptotically exactly with the BP equations, if we know the true group assignment. Thus the BP analysis is still asymptotically exact for the purpose of analysis of the phase diagram. However, in the situation we care about, where the original assignment is not known, this phase transition poses an algorithmic problem. We explain this in detail in Section IV.

iii.2 Belief propagation algorithm for inferring the group assignment

We present here the belief propagation algorithm for inferring the group assignment, and for estimating the free energy, the marginals, and the overlap.

• )

The main cycle of the algorithm takes time. Specifically, messages need to be updated, and each such update takes time operations. The number of iterations needed for the algorithm to converge is, in general, a more complicated question. However, at the right value of the parameters , the Nishimori condition ensures that BP will converge to a fixed point in a constant number () of steps, so that the total running time of the algorithm is . This is illustrated in Fig. 2. A word of caution is that if the parameters are not equal to the correct ones, the algorithm might not converge, and indeed sometimes does not. But even in that case, the messages after iterations can provide useful information.

To conclude, we summarize some properties of the algorithm. At the correct parameters, the BP algorithm works in time linear in the size of the network, and is typically much faster than an equivalent Gibbs sampling algorithm. Its superiority with respect to other community detection algorithms lies in the fact that, in addition to finding the group assignment that maximizes the overlap with the original assignment, it provides the marginal probabilities that each node belongs to each group, along with natural measures of the significance of the inferred assignment.

Of course, here we assumed that we already knew the parameters with which the network was generated. The next step is to use the BP algorithm as a subroutine to learn them, and to discuss under what circumstances this learning task is possible.

iii.3 Belief propagation algorithm to learn the parameters

The Nishimori conditions (1719) that we use for iterative learning can be written in terms of the BP messages as (for undirected graphs)

 na = 1N∑iψia, (34) cab = 1N1nbna∑(i,j)∈Ecab(ψi→jaψj→ib+ψi→jbψj→ia)Zij, (35)

where is defined in (29). Therefore, BP can also be used to learn the optimal parameters as follows.

This is an expectation-maximization (EM) learning algorithm Dempster et al. (1977), where we use BP for the expectation step. The update on line LABEL:line_5 can be also done using Gibbs sampling, but BP is much faster at computing the marginals. The number of iterations needed for the EM algorithm to converge is constant in the size of the system. However, it generally only converges to the correct from some finite fraction of the possible initial parameters . In practice, several initial values need to be tried, and the fixed point with the smallest final free energy is the correct one.

When the learning process is possible, this algorithm learns the group sizes and the affinity matrix exactly in the limit . To learn the number of groups , we run BP-learning for different values of and find the smallest such that the free energy density does not decrease further for larger .

Iv Phase transition in inference and learning

In this section we will limit ourselves to a particularly algorithmically difficult case of the block model, where the graph is undirected and every group has the same average degree :

If this is not the case, we can achieve a positive overlap with the original group assignment simply by labeling nodes based on their degrees, as we will briefly discuss in Section IV.3. We call a block model satisfying (36) a factorized block model, and explain the reason for this name in the next paragraph. Note that this case includes both the planted partitioning and the planted (noisy) coloring problem discussed in Section II.1.

The first observation to make about the belief propagation equations (26) in the factorized block model is that

 ψi→jti=nti (37)

is always a fixed point, as can be verified by plugging (37) into (26). In the literature, a fixed point where messages do not depend on the indexes is called a factorized fixed point, hence our name for this case of the block model. The free energy density at this fixed point is

 ffactorized=c2(1−logc). (38)

For the factorized fixed point we have , in which case the overlap (14) is . This fixed point does not provide any information about the original assignment—it is no better than a random guess. If this fixed point gives the correct marginal probabilities and the correct free energy, we have no hope of recovering the original group assignment. For which values of and is this the case?

iv.1 Phase transitions in community detection

We will first study the result given by the cavity method in the thermodynamic limit in the case when the parameters used to generate the network are known.

Fig. 1 represents two examples where the overlap is computed on a randomly generated graph with groups of the same size and an average degree . We set and for all and vary the ratio . The continuous line is the overlap resulting from the BP fixed point obtained by converging from a random initial condition (i.e., where for each the initial messages are random normalized distributions on ). The convergence time is plotted in Fig. 2. The points in Fig. 1 are results obtained from Gibbs sampling, using the Metropolis rule and obeying detailed balance with respect to the Hamiltonian (8), starting with a random initial group assignment . We see that for . In other words, in this region both BP and MCMC converge to the factorized state, where the marginals contain no information about the original assignment. For , however, the overlap is positive and the factorized fixed point is not the one to which BP or MCMC converge.

In particular the right-hand side of Fig. 1 shows the case of groups with average degree , corresponding to the benchmark of Newman and Girvan Newman and Girvan (2004). We show the large results and also the overlap computed with MCMC for size which is the commonly used size for this benchmark. Again, up to symmetry breaking, marginalization achieves the best possible overlap that can be inferred from the graph by any algorithm. Therefore, when algorithms are tested for performance, their results should be compared to Fig. 1 instead of to the common but wrong expectation that the four groups are detectable for any .

Let us now investigate the stability of the factorized fixed point under random perturbations to the messages when we iterate the BP equations. In the sparse case where , graphs generated by the block model are locally treelike in the sense that almost all nodes have a neighborhood which is a tree up to distance , where the constant hidden in the depends on the matrix . Equivalently, for almost all nodes , the shortest loop that belongs to has length . Consider such a tree with levels, in the limit . Assume that on the leaves the factorized fixed point is perturbed as

 ψkt=nt+ϵkt, (39)

and let us investigate the influence of this perturbation on the message on the root of the tree, which we denote . There are, on average, leaves in the tree where is the average degree. The influence of each leaf is independent, so let us first investigate the influence of the perturbation of a single leaf , which is connected to by a path . We define a kind of transfer matrix

 Tabi≡∂ψkia∂ψki+1b∣∣ψt=nt=⎡⎣ψkiacab∑rcarψki+1r−ψkia∑sψkiscsb∑rcsrψki+1r⎤⎦∣∣ ∣∣ψt=nt=na(cabc−1). (40)

where this expression was derived from (26) to leading order in . The perturbation on the root due to the perturbation on the leaf can then be written as

 ϵk0t0=∑{ti}i=1,…,d[d−1∏i=0Tti,ti+1i]ϵkdtd (41)

We observe in (40) that the matrix does not depend on the index . Hence (41) can be written as . When , will be dominated by ’s largest eigenvalue , so .

Now let us consider the influence from all of the leaves. The mean value of the perturbation on the leaves is zero, so the mean value of the influence on the root is zero. For the variance, however, we have

 ⟨(ϵk0t0)2⟩≈⟨⎛⎝cd∑k=1λdϵkt⎞⎠2⟩≈cdλ2d⟨(ϵkt)2⟩. (42)

This gives the following stability criterion,

 cλ2=1. (43)

For the perturbation on leaves vanishes as we move up the tree and the factorized fixed point is stable. On the other hand, if the perturbation is amplified exponentially, the factorized fixed point is unstable, and the communities are easily detectable.

Consider the case with groups of equal size, where for all and for all . This includes the Newman-Girvan benchmarks, as well as planted (noisy) graph coloring and planted graph partitioning. If there are groups, then . The transfer matrix has only two distinct eigenvalues, with eigenvector , and with eigenvectors of the form and degeneracy . The factorized fixed point is then unstable, and communities are easily detectable, if

 |cin−cout|>q√c. (44)

The stability condition (43) is known in the literature on spin glasses as the de Almeida-Thouless local stability condition de Almeida and Thouless (1978), in information science as the Kesten-Stigum bound on reconstruction on trees Kesten and Stigum (1966a, b), or the threshold for census reconstruction Mézard and Montanari (2006), or robust reconstruction threshold Janson and Mossel (2004).

We observed empirically that for random initial conditions both the belief propagation and the Monte Carlo Markov chain converge to the factorized fixed point when . On the other hand when then BP and MCMC converge to a fixed point with a positive overlap, so that it is possible to find a group assignment that is correlated (often strongly) to the original assignment. We thus conclude that if the parameters are known and if , it is possible to reconstruct the original group assignment.

We can estimate the number of assignments that are just as good as the original one, i.e., that have the right group sizes and the right number of edges between each pair of groups, using Eq. (21) to express the entropy as

 s=logq+c2logc−12q[(q−1)coutlogcout+cinlogcin]. (45)

We can think of this entropy as a measure of our uncertainty about the group assignment.

Next, we discuss a situation when the true marginal probabilities are in the thermodynamic limit equal to , and the free energy is given by (38). From the first fact it follows that the graph contains zero (or infinitesimally small) information about the original assignment; i.e., the overlap of the marginalized assignment with the original one is zero. To understand how is this possible even if , note that from the expressions for the free energy, it follows that the network generated with the block model is thermodynamically indistinguishable from an Erdős-Rényi random graph of the same average degree. For the coloring problem where and for all , this was proved in Achlioptas and Coja-Oghlan (2008) and discussed under the name “quiet planting” in Krzakala and Zdeborová (2009).

The main line of reasoning in the proof of Achlioptas and Coja-Oghlan (2008) goes as follows. First note that the free energy (38) is equal to the annealed free energy

 fann=−limN→∞1Nlog[Z]G, (46)

where denotes the average over the graphs. Then consider the following thought experiment. First fix the parameters ; then list in columns all the group assignments with group sizes ; and list in rows all graphs with edges. Mark each (graph, assignment) pair with the property that with assignment has the correct number of edges, , between each pair of groups.

Now consider two ways to choose these pairs. The block model corresponds to choosing a random assignment (column), and then choosing a random graph (row) consistent with it. In contrast, we can start by choosing a random graph (row), and then choose an assignment (column) consistent with the block model. If there are exactly the same number of marked pairs in every row and column, these two procedures are the same. In that case, it would be impossible to distinguish a graph generated by the block model from an Erdős-Rényi graph.

It is certainly true that for every group assignment (with fixed group sizes) there are the same number of compatible graphs. But different graphs have different numbers of compatible assignments. Thus the number of marked pairs is different in different rows. The number of marked pairs in each row is essentially the partition function . Now, if all but an exponentially small fraction of graphs have the same typical properties as a random graph (this is a large deviation principle), it follows that also the block model generates typical random graphs as long as the quenched free energy equals the annealed one, i.e., when .

In the example presented in Fig. 1, both BP and MCMC confirm that the free energy (38) is the correct free energy and that the marginals are uniform, , for . The only possibility known in statistical physics when MCMC does not give the correct free energy is ergodicity breaking on the time-scale during which the MCMC was performed (i.e. insufficient equilibration time). Indeed, the original group assignment could belong to a part of the phase space that dominates the Boltzmann distribution, but that is invisible to dynamics on time-scales linear in the size of the system. Such a situation indeed happens in systems that undergo the ideal glass transition. There is a simple trick to verify if such a glass transition appears or not in the block model: just run BP or MCMC using the original group assignment as the initial condition.

When we do this for the examples shown in Fig. 1, there is absolutely no change in the result. Thus the free energy and overlaps presented in Fig. 1 are asymptotically exact and we can distinguish two phases:

• If , the graph does not contain any significant information about the original group assignment, and community detection is impossible.

• If , the graph contains significant information about the original group assignment, and using BP or MCMC yields an assignment that is strongly correlated with the original one. There is some intrinsic uncertainty about the group assignment due to the entropy, but if the graph was generated from the block model there is no better method for inference than the marginalization introduced by Eq. (13).

Fig. 1 hence illustrates a phase transition in the detectability of communities. Unless the ratio is far enough from , the groups that truly existed when the network was generated are undetectable from the topology of the network. Moreover, unless the condition (44) is satisfied the graph generated by the block model is indistinguishable from a random graph, in the sense that typical thermodynamic properties of the two ensembles are the same.

The situation illustrated in Fig. 1 is, however, not the most general one. Fig. 3 illustrates the case of planted coloring with , , and . In this case the condition for stability (44) leads to a threshold value