Reconstruction and Estimation in the Planted Partition Model

Reconstruction and Estimation in the Planted Partition Model

Elchanan Mossel Supported by NSF grant DMS-1106999 and DOD ONR grant N000141110140 Joe Neeman Department of Statistics, UC Berkeley Allan Sly Department of Statistics, UC Berkeley
Abstract

The planted partition model (also known as the stochastic blockmodel) is a classical cluster-exhibiting random graph model that has been extensively studied in statistics, physics, and computer science. In its simplest form, the planted partition model is a model for random graphs on nodes with two equal-sized clusters, with an between-class edge probability of and a within-class edge probability of . Although most of the literature on this model has focused on the case of increasing degrees (ie.  as ), the sparse case is interesting both from a mathematical and an applied point of view.

A striking conjecture of Decelle, Krzkala, Moore and Zdeborová based on deep, non-rigorous ideas from statistical physics gave a precise prediction for the algorithmic threshold of clustering in the sparse planted partition model. In particular, if and , then Decelle et al. conjectured that it is possible to cluster in a way correlated with the true partition if , and impossible if . By comparison, the best-known rigorous result is that of Coja-Oghlan, who showed that clustering is possible if for some sufficiently large .

We prove half of their prediction, showing that it is indeed impossible to cluster if . Furthermore we show that it is impossible even to estimate the model parameters from the graph when ; on the other hand, we provide a simple and efficient algorithm for estimating and when . Following Decelle et al, our work establishes a rigorous connection between the clustering problem, spin-glass models on the Bethe lattice and the so called reconstruction problem. This connection points to fascinating applications and open problems.

1 Introduction

1.1 The planted partition problem

The clustering problem in its general form is, given a (possibly weighted) graph, to divide its vertices into several strongly connected classes with relatively weak cross-class connections. This problem is fundamental in modern statistics, machine learning and data mining, but its applications range from population genetics [26], where it is used to find genetically similar sub-populations, to image processing [30, 33], where it can be used to segment images or to group similar images, to the study of social networks [25], where it is used to find strongly connected groups of like-minded people.

The algorithms used for clustering are nearly as diverse as their applications. On one side are the hierarchical clustering algorithms [19] which build a hierarchy of larger and larger communities, by either recursive aggregation or division. On the other hand model-based statistical methods, including the celebrated EM algorithm [9], are used to fit cluster-exhibiting statistical models to the data. A third group of methods work by optimizing some sort of cost function, for example by finding a minimum cut [15, 30] or by maximizing the Girvan-Newman modularity [1, 24].

Despite the variety of available clustering algorithms, the theory of clustering contains some fascinating and fundamental algorithmic challenges. For example, the “min-bisection” problem – which asks for the smallest graph cut dividing a graph into two equal-sized pieces – is well-known to be NP-hard [13]. Going back to the 1980s, there has been much study of the average-case complexity of the min-bisection problem. For instance, the min-bisection problem is much easier if the minimum bisection is substantially smaller than most other bisections. This has led to interest in random graph models for which a typical sample has exactly one good minimum bisection. Perhaps the simplest such model is the “planted bisection” model, which is similar to the Erdös-Renyi model.

Definition 1.1 (The planted bisection model).

For and , let denote the model of random, -labelled graphs in which each vertex is assigned (independently and uniformly at random) a label , and then each possible edge is included with probability if and with probability if .

If , the planted partition model is just an Erdös-Renyi model, but if then a typical graph will have two well-defined clusters. Actually, the literature on the min-bisection problem usually assumes that the two classes have exactly the same size (instead of a random size), but this modification makes almost no difference in the context of this work.

The planted bisection model was not the earliest model to be studied in the context of min-bisection – Bui et al. [5] and Boppana [4] considered graphs chosen uniformly at random from all graphs with a given number of edges and a small minimum bisection. Dyer and Frieze [10] were the first to study the min-bisection problem on the planted bisection model; they showed that if are fixed as then the minimum bisection is the one that separates the two classes, and it can be found in expected time.

The result of Dyer and Frieze was improved by Jerrum and Sorkin [18], who reduced the running time to and allowed to shrink at the rate . More interesting than these improvements, however, was the fact that Jerrum and Sorkin’s analysis applied to the popular and fast-in-practice Metropolis algorithm. Later, Condon and Karp [7] gave better theoretical guarantees with a linear-time algorithm that works for .

With the exception of Boppana’s work (which was for a different model), the aforementioned results applied only to relatively dense graphs. McSherry [22] showed that a spectral clustering algorithm works as long as . In particular, his result is meaningful for graphs whose average degree is as low as . These are essentially the sparsest possible graphs for which the minimum cut will agree with the planted bisection, but Coja-Oghlan [6] managed to obtain a result for even sparser graphs by studying a relaxed problem. Instead of trying to recover the minimum bisection, he showed that a spectral algorithm will find a bisection which is positively correlated with the planted bisection. His result applies as long as , and so it is applicable even to graphs with a constant average degree.

1.2 Block Models in Statistics

The statistical literature on clustering is more closely focused on real-world network data with the planted bisection model (or “stochastic blockmodel,” as it is known in the statistics community) used as an important test-case for theoretical results. Its study goes back to Holland et al. [17], who discussed parameter estimation and gave a Bayesian method for finding a good bisection, without theoretical guarantees. Snijders and Nowicki [32] studied several different statistical methods – including maximum likelihood estimation and the EM algorithm – for the planted bisection model with . They then applied those methods to social networks data. More recently, Bickel and Chen [1] showed that maximizing the Girvan-Newman modularity – a popular measure of cluster strength – recovers the correct bisection, for the same range of parameters as the result of McSherry. They also demonstrated that their methods perform well on social and telephone network data. Spectral clustering, the method studied by Boppana and McSherry, has also appeared in the statistics literature: Rohe et al. [29] gave a theoretical analysis of spectral clustering under the planted bisection model and also applied the method to data from Facebook.

1.3 Sparse graphs and insights from statistical physics

The case of sparse graphs with constant average degree is well motivated from the perspective of real networks. Indeed, Leskovec et al. [21] collected and studied a vast collection of large network datasets, ranging from social networks like LinkedIn and MSN Messenger, to collaboration networks in movies and on the arXiv, to biological networks in yeast. Many of these networks had millions of nodes, but most had an average degree of no more than 20; for instance, the LinkedIn network they studied had approximately seven million nodes, but only 30 million edges. Similarly, the real-world networks considered by Strogatz [34] – which include coauthorship networks, power transmission networks and web link networks – also had small average degrees. Thus it is natural to consider the planted partition model with parameters and of order .

Although sparse graphs are natural for modelling many large networks, the planted partition model seems to be most difficult to analyze in the sparse setting. Despite the large amount of work studying this model, the only results we know of that apply in the sparse case are those of Coja-Oghlan. Recently, Decelle et al. [8] made some fascinating conjectures for the cluster identification problem in the sparse planted partition model. In what follows, we will set and for some fixed .

Conjecture 1.2.

If then the clustering problem in is solvable as , in the sense that one can a.a.s. find a bisection which is positively correlated with the planted bisection.

To put Coja-Oghlan’s work into the context of this conjecture, he showed that if for a large enough constant , then the spectral method solves the clustering problem. Decelle et al.’s work is based on deep but non-rigorous ideas from statistical physics. In order to identify the best bisection, they use the sum-product algorithm (also known as belief propagation). Using the cavity method, they argue that the algorithm should work, a claim that is bolstered by compelling simulation results.

What makes Conjecture 1.2 even more interesting is the fact that it might represent a threshold for the solvability of the clustering problem.

Conjecture 1.3.

If then the clustering in problem is not solvable as .

This second conjecture is based on a connection with the tree reconstruction problem (see [23] for a survey). Consider a multi-type branching process where there are two types of particles named and . Each particle gives birth to (ie. a Poisson distribution with mean ) particles of the same type and particles of the complementary type. In the tree reconstruction problem, the goal is to recover the label of the root of the tree from the labels of level where . This problem goes back to Kesten and Stigum [20] in the 1960s, who showed that if then it is possible to recover the root value with non-trivial probability. The converse was not resolved until 2000, when Evans, Kenyon, Peres and Schulman [11] proved that if then it is impossible to recover the root with probability bounded above independent of . This is equivalent to the reconstruction or extremality threshold for the Ising model on a branching process.

At the intuitive level the connection between clustering and tree reconstruction, follows from the fact that the neighborhood of a vertex in should look like a random labelled tree with high probability. Moreover, the distribution of that labelled tree should converge as to the multi-type branching process defined above. We will make this connection formal later.

Decelle et al. also made a conjecture related to the the parameter estimation problem that was previously studied extensively in the statistics literature. Here the problem is to identify the parameters and . Again, they provided an algorithm based belief propagation and they used physical ideas to argue that there is a threshold above which the parameters can be estimated, and below which they cannot.

Conjecture 1.4.

If then there is a consistent estimator for and under . Conversely, if then there is no consistent estimator.

2 Our results

Our main contribution is to establish Conjectures 1.3 and 1.4.

Theorem 2.1.

If and then, for any fixed vertices and ,

Remark 2.2.

Theorem 2.1 is stronger than Conjecture 1.3 because it says that an even easier problem cannot be solved: if we take two random vertices of , Theorem 2.1 says that no algorithm can tell whether or not they have the same label. This is an easier task than finding a bisection, because finding a bisection is equivalent to labeling all the vertices; we only asking whether two of them have the same label or not. Theorem 2.1 is also stronger than the conjecture because it includes the case , for which Decelle et al. did not conjecture any particular behavior.

Remark 2.3.

Note that the assumption is there to ensure that has a giant component, without which the clustering problem is clearly not solvable.

To prove Conjecture 1.4, we compare the planted partition model to an appropriate Erdös-Renyi model: let and take to be the Erdös-Renyi model that has the same average degree as .

Theorem 2.4.

If then and are mutually contiguous i.e., for a sequence of events , if, and only if, .

Moreover, if then there is no consistent estimator for and .

Note that the second part of the Theorem 2.4 follows from the first part, since it implies that and are contiguous as long as . Indeed one cannot even consistently distinguish the planted partition model from the corresponding Erdös-Renyi model!

The other half of Conjecture 1.4 follows from a converse to Theorem 2.4:

Theorem 2.5.

If , then and are asymptotically orthogonal. Moreover, a consistent estimator for can be obtained as follows: let be the number of cycles of length , and define

where . Then is a consistent estimator for and is a consistent estimator for .

Finally, there is an efficient algorithm whose running time is polynomial in to calculate and .

2.1 Proof Techniques

2.1.1 Short Cycles

To establish Theorem 2.5 we count the number of short cycles in . It is well-known that the number of -cycles in a graph drawn from is approximately Poisson-distributed with mean . Modifying the proof of this result, we will show that we will show that the number of -cycles in is approximately Poisson-distributed with mean .

By comparing the first and second moments of Poisson random variables and taking to increase slowly with , one can distinguish between the cycle counts of and as long as .

The first half of Conjecture 1.4 follows because the same comparison of first and second moments implies that counting cycles gives a consistent estimator for and (and hence also for and ).

While there is in general no efficient algorithm for counting cycles in graphs, we show that with high probability the number of short cycles coincides with the number of non-backtracking walks of the same length which can be computed efficiently using matrix multiplication.

The proof of Theorem 2.5 is carried out in Section 3.

2.1.2 Non-Reconstruction

As mentioned earlier, Theorem 2.1 intuitively follows from the fact that the neighborhood of a vertex in should look like a random labelled tree with high probability and the distribution of that labelled tree should converge as to the multi-type branching process defined above. While this intuition is not too hard to justify for small neighborhoods (by proving there are no short cycles etc.) the global ramifications are more challenging to establish. This is because, that conditioned on the graph structure, the model is neither an Ising model, nor a Markov random field! This is due to two effects:

  • The fact that the two clusters are of the same (approximate) size. This amounts to a global conditioning on the number of ’s.

  • The model is not even a Markov random field conditioned on the number of and vertices. This follows from the fact that for every two vertices that do not form an edge, there is a different weight for and . In other words, if , then there is a slight repulsion (anti-ferromagnetic interaction) between vertices not joined by an edge.

In Section 4, we prove Theorem 2.1 by showing how to overcome the challenges above.

2.1.3 The Second Moment

A major effort is devoted to the proof Theorem 2.4. In the proof we show that the random variables don’t have much mass near 0 or . Since the margin of is somewhat complicated to work with, the first step is to enrich the distribution by adding random labels. Then we show that the random variables don’t have mass near 0 or . Our proof is one of the most elegant applications of the second moment method in the context of statistical physics model. We derive an extremely explicit formula for the second moment of in Lemma 5.4. In particular we show that

This already show that the second moment is bounded off . However, in order to establish the existence of a density, we also need to show that is bounded away from zero asymptotically. In order to establish this, we utilize the small graph conditioning method by calculating joint moments of the number of cycles and . It is quite surprising that this calculation can be carried out in rather elegant manner.

3 Counting cycles

The main result of this section is that the number of -cycles of is approximately Poisson-distributed. We will then use this fact to show the first part of Theorem 2.4. Actually, Theorem 2.4 only requires us to calculate the first two moments of the number of -cycles, but the rest of the moments require essentially no extra work, so we include them for completeness.

Theorem 3.1.

Let be the number of -cycles of , where . If then

Before we prove this, let us explain how it implies Theorem 2.5. From now on, we will write instead of .

Proof of Theorem 2.5.

We start by proving the first statement of the theorem. Let’s recall the standard fact (which we have mentioned before) that under , . With this and Theorem 3.1 in mind,

Set (although any sufficiently slowly increasing function of would do). Choose such that . Then and are both as . By Chebyshev’s inequality, -a.a.s. and -a.a.s. Since , it follows that for large enough . And so, if we set then and .

We next show that Theorem 3.1 gives us an estimator for and that is consistent when . First of all, we have a consistent estimator for by simply counting the number of edges. Thus, if we can estimate consistently then we can do the same for and . Our estimator for is

where is some estimator with -a.a.s. and increases to infinity slowly enough so that and -a.a.s. Take ; by Chebyshev’s inequality, -a.a.s. Since , . Thus, -a.a.s. Since and , -a.a.s. and so is a consistent estimator for . Finally we take and . ∎

Proposition 3.2.

Let . There is an algorithm whose running time is for calculating and .

Proof.

Recal and from the proof of Theorem 2.5. Clearly, we can compute in time which is linear in the number of edges. Thus, we need to show how to find in time . It is easy to see that with high probability, each neighborhood of radius contains at most one cycle. Thus, the number of cycles of length is the same as , where is the number of non backtracking walks of length that start and end at .

To calculate , let be the radius ball around in . Let be a diagonal matrix such that for each vertex , the diagonal entry corresponding to is the degree of in . Let be the adjacency matrix of . It is easy to see that w.h.p. for each , can be generated in time. Now define and . Then it is easy to see that the entry of is the number of non-backtracking walks from to of length . The proof follows. ∎

Now we will prove Theorem 3.1 using the method of moments. Recall, therefore, that if then , where denotes the falling factorial . It will therefore be our goal to show that . It turns out that this follows almost entirely from the corresponding proof for the Erdös-Renyi model. The only additional work we need to do is in the case .

Lemma 3.3.

If then

Proof.

Let be distinct vertices. Let be the indicator that is a cycle in . Then , so let us compute . Define to be the number of times in the cycle that (with addition taken modulo ). Then

On the other hand, we can easily compute : for each , there is probability to have , and these events are mutually indepedent. But whether is completely determined by the other events since there must be an even number of such that . Thus,

for even , and zero for odd . Hence,

The second part of the claim amounts to saying that , which is trivial when . ∎

Proof of Theorem 3.1.

Let ; our goal, as discussed before Lemma 3.3, is to show that . Note that is the number of ordered -tuples of -cycles in . We will divide these -tuples into two sets: is the set of -tuples for which all of the -cycles are disjoint, while is the set of -tuples in which at least one pair of cycles is not disjoint.

Now, take . Since the are disjoint, they appear independently in . By the proof of Lemma 3.3, the probability that cycles are all present is

Since there are elements of , it follows that the expected number of vertex-disjoint -tuples of -cycles is

It remains to show, therefore, that the expected number of non-vertex-disjoint -tuples converges to zero. Let be the number of non-vertex-disjoint -tuples,

Then the distribution of under is stochastically dominated by the distribution of under the Erdös-Renyi model . It’s well-known (see, eg. [3], Chapter 4) that as long as , under for any ; hence under also. ∎

4 Non-reconstruction

The goal of this section is to prove Theorem 2.1. As we said in the introduction, the proof of Theorem 2.1 uses a connection between and Markov processes on trees. Before we go any further, therefore, we should define a Markov process on a tree and state the result that we will use.

Let be an infinite rooted tree with root . Given a number , we will define a random labelling . First, we draw uniformly in . Then, conditionally independently given , we take every child of and set with probability and otherwise. We can continue this construction recursively to obtain a labelling for which every vertex, independently, has probability of having the same label as its parent.

Back in 1966, Kesten and Stigum [20] asked (although they used somewhat different terminology) whether the label of could be deduced from the labels of vertices at level of the tree (where is very large). There are many equivalent ways of stating the question. The interested reader should see the survey [23], because we will only mention two of them.

Let and define . We will write for the configuration restricted to .

Theorem 4.1.

Suppose is a Galton-Watson tree where the offspring distribution has mean . Then

if, and only if .

In particular, if then contains no information about . Theorem 4.1 was established by several authors over the course of more than 30 years. The non-reconstruction regime (ie. the case ) is the harder one, and that part of Theorem 4.1 was first proved for -ary trees in [2], and for Galton-Watson trees in [11]. This latter work actually proves the result for more general trees in terms of their branching number.

We will be interested in trees whose offspring distribution is and we will take . Some simple arithmetic applied to Theorem 4.1 then shows that reconstruction of the root’s label is impossible whenever . Not coincidentally, this is the same threshold that appears in Theorem 2.1.

4.1 Coupling of balls in to the broadcast process on trees

The first step in applying Theorem 4.1 to our problem is to observe that a neighborhood of looks like . Indeed, fix and let be the induced subgraph on .

Proposition 4.2.

Let . There exists a coupling between and such that a.a.s.

For the rest of this section, we will take .

The proof of this lemma essentially follows from the fact that can be constructed from a sequence of independent Poisson variables, while can be constructed from a sequence of binomial variables, with approximately the same means.

For a vertex , let be the number of children of ; let be the number of children whose label is and let . By Poisson thinning, , and they are independent. Note that can be entirely reconstructed from the label of the root and the two sequences , .

We can almost do the same thing for , but it is a little more complicated. We will write and . For every subset , denote by and the subsets of that have the corresponding label. For example, . For a vertex , let be the number of neighbors that has in ; then let be the number of those neighbors whose label is and set . Then , and they are independent. Note, however, that they do not contain enough information to reconstruct : it’s possible to have which share a child in , but this cannot be determined from and . Fortunately, such events are very rare and so we can exclude them. In fact, this process of carefully excluding bad events is all that needs to be done to prove Proposition 4.2.

In order that we can exclude their complements, let us give names to all of our good events. For any , let be the event that no vertex in has more than one neighbor in . Let be the event that there are no edges within . Clearly, if and hold for all then is a tree. In fact, it’s easy to see that and are the only events that prevent from determining .

Lemma 4.3.

If

  1. ;

  2. and for every ; and

  3. and hold

then .

Proof.

The proof is essentially obvious from the construction of and , but we will be pedantic about it anyway. The statement means that there is some graph homomorphism such that . If and and then we can extend to while preserving the fact that for all . On the event , this extension can be made simultaneously for all , while the event ensures that this extension remains a homomorphism. Thus, we have constructed a label-preserving homomorphism from to , which is the same as saying that these two labelled graphs are equal.

From now on, we will not mention homomorphisms; we will just identify with . ∎

In order to complete our coupling, we need to identify one more kind of good event. Let be the event

The events are useful because they guarantee that is large enough for the desired binomial-Poisson approximation to hold. The utility of is demonstrated by the next two lemmas.

Lemma 4.4.

For all ,

Moreover, on .

Lemma 4.5.

For any ,

Proof of Lemma 4.4.

First of all, is stochastically dominated by for any . On , and so is stochastically dominated by

Thus,

by a multiplicative version of Chernoff’s inequality. But

which proves the first part of the lemma.

For the second part, on

Proof of Lemma 4.5.

For the first claim, fix . For any , the probability that and both appear is . Now, and Lemma 4.4 implies that . Hence the result follows from a union bound over all triples .

For the second part, the probability of having an edge between any particular is . Lemma 4.4 implies that and so the result follows from a union bound over all pairs . ∎

The final ingredient we need is a bound on the total variation distance between binomial and Poisson random variables.

Lemma 4.6.

If and are positive integers then

Proof.

Assume that , or else the result is trivial. A classical result of Hodges and Le Cam [16] shows that

With the triangle inequality in mind, we need only show that is close to . This follows from a direct computation: if then is just

Now the first term is and we can bound by the mean value theorem. Thus,

The claim follows from setting and . ∎

Finally, we are ready to prove Proposition 4.2.

Proof of Proposition 4.2.

Let be the event that . By Hoeffding’s inequality, exponentially fast.

Fix and suppose that and hold, and that . Then for each , is distributed as . Now,

and so Lemma 4.6 implies that we can couple with such that (and similarly for and ). Since by Lemma 4.4, the union bound implies that we can find a coupling such that with probability at least , and for every . Moreover, Lemmas 4.4 and 4.5 imply and hold simultaneously with probability at least . Putting these all together, we see that the hypothesis of Lemma 4.3 holds with probability at least . Thus,

But and we can certainly couple with . Therefore, with a union bound over , we see that a.a.s. ∎

4.2 No long range correlations in

We have shown that a neighborhood in looks like a Galton-Watson tree with a Markov process on it. In this section, we will apply this fact to prove Theorem 2.1. In the statement of Theorem 2.1, we claimed that , but this is clearly equivalent to . This latter statement is the one that we will prove, because the conditional variance has a nice monotonicity property.

The idea behind the proof of Theorem 2.1 is to condition on the labels of , which can only make reconstruction easier. Then we can remove the conditioning on , because gives much more information anyway. Since Theorem 4.1 and Proposition 4.2 imply that cannot be reconstructed from , we conclude that it cannot be reconstructed from either.

The goal of this section is to prove that once we have conditioned on , we can remove the conditioning on . If were distributed according to a Markov random field, this would be trivial because conditioning on would turn and independent. For our model, unfortunately, there are weak long-range interactions. However, these interactions are sufficiently weak that we can get an asymptotic independence result for separated sets as long as one of them takes up most of the graph.

In what follows, we say that a.a.s. if for every , as , and we say that a.a.s. if

Lemma 4.7.

Let be a (random) partition of such that separates and in . If for a.a.e. 

for a.a.e.  and .

Note that Lemma 4.7 is only true for a.a.e. . In particular, the lemma does not hold for that are very unbalanced (eg. ).

Proof.

As in the analogous proof for a Markov random field, we factorize into parts depending on , and . We then show that the part which measures the interaction between and is negligible. The rest of the proof is then quite similar to the Markov random fields case.

Define