Hierarchical community detection by recursive partitioning
Abstract
The problem of community detection in networks is usually formulated as finding a single partition of the network into some “correct” number of communities. We argue that it is more interpretable and in some regimes more accurate to construct a hierarchical tree of communities instead. This can be done with a simple topdown recursive partitioning algorithm, starting with a single community and separating the nodes into two communities by spectral clustering repeatedly, until a stopping rule suggests there are no further communities. This class of algorithms is modelfree, computationally efficient, and requires no tuning other than selecting a stopping rule. We show that there are regimes where this approach outperforms way spectral clustering, and propose a natural framework for analyzing the algorithm’s theoretical performance, the binary tree stochastic block model. Under this model, we prove that the algorithm correctly recovers the entire community tree under relatively mild assumptions. We also apply the algorithm to a dataset of statistics papers to construct a hierarchical tree of statistical research communities.
1 Introduction
Data collected in the form of networks have become increasingly common in many fields, with interesting scientific phenomena discovered through the analysis of biological, social, ecological, and various other networks; see Newman (2010) for a review. Among various network analysis tasks, community detection has been one of the most studied, due to the ubiquity of communities in different types of networks and the appealing mathematical formulations that lend themselves to analysis; see for example reviews by Fortunato (2010), Goldenberg et al. (2010), and Abbe (2018). Community detection is the task of clustering network nodes into groups with similar connection patterns, and in many applications, community structure provides a useful and parsimonious representation of the network. There are many statistical models for networks with communities, including the stochastic block model (Holland et al., 1983) and its many variants and extensions, such as, for example, Handcock et al. (2007); Hoff (2008); Airoldi et al. (2008); Karrer and Newman (2011); Xu and Hero (2013); Zhang et al. (2014); Matias and Miele (2017). One large class of methods focuses on fitting such models based on their likelihoods or approximations to them (Bickel and Chen, 2009; Mariadassou et al., 2010; Celisse et al., 2012; Bickel et al., 2013; Amini et al., 2013); another class of methods takes an algorithmic approach, designing algorithms, often based on spectral clustering, that can sometimes be proven to work well under specific models (Newman and Girvan, 2004; Newman, 2006; Rohe et al., 2011; Bickel et al., 2011; Zhao et al., 2012; Chen et al., 2012; Lei and Rinaldo, 2014; Cai and Li, 2015; Chen and Xu, 2014; Amini and Levina, 2018; Joseph and Yu, 2016; Gao et al., 2017; Le et al., 2017; Gao et al., 2016).
Most work on community detection to date has focused on finding a single way partition of the network into groups, which are sometimes allowed to overlap. This frequently leads to a mathematical structure that allows for sophisticated analysis, but for larger these partitions tend to be unstable and not easily interpretable. These methods also typically require the “true” number of clusters as input. Although various methods have been proposed to estimate (e.g. Chen and Lei, 2018; Chatterjee, 2015; Wang and Bickel, 2017; Le and Levina, 2015; Li et al., 2016), none of them have been especially tested or studied for large , and in our experience, empirically they perform poorly when is large. Finally, a single “true” number of communities may not always scientifically meaningful, since in practice different community structures can often be observed at different scales.
Communities in real networks are often hierarchically structured, and the hierarchy can be scientifically meaningful, like in an evolutionary tree. A hierarchical tree of communities, with larger communities subdivided into smaller ones further down, offers a natural and very interpretable representation of communities. It also simplifies the task of estimating , since, instead of estimating a large from the entire network we only need to decide whether to continue splitting at any given point in the tree, which only requires deciding whether that particular subnetwork still contains more than one community. We can also view it as a way to regularize an otherwise unwieldy model with a large number of communities, which in theory can approximate any exchangeable graph (Olhede and Wolfe, 2014), by imposing structural constraints on parameters. We would expect that for large networks and large , such regularization can lead to improvements in both computational costs and theoretical guarantees.
Not many hierarchical community detection algorithms are currently available, especially with theoretical guarantees. The earliest work we are aware of is Kleinberg (2002), generalized by Clauset et al. (2008) and Peel and Clauset (2015). These models directly incorporate a tree by modeling connection probabilities between pairs of nodes based on their relative distance on the tree. One line of work uses a Bayesian approach, treating the tree as a parameter, e.g., Clauset et al. (2008). Bayesian inference on these models is computationally prohibitive, and thus feasible only for small networks. A more computationally efficient approach of this type was recently proposed by Blundell and Teh (2013), but it still has computational complexity of for a network with nodes. Even more importantly, treating each node as a leaf involves a large number of parameters and makes the model less interpretable.
Another line of work on hierarchical community detection has developed greedy algorithms based on recursive partitioning. The idea has appeared in machine learning problems such as graph partitioning and image segmentation (Spielman and Teng, 1996; Shi and Malik, 2000; Kannan et al., 2004). The first rigorous analysis we are aware of was given by Dasgupta et al. (2006), for a recursive bipartitioning algorithm based on a modified version of spectral clustering. Their analysis allows for sparse networks with average degree growing polylogarithmically in , but their procedure involves multiple tuning parameters with no obvious default values. Later on, Balakrishnan et al. (2011) considered a topdown hierarchical clustering algorithm based on unnormalized graph Laplacian and the model of Clauset et al. (2008), for a pairwise similarity matrix instead of a network. They did not propose a practical stopping rule, but did provide a rigorous frequentist theoretical guarantee for clustering accuracy. However, as we will elaborate in Section 3, their analysis only works for dense networks which are rare in practice. Lyzinski et al. (2017) proposed another hierarchical model based on a mixture of random dot product graph (RDPG) models (Young and Scheinerman, 2007). In contrast to Balakrishnan et al. (2011), they use a twostage procedure which first detects all communities at the finest level and then applies agglomerative hierarchical clustering to build the hierarchy from the bottom up. They proved strong consistency of their algorithm, but it hinges on perfect recovery of all communities in the first stage, which leads to very strong requirements on network density.
In this paper, we consider a framework for hierarchical community detection based on recursive bipartitioning, which algorithmically is similar to Balakrishnan et al. (2011). The algorithm needs a partitioning method, which divides any given network into two, and a stopping rule, which decides if a given network has at least two communities; in principle, any partitioning method and any stopping rule can be used. The algorithm starts by splitting the entire network into two and then tests each resulting leaf with the stopping rule, until there are no leaves left to split. We prove that the algorithm consistently recovers the entire hierarchy, including all lowlevel communities, under the binary tree stochastic block model (BTSBM), a hierarchical network model with communities we propose, in the spirit of Clauset et al. (2008). Our analysis applies to networks with average degree as low as for any , while existing results either require the degree to be polynomial in , or for large (e.g. in Dasgupta et al. (2006)). We also allow the number of communities to grow with , which is natural for a hierarchy, at a strictly faster rate than previous work, which for the most part treats as fixed. Even when is too big to satisfy the assumption for recovering the entire hierarchy, we can still consistently recover megacommunities at the higher levels of the hierarchical tree. Notably, since the stopping rule only needs to decide whether than to estimate it exactly, we both have more options such as hypothesis tests (Bickel and Sarkar, 2016; Gao and Lafferty, 2017), and can trust various methods of estimating more, since while they tend to underestimate when it’s large, they never underestimate it so severely as to conclude . Finally, our procedure has better computational complexity than the way partitioning methods.
The rest of the paper is organized as follows. In Section 2, we present our recursive bipartitioning framework, a specific recursive algorithm, and discuss the interpretation of the resulting hierarchical structure. In Section 3, we introduce a special class of stochastic block models under which a hierarchy of communities can be naturally defined, and provide theoretical guarantees on recovering the hierarchy for that class of models. Section 4 presents extensive simulation studies demonstrating advantages of recursive bipartitioning for both community detection and estimating the hierarchy. Section 5 applies the proposed algorithm to a statistics citation network and obtains a readily interpretable hierarchical community structure. Section 6 concludes with discussion.
2 Community detection by recursive partitioning
2.1 Setup and notation
We assume an undirected network on nodes . The corresponding symmetric adjacency matrix is defined by if and only if node and node are connected, and 0 otherwise. We use to denote the integer set . We write for the identity matrix and for column vector of ones, suppressing the dependence on when the context makes it clear. For any matrix , we use to denote its spectral norm (the largest singular value of ), and the Frobenius matrix norm. Community detection will output a partition of nodes into sets, and for any , with typically unknown beforehand.
2.2 The recursive partitioning algorithm
In many network problems where a hierarchical relationship between communities is expected, estimating the hierarchy accurately is just as important as finding the final partition of the nodes. A natural framework for producing a hierarchy is recursive partitioning, an old idea in clustering that has not resurfaced much in the current statistical network analysis literature (e.g. Kannan et al., 2004; Dasgupta et al., 2006; Balakrishnan et al., 2011). The framework is general and can be used in combination with any community detection algorithm and model selection method; we will give a few options that worked very well in our experiments. In principle, the output can be any tree, but we focus on binary trees, as is commonly done in hierarchical clustering; we will sometimes refer to partitioning into two communities as bipartitioning.
Recursive bipartitioning does exactly what its name suggests:

Starting from the whole network, partition it into two communities.

Apply a decision / model selection rule to each of these communities to decide if they should be split further.

If the rule says to split, divide into two communities again, and continue until no further splits are indicated.
This is a topdown clustering procedure which produces a binary tree, but the leaves are small communities, not necessarily single nodes. Intuitively speaking, as one goes down the tree, the communities become closer, so the tree distance between communities reflects their level of connection.
Computationally, while we do have to partition multiple times, each community detection problem we have to solve is only for , which is faster, easier and more stable than for a general , and the size of networks decreases as we go down the tree and thus it becomes faster. When is large and connectivity levels between different communities are heterogeneous, we expect recursive partitioning to outperform way clustering, which does best for small and when everything is balanced.
We call this approach hierarchical community detection (HCD). As input, it takes a network adjacency matrix ; an algorithm that takes an adjacency matrix as input and partitions it into two communities, outputting their two induced submatrices, ; and a a stopping rule , where indicates there is no evidence has communities and we should stop, and otherwise. Its output is the community label vector and the hierarchical tree of communities . The algorithm clearly depends on the choice of the partitioning algorithm and the stopping rule ; we describe a few specific options next.
2.3 The choice of partitioning method and stopping rule
Possibly the simplest partitioning algorithm is a simple eigenvector sign check, used in Balakrishnan et al. (2011); Gao et al. (2017); Le et al. (2017); Abbe et al. (2017):
Algorithm 1.
Given the adjacency matrix , do the following:

Compute the eigenvector corresponding to the second largest eigenvalue in magnitude of .

Let if and otherwise.

Return label .
A more general and effective partitioning method is regularized spectral clustering (RSC), especially for sparse networks. Several regularized versions are available; in this paper, we use the proposal of Amini et al. (2013), shown to improve performance of spectral clustering for sparse networks (Joseph and Yu, 2016; Le et al., 2017).
Algorithm 2.
Given the adjacency matrix and regularization parameter (by default, we use ), do the following:

Compute the regularized adjacency matrix as
where is the average degree of the network.

Let where and calculate the regularized Laplacian

Compute the leading two eigenvectors of , arrange them in a matrix , and apply means algorithm to the rows, with .

Return the cluster labels from the means result.
The simplest stopping rule is to fix the depth of the tree in advance, though that is not what we will ultimately do. A number of recent papers focused on estimating the number of communities in a network, typically assuming that a “one community” building block of such a network is generated from either the ErdösRenyi model or the configuration model (Van Der Hofstad, 2016). The methods proposed include directly estimating rank by the USVT method of Chatterjee (2015), hypothesis tests of Bickel and Sarkar (2016) and Gao and Lafferty (2017), the BIC criteria of Wang and Bickel (2017), the spectral methods of Le and Levina (2015) and crossvalidation methods of Chen and Lei (2018); Li et al. (2016). The crossvalidation method of Li et al. (2016) works for both unweighted and weighted networks under a low rank assumption, while the others use the block model assumption.
Under block models, empirically we found that the most accurate and computationally feasible stopping criterion is the nonbacktracking method of Le and Levina (2015). Let be the nonbacktracking matrix, defined by
(1) 
Let be the real parts of the eigenvalues of (which may be complex). The number of communities is then estimated as the number of eigenvalues that satisfy . This is because if the network is generated from an SBM with communities, the largest eigenvectors of will be well separated from the radius with high probability, at least in sparse networks (Krzakala et al., 2013; Le and Levina, 2015). We approximate the norm by , as suggested by Le and Levina (2015) and only check the real parts of the two leading eigenvalues. If we want to avoid the block model assumption, the edge crossvalidation (ECV) method of Li et al. (2016) can be used instead to check whether a rank 1 model is a good approximation to the subnetwork under consideration.
The main benefit of using these estimators as stopping rules (i.e., just checking at every step if the estimated is greater than 1) is that the tree can be of any form; if we fixed in advance, we would have to choose in what order to do the splits in order to end up with exactly the chosen . Moreover, empirically we found the local stopping criterion is more accurate than directly estimating , especially with larger . For the rest of the paper, we will focus on two versions, “HCDSign” which uses splitting by eigenvalue sign (Algorithm 1), and ”HCDSpec”, which uses regularized spectral clustering (Algorithm 2). Any of the stopping rules discussed above can be used with either method.
2.4 Megacommunities and a similarity measure for binary trees
The final communities (leaves of the tree) as well as the intermediate megacommunities can be indexed by their position on the tree. Formally, each node or (mega)community of the binary tree can be represented by a sequence of binary values , where is the depth of the node (the root node has depth 0). The string records the path from the root to the node, with if step of the path is along the right branch of the split and otherwise. We define the for the root node to be an empty string. Intuitively, the tree induces a similarity measure between communities: two communities that are split further down the tree should be more similar to each other than two communities that are split higher up. The similarity between two megacommunities does not depend on how they are split further down the tree, which is a desirable feature. Note that we do not assume an underlying hierarchical community model; the tree is simply the output of the HCD algorithm.
To quantify this notion of tree similarity, we define a similarity measure between two nodes on a binary tree by
For instance, for the binary tree in Figure 1, we have , while . Note that comparing values of is only meaningful for comparing pairs with a common tree node. So indicates that community is closer to community than to , but the comparison between and is not meaningful.
A natural question is whether this tree structure and the associated similarity measure tell us anything about the underlying population model. Suppose that the network is in fact generated from the SBM. The probability matrix under the SBM is blockconstant, and applying either HCDSign or HCDSpec to will recover the correct communities and produce a binary tree. This binary tree may not be unique, for example, for the planted partition model where all communities have equal sizes, all withinblock edge probabilities are and all betweenblock edge probabilities are . However, in many situations does correspond to a unique binary tree (up to a permutation of labels), for example, under the model introduced in Section 3. For the moment, assume this is the case. Let and be the binary string community labels and the binary tree produced by applying the HCD algorithm to in exactly the same way we previously applied it to . Let and be the result of applying HCD to . The estimated tree depends on the stopping rule and may be very different in size from ; however, we can always compute the treebased similarity between nodes based on their labels. Let be the matrix of pairwise similarities based on , and the corresponding similarity matrix based on . can be viewed as an estimate of , and we argue that comparing to may give a more informative measure of performance that just comparing to . This is because with a large and weak signals it may be hard to estimate all the leaflevel communities correctly, but if the tree gets most of the megacommunities right, it is still a useful and largely correct representation of the network.
Finally, we note that an estimate of under the SBM can be obtained for any community detection method: if are estimated community labels, we can always estimate the corresponding under the SBM and apply HCD to to obtain an estimated tree . However, our empirical results in Section 4 show that applying HCD directly to the adjacency matrix to obtain gives a better estimate of than the constructed from postprocessing the estimated probability matrix produced by a way partitioning method.
3 Theoretical properties of the HCD algorithm
3.1 The binary tree stochastic block model
We now proceed to study the properties of HCD on a class of SBMs that naturally admit a binary tree community structure. We call this class the Binary Tree Stochastic Block Models (BTSBM), formally defined in Definition 1 and illustrated in Figure 1.
Definition 1 (The binary tree stochastic block model (BTSBM).
Let be the set of all length binary sequences and let . Each binary string in encodes a community label and has a 11 mapping to an integer in via standard binary representation . For node , let be its community label, let be the set of nodes labeled with string , and let .

Let be a matrix of probabilities defined by
where are arbitrary parameters in and
for defined in Section 2.4.

Edges between all pairs of distinct nodes are independent Bernoulli, with
corresponding to the probability matrix .
For instance, the BTSBM in Figure 1 corresponds to the matrix
A nice consequence of defining community labels through binary strings is that they naturally embed the communities in a binary tree. We can think of each entry of the binary string as representing one level of the tree, with the first digit corresponding to the first split at the top of the tree, and so on. We then define a megacommunity labeled by a binary string at any level of the tree as the set , defined on a binary tree . The megacommunities are unique up to community label permutations, and give a multiscale view of the community structure; for example, Figure 1 shows four megacommunities in layer 3 and two megacommunities in layer 2.
The idea of embedding connection probabilities in a tree, to the best of our knowledge, was first introduced as the hierarchical random graph (HRG) by Clauset et al. (2008), and extended by Balakrishnan et al. (2011) to weighted graphs and by Peel and Clauset (2015) to general dendrograms. The BTSBM can be viewed as a hybrid of the original HRG and the SBM, maintaining parsimony by estimating only communitylevel parameters while imposing a natural and interpretable hierarchical structure. It also provides us with a model that can be used to analyze recursive bipartitioning on sparse graphs.
3.2 The eigenstructure of the BTSBM
Let be the membership matrix with the th row containing , the th canonical basis vector in , where is the integer given by the binary representation. Then it is straightforward to show that
The second term comes from the fact that . For the rest of the theoretical analysis, we assume equal block sizes, i.e.,
(2) 
This assumption is stringent but standard in the literature and can be relaxed to a certain extent. For the BTSBM, this assumption leads to a particularly simple and elegant eigenstructure for .
Given a (mega)community label denoted by a binary string , we write and as the binary strings obtained by appending and to , respectively. We further define to be the set of all binary strings starting with . The following theorem gives a full characterization of the eigenstructure for the BTSBM model.
Theorem 1.
Let be the community connection probability matrix of the BTSBM with and define . Then the following holds:
1. (Eigenvalues) The distinct nonzero eigenvalues of , denoted by , are given by
(3) 
2. (Eigenvectors) For any and each , let be an dimensional vector, such that for any ,
Then the eigenspace corresponding to eigenvalue is spanned by . The eigenspace corresponding to is spanned by .
It is easy to see that each corresponds to a split of the two megacommunities in layer , at an internal tree node . For instance, consider the colored rectangles in Figure 1, which correspond to and . The vector has entry for all nodes in the (solid blue) megacommunity 00, entry for all nodes in the (dashed blue) megacommunity and for all the other nodes, thus separating megacommunities and . Similarly, has entry for the nodes in (solid orange) megacommunity , and entry for nodes in the (dashed orange) megacommunity . Therefore, the binary tree structure is full characterized by the signs of the given eigenvectors. Note that due to the multiplicity of eigenvalues, the basis of the eigenspace is not unique. In Appendix A, we use another basis which, though less interpretable, is used in the proof of Theorem 2 and Theorem 3 to obtain better theoretical guarantees.
While the previous theorem is stated for general configurations of , the two most natural situations where a hierarchicy is meaningful are either assortative communities, with
(4) 
or disassortative communities, with
(5) 
Recall that the HCDsign algorithm only depends on the eigenvector corresponding to the second largest eigenvalue (in magnitude). Theorem 1 directly implies that under either the assortative or disassortative setting, such eigenvalue is unique (has multiplicity ) with an eigenvector that yields the first split in the tree according to the signs of the corresponding eigenvector entries.
Corollary 1.
Let be the community connection probability matrix of the BTSBM with and balanced community (2). Under either (4) or either (5) , the second largest eigenvalue (in absolute value) of for a BTSBM is unique, given by
and the gap between it and the other eigenvalues is in the assortative case and in the disassortative case. The corresponding (normalized) eigenvector is
By a slight abuse of notation, we still denote the th eigenvalue of (instead of ) by when it is clear in context.
3.3 Consistency of HCDsign under the BTSBM
The population binary tree defined in Section 2.4 is unique under the BTSBM, and thus we can evaluate methods under this model on how well they estimate the population tree. Given a community label and the corresponding balanced binary tree of depth , define to be the community partition of all nodes into the megacommunities at level corresponding to . In particular, at level , gives the true community labels , up to a label permutation. This quantity is well defined only if the binary tree is balanced (i.e., all leaves are at the same depth ), and we will restrict our analysis to the balanced case.
The convention in the literature is to scale all probabilities of connection by a common factor that goes to 0 with , and have no other dependency on ; see, e.g., the review Abbe (2018). We similarly reparametrize the BTSBM as
(6) 
Let be the eigenvector of the second largest eigenvalue (in magnitude) of . If
(7) 
with high probability, then the first split will achieve exact recovery. A sufficient condition for (7) is concentration of around in the norm. The perturbation theory for random matrices is now fairly well studied (e.g. Eldridge et al., 2017; Abbe et al., 2017). By recursively applying an concentration bound, we can guarantee recovery of the entire binary tree with high probability, under regularity conditions.
We start from a condition for the stopping rule. Recall that a stopping rule is a function that can be applied to an adjacency matrix , such that indicates contains communities and thus one should further split the network, and indicates there is no evidence of communities and the algorith should stop splitting.
Definition 2.
A stopping rule for a network of size generated from an SBM with communities is called consistent with rate if
and
With a consistent stopping rule, the strong consistency of binary tree recovery can be guaranteed, as stated in the next two theorems.
Theorem 2 (Consistency of HCDSign in the assortative setting).
Let be generated from a BTSBM with parameters as defined in (6), with . Let be the community labels and the corresponding binary tree computed with the HCDSign algorithm with stopping rule . Suppose the model satisfies the assortative condition (4). Let and for any , define
(8) 
Fix any and . Then there exists a constant , which only depends on , such that, for any , if
(9) 
and the stopping rule is consistent for all the subgraphs corresponding to megacommunities up to the layer with rate , then for a sufficiently large ,
with probability at least . The megacommunity partition is defined at start of Section 3.3 and is the set of all label permutations on the binary string set . Further, if the conditions hold for , then with probability at least , the algorithm exactly recovers the entire binary tree and stops immediately after it is recovered.
Theorem 2 essentially says that each splitting step of HCDSign consistently recovers the corresponding megacommunity, provided that the condition (9) holds for that layer. Note that, according to (8), in the assortative setting,
As increases, the set over which we minimize grows, while each individual term remains the same. Thus the whole term increases with . We also have
which also increases in . Therefore, (9) gets strictly harder to satisfy as increases. Therefore even if recovering the entire tree is intrinsically hard or simply impossible (the condition (9) fails to hold for ), HCDSign can still consistently recover megacommunities at higher levels of the hierarchy, as long as they satisfy the condition. This is a major practical advantage of recursive partitioning compared to way partitioning, and to agglomerative hierarchical clustering of Lyzinski et al. (2017). A similar result holds in the disassortative setting.
Theorem 3 (Consistency of HCDSign in the disassortative setting).
It is easy to verify that condition (10) also becomes harder to satisfy for larger . Therefore in disassortative settings we may also be able to recover megacommunities even if we cannot recover the whole tree.
The theorems apply to any consistent stopping rule satisfying Definition 2. In particular, the nonbacktracking matrix method we use in the implementation is a consistent stopping rule based on the recently updated result of Le and Levina (2015), as we show next.
Proposition 1.
Define
(12) 
If
(13) 
and
(14) 
then for a sufficiently large , the nonbacktracking matrix stopping rule, described in (1), is consistent with rate under BTSBM for all megacommunities up to the layer .
Proposition 1 directly implies that if (13), (14) and (9) or (10) hold at the same time, the conclusions of Theorem 2 and Theorem 3 hold when using the nonbacktracking matrix as the stopping rule. In the proposition, condition (14) constrains the network from being too dense. We believe this to be an artifact of the proof technique of (Le and Levina, 2015). Intuitively, if the method works for a sparser network, it should work for a denser one as well, so we expect this condition can be removed, but we do not pursue this direction since the nonbacktracking estimator is not the focus of the present paper.
Next, we illustrate conditions (13) and (9) in a simplified setting. Consider the assortative setting when the whole tree can be recovered ().
Example 1. Assume an arithmetic sequence , given by . In this case, taking , it is easy to see that (9) is
(15) 
Some simple algebra shows that condition (13) simplifies to
Thus the additional requirement for strong consistency with nonbacktracking matrix stopping rule is redundant and we only need the condition
Example 2. Assume a geometric sequence , given by for a constant . Let for . The condition (13), after some simplifications, becomes
On the other hand, for (9), we have
If , then , and (9) with becomes
If , then , and (9) becomes
In summary, the following conditions are sufficient for exact recovery:
(16) 
3.4 Comparison with existing theoretical guarantees
In this section, we compare our result with other strong consistency results for recursive bipartitioning. We will focus on assortative setting since this is needed for most existing results (Balakrishnan et al., 2011; Lyzinski et al., 2017). To make this comparison, we have to ignore the stopping rule, since no other method showed consistency for a datadriven stopping rule.
Strong consistency of recursive bipartitioning was previously discussed by Dasgupta et al. (2006). Their algorithm is far more complicated than ours, with multiple resampling and pruning, and its computational cost is much higher. For comparison, we can rewrite their assumptions in the BTSBM parametrization. Their Theorem 1 requires , and the gap between any two columns of corresponding to node in different communities to be separated by
(17) 
Under the BTSBM, it is straightforward to show that the minimum in is achieved by two communities corresponding to sibling leaves in the last layer and
For our result, assume that for some . Then for sufficiently large , . To compare our result (with ) to (17), we consider two cases.
1. Arithmetic sequence: Suppose that is given by . Then (17) gives
whereas our condition (15) is only
which has a much better dependence on .
2. Geometric sequence: Suppose that is given by . Then the condition (17) becomes
From (16), it is easy to see that HCDsign has a better rate in if , which is equivalent to .
The algorithm proposed in Balakrishnan et al. (2011) is similar to ours, though their analysis is rather different. Under BTSBM, each entry in the adjacency matrix is subGaussian with parameter (which cannot be improved) i.e. for any . In order to recover all megacommunities up to layer (with size at least ), it is easy to show that a necessary condition in their analysis (Theorem 1) is
Thus the consistency result in Balakrishnan et al. (2011) only applies to dense graphs.
Lyzinski et al. (2017) developed their hierarchical community detection algorithm under a different model they called the hierarchical stochastic blockmodel (HSBM). The HSBM is defined recursively, as a mixture of lower level models. In particular, when with , the BTSBM is a special case of the HSBM where each level of the hierarchy has exactly two communities. Lyzinski et al. (2017) showed the exact tree recovery for fixed and the average expected degree of at least , implying a very dense network. By contrast, our result allows for a growing , and if is fixed, the average degree only needs to grow as fast as for an arbitrary , which is a much weaker requirement, especially considering that a degree of order is necessary for strong consistency of community labels under a standard SBM with fixed .
3.5 Computational complexity
We conclude this section by investigating the computational complexity of HCD, which turns out to be better than that of way partitioning, especially for problems with a large number of communities. The intuition behind this somewhat surprising result is that, even though HCD has to perform clustering multiple times, it performs a much simpler task at each step, computing no more than two eigenvectors instead of , and number of nodes to cluster decreases after each step.
We start with stating some relevant known facts. Suppose Lloyd’s algorithm (Lloyd, 1982) for means and the Lanczos algorithm (Larsen, 1998) to compute the spectrum, both with a fixed number of iterations. If the input matrix is , the means algorithm has complexity of . In calculating the leading eigenvectors, we take advantage of matrix sparsity, resulting in complexity where is the number of nonzero entries of . Therefore, for way spectral clustering, the computation cost is
(18) 
Turning to HCD, let denote the adjacency matrix of the th block in th layer. For comparison purposes, we assume the BTSBM and the conditions for exact recovery, so that we construct the entire tree. Then corresponds to nodes. Note that for both Algorithm 1 and Algorithm 2, the complexity is linear in size. As with (18), the splitting step applied to has complexity for both HCDSign and HCDSpec. Adding the cost over all layers, the total computation cost becomes
(19) 
where we use the facts that and . Since the blocks corresponding in th layer are disjoint,
(20) 
and thus (19) is upper bounded by
(21) 
This is strictly better than the complexity of way spectral clustering (18) for large .
Moreover, the inequality (20) may be overly conservative. Under the BTSBM, the expected number of withinblock edges in the th layer is
As a result,
The coefficient before is in many situations, including Examples 1 and 2 discussed at the end of Section 3.3. In these cases, the average complexity of HCD algorithms is only
Last but not least, the HCD framework, unlike way partitioning, can be easily parallelized.
4 Numerical results on synthetic networks
In this section, we investigate empirical performance of HCD on synthetic networks. Since our main focus is on comparing recursive bipartitioning with way partitioning, we will compare HCD with regularized spectral clustering (RSC), and not include other way community detection methods. All synthetic networks in this section are generated from the BTSBM. In particular, we set sequence to be the for some value of . Both balanced and imbalanced cases will be studied. There are several aspects of HCD to evaluate:

Accuracy of estimating the number of communities, i.e., comparing and .

Accuracy of estimating communities, measured by coclustering accuracy , where be the true coclustering matrix defined by , and its estimate based on HCD. This is a convenient measure of clustering accuracy which does not depend on label permutations.

Accuracy of estimating the probability matrix , measured by .

Accuracy of estimating the hierarchy, measured using the tree similarity measure as .

Accuracy of estimating megacommunities, measured by the proportion of correctly clustered nodes. We only compute this for balanced BTSBM settings and for levels 1 and 2.
For estimating the number of communities, we compare HCD with the nonbacktracking (NB) matrix method for estimating , shown to be one of the most accurate options available for the SBM (Le and Levina, 2015). For the other tasks, we compare HCDSpec with regularized spectral clustering (RSC). We use the number of communities chosen by HCDSpec for RSC for a fair comparison. To compare tree structures between HCD and RSC, we apply the HCD procedure to the probability matrix estimated by RSC. HCDSign is also included in all comparisons, and for a fair comparison, we regularize it in the same way as HCDSpec. Since the signsplitting approach only works for splitting into two parts, there is no way counterpart to compare it to. All simulation results are averaged over 100 independent replications.
4.1 Varying the number of communities
In this example, we only consider the balanced setting of BTSBM with and values of , setting . The parameter is set so that the average outin ratio (betweenblock edge/withinblock edges) for all is fixed at and is set to obtain the average degree of 50. These values of and are not too challenging for small , so we can be sure that the main impact on accuracy comes from changing .
Figure 2 shows the results as a function of the true . For estimating , Figure 1(a) shows that HCD is clearly more accurate than the nonbacktracking method applied to the entire network, though both underestimate for large . HCD also does better on community detection and probability matrix estimation (Figures 1(b) and 1(c)), especially for larger . Since is large and most nodes are in different clusters, the difference between coclustering matrices seems small, but the probability matrix errors show that the differences are actually quite large. For , when all three methods are using roughly the true number of communities, the estimation error of HCDSpec is roughly lower than that of RSC.
Perhaps the most telling result, shown in Figure 1(d), is that both HCD methods recover the tree much better than RSC, especially for larger . This indicates that even though the number of communities is underestimated, the upper levels of the tree are still estimated accurately. This is further evident from Figures 1(e) and 1(f), which compares estimated megacommunities at the top two levels of the tree. Both HCD methods perform consistently better than RSC, and their advantage is especially pronounced when the network is sparse or is large. The two versions are always very close, with HCDSpec showing a slight advantage for larger . In summary, as grows and the problem becomes more challenging, the advantages of HCD become more and more pronounced.
4.2 Varying network sparsity
We use the same configuration as before, except now we vary the average degree of the network, fixing , and holding the outin ratio at 0.15. Results are shown in Figure 3. Though both methods tend to underestimate , the HCD is uniformly more accurate. The global estimate of from the nonbacktracking matrix does not improve with growing average degree after about , where the HCD estimate continues to improve and is close to the truth when the average degree is about . The HCD also dominates the RSC on all other tasks.
4.3 Unbalanced communities with a complex tree structure
The BTSBM gives us the flexibility to generate complex tree structures and communities of varying sizes. However, it is difficult to control these features with a single parameter such as or the average degree, so instead we just include two specific examples as an illustration. The first example corresponds to the hierarchical community structure shown in Figure 3(a). It is generated from a balanced model with 32 communities by merging 4 pairs of the original communities, resulting in total, with 4 communities of 200 nodes each and 24 communities of 100 nodes each. This is a challenging community detection problem because of the large , and the varying community sizes make it harder. The second example is shown in Figure 3(b). It is generated from 32 balanced communities again, by merging 2 pairs of leaves one level up, and 8 pairs three levels up, thus making it even more unbalanced. This tree has two communities with 800 nodes, two with 200 each, and the remaining 12 communities have 100 nodes each. In both examples, the average degree is 35.
Table 1 shows performance for these two examples. The HCD methods clearly perform better on all tasks, which matches what we observed in balanced settings.
Performance metric  HCDSign  HCDSpec  RSC  

Example 1  Coclustering accuracy  0.963  0.964  0.949 
Probability matrix error  0.015  0.014  0.041  
Tree similarity error  0.135  0.126  0.212  
Example 2  Coclustering accuracy  0.980  0.981  0.919 
Probability matrix error  0.026  0.025  0.033  
Tree similarity error  0.081  0.077  0.131 
5 Hierarchical communities in a statistics citation network
This dataset (Ji and Jin, 2016) contains information on statistics papers from four journals considered top (the Annals of Statistics, Biometrika, Journal of the American Statistical Association: Theory and Methods, and Journal of the Royal Statistical Society Series B) for the period from 2003 to 2012. For each paper, the dataset contains the title, authors, year of publication, journal information, the DOI, and citations. We constructed a citation network by connecting two authors if there is at least one citation (in either direction) between them in the dataset. Following Wang and Rohe (2016), we focused on the 3core of the largest connected component of the network, ignoring the periphery which frequently does not match the community structure of the core. The resulting network, shown in Figure 5, has 707 nodes (authors) and the average degree is 9.29.
The two HCD algorithms (HCDSpec and HCDSS) give the same result on this network. We used the edge crossvalidation (ECV) method (Li et al., 2016) as the stopping rule instead of the nonbacktracking method, because ECV does not rely on the block model assumption. In this particular problem, ECV chooses a deeper and more informative tree with 15 communities, shown in Table 2, compared to the nonbacktracking estimate of 11 communities. For a closer look at the communities, see the listing of 10 authors with the highest degrees in each community in Table 3 in Appendix D. Community labels in Table 2 were constructed semimanually from research keywords associated with people in each community. The keywords were obtained by collecting research interests of 20 statisticians with the highest degrees in each community, from personal webpages, department research pages, Google Scholar and Wikipedia (sources listed in order of inspection), with stop word filtering and stemming applied. The three most frequent keywords from research interests in each community are shown in Table 2. Note that the citations are from publications between 2003 and 2012, while the research interests were collected in 2018, so there is potentially a time gap. However, it is evident to anyone familiar with the statistics literature of that period that the communities detected largely correspond to real research communities, looking both at the people in each community and the associated keywords.
The hierarchical tree of research communities contains a lot of additional information, shown in Figure 6, and clearly reflects many wellknown patterns. For example, Bayesian statistics and design of experiments split off very high up in the tree, whereas various highdimensional communities cluster together, multiple testing is closely related to neuroimaging (which served as one of its main motivations), functional analysis and non/semiparametric methods cluster together, and so on. These relationships between communities are just as informative as the communities themselves, if not more, and could not have been obtained with any “flat” way community detection method.
This network was also studied by Ji and Jin (2016), though they did not extract the core. Our results are not easy to compare since there is no ground truth available and the number of communities they found is different. Briefly, they found three communities initially, and then upon finding that one of them is very mixed, further broke it up into three. They interpreted the resulting five communities as “LargeScale Multiple Testing”,“Variable Selection”, “Nonparametric spatial/Bayesian statistics”, “Parametric spatial statistics”, and “Semiparametric/Nonparametric statistics”. Though some of the labels coincide with our communities in Table 2, it seems more mixed (with spatial statistics and nonparametric statistics both appearing twice), and the hierarchical information which allows you to see which communities are close and which are far apart is not available from a flat partition.
Community research area [size]  Top three research interests keywords (from webpages)  

Design of experiments [16]  design  experiment  theory 
Bayesian statistics [98]  Bayesian  model  inference 
Biostatistics and bio applications [35]  model  inference  sampling 
Causal inference and shape (mixed) [15]  inference  estimation  causal 
Nonparametrics and wavelets [26]  model  nonparametric  estimation 
Neuroimaging [18]  imaging  Bayesian  model 
Multiple testing/inference [92]  inference  multiple  test 
Clinical trials and survival analysis [45]  survival  clinical  trial 
Non/semiparametric methods [38]  model  longitudinal  semiparametric 
Functional data analysis [96]  functional  model  measurement 
Dimensionality reduction [35]  dimension  reduction  regression 
Machine learning [21]  machine learning  biological  mining 
(Highdim.) time series and finance [36]  financial  econometrics  time 
Highdimensional theory [29]  highdimensional  theory  model 
Highdimensional methodology [107]  highdimensional  machine learning  model 
6 Discussion
In this paper, we studied recursive partitioning as a framework for hierarchical community detection and proposed two specific algorithms for implementing it, using either spectral clustering or sign splitting. This framework requires a stopping rule to decide when to stop splitting communities, but otherwise is tuningfree. We have shown that in certain regimes recursive partitioning has significant advantages in computational efficiency, community detection accuracy, and hierarchal structure recovery, compared with way partitioning. An important feature of hierarchical splitting is that it can recover highlevel megacommunities correctly even when all smaller communities cannot be recovered. It also provides a natural interpretable representation of the community structure, and induces a treebased similarity measure that does not depend on community label permutations and allows us to quantitatively compare entire hierarchies of communities. The algorithm itself is modelfree, but we showed it works under a new model we introduced, the binary tree SBM. Under this model, the hierarchical algorithm based on sign splitting is consistent for estimating both individual communities and the entire hierarchy. We conjecture that the advantage of hierarchical clustering carries over to general nonbinary trees and more general models; more work will be needed to establish this formally.
Acknowledgements
E. Levina and T. Li (while a PhD student at the University of Michigan) were supported in part by an NSF grant (DMS1521551) and an ONR grant (N000141612910). P. J. Bickel and L. Lei were supported in part by an NSF grant (DMS1713083). P. Sarkar was supported in part by an NSF grant (DMS1713082).
References
 Abbe (2018) E. Abbe. Community detection and stochastic block models: Recent developments. Journal of Machine Learning Research, 18(177):1–86, 2018.
 Abbe et al. (2017) E. Abbe, J. Fan, K. Wang, and Y. Zhong. Entrywise eigenvector analysis of random matrices with low expected rank. arXiv preprint arXiv:1709.09565, 2017.
 Airoldi et al. (2008) E. M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P. Xing. Mixed membership stochastic blockmodels. Journal of Machine Learning Research, 9(Sep):1981–2014, 2008.
 Amini and Levina (2018) A. A. Amini and E. Levina. On semidefinite relaxations for the block model. The Annals of Statistics, 46(1):149–179, 2018.
 Amini et al. (2013) A. A. Amini, A. Chen, P. J. Bickel, and E. Levina. Pseudolikelihood methods for community detection in large sparse networks. The Annals of Statistics, 41(4):2097–2122, 2013.
 Balakrishnan et al. (2011) S. Balakrishnan, M. Xu, A. Krishnamurthy, and A. Singh. Noise thresholds for spectral clustering. In J. ShaweTaylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 954–962. Curran Associates, Inc., 2011. URL http://papers.nips.cc/paper/4342noisethresholdsforspectralclustering.pdf.
 Bickel et al. (2013) P. Bickel, D. Choi, X. Chang, and H. Zhang. Asymptotic normality of maximum likelihood and its variational approximation for stochastic blockmodels. The Annals of Statistics, 41(4):1922–1943, 2013.
 Bickel and Chen (2009) P. J. Bickel and A. Chen. A nonparametric view of network models and newman–girvan and other modularities. Proceedings of the National Academy of Sciences, 106(50):21068–21073, 2009.
 Bickel and Sarkar (2016) P. J. Bickel and P. Sarkar. Hypothesis testing for automated community detection in networks. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(1):253–273, 2016.
 Bickel et al. (2011) P. J. Bickel, A. Chen, and E. Levina. The method of moments and degree distributions for network models. The Annals of Statistics, 39(5):2280–2301, 2011.
 Blundell and Teh (2013) C. Blundell and Y. W. Teh. Bayesian hierarchical community discovery. In Advances in Neural Information Processing Systems, pages 1601–1609, 2013.
 Cai and Li (2015) T. T. Cai and X. Li. Robust and computationally feasible community detection in the presence of arbitrary outlier nodes. The Annals of Statistics, 43(3):1027–1059, 2015.
 Celisse et al. (2012) A. Celisse, J.J. Daudin, L. Pierre, et al. Consistency of maximumlikelihood and variational estimators in the stochastic block model. Electronic Journal of Statistics, 6:1847–1899, 2012.
 Chatterjee (2015) S. Chatterjee. Matrix estimation by universal singular value thresholding. The Annals of Statistics, 43(1):177–214, 2015.
 Chen and Lei (2018) K. Chen and J. Lei. Network crossvalidation for determining the number of communities in network data. Journal of the American Statistical Association, 113(521):241–251, 2018.
 Chen and Xu (2014) Y. Chen and J. Xu. Statisticalcomputational tradeoffs in planted problems and submatrix localization with a growing number of clusters and submatrices. arXiv preprint arXiv:1402.1267, 2014.
 Chen et al. (2012) Y. Chen, S. Sanghavi, and H. Xu. Clustering sparse graphs. In Advances in neural information processing systems, pages 2204–2212, 2012.
 Clauset et al. (2008) A. Clauset, C. Moore, and M. E. Newman. Hierarchical structure and the prediction of missing links in networks. Nature, 453(7191):98–101, 2008.
 Dasgupta et al. (2006) A. Dasgupta, J. Hopcroft, R. Kannan, and P. Mitra. Spectral clustering by recursive partitioning. In European Symposium on Algorithms, pages 256–267. Springer, 2006.
 Eldridge et al. (2017) J. Eldridge, M. Belkin, and Y. Wang. Unperturbed: spectral analysis beyond DavisKahan. arXiv preprint arXiv:1706.06516, 2017.
 Fortunato (2010) S. Fortunato. Community detection in graphs. Physics Reports, 486(3):75–174, 2010.
 Gao and Lafferty (2017) C. Gao and J. Lafferty. Testing for global network structure using small subgraph statistics. arXiv preprint arXiv:1710.00862, 2017.
 Gao et al. (2016) C. Gao, Y. Lu, Z. Ma, and H. H. Zhou. Optimal estimation and completion of matrices with biclustering structures. Journal of Machine Learning Research, 17(161):1–29, 2016.
 Gao et al. (2017) C. Gao, Z. Ma, A. Y. Zhang, and H. H. Zhou. Achieving optimal misclassification proportion in stochastic block models. The Journal of Machine Learning Research, 18(1):1980–2024, 2017.
 Goldenberg et al. (2010) A. Goldenberg, A. X. Zheng, S. E. Fienberg, and E. M. Airoldi. A survey of statistical network models. Foundations and Trends® in Machine Learning, 2(2):129–233, 2010.
 Handcock et al. (2007) M. S. Handcock, A. E. Raftery, and J. M. Tantrum. Modelbased clustering for social networks. Journal of the Royal Statistical Society: Series A (Statistics in Society), 170(2):301–354, 2007.
 Hoff (2008) P. Hoff. Modeling homophily and stochastic equivalence in symmetric relational data. In Advances in Neural Information Processing Systems, pages 657–664, 2008.
 Holland et al. (1983) P. W. Holland, K. B. Laskey, and S. Leinhardt. Stochastic blockmodels: First steps. Social Networks, 5(2):109–137, 1983.
 Ji and Jin (2016) P. Ji and J. Jin. Coauthorship and citation networks for statisticians. The Annals of Applied Statistics, 10(4):1779–1812, 2016.
 Joseph and Yu (2016) A. Joseph and B. Yu. Impact of regularization on spectral clustering. The Annals of Statistics, 44(4):1765–1791, 2016.
 Kannan et al. (2004) R. Kannan, S. Vempala, and A. Vetta. On clusterings: Good, bad and spectral. Journal of the ACM (JACM), 51(3):497–515, 2004.
 Karrer and Newman (2011) B. Karrer and M. E. Newman. Stochastic blockmodels and community structure in networks. Physical Review E, 83(1):016107, 2011.
 Kleinberg (2002) J. M. Kleinberg. Smallworld phenomena and the dynamics of information. In Advances in neural information processing systems, pages 431–438, 2002.
 Krzakala et al. (2013) F. Krzakala, C. Moore, E. Mossel, J. Neeman, A. Sly, L. Zdeborová, and P. Zhang. Spectral redemption in clustering sparse networks. Proceedings of the National Academy of Sciences, 110(52):20935–20940, 2013.
 Larsen (1998) R. M. Larsen. Lanczos bidiagonalization with partial reorthogonalization. DAIMI Report Series, 27(537), 1998.
 Le and Levina (2015) C. M. Le and E. Levina. Estimating the number of communities in networks by spectral methods. arXiv preprint arXiv:1507.00827, 2015.
 Le et al. (2017) C. M. Le, E. Levina, and R. Vershynin. Concentration and regularization of random graphs. Random Structures & Algorithms, 2017.
 Lei and Rinaldo (2014) J. Lei and A. Rinaldo. Consistency of spectral clustering in stochastic block models. The Annals of Statistics, 43(1):215–237, 2014.
 Li et al. (2016) T. Li, E. Levina, and J. Zhu. Network crossvalidation by edge sampling. arXiv preprint arXiv:1612.04717, 2016.
 Lloyd (1982) S. Lloyd. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129–137, 1982.
 Lyzinski et al. (2017) V. Lyzinski, M. Tang, A. Athreya, Y. Park, and C. E. Priebe. Community detection and classification in hierarchical stochastic blockmodels. IEEE Transactions on Network Science and Engineering, 4(1):13–26, 2017.
 Mariadassou et al. (2010) M. Mariadassou, S. Robin, and C. Vacher. Uncovering latent structure in valued graphs: a variational approach. The Annals of Applied Statistics, pages 715–742, 2010.
 Matias and Miele (2017) C. Matias and V. Miele. Statistical clustering of temporal networks through a dynamic stochastic block model. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(4):1119–1141, 2017.
 Newman (2010) M. Newman. Networks: an introduction. Oxford university press, 2010.
 Newman (2006) M. E. Newman. Modularity and community structure in networks. Proceedings of the National Academy of Sciences, 103(23):8577–8582, 2006.
 Newman and Girvan (2004) M. E. Newman and M. Girvan. Finding and evaluating community structure in networks. Physical review E, 69(2):026113, 2004.
 Olhede and Wolfe (2014) S. C. Olhede and P. J. Wolfe. Network histograms and universality of blockmodel approximation. Proceedings of the National Academy of Sciences, 111(41):14722–14727, 2014.
 Peel and Clauset (2015) L. Peel and A. Clauset. Detecting change points in the largescale structure of evolving networks. In AAAI, pages 2914–2920, 2015.
 Rohe et al. (2011) K. Rohe, S. Chatterjee, and B. Yu. Spectral clustering and the highdimensional stochastic blockmodel. The Annals of Statistics, 39(4):1878–1915, 2011.
 Shi and Malik (2000) J. Shi and J. Malik. Normalized cuts and image segmentation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 22(8):888–905, 2000.
 Spielman and Teng (1996) D. A. Spielman and S.H. Teng. Spectral partitioning works: Planar graphs and finite element meshes. In Foundations of Computer Science, 1996. Proceedings., 37th Annual Symposium on, pages 96–105. IEEE, 1996.
 Van Der Hofstad (2016) R. Van Der Hofstad. Random graphs and complex networks, 2016.
 Wang and Rohe (2016) S. Wang and K. Rohe. Discussion of “coauthorship and citation networks for statisticians”. The Annals of Applied Statistics, 10(4):1820–1826, 2016.
 Wang and Bickel (2017) Y. R. Wang and P. J. Bickel. Likelihoodbased model selection for stochastic block models. The Annals of Statistics, 45(2):500–528, 2017.
 Xu and Hero (2013) K. S. Xu and A. O. Hero. Dynamic stochastic blockmodels: Statistical models for timeevolving networks. In International conference on social computing, behavioralcultural modeling, and prediction, pages 201–210. Springer, 2013.
 Young and Scheinerman (2007) S. J. Young and E. R. Scheinerman. Random dot product graph models for social networks. In International Workshop on Algorithms and Models for the WebGraph, pages 138–149. Springer, 2007.
 Yu et al. (2014) Y. Yu, T. Wang, and R. J. Samworth. A useful variant of the davis–kahan theorem for statisticians. Biometrika, 102(2):315–323, 2014.
 Zhang et al. (2014) Y. Zhang, E. Levina, and J. Zhu. Detecting overlapping communities in networks using spectral methods. arXiv preprint arXiv:1412.3432, 2014.
 Zhao et al. (2012) Y. Zhao, E. Levina, J. Zhu, et al. Consistency of community detection in networks under degreecorrected stochastic block models. The Annals of Statistics, 40(4):2266–2292, 2012.
Appendix A More on the eigenstructure of the BTSBM
Recall that is the membership matrix with the th row given by , where is the th canonical basis vector in and is the integer given by the binary representation. Under the BTSBM, we have
Without loss of generality, we can rearrange the nodes so that
where denotes an dimensional vector with all entries equal to . Under (2), has orthonormal columns and we can rewrite as
(22) 
Therefore has the same eigenvalues as and the same eigenvectors as , where is any basis matrix that gives the eigenspace of . Therefore, Theorem 1 in Section 3 is a direct result from the following eigenstructure of that will be proved at the end of this section.
Theorem 4.
Let be the community connection probability matrix of the BTSBM with . Then the following holds.
1. (Eigenvalues) The distinct nonzero eigenvalues of , denoted by , are given by
(23) 
2. (Eigenvectors) For any and each , let be a dimensional vector, such that for any ,
where maps the integer to its binary representation and are defined above Theorem 1. Then the eigenspace corresponding to is spanned by .
Theorem 4 gives one representation of the BTSBM eigenspace. While it is easy to describe, it is not technically convenient, because all the eigenvectors except the first two are sparse, with norm much larger than . However, since has at most distinct eigenvalues, there are many ways to represent the eigenspace, and in particular, we can find a representation such that all entries of all eigenvectors corresponding to nonzero eigenvalues are .
We start from . In this case, Let . Then