Algorithmic detectability threshold of the stochastic blockmodel
The assumption that the values of model parameters are known or correctly learned, i.e., the Nishimori condition, is one of the requirements for the detectability analysis of the stochastic blockmodel in statistical inference. In practice, however, there is no example demonstrating that we can know the model parameters beforehand, and there is no guarantee that the model parameters can be learned accurately. In this study, we consider the expectation–maximization (EM) algorithm with belief propagation (BP) and derive its algorithmic detectability threshold. Our analysis is not restricted to the community structure, but includes general modular structures. Because the algorithm cannot always learn the planted model parameters correctly, the algorithmic detectability threshold is qualitatively different from the one with the Nishimori condition.
Graph clustering is a technique to detect a macroscopic law of connectivity in a graph Goldenberg et al. (2010); Leger et al. (2014); Peixoto (2017); Fortunato (2010); Fortunato and Hric (2016). In other words, we expect that each vertex in a graph belongs to a module (or modules) and that the vertices in the same module are statistically equivalent. We then let an algorithm infer the most likely module assignments. In particular, the subset of the graph-clustering problem that focuses on the detection of densely connected (i.e., assortative) module is termed community detection. A classic example of community detection is the detection of social groups in a social network, wherein the vertices and edges represent persons and friendships, respectively.
Often, the graph structures that are more general than the assortative modules are detected using statistical inference methods. In this approach, we infer the most-likely module assignment for each vertex by fitting graph data using a random graph model with a planted modular structure. The stochastic blockmodel (SBM) Holland et al. (1983); Fienberg et al. (1985); Wang and Wong (1987); Anderson et al. (1992); Faust and Wasserman (1992); Karrer and Newman (2011) is a canonical model used for this purpose. The SBM is an extension of the Erdős-Rényi random graph; each vertex in the SBM has a planted module assignment and the vertices in the same module have stochastically equivalent connection patterns. Thus, we can generate the graph instances of various modular structures.
The model parameters of the SBM smoothly connect the random graphs with a strong modular structure and the Erdős-Rényi random graph. Interestingly, as the strength of the modular structure decreases, before the SBM becomes equivalent to the Erdős-Rényi random graph, a phase transition occurs, and at this stage, it becomes impossible to infer the planted module assignment. This critical point is called the detectability threshold or the detectability limit Reichardt and Leone (2008); Decelle et al. (2011a); Banks et al. (2016). The impossibility of inference stems from the fact that the fluctuations of the graph instances are not negligible, so that the graph instances generated from the SBM are statistically indistinguishable from the those of the Erdős-Rényi random graph. This is a fundamental problem in graph clustering, and it offers an insight into the extent to which we should expect algorithms to work. This is a characteristic phenomenon of sparse graphs, i.e., graphs with a constant average degree, and it cannot be observed in dense graphs. In the dense regime, instead, another interesting problem called the recovery problem Condon and Karp (2001); Bickel and Chen (2009); Rohe et al. (2011); Yun and Proutiere (2014); Abbe and Sandon (2015); Abbe (2017) arises.
We focus on sparse undirected graphs. We do not consider graphs with self-loops and multi-edges or the SBM wherein a vertex belongs to multiple modules. Instead, we allow the graphs to have multiple types of edges Heimlicher et al. (2012); Lelarge et al. (2015); Mariadassou et al. (2010); GuimerÃ and Sales-Pardo (2013); Rovira-Asenjo et al. (2013); Saade et al. (2016); Peixoto (2015). Thus, we will extend our analysis to a variant of the SBM called the labeled stochastic blockmodel (labeled SBM). An instance of the labeled SBM is shown in Fig. 1.
Several frameworks corresponding to algorithms such as the greedy Karrer and Newman (2011); Larremore et al. (2014), Monte Carlo Nowicki and Snijders (2001); Peixoto (2014a, b); Newman and Reinert (2016), and expectation–maximization (EM) Daudin et al. (2008); Latouche et al. (2012); Ball et al. (2011); Decelle et al. (2011b); Aicher et al. (2014); Newman and Clauset (2016) algorithm are available for the statistical inference of the SBM. We analyze the performance of the EM algorithm, because it is scalable and suited for theoretical analysis. In particular, we consider belief propagation (BP) as its module-assignment inference (E-step), and the point estimate of the model parameters as its model-parameter learning (M-step).
In this study, we derive the algorithmic detectability threshold of the SBM using the EM algorithm. The detectability threshold is often defined as the fundamental limit where all polynomial-time algorithms fail; here, we distinguish such a threshold as the theoretical limit of detectability. There are two reasons why we focus on the algorithmic detectability threshold.
The first reason is that it helps deriving a legitimate threshold in practice. It is known that the Bayesian inference using BP achieves the theoretical limit of detectability Decelle et al. (2011a, b); Mossel et al. (2014); Massoulié (2014); Moore (2017), assuming that the correct values of model parameters are known. Supported by this favorable property in theory and the good scalability of the algorithm, BP was implemented to solve various SBM variants Aicher et al. (2014); Zhang et al. (2014); Ghasemian et al. (2016); Newman and Clauset (2016); Kawamoto and Kabashima (2017). On the other hand, some non-Bayesian methods are known to be strictly suboptimal, e.g., Kawamoto and Kabashima (2015a, b); Javanmard et al. (2016). However, this is not a fair comparison because the Bayesian inference with BP is not an algorithmic detectability threshold; it can be regarded as the EM algorithm without the M-step. In practice, we do not know the planted model parameters beforehand, and there is no guarantee that they can be learned precisely. In fact, the performance of the EM algorithm generally depends on the initial values of the model parameters. To the best of our knowledge, the algorithmic detectability threshold that takes into account both the E-step and M-step remains a mystery, and we solve this problem analytically. The condition that the planted values of model parameters are correctly learned is often referred to as the Nishimori condition Iba (1999); Decelle et al. (2011b); Zdeborová and Krzakala (2016).
The second reason why we focus on the algorithmic detectability threshold is that we need to observe the algorithmic infeasibility through the analysis. In general, even if we have data of extremely high dimensions, there exists a limit that an algorithm can handle correctly. Then, the inference task is algorithmically infeasible though it may be computationally feasible. This difficulty can be depicted in the detectability phase-diagram of the algorithm, whereas it cannot be observed from the theoretical limit of detectability because of the assumption of infinite learnability.
This paper is organized as follows. In Sec. II, we explain the precise construction of the standard (i.e., binary label) SBM and its analytically tractable parametrization. We then explain the procedure of the EM algorithm (Sec. III) and the behavior of its M-step (Sec. IV). From Sec. V, we extend the SBM to the labeled SBM. After explaining how the formulation of the standard SBM is modified in the labeled SBM in Sec. V, we derive the algorithmic detectability threshold in Sec. VI. In Sec. VII, we present the detectability phase-diagrams for some specific cases. In Sec. VIII, we discuss the algorithmic infeasibility as a physical consequence of the algorithmic detectability threshold. Finally, Sec. IX is devoted to the summary and discussion. While we focused on a limited case of the same problem in Ref. ADT (), in this study, we extend the results therein as general as possible.
Ii Standard SBM
We first explain how the standard SBM is generated. We consider the set of vertices with and denote the number of modules as . For each vertex, we assign a module label with probability independently and randomly, where is an array that determines the relative size of each module. We denote an array of module assignments as . Given , an undirected edge is generated between vertices and independently with probability , where is the adjacency matrix element with if and are connected, and if they are not connected (or connected via a non-edge). The matrix is called the affinity matrix. It is a matrix and is of so that the resulting graph is sparse. We denote an edge between vertices and as , the set of edges as , and the number of edges as . Thus, the likelihood of the standard SBM is as follows.
The model parameters to be learned in the SBM are and , and the set of module assignments is the latent variable that is to be inferred, given the adjacency matrix .
Although the SBM is very flexible, it is sometimes difficult to treat it analytically. Therefore, it is common to restrict the affinity matrix to the simple community structure for , and otherwise. However, this parametrization largely restricts the graph ensemble that the SBM can originally express. Therefore, we instead consider the following affinity matrix Kawamoto and Kabashima (2017); Young et al. (2017).
where , is a column vector with all elements equal to unity. is an indicator matrix in which represents the densely connected module pair, and otherwise; the simple community structure is the special case where is the identity matrix. While this model is parametrized only by and , it can express various modular structures by the choices of , which is the input. We focus on undirected graphs, and thus, is symmetric.
As the SBM has the average degree , the affinity matrix of Eq. (2) can also be parametrized by and . Note that because both and are non-negative, is bounded as
where we defined as . In the following paragraphs, we consider the average degree as the input and parameterize the strength of the modular structure by a normalized parameter that linearly interpolates the maximum and minimum values of :
The graph exhibits assortative and disassortative structures when and , respectively.
Iii Statistical inference of the SBM
In principle, the model parameters and are learned by maximizing the marginalized log-likelihood and the set of module assignments is determined by the posterior distribution . However, because their exact computation is demanding, we use the EM algorithm for approximation.
Using the variational expression, can be expressed as
where is our estimate of the module-assignment distribution, and is the corresponding average. The second term represents the Kullback–Leibler divergence of distributions and . Equation (5) indicates that if our estimate coincides with the posterior distribution , then the model parameters can be learned by maximizing the first term of Eq. (5). Note that this is a double-optimization problem because is conditioned on . Therefore, the EM algorithm iteratively updates the module-assignment inference (E-step) and the model-parameter learning (M-step). We use the hat notation for the estimated (learned) value of the model parameter. Note also that we do not explicitly need the whole joint distribution of . Because the SBM only has pair-wise interactions, as confirmed from the calculation of the M-step, we only need to calculate the one-point and two-point marginals of the posterior estimates.
Although the E-step can be performed in various ways, we employ the BP algorithm. It solves for the marginal distributions of module assignments based on the tree approximation Yedidia et al. (2001, 2003); Mézard and Montanari (2009); Decelle et al. (2011b). This approximation is justified because we consider the sparse graphs, which are locally tree-like. We estimate the marginal distribution of the module assignment of vertex as follows.
where represents the neighboring vertices of ; is the normalization factor; and we use the sparse approximation. In the factor of neighboring vertices, represents the marginalized distribution of vertex with missing knowledge of edge . Although the elements in Eq. (6) are correlated to each other, in general, we can treat them independently when the graph is tree-like. Analogously to Eq. (6), is calculated as
where is the set of neighbors of where vertex is excluded, and is a normalization factor. Note that Eq. (7) constitutes of a set of closed equations, and so we can iteratively update the values of . The BP algorithm updates Eq. (7) until convergence, and it calculates the complete marginal by Eq. (6).
We next explain the M-step. Because we consider the affinity matrix that is parametrized by and , the only parameter that we need to update is and . From Eqs. (1) and (2), the extremum point of the first term of Eq. (5) can be calculated analytically. Following Ref. Kawamoto and Kabashima (2017), we obtain
Here we have
where we defined ; is the row vector . As often assumed Decelle et al. (2011b); Zhang et al. (2015); Kawamoto and Kabashima (2017), if there is no macroscopic fluctuation with respect to the number of vertices in each module, we can approximate that
where . Here, we introduced superscript to indicate the th update.
For estimating each element in , the extremum condition of the first term of Eq. (5) readily yields
Iv Transient dynamics of the M-step
The transient dynamics of the M-step is the key to deriving the algorithmic detectability threshold. The trajectories of the model parameter updates for the standard SBM are exemplified in Fig. 2a. The vertical axis represents the total variation from the mean value ; indicates the uniform structure. An important observation here is that the model parameter estimate is not attracted directly to the planted value. Instead, they are attracted to the point of uniform structure first. The EM algorithm encounters the algorithmic detectability threshold during this transient regime.
To gain a deeper insight about the non-linear update equation (11), we express it in terms of .
Note that when for any , i.e., when BP does not provide any additional knowledge compared to the prior distribution, we have , so will not be updated. Here, we introduce the normalized deviation and rewrite Eq. (13) as
Because we usually have no prior information about the distribution of the marginals, it is common to set uniformly random; i.e., and at the beginning of the algorithm. (In Appendix D, we show that this condition may be relaxed for the case of equally sized modules.)
Because the estimate of the module sizes are also updated concurrently, the M-step can be very complicated in general. Fortunately, however, the update dynamics for can be neglected for the analysis in this study. When we derive the detectability threshold in Sec. VI, we need to restrict ourselves to the case of equal-size modules. In that case, as shown in Appendix D, the distribution is kept randomized during the transient regime, and it yields the estimate that the module sizes are equal.
The specific shape of Eq. (14) is shown in Fig. 2b. The fixed points are located at , and . The former two fixed points indicate the parameters of the bipartite graph and the graph with completely disconnected modules, respectively. The last fixed point indicates the parameter of the uniform random graph. From Fig. 2b, we can confirm that and are unstable fixed points, while is a stable fixed point. Note that Eq. (14) is independent of the input graph during the transient regime. Moreover, because Eq. (14) corresponds to an arbitrary modular structure, this tendency holds irrespective of the specific structure that we assume in the model. Therefore, the M-step of the EM algorithm exhibits universal dynamics at the beginning of the algorithm. When the transient regime is over, the dynamics is no longer universal, and moves toward the planted value as long as it is detectable (Fig. 2c).
V Labeled SBM
Before we consider the algorithmic detectability threshold, we extend the standard SBM to the labeled SBM that has types of edges, i.e., ; represents the non-edge. We denote the set of -edges as , and therefore, (, ). Analogously to Eq. (1), the likelihood of the labeled SBM can be expressed as
where the affinity matrix element with respect to -edges obeys the normalization constraint for any and . We denote the average degree and the strength of the modular structure of -edges as and , respectively (). We also denote the fraction of -edges as , i.e., .
where is the -edge generalization of Eq. (7),
The vertex is a neighbor of such that and .
It is also straightforward to generalize Eq. (14) to the labeled SBM. Because the variables of different edge labels are not directly coupled, we can treat them separately. We can define analogously to in Eq. (4), and for each label ,
where . Corresponding to for each , and are generalized to and . Accordingly, we defined .
Vi Detectability threshold
We now derive the algorithmic detectability threshold of the EM algorithm. As in other inference-based detectability analyses, here, we need to impose further restrictions. We hereafter focus on the case where the module size is equal, i.e., for any , and the average degree of each module is equal, i.e., () for any . In addition, we assume that the matrix is common for all . In other words, while the edges of different types may indicate assortative and disassortative structures, they share the same planted modules.
The undetectable phase can be characterized as the phase in which we cannot retrieve any information about the planted module assignments through the BP update equation (16). In the present setting, it is equivalent to the condition where the factorized state is a stable fixed point of BP Decelle et al. (2011b); Mézard and Montanari (2009); Moore (2017). The factorized state has the form such that for any and , i.e., the state exhibits no signal of a likely-module assignment for any vertex.
vi.1 instability of the factorized state
The instability condition of the factorized state in a sparse graph, which is often termed the Kesten–Stigum bound, can be analyzed using the framework of tree reconstruction Janson and Mossel (2004); Mézard and Montanari (2009); Decelle et al. (2011b); Moore (2017). We assume that the graph is a tree and evaluate whether perturbations from the leaves are significant or negligible for the inference of the root vertex. We denote as the root vertex and as the descendant vertices in th generation. When the vertices at distance are perturbed as , the variation of the marginal probability at the root vertex is expressed as follows.
where we defined the non-backtracking matrix Krzakala et al. (2013) and the transfer matrix as
In general, Eq. (19) cannot be expressed as the tensor product of matrices and because depends on the edge label. However, because we assume that all the affinity matrices share the common matrix, the transfer matrix for each edge type differs only by a constant factor. Then, we can express Eq. (19) as
In the case of a simple community structure, when the perturbation from the factorized state is considered Decelle et al. (2011b); Krzakala et al. (2013); Zhang and Moore (2014), the elements of and are
For inferring the general modular structure, we analyze the BP algorithm of the transformed basis as considered in Kawamoto and Kabashima (2017). In this case, the transfer matrix is given by
From Eq. (22), the instability condition attributed to the perturbation is determined by the eigenvalues of and . Note that because the unit vector is a leading eigenvector of , we have , where is the leading eigenvalue of and is the second-leading eigenvalue of , which may be degenerated with the leading eigenvalue. Therefore, unless all the elements in the leading eigenvector of have the same sign, the instability condition of the factorized state, i.e., the detectable region is determined by
When all the elements in the leading eigenvector of do have the same sign, e.g., the sign of is the same for all , the leading eigenvector is unrelated to the modular structure. In this case, the second-leading eigenvalue determines the detectable region, i.e.,
The eigenvalue of that we should refer to is determined by the boundary of the spectral band, which we denote as , and the isolated eigenvalue, which we denote as . Note that the eigenvalues of changes dynamically because it is a function of the estimated value of the model parameters. When we initially assume a strong modular structure, i.e., large values of , holds. Then, the spectral band shrinks because of the universal dynamics of the M-step, until reaches unity. (For a specific example, see Sec. VII.1.) At this stage, the factorized state stabilizes if there is no isolated eigenvalue that is correlated to the planted modular structure. As we mentioned earlier, will no longer be updated once the factorized state is achieved. Hence, determines the values of that we should refer to at the detectability threshold. Given this estimate, the factorized state becomes unstable when is satisfied. Hence, the boundary condition of the detectable phase is given by
These two conditions coincide when the model parameters are correctly learned (i.e., the Nishimori condition), i.e., for all .
vi.2 spectral band
We first consider the boundary of the spectral band. By applying the result derived by the cavity method in Ref. Neri and Metz (2016), we have
In terms of ,
As shown in Appendix C, this can also be derived as an upper bound of the spectral band by using the method of types. In the case of two equally sized modules with a simple community structure under the Nishimori condition, the condition yields
This threshold is equal to the one derived in Ref. Heimlicher et al. (2012). As presented in Fig. 3, however, the numerical experiment shows that the actual boundary where the EM algorithm fails (open circles) does not coincide with Eq. (31) (dashed ellipse). Instead, the undetectable region can be well characterized by the shaded region; its boundary is the algorithmic detectability threshold that we will derive in Sec. VII.1. In addition, we can confirm that the condition coincides with the threshold obtained in Ref. Kawamoto and Kabashima (2017) under the Nishimori condition for the standard SBM with general modular structures.
vi.3 isolated eigenvalues
Next, we solve for the isolated eigenvalue . Given a graph, the eigenvalue equation of with respect to an eigenvector with an eigenvalue is
As the edge label is stochastically determined by the planted module assignments, we denote the eigenvector element together with the planted module assignments and as .
Note that the non-backtracking matrix is an oriented matrix, i.e., when , then . According to Ref. Neri and Metz (2016), the isolated eigenvalue of an oriented matrix can be obtained by solving the eigenvalue equation of the averaged quantities (Eq. (S59) in Ref. Neri and Metz (2016)). Because the eigenvector statistics have dependency on the module assignment in the present case, we let be the ensemble average of the eigenvector element with module assignment with respect to vertex . The isolated eigenvalue can be obtained by solving the equation for .
Taking the ensemble average of the eigenvector elements and the configuration average, we have
The probability that the neighboring vertex belongs to module is given as , and
is the configuration average for given , and is the probability that a neighboring vertex in module is connected to a vertex in module via an -edge.
Making use of the fact that the module sizes are equal, we have
The matrix can be expressed using , and the eigenvalues of Eq. (35) are written as follows.
The eigenvector that corresponds to in Eq. (36) is proportional to a unit vector. Thus, is the eigenvalue that we referred to , i.e.,
In terms of and ,
Vii Detectability phase-diagrams
In this section, we draw detectability phase-diagrams for some specific cases. Note that the algorithmic detectability threshold of the EM algorithm depends on the initial condition, i.e., we need to specify the initial estimates of the model parameters. As we will see for each example, the trajectory of the set of estimated model parameters is crucial to the geometry of the undetectable phase. In the following examples, we always set the number of edge types .
vii.1 Community structure with
We first derive the phase boundary of the shaded region in Fig. 3. This is a case of two equally sized modules (), and is equal to the identity matrix (i.e., ); this is often referred to as the symmetric SBM. Figure 4a shows the trajectories of the estimate for various instances of the labeled SBMs. All the planted model parameters are in the detectable region.
We set the initial estimate nearly at the corner of the -plane, . An example of the corresponding spectrum of the weighted non-backtracking matrix is shown in Fig. 4(i); upon setting the initial condition as shown, the boundary of the spectral band exceeds 1. As described in Secs. IV and VI.1, of each is attracted toward the point of the uniform graph at equal rates until it satisfies the condition (Fig. 4(ii)), or equivalently, for both . Given these estimates, the condition yields
Thereafter, when the graph is in the detectable region, moves to the planted value. The spectrum of is shown in Fig. 4(iii). On the other hand, as shown in Fig. 4b (circles and squares), does not reach the planted value in the undetectable region; provided that is initially located in the detectable region, it gets stuck when is satisfied.
A value of model parameter that exhibits a strong modular structure is empirically known as a better choice for the initial model parameter. Indeed, when is initially located deep in the undetectable region, the estimate does not move at all (triangles in Fig. 4b). This can be understood as the algorithmic detectability threshold; the condition is already satisfied at the beginning of the algorithm, and there is no isolated eigenvalue of that satisfies . Similarly, although is initially located in the undetectable region, when