Algorithmic detectability threshold of the stochastic blockmodel
The assumption that the values of model parameters are known or correctly learned, i.e., the Nishimori condition, is one of the requirements for the detectability analysis of the stochastic blockmodel in statistical inference. In practice, however, there is no example demonstrating that we can know the model parameters beforehand, and there is no guarantee that the model parameters can be learned accurately. In this study, we consider the expectation–maximization (EM) algorithm with belief propagation (BP) and derive its algorithmic detectability threshold. Our analysis is not restricted to the community structure, but includes general modular structures. Because the algorithm cannot always learn the planted model parameters correctly, the algorithmic detectability threshold is qualitatively different from the one with the Nishimori condition.
Graph clustering is a technique to detect a macroscopic law of connectivity in a graph [?]. In other words, we expect that each vertex in a graph belongs to a module (or modules) and that the vertices in the same module are statistically equivalent. We then let an algorithm infer the most likely module assignments. In particular, the subset of the graph-clustering problem that focuses on the detection of densely connected (i.e., assortative) module is termed community detection. A classic example of community detection is the detection of social groups in a social network, wherein the vertices and edges represent persons and friendships, respectively.
Often, the graph structures that are more general than the assortative modules are detected using statistical inference methods. In this approach, we infer the most-likely module assignment for each vertex by fitting graph data using a random graph model with a planted modular structure. The stochastic blockmodel (SBM) [?] is a canonical model used for this purpose. The SBM is an extension of the Erdős-Rényi random graph; each vertex in the SBM has a planted module assignment and the vertices in the same module have stochastically equivalent connection patterns. Thus, we can generate the graph instances of various modular structures.
The model parameters of the SBM smoothly connect the random graphs with a strong modular structure and the Erdős-Rényi random graph. Interestingly, as the strength of the modular structure decreases, before the SBM becomes equivalent to the Erdős-Rényi random graph, a phase transition occurs, and at this stage, it becomes impossible to infer the planted module assignment. This critical point is called the detectability threshold or the detectability limit [?]. The impossibility of inference stems from the fact that the fluctuations of the graph instances are not negligible, so that the graph instances generated from the SBM are statistically indistinguishable from the those of the Erdős-Rényi random graph. This is a fundamental problem in graph clustering, and it offers an insight into the extent to which we should expect algorithms to work. This is a characteristic phenomenon of sparse graphs, i.e., graphs with a constant average degree, and it cannot be observed in dense graphs. In the dense regime, instead, another interesting problem called the recovery problem [?] arises.
We focus on sparse undirected graphs. We do not consider graphs with self-loops and multi-edges or the SBM wherein a vertex belongs to multiple modules. Instead, we allow the graphs to have multiple types of edges [?]. Thus, we will extend our analysis to a variant of the SBM called the labeled stochastic blockmodel (labeled SBM). An instance of the labeled SBM is shown in Figure 1.
Several frameworks corresponding to algorithms such as the greedy [?], Monte Carlo [?], and expectation–maximization (EM) [?] algorithm are available for the statistical inference of the SBM. We analyze the performance of the EM algorithm, because it is scalable and suited for theoretical analysis. In particular, we consider belief propagation (BP) as its module-assignment inference (E-step), and the point estimate of the model parameters as its model-parameter learning (M-step).
In this study, we derive the algorithmic detectability threshold of the SBM using the EM algorithm. The detectability threshold is often defined as the fundamental limit where all polynomial-time algorithms fail; here, we distinguish such a threshold as the theoretical limit of detectability. There are two reasons why we focus on the algorithmic detectability threshold.
The first reason is that it helps deriving a legitimate threshold in practice. It is known that the Bayesian inference using BP achieves the theoretical limit of detectability [?], assuming that the correct values of model parameters are known. Supported by this favorable property in theory and the good scalability of the algorithm, BP was implemented to solve various SBM variants [?]. On the other hand, some non-Bayesian methods are known to be strictly suboptimal, e.g., [?]. However, this is not a fair comparison because the Bayesian inference with BP is not an algorithmic detectability threshold; it can be regarded as the EM algorithm without the M-step. In practice, we do not know the planted model parameters beforehand, and there is no guarantee that they can be learned precisely. In fact, the performance of the EM algorithm generally depends on the initial values of the model parameters. To the best of our knowledge, the algorithmic detectability threshold that takes into account both the E-step and M-step remains a mystery, and we solve this problem analytically. The condition that the planted values of model parameters are correctly learned is often referred to as the Nishimori condition [?].
The second reason why we focus on the algorithmic detectability threshold is that we need to observe the algorithmic infeasibility through the analysis. In general, even if we have data of extremely high dimensions, there exists a limit that an algorithm can handle correctly. Then, the inference task is algorithmically infeasible though it may be computationally feasible. This difficulty can be depicted in the detectability phase-diagram of the algorithm, whereas it cannot be observed from the theoretical limit of detectability because of the assumption of infinite learnability.
This paper is organized as follows. In Section 2, we explain the precise construction of the standard (i.e., binary label) SBM and its analytically tractable parametrization. We then explain the procedure of the EM algorithm (Sec. Section 3) and the behavior of its M-step (Sec. Section 4). From Section 5, we extend the SBM to the labeled SBM. After explaining how the formulation of the standard SBM is modified in the labeled SBM in Section 5, we derive the algorithmic detectability threshold in Section 6. In Section 7, we present the detectability phase-diagrams for some specific cases. In Section 8, we discuss the algorithmic infeasibility as a physical consequence of the algorithmic detectability threshold. Finally, Section 9 is devoted to the summary and discussion. While we focused on a limited case of the same problem in Ref. [?], in this study, we extend the results therein as general as possible.
We first explain how the standard SBM is generated. We consider the set of vertices with and denote the number of modules as . For each vertex, we assign a module label with probability independently and randomly, where is an array that determines the relative size of each module. We denote an array of module assignments as . Given , an undirected edge is generated between vertices and independently with probability c, where is the adjacency matrix element with if and are connected, and if they are not connected (or connected via a non-edge). The matrix c is called the affinity matrix. It is a matrix and is of so that the resulting graph is sparse. We denote an edge between vertices and as , the set of edges as , and the number of edges as . Thus, the likelihood of the standard SBM is as follows.
The model parameters to be learned in the SBM are and c, and the set of module assignments is the latent variable that is to be inferred, given the adjacency matrix .
Although the SBM is very flexible, it is sometimes difficult to treat it analytically. Therefore, it is common to restrict the affinity matrix to the simple community structure for , and otherwise. However, this parametrization largely restricts the graph ensemble that the SBM can originally express. Therefore, we instead consider the following affinity matrix [?].
where , 1 is a column vector with all elements equal to unity. is an indicator matrix in which represents the densely connected module pair, and otherwise; the simple community structure is the special case where is the identity matrix. While this model is parametrized only by and , it can express various modular structures by the choices of , which is the input. We focus on undirected graphs, and thus, is symmetric.
As the SBM has the average degree c , the affinity matrix of Eq. (Equation 2) can also be parametrized by and . Note that because both and are non-negative, is bounded as
where we defined as . In the following paragraphs, we consider the average degree as the input and parameterize the strength of the modular structure by a normalized parameter that linearly interpolates the maximum and minimum values of :
The graph exhibits assortative and disassortative structures when and , respectively.
3Statistical inference of the SBM
In principle, the model parameters and c are learned by maximizing the marginalized log-likelihood cc and the set of module assignments is determined by the posterior distribution c. However, because their exact computation is demanding, we use the EM algorithm for approximation.
Using the variational expression, c can be expressed as
where is our estimate of the module-assignment distribution, and is the corresponding average. The second term represents the Kullback–Leibler divergence of distributions and . Equation (Equation 4) indicates that if our estimate coincides with the posterior distribution c, then the model parameters can be learned by maximizing the first term of Eq. (Equation 4). Note that this is a double-optimization problem because c is conditioned on c. Therefore, the EM algorithm iteratively updates the module-assignment inference (E-step) and the model-parameter learning (M-step). We use the hat notation for the estimated (learned) value of the model parameter. Note also that we do not explicitly need the whole joint distribution of . Because the SBM only has pair-wise interactions, as confirmed from the calculation of the M-step, we only need to calculate the one-point and two-point marginals of the posterior estimates.
Although the E-step can be performed in various ways, we employ the BP algorithm. It solves for the marginal distributions of module assignments based on the tree approximation [?]. This approximation is justified because we consider the sparse graphs, which are locally tree-like. We estimate the marginal distribution of the module assignment of vertex as follows.
where represents the neighboring vertices of ; is the normalization factor; and we use the sparse approximation. In the factor of neighboring vertices, represents the marginalized distribution of vertex with missing knowledge of edge . Although the elements in Eq. (Equation 5) are correlated to each other, in general, we can treat them independently when the graph is tree-like. Analogously to Eq. (Equation 5), is calculated as
where is the set of neighbors of where vertex is excluded, and is a normalization factor. Note that Eq. (Equation 6) constitutes of a set of closed equations, and so we can iteratively update the values of . The BP algorithm updates Eq. (Equation 6) until convergence, and it calculates the complete marginal by Eq. (Equation 5).
We next explain the M-step. Because we consider the affinity matrix c that is parametrized by and , the only parameter that we need to update is and . From Eqs. (Equation 1) and (Equation 2), the extremum point of the first term of Eq. (Equation 4) can be calculated analytically. Following Ref. [?], we obtain
Here we have
where we defined ; is the row vector . As often assumed [?], if there is no macroscopic fluctuation with respect to the number of vertices in each module, we can approximate that
where . Here, we introduced superscript to indicate the th update.
For estimating each element in , the extremum condition of the first term of Eq. (Equation 4) readily yields
4Transient dynamics of the M-step
The transient dynamics of the M-step is the key to deriving the algorithmic detectability threshold. The trajectories of the model parameter updates for the standard SBM are exemplified in Figure 2a. The vertical axis represents the total variation c from the mean value c; indicates the uniform structure. An important observation here is that the model parameter estimate is not attracted directly to the planted value. Instead, they are attracted to the point of uniform structure first. The EM algorithm encounters the algorithmic detectability threshold during this transient regime.
To gain a deeper insight about the non-linear update equation (Equation 10), we express it in terms of .
Note that when for any , i.e., when BP does not provide any additional knowledge compared to the prior distribution, we have , so will not be updated. Here, we introduce the normalized deviation and rewrite Eq. (Equation 11) as
Because we usually have no prior information about the distribution of the marginals, it is common to set uniformly random; i.e., and at the beginning of the algorithm. (In Appendix Section 14, we show that this condition may be relaxed for the case of equally sized modules.)
Because the estimate of the module sizes are also updated concurrently, the M-step can be very complicated in general. Fortunately, however, the update dynamics for can be neglected for the analysis in this study. When we derive the detectability threshold in Section 6, we need to restrict ourselves to the case of equal-size modules. In that case, as shown in Appendix Section 14, the distribution is kept randomized during the transient regime, and it yields the estimate that the module sizes are equal.
The specific shape of Eq. (Equation 12) is shown in Figure 2b. The fixed points are located at , and . The former two fixed points indicate the parameters of the bipartite graph and the graph with completely disconnected modules, respectively. The last fixed point indicates the parameter of the uniform random graph. From Figure 2b, we can confirm that and are unstable fixed points, while is a stable fixed point. Note that Eq. (Equation 12) is independent of the input graph during the transient regime. Moreover, because Eq. (Equation 12) corresponds to an arbitrary modular structure, this tendency holds irrespective of the specific structure that we assume in the model. Therefore, the M-step of the EM algorithm exhibits universal dynamics at the beginning of the algorithm. When the transient regime is over, the dynamics is no longer universal, and moves toward the planted value as long as it is detectable (Fig. Figure 2c).
Before we consider the algorithmic detectability threshold, we extend the standard SBM to the labeled SBM that has types of edges, i.e., ; represents the non-edge. We denote the set of -edges as , and therefore, (, ). Analogously to Eq. (Equation 1), the likelihood of the labeled SBM can be expressed as
where the affinity matrix element with respect to -edges obeys the normalization constraint for any and . We denote the average degree and the strength of the modular structure of -edges as and , respectively (). We also denote the fraction of -edges as , i.e., .
where is the -edge generalization of Eq. (Equation 6),
The vertex is a neighbor of such that and .
It is also straightforward to generalize Eq. (Equation 12) to the labeled SBM. Because the variables of different edge labels are not directly coupled, we can treat them separately. We can define analogously to in Eq. (Equation 3), and for each label ,
where . Corresponding to for each , and are generalized to and . Accordingly, we defined .
We now derive the algorithmic detectability threshold of the EM algorithm. As in other inference-based detectability analyses, here, we need to impose further restrictions. We hereafter focus on the case where the module size is equal, i.e., for any , and the average degree of each module is equal, i.e., () for any . In addition, we assume that the matrix is common for all . In other words, while the edges of different types may indicate assortative and disassortative structures, they share the same planted modules.
The undetectable phase can be characterized as the phase in which we cannot retrieve any information about the planted module assignments through the BP update equation (Equation 14). In the present setting, it is equivalent to the condition where the factorized state is a stable fixed point of BP [?]. The factorized state has the form such that for any and , i.e., the state exhibits no signal of a likely-module assignment for any vertex.
6.1instability of the factorized state
The instability condition of the factorized state in a sparse graph, which is often termed the Kesten–Stigum bound, can be analyzed using the framework of tree reconstruction [?]. We assume that the graph is a tree and evaluate whether perturbations from the leaves are significant or negligible for the inference of the root vertex. We denote as the root vertex and as the descendant vertices in th generation. When the vertices at distance are perturbed as , the variation of the marginal probability at the root vertex is expressed as follows.
where we defined the non-backtracking matrix [?] and the transfer matrix as
In general, Eq. ( ?) cannot be expressed as the tensor product of matrices and because depends on the edge label. However, because we assume that all the affinity matrices c share the common matrix, the transfer matrix for each edge type differs only by a constant factor. Then, we can express Eq. ( ?) as
In the case of a simple community structure, when the perturbation from the factorized state is considered [?], the elements of and are
For inferring the general modular structure, we analyze the BP algorithm of the transformed basis as considered in [?]. In this case, the transfer matrix is given by
From Eq. (Equation 17), the instability condition attributed to the perturbation is determined by the eigenvalues of and . Note that because the unit vector 1 is a leading eigenvector of , we have , where is the leading eigenvalue of and is the second-leading eigenvalue of , which may be degenerated with the leading eigenvalue. Therefore, unless all the elements in the leading eigenvector of have the same sign, the instability condition of the factorized state, i.e., the detectable region is determined by
When all the elements in the leading eigenvector of do have the same sign, e.g., the sign of is the same for all , the leading eigenvector is unrelated to the modular structure. In this case, the second-leading eigenvalue determines the detectable region, i.e.,
The eigenvalue of that we should refer to is determined by the boundary of the spectral band, which we denote as , and the isolated eigenvalue, which we denote as . Note that the eigenvalues of changes dynamically because it is a function of the estimated value of the model parameters. When we initially assume a strong modular structure, i.e., large values of , holds. Then, the spectral band shrinks because of the universal dynamics of the M-step, until reaches unity. (For a specific example, see Section 7.1.) At this stage, the factorized state stabilizes if there is no isolated eigenvalue that is correlated to the planted modular structure. As we mentioned earlier, will no longer be updated once the factorized state is achieved. Hence, determines the values of that we should refer to at the detectability threshold. Given this estimate, the factorized state becomes unstable when is satisfied. Hence, the boundary condition of the detectable phase is given by
These two conditions coincide when the model parameters are correctly learned (i.e., the Nishimori condition), i.e., for all .
We first consider the boundary of the spectral band. By applying the result derived by the cavity method in Ref. [?], we have
In terms of ,
As shown in Appendix Section 13, this can also be derived as an upper bound of the spectral band by using the method of types. In the case of two equally sized modules with a simple community structure under the Nishimori condition, the condition yields
This threshold is equal to the one derived in Ref. [?]. As presented in Figure 3, however, the numerical experiment shows that the actual boundary where the EM algorithm fails (open circles) does not coincide with Eq. (Equation 23) (dashed ellipse). Instead, the undetectable region can be well characterized by the shaded region; its boundary is the algorithmic detectability threshold that we will derive in Section 7.1. In addition, we can confirm that the condition coincides with the threshold obtained in Ref. [?] under the Nishimori condition for the standard SBM with general modular structures.
Next, we solve for the isolated eigenvalue . Given a graph, the eigenvalue equation of with respect to an eigenvector with an eigenvalue is
As the edge label is stochastically determined by the planted module assignments, we denote the eigenvector element together with the planted module assignments and as .
Note that the non-backtracking matrix is an oriented matrix, i.e., when , then . According to Ref. [?], the isolated eigenvalue of an oriented matrix can be obtained by solving the eigenvalue equation of the averaged quantities (Eq. (S59) in Ref. [?]). Because the eigenvector statistics have dependency on the module assignment in the present case, we let be the ensemble average of the eigenvector element with module assignment with respect to vertex . The isolated eigenvalue can be obtained by solving the equation for .
Taking the ensemble average of the eigenvector elements and the configuration average, we have
The probability that the neighboring vertex belongs to module is given as , and
is the configuration average for given , and is the probability that a neighboring vertex in module is connected to a vertex in module via an -edge.
Making use of the fact that the module sizes are equal, we have
The matrix can be expressed using , and the eigenvalues of Eq. (Equation 26) are written as follows.
The eigenvector that corresponds to in Eq. (Equation 27) is proportional to a unit vector. Thus, is the eigenvalue that we referred to , i.e.,
In terms of and ,
In this section, we draw detectability phase-diagrams for some specific cases. Note that the algorithmic detectability threshold of the EM algorithm depends on the initial condition, i.e., we need to specify the initial estimates of the model parameters. As we will see for each example, the trajectory of the set of estimated model parameters is crucial to the geometry of the undetectable phase. In the following examples, we always set the number of edge types .
7.1Community structure with
We first derive the phase boundary of the shaded region in Figure 3. This is a case of two equally sized modules (), and is equal to the identity matrix (i.e., ); this is often referred to as the symmetric SBM. Figure 4a shows the trajectories of the estimate for various instances of the labeled SBMs. All the planted model parameters are in the detectable region.
We set the initial estimate nearly at the corner of the -plane, . An example of the corresponding spectrum of the weighted non-backtracking matrix is shown in Figure 4(i); upon setting the initial condition as shown, the boundary of the spectral band exceeds 1. As described in Secs. Section 4 and Section 6.1, of each is attracted toward the point of the uniform graph at equal rates until it satisfies the condition (Fig. Figure 4(ii)), or equivalently, for both . Given these estimates, the condition yields
Thereafter, when the graph is in the detectable region, moves to the planted value. The spectrum of is shown in Figure 4(iii). On the other hand, as shown in Figure 4b (circles and squares), does not reach the planted value in the undetectable region; provided that is initially located in the detectable region, it gets stuck when is satisfied.
A value of model parameter that exhibits a strong modular structure is empirically known as a better choice for the initial model parameter. Indeed, when is initially located deep in the undetectable region, the estimate does not move at all (triangles in Figure 4b). This can be understood as the algorithmic detectability threshold; the condition is already satisfied at the beginning of the algorithm, and there is no isolated eigenvalue of that satisfies . Similarly, although is initially located in the undetectable region, when is satisfied, then the estimate moves successfully to the planted value (diamonds in Figure 4b).
In the standard SBM, as long as the initial estimate of the model parameters is in the detectable region, we can confirm that there is no distinction between the algorithmic threshold and the threshold under the Nishimori condition; the undetectable region is given by in the case of a binary label . This is consistent with Theorem 3 of Ref. [?] that the model parameters are asymptotically learnable for .
The dependence of the initial estimate of the model parameters was also examined numerically in Ref. [?] for the standard SBM. If we substitute the value corresponding to the critical point in Fig. 5 (left) in Ref. [?] into Eq. (Equation 28), we obtain . In addition, according to Eq. (Equation 21), this result corresponds to the case where we start the algorithm with ; thus, the critical point was actually the algorithmic detectability threshold.
7.2Community structure with
When as in Section 7.1, we observed that the rate of attraction of toward the point of the uniform random graph is equal (or almost equal) for all . As an example showing that this is not the case, let us consider the simple community structure with equally sized modules. The results are shown in Fig. ?. The value of is , and the update equation of is no longer symmetric with respect to . As a consequence, the rate of attraction in the transient dynamics of differs depending on whether we set or at the beginning of the algorithm.
Once we determine the point where the estimate x hits the boundary of the spectral band, we can derive the detectability threshold by . Note that the resulting detectability phase-diagram has more detectable region in the upper-right side than the lower-left side; this reflects the fact that the edges within a module are more informative than the edges between modules when .
In the previous examples, we focused on the cases with . In this section, we describe the case of a non-community structure such that the -dependence of the detectability threshold is actually observed. We consider the graph with the matrix shown in Figure 6a, i.e., , , . As in Section 7.1, we set the initial condition that for any . All estimates are attracted toward the center of the phase space at equal rates until they satisfy the condition , which yields . Given these estimates, the condition yields
To obtain the results shown in Figure 6, although we could have used the EM algorithm with the restricted affinity matrix in Eq. (Equation 2), we used the affinity matrix of full degrees of freedom instead. We can confirm that the boundary of the detectability phase-diagram is still very accurate, and thereby, the restriction of the affinity matrix that we imposed for analytical tractability does not have a crucial effect after all.
In this section, we focus on the case described in Section 7.1; for this case, we discuss the physical consequence of the distinction between the algorithmic detectability threshold, Eq. (Equation 30), and the detectability threshold under the Nishimori condition, Eq. (Equation 23).
Suppose that we have an instance of the standard SBM, whose edges indicate the assortative structure. However, the planted structure is undetectable by the EM algorithm because the graph is too sparse. In order to improve this situation, we can add some edges of a new type to the existing graph.
Because we obtain more information about the planted structure by adding these edges of a new type, in principle, the structure is more likely to be detectable. Then, the question arises whether such a prescription always strengthens the detectability and whether is it even better to introduce yet another type of edges. In practice, the above statements are not true. An algorithm does not always perform better because it becomes more difficult to learn higher dimensional model parameters. In other words, excessively higher-order information will be, at some point, algorithmically infeasible to extract.
This infeasibility can also be explained in the reverse way: There are cases where discarding edges of one type improves the detectability. To observe this behavior, we identify a region of the undetectable phase in the phase diagram of where the corresponding graph becomes detectable upon discarding the edges of . Because the undetectable region of is given by
we can readily see that the striped region in Figure 7a is the phase where algorithmic infeasibility can be observed.
The improvement of the performance is confirmed in Figure 7b. This is a set of vertical histograms with respect to the overlap, the fraction of correctly classified vertices. Each histogram shows the distribution of overlaps for the given average degree of . As we increase , at some point, the graph will enter the undetectable region; according to Eq. (Equation 30), the critical value is . Indeed, the numerical experiment shows that the overlaps tend to be high for (blue) and tend to be close to , i.e., not better than chance, for (red). Note that our analytical results are of . Thus, there is a chance to retrieve the information of the planted modules even for because of the finite-size effect and vice versa. Note also that if we make the average degree even larger ( in the current case), the planted modules eventually become detectable again.
Importantly, the detectability threshold of , Eq. (Equation 31), is tangent to the detectability threshold under the Nishimori condition (dashed ellipse). Therefore, the emergence of the phase that exhibits the algorithmic infeasibility is a consequence of the algorithmic detectability threshold.