Measuring stability of spectral clustering
Abstract
As an indicator of the stability of spectral clustering of an undirected weighted graph into clusters, the th spectral gap of the graph Laplacian is often considered. The th spectral gap is characterized here as an unstructured distance to ambiguity, namely as the minimal distance of the Laplacian to arbitrary symmetric matrices with vanishing th spectral gap. As a more appropriate measure of stability, the structured distance to ambiguity of the clustering is introduced as the minimal distance of the Laplacian to Laplacians of the same graph with weights that are perturbed such that the th spectral gap vanishes. To compute a solution to this matrix nearness problem, a twolevel iterative algorithm is proposed that uses a constrained gradient system of matrix differential equations in the inner iteration and a onedimensional optimization of the perturbation size in the outer iteration. The structured and unstructured distances to ambiguity are compared on some example graphs. The numerical experiments show, in particular, that selecting the number of clusters according to the criterion of maximal stability can lead to different results for the structured and unstructured stability indicators.
pectral clustering; clustering stability; matrix nearness problem; structured eigenvalue optimization.
15A18, 65K05
1 Introduction
Clustering — identifying groups with similarities in a large data set — is a fundamental task in data analysis. In order to cluster an undirected weighted graph, several algorithms are considered in the literature. Spectral clustering (originating with Fiedler [5]; see [19] and [21] for more recent accounts) offers significant advantages over more traditional combinatorial techniques such as direct means or single linkage. Quoting [21], “Results obtained by spectral clustering often outperform the traditional approaches; spectral clustering is very simple to implement and can be solved efficiently by linear algebra methods.”
The stability of spectral clustering algorithms is heuristically — and very conveniently — associated with the spectral gap in the graph Laplacian, that is the difference between the th and th smallest eigenvalues of the Laplacian matrix in the case of clusters built from eigenvectors. Contrary to this, we formulate an ambiguity problem where it is asked how far the given Laplacian matrix is from the Laplacian matrix of a weighted graph with the same edges for which the th and th smallest eigenvalues of the graph Laplacian coalesce and for which therefore spectral clustering with clusters becomes ambiguous. This structured distance to ambiguity is introduced here as a measure of stability of spectral clustering, for any number of clusters. On the other hand, the th spectral gap divided by is characterized as the distance of the Laplacian matrix to the much larger set of arbitrary, unstructured symmetric matrices with coalescing th and th eigenvalues.
A stability indicator is useful in selecting an appropriate number of clusters, by choosing such that the stability indicator is maximized. The structured distance to ambiguity, whose computation will be discussed in this paper, may be too costly computationally to be used as a generalpurpose stability indicator in comparison with the spectral gap, which is available without further computations in spectral clustering. Nevertheless, comparing the structured distance to ambiguity and the spectral gap for representative examples within a class of graphs of interest gives insight into the reliability (or otherwise) of the spectral gap as a stability indicator.
The structured distance to ambiguity considered in this paper is similar in spirit to other structured robustness measures that arise in matrix and control theory, such as stability radii, passivity distances, distance to singularity, etc. [13, 11, 12, 4, 7, 10, 17]. The notion of stability considered here is, however, different from [22], where a statistical perspective on clustering stability is developed.
A main objective of this paper is to show how the structured distance to ambiguity can be computed. The proposed algorithm is an iterative algorithm, where in each step a pair of eigenvalues and associated eigenvectors of the Laplacian of a graph with perturbed weights are computed. For a large sparse graph (where the number of edges leaving any vertex is moderately bounded), these computations can typically be done with a complexity that is linear in the number of vertices.
A feature of this algorithm in common with recent algorithms for eigenvalue optimization as given in [8, 7, 10, 9, 3] is a twolevel procedure for matrix nearness problems, where in an inner iteration a gradient flow drives perturbations to the original matrix of a fixed size into a (local) minimum of a nonnegative functional that depends on eigenvalues and eigenvectors, and in an outer iteration the perturbation size is optimized such that the functional becomes zero.
The paper is organized as follows. In Section 2 we recall basic concepts of spectral clustering. In Section 3 we introduce the structured distance to ambiguity (SDA) as a measure of stability for spectral clustering. In Section 4 we describe a numerical method to compute the SDA, which requires the solution of a structured matrix nearness problem. We propose a twolevel iterative method, which is based on a gradient system of matrix differential equations in an inner iteration and a onedimensional optimization of the perturbation size in an outer iteration, and we discuss algorithmic aspects. In Section 5 we present the results of numerical experiments where the spectral gap (i.e., the unstructured distance to ambiguity) and the theoretically more appropriate measure of stability discussed here (i.e., the structured distance to ambiguity) are compared on some classes of graphs.
2 Spectral clustering
Let be an undirected graph with vertex set and edge set . We assume that the graph is weighted, that is, to each edge between two vertices and a nonnegative weight is associated. We set for . The weighted adjacency matrix of the graph is the matrix
The degrees are the elements of the diagonal matrix
The (unnormalized) Laplacian of the graph is given by the matrix ,
The Laplacian matrix is nonnegative and positive semidefinite; since  by construction  , is the smallest eigenvalue of . Note that the matrix does not depend on (possible) diagonal elements of the matrix , which means that selfedges do not change the graph Laplacian. The graph Laplacian and its eigenvectors provide important instruments to spectral clustering, as stated by the following theorems.
[Bipartition, M. Fiedler [5]] Let be the weight matrix of an undirected graph and the corresponding Laplacian matrix. Let be the eigenvalues of . Then, the graph is disconnected if and only if . Moreover, if , then the entries of the corresponding eigenvector orthogonal to assume only two different values, of different sign, which mark the membership to the two connected components.
If is a simple eigenvalue, then the corresponding eigenvector is known as the Fiedler vector. In spectral graph theory, inspired by Theorem 2, it is common to compute the second smallest eigenvalue of and label the positive components of the Fiedler vector as belonging to one subset and the negative ones to another subset, and in this way obtaining a natural partition of the graph. However, this becomes unreliable when a small perturbation of the weights yields a coalescence of the eigenvalues and .
More generally we have the following result (see e.g. [21]). For a subset of vertices , we here denote the indicator vector as the vector whose th entry is equal to if and is equal to zero otherwise.
[partition] Let be the weight matrix of an undirected graph and the corresponding Laplacian matrix. Then the multiplicity of the eigenvalue (the dimension of ) equals the number of connected components in the graph. The eigenspace of the eigenvalue is spanned by the indicator vectors .
Nonempty sets form a clustering of the graph if
Similarly to the case of two clusters, this result motivates an algorithm for clustering a graph into components, which is reviewed in Algorithm 1; see [21]. For the classical means algorithm we refer the reader e.g. to [16].
Analogous results and algorithms are extended to the normalized Laplacian
3 Structured distance to ambiguity as a stability measure for spectral clustering
In order to evaluate the robustness of the spectral clustering algorithm it is essential to quantify the sensitivity to perturbations of the invariant subspace associated with the eigenvectors used by the algorithm. Suppose that a small perturbation to the weights of the graph makes the eigenvalues and of the Laplacian coalesce. Then the eigenvector associated to the computed th smallest eigenvalue can change completely and hence can yield a different clustering. It has been suggested in the literature to use the spectral gap as an indicator of the stability of the clustering. For unstructured perturbations of the graph Laplacian, this is motivated by the Davis–Kahan theorem (see, e.g., [20] and [6]), which tells us that the distance between the eigenspaces of the Laplacian of a graph and of any perturbed symmetric matrix has a bound that is proportional to the perturbation size and inversely proportional to the spectral gap of the Laplacian . In another direction, which is more related to the concepts studied here, the Hoffman–Wielandt theorem [15, p. 368] yields that the spectral gap is characterized as an unstructured distance to ambiguity (up to the scaling factor ),
under the condition that is a symmetric matrix  (1)  
such that its th and th eigenvalues coalesce. 
Here, denotes the Frobenius norm of a matrix .
The interpretation is that applying Algorithm 1 is justified when the gap is relatively large, otherwise a small perturbation of the weight matrix may yield coalescence of and and change significantly the clustering.
The weakness of this approach is that it considers unstructured perturbations of the Laplacian, as opposed to the admissible perturbations that are Laplacians of a perturbed weight matrix that preserves the symmetry and nonnegativity, i.e., , and the sparsity pattern of , i.e., for . In (1), the minimizer is given by
where and denote normalized eigenvectors of to the eigenvalues and , respectively. Apart from exceptional cases, is not the Laplacian of a graph.
We therefore propose the following stability measure for clustering, which is given by a structured distance to ambiguity:
under the condition that the th and th eigenvalues of coalesce  
and under the constraints for and for . 
In view of (1), it is clear that
and often is substantially larger than the scaled spectral gap. In view of Theorem 2, we further note that is the Frobeniusnorm distance of the Laplacian to that of a nearest disconnected graph.
For clustering a graph it is usually not known beforehand what the best choice of the number of clusters is. Asking for the most stable clustering of a graph (most stable with respect to admissible perturbations of the weights) determines the optimal number as
where can be limited to , where is given a priori or chosen such that is smaller than some threshold. This criterion for the selection of is considered here instead of choosing such that the spectral gap is maximized. The latter is computationally cheaper, but appears less meaningful. In our numerical experiments in Section 5 we will compare the two criteria for some families of graphs.
Remark 3.1
There are obvious alternatives to the above stability measure:

Instead of minimizing the perturbation of the Laplacian we might minimize the perturbation of the weights, .

Instead of the unnormalized Laplacian with we might work with the normalized Laplacian .
In the following we concentrate on as given above, but our algorithm for computing is readily extended to the other two cases.
4 Computation of the structured distance to ambiguity
In this section we describe an approach to compute the stability measure defined in the previous section.
4.1 Outline of the computational approach
Our approach is summarized by the following twolevel method:

Inner iteration: Given , we aim to compute a symmetric matrix with the same sparsity pattern as (i.e., if ), with of unit Frobenius norm, with (with componentwise inequality) such that the difference between the th smallest and the th smallest eigenvalue of is minimal. The obtained minimizer is denoted by .

Outer iteration: We compute the smallest value of such that the th and th eigenvalues of coalesce.
In order to compute for a given , we make use of a constrained gradient system for the functional
(2) 
under the constraints of unit Frobenius norm of , the nonnegativity and the symmetry and the sparsity pattern of .
In the outer iteration we compute the optimal , denoted , by a combined Newtonbisection method. We then have
The algorithm computes a perturbed weight matrix whose Laplacian has coalescent th and th eigenvalues, and .
4.2 Gradient of the functional
We denote by the Frobenius norm on and by the corresponding inner product.
For the edge set and for an arbitrary matrix , we define the symmetric projection onto the sparsity pattern given by as
We denote by the set of symmetric matrices with sparsity pattern and note that is the orthogonal projection from to : for all . The Laplacian operator is a linear map
Its adjoint with respect to the Frobenius inner product,
is given in the following lemma. {lemma} For , let be defined by for all . Then,
where is the vector of the diagonal entries of . Furthermore, if contains no autoloops, i.e., for all , then we have for ,
We have, for all and ,
and the result follows.
We make use of the following result [3, Lemma 3.2], which provides a basic tool for the derivation of the gradient system we use to solve the problem. {lemma} Consider a regular path of feasible matrices, that is such that with symmetric and with the same sparsity pattern of , denote by a simple eigenvalue of the Laplacian matrix and by the associated normalized eigenvector, i.e. such that . Then (omitting the argument )
(3) 
where denotes the vector of squares of the entries of the vector .
4.3 Gradient flow under the unitnorm constraint
We consider now the constrained gradient system associated to the functional with the constraint that has unit Frobenius norm. We then get the following matrix differential equation:
(5) 
Here we remark that the denominator cannot vanish, because would imply that The role of is that of a Lagrange multiplier that ensures that the constraint is satisfied. The given formula for is obtained as follows: Differentiating the constraint gives . Taking the inner product of both sides of the differential equation with yields
which gives .
4.4 Nonnegativity constraint
It may happen that along the solution trajectory of (5) some positive entry of becomes negative. In our experiments we never observed such a situation although this is a potential occurrence. In this subsection we explain how to deal with the further constraint for all .
One possibility would be to follow the lines of [3] and consider KKT conditions by managing active constraints and integrating a piecewise smooth system of differential equations. Another possibility is to add a penalization term such as the following:
where , and to minimize the functional
for increasing and letting . We have
giving an extra term to the gradient system (5).
4.5 Monotonicity and stationary points
The following monotonicity result follows directly from the construction of the gradient system.
Let of unit Frobenius norm satisfy the differential equation (7) with of (6). Then, decreases monotonically with . Moreover, if for all , then
where and are the th and th eigenvalues of .
Using (7) we obtain, with ,
(8) 
where the final inequality follows directly from the Cauchy–Schwarz inequality. This yields the monotone decay of and hence also the second statement, since if .
Equilibrium points of (7) are characterized as follows.
The following statements are equivalent along solutions of (7):

.

.

is a real multiple of .
By the strict Cauchy–Schwarz inequality in (8), can be zero only if is a multiple of , and then with implies .
4.6 Outer iteration
Let denote the minimizer of the functional . generically we expect that for a given perturbation size , the eigenvalues and are simple. If so, then is a smooth function of and we can exploit its regularity to obtain a fast iterative method to converge to from the left. Otherwise we can use a bisection technique to approach .
The following result provides an inexpensive formula for the computation of the derivative of , which will be useful in the construction of the outer iteration of the method.
Assumption 4.1
We assume that the th and th smallest eigenvalues of the Laplacian are simple. Moreover, is assumed to be a smooth function of in some interval.
We then have the following result.
Under Assumption 4.1, the function is differentiable and its derivative equals (with )
(9) 
Differentiating with respect to we obtain, with
(10) 
Now we use the property of minimizers, stated by Theorem 4.5, and conclude
which yields that
since is of unit norm for all . So we have
Now, (8) shows that in an equilibrium,
where the negative sign is a consequence of the steepest descent property. Together with the above formula for we thus obtain the stated result.
For , we make use of the standard Newton iteration
(11) 
to get an update value . In a practical algorithm it is useful to couple the Newton iteration (11) with a bisection technique, in the same way as for the method presented in [3]. We adopt a tolerance tol which allows us to distinguish whether , in which case we may use the derivative formula and perform the Newton step, or , so that we have to make use of bisection. The method is formulated in Algorithm 2.
The overall method is formulated in Algorithm 3. In our experience, it usually terminates already in Step 2.
4.7 Effective monotonicity with respect to
Assume we have integrated equation (7) with and we skip to a new value . In order to get continuity of the functional with respect to , we first integrate the ODE (with initial value , i.e. the optimal matrix computed at previous step)
that is (7) dropping the term with in the right hand side accounting for norm preservation, where the norm of is expected to increase. Then we check the instant such that
and continue with (7),
with initial datum of unit norm.
In this way the computed functional is made decreasing with respect to and .
4.8 Choice of the inital perturbation size and the initial perturbation matrix
While one might just start with a random perturbation, a more natural guess starts from the normalized free gradient . We propose to choose as a first guess the scaled spectral gap , which underestimates the optimal and therefore leads us into the regime where Newton rather than bisection is used for updating in the outer iteration.
4.9 Remark: Perturbing only selected weights
If one is interested in the effect of perturbations only in certain weights of the graph, it is sufficient to replace the projector by the following one:
where .
This means that when only some weights are subject to uncertainties, one could determine a more reliable stability measure by limiting the attention to such edges.
5 Numerical experiments
In this section we compare the behavior of the spectral gap and the structured distance to ambiguity (SDA) as stability indicators. First, we determine the optimal numbers of clusters by the criterion of maximal stability with both stability indicators in a family of stochastic block models with varying edge probabilities, alongside a reduced model with similar behavior. Then we examine their performance on a model with randomly generated numbers near given centers, where the optimal number of clusters is given a priori but is here determined by the criterion of maximal stability with both stability indicators.
5.1 Stochastic block model and reduced model
The stochastic block model (SBM) is a model of generating random graphs that tend to have communities. It is an important model in a wide range of fields ranging from sociology to physics (see [14] and, e.g., [1, 2]).
Here we consider a stochastic block model with the following parameters:

the number of vertices,

a partition of the vertex set into disjoint subsets called communities,

a symmetric matrix containing edge probabilities.
The graph is then sampled in the way that any two vertices and are connected with probability .
For the stability question considered in this paper, this model can be reduced to a model with vertices. Assume that the probability matrix of the SBM is such that for all . In this case, any two vertices and belonging to the same community are connected. If , the SBM thus generates a disconnected graph with communities, which are complete graphs of size . We represent this graph by a graph with vertex set such that the vertices and are connected with weight . The edge probabilites in case are then represented by inserting matrices in the respective part of the weight matrix in the reduced model, where is an appropriate function of that takes the community sizes into account.
To illustrate the construction of this reduced model, consider a SBM with vertices, three communities of size and the probability matrix
Figure 1 shows the adjacency matrix of one sample of the corresponding SBM. This matrix is represented by the matrix
With regards to clustering stability, we observe a similar behavior for the full and the reduced model, as the following example shows.
Example 1
(Chain SBM) We measure the clustering stability when applied to a SBM as described above.

We use communities of size in the reduced model and use on the offdiagonal, where
and (the above example shows such a model with and ). For small values of , we expect a clustering into communities to be most stable, whereas for increasing , the optimal number of clusters should decrease. We compute the optimal number of clusters provided by the SDA and the optimal number of clusters as provided by the spectral gaps. Figure 2 shows the results. We observe the expected behavior in both robustness measures, but the SDA tends to opt for a lower number of clusters for smaller values of . Figure 3 shows the measures and for different values of . As we expect, and are decreasing, and are increasing up to a certain point and then decreasing again.

To compare the behavior of the above reduced model to the full SBM, we compute the same values for a SBM with communities of size , edge probabilities , where . Figure 4 shows the resulting optimal number of clusters, figure 5 shows the distances to ambiguity and the spectral gaps for different values of . We see how the results are affected by randomness in the graph generation but conclude that the behavior of both models is similar. Thus, the reduced model described above is a reasonable representation of the SBM.
It is remarkable that the optimal number of communities provided by both stability indicators does not differ by more than one. This suggests that the spectral gap is a reasonable stability indicator in this case, even though it does not take the special structure of weight matrices into account. We observe a similar behavior in different stochastic block models with a similar structure.
5.2 Randomly distributed numbers around centers
In the previous section we have compared the spectral gap to the distance to ambiguity as a stability measure. We have seen a similar behavior yet we cannot conclude which measure works better or worse. We now present an example where we can measure a success rate of the stability indicators.
Consider a given set of centers with the intention to create clusters around them. We generate a set of random numbers such that for each , we first choose an index randomly (uniformly distributed) and then the random number which is normally distributed with expectation value and variance . If the centers are well separated, we expect that the numbers are naturally clustered into communities.
For a graph with weights related to the distances , we then compute the SDA and the spectral gap for . Since the data set is constructed such that it should naturally have communities, we expect that in most cases.
Example 2
We generate samples of random numbers and groups. For each , we first choose a group randomly and then generate a random number , normally distributed around , where
Figure 6 shows one sorted random sample .
In order to represent this data set by a graph, we set, following [18]
and then use
to avoid a complete graph. We denote by and the optimal number of clusters obtained by taking the spectral gaps and the SDA, respectively, as stability indicators in the criterion of maximal stability. We used and as bounds for the number of communities. Table 1 reports the frequency of occurrence of optimal clusters for with both stability indicators for , and in Table 2 for . We conclude that the success of recognizing the number of communities strongly depends on the graph representation of the data set, but in all cases considered the SDA is outperforming the spectral gap in finding the correct number of clusters.
0.0%  0.0%  69.6%  20.4%  10.0%  
0.0%  0.0%  78.0%  15.2%  6.8% 
0.0%  0.0%  89.6%  7.6%  2.8%  
0.0%  0.0%  95.6%  3.2%  1.2% 
Acknowledgments
We thank Ulrike von Luxburg and Matthias Hein for a helpful and encouraging discussion.
Part of this work was developed during some visits to Gran Sasso Science Institute in L’Aquila and to the University of Tübingen. The authors thank both institutions for the very kind hospitality.
N. Guglielmi thanks the INdAM GNCS (Gruppo Nazionale di Calcolo Scientifico) and the Italian M.I.U.R. for financial support.
References
 [1] E. Abbe and C. Sandon. Community detection in general stochastic block models: Fundamental limits and efficient algorithms for recovery. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pages 670–688. IEEE, 2015.
 [2] A. A. Amini, E. Levina, et al. On semidefinite relaxations for the block model. The Annals of Statistics, 46(1):149–179, 2018.
 [3] E. Andreotti, D. Edelmann, N. Guglielmi, and C. Lubich. Constrained graph partitioning via matrix differential equations. SIAM J. Matrix Anal. Appl., 40:1–22, 2019.
 [4] P. Benner and M. Voigt. A structured pseudospectral method for norm computation of largescale descriptor systems. Math. Control Signals Systems, 26(2):303–338, 2014.
 [5] M. Fiedler. Algebraic connectivity of graphs. Czechoslovak Math. J., 23(98):298–305, 1973.
 [6] G. H. Golub and C. F. Van Loan. Matrix computations. Johns Hopkins Studies in the Mathematical Sciences. Johns Hopkins University Press, Baltimore, MD, fourth edition, 2013.
 [7] N. Guglielmi, D. Kressner, and C. Lubich. Low rank differential equations for Hamiltonian matrix nearness problems. Numer. Math., 129:279–319, 2015.
 [8] N. Guglielmi and C. Lubich. Differential equations for roaming pseudospectra: paths to extremal points and boundary tracking. SIAM J. Numer. Anal., 49:1194–1209, 2011.
 [9] N. Guglielmi and C. Lubich. Matrix stabilization using differential equations. SIAM J. Numer. Anal., 55:3097–3119, 2017.
 [10] N. Guglielmi, C. Lubich, and V. Mehrmann. On the nearest singular matrix pencil. SIAM J. Matrix Anal. Appl., 38:776–806, 2017.
 [11] N. J. Higham. Computing a nearest symmetric positive semidefinite matrix. Linear Algebra Appl., 103:103–118, 1988.
 [12] N. J. Higham. Computing the nearest correlation matrix—a problem from finance. IMA J. Numer. Anal., 22(3):329–343, 2002.
 [13] D. Hinrichsen and A. J. Pritchard. Stability radius for structured perturbations and the algebraic Riccati equation. Systems Control Lett., 8(2):105–113, 1986.
 [14] P. W. Holland, K. B. Laskey, and S. Leinhardt. Stochastic blockmodels: First steps. Social networks, 5(2):109–137, 1983.
 [15] R. A. Horn and C. R. Johnson. Matrix analysis. Cambridge University Press, Cambridge, 1985.
 [16] I. Koch. Analysis of multivariate and highdimensional data. Cambridge Series in Statistical and Probabilistic Mathematics, 32. Cambridge University Press, New York, 2014.
 [17] D. Kressner and M. Voigt. Distance problems for linear dynamical systems. In Numerical algebra, matrix theory, differentialalgebraic equations and control theory, pages 559–583. Springer, Cham, 2015.
 [18] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Machine Intelligence, 22(8), 2000.
 [19] D. Spielman. Spectral graph theory and its application. In Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science, pages 29–38, 2007.
 [20] G. W. Stewart and Ji Guang Sun. Matrix perturbation theory. Computer Science and Scientific Computing. Academic Press, Inc., Boston, MA, 1990.
 [21] U. von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395–416, 2007.
 [22] U. von Luxburg. Clustering stability: an overview. Foundations and Trends in Machine Learning, 2(3):235–274, 2010.