Impact of regularization on Spectral Clustering

# Impact of regularization on Spectral Clustering

## Abstract

The performance of spectral clustering can be considerably improved via regularization, as demonstrated empirically in Amini et al. [2]. Here, we provide an attempt at quantifying this improvement through theoretical analysis. Under the stochastic block model (SBM), and its extensions, previous results on spectral clustering relied on the minimum degree of the graph being sufficiently large for its good performance. By examining the scenario where the regularization parameter is large we show that the minimum degree assumption can potentially be removed. As a special case, for an SBM with two blocks, the results require the maximum degree to be large (grow faster than ) as opposed to the minimum degree. More importantly, we show the usefulness of regularization in situations where not all nodes belong to well-defined clusters. Our results rely on a ‘bias-variance’-like trade-off that arises from understanding the concentration of the sample Laplacian and the eigen gap as a function of the regularization parameter. As a byproduct of our bounds, we propose a data-driven technique DKest (standing for estimated Davis-Kahan bounds) for choosing the regularization parameter. This technique is shown to work well through simulations and on a real data set.

## 1 Introduction

The problem of identifying communities (or clusters) in large networks is an important contemporary problem in statistics. Spectral clustering is one of the more popular techniques for such a purpose, chiefly due to its computational advantage and generality of application. The algorithm’s generality arises from the fact that it is not tied to any modeling assumptions on the data, but is rooted in intuitive measures of community structure such as sparsest cut based measures [11], [24], [16], [20]. Other examples of applications of spectral clustering include manifold learning [4], image segmentation [24], and text mining [9].

The canonical nature of spectral clustering also generates interest in variants of the technique. Here, we attempt to better understand the impact of regularized forms of spectral clustering for community detection in networks. In particular, we focus on the regularized spectral clustering (RSC) procedure proposed in Amini et al. [2]. Their empirical findings demonstrates that the performance of the RSC algorithm, in terms of obtaining the correct clusters, is significantly better for certain values of the regularization parameter. An alternative form of regularization was studied in Chaudhuri et al. [7] and Qin and Rohe [22].

This paper provides an attempt to provide a theoretical understanding for the regularization in the RSC algorithm. We also propose a practical scheme for choosing the regularization parameter based on our theoretical results. Our analysis focuses on the Stochastic Block Model (SBM) and an extension of this model. Below are the three main contributions of the paper.

1. We attempt to understand regularization for the stochastic block model. In particular, for a graph with nodes, previous theoretical analyses for spectral clustering, under the SBM and its extensions, [23],[7], [25], [10] assumed that the minimum degree of the graph scales at least by a polynomial power of . Even when this assumption is satisfied, the dependence on the minimum degree is highly restrictive when it comes to making inferences about cluster recovery. Our analysis provides cluster recovery results that potentially do not depend on the above mentioned constraint on the minimum degree. As an example, for an SBM with two blocks (clusters), our results require that the maximum degree be large (grow faster than ) rather than the minimum degree. This is done in Section 3.

2. We demonstrate that regularization has the potential of addressing a situation where the lower degree nodes do not belong to well-defined clusters. Our results demonstrate that choosing a large regularization parameter has the effect of removing these relatively lower degree nodes. Without regularization, these nodes would hamper with the clustering of the remaining nodes in the following way: In order for spectral clustering to work, the top eigenvectors - that is, the eigenvectors corresponding to the largest eigenvalues of the Laplacian - need to be able to discriminate between the clusters. Due to the effect of nodes that do not belong to well-defined clusters these top eigenvectors do not necessarily discriminate between the clusters with ordinary spectral clustering. This is done in Section 4

3. Although our theoretical results deal with the ‘large’ case, it is observed empirically that moderate values of may produce better clustering performance. Consequently, in Section 5 we propose , a data dependent procedure for choosing the regularization parameter. We demonstrate that this works well through simulations and on a real data set. This is in Section 5.

Our theoretical results involve understanding the trade-offs between the eigen gap and the concentration of the sample Laplacian when viewed as a function of the regularization parameter. Assuming that there are clusters, the eigen gap refers to the gap between the -th smallest eigenvalue and the remaining eigenvalues. An adequate gap ensures that the sample eigenvectors can be estimated well ([26], [20], [16]) which leads to good cluster recovery. The adequacy of an eigen gap for cluster recovery is in turn determined by the concentration of the sample Laplacian.

In particular, a consequence of the Davis-Kahan theorem [5] is that if the spectral norm of the difference of the sample and population Laplacians is small compared to the eigen gap then the top eigenvector can be estimated well. Denoting as the regularization parameter, previous theoretical analyses of regularization ([7], [23]) provided high-probability bounds on this spectral norm. These bounds have a dependence on , for large . In contrast, our high probability bounds behave like , for large . We also demonstrate that the eigen gap behaves like for large . The end result is that we show that one can get a good understanding of the impact of regularization by understanding the situation where goes to infinity. This also explains empirical observations in [2], [22] where it was seen that performance of regularized spectral clustering does not change for beyond a certain value. Our procedure for choosing the regularization parameter works by providing estimates of the Davis-Kahan bounds over a grid of values of and then choosing the that minimizes these estimates.

The paper is divided as follows. In the next subsection we discuss preliminaries. In particular, in Subsection 1.1 we review the RSC algorithm of [2], and also discuss the other forms of regularization in literature. In Section 2 we review the stochastic block model. Our theoretical results, described in (a) and (b) above, are provided in Sections 3 and 4. Section 5 describes our data dependent method for choosing the regularization parameter.

### 1.1 Regularized spectral clustering

In this section we review the regularized spectral clustering (RSC) algorithm of Amini et al. [2].

We first introduce some basic notation. A graph with nodes and edge set is represented by the symmetric adjacency matrix , where if there is an edge between and , otherwise is 0. In other words, for ,

 Aij={1,if (i,j)∈E0,otherwise .

Given such a graph, the typical community detection problem is synonymous with finding a partition of the nodes. A good partitioning would be one in which there are fewer edges between the various components of the partition, compared to the number of edges within the components. Various measures for goodness of a partition have been proposed, chiefly the Ratio Cut [11] and Normalized Cut [24] . However, minimization of the above measures is an NP-hard problem since it involves searching over all partitions of the nodes. The significance of spectral clustering partly arises from the fact that it provides a continuous approximation to the above discrete optimization problem [11], [24].

We now describe the RSC algorithm [2]. Denote by the diagonal matrix of degrees, where . The normalized (unregularized) symmetric graph Laplacian is defined as

Regularization is introduced in the following way: Let be a constant matrix with all entries equal to . Then, in regularized spectral clustering one constructs a new adjacency matrix by adding to the adjacency matrix and computing the corresponding Laplacian. In particular, let

 Aτ=A+τJ,

where is the regularization parameter. The corresponding regularized symmetric Laplacian is defined as

 Lτ=D−1/2τAτD−1/2τ. (1)

Here, is the diagonal matrix of ‘degrees’ of the modified adjacency matrix . In other words, .

The RSC algorithm for finding communities is described in Algorithm 1. In order to bring to the forefront the dependence on , we also denote the RSC algorithm as RSC-. The algorithm first computes , the eigenvector matrix corresponding to the largest eigenvalues of . The columns of are taken to be orthogonal. The rows of , denoted by , for , corresponds to the nodes in the graph. Clustering the rows of , for example using the -means algorithm, provides a clustering of the nodes. We remark that the RSC-0 Algorithm corresponds to the usual spectral clustering algorithm.

Our theoretical results assume that the data is randomly generated from a stochastic block model (SBM), which we review in the next subsection. While it is well known that there are real data examples where the SBM fails to provide a good approximation, we believe that the above provides a good playground for understanding the role of regularization in the RSC algorithm. Recent works [2], [10], [23], [6], [14] have used this model, and its variants, to provide a theoretical analyses for various community detection algorithms.

In Chaudhuri et al. [7], the following alternative regularized version of the symmetric Laplacian is proposed:

Here, the subscript stands for ‘degree’ since the usual Laplacian is modified by adding to the degree matrix . Notice that for the RSC algorithm the matrix in the above expression was replaced by .

As mentioned before, we attempt to understand regularization in the framework of the SBM and its extension. We review the SBM in the next section. Using recent results on the concentration of random graph Laplacians [21], we were able to show concentration results in Theorem 4 for the regularized Laplacian in the RSC algorithm. Previous concentration results for the Laplacian (2), as in [7], provide high probability bounds on the spectral norm of the difference of the sample and population regularized Laplacians that depends inversely on . However, for the regularization (1) we show that the dependence is inverse in , for large . We believe that this holds for the regularization (2) as well. We also demonstrate that the eigen gap depends inversely on , for large . The benefit of this, along with our improved concentration bounds, is that one can understand regularization by looking at the case where is large. This results in a very neat criterion for the cluster recovery with the RSC- algorithm.

## 2 The Stochastic Block Model

Given a set of nodes, the stochastic block model (SBM), introduced in [12], is one among many random graph models that has communities inherent in its definition. We denote the number of communities in the SBM by . Throughout this paper we assume that is known. The communities, which represent a partition of the nodes, are assumed to be fixed beforehand. Denote these by . Let , for , denote the number of nodes belonging to each of the clusters.

Given the communities, the edges between nodes, say and , are chosen independently with probability depending the communities and belong to. In particular, for a node belonging to cluster , and node belonging to cluster , the probability of edge between and is given by

 Pij=Bk1,k2.

Here, the block probability matrix

 B=((Bk1,k2)),where k1,k2=1,…,K

is a symmetric full rank matrix, with each entry between . The edge probability matrix , given by (3), represents the population counterpart of the adjacency matrix .

Denote as the binary matrix providing the cluster memberships of each node. In other words, each row of has exactly one 1, with if node belongs to . Notice that,

 P=ZBZ′. (3)

Here denotes the transpose of . Consequently, from (3), it is seen that the rank of is also .

The population counterpart for the degree matrix is denoted by , where . Here denotes the column vector of all ones. Similarly, the population version of the symmetric Laplacian is denoted by , where

 Lτ=D−1/2τPτD−1/2τ.

Here and The matrices and represent the population counterparts to and respectively. Notice that since has rank , the same holds for .

### 2.1 Notation

We use to denote the spectral norm of a matrix. Notice that for vectors this corresponds to the usual -norm. We use to denote the transpose of a matrix, or vector, .

For positive , we use the notation if there exists universal constants so that . Further, we use if , for some positive not depending on . The notation is analogously defined.

The quantities

 dmin,n=mini=1,…,ndi,dmax,n=maxi=1,…,ndi

denote the minimum and maximum expected degrees of the nodes.

### 2.2 The Population Cluster Centers

We now proceed to define population cluster centers , for , for the block SBM. These points are defined so that the rows of the eigenvector matrix , for , are expected to be scattered around .

Denote by an matrix containing the eigenvectors of the largest eigenvalues of the population Laplacian . As with , the columns of are also assumed to be orthogonal.

Notice that both and are eigenvector matrices corresponding to . This ambiguity in the definition of is further complicated if an eigenvalue of has multiplicity greater than one. We do away with this ambiguity in the following way: Let denote the set of all eigenvector matrices of corresponding to the top eigenvalues. We take,

 Vτ=argminH∈H∥Vτ−H∥, (4)

where recall that denotes the spectral norm. The matrix , as defined above, represents the population counterpart of the matrix .

Let denote the -th row of . Notice that since the set is closed under the norm, one has that is also an eigenvector matrix of corresponding to the top eigenvalues. Consequently, the rows are the same across nodes belonging to a particular cluster (See, for example, Rohe et al. [23] for a proof of this fact). In other words, there are distinct rows of , with each row corresponding to nodes from one of the clusters.

Notice that the matrix depends on the sample eigenvector matrix through (4), and consequently is a random quantity. However, the following lemma shows that the pairwise distances between the rows of are non-random and, more importantly, independent of .

###### Lemma 1.

Let and . Then,

 ∥Vi,τ−Vi′,τ∥={0,% if k=k′√1nk+1nk′,if k≠k′

From the above lemma, there are distinct rows of corresponding to the clusters. We denote these as . We also call these the population cluster centers since, intuitively, in an idealized scenario the data points , with , should be concentrated around .

### 2.3 Cluster recovery using K-means algorithm

Recall that the RSC- Algorithm 1 works by performing -means clustering on the rows of the sample eigenvector matrix, denoted by , for . In this section, in particular Corollary 3, we relate the fraction of mis-clustered nodes using the -means algorithm to the various parameters in the SBM.

In general, the -means algorithm can be described as follows: Assume one wants to find clusters, for a given set of data points , for . Then the -clusters resulting from applying the -means algorithm corresponds to a partition of that aims to minimize the following objective function over all such partitions:

 Obj(T)=K∑k=1∑i∈Tk∥xi−¯xTk∥2, (5)

Here is a partition , and corresponds to the vector of component-wise means of the , for .

In our situation there is also an underlying true partition of nodes into clusters, given by . Notice that iff there is a permutation of so that , for . In general, we use the following measure to quantify the closeness of the outputted partition and the true partition : Denote the clustering error associated with as

 ^f=minπmaxk|Ck∩^Tcπ(k)|+|Cck∩^Tπ(k)|nk. (6)

The clustering error measures the maximum proportion of nodes in the symmetric difference of and .

In many situations, such as ours, there exists population quantities associated with each cluster around which the ’s are expected to concentrate. Denote these quantities by . In our case, . If the ’s, for , concentrate well around , and the ’s are sufficiently well separated, then it is expected the -means algorithm recovers the clusters with small error .

Denote as the matrix with ’s as rows. In our case, the , and . Further, denote as the matrix with the ’s as rows. In our case, . Recent results on cluster recovery using the -means algorithm, as given in Kumar and Kannan [15] and Awasthi and Sheffet [3], provide conditions on and for the success of -means. The following lemma is implied from Theorem 3.1 in Awasthi and Sheffet [3].

###### Lemma 2.

Let be a small quantity. If for each , one has

 ∥mk−mk′∥≥(1δ)√K∥X−M∥(1√nk+1√nk′) (7)

then the clustering error using the -means algorithm.

Remark : In general minimizing the objective function (5) is not computationally feasible. However, the results in [15], [3] can be extended to partitions that approximately minimize (5). The condition (7), called the center separation condition in [3], provides lower bounds on the pairwise distances between the population cluster centers that depend on the perturbation of data points around the population centers (represented by ) and the cluster sizes.

Let

 1=μ1,τ≥…≥μn,τ

be the eigenvalues of the regularized population Laplacian arranged in decreasing order. The fact that is 1 follows from standard results on the spectrum of Laplacian matrices (see, for example, [26]). As mentioned in the introduction, in order to control the perturbation of the first eigenvectors the eigen gap, given by , must be adequately large, as noted in [26], [20], [16]. Since has rank one has . Thus the eigen gap is simply . For our -block SBM framework the following is an immediate consequence of Lemma 2 and the Davis-Kahan theorem for the perturbation of eigenvectors.

###### Corollary 3.

Let be fixed. For the RSC- algorithm the clustering error, given by (6), is

 O(K∥Lτ−Lτ∥2μ2K,τ)
###### Proof.

Use Lemma 2 with , , , and notice that from Lemma 1 that is .

Consequently, using one gets from (7) that if

 ∥Vτ−Vτ∥≤δ√K, (8)

for some , then at most fraction of nodes are misclassified with the RSC- algorithm.

From the Davis-Kahan theorem [5], one has

 ∥Vτ−Vτ∥≲∥Lτ−Lτ∥μK,τ (9)

Consequently, if we take then relation (8) is satisfied using (9). This proves the corollary. ∎

## 3 Improvements through regularization

In this section we will use Corollary 3 to quantify improvements in clustering performance via regularization. If the number of clusters is fixed (does not grow with ) then the quantity

 ∥Lτ−Lτ∥μK,τ, (10)

in Corollary 3 provides an insight into the role of the regularization parameter . Clearly, an ideal choice of would be the one that minimizes (10). Note, however, that this is not practically possible since are not known in advance.

Increasing will ensure that the Laplacian will be well concentrated around . This is demonstrated in Theorem 4 below. However, increasing also has the effect of decreasing the eigen gap, which in this case is , since the population Laplacian becomes more like a constant matrix upon increasing . Thus the optimum results from the balancing out of these two competing effects.

Independent of our work, a similar argument for the optimum choice of regularization, using the Davis-Kahan theorem, was given in Qin and Rohe [22] for the regulariztion proposed in [7]. However, they didn’t provide a quantification of the benefit of regularization as given in this section and Section 4.

Theorem 4 provides high-probability bounds on the quantity appearing in the numerator of (10). Previous analysis of the regularization (2), in [7], [22], show high-probability bounds on the aforementioned spectral norm that have a dependence on . However, for large , the theorem below shows that the behavior is . We believe this holds for the regularization (2) as well. Thus, our bounds has a dependence on , for large , as opposed to the dependence shown in [7]. This is crucial since the eigen gap also behaves like for large which implies that (10) converges to a quantity as tends to infinity. In Theorem 5 we provide a bound on this quantity. Our claims regarding improvements via regularization will then follow from comparing this bound with the bound on (10) at .

###### Theorem 4.

With probability at least , for all satisfying

 max{τ,dmin,n}≥32logn, (11)

we have

 ∥Lτ−Lτ∥≤ϵτ,n. (12)

Here

 ϵτ,n=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩10√logn√dmin,n+τ,if τ≤2dmax,n10√dmax,nlogndmax,n+τ/2,if τ>2dmax,n

We use Theorem 4, along with Corollary 3, to demonstrate improvements from regularization over previous analyses of eigenvector perturbation. Our strategy for this is a follows: Take

 δτ,n=ϵτ,nμK,τ

Notice that from Corollary 3 and Theorem 4, one gets that with probability at least , for all satisfying (11), the clustering error is . Consequently, it is of interest to study the quantity as a function of . Define,

 δn=limτ→∞δτ,n. (13)

Although we would have ideally liked to study the quantity,

 ~δn=minmax{τ,dmin,n}≳lognδτ,n

we study since it is easy to characterize as we shall see in Theorem 5 below. Section 5 introduces a data-driven methodology that is based on finding an approximation for .

Before introducing our main theorem quantifying the performance of RSC- for large we introduce the follow definition.

###### Definition 1.

Let be a sequence of the regularization parameters. For the -block SBM we say that RSC- gives consistent cluster estimates if the error (6) goes 0, with probability tending to 1, as goes to infinity.

Throughout the remainder of the section we consider a -block stochastic block model with the following block probability matrix.

 B=⎛⎜ ⎜ ⎜ ⎜⎝p1,nqn...qnqnp2,n...qn..................qnpK,n⎞⎟ ⎟ ⎟ ⎟⎠. (14)

The number of communities is assumed to be fixed. Without loss, assume that . We also assume that . Denote , for . The quantity represents the proportion of nodes belonging to the -th community. Throughout this section we assume that is a sequence of regularization parameters satisfying,

 (∑Kk=11/wk)dmax,nlognτn=o(1) (15)

Notice that if the cluster sizes are of the same order, that is , then the above condition simply states that should grow faster than .

Denote . The following is our main result regarding the impact of regularization.

###### Theorem 5.

For the block SBM, with block probability matrix (14),

 δn≍(~m1,nm1,n−m2,n)m1,n√dmax,nlogn. (16)

Here is given by (13) and

 m1,n =K∑k=1wkγk,n (17) ~m1,n =K∑k=11γk,n (18) m2,n =K∑k=1wkγ2k,n (19)

Further, let satisfy (15). If goes to 0, as tends to infinity, then RSC- gives consistent cluster estimates.

Theorem 5 will be proved in Appendix B. In particular, the following corollary shows that for the stochastic block model regularized spectral clustering would work even when the minimum degree is of constant order. This is an improvement over recent works on unregularized spectral clustering, such as [18], [7], [23], which required the minimum degree to grow at least as fast as .

###### Corollary 6.

Let the block probability matrix be as in (14). Let satisfy (15). Then RSC- gives consistent cluster estimates under the following scenarios:

1. For the -block SBM if , for each , and

 (pK−1,n−qn)2p1,ngrows faster thanlognn. (20)
2. For the 2-block SBM if and

 (p1,n−qn)2w1p1,n+w2qngrows faster % thanlognn(min{w1,w2})2. (21)

Remark : Regime deals with the situation that the clusters sizes are of the same order of magnitude. Regime , where mimics a scenario where there is only one cluster. This is a generalization of the planted clique problem where and . For the planted clique problem (21) translates to requiring that grow faster that for consistent cluster estimates, which is similar to results in [18].

Notice that in both (20) and (21) the minimum degree could be of constant order. For example, for the two-block SBM if then the minimum degree is of constant order. In this case ordinary spectral clustering using the normalized Laplacian would perform poorly. RSC performs better since from (20) it only requires that the larger of the two within block probabilities, that is , growing appropriately fast. Figure 1 illustrates this with and edge probability matrix

 B=(.01.0025.0025.003). (22)

The figure provides the scatter plot of the first two eigenvectors of the unregularized and regularized sample Laplacians. Figure a) corresponds to the usual spectral clustering, while plots b) & c) corresponds to RSC-, with respectively. Here, was selected using our data-driven methodology for slecting proposed in Section 5. Also, was selected as suggested from Theorem 5 and Corollary 6. The fraction of mis-classified are for the cases a),  b),  c) respectively.

From the scatter plots one sees that there is considerably less scattering for the blue points with regularization. This results in improvements in clustering performance. Also, note that the performance in case c), in which is taken to be very large, is only slightly worse than case b). For case c) there is almost no variation in the first eigenvector, plotted along the -axis. This makes sense since the first eigenvector is proportional to and for large one has .

It may seem surprising that in Corollary 6, claim (20), the smallest within block probability, that is does not matter at all. One way of explaining this is that if one can do a good job identifying the top highest degree clusters then the cluster with the lowest degree can also be identified simply by eliminating nodes not belonging to this cluster.

## 4 SBM with strong and weak clusters

In many practical situations, not all nodes belong to clusters that can be estimated well. As mentioned in the introduction, these nodes interfere with the clustering of the remaining nodes in the sense that none of the top eigenvectors might discriminate between the nodes that do belong to well-defined clusters. As an example of a real life data set, we consider the political blogs data set, which has two clusters, in Subsection 5.2. With ordinary spectral clustering, the top two eigenvectors do not discriminate between the two clusters (see Figure 2 for explanation). Infact, it is only the third eigenvector that discriminates between the two clusters. This results in bad clustering performance when the first two eigenvectors are considered. However, regularization rectifies this problem by ‘bringing up’ the important eigenvector thereby allowing for much better performance.

We model the above situation – where there are main clusters as well as outlier nodes – in the following way: Consider a stochastic block model, as in (14), with blocks. In particular, let the block probability matrix be given by

 B=(BsBswB′swBw), (23)

where is a matrix with in the diagonal and in the off-diagonal. Further, are and dimensional matrices respectively. In the above -block SBM, the top blocks corresponds to the well-defined or strong clusters, while the bottom blocks corresponds to less well-defined or weak clusters.

We now formalize our notion of strong and weak clusters. The matrix models the distribution of edges between the nodes belonging to the strong clusters, while the matrix has the corresponding role for the weak clusters. The matrix models the interaction between the strong and weak clusters. For ease of analysis, we make the following simplifying assumptions : Assume that , for , and that the strong clusters have equal sizes, that is, assume for .

Let be defined as the maximum of the elements in , and let be the number of nodes belonging to a weak cluster. In other words, . We make the following three assumptions:

 (psn−qn)2psngrows faster thanlognn (24)
 nw=O(1). (25)
 bsw≲√psnlognn (26)

Assumption (24) ensures recovery of the strong clusters if there were no nodes belonging to weak clusters (See Corollary 6 or McSherry [18], Corollary 1). Assumption (25) and (26) pertain to the nodes in the weak clusters. In particular, Assumption (25) simply states that the total number of nodes belonging to a weak cluster is constant and does not grow with . Assumption (26) states that the density of the edges between the strong and weak clusters, denoted by , is not too large.

We only assume that the rank of is . Thus, the rank of is at least . As before, we assume that is known and does not grow with . The number of weak clusters, , need not be known and and could be as high as . We do not even place any restriction on the sizes of a weak cluster. Indeed, we even entertain the case that each of the clusters has one node. Consequently, we are only interested in recovering the strong clusters.

Theorem 7 presents our theorem for the recovery of the strong clusters using the RSC- Algorithm, with , satisfying

 npsnlognτn=o(1) (27)

In other words, the regularization parameter is taken to grow faster than , where notice that is of the same order of the expected maximum degree of the graph. Let be the clusters outputted from the RSC- Algorithm. Let

 ^f=minπmaxk|Ck∩^Tcπ(k)|+|Cck∩^Tπ(k)|nk,

be as in (6). Notice that the clusters do not form a partition of , while the estimates do. However, since does not grow with this should not make much of a difference.

###### Theorem 7.

Let Assumptions (24), (25) and (26) be satisfied. If satisfies (27) then the clustering error for RSC- goes to zero with probability tending to one.

The theorem is proved in Appendix C. It states that under Assumption (24) – (26) one can can get the same results with regularization that one would get if the nodes belonging to the weak clusters weren’t present.

Spectral clustering (with ) may fail under the above assumptions. This is elucidated in Figure 3. Here and there are two strong clusters () and three weak clusters (). The first 1600 nodes are evenly split between the two strong clusters, with the remaining nodes split evenly between the weak clusters. The matrix and are as in (28) and is a matrix with all entries .

 Bs=(.025.015.015.025)Bw=⎛⎜⎝.007.015.015.015.0071.015.015.015.0069⎞⎟⎠. (28)

The nodes in the weak clusters have relatively lower degrees, and consequently, cannot be recovered. Figures 3(a) and 3(b) show the first 3 eigenvectors of the population Laplacian in the regularized and unregularized cases. We plot the first 3 instead of the first 5 eigenvectors in order to facilitate understanding of the plot. In both cases the first eigenvector is not able to distinguish between the two strong clusters. This makes sense since the first eigenvector of the Laplacian has elements whose magnitude is proportional to square root of the population degrees (see, for example, [26] for a proof of this fact). Consequently, as the population degrees are the same for the two strong clusters, the values for this eigenvector is constant for nodes belonging to the strong clusters.

The situation is different for the second population eigenvector. In the regularized case, the second eigenvector is able to distinguish between these two clusters. However, this is not the case for the unregularized case. From Figure 3(a), not even the third unregularized eigenvector is able to distinguish between the strong and weak clusters. Indeed, it is only the fifth eigenvector that distinguishes between the two strong clusters in the unregularized case.

In Figure 4(a) and 4(b) we show the second sample eigenvector for the two cases in Figure 3(a) and 3(b). Note, we do not show the first sample eigenvector since from Figure 3(a) and 3(b), the corresponding population eigenvectors are not able to distinguish between the two strong clusters. As expected, it is only for the regularized case that one sees that the second eigenvector is able to do a good job in separating the two strong clusters. Running -means, with , resulted in a mis-classification of of the nodes in the strong clusters in the unregularized case, compared with in the regularized case.

## 5 DKest : Data dependent choice of τ

The results Sections 3 and 4 theoretically examined the gains from regularization for large values of regularization parameter . Those results do not rule out the possibility that intermediate values of may lead to better clustering performance. In this section we propose a data dependent scheme to select the regularization parameter. We compare it with the scheme in [8] that uses the Girvan-Newman modularity [6]. We use the widely used normalized mutual information criterion (NMI) [2], [27] to quantify the performance of the spectral clustering algorithm in terms of closeness of the estimated clusters to the true clusters.

Our scheme works by directly estimating the quantity in (10) in the following manner: For each in grid, an estimate of is obtained using clusters outputted from the RSC- algorithm. In particular, let be the estimates of the clusters produced from running RSC-. The estimate is taken as the population regularized Laplacian corresponding to an estimated block probability matrix and clusters . More specifically, the -th entry of is taken as

 ^Bk1,k2=∑i∈^Ck1,τ,j∈^Ck2,τAij|^Ck1,τ||^Ck2,τ| (29)

The above is simply the proportion of edges between the nodes in the cluster estimates and . The following statistic is then considered:

 DKestτ=∥Lτ−^Lτ∥μK(^Lτ), (30)

where denotes the the -th smallest eigenvalue of . The that minimizes the criterion is then chosen. Since this criterion provides an estimate of the Davis-Kahan bound, we call it the DKest criterion.

We compare the above to the scheme that uses Girvan-Newman modularity [6], [19], as suggested in [8]. For a particular in the grid the Girvan-Newman modularity is computed for the clusters outputted using the RSC- Algorithm. The that maximizes the modularity value over the grid is then chosen.

Notice that the best possible choice of would be the one that simply maximizes the NMI over the selected grid. However, this cannot be computed in practice since calculation of the NMI requires knowledge of the true clusters. Nevertheless, this provides a useful benchmark against which one can compare the other two schemes. We call this the ‘oracle’ scheme.

### 5.1 Simulation Results

Figure 5 provides results comparing the three schemes, viz. , Girvan-Newman and ‘oracle’ schemes. We perform simulations following the pattern of [2]. In particular, for a graph with nodes we take the clusters to be of equal sizes. The block probability matrix is taken to be of the form

 B=fac⎛⎜ ⎜ ⎜⎝βw11...11βw2...1..................1βwK⎞⎟ ⎟ ⎟⎠.

Here, the vector , which are the inside weights, denotes the relative degrees of nodes within the communities. Further, the quantity , which is the out-in ratio, represents the ratio of the probability of an edge between nodes from different communities to that of probability of edge between nodes in the same community. The scalar parameter fac is chosen so that the average expected degree of the graph is equal to .

Figure 5 compares the two methods of choosing the best for various choices of and . In general, we see that the selection procedure performs at least as well, and in some cases much better, than the procedure that used the Girvan-Newman modularity. The performance of the two methods is much closer when the average degree is small.

### 5.2 Analysis of the Political Blogs dataset

Here we investigate the performance of on the well studied network of political blogs [1]. The data set aims to study the degree of interaction between liberal and conservative blogs over a period prior to the 2004 U.S Presidential Election. The nodes in the networks are select conservative and liberal blog sites. While the original data set had directed edges corresponding to hyperlinks between the blog sites, we converted it to an undirected graph by connecting two nodes with an edge if there is at least one hyperlink from one node to the other.

The data set has 1222 nodes with an average degree of 27. Spectral clustering () resulted in only 51% of the nodes correctly classified as liberal or conservative. The oracle procedure, with , resulted in 95% of the nodes correctly classified. The DKest procedure selected , with an accuracy of 81%. The Girvan-Newman (GN) procedure, in this case, outperforms the DKest procedure providing the same accuracy as the oracle procedure. Figure 6 illustrates these findings. As predicted by our theory, the performance becomes insensitive for large . In this case 70% of the nodes are correctly clustered for large .

We remark that the DKest procedure does not perform as well as the GN procedure most likely because our estimate in (30) assumes that the data is generated from an SBM, which is a poor model for the data due to the large heterogeneity in the node degrees. A better model for the data would be the degree corrected stochastic block model (D-SBM) proposed by Karrer and Newman [14]. If we use D-SBM based estimaes in then the selection of matches that of the GN Newman and the oracle procedure. See Section 6 for a discussion on this.

The results of Section 4 also explain why unregularized spectral clustering performs badly (see Figure 2). The first eigenvector in both cases (regularized and unregularized) does not discriminate between the two clusters. In Figure 7, we plot the second eigenvector of the regularized and unregularized Laplacians. The second eigenvector is able to discriminate between the clusters in the regularized case, while it fails to do so in without regularization. Indeed, it is only the third eigenvector in the unregularized case that distinguishes between the clusters, as shown in Figure 8.

## 6 Discussion

The paper provides a theoretical justification for regularization. In particular, we show why choosing a large regularization parameter can lead to good results. The paper also partly explains empirical findings in Amini et al. [2] showing that the performance of regularized spectral clustering becomes insensitive for larger values of regularization parameters. It is unclear at this stage whether the benefits of regularization, resulting from the trade-offs between the eigen gap and the concentration bound, hold for the regularization in [7], [22] as they hold for the regularization in Amini et al. [2] (as demonstrated in Sections 3 and 4).

Even though our theoretical results focus on larger values of the regularization parameter it is very likely that intermediate values of produce better clustering performance. Consequently, we propose a data-driven methodology for choosing the regularization parameter. We hope to quantify theoretically the gains from using intermediate values of the regularization parameter in a future work.

For the extension of the SBM proposed in Section 4, if the rank of , given by (23), is then the model encompasses specific degree-corrected stochastic block models (D-SBM) [14] where the edge probability matrix takes the form

 P=ΘZBZ′Θ.

Here models the heterogeneity in the degrees. In particular, consider a -block D-SBM with , for each . Assume that for the most of the nodes. Take the nodes in the strong clusters to be those with . The nodes in the strong clusters are associated to one of clusters depending on the cluster they belong to in the D-SBM. The remaining nodes are taken to be in the weak clusters. Assumptions (25) and (26) puts constraints on the ’s which allows one to distinguish between the strong clusters via regularization. It would be interesting to investigate the effect of regularization in more general versions of the D-SBM, especially where there are high as well as low degree nodes.

The methodology for choosing the regularization parameter works by providing estimates of the population Laplacian assuming that the data is drawn from an SBM. From our simulations, it is seen that the performance of does not change much if we take the matrix norm in the numerator of (30) to be the Frobenius norm, which is much faster to compute.

It is seen that the performance of improves for the political blogs data set by taking to be the estimate assuming that the data is drawn from the more flexible D-SBM. Indeed, if we take to be such an estimate then the performance of is seen to be as good as the oracle scheme (and the GN scheme) for this data set. We describe how we construct this estimate in Appendix D.

## Acknowledgments

This paper is supported in part by NSF grants DMS-1228246 and DMS-1160319 (FRG), ARO grant W911NF-11-1-0114, NHGRI grant 1U01HG007031-01 (ENCODE), and the Center of Science of Information (CSoI), a US NSF Science and Technology Center, under grant agreement CCF-0939370. A. Joseph would like to thank Sivaraman Balakrishnan and Puramrita Sarkar for some very helpful discussions, and also Arash A. Amini for sharing the code used in the work [2].

## Appendix A Analysis of SBM with K blocks

Throughout this section we assume that we have samples from a block SBM. Denote the sample and population regularized Laplacian as respectively. For ease of notation, we remove the subscript from the various matrices such as . We also remove the subscript in the ’s and denote these as respectively. However, in some situations we may need to refer to these quantities at . In such cases, we make this clear by writing them as , for and for .

We need probabilistic bounds on the weigthed sum of Bernoulli random variables. The following lemma is proved in [13].

###### Lemma 8.

Let , be independent random variables. Furthermore, let be non-negative weights that sum to 1 and let . Then the weighted sum , which has mean given by , satisfies the following large deviation inequalities. For any with ,

 P(^r

and for any with ,

 P(^r>~r)≤exp{−NαD(~r∥r∗)} (32)

where denotes the relative entropy between Bernoulli random variables of success parameters and .

The following is an immediate corollary of the above.

###### Corollary 9.

Let be as in Lemma 8. Let , for be non-negative weights, and let

 W=N∑j=1βjWj.

Then,

 P(W−E(W)>δ)≤exp{−12maxjβjδ2(E(W)+δ)} (33)

and

 P(W−E(W)<−δ)≤exp{−12maxjβjδ2E(W)} (34)
###### Proof.

Here we use the fact that

 D(r||r∗)≥(r−r∗)2/(2r), (35)

for any . We prove (33). The proof of (34) is similar. The event under consideration may be written as

 {^r−r∗>~δ},

where ,   and . Correspondingly, using Lemma 8 and (35), one gets that

 P(W−E(W)>δ)≤exp{−∑jβjmaxjβj~δ22(r∗+~δ)}.

Substituting the values of and results in bound (33). ∎

The following lemma provides high probability bounds on the degree. Let and .

###### Lemma 10.

On a set of probability at most , one has

 |^di,τ−di,τ|≤c2√δi,clognfor% each i=1,…,n.,

where .

###### Proof.

Use the fact that , and

 P(|^di,0−di,0|≤c2√δi,clogn∀i)≤n∑i=1P(|^di,0−d