A Unified Framework for Tuning Hyperparameters in Clustering Problems

# A Unified Framework for Tuning Hyperparameters in Clustering Problems

## Abstract

Selecting hyperparameters for unsupervised learning problems is challenging in general due to the lack of ground truth for validation. Despite the prevalence of this issue in statistics and machine learning, especially in clustering problems, there are not many methods for tuning these hyperparameters with theoretical guarantees. In this paper, we provide a framework with provable guarantees for selecting hyperparameters in a number of distinct models. We consider both the subgaussian mixture model and network models to serve as examples of i.i.d. and non-i.i.d. data. We demonstrate that the same framework can be used to choose the Lagrange multipliers of penalty terms in semidefinite programming (SDP) relaxations for community detection, and the bandwidth parameter for constructing kernel similarity matrices for spectral clustering. By incorporating a cross-validation procedure, we show the framework can also do consistent model selection for network models. Using a variety of simulated and real data examples, we show that our framework outperforms other widely used tuning procedures in a broad range of parameter settings.

## 1 Introduction

A standard statistical model has parameters, which characterize the underlying data distribution; an inference algorithm to learn these parameters typically involve hyperparameters (or tuning parameters). Popular examples include the penalty parameter in regularized regression models, the number of clusters in clustering analysis, the bandwidth parameter in kernel based clustering, nonparameteric density estimation or regression methods (wasserman2006all; tibshirani2015statistical), to name but a few. It is well-known that selecting these hyperparameters may require repeated training to search through different combinations of plausible hyperparameter values and often has to rely on good heuristics and domain knowledge from the user.

A classical method to do automated hyperparameter tuning is the nonparametric procedure Cross Validation (CV) (stone1974cross; zhang1993model) which has been used extensively in machine learning and statistics (hastie2005elements).CV has been studied extensively in supervised learning settings, particularly in low dimensional linear models (shao1993linear; yang2007consistency) and penalized regression in high dimension (wasserman2009high). Other notable stability based methods for model selection in similar supervised settings include breiman1996heuristics; bach2008bolasso; meinshausen2010stability; lim2016estimation. Finally, a large number of empirical methods exist in the machine learning literature for tuning hyperparameters in various training algorithms (bergstra2012random; bengio2000gradient; snoek2012practical; bergstra2011algorithms), most of which do not provide theoretical guarantees.

In contrast to the supervised setting with i.i.d. data used in many of the above methods, in this paper, we consider unsupervised clustering problems with possible dependence structure in the datapoints. We propose an overarching framework for hyperparameter tuning and model selection for a variety of probabilistic clustering models. Here the challenge is two-fold. Since labels are not available, choosing a criterion for evaluation and in general a method for selecting hyperparameters is not easy. One may consider splitting the data in different folds and selecting the model or hyperparameter with the most stable solution. However, for multiple splits of the data, the inference algorithm may get stuck at the same local optima, and thus stability alone can lead to a suboptimal solution (von2010clustering). In wang2010consistent; fang2012selection, the authors overcome this by redefining the number of clusters as one that gives the most stable clustering for a given algorithm. In meila2018tell, a semi-definite program (SDP) maximizing an inner product criterion is performed for each clustering solution, and the value of the objective function is used to evaluate the stability of the clustering. The analysis is done without any model assumptions. The second difficulty arises if there is dependence structure in the datapoints, which necessitates careful splitting procedures in a CV-based procedure.

To illustrate the generality of our framework, we focus on subgaussian mixtures and the statistical network models like the Stochastic Blockmodel (SBM) and the Mixed Membership Stochastic Blockmodel (MMSB) as two representative models for i.i.d. data and non i.i.d. data, where clustering is a natural problem. We propose a unified framework with provable guarantees to do hyperparameter tuning and model selection in these models. More specifically, our contributions can be summarized as below:

1. Our framework can provably tune the following hyperparameters:

1. Lagrange multiplier of the penalty term in a type of semidefinite relaxation for community detection problems in SBM;

2. Bandwidth parameter used in kernel spectral clustering for subgaussian mixture models.

2. We have consistent model selection, i.e. determining number of clusters:

1. When the model selection problem is embedded in the choice of the Lagrange multiplier in another type of SDP relaxation for community detection in SBM;

2. General model selection for the Mixed Membership Stochastic Blockmodel (MMSB), which includes the SBM as a sub-model.

We choose to focus on model selection for network-structured data, because there already is an extensive repertoire of empirical and provable methods including the gap statistic (tibs2001gap), silhouette index (ROUSSEEUW198753), the slope criterion (Birge2001), eigen-gap von2007tutorial, penalized maximum likelihood (leroux1992), information theoretic approaches (AIC (Bozdogan1987ModelSA), BIC (keribin2000; drtonjrssb), minimum message length (figueiredo2002mml)), spectral clustering and diffusion based methods (Maggioni2018LearningBU; little2017spec) for i.i.d mixture models. We discuss the related work on the other models in the following subsection.

### 1.1 Related Work

Hyperparameters and model selection in network models: In network analysis, while a number of methods exist for selecting the true number of communities (denoted by ) with consistency guarantees including lei2016goodness; wang2017; le2015estimating; bickel2016hypothesis for SBM, and fan2019simple and han2019universal for more general models such as the degree-corrected mixed membership blockmodel, these methods have not been generalized to other hyperparameter selection problems. For CV-based methods, existing strategies involve node splitting (chen2018network), or edge splitting (li2016network). In the former, it is established that CV prevents underfitting for model selection in SBM. In the latter, a similar one-sided consistency result for Random Dot Product Models (RDPG) (young2007random, which includes SBM as a special case) is shown. This method has also been empirically applied to tune other hyperparameters, though no provable guarantee was provided.

In terms of algorithms for community detection or clustering, SDP methods have gained a lot of attention (abbe2015exact; amini2018semidefinite; Guedon2016; cai2015robust; hajek2016achieving) due to their strong theoretical guarantees. Typically, SDP based methods can be divided into two broad categories. The first one maximizes a penalized trace of the product of the adjacency matrix and an unnormalized clustering matrix (see definition in Section 2.2). Here the hyperparameter is the Lagrange multiplier of the penalty term amini2018semidefinite; cai2015robust; chen2018network; Guedon2016. In this formulation, the optimization problem does not need to know the number of clusters. However, it is implicitly required in the final step which obtains the memberships from the clustering matrix.

The other class of SDP methods uses a trace criterion with a normalized clustering matrix (definition in Section 2.2(Peng:2007; Yan2019CovariateRC; mixon2017sdp). Here the constraints directly use the number of clusters. (yan2017provable) use a penalized alternative of this SDP to do provable model selection for SBMs. However, most of these methods require appropriate tuning of the Lagrange multipliers, which are themselves hyperparameters. Usually the theoretical upper and lower bounds on these hyperparameters involve unknown model parameters, which are nontrivial to estimate. The proposed method in abbe2015recovering is agnostic of model parameters, but it involves a highly-tuned and hard to implement spectral clustering step (also noted by perry2017semidefinite).

In this paper, we use a SDP from the first class (SDP-1) to demonstrate our provable tuning procedure, and another SDP from the second class (SDP-2) to establish consistency guarantee for our model selection method.

Spectral clustering with mixture model: In statistical machine learning literature, analysis of spectral clustering typically is done in terms of the Laplacian matrix built from an appropriately constructed similarity matrix of the datapoints. There has been much work (hein2005; hein2006uniform; vonLuxburg2007; belkin2003laplacian; gine2006empirical) on establishing different forms of asymptotic convergence of the Laplacian. Recently lffler2019optimality have established error bounds for spectral clustering that uses the gram matrix as the similarity matrix. In srivastava2019robust error bounds are obtained for a variant of spectral clustering for the Gaussian kernel in presence of outliers. Most of the existing tuning procedures for the bandwidth parameter of the Gaussian kernel are heuristic and do not have provable guarantees. Notable methods include vonLuxburg2007, who choose an analogous parameter, namely the radius in an -neighborhood graph “as the length of the longest edge in a minimal spanning tree of the fully connected graph on the data points.” Other discussions on selecting the bandwidth can be found in (hein2005; coifman2008random) and (schiebinger2015). shi2008data propose a data dependent way to set the bandwidth parameter by suitably normalizing the quantile of a vector containing quantiles of distances from each point.

We now present our problem setup in Section 2. Section 3 proposes and analyzes our hyperparameter tuning method MATR for networks and subgaussian mixtures. Next, in Section 4, we present MATR-CV and the related consistency guarantees for model selection for SBM and MMSB models. Finally, Section 5 contains detailed simulated and real data experiments and we conclude with paper with a discussion in Section 6.

## 2 Preliminaries and Notations

### 2.1 Notations

Let denote a partition of data points into clusters; denote the size of . Denote . The cluster membership of each node is represented by a matrix , with if data point belongs to cluster , and otherwise. Since is the true number of clusters, is full rank. Given , the corresponding unnormalized clustering matrix is , and the normalized clustering matrix is . can be either a normalized or unnormalized clustering matrix, and will be made clear. We use to denote the matrix returned by SDP algorithms, which may not be a clustering matrix. Denote as the set of all possible normalized clustering matrices with cluster number . Let and be the membership and normalized clustering matrix from the ground truth. is a general hyperparameter; although with a slight abuse of notation, we also use to denote the Lagrange multiplier in SDP methods. For any matrix , let be a matrix such that if , and otherwise. is the all ones matrix. We write Standard notations of will be used. By “with high probability”, we mean with probability tending to one.

### 2.2 Problem setup and motivation

We consider a general clustering setting where the data gives rise to a observed similarity matrix , where is symmetric. Denote as a clustering algorithm which operates on the data with a hyperparameter and outputs a clustering result in the form of or . Here note that may or may not perform clustering on , and , and could all depend on . In this paper we assume that has the form , where is a matrix of arbitrary noise, and is the “population similarity matrix”. As we consider different clustering models for network-structured data and iid mixture data, it will be made clear what and are in each context.

Assortativity (weak and strong): In some cases, we require weak assortativity on the similarity matrix defined as follows. Suppose for , . Define the minimal difference between diagonal term and off-diagonal terms in the same row cluster as

 pgap=mink⎛⎜⎝akk−maxi∈Ck,j∈Cℓℓ≠kSij⎞⎟⎠. (1)

Weak assortativity requires . This condition is similar to weak assortativity defined for blockmodels (e.g. amini2018semidefinite). It is mild compared to strong assortativity requiring .

Stochastic Blockmodel (SBM): The SBM is a generative model of networks with community structure on nodes. By first partitioning the nodes into classes which leads to a membership matrix , the binary adjacency matrix is sampled from probability matrix . where and are the and row of matrix , is the block probability matrix. The aim is to estimate node memberships given . We assume the elements of have order with at some rate.

Mixed Membership Stochastic Blockmodel (MMSB): The SBM can be restrictive when it comes to modeling real world networks. As a result, various extensions have been proposed. The mixed membership stochastic blockmodel (MMSB, (airoldi2008mmsb)) relaxes the requirement on the membership vector being binary and allows the entries to be in , such that they sum up to 1 for all . We will denote this soft membership matrix by .

Under the MMSB model, the adjacency matrix is sampled from the probability matrix with . We use an analogous definition for normalized clustering matrix: . Note that this reduces to the usual normalized clustering matrix when is a binary cluster membership matrix.

Mixture of sub-gaussian random variables: Let be a data matrix. We consider a setting where are generated from a mixture model with clusters,

 Yi=μa+Wi,E(Wi)=0,Cov(Wi)=σ2aI,a=1,…,r, (2)

where ’s are independent sub-gaussian vectors.

Trace criterion: Our framework is centered around the trace , where is the normalized clustering matrix associated with hyperparameter . This criterion is often used in relaxations of the k-means objective (mixon2017sdp; Peng:2007; yan2017provable) in the context of SDP methods. The idea is that the criterion is large when datapoints within the same cluster are more similar. This criterion is also used by meila2018tell for evaluating stability of a clustering solution, where the author uses SDP to maximize this criterion for each clustering solution. Of course, this makes the implicit assumption that (and ) is assortative, i.e. datapoints within the same cluster have high similarity based on . While this is reasonable for iid mixture models, not all community structures in network models are assortative if we use the adjacency matrix as . If all the communities in a network are dis-assortative, then one can just use as . However, for the SBM or MMSB models, one may have a mixture of assortative and dis-assortative structure. In what follows, we begin our discussion of hyperparameter tuning and model selection for SBM by assuming weak assortativity, both for ease of demonstration and the fact that our algorithms of interest, SDP methods, operate on weakly assortative networks. For MMSB, which includes SBM as a sub-model, we show the same criterion still works without assortativity if we choose to be with the diagonal removed.

## 3 Hyperparameter tuning with known r

In this section, we consider tuning hyperparameters when the true number of clusters is known. First, we provide two simulation studies to motivate this section. The detailed parameter settings for generating the data can be found in the Appendix Section C.

As mentioned in Section 1.1, SDP is an important class of methods for community detection in SBM, but its performance can depend on the choice of the Lagrange multiplier parameter. We first consider a SDP formulation (li2018convex), which has been widely used with slight variations in the literature (amini2018semidefinite; perry2017semidefinite; Guedon2016; cai2015robust; chen2018network),

 maxtrace(AX)−λ% trace(XEn)s.t.X⪰0,X≥0,Xii=1 for% 1≤i≤n, (SDP-1)

where is a hyperparameter. Typically, one then performs spectral clustering (that is, -means on the top eigenvectors) on the output of the SDP to get the clustering result. In Figure 1 (a), we generate an adjacency matrix from the probability matrix described in Appendix Section C and use SDP-1 with tuning parameter from 0 to 1. The accuracy of the clustering result is measured by the normalized mutual information (NMI) and shown in Figure 1 (a). We can see that different values lead to widely varying clustering performance.

As a second example, we consider a four-component Gaussian mixture model generated as described in Appendix Section C. We perform spectral clustering (-means on the top eigenvectors) on the widely used Gaussian kernel matrix (denoted ) with bandwidth parameter . Figure 1(b) shows the clustering performance using NMI as varies, and the flat region of suboptimal corresponds to cases when the two adjacent clusters cannot be separated well.

We show that in the case where the true cluster number is known, an ideal hyperparameter can be chosen by simply maximizing the trace criterion introduced in Section 2.2. The tuning algorithm (MATR) is presented in Algorithm 1. It takes a general clustering algorithm , data and similarity matrix as inputs, and outputs a clustering result with chosen by maximizing the trace criterion.

We have the following theoretical guarantee for Algorithm 1.

###### Theorem 1.

Consider a clustering algorithm with inputs and output . The similarity matrix used for Algorithm 1(MATR) can be written as . We further assume is weakly assortative with defined in Eq (1), and is the normalized clustering matrix for the true binary membership matrix . Let be the smallest cluster proportion, and . As long as there exists , such that , Algorithm 1 will output a , such that

 ∥∥^Xλ∗−X0∥∥2F≤2τ(ϵ+supX∈Xr|⟨X,R⟩|),

where is the normalized clustering matrix associated with

In other words, as long as the range of we consider covers some optimal value that leads to a sufficiently large trace criterion (compared with the true underlying and the population similarity matrix ), the theorem guarantees Algorithm 1 will lead to a normalized clustering matrix with small error. The deviation depends both on the noise matrix and how close the estimated is to the ground truth , i.e. the performance of the algorithm. If both and are , then MATR will yield a clustering matrix which is weakly consistent. The proof is in the Appendix Section A.

In the following subsections, we apply MATR to more specific settings, namely to select the Lagrange multiplier parameter in SDP-1 for SBM and the bandwidth parameter in spectral clustering for sub-gaussian mixtures.

### 3.1 Hyperparameter tuning for SBM

We consider the problem of choosing in SDP-1 for community detection in SBM. Here, the input to Algorithm 1 – the data and similarity matrix – are both the adjacency matrix . A natural choice of a weakly assortative is the conditional expectation of , i.e. up to diagonal entries: let for and for . Note that is blockwise constant, and assortativity condition on translates naturally to the usual assortativity condition on . As the output matrix from SDP-1 may not necessarily be a clustering matrix, we use spectral clustering on to get the membership matrix required in Algorithm 1. SDP-1 together with spectral clustering is used as .

In Proposition 13 of the Appendix, we show that SDP-1 is strongly consistent, when applied to a general strongly assortative SBM with known , as long as satisfies:

 maxk≠lBk,l+Ω(√ρlogn/nπmin)≤λ≤minkBkk+O(√ρlogn/nπ2max) (3)

An empirical way of choosing was provided in cai2015robust, which we will compare with in Section 5. We first show a result complementary to Eq 3 under a SBM model with weakly assortative , that for a specific region of , the normalized clustering matrix from SDP-1 will merge two clusters with high probability. This highlights the importance of selecting an appropriate since different values can lead to drastically different clustering result. The detailed statement and proof can be found in Proposition 12 of the Appendix Section A.2.

When we use Algorithm 1 to tune for , we have the following theoretical guarantee.

###### Corollary 2.

Consider with weakly assortative and number of communities. Denote . If we have , for some constant , then as long as there exists , such that , with Algorithm 1(MATR) will output a , such that where , are the normalized clustering matrices for , respectively.

###### Remark 3.
1. Since , to ensure the range of considered overlaps with the optimal range in (3), it suffices to consider choices from . Then for satisfying Eq 3, SDP-1 produces w.h.p. if is strongly assortative. Since , we can take , and the conditions in this corollary imply . Suppose all the communities are of comparable sizes, i.e. , then the conditions only require since .

2. Since the proofs of Theorem 1 and Corollary 2 are general, the conclusion is not limited to SDP-1 and applies to more general community detection algorithms for SBM when is known. It is easy to see that a sufficient condition for the consistency of to hold is that there exists in the range considered, such that .

3. We note that the specific application of Corollary 2 to SDP-1 leads to weak consistency of instead of strong consistency as originally proved for SDP-1. This is partly due to the generality of theorem (including the relaxation of strong assortativity on to weak assortativity) as discussed above, and the fact that we are estimating .

### 3.2 Hyperparameter tuning for mixtures of subgaussians

In this case, the data is defined in Eq (2), the clustering algorithm is spectral clustering (see motivating example in Section 3) on the Gaussian kernel . Note that one could use the similarity matrix as the kernel itself. However, this makes the trace criterion a function of the hyperparameter we are trying to tune, which compounds the difficulty of the problem. For simplicity, we use the negative squared distance matrix as , i.e. . The natural choice for would be the conditional expectation of given the cluster memberships, which is blockwise constant, as in the case for SBM’s. However, in this case, the convergence behavior is different from that of blockmodels. In addition, this choice leads to a suboptimal error rate. Therefore we use a slightly corrected variant of the matrix as (also see (mixon2017sdp)), called the reference matrix:

 Sij=−d2ab2−max{0,d2ab2+2(Wi−Wj)T(μa−μb)}1(i∈Ca,j∈Cb), (4)

where , is defined in Eq 2. Note that for in the same cluster . Interestingly this reference matrix is random itself, which is a deviation from the used for network models. For MATR applied to select , we have the following theoretical guarantee.

###### Corollary 4.

Let be the negative squared distance matrix, and let be defined as in Eq 4. Let denote the minimum distance between cluster centers, i.e. . Denote and . As long as there exists , such that , Algorithm 1(MATR) will output a , such that w.h.p.

 ∥^Xθ∗−X0∥2F ≤Cϵ+rασ2max(α+min{r,d})δ2sep

where is the largest operator norm of the covariance matrices of the mixture components, is the normalized clustering matrix for and is an universal constant.

###### Remark 5.

Note that, similar to SBMs, in this setting, has to be much smaller than in order to guarantee small error. This will happen if the spectral clustering algorithm is supplied with an appropriate bandwidth parameter that leads to small error in estimating (see for example (srivastava2019robust)). This is satisfied by the condition in Corollary 4.

## 4 Hyperparameter tuning with unknown r

In this section, we adapt MATR to situations where the number of clusters is unknown to perform model selection. Similar to Section 3, we first explain the general tuning algorithm and state a general theorem to guarantee its performance. Then applications to specific models will be discussed in the following subsections. Since the applications we focus on are network models, we will present our algorithm with the data being for clarity.

We show that MATR can be extended to model selection if we incorporate a cross-validation (CV) procedure. In Algorithm 2, we present the general MATR-CV algorithm which takes clustering algorithm , adjacency matrix , and similarity matrix as inputs. Compared with MATR, MATR-CV has two additional parts.

The first part (Algorithm 3) is to split nodes into two subsets for training and testing. This in turn partitions the adjacency matrix into four submatrices , , and its transpose, and similarly for . MATR-CV makes use of all the submatrices: for training, for testing, and for estimating the clustering result for nodes in as shown in Algorithm 4, which is the second additional part. Algorithm 4 clusters testing nodes based on the training nodes cluster membership estimated from , and the connections between training nodes and testing nodes .

For each node in the testing set, using the estimated membership , the corresponding row in counts the number of connections it has with nodes in the training set belonging to each cluster and normalizes the counts by the cluster sizes. Finally, the estimated membership is determined by a majority vote. For now we still assume is weakly assortative, so majority vote is reasonable. As we later extend to more general network structures in Section 4.2, we will also show how Algorithm 4 can be generalized.

Like other CV procedures, we note that MATR-CV requires specifying a training ratio and the number of repetitions . Choosing any does not affect our asymptotic results. Repetitions of splits are used empirically to enhance stability; theoretically we show asymptotic consistency for any random split. The general theoretical guarantee and the role of the trace gap are given in the next theorem.

###### Theorem 6.

Given a candidate set of cluster numbers containing the true number of cluster , let be the normalized clustering matrix obtained from clusters, as described in MATR-CV. Assume the following is true:

(i) with probability at least ,

(ii) with probability at least ,

(iii) for the true , with probability at least ,

(iv) there exists such that

Here . Then with probability at least , MATR-CV will recover the true with trace gap .

The proof is deferred to the Appendix Section B.

###### Remark 7.
1. MATR-CV is also compatible with tuning multiple hyperparameters. For example, for SDP-1, if the number of clusters is unknown, then for each , we can run MATR to find the best for the given , followed by running a second level MATR-CV to find the best . As long as the conditions in Theorems 1 and 6 are met, and the clustering matrix returned will be consistent.

2. As will be seen in the applications below, the derivations of and are general and only depend on the properties of . On the other hand, measures the estimation error associated with the algorithm of interest and depends on its performance.

In what follows, we demonstrate MATR-CV can be applied to do model selection inherent to an SDP method for SBM and more general model selection for MMSB. While we still assume an assortative structure for the former model as required by the SDP method, the constraint is removed for MMSB. Furthermore, we use these two models to illustrate how MATR-CV works both when is zero (SBM) and nonzero (MMSB).

### 4.1 Model selection for SBM

We consider the SDP algorithm introduced in Peng:2007; yan2017provable as shown in SDP-2- for community detection in SBM. Here is a normalized clustering matrix, and in the case of exact recovery is equal to the number of clusters. In this way, is implicitly chosen through , hence most of the existing model selection methods with consistency guarantees do not apply directly. yan2017provable proposed to recover the clustering and simultaneously. However, still needs to be empirically selected first. We provide a systematic way to do this.

 maxXtrace(AX)−λtrace(X)s.t.X⪰0,X≥0,X1=1 (SDP-2-λ)

We consider applying MATR-CV to an alternative form of SDP-2- as shown in SDP-2, where the cluster number appears explicitly in the constraint and is part of the input. SDP-2 returns an estimated normalized clustering matrix, to which we apply spectral clustering to compute the cluster memberships. We name this algorithm . In this case, we use as , so is the population similarity matrix.

 maxXtrace(AX)s.t.X⪰0,X≥0,% trace(X)=r′,X1=1 (SDP-2)

We have the following result ensuring MATR-CV returns a consistent cluster number.

###### Theorem 8.

Suppose is generated from a SBM model with clusters and a weakly assortative . We assume is fixed, and for some constant , and . Given a candidate set of containing true cluster number and , with high probability for large, MATR-CV returns the true number of clusters with , where .

###### Proof sketch.

We provide a sketch of the proof here, the details can be found in the Appendix Section B.2. We derive the three errors in Theorem 6. In this case, we show that w.h.p., , and MATR-CV achieves exact recovery when given the true , that is, . Since under the conditions of the theorem, by Theorem 6, taking MATR-CV returns the correct w.h.p. Furthermore, we can remove the dependence of on unknown by noting that w.h.p., then it suffices to consider the candidate range . Thus and in can be replaced with . ∎

###### Remark 9.
1. Although we have assumed fixed , it is easy to see from the order of and that the theorem holds for , if we let for clarity. Many other existing works on SBM model selection assume fixed . lei2016goodness considered the regime . hu2017using allowed to grow lineary up to a logarithmic factor, but at the cost of making fixed.

2. Asymptotically, is equivalent to . We will use in practice when is fixed.

### 4.2 Model selection for MMSB

In this section, we consider model selection for the MMSB model as introduced in Section 2.2 with a soft membership matrix , which is more general than the SBM model. As an example of estimation algorithm, we consider the SPACL algorithm proposed by Mao2017EstimatingMM, which gives consistent parameter estimation when given the correct . As mentioned in Section 2.2, a normalized clustering matrix in this case is defined analogously as for any . is still a projection matrix, and since . Following Mao2017EstimatingMM, we consider a Bayesian setting for : each row of , . We assume , are all fixed constants. Note that the Bayesian setting here is only for convenience, and can be replaced with equivalent assumptions bounding the eigenvalues of . We also assume there is at least one pure node for each of the communities for consistent estimation at the correct .

MATR-CV can be applied to the MMSB model with a few modifications. (i) Replace all by , the estimated soft memberships from the training graph. (ii) We take , . This allows us to remove the assortativity requirement on and replace it with a full rank condition on , which is commonly assumed in the MMSB literature. The fact that is always positive semi-definite will be used in the proof. The removal of and leads to better concentration, since is centered around a different mean. (iii) We change Algorithm 4 to estimate . Note that , thus we can view the estimation of as a regression problem with plug-in estimators of and . In Algorithm 4, we use an estimate of the form , where , are estimated from .

We have the following consistency guarantee for returned by MATR-CV.

###### Theorem 10.

Let be generated from a MMSB model (see Section 2.2) satisfying , where is the smallest singular value of . We assume for some arbitrarily small . Given a candidate set of containing and , with high probability for large , MATR-CV returns the true cluster number if .

###### Proof sketch.

We first show w.h.p., the underfitting and overfitting errors in Theorem 6 are , To obtain , we show that given the true cluster number, the convergence rate of the parameter estimates for the testing nodes obtained from the regression algorithm is the same as the convergence rate for the training nodes. This leads to . For convenience we pick . For details, see Section B.3 of the supplement.

###### Remark 11.
1. Compared with fan2019simple and han2019universal, which consider the more general degree-corrected MMSB model, our consistency result holds for at a faster rate.

2. A practical note: due to the constant in the estimation error being tedious to determine, in this case we only know the asymptotic order of the gap . As has been observed in many other methods based on asymptotic properties (e.g. bickel2016hypothesis; lei2016goodness; wang2017; hu2017using), performing an adjustment for finite samples often improves the empirical performance. In practice we find that if the constant factor in is too large, then we tend to underfit. To guard against this, we note that at the correct , the trace difference should be much larger than . We start with and find by Algorithm 2; if is smaller than , we reduce by half and repeat the step of finding in Algorithm 2 until . This adjustment is much more computationally efficient than bootstrap corrections and works well empirically.

## 5 Numerical experiments

In this section, we present extensive numerical results on simulated and real data by applying MATR and MATR-CV to different settings considered in Sections 3 and 4.

### 5.1 MATR on SBM with known number of clusters

We apply MATR to tune in SDP-1 for known . Since for SDP-1, we choose in all the examples. For comparison we choose two existing data driven methods. The first method (CL, cai2015robust) sets as the mean connectivity density in a subgraph determined by nodes with “moderate” degrees. The second is ECV (li2016network) which uses CV with edge sampling to select the giving the smallest loss on the test edges from a model estimated on training edges. We use a training ratio of 0.9 and the loss throughout.

Simulated data. Consider a strongly assortative SBM as required by SDP-1 for both equal sized and unequal sized clusters. The details of the experimental setting can be found in the Appendix Section C. Standard deviations are calculated based on random runs of the each parameter setting. We present NMI comparisons for equal sized SBM (, ) in Figure 2(A), and unequal sized SBM (two with 100 nodes, and two with 50) in Figure 2(B). In both, MATR outperforms others by a large margin as degree grows.

Real data. We also compare MATR with ECV and CL on three real datasets: the football dataset (girvan2002community), the political books dataset and the political blogs dataset (adamic2005political). All of them are binary networks with and nodes respectively. In the football dataset, the nodes represent teams and an edge is drawn between two teams if any regular-season games are played between them; there are clusters where each cluster represents the conference among teams, and games are more frequently between teams in the same conference. In the political blogs dataset, the nodes are weblogs and edges are hyperlinks between the blogs; it has clusters based on political inclination: "liberal" and "conservative". In the political books dataset, the nodes represent books and edges indicate co-purchasing on Amazon; the clusters represent categories based on manual labeling of the content: "liberal", "neutral" and "conservative". The clustering performance of each method is evaluated by NMI and shown in Table 0(a). MATR has performs the best out of the three methods on the football dataset, and is tied with ECV on the political books dataset. MATR is not as good as CL on the poligical blogs dataset, but still outperforms ECV.

### 5.2 MATR on mixture model with known number of clusters

We use MATR-CV to select the bandwidth parameter in spectral clustering applied to mixture data when given the correct number of clusters. In all the examples, our candidate set of is for and . We compare MATR with three other well-known heuristic methods. The first one was proposed by (shi2008data) (DS), where, for each data point , the quantile of is denoted and then is set to be . We also compare with two other methods in von2007tutorial: a method based on -nearest neighbor (KNN) and a method based on minimal spanned tree (MST). For KNN, is chosen in the order of the mean distance of a point to its -th nearest neighbor, where . For MST, is set as the length of the longest edge in a minimal spanning tree of the fully connected graph on the data points.

Simulated data. We first conduct experiments on simulated data generated from a 3-component Gaussian mixture with . The means are multiplied by a separation constant which controls clustering difficulty (larger, the better). Detailed descriptions of the parameter settings can be found in Section C of the Appendix. datapoints are generated for each mixture model and random runs are used to calculate standard deviations for each parameter setting. In Figure 2 (A) and (B) we plot NMI on the axis against the separation along the axis for mixture models with equal and unequal mixing coefficients respectively. For all these settings, MATR performs as well or better than the best among DS, KNN and MST.

Real data. We also test MATR for tuning on a real dataset: Optical Recognition of Handwritten Digits Data Set1. We use a copy of the test set provided by scikit-learn (scikit-learn), which consists of 1797 instances of 10 classes. We standardize the dataset before clustering. With clusters, MATR, DS, KNN and MST yield cluster results with NMI values , , and respectively. In other words, MATR performs similarly to KNN but outperforms DS and MST. We also visualize and compare the clustering results by different methods in 2-D using tSNE (maaten2008visualizing), which can be found in Section C of the Appendix.

### 5.3 Model selection with MATR-CV on SBM

We make comparisons among MATR-CV, Bethe-Hessian estimator (BH) (le2015estimating) and ECV (li2016network). For ECV and MATR-CV, we consider , where is the number of nodes.

Simulated data. We simulate networks from a -cluster strongly assortative SBM with equal and unequal sized blocks (detailed in Section C of the Appendix). In Figure 3, we show NMI on axis vs. average degree on axis. In Figure 3(a) and (b) we respectively consider equal sized ( clusters of size ) and unequal sized networks (two with nodes and two with nodes). In all cases, MATR-CV has the highest NMI. A table with median number of clusters selected by each method can be found in Section C of the Appendix.

Real data. The same set of methods are also compared on three real datasets: the football dataset, the political blogs dataset and the political books dataset. The results are shown in Table 0(b), where MATR-CV finds the ground truth for the football dataset.

### 5.4 Model selection with MATR-CV on MMSB

We compare MATR-CV with Universal Singular Value Thresholding (USVT) (chatterjee2015matrix), ECV (li2016network) and SIMPLE (fan2019simple) in terms of doing model selection with MMSB. For ECV and MATR-CV, we consider the candidate set , where .

Simulated data. We first apply all the methods to simulated data. We consider . Following (mao2018overlapping), we sample and . We generate networks with nodes with and respectively. We set when ; when for a range of .In Table 1(a) and 1(b), we report the fractions of exactly recovering the true cluster number over 40 runs for each method across different average degrees. We observe that in both and cases, MATR-CV outperforms the other three methods with a large margin on sparse graphs. The method SIMPLE consistently underfits in our sparsity regime, which is understandable, since their theoretical guarantees hold for a dense degree regime.

Real data. We also test MATR-CV with MMSB on a real network, the political books network, which contains 3 clusters. Here fitting a MMSB model is reasonable since each book can have mixed political inclinations, e.g. a “conserved” book may be in fact mixed between “neutral” and “conservative”. With MATR-CV, we found clusters. With USVT, ECV and SIMPLE we found fewer than clusters.

## 6 Discussion

Clustering data, both in i.i.d and network structured settings have received a lot of attention both from applied and theoretical communities. However, methods for tuning hyperparameters involved in clustering problems are mostly heuristic. In this paper, we present MATR, a provable MAx-TRace based hyperparameter tuning framework for general clustering problems. We prove the effectiveness of this framework for tuning SDP relaxations for community detection under the block model and for learning the bandwidth parameter of the gaussian kernel in spectral clustering over a mixture of subgaussians. Our framework can also be used to do model selection using a cross validation based extension (MATR-CV) which can be used to consistently estimate the number of clusters in blockmodels and mixed membership blockmodels. Using a variety of simulation and real experiments we show the advantage of our method over other existing heuristics.

The framework presented in this paper is general and can be applied to doing model selection or tuning for broader model classes like degree corrected blockmodels  (karrer2011dcbm), since there are many exact recovery based algorithms for estimation in these settings (chen2018). We believe that our framework can be extended to the broader class of degree corrected mixed membership blockmodels (jin2017estimating) which includes the topic model (mao2018overlapping). However, the derivation of the estimation error involves tedious derivations of parameter estimation error, which has not been done by existing works. Furthermore, even though our work uses node sampling, we believe we can extend the MATR-CV framework to get consistent model selection for other sampling procedures like edge sampling (li2016network).

Appendix

This appendix contains detailed proofs of theoretical results in the main paper “A Unified Framework for Tuning Hyperparameters in Clustering Problems”, additional theoretical results, and detailed description of the experimental parameter settings. We present proofs for MATR and MATR-CV in Sections A and Sections B respectively. Sections A.2 also contains additional theoretical results on the role of the hyperparameter in merging clusters in SDP-1 and SDP-2 respectively. Finally, Section C contains detailed parameter settings for the experimental results in the main paper.

## Appendix A Additional theoretical results and proofs of results in Section 3

### a.1 Proof of Theorem 1

###### Proof.

If for tuning parameter , we have , then

 ⟨S,^Xλ⟩≥⟨S,X0⟩−|⟨^S−S,^Xλ⟩|−ϵ. (5)

First we will prove that this immediately gives an upper bound on . We will remove the subscript for ease of exposition. Denote , when and otherwise, and off-diagonal set for th cluster as . Then we have

 ⟨S,^X⟩=r0∑k=1akk⟨ECk,Ck,^X⟩+r0∑k=1∑(i,j)∈Cckaij⟨Ei,j,^X⟩=r0∑k=1akkmkωk+r0∑k=1mk(1−ωk)∑(i,j)∈Cckaijαij=r0∑k=1mkωk(akk−∑(i,j)∈Cckaijαij)+r0∑k=1mk∑(i,j)∈Cckaijαij (6)

Since , by (5), , we have

 ∑kmkωk(akk−∑(i,j)∈Cckaijαij)+∑kmk∑(i,j)∈Cckaijαij≥∑kmkakk−|⟨R,^X⟩|−ϵ.

Note that, since