Degrees of Freedom and Model Selection for means Clustering
Abstract
This paper investigates the problem of model selection for means clustering, based on conservative estimates of the model degrees of freedom. An extension of Stein’s lemma, which is used in unbiased risk estimation, is used to obtain an expression which allows one to approximate the degrees of freedom. Empirically based estimates of this approximation are obtained. The degrees of freedom estimates are then used within the popular Bayesian Information Criterion to perform model selection. The proposed estimation procedure is validated in a thorough simulation study, and the robustness is assessed through relaxations of the modelling assumptions and on data from real applications. Comparisons with popular existing techniques suggest that this approach performs extremely well when the modelling assumptions are met, or almost met, and is highly competitive on real data applications.
Keywords: Cluster number determination, Model complexity
1 Introduction
1.1 Clustering
Clustering refers to the problem of identifying groups (clusters) of relatively similar points within collections of data, without any explicit information about the true groupings of any subsets of the data. Clustering has found application in extremely diverse fields from marketing to population genetics to computer vision [6, 8, 4]. The absence of a ground truth grouping for any of the data makes clustering extremely challenging. It also makes a precise definition of what constitutes a cluster difficult to pin down. Indeed numerous popular definitions exist, each leading to hosts of methods for obtaining clustering solutions based on the corresponding definition. The means clustering model, as well as other centroid based methods, define clusters as compact groups of points which arise around a collection of socalled cluster centroids. Centroid based methods are extremely popular for their simplicity and their reasonable degree of robustness to the relaxation of their underlying assumptions. In the case of means specifically the optimal centroids are those which minimise the sum of squared distances between each point and its assigned centroid. Using the squared distance connects the means objective with the loglikelihood of the Gaussian Mixture Model (GMM). The optimal means solution may be seen as an approximation of the maximum likelihood solution for a GMM in which the covariance matrices and mixing proportions are equal, and the covariance matrices are proportional to the identity. One of the major benefits of means over a more flexible GMM based clustering model is the relative computational efficiency of the algorithms used to obtain solutions. Although obtaining the gloabally optimal solution is NPhard (as is the case with numerous other clustering objectives) fast algorithms allow for multiple initialisations and frequently lead to high quality locally optimal solutions.
1.2 Model Selection
In statistical modelling the data being analysed are assumed to have come from an underlying generative process comprised of a fixed structural component, and a noise component which results in the actual observations being perturbed versions of the structural component. The noise component may be attributed to measurement error, unknown latent factors, or a host of other sources. It is convenient to model the noise component as being random, and therefore obeying some probability distribution. A sound modelling procedure should be capable of capturing the structural component in the data, while filtering out, as well as possible, the noise. When determining an appropriate model, it is necessary to identify the correct degree of flexibility in the model’s formulation. A model which is too flexible will begin to incorporate some of the noise, as well as the structure, in its representation/abstraction of the observed data. On the other hand, a model which is not flexible enough should be able to filter out the noise, but won’t be able to adequately represent all of the underlying structure’s complexity.
The flexibility of a model may be interpreted through the concept of degrees of freedom.
A classical definition for model degrees of freedom is given by the number of freely varying parameters in its formulation. This has an appealing interpretation in terms of flexibility, in that the greater the number of free parameters the more flexible is the model. It is not always clear, however, how many effectively freely varying parameters are present in a model. This is especially the case in nonparametric modelling. The effective degrees of freedom [7] offers a more universal formulation as the covariance between the data and the fitted values arising from the model. Fits from an overly flexible model will have a high degree of covariance with the data. This is because a variation in the data (which can only be attributed to the noise component, due to the fact that the structural component is constant) will be largely incorporated into such a model’s representation of the data. The excess flexibility allows the model to attempt to “explain” this variation, since it is not able to sufficiently filter it out. Such a model thus consumes a large number of degrees of freedom. On the other hand, a rigid/inflexible model will, by definition, not vary as much with variations in the data, thereby consuming fewer degrees of freedom. In the extreme case a completely inflexible model is constant, and thus cannot depend at all on the data, therefore the covariance and hence degrees of freedom would be zero.
The purpose of model selection procedures is to compare models of varying levels of flexibility, and balancing how well the models explain the observations made (how well they fit the data) against this flexibility.
In the context of clustering, and specifically the means model, flexibility is varied by different choices of , the number of centroids.
In the remaining paper a derivation of the model degrees of freedom for means is obtained, and a practical approach for approximating this value is presented. The proposed approach is then validated in a simulation study, by its ability to reliably estimate the true number of clusters under a correctly specified generative distribution. Robustness to the relaxation of these assumptions is determined using simulated and real data sets with varying characteristics.
2 Model Selection for means Clustering
Being arguably the most popular clustering method, means has been extremely well studied. Indeed there are numerous existing approaches for selecting the number of clusters in the model. A few approaches will be discussed here, where the reader is directed to [5] for an extensive list of selection criteria for clustering in general. One of the oldest approaches, often termed the “elbow method” is a graphical technique based on a plot of the means objective against the number of clusters. One selects the number of clusters to the right of which the gradient of the plot becomes substantially less steep. That is, the value above which the relative benefit, in terms of objective function value, per additional cluster becomes substantially less.
So called penalty methods can be seen as automations of the elbow method. If we use SS to denote the means objective for clusters, then penalty methods select the value of which minimises SS, for some penalty function . Such approaches provide perhaps the most literal interpretation of model selection as identifying the correct tradeoff of performance (objective function value) against flexibility (number of clusters). If is linear in , say , then in most cases this approach will select the smallest for which SSSS. The standard implementations of AIC (Akaike’s Information Criterion) and BIC (Bayesian Information Criterion) for means use such linear penalty methods [13, 9]. The proposed approach uses a nonstandard application of BIC for means based on an alternative estimation of the model degrees of freedom. Here the penalty function is not explicit and does not admit a closed form. An alternative approach for determining the number of clusters is to select which leads to the greatest deviation in the observed value of SS (or some monotonic transformation thereof) from its expected value if the underlying data distribution does not contain any true clusters. This expected deviation can be estimated analytically under very strict assumptions [14], or more generally using Monte Carlo integration [17].
The remainder of this section is devoted to methodology for obtaining an estimate of the model degrees of freedom, to be used in model selection for means. Thereafter this proposed method is compared against popular existing techniques on both simulated data and data from real applications.
2.1 Degrees of Freedom in the means Model
In this section an investigation of the model degrees of freedom for means will be presented. The data matrix, , is assumed to have the following form,
(1) 
where is a mean matrix and the elements of are independent realisations from a distribution. The number of unique rows of (i.e., the number of clusters) is not assumed to be known. For a given number of clusters, , the means model is defined by the function which assigns each row of to the cluster centroid to which it is nearest, and the centroids minimise the sum of the squared distances between the data and their assigned centroid. That is,
(2)  
(3)  
(4) 
where the subscript “” is used to indicate the th row of a matrix. The matrix estimates the unique rows of , and approximates the maximum likelihood solution under Eq (1).
As mentioned previously the degrees of freedom may be interpreted as measuring the flexibility of a model. Model selection in clustering deals with selecting the correct number of clusters. A straightforward estimate of the degrees of freedom given by , i.e., the number of entries in , has been fairly widely adopted [13]. However it has been observed that using this estimate within well established criteria such as the BIC often leads to an over estimation of the number of clusters [9]. The proposed approach instead adopts the definition given by the effective degrees of freedom [7],
(5) 
We would expect that, for a correctly specified value of , the clustering solution would not be especially sensitive to slight changes in the data. Indeed if the clusters are roughly spherical in shape, and have the same variance, then subtle changes in the data will not affect the fitted values significantly. The covariance between the data and fitted values should thus be small. If the value of is less than the true number of groups then at least one of the clusters in the model will contain points from multiple groups. This combining of multiple groups might be fairly arbitrary, in that a slight change in the data could result in a different combination of groups being clustered together. This would result in a substantial change in the fitted values, as entire subsets of the data are assigned to a different centroid. The covariance may in that case be fairly large. On the other hand, if the number of groups is overestimated then at least one group must be split among multiple clusters. Again this may occur in a somewhat arbitrary manner, and changes in the data might result in a different group being split. This is illustrated in Figure 1. The data arise from a Gaussian mixture containing ten components. The interclass distances are roughly equal and the overlap is relatively large to illustrate the effect of ambiguity in the solution when is incorrectly specified. Figure 1(a) shows the standard estimate of degrees of freedom, , and the estimated effective degrees of freedom. Although perhaps counterintuitive in a more general setting, the degrees of freedom decreases for increasing , for values up to the correct number of 10, whereafter it increases as exceeds the correct number. This effect translates to the BIC, Figure 1(b), where there is a clear and unique minimum at and where the standard approach would not have detected any clusters at all. Important to note too is that the effective degrees of freedom estimates are greater than the standard estimate for all values of . This exceedance of the standard estimate has been consistently observed across all experiments conducted for this research.
The remainder of this section is concerned with obtaining an appropriate approximation of the effective degrees of freedom for the means model. The following two lemmas are useful for obtaining such an estimate.
Lemma 1
Let , with fixed and with independent for all . Let satisfy the following condition. For each and , there exists a finite set s.t. , viewed as a univariate function by keeping fixed, is Lipschitz on each of , and . Then for each ,
(6) 
provided the second term on the right hand side exists. Here has zero entries except in the th position, where it takes the value one, and
Proof:
Let and consider any which is Lipschitz on and for some . For each define
where . Then is Lipschitz by construction and so by [3, Lemma 3.2] we know is almost differentiable and , and so by [16, Lemma 2] we have
But
Taking the limit as gives
as required. The extension to any with finitely many such discontinuity points arises from a very simple induction.
We therefore have for any , that
The result follows from the law of total expectation.
This result is very similar to [18, Lemma 5], although the proof given above is markedly different. The first term on the right hand side of Eq. (6) comes from Stein’s influential result [16, Lemma 2]. Due to the discontinuities in the means model, which occur at points where the cluster assignments of some of the data change, the additional covariance at the discontinuity points needs to be accounted for. Consider a which is close to a point of discontinuity with respect to the th entry. Conditional on the fact that is close to such a point, takes values approximately equal to the left and right limits, depending on whether is below or above the discontinuity respectively. On a small enough scale each happens with roughly equal probability. After taking into account the probability of being close to the discontinuity point, and taking the limit as gets arbitrarily close to the discontinuity point, one can arrive at an intuitive justification for the additional term in Eq. (6). In the remainder this additional covariance term will be referred to as the excess degrees of freedom. The next lemma places Lemma 1 in the context of the means model.
Lemma 2
Let be defined as
where
Then satisfies the conditions on the function in the statement of Lemma 1, and moreover if , with fixed and with independent for all , then
exists and is finite.
Proof:
Notice that the discontinuities in can occur only when there is a change in the assignment of one of the observations. If this occurs at the point , then it is straightforward to show that , where Diam is the diameter of the rows of and is a constant independent of . There are also clearly finitely many such discontinuities since there are finitely many cluster solutions arising from data, i.e., , for some constant independent of . Furthermore as long as all cluster assignments remain the same, and hence is Lipschitz as a function of between points of discontinuity. Finally,
since is maximised by a , and is bounded above by . Now, the tail of the distribution of is similar to that of the distribution of the maximum of random variables with degrees of freedom. Therefore is clearly finite. The second term above is clearly finite, since is normally distributed, and hence the expectation in Lemma 2 exists and is finite.
One of the most important consequences of [16, Lemma 2], which leads to the first term on the right hand side of Eq. (6), is that this term is devoid of any of the parameters of the underlying distribution. An unbiased estimate of this term can be obtained by taking the partial derivatives of the model using the observed data. In the case of means one arrives at,
where is the number of data assigned to centroid . Therefore,
The excess degrees of freedom therefore equals the difference between the effective degrees of freedom and the commonly used value of the number of elements in . The excess degrees of freedom reintroduces the unknown parameters to the degrees of freedom expression, and hence invariably one must use estimates (which, in the case of the residual variance, unfortunately depend on the degrees of freedom). Furthermore, as noted in [18], it is generally not straightforward to determine the discontinuity points, making the computation of the excess degrees of freedom challenging. This is discussed in the following subsection in further detail.
2.2 Approximating Excess Degrees of Freedom
Consider the excess degrees of freedom arising from the th entry,
Since the data matrix, , is assumed to have arisen from the distribution underlying above, the excess degrees of freedom are approximated using the observed data and the corresponding clustering solution obtained. Firstly, it becomes necessary to use estimates of both and , where is estimated using , and will be discussed at the end of this subsection. For the moment assume that is fixed.
Now, recall that the discontinuities are those at which the assignment of some of the data changes. That is, those for which s.t.
where is as before the matrix with zeros except the th entry where it takes the value one. Here the dependence of on the data matrix is stressed explicitly since the location and magnitude of the discontinuities is not determined in relation to the observed matrix, , but rather the modified matrix, . It is this fact which makes determining the discontinuity points extremely challenging. Indeed it is possible to construct examples where slight changes in only a single matrix entry can result in reassignments of large subsets of data, resulting in substantial and unpredictable changes in the values of . Experiments have led to the conclusion that in most cases contributions to the excess degrees of freedom arising from discontinuities in the fitted value of , which come about as a result of a reassingment of another row of , are negligible, in general contributing less than of the total. Only reassignments of which come about as a result of modifying the th entry are thus considered. If we consider the value of at which the assignment of changes from to some , while ignoring all other clusters, we find that must satisfy
(7) 
Here is the th canonical basis vector for and is the size of the th cluster. This results in a quadratic equation which can easily be solved. A simplification is adopted here, which is that rather than considering the paths of through multiple reassignments resulting from varying (which quickly become difficult to calculate), the magnitude and location of a discontinuity at a value is determined as though no reassignments had occured for values between zero and . It is fairly straightforward to reason that this will tend to lead to a positive bias in the total excess degrees of freedom. However, since the corresponding values of are generally large, the contributions from the quantities are generally small, and hence the bias induced by this simplification is relatively small. The excess degrees of freedom for the th entry are thus approximated using
where is the solution to Eq. (7) with smaller magnitude (when a solution exists). Notice that when , and we assume that no values result in a reassignment of , we have
If then we simply have the negative of the above. Now, notice that the locations and magnitudes of the discontinuities in do not depend on . To obtain an estimate, , to be used in computing the degrees of freedom, a straightforward solvetheequation approach is employed. That is, if we define
then is the solution which satisfies
where is as above, but now with an explicit dependence on the indices . In the author’s experience only one or two iterations are required for convergence of the value of in the above.
3 Experimental Results
In this section the results of experiments using the proposed approach of estimating the degrees of freedom are presented. To assess the reliability of the approach for model selection, a standard approximate BIC formula is used. That is, the number of clusters to select is that which minimises , where SS is the total within cluster sum of squares, SS . This approach will be referred to as BIC^{1}^{1}1A simple R package containing code to compute BIC is available at https://github.com/DavidHofmeyr/edfkmeans. The package also contains sample code towards conducting the experiments presented in this section.. For context the proposed approach is compared with popular methods from the literature. These include

The approximate BIC described above but using the standard degrees of freedom formula, . This method will be referred to as BIC.

The gap statistic [17], which is based on approximating the deviation of the within cluster sum of squares from its expected value when the underlying data distribution contains no clusters. Monte Carlo based estimates of this deviation are produced by simulating a uniform distribution scaled to have the same covariance as the data. Each estimate was based on 50 Monte Carlo samples. The implementation in the R package cluster [12] was used, and the number of clusters was selected based on the recommendation in [17].

The method of [14] which uses the same motivation as the gap statistics, but determines the deviation of the sum of squares from its expected value analytically under the assumption that the data distribution meets the standard means assumptions. When the data meet these assumptions the expectation is that this approach should perform extremely well. This approach will be referred to as fK.

The silhouette index [11], which is based on comparing the average dissimilarity of each point to its own cluster with its average dissimilarity to points in different clusters. Dissimilarity is determined by the Euclidean distance between points.
For all experiments presented in this section, the data sets were first standardised to have unit variance in every dimension.
3.1 Simulations
In this subsection a thorough simulation study is presented, which is used to investigate the reliability of the proposed approach in model selection for means. Reliability is assessed through the ability to detect the correct number of clusters, and in the event that the number is incorrectly selected that the clustering result obtains a high degree of accuracy. The robustness of the approach is assessed via a variety of deviations from the modelling assumptions (Eq. (1)). It is important to consider the accuracy of the clustering solutions as well as the accurate determination of the number of clusters. This is because if, for example, the variance of each group is different, the means model is capable of focusing on splitting a single group with large variance rather than separating multiple groups of small variance. It is therefore possible that a solution with the correct number of clusters may be inferior in accuracy to one which contains an incorrect number of clusters. Numerous cluster evalution metrics have been developed, see [1] for an extensive review. For these experiments the adjusted Rand index [10] is used. The Rand index [15] is defined as the proportion of pairs of observations which are grouped together both in the clustering solution and the true underlying grouping. The adjusted version of the index accounts for the expected value of the Rand index under a random assignment of data to clusters. The adjusted Rand index takes values in with positive values indicating that the solution is better than a random assignment, on average, and a value close to one indicates a very close agreement between the cluster assignments and the true underlying groups. This index captures both homogeneity of the clusters, i.e. that points in the same cluster should mostly be from the same true group, as well as the completeness, i.e. that the true groups should not be substantially divided among multiple clusters [1].
For all experiments data sets of size 1000 were generated from a 10 component Gaussian mixture in 10 dimensions, with each component representing a cluster. Under the modelling assumptions of means the mixing proportions are equal and the covariance of each cluster is equal and proportional to the identity matrix. To deviate from these assumptions, the mixing proportions, scale and shape of covariance matrices, as well as the number of noise dimensions can be varied. For each collection of settings 50 data sets were generated.
Figures 2 to 6 show the empirical distributions of the number of clusters selected by each method, under different underlying data distributions. The horizontal lines are proportional to the number of times (out of 50) that the corresponding number of clusters was selected. Table 1 shows the average adjusted Rand index achieved by each method for each collection of settings.
fK  Gap  Silh.  BIC  BIC  

Assumptions Met  low  0.98  0.71  0.95  1.00  1.00 
moderate  0.79  0.05  0.87  0.87  0.89  
Proportions differ  moderately  0.91  0.77  0.95  0.99  1.00 
substantially  0.72  0.89  0.97  0.79  0.99  
Variances differ  moderately  0.98  0.70  0.95  1.00  1.00 
substantially  0.77  0.24  0.80  0.68  0.72  
Shapes differ  moderately  0.97  0.61  0.95  0.98  0.99 
substantially  0.92  0.53  0.92  0.92  0.94  
Noise dimensions  one  0.92  0.40  0.94  0.75  1.00 
three  0.62  0.19  0.83  0.57  0.88 

When the modelling assumptions are met (Figure 2), BIC with the proposed estimation method for degrees of freedom is the only method which correctly identified 10 clusters in every experiment. When the within cluster variance is low (Figure 2(a)) the standard BIC implementation and the fK method also performed well. The gap statistic underestimated the correct number in general, with the silhouette index slightly overestimating the correct value on average. With moderate within cluster variability the gap statistic performed extremely poorly, with other methods being fairly robust to this change. This performance is reflected in the adjust Rand index values as well.
Figure 2: Empirical distribution of the number of clusters selected by each method: Modelling assumptions are met. 
Arguably the simplest deviation from the modelling assumptions for means is to vary the mixing proportions, leading to clusters with different numbers of points each on average (Figure 3). As expected this tends to lead to an underestimation of the correct number of clusters, due to their essentially being less “evidence” to support the existence of smaller clusters. Both versions of BIC again perform best in terms of the correct estimation of . Interestingly, when the mixing proportions vary substantially (Figure 3(b)) the standard BIC does not underestimate on average, whereas all other methods do. This does not, however, translate to the clustering accuracy, where it is BIC is one of the worst performers. The greater number of selected clusters must therefore corresponding to splitting some of the true groups apart. In general all methods seem fairly robust to these changes.
Figure 3: Empirical distribution of the number of clusters selected by each method: Mixing proportions differ. 
Figure 4 shows the case where the individual group variances differ from one another. When the difference is moderate the performance of all methods is scarcely different from the situation where there is no difference in the variances. The case where the variances differ substantially is the only case in which the proposed approach does not obtain the greatest average accuracy. The silhouette index approach appears more robust for maintaining clustering accuracy, even though it overestimates the number of clusters. The gap seems most severely affected by this deviation from the modelling assumptions.
Figure 4: Empirical distribution of the number of clusters selected by each method: Within cluster variances differ. 
When the group covariance shapes differ to a large degree (Figure 5(b)), the fK method is the only approach which still reliably maintains a good estimation of the number of clusters. Both versions of BIC as well as the silhouette index overestimate substantially. The “additional clusters” which are included by these methods do not decrease the overall accuracy substantially, suggesting they contain relatively few points each. All methods besides the gap are reasonably robust to these changes in terms of clustering accuracy.
Figure 5: Empirical distribution of the number of clusters selected by each method: Covariance shapes differ for different clusters. 
With the introduction of dimensions which do not contribute to the group separation, the proposed approach and the silhouette index appear most robust in terms of cluster number estimation (Figure 6) and overall clusternig accuracy. The standard BIC implementation and the gap are most severely affected by this change.
Figure 6: Empirical distribution of the number of clusters selected by each method: Data contain dimensions which do not contribute to the cluster structure (noise dimensions)
3.2 Real Applications
In this section results from experiments using plublicly available data sets associated with real applications are presented. These are popular benchmark data sets taken from the UCI machine learning repository [2]. The methods are again compared based on their ability to select the correct number of clusters, and the accuracy of the clustering solutions obtained. The results are summarised in Table 2, which shows the selected number of clusters, , and the adjusted Rand index (A.Rand). In addition, the number of times each method selected closest to the true value of (or was tied for closest) is included. Similarly, the number of times the highest adjusted Rand index was achieved by each method is shown. Unlike in the simulation study, the gap statistic is the most reliable for accurately detecting the correct number of clusters, followed by the silhouette index and the proposed approach. The fK method does not appear robust when the data do not come from a Gaussian mixture; selecting the smallest value of considered by the method for every data set. The proposed approach, despite overestimating the number of clusters in a few examples, achieved the best clustering accuracy overall. All things considered the gap statistic and the proposed approach performed the best on this selection of data sets.
fK  Gap  Silhouette  BIC  BIC  
Data set  A.Rand  A.Rand  A.Rand  A.Rand  A.Rand  
Wine (k=3)  2  0.37  3  0.90  3  0.90  5  0.66  3  0.90 
Seeds (k=3)  2  0.48  3  0.77  2  0.48  5  0.52  3  0.77 
Iris (k=3)  2  0.57  2  0.57  2  0.57  5  0.42  2  0.57 
Mammography (k=2)  2  0.39  2  0.39  3  0.32  14  0.01  3  0.32 
Parkinsons (k=2)  2  0.10  1  0.00  2  0.10  7  0.06  7  0.06 
Forest (k=4)  2  0.18  6  0.35  2  0.18  15  0.21  12  0.22 
Auto (k=3)  2  0.04  4  0.13  2  0.04  9  0.08  4  0.13 
Breast Cancer (k=2)  2  0.82  6  0.40  2  0.82  9  0.36  4  0.76 
Dermatology (k=6)  2  0.21  4  0.65  3  0.57  7  0.76  5  0.84 
Synth. (k=6)  2  0.27  4  0.58  2  0.27  8  0.67  8  0.67 
Soy Bean (k=19)  2  0.05  3  0.05  23  0.45  16  0.56  18  0.43 
Olive Oil (k=4/7)  2  0.46/0.34  7  0.41/0.64  5  0.58/0.76  10  0.36/0.62  6  0.53/0.80 
# times max A.Rand  3  6  4  3  8  
# times min  6  8  7  2  7 
4 Discussion
This work used the notion of effective degrees of freedom to obtain an alternative to the standard application of the Bayesian Information Criterion for model selection in means clustering. An extension of Stein’s lemma made it possible to approximate the effective degrees of freedom. A thorough simulation study illustrated the effectiveness and robustness of this approach in model selection for the means method. Experiments using publicly available benchmark data sets suggest that this approach is competitive with popular existing methods for the problem.
References
 [1] Amigó, E., Gonzalo, J., Artiles, J., Verdejo, F.: A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information retrieval 12(4), 461–486 (2009)
 [2] Bache, K., Lichman, M.: UCI machine learning repository (2013). URL http://archive.ics.uci.edu/ml
 [3] Candes, E.J., SingLong, C.A., Trzasko, J.D.: Unbiased risk estimates for singular value thresholding and spectral estimators. IEEE transactions on signal processing 61(19), 4643–4657 (2013)
 [4] Carson, C., Belongie, S., Greenspan, H., Malik, J.: Blobworld: Image segmentation using expectationmaximization and its application to image querying. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(8), 1026–1038 (2002)
 [5] Charrad, M., Ghazzali, N., Boiteau, V., Niknafs, A.: Nbclust: An r package for determining the relevant number of clusters in a data set. Journal of Statistical Software, Articles 61(6), 1–36 (2014). DOI 10.18637/jss.v061.i06. URL https://www.jstatsoft.org/v061/i06
 [6] Dillon, W.R., Mulani, N.: Ladi: A latent discriminant model for analyzing marketing research data. Journal of Marketing Research pp. 15–29 (1989)
 [7] Efron, B.: How biased is the apparent error rate of a prediction rule? Journal of the American statistical Association 81(394), 461–470 (1986)
 [8] François, O., Durand, E.: Spatially explicit bayesian clustering models in population genetics. Molecular Ecology Resources 10(5), 773–784 (2010)
 [9] Hamerly, G., Elkan, C.: Learning the k in kmeans. In: Advances in neural information processing systems, pp. 281–288 (2004)
 [10] Hubert, L., Arabie, P.: Comparing partitions. Journal of classification 2(1), 193–218 (1985)
 [11] Kaufman, L., Rousseeuw, P.J.: Finding groups in data: an introduction to cluster analysis, vol. 344. John Wiley & Sons (2009)
 [12] Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., Hornik, K.: cluster: Cluster Analysis Basics and Extensions (2017). R package version 2.0.6 — For new features, see the ’Changelog’ file (in the package source)
 [13] Manning, C., Raghavan, P., Schütze, H.: Introduction to Information Retrieval, 1 edn. Cambridge University Press (2008)
 [14] Pham, D.T., Dimov, S.S., Nguyen, C.D.: Selection of k in kmeans clustering. Proceedings of the Institution of Mechanical Engineers, Part C: Journal of Mechanical Engineering Science 219(1), 103–119 (2005)
 [15] Rand, W.M.: Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association 66(336), 846–850 (1971)
 [16] Stein, C.M.: Estimation of the mean of a multivariate normal distribution. The annals of Statistics pp. 1135–1151 (1981)
 [17] Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63(2), 411–423 (2001)
 [18] Tibshirani, R.J.: Degrees of freedom and model search. Statistica Sinica pp. 1265–1296 (2015)