Penalized kmeans algorithms for finding the correct
number of clusters
in a dataset
Abstract
In many applications we want to find the number of clusters in a dataset. A common approach is to use the penalized kmeans algorithm with an additive penalty term linear in the number of clusters. An open problem is estimating the value of the coefficient of the penalty term. Since estimating the value of the coefficient in a principled manner appears to be intractable for general clusters, we investigate ideal clusters, i.e. identical spherical clusters with no overlaps and no outlier background noise. In this paper: (a) We derive, for the case of ideal clusters, rigorous bounds for the coeffcient of the additive penalty. Unsurprisingly, the bounds depend on the correct number of clusters, which we want to find in the first place. We further show that additive penalty, even for this simplest case of ideal clusters, typically produces a weak and often ambiguous signature for the correct number of clusters. (b) As an alternative, we examine the kmeans with multiplicative penalty, and show that this parameterfree formulation has a stronger, and less often ambiguous, signature for the correct number of clusters. We also empirically investigate certain types of deviations from ideal cluster assumption and show that combination of kmeans with additive and multiplicative penalties can resolve ambiguous solutions.
1 Introduction
Finding the correct number of clusters in a dataset is an open, important problem in many applications. A widely used algorithm for this purpose is kmeans, and while it is appropriate only for samesize spherically symmetric clsuters, in practice it is often used beyond its valid range also for clusters with vastly different shapes and sizes. Because of its applicability in many areas including image and pattern analysis, statistics, information theroy, data reduction and others, there are a number of excellent treatments of kmeans in the literature. Here we cite only a few [6][7][10][12][18][20][21][24]. For the sake of completeness, we include a brief description of the kmeans here as well.
The problem may be stated as: Given a set of unlabeled data points , , in a dimensional feature space, we want to partition the data into clusters , , such that the empirical error (or cost function, distortion, sum of withincluster variances),
(1.1) 
is minimized. Here denotes the centroid of cluster , is the number of data points that are in cluster , and . The kmeans algorithm is a popular method for solving this nonconvex optimization problem.
In the kmeans algorithm we start by randomly selecting cluster centroids, and then use the Lloyd iterations to obtain a solution. The Lloyd iteration [19] is a simple ExpectationMaximization (EM) process, more precisely a hard EM process [21], where the two steps are: (i) assign the points that are closest to a cluster centroid to that centroid, (ii) recompute the cluster centroids. Then repeat the two steps until cluster memberships no longer change. This process monotonically decreases the error and typically converges to a local minimum. In order to potentially find a better solution, the algorithm is restarted many times with different initial cluster centroids and then the clustering with the smallest error is taken as the solution. Various initialization procedures have been introduced in the literature among which we cite the kmeans++ [3], and its modifications [4][5], that generally yield better solutions. Also a number of alternative algorithms have been proposed in the literature for more efficient solutions. Here we cite only a few: [15] computes distances more efficiently; [17] presents a parallel method; [22] proposes successive partitioning of data; and [13][14] proposes a soft kmeans algorithm based on the Hopfield network that generally leads to better solutions, albeit at higher comuptational cost.
We note that these algorithms, even for the simple case of ideal clusters (defined below), cannot be proven to find the best solution, or yield objective information about the correct number of clusters. Neither are the results reproducible, because of the random selection of initial clusters.
In many applications we do not know the number of clusters a priori, rather we want to find the correct number of clusters. Examples include finding the number of speakers in a crowded room from audio recordings, the number of vessels in a hightraffic channel from navigation radar signals, the number of distinct textures in an image, gene expressions, and many others. Since increasing leads to solutions with ever decreasing , the error (1.1) cannot indicate what the correct may be. Therefore, we must use a penalized error that has its global minimum at the correct value of . Penalized error may be formulated with either additive or multiplicative penalties respectively given by:
(1.2) 
(1.3) 
where is a monotonically increasing function of . The common practice is to use the additive penalized error . It will have a global minimum at some finite , which depends on the value of the penalty coefficient . The problem then becomes the more complicated simultaneous determination of the optimal , as well as the optimal set of cluster centroids , . The correct number of clusters depends critically on the value of coefficient . Obviously, small favors a larger number of clusters, and vice versa, large leads to a smaller number of clusters.
In this paper, we present a rigorous analysis of the penalized kmeans with additive penalty to obtain principled bounds for the value of . Since treating general clusters appears to be intractable, we consider ideal clusters that are the simplest form of clusters. We will describe and justify the rather strong assumption for using ideal clusters in the next section. But briefly, ideal clusters are dimensional spheres, with identical sizes, with no overlapps, and no outlier background data points.
Related work
Because finding the correct number of clusters plays an important role in many applications, a large number of methods have been proposed in the literature. While many methods have been proposed for the standard nonpenalized kmeans formulation (1.1), there is little work on the penalized kmeans formulations (1.2) and (1.3).
For the nonpenalized kmeans there are two broad approaches, which maybe called slopebased and validationbased. Here we cite only a few relevant papers and the references therein. The slopebased approach essentially tries to find a sudden change in the slope of the clustering error as a function of , and includes the elbow method, the more elaborate silhouette method [23], and the related slope statistics method [9]. These methods are subjective and in general unreliable. In the validationbased approach, the data is divided into a training set and a validation set [2][8]. This is a more reliable approach, but the outcome can be sensitive to how the data is divided into training and validation sets.
For the penalized kmeans there is very little work. Recently [16] derives the penalized kmeans with additive penalty from a Bayesian nonparametric viewpoint. However, coefficient is selected as the largest disatnce any data point can have from its centroid. Althogh this scheme limits the growth of number of clusters, it does not find the correct number of clusters. iTo our knowledge, the multiplicative penalized error has not been studied in the literature. We show that this parameterfree formulation has properties that appear to be more useful than the additive penalized error. We note that the penalized kmeans is in the broad class of slopebased methods, but without being subjective in principle.
In the following sections, we define ideal clusters, analyze penalized kmeans with additive and multiplicative penalties, and present a number of selected experiments that show the validity of theoretical analyses and certain limits of ideal cluster assumption.
2 Ideal Clusters
Here we define ideal clusters and argue that they are consistent with the underlying assumptions for kmeans clusters even though more restrictive. The original kmeans clustering may be interpreted as: we want to simultaneously estimate the value of entities , from unlalbeled measurements , and that these measurements have identical independent normal distributions , with the same covariance , around their true values. To do so, we must cluster the measurements into correct groups and maximize the likelihood,
(2.4) 
or equivalently minimize its logarithm given by (1.1). Here is the normalization factor, and is the dimension of the feature space. The underlying assumptions are: (a) clusters are spherically symmetric since the variances are isotropic in all the Euclidean space dimensions, (b) clusters have the same size since variances are identical, (c) clusters are sufficiently separated so no two clusters overlap in such a way that their peaks cannot be distinguished from each other, and (d) there are no outlier data points not belonging to any of the clusters.
We define ideal clusters to have the following properties: (a) clusters are spherical; (b) have the same size and volume ; (c) are nonoverlapping, a stronger assumption than the original; and (d) have no outliers. We make an additional assumption (e) that clusters are dense, that is, the data points nearly fill the volume of the spheres. This is primarily for computational convenience to replace sums with integrals, and implies the number of data points in each cluster may be approximated by the volume of the sphere and that is equivalent to .
In regular kmeans we know the number of clusters a priori. In penalized kmeans, however, is a variable to be determined along with the set of cluster centroids . The optimization procedure is to vary and then for each find the optimal set using the regular kmeans clustering. The expectation is that as a function of the number of clusters behaves as shown in Fig. 1, where the correct is the value at which penalized error is minimum. This happens if the regular kmeans algorithm finds the optimal set for each . Since the regular kmeans algorithm cannot guarantee optimality, we restrict the discussion to ideal clusters for which we can design an algorithm that is provably optimal.
2.1 Optimal Clustering Algorithm
The algorithm that is provably optimal for ideal clusters is as follows.

Step 1: Choose a data point and save it as the initial centroid for the first cluster . (The algorithm is insensitive to the initial choice; we typically choose the point closest to the origin of the feature space.)

Step 2: Choose the data point farthest from . Save it as the initial centroid for the second cluster . Using these initial centroids, run the Lloyd iteration until convergence to , .

Step 3: Choose the data point farthest from both and Save it as the initial centroid for the third cluster . Using these initial centroids, run the Lloyd iteration until convergence to , .

Step 4 and higher: At step continue choosing the farthest point to all the previous initial centroids and save it as the initial centroid for the th cluster. Using these initial centroids, run the Lloyd iteration until convergence to , .
We note that this initialization procedure may be considered as a deterministic version of kmeans++ initialization [3].
When ( being the correct number of clusters), the clustering is optimal. The proof is straightforward, because each initial centroid is in a distinct cluster, and since clusters are nonoverlapping each initial centroid converges to the true centroid of the cluster to which it belongs. When one of the clusters splits into two equal halfspheres, since that cluster contains two of the initial clusters. And when two adjacent clusters merge and form a dumbbell cluster. Fig. 1 illustrates these cases for an example with .
2.2 Errors for Individual Clusters
When the number of clusters is close to the correct value, individual ideal clusters take on three different shapes, namely, sphere with clustering error , halfsphere with error , and dumbbell with error . These individual clustering errors are given by,
(2.5) 
where is the volume of the dimensional sphere with radius ,
(2.6) 
is the distance of the center of the dumbbell to the center of either of its spheres,
(2.7) 
and is the generalized factorial function. Also , where is the distance of the halfsphere centroid from the equator plane. We refer to Appendix A for derivations of (2.5) and (2.6). Also see [6] for aspects of statistics in highdimensional spaces.
3 Additive Penalty
Here we analyze the linear additive penalty to obtain the bounds for the penalty coefficient . Other functional forms of penalty are discussed in more detail in Appendix B and mentioned briefly at the end of this section. The conclusions, however, remain the same.
Let denote the correct number of spherical (ideal) clusters. The case with clusters then has spheres and one dumbbell, and the case with clusters has spheres and 2 halfspheres. Thus, the clustering errors for the nonpenalized case in (1.1) are given by:
(3.8) 
The clustering error obviously decreases as we increase the number of clusters, and it is straight forward to show this for the ideal case:
(3.9) 
The second inequality is satisfied since , as may also be seen in Fig. 8.
For the clustering error to have the minimum at we must require:
(3.10) 
SInce , and volume stands for the average number of points in each cluster, i.e. , the two inequalities may be combined and written as:
(3.11) 
Note that depends on the correct number of clusters , which we want to determine. This introduces another layer of complexity in using kmeans with additive penalty. To overcome this problem, we propose the following procedure.
3.1 Procedure for Using Additive Penalty

Begin by assuming . Select a value for in the range given by (3.11). Run the kmeans algorithm for , where is very large. If has a minimum at , then the assumed is a potential solution.

Next assume . Select a value for in the corresponding range. Run the kmeans algorithm for . If has a minimum at , then the assumed is a potential solution.

Then assume , and repeat the process. If has a minimum at , then is a potential solution.

And so on assuming up to a very large value.
To make this procedure consistent and repeatable, we propose to select the midpoint of the range given in (3.11) as the value for . That is,
(3.12) 
The approximation is reasonable since (a) and for and goes to zero as the dimension , and (b) the term may be neglected since for nonoverlapping clusters. We use the approximate value for in the experiments we present later. This approximation becomes increasingly better as dimension increases. For we choose the smallest intercentroid distance, which also depends on .
We also considered other penalty functions, namely, . Similar to the linear penalty (3.12), for these cases also depends on and is, respectively (see Appendix B),
(3.13) 
4 Multiplicative Penalty
In this section, we analyze in (1.3) with linear multiplicative penalty , and show that for ideal clusters it has a natural miniumum at the correct number of clusters . To do so, we show that for clusters with multiplicative penalty while :
(4.14) 
The first inequality is satisfied since and . The second inequality holds because for (see Fig. 8 ). For the uninteresting case (clustering on a straight line), for which , the inequality is satisfied when .
Note that the inequalities happen naturally; unlike the additive penalty case where we must impose inequalities to ensure becomes minimum at .
5 Experiments
In this section we present several experiments for selected dimensions and number of clusters, and also varying degrees of cluster overlaps. They show that multiplicative penalty has a stronger signature than additive penalty, and is generally more stable when the ideal nonoverlap assumption is violated.
To do the experiments we first use the clustering algorithm presented in Sec. 2.1. Then for the additive penalty we use the procedure presented in Sec. 3.1, which yields candidate solutions when the Assumed and Estimated number of clusters are the same. We also compute the multiplicative penalized error as a function of from (1.3). Its minima are candidate solutions. In cases where there are ambiguities as to the correct number of clusters , the agreement between solutions of additive and multiplicative penalties resolves the ambiguity.
Figures 2 and 3 are both 2D cases with clusters each with approximately 100 point points. It can be seen that the additive penalty yields several candidate solutions, while multiplicative penalty yields only one solution at the correct number of clusters. It can also be seen that the additive penalized errors are significantly shallower than the multiplicative penalized error. Also note that as clusters overlap more, as in Fig. 3, both additive and multiplicative errors become shallower.
Fig. 4 is a 2D case with clusters each with about 200 data points. As can be seen there is little overlap among clusters, nevertheless additive penalty yields several solutions while the multiplicative penalty yields only one solution at the correct value. In Fig. 5 the clusters are the same as in Fig. 4 but the intercluster distances have been significantly reduced so that the clusters have substantial overlap. Even though both additive and multiplicative penalties yield several solutions, they both only agree on .
Fig. 6 is an example with clusters in dimensions. The additive penalty yields as candidate solutions, while the multiplicative penalty yields only . We also plot for and , which show the comparative depth of their minima. Note that at higher dimensions clusters generally have less overlap, thus becoming more like ideal clusters with little overlap and unambiguous solutions.
Fig. 7 is a composite picture of five Brodatz textures with 256256 image size. There are many alternative ways of representing the image for clustering. Here we convert the image to a set of data points into a dimensional space by using nonoverlapping 44 cosine transform. Thus, the image with over 65k pixels is represented by by 4,096 data points in a 16dimensional feature space. The pixels on the borders of the texture regions are sparse and appear like outliers. The clustering algorithm presented in Sec. 2.1 is sensitive to outliers, as are other clustering algroithms. Therefore, to deal with outliers we compute the point density around each of them, and eliminate those that have densities below a certain threshold. The remaining data is 3,660 points. In this figure we show the results of the penalized kmeans algorithms. As may be seen, both additive and multiplicative penalties have pronounced minima at yielding the correct number of clusters or textures. Again we note that clusters in high dimensional spaces are more likely to behave as ideal clusters, hence yielding unambiguous correct solutions.
6 Concluding Remarks
Penalized kmeans with additive penalty is a popular algorithm for finding the correct number of clusters in a dataset. However, in practice the coefficient of the additive penalty term is chosen arbitrarily in an ad hoc manner. We presented a principled derivation of the bounds for the coefficient of additive penalty in kmeans for ideal clusters. Even though in practice clusters typically deviate from the ideal assumption, the ideal case serves as a useful guideline. We also investigated multiplicative penalty, which turns out to produce a more reliable signature for the correct number of clusters. We examined empirically deviations from the nonoverlap assumption and presented a procedure for combining kmeans with additive and multiplicative penalties to obtain the correct number of clusters when one or both approaches yield ambiguous solutions. What the limits of this disambiguation procedure are needs further theoretical analysis or more exhaustive empirical evidence.
This work may be further extended in a number of directions that relax the ideal cluster assumptions. These include (a) rigorous derivation of the limits of nonoverlap assumption; (b) extend to spherically symmetric normal distribution of data points rather than spherical distributions, which would allow greater tolerance of overlapping clusters; (c) relax the assumption that the number of points in each cluster be approximately the same; (d) theoretical analysis and the algorithm for finding the best clustering is sensitive to outliers, develop more sophisticated methods for detecting and discarding outliers.
Appendix A Appendix A
In this Appendix, we calculate the volume of a sphere in the dimensional Euclidean space, as well as the withincluster variances for the sphere, and the centroid and the withincluster variances for the halfsphere and a dumbbell. Note that while in the body of the paper we denote the dimension by , in this appendix we denote it by to avoid confusion with the differential element in the integrals.
A.1 Sphere
The volume element in the dimensional spherical coordinate system (in terms of radius and angles ) is:
(A.15) 
Hence the volume of a sphere of radius may be calculated from:
(A.16) 
The integrals over spherical angles can be calculated from the following formula [11],
(A.17) 
where , the generalized factorial function, is:
(A.18) 
With repeated use of (A.17), Eq. (A.16) simplifies to:
(A.19) 
The withincluster variance, or clustering error, for the dimensional sphere is given by:
(A.20) 
and may be expressed in the spherical coordinate system (the centroid is at the origin),
(A.21) 
This can be similarly simplified to:
(A.22) 
A.2 Halfsphere
The centroid of the halfsphere is on the axis that is orthogonal to the equator, say axis 1, that is . Its distance, , from the equator may be calculated from
(A.23) 
Subscript indicates integral over the volume of the halfsphere. This leads to
(A.24) 
The withincluster variance for the halfsphere is given by:
(A.25) 
which reduces to the following,
(A.26) 
A.3 Dumbbell
The withincluster variance for the dumbbell with two equal size spheres may be expressed as,
(A.27) 
where is the distance of the center of the dumbbell to the center of either sphere.
A quantity of interest is the ratio , which ranges between 1 and 4. For (straight line) we have , and as it approaches 1, thus . See Fig. 8.
Appendix B Appendix B
In this Appendix we derive bounds for the coefficient in the additive penalized kmeans when the penalty function has polynomial, logarithmic, and exponential dependence on the number of clusters . For the additive penalty, Eq. (1.2), in order to get a similar signature to multiplicative penalty case, we must require the following inequalities:
(B.28) 
Noting that , the two inequalities in (B.1) maybe be combined and written as:
(B.29) 
To estimate the value of , we remember that the volume is the average number of data points in each cluster, i.e., , is given by (A.24) and is less than and goes to zero as the dimension increases. And is an intercluster distance.
We consider three functional forms for . For these penalty terms (B.29) becomes,
(B.30) 
where we assume is sufficiently greater than for the polynomial case. It can be seen that depends on the correct number of clusters , which is not known a priori. In particular, for the commonly used linear penalty function, , we get:
(B.31) 
References
 [1]
 [2] R.C. de Amorim and C. Henning. Recovering the number of clusters in data sets with noise features using feature rescaling factors. arXiv:1602.06989v1, Feb. 2016.
 [3] D. Arthur and S. Vassilvitskii. kmeans++: The Advantages of Careful Seeding. Proc. Annual ACMSIAM Symposium on Discrete Algorithms, pp. 10271035, 2007.
 [4] O. Bachem, M. Lucic, S.H. Hassani, and A. Krause. Fast and Provably Good Seedings for kMeans. Proceedings of NIPS 2016, pp. 5563.
 [5] C. Baldassi. Recombinatorkmeans: Enhancing kmeans++ by seeding from pools of previous runs. arXiv:1905.00531v1, May 2019
 [6] A. Blum, J. Hopcroft, and R. Kannan. Foundations of Data Science, Jan 4, 2018 version, https://www.cs.cornell.edu/jeh/book.pdf.
 [7] R.O. Duda, P. E. Hart, and D.G. Stork. Pattern Classification. Wiley, 2003.
 [8] W. Fu and P.O. Perry. Estimating the number of clusters using crossvalidation. arXiv:1702.02658v1, Feb. 2017.
 [9] A. Fujita, D.Y. Takahashi, and A.G. Patriota. A nonparametric method to estimate the number of clusters. Computational Statistics and Data Analysis, vol. 73, pp. 2739, 2014.
 [10] A. Gersho and R.M. Gray. Vector Quantization and Signal Compression. Kluwer Academic Publishers, 1992.
 [11] I.S. Gradshteyn and I.M. Ryzhik. Table of Integrals, Series, and Products, Academic Press, 4th edition, p. 369, 1965.
 [12] A.K. Jain and R.C. Dubes. Algorithms for Clustering Data. PrenticeHall, 1988.
 [13] B. KamgarParsi, J.A. Gualtieri, J.E. Devaney, and B. KamgarParsi. Clustering with neural networks. Biological Cybernetics, vol. 63, no. 3, pp. 201208, 1990.
 [14] B. KamgarParsi and B. KamgarParsi. Dynamical Stability and Parameter Selection in Neural Optimization. Proc. International Joint Conference on Neural Networks, vol. 4, pp 566571, 1992.
 [15] T. Kanungo, D.M. Mount, N.S. Netanyahu, C.D. Piatko, R. Silverman, and A.Y. Wu. An Efficient kMeans Clustering Algorithm: Analysis and Implementation. IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 881892, 2002.
 [16] B. Kulis and M.I. Jordan. Revisiting kmeans: New Algorithms via Bayesian Nonparametrics. Proc. International Conference on Machine Learning, Edinburgh, UK, 2012.
 [17] W. Li, E.K. Ryu, S. Osher, W. Yin, and W. Gangbo. A parallel method for earth mover’s distance. UCLA Comput. Appl. Math. Pub. (CAM) Rep., 1712, 2017.
 [18] Y. Linde, A. Buzo, and R. Gray. An Algorithm for Vector Quantizer Design. IEEE Trans. Communications, vol. 28, pp. 8495, 1980.
 [19] S. P. Lloyd. Least squares quantization in PCM. IEEE Trans. Information Theory, vol. 28, no. 2, pp. 129137, 1982. Appeared first in a 1956 AT&T Bell Labs memorandum.
 [20] C.D. Manning, P. Raghavan, and H. Schutze. Introduction to Information retrieval. Cambridge University Press, 2008.
 [21] K.P. Murphy. Machine Learning: A Probabilistic Perspective. MIT Press, 2012.
 [22] K. Rose, E. Gurewitz, and G.C. Fox. Vector quantization by deterministic annealing. IEEE Trans. Information Theory, vol. 38, no. 4, pp. 12491257, 1992
 [23] P.J. Rousseeuw. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, vol. 20, pp. 5365, 1987.
 [24] R. Vidal, Y. Ma, and S. Sastry. Generalized Principal Component Analysis, SpringerVerlag, 2016.