Improved Clustering with Augmented kmeans
Abstract
Identifying a set of homogeneous clusters in a heterogeneous dataset is one of the most important classes of problems in statistical modeling. In the realm of unsupervised partitional clustering, kmeans is a very important algorithm for this. In this technical report, we develop a new kmeans variant called Augmented kmeans, which is a hybrid of kmeans and logistic regression. During each iteration, logistic regression is used to predict the current cluster labels, and the cluster belonging probabilities are used to control the subsequent reestimation of cluster means. Observations which can’t be firmly identified into clusters are excluded from the reestimation step. This can be valuable when the data exhibit many characteristics of real datasets such as heterogeneity, nonsphericity, substantial overlap, and high scatter. Augmented kmeans frequently outperforms kmeans by more accurately classifying observations into known clusters and / or converging in fewer iterations. We demonstrate this on both simulated and real datasets. Our algorithm is implemented in Python and will be available with this report.
Keywords: Unsupervised Learning, Clustering, kmeans
1 Introduction
Clustering datapoints in dimensions into distinct clusters is a very old problem in statistical modeling. The kmeans algorithm, introduced by MacQueen [12], is an important method^{1}^{1}1As of a 2002 survey [3]. to do this. Fundamentally, the kmeans algorithm iterates through a set of steps in an attempt to minimize the sum of squared distances within all clusters. Its popularity is probably due to this inherent simplicity, as opposed to its perfection. Indeed, as mentioned in Krishna and Murty [11], the kmeans algorithm exhibits a strong tendency to converge to suboptimal local minima, and is not robust to the initial state, leading to different ways to cluster the same dataset. Many researchers have proposed differing degrees of variations around the underlying theme of iteratively minimizing the total sum of squared distances.
Wong [17] developed a hybrid clustering algorithm using both kmeans and singlelinkage hierarchical clustering, with the specific goal of identifying highdensity modal regions. With a similar goal of finding tight, stable clusters, Tseng and Wong [16] used a resampling method with a merged kmeans and truncated hierarchical clustering algorithm. To combat the same issue of scattered observations which truly don’t belong in any cluster, Maitra and Ramler [13] proposed a new algorithm which iteratively builds each homogenous spherical cluster around a core, so that scattered observations are explicitly excluded.
Several authors have used trimming in clustering with kmeans, in which certain observations are remove (or trimmed). The goal of trimming is to identify clusters robustly w.r.t. outliers, originally proposed by Gordaliza [8]. CuestaAlbertos et al. used a databased trimming method optimized by simulated annealing [5]. GarcíaEscudero et al. [7] introduced a trimmed clustering algorithm, constrained by the ratio of maximum to minimum eigenvalues from the withincluster scatter matrices. GarcíaEscuder et al. [6] further extended robust clustering with trimmed kmeans by modeling linear patterns in the data around which clusters formed.
Bozdogan [4] proposed an initialization method that initializes the clusters so that they are evenly spaced throughout the data. Arthur & Vassilvitskii [2] proposed a modified algorithm called kmeans++, wherein cluster centers are spaced around each other following an iterative distanceweighting scheme.
Krishna and Murty [11] created a hybrid algorithm called Genetic kmeans based on the Genetic Algorithm of Holland [9, 10]. Along similar lines, Song et al. created their GARM algorithm, which computes the cluster distances using a regularized Mahalanobis distance. Use of the Genetic Algorithm in both cases allows the clustering algorithm to better avoid local optima and robustifies it against initialization.
Perhaps most conceptually relevant to this article is the work by Tibshirani and Walther [15]. Like [16, 17], they focused on identifying stable clusters. Their approach used iterated kfold crossvalidation to create a hybrid unsupervised and supervised clustering prediction technique.
In this work, we propose a new variant called Augmented kmeans in which each iteration is augmented by performing logistic regression. Hence, our algorithm joins unsupervised and supervised clustering. The clusterbelonging probabilities output by the regression are used to exclude some observations from being used to reestimate the cluster means. While this certainly adds computation time, the augmented algorithm tends to more accurately classify the observations into known clusters, and often converges in fewer iterations.
For data with homogeneous, nonoverlapping clusters and little scatter, Augmented kmeans should require more time and iterations than kmeans to converge to a solution, assuming the clusters are seeded well (as in [2, 4]). However, in our experience, real datasets for which clustering is needed are often generated by diffuse, heterogenous, nonspherical, and highly overlapped populations. Hence, Augmented kmeans is a practical addition to the current set of clustering methodologies.
In the interest of reproducible and open research, the Augmented kmeans algorithm is implemented in Python using the scientific computing package numpy and the machine learning library scikitlearn [14]. Along with the code for running the Monte Carlo experiments, it will be available with this report.
2 Augmented kmeans
The kmeans algorithm typically starts with initial cluster means^{2}^{2}2Alternatively, it can start by classifying each observation into clusters, but this is a trivial difference. From here, it iterates assigning observations into their closest cluster, and recomputing the cluster means. The notation we use is:

: the th observation vector,

: the cluster assignment of the th observation, ,

: returns if , and returns otherwise

: the mean vector of the th cluster,

: convergence criteria for sequential difference in total sum of squared distances
After generating initial cluster means  we use the initialization from kmeans++ [2]  the kmeans algorithm is:

For each observation and cluster , compute the squared Euclidean distance to the mean :
(1) 
Assign each observation to its closest cluster:
(2) 
Recompute the cluster means:
(3) 
Compute the total sum of squared distances:
(4) 
Measure the change in the total sum of squared distances from the previous iteration^{3}^{3}3Only starting from the second iteration, obviously., and compare against . If , exit. Otherwise, return to step (i).
After step (ii), we have cluster assignments . If we take the stance that they are known class labels, we can use this data to formulate a supervised learning problem. Accordingly, Augmented kmeans inserts one step after (ii) and modifies (iii). This new step begins by solving the set of multinomial^{4}^{4}4Obviously, if , regular binary logistic regression is used. logistic regression equations shown below.
(5)  
Next, for each observation, it uses the estimated logistic relations to predict the probability of cluster membership in all clusters, and the probabilities are then ordered in descending order; the vector of ordered probabilities can be indicated as . Consider a situation where (for ); while the highest probability is associated with , it’s not clear whether or not observation truly belongs in cluster or . Since observation is clearly almost equidistant between both cluster means, it may not make sense to use it to recompute the mean of its assigned cluster .
Our algorithm computes the ratio of the two largest of these probabilities:
(6) 
The th observation is only used to recompute the mean of cluster if . We can annotate this condition as , which returns if the inequality is met, and otherwise. We use because it allows the algorithm to consider any observations with greater than a 60:40 split between the two most likely clusters as being firmly placed in its cluster. Any belonging probability spread out among the remaining clusters (for ) only makes the algorithm more lenient. After augmentation in this manner, the The full Augmented kmeans algorithm is:

For each observation and cluster , compute the squared Euclidean distance to the mean :
(7) 
Assign each observation to its closest cluster:
(8) 
Perform logistic regression, with as the independent data, and as the known class labels, then compute the cluster belonging probabilities, order them in descending order, and compute the ratio of the two largest probabilities .

Recompute the cluster means:
(9) 
Compute the total sum of squared distances:
(10) 
Measure the change in the total sum of squared distances from the previous iteration, and compare against . If , exit. Otherwise, return to step (i).
Note that there’s no in the step (v); the sequence of total sum of squared distances does not take into account the additional information generated by the logistic regression. This computation is left unmolested so as to retain the algorithms property of monotonic convergence.
When Augmented kmeans converges to a solution, every observation is classed into a cluster. However, the knowledge that certain observations had is retained. Hence, in answer to the concerns addressed in [13, 16, 17], we could instead mark these observations as scatter.
The two panes of Figure 1  generated from simulated data with overlapping clusters  help explain why Augmenting kmeans is an improvement.
The solid blue circles in each pane shows how the mean of cluster 0 changed as the algorithm iterated, and the solid green triangles shows the same for cluster 2. The initial estimated cluster means from kmeans++ are indicated with the ”Init 0” / ”Init 2” text, and the true cluster means are annotated with ”True 0” / ”True 2”. In both panes, the datapoints are the small points, and those circled in red are the observations excluded by the augmentation from updating the cluster means.
In Figure (a)a, we see that the cluster means kept moving further away from the true centers. In Figure (b)b, however, we see that they stopped moving away after only a handful of iterations. The predominance of excluded observations in the upper right corner of the plots shows the reason. With kmeans, these observations pulled the mean for cluster 0 in that direction. While not shown to make the plot more legible, there is an observation in the lower left corner which acted to pull the mean for cluster 2 in that direction. In each pane of Figure 1, the distance between the final and true cluster means are written near the true means. For both clusters, the final means computed by Augmented kmeans are closer. It should be clear that the benefit obtained by the augmentation is very dependent on the homogeneity and scatter in the data, as well as the number of clusters and their spacing.
3 Numerical Results
Here we show comparative results for several datasets with known clustering structures. In both real data examples, we executed 1,000 replications of the algorithms, using kmeans++ initialization. In each replication, both kmeans and Augmented kmeans began with the same initial state.
3.1 Simulated Data
We begin with demonstrating the performance on a simulated bivariate dataset with observations from overlapping clusters. As can be seen in Figure 2, there are a lot of observations that we can expect to not be firmly placed in a specific cluster. Instead of 1,000 replications, we ran 5,000, since the data was simulated. Each time, the same data with the same initial means was used for both algorithms. As shown in Table 1, Augmented Kmeans correctly classified more observations than kmeans of the time, and it only underperformed in of the replications. When it did outperform kmeans, the classification gain was slightly more than . Augmented kmeans converged in fewer iterations in approximately half the replications; when it converged faster, it required fewer iterations on average. Averaged over all 5,000 replications, Augmented kmeans only needed to run s longer.
3.2 Iris Data
We continue with Fisher’s iris data. This dataset consists of flower characteristics: petal length, petal width, sepal length, and sepal width. There are groups: observations each from the varieties Iris Setosa, Iris Versicolor, and Iris Virginica. The comparative results for the Iris data are shown in Table 2. In of the simulations, Augmented kmeans correctly classified more observations, with an average improvement of . For this dataset, the augmentation algorithm tended to require more iterations; in the of simulations in which it converged faster, Augmented kmeans required on average 4.6 fewer iterations. For the rare simulations in which the classification performance was worse, the average shortfall relative to kmeans was only  a single observation.
3.3 Wine Composition Data
Our final example is the wine recognition dataset of M. Fiorina, et al., used in [1]. These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from different cultivars (, , ). The analysis determined the values of characteristics of each wine. The variables are shown in Table 3.
For the wine data, Augmented kmeans outperformed kmeans in of the simulations, regarding classification, and regarding iteration count, as can be seen in Table 4. While the improvement in classification performance was slight, the average additional computation time required by Augmented kmeans was only s. When it needed more iterations than kmeans, the excess was less than 2 iterations on average.
4 Concluding Remarks
In this technical report, we’ve developed a new clustering algorithm, called Augmented kmeans, that combines unsupervised clustering with kmeans and supervised clustering with logistic regression. In each iteration, we use the group membership probabilities from logistic regression to exclude observations used to recompute the cluster means in kmeans. This allows each cluster to form without being influenced by observations which don’t firmly belong. We have demonstrated the advantages of Augmented kmeans on both simulated and real datasets. The augmentation frequently leads to better classification performance and / or faster convergence.
It is true that our results demonstrate minimal incremental performance improvement over kmeans++, which could be seen as a reason to forego publication. However, we feel that our hybrid unsupervised + supervised clustering approach is sufficiently innovative to justify publication. Additional work around this innovation will only be accelerated by sharing the idea openly in the research community.
Further research with Augmented kmeans could go in a few directions. The most obvious would be to attempt to augment kmeans with a different supervised learning procedure, such as discriminant analysis (linear, quadratic, or kernel) or artificial neural networks. Of course, with both these procedures, the researcher has several subjective parameterization decisions to make. Also, neither produces the cluster belonging probabilities, so some other model output would need to be used in their place. It could be worthwhile to include a feature selection procedure in the logistic regression step in our algorithm. While this would require more CPU time, modeling the predictive power in an optimal subset should cause more greedy exclusion of observations, which may further improve performance.
References
 Aeberhard et al. [1992] Aeberhard, S., Coomans, D., de vel, O., 1992. Comparison of Classifiers in High Dimensional Settings. Tech. Rep. 9202, Dept. of Computer Science and Dept. of Mathematics and Statistics, James Cook University of North Queensland.

Arthur and Vassilvitskii [2007]
Arthur, D., Vassilvitskii, S., 2007. Kmeans++: The Advantages of Careful
Seeding. In: Proceedings of the Eighteenth Annual ACMSIAM Symposium on
Discrete Algorithms. SODA ’07. Society for Industrial and Applied
Mathematics, Philadelphia, PA, USA, pp. 1027–1035.
URL http://dl.acm.org/citation.cfm?id=1283383.1283494  Berkhin [2002] Berkhin, P., 2002. Survey of Clustering Data Mining Techniques. Tech. rep., Accrue Software.
 Bozdogan [1983] Bozdogan, H., June 1983. Determining the Number of Component Clusters in the Standard Multivariate Normal Mixture Model Using ModelSelection Criteria. Tech. Rep. UIC/DQM/A831, University of Illinois at Chicago, Quantitative Methods Department, Illinois, aRO Contract DAAG2982K0155.
 CuestaAlbertos et al. [1997] CuestaAlbertos, J. A., Gordaliza, A., Matran, C., 1997. Trimmed kMeans: An Attempt to Robustify Quantizers. Annals of Statistics 25 (2), 553–576.
 GarcíaEscudero et al. [2009] GarcíaEscudero, L. A., Gordaliza, A., Martín, R. S., Zamar, R., 2009. Robust Linear Clustering. Journal of the Royal Statistical Society, Series B (Statistical Methodology 71 (1), 301–318.
 GarcíaEscudero et al. [2008] GarcíaEscudero, L. A., Gordaliza, A., Matrán, C., Mayoiscar, A., 2008. A General Trimming Approach to Robust Cluster Analysis. Annals of Statistics 36 (3), 1324–1345.
 Gordaliza [1991] Gordaliza, A., 1991. Best Approximations to Random Variables Based on Trimming Procedures. Journal of Approximation Theory 64, 192–180.
 Holland [1975] Holland, J. H., 1975. Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence. The University of Michigan Press, Ann Arbor, USA.
 Holland [1992] Holland, J. H., 1992. Genetic Algorithms. Scientific American 267, 66–72.
 Krishna and Murty [1999] Krishna, K., Murty, M., 1999. Genetic KMeans Algorithm. IEEE Transactions on Systems, Man, and Cybernetics  Part B: Cybernetics 29 (3), 433–439.
 MacQueen [1967] MacQueen, J., 1967. Some Methods for Classification and Analysis of Multivariate Observations. In: Cam, L. M. L., Neyman, J. (Eds.), Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability. Vol. 1. University of California, Berkeley, Berkeley, USA, pp. 281–297.
 Maitra and Ramler [2009] Maitra, R., Ramler, I. P., 2009. Clustering in the Presence of Scatter. Biometrics 65 (2), 341–352.
 Pedregosa et al. [2011] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E., 2011. Scikitlearn: Machine Learning in Python. Journal of Machine Learning Research 12, 2825–2830.
 Tibshirani and Walther [2005] Tibshirani, R., Walther, G., 2005. Cluster Validation by Prediction Strength. Journal of Computational and Graphical Statistics 14 (3), 511–528.
 Tseng and Wong [2005] Tseng, G. C., Wong, W. H., 2005. Tight Clustering: A ResamplingBased Approach for Identifying Stable and Tight Patterns in Data. Biometrics 61 (1), 10–16.
 Wong [1982] Wong, M. A., 1982. A Hybrid Clustering Method for Identifying HighDensity Clusters. Journal of the American Statistical Association 77 (380), 841–847.