Improved Clustering with Augmented k-means

Improved Clustering with Augmented k-means

J. Andrew Howe, ahowe42@gmail.com
Independent Researcher, Riyadh, Saudi Arabia
Abstract

Identifying a set of homogeneous clusters in a heterogeneous dataset is one of the most important classes of problems in statistical modeling. In the realm of unsupervised partitional clustering, k-means is a very important algorithm for this. In this technical report, we develop a new k-means variant called Augmented k-means, which is a hybrid of k-means and logistic regression. During each iteration, logistic regression is used to predict the current cluster labels, and the cluster belonging probabilities are used to control the subsequent re-estimation of cluster means. Observations which can’t be firmly identified into clusters are excluded from the re-estimation step. This can be valuable when the data exhibit many characteristics of real datasets such as heterogeneity, non-sphericity, substantial overlap, and high scatter. Augmented k-means frequently outperforms k-means by more accurately classifying observations into known clusters and / or converging in fewer iterations. We demonstrate this on both simulated and real datasets. Our algorithm is implemented in Python and will be available with this report.


Keywords: Unsupervised Learning, Clustering, k-means

1 Introduction

Clustering datapoints in dimensions into distinct clusters is a very old problem in statistical modeling. The k-means algorithm, introduced by MacQueen [12], is an important method111As of a 2002 survey [3]. to do this. Fundamentally, the k-means algorithm iterates through a set of steps in an attempt to minimize the sum of squared distances within all clusters. Its popularity is probably due to this inherent simplicity, as opposed to its perfection. Indeed, as mentioned in Krishna and Murty [11], the k-means algorithm exhibits a strong tendency to converge to suboptimal local minima, and is not robust to the initial state, leading to different ways to cluster the same dataset. Many researchers have proposed differing degrees of variations around the underlying theme of iteratively minimizing the total sum of squared distances.

Wong [17] developed a hybrid clustering algorithm using both k-means and single-linkage hierarchical clustering, with the specific goal of identifying high-density modal regions. With a similar goal of finding tight, stable clusters, Tseng and Wong [16] used a re-sampling method with a merged k-means and truncated hierarchical clustering algorithm. To combat the same issue of scattered observations which truly don’t belong in any cluster, Maitra and Ramler [13] proposed a new algorithm which iteratively builds each homogenous spherical cluster around a core, so that scattered observations are explicitly excluded.

Several authors have used trimming in clustering with k-means, in which certain observations are remove (or trimmed). The goal of trimming is to identify clusters robustly w.r.t. outliers, originally proposed by Gordaliza [8]. Cuesta-Albertos et al. used a data-based trimming method optimized by simulated annealing [5]. García-Escudero et al. [7] introduced a trimmed clustering algorithm, constrained by the ratio of maximum to minimum eigenvalues from the within-cluster scatter matrices. García-Escuder et al. [6] further extended robust clustering with trimmed k-means by modeling linear patterns in the data around which clusters formed.

Bozdogan [4] proposed an initialization method that initializes the clusters so that they are evenly spaced throughout the data. Arthur & Vassilvitskii [2] proposed a modified algorithm called k-means++, wherein cluster centers are spaced around each other following an iterative distance-weighting scheme.

Krishna and Murty [11] created a hybrid algorithm called Genetic k-means based on the Genetic Algorithm of Holland [9, 10]. Along similar lines, Song et al. created their GARM algorithm, which computes the cluster distances using a regularized Mahalanobis distance. Use of the Genetic Algorithm in both cases allows the clustering algorithm to better avoid local optima and robustifies it against initialization.

Perhaps most conceptually relevant to this article is the work by Tibshirani and Walther [15]. Like [16, 17], they focused on identifying stable clusters. Their approach used iterated k-fold cross-validation to create a hybrid unsupervised and supervised clustering prediction technique.

In this work, we propose a new variant called Augmented k-means in which each iteration is augmented by performing logistic regression. Hence, our algorithm joins unsupervised and supervised clustering. The cluster-belonging probabilities output by the regression are used to exclude some observations from being used to re-estimate the cluster means. While this certainly adds computation time, the augmented algorithm tends to more accurately classify the observations into known clusters, and often converges in fewer iterations.

For data with homogeneous, non-overlapping clusters and little scatter, Augmented k-means should require more time and iterations than k-means to converge to a solution, assuming the clusters are seeded well (as in [2, 4]). However, in our experience, real datasets for which clustering is needed are often generated by diffuse, heterogenous, non-spherical, and highly overlapped populations. Hence, Augmented k-means is a practical addition to the current set of clustering methodologies.

In the interest of reproducible and open research, the Augmented k-means algorithm is implemented in Python using the scientific computing package numpy and the machine learning library scikit-learn [14]. Along with the code for running the Monte Carlo experiments, it will be available with this report.

Our new algorithm is detailed in Section 2, followed by numerical results on both simulated and real datasets in Section 3. We finish with some concluding remarks in Section 4.

2 Augmented k-means

The k-means algorithm typically starts with initial cluster means222Alternatively, it can start by classifying each observation into clusters, but this is a trivial difference. From here, it iterates assigning observations into their closest cluster, and recomputing the cluster means. The notation we use is:

  • : the th observation vector,

  • : the cluster assignment of the th observation, ,

  • : returns if , and returns otherwise

  • : the mean vector of the th cluster,

  • : convergence criteria for sequential difference in total sum of squared distances

After generating initial cluster means - we use the initialization from k-means++ [2] - the k-means algorithm is:

  1. For each observation and cluster , compute the squared Euclidean distance to the mean :

    (1)
  2. Assign each observation to its closest cluster:

    (2)
  3. Recompute the cluster means:

    (3)
  4. Compute the total sum of squared distances:

    (4)
  5. Measure the change in the total sum of squared distances from the previous iteration333Only starting from the second iteration, obviously., and compare against . If , exit. Otherwise, return to step (i).

After step (ii), we have cluster assignments . If we take the stance that they are known class labels, we can use this data to formulate a supervised learning problem. Accordingly, Augmented k-means inserts one step after (ii) and modifies (iii). This new step begins by solving the set of multinomial444Obviously, if , regular binary logistic regression is used. logistic regression equations shown below.

(5)

Next, for each observation, it uses the estimated logistic relations to predict the probability of cluster membership in all clusters, and the probabilities are then ordered in descending order; the vector of ordered probabilities can be indicated as . Consider a situation where (for ); while the highest probability is associated with , it’s not clear whether or not observation truly belongs in cluster or . Since observation is clearly almost equidistant between both cluster means, it may not make sense to use it to recompute the mean of its assigned cluster .

Our algorithm computes the ratio of the two largest of these probabilities:

(6)

The th observation is only used to recompute the mean of cluster if . We can annotate this condition as , which returns if the inequality is met, and otherwise. We use because it allows the algorithm to consider any observations with greater than a 60:40 split between the two most likely clusters as being firmly placed in its cluster. Any belonging probability spread out among the remaining clusters (for ) only makes the algorithm more lenient. After augmentation in this manner, the The full Augmented k-means algorithm is:

  1. For each observation and cluster , compute the squared Euclidean distance to the mean :

    (7)
  2. Assign each observation to its closest cluster:

    (8)
  3. Perform logistic regression, with as the independent data, and as the known class labels, then compute the cluster belonging probabilities, order them in descending order, and compute the ratio of the two largest probabilities .

  4. Recompute the cluster means:

    (9)
  5. Compute the total sum of squared distances:

    (10)
  6. Measure the change in the total sum of squared distances from the previous iteration, and compare against . If , exit. Otherwise, return to step (i).

Note that there’s no in the step (v); the sequence of total sum of squared distances does not take into account the additional information generated by the logistic regression. This computation is left unmolested so as to retain the algorithms property of monotonic convergence.

When Augmented k-means converges to a solution, every observation is classed into a cluster. However, the knowledge that certain observations had is retained. Hence, in answer to the concerns addressed in [13, 16, 17], we could instead mark these observations as scatter.

The two panes of Figure 1 - generated from simulated data with overlapping clusters - help explain why Augmenting k-means is an improvement.

(a) k-means: Class. Rate: 68.7%, Iterations: 18
(b) Augmented k-means: Class. Rate: 74.7%, Iterations: 7
Figure 1: Evolution of Two Cluster Means, Contrasting k-means with Augmented k-means.

The solid blue circles in each pane shows how the mean of cluster 0 changed as the algorithm iterated, and the solid green triangles shows the same for cluster 2. The initial estimated cluster means from k-means++ are indicated with the ”Init 0” / ”Init 2” text, and the true cluster means are annotated with ”True 0” / ”True 2”. In both panes, the datapoints are the small points, and those circled in red are the observations excluded by the augmentation from updating the cluster means.

In Figure (a)a, we see that the cluster means kept moving further away from the true centers. In Figure (b)b, however, we see that they stopped moving away after only a handful of iterations. The predominance of excluded observations in the upper right corner of the plots shows the reason. With k-means, these observations pulled the mean for cluster 0 in that direction. While not shown to make the plot more legible, there is an observation in the lower left corner which acted to pull the mean for cluster 2 in that direction. In each pane of Figure 1, the distance between the final and true cluster means are written near the true means. For both clusters, the final means computed by Augmented k-means are closer. It should be clear that the benefit obtained by the augmentation is very dependent on the homogeneity and scatter in the data, as well as the number of clusters and their spacing.

3 Numerical Results

Here we show comparative results for several datasets with known clustering structures. In both real data examples, we executed 1,000 replications of the algorithms, using k-means++ initialization. In each replication, both k-means and Augmented k-means began with the same initial state.

3.1 Simulated Data

Figure 2: Simulated Bivariate Data with Four Clusters

We begin with demonstrating the performance on a simulated bivariate dataset with observations from overlapping clusters. As can be seen in Figure 2, there are a lot of observations that we can expect to not be firmly placed in a specific cluster. Instead of 1,000 replications, we ran 5,000, since the data was simulated. Each time, the same data with the same initial means was used for both algorithms. As shown in Table 1, Augmented K-means correctly classified more observations than k-means of the time, and it only under-performed in of the replications. When it did outperform k-means, the classification gain was slightly more than . Augmented k-means converged in fewer iterations in approximately half the replications; when it converged faster, it required fewer iterations on average. Averaged over all 5,000 replications, Augmented k-means only needed to run s longer.

Correct Class. Rate Number Iterations Better Better or Equal Average Improvement When Augmented k-means is Better 4.86

Table 1: Simulated Data: Frequency with which Augmented k-means Performance, Relative to k-means is:

3.2 Iris Data

We continue with Fisher’s iris data. This dataset consists of flower characteristics: petal length, petal width, sepal length, and sepal width. There are groups: observations each from the varieties Iris Setosa, Iris Versicolor, and Iris Virginica. The comparative results for the Iris data are shown in Table 2. In of the simulations, Augmented k-means correctly classified more observations, with an average improvement of . For this dataset, the augmentation algorithm tended to require more iterations; in the of simulations in which it converged faster, Augmented k-means required on average 4.6 fewer iterations. For the rare simulations in which the classification performance was worse, the average shortfall relative to k-means was only - a single observation.

Correct Class. Rate Number Iterations Better Better or Equal Average Improvement When Augmented k-means is Better 4.59

Table 2: Iris Data: Frequency with which Augmented k-means Performance, Relative to k-means is:

3.3 Wine Composition Data

Our final example is the wine recognition dataset of M. Fiorina, et al., used in [1]. These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from different cultivars (, , ). The analysis determined the values of characteristics of each wine. The variables are shown in Table 3.

Variable Variable Alcohol Non-flavonoid Phenols Malic Acid Proanthocyanins Ash Color Intensity Alcalinity of Ash Hue Magnesium OD280/OD315 of Diluted Wines Total Phenols Proline Total Flavonoids

Table 3: Wine Data Variables.

For the wine data, Augmented k-means outperformed k-means in of the simulations, regarding classification, and regarding iteration count, as can be seen in Table 4. While the improvement in classification performance was slight, the average additional computation time required by Augmented k-means was only s. When it needed more iterations than k-means, the excess was less than 2 iterations on average.

Correct Class. Rate Number Iterations Better Better or Equal Average Improvement When Augmented k-means is Better 4.59

Table 4: Wine Data: Frequency with which Augmented k-means Performance, Relative to k-means is:

4 Concluding Remarks

In this technical report, we’ve developed a new clustering algorithm, called Augmented k-means, that combines unsupervised clustering with k-means and supervised clustering with logistic regression. In each iteration, we use the group membership probabilities from logistic regression to exclude observations used to recompute the cluster means in k-means. This allows each cluster to form without being influenced by observations which don’t firmly belong. We have demonstrated the advantages of Augmented k-means on both simulated and real datasets. The augmentation frequently leads to better classification performance and / or faster convergence.

It is true that our results demonstrate minimal incremental performance improvement over k-means++, which could be seen as a reason to forego publication. However, we feel that our hybrid unsupervised + supervised clustering approach is sufficiently innovative to justify publication. Additional work around this innovation will only be accelerated by sharing the idea openly in the research community.

Further research with Augmented k-means could go in a few directions. The most obvious would be to attempt to augment k-means with a different supervised learning procedure, such as discriminant analysis (linear, quadratic, or kernel) or artificial neural networks. Of course, with both these procedures, the researcher has several subjective parameterization decisions to make. Also, neither produces the cluster belonging probabilities, so some other model output would need to be used in their place. It could be worthwhile to include a feature selection procedure in the logistic regression step in our algorithm. While this would require more CPU time, modeling the predictive power in an optimal subset should cause more greedy exclusion of observations, which may further improve performance.

References

  • Aeberhard et al. [1992] Aeberhard, S., Coomans, D., de vel, O., 1992. Comparison of Classifiers in High Dimensional Settings. Tech. Rep. 92-02, Dept. of Computer Science and Dept. of Mathematics and Statistics, James Cook University of North Queensland.
  • Arthur and Vassilvitskii [2007] Arthur, D., Vassilvitskii, S., 2007. K-means++: The Advantages of Careful Seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms. SODA ’07. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, pp. 1027–1035.
    URL http://dl.acm.org/citation.cfm?id=1283383.1283494
  • Berkhin [2002] Berkhin, P., 2002. Survey of Clustering Data Mining Techniques. Tech. rep., Accrue Software.
  • Bozdogan [1983] Bozdogan, H., June 1983. Determining the Number of Component Clusters in the Standard Multivariate Normal Mixture Model Using Model-Selection Criteria. Tech. Rep. UIC/DQM/A83-1, University of Illinois at Chicago, Quantitative Methods Department, Illinois, aRO Contract DAAG29-82-K-0155.
  • Cuesta-Albertos et al. [1997] Cuesta-Albertos, J. A., Gordaliza, A., Matran, C., 1997. Trimmed k-Means: An Attempt to Robustify Quantizers. Annals of Statistics 25 (2), 553–576.
  • García-Escudero et al. [2009] García-Escudero, L. A., Gordaliza, A., Martín, R. S., Zamar, R., 2009. Robust Linear Clustering. Journal of the Royal Statistical Society, Series B (Statistical Methodology 71 (1), 301–318.
  • García-Escudero et al. [2008] García-Escudero, L. A., Gordaliza, A., Matrán, C., Mayoiscar, A., 2008. A General Trimming Approach to Robust Cluster Analysis. Annals of Statistics 36 (3), 1324–1345.
  • Gordaliza [1991] Gordaliza, A., 1991. Best Approximations to Random Variables Based on Trimming Procedures. Journal of Approximation Theory 64, 192–180.
  • Holland [1975] Holland, J. H., 1975. Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence. The University of Michigan Press, Ann Arbor, USA.
  • Holland [1992] Holland, J. H., 1992. Genetic Algorithms. Scientific American 267, 66–72.
  • Krishna and Murty [1999] Krishna, K., Murty, M., 1999. Genetic K-Means Algorithm. IEEE Transactions on Systems, Man, and Cybernetics - Part B: Cybernetics 29 (3), 433–439.
  • MacQueen [1967] MacQueen, J., 1967. Some Methods for Classification and Analysis of Multivariate Observations. In: Cam, L. M. L., Neyman, J. (Eds.), Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability. Vol. 1. University of California, Berkeley, Berkeley, USA, pp. 281–297.
  • Maitra and Ramler [2009] Maitra, R., Ramler, I. P., 2009. Clustering in the Presence of Scatter. Biometrics 65 (2), 341–352.
  • Pedregosa et al. [2011] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E., 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12, 2825–2830.
  • Tibshirani and Walther [2005] Tibshirani, R., Walther, G., 2005. Cluster Validation by Prediction Strength. Journal of Computational and Graphical Statistics 14 (3), 511–528.
  • Tseng and Wong [2005] Tseng, G. C., Wong, W. H., 2005. Tight Clustering: A Resampling-Based Approach for Identifying Stable and Tight Patterns in Data. Biometrics 61 (1), 10–16.
  • Wong [1982] Wong, M. A., 1982. A Hybrid Clustering Method for Identifying High-Density Clusters. Journal of the American Statistical Association 77 (380), 841–847.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
13152
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description