A Comparative Study of Efficient Initialization Methods for the K-Means Clustering Algorithm
K-means is undoubtedly the most widely used partitional clustering algorithm. Unfortunately, due to its gradient descent nature, this algorithm is highly sensitive to the initial placement of the cluster centers. Numerous initialization methods have been proposed to address this problem. In this paper, we first present an overview of these methods with an emphasis on their computational efficiency. We then compare eight commonly used linear time complexity initialization methods on a large and diverse collection of data sets using various performance criteria. Finally, we analyze the experimental results using non-parametric statistical tests and provide recommendations for practitioners. We demonstrate that popular initialization methods often perform poorly and that there are in fact strong alternatives to these methods.
Clustering, the unsupervised classification of patterns into groups, is one of the most important tasks in exploratory data analysis Jain99 (). Primary goals of clustering include gaining insight into data (detecting anomalies, identifying salient features, etc.), classifying data, and compressing data. Clustering has a long and rich history in a variety of scientific disciplines including anthropology, biology, medicine, psychology, statistics, mathematics, engineering, and computer science. As a result, numerous clustering algorithms have been proposed since the early 1950s Jain10 ().
Clustering algorithms can be broadly classified into two groups: hierarchical and partitional Jain10 (). Hierarchical algorithms recursively find nested clusters either in a top-down (divisive) or bottom-up (agglomerative) fashion. In contrast, partitional algorithms find all the clusters simultaneously as a partition of the data and do not impose a hierarchical structure. Most hierarchical algorithms have quadratic or higher complexity in the number of data points Jain99 () and therefore are not suitable for large data sets, whereas partitional algorithms often have lower complexity.
Given a data set in , i.e., points (vectors) each with attributes (components), hard partitional algorithms divide into exhaustive and mutually exclusive clusters for . These algorithms usually generate clusters by optimizing a criterion function. The most intuitive and frequently used criterion function is the Sum of Squared Error (SSE) given by:
where denotes the Euclidean () norm and is the centroid of cluster whose cardinality is . The optimization of (1) is often referred to as the minimum SSE clustering (MSSC) problem.
The number of ways in which a set of objects can be partitioned into non-empty groups is given by Stirling numbers of the second kind:
which can be approximated by It can be seen that a complete enumeration of all possible clusterings to determine the global minimum of (1) is clearly computationally prohibitive except for very small data sets Kaufman90 (). In fact, this non-convex optimization problem is proven to be NP-hard even for Aloise09 () or Mahajan12 (). Consequently, various heuristics have been developed to provide approximate solutions to this problem Tarsitano03 (). Among these heuristics, Lloyd’s algorithm Lloyd82 (), often referred to as the (batch) k-means algorithm, is the simplest and most commonly used one. This algorithm starts with arbitrary centers, typically chosen uniformly at random from the data points. Each point is assigned to the nearest center and then each center is recalculated as the mean of all points assigned to it. These two steps are repeated until a predefined termination criterion is met.
The k-means algorithm is undoubtedly the most widely used partitional clustering algorithm Jain99 (); Jain10 (). Its popularity can be attributed to several reasons. First, it is conceptually simple and easy to implement. Virtually every data mining software includes an implementation of it. Second, it is versatile, i.e., almost every aspect of the algorithm (initialization, distance function, termination criterion, etc.) can be modified. This is evidenced by hundreds of publications over the last fifty years that extend k-means in various ways. Third, it has a time complexity that is linear in , , and (in general, and ). For this reason, it can be used to initialize more expensive clustering algorithms such as expectation maximization Bradley98 (), DBSCAN Dash01 (), and spectral clustering Chen11 (). Furthermore, numerous sequential Kanungo02 (); Hamerly10 () and parallel Chen10 () acceleration techniques are available in the literature. Fourth, it has a storage complexity that is linear in , , and . In addition, there exist disk-based variants that do not require all points to be stored in memory Ordonez04 (). Fifth, it is guaranteed to converge Selim84 () at a quadratic rate Bottou95 (). Finally, it is invariant to data ordering, i.e., random shufflings of the data points.
On the other hand, k-means has several significant disadvantages. First, it requires the number of clusters, , to be specified a priori. The value of this parameter can be determined automatically by means of various cluster validity measures Vendramin10 (). Second, it can only detect compact, hyperspherical clusters that are well separated. This can be alleviated by using a more general distance function such as the Mahalanobis distance, which permits the detection of hyperellipsoidal clusters Mao96 (). Third, due its utilization of the squared Euclidean distance, it is sensitive to noise and outlier points since even a few such points can significantly influence the means of their respective clusters. This can addressed by outlier pruning Zhang03 () or using a more robust distance function such as City-block () distance. Fourth, due to its gradient descent nature, it often converges to a local minimum of the criterion function Selim84 (). For the same reason, it is highly sensitive to the selection of the initial centers. Adverse effects of improper initialization include empty clusters, slower convergence, and a higher chance of getting stuck in bad local minima Celebi11 (). Fortunately, all of these drawbacks except the first one can be remedied by using an adaptive initialization method (IM).
In this study, we investigate some of the most popular IMs developed for the k-means algorithm. Our motivation is three-fold. First, a large number of IMs have been proposed in the literature and thus a systematic study that reviews and compares these methods is desirable. Second, these IMs can be used to initialize other partitional clustering algorithms such as fuzzy c-means and its variants and expectation maximization. Third, most of these IMs can be used independently of k-means as standalone clustering algorithms.
This study differs from earlier studies of a similar nature Pena99 (); He04 () in several respects: (i) a more comprehensive overview of the existing IMs is provided, (ii) the experiments involve a larger set of methods and a significantly more diverse collection of data sets, (iii) in addition to clustering effectiveness, computational efficiency is used as a performance criterion, and (iv) the experimental results are analyzed more thoroughly using non-parametric statistical tests.
2 Initialization Methods for K-Means
In this section, we briefly review some of the commonly used IMs with an emphasis on their time complexity (with respect to ). In each complexity class, methods are presented in chronologically ascending order.
2.1 Linear Time-Complexity Initialization Methods
Forgy’s method Forgy65 () assigns each point to one of the clusters uniformly at random. The centers are then given by the centroids of these initial clusters. This method has no theoretical basis, as such random clusters have no internal homogeneity Anderberg73 ().
Jancey’s method Jancey66 () assigns to each center a synthetic point randomly generated within the data space. Unless the data set fills the space, some of these centers may be quite distant from any of the points Anderberg73 (), which might lead to the formation of empty clusters.
MacQueen MacQueen67 () proposed two different methods. The first one, which is the default option in the Quick Cluster procedure of IBM SPSS Statistics Norusis11 (), takes the first points in as the centers. An obvious drawback of this method is its sensitivity to data ordering. The second method chooses the centers randomly from the data points. The rationale behind this method is that random selection is likely to pick points from dense regions, i.e., points that are good candidates to be centers. However, there is no mechanism to avoid choosing outliers or points that are too close to each other Anderberg73 (). Multiple runs of this method is the standard way of initializing k-means Bradley98 (). It should be noted that this second method is often mistakenly attributed to Forgy Forgy65 ().
Ball and Hall’s method Ball67 () takes the centroid of , i.e., , as the first center. It then traverses the points in arbitrary order and takes a point as a center if it is at least units apart from the previously selected centers until centers are obtained. The purpose of the distance threshold is to ensure that the seed points are well separated. However, it is difficult to decide on an appropriate value for . In addition, the method is sensitive to data ordering.
The Simple Cluster Seeking method Tou74 () is identical to Ball and Hall’s method with the exception that the first point in is taken as the first center. This method is used in the FASTCLUS procedure of SAS SAS09 ().
Späth’s method Spath77 () is similar to Forgy’s method with the exception that the points are assigned to the clusters in a cyclical fashion, i.e., the -th () point is assigned to the -th cluster. In contrast to Forgy’s method, this method is sensitive to data ordering.
Maximin method Gonzalez85 (); Katsavounidis94 () chooses the first center arbitrarily and the -th () center is chosen to be the point that has the greatest minimum-distance to the previously selected centers, i.e., . This method was originally developed as a -approximation to the -center clustering problem111Given a set of points in a metric space, the goal of -center clustering is to find representative points (centers) such that the maximum distance of a point to a center is minimized.. It should be noted that, motivated by a vector quantization application, Katsavounidis et al.’s variant Katsavounidis94 () takes the point with the greatest Euclidean norm as the first center.
Al-Daoud’s density-based method AlDaoud96 () first uniformly partitions the data space into disjoint hypercubes. It then randomly selects points from hypercube () to obtain a total of centers ( is the number of points in hypercube ). There are two main disadvantages associated with this method. First, it is difficult to decide on an appropriate value for . Second, the method has a storage complexity of , where is the number of bits allocated to each attribute.
Bradley and Fayyad’s method Bradley98 () starts by randomly partitioning the data set into subsets. These subsets are clustered using k-means initialized by MacQueen’s second method producing sets of intermediate centers each with points. These center sets are combined into a superset, which is then clustered by k-means times, each time initialized with a different center set. Members of the center set that give the least SSE are then taken as the final centers.
Pizzuti et al. Pizzuti99 () improved upon Al-Daoud’s density-based method using a multiresolution grid approach. Their method starts with hypercubes and iteratively splits these as the number of points they receive increases. Once the splitting phase is completed, the centers are chosen from the densest hypercubes.
The k-means++ method Arthur07 () interpolates between MacQueen’s second method and the maximin method. It chooses the first center randomly and the -th () center is chosen to be with a probability of , where denotes the minimum-distance from a point to the previously selected centers. This method yields an approximation to the MSSC problem. The greedy k-means++ method probabilistically selects centers in each round and then greedily selects the center that most reduces the SSE. This modification aims to avoid the unlikely event of choosing two centers that are close to each other.
The PCA-Part method Su07 () uses a divisive hierarchical approach based on PCA (Principal Component Analysis) Hotelling36 (). Starting from an initial cluster that contains the entire data set, the method iteratively selects the cluster with the greatest SSE and divides it into two subclusters using a hyperplane that passes through the cluster centroid and is orthogonal to the principal eigenvector of the cluster covariance matrix. This procedure is repeated until clusters are obtained. The centers are then given by the centroids of these clusters. The Var-Part method Su07 () is an approximation to PCA-Part, where the covariance matrix of the cluster to be split is assumed to be diagonal. In this case, the splitting hyperplane is orthogonal to the coordinate axis with the greatest variance.
Lu et al.’s method Lu08 () uses a two-phase pyramidal approach. The attributes of each point are first encoded as integers using -level quantization, where is a resolution parameter. These integer points are considered to be at level of the pyramid. In the bottom-up phase, starting from level , neighboring data points at level () are averaged to obtain weighted points at level until at least points are obtained. Data points at the highest level are refined using k-means initialized with the points with the largest weights. In the top-down phase, starting from the highest level, centers at level are projected onto level and then used to initialize the -th level clustering. The top-down phase terminates when level is reached. The centers at this level are then inverse quantized to obtain the final centers. The performance of this method degrades with increasing dimensionality Lu08 ().
2.2 Loglinear Time-Complexity Initialization Methods
Hartigan’s method Hartigan79 () first sorts the points according to their distances to . The -th () center is then chosen to be the ()-th point. This method is an improvement over MacQueen’s first method in that it is invariant to data ordering and is more likely to produce seeds that are well separated. The computational cost of this method is dominated by the complexity of sorting, which is .
Al-Daoud’s variance-based method AlDaoud05 () first sorts the points on the attribute with the greatest variance and then partitions them into groups along the same dimension. The centers are then chosen to be the points that correspond to the medians of these groups. Note that this method disregards all attributes but one and therefore is likely to be effective only for data sets in which the variability is mostly on one dimension.
Redmond and Heneghan’s method Redmond07 () first constructs a kd-tree of the data points to perform density estimation and then uses a modified maximin method to select centers from densely populated leaf buckets. The computational cost of this method is dominated by the complexity of kd-tree construction, which is .
The ROBIN (ROBust INitialization) method AlHasan09 () uses a local outlier factor (LOF) Breunig00 () to avoid selecting outlier points as centers. In iteration (), the method first sorts the data points in decreasing order of their minimum-distance to the previously selected centers. It then traverses the points in sorted order and selects the first point that has an LOF value close to as the -th center. The computational cost of this method is dominated by the complexity of sorting, which is .
2.3 Quadratic-Complexity Initialization Methods
Astrahan’s method Astrahan70 () uses two distance thresholds and . It first calculates the density of each point as the number of points within a distance of . The points are sorted in decreasing order by their densities and the highest density point is chosen as the first center. Subsequent centers are chosen in order of decreasing density subject to the condition that each new center be at least at a distance of from the previously selected centers. This procedure is continued until no more centers can be chosen. Finally, if more than centers are chosen, hierarchical clustering is used to group the centers until only of them remain. The main problem with this method is that it is very sensitive to the values of and . For example, if is too small there may be many isolated points with zero density whereas if it is too large a few centers will cover the entire data set Anderberg73 ().
Lance and Williams Lance67 () suggested that the output of a hierarchical clustering algorithm can be used to initialize k-means. Despite the fact that such algorithms often have quadratic or higher complexity, this method is highly recommended in the statistics literature Milligan80a () possibly due to the limited size of the data sets in this field.
Kaufman and Rousseeuw’s method Kaufman90 () takes as the first center and the -th () center is chosen to be the point that most reduces the SSE. Since pairwise distances between the data points need to be calculated in each iteration, the time complexity of this method is .
Cao et al. Cao09 () formalized Astrahan’s density-based method within the framework of a neighborhood-based rough set model. In this model, the -neighborhood of a point is defined as the set of points within distance from it according to a particular distance measure. Based on this neighborhood model, the concepts of cohesion and coupling are defined. The former is a measure of the centrality of a point with respect to its neighborhood; whereas the latter is a measure of separation between two neighborhoods. The method first sorts the data points in decreasing order of their cohesion and takes the point with the greatest cohesion as the first center. It then traverses the points in sorted order and takes the first point that has a coupling of less than with the previously selected centers as the -th () center. The computational cost of this method is dominated by the complexity of the -neighborhood calculations, which is .
2.4 Other Initialization Methods
The binary-splitting method Linde80 () takes as the first center. In iteration (), each of the existing centers is split into two new centers by subtracting and adding a fixed perturbation vector , i.e., and (). These new centers are then refined using k-means. There are two main disadvantages associated with this method. First, there is no guidance on the selection of a proper value for , which determines the direction of the split Huang93 (). Second, the method is computationally demanding since after each iteration k-means has to be run for the entire data set.
The directed-search binary-splitting method Huang93 () is an improvement over the binary-splitting method in that it determines the value of using PCA. However, it has even higher computational requirements due to the calculation of the principal eigenvector in each iteration.
The global k-means method Likas03 () takes as the first center. In iteration () it considers each of the points in turn as a candidate for the -st center and runs k-means with centers on the entire data set. This method is computationally prohibitive for large data sets as it involves runs of k-means on the entire data set.
It should be noted that the two splitting methods and the global k-means method are not initialization methods per se. These methods can be considered as complete clustering methods that utilize k-means as a local search procedure. For this reason, to the best of our knowledge, none of the initialization studies to date included these methods in their comparisons.
We should also mention IMs based on metaheuristics such as simulated annealing Babu94a () and genetic algorithms Babu93 (). These algorithms start from a random initial configuration (population) and use k-means to evaluate their solutions in each iteration (generation). There are two main disadvantages associated with these methods. First, they involve numerous parameters that are difficult to tune (initial temperature, cooling schedule, population size, crossover/mutation probability, etc.) Jain99 (). Second, due to the large search space, they often require a large number of iterations, which renders them computationally prohibitive for all but the smallest data sets. Interestingly, with the recent developments in combinatorial optimization algorithms, it is now feasible to obtain globally minimum SSE clusterings for small data sets without resorting to metaheuristics Aloise10 ().
2.5 Linear vs. Superlinear Initialization Methods
Based on the descriptions given above, it can be seen that superlinear methods often have more elaborate designs when compared to linear ones. An interesting feature of the superlinear methods is that they are often deterministic, which can be considered as an advantage especially when dealing with large data sets. In contrast, linear methods are often non-deterministic and/or order-sensitive. As a result, it is common practice to perform multiple runs of such methods and take the output of the run that produces the least SSE Bradley98 ().
A frequently cited advantage of the more elaborate methods is that they often lead to faster k-means convergence, i.e., require fewer iterations, and as a result the time gained during the clustering phase can offset the time lost during the initialization phase Su07 (); Redmond07 (); AlHasan09 (). This may be true when a standard implementation of k-means is used. However, convergence speed may not be as important when a fast k-means variant is used as such methods often require significantly less time compared to a standard k-means implementation. In this study, we utilize a fast k-means variant based on triangle inequality Huang90 () and partial distance elimination Bei85 () techniques. As will be seen in §4, this fast and exact k-means implementation will diminish the computational efficiency differences among various IMs. In other words, we will demonstrate that elaborate methods that lead to faster k-means convergence are not necessarily more efficient than simple methods with slower convergence.
3 Experimental Setup
3.1 Data Set Descriptions
In order to obtain a comprehensive evaluation of various IMs, we conducted two sets of experiments. The first experiment involved commonly used real data sets with sizes ranging from to points. Most of these data sets were obtained from the UCI Machine Learning Repository Frank11 () (see Table 1.) The second experiment involved a large number of synthetic data sets with varying clustering complexity. We used a recent algorithm proposed by Maitra and Melnykov Maitra10 () to generate these data sets. This algorithm involves the calculation of the exact overlap () between each cluster pair, measured in terms of their total probability of misclassification, and guided simulation of Gaussian mixture components satisfying prespecified overlap characteristics. The algorithm was used with the following parameters: mean overlap (), number of points (), number of attributes (), and number of classes ().
The parameter denotes the mean overlap between pairs of clusters. However, we observed that two synthetic data sets with the same can have considerably different clustering complexity. Therefore, we quantified clustering complexity using the following indirect approach. For each data set, we executed the k-means algorithm initialized with the “true” centers given by the cluster generation algorithm and calculated the RAND, VD, and VI measures (see §3.3) upon convergence. The average of these measures, , was taken as a quantitative indicator of clustering complexity. Note that each of these normalized measures takes values from the interval. For RAND larger values are better, whereas for VD and VI smaller values are better. Therefore, we inverted the RAND values by subtracting them from to make this measure compatible with the other two. Finally, using the aforementioned complexity quantification scheme, we generated synthetic data sets from each of the following complexity classes: easy (), moderate (), and difficult (). The total number of synthetic data sets was thus . Figure 1 shows sample data sets with clusters from each complexity class.
|ID||Data Set||# Points ()||# Attributes ()||# Classes ()|
|1||Breast Cancer Wisconsin (Original)||683||9||2|
|2||Cloud Cover (DB1)||1,024||10||8†|
|3||Concrete Compressive Strength||1,030||9||8†|
|4||Corel Image Features||68,040||25||16†|
|7||Steel Plates Faults||1,941||27||7|
|12||Landsat Satellite (Statlog)||6,435||36||6|
|14||MAGIC Gamma Telescope||19,020||10||2|
|15||Multiple Features (Fourier)||2,000||76||10|
|16||MiniBooNE Particle Identification||130,064||50||2|
|19||Page Blocks Classification||5,473||10||5|
|23||Pima Indians Diabetes||768||8||2|
|27||Telugu Vowels Pal77 ()||871||3||6|
|28||Vehicle Silhouettes (Statlog)||846||18||4|
|29||Wall-Following Robot Navigation||5,456||24||4|
|31||World TSP Cook11 ()||1,904,711||2||7†|
Due to the unavailability of class labels, for data sets #2, #3, and #4, was chosen arbitrarily, whereas for #20 and #31, it was determined based on domain knowledge.
3.2 Attribute Normalization
In clustering tasks, normalization is a common preprocessing step that is necessary to prevent attributes with large ranges from dominating the distance calculations and also to avoid numerical instabilities in the computations. Two commonly used normalization schemes are linear scaling to unit range (min-max normalization) and linear scaling to unit variance (z-score normalization). Several studies revealed that the former scheme is preferable to the latter since the latter is likely to eliminate valuable between-cluster variation Milligan88 (); Su07 (). As a result, we used min-max normalization to map the attributes of each real data set to the interval. Note that attributes of the synthetic data sets were already normalized by the cluster generation algorithm.
3.3 Performance Criteria
The performance of the IMs was measured using five effectiveness (quality) and two efficiency (speed) criteria:
Initial SSE: This is the SSE value calculated after the initialization phase, before the clustering phase. It gives us a measure of the effectiveness of an IM by itself.
Final SSE: This is the SSE value calculated after the clustering phase. It gives us a measure of the effectiveness of an IM when its output is refined by k-means. Note that this is the objective function of the k-means algorithm, i.e., (1).
Normalized Rand (RAND) Hubert85 (), van Dongen (VD) vanDongen00 (), and Variation of Information Meila07 () criteria (VI): These are external validity measures that quantify the extent to which the clustering structure discovered by a clustering algorithm matches some external structure, e.g., one specified by the given class labels Wu09a (); Wu09c (). In a recent comprehensive study, these three measures were found to be the best among 16 external validity measures Wu09a (). Note that each of these normalized measures takes values from the interval.
Number of Iterations: This is the number of iterations that k-means requires until reaching convergence when initialized by a particular IM. It is an efficiency measure independent of programming language, implementation style, compiler, and CPU architecture.
CPU Time: This is the total CPU time taken by the initialization and clustering phases. This criterion is reported only for the real data sets.
All of the methods were implemented in the C language, compiled with the gcc v4.4.3 compiler, and executed on an Intel Xeon E5520 2.26GHz machine. Time measurements were performed using the getrusage function, which is capable of measuring CPU time to an accuracy of a microsecond. The MT19937 variant of the Mersenne Twister algorithm was used to generate high-quality pseudorandom numbers Matsumoto98 ().
The convergence of k-means was controlled by the disjunction of two criteria: the number of iterations reaches a maximum of or the relative improvement in SSE between two consecutive iterations drops below a threshold, i.e., , where denotes the SSE value at the end of the -th () iteration. The convergence threshold was set to .
4 Experimental Results and Discussion
In this study, we focus on IMs that have time complexity linear in . This is because k-means itself has linear complexity, which is perhaps the most important reason for its popularity. Therefore, an IM for k-means should not diminish this advantage of the algorithm. Eight commonly used, order-invariant IMs were included in the experiments: Forgy’s method (F), MacQueen’s second method (M), maximin (X), Bradley and Fayyad’s method (B) with , k-means++ (K), greedy k-means++ (G), Var-Part (V), and PCA-Part (P). It should be noted that among these methods only V and P are deterministic.
In the experiments, each non-deterministic method was executed a times and statistics such as minimum, mean, and standard deviation were collected for the effectiveness criteria. In each run, the number of clusters () was set equal to the number of classes (), as commonly seen in the related literature Bradley98 (); Pizzuti99 (); He04 (); Arthur07 (); Su07 (); AlHasan09 (); Cao09 ().
Tables 2 and 3 give the Final SSE and CPU time (in milliseconds) results for the real data sets, respectively. Note that, due to space limitations, only mean values are reported for the CPU time criterion. In order to determine if there are any statistically significant differences among the methods, we employed two non-parametric statistical tests Garcia08 (): the Friedman test Friedman37 () and Iman & Davenport test Iman80 (). These tests are alternatives to the parametric two-way analysis of variance (ANOVA) test. Their advantage over ANOVA is that they do not require normality or homoscedasticity, assumptions that are often violated in machine learning studies Luengo09 (); Garcia09 ().
Given blocks (subjects) and treatments (measurements), the null hypothesis () of the Friedman test is that populations within a block are identical. The alternative hypothesis () is that at least one treatment tends to yield larger (or smaller) values than at least one other treatment. The test statistic is calculated as follows Daniel00 (). In the first step, the observations within each block are ranked separately, so each block contains a separate set of ranks. If ties occur, the tied observations are given the mean of the rank positions for which they are tied. If is true, the ranks in each block should be randomly distributed over the columns (treatments). Otherwise, we expect a lack of randomness in this distribution. For example, if a particular treatment is better than the others, we expect large (or small) ranks to ‘favor’ that column. In the second step, the ranks in each column are summed. If is true, we expect the sums to be fairly close — so close that we can attribute differences to chance. Otherwise, we expect to see at least one difference between pairs of rank sums so large that we cannot reasonably attribute it to sampling variability. The test statistic is given as:
where () is the rank sum of the -th column. is approximately chi-square with degrees of freedom. is rejected at the level of significance if the value of (3) is greater than or equal to the critical chi-square value for degrees of freedom. Iman and Davenport Iman80 () proposed the following statistic:
which is distributed according to the F-distribution with and degrees of freedom. When compared to , this statistic is not only less conservative, but also more accurate for small sample sizes Iman80 ().
In this study, blocks and treatments correspond to data sets and initialization methods, respectively. Our goal is to determine whether or not there is at least one method that is significantly better than at least one other method at the level. If this is the case, we will conduct a post-hoc (multiple comparison) test to determine which pairs of methods differ significantly. For this purpose, we will use the Bergmann-Hommel test Bergmann88 (), a powerful post-hoc procedure that has been used successfully in various machine learning studies Garcia08 (); Voss10 ().
4.1 Real Data Sets
Table 4 gives the Final SSE rankings of the IMs for the real data sets as determined by the Bergmann-Hommel procedure using data given in Table 2. Here, a notation such as indicates that there is no statistically significant difference between methods D and E and these two methods are significantly better than method C. From Table 4 it can be seen that the methods cannot be distinguished from each other reliably. This was expected since even nonparametric post-hoc tests lack discrimination power in small sample cases (recall that only data sets were used) with a large number of ties (see Table 2). For example, with respect to the minimum statistic, the performances of F, M, B, K, and G are statistically indistinguishable. In other words, if we initialize k-means with each of these non-deterministic methods and execute it until convergence, the resulting clusterings over runs will have very similar minimum Final SSE values. Similar trends were observed for the RAND, VD, and VI criteria (data not shown). Given the abundance of local minima even in data sets of moderate size and/or dimensionality and the gradient descent nature of k-means, it is not surprising that the deterministic methods V and P are outperformed by most of the non-deterministic methods as the former methods were executed only once, whereas the latter ones were executed times.
As mentioned earlier, the minimum statistic is meaningful only when it is practical to execute k-means multiple times. Otherwise, the mean statistic is more meaningful. The analysis of mean Final SSE results using the Bergmann-Hommel procedure reveals that deterministic methods V and P are the preferred choices in this case. This is not surprising since non-deterministic methods, in particular those that are ad hoc in nature, often produce highly variable results across multiple runs.
The standard deviation statistic characterizes the reliability of a non-deterministic IM with respect to a particular performance criterion. In other words, if a non-deterministic IM obtains low mean and standard deviation with respect to an effectiveness criterion, we do not have to execute this method a large number of times to obtain good results. The analysis of Final SSE standard deviations reveals two overlapping groups of methods. Once again this is not necessarily because the members of each group are in fact indistinguishable with respect to their reliability, but due to the relatively small sample size used. In summary, due to the necessarily small number of real-world data sets available for clustering studies, it may not be possible to distinguish among various IMs. Therefore, it is crucial that these IMs be tested on a large number of synthetic data sets (see §4.2).
As for computational efficiency, it can be seen from Table 3 that, in general, the IMs have similar computational requirements per run. However, in practice, a non-deterministic method is typically executed times and the output of the run that gives the least SSE is taken as the result. Therefore, the total computational cost of a non-deterministic method is often significantly higher than that of a deterministic method. As predicted in §2.5, simple methods such as M require about the same CPU time as elaborate methods such as G. This is because simple methods often lead to more k-means iterations, whereas elaborate ones compensate for their computational overhead by requiring fewer k-means iterations. It should be noted that efficiency differences among the methods can be further reduced by using faster k-means variants such as those described in Kanungo02 (); Hamerly10 ().
4.2 Synthetic Data Sets
Table 5 gives the ranking of the IMs with respect to the minimum statistic. It can be seen that despite variations in rankings across the performance criteria, some general trends emerge:
Non-deterministic methods outperform the deterministic ones, i.e., V and P, except in the case of Initial SSE. As explained in §4.1, this is due to the fact that the non-deterministic methods were executed times, whereas the deterministic ones were executed only once. The reason why deterministic methods have good Initial SSE performance is because these methods are approximate (divisive hierarchical) clustering methods by themselves and thus they give reasonable results even without k-means refinement.
Method B consistently appears in the best performing group, whereas methods F and X are often among the worst non-deterministic methods.
Method M exhibits moderate-to-good performance except in the case of Initial SSE. Recall that this method randomly selects the initial centers from among the data points and therefore it cannot be expected to perform well without k-means refinement.
Methods K and G generally perform well. In some cases the latter outperforms the former, whereas in others they have comparable performance.
Table 6 gives the ranking of the IMs with respect to the mean statistic. It can be seen that despite variations in rankings across the performance criteria, some general trends emerge:
Deterministic methods, i.e., V and P, generally outperform the non-deterministic ones. As explained in §4.1, this is due to the fact that the non-deterministic methods can produce highly variable results across multiple runs. Method B is highly competitive with the deterministic methods.
Methods M and X are often among the worst performers, whereas methods F and K exhibit moderate-to-bad performance.
Method G is often significantly better than all non-deterministic methods but B.
Table 7 gives the ranking of the non-deterministic IMs with respect to the standard deviation statistic. It can be seen that despite variations in rankings across the performance criteria, some general trends emerge:
Method B consistently appears in the best performing group, whereas method M is often among the worst performers.
Methods X and K exhibit moderate-to-bad performance.
Method F and G are significantly better than all methods but B.
4.3 Recommendations for Practitioners
Based on the statistical analyses presented in the previous section, the following recommendations can be made:
In general, methods F, M, and X should not be used. These methods are easy to understand and implement, but they are often ineffective and unreliable. Furthermore, despite their low overhead, these methods do not offer significant time savings since they often result in slower k-means convergence.
In time-critical applications that involve large data sets or applications that demand determinism, methods V or P should be used. These methods need to be executed only once and they lead to very fast k-means converge. The efficiency difference between the two is noticeable only on high dimensional data sets. This is because method V calculates the direction of split by determining the coordinate axis with the greatest variance (in time), whereas method P achieves this by calculating the principal eigenvector of the covariance matrix (in time using the power method Hotelling36 ()). Note that despite its higher computational complexity, method P can, in some cases, be more efficient than method V (see Table 3). This is because the former converges significantly faster than the latter (see Table 6). The main disadvantage of these methods is that they are more complicated to implement due to their hierarchical formulation.
In applications that involve small data sets, e.g., , methods B or G should be used. It is computationally feasible to run these methods hundreds of times on such data sets given that one such run takes only a few milliseconds.
In this paper we presented an overview of k-means initialization methods with an emphasis on their computational efficiency. We then compared eight commonly used linear time initialization methods on a large and diverse collection of real and synthetic data sets using various performance criteria. Finally, we analyzed the experimental results using non-parametric statistical tests and provided recommendations for practitioners. Our statistical analyses revealed that popular initialization methods such as forgy, macqueen, and maximin often perform poorly and that there are significantly better alternatives to these methods that have comparable computational requirements.
This publication was made possible by grants from the Louisiana Board of Regents (LEQSF2008-11-RD-A-12) and National Science Foundation (0959583, 1117457). The authors are grateful to D. Arthur, S. Vassilvitskii, J.G. Dy, S.J. Redmond, J.F. Lu, and M. Al Hasan for clarifying various points about their papers.
- (1) A. K. Jain, M. N. Murty, P. J. Flynn, Data Clustering: A Review, ACM Computing Surveys 31 (3) (1999) 264–323.
- (2) A. K. Jain, Data Clustering: 50 Years Beyond K-means, Pattern Recognition Letters 31 (8) (2010) 651–666.
- (3) L. Kaufman, P. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, Wiley-Interscience, 1990.
- (4) D. Aloise, A. Deshpande, P. Hansen, P. Popat, NP-Hardness of Euclidean Sum-of-Squares Clustering, Machine Learning 75 (2) (2009) 245–248.
- (5) M. Mahajan, P. Nimbhorkar, K. Varadarajan, The Planar k-Means Problem is NP-hard, Theoretical Computer Science 442 (2012) 13–21.
- (6) A. Tarsitano, A Computational Study of Several Relocation Methods for K-Means Algorithms, Pattern Recognition 36 (12) (2003) 2955–2966.
- (7) S. Lloyd, Least Squares Quantization in PCM, IEEE Trans. on Information Theory 28 (2) (1982) 129–136.
- (8) P. S. Bradley, U. Fayyad, Refining Initial Points for K-Means Clustering, in: Proc. of the 15th Int. Conf. on Machine Learning, 1998, pp. 91–99.
- (9) M. Dash, H. Liu, X. Xu, ’’: Merging Distance and Density Based Clustering, in: Proc. of the 7th Int. Conf. on Database Systems for Advanced Applications, 2001, pp. 32–39.
- (10) W. Y. Chen, Y. Song, H. Bai, C. J. Lin, E. Y. Chang, Parallel Spectral Clustering in Distributed Systems, IEEE Trans. on Pattern Analysis and Machine Intelligence 33 (3) (2011) 568–586.
- (11) T. Kanungo, D. Mount, N. Netanyahu, C. Piatko, R. Silverman, A. Wu, An Efficient K-Means Clustering Algorithm: Analysis and Implementation, IEEE Trans. on Pattern Analysis and Machine Intelligence 24 (7) (2002) 881–892.
- (12) G. Hamerly, Making k-means Even Faster, in: Proc. of the 2010 SIAM Int. Conf. on Data Mining, 2010, pp. 130–140.
- (13) T. W. Chen, S. Y. Chien, Bandwidth Adaptive Hardware Architecture of K-Means Clustering for Video Analysis, IEEE Trans. on Very Large Scale Integration (VLSI) Systems 18 (6) (2010) 957–966.
- (14) C. Ordonez, E. Omiecinski, Efficient Disk-Based K-Means Clustering for Relational Databases, IEEE Trans. on Knowledge and Data Engineering 16 (8) (2004) 909–921.
- (15) S. Z. Selim, M. A. Ismail, K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality, IEEE Trans. on Pattern Analysis and Machine Intelligence 6 (1) (1984) 81–87.
- (16) L. Bottou, Y. Bengio, Advances in Neural Information Processing Systems 7, MIT Press, 1995, Ch. Convergence Properties of the K-Means Algorithms, pp. 585–592.
- (17) L. Vendramin, R. J. G. B. Campello, E. R. Hruschka, Relative Clustering Validity Criteria: A Comparative Overview, Statistical Analysis and Data Mining 3 (4) (2010) 209–235.
- (18) J. Mao, A. K. Jain, A Self-Organizing Network for Hyperellipsoidal Clustering (HEC), IEEE Trans. on Neural Networks 7 (1) (1996) 16–29.
- (19) J. S. Zhang, Y.-W. Leung, Robust Clustering by Pruning Outliers, IEEE Trans. on Systems, Man, and Cybernetics – Part B 33 (6) (2003) 983–999.
- (20) M. E. Celebi, Improving the Performance of K-means for Color Quantization, Image and Vision Computing 29 (4) (2011) 260–271.
- (21) J. M. Pena, J. A. Lozano, P. Larranaga, An Empirical Comparison of Four Initialization Methods for the K-Means Algorithm, Pattern Recognition Letters 20 (10) (1999) 1027–1040.
- (22) J. He, M. Lan, C. L. Tan, S. Y. Sung, H. B. Low, Initialization of Cluster Refinement Algorithms: A Review and Comparative Study, in: Proc. of the 2004 IEEE Int. Joint Conf. on Neural Networks, 2004, pp. 297–302.
- (23) E. Forgy, Cluster Analysis of Multivariate Data: Efficiency vs. Interpretability of Classification, Biometrics 21 (1965) 768.
- (24) M. R. Anderberg, Cluster Analysis for Applications, Academic Press, 1973.
- (25) R. C. Jancey, Multidimensional Group Analysis, Australian Journal of Botany 14 (1) (1966) 127–130.
- (26) J. MacQueen, Some Methods for Classification and Analysis of Multivariate Observations, in: Proc. of the 5th Berkeley Symposium on Mathematical Statistics and Probability, 1967, pp. 281–297.
- (27) M. J. Norušis, IBM SPSS Statistics 19 Statistical Procedures Companion, Addison Wesley, 2011.
- (28) G. H. Ball, D. J. Hall, A Clustering Technique for Summarizing Multivariate Data, Behavioral Science 12 (2) (1967) 153–155.
- (29) J. T. Tou, R. C. Gonzales, Pattern Recognition Principles, Addison-Wesley, 1974.
- (30) SAS Institute Inc., SAS/STAT 9.2 User’s Guide, SAS Publishing, 2009.
- (31) H. Späth, Computational Experiences with the Exchange Method: Applied to Four Commonly Used Partitioning Cluster Analysis Criteria, European Journal of Operational Research 1 (1) (1977) 23–31.
- (32) T. Gonzalez, Clustering to Minimize the Maximum Intercluster Distance, Theoretical Computer Science 38 (2–3) (1985) 293–306.
- (33) I. Katsavounidis, C.-C. J. Kuo, Z. Zhang, A New Initialization Technique for Generalized Lloyd Iteration, IEEE Signal Processing Letters 1 (10) (1994) 144–146.
- (34) M. Al-Daoud, S. Roberts, New Methods for the Initialisation of Clusters, Pattern Recognition Letters 17 (5) (1996) 451–455.
- (35) C. Pizzuti, D. Talia, G. Vonella, A Divisive Initialisation Method for Clustering Algorithms, in: Proc. of the 3rd European Conf. on Principles and Practice of Knowledge Discovery in Databases, 1999, pp. 484–491.
- (36) D. Arthur, S. Vassilvitskii, k-means++: The Advantages of Careful Seeding, in: Proc. of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms, 2007, pp. 1027–1035.
- (37) T. Su, J. G. Dy, In Search of Deterministic Methods for Initializing K-Means and Gaussian Mixture Clustering, Intelligent Data Analysis 11 (4) (2007) 319–338.
- (38) H. Hotelling, Simplified Calculation of Principal Components, Psychometrika 1 (1) (1936) 27–35.
- (39) J. F. Lu, J. B. Tang, Z. M. Tang, J. Y. Yang, Hierarchical Initialization Approach for K-Means Clustering, Pattern Recognition Letters 29 (6) (2008) 787–795.
- (40) T. Onoda, M. Sakai, S. Yamada, Careful Seeding Method based on Independent Components Analysis for k-means Clustering, Journal of Emerging Technologies in Web Intelligence 4 (1) (2012) 51–59.
- (41) A. Hyvärinen, Fast and Robust Fixed-Point Algorithms for Independent Component Analysis, IEEE Trans. on Neural Networks 10 (3) (1999) 626–634.
- (42) J. A. Hartigan, M. A. Wong, Algorithm AS 136: A K-Means Clustering Algorithm, Journal of the Royal Statistical Society C 28 (1) (1979) 100–108.
- (43) M. Al-Daoud, A New Algorithm for Cluster Initialization, in: Proc. of the 2nd World Enformatika Conf., 2005, pp. 74–76.
- (44) S. J. Redmond, C. Heneghan, A Method for Initialising the K-means Clustering Algorithm Using kd-trees, Pattern Recognition Letters 28 (8) (2007) 965–973.
- (45) M. Al Hasan, V. Chaoji, S. Salem, M. Zaki, Robust Partitional Clustering by Outlier and Density Insensitive Seeding, Pattern Recognition Letters 30 (11) (2009) 994–1002.
- (46) M. M. Breunig, H. P. Kriegel, R. T. Ng, J. Sander, LOF: Identifying Density-Based Local Outliers, ACM SIGMOD Record 29 (2) (2000) 93–104.
- (47) M. M. Astrahan, Speech Analysis by Clustering, or the Hyperphoneme Method, Tech. Rep. AIM-124, Stanford University (1970).
- (48) G. N. Lance, W. T. Williams, A General Theory of Classificatory Sorting Strategies - II. Clustering Systems, The Computer Journal 10 (3) (1967) 271–277.
- (49) G. W. Milligan, An Examination of the Effect of Six Types of Error Perturbation on Fifteen Clustering Algorithms, Psychometrika 45 (3) (1980) 325–342.
- (50) F. Cao, J. Liang, G. Jiang, An Initialization Method for the K-Means Algorithm Using Neighborhood Model, Computers and Mathematics with Applications 58 (3) (2009) 474–483.
- (51) Y. Linde, A. Buzo, R. Gray, An Algorithm for Vector Quantizer Design, IEEE Trans. on Communications 28 (1) (1980) 84–95.
- (52) C. M. Huang, R. W. Harris, A Comparison of Several Vector Quantization Codebook Generation Approaches, IEEE Trans. on Image Processing 2 (1) (1993) 108–112.
- (53) A. Likas, N. Vlassis, J. Verbeek, The Global K-Means Clustering Algorithm, Pattern Recognition 36 (2) (2003) 451–461.
- (54) G. Babu, M. Murty, Simulated Annealing for Selecting Optimal Initial Seeds in the K-Means Algorithm, Indian Journal of Pure and Applied Mathematics 25 (1–2) (1994) 85–94.
- (55) G. P. Babu, M. N. Murty, A Near-Optimal Initial Seed Value Selection in K-Means Algorithm Using a Genetic Algorithm, Pattern Recognition Letters 14 (10) (1993) 763–769.
- (56) D. Aloise, P. Hansen, L. Liberti, An Improved Column Generation Algorithm for Minimum Sum-of-Squares Clustering, Mathematical Programming (2010) 1–26.
- (57) S. H. Huang, S. H. Chen, Fast Encoding Algorithm for VQ-based Image Coding, Electronics Letters 26 (19) (1990) 1618–1619.
- (58) C. D. Bei, R. M. Gray, An Improvement of the Minimum Distortion Encoding Algorithm for Vector Quantization, IEEE Trans. on Communications 33 (10) (1985) 1132–1133.
- (59) A. Frank, A. Asuncion, UCI Machine Learning Repository, http://archive.ics.uci.edu/ml, University of California, Irvine, School of Information and Computer Sciences (2011).
- (60) R. Maitra, V. Melnykov, Simulating Data to Study Performance of Finite Mixture Modeling and Clustering Algorithms, Journal of Computational and Graphical Statistics 19 (2) (2010) 354–376.
- (61) S. K. Pal, D. D. Majumder, Fuzzy Sets and Decision Making Approaches in Vowel and Speaker Recognition, IEEE Trans. on Systems, Man, and Cybernetics 7 (8) (1977) 625–629, http://www.isical.ac.in/s̃ushmita/patterns/vowel.dat.
- (62) W. Cook, World TSP, http://www.tsp.gatech.edu/world, Georgia Institute of Technology (2011).
- (63) G. Milligan, M. C. Cooper, A Study of Standardization of Variables in Cluster Analysis, Journal of Classification 5 (2) (1988) 181–204.
- (64) L. Hubert, P. Arabie, Comparing Partitions, Journal of Classification 2 (1) (1985) 193–218.
- (65) S. van Dongen, Performance Criteria for Graph Clustering and Markov Cluster Experiments, Tech. Rep. INS-R0012, Centrum voor Wiskunde en Informatica (2000).
- (66) M. Meilă, Comparing Clusterings — An Information Based Distance, Journal of Multivariate Analysis 98 (5) (2007) 873–895.
- (67) J. Wu, H. Xiong, J. Chen, Adapting the Right Measures for K-means Clustering, in: Proc. of the 15th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 2009, pp. 877–885.
- (68) J. Wu, J. Chen, H. Xiong, M. Xie, External Validation Measures for K-Means Clustering: A Data Distribution Perspective, Expert Systems With Applications 36 (3) (2009) 6050–6061.
- (69) M. Matsumoto, T. Nishimura, Mersenne Twister: A 623-Dimensionally Equidistributed Uniform Pseudo-Random Number Generator, ACM Trans. on Modeling and Computer Simulation 8 (1) (1998) 3–30.
- (70) S. Garcia, F. Herrera, An Extension on ”Statistical Comparisons of Classifiers over Multiple Data Sets” for all Pairwise Comparisons, Journal of Machine Learning Research 9 (2008) 2677–2694.
- (71) M. Friedman, The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance, Journal of the American Statistical Association 32 (200) (1937) 675–701.
- (72) R. L. Iman, J. M. Davenport, Approximations of the Critical Region of the Friedman Statistic, Communications in Statistics - Theory and Methods 9 (6) (1980) 571–595.
- (73) J. Luengo, S. Garcia, F. Herrera, A Study on the Use of Statistical Tests for Experimentation with Neural Networks: Analysis of Parametric Test Conditions and Non-Parametric Tests, Expert Systems With Applications 36 (4) (2009) 7798–7808.
- (74) S. Garcia, A. Fernandez, J. Luengo, F. Herrera, A Study of Statistical Techniques and Performance Measures for Genetics-Based Machine Learning: Accuracy and Interpretability, Soft Computing 13 (10) (2009) 959–977.
- (75) W. W. Daniel, Applied Nonparametric Statistics, Duxbury Press, 2000.
- (76) B. Bergmann, G. Hommel, Multiple Hypotheses Testing, Springer, 1988, Ch. Improvements of General Multiple Test Procedures for Redundant Systems of Hypotheses, pp. 100–115.
- (77) T. Voss, N. Hansen, C. Igel, Improved Step Size Adaptation for the MO-CMA-ES, in: Proc. of the 12th Annual Conf. on Genetic and Evolutionary Computation, 2010, pp. 487–494.
|Data Set Complexity||IM Ranking|
|Number of Iterations|
|Data Set Complexity||IM Ranking|
|Number of Iterations|
|Data Set Complexity||IM Ranking|
|Moderate||Same as Easy|
|Number of Iterations|