A Comparative Study of Efficient Initialization Methods for the KMeans Clustering Algorithm
Abstract
Kmeans is undoubtedly the most widely used partitional clustering algorithm. Unfortunately, due to its gradient descent nature, this algorithm is highly sensitive to the initial placement of the cluster centers. Numerous initialization methods have been proposed to address this problem. In this paper, we first present an overview of these methods with an emphasis on their computational efficiency. We then compare eight commonly used linear time complexity initialization methods on a large and diverse collection of data sets using various performance criteria. Finally, we analyze the experimental results using nonparametric statistical tests and provide recommendations for practitioners. We demonstrate that popular initialization methods often perform poorly and that there are in fact strong alternatives to these methods.
1 Introduction
Clustering, the unsupervised classification of patterns into groups, is one of the most important tasks in exploratory data analysis Jain99 (). Primary goals of clustering include gaining insight into data (detecting anomalies, identifying salient features, etc.), classifying data, and compressing data. Clustering has a long and rich history in a variety of scientific disciplines including anthropology, biology, medicine, psychology, statistics, mathematics, engineering, and computer science. As a result, numerous clustering algorithms have been proposed since the early 1950s Jain10 ().
Clustering algorithms can be broadly classified into two groups: hierarchical and partitional Jain10 (). Hierarchical algorithms recursively find nested clusters either in a topdown (divisive) or bottomup (agglomerative) fashion. In contrast, partitional algorithms find all the clusters simultaneously as a partition of the data and do not impose a hierarchical structure. Most hierarchical algorithms have quadratic or higher complexity in the number of data points Jain99 () and therefore are not suitable for large data sets, whereas partitional algorithms often have lower complexity.
Given a data set in , i.e., points (vectors) each with attributes (components), hard partitional algorithms divide into exhaustive and mutually exclusive clusters for . These algorithms usually generate clusters by optimizing a criterion function. The most intuitive and frequently used criterion function is the Sum of Squared Error (SSE) given by:
(1) 
where denotes the Euclidean () norm and is the centroid of cluster whose cardinality is . The optimization of (1) is often referred to as the minimum SSE clustering (MSSC) problem.
The number of ways in which a set of objects can be partitioned into nonempty groups is given by Stirling numbers of the second kind:
(2) 
which can be approximated by It can be seen that a complete enumeration of all possible clusterings to determine the global minimum of (1) is clearly computationally prohibitive except for very small data sets Kaufman90 (). In fact, this nonconvex optimization problem is proven to be NPhard even for Aloise09 () or Mahajan12 (). Consequently, various heuristics have been developed to provide approximate solutions to this problem Tarsitano03 (). Among these heuristics, Lloyd’s algorithm Lloyd82 (), often referred to as the (batch) kmeans algorithm, is the simplest and most commonly used one. This algorithm starts with arbitrary centers, typically chosen uniformly at random from the data points. Each point is assigned to the nearest center and then each center is recalculated as the mean of all points assigned to it. These two steps are repeated until a predefined termination criterion is met.
The kmeans algorithm is undoubtedly the most widely used partitional clustering algorithm Jain99 (); Jain10 (). Its popularity can be attributed to several reasons. First, it is conceptually simple and easy to implement. Virtually every data mining software includes an implementation of it. Second, it is versatile, i.e., almost every aspect of the algorithm (initialization, distance function, termination criterion, etc.) can be modified. This is evidenced by hundreds of publications over the last fifty years that extend kmeans in various ways. Third, it has a time complexity that is linear in , , and (in general, and ). For this reason, it can be used to initialize more expensive clustering algorithms such as expectation maximization Bradley98 (), DBSCAN Dash01 (), and spectral clustering Chen11 (). Furthermore, numerous sequential Kanungo02 (); Hamerly10 () and parallel Chen10 () acceleration techniques are available in the literature. Fourth, it has a storage complexity that is linear in , , and . In addition, there exist diskbased variants that do not require all points to be stored in memory Ordonez04 (). Fifth, it is guaranteed to converge Selim84 () at a quadratic rate Bottou95 (). Finally, it is invariant to data ordering, i.e., random shufflings of the data points.
On the other hand, kmeans has several significant disadvantages. First, it requires the number of clusters, , to be specified a priori. The value of this parameter can be determined automatically by means of various cluster validity measures Vendramin10 (). Second, it can only detect compact, hyperspherical clusters that are well separated. This can be alleviated by using a more general distance function such as the Mahalanobis distance, which permits the detection of hyperellipsoidal clusters Mao96 (). Third, due its utilization of the squared Euclidean distance, it is sensitive to noise and outlier points since even a few such points can significantly influence the means of their respective clusters. This can addressed by outlier pruning Zhang03 () or using a more robust distance function such as Cityblock () distance. Fourth, due to its gradient descent nature, it often converges to a local minimum of the criterion function Selim84 (). For the same reason, it is highly sensitive to the selection of the initial centers. Adverse effects of improper initialization include empty clusters, slower convergence, and a higher chance of getting stuck in bad local minima Celebi11 (). Fortunately, all of these drawbacks except the first one can be remedied by using an adaptive initialization method (IM).
In this study, we investigate some of the most popular IMs developed for the kmeans algorithm. Our motivation is threefold. First, a large number of IMs have been proposed in the literature and thus a systematic study that reviews and compares these methods is desirable. Second, these IMs can be used to initialize other partitional clustering algorithms such as fuzzy cmeans and its variants and expectation maximization. Third, most of these IMs can be used independently of kmeans as standalone clustering algorithms.
This study differs from earlier studies of a similar nature Pena99 (); He04 () in several respects: (i) a more comprehensive overview of the existing IMs is provided, (ii) the experiments involve a larger set of methods and a significantly more diverse collection of data sets, (iii) in addition to clustering effectiveness, computational efficiency is used as a performance criterion, and (iv) the experimental results are analyzed more thoroughly using nonparametric statistical tests.
2 Initialization Methods for KMeans
In this section, we briefly review some of the commonly used IMs with an emphasis on their time complexity (with respect to ). In each complexity class, methods are presented in chronologically ascending order.
2.1 Linear TimeComplexity Initialization Methods
Forgy’s method Forgy65 () assigns each point to one of the clusters uniformly at random. The centers are then given by the centroids of these initial clusters. This method has no theoretical basis, as such random clusters have no internal homogeneity Anderberg73 ().
Jancey’s method Jancey66 () assigns to each center a synthetic point randomly generated within the data space. Unless the data set fills the space, some of these centers may be quite distant from any of the points Anderberg73 (), which might lead to the formation of empty clusters.
MacQueen MacQueen67 () proposed two different methods. The first one, which is the default option in the Quick Cluster procedure of IBM SPSS Statistics Norusis11 (), takes the first points in as the centers. An obvious drawback of this method is its sensitivity to data ordering. The second method chooses the centers randomly from the data points. The rationale behind this method is that random selection is likely to pick points from dense regions, i.e., points that are good candidates to be centers. However, there is no mechanism to avoid choosing outliers or points that are too close to each other Anderberg73 (). Multiple runs of this method is the standard way of initializing kmeans Bradley98 (). It should be noted that this second method is often mistakenly attributed to Forgy Forgy65 ().
Ball and Hall’s method Ball67 () takes the centroid of , i.e., , as the first center. It then traverses the points in arbitrary order and takes a point as a center if it is at least units apart from the previously selected centers until centers are obtained. The purpose of the distance threshold is to ensure that the seed points are well separated. However, it is difficult to decide on an appropriate value for . In addition, the method is sensitive to data ordering.
The Simple Cluster Seeking method Tou74 () is identical to Ball and Hall’s method with the exception that the first point in is taken as the first center. This method is used in the FASTCLUS procedure of SAS SAS09 ().
Späth’s method Spath77 () is similar to Forgy’s method with the exception that the points are assigned to the clusters in a cyclical fashion, i.e., the th () point is assigned to the th cluster. In contrast to Forgy’s method, this method is sensitive to data ordering.
Maximin method Gonzalez85 (); Katsavounidis94 () chooses the first center arbitrarily and the th () center is chosen to be the point that has the greatest minimumdistance to the previously selected centers, i.e., . This method was originally developed as a approximation to the center clustering problem^{1}^{1}1Given a set of points in a metric space, the goal of center clustering is to find representative points (centers) such that the maximum distance of a point to a center is minimized.. It should be noted that, motivated by a vector quantization application, Katsavounidis et al.’s variant Katsavounidis94 () takes the point with the greatest Euclidean norm as the first center.
AlDaoud’s densitybased method AlDaoud96 () first uniformly partitions the data space into disjoint hypercubes. It then randomly selects points from hypercube () to obtain a total of centers ( is the number of points in hypercube ). There are two main disadvantages associated with this method. First, it is difficult to decide on an appropriate value for . Second, the method has a storage complexity of , where is the number of bits allocated to each attribute.
Bradley and Fayyad’s method Bradley98 () starts by randomly partitioning the data set into subsets. These subsets are clustered using kmeans initialized by MacQueen’s second method producing sets of intermediate centers each with points. These center sets are combined into a superset, which is then clustered by kmeans times, each time initialized with a different center set. Members of the center set that give the least SSE are then taken as the final centers.
Pizzuti et al. Pizzuti99 () improved upon AlDaoud’s densitybased method using a multiresolution grid approach. Their method starts with hypercubes and iteratively splits these as the number of points they receive increases. Once the splitting phase is completed, the centers are chosen from the densest hypercubes.
The kmeans++ method Arthur07 () interpolates between MacQueen’s second method and the maximin method. It chooses the first center randomly and the th () center is chosen to be with a probability of , where denotes the minimumdistance from a point to the previously selected centers. This method yields an approximation to the MSSC problem. The greedy kmeans++ method probabilistically selects centers in each round and then greedily selects the center that most reduces the SSE. This modification aims to avoid the unlikely event of choosing two centers that are close to each other.
The PCAPart method Su07 () uses a divisive hierarchical approach based on PCA (Principal Component Analysis) Hotelling36 (). Starting from an initial cluster that contains the entire data set, the method iteratively selects the cluster with the greatest SSE and divides it into two subclusters using a hyperplane that passes through the cluster centroid and is orthogonal to the principal eigenvector of the cluster covariance matrix. This procedure is repeated until clusters are obtained. The centers are then given by the centroids of these clusters. The VarPart method Su07 () is an approximation to PCAPart, where the covariance matrix of the cluster to be split is assumed to be diagonal. In this case, the splitting hyperplane is orthogonal to the coordinate axis with the greatest variance.
Lu et al.’s method Lu08 () uses a twophase pyramidal approach. The attributes of each point are first encoded as integers using level quantization, where is a resolution parameter. These integer points are considered to be at level of the pyramid. In the bottomup phase, starting from level , neighboring data points at level () are averaged to obtain weighted points at level until at least points are obtained. Data points at the highest level are refined using kmeans initialized with the points with the largest weights. In the topdown phase, starting from the highest level, centers at level are projected onto level and then used to initialize the th level clustering. The topdown phase terminates when level is reached. The centers at this level are then inverse quantized to obtain the final centers. The performance of this method degrades with increasing dimensionality Lu08 ().
Onoda et al.’s method Onoda12 () first calculates Independent Components (ICs) Hyvarinen99 () of and then chooses the th () center as the point that has the least cosine distance from the th IC.
2.2 Loglinear TimeComplexity Initialization Methods
Hartigan’s method Hartigan79 () first sorts the points according to their distances to . The th () center is then chosen to be the ()th point. This method is an improvement over MacQueen’s first method in that it is invariant to data ordering and is more likely to produce seeds that are well separated. The computational cost of this method is dominated by the complexity of sorting, which is .
AlDaoud’s variancebased method AlDaoud05 () first sorts the points on the attribute with the greatest variance and then partitions them into groups along the same dimension. The centers are then chosen to be the points that correspond to the medians of these groups. Note that this method disregards all attributes but one and therefore is likely to be effective only for data sets in which the variability is mostly on one dimension.
Redmond and Heneghan’s method Redmond07 () first constructs a kdtree of the data points to perform density estimation and then uses a modified maximin method to select centers from densely populated leaf buckets. The computational cost of this method is dominated by the complexity of kdtree construction, which is .
The ROBIN (ROBust INitialization) method AlHasan09 () uses a local outlier factor (LOF) Breunig00 () to avoid selecting outlier points as centers. In iteration (), the method first sorts the data points in decreasing order of their minimumdistance to the previously selected centers. It then traverses the points in sorted order and selects the first point that has an LOF value close to as the th center. The computational cost of this method is dominated by the complexity of sorting, which is .
2.3 QuadraticComplexity Initialization Methods
Astrahan’s method Astrahan70 () uses two distance thresholds and . It first calculates the density of each point as the number of points within a distance of . The points are sorted in decreasing order by their densities and the highest density point is chosen as the first center. Subsequent centers are chosen in order of decreasing density subject to the condition that each new center be at least at a distance of from the previously selected centers. This procedure is continued until no more centers can be chosen. Finally, if more than centers are chosen, hierarchical clustering is used to group the centers until only of them remain. The main problem with this method is that it is very sensitive to the values of and . For example, if is too small there may be many isolated points with zero density whereas if it is too large a few centers will cover the entire data set Anderberg73 ().
Lance and Williams Lance67 () suggested that the output of a hierarchical clustering algorithm can be used to initialize kmeans. Despite the fact that such algorithms often have quadratic or higher complexity, this method is highly recommended in the statistics literature Milligan80a () possibly due to the limited size of the data sets in this field.
Kaufman and Rousseeuw’s method Kaufman90 () takes as the first center and the th () center is chosen to be the point that most reduces the SSE. Since pairwise distances between the data points need to be calculated in each iteration, the time complexity of this method is .
Cao et al. Cao09 () formalized Astrahan’s densitybased method within the framework of a neighborhoodbased rough set model. In this model, the neighborhood of a point is defined as the set of points within distance from it according to a particular distance measure. Based on this neighborhood model, the concepts of cohesion and coupling are defined. The former is a measure of the centrality of a point with respect to its neighborhood; whereas the latter is a measure of separation between two neighborhoods. The method first sorts the data points in decreasing order of their cohesion and takes the point with the greatest cohesion as the first center. It then traverses the points in sorted order and takes the first point that has a coupling of less than with the previously selected centers as the th () center. The computational cost of this method is dominated by the complexity of the neighborhood calculations, which is .
2.4 Other Initialization Methods
The binarysplitting method Linde80 () takes as the first center. In iteration (), each of the existing centers is split into two new centers by subtracting and adding a fixed perturbation vector , i.e., and (). These new centers are then refined using kmeans. There are two main disadvantages associated with this method. First, there is no guidance on the selection of a proper value for , which determines the direction of the split Huang93 (). Second, the method is computationally demanding since after each iteration kmeans has to be run for the entire data set.
The directedsearch binarysplitting method Huang93 () is an improvement over the binarysplitting method in that it determines the value of using PCA. However, it has even higher computational requirements due to the calculation of the principal eigenvector in each iteration.
The global kmeans method Likas03 () takes as the first center. In iteration () it considers each of the points in turn as a candidate for the st center and runs kmeans with centers on the entire data set. This method is computationally prohibitive for large data sets as it involves runs of kmeans on the entire data set.
It should be noted that the two splitting methods and the global kmeans method are not initialization methods per se. These methods can be considered as complete clustering methods that utilize kmeans as a local search procedure. For this reason, to the best of our knowledge, none of the initialization studies to date included these methods in their comparisons.
We should also mention IMs based on metaheuristics such as simulated annealing Babu94a () and genetic algorithms Babu93 (). These algorithms start from a random initial configuration (population) and use kmeans to evaluate their solutions in each iteration (generation). There are two main disadvantages associated with these methods. First, they involve numerous parameters that are difficult to tune (initial temperature, cooling schedule, population size, crossover/mutation probability, etc.) Jain99 (). Second, due to the large search space, they often require a large number of iterations, which renders them computationally prohibitive for all but the smallest data sets. Interestingly, with the recent developments in combinatorial optimization algorithms, it is now feasible to obtain globally minimum SSE clusterings for small data sets without resorting to metaheuristics Aloise10 ().
2.5 Linear vs. Superlinear Initialization Methods
Based on the descriptions given above, it can be seen that superlinear methods often have more elaborate designs when compared to linear ones. An interesting feature of the superlinear methods is that they are often deterministic, which can be considered as an advantage especially when dealing with large data sets. In contrast, linear methods are often nondeterministic and/or ordersensitive. As a result, it is common practice to perform multiple runs of such methods and take the output of the run that produces the least SSE Bradley98 ().
A frequently cited advantage of the more elaborate methods is that they often lead to faster kmeans convergence, i.e., require fewer iterations, and as a result the time gained during the clustering phase can offset the time lost during the initialization phase Su07 (); Redmond07 (); AlHasan09 (). This may be true when a standard implementation of kmeans is used. However, convergence speed may not be as important when a fast kmeans variant is used as such methods often require significantly less time compared to a standard kmeans implementation. In this study, we utilize a fast kmeans variant based on triangle inequality Huang90 () and partial distance elimination Bei85 () techniques. As will be seen in §4, this fast and exact kmeans implementation will diminish the computational efficiency differences among various IMs. In other words, we will demonstrate that elaborate methods that lead to faster kmeans convergence are not necessarily more efficient than simple methods with slower convergence.
3 Experimental Setup
3.1 Data Set Descriptions
In order to obtain a comprehensive evaluation of various IMs, we conducted two sets of experiments. The first experiment involved commonly used real data sets with sizes ranging from to points. Most of these data sets were obtained from the UCI Machine Learning Repository Frank11 () (see Table 1.) The second experiment involved a large number of synthetic data sets with varying clustering complexity. We used a recent algorithm proposed by Maitra and Melnykov Maitra10 () to generate these data sets. This algorithm involves the calculation of the exact overlap () between each cluster pair, measured in terms of their total probability of misclassification, and guided simulation of Gaussian mixture components satisfying prespecified overlap characteristics. The algorithm was used with the following parameters: mean overlap (), number of points (), number of attributes (), and number of classes ().
The parameter denotes the mean overlap between pairs of clusters. However, we observed that two synthetic data sets with the same can have considerably different clustering complexity. Therefore, we quantified clustering complexity using the following indirect approach. For each data set, we executed the kmeans algorithm initialized with the “true” centers given by the cluster generation algorithm and calculated the RAND, VD, and VI measures (see §3.3) upon convergence. The average of these measures, , was taken as a quantitative indicator of clustering complexity. Note that each of these normalized measures takes values from the interval. For RAND larger values are better, whereas for VD and VI smaller values are better. Therefore, we inverted the RAND values by subtracting them from to make this measure compatible with the other two. Finally, using the aforementioned complexity quantification scheme, we generated synthetic data sets from each of the following complexity classes: easy (), moderate (), and difficult (). The total number of synthetic data sets was thus . Figure 1 shows sample data sets with clusters from each complexity class.
ID  Data Set  # Points ()  # Attributes ()  # Classes () 

1  Breast Cancer Wisconsin (Original)  683  9  2 
2  Cloud Cover (DB1)  1,024  10  8^{†} 
3  Concrete Compressive Strength  1,030  9  8^{†} 
4  Corel Image Features  68,040  25  16^{†} 
5  Covertype  581,012  10  7 
6  Ecoli  336  7  8 
7  Steel Plates Faults  1,941  27  7 
8  Glass Identification  214  9  6 
9  Heart Disease  297  13  5 
10  Ionosphere  351  34  2 
11  ISOLET  7,797  617  26 
12  Landsat Satellite (Statlog)  6,435  36  6 
13  Letter Recognition  20,000  16  26 
14  MAGIC Gamma Telescope  19,020  10  2 
15  Multiple Features (Fourier)  2,000  76  10 
16  MiniBooNE Particle Identification  130,064  50  2 
17  Musk (Clean2)  6598  166  2 
18  Optical Digits  5,620  64  10 
19  Page Blocks Classification  5,473  10  5 
20  Parkinsons  5,875  18  42^{†} 
21  Pen Digits  10,992  16  10 
22  Person Activity  164,860  3  11 
23  Pima Indians Diabetes  768  8  2 
24  Image Segmentation  2,310  19  7 
25  Shuttle (Statlog)  58,000  9  7 
26  SPECTF Heart  267  44  2 
27  Telugu Vowels Pal77 ()  871  3  6 
28  Vehicle Silhouettes (Statlog)  846  18  4 
29  WallFollowing Robot Navigation  5,456  24  4 
30  Wine Quality  6,497  11  7 
31  World TSP Cook11 ()  1,904,711  2  7^{†} 
32  Yeast  1,484  8  10 

Due to the unavailability of class labels, for data sets #2, #3, and #4, was chosen arbitrarily, whereas for #20 and #31, it was determined based on domain knowledge.
3.2 Attribute Normalization
In clustering tasks, normalization is a common preprocessing step that is necessary to prevent attributes with large ranges from dominating the distance calculations and also to avoid numerical instabilities in the computations. Two commonly used normalization schemes are linear scaling to unit range (minmax normalization) and linear scaling to unit variance (zscore normalization). Several studies revealed that the former scheme is preferable to the latter since the latter is likely to eliminate valuable betweencluster variation Milligan88 (); Su07 (). As a result, we used minmax normalization to map the attributes of each real data set to the interval. Note that attributes of the synthetic data sets were already normalized by the cluster generation algorithm.
3.3 Performance Criteria
The performance of the IMs was measured using five effectiveness (quality) and two efficiency (speed) criteria:

Initial SSE: This is the SSE value calculated after the initialization phase, before the clustering phase. It gives us a measure of the effectiveness of an IM by itself.

Final SSE: This is the SSE value calculated after the clustering phase. It gives us a measure of the effectiveness of an IM when its output is refined by kmeans. Note that this is the objective function of the kmeans algorithm, i.e., (1).

Normalized Rand (RAND) Hubert85 (), van Dongen (VD) vanDongen00 (), and Variation of Information Meila07 () criteria (VI): These are external validity measures that quantify the extent to which the clustering structure discovered by a clustering algorithm matches some external structure, e.g., one specified by the given class labels Wu09a (); Wu09c (). In a recent comprehensive study, these three measures were found to be the best among 16 external validity measures Wu09a (). Note that each of these normalized measures takes values from the interval.

Number of Iterations: This is the number of iterations that kmeans requires until reaching convergence when initialized by a particular IM. It is an efficiency measure independent of programming language, implementation style, compiler, and CPU architecture.

CPU Time: This is the total CPU time taken by the initialization and clustering phases. This criterion is reported only for the real data sets.
All of the methods were implemented in the C language, compiled with the gcc v4.4.3 compiler, and executed on an Intel Xeon E5520 2.26GHz machine. Time measurements were performed using the getrusage function, which is capable of measuring CPU time to an accuracy of a microsecond. The MT19937 variant of the Mersenne Twister algorithm was used to generate highquality pseudorandom numbers Matsumoto98 ().
The convergence of kmeans was controlled by the disjunction of two criteria: the number of iterations reaches a maximum of or the relative improvement in SSE between two consecutive iterations drops below a threshold, i.e., , where denotes the SSE value at the end of the th () iteration. The convergence threshold was set to .
4 Experimental Results and Discussion
In this study, we focus on IMs that have time complexity linear in . This is because kmeans itself has linear complexity, which is perhaps the most important reason for its popularity. Therefore, an IM for kmeans should not diminish this advantage of the algorithm. Eight commonly used, orderinvariant IMs were included in the experiments: Forgy’s method (F), MacQueen’s second method (M), maximin (X), Bradley and Fayyad’s method (B) with , kmeans++ (K), greedy kmeans++ (G), VarPart (V), and PCAPart (P). It should be noted that among these methods only V and P are deterministic.
In the experiments, each nondeterministic method was executed a times and statistics such as minimum, mean, and standard deviation were collected for the effectiveness criteria. In each run, the number of clusters () was set equal to the number of classes (), as commonly seen in the related literature Bradley98 (); Pizzuti99 (); He04 (); Arthur07 (); Su07 (); AlHasan09 (); Cao09 ().
Tables 2 and 3 give the Final SSE and CPU time (in milliseconds) results for the real data sets, respectively. Note that, due to space limitations, only mean values are reported for the CPU time criterion. In order to determine if there are any statistically significant differences among the methods, we employed two nonparametric statistical tests Garcia08 (): the Friedman test Friedman37 () and Iman & Davenport test Iman80 (). These tests are alternatives to the parametric twoway analysis of variance (ANOVA) test. Their advantage over ANOVA is that they do not require normality or homoscedasticity, assumptions that are often violated in machine learning studies Luengo09 (); Garcia09 ().
Given blocks (subjects) and treatments (measurements), the null hypothesis () of the Friedman test is that populations within a block are identical. The alternative hypothesis () is that at least one treatment tends to yield larger (or smaller) values than at least one other treatment. The test statistic is calculated as follows Daniel00 (). In the first step, the observations within each block are ranked separately, so each block contains a separate set of ranks. If ties occur, the tied observations are given the mean of the rank positions for which they are tied. If is true, the ranks in each block should be randomly distributed over the columns (treatments). Otherwise, we expect a lack of randomness in this distribution. For example, if a particular treatment is better than the others, we expect large (or small) ranks to ‘favor’ that column. In the second step, the ranks in each column are summed. If is true, we expect the sums to be fairly close — so close that we can attribute differences to chance. Otherwise, we expect to see at least one difference between pairs of rank sums so large that we cannot reasonably attribute it to sampling variability. The test statistic is given as:
(3) 
where () is the rank sum of the th column. is approximately chisquare with degrees of freedom. is rejected at the level of significance if the value of (3) is greater than or equal to the critical chisquare value for degrees of freedom. Iman and Davenport Iman80 () proposed the following statistic:
(4) 
which is distributed according to the Fdistribution with and degrees of freedom. When compared to , this statistic is not only less conservative, but also more accurate for small sample sizes Iman80 ().
In this study, blocks and treatments correspond to data sets and initialization methods, respectively. Our goal is to determine whether or not there is at least one method that is significantly better than at least one other method at the level. If this is the case, we will conduct a posthoc (multiple comparison) test to determine which pairs of methods differ significantly. For this purpose, we will use the BergmannHommel test Bergmann88 (), a powerful posthoc procedure that has been used successfully in various machine learning studies Garcia08 (); Voss10 ().
4.1 Real Data Sets
Table 4 gives the Final SSE rankings of the IMs for the real data sets as determined by the BergmannHommel procedure using data given in Table 2. Here, a notation such as indicates that there is no statistically significant difference between methods D and E and these two methods are significantly better than method C. From Table 4 it can be seen that the methods cannot be distinguished from each other reliably. This was expected since even nonparametric posthoc tests lack discrimination power in small sample cases (recall that only data sets were used) with a large number of ties (see Table 2). For example, with respect to the minimum statistic, the performances of F, M, B, K, and G are statistically indistinguishable. In other words, if we initialize kmeans with each of these nondeterministic methods and execute it until convergence, the resulting clusterings over runs will have very similar minimum Final SSE values. Similar trends were observed for the RAND, VD, and VI criteria (data not shown). Given the abundance of local minima even in data sets of moderate size and/or dimensionality and the gradient descent nature of kmeans, it is not surprising that the deterministic methods V and P are outperformed by most of the nondeterministic methods as the former methods were executed only once, whereas the latter ones were executed times.
As mentioned earlier, the minimum statistic is meaningful only when it is practical to execute kmeans multiple times. Otherwise, the mean statistic is more meaningful. The analysis of mean Final SSE results using the BergmannHommel procedure reveals that deterministic methods V and P are the preferred choices in this case. This is not surprising since nondeterministic methods, in particular those that are ad hoc in nature, often produce highly variable results across multiple runs.
The standard deviation statistic characterizes the reliability of a nondeterministic IM with respect to a particular performance criterion. In other words, if a nondeterministic IM obtains low mean and standard deviation with respect to an effectiveness criterion, we do not have to execute this method a large number of times to obtain good results. The analysis of Final SSE standard deviations reveals two overlapping groups of methods. Once again this is not necessarily because the members of each group are in fact indistinguishable with respect to their reliability, but due to the relatively small sample size used. In summary, due to the necessarily small number of realworld data sets available for clustering studies, it may not be possible to distinguish among various IMs. Therefore, it is crucial that these IMs be tested on a large number of synthetic data sets (see §4.2).
As for computational efficiency, it can be seen from Table 3 that, in general, the IMs have similar computational requirements per run. However, in practice, a nondeterministic method is typically executed times and the output of the run that gives the least SSE is taken as the result. Therefore, the total computational cost of a nondeterministic method is often significantly higher than that of a deterministic method. As predicted in §2.5, simple methods such as M require about the same CPU time as elaborate methods such as G. This is because simple methods often lead to more kmeans iterations, whereas elaborate ones compensate for their computational overhead by requiring fewer kmeans iterations. It should be noted that efficiency differences among the methods can be further reduced by using faster kmeans variants such as those described in Kanungo02 (); Hamerly10 ().
4.2 Synthetic Data Sets
Table 5 gives the ranking of the IMs with respect to the minimum statistic. It can be seen that despite variations in rankings across the performance criteria, some general trends emerge:

Nondeterministic methods outperform the deterministic ones, i.e., V and P, except in the case of Initial SSE. As explained in §4.1, this is due to the fact that the nondeterministic methods were executed times, whereas the deterministic ones were executed only once. The reason why deterministic methods have good Initial SSE performance is because these methods are approximate (divisive hierarchical) clustering methods by themselves and thus they give reasonable results even without kmeans refinement.

Method B consistently appears in the best performing group, whereas methods F and X are often among the worst nondeterministic methods.

Method M exhibits moderatetogood performance except in the case of Initial SSE. Recall that this method randomly selects the initial centers from among the data points and therefore it cannot be expected to perform well without kmeans refinement.

Methods K and G generally perform well. In some cases the latter outperforms the former, whereas in others they have comparable performance.
Table 6 gives the ranking of the IMs with respect to the mean statistic. It can be seen that despite variations in rankings across the performance criteria, some general trends emerge:

Deterministic methods, i.e., V and P, generally outperform the nondeterministic ones. As explained in §4.1, this is due to the fact that the nondeterministic methods can produce highly variable results across multiple runs. Method B is highly competitive with the deterministic methods.

Methods M and X are often among the worst performers, whereas methods F and K exhibit moderatetobad performance.

Method G is often significantly better than all nondeterministic methods but B.
Table 7 gives the ranking of the nondeterministic IMs with respect to the standard deviation statistic. It can be seen that despite variations in rankings across the performance criteria, some general trends emerge:

Method B consistently appears in the best performing group, whereas method M is often among the worst performers.

Methods X and K exhibit moderatetobad performance.

Method F and G are significantly better than all methods but B.
4.3 Recommendations for Practitioners
Based on the statistical analyses presented in the previous section, the following recommendations can be made:

In general, methods F, M, and X should not be used. These methods are easy to understand and implement, but they are often ineffective and unreliable. Furthermore, despite their low overhead, these methods do not offer significant time savings since they often result in slower kmeans convergence.

In timecritical applications that involve large data sets or applications that demand determinism, methods V or P should be used. These methods need to be executed only once and they lead to very fast kmeans converge. The efficiency difference between the two is noticeable only on high dimensional data sets. This is because method V calculates the direction of split by determining the coordinate axis with the greatest variance (in time), whereas method P achieves this by calculating the principal eigenvector of the covariance matrix (in time using the power method Hotelling36 ()). Note that despite its higher computational complexity, method P can, in some cases, be more efficient than method V (see Table 3). This is because the former converges significantly faster than the latter (see Table 6). The main disadvantage of these methods is that they are more complicated to implement due to their hierarchical formulation.

In applications that involve small data sets, e.g., , methods B or G should be used. It is computationally feasible to run these methods hundreds of times on such data sets given that one such run takes only a few milliseconds.
5 Conclusions
In this paper we presented an overview of kmeans initialization methods with an emphasis on their computational efficiency. We then compared eight commonly used linear time initialization methods on a large and diverse collection of real and synthetic data sets using various performance criteria. Finally, we analyzed the experimental results using nonparametric statistical tests and provided recommendations for practitioners. Our statistical analyses revealed that popular initialization methods such as forgy, macqueen, and maximin often perform poorly and that there are significantly better alternatives to these methods that have comparable computational requirements.
6 Acknowledgments
This publication was made possible by grants from the Louisiana Board of Regents (LEQSF200811RDA12) and National Science Foundation (0959583, 1117457). The authors are grateful to D. Arthur, S. Vassilvitskii, J.G. Dy, S.J. Redmond, J.F. Lu, and M. Al Hasan for clarifying various points about their papers.
References
 (1) A. K. Jain, M. N. Murty, P. J. Flynn, Data Clustering: A Review, ACM Computing Surveys 31 (3) (1999) 264–323.
 (2) A. K. Jain, Data Clustering: 50 Years Beyond Kmeans, Pattern Recognition Letters 31 (8) (2010) 651–666.
 (3) L. Kaufman, P. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, WileyInterscience, 1990.
 (4) D. Aloise, A. Deshpande, P. Hansen, P. Popat, NPHardness of Euclidean SumofSquares Clustering, Machine Learning 75 (2) (2009) 245–248.
 (5) M. Mahajan, P. Nimbhorkar, K. Varadarajan, The Planar kMeans Problem is NPhard, Theoretical Computer Science 442 (2012) 13–21.
 (6) A. Tarsitano, A Computational Study of Several Relocation Methods for KMeans Algorithms, Pattern Recognition 36 (12) (2003) 2955–2966.
 (7) S. Lloyd, Least Squares Quantization in PCM, IEEE Trans. on Information Theory 28 (2) (1982) 129–136.
 (8) P. S. Bradley, U. Fayyad, Refining Initial Points for KMeans Clustering, in: Proc. of the 15th Int. Conf. on Machine Learning, 1998, pp. 91–99.
 (9) M. Dash, H. Liu, X. Xu, ’’: Merging Distance and Density Based Clustering, in: Proc. of the 7th Int. Conf. on Database Systems for Advanced Applications, 2001, pp. 32–39.
 (10) W. Y. Chen, Y. Song, H. Bai, C. J. Lin, E. Y. Chang, Parallel Spectral Clustering in Distributed Systems, IEEE Trans. on Pattern Analysis and Machine Intelligence 33 (3) (2011) 568–586.
 (11) T. Kanungo, D. Mount, N. Netanyahu, C. Piatko, R. Silverman, A. Wu, An Efficient KMeans Clustering Algorithm: Analysis and Implementation, IEEE Trans. on Pattern Analysis and Machine Intelligence 24 (7) (2002) 881–892.
 (12) G. Hamerly, Making kmeans Even Faster, in: Proc. of the 2010 SIAM Int. Conf. on Data Mining, 2010, pp. 130–140.
 (13) T. W. Chen, S. Y. Chien, Bandwidth Adaptive Hardware Architecture of KMeans Clustering for Video Analysis, IEEE Trans. on Very Large Scale Integration (VLSI) Systems 18 (6) (2010) 957–966.
 (14) C. Ordonez, E. Omiecinski, Efficient DiskBased KMeans Clustering for Relational Databases, IEEE Trans. on Knowledge and Data Engineering 16 (8) (2004) 909–921.
 (15) S. Z. Selim, M. A. Ismail, KMeansType Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality, IEEE Trans. on Pattern Analysis and Machine Intelligence 6 (1) (1984) 81–87.
 (16) L. Bottou, Y. Bengio, Advances in Neural Information Processing Systems 7, MIT Press, 1995, Ch. Convergence Properties of the KMeans Algorithms, pp. 585–592.
 (17) L. Vendramin, R. J. G. B. Campello, E. R. Hruschka, Relative Clustering Validity Criteria: A Comparative Overview, Statistical Analysis and Data Mining 3 (4) (2010) 209–235.
 (18) J. Mao, A. K. Jain, A SelfOrganizing Network for Hyperellipsoidal Clustering (HEC), IEEE Trans. on Neural Networks 7 (1) (1996) 16–29.
 (19) J. S. Zhang, Y.W. Leung, Robust Clustering by Pruning Outliers, IEEE Trans. on Systems, Man, and Cybernetics – Part B 33 (6) (2003) 983–999.
 (20) M. E. Celebi, Improving the Performance of Kmeans for Color Quantization, Image and Vision Computing 29 (4) (2011) 260–271.
 (21) J. M. Pena, J. A. Lozano, P. Larranaga, An Empirical Comparison of Four Initialization Methods for the KMeans Algorithm, Pattern Recognition Letters 20 (10) (1999) 1027–1040.
 (22) J. He, M. Lan, C. L. Tan, S. Y. Sung, H. B. Low, Initialization of Cluster Refinement Algorithms: A Review and Comparative Study, in: Proc. of the 2004 IEEE Int. Joint Conf. on Neural Networks, 2004, pp. 297–302.
 (23) E. Forgy, Cluster Analysis of Multivariate Data: Efficiency vs. Interpretability of Classification, Biometrics 21 (1965) 768.
 (24) M. R. Anderberg, Cluster Analysis for Applications, Academic Press, 1973.
 (25) R. C. Jancey, Multidimensional Group Analysis, Australian Journal of Botany 14 (1) (1966) 127–130.
 (26) J. MacQueen, Some Methods for Classification and Analysis of Multivariate Observations, in: Proc. of the 5th Berkeley Symposium on Mathematical Statistics and Probability, 1967, pp. 281–297.
 (27) M. J. Norušis, IBM SPSS Statistics 19 Statistical Procedures Companion, Addison Wesley, 2011.
 (28) G. H. Ball, D. J. Hall, A Clustering Technique for Summarizing Multivariate Data, Behavioral Science 12 (2) (1967) 153–155.
 (29) J. T. Tou, R. C. Gonzales, Pattern Recognition Principles, AddisonWesley, 1974.
 (30) SAS Institute Inc., SAS/STAT 9.2 User’s Guide, SAS Publishing, 2009.
 (31) H. Späth, Computational Experiences with the Exchange Method: Applied to Four Commonly Used Partitioning Cluster Analysis Criteria, European Journal of Operational Research 1 (1) (1977) 23–31.
 (32) T. Gonzalez, Clustering to Minimize the Maximum Intercluster Distance, Theoretical Computer Science 38 (2–3) (1985) 293–306.
 (33) I. Katsavounidis, C.C. J. Kuo, Z. Zhang, A New Initialization Technique for Generalized Lloyd Iteration, IEEE Signal Processing Letters 1 (10) (1994) 144–146.
 (34) M. AlDaoud, S. Roberts, New Methods for the Initialisation of Clusters, Pattern Recognition Letters 17 (5) (1996) 451–455.
 (35) C. Pizzuti, D. Talia, G. Vonella, A Divisive Initialisation Method for Clustering Algorithms, in: Proc. of the 3rd European Conf. on Principles and Practice of Knowledge Discovery in Databases, 1999, pp. 484–491.
 (36) D. Arthur, S. Vassilvitskii, kmeans++: The Advantages of Careful Seeding, in: Proc. of the 18th Annual ACMSIAM Symposium on Discrete Algorithms, 2007, pp. 1027–1035.
 (37) T. Su, J. G. Dy, In Search of Deterministic Methods for Initializing KMeans and Gaussian Mixture Clustering, Intelligent Data Analysis 11 (4) (2007) 319–338.
 (38) H. Hotelling, Simplified Calculation of Principal Components, Psychometrika 1 (1) (1936) 27–35.
 (39) J. F. Lu, J. B. Tang, Z. M. Tang, J. Y. Yang, Hierarchical Initialization Approach for KMeans Clustering, Pattern Recognition Letters 29 (6) (2008) 787–795.
 (40) T. Onoda, M. Sakai, S. Yamada, Careful Seeding Method based on Independent Components Analysis for kmeans Clustering, Journal of Emerging Technologies in Web Intelligence 4 (1) (2012) 51–59.
 (41) A. Hyvärinen, Fast and Robust FixedPoint Algorithms for Independent Component Analysis, IEEE Trans. on Neural Networks 10 (3) (1999) 626–634.
 (42) J. A. Hartigan, M. A. Wong, Algorithm AS 136: A KMeans Clustering Algorithm, Journal of the Royal Statistical Society C 28 (1) (1979) 100–108.
 (43) M. AlDaoud, A New Algorithm for Cluster Initialization, in: Proc. of the 2nd World Enformatika Conf., 2005, pp. 74–76.
 (44) S. J. Redmond, C. Heneghan, A Method for Initialising the Kmeans Clustering Algorithm Using kdtrees, Pattern Recognition Letters 28 (8) (2007) 965–973.
 (45) M. Al Hasan, V. Chaoji, S. Salem, M. Zaki, Robust Partitional Clustering by Outlier and Density Insensitive Seeding, Pattern Recognition Letters 30 (11) (2009) 994–1002.
 (46) M. M. Breunig, H. P. Kriegel, R. T. Ng, J. Sander, LOF: Identifying DensityBased Local Outliers, ACM SIGMOD Record 29 (2) (2000) 93–104.
 (47) M. M. Astrahan, Speech Analysis by Clustering, or the Hyperphoneme Method, Tech. Rep. AIM124, Stanford University (1970).
 (48) G. N. Lance, W. T. Williams, A General Theory of Classificatory Sorting Strategies  II. Clustering Systems, The Computer Journal 10 (3) (1967) 271–277.
 (49) G. W. Milligan, An Examination of the Effect of Six Types of Error Perturbation on Fifteen Clustering Algorithms, Psychometrika 45 (3) (1980) 325–342.
 (50) F. Cao, J. Liang, G. Jiang, An Initialization Method for the KMeans Algorithm Using Neighborhood Model, Computers and Mathematics with Applications 58 (3) (2009) 474–483.
 (51) Y. Linde, A. Buzo, R. Gray, An Algorithm for Vector Quantizer Design, IEEE Trans. on Communications 28 (1) (1980) 84–95.
 (52) C. M. Huang, R. W. Harris, A Comparison of Several Vector Quantization Codebook Generation Approaches, IEEE Trans. on Image Processing 2 (1) (1993) 108–112.
 (53) A. Likas, N. Vlassis, J. Verbeek, The Global KMeans Clustering Algorithm, Pattern Recognition 36 (2) (2003) 451–461.
 (54) G. Babu, M. Murty, Simulated Annealing for Selecting Optimal Initial Seeds in the KMeans Algorithm, Indian Journal of Pure and Applied Mathematics 25 (1–2) (1994) 85–94.
 (55) G. P. Babu, M. N. Murty, A NearOptimal Initial Seed Value Selection in KMeans Algorithm Using a Genetic Algorithm, Pattern Recognition Letters 14 (10) (1993) 763–769.
 (56) D. Aloise, P. Hansen, L. Liberti, An Improved Column Generation Algorithm for Minimum SumofSquares Clustering, Mathematical Programming (2010) 1–26.
 (57) S. H. Huang, S. H. Chen, Fast Encoding Algorithm for VQbased Image Coding, Electronics Letters 26 (19) (1990) 1618–1619.
 (58) C. D. Bei, R. M. Gray, An Improvement of the Minimum Distortion Encoding Algorithm for Vector Quantization, IEEE Trans. on Communications 33 (10) (1985) 1132–1133.
 (59) A. Frank, A. Asuncion, UCI Machine Learning Repository, http://archive.ics.uci.edu/ml, University of California, Irvine, School of Information and Computer Sciences (2011).
 (60) R. Maitra, V. Melnykov, Simulating Data to Study Performance of Finite Mixture Modeling and Clustering Algorithms, Journal of Computational and Graphical Statistics 19 (2) (2010) 354–376.
 (61) S. K. Pal, D. D. Majumder, Fuzzy Sets and Decision Making Approaches in Vowel and Speaker Recognition, IEEE Trans. on Systems, Man, and Cybernetics 7 (8) (1977) 625–629, http://www.isical.ac.in/s̃ushmita/patterns/vowel.dat.
 (62) W. Cook, World TSP, http://www.tsp.gatech.edu/world, Georgia Institute of Technology (2011).
 (63) G. Milligan, M. C. Cooper, A Study of Standardization of Variables in Cluster Analysis, Journal of Classification 5 (2) (1988) 181–204.
 (64) L. Hubert, P. Arabie, Comparing Partitions, Journal of Classification 2 (1) (1985) 193–218.
 (65) S. van Dongen, Performance Criteria for Graph Clustering and Markov Cluster Experiments, Tech. Rep. INSR0012, Centrum voor Wiskunde en Informatica (2000).
 (66) M. Meilă, Comparing Clusterings — An Information Based Distance, Journal of Multivariate Analysis 98 (5) (2007) 873–895.
 (67) J. Wu, H. Xiong, J. Chen, Adapting the Right Measures for Kmeans Clustering, in: Proc. of the 15th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 2009, pp. 877–885.
 (68) J. Wu, J. Chen, H. Xiong, M. Xie, External Validation Measures for KMeans Clustering: A Data Distribution Perspective, Expert Systems With Applications 36 (3) (2009) 6050–6061.
 (69) M. Matsumoto, T. Nishimura, Mersenne Twister: A 623Dimensionally Equidistributed Uniform PseudoRandom Number Generator, ACM Trans. on Modeling and Computer Simulation 8 (1) (1998) 3–30.
 (70) S. Garcia, F. Herrera, An Extension on ”Statistical Comparisons of Classifiers over Multiple Data Sets” for all Pairwise Comparisons, Journal of Machine Learning Research 9 (2008) 2677–2694.
 (71) M. Friedman, The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance, Journal of the American Statistical Association 32 (200) (1937) 675–701.
 (72) R. L. Iman, J. M. Davenport, Approximations of the Critical Region of the Friedman Statistic, Communications in Statistics  Theory and Methods 9 (6) (1980) 571–595.
 (73) J. Luengo, S. Garcia, F. Herrera, A Study on the Use of Statistical Tests for Experimentation with Neural Networks: Analysis of Parametric Test Conditions and NonParametric Tests, Expert Systems With Applications 36 (4) (2009) 7798–7808.
 (74) S. Garcia, A. Fernandez, J. Luengo, F. Herrera, A Study of Statistical Techniques and Performance Measures for GeneticsBased Machine Learning: Accuracy and Interpretability, Soft Computing 13 (10) (2009) 959–977.
 (75) W. W. Daniel, Applied Nonparametric Statistics, Duxbury Press, 2000.
 (76) B. Bergmann, G. Hommel, Multiple Hypotheses Testing, Springer, 1988, Ch. Improvements of General Multiple Test Procedures for Redundant Systems of Hypotheses, pp. 100–115.
 (77) T. Voss, N. Hansen, C. Igel, Improved Step Size Adaptation for the MOCMAES, in: Proc. of the 12th Annual Conf. on Genetic and Evolutionary Computation, 2010, pp. 487–494.
F  M  X  B  K  G  V  P  
1  min  239  239  239  239  239  239  239  239 
mean  2390  2390  2390  2390  2390  2390  2390  2390  
2  min  38  38  41  38  38  38  39  38 
mean  444  401  453  391  391  391  390  380  
3  min  167  167  167  167  167  167  173  167 
mean  1768  1767  1726  1714  1745  1734  1730  1670  
4  min  10057  10057  10057  10057  10058  10058  10101  10100 
mean  1011579  1008020  1007715  1007618  1008323  1008018  101010  101000  
5  min  66224  66224  66224  66224  66224  66224  66238  66238 
mean  66990890  671961048  67350834  66431360  66948876  66930773  662380  662380  
6  min  17  17  19  17  17  17  17  18 
mean  192  193  201  181  181  181  170  180  
7  min  1167  1167  1267  1167  1167  1167  1167  1168 
mean  123183  1250103  130353  118425  123061  119835  11670  11680  
8  min  18  18  19  18  18  18  19  19 
mean  201  202  222  201  202  201  190  190  
9  min  243  243  243  243  243  243  248  243 
mean  2518  2528  2538  2517  2528  2497  2480  2430  
10  min  629  629  629  629  629  629  629  629 
mean  6290  63328  67181  63739  63534  63535  6290  6290  
11  min  117891  117924  120898  117863  117719  117995  118495  118386 
mean  1199311060  1196551061  1233881264  119050699  119538894  119176710  1184950  1183860  
12  min  1742  1742  1742  1742  1742  1742  1742  1742 
mean  17420  17420  17420  17420  174428  174741  17420  17420  
13  min  2723  2718  2721  2719  2718  2715  2735  2745 
mean  277528  275622  276517  274215  275418  275217  27350  27450  
14  min  2923  2923  2923  2923  2923  2923  2923  2923 
mean  29230  29230  29230  29230  29230  29230  29230  29230  
15  min  3127  3128  3180  3127  3128  3127  3137  3214 
mean  316430  316828  324722  315729  317333  314920  31370  32140  
16  min  2802  2802  2802  2802  2802  2802  21983  2802 
mean  82298685  125189667  28020  119359656  57226944  37744236  219830  28020  
17  min  36373  36373  36373  36373  36373  36373  36373  36373 
mean  377552829  37046916  36738754  371521340  374401906  371031639  363730  363730  
18  min  14559  14559  14559  14559  14559  14559  14581  14807 
mean  14653140  14763273  14774293  1462766  14735234  14719214  145810  148070  
19  min  215  215  230  215  215  215  227  215 
mean  2174  2174  25432  2196  21910  2174  2270  2150  
20  min  235  219  233  218  217  217  220  219 
mean  2517  2242  2413  2222  2202  2191  2200  2190  
21  min  4930  4930  4930  4930  4930  4930  4930  5004 
mean  5130131  5091110  5036106  501270  5111116  504675  49300  50040  
22  min  1177  1177  1195  1177  1177  1177  1182  1177 
mean  117910  118718  120425  118212  119327  118314  11820  11770  
23  min  121  121  121  121  121  121  121  121 
mean  1212  1225  1223  1223  1225  1225  1210  1210  
24  min  387  387  411  387  387  387  410  405 
mean  40723  41420  43021  40216  41019  40213  4100  4050  
25  min  235  235  411  235  235  235  235  274 
mean  30739  27523  930105  24418  27139  24621  2350  2740  
26  min  214  214  214  214  214  214  214  214 
mean  2140  2140  2140  2140  2140  2140  2140  2140  
27  min  22  22  22  22  22  22  23  23 
mean  232  231  230  231  231  230  230  230  
28  min  223  223  224  223  223  223  224  224 
mean  2242  2264  2371  2286  2265  2253  2240  2240  
29  min  7772  7772  7772  7772  7772  7772  7774  7774 
mean  779891  7808102  7854160  77731  7831140  7811106  77740  77740  
30  min  334  334  348  334  334  334  335  334 
mean  3352  3362  37417  3375  3363  3363  3350  3340  
31  min  11039  11039  11039  11039  11039  11039  11483  12422 
mean  140411686  123671057  11714627  11128231  11773872  11493626  114830  124220  
32  min  58  58  61  58  58  58  69  59 
mean  645  706  611  666  635  591  690  590 
F  M  X  B  K  G  V  P  
1  0  0  0  0  0  0  0  0 
2  3  3  2  4  3  2  10  0 
3  2  2  2  4  2  2  0  10 
4  2295  2248  2173  3624  2332  2459  1900  2540 
5  2183  2229  2714  3604  2273  2274  1730  2120 
6  0  0  0  1  0  0  0  0 
7  8  8  7  12  9  9  0  20 
8  0  0  0  1  0  0  0  0 
9  0  1  0  1  0  1  0  0 
10  0  0  0  1  0  0  0  0 
11  3730  3469  2469  4063  3537  3915  6940  12200 
12  28  32  40  40  34  32  40  50 
13  700  693  729  852  698  693  950  800 
14  20  19  28  30  21  19  30  20 
15  54  56  62  70  57  58  30  70 
16  252  283  58  417  112  96  230  570 
17  22  20  24  34  25  26  20  220 
18  112  116  131  137  121  125  60  140 
19  9  10  7  12  8  10  10  10 
20  109  100  110  150  97  118  60  70 
21  59  53  52  67  56  59  30  50 
22  430  524  314  718  469  513  730  480 
23  1  1  1  1  1  0  0  0 
24  6  5  7  8  6  7  10  0 
25  84  82  19  122  79  81  100  80 
26  0  0  0  1  0  0  10  0 
27  1  1  0  1  1  1  0  0 
28  3  2  1  3  2  3  10  10 
29  20  20  20  28  19  21  10  20 
30  38  38  24  51  36  38  60  40 
31  748  949  1377  2439  918  1044  580  840 
32  5  6  5  9  6  6  0  0 
Statistic  IM Ranking 

Minimum  
Mean  
Standard Deviation 
Data Set Complexity  IM Ranking 

Initial SSE  
Easy  
Moderate  
Difficult  
Final SSE  
Easy  
Moderate  
Difficult  
Final RAND  
Easy  
Moderate  
Difficult  
Final VD  
Easy  
Moderate  
Difficult  
Final VI  
Easy  
Moderate  
Difficult  
Number of Iterations  
Easy  
Moderate  
Difficult 
Data Set Complexity  IM Ranking 

Initial SSE  
Easy  
Moderate  
Difficult  
Final SSE  
Easy  
Moderate  
Difficult  
Final RAND  
Easy  
Moderate  
Difficult  
Final VD  
Easy  
Moderate  
Difficult  
Final VI  
Easy  
Moderate  
Difficult  
Number of Iterations  
Easy  
Moderate  
Difficult 
Data Set Complexity  IM Ranking 

Initial SSE  
Easy  
Moderate  Same as Easy 
Difficult  
Final SSE  
Easy  
Moderate  
Difficult  
Final RAND  
Easy  
Moderate  
Difficult  
Final VD  
Easy  
Moderate  
Difficult  
Final VI  
Easy  
Moderate  
Difficult  
Number of Iterations  
Easy  
Moderate  
Difficult 