A Comparative Study of Efficient Initialization Methods for the K-Means Clustering Algorithm

A Comparative Study of Efficient Initialization Methods for the K-Means Clustering Algorithm

M. Emre Celebi Department of Computer Science
Louisiana State University, Shreveport, LA, USA
ecelebi@lsus.edu
Hassan A. Kingravi School of Electrical and Computer Engineering
Georgia Institute of Technology, Atlanta, GA, USA
kingravi@gatech.edu
Patricio A. Vela School of Electrical and Computer Engineering
Georgia Institute of Technology, Atlanta, GA, USA
pvela@gatech.edu
Abstract

K-means is undoubtedly the most widely used partitional clustering algorithm. Unfortunately, due to its gradient descent nature, this algorithm is highly sensitive to the initial placement of the cluster centers. Numerous initialization methods have been proposed to address this problem. In this paper, we first present an overview of these methods with an emphasis on their computational efficiency. We then compare eight commonly used linear time complexity initialization methods on a large and diverse collection of data sets using various performance criteria. Finally, we analyze the experimental results using non-parametric statistical tests and provide recommendations for practitioners. We demonstrate that popular initialization methods often perform poorly and that there are in fact strong alternatives to these methods.

1 Introduction

Clustering, the unsupervised classification of patterns into groups, is one of the most important tasks in exploratory data analysis Jain99 (). Primary goals of clustering include gaining insight into data (detecting anomalies, identifying salient features, etc.), classifying data, and compressing data. Clustering has a long and rich history in a variety of scientific disciplines including anthropology, biology, medicine, psychology, statistics, mathematics, engineering, and computer science. As a result, numerous clustering algorithms have been proposed since the early 1950s Jain10 ().

Clustering algorithms can be broadly classified into two groups: hierarchical and partitional Jain10 (). Hierarchical algorithms recursively find nested clusters either in a top-down (divisive) or bottom-up (agglomerative) fashion. In contrast, partitional algorithms find all the clusters simultaneously as a partition of the data and do not impose a hierarchical structure. Most hierarchical algorithms have quadratic or higher complexity in the number of data points Jain99 () and therefore are not suitable for large data sets, whereas partitional algorithms often have lower complexity.

Given a data set in , i.e., points (vectors) each with attributes (components), hard partitional algorithms divide into exhaustive and mutually exclusive clusters for . These algorithms usually generate clusters by optimizing a criterion function. The most intuitive and frequently used criterion function is the Sum of Squared Error (SSE) given by:

(1)

where denotes the Euclidean () norm and is the centroid of cluster whose cardinality is . The optimization of (1) is often referred to as the minimum SSE clustering (MSSC) problem.

The number of ways in which a set of objects can be partitioned into non-empty groups is given by Stirling numbers of the second kind:

(2)

which can be approximated by It can be seen that a complete enumeration of all possible clusterings to determine the global minimum of (1) is clearly computationally prohibitive except for very small data sets Kaufman90 (). In fact, this non-convex optimization problem is proven to be NP-hard even for Aloise09 () or Mahajan12 (). Consequently, various heuristics have been developed to provide approximate solutions to this problem Tarsitano03 (). Among these heuristics, Lloyd’s algorithm Lloyd82 (), often referred to as the (batch) k-means algorithm, is the simplest and most commonly used one. This algorithm starts with arbitrary centers, typically chosen uniformly at random from the data points. Each point is assigned to the nearest center and then each center is recalculated as the mean of all points assigned to it. These two steps are repeated until a predefined termination criterion is met.

The k-means algorithm is undoubtedly the most widely used partitional clustering algorithm Jain99 (); Jain10 (). Its popularity can be attributed to several reasons. First, it is conceptually simple and easy to implement. Virtually every data mining software includes an implementation of it. Second, it is versatile, i.e., almost every aspect of the algorithm (initialization, distance function, termination criterion, etc.) can be modified. This is evidenced by hundreds of publications over the last fifty years that extend k-means in various ways. Third, it has a time complexity that is linear in , , and (in general, and ). For this reason, it can be used to initialize more expensive clustering algorithms such as expectation maximization Bradley98 (), DBSCAN Dash01 (), and spectral clustering Chen11 (). Furthermore, numerous sequential Kanungo02 (); Hamerly10 () and parallel Chen10 () acceleration techniques are available in the literature. Fourth, it has a storage complexity that is linear in , , and . In addition, there exist disk-based variants that do not require all points to be stored in memory Ordonez04 (). Fifth, it is guaranteed to converge Selim84 () at a quadratic rate Bottou95 (). Finally, it is invariant to data ordering, i.e., random shufflings of the data points.

On the other hand, k-means has several significant disadvantages. First, it requires the number of clusters, , to be specified a priori. The value of this parameter can be determined automatically by means of various cluster validity measures Vendramin10 (). Second, it can only detect compact, hyperspherical clusters that are well separated. This can be alleviated by using a more general distance function such as the Mahalanobis distance, which permits the detection of hyperellipsoidal clusters Mao96 (). Third, due its utilization of the squared Euclidean distance, it is sensitive to noise and outlier points since even a few such points can significantly influence the means of their respective clusters. This can addressed by outlier pruning Zhang03 () or using a more robust distance function such as City-block () distance. Fourth, due to its gradient descent nature, it often converges to a local minimum of the criterion function Selim84 (). For the same reason, it is highly sensitive to the selection of the initial centers. Adverse effects of improper initialization include empty clusters, slower convergence, and a higher chance of getting stuck in bad local minima Celebi11 (). Fortunately, all of these drawbacks except the first one can be remedied by using an adaptive initialization method (IM).

In this study, we investigate some of the most popular IMs developed for the k-means algorithm. Our motivation is three-fold. First, a large number of IMs have been proposed in the literature and thus a systematic study that reviews and compares these methods is desirable. Second, these IMs can be used to initialize other partitional clustering algorithms such as fuzzy c-means and its variants and expectation maximization. Third, most of these IMs can be used independently of k-means as standalone clustering algorithms.

This study differs from earlier studies of a similar nature Pena99 (); He04 () in several respects: (i) a more comprehensive overview of the existing IMs is provided, (ii) the experiments involve a larger set of methods and a significantly more diverse collection of data sets, (iii) in addition to clustering effectiveness, computational efficiency is used as a performance criterion, and (iv) the experimental results are analyzed more thoroughly using non-parametric statistical tests.

The rest of the paper is organized as follows. Section 2 presents a survey of k-means IMs. Section 3 describes the experimental setup. Section 4 presents the experimental results, while Section 5 gives the conclusions.

2 Initialization Methods for K-Means

In this section, we briefly review some of the commonly used IMs with an emphasis on their time complexity (with respect to ). In each complexity class, methods are presented in chronologically ascending order.

2.1 Linear Time-Complexity Initialization Methods

Forgy’s method Forgy65 () assigns each point to one of the clusters uniformly at random. The centers are then given by the centroids of these initial clusters. This method has no theoretical basis, as such random clusters have no internal homogeneity Anderberg73 ().

Jancey’s method Jancey66 () assigns to each center a synthetic point randomly generated within the data space. Unless the data set fills the space, some of these centers may be quite distant from any of the points Anderberg73 (), which might lead to the formation of empty clusters.

MacQueen MacQueen67 () proposed two different methods. The first one, which is the default option in the Quick Cluster procedure of IBM SPSS Statistics Norusis11 (), takes the first points in as the centers. An obvious drawback of this method is its sensitivity to data ordering. The second method chooses the centers randomly from the data points. The rationale behind this method is that random selection is likely to pick points from dense regions, i.e., points that are good candidates to be centers. However, there is no mechanism to avoid choosing outliers or points that are too close to each other Anderberg73 (). Multiple runs of this method is the standard way of initializing k-means Bradley98 (). It should be noted that this second method is often mistakenly attributed to Forgy Forgy65 ().

Ball and Hall’s method Ball67 () takes the centroid of , i.e., , as the first center. It then traverses the points in arbitrary order and takes a point as a center if it is at least units apart from the previously selected centers until centers are obtained. The purpose of the distance threshold is to ensure that the seed points are well separated. However, it is difficult to decide on an appropriate value for . In addition, the method is sensitive to data ordering.

The Simple Cluster Seeking method Tou74 () is identical to Ball and Hall’s method with the exception that the first point in is taken as the first center. This method is used in the FASTCLUS procedure of SAS SAS09 ().

Späth’s method Spath77 () is similar to Forgy’s method with the exception that the points are assigned to the clusters in a cyclical fashion, i.e., the -th () point is assigned to the -th cluster. In contrast to Forgy’s method, this method is sensitive to data ordering.

Maximin method Gonzalez85 (); Katsavounidis94 () chooses the first center arbitrarily and the -th () center is chosen to be the point that has the greatest minimum-distance to the previously selected centers, i.e., . This method was originally developed as a -approximation to the -center clustering problem111Given a set of points in a metric space, the goal of -center clustering is to find representative points (centers) such that the maximum distance of a point to a center is minimized.. It should be noted that, motivated by a vector quantization application, Katsavounidis et al.’s variant Katsavounidis94 () takes the point with the greatest Euclidean norm as the first center.

Al-Daoud’s density-based method AlDaoud96 () first uniformly partitions the data space into disjoint hypercubes. It then randomly selects points from hypercube () to obtain a total of centers ( is the number of points in hypercube ). There are two main disadvantages associated with this method. First, it is difficult to decide on an appropriate value for . Second, the method has a storage complexity of , where is the number of bits allocated to each attribute.

Bradley and Fayyad’s method Bradley98 () starts by randomly partitioning the data set into subsets. These subsets are clustered using k-means initialized by MacQueen’s second method producing sets of intermediate centers each with points. These center sets are combined into a superset, which is then clustered by k-means times, each time initialized with a different center set. Members of the center set that give the least SSE are then taken as the final centers.

Pizzuti et al. Pizzuti99 () improved upon Al-Daoud’s density-based method using a multiresolution grid approach. Their method starts with hypercubes and iteratively splits these as the number of points they receive increases. Once the splitting phase is completed, the centers are chosen from the densest hypercubes.

The k-means++ method Arthur07 () interpolates between MacQueen’s second method and the maximin method. It chooses the first center randomly and the -th () center is chosen to be with a probability of , where denotes the minimum-distance from a point to the previously selected centers. This method yields an approximation to the MSSC problem. The greedy k-means++ method probabilistically selects centers in each round and then greedily selects the center that most reduces the SSE. This modification aims to avoid the unlikely event of choosing two centers that are close to each other.

The PCA-Part method Su07 () uses a divisive hierarchical approach based on PCA (Principal Component Analysis) Hotelling36 (). Starting from an initial cluster that contains the entire data set, the method iteratively selects the cluster with the greatest SSE and divides it into two subclusters using a hyperplane that passes through the cluster centroid and is orthogonal to the principal eigenvector of the cluster covariance matrix. This procedure is repeated until clusters are obtained. The centers are then given by the centroids of these clusters. The Var-Part method Su07 () is an approximation to PCA-Part, where the covariance matrix of the cluster to be split is assumed to be diagonal. In this case, the splitting hyperplane is orthogonal to the coordinate axis with the greatest variance.

Lu et al.’s method Lu08 () uses a two-phase pyramidal approach. The attributes of each point are first encoded as integers using -level quantization, where is a resolution parameter. These integer points are considered to be at level of the pyramid. In the bottom-up phase, starting from level , neighboring data points at level () are averaged to obtain weighted points at level until at least points are obtained. Data points at the highest level are refined using k-means initialized with the points with the largest weights. In the top-down phase, starting from the highest level, centers at level are projected onto level and then used to initialize the -th level clustering. The top-down phase terminates when level is reached. The centers at this level are then inverse quantized to obtain the final centers. The performance of this method degrades with increasing dimensionality Lu08 ().

Onoda et al.’s method Onoda12 () first calculates Independent Components (ICs) Hyvarinen99 () of and then chooses the -th () center as the point that has the least cosine distance from the -th IC.

2.2 Loglinear Time-Complexity Initialization Methods

Hartigan’s method Hartigan79 () first sorts the points according to their distances to . The -th () center is then chosen to be the ()-th point. This method is an improvement over MacQueen’s first method in that it is invariant to data ordering and is more likely to produce seeds that are well separated. The computational cost of this method is dominated by the complexity of sorting, which is .

Al-Daoud’s variance-based method AlDaoud05 () first sorts the points on the attribute with the greatest variance and then partitions them into groups along the same dimension. The centers are then chosen to be the points that correspond to the medians of these groups. Note that this method disregards all attributes but one and therefore is likely to be effective only for data sets in which the variability is mostly on one dimension.

Redmond and Heneghan’s method Redmond07 () first constructs a kd-tree of the data points to perform density estimation and then uses a modified maximin method to select centers from densely populated leaf buckets. The computational cost of this method is dominated by the complexity of kd-tree construction, which is .

The ROBIN (ROBust INitialization) method AlHasan09 () uses a local outlier factor (LOF) Breunig00 () to avoid selecting outlier points as centers. In iteration (), the method first sorts the data points in decreasing order of their minimum-distance to the previously selected centers. It then traverses the points in sorted order and selects the first point that has an LOF value close to as the -th center. The computational cost of this method is dominated by the complexity of sorting, which is .

2.3 Quadratic-Complexity Initialization Methods

Astrahan’s method Astrahan70 () uses two distance thresholds and . It first calculates the density of each point as the number of points within a distance of . The points are sorted in decreasing order by their densities and the highest density point is chosen as the first center. Subsequent centers are chosen in order of decreasing density subject to the condition that each new center be at least at a distance of from the previously selected centers. This procedure is continued until no more centers can be chosen. Finally, if more than centers are chosen, hierarchical clustering is used to group the centers until only of them remain. The main problem with this method is that it is very sensitive to the values of and . For example, if is too small there may be many isolated points with zero density whereas if it is too large a few centers will cover the entire data set Anderberg73 ().

Lance and Williams Lance67 () suggested that the output of a hierarchical clustering algorithm can be used to initialize k-means. Despite the fact that such algorithms often have quadratic or higher complexity, this method is highly recommended in the statistics literature Milligan80a () possibly due to the limited size of the data sets in this field.

Kaufman and Rousseeuw’s method Kaufman90 () takes as the first center and the -th () center is chosen to be the point that most reduces the SSE. Since pairwise distances between the data points need to be calculated in each iteration, the time complexity of this method is .

Cao et al. Cao09 () formalized Astrahan’s density-based method within the framework of a neighborhood-based rough set model. In this model, the -neighborhood of a point is defined as the set of points within distance from it according to a particular distance measure. Based on this neighborhood model, the concepts of cohesion and coupling are defined. The former is a measure of the centrality of a point with respect to its neighborhood; whereas the latter is a measure of separation between two neighborhoods. The method first sorts the data points in decreasing order of their cohesion and takes the point with the greatest cohesion as the first center. It then traverses the points in sorted order and takes the first point that has a coupling of less than with the previously selected centers as the -th () center. The computational cost of this method is dominated by the complexity of the -neighborhood calculations, which is .

2.4 Other Initialization Methods

The binary-splitting method Linde80 () takes as the first center. In iteration (), each of the existing centers is split into two new centers by subtracting and adding a fixed perturbation vector , i.e., and (). These new centers are then refined using k-means. There are two main disadvantages associated with this method. First, there is no guidance on the selection of a proper value for , which determines the direction of the split Huang93 (). Second, the method is computationally demanding since after each iteration k-means has to be run for the entire data set.

The directed-search binary-splitting method Huang93 () is an improvement over the binary-splitting method in that it determines the value of using PCA. However, it has even higher computational requirements due to the calculation of the principal eigenvector in each iteration.

The global k-means method Likas03 () takes as the first center. In iteration () it considers each of the points in turn as a candidate for the -st center and runs k-means with centers on the entire data set. This method is computationally prohibitive for large data sets as it involves runs of k-means on the entire data set.

It should be noted that the two splitting methods and the global k-means method are not initialization methods per se. These methods can be considered as complete clustering methods that utilize k-means as a local search procedure. For this reason, to the best of our knowledge, none of the initialization studies to date included these methods in their comparisons.

We should also mention IMs based on metaheuristics such as simulated annealing Babu94a () and genetic algorithms Babu93 (). These algorithms start from a random initial configuration (population) and use k-means to evaluate their solutions in each iteration (generation). There are two main disadvantages associated with these methods. First, they involve numerous parameters that are difficult to tune (initial temperature, cooling schedule, population size, crossover/mutation probability, etc.) Jain99 (). Second, due to the large search space, they often require a large number of iterations, which renders them computationally prohibitive for all but the smallest data sets. Interestingly, with the recent developments in combinatorial optimization algorithms, it is now feasible to obtain globally minimum SSE clusterings for small data sets without resorting to metaheuristics Aloise10 ().

2.5 Linear vs. Superlinear Initialization Methods

Based on the descriptions given above, it can be seen that superlinear methods often have more elaborate designs when compared to linear ones. An interesting feature of the superlinear methods is that they are often deterministic, which can be considered as an advantage especially when dealing with large data sets. In contrast, linear methods are often non-deterministic and/or order-sensitive. As a result, it is common practice to perform multiple runs of such methods and take the output of the run that produces the least SSE Bradley98 ().

A frequently cited advantage of the more elaborate methods is that they often lead to faster k-means convergence, i.e., require fewer iterations, and as a result the time gained during the clustering phase can offset the time lost during the initialization phase Su07 (); Redmond07 (); AlHasan09 (). This may be true when a standard implementation of k-means is used. However, convergence speed may not be as important when a fast k-means variant is used as such methods often require significantly less time compared to a standard k-means implementation. In this study, we utilize a fast k-means variant based on triangle inequality Huang90 () and partial distance elimination Bei85 () techniques. As will be seen in §4, this fast and exact k-means implementation will diminish the computational efficiency differences among various IMs. In other words, we will demonstrate that elaborate methods that lead to faster k-means convergence are not necessarily more efficient than simple methods with slower convergence.

3 Experimental Setup

3.1 Data Set Descriptions

In order to obtain a comprehensive evaluation of various IMs, we conducted two sets of experiments. The first experiment involved commonly used real data sets with sizes ranging from to points. Most of these data sets were obtained from the UCI Machine Learning Repository Frank11 () (see Table 1.) The second experiment involved a large number of synthetic data sets with varying clustering complexity. We used a recent algorithm proposed by Maitra and Melnykov Maitra10 () to generate these data sets. This algorithm involves the calculation of the exact overlap () between each cluster pair, measured in terms of their total probability of misclassification, and guided simulation of Gaussian mixture components satisfying prespecified overlap characteristics. The algorithm was used with the following parameters: mean overlap (), number of points (), number of attributes (), and number of classes ().

The parameter denotes the mean overlap between pairs of clusters. However, we observed that two synthetic data sets with the same can have considerably different clustering complexity. Therefore, we quantified clustering complexity using the following indirect approach. For each data set, we executed the k-means algorithm initialized with the “true” centers given by the cluster generation algorithm and calculated the RAND, VD, and VI measures (see §3.3) upon convergence. The average of these measures, , was taken as a quantitative indicator of clustering complexity. Note that each of these normalized measures takes values from the interval. For RAND larger values are better, whereas for VD and VI smaller values are better. Therefore, we inverted the RAND values by subtracting them from to make this measure compatible with the other two. Finally, using the aforementioned complexity quantification scheme, we generated synthetic data sets from each of the following complexity classes: easy (), moderate (), and difficult (). The total number of synthetic data sets was thus . Figure 1 shows sample data sets with clusters from each complexity class.

(a) Easy ()
(b) Moderate ()
(c) Difficult ()
Figure 1: Synthetic data sets with clusters
ID Data Set # Points () # Attributes () # Classes ()
1 Breast Cancer Wisconsin (Original) 683 9 2
2 Cloud Cover (DB1) 1,024 10 8
3 Concrete Compressive Strength 1,030 9 8
4 Corel Image Features 68,040 25 16
5 Covertype 581,012 10 7
6 Ecoli 336 7 8
7 Steel Plates Faults 1,941 27 7
8 Glass Identification 214 9 6
9 Heart Disease 297 13 5
10 Ionosphere 351 34 2
11 ISOLET 7,797 617 26
12 Landsat Satellite (Statlog) 6,435 36 6
13 Letter Recognition 20,000 16 26
14 MAGIC Gamma Telescope 19,020 10 2
15 Multiple Features (Fourier) 2,000 76 10
16 MiniBooNE Particle Identification 130,064 50 2
17 Musk (Clean2) 6598 166 2
18 Optical Digits 5,620 64 10
19 Page Blocks Classification 5,473 10 5
20 Parkinsons 5,875 18 42
21 Pen Digits 10,992 16 10
22 Person Activity 164,860 3 11
23 Pima Indians Diabetes 768 8 2
24 Image Segmentation 2,310 19 7
25 Shuttle (Statlog) 58,000 9 7
26 SPECTF Heart 267 44 2
27 Telugu Vowels Pal77 () 871 3 6
28 Vehicle Silhouettes (Statlog) 846 18 4
29 Wall-Following Robot Navigation 5,456 24 4
30 Wine Quality 6,497 11 7
31 World TSP Cook11 () 1,904,711 2 7
32 Yeast 1,484 8 10
  • Due to the unavailability of class labels, for data sets #2, #3, and #4, was chosen arbitrarily, whereas for #20 and #31, it was determined based on domain knowledge.

Table 1: Descriptions of Real Data Sets

3.2 Attribute Normalization

In clustering tasks, normalization is a common preprocessing step that is necessary to prevent attributes with large ranges from dominating the distance calculations and also to avoid numerical instabilities in the computations. Two commonly used normalization schemes are linear scaling to unit range (min-max normalization) and linear scaling to unit variance (z-score normalization). Several studies revealed that the former scheme is preferable to the latter since the latter is likely to eliminate valuable between-cluster variation Milligan88 (); Su07 (). As a result, we used min-max normalization to map the attributes of each real data set to the interval. Note that attributes of the synthetic data sets were already normalized by the cluster generation algorithm.

3.3 Performance Criteria

The performance of the IMs was measured using five effectiveness (quality) and two efficiency (speed) criteria:

  • Initial SSE: This is the SSE value calculated after the initialization phase, before the clustering phase. It gives us a measure of the effectiveness of an IM by itself.

  • Final SSE: This is the SSE value calculated after the clustering phase. It gives us a measure of the effectiveness of an IM when its output is refined by k-means. Note that this is the objective function of the k-means algorithm, i.e., (1).

  • Normalized Rand (RAND) Hubert85 (), van Dongen (VD) vanDongen00 (), and Variation of Information Meila07 () criteria (VI): These are external validity measures that quantify the extent to which the clustering structure discovered by a clustering algorithm matches some external structure, e.g., one specified by the given class labels Wu09a (); Wu09c (). In a recent comprehensive study, these three measures were found to be the best among 16 external validity measures Wu09a (). Note that each of these normalized measures takes values from the interval.

  • Number of Iterations: This is the number of iterations that k-means requires until reaching convergence when initialized by a particular IM. It is an efficiency measure independent of programming language, implementation style, compiler, and CPU architecture.

  • CPU Time: This is the total CPU time taken by the initialization and clustering phases. This criterion is reported only for the real data sets.

All of the methods were implemented in the C language, compiled with the gcc v4.4.3 compiler, and executed on an Intel Xeon E5520 2.26GHz machine. Time measurements were performed using the getrusage function, which is capable of measuring CPU time to an accuracy of a microsecond. The MT19937 variant of the Mersenne Twister algorithm was used to generate high-quality pseudorandom numbers Matsumoto98 ().

The convergence of k-means was controlled by the disjunction of two criteria: the number of iterations reaches a maximum of or the relative improvement in SSE between two consecutive iterations drops below a threshold, i.e., , where denotes the SSE value at the end of the -th () iteration. The convergence threshold was set to .

4 Experimental Results and Discussion

In this study, we focus on IMs that have time complexity linear in . This is because k-means itself has linear complexity, which is perhaps the most important reason for its popularity. Therefore, an IM for k-means should not diminish this advantage of the algorithm. Eight commonly used, order-invariant IMs were included in the experiments: Forgy’s method (F), MacQueen’s second method (M), maximin (X), Bradley and Fayyad’s method (B) with , k-means++ (K), greedy k-means++ (G), Var-Part (V), and PCA-Part (P). It should be noted that among these methods only V and P are deterministic.

In the experiments, each non-deterministic method was executed a times and statistics such as minimum, mean, and standard deviation were collected for the effectiveness criteria. In each run, the number of clusters () was set equal to the number of classes (), as commonly seen in the related literature Bradley98 (); Pizzuti99 (); He04 (); Arthur07 (); Su07 (); AlHasan09 (); Cao09 ().

Tables 2 and 3 give the Final SSE and CPU time (in milliseconds) results for the real data sets, respectively. Note that, due to space limitations, only mean values are reported for the CPU time criterion. In order to determine if there are any statistically significant differences among the methods, we employed two non-parametric statistical tests Garcia08 (): the Friedman test Friedman37 () and Iman & Davenport test Iman80 (). These tests are alternatives to the parametric two-way analysis of variance (ANOVA) test. Their advantage over ANOVA is that they do not require normality or homoscedasticity, assumptions that are often violated in machine learning studies Luengo09 (); Garcia09 ().

Given blocks (subjects) and treatments (measurements), the null hypothesis () of the Friedman test is that populations within a block are identical. The alternative hypothesis () is that at least one treatment tends to yield larger (or smaller) values than at least one other treatment. The test statistic is calculated as follows Daniel00 (). In the first step, the observations within each block are ranked separately, so each block contains a separate set of ranks. If ties occur, the tied observations are given the mean of the rank positions for which they are tied. If is true, the ranks in each block should be randomly distributed over the columns (treatments). Otherwise, we expect a lack of randomness in this distribution. For example, if a particular treatment is better than the others, we expect large (or small) ranks to ‘favor’ that column. In the second step, the ranks in each column are summed. If is true, we expect the sums to be fairly close — so close that we can attribute differences to chance. Otherwise, we expect to see at least one difference between pairs of rank sums so large that we cannot reasonably attribute it to sampling variability. The test statistic is given as:

(3)

where () is the rank sum of the -th column. is approximately chi-square with degrees of freedom. is rejected at the level of significance if the value of (3) is greater than or equal to the critical chi-square value for degrees of freedom. Iman and Davenport Iman80 () proposed the following statistic:

(4)

which is distributed according to the F-distribution with and degrees of freedom. When compared to , this statistic is not only less conservative, but also more accurate for small sample sizes Iman80 ().

In this study, blocks and treatments correspond to data sets and initialization methods, respectively. Our goal is to determine whether or not there is at least one method that is significantly better than at least one other method at the level. If this is the case, we will conduct a post-hoc (multiple comparison) test to determine which pairs of methods differ significantly. For this purpose, we will use the Bergmann-Hommel test Bergmann88 (), a powerful post-hoc procedure that has been used successfully in various machine learning studies Garcia08 (); Voss10 ().

4.1 Real Data Sets

Table 4 gives the Final SSE rankings of the IMs for the real data sets as determined by the Bergmann-Hommel procedure using data given in Table 2. Here, a notation such as indicates that there is no statistically significant difference between methods D and E and these two methods are significantly better than method C. From Table 4 it can be seen that the methods cannot be distinguished from each other reliably. This was expected since even nonparametric post-hoc tests lack discrimination power in small sample cases (recall that only data sets were used) with a large number of ties (see Table 2). For example, with respect to the minimum statistic, the performances of F, M, B, K, and G are statistically indistinguishable. In other words, if we initialize k-means with each of these non-deterministic methods and execute it until convergence, the resulting clusterings over runs will have very similar minimum Final SSE values. Similar trends were observed for the RAND, VD, and VI criteria (data not shown). Given the abundance of local minima even in data sets of moderate size and/or dimensionality and the gradient descent nature of k-means, it is not surprising that the deterministic methods V and P are outperformed by most of the non-deterministic methods as the former methods were executed only once, whereas the latter ones were executed times.

As mentioned earlier, the minimum statistic is meaningful only when it is practical to execute k-means multiple times. Otherwise, the mean statistic is more meaningful. The analysis of mean Final SSE results using the Bergmann-Hommel procedure reveals that deterministic methods V and P are the preferred choices in this case. This is not surprising since non-deterministic methods, in particular those that are ad hoc in nature, often produce highly variable results across multiple runs.

The standard deviation statistic characterizes the reliability of a non-deterministic IM with respect to a particular performance criterion. In other words, if a non-deterministic IM obtains low mean and standard deviation with respect to an effectiveness criterion, we do not have to execute this method a large number of times to obtain good results. The analysis of Final SSE standard deviations reveals two overlapping groups of methods. Once again this is not necessarily because the members of each group are in fact indistinguishable with respect to their reliability, but due to the relatively small sample size used. In summary, due to the necessarily small number of real-world data sets available for clustering studies, it may not be possible to distinguish among various IMs. Therefore, it is crucial that these IMs be tested on a large number of synthetic data sets (see §4.2).

As for computational efficiency, it can be seen from Table 3 that, in general, the IMs have similar computational requirements per run. However, in practice, a non-deterministic method is typically executed times and the output of the run that gives the least SSE is taken as the result. Therefore, the total computational cost of a non-deterministic method is often significantly higher than that of a deterministic method. As predicted in §2.5, simple methods such as M require about the same CPU time as elaborate methods such as G. This is because simple methods often lead to more k-means iterations, whereas elaborate ones compensate for their computational overhead by requiring fewer k-means iterations. It should be noted that efficiency differences among the methods can be further reduced by using faster k-means variants such as those described in Kanungo02 (); Hamerly10 ().

4.2 Synthetic Data Sets

Table 5 gives the ranking of the IMs with respect to the minimum statistic. It can be seen that despite variations in rankings across the performance criteria, some general trends emerge:

  • Non-deterministic methods outperform the deterministic ones, i.e., V and P, except in the case of Initial SSE. As explained in §4.1, this is due to the fact that the non-deterministic methods were executed times, whereas the deterministic ones were executed only once. The reason why deterministic methods have good Initial SSE performance is because these methods are approximate (divisive hierarchical) clustering methods by themselves and thus they give reasonable results even without k-means refinement.

  • Method B consistently appears in the best performing group, whereas methods F and X are often among the worst non-deterministic methods.

  • Method M exhibits moderate-to-good performance except in the case of Initial SSE. Recall that this method randomly selects the initial centers from among the data points and therefore it cannot be expected to perform well without k-means refinement.

  • Methods K and G generally perform well. In some cases the latter outperforms the former, whereas in others they have comparable performance.

Table 6 gives the ranking of the IMs with respect to the mean statistic. It can be seen that despite variations in rankings across the performance criteria, some general trends emerge:

  • Deterministic methods, i.e., V and P, generally outperform the non-deterministic ones. As explained in §4.1, this is due to the fact that the non-deterministic methods can produce highly variable results across multiple runs. Method B is highly competitive with the deterministic methods.

  • Methods M and X are often among the worst performers, whereas methods F and K exhibit moderate-to-bad performance.

  • Method G is often significantly better than all non-deterministic methods but B.

Table 7 gives the ranking of the non-deterministic IMs with respect to the standard deviation statistic. It can be seen that despite variations in rankings across the performance criteria, some general trends emerge:

  • Method B consistently appears in the best performing group, whereas method M is often among the worst performers.

  • Methods X and K exhibit moderate-to-bad performance.

  • Method F and G are significantly better than all methods but B.

4.3 Recommendations for Practitioners

Based on the statistical analyses presented in the previous section, the following recommendations can be made:

  • In general, methods F, M, and X should not be used. These methods are easy to understand and implement, but they are often ineffective and unreliable. Furthermore, despite their low overhead, these methods do not offer significant time savings since they often result in slower k-means convergence.

  • In time-critical applications that involve large data sets or applications that demand determinism, methods V or P should be used. These methods need to be executed only once and they lead to very fast k-means converge. The efficiency difference between the two is noticeable only on high dimensional data sets. This is because method V calculates the direction of split by determining the coordinate axis with the greatest variance (in time), whereas method P achieves this by calculating the principal eigenvector of the covariance matrix (in time using the power method Hotelling36 ()). Note that despite its higher computational complexity, method P can, in some cases, be more efficient than method V (see Table 3). This is because the former converges significantly faster than the latter (see Table 6). The main disadvantage of these methods is that they are more complicated to implement due to their hierarchical formulation.

  • In applications that involve small data sets, e.g., , methods B or G should be used. It is computationally feasible to run these methods hundreds of times on such data sets given that one such run takes only a few milliseconds.

  • In applications where an approximate clustering of the data set is desired, methods B, G, V, or P should be used. These methods produce very good initial clusterings (see Tables 5 and 6), which makes it possible to use them as standalone clustering algorithms.

5 Conclusions

In this paper we presented an overview of k-means initialization methods with an emphasis on their computational efficiency. We then compared eight commonly used linear time initialization methods on a large and diverse collection of real and synthetic data sets using various performance criteria. Finally, we analyzed the experimental results using non-parametric statistical tests and provided recommendations for practitioners. Our statistical analyses revealed that popular initialization methods such as forgy, macqueen, and maximin often perform poorly and that there are significantly better alternatives to these methods that have comparable computational requirements.

6 Acknowledgments

This publication was made possible by grants from the Louisiana Board of Regents (LEQSF2008-11-RD-A-12) and National Science Foundation (0959583, 1117457). The authors are grateful to D. Arthur, S. Vassilvitskii, J.G. Dy, S.J. Redmond, J.F. Lu, and M. Al Hasan for clarifying various points about their papers.

References

  • (1) A. K. Jain, M. N. Murty, P. J. Flynn, Data Clustering: A Review, ACM Computing Surveys 31 (3) (1999) 264–323.
  • (2) A. K. Jain, Data Clustering: 50 Years Beyond K-means, Pattern Recognition Letters 31 (8) (2010) 651–666.
  • (3) L. Kaufman, P. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, Wiley-Interscience, 1990.
  • (4) D. Aloise, A. Deshpande, P. Hansen, P. Popat, NP-Hardness of Euclidean Sum-of-Squares Clustering, Machine Learning 75 (2) (2009) 245–248.
  • (5) M. Mahajan, P. Nimbhorkar, K. Varadarajan, The Planar k-Means Problem is NP-hard, Theoretical Computer Science 442 (2012) 13–21.
  • (6) A. Tarsitano, A Computational Study of Several Relocation Methods for K-Means Algorithms, Pattern Recognition 36 (12) (2003) 2955–2966.
  • (7) S. Lloyd, Least Squares Quantization in PCM, IEEE Trans. on Information Theory 28 (2) (1982) 129–136.
  • (8) P. S. Bradley, U. Fayyad, Refining Initial Points for K-Means Clustering, in: Proc. of the 15th Int. Conf. on Machine Learning, 1998, pp. 91–99.
  • (9) M. Dash, H. Liu, X. Xu, ’’: Merging Distance and Density Based Clustering, in: Proc. of the 7th Int. Conf. on Database Systems for Advanced Applications, 2001, pp. 32–39.
  • (10) W. Y. Chen, Y. Song, H. Bai, C. J. Lin, E. Y. Chang, Parallel Spectral Clustering in Distributed Systems, IEEE Trans. on Pattern Analysis and Machine Intelligence 33 (3) (2011) 568–586.
  • (11) T. Kanungo, D. Mount, N. Netanyahu, C. Piatko, R. Silverman, A. Wu, An Efficient K-Means Clustering Algorithm: Analysis and Implementation, IEEE Trans. on Pattern Analysis and Machine Intelligence 24 (7) (2002) 881–892.
  • (12) G. Hamerly, Making k-means Even Faster, in: Proc. of the 2010 SIAM Int. Conf. on Data Mining, 2010, pp. 130–140.
  • (13) T. W. Chen, S. Y. Chien, Bandwidth Adaptive Hardware Architecture of K-Means Clustering for Video Analysis, IEEE Trans. on Very Large Scale Integration (VLSI) Systems 18 (6) (2010) 957–966.
  • (14) C. Ordonez, E. Omiecinski, Efficient Disk-Based K-Means Clustering for Relational Databases, IEEE Trans. on Knowledge and Data Engineering 16 (8) (2004) 909–921.
  • (15) S. Z. Selim, M. A. Ismail, K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality, IEEE Trans. on Pattern Analysis and Machine Intelligence 6 (1) (1984) 81–87.
  • (16) L. Bottou, Y. Bengio, Advances in Neural Information Processing Systems 7, MIT Press, 1995, Ch. Convergence Properties of the K-Means Algorithms, pp. 585–592.
  • (17) L. Vendramin, R. J. G. B. Campello, E. R. Hruschka, Relative Clustering Validity Criteria: A Comparative Overview, Statistical Analysis and Data Mining 3 (4) (2010) 209–235.
  • (18) J. Mao, A. K. Jain, A Self-Organizing Network for Hyperellipsoidal Clustering (HEC), IEEE Trans. on Neural Networks 7 (1) (1996) 16–29.
  • (19) J. S. Zhang, Y.-W. Leung, Robust Clustering by Pruning Outliers, IEEE Trans. on Systems, Man, and Cybernetics – Part B 33 (6) (2003) 983–999.
  • (20) M. E. Celebi, Improving the Performance of K-means for Color Quantization, Image and Vision Computing 29 (4) (2011) 260–271.
  • (21) J. M. Pena, J. A. Lozano, P. Larranaga, An Empirical Comparison of Four Initialization Methods for the K-Means Algorithm, Pattern Recognition Letters 20 (10) (1999) 1027–1040.
  • (22) J. He, M. Lan, C. L. Tan, S. Y. Sung, H. B. Low, Initialization of Cluster Refinement Algorithms: A Review and Comparative Study, in: Proc. of the 2004 IEEE Int. Joint Conf. on Neural Networks, 2004, pp. 297–302.
  • (23) E. Forgy, Cluster Analysis of Multivariate Data: Efficiency vs. Interpretability of Classification, Biometrics 21 (1965) 768.
  • (24) M. R. Anderberg, Cluster Analysis for Applications, Academic Press, 1973.
  • (25) R. C. Jancey, Multidimensional Group Analysis, Australian Journal of Botany 14 (1) (1966) 127–130.
  • (26) J. MacQueen, Some Methods for Classification and Analysis of Multivariate Observations, in: Proc. of the 5th Berkeley Symposium on Mathematical Statistics and Probability, 1967, pp. 281–297.
  • (27) M. J. Norušis, IBM SPSS Statistics 19 Statistical Procedures Companion, Addison Wesley, 2011.
  • (28) G. H. Ball, D. J. Hall, A Clustering Technique for Summarizing Multivariate Data, Behavioral Science 12 (2) (1967) 153–155.
  • (29) J. T. Tou, R. C. Gonzales, Pattern Recognition Principles, Addison-Wesley, 1974.
  • (30) SAS Institute Inc., SAS/STAT 9.2 User’s Guide, SAS Publishing, 2009.
  • (31) H. Späth, Computational Experiences with the Exchange Method: Applied to Four Commonly Used Partitioning Cluster Analysis Criteria, European Journal of Operational Research 1 (1) (1977) 23–31.
  • (32) T. Gonzalez, Clustering to Minimize the Maximum Intercluster Distance, Theoretical Computer Science 38 (2–3) (1985) 293–306.
  • (33) I. Katsavounidis, C.-C. J. Kuo, Z. Zhang, A New Initialization Technique for Generalized Lloyd Iteration, IEEE Signal Processing Letters 1 (10) (1994) 144–146.
  • (34) M. Al-Daoud, S. Roberts, New Methods for the Initialisation of Clusters, Pattern Recognition Letters 17 (5) (1996) 451–455.
  • (35) C. Pizzuti, D. Talia, G. Vonella, A Divisive Initialisation Method for Clustering Algorithms, in: Proc. of the 3rd European Conf. on Principles and Practice of Knowledge Discovery in Databases, 1999, pp. 484–491.
  • (36) D. Arthur, S. Vassilvitskii, k-means++: The Advantages of Careful Seeding, in: Proc. of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms, 2007, pp. 1027–1035.
  • (37) T. Su, J. G. Dy, In Search of Deterministic Methods for Initializing K-Means and Gaussian Mixture Clustering, Intelligent Data Analysis 11 (4) (2007) 319–338.
  • (38) H. Hotelling, Simplified Calculation of Principal Components, Psychometrika 1 (1) (1936) 27–35.
  • (39) J. F. Lu, J. B. Tang, Z. M. Tang, J. Y. Yang, Hierarchical Initialization Approach for K-Means Clustering, Pattern Recognition Letters 29 (6) (2008) 787–795.
  • (40) T. Onoda, M. Sakai, S. Yamada, Careful Seeding Method based on Independent Components Analysis for k-means Clustering, Journal of Emerging Technologies in Web Intelligence 4 (1) (2012) 51–59.
  • (41) A. Hyvärinen, Fast and Robust Fixed-Point Algorithms for Independent Component Analysis, IEEE Trans. on Neural Networks 10 (3) (1999) 626–634.
  • (42) J. A. Hartigan, M. A. Wong, Algorithm AS 136: A K-Means Clustering Algorithm, Journal of the Royal Statistical Society C 28 (1) (1979) 100–108.
  • (43) M. Al-Daoud, A New Algorithm for Cluster Initialization, in: Proc. of the 2nd World Enformatika Conf., 2005, pp. 74–76.
  • (44) S. J. Redmond, C. Heneghan, A Method for Initialising the K-means Clustering Algorithm Using kd-trees, Pattern Recognition Letters 28 (8) (2007) 965–973.
  • (45) M. Al Hasan, V. Chaoji, S. Salem, M. Zaki, Robust Partitional Clustering by Outlier and Density Insensitive Seeding, Pattern Recognition Letters 30 (11) (2009) 994–1002.
  • (46) M. M. Breunig, H. P. Kriegel, R. T. Ng, J. Sander, LOF: Identifying Density-Based Local Outliers, ACM SIGMOD Record 29 (2) (2000) 93–104.
  • (47) M. M. Astrahan, Speech Analysis by Clustering, or the Hyperphoneme Method, Tech. Rep. AIM-124, Stanford University (1970).
  • (48) G. N. Lance, W. T. Williams, A General Theory of Classificatory Sorting Strategies - II. Clustering Systems, The Computer Journal 10 (3) (1967) 271–277.
  • (49) G. W. Milligan, An Examination of the Effect of Six Types of Error Perturbation on Fifteen Clustering Algorithms, Psychometrika 45 (3) (1980) 325–342.
  • (50) F. Cao, J. Liang, G. Jiang, An Initialization Method for the K-Means Algorithm Using Neighborhood Model, Computers and Mathematics with Applications 58 (3) (2009) 474–483.
  • (51) Y. Linde, A. Buzo, R. Gray, An Algorithm for Vector Quantizer Design, IEEE Trans. on Communications 28 (1) (1980) 84–95.
  • (52) C. M. Huang, R. W. Harris, A Comparison of Several Vector Quantization Codebook Generation Approaches, IEEE Trans. on Image Processing 2 (1) (1993) 108–112.
  • (53) A. Likas, N. Vlassis, J. Verbeek, The Global K-Means Clustering Algorithm, Pattern Recognition 36 (2) (2003) 451–461.
  • (54) G. Babu, M. Murty, Simulated Annealing for Selecting Optimal Initial Seeds in the K-Means Algorithm, Indian Journal of Pure and Applied Mathematics 25 (1–2) (1994) 85–94.
  • (55) G. P. Babu, M. N. Murty, A Near-Optimal Initial Seed Value Selection in K-Means Algorithm Using a Genetic Algorithm, Pattern Recognition Letters 14 (10) (1993) 763–769.
  • (56) D. Aloise, P. Hansen, L. Liberti, An Improved Column Generation Algorithm for Minimum Sum-of-Squares Clustering, Mathematical Programming (2010) 1–26.
  • (57) S. H. Huang, S. H. Chen, Fast Encoding Algorithm for VQ-based Image Coding, Electronics Letters 26 (19) (1990) 1618–1619.
  • (58) C. D. Bei, R. M. Gray, An Improvement of the Minimum Distortion Encoding Algorithm for Vector Quantization, IEEE Trans. on Communications 33 (10) (1985) 1132–1133.
  • (59) A. Frank, A. Asuncion, UCI Machine Learning Repository, http://archive.ics.uci.edu/ml, University of California, Irvine, School of Information and Computer Sciences (2011).
  • (60) R. Maitra, V. Melnykov, Simulating Data to Study Performance of Finite Mixture Modeling and Clustering Algorithms, Journal of Computational and Graphical Statistics 19 (2) (2010) 354–376.
  • (61) S. K. Pal, D. D. Majumder, Fuzzy Sets and Decision Making Approaches in Vowel and Speaker Recognition, IEEE Trans. on Systems, Man, and Cybernetics 7 (8) (1977) 625–629, http://www.isical.ac.in/s̃ushmita/patterns/vowel.dat.
  • (62) W. Cook, World TSP, http://www.tsp.gatech.edu/world, Georgia Institute of Technology (2011).
  • (63) G. Milligan, M. C. Cooper, A Study of Standardization of Variables in Cluster Analysis, Journal of Classification 5 (2) (1988) 181–204.
  • (64) L. Hubert, P. Arabie, Comparing Partitions, Journal of Classification 2 (1) (1985) 193–218.
  • (65) S. van Dongen, Performance Criteria for Graph Clustering and Markov Cluster Experiments, Tech. Rep. INS-R0012, Centrum voor Wiskunde en Informatica (2000).
  • (66) M. Meilă, Comparing Clusterings — An Information Based Distance, Journal of Multivariate Analysis 98 (5) (2007) 873–895.
  • (67) J. Wu, H. Xiong, J. Chen, Adapting the Right Measures for K-means Clustering, in: Proc. of the 15th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 2009, pp. 877–885.
  • (68) J. Wu, J. Chen, H. Xiong, M. Xie, External Validation Measures for K-Means Clustering: A Data Distribution Perspective, Expert Systems With Applications 36 (3) (2009) 6050–6061.
  • (69) M. Matsumoto, T. Nishimura, Mersenne Twister: A 623-Dimensionally Equidistributed Uniform Pseudo-Random Number Generator, ACM Trans. on Modeling and Computer Simulation 8 (1) (1998) 3–30.
  • (70) S. Garcia, F. Herrera, An Extension on ”Statistical Comparisons of Classifiers over Multiple Data Sets” for all Pairwise Comparisons, Journal of Machine Learning Research 9 (2008) 2677–2694.
  • (71) M. Friedman, The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance, Journal of the American Statistical Association 32 (200) (1937) 675–701.
  • (72) R. L. Iman, J. M. Davenport, Approximations of the Critical Region of the Friedman Statistic, Communications in Statistics - Theory and Methods 9 (6) (1980) 571–595.
  • (73) J. Luengo, S. Garcia, F. Herrera, A Study on the Use of Statistical Tests for Experimentation with Neural Networks: Analysis of Parametric Test Conditions and Non-Parametric Tests, Expert Systems With Applications 36 (4) (2009) 7798–7808.
  • (74) S. Garcia, A. Fernandez, J. Luengo, F. Herrera, A Study of Statistical Techniques and Performance Measures for Genetics-Based Machine Learning: Accuracy and Interpretability, Soft Computing 13 (10) (2009) 959–977.
  • (75) W. W. Daniel, Applied Nonparametric Statistics, Duxbury Press, 2000.
  • (76) B. Bergmann, G. Hommel, Multiple Hypotheses Testing, Springer, 1988, Ch. Improvements of General Multiple Test Procedures for Redundant Systems of Hypotheses, pp. 100–115.
  • (77) T. Voss, N. Hansen, C. Igel, Improved Step Size Adaptation for the MO-CMA-ES, in: Proc. of the 12th Annual Conf. on Genetic and Evolutionary Computation, 2010, pp. 487–494.
F M X B K G V P
1 min 239 239 239 239 239 239 239 239
mean 2390 2390 2390 2390 2390 2390 2390 2390
2 min 38 38 41 38 38 38 39 38
mean 444 401 453 391 391 391 390 380
3 min 167 167 167 167 167 167 173 167
mean 1768 1767 1726 1714 1745 1734 1730 1670
4 min 10057 10057 10057 10057 10058 10058 10101 10100
mean 1011579 1008020 1007715 1007618 1008323 1008018 101010 101000
5 min 66224 66224 66224 66224 66224 66224 66238 66238
mean 66990890 671961048 67350834 66431360 66948876 66930773 662380 662380
6 min 17 17 19 17 17 17 17 18
mean 192 193 201 181 181 181 170 180
7 min 1167 1167 1267 1167 1167 1167 1167 1168
mean 123183 1250103 130353 118425 123061 119835 11670 11680
8 min 18 18 19 18 18 18 19 19
mean 201 202 222 201 202 201 190 190
9 min 243 243 243 243 243 243 248 243
mean 2518 2528 2538 2517 2528 2497 2480 2430
10 min 629 629 629 629 629 629 629 629
mean 6290 63328 67181 63739 63534 63535 6290 6290
11 min 117891 117924 120898 117863 117719 117995 118495 118386
mean 1199311060 1196551061 1233881264 119050699 119538894 119176710 1184950 1183860
12 min 1742 1742 1742 1742 1742 1742 1742 1742
mean 17420 17420 17420 17420 174428 174741 17420 17420
13 min 2723 2718 2721 2719 2718 2715 2735 2745
mean 277528 275622 276517 274215 275418 275217 27350 27450
14 min 2923 2923 2923 2923 2923 2923 2923 2923
mean 29230 29230 29230 29230 29230 29230 29230 29230
15 min 3127 3128 3180 3127 3128 3127 3137 3214
mean 316430 316828 324722 315729 317333 314920 31370 32140
16 min 2802 2802 2802 2802 2802 2802 21983 2802
mean 82298685 125189667 28020 119359656 57226944 37744236 219830 28020
17 min 36373 36373 36373 36373 36373 36373 36373 36373
mean 377552829 37046916 36738754 371521340 374401906 371031639 363730 363730
18 min 14559 14559 14559 14559 14559 14559 14581 14807
mean 14653140 14763273 14774293 1462766 14735234 14719214 145810 148070
19 min 215 215 230 215 215 215 227 215
mean 2174 2174 25432 2196 21910 2174 2270 2150
20 min 235 219 233 218 217 217 220 219
mean 2517 2242 2413 2222 2202 2191 2200 2190
21 min 4930 4930 4930 4930 4930 4930 4930 5004
mean 5130131 5091110 5036106 501270 5111116 504675 49300 50040
22 min 1177 1177 1195 1177 1177 1177 1182 1177
mean 117910 118718 120425 118212 119327 118314 11820 11770
23 min 121 121 121 121 121 121 121 121
mean 1212 1225 1223 1223 1225 1225 1210 1210
24 min 387 387 411 387 387 387 410 405
mean 40723 41420 43021 40216 41019 40213 4100 4050
25 min 235 235 411 235 235 235 235 274
mean 30739 27523 930105 24418 27139 24621 2350 2740
26 min 214 214 214 214 214 214 214 214
mean 2140 2140 2140 2140 2140 2140 2140 2140
27 min 22 22 22 22 22 22 23 23
mean 232 231 230 231 231 230 230 230
28 min 223 223 224 223 223 223 224 224
mean 2242 2264 2371 2286 2265 2253 2240 2240
29 min 7772 7772 7772 7772 7772 7772 7774 7774
mean 779891 7808102 7854160 77731 7831140 7811106 77740 77740
30 min 334 334 348 334 334 334 335 334
mean 3352 3362 37417 3375 3363 3363 3350 3340
31 min 11039 11039 11039 11039 11039 11039 11483 12422
mean 140411686 123671057 11714627 11128231 11773872 11493626 114830 124220
32 min 58 58 61 58 58 58 69 59
mean 645 706 611 666 635 591 690 590
Table 2: Final SSE (Real Data Sets)
F M X B K G V P
1 0 0 0 0 0 0 0 0
2 3 3 2 4 3 2 10 0
3 2 2 2 4 2 2 0 10
4 2295 2248 2173 3624 2332 2459 1900 2540
5 2183 2229 2714 3604 2273 2274 1730 2120
6 0 0 0 1 0 0 0 0
7 8 8 7 12 9 9 0 20
8 0 0 0 1 0 0 0 0
9 0 1 0 1 0 1 0 0
10 0 0 0 1 0 0 0 0
11 3730 3469 2469 4063 3537 3915 6940 12200
12 28 32 40 40 34 32 40 50
13 700 693 729 852 698 693 950 800
14 20 19 28 30 21 19 30 20
15 54 56 62 70 57 58 30 70
16 252 283 58 417 112 96 230 570
17 22 20 24 34 25 26 20 220
18 112 116 131 137 121 125 60 140
19 9 10 7 12 8 10 10 10
20 109 100 110 150 97 118 60 70
21 59 53 52 67 56 59 30 50
22 430 524 314 718 469 513 730 480
23 1 1 1 1 1 0 0 0
24 6 5 7 8 6 7 10 0
25 84 82 19 122 79 81 100 80
26 0 0 0 1 0 0 10 0
27 1 1 0 1 1 1 0 0
28 3 2 1 3 2 3 10 10
29 20 20 20 28 19 21 10 20
30 38 38 24 51 36 38 60 40
31 748 949 1377 2439 918 1044 580 840
32 5 6 5 9 6 6 0 0
Table 3: CPU Time (Real Data Sets)
Statistic IM Ranking
Minimum
Mean
Standard Deviation
Table 4: Final SSE Rankings (Real Data Sets)
Data Set Complexity IM Ranking
Initial SSE
Easy
Moderate
Difficult
Final SSE
Easy
Moderate
Difficult
Final RAND
Easy
Moderate
Difficult
Final VD
Easy
Moderate
Difficult
Final VI
Easy
Moderate
Difficult
Number of Iterations
Easy
Moderate
Difficult
Table 5: Minimum (Synthetic Data Sets)
Data Set Complexity IM Ranking
Initial SSE
Easy
Moderate
Difficult
Final SSE
Easy
Moderate
Difficult
Final RAND
Easy
Moderate
Difficult
Final VD
Easy
Moderate
Difficult
Final VI
Easy
Moderate
Difficult
Number of Iterations
Easy
Moderate
Difficult
Table 6: Mean (Synthetic Data Sets)
Data Set Complexity IM Ranking
Initial SSE
Easy
Moderate Same as Easy
Difficult
Final SSE
Easy
Moderate
Difficult
Final RAND
Easy
Moderate
Difficult
Final VD
Easy
Moderate
Difficult
Final VI
Easy
Moderate
Difficult
Number of Iterations
Easy
Moderate
Difficult
Table 7: Standard Deviation (Synthetic Data Sets)
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
50620
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description