Fast K-Means Clustering with Anderson Acceleration

Fast K-Means Clustering with Anderson Acceleration

Juyong Zhang    Yuxin Yao    Yue Peng    Hao Yu
University of Science and Technology of China
juyong@ustc.edu.cn  yaoyuxin1@126.com  {echoyue,yh8971}@mail.ustc.edu.cn
&
Bailin Deng
Cardiff University
DengB3@cardiff.ac.uk
Corresponding author.
Abstract

We propose a novel method to accelerate Lloyd’s algorithm for K-Means clustering. Unlike previous acceleration approaches that reduce computational cost per iterations or improve initialization, our approach is focused on reducing the number of iterations required for convergence. This is achieved by treating the assignment step and the update step of Lloyd’s algorithm as a fixed-point iteration, and applying Anderson acceleration, a well-established technique for accelerating fixed-point solvers. Classical Anderson acceleration utilizes previous iterates to find an accelerated iterate, and its performance on K-Means clustering can be sensitive to choice of and the distribution of samples. We propose a new strategy to dynamically adjust the value of , which achieves robust and consistent speedups across different problem instances. Our method complements existing acceleration techniques, and can be combined with them to achieve state-of-the-art performance. We perform extensive experiments to evaluate the performance of the proposed method, where it outperforms other algorithms in 106 out of 120 test cases, and the mean decrease ratio of computational time is more than 33%.

 

Fast K-Means Clustering with Anderson Acceleration


  Juyong Zhang    Yuxin Yao    Yue Peng    Hao Yu University of Science and Technology of China juyong@ustc.edu.cn  yaoyuxin1@126.com  {echoyue,yh8971}@mail.ustc.edu.cn Bailin Dengthanks: Corresponding author. Cardiff University DengB3@cardiff.ac.uk

\@float

noticebox[b]Preprint. Work in progress.\end@float

1 Introduction

K-Means clustering is a fundamental method with various applications such as data compression, classification, and density estimation. Given a set of samples , and a positive number , K-Means clustering partitions the samples into clusters, by minimizing the total squared distances from the samples to the centroids of their respective clusters, resulting in an optimization problem

(1)

where , and is the centroid closest to the sample . This is a non-convex optimization problem, and it is NP-hard to find the global minimum (Aloise et al., 2009). On the other hand, there exist effective algorithms to find a local minimum, the most popular being Lloyd’s algorithm, often simply referred to as the K-Means algorithm.

The classical Lloyd’s algorithm is equivalent to minimizing the target energy with both the cluster assignment and the centroid positions as variables:

(2)

where () is the index of the cluster that is assigned to. Lloyd’s algorithm performs the minimization in a two-step iterative process:

  • In the assignment step, each sample is assigned to the cluster whose centroid is the closest to :

    (3)
  • In the update step, each centroid is updated to the mean of all samples assigned to cluster :

    (4)

    where is the number of assigned samples for cluster .

This iterative process can be considered as alternating minimization of the target function with respective to the assignment variables and the centroid variables, respectively. As a result, Lloyd’s algorithm monotonically decreases the target function and guarantees convergence. In practice, however, Lloyd’s algorithm can suffer from slow convergence rate and require a large number of iterations. Different acceleration techniques have been proposed to improve its performance:

  • In each iteration of Lloyd’s algorithm, the assignment step is the main computational bottleneck: a naïve implementation would require distance computations to determine the nearest centroids for all samples, which can be a significant cost especially for a large number of samples or clusters. Therefore, many existing works focus on reducing the computational cost per iteration, e.g., by maintaining distance bounds between samples and centroids to avoid unnecessary calculations (Elkan, 2003; Hamerly, 2010; Hamerly and Drake, 2015; Ding et al., 2015; Newling and Fleuret, 2016).

  • As the target function of K-Means clustering is non-convex, different initial centroid positions might lead to different clustering results and different number of iterations used in Lloyd’s algorithm. Various algorithms aim at improving the initialization to achieve better clustering and reduced computational time (Ng and Han, 1994; Bradley and Fayyad, 1998; Arthur and Vassilvitskii, 2007; Bachem et al., 2016; Newling and Fleuret, 2017).

Despite the improved performance provided by these methods, they do not alter the convergence rate of Lloyd’s algorithm and may still suffer from slow convergence. In this paper, we propose a novel approach that accelerates Lloyd’s algorithm from a different perspective. Our approach aims at modifying the iterates from Lloyd’s algorithm to improve the convergence rate and reduce the number of required iterations. Our contribution is two-fold:

  • First, we treat the sequence of centroid positions generated by Lloyd’s algorithm as the results of a fixed-point iteration solver. This enables us to improve their convergence rate using Anderson acceleration (Anderson, 1965; Walker and Ni, 2011), a well-established technique for accelerating fixed-point solvers (Fang and Saad, 2009; Toth and Kelley, 2015; Higham and Strabić, 2016; Toth et al., 2017). We adopt the modification proposed by Peng et al. (2018) to reduce the computational overhead and improve stability. The resulting solver greatly improves the performance of Lloyd’s algorithm in many instances.

  • Moreover, we show that the performance of classical Anderson acceleration, which relies on the previous iterates to determine an accelerated iterate, can be further improved by dynamically adjusting the value of according to the decrease of target energy. Using this strategy, the accelerated solver achieves consistent acceleration across all problem instances and initializations in our experiments.

Our approach can not only be used as a standalone technique to accelerate Lloyd’s algorithm, but also be used in conjugation with existing approaches that reduce computational cost per iteration and improve initialization to deliver state-of-the-art performance for K-Means clustering. The effectiveness of our approach is tested on a large variety of datasets.

2 Algorithm

Before presenting our acceleration technique, let us first analyze the classical Lloyd’s algorithm to see why it can suffer from slow convergence. The main strength of Lloyd’s algorithm is the monotonic decrease of target energy and the resulting guarantee of convergence. This property stems from the fact that each iteration of Lloyd’s algorithm computes the new centroids by minimizing a convex surrogate (i.e., upper bound function) of the target energy in (1) that is constructed from the current centroid positions . Specifically, using the correspondence between samples and centroids determined in the assignment step (3), we can derive a surrogate of

(5)

where is the index set of samples assigned to centroid . In particular, it can be shown that

The minimization of this surrogate function is exactly the update step (4). The surrogate property then guarantees the decrease of target energy :

From this perspective, Lloyd’s algorithm is an instance of the majorization-minimization algorithm (Lange, 2016), which performs energy minimization by iteratively constructing and minimizing surrogate functions.

On the other hand, the surrogate function can be both a blessing and a curse. It is only guaranteed to be accurate within a small neighborhood of the current iterate , and can deviate significantly from the true energy when being far away from . Such deviation is incurred when changed centroid positions alter the assignment of samples to centroids, which occurs frequently unless the samples are highly separated into different clusters. In other words, there may be a rapid loss of accuracy for the surrogate function when moving away from , which can hinder the energy decrease in the update step and leads to slow convergence.

Indeed, it has been pointed out by Bottou and Bengio (1994) that locally Lloyd’s algorithm is equivalent to Newton’s algorithm for minimizing the energy . Since infinitesimal changes of centroid positions do not alter the sample-centroid assignment, the surrogate function (5) is the second-order Taylor expansion of the target energy, and its minimization is equivalent to a Newton step. If the new centroid positions result in the same sample assignment as , then the Newton algorithm converges in one step and is a local minimum. However, in many instances a large step will result in changes of sample assignment and invalidates the Taylor expansion, which slows down the convergence.

From the discussions above, we can see that an important reason for slow convergence of Lloyd’s algorithm is that the centroid update only considers the local landscape of the target energy. Therefore, a natural idea to improve convergence is to incorporate more global information, which is the essence of the technique proposed in this paper.

2.1 Anderson Acceleration for Lloyd’s Algorithm

The key idea of our approach is that a sequence of centroid positions generated by Lloyd’s algorithm can be considered as the results of a fixed-point iteration scheme. Since the new centroid positions depend on the sample-centroid assignments which in turn depend on the previous centroids , we can combine the assignment step and the update step and write as a function of :

(6)

Here the subscript AU indicates an iterate using the combined assignment and update step, which we differentiate from the accelerated iterate that will be introduced later. If Lloyd’s algorithm converges to a local minimum , then it must be a fixed point of the mapping , i.e., . Based on this observation, we can apply Anderson acceleration, a well-established technique for accelerating fixed-point solvers (Anderson, 1965; Walker and Ni, 2011), to improve the convergence of Lloyd’s algorithm. Note that Lloyd’s algorithm searches for centroid positions for which the residual vanishes. Instead of simply taking as the new iterate, Anderson acceleration also makes use of previous iterates to determine an accelerated iterate that achieves better reduction of the residual. Let us denote . Then using the iterates , we can approximate the high-dimensional graph of with the affine subspace spanned by . Anderson acceleration searches within this affine space for a point whose residual components are as close to zero as possible, and takes the corresponding components as the new iterate . This is equivalent to solving a linear least-squares problem (Walker and Ni, 2011; Toth and Kelley, 2015)

(7)

and then computing as

(8)

In this way, the accelerated iterate is determined using a more global landscape of the residual, which often results in faster convergence as shown by various authors (Lipnikov et al., 2013; Higham and Strabić, 2016; Pratapa et al., 2016; An et al., 2017; Ho et al., 2017). Indeed, it has been shown in (Fang and Saad, 2009) that Anderson acceleration corresponds to a quasi-Newton method for finding the residual root, which approximates the inverse Jacobian by enforcing secant conditions using previous iterates.

Although Anderson acceleration works well for many problems, it does not guarantee energy decrease in general, and can stagnate at a wrong solution (Walker and Ni, 2011; Potra and Engler, 2013). Recently, Peng et al. (2018) proposed a modification of Anderson Acceleration that improves stability. For a fixed-point solver that guarantees energy decrease in each iteration, they check the accelerated iterate to see whether it decreases the energy compared with the previous iterate; if not, then they revert to the unaccelerated iterate from the fixed-point solver. This approach guarantees monotonic decrease of the target energy and is shown to be effective in improving robustness of Anderson acceleration. We adopt the same strategy to stabilize the accelerated solver, by checking the decrease of target energy (1): if the accelerated iterate does not decrease the energy, then we revert to the iterate using Lloyd’s algorithm which guarantees that the new energy value is no larger than the previous iteration. An additional benefit of this strategy is the low computational overhead compared with the unaccelerated solver. For the K-Means clustering problem, the overhead per iteration consists of two parts: (i) computing the accelerated iterate according to Equations (7) and (8); (ii) checking the energy of the accelerated iterate. Part (i) involves inner products between -dimensional vectors as well as solving an linear system, which is typically a small cost. For Part (ii), the evaluation of target energy (1) requires assigning the sample points to the accelerated centroid positions, which can be reused in the next iteration unless we need to revert to the unaccelerated iterate. Therefore, if the accelerated iterate is accepted, then Part (ii) only involves computing the distance from each sample to its assigned centroid, with a total time complexity of which is often only a small portion of the computation per iteration (Peng et al., 2018). If the accelerated iterate is not accepted, then an additional assignment step will be incurred, but the overhead is still acceptable if we adopt a fast assignment method such as the one from (Hamerly, 2010). In our experiments, the majority of accelerated iterates are accepted (see Tables 2 and 3), thus the overhead from Part (ii) is often small.

For the classical Lloyd’s algorithm, its convergence is indicated by the same assignment of samples to centroids between two consecutive iterations, because in this case the iterative algorithm can no longer decrease the target energy. For our accelerated algorithm, it is easy to show that the convergence criterion remains the same: since we only accept an accelerated iterate when it decreases the target energy, our algorithm runs until an accelerated iterate is rejected and the fall-back iterate using Lloyd’s algorithm results in the same assignment as the previous iteration, which indicates a local minimum of the energy.

2.2 Dynamic Adjustment of

Data:: sample points; : initial cluster centroids;
: the number of previous iterates used for acceleration;
Assignment-Step: assigning samples to clusters according to Eq. (3);
Update-Step: updating cluster centroids according to Eq. (4);
: the mapping combining the assignment and update steps;
: evaluating the K-Means clustering energy (1) using pre-computed assignment ;
: the maximum number of previous iterates used for acceleration.
Result: A sequence converging to a local minimum of .
1 ;   ;    ;
2 for  do
       // Check for convergence
3       ;
4       if  then
5             return ;
6            
7       end if
      // Adjust the value of
8       ;
9       if  then
10             ;
11       else if  then
12             ;
13       end if
      // Make sure decreases the energy
14       if  then
15             ;    ;    ;
16            
17       end if
      // Anderson acceleration
18       ;
19       ;    ;    ;
20       ;
21       ;
22 end for
Algorithm 1 Anderson acceleration for the K-Means algorithm.

By default, the number of previous iterates used in Anderson acceleration is fixed during the whole solving process, and can influence the convergence rate (Higham and Strabić, 2016; Peng et al., 2018). With a larger value of , more information is utilized for approximating the inverse Jacobian. On the other hand, a larger increases the computational cost, and may overfit iterates that are far away. In general, the optimal choice of is problem-dependent and is difficult to determine beforehand. To achieve more robust speedup, we propose an adaptive strategy to dynamically adjust the value of according to the decrease of energy. The key idea is to compare the energy decrease from the current iterate with the decrease from the last iterate. In particular:

  • If the current iterate increases the energy, or the energy decrease in the current iteration is small compared to the previous iteration, then should be become smaller.

  • If the energy decrease in the current iteration is large enough compared to the previous iteration, then should be larger.

  • Otherwise, is not altered.

This strategy is similar to the trust region method that enlarges or shrinks the trust region according to the effectiveness of the current step (Nocedal and Wright, 2006). The full acceleration algorithm for K-Means clustering is shown in Algorithm 1, and lines 4-8 are the adjustment steps for the value. Our experiments show that dynamic adjustment of achieves better performance than a fixed value for many problem instances (see Section 3.1).

3 Results

We evaluate the performance of our algorithm and compare it with other K-means clustering algorithms in a series of experiments. All algorithms are implemented in C++ and parallelized on the CPU using OpenMP. The Assignment-Step in Algorithm 1 is implemented based on Hamerly’s method (Hamerly, 2010). Although the it can be further optimized using more recent methods such as those from (Ding et al., 2015) and (Newling and Fleuret, 2016), this will not affect the reduction ratio in the number of iterations.

We conduct the experiments on 20 datasets in a variety of sizes. They consist of 19 real-world datasets from the UCI machine learning repository (Dheeru and Karra Taniskidou, 2017), and the synthetic Birch dataset (Zhang et al., 1997) that contains clusters in regular grid structure111https://cs.joensuu.fi/sipu/datasets/. Table 1 shows the size and dimension of each data set ( for the number of samples, for the dimension). All algorithms are started with the same initial centroids and iterated until convergence, and are tested on a desktop PC with a quad-core Intel CPU i7 and 4GB RAM. For Algorithm 1, in all tests we set and , , and initialize to 2 by default.

3.1 Settings of

Table 2 compares the performance of our method using fixed and dynamic values, showing the computational time and the number of iterations required by each strategy. We can observe that in the majority of cases, dynamic adjustment of reduces the computational time as well as the number of iterations compared with a fixed strategy. The total computational time is reduced by more than 20 percent for most datasets, and even better for some datasets such as , and . Although we use the same values of and for all datasets, the dynamic adjustment strategy performs consistently better than a fixed strategy.

 

No. Name No. Name
1 UCIHARDATAXtrain 7352 561 11 Colorment 68040 9
2 Slicelocalization 53500 385 12 Conflongdemo 164860 3
3 RelationNetwork 53413 22 13 Birch 100000 2
4 Letterrecognition 20000 16 14 Shuttle 43500 9
5 HTRU2 17898 8 15 Covtype 581012 55
6 Household 2049280 6 16 SkinNonSkin 245057 4
7 FrogsMFCCs 7195 21 17 Finalgeneral 10104 72
8 Eb 45781 2 18 ColorHistogram 68040 32
9 AllUsers 78095 8 19 USCensus1990 2458285 69
10 MiniBoone 130064 50 20 Kddcup99 4898431 37

 

Table 1: The 20 datasets used in our experiments.

 

Dataset Fixed Dynamic Fixed Dynamic
#Iter Time (s) MSE #Iter Time (s) MSE #Iter Time (s) MSE #Iter Time (s) MSE
1 22 / 27 0.22 15.08 27 / 31 0.25 15.08 36 / 45 0.46 15.02 27 / 35 0.36 15.03
2 27 / 30 1.36 15.16 18 / 21 0.98 15.16 32 / 38 1.97 15.16 26 / 29 1.40 15.16
3 39 / 46 0.48 2.61 31 / 38 0.43 2.61 12 / 13 0.14 2.63 12 / 13 0.15 2.63
4 50 / 61 0.30 2.89 17 / 22 0.11 2.89 35 / 38 0.20 2.90 19 / 21 0.11 2.90
5 38 / 49 0.15 1.24 23 / 26 0.07 1.24 38 / 54 0.21 1.24 34 / 43 0.15 1.24
6 30 / 35 11.19 1.17 26 / 29 9.21 1.17 18 / 22 8.97 1.16 16 / 18 6.89 1.16
7 20 / 21 0.03 2.57 16 / 18 0.02 2.57 12 / 13 0.02 2.59 10 / 11 0.01 2.59
8 30 / 35 0.12 0.45 35 / 46 0.20 0.45 21 / 23 0.08 0.45 14 / 16 0.07 0.45
9 41 / 51 0.88 1.93 34 / 43 0.77 1.93 38 / 55 1.21 1.93 36 / 53 1.21 1.93
10 39 / 41 1.30 1.70 26 / 29 0.96 1.70 27 / 34 1.23 1.70 19 / 22 0.76 1.70
11 99 / 119 1.71 2.06 71 / 83 1.16 2.06 139 / 179 2.74 2.06 91 / 115 1.68 2.06
12 16 / 18 0.39 0.83 19 / 21 0.45 0.83 19 / 20 0.45 0.83 18 / 19 0.41 0.83
13 14 / 21 0.19 0.42 17 / 24 0.22 0.42 9 / 11 0.08 0.41 10 / 11 0.08 0.41
14 6 / 9 0.07 1.63 6 / 9 0.06 1.63 7 / 9 0.06 1.68 7 / 9 0.06 1.68
15 22 / 25 3.86 6.47 20 / 24 3.89 6.47 21 / 26 4.39 6.47 17 / 22 3.82 6.47
16 9 / 12 0.30 0.55 8 / 11 0.28 0.55 12 / 16 0.39 0.55 12 / 16 0.39 0.55
17 46 / 47 0.13 7.63 29 / 32 0.10 7.63 43 / 53 0.22 7.63 18 / 19 0.07 7.63
18 19 / 22 0.37 4.53 21 / 26 0.47 4.53 29 / 35 0.72 4.53 16 / 19 0.38 4.53
19 14 / 15 14.37 6.02 15 / 17 16.91 6.02 14 / 20 22.50 6.02 15 / 21 21.09 6.02
20 8 / 10 6.11 3.91 8 / 10 6.04 3.91 8 / 12 7.71 3.78 9 / 12 7.33 3.78

 

Table 2: Performance comparison between strategies of fixed and dynamic value for our method. For dynamic strategies, the initial value is shown in the header. For each strategy, the number in the column #Iter means that the solver takes iterations, out of which iterations accepts the accelerated iterate. Between each pair of strategies with the same , the shortest computational time for each dataset is highlighted in bold font.

3.2 Comparison with Lloyd’s Algorithm

In Table 3, we compare our method with Lloyd’s algorithm using the assignment method from Hamerly (2010). We show the number of iterations, the total computational time, and the final mean squared error (MSE) between the samples and their corresponding centroids. To demonstrate the robustness of our method to the initial centroid positions, we test four different initialization techniques: K-Means++ (Arthur and Vassilvitskii, 2007), afk- (Bachem et al., 2016), bf (Bradley and Fayyad, 1998), and CLARANS (Newling and Fleuret, 2017). The initial centroids are generated by the code accompanying the paper (Newling and Fleuret, 2017).

For all 20 datasets, we test the methods by setting the number of clusters and starting from each of the four sets of initial centroids, resulting in 80 test cases. Table 3 shows that our method requires less computational time than Lloyd’s algorithm in the majority of cases. In particular, our method reduces the computational time by more than 25% in 42 cases, and more than 50% in 15 cases. With the initialization techniques K-Means++, afk-, bf, and CLARANS, our method outperforms Lloyd’s algorithm in 15, 20, 18, and 19 datasets out of 20, respectively.

To demonstrate the robustness of our method to the number of clusters , we also test each dataset using different values of . The last three columns of Table 3 shows the performance of both methods initialized with CLARANS and for cluster numbers , , and , respectively. We can observe that our method consistently outperforms Lloyds’ algorithm for different values.

 

Dataset K-Means++ afk- bf CLARANS
1 #Iter 26 22 / 25 39 21 / 23 23 12 / 14 30 27 / 31 53 26 / 27 14 10 / 11
Time (s) 0.31 0.30 0.43 0.29 0.33 0.16 0.35 0.25 5.74 2.97 19.38 14.41
MSE 15.25 15.25 15.03 15.03 15.97 15.97 15.08 15.08 12.40 12.41 9.63 9.63
2 #Iter 87 61 / 79 27 22 / 24 15 10 / 11 42 18 / 21 49 49 / 55 40 31 / 33
Time (s) 5.32 5.71 1.55 1.37 0.83 0.61 2.40 0.98 28.14 34.28 167.11 143.06
MSE 15.26 15.26 15.27 15.27 16.78 16.78 15.16 15.16 12.55 12.55 8.72 8.72
3 #Iter 31 23 / 27 33 20 / 24 28 16 / 18 54 31 / 38 41 26 / 28 20 16 / 17
Time (s) 0.49 0.46 0.42 0.35 0.41 0.22 0.68 0.43 3.54 2.50 13.58 11.44
MSE 2.68 2.68 2.74 2.74 2.77 2.77 2.61 2.61 1.47 1.47 0.51 0.51
4 #Iter 51 51 / 61 57 31 / 35 75 42 / 50 66 17 / 22 98 43 / 49 21 15 / 16
Time (s) 0.32 0.42 0.34 0.21 0.39 0.29 0.39 0.11 4.57 2.51 9.69 7.42
MSE 2.90 2.90 2.92 2.92 2.90 2.90 2.89 2.89 1.88 1.88 1.13 1.13
5 #Iter 137 78 / 104 110 38 / 52 127 63 / 82 51 23 / 26 65 26 / 31 28 25 / 28
Time (s) 0.50 0.46 0.38 0.25 0.40 0.32 0.20 0.07 2.13 1.15 10.72 11.46
MSE 1.24 1.24 1.24 1.24 1.24 1.24 1.24 1.24 0.62 0.62 0.31 0.31
6 #Iter 209 53 / 89 241 38 / 55 67 29 / 33 61 26 / 29 642 282 / 377 403 549 / 610
Time (s) 87.86 54.79 96.47 30.81 27.67 14.78 22.48 9.21 2043.86 1546.83 10350.36 17077.98
MSE 1.18 1.18 1.20 1.20 1.33 1.33 1.17 1.17 0.63 0.63 0.29 0.29
7 #Iter 15 14 / 17 23 13 / 15 12 9 / 10 22 16 / 18 70 35 / 46 17 12 / 13
Time (s) 0.02 0.03 0.04 0.03 0.02 0.01 0.04 0.02 1.42 1.16 4.47 3.42
MSE 2.63 2.63 2.84 2.84 2.91 2.91 2.57 2.57 1.62 1.62 0.98 0.98
8 #Iter 97 51 / 67 87 24 / 28 83 20 / 30 82 35 / 46 175 82 / 89 33 32 / 35
Time (s) 0.48 0.40 0.44 0.16 0.36 0.18 0.40 0.20 5.86 3.20 13.53 14.43
MSE 0.45 0.45 0.45 0.45 0.45 0.45 0.45 0.45 0.14 0.14 0.04 0.04
9 #Iter 106 57 / 74 101 35 / 45 57 24 / 28 65 34 / 43 83 31 / 37 35 30 / 32
Time (s) 2.25 1.93 2.08 1.09 1.22 0.68 1.36 0.77 14.51 7.34 43.43 39.93
MSE 1.94 1.94 1.93 1.93 2.19 2.19 1.93 1.93 1.27 1.27 0.73 0.73
10 #Iter 142 91 / 125 85 41 / 53 37 14 / 16 57 26 / 29 361 65 / 80 129 48 / 53
Time (s) 6.40 7.11 3.36 2.62 1.12 0.50 2.33 0.96 150.54 41.07 430.11 193.56
MSE 1.41 1.41 1.70 1.70 1.85 1.85 1.70 1.70 0.87 0.87 0.60 0.60
11 #Iter 106 72 / 88 169 69 / 93 89 26 / 32 142 71 / 83 316 81 / 88 95 59 / 61
Time (s) 1.52 1.49 2.09 1.45 1.32 0.57 1.83 1.16 42.87 13.10 128.37 83.78
MSE 2.05 2.05 2.06 2.06 2.19 2.19 2.06 2.06 1.38 1.38 0.91 0.91
12 #Iter 73 28 / 44 38 24 / 28 48 28 / 35 36 19 / 21 190 81 / 92 201 102 / 110
Time (s) 2.06 1.79 1.06 0.94 1.40 1.26 0.94 0.45 41.45 22.73 353.33 215.20
MSE 0.83 0.83 0.85 0.85 0.91 0.91 0.83 0.83 0.40 0.40 0.18 0.18
13 #Iter 25 13 / 18 27 16 / 18 27 11 / 14 39 17 / 24 101 44 / 57 61 33 / 35
Time (s) 0.34 0.31 0.39 0.30 0.30 0.17 0.39 0.22 7.06 4.74 56.28 32.52
MSE 0.45 0.45 0.46 0.46 0.42 0.42 0.42 0.42 0.09 0.09 0.03 0.03
14 #Iter 30 19 / 22 39 20 / 28 10 7 / 9 19 6 / 9 25 7 / 9 25 16 / 17
Time (s) 0.16 0.14 0.25 0.25 0.07 0.08 0.14 0.06 1.42 0.65 16.23 11.32
MSE 1.77 1.77 1.71 1.71 1.69 1.69 1.63 1.63 0.35 0.35 0.14 0.14
15 #Iter 48 20 / 25 113 60 / 80 33 14 / 19 30 20 / 24 62 35 / 39 226 150 / 168
Time (s) 10.36 6.23 21.38 19.96 6.81 4.75 5.91 3.89 100.54 67.53 3087.60 2541.44
MSE 6.50 6.50 6.51 6.51 6.48 6.48 6.47 6.47 2.69 2.69 1.30 1.30
16 #Iter 12 9 / 10 47 19 / 29 34 15 / 22 20 8 / 11 42 20 / 24 25 35 / 39
Time (s) 0.37 0.28 1.35 1.15 0.77 0.72 0.48 0.28 7.84 5.19 74.01 110.09
MSE 0.60 0.60 0.63 0.63 0.72 0.72 0.55 0.55 0.15 0.15 0.05 0.05
17 #Iter 31 28 / 36 49 25 / 28 3 2 / 3 58 29 / 32 35 18 / 22 15 12 / 13
Time (s) 0.15 0.18 0.21 0.13 0.01 0.01 0.25 0.10 1.37 1.03 6.81 5.60
MSE 7.62 7.62 7.74 7.74 7.78 7.78 7.63 7.63 6.12 6.12 4.70 4.69
18 #Iter 44 28 / 33 75 31 / 46 55 25 / 28 43 21 / 26 163 46 / 51 77 83 / 85
Time (s) 1.03 0.77 1.51 1.20 1.15 0.69 0.91 0.47 31.06 10.33 120.59 134.91
MSE 4.53 4.53 4.55 4.55 5.23 5.23 4.53 4.53 2.61 2.61 1.78 1.78
19 #Iter 69 34 / 60 57 23 / 34 15 11 / 13 17 15 / 17 60 32 / 32 1 46 / 1
Time (s) 255.37 72.35 52.41 41.87 13.22 12.22 16.66 16.91 543.64 288.18 754.89 630.99
MSE 6.17 6.17 6.11 6.11 6.66 6.66 6.02 6.02 4.18 4.18 0.01 0.00
20 #Iter 13 11 / 12 20 13 / 15 8 9 / 10 14 8 / 10 83 38 / 53 163 58 / 72
Time (s) 11.09 11.09 14.83 13.37 5.84 6.80 7.72 6.04 220.72 191.36 10408.44 7212.02
MSE 3.92 3.92 4.07 4.07 3.89 3.89 3.91 3.91 1.86 1.86 0.77 0.77

 

Table 3: Comparison between our method and (Hamerly, 2010), for different initialization strategies and different numbers of clusters. In each cell, the performance of our method is show on the right.

4 Conclusion

In this paper, we apply a modified Anderson acceleration scheme to improve the performance of Lloyd’s algorithm. We also propose a dynamic adjustment strategy of the parameter to improve the robustness of Anderson acceleration. Experimental results show that our method can improve the convergence rate of Lloyd’s algorithm regardless of the dimension and size of the data sets, the number of clusters, and initial values. The proposed method can be combined with existing acceleration schemes that reduce the assignment cost and improve the initialization.

Although our algorithm performs consistently better than classical Lloyd’s algorithm, its acceleration ratio varies between different problem instances. One interesting future work would be to investigate characteristics of samples that influence the effectiveness of our approach. In addition, many optimization algorithms used in machine learning have similar forms as Lloyd’s algorithm. Therefore, another interesting direction for future research is to extend the proposed algorithm to other problems.

References

  • Aloise et al. (2009) Aloise, D., Deshpande, A., Hansen, P., and Popat, P. (2009). NP-hardness of Euclidean sum-of-squares clustering. Machine Learning, 75(2), 245–248.
  • An et al. (2017) An, H., Jia, X., and Walker, H. F. (2017). Anderson acceleration and application to the three-temperature energy equations. Journal of Computational Physics, 347, 1–19.
  • Anderson (1965) Anderson, D. G. (1965). Iterative procedures for nonlinear integral equations. J. ACM, 12(4), 547–560.
  • Arthur and Vassilvitskii (2007) Arthur, D. and Vassilvitskii, S. (2007). k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027–1035. Society for Industrial and Applied Mathematics.
  • Bachem et al. (2016) Bachem, O., Lucic, M., Hassani, H., and Krause, A. (2016). Fast and provably good seedings for k-means. In Advances in Neural Information Processing Systems, pages 55–63.
  • Bottou and Bengio (1994) Bottou, L. and Bengio, Y. (1994). Convergence properties of the k-means algorithms. In Advances in Neural Information Processing Systems, pages 585–592.
  • Bradley and Fayyad (1998) Bradley, P. S. and Fayyad, U. M. (1998). Refining initial points for k-means clustering. In International Conference on Machine LearningL, volume 98, pages 91–99.
  • Dheeru and Karra Taniskidou (2017) Dheeru, D. and Karra Taniskidou, E. (2017). UCI machine learning repository.
  • Ding et al. (2015) Ding, Y., Zhao, Y., Shen, X., Musuvathi, M., and Mytkowicz, T. (2015). Yinyang k-means: A drop-in replacement of the classic k-means with consistent speedup. In International Conference on Machine Learning, pages 579–587.
  • Elkan (2003) Elkan, C. (2003). Using the triangle inequality to accelerate k-means. In International Conference on Machine Learning, pages 147–153.
  • Fang and Saad (2009) Fang, H.-r. and Saad, Y. (2009). Two classes of multisecant methods for nonlinear acceleration. Numerical Linear Algebra with Applications, 16(3), 197–221.
  • Hamerly (2010) Hamerly, G. (2010). Making k-means even faster. In SIAM international conference on data mining, pages 130–140.
  • Hamerly and Drake (2015) Hamerly, G. and Drake, J. (2015). Accelerating lloyd’s algorithm for k-means clustering. In Partitional clustering algorithms, pages 41–78. Springer.
  • Higham and Strabić (2016) Higham, N. J. and Strabić, N. (2016). Anderson acceleration of the alternating projections method for computing the nearest correlation matrix. Numerical Algorithms, 72(4), 1021–1042.
  • Ho et al. (2017) Ho, N., Olson, S. D., and Walker, H. F. (2017). Accelerating the Uzawa algorithm. SIAM Journal on Scientific Computing, 39(5), S461–S476.
  • Lange (2016) Lange, K. (2016). MM Optimization Algorithms. SIAM.
  • Lipnikov et al. (2013) Lipnikov, K., Svyatskiy, D., and Vassilevski, Y. (2013). Anderson acceleration for nonlinear finite volume scheme for advection-diffusion problems. SIAM Journal on Scientific Computing, 35(2), A1120–A1136.
  • Newling and Fleuret (2016) Newling, J. and Fleuret, F. (2016). Fast k-means with accurate bounds. In International Conference on Machine Learning, pages 936–944.
  • Newling and Fleuret (2017) Newling, J. and Fleuret, F. (2017). K-medoids for k-means seeding. In Advances in Neural Information Processing Systems, pages 5201–5209.
  • Ng and Han (1994) Ng, R. T. and Han, J. (1994). Efficient and effective clustering methods for spatial data mining. In Proceedings of 20th International Conference on Very Large Data Bases, pages 144–155.
  • Nocedal and Wright (2006) Nocedal, J. and Wright, S. J. (2006). Numerical Optimization. Springer, New York, NY, USA, second edition.
  • Peng et al. (2018) Peng, Y., Deng, B., Zhang, J., Geng, F., Qin, W., and liu, L. (2018). Anderson acceleration for geometry optimization and physics simulation. ACM Transactions on Graphics, 37(4).
  • Potra and Engler (2013) Potra, F. A. and Engler, H. (2013). A characterization of the behavior of the anderson acceleration on linear problems. Linear Algebra and its Applications, 438(3), 1002–1011.
  • Pratapa et al. (2016) Pratapa, P. P., Suryanarayana, P., and Pask, J. E. (2016). Anderson acceleration of the jacobi iterative method: An efficient alternative to Krylov methods for large, sparse linear systems. Journal of Computational Physics, 306, 43–54.
  • Toth and Kelley (2015) Toth, A. and Kelley, C. T. (2015). Convergence analysis for anderson acceleration. SIAM Journal on Numerical Analysis, 53(2), 805–819.
  • Toth et al. (2017) Toth, A., Ellis, J. A., Evans, T., Hamilton, S., Kelley, C. T., Pawlowski, R., and Slattery, S. (2017). Local improvement results for anderson acceleration with inaccurate function evaluations. SIAM Journal on Scientific Computing, 39(5), S47–S65.
  • Walker and Ni (2011) Walker, H. F. and Ni, P. (2011). Anderson acceleration for fixed-point iterations. SIAM Journal on Numerical Analysis, 49(4), 1715–1735.
  • Zhang et al. (1997) Zhang, T., Ramakrishnan, R., and Livny, M. (1997). Birch: A new data clustering algorithm and its applications. Data Mining and Knowledge Discovery, 1(2), 141–182.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
199926
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description