Fast KMeans Clustering with Anderson Acceleration
Abstract
We propose a novel method to accelerate Lloyd’s algorithm for KMeans clustering. Unlike previous acceleration approaches that reduce computational cost per iterations or improve initialization, our approach is focused on reducing the number of iterations required for convergence. This is achieved by treating the assignment step and the update step of Lloyd’s algorithm as a fixedpoint iteration, and applying Anderson acceleration, a wellestablished technique for accelerating fixedpoint solvers. Classical Anderson acceleration utilizes previous iterates to find an accelerated iterate, and its performance on KMeans clustering can be sensitive to choice of and the distribution of samples. We propose a new strategy to dynamically adjust the value of , which achieves robust and consistent speedups across different problem instances. Our method complements existing acceleration techniques, and can be combined with them to achieve stateoftheart performance. We perform extensive experiments to evaluate the performance of the proposed method, where it outperforms other algorithms in 106 out of 120 test cases, and the mean decrease ratio of computational time is more than 33%.
Fast KMeans Clustering with Anderson Acceleration
Juyong Zhang Yuxin Yao Yue Peng Hao Yu University of Science and Technology of China juyong@ustc.edu.cn yaoyuxin1@126.com {echoyue,yh8971}@mail.ustc.edu.cn Bailin Deng^{†}^{†}thanks: Corresponding author. Cardiff University DengB3@cardiff.ac.uk
noticebox[b]Preprint. Work in progress.\end@float
1 Introduction
KMeans clustering is a fundamental method with various applications such as data compression, classification, and density estimation. Given a set of samples , and a positive number , KMeans clustering partitions the samples into clusters, by minimizing the total squared distances from the samples to the centroids of their respective clusters, resulting in an optimization problem
(1) 
where , and is the centroid closest to the sample . This is a nonconvex optimization problem, and it is NPhard to find the global minimum (Aloise et al., 2009). On the other hand, there exist effective algorithms to find a local minimum, the most popular being Lloyd’s algorithm, often simply referred to as the KMeans algorithm.
The classical Lloyd’s algorithm is equivalent to minimizing the target energy with both the cluster assignment and the centroid positions as variables:
(2) 
where () is the index of the cluster that is assigned to. Lloyd’s algorithm performs the minimization in a twostep iterative process:

In the assignment step, each sample is assigned to the cluster whose centroid is the closest to :
(3) 
In the update step, each centroid is updated to the mean of all samples assigned to cluster :
(4) where is the number of assigned samples for cluster .
This iterative process can be considered as alternating minimization of the target function with respective to the assignment variables and the centroid variables, respectively. As a result, Lloyd’s algorithm monotonically decreases the target function and guarantees convergence. In practice, however, Lloyd’s algorithm can suffer from slow convergence rate and require a large number of iterations. Different acceleration techniques have been proposed to improve its performance:

In each iteration of Lloyd’s algorithm, the assignment step is the main computational bottleneck: a naïve implementation would require distance computations to determine the nearest centroids for all samples, which can be a significant cost especially for a large number of samples or clusters. Therefore, many existing works focus on reducing the computational cost per iteration, e.g., by maintaining distance bounds between samples and centroids to avoid unnecessary calculations (Elkan, 2003; Hamerly, 2010; Hamerly and Drake, 2015; Ding et al., 2015; Newling and Fleuret, 2016).

As the target function of KMeans clustering is nonconvex, different initial centroid positions might lead to different clustering results and different number of iterations used in Lloyd’s algorithm. Various algorithms aim at improving the initialization to achieve better clustering and reduced computational time (Ng and Han, 1994; Bradley and Fayyad, 1998; Arthur and Vassilvitskii, 2007; Bachem et al., 2016; Newling and Fleuret, 2017).
Despite the improved performance provided by these methods, they do not alter the convergence rate of Lloyd’s algorithm and may still suffer from slow convergence. In this paper, we propose a novel approach that accelerates Lloyd’s algorithm from a different perspective. Our approach aims at modifying the iterates from Lloyd’s algorithm to improve the convergence rate and reduce the number of required iterations. Our contribution is twofold:

First, we treat the sequence of centroid positions generated by Lloyd’s algorithm as the results of a fixedpoint iteration solver. This enables us to improve their convergence rate using Anderson acceleration (Anderson, 1965; Walker and Ni, 2011), a wellestablished technique for accelerating fixedpoint solvers (Fang and Saad, 2009; Toth and Kelley, 2015; Higham and Strabić, 2016; Toth et al., 2017). We adopt the modification proposed by Peng et al. (2018) to reduce the computational overhead and improve stability. The resulting solver greatly improves the performance of Lloyd’s algorithm in many instances.

Moreover, we show that the performance of classical Anderson acceleration, which relies on the previous iterates to determine an accelerated iterate, can be further improved by dynamically adjusting the value of according to the decrease of target energy. Using this strategy, the accelerated solver achieves consistent acceleration across all problem instances and initializations in our experiments.
Our approach can not only be used as a standalone technique to accelerate Lloyd’s algorithm, but also be used in conjugation with existing approaches that reduce computational cost per iteration and improve initialization to deliver stateoftheart performance for KMeans clustering. The effectiveness of our approach is tested on a large variety of datasets.
2 Algorithm
Before presenting our acceleration technique, let us first analyze the classical Lloyd’s algorithm to see why it can suffer from slow convergence. The main strength of Lloyd’s algorithm is the monotonic decrease of target energy and the resulting guarantee of convergence. This property stems from the fact that each iteration of Lloyd’s algorithm computes the new centroids by minimizing a convex surrogate (i.e., upper bound function) of the target energy in (1) that is constructed from the current centroid positions . Specifically, using the correspondence between samples and centroids determined in the assignment step (3), we can derive a surrogate of
(5) 
where is the index set of samples assigned to centroid . In particular, it can be shown that
The minimization of this surrogate function is exactly the update step (4). The surrogate property then guarantees the decrease of target energy :
From this perspective, Lloyd’s algorithm is an instance of the majorizationminimization algorithm (Lange, 2016), which performs energy minimization by iteratively constructing and minimizing surrogate functions.
On the other hand, the surrogate function can be both a blessing and a curse. It is only guaranteed to be accurate within a small neighborhood of the current iterate , and can deviate significantly from the true energy when being far away from . Such deviation is incurred when changed centroid positions alter the assignment of samples to centroids, which occurs frequently unless the samples are highly separated into different clusters. In other words, there may be a rapid loss of accuracy for the surrogate function when moving away from , which can hinder the energy decrease in the update step and leads to slow convergence.
Indeed, it has been pointed out by Bottou and Bengio (1994) that locally Lloyd’s algorithm is equivalent to Newton’s algorithm for minimizing the energy . Since infinitesimal changes of centroid positions do not alter the samplecentroid assignment, the surrogate function (5) is the secondorder Taylor expansion of the target energy, and its minimization is equivalent to a Newton step. If the new centroid positions result in the same sample assignment as , then the Newton algorithm converges in one step and is a local minimum. However, in many instances a large step will result in changes of sample assignment and invalidates the Taylor expansion, which slows down the convergence.
From the discussions above, we can see that an important reason for slow convergence of Lloyd’s algorithm is that the centroid update only considers the local landscape of the target energy. Therefore, a natural idea to improve convergence is to incorporate more global information, which is the essence of the technique proposed in this paper.
2.1 Anderson Acceleration for Lloyd’s Algorithm
The key idea of our approach is that a sequence of centroid positions generated by Lloyd’s algorithm can be considered as the results of a fixedpoint iteration scheme. Since the new centroid positions depend on the samplecentroid assignments which in turn depend on the previous centroids , we can combine the assignment step and the update step and write as a function of :
(6) 
Here the subscript AU indicates an iterate using the combined assignment and update step, which we differentiate from the accelerated iterate that will be introduced later. If Lloyd’s algorithm converges to a local minimum , then it must be a fixed point of the mapping , i.e., . Based on this observation, we can apply Anderson acceleration, a wellestablished technique for accelerating fixedpoint solvers (Anderson, 1965; Walker and Ni, 2011), to improve the convergence of Lloyd’s algorithm. Note that Lloyd’s algorithm searches for centroid positions for which the residual vanishes. Instead of simply taking as the new iterate, Anderson acceleration also makes use of previous iterates to determine an accelerated iterate that achieves better reduction of the residual. Let us denote . Then using the iterates , we can approximate the highdimensional graph of with the affine subspace spanned by . Anderson acceleration searches within this affine space for a point whose residual components are as close to zero as possible, and takes the corresponding components as the new iterate . This is equivalent to solving a linear leastsquares problem (Walker and Ni, 2011; Toth and Kelley, 2015)
(7) 
and then computing as
(8) 
In this way, the accelerated iterate is determined using a more global landscape of the residual, which often results in faster convergence as shown by various authors (Lipnikov et al., 2013; Higham and Strabić, 2016; Pratapa et al., 2016; An et al., 2017; Ho et al., 2017). Indeed, it has been shown in (Fang and Saad, 2009) that Anderson acceleration corresponds to a quasiNewton method for finding the residual root, which approximates the inverse Jacobian by enforcing secant conditions using previous iterates.
Although Anderson acceleration works well for many problems, it does not guarantee energy decrease in general, and can stagnate at a wrong solution (Walker and Ni, 2011; Potra and Engler, 2013). Recently, Peng et al. (2018) proposed a modification of Anderson Acceleration that improves stability. For a fixedpoint solver that guarantees energy decrease in each iteration, they check the accelerated iterate to see whether it decreases the energy compared with the previous iterate; if not, then they revert to the unaccelerated iterate from the fixedpoint solver. This approach guarantees monotonic decrease of the target energy and is shown to be effective in improving robustness of Anderson acceleration. We adopt the same strategy to stabilize the accelerated solver, by checking the decrease of target energy (1): if the accelerated iterate does not decrease the energy, then we revert to the iterate using Lloyd’s algorithm which guarantees that the new energy value is no larger than the previous iteration. An additional benefit of this strategy is the low computational overhead compared with the unaccelerated solver. For the KMeans clustering problem, the overhead per iteration consists of two parts: (i) computing the accelerated iterate according to Equations (7) and (8); (ii) checking the energy of the accelerated iterate. Part (i) involves inner products between dimensional vectors as well as solving an linear system, which is typically a small cost. For Part (ii), the evaluation of target energy (1) requires assigning the sample points to the accelerated centroid positions, which can be reused in the next iteration unless we need to revert to the unaccelerated iterate. Therefore, if the accelerated iterate is accepted, then Part (ii) only involves computing the distance from each sample to its assigned centroid, with a total time complexity of which is often only a small portion of the computation per iteration (Peng et al., 2018). If the accelerated iterate is not accepted, then an additional assignment step will be incurred, but the overhead is still acceptable if we adopt a fast assignment method such as the one from (Hamerly, 2010). In our experiments, the majority of accelerated iterates are accepted (see Tables 2 and 3), thus the overhead from Part (ii) is often small.
For the classical Lloyd’s algorithm, its convergence is indicated by the same assignment of samples to centroids between two consecutive iterations, because in this case the iterative algorithm can no longer decrease the target energy. For our accelerated algorithm, it is easy to show that the convergence criterion remains the same: since we only accept an accelerated iterate when it decreases the target energy, our algorithm runs until an accelerated iterate is rejected and the fallback iterate using Lloyd’s algorithm results in the same assignment as the previous iteration, which indicates a local minimum of the energy.
2.2 Dynamic Adjustment of
By default, the number of previous iterates used in Anderson acceleration is fixed during the whole solving process, and can influence the convergence rate (Higham and Strabić, 2016; Peng et al., 2018). With a larger value of , more information is utilized for approximating the inverse Jacobian. On the other hand, a larger increases the computational cost, and may overfit iterates that are far away. In general, the optimal choice of is problemdependent and is difficult to determine beforehand. To achieve more robust speedup, we propose an adaptive strategy to dynamically adjust the value of according to the decrease of energy. The key idea is to compare the energy decrease from the current iterate with the decrease from the last iterate. In particular:

If the current iterate increases the energy, or the energy decrease in the current iteration is small compared to the previous iteration, then should be become smaller.

If the energy decrease in the current iteration is large enough compared to the previous iteration, then should be larger.

Otherwise, is not altered.
This strategy is similar to the trust region method that enlarges or shrinks the trust region according to the effectiveness of the current step (Nocedal and Wright, 2006). The full acceleration algorithm for KMeans clustering is shown in Algorithm 1, and lines 48 are the adjustment steps for the value. Our experiments show that dynamic adjustment of achieves better performance than a fixed value for many problem instances (see Section 3.1).
3 Results
We evaluate the performance of our algorithm and compare it with other Kmeans clustering algorithms in a series of experiments. All algorithms are implemented in C++ and parallelized on the CPU using OpenMP. The AssignmentStep in Algorithm 1 is implemented based on Hamerly’s method (Hamerly, 2010). Although the it can be further optimized using more recent methods such as those from (Ding et al., 2015) and (Newling and Fleuret, 2016), this will not affect the reduction ratio in the number of iterations.
We conduct the experiments on 20 datasets in a variety of sizes. They consist of 19 realworld datasets from the UCI machine learning repository (Dheeru and Karra Taniskidou, 2017), and the synthetic Birch dataset (Zhang et al., 1997) that contains clusters in regular grid structure^{1}^{1}1https://cs.joensuu.fi/sipu/datasets/. Table 1 shows the size and dimension of each data set ( for the number of samples, for the dimension). All algorithms are started with the same initial centroids and iterated until convergence, and are tested on a desktop PC with a quadcore Intel CPU i7 and 4GB RAM. For Algorithm 1, in all tests we set and , , and initialize to 2 by default.
3.1 Settings of
Table 2 compares the performance of our method using fixed and dynamic values, showing the computational time and the number of iterations required by each strategy. We can observe that in the majority of cases, dynamic adjustment of reduces the computational time as well as the number of iterations compared with a fixed strategy. The total computational time is reduced by more than 20 percent for most datasets, and even better for some datasets such as , and . Although we use the same values of and for all datasets, the dynamic adjustment strategy performs consistently better than a fixed strategy.


No.  Name  No.  Name  
1  UCIHARDATAXtrain  7352  561  11  Colorment  68040  9 
2  Slicelocalization  53500  385  12  Conflongdemo  164860  3 
3  RelationNetwork  53413  22  13  Birch  100000  2 
4  Letterrecognition  20000  16  14  Shuttle  43500  9 
5  HTRU2  17898  8  15  Covtype  581012  55 
6  Household  2049280  6  16  SkinNonSkin  245057  4 
7  FrogsMFCCs  7195  21  17  Finalgeneral  10104  72 
8  Eb  45781  2  18  ColorHistogram  68040  32 
9  AllUsers  78095  8  19  USCensus1990  2458285  69 
10  MiniBoone  130064  50  20  Kddcup99  4898431  37 



Dataset  Fixed  Dynamic  Fixed  Dynamic  
#Iter  Time (s)  MSE  #Iter  Time (s)  MSE  #Iter  Time (s)  MSE  #Iter  Time (s)  MSE  
1  22 / 27  0.22  15.08  27 / 31  0.25  15.08  36 / 45  0.46  15.02  27 / 35  0.36  15.03 
2  27 / 30  1.36  15.16  18 / 21  0.98  15.16  32 / 38  1.97  15.16  26 / 29  1.40  15.16 
3  39 / 46  0.48  2.61  31 / 38  0.43  2.61  12 / 13  0.14  2.63  12 / 13  0.15  2.63 
4  50 / 61  0.30  2.89  17 / 22  0.11  2.89  35 / 38  0.20  2.90  19 / 21  0.11  2.90 
5  38 / 49  0.15  1.24  23 / 26  0.07  1.24  38 / 54  0.21  1.24  34 / 43  0.15  1.24 
6  30 / 35  11.19  1.17  26 / 29  9.21  1.17  18 / 22  8.97  1.16  16 / 18  6.89  1.16 
7  20 / 21  0.03  2.57  16 / 18  0.02  2.57  12 / 13  0.02  2.59  10 / 11  0.01  2.59 
8  30 / 35  0.12  0.45  35 / 46  0.20  0.45  21 / 23  0.08  0.45  14 / 16  0.07  0.45 
9  41 / 51  0.88  1.93  34 / 43  0.77  1.93  38 / 55  1.21  1.93  36 / 53  1.21  1.93 
10  39 / 41  1.30  1.70  26 / 29  0.96  1.70  27 / 34  1.23  1.70  19 / 22  0.76  1.70 
11  99 / 119  1.71  2.06  71 / 83  1.16  2.06  139 / 179  2.74  2.06  91 / 115  1.68  2.06 
12  16 / 18  0.39  0.83  19 / 21  0.45  0.83  19 / 20  0.45  0.83  18 / 19  0.41  0.83 
13  14 / 21  0.19  0.42  17 / 24  0.22  0.42  9 / 11  0.08  0.41  10 / 11  0.08  0.41 
14  6 / 9  0.07  1.63  6 / 9  0.06  1.63  7 / 9  0.06  1.68  7 / 9  0.06  1.68 
15  22 / 25  3.86  6.47  20 / 24  3.89  6.47  21 / 26  4.39  6.47  17 / 22  3.82  6.47 
16  9 / 12  0.30  0.55  8 / 11  0.28  0.55  12 / 16  0.39  0.55  12 / 16  0.39  0.55 
17  46 / 47  0.13  7.63  29 / 32  0.10  7.63  43 / 53  0.22  7.63  18 / 19  0.07  7.63 
18  19 / 22  0.37  4.53  21 / 26  0.47  4.53  29 / 35  0.72  4.53  16 / 19  0.38  4.53 
19  14 / 15  14.37  6.02  15 / 17  16.91  6.02  14 / 20  22.50  6.02  15 / 21  21.09  6.02 
20  8 / 10  6.11  3.91  8 / 10  6.04  3.91  8 / 12  7.71  3.78  9 / 12  7.33  3.78 

3.2 Comparison with Lloyd’s Algorithm
In Table 3, we compare our method with Lloyd’s algorithm using the assignment method from Hamerly (2010). We show the number of iterations, the total computational time, and the final mean squared error (MSE) between the samples and their corresponding centroids. To demonstrate the robustness of our method to the initial centroid positions, we test four different initialization techniques: KMeans++ (Arthur and Vassilvitskii, 2007), afk (Bachem et al., 2016), bf (Bradley and Fayyad, 1998), and CLARANS (Newling and Fleuret, 2017). The initial centroids are generated by the code accompanying the paper (Newling and Fleuret, 2017).
For all 20 datasets, we test the methods by setting the number of clusters and starting from each of the four sets of initial centroids, resulting in 80 test cases. Table 3 shows that our method requires less computational time than Lloyd’s algorithm in the majority of cases. In particular, our method reduces the computational time by more than 25% in 42 cases, and more than 50% in 15 cases. With the initialization techniques KMeans++, afk, bf, and CLARANS, our method outperforms Lloyd’s algorithm in 15, 20, 18, and 19 datasets out of 20, respectively.
To demonstrate the robustness of our method to the number of clusters , we also test each dataset using different values of . The last three columns of Table 3 shows the performance of both methods initialized with CLARANS and for cluster numbers , , and , respectively. We can observe that our method consistently outperforms Lloyds’ algorithm for different values.


Dataset  KMeans++  afk  bf  CLARANS  
1  #Iter  26  22 / 25  39  21 / 23  23  12 / 14  30  27 / 31  53  26 / 27  14  10 / 11 
Time (s)  0.31  0.30  0.43  0.29  0.33  0.16  0.35  0.25  5.74  2.97  19.38  14.41  
MSE  15.25  15.25  15.03  15.03  15.97  15.97  15.08  15.08  12.40  12.41  9.63  9.63  
2  #Iter  87  61 / 79  27  22 / 24  15  10 / 11  42  18 / 21  49  49 / 55  40  31 / 33 
Time (s)  5.32  5.71  1.55  1.37  0.83  0.61  2.40  0.98  28.14  34.28  167.11  143.06  
MSE  15.26  15.26  15.27  15.27  16.78  16.78  15.16  15.16  12.55  12.55  8.72  8.72  
3  #Iter  31  23 / 27  33  20 / 24  28  16 / 18  54  31 / 38  41  26 / 28  20  16 / 17 
Time (s)  0.49  0.46  0.42  0.35  0.41  0.22  0.68  0.43  3.54  2.50  13.58  11.44  
MSE  2.68  2.68  2.74  2.74  2.77  2.77  2.61  2.61  1.47  1.47  0.51  0.51  
4  #Iter  51  51 / 61  57  31 / 35  75  42 / 50  66  17 / 22  98  43 / 49  21  15 / 16 
Time (s)  0.32  0.42  0.34  0.21  0.39  0.29  0.39  0.11  4.57  2.51  9.69  7.42  
MSE  2.90  2.90  2.92  2.92  2.90  2.90  2.89  2.89  1.88  1.88  1.13  1.13  
5  #Iter  137  78 / 104  110  38 / 52  127  63 / 82  51  23 / 26  65  26 / 31  28  25 / 28 
Time (s)  0.50  0.46  0.38  0.25  0.40  0.32  0.20  0.07  2.13  1.15  10.72  11.46  
MSE  1.24  1.24  1.24  1.24  1.24  1.24  1.24  1.24  0.62  0.62  0.31  0.31  
6  #Iter  209  53 / 89  241  38 / 55  67  29 / 33  61  26 / 29  642  282 / 377  403  549 / 610 
Time (s)  87.86  54.79  96.47  30.81  27.67  14.78  22.48  9.21  2043.86  1546.83  10350.36  17077.98  
MSE  1.18  1.18  1.20  1.20  1.33  1.33  1.17  1.17  0.63  0.63  0.29  0.29  
7  #Iter  15  14 / 17  23  13 / 15  12  9 / 10  22  16 / 18  70  35 / 46  17  12 / 13 
Time (s)  0.02  0.03  0.04  0.03  0.02  0.01  0.04  0.02  1.42  1.16  4.47  3.42  
MSE  2.63  2.63  2.84  2.84  2.91  2.91  2.57  2.57  1.62  1.62  0.98  0.98  
8  #Iter  97  51 / 67  87  24 / 28  83  20 / 30  82  35 / 46  175  82 / 89  33  32 / 35 
Time (s)  0.48  0.40  0.44  0.16  0.36  0.18  0.40  0.20  5.86  3.20  13.53  14.43  
MSE  0.45  0.45  0.45  0.45  0.45  0.45  0.45  0.45  0.14  0.14  0.04  0.04  
9  #Iter  106  57 / 74  101  35 / 45  57  24 / 28  65  34 / 43  83  31 / 37  35  30 / 32 
Time (s)  2.25  1.93  2.08  1.09  1.22  0.68  1.36  0.77  14.51  7.34  43.43  39.93  
MSE  1.94  1.94  1.93  1.93  2.19  2.19  1.93  1.93  1.27  1.27  0.73  0.73  
10  #Iter  142  91 / 125  85  41 / 53  37  14 / 16  57  26 / 29  361  65 / 80  129  48 / 53 
Time (s)  6.40  7.11  3.36  2.62  1.12  0.50  2.33  0.96  150.54  41.07  430.11  193.56  
MSE  1.41  1.41  1.70  1.70  1.85  1.85  1.70  1.70  0.87  0.87  0.60  0.60  
11  #Iter  106  72 / 88  169  69 / 93  89  26 / 32  142  71 / 83  316  81 / 88  95  59 / 61 
Time (s)  1.52  1.49  2.09  1.45  1.32  0.57  1.83  1.16  42.87  13.10  128.37  83.78  
MSE  2.05  2.05  2.06  2.06  2.19  2.19  2.06  2.06  1.38  1.38  0.91  0.91  
12  #Iter  73  28 / 44  38  24 / 28  48  28 / 35  36  19 / 21  190  81 / 92  201  102 / 110 
Time (s)  2.06  1.79  1.06  0.94  1.40  1.26  0.94  0.45  41.45  22.73  353.33  215.20  
MSE  0.83  0.83  0.85  0.85  0.91  0.91  0.83  0.83  0.40  0.40  0.18  0.18  
13  #Iter  25  13 / 18  27  16 / 18  27  11 / 14  39  17 / 24  101  44 / 57  61  33 / 35 
Time (s)  0.34  0.31  0.39  0.30  0.30  0.17  0.39  0.22  7.06  4.74  56.28  32.52  
MSE  0.45  0.45  0.46  0.46  0.42  0.42  0.42  0.42  0.09  0.09  0.03  0.03  
14  #Iter  30  19 / 22  39  20 / 28  10  7 / 9  19  6 / 9  25  7 / 9  25  16 / 17 
Time (s)  0.16  0.14  0.25  0.25  0.07  0.08  0.14  0.06  1.42  0.65  16.23  11.32  
MSE  1.77  1.77  1.71  1.71  1.69  1.69  1.63  1.63  0.35  0.35  0.14  0.14  
15  #Iter  48  20 / 25  113  60 / 80  33  14 / 19  30  20 / 24  62  35 / 39  226  150 / 168 
Time (s)  10.36  6.23  21.38  19.96  6.81  4.75  5.91  3.89  100.54  67.53  3087.60  2541.44  
MSE  6.50  6.50  6.51  6.51  6.48  6.48  6.47  6.47  2.69  2.69  1.30  1.30  
16  #Iter  12  9 / 10  47  19 / 29  34  15 / 22  20  8 / 11  42  20 / 24  25  35 / 39 
Time (s)  0.37  0.28  1.35  1.15  0.77  0.72  0.48  0.28  7.84  5.19  74.01  110.09  
MSE  0.60  0.60  0.63  0.63  0.72  0.72  0.55  0.55  0.15  0.15  0.05  0.05  
17  #Iter  31  28 / 36  49  25 / 28  3  2 / 3  58  29 / 32  35  18 / 22  15  12 / 13 
Time (s)  0.15  0.18  0.21  0.13  0.01  0.01  0.25  0.10  1.37  1.03  6.81  5.60  
MSE  7.62  7.62  7.74  7.74  7.78  7.78  7.63  7.63  6.12  6.12  4.70  4.69  
18  #Iter  44  28 / 33  75  31 / 46  55  25 / 28  43  21 / 26  163  46 / 51  77  83 / 85 
Time (s)  1.03  0.77  1.51  1.20  1.15  0.69  0.91  0.47  31.06  10.33  120.59  134.91  
MSE  4.53  4.53  4.55  4.55  5.23  5.23  4.53  4.53  2.61  2.61  1.78  1.78  
19  #Iter  69  34 / 60  57  23 / 34  15  11 / 13  17  15 / 17  60  32 / 32  1  46 / 1 
Time (s)  255.37  72.35  52.41  41.87  13.22  12.22  16.66  16.91  543.64  288.18  754.89  630.99  
MSE  6.17  6.17  6.11  6.11  6.66  6.66  6.02  6.02  4.18  4.18  0.01  0.00  
20  #Iter  13  11 / 12  20  13 / 15  8  9 / 10  14  8 / 10  83  38 / 53  163  58 / 72 
Time (s)  11.09  11.09  14.83  13.37  5.84  6.80  7.72  6.04  220.72  191.36  10408.44  7212.02  
MSE  3.92  3.92  4.07  4.07  3.89  3.89  3.91  3.91  1.86  1.86  0.77  0.77  

4 Conclusion
In this paper, we apply a modified Anderson acceleration scheme to improve the performance of Lloyd’s algorithm. We also propose a dynamic adjustment strategy of the parameter to improve the robustness of Anderson acceleration. Experimental results show that our method can improve the convergence rate of Lloyd’s algorithm regardless of the dimension and size of the data sets, the number of clusters, and initial values. The proposed method can be combined with existing acceleration schemes that reduce the assignment cost and improve the initialization.
Although our algorithm performs consistently better than classical Lloyd’s algorithm, its acceleration ratio varies between different problem instances. One interesting future work would be to investigate characteristics of samples that influence the effectiveness of our approach. In addition, many optimization algorithms used in machine learning have similar forms as Lloyd’s algorithm. Therefore, another interesting direction for future research is to extend the proposed algorithm to other problems.
References
 Aloise et al. (2009) Aloise, D., Deshpande, A., Hansen, P., and Popat, P. (2009). NPhardness of Euclidean sumofsquares clustering. Machine Learning, 75(2), 245–248.
 An et al. (2017) An, H., Jia, X., and Walker, H. F. (2017). Anderson acceleration and application to the threetemperature energy equations. Journal of Computational Physics, 347, 1–19.
 Anderson (1965) Anderson, D. G. (1965). Iterative procedures for nonlinear integral equations. J. ACM, 12(4), 547–560.
 Arthur and Vassilvitskii (2007) Arthur, D. and Vassilvitskii, S. (2007). kmeans++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACMSIAM symposium on Discrete algorithms, pages 1027–1035. Society for Industrial and Applied Mathematics.
 Bachem et al. (2016) Bachem, O., Lucic, M., Hassani, H., and Krause, A. (2016). Fast and provably good seedings for kmeans. In Advances in Neural Information Processing Systems, pages 55–63.
 Bottou and Bengio (1994) Bottou, L. and Bengio, Y. (1994). Convergence properties of the kmeans algorithms. In Advances in Neural Information Processing Systems, pages 585–592.
 Bradley and Fayyad (1998) Bradley, P. S. and Fayyad, U. M. (1998). Refining initial points for kmeans clustering. In International Conference on Machine LearningL, volume 98, pages 91–99.
 Dheeru and Karra Taniskidou (2017) Dheeru, D. and Karra Taniskidou, E. (2017). UCI machine learning repository.
 Ding et al. (2015) Ding, Y., Zhao, Y., Shen, X., Musuvathi, M., and Mytkowicz, T. (2015). Yinyang kmeans: A dropin replacement of the classic kmeans with consistent speedup. In International Conference on Machine Learning, pages 579–587.
 Elkan (2003) Elkan, C. (2003). Using the triangle inequality to accelerate kmeans. In International Conference on Machine Learning, pages 147–153.
 Fang and Saad (2009) Fang, H.r. and Saad, Y. (2009). Two classes of multisecant methods for nonlinear acceleration. Numerical Linear Algebra with Applications, 16(3), 197–221.
 Hamerly (2010) Hamerly, G. (2010). Making kmeans even faster. In SIAM international conference on data mining, pages 130–140.
 Hamerly and Drake (2015) Hamerly, G. and Drake, J. (2015). Accelerating lloyd’s algorithm for kmeans clustering. In Partitional clustering algorithms, pages 41–78. Springer.
 Higham and Strabić (2016) Higham, N. J. and Strabić, N. (2016). Anderson acceleration of the alternating projections method for computing the nearest correlation matrix. Numerical Algorithms, 72(4), 1021–1042.
 Ho et al. (2017) Ho, N., Olson, S. D., and Walker, H. F. (2017). Accelerating the Uzawa algorithm. SIAM Journal on Scientific Computing, 39(5), S461–S476.
 Lange (2016) Lange, K. (2016). MM Optimization Algorithms. SIAM.
 Lipnikov et al. (2013) Lipnikov, K., Svyatskiy, D., and Vassilevski, Y. (2013). Anderson acceleration for nonlinear finite volume scheme for advectiondiffusion problems. SIAM Journal on Scientific Computing, 35(2), A1120–A1136.
 Newling and Fleuret (2016) Newling, J. and Fleuret, F. (2016). Fast kmeans with accurate bounds. In International Conference on Machine Learning, pages 936–944.
 Newling and Fleuret (2017) Newling, J. and Fleuret, F. (2017). Kmedoids for kmeans seeding. In Advances in Neural Information Processing Systems, pages 5201–5209.
 Ng and Han (1994) Ng, R. T. and Han, J. (1994). Efficient and effective clustering methods for spatial data mining. In Proceedings of 20th International Conference on Very Large Data Bases, pages 144–155.
 Nocedal and Wright (2006) Nocedal, J. and Wright, S. J. (2006). Numerical Optimization. Springer, New York, NY, USA, second edition.
 Peng et al. (2018) Peng, Y., Deng, B., Zhang, J., Geng, F., Qin, W., and liu, L. (2018). Anderson acceleration for geometry optimization and physics simulation. ACM Transactions on Graphics, 37(4).
 Potra and Engler (2013) Potra, F. A. and Engler, H. (2013). A characterization of the behavior of the anderson acceleration on linear problems. Linear Algebra and its Applications, 438(3), 1002–1011.
 Pratapa et al. (2016) Pratapa, P. P., Suryanarayana, P., and Pask, J. E. (2016). Anderson acceleration of the jacobi iterative method: An efficient alternative to Krylov methods for large, sparse linear systems. Journal of Computational Physics, 306, 43–54.
 Toth and Kelley (2015) Toth, A. and Kelley, C. T. (2015). Convergence analysis for anderson acceleration. SIAM Journal on Numerical Analysis, 53(2), 805–819.
 Toth et al. (2017) Toth, A., Ellis, J. A., Evans, T., Hamilton, S., Kelley, C. T., Pawlowski, R., and Slattery, S. (2017). Local improvement results for anderson acceleration with inaccurate function evaluations. SIAM Journal on Scientific Computing, 39(5), S47–S65.
 Walker and Ni (2011) Walker, H. F. and Ni, P. (2011). Anderson acceleration for fixedpoint iterations. SIAM Journal on Numerical Analysis, 49(4), 1715–1735.
 Zhang et al. (1997) Zhang, T., Ramakrishnan, R., and Livny, M. (1997). Birch: A new data clustering algorithm and its applications. Data Mining and Knowledge Discovery, 1(2), 141–182.