# Nested Mini-Batch K-Means

###### Abstract

A new algorithm is proposed which accelerates the mini-batch k-means algorithm of Sculley (2010) by using the distance bounding approach of Elkan (2003). We argue that, when incorporating distance bounds into a mini-batch algorithm,
already used data should preferentially be reused. To this end we propose using *nested* mini-batches, whereby data in a mini-batch at iteration is automatically reused at iteration .

Using nested mini-batches presents two difficulties. The first is that unbalanced use of data can bias estimates, which we resolve by ensuring that each data sample contributes exactly once to centroids. The second is in choosing mini-batch sizes, which we address by balancing premature fine-tuning of centroids with redundancy induced slow-down. Experiments show that the resulting algorithm is very effective, often arriving within of the empirical minimum earlier than the standard mini-batch algorithm.

## 1 Introduction

The -means problem is to find centroids to minimise the mean distance between samples and their nearest centroids. Specifically, given training samples in vector space , one must find in to minimise energy defined by,

(1) |

where . In general the -means problem is NP-hard, and so a trade off must be made between low energy and low run time. The -means problem arises in data compression, classification, density estimation, and many other areas.

A popular algorithm for -means is Lloyd’s algorithm, henceforth . It relies on a two-step iterative refinement technique. In the *assignment* step, each sample is assigned to the cluster whose centroid is nearest. In the *update* step, cluster centroids are updated in accordance with assigned samples. is also referred to as the *exact* algorithm, which can lead to confusion as it does not solve the -means problem exactly. Similarly, *approximate* -means algorithms often refer to algorithms which perform an approximation in either the assignment or the update step of .

### 1.1 Previous works on accelerating the exact algorithm

Several approaches for accelerating have been proposed, where the required computation is reduced without changing the final clustering. Hamerly (2010) shows that approaches relying on triangle inequality based distance bounds (Phillips, 2002; Elkan, 2003; Hamerly, 2010) always provide greater speed-ups than those based on spatial data structures (Pelleg and Moore, 1999; Kanungo et al., 2002). Improving bounding based methods remains an active area of research (Drake, 2013; Ding et al., 2015). We discuss the bounding based approach in § 2.1.

### 1.2 Previous approximate -means algorithms

The assignment step of requires more computation than the update step. The majority of approximate algorithms thus focus on relaxing the assignment step, in one of two ways. The first is to assign all data approximately, so that centroids are updated using all data, but some samples may be incorrectly assigned. This is the approach used in Wang et al. (2012) with cluster closures. The second approach is to exactly assign a fraction of data at each iteration. This is the approach used in Agarwal et al. (2005), where a representative core-set is clustered, and in Bottou and Bengio (1995), and Sculley (2010), where random samples are drawn at each iteration. Using only a fraction of data is effective in reducing redundancy induced slow-downs.

The mini-batch -means algorithm of Sculley (2010), henceforth , proceeds as follows. Centroids are initialised as a random selection of samples. Then at every iteration, of samples are selected uniformly at random and assigned to clusters. Cluster centroids are updated as the mean of all samples ever assigned to them, and are therefore running averages of assignments. Samples randomly selected more often have more influence on centroids as they reappear more frequently in running averages, although the law of large numbers smooths out any discrepancies in the long run. is presented in greater detail in § 2.2.

### 1.3 Our contribution

The underlying goal of this work is to accelerate by using triangle inequality based distance bounds. In so doing, we hope to merge the complementary strengths of two powerful and widely used approaches for accelerating .

The effective incorporation of bounds into requires a new sampling approach. To see this, first note that bounding can only accelerate the processing of samples which have already been visited, as the first visit is used to establish bounds. Next, note that the expected proportion of visits during the first epoch which are *re*visits is at most , as shown in SM-A. Thus the majority of visits are first time visits and hence cannot be accelerated by bounds. However, for highly redundant datasets, often obtains satisfactory clustering in a single epoch, and so bounds need to be effective during the first epoch if they are to contribute more than a minor speed-up.

To better harness bounds, one must preferentially reuse already visited samples. To this end, we propose nested mini-batches. Specifically, letting be the mini-batch indices used at iteration , we enforce that . One concern with nesting is that samples entering in early iterations have more influence than samples entering at late iterations, thereby introducing bias. To resolve this problem, we enforce that samples appear at most once in running averages. Specifically, when a sample is revisited, its old assignment is first removed before it is reassigned. The idea of nested mini-batches is discussed in § 3.1.

The second challenge introduced by using nested mini-batches is determining the size of . On the one hand, if grows too slowly, then one may suffer from premature fine-tuning. Specifically, when updating centroids using , one is using energy estimated on samples indexed by as a proxy for energy over all training samples. If is small and the energy estimate is poor, then minimising the energy estimate exactly is a waste of computation, as as soon as the mini-batch is augmented the proxy energy loss function will change. On the other hand, if grows too rapidly, the problem of redundancy arises. Specifically, if centroid updates obtained with a small fraction of are similar to the updates obtained with , then it is waste of computation using in its entirety. These ideas are pursued in § 3.2.

## 2 Related works

### 2.1 Exact acceleration using the triangle inequality

The standard approach to perform the assignment step of requires distance calculations. The idea introduced in Elkan (2003) is to eliminate certain of these calculations by maintaining bounds on distances between samples and centroids. Several novel bounding based algorithms have since been proposed, the most recent being the algorithm of Ding et al. (2015). A thorough comparison of bounding based algorithms was presented in Drake (2013). We illustrate the basic idea of Elkan (2003) in Alg. 1, where for every sample , one maintains lower bounds, for , each bound satisfying . Before computing on line 4 of Alg. 1, one checks that , where is the distance from sample to the nearest currently found centroid. If then , and thus can automatically be eliminated as a nearest centroid candidate.

The fully-fledged algorithm of Elkan (2003) uses additional tests to the one shown in Alg. 1, and includes upper bounds and inter-centroid distances. The most recently published bounding based algorithm, of Ding et al. (2015), is like that of Elkan (2003) but does not maintain bounds on all distances to centroids, rather it maintains lower bounds on groups of centroids simultaneously.

To maintain the validity of bounds, after each centroid update one performs , where is the distance moved by centroid during the centroid update, the validity of this correction follows from the triangle inequality. Lower bounds are initialised as exact distances in the first iteration, and only in subsequent iterations can bounds help in eliminating distance calculations. Therefore, the algorithm of Elkan (2003) and its derivatives are all at least as slow as during the first iteration.

### 2.2 Mini-batch k-means

The work of Sculley (2010) introduces , presented in Alg. 4, as a scalable alternative to . Reusing notation, we let the mini-batch size be , and the total number of assignments ever made to cluster be . Let be the cumulative sum of data samples assigned to cluster . The centroid update, line 9 of Alg. 4, is then . Sculley (2010) present in the context sparse datasets, and at the end of each round an -sparsification operation is performed to encourage sparsity. In this paper we are interested in in a more general context and do not consider sparsification.

## 3 Nested mini-batch k-means :

The bottleneck of is the assignment step, on line 5 of Alg. 4, which requires distance calculations per sample. The underlying motivation of this paper is to reduce the number of distance calculations at assignment by using distance bounds. However, as already discussed in § 1.3, simply wrapping line 5 in a bound test would not result in much gain, as only a minority of visited samples would benefit from bounds in the first epoch. For this reason, we will replace random mini-batches at line 3 of Alg. 4 by nested mini-batches. This modification motivates a change to the running average centroid updates, discussed in Section 3.1. It also introduces the need for a scheme to choose mini-batch sizes, discussed in 3.2. The resulting algorithm, which we refer to as , is presented in Alg. 5.

There is no random sampling in , although an initial random shuffling of samples can be performed to remove any ordering that may exist. Let be the size of the mini-batch at iteration , that is . We simply take to be the first indices, that is . Thus corresponds to . Let be the number of iterations of before terminating. We use as stopping criterion that no assignments change on the full training set, although this is not important and can be modified.

### 3.1 One sample, one vote : modifying cumulative sums to prevent duplicity

In , a sample used times makes contributions to one or more centroids, through line 6 of Alg. 4. Due to the extreme and systematic difference in the number of times samples are used with nested mini-batches, it is necessary to curtail any potential bias that duplicitous contribution may incur. To this end, we only alow a sample’s most recent assignment to contribute to centroids. This is done by removing old assignments before samples are reused, shown on lines 15 and 16 of Alg. 5.

### 3.2 Finding the sweet spot : balancing premature fine-tuning with redundancy

We now discuss how to sensibly select mini-batch size , where recall that the sample indices of the mini-batch at iteration are . The only constraint imposed so far is that for , that is that does not decrease. We consider two extreme schemes to illustrate the importance of finding a scheme where grows neither too rapidly nor too slowly.

The first extreme scheme is for . This is just a return to full batch -means, and thus redundancy is a problem, particularly at early iterations. The second extreme scheme, where grows very slowly, is the following: if any assignment changes at iteration , then , otherwise . The problem with this second scheme is that computation may be wasted in finding centroids which accurately minimise the energy estimated on unrepresentative subsets of the full training set. This is what we refer to as premature fine-tuning.

To develop a scheme which balances redundancy and premature fine-tuning, we need to find sensible definitions for these terms. A first attempt might be to define them in terms of energy (1), as this is ultimately what we wish to minimise. Redundancy would correspond to a slow decrease in energy caused by long iteration times, and premature fine-tuning would correspond to approaching a local minimum of a poor proxy for (1). A difficulty with an energy based approach is that we do not want to compute (1) at each iteration and there is no clear way to quantify the underestimation of (1) using a mini-batch. We instead consider definitions based on centroid statistics.

#### 3.2.1 Balancing intra-cluster standard deviation with centroid displacement

Let denote centroid at iteration , and let be when , so that is the update to using samples . Consider two options, with resulting update , and with update . If,

(2) |

then it makes little difference if centroid is updated with or , as illustrated in Figure 1, left. Using would therefore be redundant. If on the other hand,

(3) |

this suggests premature fine-tuning, as illustrated in Figure 1, right. Balancing redundancy and premature fine-tuning thus equates to balancing the terms on the left and right hand sides of (2) and (3). Let us denote by the indices of samples in assigned to cluster . In SM-B we show that the term on the left hand side of (2) and (3) can be estimated by , where

(4) |

may underestimate as samples have not been used by centroids at iteration , however our goal here is to establish dimensional homogeneity. The right hand sides of (2) and (3) can be estimated by the distance moved by centroid in the preceding iteration, which we denote by . Balancing redundancy and premature fine-tuning thus equates to preventing from getting too large or too small.

It may be that differs significantly between clusters . It is not possible to independently control the number of samples per cluster, and so a joint decision needs to be made by clusters as to whether or not to increase . We choose to make the decision based on the minimum ratio, on line 37 of Alg. 5, as premature fine-tuning is less costly when performed on a small mini-batch, and so it makes sense to allow slowly converging centroids to catch-up with rapidly converging ones.

The decision to use a double-or-nothing scheme for growing the mini-batch is motivated by the fact that drops by a constant factor when the mini-batch doubles in size. A linearly increasing mini-batch would be prone to premature fine-tuning as the mini-batch would not be able to grow rapidly enough.

Starting with an initial mini-batch size , iterates until is above some threshold , at which point mini-batch size increases as , shown on line 37 of Alg. 5. The mini-batch size is guaranteed to eventually reach , as eventually goes to zero. The doubling threshold reflects the relative costs of premature fine-tuning and redundancy.

### 3.3 A note on parallelisation

The parallelisation of can be done in the same way as in , whereby a mini-batch is simply split into sub-mini-batches to be distributed. For , the only constraint on sub-mini-batches is that they are of equal size to guarantee equal processing times. With the constraint is slightly stricter, as the time required to process a sample depends on its time of entry into the mini-batch, due to bounds. Samples from all iterations should thus be balanced, the constraint becoming that each sub-mini-batch contains an equal number of samples from for all .

## 4 Results

We have performed experiments on 3 dense datasets and sparse dataset used in Sculley (2010). The INFMNIST dataset (Loosli et al., 2007) is an extension of MNIST, consisting of 2828 hand-written digits (). We use 400,000 such digits for performing -means and 40,000 for computing a validation energy . STL10P (Coates et al., 2011) consists of 663 image patches (), we train with 960,000 patches and use 40,000 for validation. KDDC98 contains 75,000 training samples and 20,000 validation samples, in 310 dimensions. Finally, the sparse RCV1 dataset of Lewis et al. (2004) consists of data in 47,237 dimensions, with two partitions containing 781,265 and 23,149 samples respectively. As done in Sculley (2010), we use the larger partition to learn clusters.

The experimental setup used on each of the datasets is the following: for 20 random seeds, the training dataset is shuffled and the first datapoints are taken as initialising centroids. Then, for each of the algorithms, -means is run on the shuffled training set. At regular intervals, a validation energy is computed on the validation set. The time taken to compute is not included in run times. The batchsize for and initial batchsize for are , and clusters. The mean and standard deviation of over the 20 runs are computed, and this is what is plotted in Figure 2, relative to the lowest obtained validation energy over all runs on a dataset, . Before comparing algorithms, we note that our implementation of the baseline is competitive with existing implementations, as shown in Appendix A.

In Figure 2, we plot time-energy curves for with three baselines. We use , as described in the following paragraph. On the 3 dense datasets, we see that is much faster than , obtaining a solution within of between and earlier than . On the sparse dataset RCV1, the speed-up becomes noticeable within of . Note that in a single epoch gets very near to , whereas the full batch algorithms and only complete one iteration. The mean final energies of and the exact algorithms are consistently within one initialisation standard deviation. This means that the random initialisation seed has a larger impact on final energy than the choose between and an exact algorithm.

We now discuss the choice of . Recall that the mini-batch size doubles when . Thus a large means smaller s are needed to invoke a doubling, which means less robustness against premature fine-tuning. The relative costs of premature fine-tuning and redundancy are influenced by the use of bounds. Consider the case of premature fine-tuning with bounds. becomes small, and thus bound tests become more effective as they decrease more slowly (line 10 of Alg. 5). Thus, while premature fine-tuning does result in more samples being visited than necessary, each visit is processed rapidly and so is less costly. We have found that taking to be large works well for . Indeed, there is little difference in performance for . To test that our formulation is sensible, we performed tests with the bound test (line 3 of Alg. 1) deactivated. When deactivated, is in general better than larger values of , as seen in Figure 3. Full time-energy curves for different values are provided in SM-C.

## 5 Conclusion and future work

We have shown how triangle inequality based bounding can be used to accelerate mini-batch -means. The key is the use of nested batches, which enables rapid processing of already used samples. The idea of replacing uniformly sampled mini-batches with nested mini-batches is quite general, and applicable to other mini-batch algorithms. In particular, we believe that the sparse dictionary learning algorithm of Mairal et al. (2009) could benefit from nesting. One could also consider adapting nested mini-batches to stochastic gradient descent, although this is more speculative.

Celebi et al. (2013) show that specialised initialisation schemes such as k-means++ can result in better clusterings. While this is not the case for the datasets we have used, it would be interesting to consider adapting such initialisation schemes to the mini-batch context.

Our nested mini-batch algorithm nmbatch uses a very simple bounding scheme. We believe that further improvements could be obtained through more advanced bounding, and that the memory footprint of could be reduced by using a scheme where, as the mini-batch grows, the number of bounds maintained decreases, so that bounds on groups of clusters merge.

## Appendix A Comparing Baseline Implementations

We compare our implementation of with two publicly available implementations, that accompanying Sculley (2010) in C++, and that in scikit-learn Pedregosa et al. (2011), written in Cython. Comparisons are presented in Table 1, where our implementations are seen to be competitive. Experiments were all single threaded. Our C++ and Python code is available at https://github.com/idiap/eakmeans.

INFMNIST (dense) | RCV1 (sparse) | |||
---|---|---|---|---|

ours | sklearn | ours | sklearn | sofia |

12.4 | 20.6 | 15.2 | 63.6 | 23.3 |

## Acknowledgments

James Newling was funded by the Hasler Foundation under the grant 13018 MASH2.

## References

- Agarwal et al. (2005) Agarwal, P. K., Har-Peled, S., and Varadarajan, K. R. (2005). Geometric approximation via coresets. In COMBINATORIAL AND COMPUTATIONAL GEOMETRY, MSRI, pages 1–30. University Press.
- Bottou and Bengio (1995) Bottou, L. and Bengio, Y. (1995). Convergence properties of the K-means algorithm. pages 585–592.
- Celebi et al. (2013) Celebi, M. E., Kingravi, H. A., and Vela, P. A. (2013). A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst. Appl., 40(1):200–210.
- Coates et al. (2011) Coates, A., Lee, H., and Ng, A. (2011). An analysis of single-layer networks in unsupervised feature learning. In Gordon, G., Dunson, D., and Dudík, M., editors, Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 of JMLR Workshop and Conference Proceedings, pages 215–223. JMLR W&CP.
- Ding et al. (2015) Ding, Y., Zhao, Y., Shen, X., Musuvathi, M., and Mytkowicz, T. (2015). Yinyang k-means: A drop-in replacement of the classic k-means with consistent speedup. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pages 579–587.
- Drake (2013) Drake, J. (2013). Faster k-means clustering. Accessed online 19 August 2015.
- Elkan (2003) Elkan, C. (2003). Using the triangle inequality to accelerate k-means. In Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), August 21-24, 2003, Washington, DC, USA, pages 147–153.
- Hamerly (2010) Hamerly, G. (2010). Making k-means even faster. In SDM, pages 130–140.
- Kanungo et al. (2002) Kanungo, T., Mount, D., Netanyahu, N., Piatko, C., Silverman, R., and Wu, A. (2002). An efficient k-means clustering algorithm: analysis and implementation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24(7):881–892.
- Lewis et al. (2004) Lewis, D. D., Yang, Y., Rose, T. G., and Li, F. (2004). Rcv1: A new benchmark collection for text categorization research. JOURNAL OF MACHINE LEARNING RESEARCH, 5:361–397.
- Loosli et al. (2007) Loosli, G., Canu, S., and Bottou, L. (2007). Training invariant support vector machines using selective sampling. In Bottou, L., Chapelle, O., DeCoste, D., and Weston, J., editors, Large Scale Kernel Machines, pages 301–320. MIT Press, Cambridge, MA.
- Mairal et al. (2009) Mairal, J., Bach, F., Ponce, J., and Sapiro, G. (2009). Online dictionary learning for sparse coding. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 689–696, New York, NY, USA. ACM.
- Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.
- Pelleg and Moore (1999) Pelleg, D. and Moore, A. (1999). Accelerating exact k-means algorithms with geometric reasoning. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’99, pages 277–281, New York, NY, USA. ACM.
- Phillips (2002) Phillips, S. (2002). Acceleration of k-means and related clustering algorithms. volume 2409 of Lecture Notes in Computer Science. Springer.
- Sculley (2010) Sculley, D. (2010). Web-scale k-means clustering. In Proceedings of the 19th International Conference on World Wide Web, WWW ’10, pages 1177–1178, New York, NY, USA. ACM.
- Wang et al. (2012) Wang, J., Wang, J., Ke, Q., Zeng, G., and Li, S. (2012). Fast approximate k-means via cluster closures. In CVPR, pages 3037–3044. IEEE Computer Society.

## Appendix SM-A Showing that there are more first time visits than revisits in the first epoch

Let the probability that a sample is not visited in an epoch be , where recall that an epoch consists of drawing mini-batches, where we assume . Denote by the probability that the visit of a sample is a revisit. We argue that : the number of samples not visited exactly corresponds to the number of revisits, as the number of visits is the number of samples, by definition of an epoch. Clearly, , from which it can be shown that . Thus as we want. In other words, there are at least first time visits for 1 revisit.

## Appendix SM-B Showing that the expectations and are approximately the same

Recall that are the samples used to obtain , that is

The mean squared distance of samples in to we denote by ,

We compute the expectation of , where the expectation is over all possible shufflings of the data. Recall that is centroid at iteration if the mini-batch at size is , where is the mini-batch size at iteration . Recall that we use to denote samples assigned to . We will now denote by the sample indices assigned to and the sample indices assigned to . Thus,

We now assume that the number of samples per centroid does not change significantly between iterations and for a fixed batch size, so that and . Continuing we have, | ||||

The two summation terms are independant and the second has expectation approximately zero assuming the centroids do not move too much between rounds, so | ||||

Finally, each of the two expectation terms can be approximated by . Approximating the first term by , may be an underestimation as the summation is over data which was not used to obtain , whereas is obtained using data used by . Using this estimation we get, | ||||

The final equality following from the definition of .

## Appendix SM-C Time-energy curves with various doubling thresholds

## Appendix SM-D On algorithms intermediate to and

The primary argument presented in this paper for removing old assignments is to prevent a biased use of samples in . However, a second reason for removing old assignments is that they can contaminate centroids if left unremoved. This second reason in favour of removing old assignments is also applicable to , and so it is interesting to see if can be improved by removing old assignments, without the inclusion of triangle inequality based bounds. We call this algorithm . In addition, it is interesting to consider the performance of without bound testing. We here call without bound testing .

In Figure 6 we see that is indeed improved by removing old assignments: outperforms , especially at later iterations. The algorithm does not perform as well as , as expected, however it is comparable to , if not slightly better. There is no algorithmic reason why should be better than , as nesting was proposed purely as way to better harness bounds. One possible explanation for the good performance of is better memory usage: when samples are reused there are fewer cache memory misses.

## Appendix SM-E Premature fine-tuning, one more time please

The loss function being minimised changes when the mini-batch grows. With samples, it is

and then with it is

Minima of these two loss functions are different, although as gets large they approach each each. Premature fine-tuning refers to putting a large amount of effort into getting very close to a minimum with samples, when we know that as soon as we switch to samples the minimum will move, undoing our effort to get very close to a mimumum.

Coffee break definition: It’s like a glazed cherry without a cake, that finishing touch which is useless until the main project is complete. Donald Knuth once wrote that *premature optimization is the root of all evil*, where optimisations to code performed too early on in a project become useless as software evolves. This is roughly what we’re talking about.