A Appendix A: complete version of Section 2

# Convergence rate of stochastic k-means

## Abstract

We analyze online [5] and mini-batch [16] -means variants. Both scale up the widely used -means algorithm via stochastic approximation, and have become popular for large-scale clustering and unsupervised feature learning. We show, for the first time, that starting with any initial solution, they converge to a “local optimum” at rate (in terms of the -means objective) under general conditions. In addition, we show if the dataset is clusterable, when initialized with a simple and scalable seeding algorithm, mini-batch -means converges to an optimal -means solution at rate with high probability. The -means objective is non-convex and non-differentiable: we exploit ideas from recent work on stochastic gradient descent for non-convex problems [8, 3] by providing a novel characterization of the trajectory of -means algorithm on its solution space, and circumvent the non-differentiability problem via geometric insights about -means update.

## 1 Introduction

Lloyd’s algorithm (batch -means) [13] is one of the most popular heuristics for clustering [9]. However, at every iteration it requires computation of the closest centroid to every point in the dataset. Even with fast implementations such as [7], which reduces the computation for finding the closest centroid of each point, the per-iteration running time still depends linearly on , making it a computational bottleneck for large datasets. To scale up the centroid-update phase, a plausible recipe is the “stochastic approximation” scheme [4]: the overall idea is, at each iteration, the centroids are updated using one (online [5]) or a few (mini-batch [16]) randomly sampled points instead of the entire dataset. In the rest of the paper, we refer to both as stochastic -means, which we formally present as Algorithm 1. Empirically, stochastic -means has gained increasing attention for large-scale clustering and is included in widely used machine learning packages, such as Sofia-ML [16] and scikit-learn [14]. Figure 3 demonstrates the efficiency of stochastic -means against batch -means on the RCV1 dataset [12]. The advantage is clear, and the results raise some natural questions: Can we characterize the convergence rate of stochastic -means? Why do the algorithms appear to converge to different “local optima”? Why and how does mini-batch size affect the quality of the final solution? Our goal is to address these questions rigorously.

Our main contribution1 is the global convergence of stochastic -means. Our key idea is to keep track of the distance from the algorithm’s current clustering solution to the set of all “local optima” of batch -means: when the distance is large with respect to all local optima, the drop in -means objective after one iteration of stochastic -means update is lower bounded; when the algorithm is close enough to a local optimum, the algorithm will be trapped by it and maintains the convergence rate thanks to a geometric property of local optima.

#### Notation

The input to our clustering problem is a discrete dataset ; , . We use letter to denote clusterings of ; we use letter to denote a set of -centroids. Superscripts are used to indicate a particular clustering, e.g., denotes the clustering at the -th iteration in Algorithm 1 (or batch -means); subscripts indicate individual clusters in a clustering: denotes the -th centroid in corresponding to the -th cluster . We use letter to denote cardinality, , , etc. Fix a point set , we let denote the mean of . Each clustering induces a unique set of centroids . Fix , we let denote the Voronoi cell belonging to centroid , i.e., , and we use to denote the Voronoi diagram induced by a set of centroids . Fix with -centroids, we denote its -means cost with respect to a -clustering by (or simply when is induced by the Voronoi diagram of ). We let , , and let () denote the -means cost of cluster (). and are similarly defined for the cost of an optimal -means clustering. We use to denote permutations between two sets of the same cardinality.

## 2 A framework for tracking batch k-means in the solution space

First, we develop insights for batch -means to facilitate our analysis of stochastic -means.2 Batch -means initializes the position of centroids via a seeding algorithm. Then ,

1. Obtain by assigning each to its closest centroid (Voronoi cell).

2. For all that is not empty, obtain .

The algorithm alternates between two solution spaces: the continuous space of all -centroids, which we denote by , and the finite set of all -clusterings, which we denote by . One problem with -means is it may produce degenerate solutions: if the solution has centroids, it is possible that data points are mapped to only centroids. To handle degenerate cases, starting with , we consider an enlarged clustering space , which is the union of all -clusterings with . Our key idea is that can be partitioned into equivalence classes by the clustering they induce on , and the algorithm stops if and only if two consecutive iterations stay within the same equivalence class in {C}. Specifically, for any , let denote the clustering they induce on via their Voronoi diagram. Then are in the same equivalence class in if . In lieu of this construction, the algorithm goes from to via mapping ; it goes from to via the mean operation . Figure 3 illustrates how batch -means alternates between two solution spaces until convergence. We use the pre-image to denote the equivalence class induced by clustering . When batch -means visits a clustering , if , the algorithms jumps to another clustering . If , the algorithm stops because and . We thus formalize the idea of “local optima” of batch -means as below.

###### Definition 1 (Stationary clusterings).

We call a stationary clustering of , if . We let denote the set of all stationary clusterings of with number of clusters .

The operator denotes the “closure” of an equivalence class , which includes its “boundary points”, a set of centroids that induces ambiguous clusterings (this happens when there is a data point on the bisector of two centroids in a solution; see Appendix A for details). For each , we define a matching centroidal solution .

###### Definition 2 (Stationary points).

For a stationary clustering with clusters, we define to be a stationary point corresponding to , so that , . We let denote the corresponding set of all stationary points of with .

As preluded in the introduction, defining a distance measure on is important to our subsequent analysis; For and , we let , where . This distance is asymmetric and non-negative, and evaluates to zero if and only if two sets of centroids coincide. In addition, if is a stationary point, then for any solution , upper bounds the difference of -means objective, (Lemma 11).

### 2.1 Stochastic k-means

Algorithm 1 and stochastic -means [5, 16] are equivalent up to the choice of learning rate and sampling scheme (the proof of equivalence is in Appendix A). In [5, 16], the per-cluster learning rate is chosen as ; in our analysis, we choose a flat learning rate for all clusters, where are some fixed constants (empirically, no obvious differences are observed; see Section 3.1 for more discussion and Section 4 for empirical comparison).

Unlike a usual gradient-based algorithm on a continuous domain, the discrete nature of the -means problem causes major differences when stochastic approximation is applied.

First, for batch -means, at every iteration , is chosen as the means, , of the current -clustering . Since is finite and is unique for a fixed , the set of “legal moves” of batch -means is finite, while is continuous. In stochastic -means, however, can be any member of due to both stochasticity and the learning rate. As such, its effective solution space on is continuous and thus infinite. Our framework enables us to impose a finite structure on by mapping points in to points in .

Second, in stochastic -means, centroids are usually updated asynchronously, especially when the mini-batch size is small compared to ; in the extreme case of online -means, centroids are updated one at a time. Since the positions of centroids implicitly determine the clustering assignment, asynchronous -means updates will lead to a different solution path than fully synchronized ones, even if we ignore the noise introduced by sampling and choose the learning rate to be 1. Due to asynchronicity, it is also hard to detect when stochastic -means produces a degenerate solution, since centroids may fail to be updated for a long time due to sampling. In practice, implementation such as scikit-learn re-locate a centroid randomly if it fails to be updated for too long. In Algorithm 1, we do not allow re-location of centroids and our analysis subsumes the degenerate cases.

When analyzing Algorithm 1, we let denote the sample space of all outcomes and the natural filtration of the stochastic process up to ( is also used in our statements of Theorems 1 and 2 as the Big-Omega). We let .

## 3 Main result

With two weak assumptions on the properties of stationary points of a dataset , our first result shows stochastic -means has an overall convergence rate to a stationary point of batch -means, regardless of initialization.

#### (A0)

We assume all stationary points are non-boundary points, i.e., , (By Lemma 5, under this assumption, such that all stationary points are -stable).

#### (A1)

, we assume there is an upper bound on

 max{E[∑r∑x∈At+1r∥x−^ct+1r∥2+ϕt|Ft],∑rn∗r⟨ct−1r−c∗r,^ctr−E[^ctr|Ft−1]⟩,∑rn∗r∥^ctr−c∗r∥2}

where are the cardinality and centroid of the -th cluster in a stationary clustering .

###### Theorem 1.

Assume (A0)(A1) holds. Fix any , if we run Algorithm 1 with learning rate such that for all ,

 where~{} p+min(m):=minr∈[k];ptr(m)>0ptr(m),ρ(m):=1−(1−p∗min)m,~{}and~{}p∗min:=minA∗∗∈{A∗}[k]minrn∗∗rn

Then starting from any initial set of -centroids , Let denote the event . Then , and there exists events parametrized by , denoted by , such that . And for any , we have ,

This result guarantees the expected convergence of Algorithm 1 towards a stationary point under a general condition. However, it does not guarantee Algorithm 1 converges to the same stationary clustering as its batch counterpart, even with the same initialization. Moreover, it is even possible that the final solution becomes degenerate.

#### Algorithm parameters in the convergence bound

The exact convergence bound of Theorem 1, which we hide in the Big-O notation, reveals dependence of the convergence on the algorithm’s parameters. When is sufficiently large, the exact bound we obtain is likely dominated by where and . The bound becomes tighter when decreases. Since the -order convergence requires a sufficiently large , our analysis suggests we should choose to be neither too large nor too small. The bound also becomes tighter when and becomes closer to . Both depend on ; less obviously, they also depend on the number of non-degenerate clusters. Since , which equals zero if and only if , , where is the smallest number of non-degenerate centroids in a run. The similar holds for . This suggests the convergence rate may be slower with larger , which depends on the initial number of clusters , and smaller . For experimental results of the effect of algorithm parameters on convergence, see Section 4. A detailed explanation on the exact bound is included in Appendix B.

The general convergence result in Theorem 1 is applicable to any seeding algorithm, including random sampling, which is probably the most scalable seeding heuristic one can use. However, it does not provide performance guarantee with respect to the -means objective. We next show that stochastic -means initialized by Algorithm 2 converges to a global optimum of -means objective at rate , under additional geometric assumptions on the dataset. The major advantage of Algorithm 2 over other choices of initialization, such as the k-means++ [1] or random sampling, is that its running time is independent of the data size while providing seeding guarantee. When using it with stochastic -means, both the seeding and the update phases are independent of the data size, making the overall algorithm highly scalable.

Let denote an optimal -means clustering of . We assume, similar to [11], that the means of each pair of clusters are well-separated and that the points from the two clusters are separated by a “margin”, that is, , the distance from to the bisector of and is lower bounded. Formally, let denote the projection of onto the line joining , the margin between the two clusters is defined as . In addition, we require that the size of the smallest cluster is not too small compared to the data size. The geometric assumptions are formally defined as below.

#### (B1)

Mean separation: , , with for some that is sufficiently small.

#### (B2)

Existence of margin: , , for some .

#### (B3)

Cluster balance: , where .

###### Theorem 2.

Assume (B1)(B2) (B3) hold for an optimal -means clustering . Fix any , if in addition satisfies , and we choose in Algorithm 2 such that . And if we run Algorithm 1 initialized by Algorithm 2 and choosemini-batch size and learning rate of the form so that

 m>1~{}~{}~{}and~{}~{}~{}c′>12[1−√α−(1−√α)m]

and

 to=Ω{[(c′)2ρ(m)2(2c′(ρ(m)−√α)−1)ln1δ]22c′(ρ(m)−√α)−1}

where . Then , there exists event s.t.

 Pr{Gt}≥(1−δ)(1−ξ)~{}~{}~{}and~{}~{}~{}E[ϕt|Gt]−ϕopt≤E[Δ(Ct,Copt)|Gt]=O(1t)

Interestingly, we cannot provide performance guarantee for online -means () in Theorem 2. Our intuition is, instead of allowing stochastic -means to converge to any stationary point as in Theorem 1, it studies convergence to a fixed optimal solution; a larger provides more stability to the algorithm and prevents it from straying away from the target solution. The proposed clustering scheme is reminiscent to the Buckshot algorithm [6], widely used in the domain of document clustering. Readers may wonder how can the algorithm approach for an NP-hard problem. The reason is that our geometric assumptions softens the problem. In this case, Algorithm 1 converges to the same (optimal) solution as its batch counterpart, provided the same initialization.

### 3.1 Related work and proof outline

Our major source of inspiration comes from recent advances in non-convex stochastic optimization for unsupervised learning problems [8, 3]. [8] studies the convergence of stochastic gradient descent (SGD) for the tensor decomposition problem, which amounts to finding a local optimum of a non-convex objective function composed exclusively of saddle points and local optima. Inspired by their analysis framework, we divide our analysis of Algorithm 1 into two phases, that of global convergence and local convergence, indicated by the distance from the current solution to stationary points, . We use as a shorthand.

#### Significant decrease in k-means objective when Δt is large

In the global convergence phase, the algorithm is not close to any stationary point, i.e., , is lower bounded. We first prove a black-box result showing that on non-stationary points, stochastic -means decreases the -means objective in expectation at every iteration.

###### Lemma 1.

Suppose , w.p. . Then, , where .

By Lemma 1, the term lower bounds the per iteration drop in -means objective. Since , , and (the second equality is by Lemma 12), the drop is always lower bounded by zero. We show that if is bounded away from zero (for any stationary point ), then so is : the rough idea is, in case is a non-stationary point, and must belong to different equivalence classes, as discussed in Section 2, and their distance must be lower bounded by Lemma 4; otherwise, is a stationary point, and by our assumption their distance is lower bounded. Thus, is lower bounded by a positive constant, and so is . Since we chose , the expected per iteration drop of cost is of order , which forms a divergent series; after a sufficient number of iterations the expected drop can be arbitrarily large. We conclude that cannot be bounded away from zero asymptotically, since the -means cost of any clustering is positive. This result is presented in Lemma 2.

Before proceeding, note the drop increases with . This means if we can set a cluster-dependent learning rate that adapts to , the drop could be larger than choosing a flat rate, as we do in our analysis. The learning rate in [16, 5], where , is intuitively making this adaptation: in the case when the clustering assignment does not change between iterations, it can be seen that , so the effective learning rate is balanced for different clusters and is roughly . However, in practice, the clustering assignment changes when the centroids are updated, and it is hard to analyze this adaptive learning rate due to its random dependence on all previous sampling outcomes.

###### Lemma 2.

Assume (A1) holds. If we run Algorithm 1 on with , with as defined in Theorem 1, and , and choose any initial set of centroids . Then for any , s.t. with for some .

Lemma 2 suggests that, starting from any initial point in , the algorithm always approaches a stationary point asymptotically, ending its global convergence phase after a finite number of iterations. We next examine its local behavior around stationary points.

#### Local convergence to stationary points when Δt is small

To obtain local convergence result, we first define “local attractors” and “basin of attraction” for batch -means; the natural candidate for the former is the set of stationary points; a basin of attraction is a subset of the solution space so that once batch -means enters it, it will not escape.

###### Definition 3.

We call a -stable stationary point of batch -means if it is a stationary point and for any clustering such that , , the Voronoi partition induced by , denoted by , satisfies , where is the permutation achieving , with for some .

We can view as the radius of the basin and the degree of the “slope” leading to the attractor. A key lemma to our analysis characterizes the iteration-wise local convergence property of batch -means around stable stationary points, whose convergence rate depends on the slope .

###### Lemma 3.

Let be a -stable stationary point. For any such that , , apply one step of batch -means update on results in a new solution such that .

Lemma 3 resembles the standard iteration-wise convergence statement in SGD analysis, typically via convexity or smoothness of a function [15]. Here, we have neither at our dispense (we do not even have a gradient). Instead, our analysis relies on the geometric property of Voronoi diagram and the mean operation used in a -means iteration, similar to those in recent works on batch -means [11, 2, 17]. Although this lemma applies only to batch -means, our hope is that stochastic -means has similar iteration-wise convergence behavior in expectation even in the presence of noise.

The difficulty here is, due to non-convexity, the convergence result only holds within a local basin of attraction: if the algorithm’s solution is driven off the current basin of attraction by stochastic noise at any iteration, it may converge to a different attractor, causing trouble to our analysis. To deal with this, we exploit probability tools developed in [3]. [3] studies the convergence of stochastic PCA algorithms, where the objective function is the non-convex Rayleigh quotient, which has a plateau-like component. The tools developed there were used to show that stochastic PCA gradually escapes the plateau. Here, we adapted their analysis to show Algorithm 1 stays within a basin of attraction with high probability, and converges to the attractor at rate . We define a nested sequence of events :

 Ω⊃Ω1⊃⋯⊃Ωi⊃…

where . Then if Algorithm 1 is within the basin of attraction of a stable stationary point at time , the event that it escapes this local basin of attraction is contained in the event . We upper bound the probability of this bad event (Proposition 1) using techniques that derive tight concentration inequality via moment generating functions from [3], which in turn implies a lower bound on the probability of , . Then conditioning on and adapting Lemma 3 from batch to stochastic -means proves the expected local convergence rate of .

###### Theorem 3.

Assume (A1) holds. Let be a -stable stationary point, and let for some at some iteration in Algorithm 1. Let . Suppose we set and sufficiently large so that

 β:=2c′minr,tptr(m)(1−maxtat√α))>1

Fix any , and let . If in addition,

 t0≥max{(16a2maxBΔi)1β−1,[48a2maxB2(β−1)(Δi)2ln1δ]2β−1,(ln1δ)2β−1}

Then for all ,

 Pr(Ωt)≥1−δ~{}~{}and~{}~{}E[Δt|Ωt]≤t0+i+1t0+tΔi+a2maxBβ−1(t0+i+2t0+i+1)β+11t0+t+1

Note how, in contrast to Theorem 1, the local convergence result does not allow degeneracy here, by implicitly requiring that . This is reasonable, since a degenerate set of centroids cannot converge to a fixed stationary point with centroids.

Building on the ideas introduced above, we present the key ingredients of our main theorems.

#### Proof idea of Theorem 1

To prove Theorem 1, we first show that all stationary points satisfying (A0) must be -stable for some (Lemma 5). Then we apply results from both the global and local convergence analysis of Algorithm 1. We define the global convergence phase as when , . As discussed, since is bounded away from zero, the per-iteration drop of -means cost is of order , thus we get the expected convergence rate. Lemma 2 suggests that this phase will eventually end and at some iteration , must hold for some stationary point . The first time when this happens signals the beginning of the local convergence phase: at this point, is in the basin of attraction of since is -stable. Thus, applying Theorem 3 implies that in this phase the algorithm converges to locally at rate . Hence, the overall global convergence rate is .

#### Proof idea of Theorem 2

Here we only apply the local convergence result. The key step is to show that our clusterability assumption on the dataset, (B1) and (B2), implies that its optimal -means solution, , is a -stable stationary point with a sufficiently large (Proposition 2). Then we adapt results from [17] to show that the seeds obtained from Algorithm 2 are within the basin of attraction of with high probability (Lemmas 13). Using the other geometric assumption, (B3), we apply Theorem 3 to show an convergence to with high probability.

## 4 Experiments

To verify the global convergence rate of Theorem 1, we run stochastic -means with varying learning rate, mini-batch size, and on RCV1 [12]. The dataset is relatively large in size: it has manually categorized newswire stories with topics, where each story is a -dimensional sparse vector; it was used in [16] for empirical evaluation of mini-batch -means. We used Python and its scikit-learn package [14] for our experiments, which has stochastic -means implemented. We disabled centroid relocation and modified their source code to allow a user-defined learning rate (their learning rate is fixed as , as in [5, 16], which we refer to as BBS-rate).

Figure 5 shows the convergence in -means cost of stochastic -means algorithm over iterations for varying and ; fix each pair , we initialize Algorithm 1 with a same set of random centroids and run stochastic -means with varying learning rate parameters , and we average the performance of each learning rate setup over runs to obtain the original convergence plot. Figure 5 is an example of a convergence plot before transformation. The dashed black line in each log-log figure is , a function of order . To compare the performance of stochastic -means with this baseline, we first transform the original vs plot to that of vs . By Theorem 1, , so we expect the slope of the log-log plot of vs to be at least as large as that of . Although we do not know the exact cost of the stationary point, since the algorithm has reached a stable phase over iterations, as illustrated by Figure 5, we simply use as an estimate of . Most log-log convergence graphs fluctuate around a line with a slope at least as steep as that of , and do not seem to be sensitive to the choice of learning rate in our experiment. Note in some plots, such as Figure 5, the initial phase is flat. This is because we force the plot to start at instead of their true intercept on the -axis. BBS-rate exhibits similar behavior to our flat learning rates. Our experiment suggests the convergence rate of stochastic -means may be sensitive to the ratio ; for larger or smaller , faster and more stable convergence is observed.

## Appendix A Appendix A: complete version of Section 2

To facilitate our analysis of mini-batch -means, we build a framework to track the solution path produced by batch -means; it alternates between two solutions spaces: the space of all -centroids, which we denote by , and the space of all -clusterings, which we denote by . Note the latter is a finite set. Throughout the paper, we use to denote permutations between two sets of the same cardinality.

#### Degenerate cases

One problem with batch -means algorithm is it may produce degenerate solutions: if the solution has centroids, it is possible that data points are mapped to only centroids. To handle degenerate cases, starting with , we will consider the enlarged clustering space , the union of all -clusterings, which we denote by , with .

#### Partitioning {C} via {A}

Our first observation is that most part of (with exception discussed below) can be partitioned into equivalence classes by the clustering they induce on ; for any , let to formally denote the clustering they induce via their Voronoi diagram. For most points in , there is only one such clustering so is uniquely determined. We define as the set of points inducing a unique -clustering , with . Then we let be in the same partition in if are both unique and .

#### Ambiguous cases

However, it is not always clear which partition belongs to: if such that , where denote the centroids in that are closest and second closest to , can be clustered into either the cluster of centroid or that of . Centroids with this property can induce two or more clusterings due to ambiguity of Voronoi partition. Intuitively, they are at the boundary of , for some clusterings . We formalize and boundary points as below.

###### Definition 4 (members of v−1(A)).

Fix a clustering , we define if it contains a matching set of centroids and there exists a permutation of s.t. ,

 ∥x−cπ(r)∥<∥x−cπ(s)∥,∀s≠r

We say is a boundary point of , if ,

 ∥x−cπ(r)∥≤∥x−cπ(s)∥,∀s≠r

with equality attained for at least one data point . Let denote the set of all boundary points of . Let denote the closure of .

Note that in the case has centroids, implies all centroids in are degenerate.