Convergence rate of stochastic means
Abstract
We analyze online [5] and minibatch [16] means variants. Both scale up the widely used means algorithm via stochastic approximation, and have become popular for largescale clustering and unsupervised feature learning. We show, for the first time, that starting with any initial solution, they converge to a “local optimum” at rate (in terms of the means objective) under general conditions. In addition, we show if the dataset is clusterable, when initialized with a simple and scalable seeding algorithm, minibatch means converges to an optimal means solution at rate with high probability. The means objective is nonconvex and nondifferentiable: we exploit ideas from recent work on stochastic gradient descent for nonconvex problems [8, 3] by providing a novel characterization of the trajectory of means algorithm on its solution space, and circumvent the nondifferentiability problem via geometric insights about means update.
1 Introduction
Lloyd’s algorithm (batch means) [13] is one of the most popular heuristics for clustering [9]. However, at every iteration it requires computation of the closest centroid to every point in the dataset. Even with fast implementations such as [7], which reduces the computation for finding the closest centroid of each point, the periteration running time still depends linearly on , making it a computational bottleneck for large datasets. To scale up the centroidupdate phase, a plausible recipe is the “stochastic approximation” scheme [4]: the overall idea is, at each iteration, the centroids are updated using one (online [5]) or a few (minibatch [16]) randomly sampled points instead of the entire dataset. In the rest of the paper, we refer to both as stochastic means, which we formally present as Algorithm 1. Empirically, stochastic means has gained increasing attention for largescale clustering and is included in widely used machine learning packages, such as SofiaML [16] and scikitlearn [14]. Figure 3 demonstrates the efficiency of stochastic means against batch means on the RCV1 dataset [12]. The advantage is clear, and the results raise some natural questions: Can we characterize the convergence rate of stochastic means? Why do the algorithms appear to converge to different “local optima”? Why and how does minibatch size affect the quality of the final solution? Our goal is to address these questions rigorously.
Our main contribution
Notation
The input to our clustering problem is a discrete dataset ; , . We use letter to denote clusterings of ; we use letter to denote a set of centroids. Superscripts are used to indicate a particular clustering, e.g., denotes the clustering at the th iteration in Algorithm 1 (or batch means); subscripts indicate individual clusters in a clustering: denotes the th centroid in corresponding to the th cluster . We use letter to denote cardinality, , , etc. Fix a point set , we let denote the mean of . Each clustering induces a unique set of centroids . Fix , we let denote the Voronoi cell belonging to centroid , i.e., , and we use to denote the Voronoi diagram induced by a set of centroids . Fix with centroids, we denote its means cost with respect to a clustering by (or simply when is induced by the Voronoi diagram of ). We let , , and let () denote the means cost of cluster (). and are similarly defined for the cost of an optimal means clustering. We use to denote permutations between two sets of the same cardinality.
2 A framework for tracking batch means in the solution space
First, we develop insights for batch means to facilitate our analysis of stochastic means.

Obtain by assigning each to its closest centroid (Voronoi cell).

For all that is not empty, obtain .
The algorithm alternates between two solution spaces: the continuous space of all centroids, which we denote by , and the finite set of all clusterings, which we denote by . One problem with means is it may produce degenerate solutions: if the solution has centroids, it is possible that data points are mapped to only centroids. To handle degenerate cases, starting with , we consider an enlarged clustering space , which is the union of all clusterings with . Our key idea is that can be partitioned into equivalence classes by the clustering they induce on , and the algorithm stops if and only if two consecutive iterations stay within the same equivalence class in {C}. Specifically, for any , let denote the clustering they induce on via their Voronoi diagram. Then are in the same equivalence class in if . In lieu of this construction, the algorithm goes from to via mapping ; it goes from to via the mean operation . Figure 3 illustrates how batch means alternates between two solution spaces until convergence. We use the preimage to denote the equivalence class induced by clustering . When batch means visits a clustering , if , the algorithms jumps to another clustering . If , the algorithm stops because and . We thus formalize the idea of “local optima” of batch means as below.
Definition 1 (Stationary clusterings).
We call a stationary clustering of , if . We let denote the set of all stationary clusterings of with number of clusters .
The operator denotes the “closure” of an equivalence class , which includes its “boundary points”, a set of centroids that induces ambiguous clusterings (this happens when there is a data point on the bisector of two centroids in a solution; see Appendix A for details). For each , we define a matching centroidal solution .
Definition 2 (Stationary points).
For a stationary clustering with clusters, we define to be a stationary point corresponding to , so that , . We let denote the corresponding set of all stationary points of with .
As preluded in the introduction, defining a distance measure on is important to our subsequent analysis; For and , we let , where . This distance is asymmetric and nonnegative, and evaluates to zero if and only if two sets of centroids coincide. In addition, if is a stationary point, then for any solution , upper bounds the difference of means objective, (Lemma 11).
2.1 Stochastic means
Algorithm 1 and stochastic means [5, 16] are equivalent up to the choice of learning rate and sampling scheme (the proof of equivalence is in Appendix A). In [5, 16], the percluster learning rate is chosen as ; in our analysis, we choose a flat learning rate for all clusters, where are some fixed constants (empirically, no obvious differences are observed; see Section 3.1 for more discussion and Section 4 for empirical comparison).
Unlike a usual gradientbased algorithm on a continuous domain, the discrete nature of the means problem causes major differences when stochastic approximation is applied.
First, for batch means, at every iteration , is chosen as the means, , of the current clustering . Since is finite and is unique for a fixed , the set of “legal moves” of batch means is finite, while is continuous. In stochastic means, however, can be any member of due to both stochasticity and the learning rate. As such, its effective solution space on is continuous and thus infinite. Our framework enables us to impose a finite structure on by mapping points in to points in .
Second, in stochastic means, centroids are usually updated asynchronously, especially when the minibatch size is small compared to ; in the extreme case of online means, centroids are updated one at a time. Since the positions of centroids implicitly determine the clustering assignment, asynchronous means updates will lead to a different solution path than fully synchronized ones, even if we ignore the noise introduced by sampling and choose the learning rate to be 1. Due to asynchronicity, it is also hard to detect when stochastic means produces a degenerate solution, since centroids may fail to be updated for a long time due to sampling. In practice, implementation such as scikitlearn relocate a centroid randomly if it fails to be updated for too long. In Algorithm 1, we do not allow relocation of centroids and our analysis subsumes the degenerate cases.
3 Main result
With two weak assumptions on the properties of stationary points of a dataset , our first result shows stochastic means has an overall convergence rate to a stationary point of batch means, regardless of initialization.
(A0)
We assume all stationary points are nonboundary points, i.e., , (By Lemma 5, under this assumption, such that all stationary points are stable).
(A1)
, we assume there is an upper bound on
where are the cardinality and centroid of the th cluster in a stationary clustering .
Theorem 1.
Assume (A0)(A1) holds. Fix any , if we run Algorithm 1 with learning rate such that for all ,
Then starting from any initial set of centroids , Let denote the event . Then , and there exists events parametrized by , denoted by , such that . And for any , we have ,
This result guarantees the expected convergence of Algorithm 1 towards a stationary point under a general condition. However, it does not guarantee Algorithm 1 converges to the same stationary clustering as its batch counterpart, even with the same initialization. Moreover, it is even possible that the final solution becomes degenerate.
Algorithm parameters in the convergence bound
The exact convergence bound of Theorem 1, which we hide in the BigO notation, reveals dependence of the convergence on the algorithm’s parameters. When is sufficiently large, the exact bound we obtain is likely dominated by where and . The bound becomes tighter when decreases. Since the order convergence requires a sufficiently large , our analysis suggests we should choose to be neither too large nor too small. The bound also becomes tighter when and becomes closer to . Both depend on ; less obviously, they also depend on the number of nondegenerate clusters. Since , which equals zero if and only if , , where is the smallest number of nondegenerate centroids in a run. The similar holds for . This suggests the convergence rate may be slower with larger , which depends on the initial number of clusters , and smaller . For experimental results of the effect of algorithm parameters on convergence, see Section 4. A detailed explanation on the exact bound is included in Appendix B.
The general convergence result in Theorem 1 is applicable to any seeding algorithm, including random sampling, which is probably the most scalable seeding heuristic one can use. However, it does not provide performance guarantee with respect to the means objective. We next show that stochastic means initialized by Algorithm 2 converges to a global optimum of means objective at rate , under additional geometric assumptions on the dataset. The major advantage of Algorithm 2 over other choices of initialization, such as the kmeans++ [1] or random sampling, is that its running time is independent of the data size while providing seeding guarantee. When using it with stochastic means, both the seeding and the update phases are independent of the data size, making the overall algorithm highly scalable.
Let denote an optimal means clustering of . We assume, similar to [11], that the means of each pair of clusters are wellseparated and that the points from the two clusters are separated by a “margin”, that is, , the distance from to the bisector of and is lower bounded. Formally, let denote the projection of onto the line joining , the margin between the two clusters is defined as . In addition, we require that the size of the smallest cluster is not too small compared to the data size. The geometric assumptions are formally defined as below.
(B1)
Mean separation: , , with for some that is sufficiently small.
(B2)
Existence of margin: , , for some .
(B3)
Cluster balance: , where .
Theorem 2.
Interestingly, we cannot provide performance guarantee for online means () in Theorem 2. Our intuition is, instead of allowing stochastic means to converge to any stationary point as in Theorem 1, it studies convergence to a fixed optimal solution; a larger provides more stability to the algorithm and prevents it from straying away from the target solution. The proposed clustering scheme is reminiscent to the Buckshot algorithm [6], widely used in the domain of document clustering. Readers may wonder how can the algorithm approach for an NPhard problem. The reason is that our geometric assumptions softens the problem. In this case, Algorithm 1 converges to the same (optimal) solution as its batch counterpart, provided the same initialization.
3.1 Related work and proof outline
Our major source of inspiration comes from recent advances in nonconvex stochastic optimization for unsupervised learning problems [8, 3]. [8] studies the convergence of stochastic gradient descent (SGD) for the tensor decomposition problem, which amounts to finding a local optimum of a nonconvex objective function composed exclusively of saddle points and local optima. Inspired by their analysis framework, we divide our analysis of Algorithm 1 into two phases, that of global convergence and local convergence, indicated by the distance from the current solution to stationary points, . We use as a shorthand.
Significant decrease in means objective when is large
In the global convergence phase, the algorithm is not close to any stationary point, i.e., , is lower bounded. We first prove a blackbox result showing that on nonstationary points, stochastic means decreases the means objective in expectation at every iteration.
Lemma 1.
Suppose , w.p. . Then, , where .
By Lemma 1, the term lower bounds the per iteration drop in means objective. Since , , and (the second equality is by Lemma 12), the drop is always lower bounded by zero. We show that if is bounded away from zero (for any stationary point ), then so is : the rough idea is, in case is a nonstationary point, and must belong to different equivalence classes, as discussed in Section 2, and their distance must be lower bounded by Lemma 4; otherwise, is a stationary point, and by our assumption their distance is lower bounded. Thus, is lower bounded by a positive constant, and so is . Since we chose , the expected per iteration drop of cost is of order , which forms a divergent series; after a sufficient number of iterations the expected drop can be arbitrarily large. We conclude that cannot be bounded away from zero asymptotically, since the means cost of any clustering is positive. This result is presented in Lemma 2.
Before proceeding, note the drop increases with . This means if we can set a clusterdependent learning rate that adapts to , the drop could be larger than choosing a flat rate, as we do in our analysis. The learning rate in [16, 5], where , is intuitively making this adaptation: in the case when the clustering assignment does not change between iterations, it can be seen that , so the effective learning rate is balanced for different clusters and is roughly . However, in practice, the clustering assignment changes when the centroids are updated, and it is hard to analyze this adaptive learning rate due to its random dependence on all previous sampling outcomes.
Lemma 2.
Lemma 2 suggests that, starting from any initial point in , the algorithm always approaches a stationary point asymptotically, ending its global convergence phase after a finite number of iterations. We next examine its local behavior around stationary points.
Local convergence to stationary points when is small
To obtain local convergence result, we first define “local attractors” and “basin of attraction” for batch means; the natural candidate for the former is the set of stationary points; a basin of attraction is a subset of the solution space so that once batch means enters it, it will not escape.
Definition 3.
We call a stable stationary point of batch means if it is a stationary point and for any clustering such that , , the Voronoi partition induced by , denoted by , satisfies , where is the permutation achieving , with for some .
We can view as the radius of the basin and the degree of the “slope” leading to the attractor. A key lemma to our analysis characterizes the iterationwise local convergence property of batch means around stable stationary points, whose convergence rate depends on the slope .
Lemma 3.
Let be a stable stationary point. For any such that , , apply one step of batch means update on results in a new solution such that .
Lemma 3 resembles the standard iterationwise convergence statement in SGD analysis, typically via convexity or smoothness of a function [15]. Here, we have neither at our dispense (we do not even have a gradient). Instead, our analysis relies on the geometric property of Voronoi diagram and the mean operation used in a means iteration, similar to those in recent works on batch means [11, 2, 17]. Although this lemma applies only to batch means, our hope is that stochastic means has similar iterationwise convergence behavior in expectation even in the presence of noise.
The difficulty here is, due to nonconvexity, the convergence result only holds within a local basin of attraction: if the algorithm’s solution is driven off the current basin of attraction by stochastic noise at any iteration, it may converge to a different attractor, causing trouble to our analysis. To deal with this, we exploit probability tools developed in [3]. [3] studies the convergence of stochastic PCA algorithms, where the objective function is the nonconvex Rayleigh quotient, which has a plateaulike component. The tools developed there were used to show that stochastic PCA gradually escapes the plateau. Here, we adapted their analysis to show Algorithm 1 stays within a basin of attraction with high probability, and converges to the attractor at rate . We define a nested sequence of events :
where . Then if Algorithm 1 is within the basin of attraction of a stable stationary point at time , the event that it escapes this local basin of attraction is contained in the event . We upper bound the probability of this bad event (Proposition 1) using techniques that derive tight concentration inequality via moment generating functions from [3], which in turn implies a lower bound on the probability of , . Then conditioning on and adapting Lemma 3 from batch to stochastic means proves the expected local convergence rate of .
Theorem 3.
Assume (A1) holds. Let be a stable stationary point, and let for some at some iteration in Algorithm 1. Let . Suppose we set and sufficiently large so that
Fix any , and let . If in addition,
Then for all ,
Note how, in contrast to Theorem 1, the local convergence result does not allow degeneracy here, by implicitly requiring that . This is reasonable, since a degenerate set of centroids cannot converge to a fixed stationary point with centroids.
Building on the ideas introduced above, we present the key ingredients of our main theorems.
Proof idea of Theorem 1
To prove Theorem 1, we first show that all stationary points satisfying (A0) must be stable for some (Lemma 5). Then we apply results from both the global and local convergence analysis of Algorithm 1. We define the global convergence phase as when , . As discussed, since is bounded away from zero, the periteration drop of means cost is of order , thus we get the expected convergence rate. Lemma 2 suggests that this phase will eventually end and at some iteration , must hold for some stationary point . The first time when this happens signals the beginning of the local convergence phase: at this point, is in the basin of attraction of since is stable. Thus, applying Theorem 3 implies that in this phase the algorithm converges to locally at rate . Hence, the overall global convergence rate is .
Proof idea of Theorem 2
Here we only apply the local convergence result. The key step is to show that our clusterability assumption on the dataset, (B1) and (B2), implies that its optimal means solution, , is a stable stationary point with a sufficiently large (Proposition 2). Then we adapt results from [17] to show that the seeds obtained from Algorithm 2 are within the basin of attraction of with high probability (Lemmas 13). Using the other geometric assumption, (B3), we apply Theorem 3 to show an convergence to with high probability.
4 Experiments
To verify the global convergence rate of Theorem 1, we run stochastic means with varying learning rate, minibatch size, and on RCV1 [12]. The dataset is relatively large in size: it has manually categorized newswire stories with topics, where each story is a dimensional sparse vector; it was used in [16] for empirical evaluation of minibatch means. We used Python and its scikitlearn package [14] for our experiments, which has stochastic means implemented. We disabled centroid relocation and modified their source code to allow a userdefined learning rate (their learning rate is fixed as , as in [5, 16], which we refer to as BBSrate).
Figure 5 shows the convergence in means cost of stochastic means algorithm over iterations for varying and ; fix each pair , we initialize Algorithm 1 with a same set of random centroids and run stochastic means with varying learning rate parameters , and we average the performance of each learning rate setup over runs to obtain the original convergence plot. Figure 5 is an example of a convergence plot before transformation. The dashed black line in each loglog figure is , a function of order . To compare the performance of stochastic means with this baseline, we first transform the original vs plot to that of vs . By Theorem 1, , so we expect the slope of the loglog plot of vs to be at least as large as that of . Although we do not know the exact cost of the stationary point, since the algorithm has reached a stable phase over iterations, as illustrated by Figure 5, we simply use as an estimate of . Most loglog convergence graphs fluctuate around a line with a slope at least as steep as that of , and do not seem to be sensitive to the choice of learning rate in our experiment. Note in some plots, such as Figure 5, the initial phase is flat. This is because we force the plot to start at instead of their true intercept on the axis. BBSrate exhibits similar behavior to our flat learning rates. Our experiment suggests the convergence rate of stochastic means may be sensitive to the ratio ; for larger or smaller , faster and more stable convergence is observed.
Appendix A Appendix A: complete version of Section 2
To facilitate our analysis of minibatch means, we build a framework to track the solution path produced by batch means; it alternates between two solutions spaces: the space of all centroids, which we denote by , and the space of all clusterings, which we denote by . Note the latter is a finite set. Throughout the paper, we use to denote permutations between two sets of the same cardinality.
Degenerate cases
One problem with batch means algorithm is it may produce degenerate solutions: if the solution has centroids, it is possible that data points are mapped to only centroids. To handle degenerate cases, starting with , we will consider the enlarged clustering space , the union of all clusterings, which we denote by , with .
Partitioning via
Our first observation is that most part of (with exception discussed below) can be partitioned into equivalence classes by the clustering they induce on ; for any , let to formally denote the clustering they induce via their Voronoi diagram. For most points in , there is only one such clustering so is uniquely determined. We define as the set of points inducing a unique clustering , with . Then we let be in the same partition in if are both unique and .
Ambiguous cases
However, it is not always clear which partition belongs to: if such that , where denote the centroids in that are closest and second closest to , can be clustered into either the cluster of centroid or that of . Centroids with this property can induce two or more clusterings due to ambiguity of Voronoi partition. Intuitively, they are at the boundary of , for some clusterings . We formalize and boundary points as below.
Definition 4 (members of ).
Fix a clustering , we define if it contains a matching set of centroids and there exists a permutation of s.t. ,
We say is a boundary point of , if ,
with equality attained for at least one data point . Let denote the set of all boundary points of . Let denote the closure of .
Note that in the case has centroids, implies all centroids in are degenerate.
Representing means updates
For now, let denote a “local minimum” of batch means and suppose is not a boundary point. Let . One run of batch means can be represented as
Figure 3 illustrates how the algorithm alternates between two solution spaces. When batch means visits a clustering , if , the algorithms jumps to another clustering . Otherwise, if , the algorithm stops because and . In the special case where is a boundary stationary point, since the algorithm arbitrarily breaks the tie, then it will continue to operate if the new clustering is chosen such that , or stops if . In practice, it is unlikely to encounter a boundary point due to perturbations in the computing system, and regardless, a sufficient condition for batch means to converge at the th iteration is . This motivates us to formalize “local optima” of batch means as below.
Definition 5 (Stationary clusterings).
We call a stationary clustering of , if