# A Constant-Factor Bi-Criteria Approximation Guarantee for $k$-means++

## Abstract

This paper studies the -means++ algorithm for clustering as well as the class of sampling algorithms to which -means++ belongs. It is shown that for any constant factor , selecting cluster centers by sampling yields a constant-factor approximation to the optimal clustering with centers, in expectation and without conditions on the dataset. This result extends the previously known guarantee for the case to the constant-factor bi-criteria regime. It also improves upon an existing constant-factor bi-criteria result that holds only with constant probability.

## 1 Introduction

The -means problem and its variants constitute one of the most popular paradigms for clustering (Jain, 2010). Given a set of data points, the task is to group them into clusters, each defined by a cluster center, such that the sum of distances from points to cluster centers (raised to a power ) is minimized. Optimal clustering in this sense is known to be NP-hard, in particular for -means () (Dasgupta, 2008; Aloise et al., 2009; Mahajan et al., 2009; Awasthi et al., 2015) and -medians () (Jain et al., 2002). In practice, the most widely used algorithm remains Lloyd’s (1957; 1982) (often referred to as the -means algorithm), which alternates between updating centers given cluster assignments and re-assigning points to clusters.

In this paper, we study an enhancement to Lloyd’s algorithm known as -means++ (Arthur and Vassilvitskii, 2007) and the more general class of sampling algorithms to which -means++ belongs. These algorithms select cluster centers randomly from the given data points with probabilities proportional to their current costs. The clustering can then be refined using Lloyd’s algorithm. sampling is attractive for two reasons: First, it is guaranteed to yield an expected approximation to the optimal clustering with centers (Arthur and Vassilvitskii, 2007). Second, it is as simple as Lloyd’s algorithm, both conceptually as well as computationally with running time in dimensions.

The particular focus of this paper is on the setting where an optimal -clustering remains the benchmark but more than cluster centers can be sampled to improve the approximation. Specifically, it is shown (see Theorem 1 and Corollary 1) that for any constant factor , if centers are chosen by sampling, then a constant-factor approximation to the optimal -clustering is obtained. This guarantee holds in expectation and for all datasets, like the one in Arthur and Vassilvitskii (2007), and improves upon the factor therein. Such a result is known as a constant-factor bi-criteria approximation since both the optimal cost and the relevant degrees of freedom ( in this case) are exceeded but only by constant factors.

In the context of clustering, bi-criteria approximations can be valuable because an appropriate number of clusters is almost never known or pre-specified in practice. Approaches to determining from the data are all ideally based on knowing how the optimal cost decreases as increases, but obtaining this optimal trade-off between cost and is NP-hard as mentioned earlier. Alternatively, a simpler algorithm that has a constant-factor bi-criteria guarantee would ensure that the trade-off curve generated by this algorithm deviates by no more than constant factors along both axes from the optimal curve. This may be more appealing than a deviation along the cost axis that grows with . Furthermore, if a solution with a specified number of clusters is truly required, then linear programming techniques can be used to select a -subset from the cluster centers while still maintaining a constant-factor approximation (Aggarwal et al., 2009; Charikar et al., 2002).

The main result in this paper differs from the constant-factor bi-criteria approximation established in Aggarwal et al. (2009) in that the latter holds only with constant probability as opposed to in expectation. Using Markov’s inequality, a constant-probability corollary can be derived from Theorem 1 herein, and doing so improves upon the approximation factor of Aggarwal et al. (2009) by more than a factor of . The present paper also differs from recent work on more general bi-criteria approximation of -means by Makarychev et al. (2015), which analyzes substantially more complex algorithms.

In the next section, existing work on sampling and clustering approximations in general is reviewed in more detail. Section 2 gives a formal statement of the problem, the sampling algorithm, and existing lemmas regarding the algorithm. Section 3 states the main results of the paper and compares them to previous results. Proofs are presented in Section 4 and the paper concludes in Section 5.

### 1.1 Related Work

There is a considerable literature on approximation algorithms for -means, -medians, and related problems, spanning a wide range in the trade-off between tighter approximation factors and lower algorithm complexity. At one end, exact algorithms (Inaba et al., 1994) and several polynomial-time approximation schemes (PTAS) (Matoušek, 2000; Badoiu et al., 2002; de la Vega et al., 2003; Har-Peled and Mazumdar, 2004; Kumar et al., 2010; Chen, 2009; Feldman et al., 2007; Jaiswal et al., 2014) have been proposed for -means and -medians. While these have polynomial running times in , the dependence on and sometimes on the dimension is exponential or worse. A simpler local search algorithm was shown to yield a approximation for -means () in Kanungo et al. (2004) and -medians () in Arya et al. (2004), the latter under the additional constraint that centers are chosen from a finite set. This local search however requires a polynomial number of iterations of complexity , and Kanungo et al. (2004) also rely on a discretization to an -approximate centroid set (Matoušek, 2000) of size . Linear programming algorithms offer similar constant-factor guarantees with similar running times for -medians (again the finite set variant) and the related problem of facility location (Charikar et al., 2002; Jain and Vazirani, 2001).

In contrast to the above, this paper focuses on simpler algorithms in the sampling class, including -means++. In Arthur and Vassilvitskii (2007), it was proved that sampling results in an approximation, in expectation and for all datasets. The current work builds upon Arthur and Vassilvitskii (2007) to extend the guarantee to the constant-factor bi-criteria regime. Arthur and Vassilvitskii (2007) also provided a matching lower bound, exhibiting a dataset on which -means++ achieves an expected approximation.

Sampling algorithms have been shown to yield improved approximation factors provided that the dataset satisfies certain conditions. Such a result was established in Ostrovsky et al. (2012) for -means++ and other variants of Lloyd’s algorithm under the condition that the dataset is well-suited in a sense to partitioning into clusters. In Mettu and Plaxton (2004), an approximation was shown for a somewhat more complicated algorithm called successive sampling with running time, subject to a bound on the dispersion of the points. A constant-factor approximation with slightly superlinear running time has also been obtained in the streaming setting (Guha et al., 2003).

For -means++, the lower bound in Arthur and Vassilvitskii (2007), which holds in expectation, has spurred follow-on works on the question of whether -means++ might guarantee a constant-factor approximation with reasonably large probability. Negative answers were provided by Brunsch and Röglin (2013), who showed that an approximation factor better than cannot be achieved with probability higher than a decaying exponential in , and Bhattacharya et al. (2014), who showed that a similar statement holds even in dimensions.

In a similar direction to the one pursued in the present work, Aggarwal et al. (2009) showed that if the number of cluster centers can be increased to a constant factor times , then a constant-factor approximation can be achieved with constant probability. Specifically, they prove that using centers gives an approximation factor of with probability , together with a general bi-criteria guarantee but without explicit constants. An factor was also obtained independently by Ailon et al. (2009) using more centers, of order . As mentioned, the result of Aggarwal et al. (2009) differs from Theorem 1 herein in being true with constant probability as opposed to in expectation. Furthermore, Section 3.1 shows that a constant-probability corollary of Theorem 1 improves significantly upon Aggarwal et al. (2009).

Recently, Makarychev et al. (2015) has also established constant-factor bi-criteria results for -means. Their work differs from the present paper in studying more complex algorithms. First, similar to Kanungo et al. (2004), Makarychev et al. (2015) reduce the -means problem to an -approximate, finite-set instance of -medians of size . Subsequently, linear programming and local search algorithms are considered, the latter the same as in Kanungo et al. (2004); Arya et al. (2004), and both with polynomial complexity in the size of the -medians instance.

## 2 Preliminaries

### 2.1 Problem Definition

We are given points in a real metric space with metric . The objective is to choose cluster centers in and assign points to the nearest cluster center to minimize the potential function

(1) |

A cluster is thus defined by the points assigned to a center , where ties (multiple closest centers) are broken arbitrarily. For a subset of points , define to be the contribution to the potential from ; is the contribution from a single point .

The exponent in (1) is regarded as a problem parameter. Letting and be Euclidean distance, we have what is usually known as the -means problem, so-called because the optimal cluster centers are means of the points assigned to them. The choice is also popular and corresponds to the -medians problem.

Throughout this paper, an optimal clustering will always refer to one that minimizes (1) over solutions with clusters, where is given. Likewise, the term optimal cluster and symbol will refer to one of the clusters from this optimal solution. The goal is to approximate the potential of this optimal -clustering using cluster centers for .

### 2.2 Sampling Algorithm

The sampling algorithm chooses cluster centers randomly from with probabilities proportional to their current contributions to the potential, as detailed in Algorithm 1. Following Arthur and Vassilvitskii (2007), the case is referred to as the -means++ algorithm and the probabilities used after the first iteration are referred to as weighting (hence in general). For cluster centers, the running time of sampling is in dimensions.

### 2.3 Existing Lemmas Regarding Sampling

The following lemmas synthesize results from Arthur and Vassilvitskii (2007) that bound the expected potential within a single optimal cluster due to selecting a center from that cluster with uniform or weighting, as in Algorithm 1. These lemmas define the constant appearing in the main results below and are also used in their proof.

###### Lemma 1.

(Arthur and Vassilvitskii, 2007, Lemmas 3.1 and 5.1) Given an optimal cluster , let be the potential resulting from selecting a first cluster center randomly from with uniform weighting. Then for any , where

###### Lemma 2.

(Arthur and Vassilvitskii, 2007, Lemma 3.2) Given an optimal cluster and an initial potential , let be the potential resulting from adding a cluster center selected randomly from with weighting. Then for any , where .

The factor of between and for general is explained just before Theorem 5.1 in Arthur and Vassilvitskii (2007).

## 3 Main Results

The main results of this paper are stated below in terms of the single-cluster approximation ratio defined by Lemma 2. Subsequently in Section 3.1, the results are discussed in the context of previous work.

###### Theorem 1.

Let be the potential resulting from selecting cluster centers according to Algorithm 1, where . The expected approximation ratio is then bounded as

where is the golden ratio and is the th harmonic number.

In the proof of Theorem 1 in Section 4.2, it is shown that the term is indeed non-positive and can therefore be omitted, with negligible loss for large .

The approximation ratio bound in Theorem 1 is stated as a function of . The following corollary confirms that the theorem also implies a constant-factor bi-criteria approximation.

###### Corollary 1.

With the same definitions as in Theorem 1, the expected approximation ratio is bounded as

###### Proof.

The minimum appearing in Theorem 1 is bounded from above by its first term. This term is in turn increasing in with asymptote , which can therefore be taken as a -independent bound. ∎

It follows from Corollary 1 that a constant “oversampling” ratio leads to a constant-factor approximation. Theorem 1 offers a further refinement for finite .

The bounds in Theorem 1 and Corollary 1 consist of two factors. As increases, the second, parenthesized factor decreases to either exactly or approximately as . The first factor of however is no smaller than , and is a direct consequence of Lemma 2. Any improvement of Lemma 2 would therefore strengthen the approximation factors above. This subject is briefly discussed in Section 5.

### 3.1 Comparisons to Existing Results

A comparison of Theorem 1 to results in Arthur and Vassilvitskii (2007) is implicit in its statement since the term in the minimum comes directly from Arthur and Vassilvitskii (2007, Theorems 3.1 and 5.1). For , the first term in the minimum is smaller than for any , and hence Theorem 1 is always an improvement. For , Theorem 1 improves upon Arthur and Vassilvitskii (2007) for greater than the critical value

Numerical evaluation of shows that it reaches a maximum value of at and then decreases back toward roughly as . It can be concluded that for any , at most oversampling is required for Theorem 1 to guarantee a better approximation than Arthur and Vassilvitskii (2007).

The most closely related result to Theorem 1 and Corollary 1 is found in Aggarwal et al. (2009, Theorem 1). The latter establishes a constant-factor bi-criteria approximation that holds with constant probability, as opposed to in expectation. Since a bound on the expectation implies a bound with constant probability via Markov’s inequality, a direct comparison with Aggarwal et al. (2009) is possible. Specifically, for and the cluster centers assumed in Aggarwal et al. (2009), Theorem 1 in the present work implies that

after taking . Then by Markov’s inequality,

with probability at least as in Aggarwal et al. (2009). This approximation factor is less than half the factor of in Aggarwal et al. (2009).

Corollary 1 may also be compared to the results in Makarychev et al. (2015), although it should be re-emphasized that the latter analyzes different, substantially more complex algorithms, with running time at least for reasonably small . The main difference between Corollary 1 and the bounds in Makarychev et al. (2015) is the extra factor of since the factor of is comparable, at least for moderate values of that are of practical interest. As discussed above and in Section 5, the factor of is due to Lemma 2 and is unlikely to be intrinsic to the sampling algorithm.

## 4 Proofs

The overall strategy used to prove Theorem 1 is similar to that in Arthur and Vassilvitskii (2007). The key intermediate result is Lemma 3 below, which relates the potential at a later iteration in Algorithm 1 to the potential at an earlier iteration. Section 4.1 is devoted to proving Lemma 3. Subsequently in Section 4.2, Theorem 1 is proven by an application of Lemma 3.

In the sequel, we say that an optimal cluster is covered by a set of cluster centers if at least one of the centers lies in . Otherwise is uncovered. Also define as an abbreviation.

###### Lemma 3.

For an initial set of centers leaving optimal clusters uncovered, let denote the potential, the union of uncovered clusters, and the union of covered clusters. Let denote the potential resulting from adding centers, each selected randomly with weighting as in Algorithm 1. Then the new potential is bounded in expectation as

for coefficients and that depend only on . This holds in particular for

(2a) | ||||

(2b) |

### 4.1 Proof of Lemma 3

Lemma 3 is proven using induction, showing that if it holds for and , then it also holds for , similar to the proof of Arthur and Vassilvitskii (2007, Lemma 3.3). The proof is organized into three parts. Section 4.1.1 provides base cases. In Section 4.1.2, sufficient conditions on the coefficients , are derived that allow the inductive step to be completed. In Section 4.1.3, it is shown that the closed-form expressions in (2) are consistent with the base cases in Section 4.1.1 and satisfy the sufficient conditions from Section 4.1.2, thus completing the proof.

#### Base cases

This subsection exhibits two base cases of Lemma 3. While the second of these base cases does not conform to the functional forms in (2), it is shown later in Section 4.1.3 that the same base cases also hold with coefficients given by (2).

The first case corresponds to , for which we have . Since adding centers cannot increase the potential, i.e. deterministically, Lemma 3 holds with

(3) |

The second base case occurs for , . For this purpose, a slightly strengthened version of Arthur and Vassilvitskii (2007, Lemma 3.3) is used, as given next.

###### Lemma 4.

#### Sufficient conditions on coefficients

In this subsection, it is assumed inductively that Lemma 3 holds for and . The induction to the case is then completed under the following sufficient conditions on the coefficients:

(5a) | |||

(5b) |

and

(6a) | ||||

(6b) |

The first pair of conditions (5) applies to the coefficients involved in the inductive hypothesis for and . The second pair (6) can be seen as a recursive specification of the new coefficients for . This inductive step together with base cases (3) and (4) are sufficient to extend Lemma 3 to all , starting with from and .

The inductive step is broken down into a series of three lemmas, each building upon the last. The first lemma applies the inductive hypothesis to derive a bound on the potential that depends not only on and but also on .

###### Lemma 5.

Assume that Lemma 3 holds for and . Then for the case , i.e. corresponding to uncovered clusters and resulting after adding centers,

###### Proof.

We consider the two cases in which the first of the new centers is chosen from either the covered set or the uncovered set , similar to the proof of Lemma 4. Denote by the potential after adding the first new center.

Covered case: This case occurs with probability and leaves the covered and uncovered sets unchanged. We then invoke Lemma 3 with (one fewer center to add) and playing the role of . The contribution to from this case is then bounded by

(7) |

noting that for any set .

Uncovered case: We consider each uncovered cluster separately. With probability , the first new center is selected from , moving from the uncovered to the covered set and reducing the number of uncovered clusters by one. Applying Lemma 3 for , the contribution to is bounded by

Taking the expectation with respect to possible centers in and using Lemma 2 and , we obtain the further bound

Summing over yields

(8) |

using the inner product bound (18).

As noted above, the bound in Lemma 5 depends on , the potential over uncovered clusters. This quantity can be arbitrarily large or small. In the next lemma, is eliminated by maximizing with respect to it.

###### Proof.

The result is obtained by maximizing the bound in Lemma 5 with respect to . Let and denote the two terms in the minimum. The derivative of is given by

which does not change sign as a function of . The two cases and are considered separately below. Taking the maximum of the resulting bounds (9), (10) establishes the lemma.

Case : Both and are non-decreasing functions of . The former has the finite supremum

(9) |

whereas the latter increases without bound. Therefore eventually becomes the smaller of the two and (9) can be taken as an upper bound on .

Case : At , we have and . The assumption implies that . Since is now a decreasing function, the two functions must intersect and the point of intersection then provides an upper bound on .

Solving for the intersection leads after some algebra to a quadratic equation in :

Again by the assumption , the constant term in this quadratic equation is non-positive, implying that one of the roots is also non-positive and can be discarded. The remaining positive root is given by

after simplifying the discriminant to match the stated expression for . Evaluating either or at this root gives

(10) |

∎

The bound in Lemma 6 is a function of and only but is nonlinear, in contrast to the desired form in Lemma 3. The next step is to linearize the bound by imposing additional conditions (5) on the coefficients.

###### Lemma 7.

###### Proof.

It suffices to linearize the term in Lemma 6. In particular, we aim to bound the quadratic function from above by the square for all and some choice of . The cases and require that

Setting these inequalities to equalities, the remaining condition for the cross-term is

Equivalently for ,

We rearrange to obtain

the last of which is true by assumption (5). Thus we conclude that

Combining this last inequality with Lemma 6 proves the result. ∎

Given conditions (5) and Lemma 7, the inductive step for Lemma 3 can be completed by defining and recursively as in (6).

Equations (5) and (6) provide sufficient conditions on the coefficients and to establish Lemma 3 by induction. Section 4.1.3 shows that these conditions are satisfied by (2). To motivate the functional form chosen in (2), we first explore the behavior of solutions that satisfy (6) in particular. This is done by treating (6) as a recursion, taking the inequalities to be equalities, and numerically evaluating and starting from the base cases (3) and (4) as boundary conditions. More specifically, the computation is carried out as an outer loop over increasing starting from , and an inner loop over starting from . Figure 1 plots the resulting values for over the region ( is simply a shifted copy). The most striking feature of Figure 1 is that the level contours appear to be lines emanating from the origin. Sampling values at multiple points suggests that . The plot also has the properties that is decreasing in for fixed and increasing in for fixed . These observations lead to the functional form for proposed in Section 4.1.3.

As for conditions (5), it can be verified directly using the base cases (3) and (4) that they are satisfied for . The subsequent numerical values in Figure 1 were found to satisfy (5) for all as well. This suggests that recursion (6) is self-perpetuating in the sense that if (5) are satisfied for , then the values for , resulting from (6) will also satisfy (5) for and , i.e. points to the right and upper-right. This self-perpetuating property is not proven however in the present paper. Instead, it is shown that the proposed functional form (11) satisfies (5) directly.

#### Proof with specific form for coefficients

We now prove that Lemma 3 holds for coefficients , given by (11) below. These expressions are more general than (2) and are based on the observations drawn from Figure 1.

(11a) | ||||

(11b) |

Here and are parameters introduced to add flexibility to the basic form suggested by Figure 1, subject to the constraints , ,

(12a) | ||||

(12b) |

Equation (2) is obtained at the end from (11) by optimizing the parameters and . Note that with , (11a) is decreasing in for fixed and increasing in for fixed .

Given the inductive approach and the results established in Sections 4.1.1 and 4.1.2, the proof requires the remaining steps below. First, it is shown that the base cases (3), (4) from Section 4.1.1 imply that Lemma 3 is true for the same base cases but with , given by (11) instead. Second, (11) is shown to satisfy conditions (5) for all , thus permitting Lemma 7 to be used. Third, (11) is also shown to satisfy (6), which combined with Lemma 7 completes the induction.

Considering the base cases, for , (3) and (11) coincide so there is nothing to prove. For the case , , Lemma 3 with coefficients given by (4) implies the same with coefficients given by (11) provided that

This in turn is ensured if the coefficients satisfy for all . The most stringent case is and is met by assumption (12a).

For the second step of establishing (5), it is clear that (5a) is satisfied by (11a). A direct calculation presented in Appendix B shows that (5b) is also true.

Similarly for the third step, it suffices to show that (11a) satisfies recursion (6a) since (11b) automatically satisfies (6b). A proof is provided in Appendix C.

Having shown that Lemma 3 is true for coefficients given by (11) and (12), the specific expressions in (2) are obtained by minimizing in (11a) with respect to , , subject to (12). For fixed , minimizing with respect to yields in light of (12a), and

Minimizing with respect to then results in from (12b). The solution satisfying is and .

### 4.2 Proof of Theorem 1

Denote by the number of points in optimal cluster . In the first iteration of Algorithm 1, the first cluster center is selected from some with probability . Conditioned on this event, Lemma 3 is applied with covered set , uncovered clusters, and remaining cluster centers. This bounds the final potential as

where , are given by (2). Taking the expectation over possible centers in and using Lemma 1,

Taking the expectation over clusters and recalling that ,

(13) |

where

Next we aim to further bound the last term in (13). Using (2) and from Lemma 2,

The last expression for is seen to be non-negative for , , and . Furthermore, since (a singleton cluster) implies that , we have

(14) |

with equality if is completely concentrated in clusters of size . Substituting (2b) and (14) into (13), we obtain

(15) |

The last step is to recall Arthur and Vassilvitskii (2007, Theorems 3.1 and 5.1), which together state that

(16) |

for resulting from selecting exactly cluster centers. In fact, (16) also holds for centers, , since adding centers cannot increase the potential. The proof is completed by taking the minimum of (15) and (16).

## 5 Conclusion and Future Work

This paper has shown that simple sampling algorithms, including -means++, are guaranteed in expectation to attain a constant-factor bi-criteria approximation to an optimal clustering. The contributions herein extend and improve upon previous results concerning sampling (Arthur and Vassilvitskii, 2007; Aggarwal et al., 2009).

As noted in Section 3, the constant in Theorem 1 and Corollary 1 represents an opportunity to further improve the approximation bounds. One possibility is to tighten Lemmas 3.2 and 5.1 in Arthur and Vassilvitskii (2007), which are the lemmas responsible for the factor. A more significant improvement may result from considering not only the covering of optimal clusters by at least one cluster center, but also the effect of selecting more than one center from a single optimal cluster. As the number of selected centers increases, an approximation factor analogous to would be expected to decrease. Analysis of algorithms with similar simplicity to sampling is also of interest.

## Appendix A Proof of Lemma 4

The proof follows the inductive proof of Arthur and Vassilvitskii [2007, Lemma 3.3] with the notational changes , , and . For brevity, only the differences are presented.

For the first base case , , Arthur and Vassilvitskii [2007] already show that the lemma holds with coefficients , , and . Similarly for the second base case , Arthur and Vassilvitskii [2007] show that , as required for the stronger version here.

For the first “covered” case considered in the inductive step, the argument is the same and the upper bound on the contribution to is changed to

(17) |

For the second “uncovered” case, the first displayed expression in the right-hand column of Arthur and Vassilvitskii [2007, page 1030] becomes (after applying the bound from Lemma 2)

Summing over all uncovered clusters , the contribution to is bounded from above by

The inner product above can be bounded as

(18) |

with equality if both , are completely concentrated in the same cluster . The sum of squares term can be bounded using the power-mean inequality as in Arthur and Vassilvitskii [2007]. Hence the contribution to