A tight lower bound instance for k-means++ in constant dimension

A tight lower bound instance for k-means++ in constant dimension

Anup Bhattacharya IIT Delhi, India Ragesh Jaiswal Corresponding author: rjaiswal@cse.iitd.ac.in IIT Delhi, India Nir Ailon Nir Ailon acknowledges the support of a Marie Curie International Reintegration Grant PIRG07-GA-2010-268403, as well as the support of The Israel Science Foundation (ISF) no. 1271/13. Technion, Haifa, Israel
Abstract

The k-means++ seeding algorithm is one of the most popular algorithms that is used for finding the initial centers when using the k-means heuristic. The algorithm is a simple sampling procedure and can be described as follows:

Pick the first center randomly from the given points. For , pick a point to be the center with probability proportional to the square of the Euclidean distance of this point to the closest previously chosen centers.

The k-means++ seeding algorithm is not only simple and fast but also gives an approximation in expectation as shown by Arthur and Vassilvitskii [7]. There are datasets [7, 3] on which this seeding algorithm gives an approximation factor of in expectation. However, it is not clear from these results if the algorithm achieves good approximation factor with reasonably high probability (say ). Brunsch and Röglin [9] gave a dataset where the k-means++ seeding algorithm achieves an approximation ratio with probability that is exponentially small in . However, this and all other known lower-bound examples [7, 3] are high dimensional. So, an open problem was to understand the behavior of the algorithm on low dimensional datasets. In this work, we give a simple two dimensional dataset on which the seeding algorithm achieves an approximation ratio with probability exponentially small in . This solves open problems posed by Mahajan et al. [13] and by Brunsch and Röglin [9].

1 Introduction

The k-means clustering problem is one of the most important problems in Data Mining and Machine Learning that has been widely studied. The problem is defined as follows:

(k-means problem): Given a set of points in a -dimensional space, find a set of points (these are called centers) such that the cost function is minimized. Here denotes the square of the Euclidean distance between points and . In the discrete version of this problem, the centers are constrained to be a subset of the given points .

The problem is known to be NP-hard even for small values of the parameters such as when  [10] and when  [14, 13]. There are various approximation algorithms for the problem. However, in practice, a heuristic known as the k-means algorithm (also known as Lloyd’s algorithm) is used because of its excellent performance on real datasets even though it does not give any performance guarantees. This algorithm is simple and can be described as follows:

(k-means Algorithm): (i) Arbitrarily, pick points as centers. (ii) Cluster the given points based on the nearest distance to centers in . (iii) For all clusters, find the mean of all points within a cluster and replace the corresponding member of with this mean. Repeat steps (ii) and (iii) until convergence.

Even though the above algorithm performs very well on real datasets, it guarantees only convergence to local minima. This means that this local search algorithm may either converge to a local optimum solution or may take a large amount of time to converge [5, 6]. Poor choice of the initial centers (step (i)) is one of the main reasons for its bad performance with respect to approximation factor. A number of seeding heuristics have been suggested for choosing the initial centers. One such seeding algorithm that has become popular is the k-means++ seeding algorithm. The algorithm is extremely simple and runs very fast in practice. Moreover, this simple randomized algorithm also gives an approximation factor of in expectation [7]. In practice, this seeding technique is used for finding the initial centers to be used with the k-means algorithm and this ensures a theoretical approximation guarantee. The simplicity of the algorithm can be seen by its simple description below:

(k-means++ seeding): Pick the first center randomly from the given points. After picking centers, pick the center to be a point with probability proportional to the square of the Euclidean distance of to the closest previously chosen centers.

A lot of recent work has been done in understanding the power of this simple sampling based approach for clustering. We discuss these in the following paragraph.

1.1 Related work

Arthur and Vassilvitskii [7] showed that the sampling algorithm gives an approximation guarantee of in expectation. They also give an example dataset on which this approximation guarantee is best possible. Ailon et al. [4] and Aggarwal et al. [3] showed that sampling more than centers in the manner described above gives a constant pseudo-approximation.111Here pseudo-approximation means that the algorithm is allowed to output more than centers but the approximation factor is computed by comparing with the optimal solution with centers. Ackermann and Blömer [1] showed that the results of Arthur and Vassilvitskii [7] may be extended to a large class of other distance measures. Jaiswal et al. [12] showed that the seeding algorithm may be appropriately modified to give a -approximation algorithm for the k-means problem. Jaiswal and Garg [11] and Agarwal et al. [2] showed that if the dataset satisfies certain separation conditions, then the seeding algorithm gives constant approximation with probability . Bahmani et al. [8] showed that the seeding algorithm performs well even when fewer than sampling iterations are executed provided that more than one center is chosen in a sampling iteration. We now discuss our main results.

1.2 Main results

The lower-bound examples of Arthur and Vassilvitskii [7] and Aggarwal et al. [3] have the following two properties: (a) the examples are high dimensional and (b) the examples lower-bound the expected approximation factor. Brunsch and Röglin [9] showed that the k-means++ seeding gives an approximation ratio of at most only with probability that is exponentially small in . They constructed a high dimensional example where this is not true and showed that an approximation is achieved with probability exponentially small in . An important open problem mentioned in their work is to understand the behavior of the seeding algorithm on low-dimensional datasets. This problem is also mentioned as an open problem by Mahajan et al. [13] who showed that the planar (dimension=2) k-means problem is NP-hard. In this work, we construct a two dimensional dataset on which the k-means++ seeding algorithm achieves an approximation ratio with probability exponentially small in . More formally, here is the main theorem that we prove in this work.

Theorem 1 (Main Theorem).

Let for a fixed real . There exists a family of instances for which k-means++ achieves an -approximation with probability at most .

Note that the theorem refutes the conjecture by Brunsch and Röglin [9]. They conjectured that the k-means++ seeding algorithm gives an -approximation for any -dimensional instance.

1.3 Our techniques

All the known lower-bound examples [7, 3, 9] have the following general properties:

  1. All optimal clusters have equal number of points.

  2. The optimal clusters are high dimensional simplices.

In order to construct a counterexample for the two dimensional case, we consider datasets that have different number of points in different optimal clusters. Our counterexample is shown in Figure 1. The optimal clusters (indicated in the figure using shaded areas) are along the vertical lines drawn along the -axis. In the next section, we will show that these are indeed the optimal clusters. Note that the cluster sizes decrease exponentially going from left to right. We say that an optimal cluster is covered by the algorithm if the algorithm picks a center from that optimal cluster. We will use the following two high level observations to show the main theorem:

  • Observation 1: The algorithm needs to cover more than a certain minimum fraction of clusters to achieve a required approximation.

  • Observation 2: After any number of iterations, the probability of sampling the next center from an uncovered cluster is not too large compared to the probability of sampling from a covered cluster.

We bound the probability of covering more than a certain minimum fraction of clusters by analyzing a simple Markov chain. This Markov chain is almost the same as the chain used by Brunsch and Röglin [9]. We also borrow the analysis of the Markov chain from [9]. So, in some sense, the main contribution of this paper is to come up with a two dimensional instance the analysis of which may be reduced to the Markov chain analysis in [9].

In the next section, we give the details of our construction and proof.

2 The Bad Instance

We provide a family of -dimensional instances on which performance of k-means++ is bad in the sense of Theorem 1. This family is depicted in Figure 1. We first recursively define certain quantities that will be useful in describing the construction. Here is any positive integer, is any positive real number, and is a positive real number dependent on (we will define this dependency later during analysis).

Note that the input points may overlap in our construction. We will consider groups of points . These groups are shown as shaded areas in Figure 1. They are located at only distinct -coordinates. These distinct -coordinates are given by , where . The group, , consists of points that have the -coordinate . We will later show that is actually the optimal k-means clustering for our instance. Group has points located at . For all , group has points located at , and for all , has points located at each of and .

Let the total number of points on group be denoted by . Therefore, we can write summing points across all locations on that cluster to get the following:

(1)

Note that .

Figure 1: -D example instance showing the , , , and optimal clusters only. Note that this figure is not to scale.

2.1 Optimal solution for our instance

We consider the following partitioning of the given points: Let denote the subset of points on the -axis and for , let denote the subset of all points that are located at -coordinate . For any point , we say that point is in level . Given the above definitions of group and level, the location of a point may be defined by a tuple , where denotes the index of the group to which this point belongs and denotes the level of this point.

Given a set of centers and a subset of points , the potential of with respect to is given by . Furthermore, the potential of a location with respect to is defined as . Here, . Given a set of locations and a subset of points , denotes the potential of points with respect to a set of centers located at locations in .

We start by showing certain basic properties of our instance.

Lemma 2.

Let . The total number of points at level in group is .

Proof.

The points for group are located at distance . Since , this means that the points in are located at . So, the number of points is given by . ∎

Lemma 3.

For all and ,

Proof.

From Lemma 2, we know that the total number of points at level of any group is either or . The net change in the squared Euclidean distance of any point in with respect to locations and is . So, the total change in potential is at most . ∎

Lemma 4.

For all and ,

Proof.

WLOG assume that . From Lemma 2, we can get an upper bound in the following manner:

Let denote a set of optimal centers for the k-means problem. Let denote the set of locations of these centers. We will show that . We start by showing some simple properties of the set . We will need the following additional definitions: We say that a group is covered with respect to if has at least one center from group . Group is said to be uncovered otherwise.

Lemma 5.

.

Proof.

Let . Then we have:

Let be any set of locations that do not include , then (since the nearest location to is ). So, necessarily includes the location . ∎

Lemma 6.

For any , if group is covered with respect to , then .

Proof.

For the sake of contradiction, assume that . Let be the location that is farthest from the -axis among the locations of the form . Consider the set of locations . We will now show that . WLOG let us assume that is positive. The change in center location does not decrease the potential of , does not increase the potential of , and does not increase the potential of points on the -axis. From Lemmas 3 and 4, we have that the increase in potential is at most . On the other hand, since the contribution of the points located at to the total potential changes from to , the total decrease in potential is at least . So, we have that the total potential decreases and hence . This contradicts the fact that denotes the location of the optimal centers. ∎

Lemma 7.

All groups are covered with respect to .

Proof.

For the sake of contradiction, assume that there is a group that is uncovered. This means that there is another group such that there are at least two locations from that is present in . Note that from the previous lemma . Let for some . We now consider the set of locations . We will now show that . Since , the change in center location does not decrease the potential of , does not increase the potential of and does not increase the potential of points on the -axis. From Lemmas 3 and 4, we have that the increase in potential is at most . On the other hand, since the contribution of the points located at to the total potential changes from to , the total decrease in potential is at least . So, we have that the total potential decreases and hence . This contradicts the fact that denotes the location of the optimal centers. ∎

The following is a simple corollary of Lemmas 6 and 7.

Corollary 8.

Let denote the optimal set of centers for our k-means problem instance and let denote the location of these optimal centers. Then .

2.2 Potential of the optimal solution

Let us denote the potential of the optimum solution by . Since optimum chooses its centers only from locations on the -axis, we can compute as follows:

(2)

3 Analysis of k-means++ for our instance

We will first show that with very high probability, the first center chosen by the k-means++ seeding algorithm is located at the location . This is simply due to the large number of points located at the location and the fact that the first center is chosen uniformly at random from all the given points.

Lemma 9.

Let be the location of the first center chosen by the k-means++ seeding algorithm. Then .

Proof.

For any let . Since the first center is chosen uniformly at random, we have:

(since from (1), )

Let us define the following event:

Definition 10.

denotes the event that the location of the first chosen center is .

Lemma 9 shows that happens with a very high probability. We will do the remaining analysis conditioned on the event . We will later use the above lemma to remove the conditioning. The advantage of using this event is that once the first center has the location , computing an upper-bound on the potential of any location becomes easy. This is because we can compute potential with respect to the center at location . Computing such upper bounds will be crucial in our analysis.

Our analysis closely follows that of [9]. Let us analyze the situation after iterations of the k-means++ seeding algorithm (given that the event happens). Let denote the set of chosen centers. Let denote the number of optimal clusters among that are covered by . Let denote the points in these covered clusters and denote the points in the uncovered clusters. Conditioned on , the probability that the next center will be chosen from is . So, the probability of covering a previously uncovered cluster in iteration depends on the ratio . The smaller this ratio, the smaller is the chance of covering a new cluster. We will show that this ratio is small for most iterations of the algorithm. This means that even when the algorithm terminates, there are a number of uncovered clusters. This implies that the algorithm gives a solution that is worse compared to the optimal solution. In order to upper-bound the ratio , we will upper bound the value of and lower-bound the value of . We state these bounds formally in the next two lemmas.

Lemma 11.

.

Proof.

For any covered cluster for , we know that has points at levels . For any such location (except location ) such that does not have a center at this location, the contribution of the points at this location to is at least . Furthermore, the contribution of points at location in case does not contain a center from this location, is at least . Therefore,

Lemma 12.

.

Proof.

Since the number of covered clusters among is , the number of uncovered clusters is given by . Let be any such uncovered cluster. Since happens, there is a center at location . Therefore, the contribution of to can be upper bounded by the quantity . This can be computed in the following manner:

Hence, the total contribution from the uncovered clusters is upper bounded by . ∎

We will also need a lower bound on . This is given in the next lemma.

Lemma 13.

.

Proof.

Let be an uncovered cluster for some . For any location , the contribution of the points at this location to is at least times the number of points at that location. So we have:

Since most of our bounds have the term , we define and do the remaining analysis in terms of . Note that all the bounds on and are dependent only on and not on . This allows us to define the following quantity that will be used in the remaining analysis. This is an upper bound on the ratio obtained from Lemmas 11 and 12.

(3)

We now get a bound on the number of clusters among that are needed to be covered to achieve an approximation factor of for a fixed . For any such fixed approximation factor , we define the following quantities that will be used in the analysis.

(4)
Lemma 14.

Any -approximate clustering covers and at least clusters among .

Proof.

The optimal potential is given by (by (2)). Consider any -approximate clustering. Suppose this clustering covers clusters among . Let the covered and uncovered clusters be denoted by and respectively. Then we have:

The second inequality above is using Lemma 13. This means that the number of covered clusters among should satisfy

Figure 2: Markov chain used for analyzing the algorithm.

We analyze the behavior of the k-means++ seeding algorithm with respect to the number of covered optimal clusters using a Markov chain (see Figure 2). This Markov chain is almost the same as the Markov chain used to analyze the bad instance by Brunsch and Röglin [9]. In fact, the remaining analysis will mostly mimic that analysis in [9]. The next lemma formally relates the probability that the algorithm achieves an approximation to the Markov chain reaching its end state. We analyze this Markov chain in the next subsection.

Lemma 15.

Let and for , let We consider the linear Markov chain with states with starting state (see Figure 2). Edges have transition probabilities and the self-loops have transition probabilities . Then the probability that the k-means++ seeding algorithm gives an -approximate solution is upper bounded by the probability that the state is reached by the Markov chain within steps.

Proof.

The proof is trivial from the observation that the probability that a previously uncovered cluster will be covered in iteration is given by . ∎

3.1 Definitions and inequalities

A number of quantities will be used for the analysis of the Markov chain. The reader is advised to refer to this subsection when reading the next subsection dealing with the analysis of the Markov chain. The following quantities written as a function of will be used in the analysis:

(5)
(6)
(7)
(8)
(9)
(10)
(11)

We will also use the following inequalities. Here, whenever we say that for two functions and , we actually mean to say that for all sufficiently large .

(12)
(13)
(14)
(15)
(16)

Except for inequality (13), all the inequalities are the same as in [9]. We refer the reader to [9] for the correctness of these inequalities. As for (13), note that and . So, for sufficiently large values of , the inequality is true.

For the remaining analysis, we will assume that the value of is fixed such that inequalities (12), (13), (14), (15), and (16) are true. Given this, we will avoid using the functional notation and simply use the name of the quantities. For example, we will use instead of and instead of etc.

3.2 Analysis of Markov chain

We now analyze the Markov chain and upper bound the probability of this Markov chain reaching the state within steps. To be able to do so, we define random variables , where the denotes the number of steps to move from state to state . We consider the random variable . We would like to show that the expected value of is much larger than and then use the Hoeffding inequality to bound the probability. To do this using the well known Hoeffding bound, we need to have a bound on the value of each of the random variables. So, we define related random variables , where . We will analyze the random variable . We will use the following lemma from [9].

Lemma 16 (Claim 5 from [9]).

The expected value of is and the expected value of is .

The next lemma relates the expected values of and .

Lemma 17 (Similar to Lemma 6 in [9]).

.

Proof.

First we get a lower bound on in the following manner:

Also, from the previous lemma we have:

The last inequality used (13). ∎

Next, we we get a lower bound on .

Lemma 18 (Similar to Lemma 7 in [9]).

.

Proof.

We can lower-bound in the following manner:

Since , we can write,

Using this, we have:

Using the previous two lemmas, we can now obtain a lower bound on .

Lemma 19 (Same as Corollary 8 in [9]).

.

Proof.

Using the last two lemmas, we have