A Unified Framework for Clustering Constrained Data without Locality PropertyThis work was supported in part by NSF through grants IIS-1115220, IIS-1422591, CCF-1422324, CNS-1547167, CCF-1656905, and CCF-1716400. The first author was also supported by a start-up fund from Michigan State University. A preliminary version of this paper has appeared in Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2015)DX15 ().

# A Unified Framework for Clustering Constrained Data without Locality Property††thanks: This work was supported in part by NSF through grants IIS-1115220, IIS-1422591, CCF-1422324, CNS-1547167, CCF-1656905, and CCF-1716400. The first author was also supported by a start-up fund from Michigan State University. A preliminary version of this paper has appeared in Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2015)Dx15 ().

Hu Ding Hu Ding Department of Computer Science and Engineering
Michigan State University
School of Computer Science and Technology
University of Science and Technology of China
22email: huding@msu.edu, huding@ustc.edu.cnJinhui Xu Department of Computer Science and Engineering
State University of New York at Buffalo
44email: jinhui@buffalo.edu
Jinhui Xu Hu Ding Department of Computer Science and Engineering
Michigan State University
School of Computer Science and Technology
University of Science and Technology of China
22email: huding@msu.edu, huding@ustc.edu.cnJinhui Xu Department of Computer Science and Engineering
State University of New York at Buffalo
44email: jinhui@buffalo.edu
###### Abstract

In this paper, we consider a class of constrained clustering problems of points in , where could be rather high. A common feature of these problems is that their optimal clusterings no longer have the locality property (due to the additional constraints), which is a key property required by many algorithms for their unconstrained counterparts. To overcome the difficulty caused by the loss of locality, we present in this paper a unified framework, called Peeling-and-Enclosing (PnE), to iteratively solve two variants of the constrained clustering problems, constrained -means clustering (-CMeans) and constrained -median clustering (-CMedian). Our framework generalizes Kumar et al.’s elegant -means clustering approach KSS () from unconstrained data to constrained data, and is based on two standalone geometric techniques, called Simplex Lemma and Weaker Simplex Lemma, for -CMeans and -CMedian, respectively. The simplex lemma (or weaker simplex lemma) enables us to efficiently approximate the mean (or median) point of an unknown set of points by searching a small-size grid, independent of the dimensionality of the space, in a simplex (or the surrounding region of a simplex), and thus can be used to handle high dimensional data. If and are fixed numbers, our framework generates, in nearly linear time (i.e., ), -tuple candidates for the mean or median points, and one of them induces a -approximation for -CMeans or -CMedian, where is the number of points. Combining this unified framework with a problem-specific selection algorithm (which determines the best -tuple candidate), we obtain a -approximation for each of the constrained clustering problems. Our framework improves considerably the best known results for these problems. We expect that our technique will be applicable to other constrained clustering problems without locality.

###### Keywords:
constrained clustering k-means/median approximation algorithms high dimensions

## 1 Introduction

Clustering is one of the most fundamental problems in computer science, and finds numerous applications in many different areas, such as data management, machine learning, bioinformatics, networking, etc. JMF (). The common goal of many clustering problems is to partition a set of given data items into a number of clusters so as to minimize the total cost measured by a certain objective function. For example, the popular -means (or -median) clustering seeks mean (or median) points to induce a partition of the given data items so that the average squared distance (or the average distance) from each data item to its closest mean (or median) point is minimized. Most existing clustering techniques assume that the data items are independent from each other and therefore can “freely” determine their memberships in the resulting clusters (i.e., a data item does not need to pay attention to the clustering of others). However, in many real-world applications, data items are often constrained or correlated, which require a great deal of effort to handle such additional constraints. In recent years, considerable attention has been paid to various types of constrained clustering problems and a number of techniques, such as -diversity clustering LYZ (), -gather clustering APF (); EHR (); HR13 (), capacitated clustering ABS13 (); CHK (); KS00 (), chromatic clustering DX11 (); ADH (), and probabilistic clustering GM09 (); CM08 (); LSS (), have been obtained. In this paper, we study a class of constrained clustering problems of points in Euclidean space.

Given a set of points in , a positive integer , and a constraint , the constrained -means (or -median) clustering problem is to partition into clusters so as to minimize the objective function of the ordinary -means (or -median) clustering and satisfy the constraint . In general, the problems are denoted by -CMeans and -CMedian, respectively.

The detailed definition for each individual problem is given in Section 4. Roughly speaking, data constraints can be imposed at either cluster or item level. Cluster level constraints are restrictions on the resulting clusters, such as the size of the clusters APF () or their mutual differences ZLM (), while item level constraints are mainly on data items inside each cluster, such as the coloring constraint which prohibits items of the same color being clustered into one cluster ADH (); DX11 (); LYZ ().

The additional constraints could considerably change the nature of the clustering problems. For instance, one key property exhibited in many unconstrained clustering problems is the so called locality property, which indicates that each cluster is located entirely inside the Voronoi cell of its center (e.g., the mean, median, or center point) in the Voronoi diagram of all the centers IKI () (see Figure (a)a). Existing algorithms for these clustering problems often rely on such a property KSS (); BHI (); AV07 (); C09 (); FKK (); IKI (); M00 (); OSS (). However, due to the additional constraints, the locality property may no longer exist (see Figure (b)b). Therefore, we need new techniques to overcome this challenge.

### 1.1 Our Main Results

In this paper we present a unified framework called Peeling-and-Enclosing (PnE), based on two standalone geometric techniques called Simplex Lemma and Weaker Simplex Lemma, to solve a class of constrained clustering problems without the locality property in Euclidean space, where the dimensionality of the space could be rather high and the number of clusters is assumed to be some fixed number. Particularly, we investigate the constrained -means (-CMeans) and -median (-CMedian) versions of these problems. For the -CMeans problem, our unified framework generates in time a set of -tuple candidates of cardinality for the to-be-determined mean points. We show that among the set of candidates, one of them induces a -approximation for -CMeans. To find out the best -tuple candidate, a problem-specific selection algorithm is needed for each individual constrained clustering problem (note that due to the additional constraints, the selection problems may not be trivial). Combining the unified framework with the selection algorithms, we obtain a -approximation for each constrained clustering problem in the considered class. Our results considerably improve (in various ways) the best known algorithms for all these problems (see the table in Section 1.2). Our techniques can also be extended to -CMedian to achieve -approximations for these problems with the same time complexities. Below is a list of the constrained clustering problems considered in this paper. We expect that our technique will be applicable to other clustering problems without locality property, as long as the corresponding selection problems can be solved.

1. -Diversity Clustering. In this problem, each input point is associated with a color, and each cluster has no more than a fraction (for some constant ) of its points sharing the same color. The problem is motivated by a widely-used privacy preserving model called -diversity MGK (); LYZ () in data management, which requires that each block contains no more than a fraction of items sharing the same sensitive attribute.

2. Chromatic Clustering. In DX11 (), Ding and Xu introduced a new clustering problem called chromatic clustering, which requires that the points with the same color should be clustered in different clusters. It is motivated by a biological application for clustering chromosome homologs in a population of cells, where homologs from the same cell should be clustered into different clusters. Similar problem also appears in applications related to transportation system design ADH ().

3. Fault Tolerant Clustering. The problem of fault tolerant clustering assigns each point to its nearest cluster centers for some , and counts all the distances as its cost. The problem has been extensively studied in various applications for achieving better fault tolerance CGR (); KPS (); SS03 (); KR13 (); HHL ().

4. -Gather Clustering. This clustering problem requires each of the clusters to contain at least points for some . It is motivated by the -anonymity model for privacy preserving S02 (); APF (), where each block contains at least items 111We use here, instead of , since often denotes the number of clusters in a clustering problem..

5. Capacitated Clustering. This clustering problem has an upper bound on the size of each cluster, and finds various applications in data mining and resource assignment KS00 (); CHK ().

6. Semi-Supervised Clustering. Many existing clustering techniques, such as ensemble clustering SG02 (); Sin10 () and consensus clustering ACN (); CW10 (), make use of a priori knowledge. Since such clusterings are not always based on the geometric cost (e.g., -means cost) of the input, thus a more accurate way of clustering is to consider both the priori knowledge and the geometric cost. We consider the following semi-supervised clustering problem: given a set of points and a clustering of (based on the priori knowledge), partition into clusters so as to minimize the sum (or some function) of the geometric cost and the difference with the given clustering . Another related problem is evolutionary clustering CKT (), where the clustering in each time point needs to minimize not only the geometric cost, but also the total shifting from the clustering in the previous time point (which can be viewed as ).

7. Uncertain Data Clustering. Due to the unavoidable error, data for clustering are not always precise. This motivates us to consider the following probabilistic clustering problem GM09 (); CM08 (); LSS () : given a set of “nodes” with each represented as a probabilistic distribution over a point set in , group the nodes into clusters so as to minimize the expected cost with respect to the probabilistic distributions.

Note: Following our work published in DX15 (), Bhattacharya et al. BJK () improved the running time for finding the candidates of -cluster centers from nearly linear to linear based on the elegant -sampling. Their work also follows the framework of clustering constrained data, i.e., generating the candidates and selecting the best one by a problem-specific selection algorithm, presented in this paper. Our paper represents the first systematically theoretical study of the constrained clustering problems. Some of the underlying techniques, such as Simplex Lemma and Weaker Simplex Lemma, are interesting in their rights, which have already been used to solve other problems DGX () (e.g., the popular “truth discovery” problem in data mining).

### 1.2 Related Works

The above constrained clustering problems have been extensively studied in the past and a number of theoretical results have been obtained (in addition to many heuristic/practical solutions). Table 1 lists the best known theoretical results for each of them. It is clear that most existing results are either constant approximations or only for some restricted versions (e.g., constant dimensional space, etc.), and therefore can be improved by our techniques.

For the related traditional Euclidean -means and -median clustering problems, extensive research has been done in the past. Inaba et al. IKI () showed that an exact -means clustering can be computed in time for points in . Arthur and VassilvitskiiAV07 () presented the -means++ algorithm that achieves the expected approximation ratio. Ostrovsky et al. OSS () provided a -approximation for well-separated points. Based on the concept of stability, Awasthi et al. ABS10 () presented the PTAS for the problems of -means and -median clustering. Matousek M00 () obtained a nearly linear time -approximation for any fixed and . Similar result for -median has also been achieved by Kolliopoulos and Rao KR07 (). Later, Fernandez de la Vega et al. FKK () and Badŏiu et al. BHI () achieved nearly linear time -approximations for high dimensional -means and -median clustering, respectively, for fixed . Kumar et al. KSS () showed that linear-time randomized -approximation algorithms can be obtained for several Euclidean clustering problems (such as -means and -median) in any dimensional space. Recently, this technique has been further extended to several clustering problems with non-metric distance functions ABS (). Later, Jaiswal et al. JKS () applied a non-uniform sampling technique, which is called -sampling, to simplify and improve the result in KSS (); their algorithm can also handle the non-metric distance clustering problems studied in ABS (). Using the core-set technique, a series of improvements have been achieved for high dimensional clustering problems FL11 ().

As for the hardness of the problem, Dasgupta D08 () showed that it is NP-hard for -means clustering in high dimensional space even if ; Awasthi et al. ACK15 () proved that there is no PTAS for -means clustering if both and are large, unless . Guruswami and Indyk GI03 () showed that it is NP-hard to obtain any PTAS for -median clustering if is not a constant and is .

Besides the traditional clustering models, Balcan et al. considered the problem of finding the clustering with small difference from the unknown ground truth BBG (); BB09 ().

### 1.3 Our Main Ideas

Most existing -means or -median clustering algorithms in Euclidean space consist of two main steps: (1) identify the set of mean or median points and (2) partition the input points into clusters based on these mean or median points (we call this step Partition). Note that for some constrained clustering problems, the Partition step may not be trivial. More formally, we have the following definition.

###### Definition 1 (Partition Step)

Given an instance of -CMeans (or -CMedian) and cluster centers (i.e., the mean or median points), the Partition step is to form clusters of , where the clusters should satisfy the constraint and each cluster is assigned to an individual cluster center, such that the objective function of the ordinary -means (or -median) clustering is minimized.

To determine the set of mean or median points in step (1), most existing algorithms (either explicitly or implicitly) rely on the locality property. To shed some light on this, consider a representative and elegant approach by Kumar et al. KSS () for -means clustering. Let be the set of unknown optimal clusters in non-increasing order of their sizes. Their approach uses random sampling and sphere peeling to iteratively find mean points. At the -th iterative step, it draws - peeling spheres centered at the - already obtained mean points, and takes a random sample on the points outside the peeling spheres to find the -th mean point. Due to the locality property, the points belonging to the first - clusters lie inside their corresponding - Voronoi cells; that is, for each peeling sphere, most of the covered points belong to their corresponding cluster, and thus ensures the correctness of the peeling step.

However, when the additional constraint (such as coloring or size) is imposed on the points, the locality property may no longer exist (see Figure (b)b), and thus the correctness of the peeling step cannot always be guaranteed. In this scenario, the core-set technique FL11 () is also unlikely to be able to resolve the issue. The main reason is that although the core-set can greatly reduce the size of the input points, it is quite challenging to impose the constraint through the core-set.

To overcome this challenge, we present a unified framework, called Peeling-and-Enclosing (PnE), in this paper, based on a standalone new geometric technique called Simplex Lemma. The simplex lemma aims to address the major obstacle encountered by the peeling strategy in KSS () for constrained clustering problems. More specifically, due to the loss of locality, at the -th peeling step, the points of the -th cluster could be scattered over all the Voronoi cells of the first - mean points, and therefore their mean point can no longer be simply determined by the sample outside the - peeling spheres. To resolve this issue, our main idea is to view as the union of unknown subsets, , with each , -, being the set of points inside the Voronoi cell (or peeling sphere) of the obtained -th mean point and being the set of remaining points of . After approximating the mean point of each unknown subset by using random sampling, we build a simplex to enclose a region which contains the mean point of , and then search the simplex region for a good approximation of the -th mean point. To make this approach work, we need to overcome two difficulties: (a) how to generate the desired simplex to contain the -th mean point, and (b) how to efficiently search the (approximate) -th mean point inside the simplex.

For difficulty (a), our idea is to use the already determined - mean points (which can be shown that they are also the approximate mean points of , respectively) and another point, which is the mean of those points in outside the peeling spheres (or Voronoi cells) of the first - mean points (i.e., ), to build a (-)-dimensional simplex to contain the -th mean point. Since we do not know how is partitioned (i.e., how intersects the - peeling spheres), we vary the radii of the peeling spheres times to guess the partition and generate a set of simplexes, where the radius candidates are based on an upper bound of the optimal value determined by a novel estimation algorithm (in Section 3.4). We show that among the set of simplexes, one of them contains the -th (approximate) mean point.

For difficulty (b), our simplex lemma (in Section 2) shows that if each vertex of the simplex is the (approximate) mean point of , then we can find a good approximation of the mean point of by searching a small-size grid inside . A nice feature of the simplex lemma is that the grid size is independent of the dimensionality of the space and thus can be used to handle high dimensional data. In some sense, our simplex lemma can be viewed as a considerable generalization of the well-known sampling lemma (i.e., Lemma 4 in this paper) in IKI (), which has been widely used for estimating the mean of a point set through random sampling FMS (); IKI (); KSS (). Different from Lemma 4, which requires a global view of the point set (meaning that the sample needs to be taken from the point set), our simplex lemma only requires some partial views (e.g., sample sets are taken from those unknown subsets whose size might be quite small). If is the point set, our simplex lemma enables us to bound the error by the variance222Given a point set in Euclidean space, its “variance” is the average of the squared distances from the points to their mean point. of (i.e., a local measure) and the optimal value of the clustering problem on the whole instance (i.e., a global measure), and thus helps us to ensure the quality of our solution.

For the -CMedian problem, we show that although the simplex lemma no longer holds since the median point may lie outside the simplex, a weaker version (in Section 5.1) exists, which searches a surrounding region of the simplex. Thus our Peeling-and-Enclosing framework works for both -CMeans and -CMedian. It generates in total -tuple candidates for the constrained mean or median points. To determine the best mean or median points, we need to use the property of each individual problem to design a selection algorithm. The selection algorithm takes each -tuple candidate, computes a clustering (i.e., completing the Partition step) satisfying the additional constraint, and outputs the -tuple with the minimum cost. We present a selection algorithm for each considered problem in Sections 4 and 5.4.

## 2 Simplex Lemma

In this section, we present the Simplex Lemma for approximating the mean point of an unknown point set , where the only known information is a set of points with each of them being an approximate mean point of an unknown subset of . In Section 5.1, we show how to extend the idea to approximate median point by the Weaker Simplex Lemma. The two lemmas are keys to solving the -CMeans and -CMedian problems.

###### Lemma 1 (Simplex Lemma \@slowromancapi@)

Let be a set of points in with a partition of and for any . Let be the mean point of , and be the mean point of for . Let the variance of be , and be the simplex determined by . Then for any , it is possible to construct a grid of size inside such that at least one grid point satisfies the inequality .

Figure (a)a gives an example for Lemma 1. To prove Lemma 1, we first introduce the following lemma.

###### Lemma 2

Let be a set of points in , and be a subset containing points for some . Let and be the mean points of and , respectively. Then , where .

###### Proof

The following claim has been proved in KSS ().

Claim 1 Let be a set of points in space, and be the mean point of . For any point , .

Let , and be its mean point. By Claim 1, we have the following two equalities.

 ∑q∈Q1||q−o||2 = ∑q∈Q1||q−o1||2+|Q1|×||o1−o||2, (1) ∑q∈Q2||q−o||2 = ∑q∈Q2||q−o2||2+|Q2|×||o2−o||2. (2)

Let . By the definition of mean point, we have

 o=1|Q|∑q∈Qq=1|Q|(∑q∈Q1q+∑q∈Q2q)=1|Q|(|Q1|o1+|Q2|o2). (3)

Thus the three points are collinear, while and . Meanwhile, by the definition of , we have

 δ2=1|Q|(∑q∈Q1||q−o||2+∑q∈Q2||q−o||2). (4)

Combining (1) and (2), we have

 δ2 = 1|Q|(∑q∈Q1||q−o1||2+|Q1|×||o1−o||2 (5) +∑q∈Q2||q−o2||2+|Q2|×||o2−o||2) ≥ 1|Q|(|Q1|×||o1−o||2+|Q2|×||o2−o||2) = α((1−α)L)2+(1−α)(αL)2 = α(1−α)L2.

Thus, we have , which means that . ∎

###### Proof (of Lemma 1)

We prove this lemma by induction on .

Base case: For , since , . Thus, the simplex and the grid are all simply the point . Clearly satisfies the inequality.

Induction step: Assume that the lemma holds for any for some (i.e., the induction hypothesis). Now we consider the case of . First, we assume that for each . Otherwise, we can reduce the problem to the case of a smaller in the following way. Let be the index set of small subsets. Then, , and . By Lemma 2, we know that , where is the mean point of . Let be the variance of . Then, we have . Thus, if we replace and by and , respectively, and find a point such that , then we have

 ||τ−o||2≤(||τ−o′||+||o′−o||)2≤916ϵ1−ϵ/4δ2≤ϵδ2, (6)

where the last inequality is due to the fact . This means that we can reduce the problem to a problem with the point set and a smaller (i.e., ). By the induction hypothesis, we know that the reduced problem can be solved, where the new simplex would be a subset of determined by , and therefore the induction step holds for this case. Note that in general, we do not know , but we can enumerate all the possible combinations to guess if is a fixed number as is the case in the algorithm in Section 3.2. Thus, in the following discussion, we can assume that for each .

For each , since , by Lemma 2, we know that . This, together with triangle inequality, implies that for any ,

 ||ol−ol′||≤||ol−o||+||ol′−o||≤4√j/ϵδ. (7)

Thus, if we pick any index , and draw a ball centered at and with radius (by (7)), the whole simplex will be inside . Note that , so lies inside the simplex . To guarantee that is contained by the ball , we can construct only in the ()-dimensional space spanned by , rather than the whole space. Also, if we build a grid inside with grid length , i.e., generating a uniform mesh with each cell being a -dimensional hypercube of edge length , the total number of grid points is no more than . With this grid, we know that for any point inside , there exists a grid point such that . This means that we can find a grid point inside , such that . Thus, the induction step holds, and the lemma is true for any . ∎

In the above lemma, we assume that the exact positions of are known (see Figure (a)a). However, in some scenarios (e.g., in the Algorithm in Section 3.2), we only know an approximate position of each mean point (see Figure (b)b). The following lemma shows that an approximate position of can still be similarly determined (see Section A.1 for the proof).

###### Lemma 3 (Simplex Lemma \@slowromancapii@)

Let , , , and be defined as in Lemma 1. Let be points in such that for and , and be the simplex determined by . Then for any , it is possible to construct a grid of size inside such that at least one grid point satisfies the inequality .

## 3 Peeling-and-Enclosing Algorithm for k-CMeans

In this section, we present a new Peeling-and-Enclosing (PnE) algorithm for generating a set of candidates for the mean points of -CMeans. Our algorithm uses peeling spheres and the simplex lemma to iteratively find a good candidate for each unknown cluster. An overview of the algorithm is given in Section 3.1.

Some notations: Let be the set of points in -CMeans, and be the unknown optimal constrained clusters with being the mean point of for . Without loss of generality, we assume that . Denote by the optimal objective value, i.e., . We also set as the parameter related to the quality of the approximate clustering result.

### 3.1 Overview of the Peeling-and-Enclosing Algorithm

Our Peeling-and-Enclosing algorithm needs an upper bound on the optimal value . Specifically, satisfies the condition for some constant . In Section 3.4, we will present a novel algorithm to determine such an upper bound for general constrained -means clustering problems. Then, it searches for a -approximation of in the set

 H={Δ/c,(1+ϵ)Δ/c,(1+ϵ)2Δ/c,⋯,(1+ϵ)⌈log1+ϵc⌉Δ/c≥Δ}. (8)

Obviously, there exists one element of lying inside the interval , and the size of is .

At each searching step, our algorithm performs a sphere-peeling and simplex-enclosing procedure to iteratively generate approximate mean points for the constrained clusters. Initially, our algorithm uses Lemmas 4 and 5 to find an approximate mean point for (note that since is the largest cluster, and the sampling lemma applies). At the -th iteration, it already has the approximate mean points for , respectively (see Figure 3(a)). Due to the lack of locality, some points of could be scattered over the regions (e.g., Voronoi cells or peeling spheres) of and are difficult to be distinguished from the points in these clusters. Since the number of such points could be small (comparing to that of the first clusters), they need to be handled differently from the remaining points. Our idea is to separate them using peeling spheres, , centered at the approximate mean points respectively and with some properly guessed radius (see Figure 3(b)). Let be the set of unknown points in . Our algorithm considers two cases, (a) is large enough and (b) is small. For case (a), since is large enough, we can use Lemma 4 and Lemma 5 to find an approximate mean point of , and then construct a simplex determined by and to contain the -th mean point (see Figure 3(c)). Note that and can be viewed as a partition of where the points covered by multiple peeling spheres can be assigned to anyone of them, and can be shown as an approximate mean point of ; thus the simplex lemma applies. For case (b), it directly constructs a simplex determined just by . For either case, our algorithm builds a grid inside the simplex and uses Lemma 3 to find an approximate mean point for (i.e., , see Figure 3(d)). The algorithm repeats the Peeling-and-Enclosing procedure times to generate the approximate mean points.

### 3.2 Peeling-and-Enclosing Algorithm

Before presenting our algorithm, we first introduce two basic lemmas from IKI (); DX14 () for random sampling. Let be a set of points in space, and be a randomly selected subset of size from . Denote by and the mean points of and respectively.

###### Lemma 4 (Iki ())

With probability , , where and .

###### Lemma 5 (Dx14 ())

Let be a set of elements, and be a subset of with for some . If we randomly select elements from , then with probability at least , the sample contains at least elements from for and .

Our Peeling-and-Enclosing algorithm is shown in Algorithm 1.

###### Theorem 3.1

Let be the set of points and be a fixed constant. In time, Algorithm 1 outputs -tuple candidate mean points. With constant probability, there exists one -tuple candidate in the output which is able to induce a -approximation of -CMeans (together with the solution for the corresponding Partition step).

###### Remark 1

(1) To increase the success probability to be close to , e.g., , one only needs to repeatedly run the algorithm times; both the time complexity and the number of -tuple candidates increase by a factor of . (2) In general, the Partition step may be challenging to solve. As shown in Section 4, the constrained clustering problems considered in this paper admit efficient selection algorithms for their Partition steps.

### 3.3 Proof of Theorem 3.1

Let , and , where is the mean point of . By our assumption in the beginning of Section 3, we know that . Clearly, and the optimal objective value .

Proof Synopsis: Instead of directly proving Theorem 3.1, we consider the following Lemma 6 and Lemma 7 which jointly ensure the correctness of Theorem 3.1. In Lemma 6, we show that there exists such a root-to-leaf path in one of the returned trees that its associated points along the path, denoted by , are close enough to the mean points of the optimal clusters, respectively. The proof is based on mathematical induction; each step needs to build a simplex, and applies Simplex Lemma \@slowromancapii@ to bound the error, i.e., in (9). The error is estimated by considering both the local (i.e., the variance of cluster ) and global (i.e., the optimal value ) measurements. This is a more accurate estimation, comparing to the widely used Lemma 4 which considers only the local measurement. Such an improvement is due to the increased flexibility in the Simplex Lemma \@slowromancapii@, and is a key to our proof. In Lemma 7, we further show that the points, , in Lemma 6 induce a -approximation of -CMeans.

###### Lemma 6

Among all the trees generated by Algorithm 1, with constant probability, there exists at least one tree, , which has a root-to-leaf path with each of its nodes at level () associating with a point and satisfying the inequality

 ||pvj−mj||≤ϵδj+(1+ϵ)j√ϵβjδopt. (9)

Before proving this lemma, we first show its implication.

###### Lemma 7

If Lemma 6 is true, then is able to induce a -approximation of -CMeans (together with the solution for the corresponding Partition step).

###### Proof

We assume that Lemma 6 is true. Then for each , we have

 ∑p∈Optj||p−pvj||2 = ∑p∈Optj||p−mj||2+|Optj|×||mj−pvj||2 (10) ≤ ∑p∈Optj||p−mj||2+|Optj|×2(ϵ2δ2j+(1+ϵ)2j2ϵβjδ2opt) = (1+2ϵ2)|Optj|δ2j+2(1+ϵ)2j2ϵnδ2opt,

where the first equation follows from Claim 1 in the proof of Lemma 2 (note that is the mean point of ), the inequality follows from Lemma 6 and the fact that for any two real numbers and , and the last equality follows from the fact that . Summing both sides of (10) over , we have

 k∑j=1∑p∈Optj||p−pvj||2 ≤ k∑j=1((1+2ϵ2)|Optj|δ2j+2(1+ϵ)2j2ϵnδ2opt) (11) ≤ (1+2ϵ2)k∑j=1|Optj|δ2j+2(1+ϵ)2k3ϵnδ2opt = (1+O(k3)ϵ)nδ2opt,

where the last equation follows from the fact that . By (11), we know that will induce a -approximation for -CMeans (together with the solution for the corresponding Partition step). Note that is assumed to be a fixed number. Thus the lemma is true. ∎

Lemma 7 implies that Lemma 6 is indeed sufficient to ensure the correctness of Theorem 3.1 (except for the number of candidates and the time complexity). Now we prove Lemma 6.

###### Proof (of Lemma 6)

Let be the tree generated by Algorithm 2 when falls in the interval of . We will focus our discussion on , and prove the lemma by mathematical induction on .

Base case: For , since , we have . By Lemma 4 and Lemma 5, we can find the approximate mean point through random sampling. Let and (in Lemma 5) be and , respectively. Also, is the mean point of the random sample from . Lemma 5 ensures that the sample contains enough number of points from , and Lemma 4 implies that .

Induction step: Suppose . We assume that there is a path in from the root to the -th level, such that for each , the level- node on the path is associated with a point satisfying the inequality (i.e., the induction hypothesis). Now we consider the case of . Below we will show that there is one child of , i.e., , such that its associated point satisfies the inequality . First, we have the following claim (see Section A.2 for the proof).

Claim 2 In the set of radius candidates in Algorithm 2, there exists one value such that

 rj∈[j√ϵ/βjδopt,(1+ϵ2)j√ϵ/βjδopt]. (12)

Now, we construct the peeling spheres, as in Algorithm 2. For each , is centered at and with radius . By Markov’s inequality and the induction hypothesis, we have the following claim (see Section A.3 for the proof).

Claim 3 For each , .

Claim 3 shows that is bounded for , which helps us to find the approximate mean point of . Induced by the peeling spheres , is divided into subsets, , , and . For ease of discussion, let denote for , denote , and denote the mean point of for . Note that the peeling spheres may intersect with each other. For any two intersecting spheres and , we arbitrarily assign the points in to either or . Thus, we can assume that are pairwise disjoint.

Now consider the size of . We have the following two cases: (a) and (b) . We show how, in each case, Algorithm 2 can obtain an approximate mean point for by using the simplex lemma (i.e., Lemma 3).

For case (a), by Claim 3, together with the fact that for , we know that

 k∑l=1|Optl∖(j−1⋃w=1