k-Means for Streaming and Distributed Big Sparse Data

# k-Means for Streaming and Distributed Big Sparse Data

Artem Barger    Dan Feldman
artem@barger.net              dannyf.post@gmail.com
University of Haifa
###### Abstract

We provide the first streaming algorithm for computing a provable approximation to the -means of sparse Big Data. Here, sparse Big Datais a set of vectors in , where each vector has non-zeroes entries, and . E.g., adjacency matrix of a graph, web-links, social network, document-terms, or image-features matrices.

Our streaming algorithm stores at most input points in memory. If the stream is distributed among machines, the running time reduces by a factor of , while communicating a total of (sparse) input points between the machines.

Our contribution is a deterministic algorithm for computing a sparse -coreset, which is a weighted subset of input points that approximates the sum of squared distances from the input points to every centers, up to factor, for any given constant . This is the first such coreset of size independent of both and .

Existing algorithms use coresets of size at least polynomial in , or project the input points on a subspace which diminishes their sparsity, thus require memory and communication even for .

Experimental results on synthetic and real public dataset shows that our algorithm boost the performance of such given heuristics even in the off-line setting. Open access to our implementation is also provided [6].

## 1 Background

### Clustering.

For a given similarity measure, clustering aims at partitioning a given set of points into coherent subset. There is a lot of different clustering techniques, but most prominent and popular is Lloyd’s algorithm or the -means algorithm [27, 4]. The input to the classical Euclidean -means problem is a set of points in , and the goal is to compute a set of -centers (also points in ) that minimizes the sum of squared distances over each input point to its nearest center.

More generally, there are some constraints on the location of the centers. For example, in the discrete -mean the set of centers must be subset of the input points. This version is preferable for example, when each input point is sparse (i.e., have small number of non-zeroes coordinates), and we wish that the centers will also be sparse. In other applications, such as in computer vision, the input is a set of images and we wish to compute representative images (centers). Since it is not clear how to interpret the sum of a vector that represents an image of a cat with a vector that represents an image of a dog, it is natural to ask that the set of centers will be a subset of the input images. In facility location problems [17, 11], the input points represent the location of clients, and the centers are called facilities. In this case we might have constraints that some of the centers will be closer to some clients.

However, most of the existing clustering algorithms, especially the ones that support streaming or large distributed data sets, assume that the dimension is significantly smaller than the number of input points. Otherwise, techniques such as PCA/SVD [25, 15] or Johnson-Lindenstraus [21] are used for reducing the dimensionality of the input. However, such projections turn sparse data into dense data, and the -means of the projected points are no longer subset of the input or sparse [15]. In fact, few projected (dense) points in a high dimensional space might take more memory than the complete original dataset, e.g. in Wikipedia document-term matrix, where .

### Sparse Big Data.

We consider sparse Big Dataas a set of vectors in , where each vector has non-zeroes entries, and both the number of vectors and their dimensionality is asymptotically large. Hence, we cannot store all the vectors, or general linear combinations of vectors, in memory. Instead, we allow to scan these vectors in a streaming fashion, while using memory that is only logarithmic in or .

In particular, unlike previous related work, this paper handle the case (e.g. ), where most of those papers assume . Our algorithm also supports input such as in NoSQL or Matlab sparse matrices, where the row is given as a stream of pairs where means that the th entry of the row is , and the rest are zeroes by default.

Sometimes a cloud or a large set of machines are available for handling Big Data. In this case, the data is distributed among them and we wish not only to process the data in parallel, but also to minimize communication between the machines. Finally, GPUs (Graphical Processing Units) are used to boost the performance of the algorithm, but only if the algorithm uses a limited set of simple operations that are supported by the GPU.

All these models are supported by our algorithm, via the framework of sparse coresets that is explained below.

### Coresets

Given a set of points in , and an error parameter , a coreset in this paper is a small set of weighted points in , such that the sum of squared distances from the original set of points to any set of centers in can be approximated by the sum of weighted squared distances from the points in the coreset. The output of running an existing clustering algorithm on the coreset would then yield approximation to the output of running the same algorithm on the original data, by the definition of the coreset.

Coresets were first suggested in [2] as a way to improve the theoretical running time of existing algorithms. Moreover, a coreset is a natural tool for handling Big Datausing all the computation models that are mentioned in the previous section. This is mainly due to the merge-and-reduce tree approach that was suggested in [19, 7] and is formalized in [14]: coresets can be computed independently for subsets of input points, e.g. on different computers, and then be merged and re-compressed again. Such a binary compression tree can also be computed using one pass over a possibly unbounded streaming set of points, where in every moment only coresets exist in memory for the points streamed so far. Here the coreset is computed only on small chunks of points, so a possibly inefficient coreset construction still yields efficient coreset constructions for large sets; see Fig. 1. Note that the coreset guarantees are preserved while using this technique, while no assumptions are made on the order of the streaming input points.

In practice, this technique can be implemented easily using the map-reduce approach of modern software for Big Datasuch as Hadoop [1].

The fact that the coreset approximates every set of centers, allows us to use the above merge-and-reduce technique where the optimal solution keeps changing, and also to solve the -means problem under different constraints or a small set of candidate centers.

## 2 Related Work

We summarize existing coresets constructions for -mean queries, as will be formally defined in Section 4.

### Importance Sampling.

Following a decade of research, coreset of size polynomial in were suggested in [8]. An improved version of size was suggested in [22] which is a special case of the algorithms in [13]. The construction is based on computing an approximation to the -mean of the input points (with no constraints on the centers) and then sample points proportionally to their distance to these centers. Each chosen point is then assigned a weight that is inverse proportional to this distance. The probability of failure in this algorithms reduces exponentially with the input size. Coresets of size , i.e., linear in , were suggested in [12], however the weight of a point may be negative or a function of the given query. For the special case , such coresets of size were suggested in [20] using uniform sampling.

Projection based coresets. Coresets of size that are based on projections on low-dimensional subspaces that diminishes the sparsity of the input data were recently suggested in [9] by improving the analysis in [14]. Other type of weak coresets approximates only the optimal solution, and not every centers. Such randomized coresets of size independent of and only polynomial in were suggested in [13] and simplified in [12].

Deterministic Constructions. The first coresets for -means [19, 18] were based on partitioning the data into cells, and take a representative point from each cell to the coreset, as is done in hashing or Hough transform [5]. However, these coresets are of size at least , i.e., exponential in . Deterministic coreset constructions of size , i.e., independent of but exponential in , were suggested in [14] by recursively computing -means clustering of the input points. Uniform sampling is then used for replacing each mean of a cluster by a subset of its points. By Markov’s inequality, the probability of failure for these algorithms reduces only linearly with the input size. Therefore they cannot be used in the streaming or distributed setting, where we need to compute union of core-sets. Our technique improves this result by suggesting a fully deterministic coreset construction of size , i.e., polynomial in .

## 3 Our contribution

Our main technical result is a deterministic algorithm that computes a -coreset of size for a set of points and every constant error parameter ; see Theorem 1 and Corollary 1 for details and exact bounds. This is the first coreset which uses number of memory words that is independent of both and , and only polynomial in where each point of has non-zeroes entries. Using this coreset with the merge-and-reduce technique above, we achieve the following deterministic algorithms:

1. An algorithm that computes a approximation for the -means of a set that is distributed (partitioned) among machines, where each machine needs to send only input points to the main server at the end of its computation.

2. A streaming algorithm that, after one pass over the data and using space returns an -approximation to the -means of .

3. Description of how to use our algorithm to boost both the running time and quality of any existing -means heuristic using only the heuristic itself, even in the classic off-line setting.

4. Experimental results on real world data-sets and open-code that demonstrate how we applied our coreset to boost the performance of the popular Llooyd’s -means heuristic [27, 4].

### Running time.

The overall running time of our algorithm is for the set of points seen so far in a possibly unbounded stream, where each point has non-zeroes entries. Computing a -approximation to the -means on the coreset takes time where is the coreset size. Running time that is exponential in is unavoidable for any -approximation algorithm that solves -means, even in the planar case  [26]. Our main contributions thus a coreset construction that uses memory that is independent of and running time that is near-linear in . This is a new result even for the case . To handle large values of in our experiments, our algorithm calls popular heuristics instead of computing the optimal -means during the coreset construction and on the resulting coreset itself.

## 4 Notation and Main Result

The input to our problem is a set of points in , where each point includes a multiplicative weight . In addition, there is an additive weight for the set. Formally, a weighted set in is a tuple , where , . In particular, an unweighted set has a unit weight for each point, and a zero additive weight.

### k-Mean clustering.

For a given set of centers (points) in , the Euclidean distance from a point to its closest center in is denoted by . The sum of these weighted squared distances over the points of is denoted by

 cost(P,Q):=∑p∈P′u(p)⋅dist2(p,Q)+ρ,

If is an unweighted set, this cost is just the sum of squared distances over each point in to its closest center in .

Let denote the subset of points in whose closest center in is , for every . Ties are broken arbitrarily. This yields a partition of by . More generally, the partition of by is the set where , and for every and every .

A set that minimizes this weighted sum over every set of centers in is called the -mean of . The -means of is called the centroid, or the center of mass, since

 μ(P)=1∑p′∈P′u(p′)∑p∈P′u(p)⋅p.

We denote the cost of the -mean of by .

### Coreset.

Computing the -mean of a weighted set is the main motivation of this paper. To this end, we wish to compute another weighed set where is small set (core-set) that can be used to approximate for any set of points. In particular, a set that minimizes the weighted cost to must also approximately minimize the cost to , over such sets in .

Formally, let be an error parameter. The weighted set is a -coreset for , if for every set of centers we have

 (1−ε)cost(P,Q)≤cost(S,Q)≤(1+ε)cost(P,Q),

That is,

 (1−ε)∑p∈Pu(p)dist2(p,Q)+ρ≤∑p∈Sw(p)⋅dist2(p,Q)+ϕ≤(1+ε)∑p∈Pu(p)⋅dist2(p,Q)+ρ.

To handle streaming data we will need to compute “coresets for union of coresets”, which is the reason that we assume that both the input and its coreset are weighted sets.

### Sparse coresets.

Unlike previous work, we add the constraints that if each point in is sparse, i.e., has few non-zeroes coordinates, then the set will also be sparse. Formally, the maximum sparsity of is the maximum number of non zeroes entries over every point in .

In particular, if each point in is a linear combination of at most points in , then . In addition, we would like that the set will be of size independent of both and .

We can now state the main result of this paper.

###### Theorem 1 (Small sparse coreset).

For every weighted set in , and an integer , there is a -coreset of size where each point in is a linear combination of points from . In particular, the maximum sparsity of is .

By plugging this result to the traditional merge-and-reduce tree below, it is straight-forward to compute a coreset using one pass over a stream of points. Previous results computed such coresets using points but memory words (coordinates of points) which is inefficient when, say, . In contrast, the coreset below is computed in memory of size that is independent of . Similarly, when the input is distributed among few machines, previous results had to communicate coresets of size at least linear in between the machines, while communicating the coreset below will take number of bits that is independent of .

###### Corollary 1.

A coreset of size and maximum sparsity can be computed for the set of the points seen so far in an unbounded stream, using memory words. The insertion time per point in the stream is . If the stream is distributed uniformly to machines, then the amortized insertion time per point is reduced by a (multiplicative) factor of to . The coreset for the union of streams can then be computed by communicating the coresets to a main server.

## 5 Coreset Construction

Our main coreset construction algorithm gets a set as input, and returns a -coreset ; see Algorithm LABEL:algk.

To obtain running time that is linear in the input size, without loss of generality, we assume that has points, and that the cardinality of the output is . This is thanks to the traditional merge-and-reduce approach: given a stream of points, we apply the coreset construction only on subsets of size from during the streaming and reduce them by half. See Fig. 1 and e.g. [10, 14] for details.

### Algorithm overview.

In Line LABEL:one we compute the smallest integer such that the cost of the -means of is close to the cost of the -means of . In Line LABEL:two we compute the corresponding partition of by its -means . In Line LABEL:four a -sparse coreset of size is computed for every , . This can be done deterministically e.g. by taking the mean of as explained in Lemma 1 or by using a gradient descent algorithm, namely Frank-Wolfe, as explained in [16] which also preserves the sparsity of the coreset as desired by our main theorem. The output of the algorithm is the union of the means of all these coresets, with appropriate weight, which is the size of the coreset.

Line LABEL:one is the only line in the algorithm that takes time exponential in . As stated in Section 3 this is unavoidable for a -approximation, where is an arbitrarily small constant. Still, the memory that is required by our algorithm is polynomial in , and the running time is near-linear in . In Section 9 this line will be replaced by -means heuristic or -approximation to yield a practical algorithm for large values of .

\@float

algocf[!h]     \end@float

## 6 Proof of Correctness

In this section we prove that a call to indeed returns a small -coreset; see Algorithm LABEL:algk. We use the variable names as defined in the algorithm (e.g. , , ) and consider their corresponding values during the last line of the algorithm. We also identify the input weighted set by .

The first lemma states the common fact that the sum of squared distances of a set of point to a center is the sum of their squared distances to their center of mass, plus the squared distance to the center (the variance of the set).

###### Lemma 1.

For every

 cost(P,x)=cost(P,μ(P))+∥μ(P)−x∥2∑p∈Pu(p).
###### Proof.

We have

 cost(P,x)−ρ=∑p∈Pu(p)∥p−x∥2=∑p∈Pu(p)∥(p−μ(P))+(μ(Pi)−x)∥2=∑p∈Pu(p)∥p−μ(P)∥2+∑p∈Pu(p)∥μ(P)−x∥2+2(μ(p)−x)⋅∑p∈Pu(p)(p−μ(P)).

The last term equals zero since , and thus

 ∑p∈Pu(p)(p−μ(P))=∑p∈Pu(p)⋅p−∑p∈Pu(p)μ(P)=∑p∈Pu(p)⋅p−∑p∈Pu(p)⋅p=0.

Hence,

 cost(P,x)=ρ+∑p∈Pu(p)∥p−μ(P)∥2+∑p∈Pu(p)∥μ(P)−x∥2=cost(P,μ(P))+∥μ(P)−x∥2∑p∈Pu(p).

The second lemma shows that assigning all the points of to the closest center in yields a small multiplicative error if the -mean and the -mean of has roughly the same cost. If , this means that we can approximate using only one center in the query; see Line LABEL:one of Algorithm LABEL:algk. Note that for .

###### Lemma 2.

For every set of centers we have

 cost(P,Q)≤minq∈Qcost(P,{q})≤cost(P,Q)⋅1+2ε1−2ε+opt(P,1)−opt(P,k)(1−2ε)ε. (1)
###### Proof.

Let denote a center that minimizes over . The left inequality of (1) is then straight-forward since

 (2)

It is left to prove the right inequality of (1). Indeed, for every , let denote the closest point to in . Ties are broken arbitrarily. Hence,

 minq∈Qcost(P,{q})−cost(P,Q)=∑p∈P′u(p)∥p−q∗∥2−∑p∈P′u(p)∥p−qp∥2.

Let denote the partition of by , where are the closest points to for every ; see Section 4. For every , let . Hence,

 ∑p∈P′iu(p)∥p−q∗∥2−∑p∈P′iu(p)∥p−qp∥2=∑p∈P′iu(p)∥p−μ(Pi)∥2+∥μ(Pi)−q∗∥2∑p∈P′iu(p) (3) −⎛⎜⎝∑p∈P′iu(p)∥p−μ(Pi)∥2+∥∥μ(Pi)−qi∥∥2∑p∈P′iu(p)⎞⎟⎠ (4) =∑p∈P′iu(p)(∥∥q∗p−q∗∥∥2−∥∥q∗p−qp∥∥2), (5)

where in (3) and (4) we substituted and respectively in Lemma 1, and in (5) we use the fact that and for every . Summing (5) over yields

 ∑p∈P′u(p)∥p−q∗∥2−∑p∈P′u(p)∥p−qp∥2=∑p∈P′u(p)(∥∥q∗p−q∗∥∥2−∥∥q∗p−qp∥∥2) =∑p∈P′u(p)(∥∥(q∗p−μ(P))+(μ(P)−q∗)∥∥2−∥∥(q∗p−μ(P))+(μ(P)−qp)∥∥2) (6) (7) −2∑p∈P′u(p)(q∗p−μ(P))(q∗−qp). (8)

To bound (7), we substitute and then in Lemma 1, and obtain that for every

where the last inequality is by the definition of . This implies that for every ,

 ∥μ(P)−q∗∥2−∥μ(P)−qp∥2≤0.

Plugging the last inequality in (7) yields

 ∑p∈P′u(p)∥p−q∗∥2−∑p∈P′u(p)∥p−qp∥2≤−2∑p∈P′u(p)(q∗p−μ(P))(q∗−qp) (9) ≤2∑p∈P′u(p)∥∥q∗p−μ(P)∥∥∥q∗−qp∥=∑p∈P′u(p)⋅2⋅∥∥q∗p−μ(P)∥∥√ε⋅√ε∥q∗−qp∥ (10) (11) =1ε∑p∈P′u(p)∥∥q∗p−μ(P)∥∥2+ε∑p∈P′u(p)∥q∗−qp∥2, (12)

where (10) is by Cauchy-Schwartz inequality, and in (11) we use the fact that for every .

To bound the left term of (12) we use the fact and substitute , in Lemma 1 for every as follows.

 (13)

To bound the right term of (12) we use to obtain

Plugging (13) and the last inequality in (12) yields

 cost(P,{q∗})−cost(P,Q)=∑p∈P′u(p)∥p−q∗∥2−∑p∈P′u(p)∥p−qp∥2≤opt(P,1)−opt(P,k)ε+2⋅ε(cost(P,{q∗})+cost(P,Q)).

Rearranging,

 cost(P,{q∗})≤cost(P,Q)⋅1+2ε1−2ε+opt(P,1)−opt(P,k)(1−2ε)ε

Together with (2) this proves Lemma 2. ∎

###### Lemma 3.

Let be a -coreset for a weighted set in . Let be a finite set. Then

 (1−ε)minq∈Qcost(P,{q})≤minq∈Qcost(S,{q})≤(1+ε)minq∈Qcost(P,{q}) (14)
###### Proof.

Let be a center such that , and let be a center such that . The right side of (14) is bounded by

 minq∈Qcost(S,{q})=cost(S,{qS})≤cost(S,{qP})≤(1+ε)cost(P,{qP})=(1+ε)minq∈Qcost(P,{q}),

where the first inequality is by the optimality of , and the second inequality is since is a coreset for . Similarly, the left hand side of (14) is bounded by

 (1−ε)minq∈Qcost(P,{q})=(1−ε)cost(P,{qP})≤(1−ε)cost(P,{qS})≤(1−ε)(1+ε)cost(S,{qS})=(1−ε2)minq∈Qcost(S,{q})≤minq∈Qcost(S,{q}).

where the last inequality follows from the assumption . ∎

###### Lemma 4.

Let be the output of a call to . Then is a -coreset for .

###### Proof.

By replacing with in Lemma 1 for each it follows that

 cost(Pi,Q)≤minq∈Qcost(Pi,Q)≤cost(Pi,Q)⋅1+2ε1−2ε+opt(Pi,1)−opt(Pi,k)(1−2ε)ε.

Summing the last inequality over each yields

 cost(P,Q)≤m∑i=1minq∈Qcost(Pi,Q)≤cost(P,Q)⋅1+2ε1−2ε+1(1−2ε)εm∑i=1(opt(Pi,1)−opt(Pi,k)). (15)

Since is the partition of the -means of we have . By letting be the -means of we have

 m∑i=1opt(Pi,k)=m∑i=1cost(Pi,Qi)≥m∑i=1cost(Pi,∪mj=1Qj)=cost(P,∪mj=1Qj)≥opt(P,mk).

Hence,

 m∑i=1(opt(Pi,1)−opt(Pi,k))≤opt(P,m)−opt(P,mk)≤ε2opt(P,k)≤ε2cost(P,Q),

where the second inequality is by Line LABEL:one of the algorithm. Plugging the last inequality in (15) yields

 cost(P,Q)≤m∑i=1minq∈Qcost(Pi,Q)≤cost(P,Q)⋅1+3ε1−2ε. (16)

Using Lemma 3, for every

By summing over we obtain

 (1−ε)m∑i=1minq∈Qcost(Pi,{q})≤m∑i=1minq∈Qcost(Si,{q})≤(1+ε)m∑i=1minq∈Qcost(Pi,{q}).

By this and Lemma 1

 (1−ε)m∑i=1minq∈Qcost(Pi,{q})≤cost(S,Q)≤(1+ε)m∑i=1minq∈Qcost(Pi,{q}).

Plugging the last inequality in (16) yields

 (1−ε)cost(P,Q)≤(1−ε)m∑i=1minq∈Qcost(Pi,Q)≤cost(S,Q)≤(1+ε)m∑i=1minq∈Qcost(Pi,{q})≤(1+ε)cost(P,Q)⋅1+3ε1−2ε≤(1+15ε)cost(P,Q). (17)

Hence, is a coreset for . ∎

###### Lemma 5.

There is an integer such that

 opt(P,kt)−opt(P,kt+1)≤ε2⋅opt(P,k). (18)
###### Proof.

Contradictively assume that (18) does not hold for every integer . Hence,

 opt(P,k)−opt(P,k⌈1/ε2⌉+1)=⌈1/ε2⌉∑i=1(opt(P,ki)−opt(P,ki+1))>⌈1/ε2⌉⋅ε2opt(P,k)≥opt(P,k).

Using the mean of in Line LABEL:four of the algorithm yields a -coreset as shown in Lemma 1. The resulting coreset is not sparse, but gives the following result.

###### Theorem 2.

There is such that the -means of is a -coreset for .

###### Proof of Theorem 1:.

We compute a mean coreset for -mean of at line LABEL:four of Algorithm LABEL:algk by using Frank-Wolfe [16] algorithm. It follows that for each , therefore the overall sparsity of is . This and Lemma 4 concludes the proof. ∎

## 7 Why does it work?

In this section we try to give an intuition of why our coreset construction yields a smaller error for the same number of samples, compared to existing coreset constructions. Roughly, this is mainly due to the “cost of independent sampling” that is used by the existing smallest coreset constructions, namely the sensitivity/importance sampling approach [8, 22, 12, 13].

In Fig. 2 the input is a set of 16 points on the plane that is distributed over clusters that are relatively far from each other. Each cluster consists of a single point, except for one cluster that has points. Given a “budget” (coreset size) of points, the “optimal coreset” seems to have all the isolated input points, including input points inside the large cluster, that are well distributed in this cluster. What would be the expected coresets of size using the existing techniques?

Algorithm -Mean-Coreset. would return exactly this “optimal coreset”, as the -means of consists of the isolated clusters and the means of the large cluster. For the case , the algorithm will pick exactly one representative in each cluster, as it is the -means of . See Fig. 2(left).

Uniform Sampling. If the large cluster is sufficiently large, all the points in a uniform sample will be from this cluster, while all the other (singleton) clusters will be missed. Since these clusters are far away from each other, the approximation error will then be very large. Even for large sample size, uniform sample misses isolated clusters that are crucial for obtaining a small error. See Fig. 2(right).

Non-Uniform Sampling. The optimal distribution that will make sure that a representative from each cluster will be selected to the coreset, is to sample a point from each of the clusters with roughly the same probability. However, due to the independent (i.i.d.) sampling approach, the number of samples that are needed in order to have a representative from each cluster is more than . In general, the expected sample size is . This phenomena is known as the coupon collector problem: if there is a coupon in each box at the supermarket, picked uniformly at random from a collection of distinct coupons, then one need to buy boxes in expectation to have the collection of all the coupons. This is compared to the deterministic construction of Algorithm 1 that always pick the desired representatives.

Even after having a representative from each of the isolated clusters a non-uniform sampling will keep sample a point from one of these clusters with probability . This means that from the total “budget” of points in the coreset, a large fraction will be used to sample the same point again and again. This is also why in Fig. 2(middle) there is less number of red points than the other constructions.

## 8 Practical and Simple Boosting of Existing Heuristics

To get the desired phenomena that is described in Section 7 there is no need to actually compute the -means for many values of and existing heuristics can be used. For example, any reasonable -means heuristics for would yield a set of red points as in Fig. 2(left).

The chicken and the egg phenomena. As in Algorithm 1, coresets for solving optimization problems usually need to solve the optimization problem in order to decide which points are important to sample. This problem is solved in theory using rough approximation algorithms or bi-criteria approximations [28, 12] that replaces the optimal solution, or using the merge-and-reduce tree that apply the coreset constuction only on small sets. In practice, algorithms that compute provable -approximations or even -approximations for the -means clustering are rarely used. Instead, heuristics such as the classic LLoyd’s -means or -Means++ [27, 4] are used.

Based on our experimental results, a rough approximation using existing heuristics seems to be suffices. In addition, plugging as an input, almost always produces coreset with error that is much smaller than . This is common also in other coresets and related to the facts that the analysis is (i) for the worst case input set and not a specific that is usually well structured, (ii) sub-optimal compared to the actual error, (iii) consider every set of centers, while we usually care about the optimal solution under some constraints.

We suggest to take the coreset size as the input and run iterations of our algorithm. In fact, our experimental results suggest the following simple approach that use a single instead of runs and yields only slightly less better results.

Boosting technique. Given a heuristic for solving the -means problem using some iterations, Algorithm 1 suggests to run the heuristic only small number of iterations. Then, we take the mean of each of the clusters, weighted by the size of the cluster, or construct a -coresets on each of the clusters. Then, we run the heuristic times on this “coreset” of size . Even if each iteration of the heuristics takes linear time of , the running time is reduced from to .

Example 1: LLyod’-means. In the case of Lloyd’s -means, each iteration takes time for computing the distances from the existing seed of centers, and then time is needed to compute the next set of centers. The boosting technique above suggests to run such iterations on to produce a weighted coreset of size . Then run the iterations on this coreset.

Example 2: KMean++. The KMean++ algorithm picks another point to the output set in each iteration, where the first point is a random seed. The next point is sampled with probability that is proportional to the distances of the input points to the center (points) that were already picked. This is very similar to the importance sampling that is used for constructing existing coresets with a crucial difference: the sampling is not independent and same for each new point, but adaptive, i.e., based on points that were already picked. This is exactly the advantage of our approach compared to the non-uniform sampling, as described in Fig. 2 and Section 7. Note that KMean++ will always select the right centers in Fig. 2, no matter what is the seed and although it is a random algorithm.

The KMean++ algorithm is very natural for using with our boosting technique: We just run it for iterations to get a coreset of size . Then we run KMean++ times on the coreset. In each of the th times we use a different seed (first point) and take the optimal among the sets of -mean candidates. Line 3 of Algoirthm 1 suggests an interesting way to choose the size of the coreset, based on our analysis in the supplementary material.

## 9 Experimental Results

Datasets. To produce experimental results we have use two well known datasets.

MNIST handwritten digits[24]. The MNIST dataset consist of grayscale images of handwritten digits. Each image of size 28x28 pixels was transformed to the the vector row of dimensions.

Pendigits[3]. This is a dataset from the UCI repository. The dataset created out of 250 samples provided by 44 writers. These writers were asked to write 250 digits in random order inside boxes of 500 by 500 tablet pixel resolution. The tablet sends and tablet coordinates and pressure level values of the pen at fixed time intervals (sampling rate) of 100 miliseconds. Digits are represented as constant length feature vectors of size the number of digits in the dataset is .

NIPS dataset[23]. The OCR data from the collection which represents 13 years of NIPS proceedings. The overall of 15,000 pages and 1958 articles. For each author there is a words count vector extracted, where ith entry in the vector represents count of the particular word which was used in one of the conference submissions by given author. There are overall authors and words corpus size is .

Expirement. We used our algorithm to boost the performance of Lloyd’s -means heuristic as explained in Section 8. Give a coreset size we run this heuristic for only iterations with centers. We compared our algorithm with uniform and importance sampling algorithms using both offline computation setting and streaming data model. For offline computation we used datasets above to produce coresets of sizes , then computed -means with values of using Lloyd’s heuristic for many iterations till convergence. While to simulate streaming data model we divided datasets into chunks and computed coresets of sizes using map-and-reduce techniques to construct a coreset tree, later repeated computation of -means for same values of .

For each set of centers that was produced, we computed sum of squared distances to the original (full) set of points, and denoted these “approximated solutions” by and for uniform, non uniform sampling and our algorithm respectively. The “ground truth” or “optimal solution” was computed using -means on entire dataset until convergence. The empirical estimated error is then defined to be for coreset number .

Results for datasets Fig.5 and Fig.6 shows results for offline setting and streaming models respectevly. The results of out algorithm are outperforms the uniform sampling and non-uniform sampling algorithms. Important to note, that our algorithm starts with very small error value compared to others and improves error value gradually with sample size, while two others starts with greater error values and succeeds to converge to significantly smaller values only for large sample subsets.

Fig.3 and Fig.4, shows the boxplot of error distribution for all three algorithms in offline and streaming settings. It’s easy to see that that our algorithm show little variance across all experiments and mean error value is very close to the median, indicating that our algorithm produces very stable results, while running on streaming data whiskers are broader due to the multiplicative factor of .

In Fig.7 we present the memory (RAM) footprint during the coreset construction based on synthetically generated random data. These results are common to other coresets papers. The oscillations corresponds to the number of coresets in the tree that each new chunk needs to update. For example, the first point in a streaming tree is updated in , however the th point for some climbs up through levels in the tree, so coresets need to be merged.

## References

• [2] P. K. Agarwal, S. Har-Peled, and K. R. Varadarajan. Approximating extent measures of points. Journal of the ACM, 51(4):606–635, 2004.
• [3] F. Alimoglu, D. Doc, E. Alpaydin, and Y. Denizhan. Combining multiple classifiers for pen-based handwritten digit recognition. 1996.
• [4] D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027–1035. Society for Industrial and Applied Mathematics, 2007.
• [5] D. H. Ballard. Generalizing the hough transform to detect arbitrary shapes. Pattern recognition, 13(2):111–122, 1981.
• [6] A. Barger and D. Feldman. Kmeans coreset code, http://sites.hevra.haifa.ac.il/bdrl/software/, 2015.
• [7] J. L. Bentley and J. B. Saxe. Decomposable searching problems i: Static-to-dynamic transformation. J. Algorithms, 1(4):301–358, 1980.
• [8] K. Chen. On -median clustering in high dimensions. In Proc. 17th Annu. ACM-SIAM Symp. on Discrete Algorithms (SODA), pages 1177–1185, 2006.
• [9] M. Cohen, S. Elder, C. Musco, C. Musco, and M. Persu. Dimensionality reduction for k-means clustering and low rank approximation. In (STOC), 2015.
• [10] D. Feldman, M. Faulkner, and A. Krause. Scalable training of mixture models via coresets. In Advances in Neural Information Processing Systems, pages 2142–2150, 2011.
• [11] D. Feldman, A. Fiat, and M. Sharir. Coresets forweighted facilities and their applications. In Foundations of Computer Science, 2006. FOCS’06. 47th Annual IEEE Symposium on, pages 315–324. IEEE, 2006.
• [12] D. Feldman and M. Langberg. A unified framework for approximating and clustering data. In STOC, pages 569–578, 2011. See http://arxiv.org/abs/1106.1379 for fuller version.
• [13] D. Feldman, M. Monemizadeh, and C. Sohler. A PTAS for k-means clustering based on weak coresets. In SoCG, 2007.
• [14] D. Feldman, M. Schmidt, and C. Sohler. Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering. In SODA, 2013.
• [15] D. Feldman, M. Schmidt, and C. Sohler. Turning big data into tiny data: Constant-size coresets for k-means, pca and projective clustering. In Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1434–1453. SIAM, 2013.
• [16] D. Feldman, M. V. Volkov, and D. Rus. Dimensionality reduction of massive sparse datasets using coresets. CoRR, abs/1503.01663, 2015.
• [17] H. W. Hamacher and Z. Drezner. Facility location: applications and theory. Springer Science & Business Media, 2002.
• [18] S. Har-Peled and A. Kushal. Smaller coresets for -median and -means clustering. Discrete Comput. Geom., 37(1):3–19, 2007.
• [19] S. Har-Peled and S. Mazumdar. On coresets for -means and -median clustering. In STOC, 2004.
• [20] Inaba, Katoh, and Imai. Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering. In SoCG, pages 332–339, 1994.
• [21] W. B. Johnson and J. Lindenstrauss. Extensions of lipschitz mappings into a hilbert space. Contemporary mathematics, 26(189-206):1, 1984.
• [22] M. Langberg and L. J. Schulman. Universal approximators for integrals. SODA, 2010.
• [23] Y. LeCun. Nips online web site. http://nips.djvuzone.org, 2001.
• [24] Y. LeCun and C. Cortes. The mnist database of handwritten digits, 1998.
• [25] J. A. Lee and M. Verleysen. Nonlinear dimensionality reduction. Springer Science & Business Media, 2007.
• [26] M. Mahajan, P. Nimbhorkar, and K. Varadarajan. The planar k-means problem is np-hard. In WALCOM, pages 274–285. Springer, 2009.
• [27] R. Ostrovsky, Y. Rabani, L. J. Schulman, and C. Swamy. The effectiveness of lloyd-type methods for the k-means problem. In FOCS’06, pages 165–176. IEEE, 2006.
• [28] K. Varadarajan and X. Xiao. Greedy minimization of weakly supermodular set functions.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters