Turning Big data into tiny data:
Constant-size coresets for
-means, PCA and projective clustering
We develop and analyze a method to reduce the size of a very large set of data points in a high dimensional Euclidean space to a small set of weighted points such that the result of a predetermined data analysis task on the reduced set is approximately the same as that for the original point set. For example, computing the first principal components of the reduced set will return approximately the first principal components of the original set or computing the centers of a -means clustering on the reduced set will return an approximation for the original set. Such a reduced set is also known as a coreset. The main new feature of our construction is that the cardinality of the reduced set is independent of the dimension of the input space and that the sets are mergable. The latter property means that the union of two reduced sets is a reduced set for the union of the two original sets (this property has recently also been called composable, see [IMMM14]). It allows us to turn our methods into streaming or distributed algorithms using standard approaches. For problems such as -means and subspace approximation the coreset sizes are also independent of the number of input points.
Our method is based on projecting the points on a low dimensional subspace and reducing the cardinality of the points inside this subspace using known methods. The proposed approach works for a wide range of data analysis techniques including -means clustering, principal component analysis and subspace clustering.
The main conceptual contribution is a new coreset definition that allows to charge costs that appear for every solution to an additive constant.
In many areas of science, progress is closely related to the capability to analyze massive amounts of data. Examples include particle physics where according to the webpage dedicated to the Large Hadron collider beauty experiment [lhc] at CERN, after a first filtering phase 35 GByte of data per second need to be processed “to explore what happened after the Big Bang that allowed matter to survive and build the Universe we inhabit today” [lhc]. The IceCube neutrino observatory “searches for neutrinos from the most violent astrophysical sources: events like exploding stars, gamma ray bursts, and cataclysmic phenomena involving black holes and neutron stars.” [Ice]. According to the webpages [Ice], the datasets obtained are of a projected size of about 10 Teta-Bytes per year. Also, in many other areas the data sets are growing in size because they are increasingly being gathered by ubiquitous information-sensing mobile devices, aerial sensory technologies (remote sensing), genome sequencing, cameras, microphones, radio-frequency identification chips, finance (such as stocks) logs, internet search, and wireless sensor networks [Hel, SH09].
The world’s technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s [HL11]; as of 2012, every day 2.5 quintillion bytes() of data were created [IBM]. Data sets as the ones described above and the challenges involved when analyzing them is often subsumed in the term Big Data. Big Data is also sometimes described by the “3Vs” model [Bey]: increasing volume (number of observations or records), its velocity (update time per new observation) and its variety (dimension of data, features, or range of sources).
In order to analyze data that for example results from the experiments above, one needs to employ automated data analysis methods that can identify important patterns and substructures in the data, find the most influential features, or reduce size and dimensionality of the data. Classical methods to analyze and/or summarize data sets include clustering, i.e., the partitioning of data into subsets of similar characteristics, and principal component analysis which allows to consider the dimensions of a data set that have the highest variance. Examples for such methods include -means clustering, principal component analysis (PCA), and subspace clustering.
The main problem with many existing approaches is that they are often efficient for large number of input records, but are not efficient enough to deal with Big Data where also the dimension is asymptotically large. One needs special algorithms that can easily handle massive streams of possibly high dimensional measurements and that can be easily parallelized and/or applied in a distributed setting.
In this paper, we address the problem of analyzing Big Data by developing and analyzing a new method to reduce the data size while approximately keeping its main characteristics in such a way that any approximation algorithm run on the reduced set will return an approximate solution for the original set. This reduced data representation (semantic compression) is sometimes called a coreset. Our method reduces any number of items to a number of items that only depends on some problem parameters (like the number of clusters) and the quality of the approximation, but not on the number of input items or the dimension of the input space.
Furthermore, we can always take the union of two data sets that were reduced in this way and the union provides an approximation for the two original data sets. The latter property is very useful in a distributed or streaming setting and allows for very simple algorithms using standard techniques. For example, to process a very large data set on a cloud, a distributed system or parallel computer, we can simply assign a part of the data set to each processor, compute the reduced representation, collect it somewhere and do the analysis on the union of the reduced sets. This merge-and-reduce method is strongly related to MapReduce and its popular implementations (e.g. Hadoop [Whi12]). If there is a stream of data or if the data is stored on a secondary storage device, we can read chunks that fit into the main memory of each individual computer and then reduce the data in this chunk. In the end, we apply our data analysis tools on the union of the reduced sets via small communication of only the coresets between the computers.
Our main result is a dimensionality reduction algorithm for points in high -dimensional space to points in dimensional space, such that the sum of squared distances to every object that is contained in a -dimensional subspace is approximated up to a -factor. This result is applicable to a wide range of problems like PCA, -means clustering and projective clustering. For the case of PCA, i.e., subspace approximation, we even get a coreset of cardinality (here is just the dimension of the subspace). The cardinality of this coreset is constant in the sense that it is independent of the input size: both its original cardinality and dimension . A coreset of such a constant cardinality is also obtained for -means queries, i.e., approximating the sum of squared distances over each input point to its closest center in the query.
For other objectives, we combine our reduction with existing coreset constructions to obtain very small coresets. A construction that computes coresets of cardinality will result in a construction that computes coresets of cardinality , i.e., independent of . This scheme works as long as there is such a coreset construction, e.g., it works for -means or -line-means. For the projective clustering problem (more precisely, the affine -subspace -clustering problem that we define below), such a coreset construction does not and can not exist. We circumvent this problem by requiring that the points are on an integer grid (the resulting size of the coreset will depend polylogarithmically on the size of the grid and n).
A more detailed (technical) description of our results is given in Section 2.4 after a detailed discussion about the studied problems and concepts.
We remark that in the conference version of this paper, some of the coreset sizes resulting from applying our new technique were incorrect. We have updated the results in this paper (see Section 2.4 for an overview). In particular, the coreset size for projective clustering is not independent of .
The main results of this work have been published in [FSS13]. However, the version at hand is significantly different. We carefully derive the concrete application to several optimization problems, develop explicit streaming algorithms, explain and re-prove some related results that we need, correct errors from the conference version (see above), provide pseudo code for most methods and add a lot of explanations compared to the conference version. The PhD thesis [Sch14] also contains a write-up of the main results, but without the streaming algorithms for subspace approximation and projective clustering.
In this section we formally define our notation and the problems that we study.
The set of all real-valued matrices is denoted by . Our input data set is a set of points in . We will represent it by a matrix , whose rows are the input points. The entry in the th row and th column of is denoted by . We use to denote the -th row of a and to denote its -th column. We use to denote the identity matrix, or just if the dimension is clear from the context. We say that a matrix has orthonormal columns if its columns are orthogonal unit vectors. Notice that every such matrix satisfies . If is also a square matrix, it is called an orthogonal matrix.
Any matrix has a singular value decomposition (SVD), which is a factorization , where is an orthogonal matrix, is a orthogonal matrix and is an rectangular diagonal matrix whose diagonal entries are non-negative and non-increasing. We use to denote the diagonal elements of . The columns of are called the left singular vectors of . Similarly, the columns of are called the right singular vectors of . Note that the right singular vectors are the eigenvectors of in order of non-increasing corresponding eigenvalues.
The number of non-zeroes entries in is the rank of , which is bounded by , that is . This motivates the thin SVD , where , and denote the first columns of , first columns/rows of and first columns of , respectively. Notice that the matrix product is still equal to (it is still a factorization). If we keep less than entries of , then we get an approximation of with respect to the squared Frobenius norm. We will use this approximation frequently in this paper.
Let and be its SVD. Let be an integer and define to be the diagonal matrix whose first diagonal entries are the same as that of and whose remaining entries are . Then the -rank approximation of is defined as .
The columns of a matrix span a linear subspace if the set of all linear combinations of columns of equals . In this case we also say that spans : A -dimensional linear subspace can be represented by a matrix with orthonormal columns that span . The projection of a point set (matrix) on a linear subspace represented by is the point set (matrix) . The projection of the point set (matrix) on using the coordinates of is the set of rows of .
Given and a set we denote by the translation of by . Similarly, is the translation of each row of by . An affine subspace is a translation of a linear subspace and as such can be written as , where is the translation vector and is a linear subspace.
Distances and norms
The origin of is denote by , where the dimension follows from the context. For a vector we write to denote its -norm, the square root of the sum of its squared entries. More generally, for an matrix we write to denote its spectral norm and
to denote its Frobenius norm. It is known that the Frobenius norm does not change under orthogonal transformations, i.e., for an matrix and an orthogonal matrix . This also implies the following observation that we will use frequently in the paper.
Let be an matrix and be a matrix with orthonormal columns. Then
Proof: Let be a orthogonal matrix whose first columns agree with . Then we have .
[Matrix form of the Pythagorean Theorem] Let be a matrix with orthonormal columns and be a matrix with orthonormal columns that spans the orthogonal complement of . Furthermore, let be any matrix. Then we have
Proof: Let be the matrix whose first columns equal and the second columns equal . Observe that is an orthogonal matrix. Since the Frobenius norm does not change under multiplication with orthogonal matrices, we get
The result follows by observing that .
For a set and a vector in , we denote the Euclidean distance between and (its closest point in) by if is non-empty (notice that the infimum always exists because the distance is lower bounded by zero, but the minimum might not exist, e. g., when is open), or otherwise. In this paper we will mostly deal with the squared Euclidean distance, which we denote by . For an matrix , we will slightly abuse notation and write to denote the distance to . The sum of the squared distances of the rows of to by If the rows of the matrix are weighted by non-negative weights then we sometimes use Let be a -dimensional linear subspace represented by a matrix with orthonormal columns that spans . Then the orthogonal complement of can also be represented by a matrix with orthonormal columns which spans . We usually name it . The distance from a point (column vector) to is the norm of its projection on , . The sum of squared distances from the rows of a matrix to is thus .
Range spaces and VC-dimension
In the following we will introduce the definitions related to range spaces and VC-dimension that are used in this paper.
A range space is a pair where is a set, called ground set and is a family (set) of subsets of , called ranges.
Definition 5 (VC-dimension)
The VC-dimension of a range space is the size of the largest subset such that
In the context of range spaces we will use the following type of approximation.
Definition 6 ([Hs11])
Let , and be a range space with finite . An -approximation of is a set such that for all we have
[LLS01] Let with finite be a range space with VC-dimension , and . There is a universal constant such that a sample of
elements drawn independently and uniformly at random from is a -approximation for with probability at least , where denotes the -dimension of .
2.1 Data Analysis Methods
In this section we briefly describe the data analysis methods for which we will develop coresets in this paper. The first two subsection define and explain two fundamental data analysis methods: -means clustering and principal component analysis. Then we discuss the other techniques considered in this paper, which can be viewed as generalizations of these problems. We always start to describe the motivation of a method, then give the technical problem definition and in the end discuss the state of the art.
The goal of clustering is to partition a given set of data items into subsets such that items in the same subset are similar and items in different subsets are dissimilar. Each of the computed subsets can be viewed as a class of items and, if done properly, the classes have some semantic interpretation. Thus, clustering is an unsupervised learning problem. In the context of Big Data, another important aspect is that many clustering formulations are based on the concept of a cluster center, which can be viewed as some form of representative of the cluster. When we replace each cluster by its representative, we obtain a concise description of the original data. This description is much smaller than the original data and can be analyzed much easier (and possibly by hand). There are many different clustering formulations, each with its own advantages and drawbacks and we focus on some of the most widely used ones. Given the centers, we can typically compute the corresponding partition by assigning each data item to its closest center. Since in the context of Big Data storing such a partition may already be difficult, we will focus on computing the centers in the problem definitions below.
Maybe the most widely used clustering method is -means clustering. Here the goal is to minimize the sum of squared error to cluster centers.
Definition 8 (The -means clustering problem (sum of squared error clustering))
Given , compute a set of centers (points) in such that its sum of squared distance to the rows of , is minimized.
The -means problem is studied since the fifties. It is NP-hard, even for two centers [ADHP09] or in the plane [MNV09]. When either the number of clusters is a constant (see, for example, [FMS07, KSS10, FL11]) or the dimension is constant [FRS16, CKM16], it is possible to compute a -approximation for fixed in polynomial time. In the general case, the -means problem is APX-hard and cannot be approximated better than 1.0013 [ACKS15, LSW17] in polynomial time. On the positive side, the best known approximation guarantee has recently been improved to 6.357 [ANSW16].
2.1.1 Principal Component Analysis
Let be an matrix whose rows are considered as data points. Let us assume that has mean , i.e., the rows sum up to the origin of . Given such a matrix , the goal of principal component analysis is to transform its representation in such a way that the individual dimensions are linearly uncorrelated and are ordered by their importance. In order to do so, one considers the co-variance matrix and computes its eigenvectors. They are sorted by their corresponding eigenvalues and normalized to form an orthonormal basis of the input space. The eigenvectors can be computed using the singular value decomposition. They are simply the right singular vectors of (sorted by their corresponding singular values).
The eigenvectors corresponding to the largest eigenvalues point into the direction(s) of highest variance. These are the most important directions. Ordering all eigenvectors according to their eigenvalues means that one gets a basis for which is ordered by importance. Consequently, one typical application of PCA is to identify the most important dimensions. This is particularly interesting in the context of high dimensional data, since maintaining a complete basis of the input space requires space. Using PCA, we can keep the most important directions. We are interested in computing an approximation of these directions.
An almost equivalent geometric formulation of this problem, which is used in this paper, is to find a linear subspace of dimension such that the variance of the projection of the points on this subspace is maximized; this is the space spanned by the first right singular vectors (note that the difference between the two problem definitions is that knowing the subspace does not imply that we know the singular vectors. The subspace may be given by any basis.) By the Pythagorean Theorem, the problem of finding this subspace is equivalent to the problem of minimizing the sum of squared distances to a subspace, i.e., find a subspace such that is minimized over all -dimensional subspaces of . We remark that in this problem formulation we are not assuming that the data is normalized, i.e., that the mean of the rows of is .
Definition 9 (linear -subspace problem)
Let and be an integer. The -subspace problem is to compute a -dimensional subspace of that minimizes .
We may also formulate the above problem as finding a matrix with orthonormal columns that minimizes over every such possible matrix . Such a matrix spans the orthogonal complement of . For the subspace that minimizes the squared Frobenius norm we have
If we would like to do a PCA on unnormalized data, the problem is better captured by the affine -subspace problem.
A coreset for -subspace queries, i.e., that approximates the sum of squared distances to a given -dimensional subspace was suggested by Ghashami, Liberty, Phillips, and Woodruff [Lib13, GLPW16], following the conference version of our paper. This coreset is composable and has cardinality of . It also has the advantage of supporting streaming input without the merge-and-reduce tree as defined in Section 1 and the additional factors it introduces. However, it is not clear how to generalize the result for affine -subspaces [JF] as defined below.
Definition 10 (affine -subspace problem)
Let and be an integer. The affine -subspace problem is to compute a -dimensional affine subspace of that minimizes .
The singular value decomposition was developed by different mathematicians in the 19th century (see [Ste93] for a historic overview). Numerically stable algorithms to compute it were developed in the sixties [GK65, GR70]. Nowadays, new challenges include very fast computations of the SVD, in particular in the streaming model (see Section 2.3). Note that the projection of to the optimal subspace (in which we are interested in the case of PCA), is called low-rank approximation in the literature since the input matrix is replaced by a matrix (representing ) that has lower rank, namely rank . Newer -approximation algorithms for low-rank approximation and subspace approximation are based on randomization and significantly reduce the running time compared to computing the SVD [CW09, CW13, DV06, DR10, DTV11, FMSW10, NDT09, Sar06, SV12]. More information on the huge body of work on this topic can be found in the surveys by Halko, Martinsson and Tropp [HMT11] and Mahoney [Mah11].
It is known [DFK04] that computing the -means on the low -rank of the input data (its first largest singular vectors), yields a -approximation for the -means of the input. Our result generalizes this claim by replacing with and with , as well as approximating the distances to any centers that are contained in a -subspace.
The coresets in this paper are not subset of the input. Following papers aimed to add this property, e.g. since it preserves the sparsity of the input, easy to interpret, and more numerically stable. However, their size is larger and the algorithms are more involved. The first coreset for the -subspace problem (as defined in this paper) of size that is independent of both and , but are also subsets of the input points, was suggested in [FVR15, FVR16]. The coreset size is larger but still polynomial in . A coreset of size that is a bit weaker (preserves the spectral norm instead of the Frobenius norm) but still satisfies our coreset definition was suggested by Cohen, Nelson, and Woodruff in [CNW16]. This coreset is a generalization of the breakthrough result by Batson, Spielman, and Srivastava [BSS12] that suggested such a coreset for . Their motivation was graph sparsification, where each point is a binary vector of 2 non-zeroes that represents an edge in the graph. An open problem is to reduce the running time and understand the intuition behind this result.
2.1.2 Subspace Clustering
A generalization of both the -means and the linear -subspace problem is linear -subspace -clustering. Here the idea is to replace the cluster centers in the -means definition by linear subspaces and then to minimize the squared Euclidean distance to the nearest subspace. The idea behind this problem formulation is that the important information of the input points/vectors lies in their direction rather than their length, i.e., vectors pointing in the same direction correspond to the same type of information (topics) and low dimensional subspaces can be viewed as combinations of topics describe by basis vectors of the subspace. For example, if we want to cluster webpages by their TFIDF (term frequency inverse document frequency) vectors that contain for each word its frequency inside a given webpage divided by its frequency over all webpages, then a subspace might be spanned by one basis vector for each of the words “computer”,“laptop”, “server”, and “notebook”, so that the subspace spanned by these vectors contains all webpages that discuss different types of computers.
A different view of subspace clustering is that it is a combination of clustering and PCA: The subspaces provide for each cluster the most important dimensions, since for one fixed cluster the subspace that minimizes the sum of squared distances is the space spanned by the right singular vectors of the restricted matrix. First provable PCA approximation of Wikipedia were obtained using coresets in [FVR16].
Definition 11 (Linear/Affine -subspace -clustering)
Let . The linear/affine -subspace -clustering problem is to find a set that is the union of linear/affine -dimensional subspaces, such that the sum of squared distances to the rows of , is minimized over every such set .
Notice that -means clustering is affine subspace clustering for and the linear/affine -subspace problem is linear/affine -subspace -clustering. An example of a linear -subspace -clustering is visualized in Figure 1.
Affine -subspace -clustering is NP-hard to approximate, even for and . This is due to a result by Megiddo and Tamir [MT82], who show that it is NP-complete to decide whether a set of points in can be covered by affine subspaces of dimension one. Any multiplicative approximation would have to decide whether it is possible to find a solution of zero cost. Feldman, Fiat and Sharir [FFS06] give a -approximation algorithm for the affine -subspace -clustering problem (which is called -line mean problem in their paper) for constant and .
Deshpande, Rademacher, Vempala and Wang propose a polynomial time -approximation algorithm for the -subspace -clustering problem [DRVW06] when and are constant. Newer algorithms with faster running time are based on the sensitivity sampling framework by Feldman and Langberg [FL11]. We discuss [FL11] and the results by Varadarajan and Xiao [VX12a] in detail in Section 8.
-Clustering under -Distance
In order to keep our notation concise we will summarize the above problems in the larger class of clustering problems under distance, which is defined as follows. Let be a family of subsets of . The set can be thought of as a set of candidate solutions and in this paper we will typically think of each as a union of centers, i.e., in -means clustering is a set of points, in -subspace -clustering, is the union of subspaces, etc.
Definition 12 (-Clustering Problem under -distance)
Given a matrix , and a set of sets in , the -clustering problem under -distance is to compute a set that minimizes .
It is easy to see that the previously mentioned problems are special cases of the -Clustering problem under -distance for different choices of . For example, when we choose to be the family of all -dimensional subspaces of we obtain the -subspace problem or when is the family of sets of centers we obtain the -means problem.
2.2 Coresets and Dimensionality Reductions
A coreset for an optimization problem is a (possibly weighted) point set that approximates the cost (sum of squared distances) of every feasible solution to the problem up to a small factor. In the case of a clustering problem as defined above, the set of feasible solutions is simply the set . There are a number different definitions for coresets for clustering problems that have different properties. A commonly used definition for the -means problem goes back to the work of Har-Peled and Mazumdar [HPM04]: a coreset is a weighted set of points that approximates the sum of squared distances of the original input set to every candidate solution, up to a factor of . In this paper we introduce a new definition for a coreset that generalizes the definition of Har-Peled and Mazumdar [HPM04]. The main difference is that we allow to have an additive constant that may be added to the coreset cost. The main idea behind this definition is that in high dimensional spaces that we can partition the input data into a ”pseudorandom” part, i.e., noise, and a structured part. The pseudorandom part can then be removed from the data and will just result in an additive constant and the true information is maintain in the structured part. The value of the additive constant may depend on the input point set , the value of and the family , but it must not depend on the particular choice of , i.e., for all the value of will be identical. Also note that this does not introduce an additive error, i.e., the desired value is approximated up to a multiplicative factor of . Below is the definition of a coreset as it is used in this paper.
Definition 13 (coreset for -clustering under -distance)
Let be a family of non-empty sets in . Let , be an integer, and . A tuple of a matrix with a vector of non-negativ weights associated with its rows and a value is an -coreset for the -clustering problem under -distance for , if for every we have
In the first place it seems to be surprising that the addition of helps us to construct smaller coresets. The intuition is that if the input is in high dimensional space and the shape is contained in a low-dimensional space, we can split the contribution of any point to the -distance into a part that corresponds to its distance to some low-dimensional subspace (that may depend on the considered shape of the cluster centers) and its contribution inside the subspace. By projecting the points to a minimum cost subspace of sufficient dimensionality, we can reduce the first part and get a low-dimensional point set. takes care of the reduced costs.
Many coreset constructions (without ) have been proposed for the -means problem. Early algorithms compute exponential grids or similar geometric structures to merge close enough points into coreset points [HPM04, FS05, HPK07]. This approach leads to a number of coreset points which is exponential in the dimension. Chen [Che09] showed how to reduce the size to a polynomial in , , and by combining geometric arguments with sampling. Further improvement was then based on refined sampling approaches. Langberg and Schulman [LS10] defined the sensitivity of an input point and showed how to compute coresets of size . The sensitivity-based framework by Feldman and Langberg [FL11] then yields coresets of size for the -means problem.
For the general -subspace -clustering problem, coresets of small size do not exist [Har04, Har06]. Edwards and Varadarajan [EV05] circumvent this problem by studying the problem under the assumption that all input points have integer coordinates. They compute a coreset for the -subspace -clustering problem with maximum distance instead of the sum of squared distances. We discuss their result together with the work of Feldman and Langberg [FL11] and Varadarajan and Xiao [VX12a] in Section 8. The latter paper proposes coresets for the general -subspace -clustering problem with integer coordinates.
Definition 13 requires that the coreset approximates the sum of squared distance for every possible solution. We require the same strong property when we talk about dimensionality reduction of a point set. The definition is verbatim except that instead of a matrix , we want a matrix with rows but of lesser (intrinsic) dimension. A famous example for this idea is the application of the Johnson-Lindenstrauss-Lemma: It allows to replace any matrix with a matrix while preserving the -means cost function up to a -factor. Boutsidis, Zouzias and Drineas [BZD10] develop a dimensionality reduction that is also based on a random projection onto dimensions. However, the approximation guarantee is instead of as for the Johnson-Lindenstrauss-Lemma.
Drineas et. al. [DFK04] developed an SVD based dimensionality reduction for the -means problem. They projected onto the most important dimensions and solved the lower dimensional instance to optimality (assuming that is a constant). This gives a -approximate solution. Boutsidis, Zouzias, Mahoney, and Drineas [BZMD15] show that the exact SVD can be replaced by an approximate SVD, giving a -approximation to dimensions with faster running time. Boutsidis et. al. [BMD09, BZMD15] combine the SVD approach with a sampling process that samples dimensions from the original dimensions, in order to obtain a projection onto features of the original point set. The approximation guarantee of their approach is , and the number of dimensions is reduced to .
2.3 Streaming algorithms
A stream is a large, possibly infinitely long list of data items that are presented in arbitrary (so possibly worst-case) order. An algorithm that works in the data stream model has to process this stream of data on the fly. It can store some amount of data, but its memory usage should be small. Indeed, reducing the space complexity is the main focus when developing streaming algorithms. In this paper, we consider algorithms that have a constant or polylogarithmic size compared to the input that they process. The main influence on the space complexity will be from the model parameters (like the number of centers in the -means problem) and from the desired approximation factor.
There are different streaming models known in the literature. A good introduction to the area is the survey by Mutukrishnan [Mut05]. We consider the Insertion-Only data stream model for geometric problems. Here, the stream consists of points from which arrive in arbitrary order. At any point in time (i.e., after seeing ) we want to be able to produce an approximate solution for the data seen so far. This does not mean that we always have a solution ready. Instead, we maintain a coreset of the input data as described in Section 2.2. Since the cost of any solution is approximated by the coreset, we can always compute an approximate solution by running any approximation algorithm on the coreset (as long as the algorithm can deal with weights, since the coreset is a weighted set).
A standard technique to maintain coresets is the merge-and-reduce method, which goes back to Bentley and Saxe [BS80] and was first used to develop streaming algorithms for geometric problems by Agarwal et al. [AHPV04b]. It processes chunks of the data and reduces each chunk to a coreset. Then the coresets are merged and reduced in a tree-fashion that guarantees that no input data point is part of more than reduce operations. Every reduce operation increases the error, but the upper bound on the number of reductions allows the adjustment of the precision of the coreset in an appropriate way (observe that this increases the coreset size). We discuss merge-and-reduce in detail in Section 7.
Har-Peled and Mazumdar initiated the development of coreset-based streaming algorithms for the -means problem. Their algorithm stores at most during the computation. The coreset construction by Chen [Che09] combined with merge-and-reduce gave the first the construction of coresets of polynomial size (in , , and ) in the streaming model. Various additional results exist that propose coresets of smaller size or coreset algorithms that have additional desirable properties like good implementability or the ability to cope with point deletions [AMR12, FGS13, FMSW10, FS05, HPK07, LS10, BFL17]. The construction with the lowest space complexity is due to Feldman and Langberg [FL11].
Recall from Section 2.1 that the -means problem can be approximated up to arbitrary precision when or is constant, and that the general case allows for a constant approximation. Since one can combine the corresponding algorithms with the streaming algorithms that compute coresets for -means, these statements are thus also true in the streaming model.
2.4 Our results and closely related work
Our main conceptual idea can be phrased as follows. For clustering problems with low dimensional centers any high dimensional input point set may be viewed as consisting of a structured part, i.e. a part that can be clustered well and a ”pseudo-random” part, i.e. a part that induces roughly the same cost for every cluster center (in this way, it behaves like a random point set). This idea is captured in the new coreset definition given in Definition 13.
Our new idea and the corresponding coreset allows us to use the following approach. We show that for any clustering problem whose centers fit into a low-dimensional subspace, we can replace the input matrix by its low-rank approximation for a certain small rank that only depends on the shape and number of clusters and the approximation parameter . The low rank approximation may be viewed as the structured part of the input. In order to take care of the “pseudo-random” part, we add the cost of projecting onto to any clustering.
Our new method allows us to obtain coresets and streaming algorithms for a number of problems. For most of the problems our coresets are independent of the dimension and the number of input points and this is the main qualitative improvement over previous results.
In particular, we obtain (for constant error probability) a coreset of size
for the linear and affine -subspace problem,
for the -means problem111When we use , then factors that are polylogarithmic in are hidden in the stated term. ,
for the -line means problem,
and for the -dimensional subspace -clustering problem when the input points are integral and have maximum -norm and where is a function that depends only on and .
We also provide detailed streaming algorithms for subspace approximation, -means, and -dimensional subspace -clustering. We do not explicitly state an algorithm that is based on coresets for -line means as it follows using similar techniques as for -means and -dimensional subspace -clustering and a weaker version also follows from the subspace -clustering problem with .
Furthermore, we develop a different method for constructing a coreset of size independent of and and show that this construction works for a restricted class of Bregman divergences.
The SVD and its ability to compute the optimal solution for the linear and affine subspace approximation problem has been known for over a century. About ten years ago, Drineas, Frieze, Kannan, Vempala, Vinay [DFK04] observed that the SVD can be used to obtain approximate solutions for the -means problem. They showed that projecting onto the first singular vectors and then optimally solving -means in the lower dimensional space yields a -approximation for the -means problem.
After the publication of the conference version of this work, Cohen, Elder, Musco, Musco and Persu [CEM15] observed that the dimensionality reduction and the coreset construction for subspace approximation can also be used for the -means problem because the -means problem can be seen as a subspace approximation problem with side constraints (in instead of ). By this insight, they show that dimensions suffice to preserve the -means cost function. Additionally, they show that this is tight, i.e., projecting to less singular vectors will no longer give a -guarantee.
3 Coresets for the linear -subspace problem
We will first develop a coreset for the problem of approximating the sum of squared distances of a point set to a single linear -dimensional subspace for an integer . Let be a -dimensional subspace represented by whose columns are orthonormal and span . Similarly, Let be the subspace that spans the orthogonal complement to , represented by a matrix with orthonormal columns.
Recall that for a given matrix containing points of dimension as its rows, the sum of squared distances (cost) of the points to is at least , where is the th singular value of (sorted non increasingly). Furthermore, the subspace that is spanned by the first right singular vectors of achieves the minimum cost .
Now, we will show that appropriately chosen vectors suffice to approximate the cost of every -dimensional subspace . We obtain these vectors by considering the singular value decomposition . Our first step is to replace the matrix by its rank approximation as defined in Definition 1. We show the following simple lemma regarding the error of this approximation with respect to squared Frobenius norm.
Let , and let be a matrix whose columns are orthonormal. Let and be an integer. Then
Proof: Using the singular value decomposition we write and . We first observe that is always non-negative. Then
holds since has orthonormal columns. Now we observe that and its rows satisfy that
We can thus continue and get
To see the inequality, recall that the spectral norm is compatible with the Euclidean norm ([QSS00]), set and and observe that
In the following we will use this result to give an estimate for the -distance to a -dimensional subspace . We will represent the orthogonal complement of by a matrix with orthonormal columns. Recall that . We then split into its low rank approximation for some suitable value of . This will be the ”structured” part of the input. Furthermore, we will view the cost of projecting onto the optimal -dimensional subspace w.r.t. the -subspace problem as taking care of the ”pseudorandom” part of the input. The argument is formalized in the next lemma.
Let , be an integer and . For every integer , and every matrix with orthonormal columns, by letting , we have
Proof: By the triangle inequality and the fact that has orthonormal columns we have
which proves that . Let by a matrix that spans the orthogonal complement of the column space of . Using Claim 3, , , and we obtain
where the inequality follows from Lemma 14.
Let , and be an integer. Let and suppose that . For and every matrix whose columns are orthonormal, we have
Proof: From our choice of it follows that
where the last inequality follows from the fact that the optimal solution to the -subspace problem has cost . Now the Corollary follows from Lemma 15.
The previous corollary implies that we can use and to approximate the cost of any -dimensional subspace. The number of rows in is and so it is not a small coreset. However, has small rank, which we can exploit to obtain the desired coreset. In order to do so, we observe that by orthonormality of the columns of we have for any matrix , which implies that . Thus, we can replace the matrix in the above corollary by . This is interesting, because all rows except the first rows of this new matrix have only entries and so they don’t contribute to . Therefore, we define our coreset to be the matrix consisting of the first rows of . The rows of this matrix are the coreset points. We summarize our coreset construction in the following algorithm.
|Input:||, an integer and an error parameter .|
|Output:||A pair that satisfies Theorem 17.|
In the following, we summarize the properties of our coreset construction.
Theorem 17 (Coreset for -subspace)
Let , be an integer and . Let be the output of a call to ; see Algorithm 1. Then where , , and for every -dimensional linear subspace of we have that
This takes time.
If one is familiar with the coreset literature it may seem a bit strange that the resulting point set is unweighted, i.e., we replace unweighted points by unweighted points. However, for this problem the weighting is implicitly done by scaling. Alternatively, we could also define our coreset to be the set of the first rows of where the th row is weighted by , and is the SVD of .
As already described in the Preliminaries, principal component analysis requires that the data is translated such that its mean is the origin of . If this is not the case, we can easily enforce this by subtracting the mean before the coreset computation. However, if we are taking the union of two or more coresets, they will have different means and cannot be easily combined. This limits the applicability to streaming algorithms to the case where we a priori know that the data set has the origin as its mean. Of course we can easily maintain the mean of the data, but a simple approach such as subtracting it from the coresets points at the end of the stream does not work as it invalidates the properties of the coreset. In the next section we will show how to develop a coreset for the affine case, which allows us to deal with data that is not a priori normalized.
4 Coresets for the Affine -Subspace Problem
We will now extend our coreset to the affine -subspace problem. The main idea of the new construction is very simple: Subtract the mean of from each input point to obtain a matrix , compute a coreset for and then add the mean to the points in the coreset to obtain a coreset . While this works in principle, there are two hurdles that we have to overcome. Firstly, we need to ensure that the mean of the coreset is before we add . Secondly, we need to scale and weight our coreset. The resulting construction is given as pseudo code in Algorithm 2.
|Input:||, an integer and an error parameter .|
|Output:||A triple that satisfies Theorem 19.|
Let with and let with be an affine -dimensional subspace of for . Then
Proof: Assume that spans . Then it holds that
First observe that Lemma 18 is only applicable to point sets with mean . This is true for , and also for any matrix that has and as its rows, even if we scale it. We know that is a coreset for for linear subspaces, satisfying that is approximately equal to for any linear subspace . Since we double the points, a likely coreset candidate would by
since for any linear subspace , and is also satisfied. What is the problem with ? Again consider Lemma 18. Assume that we have a linear subspace and start to move it around, obtaining an affine subspace for . Then the distance of and increases by a multiple of – but the multiple depends on the number of points. Thus, we either need to increase the number of points in (clearly not in line with our idea of a coreset), or we need to weight the points by . However, is not comparable to anymore. To compensate for the weighting, we need to scale by (notice that the now cancels out). This is how we obtain Line 3 of Algorithm 3. We conclude by stating and showing the coreset guarantee. Notice that all rows in receive the same weight, so we do not need to deal with the weights explicitly and rather capture the weighting by a multiplicative factor in the following theorem.
Theorem 19 (Coreset for affine -subspace)
Let , be an integer, and . Let be the output of a call to ; see Algorithm 2. Then where , , and for every affine -dimensional subspace of we have that