Chromatic Clustering in High Dimensional Space
In this paper, we study a new type of clustering problem, called Chromatic Clustering, in high dimensional space. Chromatic clustering seeks to partition a set of colored points into groups (or clusters) so that no group contains points with the same color and a certain objective function is optimized. In this paper, we consider two variants of the problem, chromatic -means clustering (denoted as -CMeans) and chromatic -medians clustering (denoted as -CMedians), and investigate their hardness and approximation solutions. For -CMeans, we show that the additional coloring constraint destroys several key properties (such as the locality property) used in existing -means techniques (for ordinary points), and significantly complicates the problem. There is no FPTAS for the chromatic clustering problem, even if . To overcome the additional difficulty, we develop a standalone result, called Simplex Lemma, which enables us to efficiently approximate the mean point of an unknown point set through a fixed dimensional simplex. A nice feature of the simplex is its independence with the dimensionality of the original space, and thus can be used for problems in very high dimensional space. With the simplex lemma, together with several random sampling techniques, we show that a -approximation of -CMeans can be achieved in near linear time through a sphere peeling algorithm. For -CMedians, we show that a similar sphere peeling algorithm exists for achieving constant approximation solutions.
Clustering is one of the most fundamental problems in computer science and finds applications in many different areas [3, 4, 6, 2, 9, 11, 12, 14, 16, 7, 10]. Most existing clustering techniques assume that the to-be-clustered data items are independent from each other. Thus each data item can “freely” determine its membership within the resulting clusters, without paying attention to the clustering of other data items. In recent years, there are also considerable attentions on clustering dependent data and a number of clustering techniques, such as correlation clustering, point-set clustering, ensemble clustering, and correlation connected clustering, have been developed [4, 7, 9, 11, 10].
In this paper, we consider a new type of clustering problems, called Chromatic Clustering, for dependent data. Roughly speaking, a chromatic clustering problem takes as input a set of colored data items and groups them into clusters, according to certain objective functions, so that no pair of items with the same color are grouped together (such a requirement is called chromatic constraint). Chromatic clustering captures the mutual exclusiveness relationship among data items and is a rather useful model for various applications. Due to the additional chromatic constraint, chromatic clustering is thus expected to simultaneously solve the “coloring” and clustering problems, which significantly complicates the problem. As it will be shown later, the chromatic clustering problem is challenging to solve even for the case that each color is shared only by two data items.
For chromatic clustering, we consider in this paper two variants, Chromatic -means Clustering (-CMeans) and Chromatic -median Clustering (-CMedians), in space, where the dimensionality could be very high and is a fixed number. In both variants, the input is a set of point-sets with each containing a maximum of points in -dimensional space, and the objective is to partition all points of into different clusters so that the chromatic constraint is satisfied and the total squared distance (i.e., -CMeans) or total distance (i.e., -CMedians) from each point to the center point (i.e., median or mean point) of its cluster is minimized.
Motivation: The chromatic clustering problem is motivated by several interesting applications. One of them is for determining the topological structure of chromosomes in cell biology . In such applications, a set of 3D probing points (e.g., using BAC probes) is extracted from each homolog of the interested chromosome (see Figure 6 in Appendix), and the objective is to determine, for each chromosome homolog, the common spatial distribution pattern of the probes among a population of cells. For this purpose, the set of probes from each homolog is converted into a high dimensional feature point in the feature space, where each dimension represents the distance between a particular pair of probes. Since each chromosome has two (or more as in cancer cells) homologs, each cell contributes (i.e., two or more) feature points. Due to technical limitation, it is impossible to identify the same homolog from all cells. Thus, the feature points from each cell form a point-set with the same color (meaning that they are undistinguishable). To solve the problem, one could chromatically cluster all point-sets into clusters (after normalizing the cell size), with each corresponding to a homolog, and use the mean or median point of each cluster as its common pattern.
Related works: As its generalization, chromatic clustering is naturally related to the traditional clustering problem. Due to the additional chromatic constraint, chromatic clustering could behave quite differently from its counterpart. For example, the -means algorithms in [6, 15] relies on the fact that all input points in a Voronoi cell of the optimal mean points belong to the same cluster. However, such a key locality property no longer holds for the -CMeans problem.
Chromatic clustering falls in the umbrella of clustering with constraint. For such type of clustering, several solutions exist for some variants . Unfortunately, due to their heuristic nature, none of them can yield quality guaranteed solutions for the chromatic clustering problem. The first quality guaranteed solution for chromatic clustering was obtained recently by Ding and Xu. In , they considered a special chromatic clustering problem, where every point-set has exactly points in the first quadrant, and the objective is to cluster points by cones apexed at the origin, and presented the first PTAS for constant . The -CMeans and -CMedians problems considered in this paper are the general cases of the chromatic clustering problem. Very recently, Arkin et al.  considered a chromatic 2D -center clustering problem and presented both approximation and exact solutions.
1.1 Main Results and Techniques
In this paper, we present three main results, a constant approximation and a -approximation for -CMeans and their extensions to -CMedians.
Constant approximation: We show that given any -approximation for -means clustering, it could yield a -approximation for -CMeans. This not only provides a way for us to generate an initial constant approximation solution for -CMeans through some -means algorithm, but more importantly reveals the intrinsic connection between the two clustering problems.
-approximation: We show that a near linear time -approximation solution for -CMeans can be obtained using an interesting sphere peeling algorithm. Due to the lack of locality property in -CMeans, our sphere peeling algorithm is quite different from the ones used in [15, 6], which in general do not guarantee a -approximation solution for -CMeans as shown by our first result. Our sphere peeling algorithm is based on another standalone result, called Simplex Lemma. The simplex lemma enables us to obtain an approximate mean point of a set of unknown points through a grid inside a simplex determined by some partial knowledge of the unknown point set. A unique feature of the simplex lemma is that the complexity of the grid is independent of the dimensionality, and thus can be used to solve problems in high dimensional space. With the simplex lemma, our sphere peeling algorithm iteratively generates the mean points of -CMeans with each iteration building a simplex for the mean point.
Extensions to -CMedians: We further extend the idea for -CMeans to -CMedians. Particularly, we show that any -approximation for -medians can be used to yield a -approximation for -CMedians, where the error comes from the difficulty of computing the optimal median point (i.e., Fermat Weber point). With this and a similar sphere peeling technique, we obtain a -approximation for -CMedians. Note that although is a constant in this paper, a -approximation is still much better than a -approximation.
Due to space limit, many details of our algorithms, proofs, and figures are put in Appendix.
In this section, we introduce some definitions which will be used throughout the paper.
Definition 1 (Chromatic Partition)
Let be a set of point-sets with each consisting of points in space. A chromatic partition of is a partition of the points into sets, , such that each contains no more than one point from each for .
Definition 2 (Chromatic -means Clustering (-CMeans))
Let be a set of point-sets with each consisting of points in space. The chromatic -means clustering (or -CMeans) of is to find points in space and a chromatic partition of such that is minimized. The problem is called full -CMeans if .
For both -CMedians and -CMeans, a problem often encountered in our approach is “How to find the best cluster for each point in if the mean or median points are already known?” An easy way to solve this problem is to first build a complete bipartite graph with points in and as the two partites and then compute a minimum weight bipartite matching as the solution, where the edge weight is the Euclidean distance or squared distance of the two corresponding vertices. Clearly, this can be done in a total of time for all ’s. (We call this procedure as bipartite matching.)
3 Hardness of -CMeans
It is easy to see that -means is a special case of -CMeans (i.e., each contains exactly one point). As shown by Dasgupta , -means in high dimensional space is NP-hard even if . Thus, we immediately have the following theorem.
-CMeans is NP-hard for in high dimensional space.
3.1 Is Full -CMeans Easier?
It is interesting to know whether full -CMeans is easier than general -CMeans, since it is disjoint with -means when . The following theorem gives a negative answer to this question.
Full -CMeans is NP-hard and has no FPTAS for in high dimensional space unless P=NP (see Appendix for the proof).
The above theorem indicates that the fullness of -CMeans does not reduce the hardness of the problem. However, this does not necessarily mean that full -CMeans is as difficult as general -CMeans to achieve a -approximation for fixed . Below we show that a -approximation can be relatively easily achieved for full -CMeans through some random sampling technique.
First we introduce a key lemma from . Let be a set of points in space, be a randomly selected subset from with points, and , be the mean points of and respectively.
Lemma 1 ()
With probability , , where .
Let be a set of elements, and be a subset of such that . If randomly select elements from , with probability at least , the sample contains at least elements from .
If we randomly select elements from , then it is easy to know that with probability , there is at least one element from the sample belonging to . If we want the probability equal to , has to be (by Taylor series and , ). Thus if we perform rounds of random sampling with each round selecting elements, we get at least elements from with probability at least . ∎
Lemma 1 tells us that if we want to find an approximate mean point within a distance of to the mean point, we just need to take a random sample of size . Lemma 2 suggests that for any set and its subset of size , we can have a random subset of with size by randomly sampling directly from points, even if is an unknown subset of . Combining the two lemmas, we can immediately compute an approximation solution for full -CMeans in the following way. First, we note that in full -CMeans, each optimal cluster contains exact points from the total of points in . This means that each cluster has a fraction of points from . Then, we can obtain an approximate mean point for each optimal cluster by (1) randomly sampling points from , (2) enumerating all possible subsets of size to find the set which is a random sample of the unknown optimal cluster, and (3) computing the mean of as the approximate mean point of the optimal cluster. Finally, we can generate the chromatic clusters from the approximate mean points by using the bipartite matching procedure (see Section 2).
With constant probability, a -approximation of full -CMeans can be obtained in time.
With the above theorem, we only need to focus on the general -CMeans problem in the remaining sections. Note that in the general case, some clusters may have a very small fraction (rather than ) of points, thus we can not use the above method to solve the general -CMeans problem.
4 Constant Approximation from -means
In this section, we show that a constant approximation solution for -CMeans can be produced from an approximation solution of -means. Below is the main theorem of this section.
Let be an instance of -CMeans, and be the mean points of a constant -approximation solution of -means on the points . Then contains at least one -tuple which could induce a -approximation of -CMeans on , where .
To prove Theorem 4.1, we first introduce two lemmas.
Let be a set of points in space, and be the mean point of . For any point , (see Appendix for the proof).
Let be a set of points in space, and be its subset containing points for some . Let and be the mean points of and respectively. Then , where .
Let , and be its mean point. By Lemma 3 we first have the following two equalities.
Thus, we have , which means that . ∎
Proof (of Theorem 4.1)
Let be the mean points in , and be their corresponding clusters. Let be the unknown optimal mean points of -CMeans, and be the corresponding optimal chromatic clusters. Let , and be its mean point for .
Since , by pigeonhole principle we know that there must exist some index such that . Thus by fixing , we have the following about (see Figure 1)
Since , we have . Thus the above inequality becomes
Summing both sides of (6) over , we have
where the second inequality follows from the inequality , which implies that .
It is obvious that the optimal objective value of -means is no larger than that of -CMeans on the same set of points in . Thus, . Plugging this inequality into inequality (7), we have
The above inequality means that if we take the -tuple as the approximate mean points for -CMeans, we have a -approximation solution, where the chromatic clusters can be obtained by the bipartite matching procedure. Thus, the theorem is proved. ∎
Running Time: In the above theorem, the bipartite matching procedure takes time for one -tuple. Since there are in total such -tuples, the total running time is for computing a -approximation of -CMeans from a -approximation of -means. As is assumed to be a constant in this paper, the running time is linear.
5 -Approximation Algorithm
This section presents our -approximation solution to the -CMeans problem. We first introduce a standalone result, Simplex Lemma, and then use it to achieve a -approximation for -CMeans. The main idea of the algorithm is to use a sphere peeling technique to generate the chromatic clusters iteratively, where the Simplex Lemma helps to determine a proper peeling region.
5.1 Simplex Lemma
Simplex Lemma is mainly for approximating the mean point of some unknown points set . The only known information about is a set of points with each of them being an approximate mean point of a subset of . The following Simplex lemmas show that it is possible to construct a simplex of and find the desired approximate mean point of inside the simplex.
Lemma 5 (Simplex Lemma \@slowromancapi@)
Let be a set of points in with a partition of and for any . Let be the mean point of , and be the mean point of for . Further, let , and be the simplex determined by . Then for any , it is possible to construct a grid of size inside such that at least one grid point satisfies the inequality .
We will prove this lemma by mathematical induction on .
Base case: For , since , . Thus, the simplex and the grid are all simply the point . Clearly satisfies the inequality.
Induction step: Assume that the lemma holds for any for some (i.e., Induction Hypothesis). Now we consider the case of . First, we assume that for each . Otherwise, we can reduce the problem to the case of smaller in the following way. Let be the index set of small subsets. Then, , and . By Lemma 4, we know that , where is the mean point of . Let be the variance of . Then, we have . Thus, if we replace and by and respectively, and find a point such that , we have (where the last inequality is due to the fact ). This means that we can reduce the problem to a problem with point set and a smaller (i.e., ). By the induction hypothesis, we know that the reduced problem can be solved (note that the simplex would be a subset of determined by ), and therefore the induction step holds for this case. Thus, in the following discussion, we can assume that for each .
For each , since , by Lemma 4, we know that . This, together with triangle inequality, implies that for any , . Thus, if we pick any index , and draw a ball centered at and with radius , the whole simplex will be inside . Note that since , also locates inside . This indicates that we can construct in the -dimensional space spanned by , rather than the whole space. Also, if we build a grid inside with grid length , the total number of grid points is no more than . With this grid, we know that for any point inside , there exists a grid point such that . This means that can find a grid point inside , such that . Thus, the induction step holds.
With the above base case and induction steps, the lemma holds for any . ∎
In the above lemma, we assume that the exact positions of are known (see Fig. 3). However, in some scenario (e.g., the exact partition of is not given, as is the case in -CMeans), it is possible that we only know the approximate position of each mean point (see Fig. 3). The following lemma shows that an approximate position of can still be similarly determined.
Lemma 6 (Simplex Lemma \@slowromancapii@)
Let , , , and, be defined as in Lemma 5. Let be points in such that for and , and be the simplex determined by . Then for any , it is possible to construct a grid of size inside such that at least one grid point satisfies the inequality .
5.2 Sphere Peeling Algorithm
This section presents a sphere peeling algorithm to achieve a -approximation for -CMeans.
Let be an instance of -CMeans with (unknown) optimal chromatic clusters , and be the mean point of the cluster for . Without loss of generality, we assume that .
Algorithm overview: Our algorithm first computes a constant -approximation solution (by Theorem 4.1) to determine an upper bound of the optimal objective value , and then search for a good approximation of in the interval of . At each search step, our algorithm performs a sphere peeling procedure to iteratively generate approximate mean points for the chromatic clusters. Initially, the sphere peeling procedure uses random sampling technique (i.e., Lemma 1 and 2) to find an approximate mean point for . At -th iteration, it already has approximate mean points for respectively. Then it draws peeling spheres, , centered at the approximate mean points respectively and with a radius determined by the approximation of . Denote the set of unknown points as . Our algorithm considers two cases: (a) is large enough and (b) is small. For case (a), since is large enough, we can first use Lemma 2 to find an approximate mean point of , and then construct a simplex determined by and . For case (b), it directly constructs a simplex determined just by . For either case, our algorithm builds a grid inside the simplex (i.e., using Lemma 6) to find an approximate mean point for (i.e., ). Repeat the sphere peeling procedure times to generate the approximate mean points.
Input: , , and a small positive value .
Output: -approximation solution for -CMeans on .
Run the PTAS of -means in  on , and let be the obtained objective value.
For to do
Set , and run the Sphere-Peeling-Tree algorithm.
Let be the output tree.
For each path of every , use bipartite matching procedure to compute the objective value of -CMeans on . Output the points from the path with the smallest objective value.
Input: , , .
Output: A tree of height with each node associating with a point .
Initialize with a single root node associating with no point.
Recursively grow each node in the following way
If the height of is already , then it is a leaf.
Otherwise, let be the height of . Build the radius candidates set . For each , do
Let be the points associated with nodes on the root-to- path.
For each , , construct a ball centered at and with radius .
Take a random sample from with size . Compute the mean points of all subset of the sample, and denote them as .
For each , construct the simplex determined by . Also construct the simplex determined by . Build a grid inside each simplex with size .
In total, there are grid points inside the simplices. For each grid point, add one child to , and associate it with the grid point.
With constant probability, Algorithm -CMeans yields a -approximation for -CMeans in time.
5.3 Proof of Theorem 5.1
Let , and , where is the mean point of . Clearly, (by assumption) and . Let .
Among all the trees generated in Algorithm -CMeans, with constant probability, there exists at least one tree, , which has a root-to-leaf path with each node at level , , on the path associating a point and satisfying the inequality
Before proving this lemma, we first show its implication.
If Lemma 7 is true, Algorithm -CMeans yields a -approximation for -CMeans.
We first assume that Lemma 7 is true. Then for each , we have