Greedy Strategy Works for Clustering with Outliers and Coresets Construction

Greedy Strategy Works for Clustering with Outliers and Coresets Construction

Hu Ding School of Computer Science and Engineering, University of Science and Technology of China, He Fei, China
huding@ustc.edu.cn
Abstract

We study the problems of clustering with outliers in high dimension. Though a number of methods have been developed in the past decades, it is still quite challenging to design quality guaranteed algorithms with low complexities for the problems. Our idea is inspired by the greedy method, Gonzalez’s algorithm, for solving the problem of ordinary -center clustering. Based on some novel observations, we show that this greedy strategy actually can handle -center/median/means clustering with outliers efficiently, in terms of qualities and complexities. We further show that the greedy approach yields small coreset for the problem in doubling metrics, so as to reduce the time complexity significantly. Moreover, a by-product is that the coreset construction can be applied to speedup the popular density-based clustering approach DBSCAN.

minimum enclosing ball, high dimension, outlier recognition, dimension reduction, random sampling
\Copyright

H. Ding\subjclassF.2.2 Nonnumerical Algorithms and Problems - Geometrical problems and computations\EventEditors \EventNoEds2 \EventLongTitle \EventShortTitle \EventAcronym \EventYear2019 \EventDate \EventLocation \EventLogo \SeriesVolume \ArticleNo

1 Introduction

Clustering is one of the most fundamental problems in data analysis [37]. Given a set of elements, the goal of clustering is to partition the set into several groups based on their similarities or dissimilarities. Several clustering models have been extensively studied, such as -center, -median, and -means clustering [6]. In reality, many datasets are noisy and contain outliers. Moreover, outliers could seriously affect the final results in data analysis. The problems of outlier removal have attracted a great amount of attention in the past decades [42, 13]. Clustering with outliers can be viewed as a generalization of the ordinary clustering problems; however, the existence of outliers makes the problems to be much more challenging.

1.1 Prior Work and Our Contribution

We consider the problem of -center clustering with outliers first. Given points in and the number of outliers , the problem is to find balls to cover at least points and minimize the maximum radius of the balls. A -approximation algorithm for the problem in metric graph was proposed by [14]; for the problem in Euclidean space where the cluster centers can appear anywhere in the space, their approximation ratio becomes . A following streaming -approximation algorithm was proposed by [47]. Recently,  [12] proposed a -approximation algorithm for metric -center clustering with outliers (but it is unclear that what the resulting approximation ratio is for the problem in Euclidean space). Existing algorithms often have high time complexities. For example, the complexities of the algorithms in [14, 47] are and respectively, where is the ratio of the optimal radius to the shortest distance between any two distinct input points; the algorithm in [12] needs to solve a complicated model of linear programming and the exact time complexity is not provided. The coreset based idea of  [8] needs to enumerate a large number of possible cases and also yields high complexity.

In this paper, we aim to design quality guaranteed algorithms with low complexities in high dimension. Our idea is inspired by the greedy method  [29] for solving ordinary -center clustering. Through some novel modifications and reasonable relaxation, we show that this greedy method also works for the problem of -center clustering with outliers. Our approach can achieve the approximation ratio with respect to the clustering cost (i.e., the radius); moreover, the time complexity is linear on the input size .

We further consider the problem in doubling metrics, motivated by the fact that many real-world datasets often manifest low intrinsic dimensions [9]. For example, image sets usually can be represented in low dimensional manifold though the Euclidean dimension of the image vectors can be very high. “Doubling dimension” is widely used for measuring the intrinsic dimensions of datasets [50]. We adopt the following assumption: the inliers of the given data have a low doubling dimension . Note that we do not have any assumption on the outliers; namely, the outliers can scatter arbitrarily in the space. We believe that this assumption captures a large range of high dimensional instances in reality. With this assumption, we show that our approach can further improve the clustering quality. In particular, the greedy approach is able to construct a coreset for the problem of -center clustering with outliers; as a consequence, the time complexity can be significantly reduced if running existing algorithms on the coreset. The size of our coreset is , where is the small parameter measuring the quality of the coreset (note that and are often much smaller than in practice; the coefficient of actually can be further reduced to be arbitrarily close to , by increasing the coefficient of the second term ). Moreover, our coreset is a natural “composable coreset” that is an emerging topic for solving the problems in distributed computing [36].

We are aware of some prior work on reducing the data size for -center clustering with outliers. [15] and [35] respectively showed that if more than outliers are allowed to remove, the random sampling technique can be applied to reduce the data size; but their sample sizes depend on the dimension 111 [15] stated the result for metric graph; for the instances in Euclidean space, the sample size will depend on .. [35] also provided the coresets construction for -median/means clustering with outliers in doubling metrics, however, their method cannot be extended to the case of -center. [2] considered the coresets construction for ordinary -center clustering without outliers. More detailed discussion is shown in Remark 3.1.

Moreover, our coresets construction can be applied to speedup DBSCAN, a popular density-based clustering approach for outlier recognition [24]. Roughly speaking, DBSCAN groups the points locating in each dense region to be an individual cluster and labels the remaining points as outliers. Despite of its wide applications [51], a major bottleneck is the high time complexity especially when the data size is large. The running time of DBSCAN in has been improved from to by [21, 31]. For the case in general -dimensional space, [17] and [28] respectively provided the algorithms achieving the running times lower than ( is assumed to be a constant). If is high, the straightforward implementation takes time (we refer the reader to  [28] for more details). It is still an open problem that whether it is possible to improve the running time in high dimension. In this paper, we show that an approximate result can be obtained by running existing DBSCAN algorithms on the coreset, if the inliers (including core and border points) have a low doubling dimension.

Finally, our greedy strategy can be extended to handle -median/means clustering with outliers. The theoretical algorithms [18, 43, 26] often have high complexities. Several heuristic algorithms have been studied before [16, 48]. By using local search method, [32] provided a -approximation of -means clustering with outliers, but the violation on the number of outliers is large (it removes outliers, where denotes the diameter of the points).

1.2 Preliminaries

We introduce several important definitions that are used throughout the paper.

{definition}

[-Center Clustering with Outliers] Given a set of points in with two positive integers and , the problem of -center clustering with outliers is to find a subset of , where , and cluster centers , such that is minimized. Here, denotes the Euclidean distance between two points and .

In this paper, we always use , a subset of with size , to denote the subset yielding the optimal solution. Also, let be the clusters forming , and the resulting clustering cost be ; that is, each is covered by an individual ball with radius .

Usually, the optimizations relating to outliers are challenging combinatorial problems. Thus we often relax our goal and allow to miss a little more outliers in practice. Actually the same relaxation idea has been adopted by a number of previous work on clustering with outliers [15, 35, 3].

{definition}

[-Center Clustering] Let be an instance of -center clustering with outliers, and . -center clustering is to find a subset of , where , such that the corresponding clustering cost of Definition 1.2 on is minimized.

(i) If a solution has the clustering cost, i.e., the radius, at most with , it is called an -approximation. Moreover, if the solution outputs more than clustering centers, say with , it is called an -approximation.

(ii) Given a set of cluster centers ( could be larger than ), the resulting clustering cost is denoted by .

Obviously, the problem in Definition 1.2 is a special case of -center clustering with .

Actually, Definition 1.2 and 1.2 can be naturally extended to weighted case: each point has a non-negative weight and the number of outliers is replaced by the total weight of outliers. Further, we have the following definition of coresets.

{definition}

[Coreset] Given a small parameter and an instance of -center clustering with outliers, a weighted point set is called the -coreset of , if for any set of points.

Given a large-scale instance , we can run existing algorithm on its coreset to compute an approximate solution for ; if , the resulting running time can be significantly reduced. Formally, we have the following proposition (see the proof in our supplement).

Proposition 1.

If the set is an -approximation of the -coreset , it is an -approximation of .

As mentioned before, we also consider the case with low doubling dimension. For any and , let be the ball of radius around .

{definition}

[Doubling Dimension] The doubling dimension of a point set is the smallest number , such that for any and , is always covered by the union of at most balls with radius .

Doubling dimension describes the expansion rate of . For example, a set of points are uniformly distributed inside a -dimensional hypercube, and then their doubling dimension is but the Euclidean dimension can be very high. For coresets construction, we adopt the following assumption.

{definition}

[Low Doubling Dimension Assumption] Given an instance of -center clustering with outliers, we assume that the inliers have a constant doubling dimension , but the outliers can be scattered arbitrarily in the space.

Other notations. For convenience, we use to denote the shortest distance between a point and a point set Q, i.e., . Further, given two point sets and , we let .

Other related work. [29, 34] provided -approximations for ordinary -center clustering, and proved that any approximation ratio lower than implies . For -means/median clustering, several approximation solutions have been proposed [5, 40]; if or is a constant in , it is able to achieve their PTAS [44, 38, 41, 19, 27]. Recent research also focused on distributed clustering with outliers [46, 30, 45].

In computational geometry, coreset construction is a technique for reducing the data size so as to speedup many optimization problems; we refer the reader to the surveys [49, 7] for more details. In particular, the coresets can be used to improve the running times of existing clustering algorithms in Euclidean space and doubling metrics [25, 35, 2].

The rest of the paper is organized as follows. We study the problem of -center clustering with outliers in Section 2, and consider the coresets construction and application on DBSCAN in Section 3. Due to the space limit, we briefly summarize our results on -median/means clustering with outliers in Section 4, and place the details in our supplement.

2 Algorithms for -Center Clustering

For the sake of completeness, let us briefly introduce the algorithm of [29] first. Initially, it arbitrarily selects a point from , and iteratively selects the following points, where each -th step () chooses the point which has the largest minimum distance to the already selected points; finally, each input point is assigned to its nearest neighbor of these points. It can be proved that this greedy strategy results in a -approximation of -center clustering . In this section, we show that a modified version of Gonzalez’s algorithm yields approximate solutions for the problem of -center clustering.

2.1 -Approximation

Here, we consider the bi-criteria approximation that has more than cluster centers. The main challenge for implementing Gonzalez’s algorithm is that the outliers and inliers are mixed in ; for example, the selected point, which has the largest minimum distance to the already selected points, is very likely an outlier in each iteration, and therefore the clustering result could be arbitrarily bad. Instead, our strategy is to take a small sample from the farthest subset with an appropriate size222Some similar heuristics has been studied for other optimization problems [23, 22].. We realize this idea in Algorithm 1. For simplicity, let denote in the algorithm; usually we can assume that is a value much smaller than . We prove the correctness of Algorithm 1 below.

  Input: An instance of -center clustering with outliers, and ; the parameters , , and .
  
  1. Let and initialize a set .

  2. Initially, ; randomly select points from and add them to .

  3. Run the following steps until :

    1. and let be the farthest points of to the current .

    2. Randomly select points from and add them to .

  Output .
Algorithm 1 Bi-criteria Approximation Algorithm
Proposition 2.

Let be a set of elements and with . Given , if one randomly samples elements from , with probability , the sample contains at least one element from .

Proposition 2 can be obtained by simple calculation; the proof is placed in our supplement.

{lemma}

With probability , the set in Step 2 of Algorithm 1 contains at least one point from .

Lemma 2.1 can be easily obtained by Proposition 2, since .

Recall that are the clusters forming . Denote by the number of the clusters which have non-empty intersection with in -th round. For example, initially by Lemma 2.1; if , that means for any .

{lemma}

In each round of Step 3 of Algorithm 1, with probability , either (1) or (2) .

Proof.

Suppose that (1) is not true, i.e., , and we prove that (2) is true. Let include all the indices with . We claim that for each . Otherwise, let and ; due to the triangle inequality, we know that which in contradiction with the assumption . Thus, only contains the points from with . Moreover, since the number of outliers is , we know that . By Proposition 2, if randomly selecting points from , with probability , it contains at least one point from ; also, the point must come from . Overall, (2) is true. ∎

If (1) of Lemma 2.1 happens, i.e., , then it implies that ; moreover, since , we have .

Next, we assume that (1) in Lemma 2.1 never happens, and prove that with constant probability when . The following idea actually has been used by [1] for obtaining a bi-criteria approximation for -means clustering. Define a random variable if , or if , for . So and

(1)

Also, let and . Then, is a super-martingale with (more details are shown in our supplement). Through Azuma-Hoeffding inequality [4], we have for any and . Let and , the inequality implies that

(2)

Combining (1) and (2), we know that with probability at least . Moreover, directly implies that for any based on the triangle inequality. Together with Lemma 2.1, we have the following theorem.

{theorem}

Let . If we set for Algorithm 1, with probability , .

Quality and Running time. If is a constant, Theorem 2.1 implies that is a -approximation for -center clustering of with constant probability. In each round of Step 3, there are new points added to , thus it takes time to update the distances from the points of to ; to select the set , we can apply the linear time PICK algorithm [10]. Overall, the running time of Algorithm 1 is .

Further, we consider the practical instances under some reasonable assumption, and provide new analysis of Algorithm 1. In reality, the clusters are usually not too small, compared with the number of outliers. For example, it is rare to have a cluster that .

{theorem}

If we assume that each optimal cluster has the size at least for , then the set of Algorithm 1 is a -approximation for the problem of -center clustering on with constant probability. Compared with Theorem 2.1, Theorem 2.1 shows that we can exclude exactly outliers (rather than ), though the approximation ratio with respect to the radius becomes . Moreover, our running time is significantly lower than those of the and -approximations by [14, 47] in Euclidean space (as discussed in Section 1.1).

Proof of Theorem 2.1.

We take a more careful analysis on the proof of Lemma 2.1. If (1) never happens, eventually will reach and thus . So we focus on the case that (1) happens before reaching . Suppose at -th round, but . We consider two cases (i) there exists some such that and (ii) otherwise.

For (i), implies that . Note that we assume , i.e., . Using the same manner in the proof of Lemma 2.1, we know that (2) happens with probability .

For (ii), for all . Together with , we know that there exists (for each ) such that . Consequently, we have that ,

(3)

Overall, after steps, either , i.e., a -approximation, or a -approximation of -center clustering is obtained with constant probability. ∎

Figure 1: Left: is a point of having the distance to ; right: is any point of , and are the centers taking charge of and .

2.2 -Approximation

If is a constant, we show that a single-criterion -approximation can be obtained. Actually, we use the same strategy as Section 2.1, but run only rounds with each round sampling only one point. See Algorithm 2. However, the success probability would be exponentially small on ; hence we need to repeat the process to guarantee a constant success probability.

  Input: An instance of -center clustering with outliers, and ; the parameter .
  
  1. Maintain a set that is empty at the beginning.

  2. Initially, ; randomly select one point from and add it to .

  3. Run the following steps until :

    1. and let be the farthest points to the current .

    2. Randomly select one point from and add it to .

  Output .
Algorithm 2 -Approximation Algorithm

Denote by the sampled points of . Actually, the proof of Theorem 2.2 is almost identical to that of Lemma 2.1. The only difference is that the probability of (2) is at least . Also note that with probability (). If all of these events happen, we either obtain a -approximation before steps (i.e., for some ), or fall into the optimal clusters separately (i.e., ). No matter which case happens, we always obtain a -approximation. So we have Theorem 2.2.

{theorem}

With probability at least , Algorithm 2 yields a -approximation for the problem of -center clustering on . The running time is .

To boost the probability of Theorem 2.2, we just need to repeat the algorithm. Due to the space limit, please refer to supplement for the proof of Theorem 2.2.

{theorem}

If we run Algorithm 2 times, with constant probability, at least one time the algorithm yields a -approximation.

Similar to Theorem 2.1, we consider the practical instances. We show that the quality of Theorem 2.2 can be preserved even only excluding exactly outliers, if the optimal clusters are “well separated”. In fact, this property has been widely studied in previous clustering algorithms and is believed to be common for practical instances [39, 20]. Let be the cluster (ball) centers of the optimal clusters .

{theorem}

Suppose that each optimal cluster has the size at least and for . Then with probability at least , the result of Algorithm 2 yields a -approximation for the problem of -center clustering on .

Proof.

Initially, we know that with probability . Suppose that at the beginning of the -th round of Algorithm 2 with , already has points separately falling in optimal clusters; also, we still let be the set of the indices of these clusters. Then we have the following claim.

Claim 1.

.

Proof.

For any , we have

(4)

from triangle inequality and the assumption for (see the right of Figure. 1). In addition, for any , we have

(5)

We consider two cases. If at the current round, then (4) directly implies that and thus by the assumption that any .

Otherwise, . Then by (5). Moreover, since there are only outliers and , we know that . ∎

Claim 1 reveals that with probability at least , the new added point falls in , i.e., . Overall, we know that , i.e., is a -approximation of -center clustering, with probability at least . ∎

3 Coresets Construction in Doubling Metrics

In this section, we show some extensions of Algorithm 1 when the inliers have a constant doubling dimension . Notation: In the following analysis, we always assume that the assumption of Definition 1.2 is true by default.

From Definition 1.2, we directly know that each optimal cluster of can be covered by balls with radius (see the left figure in Figure. 2). Imagine a new instance having clusters, where the optimal radius is at most . Therefore, we can just replace by when running Algorithm 1, so as to reduce the approximation ratio on the clustering cost from to .

Figure 2: Illustrations for Theorem 3 and 3.1.
{theorem}

If we set for Algorithm 1, with probability , . So the set is a -approximation for the problem of -center clustering on . The running time is .

3.1 Coresets for -Center Clustering with Outliers

Inspired by Theorem 3, we can further construct the coreset for the problem of -center clustering with outliers. Let . If applying Definition 1.2 recursively, we know that each is covered by balls with radius , and is covered by such balls in total. See the right figure in Figure. 2. We have Algorithm 3 for constructing the -coreset.

  Input: An instance of -center clustering with outliers, and ; the parameters and .
  
  1. Let .

  2. Set and run Algorithm 1 rounds. Record the value being the maximum distance between and by excluding the farthest points, after the final round of Algorithm 1.

  3. Let .

  4. For each point , assign it to its nearest neighbor in ; for each point , let its weight be the number of points assigning to it.

  5. Add to with each point having weight .

  Output as the coreset.
Algorithm 3 The Coreset Construction
{theorem}

With constant probability, Algorithm 3 outputs a -coreset of -center clustering with outliers on . The size of the coreset is at most , and the construction time is . {remark} (1) Comparing with the uniform sampling approaches [15, 35], our coreset size is independent of . Moreover, another benefit is that our coreset works for exactly removing outliers. Consequently, our coreset can be used for existing algorithms of -center with outliers, such as [14], to reduce their complexities. The previous ideas based on uniform sampling cannot get rid of the violation on the number of outliers, and the sample sizes become infinity if not allowing to remove more than outliers.

(2) Another feature is that our coreset is a natural composable coreset. If is partitioned into parts, we can run Algorithm 3 for each part, and obtain a coreset with size in total (the proof is almost identical to the proof of Theorem 3.1 below). So our coreset construction can also handle distributed clustering with outliers.

(3) Very recently, [11] also provided a coreset for -center clustering with outliers in doubling metrics, where their size is with construction time. Thus our result in Theorem 3.1 is a significant improvement in terms of the coreset size and construction time.

(4) The coefficient of actually can be further reduced by modifying the value of in Step 2 of Algorithm 3. In general, the size of is and the construction time is .

Proof of Theorem 3.1.

Similar to Theorem 3, we know that and (the value recorded in Step of Algorithm 3) with constant probability. Thus, the size of is . Moreover, it is easy to see that the running time of Algorithm 3 is .

Next, we show that is a -coreset of . For each point , denote by the weight of ; for the sake of convenience in our proof, we view each as a set of overlapping points. Thus, from the construction of , we can see that there is a bijective mapping between and , where

(6)

Let be any points in the space. Suppose that induces clusters (resp., ) with respect to the problem of -center clustering with outliers on (resp., ), where each (resp., ) has the cluster center for . Let and , respectively. Also, let (resp., ) be the smallest value , such that for any , (resp., ). We need the following claim.

Claim 2.

and .

Proof.

We just need to prove the first inequality since the other one can be obtained by the same manner.

Because each and each point is moved by a distance at most based on (6), we know that , i.e., .

Let be the point realizing , that is, there exists some such that . The triangle inequality and (6) together imply . Hence .

Overall, we have . ∎

In addition, since also form clusters for the instance with respect to the cluster centers in , we know that

(7)

Similarly, we have

(8)

Combining Claim 2, (7) and (8), we have , that is

(9)

Consequently, is a -coreset of . ∎

Though the above obtained coreset can speedup the algorithms of -center with outliers, we also are wondering that whether it can be used to handle -center clustering for any ; namely, does (instead of (9)) hold?

Actually this can be proved by almost the same idea of the proof of Theorem 3.1. We just need to let (resp., ) be the induced clusters of (resp., ) instead; correspondingly, and . As a consequence, Claim 2, (7), and (8) still hold, and thus . So we can run Algorithm 2 on the coreset to compute an approximate solution for -center clustering.

{corollary}

Given and , there exists an algorithm yielding a -approximation of -center clustering in

(10)

time, with constant probability. Comparing with Theorem 2.2 and 2.2, we replace the size by the size of the coreset in (10) while add an extra term for constructing the coreset. The proof for the approximation ratio is shown in supplement.

3.2 Application for DBSCAN

Given two parameters 333Usually, DBSCAN uses the symbol instead of ; since has other meaning here, we use in this paper. and , we denote an instance of DBSCAN as . DBSCAN divides the points of into three classes:

  1. is a core point, if ;

  2. is a border point, if is not a core point but for some core point ;

  3. the remaining points are all outliers.

To define a cluster of DBSCAN, we need the following definition.

{definition}

[Density-reachable] We say a point is density-reachable from a core point , if there exists a sequence of points such that:

  • and ;

  • are all core points;

  • for each .

If one arbitrarily picks a core point , then DBSCAN defines the corresponding cluster

(11)

Namely, the cluster is the maximal subset containing the points who are density-reachable from . The cluster may contain both core and border points. Also, the cluster is uniquely defined by any of its core points; that is, for any two core point and , they define exactly the same cluster if they are density-reachable from each other. Since is always fixed in our context, we simply use and to denote the set of clusters and outliers of , respectively.

Here, we show that the method of coresets construction in Section 3.1 can be applied to reduce the data size for DBSCAN. We also follow the assumption in Definition 1.2, where the inliers include the core and border points. Similar to Theorem 3.1, we run Algorithm 3 and obtain a much smaller set , such that existing DBSCAN algorithms can be applied to and yield lower time complexities. To realize this idea, we consider the following two questions.

(1) What is the value of for Step 2 of Algorithm 3, that is, how many rounds we need for Algorithm 1. Let be the diameter of the set of core and border points, i.e., the distance of the farthest two points. Based on the property of doubling dimension, we know that the set of core and border points are covered by balls with radius . Thus, we set . (2) Unlike Theorem 3.1, the values of some parameters are unknown for DBSCAN, e.g., the number of outliers and the diameter . So we have to assume that some upper bounds are given in practice. For example, let and , where and are given values.

To state our result clearly, we need to define the relation between two clusterings. Let and , where each with (resp., with ) is a cluster of elements. We say , if for any , there exists a such that . For example, if and , then .

Following the proof of Theorem 3.1, let be the bijective mapping between and .

{theorem}

Given an instance , with constant probability, Algorithm 3 outputs a set such that

(12)
(13)

The size of is , and the construction time is . {remark} Theorem 3.2 reveals that the result returned by is “bounded” by the results of and . For any cluster , there exists two clusters and , such that . As mentioned in [28], “It is well-known that the clusters of DBSCAN rarely differ considerably when changes by just a small factor. In fact, if this really happens, it suggests that the choice of is very bad, such that the exact clusters are not stable anyway.”

Proof of Theorem 3.2.

Let such that is a core point of the instance . Thus,

(14)

As mentioned before, the set of balls covering core and border points have the radius ; therefore, the right-hand side of (6) should be instead. Through the triangle inequality, we know that

(15)

because and the points covered by all move by a distance at most . (14) and (15) imply that

(16)

Therefore, is a core point of . Let be the cluster of that contains , and be any point such that . Then should be density-reachable from with respect to . Using the triangle inequality again, we know that is density-reachable from with respect to . That is, belongs to the cluster of that contains , say B. So we have

(17)

Consequently, . The other side of (12) can be proved by the same manner.

Since the outliers are the remaining points by removing the core and border points, together with (12), we know that (13) is true. ∎

4 -Median/Means Clustering

Due to the space limit, we overview our main idea and place the details in our supplement. For the problems of -median/means clustering with outliers, the objective functions should be respectively replaced by and in Definition 1.2. By Markov’s inequality, we know that a large part of should be covered by balls with some bounded radius. Therefore, we can convert the problems to be the instances of -center clustering with outliers. As a consequence, Algorithm 1 and 2 are applicable for the cases of -median/means, though the resulting approximation ratios are higher than those in Theorem 2.1 and 2.2. Similar to Theorem 3, if the assumption of Definition 1.2 is true, we can also reduce the approximation ratios with respect to the clustering costs to be .

5 Future Work

Following our work, several interesting problems deserve to be studied in future. For example, is there any lower bound on the size of the coresets for -center clustering with outliers? In addition, can the coreset construction time of Algorithm 3 be further improved, like the fast net construction method proposed by [33] in doubling metrics? It is also interesting to consider to solve other problems involving outliers by using the greedy strategy studied in this paper.

References

  • [1] Ankit Aggarwal, Amit Deshpande, and Ravi Kannan. Adaptive sampling for k-means clustering. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 15–28. Springer, 2009.
  • [2] Sepideh Aghamolaei and Mohammad Ghodsi. A composable coreset for k-center in doubling metrics. In Proceedings of the 30th Canadian Conference on Computational Geometry, CCCG 2018, August 8-10, 2018, University of Manitoba, Winnipeg, Manitoba, Canada, pages 165–171, 2018.
  • [3] Noga Alon, Seannie Dar, Michal Parnas, and Dana Ron. Testing of clustering. SIAM Journal on Discrete Mathematics, 16(3):393–417, 2003.
  • [4] Noga Alon and Joel H Spencer. The probabilistic method. John Wiley & Sons, 2004.
  • [5] Vijay Arya, Naveen Garg, Rohit Khandekar, Adam Meyerson, Kamesh Munagala, and Vinayaka Pandit. Local search heuristics for k-median and facility location problems. SIAM Journal on computing, 33(3):544–562, 2004.
  • [6] Pranjal Awasthi and Maria-Florina Balcan. Center based clustering: A foundational perspective. 2014.
  • [7] Olivier Bachem, Mario Lucic, and Andreas Krause. Practical coreset constructions for machine learning. arXiv preprint arXiv:1703.06476, 2017.
  • [8] Mihai Badoiu, Sariel Har-Peled, and Piotr Indyk. Approximate clustering via core-sets. In Proceedings of the ACM Symposium on Theory of Computing (STOC), pages 250–257, 2002.
  • [9] Mikhail Belkin. Problems of learning on manifolds. The University of Chicago, 2003.
  • [10] Manuel Blum, Robert W. Floyd, Vaughan Pratt, Ronald L. Rivest, and Robert E. Tarjan. Time bounds for selection. Journal of Computer and System Sciences, 7(4):448–461, 1973.
  • [11] Matteo Ceccarello, Andrea Pietracaprina, and Geppino Pucci. Solving k-center clustering (with outliers) in mapreduce and streaming, almost as accurately as sequentially. CoRR, abs/1802.09205, 2018.
  • [12] Deeparnab Chakrabarty, Prachi Goyal, and Ravishankar Krishnaswamy. The non-uniform k-center problem. In 43rd International Colloquium on Automata, Languages, and Programming, ICALP 2016, July 11-15, 2016, Rome, Italy, pages 67:1–67:15, 2016.
  • [13] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey. ACM Computing Surveys (CSUR), 41(3):15, 2009.
  • [14] Moses Charikar, Samir Khuller, David M Mount, and Giri Narasimhan. Algorithms for facility location problems with outliers. In Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms, pages 642–651. Society for Industrial and Applied Mathematics, 2001.
  • [15] Moses Charikar, Liadan O’Callaghan, and Rina Panigrahy. Better streaming algorithms for clustering problems. In Proceedings of the thirty-fifth annual ACM symposium on Theory of computing, pages 30–39. ACM, 2003.
  • [16] Sanjay Chawla and Aristides Gionis. k-means–: A unified approach to clustering and outlier detection. In Proceedings of the 2013 SIAM International Conference on Data Mining, pages 189–197. SIAM, 2013.
  • [17] Danny Z. Chen, Michiel H. M. Smid, and Bin Xu. Geometric algorithms for density-based data clustering. Int. J. Comput. Geometry Appl., 15(3):239–260, 2005.
  • [18] Ke Chen. A constant factor approximation algorithm for k-median clustering with outliers. In Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms, pages 826–835. Society for Industrial and Applied Mathematics, 2008.
  • [19] Vincent Cohen-Addad, Philip N Klein, and Claire Mathieu. Local search yields approximation schemes for k-means and k-median in euclidean and minor-free metrics. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 353–364. IEEE, 2016.
  • [20] Amit Daniely, Nati Linial, and Michael E. Saks. Clustering is difficult only when it does not matter. CoRR, abs/1205.4891, 2012.
  • [21] Mark de Berg, Ade Gunawan, and Marcel Roeloffzen. Faster dbscan and hdbscan in low-dimensional euclidean spaces. In 28th International Symposium on Algorithms and Computation, ISAAC 2017, December 9-12, 2017, Phuket, Thailand, pages 25:1–25:13, 2017.
  • [22] Hu Ding and Jinhui Xu. Random gradient descent tree: A combinatorial approach for SVM with outliers. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, January 25-30, 2015, Austin, Texas, USA., pages 2561–2567, 2015.
  • [23] Hu Ding and Mingquan Ye. Solving minimum enclosing ball with outliers: Algorithm, implementation, and application. CoRR, abs/1804.09653, 2018.
  • [24] Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland, Oregon, USA, pages 226–231, 1996.
  • [25] Dan Feldman and Michael Langberg. A unified framework for approximating and clustering data. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 569–578. ACM, 2011.
  • [26] Zachary Friggstad, Kamyar Khodamoradi, Mohsen Rezapour, and Mohammad R Salavatipour. Approximation schemes for clustering with outliers. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 398–414. SIAM, 2018.
  • [27] Zachary Friggstad, Mohsen Rezapour, and Mohammad R Salavatipour. Local search yields a ptas for k-means in doubling metrics. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 365–374. IEEE, 2016.
  • [28] Junhao Gan and Yufei Tao. Dbscan revisited: mis-claim, un-fixability, and approximation. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 519–530. ACM, 2015.
  • [29] Teofilo F Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38:293–306, 1985.
  • [30] Sudipto Guha, Yi Li, and Qin Zhang. Distributed partial clustering. In Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, pages 143–152. ACM, 2017.
  • [31] Ade Gunawan. A faster algorithm for dbscan. Master’s thesis. Eindhoven University of Technology, the Netherlands, 2013.
  • [32] Shalmoli Gupta, Ravi Kumar, Kefu Lu, Benjamin Moseley, and Sergei Vassilvitskii. Local search methods for k-means with outliers. Proceedings of the VLDB Endowment, 10(7):757–768, 2017.
  • [33] Sariel Har-Peled and Manor Mendel. Fast construction of nets in low-dimensional metrics and their applications. SIAM Journal on Computing, 35(5):1148–1184, 2006.
  • [34] Dorit S Hochbaum and David B Shmoys. A best possible heuristic for the k-center problem. Mathematics of operations research, 10(2):180–184, 1985.
  • [35] Lingxiao Huang, Shaofeng Jiang, Jian Li, and Xuan Wu. Epsilon-coresets for clustering (with outliers) in doubling metrics. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 814–825. IEEE, 2018.
  • [36] Piotr Indyk, Sepideh Mahabadi, Mohammad Mahdian, and Vahab S. Mirrokni. Composable core-sets for diversity and coverage maximization. In Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS’14, Snowbird, UT, USA, June 22-27, 2014, pages 100–108, 2014.
  • [37] Anil K Jain. Data clustering: 50 years beyond k-means. Pattern recognition letters, 31(8):651–666, 2010.
  • [38] Ragesh Jaiswal, Mehul Kumar, and Pulkit Yadav. Improved analysis of d2-sampling based ptas for k-means and other clustering problems. Information Processing Letters, 115(2):100–103, 2015.
  • [39] Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. An efficient k-means clustering algorithm: Analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell., 24(7):881–892, 2002.
  • [40] Tapas Kanungo, David M Mount, Nathan S Netanyahu, Christine D Piatko, Ruth Silverman, and Angela Y Wu. A local search approximation algorithm for k-means clustering. Computational Geometry, 28(2-3):89–112, 2004.
  • [41] Stavros G Kolliopoulos and Satish Rao. A nearly linear-time approximation scheme for the euclidean k-median problem. SIAM Journal on Computing, 37(3):757–782, 2007.
  • [42] Hans-Peter Kriegel, Peer Kröger, and Arthur Zimek. Outlier detection techniques. Tutorial at PAKDD, 2009.
  • [43] Ravishankar Krishnaswamy, Shi Li, and Sai Sandeep. Constant approximation for k-median and k-means with outliers via iterative rounding. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 646–659. ACM, 2018.
  • [44] Amit Kumar, Yogish Sabharwal, and Sandeep Sen. Linear-time approximation schemes for clustering problems in any dimensions. Journal of the ACM (JACM), 57(2):5, 2010.
  • [45] Shi Li and Xiangyu Guo. Distributed -clustering for data with heavy noise. In Advances in Neural Information Processing Systems, pages 7849–7857, 2018.
  • [46] Gustavo Malkomes, Matt J Kusner, Wenlin Chen, Kilian Q Weinberger, and Benjamin Moseley. Fast distributed k-center clustering with outliers on massive data. In Advances in Neural Information Processing Systems, pages 1063–1071, 2015.
  • [47] Richard Matthew McCutchen and Samir Khuller. Streaming algorithms for k-center clustering with outliers and with anonymity. In Approximation, Randomization and Combinatorial Optimization. Algorithms and Techniques, pages 165–178. Springer, 2008.