Clustering under Local Stability: Bridging the Gap between Worst-Case and Beyond Worst-Case Analysis 111Authors’ addresses: firstname.lastname@example.org, email@example.com. This work was supported in part by grants NSF-CCF 1535967, NSF CCF-1422910, NSF IIS-1618714, a Sloan Fellowship, a Microsoft Research Fellowship, and a National Defense Science and Engineering Graduate (NDSEG) fellowship.
Recently, there has been substantial interest in clustering research that takes a beyond worst-case approach to the analysis of algorithms. The typical idea is to design a clustering algorithm that outputs a near-optimal solution, provided the data satisfy a natural stability notion. For example, Bilu and Linial (2010) and Awasthi et al. (2012) presented algorithms that output near-optimal solutions, assuming the optimal solution is preserved under small perturbations to the input distances. A drawback to this approach is that the algorithms are often explicitly built according to the stability assumption and give no guarantees in the worst case; indeed, several recent algorithms output arbitrarily bad solutions even when just a small section of the data does not satisfy the given stability notion.
In this work, we address this concern in two ways. First, we provide algorithms that inherit the worst-case guarantees of clustering approximation algorithms, while simultaneously guaranteeing near-optimal solutions when the data is stable. Our algorithms are natural modifications to existing state-of-the-art approximation algorithms. Second, we initiate the study of local stability, which is a property of a single optimal cluster rather than an entire optimal solution. We show our algorithms output all optimal clusters which satisfy stability locally. Specifically, we achieve strong positive results in our local framework under recent stability notions including metric perturbation resilience (Angelidakis et al. 2017) and robust perturbation resilience (Balcan and Liang 2012) for the -median, -means, and symmetric/asymmetric -center objectives.
Clustering is a fundamental problem in combinatorial optimization with numerous real-life applications in areas from bioinformatics to computer vision to text analysis and so on. The underlying goal is to group a given set of points to maximize similarity inside a group and minimize similarity among groups. A common approach to clustering is to set up an objective function and then approximately find the optimal solution according to the objective. Given a set of points and a distance metric , common clustering objectives include finding centers to minimize the sum of the distance, or squared distance, from each point to its closest center (-median and -means, respectively), or to minimize the maximum distance from a point to its closest center (-center). These popular objective functions are provably NP-hard to optimize [23, 28, 32], so research has focused on finding approximation algorithms. This has attracted significant attention in the theoretical computer science community [5, 15, 16, 17, 18, 23, 33].
Traditionally, the theory of clustering (and more generally, the theory of algorithms) has focused on the analysis of worst-case instances. While this approach has led to many elegant algorithms and lower bounds, it is often overly pessimistic of an algorithm’s performance on “typical” instances or real world instances. A rapidly developing line of work in the algorithms community, the so-called beyond worst-case analysis of algorithms (BWCA), considers designing algorithms for instances that satisfy some natural structural properties. BWCA has given rise to many positive results [26, 31, 37], especially for clustering problems [6, 7, 10, 30]. For example, the popular notion of -perturbation resilience, introduced by Bilu and Linial , informally states that the optimal solution does not change when the input distances are allowed to increase by up to a factor of . This definition seeks to capture a phenomenon in practice: the optimal solution is often significantly better than all other solutions, thereby the optimal solution does not change even when the input is slightly perturbed.
However, there are two potential downsides to this approach. The first downside is that many of these algorithms aggressively exploit the given structural assumptions, which can lead to contrived algorithms with no guarantees in the worst case. Therefore, a user can only use algorithms from this line of work if she is certain her data satisfies the assumption (even though none of these assumptions are computationally efficient to verify). The second downside is that while the algorithms return the optimal solution when the input is stable, there are no partial guarantees when most, but not all, of the data is stable. For example, the algorithms of Balcan and Liang  and Angelidakis et al.  return the optimal clustering when the instance is resilient to perturbations, however, both algorithms use a dynamic programming subroutine that is susceptible to errors which can propagate when a small fraction of the data does not satisfy perturbation resilience (see Appendix A for more details). From these setbacks, two natural questions arise. (1) Can we find natural algorithms that achieve the worst-case approximation ratio, while outputting the optimal solution if the data is stable, 222 A trivial solution is to run an approximation algorithm and a BWCA algorithm in parallel, and output the better of the two solutions. We seek a more satisfactory and “natural” answer to this question. and (2) Can we construct robust algorithms that still output good results even when only part of the data satisfies stability? The current work seeks to answer both of these questions.
1.1 Our results and techniques
In this work, we answer both questions affirmatively for a variety of clustering objectives under perturbation resilience. We present algorithms that simultaneously achieve state-of-the-art approximation ratios in the worst case, while outputting the optimal solution when the data is stable. All of our algorithms are natural modifications to existing approximation algorithms. To answer question (2), we define the notion of local perturbation resilience in Section 2. This is the first definition that applies to an individual cluster, rather than the dataset as a whole. Informally, an optimal cluster satisfies -local perturbation resilience if it remains in the optimal solution under any perturbation to the input. We show that every optimal cluster satisfying local stability will be returned by our algorithms. Therefore, given an instance with a mix of stable and non-stable data, our algorithms will return the optimal clusters over the stable data, and a worst-case approximation guarantee over the rest of the data. Specifically, we prove the following results.
Approximation algorithms under local perturbation resilience
In Section 3, we introduce a condition that is sufficient for an -approximation algorithm to return the optimal -median or -means clusters satisfying -local perturbation resilience. Intuitively, our condition requires that the approximation guarantee must be true locally as well as globally. We show the popular local search algorithm satisfies the property for a sufficiently large search parameter. For -center, we show that any -approximation algorithm for -center will always return the clusters satisfying -local metric perturbation resilience, which is a slightly weaker notion of perturbation resilience.
The asymmetric -center problem admits an approximation algorithm due to Vishnwanathan , which is tight . In Section 4, we show a simple modification to this algorithm ensures that it returns all optimal clusters satisfying a condition slightly stronger than 2-local perturbation resilience (all neighboring clusters must satisfy 2-local perturbation resilience). The first phase of Vishwanathan’s approximation algorithm involves iteratively removing the neighborhood of special points called center-capturing vertices (CCVs). We show the centers of locally perturbation resilient clusters are CCVs and satisfy a separation condition, by constructing 2-perturbations in which neighboring clusters cannot be too close to the locally perturbation resilient centers without causing a contradiction. This allows us to modify the approximation algorithm by first removing the neighborhood around CCVs satisfying the separation condition. With a careful reasoning, we maintain the original approximation guarantee while only removing points from a single local perturbation resilient cluster at a time.
Robust perturbation resilience
In Section 5, we consider -local perturbation resilience, which states that at most points can swap into or out of the cluster under any -perturbation. For -center, we show that any 2-approximation algorithm will return the optimal -locally perturbation resilient clusters, assuming a mild lower bound on optimal cluster sizes. To prove this, we show that if points from two different locally perturbation resilient clusters are close to each other, then centers achieve the optimal value under a carefully constructed 3-perturbation. The rest of the analysis involves building up conditional claims dictating the possible centers for each locally perturbation resilient cluster under the 3-perturbation. We utilize the idea of a cluster-capturing center  along with machinery specific to handling local perturbation resilience to show that a locally perturbation resilient cluster must split into two clusters under the 3-perturbation, causing a contradiction. Finally, we show that the mild lower bound on the cluster sizes is necessary. Specifically, we show hardness of approximation for -median, -means, and -center, even when it is guaranteed the clustering satisfies -perturbation resilience for any and . In fact, the result holds even for a stronger notion called -approximation stability. The hardness is based on a reduction from the general clustering instances, so the APX-hardness constants match the worst-case APX-hardness results of 2, 1.73, and 1.0013 for -center , -median , and -means , respectively. This generalizes prior hardness results in BWCA [10, 11].
1.2 Related work
Clustering The first constant-factor approximation algorithm for -median was given by Charikar et al. , and the current best approximation ratio is 2.675 by Byrka et al. . Jain et al. proved -median is NP-hard to approximate to a factor better than 1.73 . For -center, Gonzalez showed a tight 2-approximation algorithm . For -means, the best approximation ratio was recently lowered to by Ahmadian et al. . -means was shown to be APX-hard by Awasthi et al. , and the constant was recently improved to 1.0013 .
Perturbation resilience Perturbation resilience was introduced by Bilu and Linial, who showed algorithms that outputted the optimal solution for max cut under -perturbation resilience . This result was improved by Markarychev et al. , who showed the standard SDP relaxation is integral for -perturbation resilient instances. They also show an optimal algorithm for minimum multiway cut under 4-perturbation resilience. The study of clustering under perturbation resilience was initiated by Awasthi et al. , who provided an optimal algorithm for center-based clustering objectives (which includes -median, -means, and -center clustering, as well as other objectives) under 3-perturbation resilience. This result was improved by Balcan and Liang , who showed an algorithm for center-based clustering under -perturbation resilience. They also gave a near-optimal algorithm for -median -perturbation resilience, a robust version of perturbation resilience, when the optimal clusters are not too small. Balcan et al.  constructed algorithms for -center and asymmetric -center under 2-perturbation resilience and -perturbation resilience, and they showed no polynomial-time algorithm can solve -center under -approximation stability (a notion that is stronger than perturbation resilience) unless . Recently, Angelidakis et al. , gave algorithms for center-based clustering under 2-perturbation resilience and minimum multiway cut with terminals under -perturbation resilience. They also define the more general notion of metric perturbation resilience. In Appendix A, we discuss prior work in the context of local perturbation resilience. Perturbation resilience has also been applied to other problems, such as the traveling salesman problem, and finding Nash equilibria [10, 35].
Approximation stability is a related definition that is stronger than perturbation resilience. It was introduced by Balcan et al. , who showed algorithms that outputted nearly optimal solutions under -approximation stability for -median and -means when . Balcan et al.  studied a relaxed notion of approximation stability in which a specified fraction of the data satisfies approximation stability. In this setting, there may not be a unique approximation stable solution. The authors provided an algorithm which outputted a small list of clusterings, such that all approximation stable clusterings are close to one clustering in the list. We remark the property itself is similar in spirit to local stability, although the solution/results are much different. Voevodski et al.  gave an algorithm for empirically clustering protein sequences using the min-sum objective under approximation stability, which compares favorably to popular clustering algorithms used in practice. Gupta et al.  showed algorithm for finding near-optimal solutions for -median under approximation stability in the context of finding triangle-dense graphs.
Other stability notions
Ostrovsky et al. show how to efficiently cluster instances in which the -means clustering cost is much lower than the -means cost . Kumar and Kannan give an efficient clustering algorithm for instances in which the projection of any point onto the line between its cluster center to any other cluster center is a large additive factor closer to its own center than the other center . This result was later improved along multiple axes by Awasthi and Sheffet . There are many other works that show positive results for different natural notions of stability in various settings [4, 6, 25, 26, 30, 31, 37].
A clustering instance consists of a set of points, as well as a distance function . For a point and a set , we define . The -median, -means, and -center objectives are to find a set of points called centers to minimize , , and , respectively. We denote by the Voronoi tile of induced by on the set of points , and we denote for a subset . We refer to the Voronoi partition induced by as a clustering. Throughout the paper, we denote the clustering with minimum cost by , and we denote the optimal centers by , where is the center of for all .
All of the distance functions we study are metrics, except for Section 4, in which we study an asymmetric distance function. An asymmetric distance function satisfies all the properties of a metric space except for symmetry. In particular, an asymmetric distance function must satisfy the directed triangle inequality: for all , .
We formally define perturbation resilience, a notion introduced by Bilu and Linial . is called an -perturbation of the distance function , if for all , . (We only consider perturbations in which the distances increase because WLOG we can scale the distances to simulate decreasing distances.)
A clustering instance satisfies -perturbation resilience (-PR) if for any -perturbation of , the optimal clustering under is unique and equal to .
Now we define local perturbation resilience, a property of an optimal cluster rather than a dataset.
Given a clustering instance with optimal clustering , an optimal cluster satisfies -local perturbation resilience (-LPR) if for any -perturbation of , the optimal clustering under contains .
We will sometimes refer to a center of an -LPR cluster as an -LPR center. Clearly, if a clustering instance is perturbation resilient, then every optimal cluster satisfies local perturbation resilience. Now we will show the converse is also true.
A clustering instance satisfies -PR if and only if each optimal cluster satisfies -LPR.
Given a clustering instance , the forward direction follows by definition: assume satisfies -PR, and given an optimal cluster , then for each -perturbation , the optimal clustering stays the same under , therefore is contained in the optimal clustering under . Now we prove the reverse direction. Given a clustering instance with optimal clustering , and given an -perturbation , let the optimal clusetring under be . For each , by assumption, satisfies -LPR, so . Therefore . ∎
In Section 4, we define a stronger version of Definition 2 specifically for -center. Next, we define a more robust version of -PR and -LPR that allows a small change in the optimal clusters when the distances are perturbed. We say that two clusters and are -close if they differ by only points, i.e., . We say that two clusterings and are -close if .
 A clustering instance satisfies -perturbation resilience -PR) if for any - perturbation of , all optimal clusterings under must be -close to .
Given a clustering instance with optimal clustering , an optimal cluster satisfies -local perturbation resilience -LPR if for any -perturbation of , the optimal clustering under contains a cluster which is -close to .
A clustering instance satisfies -PR if and only if each optimal cluster satisfies -LPR and .
In all definitions thus far, we do not assume that the -perturbations satisfy the triangle inequality. Angelidakis et al.  recently studied the weaker definition in which the -perturbations must satisfy the triangle inequality, called metric perturbation resilience. All of our definitions can be generalized accordingly, and some of our results hold under this weaker assumption. To this end, we will sometimes take the metric completion of a non-metric distance function , by setting the distances in as the length of the shortest path on the graph whose edges are the lengths in .
3 Approximation algorithms under local perturbation resilience
In this section, we show that local search for -median will always return the -LPR clusters, and for -means it will return the -LPR clusters. We also show that any 2-approximation for -center will return the 2-LPR clusters.
We start by giving a condition on an approximate -median solution, which is sufficient to show the solution contains all -LPR clusters.
Given a -median instance and a set of centers , if for all sets of size ,
then all -LPR clusters are contained in the clustering defined by .
Given such a set of centers , we construct an -perturbation as follows. Increase all distances by a factor of except for the distances between each point and its closest center in . Now our goal is to show that is the optimal set of centers under .
Given any other set of centers, we consider four types of points: , , , and , which we denote by , , , and , respectively (see Figure 0(a)). The distance from a point to its center in might stay the same under , or increase, depending on its type. For each point , because these points have centers in . For each point , because their centers are in . The points in were originally closest to a center in , but might switch to their center in , since it is in . Therefore, for each , . Altogether,
The cost of the clustering induced by under , using our assumption, is equal to
Therefore, is the optimal set of centers under the perturbation . Given an -LPR cluster , by definition, there exists such that under , therefore by construction, under as well. This proves the theorem. ∎
Essentially, Lemma 7 claims that any -approximation algorithm will return the -LPR clusters as long as the approximation ratio is “uniform” across all clusters. For example, a 3-approximation algorithm that returns half of the clusters paying 1.1 times their cost in , and half of the clusters paying 5 times their cost in , would fail the property. Luckily, the local search algorithm is well-suited for this property, due to the local optimum guarantee.
The local search algorithm starts with any set of centers, and iteratively replaces centers with different centers if it leads to a better clustering (see Algorithm 1). The number of iterations is . The classic Local Search heuristic is widely used in practice, and many works have studied local search theoretically for -median and -means [5, 24, 29]. For more information on local search, see a general introduction by Aarts and Lenstra .
The next theorem utilizes Lemma 7 and a result by Cohen-Addad and Schwiegelshohn , who showed that local search returns the optimal clustering under a stronger version of -PR. For the formal proof of Theorem 8, see Appendix C.
Given a -median instance , running local search with search size returns a clustering that contains every -LPR cluster, and it gives a -approximation overall.
Because the -center objective takes the maximum rather than the average of all cluster costs, the equivalent of the condition in Lemma 7 is essentially satisfied by any -approximation algorithm. We will now prove an even stronger result. Any -approximation for -center returns the -LPR clusters, even for metric perturbation resilience. First we state a lemma which allows us to reason about a specific class of -perturbations which will be useful in this section as well as throughout the paper, for symmetric and asymmetric -center. For the full proofs, see Appendix C.
Given and an asymmetric -center clustering instance with optimal radius , let denote an -perturbation such that for all , either or . Let denote the metric completion of . Then is an -metric perturbation of , and the optimal cost under is .
Proof sketch. First, is a valid -metric perturbation of because for all , . To show the optimal cost under is , it suffices to prove that for all , if , then . This is true of by construction, so we show it still holds after taking the metric completion of , which can shrink some distances. Given such that , there exists a path –––– such that and for all , . If one of the segments has length in , then it has length in and we are done. If not, all distances increase by exactly a factor of , so we sum up all distances to show . ∎
Given an asymmetric -center clustering instance and an -approximate clustering , each -LPR cluster is contained in , even under the weaker metric perturbation resilience condition.
Proof sketch. Similar to the proof of Lemma 7, we construct an -perturbation and argue that becomes the optimal clustering under . Let denote the optimal -center radius of . First we define an -perturbation by increasing the distance from each point to its center in to , and increase all other distances by a factor of . Then by Lemma 9, the metric completion of has optimal cost , and so is the optimal clustering. Now we finish off the proof in a manner identical to Lemma 7. ∎
We remark that Theorem 10 generalizes the result from Balcan et al. . Although Theorem 10 applies more generally to asymmetric -center, it is most useful for symmetric -center, for which there exist several 2-approximation algorithms [22, 23, 27]. Asymmetric -center is NP-hard to approximate to within a factor of , so Theorem 10 only guarantees returning the -LPR clusters. In the next section, we show how to substantially improve this result.
4 Asymmetric -center
In this section, we show that a slight modification to the approximation algorithm of Vishwanathan  leads to an algorithm that maintains its performance in the worst case, while returning each cluster with the following property: is 2-LPR, and all nearby clusters are 2-LPR as well. This result also holds for metric perturbation resilience. We start by formally giving the stronger version of Definition 2. Throughout this section, we denote the optimal -center radius by .
An optimal cluster satisfies -strong local perturbation resilience (-SLPR) if for each such that there exists , and , then is -LPR.
Given an asymmetric -center clustering instance of size , Algorithm 3 returns each 2-SLPR cluster exactly. For each 2-LPR cluster , Algorithm 3 outputs a cluster that is a superset of and does not contain any other 2-LPR cluster. These statements hold for metric perturbation resilience as well. Finally, the overall clustering returned by Algorithm 3 is an -approximation.
Approximation algorithm for asymmetric -center
We start with a recap of the -approximation algorithm of Vishwanathan . This was the first nontrivial algorithm for asymmetric -center, and the approximation ratio was later proven to be tight . To explain the algorithm, it is convenient to think of asymmetric -center as a set covering problem. Given an asymmetric -center instance , define the directed graph , where . For a point , we define and as the set of vertices with an arc to and from , respectively. The asymmetric -center problem is equivalent to finding a subset of size such that . We also define and as the set of vertices which have a path of length to and from in , respectively, and we define for a set , and similarly for . It is standard to assume the value of is known; since it is one of distances, the algorithm can search for the correct value in polynomial time. Vishwanathan’s algorithm crucially utilizes the following concept.
Given an asymmetric -center clustering instance , a point is a center-capturing vertex (CCV) if . In other words, for all , implies .
As the name suggests, each CCV , “captures” its center, i.e. (see Figure 0(b)). Therefore, ’s entire cluster is contained inside , which is a nice property that the approximation algorithm exploits. At a high level, the approximation algorithm has two phases. In the first phase, the algorithm iteratively picks a CCV arbitrarily and removes all points in . This continues until there are no more CCVs. For every CCV picked, the algorithm is guaranteed to remove an entire optimal cluster. In the second phase, the algorithm runs rounds of a greedy set-cover subroutine on the remaining points. See Algorithm 2. To prove the second phase terminates in rounds, the analysis crucially assumes there are no CCVs among the remaining points. We refer the reader to  for these details.
Robust algorithm for asymmetric -center
We show a small modification to Vishwanathan’s approximation algorithm leads to simultaneous guarantees in the worst case and under local perturbation resilience. We show that each 2-LPR center is itself a CCV, and displays other structure which allows us to distinguish it from non-center CCVs. This suggests a simple modification to Algorithm 2: instead of picking CCVs arbitrarily, we first pick CCVs which display the added structure, and then when none are left, we go back to picking regular CCVs. However, we need to ensure that a CCV chosen by the algorithm marks points from at most one 2-LPR cluster, or else we will not be able to output a separate cluster for each 2-LPR cluster. Thus, the difficulty in our argument is carefully specifying which CCVs the algorithm picks, and which nearby points get marked by the CCVs, so that we do not mark other LPR clusters and simultaneously maintain the guarantee of the original approximation algorithm, namely that in every round, we mark an entire optimal cluster. To accomplish this tradeoff, we start by defining two properties. The first property will determine which CCVs are picked by the algorithm. The second property is used in the proof of correctness, but is not used explicitly by the algorithm. We give the full details of the proofs in Appendix D.
(1) A point satisfies CCV-proximity if it is a CCV, and each point in is closer to than any CCV outside of . That is, for all points and CCVs , . 333 This property loosely resembles -center proximity , a property defined over an entire clustering instance, which states for all , for all , , we have . (2) An optimal center satisfies center-separation if any point within distance of belongs to its cluster . That is, for all , .
Given an asymmetric -center clustering instance and a 2-LPR cluster , satisfies CCV-proximity and center-separation. Furthermore, given a CCV , a CCV , and a point , we have .
Proof sketch. Given an instance and a 2-LPR cluster , we show that has the desired properties.
Center Separation: Assume there exists a point for such that . The idea is to construct a -perturbation in which becomes the center for , since the distance from to each point in is by the triangle inequality. Define by increasing all distances by a factor of 2, except for the distances between and each point in , which we increase to . By Lemma 9, the metric completion of is a 2-metric perturbation with optimal cost , so we can replace with in the set of optimal centers under . However, now switches to a different cluster, contradicting 2-LPR.
Final property: Given CCVs , , and a point , assume . Again, we will use a perturbation to construct a contradiction. Since and are CCVs and thus distance to their clusters, we can construct a 2-metric perturbation with optimal cost in which and become centers for their respective clusters. Then switches clusters, so we have a contradiction.
CCV-proximity: By center-separation and the definition of , we have that , so is a CCV. Now given a point and a CCV , from center-separation and definition of , and for . Then from the property in the previous paragraph, . ∎
Now we can modify the algorithm so that it first chooses CCVs satisfying CCV-proximity. The other crucial change is instead of each chosen CCV marking all points in , it instead marks all points such that for some . See Algorithm 3. Note this new way of marking preserves the guarantee that each CCV marks its own cluster, because . It also allows us to prove that each CCV satisfying CCV-proximity can never mark an LPR center from a different cluster. Intuitively, if marks , then there exists a point , but there can never exist a point distance to two points satisfying CCV-proximity, since both would need to be closer to by definition. Finally, the last property in Lemma 15 allows us to prove that when the algorithm computes the Voronoi tiles after Phase 1, all points will be correctly assigned. Now we are ready to prove Theorem 12.
Proof sketch of Theorem 12.
First we explain why Algorithm 3 retains the approximation guarantee of Algorithm 2. Given any CCV chosen in Phase I, marks its entire cluster by definition, and we start Phase II with no remaining CCVs. This condition is sufficient for Phase II to return an approximation (Theorem 3.1 from ).
Next we claim that for each 2-LPR cluster , there exists a cluster outputted by Algorithm 3 that is a superset of and does not contain any other 2-LPR cluster. To prove this claim, we first show there exists a point from satisfying CCV-proximity that cannot be marked by any point from a different cluster in Phase I. From Lemma 15, satisfies CCV-proximity and center-separation. If a point marks , then . By center-separation, . Then from the definition of CCV-proximity, both and must be closer to than the other, causing a contradiction. At this point, we know a point will always be chosen by the algorithm in Phase I. To finish the claim, we show that each point from is closer to than to any other point chosen in Phase I. Since and are both CCVs, this follows directly from Lemma 15. However, it is possible that a center is closer to than is to , causing to “steal” ; this is unavoidable. Therefore, we forbid the algorithm from decreasing the size of the Voronoi tiles of after Phase I.
Finally, we claim that Algorithm 3 returns each 2-SLPR cluster exactly. Given a 2-SLPR cluster , by our previous argument, the algorithm chooses a CCV such that . It is left to show that . The intuition is that since is 2-SLPR, its neighboring clusters were also marked in Phase I, and these clusters “shield” from picking up superfluous points in Phase II. Specifically, there are two cases. If there exists that was marked in Phase I, then we can prove that comes from a 2-LPR cluster, so contradicts our previous argument. If there exists from Phase II, then the shortest path in from to is length at least 5 (see Figure 1(a)). The first point , on the shortest path must come from a 2-LPR cluster, and we prove that is closer to ’s cluster using CCV-proximity. ∎
5 Robust local perturbation resilience
In this section, we show that any 2-approximation algorithm for -center returns the optimal -SLPR clusters, provided the clusters are size . Our main structural result is the following theorem.
Given a -center clustering instance with optimal radius such that all optimal clusters are size and there are at least three -LPR clusters, then for each pair of -LPR clusters and , for all and , we have .
Later in this section, we will show that both added conditions (the lower bound on the size of the clusters, and that there are at least three -LPR clusters) are necessary. Since the distance from each point to its closest center is , a corollary of Theorem 16 is that any 2-approximate solution must contain the optimal -SLPR clusters, as long as the 2-approximation satisfies two sensible conditions: (1) for every edge in the 2-approximation, s.t. and are , and (2) there cannot be multiple clusters outputted in the 2-approximation that can be combined into one cluster with the same radius. Both of these properties are easily satisfied using quick pre- or post-processing steps. 555 For condition (1), before running the algorithm, remove all edges of distance , and then take the metric completion of the resulting graph. For condition (2), given the radius of the outputted solution, for each , check if the ball of radius around captures multiple clusters. If so, combine them. We may also combine this result with Theorem 10 to obtain a more powerful result for -center.
Given a -center clustering instance such that all optimal clusters are size and there are at least three -LPR clusters, then any 2-approximate solution satisfying conditions (1) and (2) must contain all optimal 2-LPR clusters and -SLPR clusters.
Given such a clustering instance, then Theorem 16 ensures that there is no edge of length between points from two different -LPR clusters. Given a -SLPR cluster , it follows that there is no point such that . Therefore, given a 2-approximate solution satisfying condition (1), any and cannot be in the same cluster. Furthermore, by condition (2), must not be split into two clusters. Therefore, . The second part of the statement follows directly from Theorem 10. ∎
Proof idea for Theorem 16
The proof consists of two parts. The first part is to show that if two points from different LPR clusters are close together, then all points in the clustering instance must be near each other, in some sense (Lemma 22). The second part of the proof consists of showing that the points from three LPR clusters must be reasonably far from one another; therefore, we achieve the final result by contradiction.
Here is the intution for part 1. Assume that there are points and from different LPR clusters, but . Then by the triangle inequality, the distance from to is less than . We show that under a suitable 3-perturbation, we can replace and with in the set of optimal centers. So, there is a 3-perturbation in which the optimal solution uses just centers. However, as pointed out in , we are still a long way off from showing a contradiction. Since the definiton of local perturbation resilience reasons about sets of centers, we must add a “dummy center”. But adding any point as a dummy center might not immediately result in a contradiction, if the voronoi partition “accidentally” outputs the LPR clusters. To handle this problem, we use the notion of a cluster-capturing center , intuitively, a center which is within of most of the points of a different optimal cluster (see Figure 1(b)). This allows us to construct perturbations and control which points become centers for which clusters. We show all of the points in the instance are close together, in some sense.
The second part of the argument diverges from all previous work in perturbation resilience, since finding a contradiction under local perturbation resilience poses a novel challenge. From the previous part of the proof, we are able to find two noncenters and , which are collectively close to all other points in the dataset. Then we construct a 3-perturbation such that any size subset of is an optimal set of centers. Our goal is to show that at least one of these subsets must break up a LPR cluster, causing a contradiction. There are many cases to consider, so we build up conditional structural claims dictating the possible centers for each LPR cluster under the 3-perturbation. For instance, if a center is the best center for a LPR cluster under some set of optimal centers, then or must be the best center for , otherwise we would arrive at a contradiction by definition of LPR (Lemma 24). We build up enough structural results to examine every possibility of center-cluster combinations, showing they all lead to contradictions, thus negating our original assumption.
Formal analysis of Theorem 16
Given a -center clustering instance such that all optimal clusters have size , let denote an -perturbation with optimal centers . Let denote the set of -LPR clusters. Then there exists a one-to-one function such that for all , is the center for more than half of the points in under . 666A non local version of this fact appeared in .
In words, for any set of optimal centers under an -perturbation, each LPR cluster can be paired to a unique center. This follows simply because all optimal clusters are size , yet under a perturbation, points can switch out of each LPR cluster. Next, we give the following definition, which will be a key point in the first part of the proof.
Intuitively, a center is a CCC for if is a valid center for when is removed from the set of optimal centers. This is particularly useful when is -LPR, since we can combine it with Fact 18 to show that is the unique center for the majority of points in . Another key idea in our analysis is the following concept.
A set -hits if for all , there exist points in at distance to .
Given a -center clustering instance , given , and given a set of size which -hits , there exists an -perturbation such that all size subsets of are optimal sets of centers under .
Consider the following perturbation .
This is an -perturbation by construction. Define as the metric completion of . Then by Lemma 9, is an -metric perturbation with optimal cost . Given any size subset , then for all , there is still at least one such that , therefore by construction, . It follows that is a set of optimal centers under . ∎
Given a -center clustering instance such that all optimal clusters are size and there exist two points at distance from different -LPR clusters, then there exists a partition of the non-centers such that for all pairs , , -hits .
Proof sketch. This proof is split into two main cases. The first case is the following: there exists a CCC2 for a -LPR cluster, discounting a -LPR cluster. In fact, in this case, we do not need the assumption that two points from different LPR clusters are close. If there exists a CCC to a -LPR cluster, denote the CCC by and the cluster by . Otherwise, let denote a CCC2 to a -LPR cluster , discounting a -LPR center . Then is at distance to all but points in . Therefore, and so is at distance to all points in . Now we can create a 3-perturbation by increasing all distances by a factor of 3 except for the distances between and each point , which we increase to . Then by Lemma 9, is a 3-perturbation with optimal cost . Therefore, given any non-center , the set of centers achieves the optimal score, and from Fact 18, one of the centers in must be the center for the majority of points in under . If this center is , , then by definition, is a CCC for the -LPR cluster, , which creates a contradiction because . Therefore, either or must be the center for the majority of points in under .
If is the center for the majority of points in , then because is -LPR, the corresponding cluster must contain fewer than points from . Furthermore, since for all and , , it follows that must be the center for the majority of points in . Therefore, every non-center is at distance to the majority of points in either or .
Now partition all the non-centers into two sets and , such that and . Given , then since both points are close to more than half the points in . Similarly, any two points are apart.
Now we claim that -hits for any pair , . This is because a point from is to , , and and a point is to , , and . The latter follows because . Similar statements are true for and .
Now we turn to the other case. Assume there does not exist a CCC2 to a LPR cluster, discounting a LPR center. In this case, we need to use the assumption that there exist -LPR clusters and , and , such that . Then by the triangle inequality, is distance to all points in and . Again we construct a 3-perturbation by increasing all distances by a factor of 3 except for the distances between and , which we increase to . By Lemma 9, has optimal cost . Then given any non-center , the set of centers achieves the optimal score.
From Fact 18, one of the centers in must be the center for the majority of points in under . If this center is for , then is a CCC2 for discounting , which contradicts our assumption. Similar logic applies to the center for the majority of points in . Therefore, and must be the centers for and . Since was an arbitrary non-center, all non-centers are distance to all but points in either or .
Similar to Case 1, we now partition all the non-centers into two sets and , such that and . As before, each pair of points in are distance apart, and similarly for .
Again we must show that -hits for each pair , . It is no longer true that , however, we can prove that for both and , there exist points from two distinct clusters each. From the previous paragraph, given a non-center for , we know that and are centers for and . With an identical argument, given for , we can show that and are centers for and . It follows that and both contain points from at least two distinct clusters. Now given a non-center