When Do Birds of a Feather Flock Together? k-Means, Proximity, and Conic Programming

# When Do Birds of a Feather Flock Together? k-Means, Proximity, and Conic Programming

## Abstract

Given a set of data, one central goal is to group them into clusters based on some notion of similarity between the individual objects. One of the most popular and widely-used approaches is -means despite the computational hardness to find its global minimum. We study and compare the properties of different convex relaxations by relating them to corresponding proximity conditions, an idea originally introduced by Kumar and Kannan. Using conic duality theory, we present an improved proximity condition under which the Peng-Wei relaxation of -means recovers the underlying clusters exactly. Our proximity condition improves upon Kumar and Kannan and is comparable to that of Awashti and Sheffet, where proximity conditions are established for projective -means. In addition, we provide a necessary proximity condition for the exactness of the Peng-Wei relaxation. For the special case of equal cluster sizes, we establish a different and completely localized proximity condition under which the Amini-Levina relaxation yields exact clustering, thereby having addressed an open problem by Awasthi and Sheffet in the balanced case.

Our framework is not only deterministic and model-free but also comes with a clear geometric meaning which allows for further analysis and generalization. Moreover, it can be conveniently applied to analyzing various data generative models such as the stochastic ball models and Gaussian mixture models. With this method, we improve the current minimum separation bound for the stochastic ball models and achieve the state-of-the-art results of learning Gaussian mixture models.

## 1 Introduction

-means clustering is one of the most well-known and widely-used clustering methods in unsupervised learning. Given data points in , the goal is to partition them into clusters by minimizing the total squared distance between each data point and the corresponding cluster center. It is a problem related to Voronoi tessellations [10]. However, -means is combinatorial in nature since it is essentially equivalent to an integer programming problem [22]. Thus, minimizing the -means objective function turns out to be an NP-hard problem, even if there are only two clusters [2] or if the data points are on a D plane [19].

Despite its hardness, numerous efforts have been made to develop effective and efficient heuristic algorithms to handle the -means problem in practice. A famous example is Lloyd’s algorithm [17] which was originally introduced for vector quantization and then became popular in data clustering due to its high efficiency and simplicity of implementation. One of the earliest convergence analyses of Lloyd’s algorithm was given by Selim and Ismail [22]: Under certain conditions, the algorithm converges to a stationary point within a finite number of iterations but may fail to converge to a local minimum. A smoothed analysis given by Arthur, Manthey and Roglin [4] shows that the smoothed/expected number of iterations is bounded polynomially by , and while the worst-case running time can be even for the case when data points are on a plane [24].

We are particularly interested in the semidefinite programming (SDP) relaxation for -means by Peng and Wei [21], who observed that the -means objective function can be written as the inner product between a projection matrix and a distance matrix constructed from the data, and the combinatorial constraints of the projection matrix can be convexified. Thus, whenever the Peng-Wei relaxation produces an output corresponding to a partition of the data set, the -means problem is solved in polynomial time [27]. The details of the Peng-Wei relaxation will be explained in \prettyrefsec:prelim.

Theoretical properties of the Peng-Wei relaxation have also been studied under specific stochastic models in the literature. Minimum separation conditions were established in [5, 13] to guarantee exact clustering for the stochastic ball models with balanced clusters (i.e., each cluster has the same number of points), while a similar study was conducted in [20] for the Gaussian mixture model.

Despite these efforts, the Peng-Wei relaxation is not yet thoroughly understood. Several fundamental questions of vital importance remain unexplored or require better answers, such as

• How do the number of clusters and the data dimension affect the performance of the Peng-Wei relaxation?

• How does the performance of the Peng-Wei relaxation depend on the balancedness of the cluster sizes and covariance structures within each cluster?

• Can the global minimum separation condition be localized?

• Under the special case of equal cluster sizes, does the tighter Amini-Levina relaxation [3] improve the Peng-Wei relaxation? If so, in which sense?

The studies in [5, 13, 20] reveal certain information about the Peng-Wei relaxation based on the assumption of sufficient minimum center separation: guaranteed exact recovery in the case of the stochastic ball model [5, 13] and learning of centers for the Gaussian mixture model [20]. The price to obtain such information, the requirement imposed upon the minimum center separation, is the homogeneity of the criteria forced on all different clusters. In other words, each pair of clusters, regardless of their shapes and cardinalities, must have their centers separated by a uniform distance determined by the entire data set. As a consequence of this “global” condition, the effect of an isolated but huge cluster ripples throughout the entire data set by raising the minimum center separation. Thus, a more “localized” condition, i.e., a condition on the center separation for each pair of clusters that relies largely on local information, is much desired. Such a more localized condition might pave the way to address the aforementioned fundamental questions regarding the Peng-Wei relaxation.

To that end, in this paper we introduce a proximity condition enabling us to relate the pairwise center distances to more localized quantities. Interestingly, it turns out that our proximity condition improves the one in [15] and is comparable to that in [6], the state-of-the-art proximity conditions in the literature of SVD-based projective -means. Furthermore, under the Amini-Levina relaxation for clusters of equal cardinality, the associated proximity condition becomes even “fully localized”, as it only involves information about pairs of clusters.

### 1.1 Organization of our paper

Our paper is organized as follows. In the remainder of this introductory section we present our aforementioned proximity condition, discuss its implication for various stochastic cluster models and briefly compare our results to the state of the art. In Section 2, we discuss -means and its convex relaxation introduced by Peng and Wei. In Section 3, we show that the Peng-Wei relaxation yields the solution of the -means objective as long as our proximity condition (1.1) is satisfied. A different proximity condition for the exactness of Amini-Levina relaxation is discussed in the same section. In \prettyrefsec:sbm_and_gmm, we consider the application of our framework to the stochastic ball model and the Gaussian mixture model. Numerical simulations that illustrate our theoretical findings are presented in Section 5. All proofs can be found in Sections 68.

### 1.2 Proximity conditions under deterministic models

The idea of proximity conditions originates from the work [15] by Kumar and Kannan who use a proximity condition to characterize the performance of Lloyd’s algorithm with an initialization given by an SVD-based projection under deterministic models. The result is later improved by Awasthi and Sheffet [6], who perform a finer analysis and redesign the proximity condition for the same algorithm. To the best of our knowledge, no such type of proximity conditions has been established for the Peng-Wei relaxation so far, and we will fill this gap in this paper.

Conceptually speaking, our proximity condition can be interpreted as follows:

For each pair of clusters, every point is closer to the center of its own cluster, while the bisector hyperplane of the centers keeps all points in the two clusters at a certain distance determined by global information of the data set.

Roughly speaking, the proximity condition characterizes for each pair of clusters how much closer each point is to the within-cluster center than the cross-cluster center. This is conceptually much more localized than minimum separation, which compares all pairwise center distances to a uniform quantity.

Let us introduce some necessary notation before we proceed to the exact statement of our proximity condition. Given a set of data points with mutually disjoint clusters , we can re-index according to the clusters: for all . Denote by the number of elements in .

Denote the data matrix of the -th cluster by

Furthermore, define

 ca=1nana∑i=1xa,i,wa,b=cb−ca∥cb−ca∥,and¯¯¯¯¯Xa=Xa−1nac⊤a.

In other words, is the sample mean (cluster center) of the -th cluster, is the unit vector pointing from to , and is the centered data matrix of the -th cluster. Now we are ready to give a mathematical characterization of the proximity condition.

###### Condition 1.1 (Proximity condition).

The partition satisfies the proximity condition if for any , there holds

 min1≤i≤na⟨xa,i−ca+cb2,wb,a⟩>12 ⎷(k∑l=1∥¯¯¯¯¯Xl∥2)(1na+1nb). (1.1)

Here, is the operator norm of the matrix .

The proximity condition has a very intuitive geometric interpretation, see also Figure 1. Suppose the partition of data points satisfies the proximity condition. Then each pair of clusters and can be separated by a plane through the bisector of their sample means and . Moreover, the distance between every point in those two clusters and the bisector must be greater than the right hand side of (1.1). This geometric interpretation can be further illustrated by rewriting (1.1): Denote by the distance between the two centers and . Moreover, define

 τa,b=max{max(ua,b),max(ub,a)}whereua,b=¯¯¯¯¯Xawa,b~{}for~{}1≤a,b≤k.

Clearly, is the maximum signed projection distance over all the data points in the clusters and . As illustrated in Figure 1, one can easily check that the left hand side of proximity condition (1.1) is in fact equal to which is the shortest distance between the midpoint and the projections of all the data points in and on the line connecting and . This observation gives us the following proposition.

###### Proposition 1.2.

The proximity condition (1.1) is equivalent to

 ha,b>2τa,b+ ⎷k∑l=1∥¯¯¯¯¯Xl∥2(1na+1nb),∀a≠b. (1.2)

Besides showing that the proximity condition (1.1) guarantees the exactness of Peng-Wei relaxation, we also obtain a necessary proximity condition. If a deterministic mixture fails to fulfill the necessary condition, exact recovery by the Peng-Wei relaxation is provably impossible.

Awasthi and Sheffet’s has raised an open question in [6]: can the pairwise separation condition be fully localized, i.e., depend only on information of the corresponding pair of clusters? We apply the Amini and Levina’s relaxation [3], originally intended to address the weak assortativity issue in community detection among networks, to convexify the -means problem in the case of balanced clusters. Surprisingly, we end up with a completely localized proximity condition for the exactness of the convex relaxation, thus solving Awasthi and Sheffet’s open problem for the balanced case.

Furthermore, beyond the scope of the Peng-Wei relaxation of -means, the proximity condition itself provides an algorithm that can accept answers to the NP-hard -means problem (although it is not able to reject an answer). For a given solution to -means, one can simply check whether the proximity condition holds, and if it does hold, then the solution is provably the unique global minimum. The time cost is proportional to . Assuming the number of clusters and the dimension of data are fixed, the time complexity is linear in the total number of points , which improves the quasilinear-time algorithm proposed in [13] in terms of the time complexity.

### 1.3 Comparison to existing proximity conditions in the literature

As mentioned before, in the literature of projective -means, proximity conditions have been proposed in [15] and later improved in [6]. In this section we compare our proximity conditions with these existing results.

Denote . By our notation, the original Kumar-Kannan proximity condition [15] is equivalent to

 ha,b>2τa,b+Ck(1√na+1√nb)∥¯¯¯¯¯¯W∥,∀a≠b,

for some large absolute constant . The fact that implies . Therefore, our proximity condition (1.2) is strictly weaker than the Kumar-Kannan condition by at least a factor of .

The comparison between (1.1) and the Awasthi-Sheffet conditions in [6] is less straightforward. Theorem 4 therein states that consistent clustering is guaranteed by projective -means plus Lloyd’s algorithm as long as

 ha,b>max{2τa,b+C(1√na+1√nb)∥¯¯¯¯¯¯W∥,  C√k(1√na+1√nb)∥¯¯¯¯¯¯W∥}∀a≠b. (1.3)

Compared to our proximity condition (1.1), the second term on the right-hand side of \prettyrefeq:awasthi_min_sep could be more stringent given the fact , whereas the first term is less stringent than ours since

Therefore, it is fair to say our proximity condition is comparable to the Awasthi-Sheffet condition.

### 1.4 Implications under stochastic models

We should emphasize that in order to prove our main results, we benefit a lot from the existing primal-dual analyses in [5, 13]. The major difference between our analysis and [5, 13] is that we aim at deriving proximity conditions under deterministic models rather than establishing minimum separation results under stochastic models.

However, we are still curious about what minimum separation conditions our proximity condition can yield when applied to both the stochastic ball model and the Gaussian mixture model. Before presenting conditions given by our proximity condition, we first review the state-of-the-art results on both models.

#### Existing work on the Peng-Wei relaxation:

The stochastic ball model can be viewed as a special case of mixture models where the distributions of sample data points are compactly supported on disjoint unit balls in . The clusters are balanced and the covariance structure is fairly rigid since all the distributions are assumed to be identical and isotropic.

Let be the minimal separation between the cluster centers. In [5], it is proven that the Peng-Wei relaxation achieves exact recovery provided , where the lower bound of is independent of the number of clusters . Another bound of is given in [13] stating that exact recovery is guaranteed if which is near-optimal in the regime.

The Gaussian mixture model (GMM) as a stochastic model is more flexible. This model is characterized by its density function which is a weighted sum of the density functions of Gaussian or subgaussian distributions. In [20], assuming the Gaussian distributions are identical and isotropic, Mixon, Villar and Ward prove that the Peng-Wei relaxation learns the Gaussian centers for balanced clusters when the center separations are required to be above , where is the common covariance of all Gaussian distributions.

#### Existing work on other algorithms:

Clustering Gaussian mixture models has received extensive attention in machine learning and statistics communities. Besides [20], a lot of progress has been made in developing efficient algorithms for this task. Among them are a family of algorithms here referred to as the projective -means [25, 1, 14, 15, 6, 9, 18]. In general, the projective -means works in two steps: first project all the data points onto a lower dimensional space usually based on singular value decomposition (SVD), and then classify each point by heuristic methods such as single linkage clustering in [1] or Lloyd’s algorithm in [6].

Vempala and Wang [25] show that if each pairwise center separation is larger than a quantity determined by the number of clusters , the dimension and the variances of the clusters, the projective algorithm can classify a mixture of isotropic Gaussians with high probability. Achlioptas and McSherry [1] show that SVD-based projection followed by single-linkage clustering is able to classify all the sampled data points accurately if the center separation of each pair of clusters is greater than the operator norm of the covariance matrix and the weights of the two clusters plus a term which depends on the concentration properties of the distributions in the mixture. The algorithm studied by Kannan and Kumar in [15]—the work that first devises the idea of proximity condition—also begins with an SVD-based projection and proceeds by Lloyd’s algorithm which is initialized by an unspecified near-optimal solution to the -means problem. As stated before, its technical results are improved by Awatshi and Sheffet in [6]. Recently, Lu and Zhou [18] provide a more detailed estimation of misclassification rate for each iteration of Lloyd’s algorithm with initialization given by spectral methods [14].

#### Our results:

We can easily apply the proximity condition to the stochastic ball model and the Gaussian mixture model. The corresponding recovery guarantees are competitive with or improve upon other state-of-the-art results.

• For the stochastic ball model, we show that is sufficient to guarantee the exact recovery of the Peng-Wei relaxation, which improves the separation condition in [13] when is large. Moreover, our result applies to a broader class of stochastic ball models where each cluster can have a different number of points and may even satisfy a different probability distribution as long as the support of density function is contained within a unit ball.

• For the Gaussian mixture model, we summarize our result for the Peng-Wei relaxation and other state-of-the-art results for both the Peng-Wei relaxation and projective -means in Table 1. It has been shown in [20] that the centers of a Gaussian mixture can be accurately estimated by Peng-Wei relaxation provided the minimal separation is . In contrast, our proximity provides a different minimal separation condition , which is smaller than if is large and not too large. Our separation condition is better than [15] and comparable to [6] for projective -means. Though our bound loses a factor vis-à-vis the one in [25] for the special case of spherical Gaussian mixtures, we can handle more general Gaussian mixtures where the density functions do not have to be spherical or identical.

### 1.5 Notation

Let be the indicator vector of . is an vector with all entries equal to 1. Given any two real matrices and in , we define the inner product as . For a vector , is equal to the largest entry of . We denote if is a nonnegative matrix, i.e., each entry is nonnegative; if is a symmetric positive semi-definite matrix. Besides, we also use the notation listed below throughout the paper.

## 2 k-means and the Peng-Wei relaxation

In this section, we briefly review the formulation of -means and its SDP relaxation introduced by Peng and Wei [21]. Let be a set of data points in . -means attempts to divide into disjoint clusters by seeking a solution to the following minimization problem:

 min{Γa}ka=1min{γa}ka=1 k∑a=1∑l∈Γa∥xl−γa∥2,

where form a partition of (i.e., and if ). For any given partition , choosing as the centroid minimizes the objective function. Therefore, the -means problem is equivalent to:

 min{Γa}ka=1 k∑a=1∑l∈Γa∥xl−ca∥2, (2.1)

Given an arbitrary partition of , let be the indicator function of the -th cluster. That is,

 1Γa(l)={1if l∈Γa,0otherwise.

A simple calculation can reveal that

 1|Γa|∑l∈Γa,s∈Γa∥xl−xs∥2=2∑l∈Γa∥xl−γa∥2

and hence,

 k∑a=1∑l∈Γa∥xl−μa∥2 =12k∑a=11|Γa|∑l∈Γa,s∈Γa∥xl−xs∥2 =12k∑a=11|Γa|⟨1Γa1⊤Γa,D⟩,

where is the distance matrix with the -th entry being given by . Therefore, we can rewrite the -means problem as

 min⟨Z,D⟩s.t.Z=k∑a=11|Γa|1Γa1⊤Γa with ⊔ka=1Γa=Γ and Γa⊓Γb=∅ for a≠b. (2.2)

It is self-evident that (2.2) is a non-convex problem due to the combinatorial nature of the feasible set. Indeed, (2.2) is an NP-hard problem [2]. Despite this, it can be easily verified that satisfies the following four properties:

 Z⪰0,Z≥0,Z1N=1N,Tr(Z)=k.

Replacing the constraint in (2.2) by the above four properties leads to the SDP relaxation of -means introduced by Peng and Wei in [21],

 min⟨Z,D⟩s.t.Z⪰0,Z≥0,Z1N=1N,Tr(Z)=k, (2.3)

which will be the focus of this paper.

The Peng-Wei relaxation is a convex problem and can be solved in polynomial time using the interior-point method [27]. We denote by the optimal solution to the Peng-Wei relaxation. Clearly, every feasible point of (2.2) is also feasible for (2.3); so once the optimal solution to (2.3) has the form , it must be an optimal solution to the -means problem. Therefore, the question of central importance is:

When is the solution to (2.3) of the form ?

## 3 Exact recovery guarantees

### 3.1 Exact clustering and proximity conditions

In a nutshell our following main theorem states that the proximity condition (1.1) implies the exactness of the Peng-Wei relaxation (2.3):

###### Theorem 3.1 (Main theorem).

Suppose the partition obeys the proximity condition (1.1). Then the minimizer of the Peng-Wei relaxation (2.3) is unique and given by

Since the global minimum of (2.3) is always smaller than that of (2.1), Theorem 3.1 implies that the proximity condition provides a simple algorithm that is able to accept answers to the -means problem.

###### Corollary 3.2 (Algorithm accepting answers to k-means).

If a partition satisfies the proximity condition (1.1), then it is the unique global minimum to the -means objective function.

Note that each data point appears times on the left hand side of (1.1), and it takes amount of time to compute each matrix operator norm using the Golub-Reisch SVD algorithm [11]. Thus, the time cost to examine the proximity condition is proportional to .

To the best of our knowledge, -means problem has not been shown in NP or not. The proximity condition does not change this fact. We want to emphasize that the polynomial time examination of the proximity condition (1.1) does not imply that an answer to the -means problem can be verified in polynomial time since it does not accept all correct answers. A different approach that leverages the dual certificate associated with the Peng-Wei relaxation to test under certain conditions the optimality of a candidate -means solution can be found in [13]. The algorithm proposed in [13] tests the optimality of a candidate solution in quasilinear time. Hence, our method improves the time complexity by a logarithmic factor.

While the main theorem provides a sufficient condition for the Peng-Wei relaxation to exactly recover a given partition, the following theorem gives a necessary condition.

###### Theorem 3.3 (Necessary condition).

Suppose is a global minimum of (2.3). Then the partition must satisfy

 ha,b≥τa,b+√τ2a,b+maxt∥¯¯¯¯¯Xt∥2(1na+1nb),∀a≠b. (3.1)

Notice that as long as is a solution to (2.3), must be a global minimum to the -means. In other words, it is harder for a deterministic mixture to be exactly recovered by the Peng-Wei relaxation than being the global minimum to the -means. It remains unclear whether this necessary condition (Theorem 3.3) is only necessary for the Peng-Wei relaxation or is necessary for the -means itself as well.

### 3.2 Balanced case: Amini-Levina relaxation and proximity condition

One special case of interest is the balanced case where each cluster has the same number of points, i.e. . We have seen in Section 2 that the -means problem can be rewritten as (2.2):

 min⟨Z,D⟩s.t.Z=k∑a=11|Γa|1Γa1⊤Γa with ⊔ka=1Γa=Γ and Γa⊓Γb=∅ for a≠b. (3.2)

With the balanced assumption, i.e., the cardinalities of all clusters being the same, it is easy to verify that obeys the following four constraints:

 Z⪰0,Z≥0,Z1N=1N,diag(Z)=1n1N.

This leads to the Amini-Levina relaxation of -means, which was first introduced in [3] for community detection under balanced case in order to address the weak assortativity issue:

 min⟨Z,D⟩s.t.Z⪰0,Z≥0,Z1N=1N,diag(Z)=1n1N. (3.3)

As with the analyses on the Peng-Wei relaxation, once the optimal solution to (3.3) takes the form , the Amini-Levina relaxation gives an optimal solution to the -means problem with balanced assumption. Once again, we ask the same question for Peng and Wei’s relaxation: When is the solution to (3.3) of the form ?

Unsurprisingly, the answer is another proximity condition specially tailored for Amini and Levina’s relaxation.

###### Condition 3.4 (Proximity condition for balanced clusters).

A partition with satisfies the proximity condition for balanced clusters if for any , there holds

 min1≤i≤na⟨xa,i−ca+cb2,wb,a⟩>√k4n(∥¯¯¯¯¯Xa∥2+∥¯¯¯¯¯Xb∥2). (3.4)

Similar to the general case, the proximity condition for balanced clusters also has an equivalent formulation:

 ha,b>2τa,b+√kn(∥¯¯¯¯¯Xa∥2+∥¯¯¯¯¯Xb∥2). (3.5)
###### Theorem 3.5 (Exact recovery for balanced clusters).

Suppose the partition with obeys the proximity condition for balanced clusters (3.4). Then the minimizer of the Amini-Levina relaxation (3.3) is unique and given by Therefore, the partition can be recovered exactly by the Amini-Levina relaxation.

Compared with the proximity condition for Peng and Wei’s relaxation (1.1), the proximity condition for Amini and Levina’s relaxation distinguishes itself by decoupling the clusters in the sense that each of the inequalities in (3.4) only depends on the two clusters involved in the inequality. In the case of balanced clusters, this immediately solves the open question posed by Awasthi and Sheffet [6], which asks if such a proximity condition exists.

The completely localized proximity condition is particularly meaningful when there are a few abnormal clusters whose covariance matrices are huge in matrix operator norm, but at the same time being away from all the other clusters. In this case, the proximity condition for Amini and Levina’s relaxation has far better chance than that for Peng and Wei’s relaxation to detect a reasonable partition of the data set. Figure 2 provides such an example.

Analogously, we can also prove a necessary condition for the Amini-Levina relaxation, which can be compared with Theorem 3.3 for the general case.

###### Theorem 3.6 (Necessary condition for balanced clusters).

Suppose is a global minimum of (3.3). Then the partition must satisfy

 ha,b≥τa,b+√τ2a,b+1n(∥¯¯¯¯¯Xa∥2+∥¯¯¯¯¯Xb∥2),∀a≠b. (3.6)

## 4 Results under random models

Next we apply the proximity condition (1.1) to data sets generated from the generalized stochastic ball model and the Gaussian mixture model, respectively. We first give a formal definition for each model and then present the minimal separation condition which is sufficient to guarantee the exact recovery of underlying clusters by the Peng-Wei relaxation. The minimal separation conditions are established by verifying the proximity condition (1.1) for those two random models. For proofs, see Sections 8.2 and 8.3.

### 4.1 Stochastic ball model

The definition of generalized stochastic ball model is given as follows where we only assume the support of the density function is contained in the unit ball of for all clusters.

###### Definition 4.1 (Generalized stochastic ball model).

Let be a set of deterministic vectors in . For each , is a distribution supported on the unit ball of with a covariance matrix and are i.i.d. zero-mean random vectors drawn from the distribution . The -th cluster is formed by , where for .

###### Corollary 4.1.

Denote , , , and For the generalized stochastic ball model, we draw points from the -th ball for each . The Peng-Wei relaxation achieves exact recovery with probability at least if and

 Δ≥2+√2wminσmax+7√twmin, (4.1)

where and . In particular, if for all , and each is a uniform distribution over the unit ball of , then (4.1) can be simplified to

 Δ≥2+√2km+2+7√tk

by noting that

###### Remark 4.2.

As the number of data points goes to infinity provided and are fixed, the value of vanishes. So asymptotically the minimal separation condition reduces to when and . Note that we only assume that the distribution is supported on the unit ball, so rotation-invariant distributions which are assumed in [13, 12] are also included. Compared with the result in [13, 12] where is required, we have achieved a better bound when is large.

We can also apply the necessary lower bound (Theorem 3.3) to the generalized stochastic ball model. To illustrate this, let us study a special case where the following Corollary holds.

###### Corollary 4.3.

For the generalized ball model, if for all we have , then with high probability, the Peng-Wei relaxation fails to achieve exact recovery provided that is large enough and

 Δ<1+√1+2σ2max.

If for any , is the uniform distribution over the unit ball, the bound becomes

 Δ<1+√1+2m+2.

### 4.2 Gaussian mixture model

The definition of Gaussian mixture model is given below, followed by the minimal separation condition for the exactness of the Peng-Wei relaxation.

###### Definition 4.2 (Gaussian mixture model).

Consider a mixture of Gaussian distributions in with a set of weights obeying and . The probability density function of this mixture model is

 p(x)=k∑a=1wapN(x;μa,Σa),x∈\msbm{R}m,

where is the probability density function of the Gaussian distribution .

###### Corollary 4.4.

Denote , and . For the Gaussian mixture model, the Peng-Wei relaxation achieves exact recovery with probability at least if

 Δ≥σmax(2√wmin+4√2log1/2(kN2)+q(N;m,k,wmin)),

where if . In particular, if and for all , then the above condition reduces to

 Δ≥2√k+4√2log1/2(kN2)+q(N;m,k,1/k),

and if .

## 5 Numerical experiments

Consider applying the Peng-Wei relaxation to the generalized stochastic ball model. When the total number of the data points becomes large enough, the parameter vanishes and the sufficient lower bound predicted by Corollary 4.1 as in (4.1) becomes

 Δ≥2+σmax√2wmin. (5.1)

The state-of-the-art bound for the stochastic ball model proved in  [5, 13] is

 Δ>min{2√2(1+1√m),2+k2m}. (5.2)

The exact phase transition bound, above which exact recovery can be achieved by the Peng-Wei relaxation of -means, is smaller than both of the above sufficient lower bounds. As one would expect, the actual lower bound is hard to find in practice. The major difficulty occurs when the number of clusters is greater than 2. In this case, when creating an instance of the stochastic ball model with prescribed minimal separation distance , there are infinitely many possible ways to place the centers and this cannot be resolved by translation, rotation, and scaling. To address this, we investigate the worst case where centers are packed as compactly as possible while points in each cluster are chosen in the most scattered way. We have a better chance finding a more accurate lower bound under this arrangement.

Three instructive centroidal geometries, the geometries formed by the locations of the centers, are considered, and we call them circle-shaped geometry, line-shaped geometry, and hive-shaped geometry respectively. Centers are packed compactly under these shapes, especially the hive-shaped geometry. We can rescale the three geometries to change the minimal separation distance . An illustration of these geometries formed by the locations of the centers is shown in Figure 3.

We let the number of data points in each cluster be . Hence, the total number of points . As a result, . These points are equispaced points on the unit circle centered at . The data points are chosen in this way since it maximizes the variance. Because the data is isotropic and the variance is equal to , we have .

For and chosen above, we can see that our bound is an improvement to the state-of-the-art result. Overall, it is still a meaningful addition to the state-of-the-art result. Nevertheless, it is not yet tight. Figure 4 shows that the actual lower bound is almost independent of the parameter , while our theory still relies on the assumption that .

Another parameter that may affect the bound is the dimension . To reveal dependence of the bound on the dimension, we fix the number of clusters to be and let the dimension vary between and . The center separation is chosen among equispaced number between and . The number of points in each cluster is equal to , so there are in total. The distribution for each ball is the uniform distribution on the unit sphere centered at . For any fixed pair of and , we generate instances of the stochastic ball model.

From Figure 5, it is evident that neither our bound nor the state-of-the-art bound is tight. The blue line, which represents the bound , fits our empircal result the best. Based on the observation of dependence between the empirical lower bound and the parameters and as in Figure 4 and 5 , we formulate a conjecture as stated below.

###### Conjecture 5.1.

For a mixture generated by the generalized stochastic ball model, the Peng-Wei relaxation achieves exact recovery with high probability if

 Δ≥2+O(1m), (5.3)

provided that the total number of points is large enough.

After the completion of this manuscript, a semidefinite relaxation based on graph cuts has been proposed in [16] to overcome the performance limits of Peng-Wei relaxation, which provides a new alternative way to learn the stochastic ball models.

## 6 Proofs for Section 3.1

We will prove the main theorem and related results under the proximity condition given in Proposition 1.2. The proof for the equivalence of the two proximity conditions is presented at the end of this section. The key ingredient in the proof of the main theorem is to construct a dual variable to certify the optimality of the desired solution based on the conic duality theorem in convex optimization [7].

### 6.1 Conic duality

We first rewrite (2.3) as a cone program in standard form which naturally leads to its dual formulation. Noting that is a symmetric variable, the Peng-Wei relaxation of -means (2.3) is equivalent to the following optimization problem:

 min⟨Z,D⟩s.t.Z⪰0,Z≥0,12(Z+Z⊤)1N=1N,Tr(Z)=k. (6.1)

Let , the intersection of two self-dual cones: the positive semi-definite cone and the nonnegative cone . By definition, it is a pointed1 and closed convex cone with a nonempty interior. Moreover, its dual cone2 is given by . Let be a linear map from to defined as follows:

 A(Z):Z→[⟨Z,IN⟩12(Z+Z⊤)1N)].

We can express (6.1) in the form of a standard cone program,

 min⟨Z,D⟩,s.t.A(Z)=[k1N],Z∈K. (6.2)

Thus, using the standard derivation in Lagrangian duality theory [8], the dual problem of (6.1) can be easily obtained and given by

 max−kz−⟨α,1N⟩,s.t.D+A∗(λ)∈K∗, (6.3)

where is the dual variable with respect to the affine constraints and

 A∗(λ):=12(α1⊤N+1Nα⊤)+zIN (6.4)

is the adjoint operator of under the canonical inner product over .

### 6.2 Optimality condition

This subsection presents a necessary and sufficient condition for to be the global minimum of the Peng-Wei relaxation. The result is summarized in Proposition 6.5, which follows from the complementary slackness in the conic duality theory. Moreover, a stronger sufficient condition has been established for the uniqueness of in Proposition 6.6.

###### Theorem 6.1 (Conic Duality Theorem, Theorem 2.4.1 in [7]).

There hold:

1. If the primal problem is strictly feasible and bounded below, then the dual program is solvable3 and the optimal values of the primal/dual problems are equal to each other;

2. If the dual problem is strictly feasible and bounded above, then the primal program is solvable and the optimal values of the primal/dual problems are equal to each other;

3. Assume either the primal problem or the dual problem is bounded and strictly feasible. Then is a pair of primal/dual optimum if and only if either the duality gap is zero or the complementary slackness holds.

The following lemma, tailored to (6.1) and (6.3), simply follows from the strict feasibility of (6.1) or (6.3) and Theorem 6.1.

###### Lemma 6.2.

Both primal/dual problems (6.1) and (6.3) are strictly feasible and bounded below/above. Therefore, they are are solvable (so the optimal values are attained). Moreover, is a pair of primal/dual optima if and only if the complementary slackness holds: where .

###### Proof: .

Consider , where for . Note that and . So is in the interior of . It is also easy to verify that satisfies the other two equality constraints. This shows (6.1) is strictly feasible. In addition, we can see that the objective function in (6.1) is also nonnegative since both and are entrywise nonnegative. In conclusion, the primal problem is strictly feasible and bounded below by .

Note that is a strictly positive symmetric matrix. For the dual problem (6.3), we can take and let be a sufficiently large positive number such that

 D+A∗(λ)=JN×Na % positive matrix+(D+zIN−JN×N)a positive definite matrix

is in the interior of . Hence, the dual program is also strictly feasible. Its optimal value is bounded above because it is always smaller than the optimal value of the primal problem.

Therefore, the application of Theorem 6.1 implies that is a pair of primal/dual optima if and only if the complementary slackness holds, i.e., where and

###### Remark 6.3.

The complementary slackness is indeed equivalent to the zero duality gap since the optimal values of both problems are attained and there holds

 ⟨D,X⟩=−⟨A∗(λ),X⟩=−⟨λ,A(X)⟩=−⟨λ,[k1N]⟩=−kz−⟨α,1N⟩.

In the following lemma, we will derive a more explicit expression for complementary slackness which will be used in the analysis later. By definition of , the matrix must be in the form of

 D+A∗(λ)=B+Q, (6.5)

where , and both of them are symmetric.

###### Lemma 6.4.

The complementary slackness is equivalent to

 B(a,a)=0 ~{}for all~{}1≤a≤k,andQX=XQ=0, (6.6)

where and obeys (6.5) for some . It follows immediately that for Moreover, (6.6) implies that the dual variable satisfies