Clustering with Spectral Norm and the k-means Algorithm

There has been much progress on efficient algorithms for clustering data points generated by a mixture of probability distributions under the assumption that the means of the distributions are well-separated, i.e., the distance between the means of any two distributions is at least standard deviations. These results generally make heavy use of the generative model and particular properties of the distributions. In this paper, we show that a simple clustering algorithm works without assuming any generative (probabilistic) model. Our only assumption is what we call a “proximity condition”: the projection of any data point onto the line joining its cluster center to any other cluster center is standard deviations closer to its own center than the other center. Here the notion of standard deviations is based on the spectral norm of the matrix whose rows represent the difference between a point and the mean of the cluster to which it belongs. We show that in the generative models studied, our proximity condition is satisfied and so we are able to derive most known results for generative models as corollaries of our main result. We also prove some new results for generative models - e.g., we can cluster all but a small fraction of points only assuming a bound on the variance. Our algorithm relies on the well known -means algorithm, and along the way, we prove a result of independent interest – that the -means algorithm converges to the “true centers” even in the presence of spurious points provided the initial (estimated) centers are close enough to the corresponding actual centers and all but a small fraction of the points satisfy the proximity condition. Finally, we present a new technique for boosting the ratio of inter-center separation to standard deviation. This allows us to prove results for learning mixture of a class of distributions under weaker separation conditions.

1 Introduction

Clustering is in general a hard problem. But, there has been a lot of research (see Section 3 for references) on proving that if we have data points generated by a mixture of probability distributions, then one can cluster the data points into the clusters, one corresponding to each component, provided the means of the different components are well-separated. There are different notions of well-separated, but mainly, the (best known) results can be qualitatively stated as:

“If the means of every pair of densities are at least times standard deviations apart, then we can learn the mixture in polynomial time.”

These results generally make heavy use of the generative model and particular properties of the distributions (Indeed, many of them specialize to Gaussians or independent Bernoulli trials). In this paper, we make no assumptions on the generative model of the data. We are still able to derive essentially the same result (loosely stated for now as):

“If the projection of any data point onto the line joining its cluster center to any other cluster center is times standard deviations closer to its own center than the other center (we call this the “proximity condition”), then we can cluster correctly in polynomial time.”

First, if the points to be clustered form the rows of an matrix and is the corresponding matrix of cluster centers (so each row of is one of vectors, namely the centers of clusters) then note that the maximum directional variance (no probabilities here, the variance is just the average squared distance from the center) of the data in any direction is just

where is the spectral norm. So, spectral norm scaled by will play the role of standard deviation in the above assertion. To our knowledge, this is the first result proving that clustering can be done in polynomial time in a general situation with only deterministic assumptions. It settles an open question raised in [KV09].

We will show that in the generative models studied, our proximity condition is satisfied and so we are able to derive all known results for generative models as corollaries of our theorem (with one qualification: whereas our separation is in terms of the whole data variance, often, in the case of Gaussians, one can make do with separations depending only on individual densities’ variances – see Section 3.)

Besides Gaussians, the planted partition model (defined later) has also been studied; both these distributions have very “thin tails” and a lot of independence, so one can appeal to concentration results. In section 6.3, we give a clustering algorithm for a mixture of general densities for which we only assume bounds on the variance (and no further concentration). Based on our algorithm, we show how to classify all but an fraction of points in this model. Section 3 has references to recent work dealing with distributions which may not even have variance, but these results are only for the special class of product densities, with additional constraints.

One crucial technical result we prove (Theorem 5.5) may be of independent interest. It shows that the good old means algorithm [Llo82] converges to the “true centers” even in the presence of spurious points provided the initial (estimated) centers are close enough to the corresponding actual centers and all but an fraction of the points satisfy the proximity condition. Convergence (or lack of it) of the means algorithm is again well-studied ([ORSS06, AV06, Das03, HPS05]). The result of [ORSS06] (one of the few to formulate sufficient conditions for the means algorithm to provably work) assumes the condition that the optimal clustering with centers is substantially better than that with fewer centers and shows that one iteration of means yields a near-optimal solution. We show in section 6.4 that their condition implies proximity for all but an fraction of the points. This allows us to prove that our algorithm, which is again based on the means algorithm, gives a PTAS.

The proof of Theorem 5.5 is based on Theorem  5.4 which shows that if current centers are close to the true centers, then misclassified points (whose nearest current center is not the one closest to the true center) are far away from true centers and so there cannot be too many of them. This is based on a clean geometric argument shown pictorially in Figure 2. Our main theorem in addition allows for an fraction of “spurious” points which do not satisfy the proximity condition. Such errors have often proved difficult to account for.

As indicated, all results on generative models assume a lower bound on the inter-center separation in terms of the spectral norm. In section 7, we describe a construction (when data is from a generative model – a mixture of distributions) which boosts the ratio of inter-center separation to spectral norm. The construction is the following: we pick two sets of samples and independently from the mixture. We define new points , where is defined as , where denotes that we have subtracted the mean (of the mixture.) Using this, we are able to reduce the dependence of inter-center separation on the minimum weight of a component in the mixture that all models generally need. This technique of boosting is likely to have other applications.

2 Preliminaries and the Main Theorem

For a matrix , we shall use to denote its spectral norm. For a vector , we use to denote its length. We are given points in which are divided into clusters – . Let denote the mean of cluster and denote . Let be the matrix with rows corresponding to the points. Let be the matrix where , for all . We shall use to denote the row of . Let

where is a large enough constant.

Definition 2.1

We say a point satisfies the proximity condition if for any , the projection of onto the to line is at least closer to than to . We let (for good) be the set of points satisfying the proximity condition.

Note that the proximity condition implies that the distance between and must be at least . We are now ready to state the theorem.

Theorem 2.2

If , then we can correctly classify all but points in polynomial time. In particular, if , all points are classified correctly.

Often, when applying this theorem to learning a mixture of distributions, will correspond to a set of independent samples from the mixture. We will denote the corresponding distributions by , and their relative weights by . Often, will denote the maximum variance along any direction of the distribution , and will denote . We denote the minimum mixing weight of a distribution as .

3 Previous Work

Learning mixture of distributions is one of the central problems in machine learning. There is vast amount of literature on learning mixture of Gaussian distributions. One of the most popular methods for this is the well known EM algorithm which maximizes the log likelihood function [DLR77]. However, there are few results which demonstrate that it converges to the optima solution. Dasgupta [Das99] introduced the problem of learning distributions under suitable separation conditions, i.e., we assume that the distance between the means of the distributions in the mixture is large, and the goal is to recover the original clustering of points (perhaps with some error).

We first summarize known results for learning mixtures of Gaussian distributions under separation conditions. We ignore logarithmic factors in separation condition. We also ignore the minimum number of samples required by the various algorithms – they are often bounded by a polynomial in the dimension and the mixing weights. Let be the maximum variance of the Gaussian in any direction. Dasgupta [Das99] gave an algorithm based on random projection to learn mixture of Gaussians provided mixing weights of all distributions are about the same, and is Dasgupta and Schulman [DS07] gave an EM based algorithm provided is Arora and Kannan [AK01] also gave a learning algorithm with similar separation conditions. Vempala and Wang [VW04] were the first to demonstrate the effectiveness of spectral techniques. For spherical Gaussians, their algorithm worked with a much weaker separation condition of between and . Achlioptas and McSherry [AM05] extended this to arbitrary Gaussians with separation between and being at least Kannan et. al. [KSV08] also gave an algorithm for arbitrary Gaussians with the corresponding separation being . Recently, Brubaker and Vempala [BV08] gave a learning algorithm where the separation only depends on the variance perpendicular to a hyperplane separating two Gaussians (the so called “parallel pancakes problem”).

Much less is known about learning mixtures of heavy tailed distributions. Most of the known results assume that each distribution is a product distribution, i.e., projection along co-ordinate axes are independent. Often, they also assume some slope condition on the line joining any two means. These slope conditions typically say that the unit vector along such lines does not lie almost entirely along very few coordinates. Such a condition is necessay because if the only difference between two distributions were a single coordinate, then one would require much stronger separation conditions. Dasgupta et. al. [DHKS05] considered the problem of learning product distributions of heavy tailed distributions when each component distribution satisfied the following mild condition : . Here is the half-radius of the distribution (these distributions can have unbounded variance). Their algorithm could classify at least fraction of the points provided the distance between any two means is at least . Here is the maximum half-radius of the distributions along any coordinate. Under even milder assumptions on the distributions and a slope condition, they could correctly classify all but fraction of the points provided the corresponding separation was . Their algorithm, however, requires exponential (in and ) amount of time. This problem was resolved by Chaudhuri and Rao [CR08]. Dasgupta et. al. [DHKM07] considered the problem of classifying samples from a mixture of arbitrary distributions with bounded variance in any direction. They showed that if the separation between the means is and a suitable slope condition holds, then all the samples can be correctly classified. Their paper also gives a general method for bounding the spectral norm of a matrix when the rows are independent (and some additional conditions hold). We will mention this condition formally in Section 6 and make heavy use of it.

Finally, we discuss the planted partition model [McS01]. In this model, an instance consists of a set of points, and there is an implicit partition of these points into groups. Further, there is an (unknown) matrix of prababilities . We are given a graph on these points, where an edge between two vertices from groups and is present with probability . The goal is to recover the actual partition of the points (and hence, an approximation to the matrix as well). We can think of this as a special case of learning mixture of distributions, where the distribution corresponding to the part is as follows : is a distribution over , one coordinate corresponding to each vertex. The coordinate corresponding to vertex is set to 1 with probability , where denotes the group to which belongs. Note that the mean of , , is equal to the vector , where denotes the group to which the vertex belongs. McSherry[McS01] showed that if the following separation condition is satisfied, then one can recover the actual partition of the vertex set with probability at least – for all ,

(1)

where is a large constant, is such that every group has size at least , and denotes .

There is a rich body of work on the -means problem and heuristic algorithms for this problem (see for example [KSS10, ORSS06] and references therein). One of the most widely used algorithms for this problem was given by Lloyd [Llo82]. In this algorithm, we start with an arbitrary set of candidate centers. Each point is assigned to the closest candidate center – this clusters the points into clusters. For each cluster, we update the candidate center to the mean of the points in the cluster. This gives a new set of candidate centers. This process is repeated till we get a local optimum. This algorithm may take superpolynomial time to converge [AV06]. However, there is a growing body of work on proving that this algorithm gives a good clustering in polynomial time if the initial choice of centers is good  [AV07, ADK09, ORSS06]. Ostrovsky et. al. [ORSS06] showed that a modification of the Lloyd’s algorithm gives a PTAS for the -means problem if there is a sufficiently large separation between the means. Our result also fits in this general theme – the -means algorithm on a choice of centers obtained from a simple spectral algorithm classifies the point correctly.

4 Our Contributions

Our main contribution is to show that a set of points satisfying a deterministic proximity condition (based on spectral norm) can be correctly classified (Theorem 2.2). The algorithm is described in Figure 1. It has two main steps – first find an initial set of centers based on SVD, and then run the standard -means algorithm with these initial centers as seeds. In Section 5, we show that after each iteration of the -means algorithm, the set of centers come exponentially close to the true centers. Although both steps of our algorithm – SVD and the -means algorithm – have been well studied, ours is the first result which shows that combining the two leads to a provably good algorithm. In Section 6, we give several applications of Theorem 2.2. We have the following results for learning mixture of distriutions (we ignore poly-logarithmic factors in the discussion below) :

  • Arbitrary Gaussian Distributions with separation : as mentioned above, this matches known results [AM05, KSV08] except for the fact that the separation condition between two distributions depends on the maximum standard deviation (as compared to standard deviations of these distributions only).

  • Planted distribution model with separation : this matches the result of McSherry [McS01] except for a factor which we can also remove with a more careful analysis.

  • Distributions with bounded variance along any direction : we can classify all but an fraction of points if the separation between means is at least . Although results are known for classifying (all but a small fraction) points from mixtures of distributions with unbounded variance [DHKS05, CR08], such results work for product distributions only.

  • PTAS using the -means algorithm : We show that the separation condition of Ostrovsky et. al. [ORSS06] is stronger than the proximity condition. Using this fact, we are also able to give a PTAS based on the -means algorithm.

Further, ours is the first algorithm which applies to all of the above settings. In Section 7, we give a general technique for working with weaker separation conditions (for learning mixture of distributions). Under certain technical conditions described in Section 7, we give a construction which increases the spectral norm of at a much faster rate than the increase in inter-mean distance as we increase the number of samples. As applications of this technique, we have the following results :

  • Arbitrary Gaussians with separation : this is the first result for arbitrary Gaussians where the separation depends only logarithmically on the minimum mixture weight.

  • Power-law distributions with sufficiently large (but constant) exponent (defined in equation (13)) : We prove that we can learn all but fraction of samples provided the separation between means is . For large values of , it significantly reduces the dependence on .

We expect this technique to have more applications.

5 Proof of Theorem 2.2

Our algorithm for correctly classifying the points will run in several iterations. At the beginning of each iteration, it will have a set of candidate points. By a Lloyd like step, it will replace these points by another set of points. This process will go on for polynomial number of steps.

  (Base case) Let denote the projection of the points on the best -dimensional subspace found by computing SVD of . Let denote the centers of a (near)-optimal solution to the -means problem for the points . For do Assign each point to the closest point among . Let denote the set of points assigned to . Define as the mean of the points . Update , as the new centers, i.e., set for the next iteration.  

Figure 1: Algorithm Cluster

The iterative procedure is described in Figure 1. In the first step, we can use any constant factor approximation algorithm for the -means problem. Note that the algorithm is same as Lloyd’s algorithm, but we start with a special set of initial points as described in the algorithm. We now prove that after the first step (the base case), the estimated centers are close to the actual ones – this case follows from [KV09], but we prove it below for sake of completeness.

Lemma 5.1

(Base Case) After the first step of the algorithm above,

Proof. Suppose, for sake of contradiction, that there exists an such that all the centers are at least distance away from . Consider the points in . Suppose is assigned to the center in this solution. The assignment cost for these points in this optimal -means solution is

(2)
(3)

where inequality (2) follows from the fact that for any two numbers , ; and inequality (3) follows from the fact that But this is a contradiction, because one feasible solution to the -means problem is to assign points in to for – the cost of this solution is  

Observe that the lemma above implies that there is a unique center associated with each . We now prove a useful lemma which states that removing small number of points from a cluster can move the mean of the remaining points by only a small distance.

Lemma 5.2

Let be a subset of . Let denote the mean of the points in . Then

Proof. Let be unit vector along . Now,

But, . This proves the lemma.  

Corollary 5.3

Let such that , where . Let denote the mean of the points in . Then

Proof. Let denote . We know that So we get

where the inequality above follows from Lemma 5.2. The result now follows because  

Now we show that if the estimated centers are close to the actual centers, then one iteration of the second step in the algorithm will reduce this separation by at least half.

Notation :

  • denote the current centers at the beginning of an iteration in the second step of the algorithm, where is the current center closest to .

  • denotes the set of points for which the closest current center is .

  • denotes the mean of points in ; so are the new centers. Let .

The theorem below shows that the set of misclassified points (which really belong to , but have , as the closest current center) are not too many in number. The proof first shows that any misclassified point must be far away from and since the sum of squared distances from for all points in is bounded, there cannot be too many.

Theorem 5.4

Assume that for all . Then,

(4)

Further, for any ,

(5)

Proof. Let denote the projection of vector to the affine space spanned by and . Assume . Splitting into its projection along the line to and the component orthogonal to it, we can write

where is orthogonal to . Since is closer to than to , we have

i.e.,
Figure 2: Misclassified

We have since is orthogonal to . The last quantity is at most , where . Substituting this we get

i.e., (6)

Now,

where the last inequality follows from the fact that (proximity condition) and the assumption that . Therefore, we have

If we take a basis of , we see that , which proves the first statement of the theorem.

For the second statement, we can write as

where, is orthogonal to . Since is the average of points in , we get (arguing as for (6)):

Now, we have

If , then clearly, . If , then we have because . This again yields . Now, by Lemma 5.2, we have , so the second statement in the theorem.  

We are now ready to prove the main theorem of this section which will directly imply Theorem 2.2. This shows that means converges if the starting centers are close enough to the corresponding true centers. To gain intuition, it is best to look at the case , when all points satisfy the proximity condition. Then the theorem says that if for all , then , thus halving the upper bound of the distance to in each iteration.

Theorem 5.5

If

for all and a parameter , then

for all .

Proof. Let denote the number and mean respectively of and of . Similarly, define and as the size and mean of the points in . We get

We have

where the first one is from Corollary 5.3 (it is easy to check from the first statement of Theorem 5.4 that ) and the last two are from the second statement in Theorem 5.4.

Now using the fact that length is a convex function, we have

since . Let us look at each of the terms above. Note that (using Theorem 5.4). So

Also, note that . So we get Assuming to be large enough constant proves the theorem.  

Now we can easily finish the proof of Theorem 2.2. Observe that after the base case in the algorithm, the statement of Theorem 5.5 holds with . So after enough number of iterations of the second step in our algorithm, will become very small and so we will get

for all . Now substituting this is Theorem 5.4, we get

Summing over all pairs implies Theorem 2.2.

6 Applications

We now give applications of Theorem 2.2 to various settings. One of the main technical steps here would be to bound the spectral norm of a random matrix whose rows are chosen independently. We use the following result from [DHKM07]. Let denote the matrix . Also assume that .

Fact 6.1

Let be such that and . Then with high probability.

6.1 Learning in the planted distribution model

In this section, we show that McSherry’s result [McS01] can be derived as a corollary of our main theorem. Consider an instance of the planted distribution model which satisfies the conditon (1). We would like to show that with high probability, the points satisfy the proximity condition. Fix a point . We will show that the probability that it does not satisfy this condition is at most . Using union bound, it will then follow that the proximity condition is satisfied with probability at least .

Let . Let denote the unit vector along . Let denote the line joining and , and be the projection of on . The following result shows that the distance between and is small with high probability.

Lemma 6.2

Assume where . With probability at least ,

where is a large constant.

Proof. For a vector , we use to denote the coordinate of at position . Define similarly. First observe that . The coordinates of corresponding to points belonging to a particular cluster are same – let denote this value for cluster . So we get

where denotes the size of cluster . The last inequality above follows from the fact that . So, . Now, if does not satisfy the condition of the lemma, then there must be some for which

Now note that , are i.i.d. - random variables with mean . Now we use the following version of Chernoff bound : let be i.i.d. 0–1 random variables, each with mean . Then

For us, If , the probability of this event is at most

Now, assume . In this case the probability of this event is at most

where we have assumed that (we need this assumption anyway to use Wigner’s theorem for bounding ).  

Assuming

we see that

with probability at least . Here, we have used the fact that with high probability (Wigner’s theorem). Now, using union bound, we get that all the points satisfy the proximity condition with probability at least .

Remark : Here we have used as the matrix whose rows are the actual means . But while applying Theorem 2.2, should represent the means of the samples in belonging to a particular cluster. The error incurred here can be made very small and will not affect the results. So we shall assume that is the actual mean of points in . Similar comments apply in other applications described next.

6.2 Learning Mixture of Gaussians

We are given a mixture of Gaussians in dimensions. Let the mixture weights of these distributions be and denote their means respectively.

Lemma 6.3

Suppose we are given a set of samples from the mixture distribution. Then these points satisfy the proximity condition with high probability if

for all . Here is the maximum variance in any direction of any of the distributions .

Proof. It can be shown that is with high probability (see [DHKM07]). Further, let be a point drawn from the distribution . Let be the line joining and . Let be the projection of on this line. Then the fact that is Gaussian implies that with probability at least . It is also easy to check that the number of points from in the sample is close to with high probability. Thus, it follows that all the points satisfy the proximity condition with high probability.  

The above lemma and Theorem 2.2 imply that we can correctly classify all the points. Since we shall sample at least points from each distribution, we can learn each of the distribution to high accuracy.

6.3 Learning Mixture of Distributions with Bounded Variance

We consider a mixture of distributions with weights . Let be an upper bound on the variance along any direction of a point sampled from one of these distributions. In other words,

for all distributions and each unit vector .

Theorem 6.4

Suppose we are given a set of samples from the mixture distribution. Assume that . Then there is an algorithm to correctly classify at least fraction of the points provided

for all . Here is assumed to be less than .

Proof. The algorithm is described in Figure 3. We now prove that this algorithm has the desired properties. Let denote the matrix of points and be the corresponding matrix of means. We first bound the spectral norm of . The bound obtained is quite high, but is probably tight.

  Run the first step of Algorithm Cluster on the set of points, and let denote the centers obtained. Remove centers (and points associated with them) to which less than points are assigned. Let be the remaining centers. Remove any point whose distance from the nearest center in