Robustly Clustering a Mixture of Gaussians

Robustly Clustering a Mixture of Gaussians

Abstract

We give an efficient algorithm for robustly clustering of a mixture of two arbitrary Gaussians, a central open problem in the theory of computationally efficient robust estimation, assuming only that the the means of the component Gaussian are well-separated or their covariances are well-separated. Our algorithm and analysis extend naturally to robustly clustering mixtures of well-separated logconcave distributions. The mean separation required is close to the smallest possible to guarantee that most of the measure of the component Gaussians can be separated by some hyperplane (for covariances, it is the same condition in the second degree polynomial kernel). Our main tools are a new identifiability criterion based on isotropic position, and a corresponding Sum-of-Squares convex programming relaxation.

1 Introduction

The Gaussian Mixture Model has been the quintessential generative statistical model for multi-dimensional data since its definition and application by Pearson [22] more than a century ago: A GMM is an unknown discrete distribution over components, each a Gaussian with unknown mean and covariance. Remarkably, such a model is always uniquely identifiable. It has led to the development of important tools in statistics. Over the past two decades, the study of its computational aspects has been immensely fruitful. Since the seminal paper of Dasgupta [5], there has been much progress on efficiently clustering and learning Gaussian Mixture Models. One line of results assumes that the component Gaussians are spherical and their means are sufficiently separated [24, 12]. Another line generalizes this to arbitrary Gaussians under mean separation [1, 15, 4]. A more general approach of estimating all parameters without requiring any separation was introduced by Kalai, Moitra and Valiant [14, 21] and [2] and is polynomial for any fixed number of components and desired accuracy. We discuss these developments in more detail presently.

In spite of its mathematical appeal and wide usability, the Gaussian Mixture Model and approaches to estimating it have a serious vulnerability — noise in the data. Another limitation of the model is the requirement that the components should be Gaussians; a natural and more general model is a mixture of logconcave distributions (e.g., each component is the uniform distribution over a convex body).

Robust statistics, which seeks parameters that are immune to noise, is itself a classical topic [13] and has led to the definition of robust statistical parameters such as the Tukey median [23] and the geometric median. These classical parameters are statistically sound, but either computationally intractable in high dimension or have error factors that grow polynomially with the dimension, even for the most basic problem of estimating the mean of a single Gaussian. Over the past few years, there has been significant progress in computationally efficient robust estimation, starting with mean and covariance estimation for a large family of distributions [6, 18], including logconcave distributions. Robust estimation has also been discovered for generative models including mixtures of well-separated spherical Gaussians (early work by [3], and improved bounds more recently [6, 11, 17]), Independent Component Analysis [18, 17] and linear regression. Despite this progress, the core motivating problem of robustly estimating a mixture of two Gaussians has remained unsolved.

In this paper, we give a polynomial-time algorithm for the robust estimation (and clustering) of a mixture of two arbitrary Gaussians, assuming only that either their means are separated or their covariance matrices are separated. Our results extend to a mixture of two arbitrary logconcave distributions. We measure the separation with respect to the full distribution being in isotropic position, i.e., if the mixture has covariance , then we need that at least one of

is large (the distance between the parameters when the overall covariance is the identity), i.e., an multiple of the larger standard deviation along this inter-mean direction. This is an affine-invariant measure of the separation between the component Gaussians; as we will see shortly, it is equivalent to the existence of a hyperplane that separates (most of) the neasure of the components. Before we present our main result, we recall three results in the literature that are directly relevant. First, the clustering algorithm of  [4] works assuming that the means of the Gaussians are well-separated in the above affine-invariant sense. They do not consider robustness, more general component distributions or separation in the covariances, the last of which becomes relevant when the means are very close (or coincide). The second directly relevant line of work is for mixtures of spherical Gaussians. Using the Sum-of-Squares convex programming hierarchy, it is possible to cluster a mixture of spherical Gaussians assuming pairwise mean separation that is close to the best possible in quasi-polynomial time; and in polytime for separation that is at least standard deviations for any  [17, 11], improving on the polynomial-time -standard-deviations separation result of [24]. While the results of [24, 4] are based on spectral methods and are not robust, the SoS-based results for spherical Gaussian mixtures are robust to adversarial noise in addition to having a weaker separation requirement. We note that all of these results assume mean separation, and all but [4] are for spherical Gaussians. The work of [16] considers separated logconcave mixtures and give an algorithm that requires separation that grows linearly with the largest component standard deviation (in any direction), which is much stronger than hyperplane separability.

Our first result is a robust algorithm to cluster a mixture of two arbitrary distributions with logconcave densities assuming only that either their means are separated in an affine-invariant manner, i.e., the distance between the means in some direction is a small multiple of the larger standard deviation of the components in that direction. Once we get an approximately correct clustering, the samples in each cluster can be used to robustly estimate the component mixing weights, means and covariances.

Theorem 1.

Let be a mixture of two unknown logconcave distributions in with means and covariances , mixing weights , with for . Let be a noisy mixture obtained from with adversarial noise fraction , an absolute constant. Assume that there is a direction s.t.

For , there is a randomized algorithm that takes a sample of size from , and with probability at least , finds a clustering that is correct for all but fraction of the sample, in time polynomial in .

While the above condition is close to being tight in terms of mean separation (a multiple of the larger standard deviation is necessary for clustering), if the means are very close or coincide, clustering might still be possible because the covariances of the components are sufficiently different. Our next theorem applies to a mixture of Gaussians with the property that either their means are separated or their covariances are separated, once again in an affine-invariant manner.

Theorem 2.

Let be a mixture of two unknown Gaussian distributions in with means and covariances , mixing weights , with for . Let be a noisy mixture obtained from with adversarial noise fraction , an absolute constant. Assume that either there is a direction s.t.

or there is a matrix s.t.

where is the covariance of for . For , there is a randomized algorithm that takes a sample of size from , and with probability at least , finds a clustering that is correct for all but fraction of the sample, in time polynomial in .

A few remarks are in order. The separation condition for the means says that there exists some direction along which the means are separated by a multiple of the larger standard deviation along that direction, i.e., they are hyperplane-separable, which is a lower bound for mean separation to imply clustering; moreover, it is substantially weaker than the requirement in previous work on robust mixture clustering, which needs the separation to grow with the largest component variances (see Fig. 1). For covariance separation, it is the same condition in the lifted space .

Figure 1: Earlier work needed mean separation determined by largest variance (left figure); Top figure is hyperplane separability, needing much smaller mean separation; bottom right figure shows the same in isotropic position — the best hyperplane is normal the intermean direction.

Second, the separation conditions above are implied by a corresponding bounds on the Fisher discriminant, applied to and respectively (see below). Third, even when the means of components coincide, a separation in their covariances in Frobenius norm suffices for the algorithm. We remark here that the above theorem holds for mixtures of more general distributions and to any lifting (e.g., higher-degree polynomials in the original points ), as long as the lifted variables have a logconcave density. When is Gaussian, has a logconcave density, and this is really all we need above. More generally the above theorem holds for any logconcave random variable whose square is also logconcave.

The following notion will be more convenient to work with than hyperplane separation, and our main theorem can also be stated using this notion of overlap.

Definition 3.

The Fisher overlap of a mixture with components , mixing weights , for , and overall mean zero is

We let .

When is isotropic, the denominator of is for any . We define the overlap for covariances similarly The next two statements are from [4]. The first says that small overlap corresponds to hyperplane separability.

Lemma 4.

For a mixture and a vector , if

then

Lemma 5.

For an isotropic mixture, is minimized for when is the inter-mean direction.

1.1 Approach

Known efficient algorithms for Gaussian mixtures are typically either based on spectral considerations, or more general (and less efficient) convex programming. While the former methods work well in practice, and yield relatively small polynomial bounds, they are vulnerable to noise. It is worth noting though, that the approach of [3, 18] as well as the filtering approach of [6] build on such spectral methods for the robust estimation of mean and covariance (which includes robust estimation of a single Gaussian).

Ideally, one would like an algorithm for Gaussian mixtures to be polynomial in all parameters. The general algorithm of Kalai, Moitra and Valiant [14, 21] has complexity , even without noise. Unfortunately, this appears unavoidable, at least for any Statistical Query (SQ) algorithm [7], a model that captures most existing algorithms for problems over distributions [8, 9]. Moreover, even for two components, the assumption of Gaussians is critical to guarantee unique identifiability (since without separation assumptions, clustering is not possible).

On the other hand, the approach of [4] is polynomial in all parameters assuming a separation between the means of the components. This makes the mixture components identifiable, at least up to clustering with high probability. The separation needed is affine-invariant and considerably weaker than previous work for mixtures of arbitrary Gaussians (we will shortly draw inspiration from recent progress for the case of spherical Gaussians as well), in that the separation required depends only on the standard deviation in some direction, i.e., there is some hyperplane separating each Gaussian from the rest, and the separation needed is proportional to standard deviation along the normal to the hyperplane (not e.g., the largest standard deviation). This measure of separation is affine-invariant. Our starting goal was to find a robust version of [4] that remains polynomial in all parameters. Their technique, isotropic PCA, an affine-invariant version of PCA, is not robust (showing this is a bit more involved than for other spectral algorithms). So we turn for inspiration to the special case of spherical Gaussians for which robust algorithms have been recently discovered, with near-optimal separation [17, 11]. The key idea there is to express the identifiability of a Gaussian component in terms of a polynomial system, solve this polynomial system using a sum-of-squares semi-definite programming relaxation, and round the fractional solution obtained to a nearly correct clustering. The requirement of the polynomial system for identification is that the means are sufficiently separated.

We combine and generalize the above approaches as follows: (1) we use the affine-invariant separation condition to formulate a polynomial program for identifiability, and show that the SoS-based approach can be extended to this setting. For this step, we crucially use the robust estimation algorithms for mean and covariance under bounded moment conditions (by showing that these conditions hold for noisy mixtures). This normalization of moments is used in our identifiability proofs. It implies that under the hyperplane separability assumption, the Fisher overlap is small along the inter-mean direction (Lemma 5). This is a fundamental departure from previous applications of SoS to clustering mixtures. (2) Then we show that even if the means are too close to guarantee clusterability, as long as the covariances are sufficiently separated, again in an affine-invariant manner, we still get polynomial identifiability. These requirements are considerably weaker than the previous affine-invariant requirements of [4]. (3) Finally, the resulting algorithms work for arbitrary logconcave mixtures, a problem that was open even for the noise-free case.

2 Background and Preliminaries

Noise model.

We assume that the data is generated as follows. First, a sample is generated from a pure mixture. Then an adversary replaces up to an fraction of the data with arbitrary points. We refer to the pure mixture as with

and each is a distribution with mean , covariance , and the nonnegative mixing weights sum to . We refer to the noisy mixture as .

Isotropic position.

We say that a distribution in is in isotropic position if satisfies

Any distribution with a bounded, full-rank covariance matrix can be brought to isotropic position by an affine transformation. Namely, if and , then the distribution of the random variable is in isotropic position. Isotropic position of a distribution can be computed to desired accuracy from a sample via the sample mean and covariance.

SoS relaxations.

The Sum-of-Squares hierarchy is a sequence of semi-definite programs that provide increasingly tighter relaxations of solutions to polynomial inequalities over . The basic idea is to use multilinear variables of degree up to for some , and rewrite the constraints in terms of these variables.

Definition 6 (Sum-of-Squares Relaxation).

Let . Suppose that is on variables . Define variables where is any subset of with . Define the matrix

where and are subsets of of size at most . For each polynomial inequality (or equality ), define the matrix

where and are subsets of of size at most , and is the coefficient of the term in . The resulting SoS relaxation is defined by the set of constraints:

Pseudo-expectations and Pseudo-distributions.

Any point in the convex hull of points from can be viewed as a probability distribution (convex combination) of extreme solutions and naturally defines an expectation. If is a function of interest, then the expectation corresponding to a fractional solution with where is . Any solution to a level- SoS program above can be viewed as defining a pseudo-expectation , where is the set of all multi-linear functions over , with the following properties:

The pseudo-expectation behaves like a true expectation for polynomials of degree up to , and the above constraints are implied by the SoS constraints. For more detailed background, see e.g., [10].

3 Identifiability with Fisher

In this section, we describe a set of polynomial equations and inequalities that will lead to the SoS relaxation and imply the desired properties of a pseudo-expectation obtained by solving the relaxation. Then we will prove that the indicator vector of every true cluster satisfies this set of constraints (Lemma 9), and the statement that the set of constraints implies the desired polynomial property has a low-degree SoS proof (Lemma 11).

Definition 7.

Let be the following system of polynomial equations and inequalities on the variables , , given data points , parameters :

  1. , for all ,

  2. for all ,

  3. , ,

  4. , ,

  5. ,

  6. ,

  7. .

The conditions above differ from previous polynomial identifiability systems in two ways: the use of two sets of variables, and more significantly, the separation condition is a single existence condition rather than a condition over all vectors [17, 11].

If is a sample drawn from a logconcave density, then any solutions and are indicator vectors of subsets and of the given points so that the subsets approximately satisfy Fisher criterion (g). Similarly, if we solve the system with data points , the solutions satisfy the Fisher criterion in the second moment space. Our goal is to ensure that the subsets and identified are essentially the components of the mixture. We note that in the direction of , the subsets and have the sum of sample variances bounded as .

In the next section, we describe the corresponding SoS relaxation and prove that its solution is a pseudo-distribution that satisfies (Lemma 16). The following definition captures the essential properties of a sample with respect to the the constraint system.

Definition 8 (Well-separated Isotropic Sample).

We say that is a -separated isotropic sample with true clusters and if

  1. ,

  2. satisfy with , ,

  3. For any such that , , , , and for . Then for any vector ,

3.1 Mean Separation

In this subsection, we consider a nearly isotropic logconcave 2-mixture with Fisher discriminant in some direction .

Lemma 9 (Completeness).

Suppose where each is an i.i.d. sample generated from a nearly isotropic noisy logconcave mixture with true clusters such that where is the Fisher discriminant and are means of two components. Then, with , is a -separated sample with probability at least .

By assumption, satisfies Definition 8.1 and 8.2 with high probability. To prove that it satisfies Definition 8.3, we introduce the following lemma for samples generated from a single one-dimensional logconcave distribution. Then we can extend the conclusion to the mixture.

Lemma 10.

Suppose where each is an i.i.d. sample generated from a one-dimensional logconcave density with variance . For any such that , , and , if let , , and , and , then

Next we will show that the soundness of has a low-degree SoS proof, which implies that the pseudo-expectation satisfying has the desired properties. Let be the indicator of , . The main conclusion of the soundness is stated below.

Lemma 11 (Soundness).

Suppose that is a -separated sample. Let be a degree- pseudo-expectation which satisfies . Then for ,

The following two lemmas suggests that we can rewrite the constraint system as a one-dimensional projection onto the direction . In this intermean direction, we have the same Fisher discriminant as in the whole space, and hence the same mean separation. Then we can prove Lemma 11 for this one-dimensional problem.

Lemma 12.

There is a constant degree SoS proof that implies

Lemma 13.

Suppose and are the projections of and onto direction . Then there is a constant-degree SoS proof that implies

  1. .

The next lemma shows that Fisher overlap implies mean separation.

Lemma 14.

For an isotropic mixture , if there is a vector such that , then

Together these lemmas will allow us to prove Lemma 11.

3.2 Covariance Separation

In this subsection, we assume that is a noisy mixture of 2 Gaussians. Suppose and . We will first show that has a logconcave density for each component Gaussian.

Lemma 15.

Suppose is a random variable with a Gaussian distribution and . Then has a logconcave density function.

If we assume satisfies the mean separation condition: there is a matrix s.t.

where is the covariance of in the ’th component, then we can apply the same algorithm as for the mean separation case. That is, after we put into isotropic position, we can get a constraint system on with by replacing the data points by in in Definition 7. Then all the lemmas and the proofs in Section 3.1 also apply to covariance separation without modification.

4 SoS Relaxation

We will use the SoS relaxation given by Def. 6 applied to the identifiability constraint system . The level- SoS-relaxation of is denoted by . The solution of this SoS relaxation is a pseudo-expectation with the desired properties proved in Section 3. We can see that is a system with variables and constraints. The resulting system is defined on variables with . Using the ellipsoid method, this SDP can be solved up to an additive -error in time proportional to . We will set as a constant. The following theorem (see [10]) shows that the solution of the SDP is actually a pseudo-expectation for . We only use in our algorithm.

Theorem 16.

Let be a set of polynomial constraints on and be any feasible point in . Define the multilinearizing map as, where is the set of all multi-linear functions over ,

(1)

for every , and extend linearly. Then is a degree pseudo-expectation for .

From (1), we can compute the pseudo-expectation of each multi-linear functions of degree at most over variables and . In the rounding algorithm, we will only use up to the degree .

5 Robust Isotropic Position

We will need the following robust estimate of mean and covariance of the full Gaussian mixture. The theorem state below follows by combining the algorithm of [6] with the moment condition of [18], and proving the corresponding moment bounds.

Theorem 17.

Let be a noisy logconcave mixture with unknown noise fraction and mixture of logconcave distributions unknown mean and covariance . There is a polynomial time algorithm which given samples from with , computes and with probability within error

Lemma 18.

Let be a mixture of logconcave densities with mean . Then for , satisfies the following bounded moment condition (2) for :

(2)

where is the lower bound on the minimum mixing weight and is a bound on the ’th moment constant of any logconcave density.

6 Algorithm

We can now state the main algorithm for robust clustering.

Input: A sample .

  1. Use Robust Estimation (Theorem 17) to approximate the mean and covariance of the mixture; denote the results by and . Apply the affine transformation to make the mixture nearly isotropic.

    1. Find as the maximum s.t. the SoS SDP with the polynomial constraint system and is feasible. Let the solution be a pseudo-expectation .

    2. Let . Choose a uniformly random row of such that Let be the largest entries in the ’th row of .

  2. Run the previous steps 1 and 2 with sample points . So and in Step 1 are the estimated mean and covariance of .

  3. Output the clustering with smaller Fisher overlap among the the two SoS solutions.

Algorithm 1 Mixture Clustering

7 Rounding Analysis

Suppose is a pseudo-expectation of constant degree on satisfying and for . The following main lemma of the rounding analysis gives the error guarantee of the rounding algorithm.

Lemma 19.

The rounding algorithm outputs clusters such that with probability at least , for ,

where . Moreover, .

Let . For the purpose of analysis, let be a submatrix of with all the rows and columns from and . We will not use in our rounding algorithm. Let be the set of indices (of columns and rows) in corresponding to , and corresponding to . To prove the main lemma, we use Lemma 20 and Lemma 21 to analyze each entry and each row of . We can see that and have a common submatrix .

Lemma 20.

Let be a pseudo-expectation of constant degree on satisfying . Then

  1. For any , ,

  2. For any , and,

  3. .

Lemma 21.

Choose a uniformly random row from such that . Then with probability at least , is a “good row”, i.e, is in a true cluster and denoting the submatrix of with rows of columns in this true cluster as , we have

8 Proofs

8.1 Proof of Main Theorems

Proof of Theorem 1.

By Theorem 17, we can robustly estimate the mean of and the covariance matrix of the mixture in polynomial time. Then we can assume the sample is nearly isotropic.

By the separation assumption and Lemma 9, we have the set of noisy samples is a -separated sample with probability . By Lemma 11, we have

(3)

If we have

then satisfies .

Then Lemma 19 shows that the clustering is correct for all but fraction of sample in cluster . So with probability , the fraction of the sample wrongly clustered is at most

If , then the failure probability of the rounding algorithm is . If we run the rounding algorithm times, the failure probability is less than .

For the covariance separation case, the proof is almost the same.  

Proof of Theorem 2.

If satisfies the mean separation condition, then we can apply Theorem 1 and draw the same conclusion since Gaussian distributions are logconcave.

If satisfies the covariance separation condition, let be a random variable generated from and . By Lemma 15, we know that has logconcave density. And we can write the covariance separation condition with respect to the means and covariances of for the two components: there is a matrix s.t.

Here . So applying Theorem 1 again to , we get the result for the covariance separation case.

 

8.2 Moments

We first prove the bounded moments condition of a mixture of logconcave densities to apply the robust estimation algorithm on the mixture.

Proof of Lemma 18.

Without loss of generality, we can assume that . It suffices to prove the inequality for one dimension case because the projection of mixture of logconcave densities into any direction is still a mixture of logconcave densities.

Suppose is the expectation over ’th logconcave component, and and are the mean and the variance of ’th logconcave component. Let be the ’th moment constant for a logconcave density, that is, . Then as shown in [19]. For one component, we have

and

Thus

Then

Similarly, for one component we have

Thus

Then

 

8.3 Identifiability

In the proof of soundness, we will use only inequalities that can themselves be proved using low-degree SoS proofs.

We first prove Lemma 11, assuming that the means of two components are separated. Then the soundness for the covariance separation case follows.

The proof will use the following well-known facts. The notation below indicates that the proof is a sum-of-squares proof of degree at most . We omit the notation indicating this explicitly when clear from context.

Fact 22 (SoS Triangle Inequality).

Let be indeterminates. Let be a power of 2.

The next two facts apply to pseudo-expectations.

Fact 23 (SoS Hölder’s).

Let and be indeterminates. Let be a power of 2.

and

Fact 24 (Pseudo-expectation Hölder’s).

Let be a degree- polynomial. Let be a degree- pseudo-expectation on indeterminates .

Fact 25 (Pseudo-expectation Cauchy-Schwarz).

Let be a degree- pseudo-expectation on indeterminates . Let be polynomials of degree at most .

As we mentioned in Section 3, it suffices to prove Lemma 11 in one-dimensional space. So we assume in the following proofs of soundness that and are one dimensional values which are the projections onto direction .

We will use the following lemma to prove Lemma 11.

Lemma 26.

For , there exists a constant degree SoS proof that implies

Proof of Lemma 26.

Using SoS triangle inequality, we get

(4)

By Hölder’s inequality, implies

Then using Lemma 13,