Toward Learning Gaussian Mixtures with Arbitrary Separation

# Toward Learning Gaussian Mixtures with Arbitrary Separation

Mikhail Belkin
Ohio State University
Columbus, Ohio
mbelkin@cse.ohio-state.edu &Kaushik Sinha
Ohio State University
Columbus, Ohio
sinhak@cse.ohio-state.edu
###### Abstract

In recent years analysis of complexity of learning Gaussian mixture models from sampled data has received significant attention in computational machine learning and theory communities. In this paper we present the first result showing that polynomial time learning of multidimensional Gaussian Mixture distributions is possible when the separation between the component means is arbitrarily small. Specifically, we present an algorithm for learning the parameters of a mixture of identical spherical Gaussians in -dimensional space with an arbitrarily small separation between the components, which is polynomial in dimension, inverse component separation and other input parameters for a fixed number of components . The algorithm uses a projection to dimensions and then a reduction to the -dimensional case. It relies on a theoretical analysis showing that two -dimensional mixtures whose densities are close in the norm must have similar means and mixing coefficients. To produce the necessary lower bound for the norm in terms of the distances between the corresponding means, we analyze the behavior of the Fourier transform of a mixture of Gaussians in one dimension around the origin, which turns out to be closely related to the properties of the Vandermonde matrix obtained from the component means. Analysis of minors of the Vandermonde matrix together with basic function approximation results allows us to provide a lower bound for the norm of the mixture in the Fourier domain and hence a bound in the original space. Additionally, we present a separate argument for reconstructing variance.

## 1 Introduction

Mixture models, particularly Gaussian mixture models, are a widely used tool for many problems of statistical inference [21, 19, 18, 11, 17]. The basic problem is to estimate the parameters of a mixture distribution, such as the mixing coefficients, means and variances within some pre-specified precision from a number of sampled data points. While the history of Gaussian mixture models goes back to [20], in recent years the theoretical aspects of mixture learning have attracted considerable attention in the theoretical computer science, starting with the pioneering work of [9], who showed that a mixture of spherical Gaussians in dimensions can be learned in time polynomial in , provided certain separation conditions between the component means (separation of order ) are satisfied. This work has been refined and extended in a number of recent papers. The first result from [9] was later improved to the order of in [10] for spherical Gaussians and in [2] for general Gaussians. The separation requirement was further reduced and made independent of to the order of in [23] for spherical Gaussians and to the order of in [15] for Logconcave distributions. In a related work [1] the separation requirement was reduced to . An extension of PCA called isotropic PCA was introduced in [3] to learn mixtures of Gaussians when any pair of Gaussian components is separated by a hyperplane having very small overlap along the hyperplane direction (so-called ”pancake layering problem”).

In a slightly different direction the recent work [13] made an important contribution to the subject by providing a polynomial time algorithm for PAC-style learning of mixture of Gaussian distributions with arbitrary separation between the means. The authors used a grid search over the space of parameters to a construct a hypothesis mixture of Gaussians that has density close to the actual mixture generating the data. We note that the problem analyzed in [13] can be viewed as density estimation within a certain family of distributions and is different from most other work on the subject, including our paper, which address parameter learning111Note that density estimation is generally easier than parameter learning since quite different configurations of parameters could conceivably lead to very similar density functions, while similar configurations of parameters always result in similar density functions..

We also note several recent papers dealing with the related problems of learning mixture of product distributions and heavy tailed distributions. See for example, [12, 8, 5, 6].

In the statistics literature, [7] showed that optimal convergence rate of MLE estimator for finite mixture of normal distributions is , where is the sample size, if number of mixing components is known in advance and is when the number of mixing components is known up to an upper bound. However, this result does not address the computational aspects, especially in high dimension.

In this paper we develop a polynomial time (for a fixed ) algorithm to identify the parameters of the mixture of identical spherical Gaussians with potentially unknown variance for an arbitrarily small separation between the components222We point out that some non-zero separation is necessary since the problem of learning parameters without any separation assumptions at all is ill-defined.. To the best of our knowledge this is the first result of this kind except for the simultaneous and independent work [14], which analyzes the case of a mixture of two Gaussians with arbitrary covariance matrices using the method of moments. We note that the results in [14] and in our paper are somewhat orthogonal. Each paper deals with a special case of the ultimate goal (two arbitrary Gaussians in [14] and identical spherical Gaussians with unknown variance in our case), which is to show polynomial learnability for a mixture with an arbitrary number of components and arbitrary variance.

All other existing algorithms for parameter estimation require minimum separation between the components to be an increasing function of at least one of or . Our result also implies a density estimate bound along the lines of [13]. We note, however, that we do have to pay a price as our procedure (similarly to that in [13]) is super-exponential in . Despite these limitations we believe that our paper makes a step towards understanding the fundamental problem of polynomial learnability of Gaussian mixture distributions. We also think that the technique used in the paper to obtain the lower bound may be of independent interest.

The main algorithm in our paper involves a grid search over a certain space of parameters, specifically means and mixing coefficients of the mixture (a completely separate argument is given to estimate the variance). By giving appropriate lower and upper bounds for the norm of the difference of two mixture distributions in terms of their means, we show that such a grid search is guaranteed to find a mixture with nearly correct values of the parameters.

To prove that, we need to provide a lower and upper bounds on the norm of the mixture. A key point of our paper is the lower bound showing that two mixtures with different means cannot produce similar density functions. This bound is obtained by reducing the problem to a 1-dimensional mixture distribution and analyzing the behavior of the Fourier transform (closely related to the characteristic function, whose coefficients are moments of a random variable up to multiplication by a power of the imaginary unit ) of the difference between densities near zero. We use certain properties of minors of Vandermonde matrices to show that the norm of the mixture in the Fourier domain is bounded from below. Since the norm is invariant under the Fourier transform this provides a lower bound on the norm of the mixture in the original space.

We also note the work [16], where Vandermonde matrices appear in the analysis of mixture distributions in the context of proving consistency of the method of moments (in fact, we rely on a result from [16] to provide an estimate for the variance).

Finally, our lower bound, together with an upper bound and some results from the non-parametric density estimation and spectral projections of mixture distributions allows us to set up a grid search algorithm over the space of parameters with the desired guarantees.

## 2 Outline of the argument

In this section we provide an informal outline of the argument that leads to the main result. To simplify the discussion, we will assume that the variance for the components is known or estimated by using the estimation algorithm provided in Section 3.3. It is straightforward (but requires a lot of technical details) to see that all results go through if the actual variance is replaced by a sufficiently (polynomially) accurate estimate.

We will denote the n-dimensional Gaussian density by , where or, when appropriate, in . The notation will always be used to represent norm while will be used to denote the Hausdorff distance between sets of points. Let be a mixture of Gaussian components with the covariance matrix in . The goal will be to identify the means and the mixing coefficients under the assumption that the minimum distance is bounded from below by some given (arbitrarily small) and the minimum mixing weight is bounded from below by . We note that while can also be estimated, we will assume that it is known in advance to simplify the arguments. The number of components needs to be known in advance which is in line with other work on the subject. Our main result is an algorithm guaranteed to produce an approximating mixture , whose means and mixing coefficients are all within of their true values and whose running time is a polynomial in all parameters other than . Input to our algorithm is , points in sampled from and an arbitrary small positive satisfying . The algorithm has the following main steps.

Parameters: .
Input: , points in sampled from .
Output: , the vector of approximated means and mixing coefficients.

Step 1. (Reduction to dimensions). Given a polynomial number of data points sampled from it is possible to identify the -dimensional span of the means in by using Singular Value Decomposition (see [23]). By an additional argument the problem can be reduced to analyzing a mixture of Gaussians in .

Step 2. (Construction of kernel density estimator). Using Step 1, we can assume that . Given a sample of points in , we construct a density function using an appropriately chosen kernel density estimator. Given sufficiently many points, can be made arbitrarily small. Note that while is a mixture of Gaussians, it is not a mixture of Gaussians.

Step 3. (Grid search). Let be the -dimensional space of parameters (component means and mixing coefficients) to be estimated. Because of Step 1, we can assume (see Lemma 3) s are in .

For any , let be the corresponding mixture distribution. Note that are the true parameters. We obtain a value (polynomial in all arguments for a fixed ) from Theorem 3 and take a grid of size in . The value is found from a grid search according to the following equation

 θ∗=argmin~θ∈MG{∥p(x,~θ)−pkde∥} (1)

We show that the means and mixing coefficients obtained by taking are close to the true underlying means and mixing coefficients of with high probability. We note that our algorithm is deterministic and the uncertainty comes only from the sample (through the SVD projection and density estimation).

While a somewhat different grid search algorithm was used in [13], the main novelty of our result is showing that the parameters estimated from the grid search are close to the true underlying parameters of the mixture. In principle, it is conceivable that two different configurations of Gaussians could give rise to very similar mixture distributions. However, we show that this is not the case. Specifically, and this is the theoretical core of this paper, we show that mixtures with different means/mixing coefficients cannot be close in norm333Note that our notion of distance between two density functions is slightly different from the standard ones used in literature, e.g., Hellinger distance or KL divergence. However, our goal is to estimate the parameters and here we use norm merely as a tool to describe that two distributions are different. (Theorem 3) and thus the grid search yields parameter values that are close to the true values of the means and mixing coefficients.

To provide a better high-level overview of the whole proof we give a high level summary of the argument (Steps 2 and 3).

1. Since we do not know the underlying probability distribution directly, we construct , which is a proxy for . is obtained by taking an appropriate non-parametric density estimate and, given a sufficiently large polynomial sample, can be made to be arbitrarily close to in norm (see Lemma B). Thus the problem of approximating in norm can be replaced by approximating .

2. The main technical part of the paper are the lower and upper bounds on the norm in terms of the Hausdorff distance between the component means (considered as sets of points) and . Specifically, in Theorem 3 and Lemma 3 we prove that for

 dH(m,~m)≤f(∥p(x,θ)−p(x,~θ)∥)≤h(dH(m,~m)+∥α−~α∥1)

where are some explicitly given increasing functions. The lower bound shows that can be controlled by making sufficiently small, which (assuming minimum separation between the components of ) immediately implies that each component mean of is close to exactly one component mean of .

On the other hand, the upper bound guarantees that a search over a sufficiently fine grid in the space will produce a value , s.t. is small.

3. Once the component means and are shown to be close an argument using the Lipschitz property of the mixture with respect to the mean locations can be used to establish that the corresponding mixing coefficient are also close (Corollary 3).

We will now briefly outline the argument for the main theoretical contribution of this paper which is a lower bound on the norm in terms of the Hausdorff distance (Theorem 3).

1. (Minimum distance, reduction from to ) Suppose a component mean , is separated from every estimated mean by a distance of at least , then there exists a unit vector in such than . In other words a certain amount of separation is preserved after an appropriate projection to one dimension. See Lemma A for a proof.

2. (Norm estimation, reduction from to ). Let and be the true and estimated density respectively and let be a unit vector in . and will denote the one-dimensional marginal densities obtained by integrating and in the directions orthogonal to . It is easy to see that and are mixtures of -dimensional Gaussians, whose means are projections of the original means onto . It is shown in Lemma A that

 ∥p−~p∥2≥(1cσ)k∥pv−~pv∥2

and thus to provide a lower bound for it is sufficient to provide an analogous bound (with a different separation between the means) in one dimension.

3. (-d lower bound) Finally, we consider a mixture of Gaussians in one dimension, with the assumption that one of the component means is separated from the rest of the component means by at least and that the (not necessarily positive) mixing weights exceed in absolute value. Assuming that the means lie in an interval we show (Theorem 3.1)

 ∥q∥2≥α4kmin(ta2)Ck2

for some positive constant independent of .

The proof of this result relies on analyzing the Taylor series for the Fourier transform of near zeros, which turns out to be closely related to a certain Vandermonde matrix.

Combining 1 and 2 above and applying the result in 3, yields the desired lower bound for .

## 3 Main Results

In this section we present our main results. First we show that we can reduce the problem in to a corresponding problem in , where represents the dimension and is the number of components, at the cost of an arbitrarily small error. Then we solve the reduced problem in , again allowing for only an arbitrarily small error, by establishing appropriate lower and upper bounds of a mixture norm in .

{lemma}

[Reduction from to ] Consider a mixture of n-dimensional spherical Gaussians where the means lie within a cube , and for all , . For any positive and , given a sample of size , with probability greater than , the problem of learning the parameters (means and mixing weights) of within error can be reduced to learning the parameters of a k-dimensional mixture of spherical Gaussians where the means lie within a cube , . However, in we need to learn the means within error. {proof} For , let be the top right singular vectors of a data matrix of size sampled from . It is well known (see [23]) that the space spanned by the means remains arbitrarily close to the space spanned by . In particular, with probability greater than , the projected means satisfy for all (see Lemma A).

Note that each projected mean can be represented by a dimensional vector which are the coefficients along the singular vectors s, that is for all . Thus, for any . Since , we have . Also note that each lie within a cube of where the axes of the cube are along the top singular vectors s.

Now suppose we can estimate each by such that . Again each has a corresponding representation such that and . This implies for each , .

From here onwards we will deal with mixture of Gaussians in . Thus we will assume that denotes the true mixture with means while represents any other mixture in with different means and mixing weights.

We first prove a lower bound for . {theorem}[Lower bound in ] Consider a mixture of k-dimensional spherical Gaussians where the means lie within a cube , and for all ,. Let be some arbitrary mixture such that the Hausdorff distance between the set of true means and the estimated means satisfies . Then where are some positive constants independent of . {proof} Consider any arbitrary such that its closest estimate from is . Note that and all other are at a distance at least from . Lemma A ensures the existence of a direction such that upon projecting on which and all other projected means are at a distance at least from . Note that after projecting on , the mixture becomes a mixture of 1-dimensional Gaussians with variance and whose projected means lie within . Let us denote these 1-dimensional mixtures by and respectively. Then using Theorem 3.1 . Note that we obtain (respectively ) by integrating (respectively ) in all orthogonal directions to . Now we need to relate and . This is done in Lemma A to ensure that where is in chosen such a way that in any arbitrary direction probability mass of each projected Gaussian on that direction becomes negligible outside the interval of . Thus, . Since this holds for any arbitrary , we can replace by .

Next, we prove a straightforward upper bound for . {lemma}[Upper bound in ] Consider a mixture of , -dimensional spherical Gaussians where the means lie within a cube , and for all ,. Let be some arbitrary mixture such that the Hausdorff distance between the set of true means and the estimated means satisfies . Then there exists a permutation such that

{proof}

Due to the constraint on the Hausdorff distance and constraint on the pair wise distance between the means of , there exists a permutation such that . Due to one-to-one correspondence, without loss of generality we can write,
where . Now using Lemma A,

We now present our main result for learning mixture of Gaussians with arbitrary small separation. {theorem} Consider a mixture of n-dimensional spherical Gaussians where the means lie within a cube , and for all , . Then given any positive and , there exists a positive independent of and such that using a sample of size and a grid of size , our algorithm given by Equation 1 runs in time and provides mean estimates which, with probability greater than , are within of their corresponding true values. {proof} The proof has several parts.
SVD projection: We have shown in Lemma 3 that after projecting to SVD space (using a sample of size ), we need to estimate the parameters of the mixture in , where we must estimate the means within error.

Grid Search: Let us denote the parameters444To make our presentation simple we assume that the single parameter variance is fixed and known. Note that it can also be estimated. of the underlying mixture by
and any approximating mixture has parameters . We have proved the bounds (see Theorem 3, Lemma 3), where and are increasing functions. Let be the step/grid size (whose value we need to set) that we use for gridding along each of the parameters over the grid . We note that the norm of the difference can be computed efficiently by multidimensional trapezoidal rule or any other standard numerical analysis technique (see e.g., [4]). Since this integration needs to be preformed on a -dimensional space, for any pre-specified precision parameter , this can be done in time . Now note that there exists a point on the grid , such that if somehow we can identify this point as our parameter estimate then we make an error at most in estimating each mixing weight and make an error at most in estimating each mean. Since there are mixing weights and means to be estimated, . Thus,

 f1(dH(m,m∗))≤∥po(x,θ)−po(x,θ∗)∥≤f2(G)

Now, according to Lemma B, using a sample of size we can obtain a kernel density estimate such that with probability greater than ,

 ∥pkde−po(x,θ)∥≤ϵ∗ (2)

By triangular inequality this implies,

 f1(dH(m,m∗))−ϵ∗≤∥pkde−po(x,θ∗)∥≤f2(G)+ϵ∗ (3)

Since there is a one-to-one correspondence between the set of means of and , essentially provides the maximum estimation error for any pair of true mean and its corresponding estimate. Suppose we choose such that it satisfies

 2ϵ∗+f2(G)≤f1(ϵ2) (4)

For this choice of grid size, Equation 3 and Equation 4 ensures that . Hence . Now consider a point on the grid such that . This implies,

 f1(dH(m,mN))>f1(ϵ2) (5)

Now,

where, inequality a follows from triangular inequality, inequality b follows from Equation 2, strict inequality c follows from Equation 5, inequality d follows from Equation 4 and finally inequality e follows from Equation 3. Setting , Equation 4 and the above strict inequality guarantees that for a choice of Grid size the solution obtained by equation 1 can have mean estimation error at most . Once projected onto SVD space each projected mean lies within a cube . With the above chosen grid size, grid search for the means runs in time . Note that grid search for the mixing weights runs in time .

We now show that not only the mean estimates but also the mixing weights obtained by solving Equation 1 satisfy for all . In particular we show that if two mixtures have almost same means and the norm of difference of their densities is small then the difference of the corresponding mixing weights must also be small.

{corollary}

With sample size and grid size as in Theorem 3, the solution of Equation 1 provides mixing weight estimates which are, with high probability, within of their true values. Due to space limitation we defer the proof to the Appendix.

### 3.1 Lower Bound in 1-Dimensional Setting

In this section we provide the proof of our main theoretical result in 1-dimensional setting. Before we present the actual proof, we provide high level arguments that lead us to this result. First note that Fourier transform of a mixture of univariate Gaussians is given by

Thus, . Since norm of a function and its Fourier transform are the same, we can write,
.
Further, and we can write,

 ∥q∥2=12π∫|g(u)|2exp(−σ2u2)du

where . This a complex valued function of a real variable which is infinitely differentiable everywhere. In order to bound the above square norm from below, now our goal is to find an interval where is bounded away from zero. In order to achieve this, we write Taylor series expansion of at the origin using terms. This can be written in matrix vector multiplication format , where , such that captures the function value and derivative values at origin. In particular, is the sum of the squares of the function and derivatives at origin. Noting that is a Vandermonde matrix we establish (see Lemma A) . This implies that at least one of the derivatives, say the one, of is bounded away from zero at origin. Once this fact is established, and noting that derivative of is bounded from above everywhere, it is easy to show (see Lemma A) that it is possible to find an interval where derivative of is bounded away from zero in this whole interval. Then using Lemma A, it can be shown that, it is possible to find a subinterval of where the derivative of is bounded away from zero. And thus, successively repeating this Lemma times, it is easy to show that there exists a subinterval of where is bounded away from zero. Once this subinterval is found, it is easy to show that is lower bounded as well.

Now we present the formal statement of our result. {theorem}[Lower bound in ] Consider a mixture of univariate Gaussians where, for all , the mixing coefficients and the means . Suppose there exists a such that , and for all . Then the norm of satisfies where is some positive constant independent of .

{proof}

Note that,

 ∥q∥2=12π∫|g(u)|2exp(−σ2u2)du

where, . Thus, in order to bound the above square norm from below, we need to find an interval where is bounded away from zero. Note that is an infinitely differentiable function with order derivative 555Note that Fourier transform is closely related to the characteristics function and the derivative of at origin is related to the order moment of the mixture in the Fourier domain. . Now we can write the Taylor series expansion of about origin as,

 g(u)=g(0)+g(1)(0)u1!+g(2)(0)u22!+...+g(k−1)(0)u(k−1)(k−1)!+O(uk)

which can be written as

Note that matrix is Vandermonde matrix thus, using Lemma A this implies . This further implies that either or there exists a such that . In the worst case we can have , i.e. the -th derivative of is lower bounded at origin and we need to find an interval where itself is lower bounded.

Next, note that for any . Thus, . Assuming , if we let , then using Lemma A, if we choose , and thus, in the interval . This implies . For simplicity denote by , thus, and without loss of generality assume in the interval . Now repeatedly applying Lemma A times yields that in the interval , (or in any other subinterval of length within )

In particular, this implies, in an interval