Gaussian Mixture Clustering Using Relative Tests of Fit
Abstract
We consider clustering based on significance tests for Gaussian Mixture Models (GMMs). Our starting point is the SigClust method developed by Liu et al. (2008), which introduces a test based on the kmeans objective (with k = 2) to decide whether the data should be split into two clusters. When applied recursively, this test yields a method for hierarchical clustering that is equipped with a significance guarantee. We study the limiting distribution and power of this approach in some examples and show that there are large regions of the parameter space where the power is low. We then introduce a new test based on the idea of relative fit. Unlike prior work, we test for whether a mixture of Gaussians provides a better fit relative to a single Gaussian, without assuming that either model is correct. The proposed test has a simple critical value and provides provable error control. One version of our test provides exact, finite sample control of the type I error. We show how our tests can be used for hierarchical clustering as well as in a sequential manner for model selection. We conclude with an extensive simulation study and a cluster analysis of a gene expression dataset.
1 Introduction
Gaussian mixture models (GMMs) are a commonly used tool for clustering. A major challenge in using GMMs for clustering is in adequately answering inferential questions regarding the number of mixture components or the number of clusters to use in data analysis. This task typically requires hypothesis testing or model selection. However, deriving rigorous tests for GMMs is notoriously difficult since the usual regularity conditions fail for mixture models (Ghosh and Sen 1984; DacunhaCastelle et al. 1999; Gassiat 2002; McLachlan and Peel 2004; McLachlan and Rathnayake 2014; Chen 2017; Gu et al. 2017).
In this direction, Liu et al. (2008) proposed an approach called SigClust. Their method starts by fitting a multivariate Gaussian to the data. Then a significance test based on means clustering, with , is applied. If the test rejects, then the data is split into two clusters. This test can be applied recursively leading to a topdown hierarchical clustering (Kimes et al. 2017). This approach roughly attempts to distinguish clusters which are actually present in the data from the natural sampling variability. The method is appealing because it is simple and because, as we further elaborate on in the sequel, it provides certain rigorous error control guarantees.
In this paper we study the power of SigClust and show that there are large regions of the parameter space where the method has poor power. A natural way to fix this would be to use another statistic designed to distinguish “a Gaussian” versus “a mixture of two Gaussians” such as the generalized likelihood ratio test. However, such an approach has two problems: first, as mentioned above, mixture models are irregular and the limiting distribution of the likelihood ratio test (and other familiar tests) is intractable. Second, such tests assume that one of the models (Gaussian or mixture of Gaussians) is correct. Instead from a practical standpoint, for the purposes of clustering, we only regard these models as approximations.
So we consider a different approach. We test whether one model is closer to the true distribution than the other without assuming either model is true. We call this a test of relative fit. Our test is based on data splitting. Half the data are used to fit the models and the other half are used to construct the test. The result is a test with a simple limiting distribution which makes it easy to determine an appropriate cutoff for it. In fact, we provide several versions of the test. One version provides exact type I error control without requiring any asymptotic approximations.
Following Kimes et al. (2017), we also apply the test recursively to obtain a hierarchical clustering of the data with significance guarantees. We develop a bottomup version of mixture clustering which can be regarded as a linkage clustering procedure where we first overfit a mixture and subsequently combine elements of the mixture. We also construct a sequential, nonhierarchical version, of the approach. We call our procedure Rift (Relative Information Fit Test).
Throughout this paper we assume that the dimension is fixed and the sample size is increasing. In contrast, Liu et al. (2008) and Kimes et al. (2017) focus on the large , fixed case which requires dealing with challenging issues such as estimating the covariance matrix in high dimensions (see also Vogt and Schmid (2017)). However, because of the challenges of high dimensional estimation, these prior works only establish results about power in very specific cases. In contrast, we provide a more detailed understanding of the power in the fixed case.
1.1 Related Work
Estimating the number of clusters has been approached in many ways (Bock 1985; Milligan and Cooper 1985; McLachlan and Peel 2004). A common approach is to find the optimal number of clusters by optimizing a criterion function, examples of which are the Hartigan index (Hartigan 1975), the silhouette statistic (Rousseeuw 1987) or the gap statistic (Tibshirani et al. 2001).
Another approach to estimating the number of clusters is to assess the statistical significance of the clusters. McShane et al. (2002) proposed a method to calculate pvalues by assuming that the cluster structure lies in the first three principal components of the data. Tibshirani and Walther (2005) use resampling techniques to quantify the prediction strength of different clusters and Suzuki and Shimodaira (2006) assess the significance of hierarchical clustering using bootstrapping procedures. More recently, Maitra et al. (2012) proposed a distributionfree bootstrap procedure which assumes that the data in a cluster is sampled from a spherically symmetric, compact and unimodal distribution. Engelman and Hartigan (1969) considered the maximal Fratio that compares between group dispersions with within group dispersions. Lee (1979) proposed a subsequent multivariate version and a robust version was recently proposed by GarciaEscudero et al. (2009). Another example is a statistical test proposed by Vogt and Schmid (2017). They develop a fairly general significance test but it relies on assuming that the number of covariates tends to infinity and that the clusters are, in a certain sense, wellseparated (i.e. can be consistently estimated as the number of features increases).
Alternatively, and closer to our approach, Gaussian mixture models can be used for cluster analysis. See for instance, the works Fraley and Raftery (2002); McLachlan and Peel (2004); McLachlan and Rathnayake (2014) for overviews. There is much prior work for testing the order of a Gaussian mixture. For example, the works Ghosh and Sen (1984); Hartigan (1985) used the likelihood ratio test with the null hypothesis that the order is one. Hartigan (1985) explored the impact of nonregularity of the mixture models and Ghosh and Sen (1984) used a separation condition in order to find the asymptotic distribution of the likelihood ratio test statistic.
Since finite normal mixture models are irregular and the limiting distribution of the likelihood ratio test statistic is difficult to derive, deriving a general theory for testing the order of a mixture is hard. Instead most of the algorithms test for homogeneity in the data. The works Charnigo and Sun (2004); Liu and Shao (2004); Chen et al. (2009) among others, are examples of this approach. More recently, Li and Chen (2010) and Chen et al. (2012) constructed a new likelihoodbased expectationmaximization (EM) test for the order of finite mixture models that uses a penalty function on the variance to obtain a bounded penalized likelihood. Further developments can be found in the works DacunhaCastelle et al. (1999); Gassiat (2002); Chen (2017); Gu et al. (2017). Our approach differs in three ways: we use a test that avoids the irregularities, it avoids assuming that the mixture model is correct and it is valid for multivariate mixtures. We only treat the mixture model as an approximate working model.
Liu et al. (2008) proposed a Monte Carlo based algorithm (SigClust) that defines a cluster as data generated from a single multivariate Gaussian distribution. The distribution of the test statistic under the null hypothesis SigClust depends on the eigenvalues of the null covariance matrix. Huang et al. (2015) proposed a softthresholding method that provides an estimate of these eigenvalues, and this softthresholding method leads to a modified version of SigClust that is better suited to highdimensional problems.
1.2 Outline
In Section 2 we review the SigClust procedure and we derive its power in some cases. We show that SigClust can have poor power against certain alternatives. This section also has results on the geometric properties of means clustering in a special case, which is a prelude to finding the power. In Section 3 we describe our new procedure. We also describe several other tests that are used for comparison. In Section 4 we show how to use our new tests in a hierarchical framework. Section 5 describes a sequential testing version of our approach which can be used for model selection for the GMM. We consider some examples in Section 6, and analyze a gene expression dataset in Section 7. Finally, concluding remarks are in Section 8. We defer the technical details of most proofs to the Appendix.
1.3 Notation
Throughout this paper we use to denote the Euclidean norm, i.e. for , . We use the symbols and to denote the standard stochastic convergence concepts of convergence in probability and in distribution respectively.
2 Setup and the SigClust Procedure
We let be i.i.d. observations from some distribution with probability measure on . We recall the means clustering algorithm which chooses cluster centers to minimize the withincluster sum of squares,
(1) 
as a function of . For each center , we can also associate a convex polyhedron which contains all points in closer to than to any other center. The sets are the Voronoi tessellation of . The tessellation defines the clustering. We also define,
and we let denote the minimizer of . When the minimizers are not unique we let and denote arbitrary minimizers of and respectively.
2.1 SigClust
In this section, we describe the SigClust procedure. Liu et al. (2008) define a cluster as a population sampled from a single Gaussian distribution. To capture the nonGaussianity due to the presence of multiple clusters, Liu et al. (2008) consider a test statistic based on means, with . Specifically, define to be the ratio between the withinclass sum of squares and the total sum of squares as,
where is the vector of optimal centers chosen by the means clustering algorithm and is the sample mean of the data. We note in passing that in their extension of this method to hierarchical clustering, Kimes et al. (2017) also consider other statistics that arise in hierarchical clustering.
Roughly, we reject the null for small values of this statistic. In order to estimate the pvalue we use a version of the parametric bootstrap. The (estimated) pvalue is an estimate of where is computed on the bootstrap samples from , where and where and is the sample covariance matrix. We note that in the high dimensional case, as discussed earlier, Liu et al. (2008) use a regularized estimator of .
2.2 Limiting Distribution of SigClust under the Null
In order to analytically understand the SigClust procedure and to develop results regarding its power we first find the limiting distribution of the test statistic under the null in a simplified setup.
We focus in this and subsequent sections on the case when under the null hypothesis, we obtain samples from where is a diagonal matrix. We assume that the two leading eigenvalues are distinct which ensures that, under the null, the kmeans objective at the populationlevel has a unique optimal solution whose optimal value is tractable to analyze in closedform. For notational convenience, we will assume that,
Our results extend in a straightforward way to the general nonspherical, axisaligned case with minor modifications. These results in turn are easily generalized to the nonspherical, not necessarily axisaligned, case by noting the invariance of the test statistic to orthonormal rotations under the null. The spherical case is more challenging since the population optimal kmeans solution is not unique and the limiting distribution is more complicated. To illustrate some of the difficulties, we derive the limiting distribution of the SigClust statistic, under the null, for the twodimensional case in Appendix B.3, but do not consider the power of the test in that setting.
Recall, that denotes the (unique) population optimal means solution, and we use to denote the corresponding Voronoi partition. Our results build on the following result from Pollard (1982) and Bock (1985):
Lemma 1 (Corollary 6.2 of Bock (1985)).
The minimum within cluster sum of squares has an asymptotically normal distribution given by,
where
To analyze the power of the SigClust procedure, and to better understand its limiting distribution, we need to calculate and the mass of the Voronoi cells. It is easy to verify that under the null the probability of each of the Voronoi cells corresponding to is 1/2. In Appendix B.1, we establish the following claims:
(2)  
(3) 
As a consequence of these calculations, we obtain the limiting distribution of the SigClust statistic under the null:
Remark: Leveraging this result, we are able to characterize the rejection region of the test and in Theorem 4 we analyze its power. The proof of Theorem 1 is quite long and technical. Most of the work is done in the Appendix. Here is a brief proof that leverages Lemma 1 which contains most of the technical details.
Proof.
From Lemma 1 we have that,
Furthermore by the Weak Law of Large Numbers we have that,
Putting these together yields the desired claim. ∎
2.3 Geometry of means under the alternative
Our goal is to find special cases where we are able to explicitly calculate SigClust’s power and understand cases in which it has high power and cases where it has low power. In order to find the power, we first need to understand the behaviour of means clustering under the alternative. In particular, we need to understand what the optimal split is and what the optimal within sum of squares is, if the data was indeed generated from the alternative.
We focus on the case when the data, under the alternative, is generated from a mixture of two Gaussian distributions of the form
(4) 
where , is a nonzero constant and is a diagonal matrix,
In this section, we will consider cases where allowing in some cases to be larger than . We treat the case when and are fixed (do not vary with the samplesize).
For technical reasons, we make a small modification to means clustering. We consider means clustering with symmertric centers. That is, we consider that minimizes the withincluster sum of squares,
(5) 
as a function of
We also introduce notation for the optimal split by considering a symmetric population version of the means clustering for the following theorems and lemmas. We define the following terms to be used in the lemmas. Let
where and denote the optimum cluster centers that minimize the within sum of squares when symmetric means clustering is performed on the data. The corresponding minimum within sum of squares is denoted by . That is,
We conjecture that this symmetric assumption has no practical effect on SigClust, since the samples are drawn from a symmetric distribution and in practice the optimum means cluster centers are close to being symmetric. Moreover, to consider the limiting distribution of , given by Theorem 6.4 (b) of Bock (1985, pp. 101), we need the population optimal centers to be unique. This is guaranteed only if the population optimal centers are symmetric about the origin, since if minimizes the population within sum of squares, then due to the symmetry of the distribution, also minimizes the population within sum of squares. Therefore for the minimizer to be unique, .
Therefore we state a result analogous to Theorem 6.4 (b) of Bock (1985, pp. 101) for symmetric means clustering for our population as follows:
Theorem 2.
Let the data be generated from , as defined above, and and are as defined above. Suppose

the vector that minimizes is unique upto relabeling of its coordinates;

the matrix is positive definite, where as defined in Pollard (1982) (as ) is a matrix made up of matrices of the form,
(6) for where , is the corresponding density function, is the dimensional Lebesgue measure, is the convex polyhedron that contains all points in that are closer to compared to and is viceversa and denotes the face common to and , and denotes the identity matrix.
Then as ,
where
Since and denote the optimum cluster centers, the corresponding optimal separating hyperplane passes through the origin. We denote the corresponding optimal separating hyperplane by
Then the corresponding within sum of squares can be written as:
The following theorem gives the optimal separating hyperplane and the optimal within sum of squares under the alternative.
Theorem 3.
For data generated from , where , is fixed and is a diagonal matrix with elements , such that are fixed.

When
(7) the unique optimal separating hyperplane which gives the minimum within sum of squares is given by , that is, the unique optimal is such that and for every . The corresponding optimal within sum of squares is given by
(8) 
When
(9) the unique optimal separating hyperplane which gives the minimum within sum of squares is given by , that is, the unique optimal is such that and for every . The corresponding optimal within sum of squares is given by
(10)
In simpler words, the theorem implies that when the condition in (7) holds, i.e. when the variance along the second covariate is small, the optimal symmetric means split at the populationlevel splits the data along the first covariate. On the other hand when the condition in (9) holds, i.e. when the variance along the second covariate is large, the optimal symmetric means split at the populationlevel is along the second covariate.
We conjecture that even for means clustering without the symmetric assumption, as long as the data is generated from , the above statement holds. That is, when the condition in (7) holds, the optimal means split at the populationlevel is along the first covariate and on the other hand when the condition in (9) holds, the optimal means split at the populationlevel is along the second covariate.
Additionally we also have the following lemma:
Lemma 2.
2.4 Power
In this section we derive the asymptotic power of the test using the previous results on the limiting distribution. Since in the previous section we assumed using a symmetric means clustering we now consider the test statistic for the symmetric means clustering. We define
(11) 
Let
denote the power of the test where denotes the level critical value. Building once again on the result in Lemma 1 and additionally on Theorem 2, we show the following result:
Theorem 4.
Suppose that samples are generated according to the model described in (4) and let then:

Consistent: If,
(12) then SigClust is consistent, i.e. as .

Inconsistent: On the other hand if,
(13) then SigClust is inconsistent, i.e. as .
Remarks:

In order to roughly understand the result, as we show more precisely in the Appendix for small values of :
where we use to mean equal up to a small error of size roughly . As a consequence, in our setup we see that when the variance of the second covariate is sufficiently large SigClust has no power in detecting departures from Gaussianity along the first covariate.

We observe a phasetransition in the power of SigClust, and we provide a precise characterization of this phasetransition. We highlight that the low power of SigClust is a persistent phenomenon, i.e. there is a large, nonvanishing part of the parameter space where the test is not consistent. We see that the power of SigClust is very sensitive to the particular values of the variances in the matrix . In the next section we consider alternative tests based on relativefit that address these drawbacks of SigClust.

The proof of this result is quite technical and we defer the details to Appendix D. At a highlevel, the proof follows from Theorem 3 which characterizes the optimal 2means split at the populationlevel, and uses it to study the distribution of the test statistic under the alternate. We then leverage our previous characterization of the distribution of the test statistic under the null to study the power of SigClust.

Despite the technical nature of the proof, the intuition behind the phasetransition is quite natural. As shown in Theorem 3, when the condition in (12) holds, the optimal means split at the populationlevel splits the data along the first covariate and as a result the test is able to detect the nonGaussianity of the first covariate. On the other hand when the condition in (13) holds, the optimal means split at the populationlevel is along the second covariate and the resulting test is asymptotically inconsistent.

Finally, we note in passing that in the case when
the means solution is no longer unique, and we are unable to use our techniques to characterize the power of the test. However, we conjecture that SigClust remains inconsistent even in this case.
3 A Test For Relative Fit of Mixtures
A natural way to improve the low power of SigClust is to formally test for whether the data are generated from a Gaussian versus a mixture of Gaussians. There is a long history of research on this problem; see, for example, DacunhaCastelle et al. (1999); Gassiat (2002); Chen (2017); Gu et al. (2017) and references therein. As we mentioned earlier, the mixture model is irregular and there has been little success in deriving a practical, simple test with valid type I error control. Furthermore, and more importantly, such tests ignore the fact that we are only using the parametric model as an approximation; we don’t expect that the true distribution is exactly Gaussian or a mixture of Gaussians. This motivates our new approach where we test the relative fit of the models without assuming that either model is correct. Also, our test is valid for multivariate mixtures whereas many of the existing tests are for the univariate case.
3.1 The Basic Test
Let denote the set of multivariate Gaussians and let denote the set of mixtures of two multivariate Gaussians. We are given a sample but we do not assume that is necessarily in either or . Note that, for notational simplicity, we denote the total sample size by .
We randomly split the data into two halves and . Assume each has size . Using , fit a Gaussian and a mixture of two Gaussians . Any consistent estimation procedure can be used; in our examples we use the Expectation Maximization (EM) algorithm. Understanding precise conditions under which EM yields a global maximizer is an area of active research (Balakrishnan et al. 2017) but we do not pursue this further in this paper.
Instead of testing versus we test whether is a significantly better fit for the data than . This is a different hypothesis from the usual one, but, arguably, it is more relevant since it is or that will be used for clustering. Furthermore, this does not require that the true distribution be in either or .
To formalize the test, let
(14) 
where is the KullbackLeibler distance and is the true density. Note that is a random variable. Formally, we will test, conditional on ,
(15) 
Since is a random variable, these are random hypotheses. Let
(16) 
where Below, we show that, conditionally on ,
where The quantity can be estimated by We reject if
and we refer to this as the Rift (Relative Information Fit Test). For technical reasons, we make a small modification to the test statistic. We replace with where and is some small positive number, for example, . This has no practical effect on the test and is only needed for the theory.
For the following result, let the fitted Gaussian density be given by and the fitted mixture of two Gaussians be given by , where and . For technical reasons, we restrict the parameter estimates to lie in a compact set. Formally, we assume that each is restricted to lie in a compact set and that the eigenvalues of and lie in some interval for , where . As a consequence of data splitting, the test of relative fit has a simple limiting distribution unlike the usual tests for mixtures which have intractable limits.
Theorem 5.
Let where Then, under
(17) 
where , and is a constant.
Remark: It is also possible to consider the normalized version of the statistic . Formally, under the conditions of the above result conditional on :
where . We note that since the constant does not depend on this result also holds unconditionally.
We now turn our attention to the power of Rift. Suppose that we consider a distribution such that,
(18) 
i.e. is a distribution for which the class of mixtures of two Gaussians provides a strictly better fit than a single Gaussian. Then we have the following result characterizing the power of Rift:
Theorem 6.
Suppose that in (18) is strictly positive, then Rift is asymptotically consistent, i.e. as .
Remark: A consequence of this result is that Rift is consistent against any fixed distribution . In other words, the power deficiency of SigClust observed in Theorem 4 does not happen for our test.
3.2 Variants of Rift
In this section we introduce and study a few variants of Rift that can be advantageous in various applications.
A Robust, Exact Test. The KullbackLeibler (KL) distance between two densities and is where . This distance can be sensitive to the tail of the distribution of . For this reason we also consider a robustified version of the KL distance, namely, , that is, the median of under (we will assume for convenience that the median is unique). In this case, the sample median of is a consistent estimator of , where .
For relative fit we define
(19) 
where . A point estimate is the sample median based on . To test versus we use the sign test. Hence, under , . We will refer to this as medianRift or MRift. This approach has two advantages: it is robust and it does not require any asymptotic approximations.
Version. The test does not have to be based on KullbackLeibler distance. We can also use the distance as we now explain. Define the relative fit by We test, conditional on ,
To estimate , note that we can write which can be estimated by
where To evaluate the integrals, we use importance sampling. We sample from a convenient density (such as a distribution) and then use
Again, for technical reasons, we make a small modification to the test statistic. We replace with where and is some tiny positive number, for example, . Again this has no practical effect on the test and is only needed for the theory. Recall that the Gaussian density is given by and the mixture of two Gaussians is given by , where and . Once again, we assume that are restricted to lie in a compact set and that the eigenvalues of and lie in the interval for and for .
Theorem 7.
Let where . Then, under ,
(20) 
where , and is a constant.
3.3 Aside: A Test for Mixtures
Our focus is on the relative fit as described in the previous section. However, it is possible to modify our test so that it tests the more traditional hypotheses
where are Normals and are the mixtures of two Normals. There is currently no available test that is simple, asymptotically valid and has easily computable critical values in the multivariate case. But we can use our split test for this hypothesis if we modify the test using the idea of Ghosh and Sen (1984) where we force the fit under the alternative to be bounded away from the null. When combined with data splitting, this results in a valid test. Specifically, when we fit , we will constrain the fitted density to satisfy for all . Here, is any small, positive constant.
Theorem 8.
If then .
Hence, combining data splitting with the GhoshSen separation idea yields an asymptotically valid test for mixtures with a simple critical value. To the best of our knowledge, this is the first such test.
3.4 Truncation
If we use Rift for topdown hierarchical clustering, as described later in Section 4, then after the first split, the null hypothesis will be a truncated Normal rather than a Normal since the test is now applied to the data in a cluster. Instead of comparing the fit of a Normal and a fit of a mixture of two Normals , we need to compare the fit of a truncated normal to a fit of a truncated mixture of two Normals. We can use exactly the same test except that should be replaced with where denotes the subset of corresponding to the cluster being tested. We can estimate as follows. First, generate for some large . Then set . Then replace with with in the test.
3.5 Other Tests
Another way to decide whether to split the Normal is to use a goodnessoffit test for Normality. In this section we describe two such tests. Note that such tests can only be used for the first split in the clustering problem. We include them in our study because they are simple and they provide a point of comparison. We also note that it is possible to use tests for goodnessoffit with minimaxoptimal power against neighborhoods defined in particular metrics, based on binning and the test, but these tests are complex and have tuning parameters that need to be carefully chosen.
1. Mardia’s multivariate Kurtosis test. Mardia (1974) proposed using the Kurtosis measure to test for normality. If is a dimensional random (column) vector with expectation and nonsingular covariance matrix , Mardia (1970) defines the multivariate Kurtosis as
The proposed test uses the Kurtosis measure to test for multivariate normality. If are independent observations from any multivariate normal distribution, then the sample analogue of Kurtosis is given by,
where
are the sample mean vector and the sample covariance matrix. Mardia (1970) shows that has a distribution under the null hypothesis, , given by
as . So we reject the null hypothesis for both large and small values of . This multivariate normality test is consistent if, and only if,
For detecting clusters, the method starts by fitting a multivariate Gaussian to the data. We then perform the multivariate normality test using the Kurtosis measure and if the test gets rejected then the data is split into two clusters. We reject at level if
2. NN Test. Nearest neighbor (NN) goodness of fit tests were developed by Bickel and Breiman (1983) and Zhou and Jammalamadaka (1993). Let be samples from with a density function . We want to test where has density .
In the clustering framework, we consider the null hypothesis that the data is drawn from a single multivariate Gaussian distribution. That is, we consider to be the multivariate Gaussian distribution, with some mean and covariance matrix . To implement these tests, we first split the data into two halves and and use in order to estimate the and . Therefore in our setting, is the estimated null.
Let . The first version of this test uses
where is the volume of a ball of radius and . Under , the ’s are approximately Uniform on and hence we can use the KolmogorovSmirnov test.
For the second version, we consider the test proposed by Zhou and Jammalamadaka (1993) that uses