Constrained Optimization for a Subset of the Gaussian Parsimonious Clustering Models
The expectation-maximization (EM) algorithm is an iterative method for finding maximum likelihood estimates when data are incomplete or are treated as being incomplete. The EM algorithm and its variants are commonly used for parameter estimation in applications of mixture models for clustering and classification. This despite the fact that even the Gaussian mixture model likelihood surface contains many local maxima and is singularity riddled. Previous work has focused on circumventing this problem by constraining the smallest eigenvalue of the component covariance matrices. In this paper, we consider constraining the smallest eigenvalue, the largest eigenvalue, and both the smallest and largest within the family setting. Specifically, a subset of the GPCM family is considered for model-based clustering, where we use a re-parameterized version of the famous eigenvalue decomposition of the component covariance matrices. Our approach is illustrated using various experiments with simulated and real data.
The expectation-maximization (EM) algorithm (Dempster et al., 1977) is an iterative procedure for finding maximum likelihood estimates when data are incomplete or treated as such. Although the EM algorithm is commonly attributed to Dempster et al. (1977), Titterington et al. (1985, Section 4.3.2) point out that similar treatments had previously been employed by Baum et al. (1970), Orchard and Woodbury (1972), and Sundberg (1974). The EM algorithm involves the iteration of two steps until convergence is attained. In the expectation step (E-step) the expected value of the complete-data log-likelihood is computed and then, in the maximization step (M-step), this expected value is maximized with respect to the model parameters. Here, ‘complete-data’ refers to the missing plus observed data.
In this paper, we are concerned with the application of the EM algorithm to Gaussian mixture models with applications in clustering and classification. The likelihood surface for Gaussian mixture models is known to be unbounded and the presence of local maxima is extremely common; one may go so far as to argue that the surface is singularity riddled (Titterington et al., 1985). When using the EM algorithm to fit Gaussian mixture models, problems like convergence to a spurious local maxima tend to arise when one fitted component has much smaller variance than the others (cf. Biernacki, 2004); an illustrative example of this phenomenon is given by Ingrassia and Rocci (2007), who we will follow by referring to such fitted components as ‘degenerate’. The behaviour of the EM algorithm near a degenerate solution has been studied by Biernacki and Chrétien (2003) and Ingrassia and Rocci (2007), who tackle the problem by constraining the value of the smallest eigenvalue of the component covariance matrices. In this paper, we consider constraining the smallest eigenvalue, the largest eigenvalue, and both the smallest and largest eigenvalues. While this approach is applicable to Gaussian mixture models in general, we impose these constraints while maintaining a parsimonious covariance structure. We focus on a subset of the famous parsimonious Gaussian clustering models (GPCM) family of mixture models of Celeux and Govaert (1995), cf. Section 2.1.
This paper illustrates the benefit of constraining the range (minimum and maximum) of eigenvalues in two applications. In what we call ‘dynamic initialization’, we begin with at a random starting value and impose stringent constraints on the eigenvalues and these constraints are slowly lifted during the first iterations of the EM algorithm. Dynamic initialization maintains the monotonicity property of the EM algorithm while reducing the risk of converging to a degeneracy. We also show that in most cases, dynamic initialization increases the changes of converging to a solution with higher log-likelihood than when compared to using the ‘standard’ (with no dynamic initialization) EM algorithm with the same starting values. The other application of constraining the range of eigenvalues is a more direct one. We use a constraint on the range of eigenvalues to fit the relevant GPCM models to two well known data sets within the model-based clustering literature. For each data set, we find solutions that are an improvement over the GPCM models; however, we must choosing constraints a priori (cf. Section 5).
The remainder of the paper is laid out as follows. In Section 2.1, the GPCM family of models is introduced. Then parameter estimation while constraining the largest and/or smallest eigenvalue is discussed and our methodology for constraining both is described (Section 3). Our approach is illustrated via several experiments using real and simulated data (Section 4) and the paper concludes with discussion and suggestions for future work (Section 5).
2.1 The GPCM and MCLUST families
Suppose that we observe -dimensional data vectors and that each must be assigned to one of clusters. The Gaussian mixture model-based approach has become popular for such problems. When mixture models are used for clustering in this way, the locution ‘model-based clustering’ is used. The Gaussian model-based clustering likelihood is
where , such that , are mixing proportions and is the density of a multivariate Gaussian random variable with mean and covariance matrix . Extensive details on finite mixture models and their applications are given by Titterington et al. (1985), McLachlan and Basford (1988), McLachlan and Peel (2000), and Frühwirth-Schnatter (2006).
Unless is small relative to , model fitting issues arise with these models because of the large number of covariance parameters; there are parameters for each component covariance matrix . Celeux and Govaert (1995) introduce parsimony into these Gaussian mixture models by proposing and giving estimation algorithms for fourteen different eigen-decompositions of the covariance matrix (Table 1). These decompositions have the form , where is the matrix of eigenvectors, is a diagonal matrix with entries proportional to the eigenvalues, and is the associated constant of proportionality. The resulting 14 models are called Gaussian parsimonious clustering models (GPCMs). Fraley and Raftery (1998) implemented ten of these fourteen models, based on the algorithms given in Celeux and Govaert (1995), as the popular mclust package for the R software (R Core Team, 2013). Such has been the popularity of the MCLUST package (Fraley et al., 2012) that only ten of the fourteen GPCMs are routinely employed. Browne and McNicholas (2013b) implemented all fourteen models in the mixture package for the R software, using the algorithms given in Celeux and Govaert (1995) and Browne and McNicholas (2013a).
|Mod.||Volume||Shape||Orient.||Free covariance parameters|
In this paper, we consider a reduced set of GPCMs, that we refer to as the rGPCM family (cf. Table 2). We are, in effect, reparameterizing the GPCM covariance structure to two parameters by writing and , where and are unconstrained matrices of eigenvalues. Not including as a parameter ties together component volume and shape in terms of whether they are constrained, i.e., and , while allowing component orientation to vary separately. One may argue that the paramterization is attractive because it has a ‘natural’ eigenvalue interpretation; however, it inherently has less flexibility than the GPCM models (i.e., ) and this loss of flexibility needs to be considered in context with the benefits of our constrained eigenvalue decomposition.
|Model||Volume/Shape||Orientation||Free covariance parameters|
We develop an EM algorithm to fit the rGPCM models while imposing constraints on the eigenvalues. In addition, we investigate algorithms that slowly lift the constraints on the eigenvalues either from below, above, or both. Ingrassia (2004) shows that having lower and upper bounds on the eigenvalues does not destroy the monotonicity property of the EM algorithm. Furthermore, keeping the eigenvalues from going below a threshold prevents degeneracy of the log-likelihood (Ingrassia and Rocci, 2011). However, algorithms that only have dynamic constraints from below prolong degeneracy. We show that having dynamic constraints from above and below reduces the risk of degeneracy and yields higher log-likelihood values at convergence.
2.2 Parameter Estimation and Model Selection
Parameter estimation for each member of the GPCM family is carried out using an EM algorithm. The EM algorithm is used to obtain maximum likelihood estimates when data are incomplete or are taken to be incomplete. In mixture model-based clustering applications, the missing data are the component membership labels, which we denote by , where if observation is in component and otherwise. These missing data together with the observed data are known as the complete-data, and the E-step of the EM algorithm involves computation of the expected value of the complete-data log-likelihood. For Gaussian mixture model-based clustering, the complete-data log-likelihood is
The M-step involves maximizing the expected value of Equation 1 with respect to the model parameters. The E- and M-steps are iterated until convergence. Details on the EM algorithm parameter estimates for the GPCMs are given by Celeux and Govaert (1995) and Fraley and Raftery (1999).
One feature of these EM algorithms is the importance of starting values: mclust utilizes a Gaussian model-based agglomerative hierarchical clustering procedure to obtain starting values (cf. Murtagh and Raftery, 1984; Banfield and Raftery, 1993). The mixture package allows the user freedom in selecting starting values, with -means clustering (Hartigan and Wong, 1979) results being the default. The clusterings for a given model arise as the maximum a posteriori (MAP) expected values (i.e., probabilities) of the . To compute these MAP values, we compute the expected values
with the parameter estimates taking the converged values. Then, if max occurs at component and otherwise. Note that the expected values are computed in each E-step, which is why we emphasize that in computation of the MAP classifications depends on the parameter values at convergence.
The Bayesian information criterion (BIC; Schwarz, 1978) is used to select the number of components and the covariance structure (i.e., the model). Although it is by far the most popular model selection criterion for mixture model-based clustering, the regularity conditions for the asymptotic approximation used in the development of the BIC are not generally satisfied by mixture models (cf. Keribin, 1998, 2000). There is, however, plenty of practical evidence to support its use in mixture model selection (e.g., Dasgupta and Raftery, 1998; Fraley and Raftery, 2002) and we use it for the analyses herein. The BIC is given by , where is the number of free parameters, is the maximized log-likelihood, and is the maximum likelihood estimate of . Dasgupta and Raftery (1998) proposed using the BIC for mixture model selection, where the model with the lowest BIC is selected.
3.1 Constrained Covariance Updates
Our constrained EM algorithm is an alternating conditional maximization algorithm (Meng and van Dyk, 1997), where the matrix of eigenvalues for each group is maximized conditional on and then vice versa. Note, that or can be equal or varying across groups, depending on the model (cf. Table 2). These conditional updates can be repeated times or until a convergence criteria is achieved. Ingrassia and Rocci (2007, 2011) give an algorithm for our VV model when the smallest eigenvalue is constrained and show that these constraints maintain the monotonicity property. Herein, we introduce and illustrate parameter estimation with constraints on both the smallest and largest eigenvalues.
Let be the range of allowable eigenvalues and let be the sample covariance matrix for group , i.e., . Consider unconstrained . From Celeux and Govaert (1995), we have
where and the superscripts in parentheses denote iteration number. If we let be the diagonal elements of , then the constrained EM uses the updates
Now, suppose we set . Then,
where . If we let be the diagonal elements of , then the constrained EM algorithm sets
Consider unconstrained and let be the eigen-decomposition of . Then we set
3.2 Dynamic Initialization
We run the EM algorithm for each member of the rGPCM family as described in Section 3.1, but for the first iterations we use a sequence of constraints . We could use such a set ; however, we instead simplify and use a sequence , where is some sequence from to . We also use the mapping
where . These equations are set up so that and implies that and . In this paper, we have set because we scale the data in our clustering applications (Section 4).
4 Data Experiments
4.1 Performance Assessment
Although our examples are all genuine clustering problems, i.e., no knowledge of labels are used, the labels are known in each case; therefore, we can asses the performance of our algorithms for the rGPCM family. We use classification tables and adjusted Rand indices (ARI; Hubert and Arabie, 1985) to summarize classification accuracy. The ARI corrects the Rand index (Rand, 1971) for chance agreement. An ARI value of 1 indicates perfect class agreement and a value of 0 would be expected under random classification.
4.2 Simulation Study 1
The first data set consists of observations generated from a four-dimensional two-component ( and ) mixture of multivariate -distributions with scale matrix and 5 degrees of freedom. As we would expect, the resulting clusters are clearly heavy tailed with several outlying points (Figure 1). Using the output from -means clustering as the initialization for , we run our algorithm for . The eigenvalues are constrained to be within the smallest and largest eigenvalues of the sample covariance matrix of the data.
The mixture package is used to fit the GPCM models to facilitate comparison with the rGPCM family. The chosen rGPCM model is a two-component EV model that gives perfect classification (), whereas the selected GPCM model is a four-component VEV () model with (Table 3). In the absence of a constraint on eigenvalues, the chosen GPCM model has additional components with relatively high variance to accommodate the heavier tails (Figure 2). From Table 3, it is clear that with appropriate merging of components, the classification performance of the best GPCM model is very close to that of the best rGPCM model. BIC values for all rGPCM models are given in Appendix A.
4.3 Simulation Study 2
The second data set consists of observations generated from a three-dimensional two-component Gaussian mixture model with , , and . As shown in Figure a, the components are very well separated in these simulated data.
Again, we run our algorithms for the rGPCM models for with the eigenvalues constrained to lie within the smallest and largest eigenvalues of the sample covariance matrix. For the same data, we also run the GPCM models for . Both algorithms selected a component EE model and give perfect classifications. BIC values for the rGPCM models are given in Appendix A.
We then added 5% uniform noise to the data set ( noise observations, see Figure b) and repeated the above analyses for the rGPCM and the GPCM families, respectively. The selected rGPCM model is a two-component EE model that absorbed the noisey observations into the two Gaussian components (Table 4). On the other hand, the chosen GPCM model is a eight-component EEE model (), where four of the components contain just one or two of the noisy points and the other two components contain the true Gaussian components along with one and three noisy points, respectively (Table 4). Again, BIC values for the rGPCM models are given in Appendix A.
We have illustrated the effect of only a very small proportion of outliers on the EM algorithms used for parameter estimation for the GPCM family versus the application of our algorithms for the rGPCM family. Please note that we are not proffering the rGPCM solution as being ideal in this context; rather, we are suggesting that our parameter estimation approach led to rGPCM results that are preferable to those obtained using the more traditional parameter estimation approach for the GPCM family. Effective methods for clustering noisy data include trimmed clustering (e.g., García-Escudero et al., 2008, 2010) and mixtures of contaminated distributions (Punzo and McNicholas, 2013).
4.4 Two Well-Known Data Sets
Forina et al. (1986) recorded 28 chemical and physical properties of three types of wine (Barolo, Grignolino, Barbera) from the Piedmont region of Italy. A subset of 13 of these variables is available in the gclus package (Hurley, 2004) for R. The leptograpsus crabs data set can be found in the MASS package (Venables and Ripley, 1999) in R. These data contain five physical measurements on two different colours of crab, further separated into gender. MCLUST is known to do poorly on these data; Raftery and Dean (2006) used these data to illustrate the superiority of their variable selection technique over MCLUST.
4.5 Illustrating Convergence From Random Starting Values
For each data set, we generate 50 random starting points. We run the four types of EM algorithm until convergence for components. Specifically, we run the EM algorithm in four circumstances: no constraints, lower constraints, upper constraints, and both upper and lower (range) constraints on the eigenvalues. For each dynamic initialization, we use an equidistant sequence of length 25 from to . For each run, we noted which algorithms achieved the highest converged log-likelihood value for a particular starting value. This is because all four algorithms could, and sometimes did, converge to the same solution.
The results are given Tables 11 to 12 in Appendix B. By inspection of these tables, the value of imposing eigenvalue constraints is clear. Specifically, the model most often converges to the ‘best’ value of the log-likelihood is very rarely from an unconstrained EM algorithm. Furthermore, the unconstrained EM algorithm yields far more degenerate solutions that its constrained counterparts.
4.6 Constrained Eigenvalues: A Comparison With The GPCM Family
For each data set, we compare results for the rGPCM models using the constrained eigenvalue approach to parameter estimation to results for the GPCM models with the traditional EM algorithm approach to parameter estimation. When estimating parameters for the rGPCM family, we constrain eigenvalues to be within the smallest and largest eigenvalues of the sample covariance matrix of each data set, i.e., for the wine data and for the crabs data. BIC values for the rGPCM models for all data sets are given in Appendix A.
For the wine data, the best rGPCM model is a component VE model with an ARI of 0.96 (Table 5). The best GPCM model is a component model with an ARI of 0.90.
For the crabs data, the best rGPCM model is a component EV model with an ARI of 0.80 (Table 6). For the crabs data, the best GPCM model is a component model with an ARI of 0.50.
|Blue & Male||38||12|
|Blue & Female||50|
|Orange & Male||50|
|Orange & Female||5||45|
In this paper, we introduced a constrained eigenvalue parameter estimation procedure for the eight of the parsimonious Gaussian clustering models of Celeux and Govaert (1995). For convenience, we have referred to this subset of models as the rGPCM family. Please note that when we discuss the performance of the rGPCM family herein, we are referring to the performance of those models with our constrained eigenvalue parameter estimation procedure. We are not suggesting that the rGPCM models are in any sense better than the other GPCM models when the same parameter estimation methods are used.
We illustrated our approach through extensive simulation studies and two real data applications. In one application, we studied dynamic initialization, where we begin with random starting values and impose stringent constraints on the eigenvalues which are slowly lifted during the first iterations of the EM algorithm. This approach is shown to maintain the monotonicity of the EM algorithm while reducing the risk of converging to a degeneracy. In another application, we constrained the range of eigenvalues and fit the rGPCM models to two well known data sets. In most cases, we find solutions that are an improvement over the famous GPCM models; however, we require constraints to be chosen a priori. Constraining the eigenvalues in this way can be viewed as a form of regularization or as placing a uniform prior on the eigenvalues. Future work will involve studying different approaches to estimating the range of allowable eigenvalues.
The fact that the rGPCM models outperformed the GPCM models on both simulation studies and real data sets shows that the eigenvalue constraints we use can lead to improved classification performance. Furthermore, if one follows the approach of only running the rGPCM models for which our eigenvalue constraints can be used, this is tantamount to merging the volume ( or ) and shape ( or ) parameters from the famous eigen-decomposition used in the GPCM family. Therefore, the importance of having separate volume and shape parameters deserves further consideration. Furthermore, even if it can be useful in some scenarios, the value of including the component volume as a separate parameter has to be judged in context with the fact that including it prevents application of the constrained eigenvalue approach to parameter estimation.
Appendix A BIC Tables
Appendix B Convergence Tables
- Banfield, J. D. and A. E. Raftery (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics 49(3), 803–821.
- Baum, L. E., T. Petrie, G. Soules, and N. Weiss (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Annals of Mathematical Statistics 41, 164–171.
- Biernacki, C. (2004). An asymptotic upper bound of the likelihood to prevent Gaussian mixtures from degenerating. Technical report, Université de Franche-Comté, Besançon, France.
- Biernacki, C. and S. Chrétien (2003). Degeneracy in the maximum likelihood estimation of univariate Gaussian mixtures with EM. Statistics & Probability Letters 61(4), 373–382.
- Browne, R. P. and P. D. McNicholas (2013a). Estimating common principal components in high dimensions. Advances in Data Analysis and Classification. To appear.
- Browne, R. P. and P. D. McNicholas (2013b). mixture: Mixture Models for Clustering and Classification. R package version 1.0.
- Celeux, G. and G. Govaert (1995). Gaussian parsimonious clustering models. Pattern Recognition 28(5), 781–793.
- Dasgupta, A. and A. E. Raftery (1998). Detecting features in spatial point processes with clutter via model-based clustering. Journal of the American Statistical Association 93, 294–302.
- Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B 39(1), 1–38.
- Flury, B. N. (1984). Common principal components in k groups. Journal of the American Statistical Association 79(388), 892–897.
- Forina, M., C. Armanino, M. Castino, and M. Ubigli (1986). Multivariate data analysis as a discriminating method of the origin of wines. Vitis 25, 189–201.
- Fraley, C. and A. E. Raftery (1998). How many clusters? Which clustering methods? Answers via model-based cluster analysis. The Computer Journal 41(8), 578–588.
- Fraley, C. and A. E. Raftery (1999). MCLUST: Software for model-based cluster analysis. Journal of Classification 16, 297–306.
- Fraley, C. and A. E. Raftery (2002). Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association 97(458), 611–631.
- Fraley, C., A. E. Raftery, T. B. Murphy, and L. Scrucca (2012). MCLUST version 4 for R: Normal mixture modeling for model-based clustering, classification, and density estimation. Department of Statistics, University of Washington.
- Frühwirth-Schnatter, S. (2006). Finite Mixture and Markov Switching Models. New York: Springer-Verlag.
- García-Escudero, L. A., A. Gordaliza, C. Matrán, and A. Mayo-Iscar (2008). A general trimming approach to robust cluster analysis. The Annals of Statistics 36(3), 1324–1345.
- García-Escudero, L. A., A. Gordaliza, C. Matrán, and A. Mayo-Iscar (2010). A review of robust clustering methods. Advances in Data Analysis and Classification 4(2), 89–109.
- Hartigan, J. A. and M. A. Wong (1979). A k-means clustering algorithm. Applied Statistics 28(1), 100–108.
- Hubert, L. and P. Arabie (1985). Comparing partitions. Journal of Classification 2(1), 193–218.
- Hurley, C. (2004). Clustering visualizations of multivariate data. Journal of Computational and Graphical Statistics 13(4), 788–806.
- Ingrassia, S. (2004). A likelihood-based constrained algorithm for multivariate normal mixture models. Statistical Methods and Applications 13, 151–166.
- Ingrassia, S. and R. Rocci (2007). Constrained monotone EM algorithms for finite mixture of multivariate Gaussians. Computational Statistics and Data Analysis 51, 5339–5351.
- Ingrassia, S. and R. Rocci (2011). Degeneracy of the EM algorithm for the MLE of multivariate Gaussian mixtures and dynamic constraints. Computational Statistics and Data Analysis 55, 1715–1725.
- Keribin, C. (1998). Estimation consistante de l’ordre de modèles de mélange. Comptes Rendus de l’Académie des Sciences. Série I. Mathématique 326(2), 243–248.
- Keribin, C. (2000). Consistent estimation of the order of mixture models. Sankhyā. The Indian Journal of Statistics. Series A 62(1), 49–66.
- McLachlan, G. J. and K. E. Basford (1988). Mixture Models: Inference and Applications to Clustering. New York: Marcel Dekker Inc.
- McLachlan, G. J. and D. Peel (2000). Finite Mixture Models. New York: John Wiley & Sons.
- Meng, X.-L. and D. van Dyk (1997). The EM algorithm — an old folk song sung to a fast new tune (with discussion). Journal of the Royal Statistical Society: Series B 59(3), 511–567.
- Murtagh, F. and A. E. Raftery (1984). Fitting straight lines to point patterns. Pattern Recognition 17(5), 479–483.
- Orchard, T. and M. A. Woodbury (1972). A missing information principle: Theory and applications. In L. M. Le Cam, J. Neyman, and E. L. Scott (Eds.), Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Theory of Statistics, pp. 697–715. Berkeley: University of California Press.
- Punzo, A. and P. D. McNicholas (2013). Outlier detection via parsimonious mixtures of contaminated Gaussian distributions. arXiv:1305.4669.
- R Core Team (2013). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.
- Raftery, A. E. and N. Dean (2006). Variable selection for model-based clustering. Journal of the American Statistical Association 101(473), 168–178.
- Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66(336), 846–850.
- Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics 6(2), 461–464.
- Sundberg, R. (1974). Maximum likelihood theory for incomplete data from an exponential family. Scandinavian Journal of Statistics 1(2), 49–58.
- Titterington, D. M., A. F. M. Smith, and U. E. Makov (1985). Statistical Analysis of Finite Mixture Distributions. Chichester: John Wiley & Sons.
- Venables, W. N. and B. D. Ripley (1999). Modern Applied Statistics with S-PLUS. Springer.