Capturing Patterns via Parsimonious t Mixture Models

Capturing Patterns via Parsimonious Mixture Models

Tsung-I Lin,  Paul D. McNicholas  and  Hsiu J. Ho
Institute of Statistics, National Chung Hsing University, Taichung 402, Taiwan,
Department of Public Health, China Medical University, Taichung 404, Taiwan
University Research Chair in Computational Statistics at the University of Guelph, Ontario, Canada
Corresponding author. E-mail:; tel.:+886 4 22850420; fax: +886 4 22873028.

This paper exploits a simplified version of the mixture of multivariate -factor analyzers (MtFA) for robust mixture modelling and clustering of high-dimensional data that frequently contain a number of outliers. Two classes of eight parsimonious mixture models are introduced and computation of maximum likelihood estimates of parameters is achieved using the alternating expectation conditional maximization (AECM) algorithm. The usefulness of the methodology is illustrated through applications of image compression and compact facial representation.

Keywords:  Factor analysis; Facial representation; Image compression; PGMM; PTMM

1 Introduction

The mixture of factor analyzers (MFA) model, first introduced by Ghahramani and Hinton (1997) and further generalized by McLachlan and Peel (2000a), has provided a flexible dimensionality reduction approach to the statistical modelling of high-dimensional data arising from a wide variety of random phenomena. By combining the factor analysis model with finite mixture models, the MFA model allows simultaneous partitioning of the population into several subclasses while performing a local dimensionality reduction for each mixture component. In recent years, the study of MFA has received considerable interest; see McLachlan and Peel (2000b) and Fokoué and Titterington (2003) for some excellent reviews. McLachlan et al. (2002) and McLachlan et al. (2003) exploited the MFA approach to handle high-dimensional data such as clustering of microarray expression profiles. McNicholas and Murphy (2008) generalized the MFA model by introducing a family of parsimonious Gaussian mixture models (PGMMs).

In the MFA framework, component errors and factors are routinely assumed to have a Gaussian distribution due to their mathematical tractability and computational convenience. In practice, noise components or badly discrepant outliers often exist; the multivariate (MVT) distribution contains an additional tuning parameter, the degrees of freedom (df), which can be useful for outlier accommodation. Specifically, the density of a -component multivariate mixture model is of the form


where the are mixing proportions and is the density of -variate distribution with df , mean vector , and scaling covariance matrix . Note that the general mixture (TMIX) model (1) has a total of free parameters, of which parameters correspond to the component matrices and df , respectively. In a more parsimonious version of (1), wherein the and are restricted to be identical across groups, the number of free parameters reduces to .

The TMIX model was first considered by McLachlan and Peel (1998) and Peel and McLachlan (2000), who presented expectation-maximization (EM) algorithms for parameter estimation and show the robustness of the model for clustering. Further developments along these directions followed, including work by Shoham (2002); Lin et al. (2004, 2009); Wang et al. (2004); Greselin and Ingrassia (2010); Andrews et al. (2011). An extension to mixture of -factor analyzers (MtFA), adopting the family of MVT distributions for the component factor and errors, was first considered by McLachlan et al. (2007) and more recently by Andrews and McNicholas (2011a, b).

In this paper, our objective is to illustrate the efficacy of a class of TMIX models with several parsimonious covariance structures, called parsimonious mixture models (PTMMs). The PTMMs are based on assuming a constrained -factor structure on each mixture component for parsimoniously modelling population heterogeneity in the presence of fat tails; the PTMMs are -analogues of the PGMMs. We are devoted to developing additional tools for a simplified version of the MODt family of Andrews and McNicholas (2011b) and applying the proposed tools on image reconstruction tasks.

The EM algorithm (Dempster et al., 1977) and its extensions, such as the expectation conditional maximization (ECM) algorithm (Meng and Rubin, 1993) and the expectation conditional maximization either (ECME) algorithm (Liu and Rubin, 1994, 1995), have been practiced as useful tools for conducting maximum likelihood (ML) estimation in a variety of mixture modelling scenarios. To improve the computational efficiency for fitting PTMMs, we adopt a three-cycle AECM algorithm (Meng and van Dyk, 1997) that allows specification of different complete data at each cycle.

The rest of the paper is organized as follows. In Section 2, we briefly describe the single factor analysis model and study some related properties. In Section 3, we present the formulation of PTMM and discuss the methods for fitting these models. In Section 4, we demonstrate how PTMM can be applied to image compression and compact facial representation tasks. Some concluding remarks are given in Section 5.

2 The factor analysis model

We briefly review the factor analysis (tFA) model, which can be thought of as a single component MtFA model. Let be a set of -dimensional random vectors. In the tFA setting, each is modelled as



where is a -dimensional mean vector, B is a matrix of factor loadings, is a -dimensional () vector of latent variables called common factors, is -dimensional vector of errors called specific factors, is a -dimensional identity matrix, is a diagonal matrix with positive entries, and is the df.

Note that and are uncorrelated and marginally -distributed, but not independent. Using the characterization of the distribution, model (2) can be hierarchically represented as

where stands for the gamma distribution with mean . It follows that the marginal distribution of , after integrating out and , can be expressed as .

When handling high-dimensional data for large relative to , say , the inverse of plays an important role in computational complexity. In such a case, we use the following matrix inversion formula (Woodbury, 1950):


which can be done more quickly because it involves only the low-dimensional inverse plus the inversion of a diagonal matrix. Moreover, the determinant of can be calculated as


The formulae (3) and (4) have been used many times before, including work by McLachlan and Peel (2000a) and McNicholas and Murphy (2008).

3 Parsimonious multivariate mixture models

Consider independent -dimensional random vectors that come from a heterogeneous population with non-overlapping classes. Suppose that the density function of each feature can be modelled by a -component MtFA:


where the are mixing proportions that are constrained to be positive and , is a matrix of factor loadings, and is a diagonal matrix. We use to represent all unknown parameters with containing the parameters for the density of component .

As in McLachlan et al. (2007), the MtFA model can be alternatively formulated by exploiting its link with the tFA model. Specifically, we assume that

with probability , for and , where and are, respectively, the common and specific factors corresponding to component . We assume that and have a joint MVT distribution to be consistent with model (5) for the marginal distribution of . Under a hierarchical mixture modelling framework, each is conceptualized to have originated from one of classes. It is convenient to construct unobserved allocation variables , for , whose values are a set of binary variables with indexing so that belongs to class and are constrained to be for each . More specifically, is distributed as a multinomial random vector with one trial and cell probabilities , denoted by . After a little algebra (cf. McLachlan et al., 2007), we have


where , and denotes the Mahalanobis-squared distance between and .

Following McNicholas and Murphy (2008), we extend the MtFA model by allowing constraints across groups on matrices and . Specifically, we allow the following constraints on the scaling covariance matrices: ; ; or , for . In addition, we consider the constraint when we impose this constraint to reduce the number of parameter in PTMMs, our family of models (Table 1) is called ‘PTMM1’, and when the are allowed to vary across components, we call our family ‘PTMM2’. In this latter case (PTMM2), one simply adds parameters to the last column of Table 1. The number of free covariance parameters in the PTMM family — the union of PTMM1 and PTMM2 — can be reduced to as few as or inflated to as many as .

Structure Loading Error Isotropic Number of parameters
CCC Constrained Constrained Constrained
CCU Constrained Constrained Unconstrained
CUC Constrained Unconstrained Constrained
CUU Constrained Unconstrained Unconstrained
UCC Unconstrained Constrained Constrained
UCU Unconstrained Constrained Unconstrained
UUC Unconstrained Unconstrained Constrained
UUU Unconstrained Unconstrained Unconstrained
Table 1: The eight covariance structures, used by McNicholas and Murphy (2008), that form the basis of the PTMM considered herein. The “Constrained” in the columns “Loading”, “Error”, and “Isotropic” represent , and for and , respectively.

The implementation of an efficient 3-cycle AECM algorithm for estimating PTMMs together with some practical issues including the specification of starting values, the stopping rule and the model selection criterion are described in Supplementary Material.

4 Applications

4.1 Image compression

A colour image is a digital image that includes 24-bit RGB colour information for each pixel. An RGB image is derived from the three primary colours — red (R), green (G), and blue (B) — for which each colour is 8 bits. The technique of image compression plays a central role in image transmission and storage for high quality. There are several unsupervised image compression approaches based on probabilistic models. The MFA model is well recognized as a dominant dimension reduction technique and is useful in block image transform coding (Ueda et al., 2000). In this subsection, we individually apply the PTMMs and PGMM to colour image compression and compare the quality of the reconstructed images. A RGB colour image ‘Lena’ (encoded in 24 bits per pixel) is subdivided into non-overlapping RGB-blocks of pixels and each block is taken as a 48-dimensional data vector . Let be a set of collected vector. Making a slight modification of Ueda et al. (2000), our compression procedure to transform the experimental image is summarized below.


Set the desired number of components and dimensionality of factors .


Estimate by fitting a parsimonious mixture model to .


Perform a model-based clustering according to the component membership of data point , which is decided by maximizing the posterior probability .


For each classified to , calculate


where is the estimated posterior mean of (6).


Reconstruct , , into a compressed color image.

In what follows, we assumed that the characteristic of each components in the mixture model are the same, which leads to the MFA model with common factor loadings (CUU structure). The Lena image data are fitted by the CUU structure with and using the PGMM and PTMM approaches. To evaluate the quality of the reconstructed image, we compute the root mean squared error (RMSE) and peak signal-to-noise ratio (PSNR), which are widely used metrics in the image coding field. For an original color image of size and a reconstructed image , the RMSE is defined as


and the other two quantities and are defined in the same fashion. The measure RMSE represents the average Euclidean distance from the original image to the output one. For a 24-bit colour image, the PSNR is given by

Note that the higher PSNR value indicates a compressed (i.e., reconstructed) image of better quality.

Experimental results are provided in Table 2, including the log-likelihoods, the number of parameters, and BIC values, together with the RMSE and PSNR for the compressed images. The performance of PTMM1 and PTMM2 are almost the same in terms of RMSE and PSNR for the compressed images. Notably, the BIC and PSNR values obtained from the PTMM approaches are much higher than the PGMM for all cases.

PGMM 2104.4 189 4206.8 7.62 30.5
PTMM1 2205.9 190 4409.7 5.22 33.8
PTMM2 2204.6 193 4407.1 5.26 33.7
PGMM 2235.1 363 4466.3 7.03 31.2
PTMM1 2348.8 364 4693.7 2.78 39.2
PTMM2 2351.2 371 4698.5 2.43 40.4

. Models with large BIC scores are preferred.

Table 2: Comparisons of fitting adequacy and quality of image reconstruction using three CUU models with and .

Figure 1 shows the original Lena image and three compressed images obtained by fitted the CUU models. The quality of the reconstructed images using the PTMMs is much better (i.e., smoother) than those reconstructed using the PGMM (for instance, observe the edge of Lena’s shoulder).

Figure 1: An example of Lena colour image reconstruction using three CUU models with and .

4.2 Compact facial representation

Over the past few decades, numerous template matching approaches, along with their modifications and variants, have been proposed for human perception system face recognition; see Zhao et al. (2003) for an extensive survey. Motivated by a technique developed by Sirovitch and Kirby (1987), the eigenface method (Turk and Pentland, 1991) has become the most popular tool for facial feature extraction in computer vision research. Its implementation is primarily based on the use of PCA, also known as the Karhunen-Loève transform. The central idea of PCA is to find a multilinear subspace whose orthonormal basis maximizes the scatter matrix of all the projected samples.

Although the eigenface method is remarkably simple and quite straightforward, its performance depends heavily on pose, lighting and background conditions. Because PCA is a one-sample method, a single set of eigenfaces may not be enough to represent face images having a large amount of variation. Some efforts have been made to generalize it to the mixture-of-eigenfaces method Kim et al. (2002), which generates more than one set of eigenfaces for better representation of the whole face. However, such a method will produce an over-parameterized solution in many circumstances. To capture the heterogeneity among the underlying face images while maintaining parsimony in modeling, we apply the PTMM approach with common factor loadings (CUU) as a novel strategy for robust facial representation.

Consider a training set of face images . Suppose that all images are exactly the same size; pixels in width by  in height. Let be the transformed image vectors taking intensity values in a -dimensional image space, where . In what follows, we briefly review the eigenface method. Define the total scatter matrix of the sample images as

where is the mean image vector of all samples. In PCA, the optimal orthonormal basis vectors , where , are chosen to maximize the following objective function

where is the set of -dimensional vectors of corresponding to the largest eigenvectors, which are referred to as eigenfaces in Turk and Pentland (1991) because they have exactly the same dimension as the original images. Each face image vector can be represented as a linear combination of the best eigenvectors:


where .

As another illustration, we compare the face reconstruction performance among the PGMM and PTMM approaches using the image compression algorithm described in the above example and the eigenface method. We use the face images in the Yale face database, which contains 165 grayscale images of 15 individuals in GIF format. There are 11 images per person with pose and lighting variations, but for illustrative purposes and simplicity we consider only 11 images of one individual. For each image, we manually select the centers of eyes, denoted by . The four boundaries of each cropped central part of the face are , , , and , which gives images of pixels as a sample in our experiment.

In this experiment, the 11 face images are trained by seven models: the PCA, PGMM, and PTMM1 models with CUU structure, where and . The reconstructed images can be obtained by following two steps.


In each model, the fitted vector is obtained as a reconstructed image, say . For the PCA method, as defined in (8). As for the PGMM and PTMM1, the solution is obtained by using (7).


If the entries of do not fall in the color space, such as , the reproductions are normalized by

Figure 2 shows the original and the reconstructed images obtained from the seven training models. PCA clearly has the worst performance, while the PGMM and PTMM lead to somewhat comparable results. For explicitly measuring the reconstruction quality, we further calculated the RMSE, denoted by , for each image and each model. Figure 3 displays the patterns of RMSE values. When comparing these models, smaller RMSEs indicate better reconstructions. In general, PTMM1 yields lower RMSE values than those from PGMM. As a result, we conclude that PTMM1 can be a prominent tool for facial coding.

Figure 2: Reconstructed images based on PCA, PGMM and PTMM approaches. The first column shows sample images for one individual of the Yale database.
Figure 3: RMSE values of seven trained models for 11 face images.

5 Concluding remarks

We have utilized a class of parsimonious mixture models, called the PTMMs, which may create tremendous flexibility in robust clustering of high-dimensional data as they are relatively insensitive to outliers. This model-based tool allows practitioners to analyze heterogeneous multivariate data in a broad variety of considerations and works particularly well in high-dimensional settings. Numerical results show that the proposed PTMM approach performs reasonably well for the experimental image data. As pointed by by Zhao and Yu (2008) and Wang and Lin (2013), the convergence of the AECM algorithm can be painfully slow in certain situations. It is a worthwhile task to pursue some modified algorithms toward fast convergence.


  • Ghahramani and Hinton (1997) Z. Ghahramani and G. E. Hinton, “The EM algorithm for factor analyzers”, Tech. Rep. CRG-TR-96-1, University Of Toronto, Toronto, 1997.
  • McLachlan and Peel (2000a) G. J. McLachlan and D. Peel, “Mixtures of factor analyzers”, in “Proceedings of the Seventh International Conference on Machine Learning”, pp. 599–606. Morgan Kaufmann, San Francisco, 2000a.
  • McLachlan and Peel (2000b) G. J. McLachlan and D. Peel, “Finite mixture models”, John Wiley & Sons, New York, 2000b.
  • Fokoué and Titterington (2003) E. Fokoué and D. M. Titterington, “Mixtures of factor analysers. Bayesian estimation and inference by stochastic simulation”, Machine Learning 50 (2003) 73–94.
  • McLachlan et al. (2002) G. J. McLachlan, R. W. Bean, and D. Peel, “A mixture model-based approach to the clustering of microarray expression data”, Bioinformatics 18 (2002), no. 3, 412–422.
  • McLachlan et al. (2003) G. J. McLachlan, D. Peel, and R. W. Bean, “Modelling high-dimensional data by mixtures of factor analyzers”, Comput. Statist. Data Anal. 41 (2003), no. 3–4, 379–388.
  • McNicholas and Murphy (2008) P. D. McNicholas and T. B. Murphy, “Parsimonious Gaussian mixture models”, Stat. Comput. 18 (2008) 285–296.
  • McLachlan and Peel (1998) G. J. McLachlan and D. Peel, “Robust cluster analysis via mixtures of multivariate -distributions”, in “Lecture Notes in Computer Science”, vol. 1451, pp. 658–666. Springer-Verlag, Berlin, 1998.
  • Peel and McLachlan (2000) D. Peel and G. J. McLachlan, “Robust mixture modeling using the distribution”, Stat. Comput. 10 (2000) 339–348.
  • Shoham (2002) S. Shoham, “Robust clustering by deterministic agglomeration EM of mixtures of multivariate -distributions”, Pattern Recogn. 35 (2002), no. 5, 1127–1142.
  • Lin et al. (2004) T. I. Lin, J. C. Lee, and H. F. Ni, “Bayesian analysis of mixture modelling using the multivariate distribution”, Stat. Comput. 14 (2004) 119–130.
  • Lin et al. (2009) T. I. Lin, H. J. Ho, and P. S. Shen, “Computationally efficient learning of multivariate mixture models with missing information”, Comp. Stat. 24 (2009) 375–392.
  • Wang et al. (2004) H. X. Wang, Q. B. Zhang, B. Luo, and S. Wei, “Robust mixture modelling using multivariate distribution with missing information”, Pattern Recogn. Lett. 25 (2004) 701–710.
  • Greselin and Ingrassia (2010) F. Greselin and S. Ingrassia, “Constrained monotone EM algorithms for mixtures of multivariate distributions”, Stat. Comput. 20 (2010) 9–22.
  • Andrews et al. (2011) J. L. Andrews, P. D. McNicholas, and S. Subedi, “Model-based classification via mixtures of multivariate -distributions”, Comput. Statist. Data Anal. 55 (2011), no. 1, 520–529.
  • McLachlan et al. (2007) G. J. McLachlan, R. W. Bean, and L. B.-T. Jones, “Extension of the mixture of factor analyzers model to incorporate the multivariate -distribution”, Comput. Statist. Data Anal. 51 (2007), no. 11, 5327–5338.
  • Andrews and McNicholas (2011a) J. L. Andrews and P. D. McNicholas, “Extending mixtures of multivariate -factor analyzers”, Stat. Comput. 21 (2011)a, no. 3, 361–373.
  • Andrews and McNicholas (2011b) J. L. Andrews and P. D. McNicholas, “Mixtures of modified -factor analyzers for model-based clustering, classification, and discriminant analysis”, J. Statist. Plann. Inf. 141 (2011)b, no. 4, 1479–1486.
  • Dempster et al. (1977) A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm”, J. Roy. Statist. Soc. B 39 (1977), no. 1, 1–38.
  • Meng and Rubin (1993) X.-L. Meng and D. B. Rubin, “Maximum likelihood estimation via the ECM algorithm: a general framework”, Biometrika 80 (1993) 267–278.
  • Liu and Rubin (1994) C. Liu and D. B. Rubin, “The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence”, Biometrika 81 (1994) 633–648.
  • Liu and Rubin (1995) C. Liu and D. B. Rubin, “ML estimation of the distribution using EM and its extensions, ECM and ECME”, Statistica Sinica 5 (1995) 19–39.
  • Meng and van Dyk (1997) X.-L. Meng and D. van Dyk, “The EM algorithm — an old folk song sung to a fast new tune (with discussion)”, J. Roy. Statist. Soc. B 59 (1997) 511–567.
  • Woodbury (1950) M. A. Woodbury, “Inverting modified matrices”, Princeton University, Princeton, New Jersey, 1950.
  • Ueda et al. (2000) N. Ueda, R. Nakano, Z. Ghahramani, and G. E. Hinton, “SMEM algorithm for mixture models”, Neural Computation 12 (2000) 2109–2128.
  • Zhao et al. (2003) W. Zhao, R. Chellappa, A. Rosenfeld, and P. J. Phillips, “Face recognition: A literature survey”, ACM Computing Surveys 35 (2003) 399–458.
  • Sirovitch and Kirby (1987) L. Sirovitch and M. Kirby, “Low-dimensional procedure for the characterization of human faces”, J. Optical Soc. of Am. 2 (1987) 519–524.
  • Turk and Pentland (1991) M. Turk and A. Pentland, “Eigenfaces for recognition”, J. Cognitive Neuroscience 3 (1991) 71–86.
  • Kim et al. (2002) H. C. Kim, D. Lim, and S. Y. Bang, “Face recognition using the mixture-of-eigenfaces method”, Patt. Recog. Lett. 23 (2002) 1549–1558.
  • Zhao and Yu (2008) J. H. Zhao and P. L. H. Yu, “Fast ML estimation for the mixture of factor analyzers via an ECM algorithm”, IEEE Trans. on Neu. Net. 19 (2008) 1956–1961.
  • Wang and Lin (2013) W. L. Wang and T. I. Lin, “An efficient ecm algorithm for maximum likelihood estimation in mixtures of -factor analyzers”, Comput. Statist. DOI 10.1007/s00180-012-0327-z (2013).


A: ML estimation via the AECM algorithm

We discuss how to carry out the AECM algorithm for computing the ML estimates of the parameters in PTMMs. To formulate the algorithm which consists of three cycles, we partition the unknown parameters as , where contains ’s, contains ’s and ’s, while contains ’s and ’s. Let be the collection of latent group labels. In the first cycle of the algorithm, we treat as missing data. So the complete data is and our aim is to estimate with and fixed at their current estimates and .
       The log-likelihood of based on the complete data , apart from an additive constant, is given by


Conditioning on and , taking expectation for (9) leads to



Maximizing (10) with respect to , that restricted to , yields

At the second cycle, when estimating , we treat and as the missing data. The log-likelihood function of based on the complete data takes the form of


Conditioning on and , taking expectation for (11) leads to



where and DG is the digamma function.
       Maximizing (12) with respect to and yields


which is equivalently the solution of the following equation:

In the case of , we obtain as the solution of the following equation:

At the third cycle, when estimating , we treat , and as the missing data. The log-likelihood function of based on the complete data takes the form of


Therefore, the expectation of (13) conditioning on the observed data and the updated vales of is

where , and . The resulting CM-steps for the updates of and under eight constrained/unconstrained situations are given in the next Section. Note that the above procedure is also applicable for the tFA model by treating .

B: Calculations for eight parsimonious models

To facilitate the derivation, we adopt the following notation:

B.1: Model CCC

For model CCC, we have , . The -function can be expressed as

where and . Differentiating with respect to B and , respectively, and set the derivations equal to zero, we obtain

B.2: Model CCU

For model CCU, we have , . The -function can be expressed as

where . Differentiating with respect to B and , respectively, and set the results equal to zero, we obtain

B.3: Model CUC

For model CUC, we have , . The -function can be expressed as

where and . Differentiating with respect to B and , respectively, and set the results equal to zero, we obtain


B.4: Model CUU

For model CUU we have . The -function can be expressed as