Capturing Patterns via Parsimonious Mixture Models
Abstract
This paper exploits a simplified version of the mixture of multivariate factor analyzers (MtFA) for robust mixture modelling and clustering of highdimensional data that frequently contain a number of outliers. Two classes of eight parsimonious mixture models are introduced and computation of maximum likelihood estimates of parameters is achieved using the alternating expectation conditional maximization (AECM) algorithm. The usefulness of the methodology is illustrated through applications of image compression and compact facial representation.
Keywords: Factor analysis; Facial representation; Image compression; PGMM; PTMM
1 Introduction
The mixture of factor analyzers (MFA) model, first introduced by Ghahramani and Hinton (1997) and further generalized by McLachlan and Peel (2000a), has provided a flexible dimensionality reduction approach to the statistical modelling of highdimensional data arising from a wide variety of random phenomena. By combining the factor analysis model with finite mixture models, the MFA model allows simultaneous partitioning of the population into several subclasses while performing a local dimensionality reduction for each mixture component. In recent years, the study of MFA has received considerable interest; see McLachlan and Peel (2000b) and Fokoué and Titterington (2003) for some excellent reviews. McLachlan et al. (2002) and McLachlan et al. (2003) exploited the MFA approach to handle highdimensional data such as clustering of microarray expression profiles. McNicholas and Murphy (2008) generalized the MFA model by introducing a family of parsimonious Gaussian mixture models (PGMMs).
In the MFA framework, component errors and factors are routinely assumed to have a Gaussian distribution due to their mathematical tractability and computational convenience. In practice, noise components or badly discrepant outliers often exist; the multivariate (MVT) distribution contains an additional tuning parameter, the degrees of freedom (df), which can be useful for outlier accommodation. Specifically, the density of a component multivariate mixture model is of the form
(1) 
where the are mixing proportions and is the density of variate distribution with df , mean vector , and scaling covariance matrix . Note that the general mixture (TMIX) model (1) has a total of free parameters, of which parameters correspond to the component matrices and df , respectively. In a more parsimonious version of (1), wherein the and are restricted to be identical across groups, the number of free parameters reduces to .
The TMIX model was first considered by McLachlan and Peel (1998) and Peel and McLachlan (2000), who presented expectationmaximization (EM) algorithms for parameter estimation and show the robustness of the model for clustering. Further developments along these directions followed, including work by Shoham (2002); Lin et al. (2004, 2009); Wang et al. (2004); Greselin and Ingrassia (2010); Andrews et al. (2011). An extension to mixture of factor analyzers (MtFA), adopting the family of MVT distributions for the component factor and errors, was first considered by McLachlan et al. (2007) and more recently by Andrews and McNicholas (2011a, b).
In this paper, our objective is to illustrate the efficacy of a class of TMIX models with several parsimonious covariance structures, called parsimonious mixture models (PTMMs). The PTMMs are based on assuming a constrained factor structure on each mixture component for parsimoniously modelling population heterogeneity in the presence of fat tails; the PTMMs are analogues of the PGMMs. We are devoted to developing additional tools for a simplified version of the MODt family of Andrews and McNicholas (2011b) and applying the proposed tools on image reconstruction tasks.
The EM algorithm (Dempster et al., 1977) and its extensions, such as the expectation conditional maximization (ECM) algorithm (Meng and Rubin, 1993) and the expectation conditional maximization either (ECME) algorithm (Liu and Rubin, 1994, 1995), have been practiced as useful tools for conducting maximum likelihood (ML) estimation in a variety of mixture modelling scenarios. To improve the computational efficiency for fitting PTMMs, we adopt a threecycle AECM algorithm (Meng and van Dyk, 1997) that allows specification of different complete data at each cycle.
The rest of the paper is organized as follows. In Section 2, we briefly describe the single factor analysis model and study some related properties. In Section 3, we present the formulation of PTMM and discuss the methods for fitting these models. In Section 4, we demonstrate how PTMM can be applied to image compression and compact facial representation tasks. Some concluding remarks are given in Section 5.
2 The factor analysis model
We briefly review the factor analysis (tFA) model, which can be thought of as a single component MtFA model. Let be a set of dimensional random vectors. In the tFA setting, each is modelled as
(2) 
with
where is a dimensional mean vector, B is a matrix of factor loadings, is a dimensional () vector of latent variables called common factors, is dimensional vector of errors called specific factors, is a dimensional identity matrix, is a diagonal matrix with positive entries, and is the df.
Note that and are uncorrelated and marginally distributed, but not independent. Using the characterization of the distribution, model (2) can be hierarchically represented as
where stands for the gamma distribution with mean . It follows that the marginal distribution of , after integrating out and , can be expressed as .
When handling highdimensional data for large relative to , say , the inverse of plays an important role in computational complexity. In such a case, we use the following matrix inversion formula (Woodbury, 1950):
(3) 
which can be done more quickly because it involves only the lowdimensional inverse plus the inversion of a diagonal matrix. Moreover, the determinant of can be calculated as
(4) 
The formulae (3) and (4) have been used many times before, including work by McLachlan and Peel (2000a) and McNicholas and Murphy (2008).
3 Parsimonious multivariate mixture models
Consider independent dimensional random vectors that come from a heterogeneous population with nonoverlapping classes. Suppose that the density function of each feature can be modelled by a component MtFA:
(5) 
where the are mixing proportions that are constrained to be positive and , is a matrix of factor loadings, and is a diagonal matrix. We use to represent all unknown parameters with containing the parameters for the density of component .
As in McLachlan et al. (2007), the MtFA model can be alternatively formulated by exploiting its link with the tFA model. Specifically, we assume that
with probability , for and , where and are, respectively, the common and specific factors corresponding to component . We assume that and have a joint MVT distribution to be consistent with model (5) for the marginal distribution of . Under a hierarchical mixture modelling framework, each is conceptualized to have originated from one of classes. It is convenient to construct unobserved allocation variables , for , whose values are a set of binary variables with indexing so that belongs to class and are constrained to be for each . More specifically, is distributed as a multinomial random vector with one trial and cell probabilities , denoted by . After a little algebra (cf. McLachlan et al., 2007), we have
(6) 
where , and denotes the Mahalanobissquared distance between and .
Following McNicholas and Murphy (2008), we extend the MtFA model by allowing constraints across groups on matrices and . Specifically, we allow the following constraints on the scaling covariance matrices: ; ; or , for . In addition, we consider the constraint when we impose this constraint to reduce the number of parameter in PTMMs, our family of models (Table 1) is called ‘PTMM1’, and when the are allowed to vary across components, we call our family ‘PTMM2’. In this latter case (PTMM2), one simply adds parameters to the last column of Table 1. The number of free covariance parameters in the PTMM family — the union of PTMM1 and PTMM2 — can be reduced to as few as or inflated to as many as .
Structure  Loading  Error  Isotropic  Number of parameters 

CCC  Constrained  Constrained  Constrained  
CCU  Constrained  Constrained  Unconstrained  
CUC  Constrained  Unconstrained  Constrained  
CUU  Constrained  Unconstrained  Unconstrained  
UCC  Unconstrained  Constrained  Constrained  
UCU  Unconstrained  Constrained  Unconstrained  
UUC  Unconstrained  Unconstrained  Constrained  
UUU  Unconstrained  Unconstrained  Unconstrained 
The implementation of an efficient 3cycle AECM algorithm for estimating PTMMs together with some practical issues including the specification of starting values, the stopping rule and the model selection criterion are described in Supplementary Material.
4 Applications
4.1 Image compression
A colour image is a digital image that includes 24bit RGB colour information for each pixel. An RGB image is derived from the three primary colours — red (R), green (G), and blue (B) — for which each colour is 8 bits. The technique of image compression plays a central role in image transmission and storage for high quality. There are several unsupervised image compression approaches based on probabilistic models. The MFA model is well recognized as a dominant dimension reduction technique and is useful in block image transform coding (Ueda et al., 2000). In this subsection, we individually apply the PTMMs and PGMM to colour image compression and compare the quality of the reconstructed images. A RGB colour image ‘Lena’ (encoded in 24 bits per pixel) is subdivided into nonoverlapping RGBblocks of pixels and each block is taken as a 48dimensional data vector . Let be a set of collected vector. Making a slight modification of Ueda et al. (2000), our compression procedure to transform the experimental image is summarized below.
 1.

Set the desired number of components and dimensionality of factors .
 2.

Estimate by fitting a parsimonious mixture model to .
 3.

Perform a modelbased clustering according to the component membership of data point , which is decided by maximizing the posterior probability .
 4.
 5.

Reconstruct , , into a compressed color image.
In what follows, we assumed that the characteristic of each components in the mixture model are the same, which leads to the MFA model with common factor loadings (CUU structure). The Lena image data are fitted by the CUU structure with and using the PGMM and PTMM approaches. To evaluate the quality of the reconstructed image, we compute the root mean squared error (RMSE) and peak signaltonoise ratio (PSNR), which are widely used metrics in the image coding field. For an original color image of size and a reconstructed image , the RMSE is defined as
where
and the other two quantities and are defined in the same fashion. The measure RMSE represents the average Euclidean distance from the original image to the output one. For a 24bit colour image, the PSNR is given by
Note that the higher PSNR value indicates a compressed (i.e., reconstructed) image of better quality.
Experimental results are provided in Table 2, including the loglikelihoods, the number of parameters, and BIC values, together with the RMSE and PSNR for the compressed images. The performance of PTMM1 and PTMM2 are almost the same in terms of RMSE and PSNR for the compressed images. Notably, the BIC and PSNR values obtained from the PTMM approaches are much higher than the PGMM for all cases.
Model  BIC  RMSE  PSNR  

PGMM  2104.4  189  4206.8  7.62  30.5 
PTMM1  2205.9  190  4409.7  5.22  33.8 
PTMM2  2204.6  193  4407.1  5.26  33.7 
Model  BIC  RMSE  PSNR  
PGMM  2235.1  363  4466.3  7.03  31.2 
PTMM1  2348.8  364  4693.7  2.78  39.2 
PTMM2  2351.2  371  4698.5  2.43  40.4 
. Models with large BIC scores are preferred.
Figure 1 shows the original Lena image and three compressed images obtained by fitted the CUU models. The quality of the reconstructed images using the PTMMs is much better (i.e., smoother) than those reconstructed using the PGMM (for instance, observe the edge of Lena’s shoulder).
4.2 Compact facial representation
Over the past few decades, numerous template matching approaches, along with their modifications and variants, have been proposed for human perception system face recognition; see Zhao et al. (2003) for an extensive survey. Motivated by a technique developed by Sirovitch and Kirby (1987), the eigenface method (Turk and Pentland, 1991) has become the most popular tool for facial feature extraction in computer vision research. Its implementation is primarily based on the use of PCA, also known as the KarhunenLoève transform. The central idea of PCA is to find a multilinear subspace whose orthonormal basis maximizes the scatter matrix of all the projected samples.
Although the eigenface method is remarkably simple and quite straightforward, its performance depends heavily on pose, lighting and background conditions. Because PCA is a onesample method, a single set of eigenfaces may not be enough to represent face images having a large amount of variation. Some efforts have been made to generalize it to the mixtureofeigenfaces method Kim et al. (2002), which generates more than one set of eigenfaces for better representation of the whole face. However, such a method will produce an overparameterized solution in many circumstances. To capture the heterogeneity among the underlying face images while maintaining parsimony in modeling, we apply the PTMM approach with common factor loadings (CUU) as a novel strategy for robust facial representation.
Consider a training set of face images . Suppose that all images are exactly the same size; pixels in width by in height. Let be the transformed image vectors taking intensity values in a dimensional image space, where . In what follows, we briefly review the eigenface method. Define the total scatter matrix of the sample images as
where is the mean image vector of all samples. In PCA, the optimal orthonormal basis vectors , where , are chosen to maximize the following objective function
where is the set of dimensional vectors of corresponding to the largest eigenvectors, which are referred to as eigenfaces in Turk and Pentland (1991) because they have exactly the same dimension as the original images. Each face image vector can be represented as a linear combination of the best eigenvectors:
(8) 
where .
As another illustration, we compare the face reconstruction performance among the PGMM and PTMM approaches using the image compression algorithm described in the above example and the eigenface method. We use the face images in the Yale face database, which contains 165 grayscale images of 15 individuals in GIF format. There are 11 images per person with pose and lighting variations, but for illustrative purposes and simplicity we consider only 11 images of one individual. For each image, we manually select the centers of eyes, denoted by . The four boundaries of each cropped central part of the face are , , , and , which gives images of pixels as a sample in our experiment.
In this experiment, the 11 face images are trained by seven models: the PCA, PGMM, and PTMM1 models with CUU structure, where and . The reconstructed images can be obtained by following two steps.
 1.
 2.

If the entries of do not fall in the color space, such as , the reproductions are normalized by
Figure 2 shows the original and the reconstructed images obtained from the seven training models. PCA clearly has the worst performance, while the PGMM and PTMM lead to somewhat comparable results. For explicitly measuring the reconstruction quality, we further calculated the RMSE, denoted by , for each image and each model. Figure 3 displays the patterns of RMSE values. When comparing these models, smaller RMSEs indicate better reconstructions. In general, PTMM1 yields lower RMSE values than those from PGMM. As a result, we conclude that PTMM1 can be a prominent tool for facial coding.
5 Concluding remarks
We have utilized a class of parsimonious mixture models, called the PTMMs, which may create tremendous flexibility in robust clustering of highdimensional data as they are relatively insensitive to outliers. This modelbased tool allows practitioners to analyze heterogeneous multivariate data in a broad variety of considerations and works particularly well in highdimensional settings. Numerical results show that the proposed PTMM approach performs reasonably well for the experimental image data. As pointed by by Zhao and Yu (2008) and Wang and Lin (2013), the convergence of the AECM algorithm can be painfully slow in certain situations. It is a worthwhile task to pursue some modified algorithms toward fast convergence.
References
 Ghahramani and Hinton (1997) Z. Ghahramani and G. E. Hinton, “The EM algorithm for factor analyzers”, Tech. Rep. CRGTR961, University Of Toronto, Toronto, 1997.
 McLachlan and Peel (2000a) G. J. McLachlan and D. Peel, “Mixtures of factor analyzers”, in “Proceedings of the Seventh International Conference on Machine Learning”, pp. 599–606. Morgan Kaufmann, San Francisco, 2000a.
 McLachlan and Peel (2000b) G. J. McLachlan and D. Peel, “Finite mixture models”, John Wiley & Sons, New York, 2000b.
 Fokoué and Titterington (2003) E. Fokoué and D. M. Titterington, “Mixtures of factor analysers. Bayesian estimation and inference by stochastic simulation”, Machine Learning 50 (2003) 73–94.
 McLachlan et al. (2002) G. J. McLachlan, R. W. Bean, and D. Peel, “A mixture modelbased approach to the clustering of microarray expression data”, Bioinformatics 18 (2002), no. 3, 412–422.
 McLachlan et al. (2003) G. J. McLachlan, D. Peel, and R. W. Bean, “Modelling highdimensional data by mixtures of factor analyzers”, Comput. Statist. Data Anal. 41 (2003), no. 3–4, 379–388.
 McNicholas and Murphy (2008) P. D. McNicholas and T. B. Murphy, “Parsimonious Gaussian mixture models”, Stat. Comput. 18 (2008) 285–296.
 McLachlan and Peel (1998) G. J. McLachlan and D. Peel, “Robust cluster analysis via mixtures of multivariate distributions”, in “Lecture Notes in Computer Science”, vol. 1451, pp. 658–666. SpringerVerlag, Berlin, 1998.
 Peel and McLachlan (2000) D. Peel and G. J. McLachlan, “Robust mixture modeling using the distribution”, Stat. Comput. 10 (2000) 339–348.
 Shoham (2002) S. Shoham, “Robust clustering by deterministic agglomeration EM of mixtures of multivariate distributions”, Pattern Recogn. 35 (2002), no. 5, 1127–1142.
 Lin et al. (2004) T. I. Lin, J. C. Lee, and H. F. Ni, “Bayesian analysis of mixture modelling using the multivariate distribution”, Stat. Comput. 14 (2004) 119–130.
 Lin et al. (2009) T. I. Lin, H. J. Ho, and P. S. Shen, “Computationally efficient learning of multivariate mixture models with missing information”, Comp. Stat. 24 (2009) 375–392.
 Wang et al. (2004) H. X. Wang, Q. B. Zhang, B. Luo, and S. Wei, “Robust mixture modelling using multivariate distribution with missing information”, Pattern Recogn. Lett. 25 (2004) 701–710.
 Greselin and Ingrassia (2010) F. Greselin and S. Ingrassia, “Constrained monotone EM algorithms for mixtures of multivariate distributions”, Stat. Comput. 20 (2010) 9–22.
 Andrews et al. (2011) J. L. Andrews, P. D. McNicholas, and S. Subedi, “Modelbased classification via mixtures of multivariate distributions”, Comput. Statist. Data Anal. 55 (2011), no. 1, 520–529.
 McLachlan et al. (2007) G. J. McLachlan, R. W. Bean, and L. B.T. Jones, “Extension of the mixture of factor analyzers model to incorporate the multivariate distribution”, Comput. Statist. Data Anal. 51 (2007), no. 11, 5327–5338.
 Andrews and McNicholas (2011a) J. L. Andrews and P. D. McNicholas, “Extending mixtures of multivariate factor analyzers”, Stat. Comput. 21 (2011)a, no. 3, 361–373.
 Andrews and McNicholas (2011b) J. L. Andrews and P. D. McNicholas, “Mixtures of modified factor analyzers for modelbased clustering, classification, and discriminant analysis”, J. Statist. Plann. Inf. 141 (2011)b, no. 4, 1479–1486.
 Dempster et al. (1977) A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm”, J. Roy. Statist. Soc. B 39 (1977), no. 1, 1–38.
 Meng and Rubin (1993) X.L. Meng and D. B. Rubin, “Maximum likelihood estimation via the ECM algorithm: a general framework”, Biometrika 80 (1993) 267–278.
 Liu and Rubin (1994) C. Liu and D. B. Rubin, “The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence”, Biometrika 81 (1994) 633–648.
 Liu and Rubin (1995) C. Liu and D. B. Rubin, “ML estimation of the distribution using EM and its extensions, ECM and ECME”, Statistica Sinica 5 (1995) 19–39.
 Meng and van Dyk (1997) X.L. Meng and D. van Dyk, “The EM algorithm — an old folk song sung to a fast new tune (with discussion)”, J. Roy. Statist. Soc. B 59 (1997) 511–567.
 Woodbury (1950) M. A. Woodbury, “Inverting modified matrices”, Princeton University, Princeton, New Jersey, 1950.
 Ueda et al. (2000) N. Ueda, R. Nakano, Z. Ghahramani, and G. E. Hinton, “SMEM algorithm for mixture models”, Neural Computation 12 (2000) 2109–2128.
 Zhao et al. (2003) W. Zhao, R. Chellappa, A. Rosenfeld, and P. J. Phillips, “Face recognition: A literature survey”, ACM Computing Surveys 35 (2003) 399–458.
 Sirovitch and Kirby (1987) L. Sirovitch and M. Kirby, “Lowdimensional procedure for the characterization of human faces”, J. Optical Soc. of Am. 2 (1987) 519–524.
 Turk and Pentland (1991) M. Turk and A. Pentland, “Eigenfaces for recognition”, J. Cognitive Neuroscience 3 (1991) 71–86.
 Kim et al. (2002) H. C. Kim, D. Lim, and S. Y. Bang, “Face recognition using the mixtureofeigenfaces method”, Patt. Recog. Lett. 23 (2002) 1549–1558.
 Zhao and Yu (2008) J. H. Zhao and P. L. H. Yu, “Fast ML estimation for the mixture of factor analyzers via an ECM algorithm”, IEEE Trans. on Neu. Net. 19 (2008) 1956–1961.
 Wang and Lin (2013) W. L. Wang and T. I. Lin, “An efficient ecm algorithm for maximum likelihood estimation in mixtures of factor analyzers”, Comput. Statist. DOI 10.1007/s001800120327z (2013).
Appendix
A: ML estimation via the AECM algorithm
We discuss how to carry out the AECM algorithm for computing the ML estimates of the parameters in PTMMs.
To formulate the algorithm which consists of three cycles,
we partition the unknown parameters as , where contains ’s, contains
’s and ’s,
while contains ’s and ’s. Let be the collection of
latent group labels.
In the first cycle of the algorithm, we treat as missing data. So the complete data is
and our aim is to estimate
with and fixed at their current estimates and .
The loglikelihood of based on the complete data , apart from an additive constant, is given by
(9) 
Conditioning on and , taking expectation for (9) leads to
(10) 
with
Maximizing (10) with respect to , that restricted to , yields
At the second cycle, when estimating , we treat and as the missing data. The loglikelihood function of based on the complete data takes the form of
(11)  
Conditioning on and , taking expectation for (11) leads to
(12)  
with
where
and DG is the digamma function.
Maximizing (12) with respect to and yields
and
which is equivalently the solution of the following equation:
In the case of , we obtain as the solution of the following equation:
At the third cycle, when estimating , we treat , and as the missing data. The loglikelihood function of based on the complete data takes the form of
(13)  
Therefore, the expectation of (13) conditioning on the observed data and the updated vales of is
where , and . The resulting CMsteps for the updates of and under eight constrained/unconstrained situations are given in the next Section. Note that the above procedure is also applicable for the tFA model by treating .
B: Calculations for eight parsimonious models
To facilitate the derivation, we adopt the following notation:
B.1: Model CCC
For model CCC, we have , . The function can be expressed as
where and . Differentiating with respect to B and , respectively, and set the derivations equal to zero, we obtain
B.2: Model CCU
For model CCU, we have , . The function can be expressed as
where . Differentiating with respect to B and , respectively, and set the results equal to zero, we obtain
B.3: Model CUC
For model CUC, we have , . The function can be expressed as
where and . Differentiating with respect to B and , respectively, and set the results equal to zero, we obtain
and
B.4: Model CUU
For model CUU we have . The function can be expressed as