Capturing Patterns via Parsimonious Mixture Models
This paper exploits a simplified version of the mixture of multivariate -factor analyzers (MtFA) for robust mixture modelling and clustering of high-dimensional data that frequently contain a number of outliers. Two classes of eight parsimonious mixture models are introduced and computation of maximum likelihood estimates of parameters is achieved using the alternating expectation conditional maximization (AECM) algorithm. The usefulness of the methodology is illustrated through applications of image compression and compact facial representation.
Keywords: Factor analysis; Facial representation; Image compression; PGMM; PTMM
The mixture of factor analyzers (MFA) model, first introduced by Ghahramani and Hinton (1997) and further generalized by McLachlan and Peel (2000a), has provided a flexible dimensionality reduction approach to the statistical modelling of high-dimensional data arising from a wide variety of random phenomena. By combining the factor analysis model with finite mixture models, the MFA model allows simultaneous partitioning of the population into several subclasses while performing a local dimensionality reduction for each mixture component. In recent years, the study of MFA has received considerable interest; see McLachlan and Peel (2000b) and Fokoué and Titterington (2003) for some excellent reviews. McLachlan et al. (2002) and McLachlan et al. (2003) exploited the MFA approach to handle high-dimensional data such as clustering of microarray expression profiles. McNicholas and Murphy (2008) generalized the MFA model by introducing a family of parsimonious Gaussian mixture models (PGMMs).
In the MFA framework, component errors and factors are routinely assumed to have a Gaussian distribution due to their mathematical tractability and computational convenience. In practice, noise components or badly discrepant outliers often exist; the multivariate (MVT) distribution contains an additional tuning parameter, the degrees of freedom (df), which can be useful for outlier accommodation. Specifically, the density of a -component multivariate mixture model is of the form
where the are mixing proportions and is the density of -variate distribution with df , mean vector , and scaling covariance matrix . Note that the general mixture (TMIX) model (1) has a total of free parameters, of which parameters correspond to the component matrices and df , respectively. In a more parsimonious version of (1), wherein the and are restricted to be identical across groups, the number of free parameters reduces to .
The TMIX model was first considered by McLachlan and Peel (1998) and Peel and McLachlan (2000), who presented expectation-maximization (EM) algorithms for parameter estimation and show the robustness of the model for clustering. Further developments along these directions followed, including work by Shoham (2002); Lin et al. (2004, 2009); Wang et al. (2004); Greselin and Ingrassia (2010); Andrews et al. (2011). An extension to mixture of -factor analyzers (MtFA), adopting the family of MVT distributions for the component factor and errors, was first considered by McLachlan et al. (2007) and more recently by Andrews and McNicholas (2011a, b).
In this paper, our objective is to illustrate the efficacy of a class of TMIX models with several parsimonious covariance structures, called parsimonious mixture models (PTMMs). The PTMMs are based on assuming a constrained -factor structure on each mixture component for parsimoniously modelling population heterogeneity in the presence of fat tails; the PTMMs are -analogues of the PGMMs. We are devoted to developing additional tools for a simplified version of the MODt family of Andrews and McNicholas (2011b) and applying the proposed tools on image reconstruction tasks.
The EM algorithm (Dempster et al., 1977) and its extensions, such as the expectation conditional maximization (ECM) algorithm (Meng and Rubin, 1993) and the expectation conditional maximization either (ECME) algorithm (Liu and Rubin, 1994, 1995), have been practiced as useful tools for conducting maximum likelihood (ML) estimation in a variety of mixture modelling scenarios. To improve the computational efficiency for fitting PTMMs, we adopt a three-cycle AECM algorithm (Meng and van Dyk, 1997) that allows specification of different complete data at each cycle.
The rest of the paper is organized as follows. In Section 2, we briefly describe the single factor analysis model and study some related properties. In Section 3, we present the formulation of PTMM and discuss the methods for fitting these models. In Section 4, we demonstrate how PTMM can be applied to image compression and compact facial representation tasks. Some concluding remarks are given in Section 5.
2 The factor analysis model
We briefly review the factor analysis (tFA) model, which can be thought of as a single component MtFA model. Let be a set of -dimensional random vectors. In the tFA setting, each is modelled as
where is a -dimensional mean vector, B is a matrix of factor loadings, is a -dimensional () vector of latent variables called common factors, is -dimensional vector of errors called specific factors, is a -dimensional identity matrix, is a diagonal matrix with positive entries, and is the df.
Note that and are uncorrelated and marginally -distributed, but not independent. Using the characterization of the distribution, model (2) can be hierarchically represented as
where stands for the gamma distribution with mean . It follows that the marginal distribution of , after integrating out and , can be expressed as .
When handling high-dimensional data for large relative to , say , the inverse of plays an important role in computational complexity. In such a case, we use the following matrix inversion formula (Woodbury, 1950):
which can be done more quickly because it involves only the low-dimensional inverse plus the inversion of a diagonal matrix. Moreover, the determinant of can be calculated as
3 Parsimonious multivariate mixture models
Consider independent -dimensional random vectors that come from a heterogeneous population with non-overlapping classes. Suppose that the density function of each feature can be modelled by a -component MtFA:
where the are mixing proportions that are constrained to be positive and , is a matrix of factor loadings, and is a diagonal matrix. We use to represent all unknown parameters with containing the parameters for the density of component .
As in McLachlan et al. (2007), the MtFA model can be alternatively formulated by exploiting its link with the tFA model. Specifically, we assume that
with probability , for and , where and are, respectively, the common and specific factors corresponding to component . We assume that and have a joint MVT distribution to be consistent with model (5) for the marginal distribution of . Under a hierarchical mixture modelling framework, each is conceptualized to have originated from one of classes. It is convenient to construct unobserved allocation variables , for , whose values are a set of binary variables with indexing so that belongs to class and are constrained to be for each . More specifically, is distributed as a multinomial random vector with one trial and cell probabilities , denoted by . After a little algebra (cf. McLachlan et al., 2007), we have
where , and denotes the Mahalanobis-squared distance between and .
Following McNicholas and Murphy (2008), we extend the MtFA model by allowing constraints across groups on matrices and . Specifically, we allow the following constraints on the scaling covariance matrices: ; ; or , for . In addition, we consider the constraint when we impose this constraint to reduce the number of parameter in PTMMs, our family of models (Table 1) is called ‘PTMM1’, and when the are allowed to vary across components, we call our family ‘PTMM2’. In this latter case (PTMM2), one simply adds parameters to the last column of Table 1. The number of free covariance parameters in the PTMM family — the union of PTMM1 and PTMM2 — can be reduced to as few as or inflated to as many as .
|Structure||Loading||Error||Isotropic||Number of parameters|
The implementation of an efficient 3-cycle AECM algorithm for estimating PTMMs together with some practical issues including the specification of starting values, the stopping rule and the model selection criterion are described in Supplementary Material.
4.1 Image compression
A colour image is a digital image that includes 24-bit RGB colour information for each pixel. An RGB image is derived from the three primary colours — red (R), green (G), and blue (B) — for which each colour is 8 bits. The technique of image compression plays a central role in image transmission and storage for high quality. There are several unsupervised image compression approaches based on probabilistic models. The MFA model is well recognized as a dominant dimension reduction technique and is useful in block image transform coding (Ueda et al., 2000). In this subsection, we individually apply the PTMMs and PGMM to colour image compression and compare the quality of the reconstructed images. A RGB colour image ‘Lena’ (encoded in 24 bits per pixel) is subdivided into non-overlapping RGB-blocks of pixels and each block is taken as a 48-dimensional data vector . Let be a set of collected vector. Making a slight modification of Ueda et al. (2000), our compression procedure to transform the experimental image is summarized below.
Set the desired number of components and dimensionality of factors .
Estimate by fitting a parsimonious mixture model to .
Perform a model-based clustering according to the component membership of data point , which is decided by maximizing the posterior probability .
For each classified to , calculate
where is the estimated posterior mean of (6).
Reconstruct , , into a compressed color image.
In what follows, we assumed that the characteristic of each components in the mixture model are the same, which leads to the MFA model with common factor loadings (CUU structure). The Lena image data are fitted by the CUU structure with and using the PGMM and PTMM approaches. To evaluate the quality of the reconstructed image, we compute the root mean squared error (RMSE) and peak signal-to-noise ratio (PSNR), which are widely used metrics in the image coding field. For an original color image of size and a reconstructed image , the RMSE is defined as
and the other two quantities and are defined in the same fashion. The measure RMSE represents the average Euclidean distance from the original image to the output one. For a 24-bit colour image, the PSNR is given by
Note that the higher PSNR value indicates a compressed (i.e., reconstructed) image of better quality.
Experimental results are provided in Table 2, including the log-likelihoods, the number of parameters, and BIC values, together with the RMSE and PSNR for the compressed images. The performance of PTMM1 and PTMM2 are almost the same in terms of RMSE and PSNR for the compressed images. Notably, the BIC and PSNR values obtained from the PTMM approaches are much higher than the PGMM for all cases.
. Models with large BIC scores are preferred.
Figure 1 shows the original Lena image and three compressed images obtained by fitted the CUU models. The quality of the reconstructed images using the PTMMs is much better (i.e., smoother) than those reconstructed using the PGMM (for instance, observe the edge of Lena’s shoulder).
4.2 Compact facial representation
Over the past few decades, numerous template matching approaches, along with their modifications and variants, have been proposed for human perception system face recognition; see Zhao et al. (2003) for an extensive survey. Motivated by a technique developed by Sirovitch and Kirby (1987), the eigenface method (Turk and Pentland, 1991) has become the most popular tool for facial feature extraction in computer vision research. Its implementation is primarily based on the use of PCA, also known as the Karhunen-Loève transform. The central idea of PCA is to find a multilinear subspace whose orthonormal basis maximizes the scatter matrix of all the projected samples.
Although the eigenface method is remarkably simple and quite straightforward, its performance depends heavily on pose, lighting and background conditions. Because PCA is a one-sample method, a single set of eigenfaces may not be enough to represent face images having a large amount of variation. Some efforts have been made to generalize it to the mixture-of-eigenfaces method Kim et al. (2002), which generates more than one set of eigenfaces for better representation of the whole face. However, such a method will produce an over-parameterized solution in many circumstances. To capture the heterogeneity among the underlying face images while maintaining parsimony in modeling, we apply the PTMM approach with common factor loadings (CUU) as a novel strategy for robust facial representation.
Consider a training set of face images . Suppose that all images are exactly the same size; pixels in width by in height. Let be the transformed image vectors taking intensity values in a -dimensional image space, where . In what follows, we briefly review the eigenface method. Define the total scatter matrix of the sample images as
where is the mean image vector of all samples. In PCA, the optimal orthonormal basis vectors , where , are chosen to maximize the following objective function
where is the set of -dimensional vectors of corresponding to the largest eigenvectors, which are referred to as eigenfaces in Turk and Pentland (1991) because they have exactly the same dimension as the original images. Each face image vector can be represented as a linear combination of the best eigenvectors:
As another illustration, we compare the face reconstruction performance among the PGMM and PTMM approaches using the image compression algorithm described in the above example and the eigenface method. We use the face images in the Yale face database, which contains 165 grayscale images of 15 individuals in GIF format. There are 11 images per person with pose and lighting variations, but for illustrative purposes and simplicity we consider only 11 images of one individual. For each image, we manually select the centers of eyes, denoted by . The four boundaries of each cropped central part of the face are , , , and , which gives images of pixels as a sample in our experiment.
In this experiment, the 11 face images are trained by seven models: the PCA, PGMM, and PTMM1 models with CUU structure, where and . The reconstructed images can be obtained by following two steps.
If the entries of do not fall in the color space, such as , the reproductions are normalized by
Figure 2 shows the original and the reconstructed images obtained from the seven training models. PCA clearly has the worst performance, while the PGMM and PTMM lead to somewhat comparable results. For explicitly measuring the reconstruction quality, we further calculated the RMSE, denoted by , for each image and each model. Figure 3 displays the patterns of RMSE values. When comparing these models, smaller RMSEs indicate better reconstructions. In general, PTMM1 yields lower RMSE values than those from PGMM. As a result, we conclude that PTMM1 can be a prominent tool for facial coding.
5 Concluding remarks
We have utilized a class of parsimonious mixture models, called the PTMMs, which may create tremendous flexibility in robust clustering of high-dimensional data as they are relatively insensitive to outliers. This model-based tool allows practitioners to analyze heterogeneous multivariate data in a broad variety of considerations and works particularly well in high-dimensional settings. Numerical results show that the proposed PTMM approach performs reasonably well for the experimental image data. As pointed by by Zhao and Yu (2008) and Wang and Lin (2013), the convergence of the AECM algorithm can be painfully slow in certain situations. It is a worthwhile task to pursue some modified algorithms toward fast convergence.
- Ghahramani and Hinton (1997) Z. Ghahramani and G. E. Hinton, “The EM algorithm for factor analyzers”, Tech. Rep. CRG-TR-96-1, University Of Toronto, Toronto, 1997.
- McLachlan and Peel (2000a) G. J. McLachlan and D. Peel, “Mixtures of factor analyzers”, in “Proceedings of the Seventh International Conference on Machine Learning”, pp. 599–606. Morgan Kaufmann, San Francisco, 2000a.
- McLachlan and Peel (2000b) G. J. McLachlan and D. Peel, “Finite mixture models”, John Wiley & Sons, New York, 2000b.
- Fokoué and Titterington (2003) E. Fokoué and D. M. Titterington, “Mixtures of factor analysers. Bayesian estimation and inference by stochastic simulation”, Machine Learning 50 (2003) 73–94.
- McLachlan et al. (2002) G. J. McLachlan, R. W. Bean, and D. Peel, “A mixture model-based approach to the clustering of microarray expression data”, Bioinformatics 18 (2002), no. 3, 412–422.
- McLachlan et al. (2003) G. J. McLachlan, D. Peel, and R. W. Bean, “Modelling high-dimensional data by mixtures of factor analyzers”, Comput. Statist. Data Anal. 41 (2003), no. 3–4, 379–388.
- McNicholas and Murphy (2008) P. D. McNicholas and T. B. Murphy, “Parsimonious Gaussian mixture models”, Stat. Comput. 18 (2008) 285–296.
- McLachlan and Peel (1998) G. J. McLachlan and D. Peel, “Robust cluster analysis via mixtures of multivariate -distributions”, in “Lecture Notes in Computer Science”, vol. 1451, pp. 658–666. Springer-Verlag, Berlin, 1998.
- Peel and McLachlan (2000) D. Peel and G. J. McLachlan, “Robust mixture modeling using the distribution”, Stat. Comput. 10 (2000) 339–348.
- Shoham (2002) S. Shoham, “Robust clustering by deterministic agglomeration EM of mixtures of multivariate -distributions”, Pattern Recogn. 35 (2002), no. 5, 1127–1142.
- Lin et al. (2004) T. I. Lin, J. C. Lee, and H. F. Ni, “Bayesian analysis of mixture modelling using the multivariate distribution”, Stat. Comput. 14 (2004) 119–130.
- Lin et al. (2009) T. I. Lin, H. J. Ho, and P. S. Shen, “Computationally efficient learning of multivariate mixture models with missing information”, Comp. Stat. 24 (2009) 375–392.
- Wang et al. (2004) H. X. Wang, Q. B. Zhang, B. Luo, and S. Wei, “Robust mixture modelling using multivariate distribution with missing information”, Pattern Recogn. Lett. 25 (2004) 701–710.
- Greselin and Ingrassia (2010) F. Greselin and S. Ingrassia, “Constrained monotone EM algorithms for mixtures of multivariate distributions”, Stat. Comput. 20 (2010) 9–22.
- Andrews et al. (2011) J. L. Andrews, P. D. McNicholas, and S. Subedi, “Model-based classification via mixtures of multivariate -distributions”, Comput. Statist. Data Anal. 55 (2011), no. 1, 520–529.
- McLachlan et al. (2007) G. J. McLachlan, R. W. Bean, and L. B.-T. Jones, “Extension of the mixture of factor analyzers model to incorporate the multivariate -distribution”, Comput. Statist. Data Anal. 51 (2007), no. 11, 5327–5338.
- Andrews and McNicholas (2011a) J. L. Andrews and P. D. McNicholas, “Extending mixtures of multivariate -factor analyzers”, Stat. Comput. 21 (2011)a, no. 3, 361–373.
- Andrews and McNicholas (2011b) J. L. Andrews and P. D. McNicholas, “Mixtures of modified -factor analyzers for model-based clustering, classification, and discriminant analysis”, J. Statist. Plann. Inf. 141 (2011)b, no. 4, 1479–1486.
- Dempster et al. (1977) A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm”, J. Roy. Statist. Soc. B 39 (1977), no. 1, 1–38.
- Meng and Rubin (1993) X.-L. Meng and D. B. Rubin, “Maximum likelihood estimation via the ECM algorithm: a general framework”, Biometrika 80 (1993) 267–278.
- Liu and Rubin (1994) C. Liu and D. B. Rubin, “The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence”, Biometrika 81 (1994) 633–648.
- Liu and Rubin (1995) C. Liu and D. B. Rubin, “ML estimation of the distribution using EM and its extensions, ECM and ECME”, Statistica Sinica 5 (1995) 19–39.
- Meng and van Dyk (1997) X.-L. Meng and D. van Dyk, “The EM algorithm — an old folk song sung to a fast new tune (with discussion)”, J. Roy. Statist. Soc. B 59 (1997) 511–567.
- Woodbury (1950) M. A. Woodbury, “Inverting modified matrices”, Princeton University, Princeton, New Jersey, 1950.
- Ueda et al. (2000) N. Ueda, R. Nakano, Z. Ghahramani, and G. E. Hinton, “SMEM algorithm for mixture models”, Neural Computation 12 (2000) 2109–2128.
- Zhao et al. (2003) W. Zhao, R. Chellappa, A. Rosenfeld, and P. J. Phillips, “Face recognition: A literature survey”, ACM Computing Surveys 35 (2003) 399–458.
- Sirovitch and Kirby (1987) L. Sirovitch and M. Kirby, “Low-dimensional procedure for the characterization of human faces”, J. Optical Soc. of Am. 2 (1987) 519–524.
- Turk and Pentland (1991) M. Turk and A. Pentland, “Eigenfaces for recognition”, J. Cognitive Neuroscience 3 (1991) 71–86.
- Kim et al. (2002) H. C. Kim, D. Lim, and S. Y. Bang, “Face recognition using the mixture-of-eigenfaces method”, Patt. Recog. Lett. 23 (2002) 1549–1558.
- Zhao and Yu (2008) J. H. Zhao and P. L. H. Yu, “Fast ML estimation for the mixture of factor analyzers via an ECM algorithm”, IEEE Trans. on Neu. Net. 19 (2008) 1956–1961.
- Wang and Lin (2013) W. L. Wang and T. I. Lin, “An efficient ecm algorithm for maximum likelihood estimation in mixtures of -factor analyzers”, Comput. Statist. DOI 10.1007/s00180-012-0327-z (2013).
A: ML estimation via the AECM algorithm
We discuss how to carry out the AECM algorithm for computing the ML estimates of the parameters in PTMMs.
To formulate the algorithm which consists of three cycles,
we partition the unknown parameters as , where contains ’s, contains
’s and ’s,
while contains ’s and ’s. Let be the collection of
latent group labels.
In the first cycle of the algorithm, we treat as missing data. So the complete data is
and our aim is to estimate
with and fixed at their current estimates and .
The log-likelihood of based on the complete data , apart from an additive constant, is given by
Conditioning on and , taking expectation for (9) leads to
Maximizing (10) with respect to , that restricted to , yields
At the second cycle, when estimating , we treat and as the missing data. The log-likelihood function of based on the complete data takes the form of
Conditioning on and , taking expectation for (11) leads to
and DG is the digamma function.
Maximizing (12) with respect to and yields
which is equivalently the solution of the following equation:
In the case of , we obtain as the solution of the following equation:
At the third cycle, when estimating , we treat , and as the missing data. The log-likelihood function of based on the complete data takes the form of
Therefore, the expectation of (13) conditioning on the observed data and the updated vales of is
where , and . The resulting CM-steps for the updates of and under eight constrained/unconstrained situations are given in the next Section. Note that the above procedure is also applicable for the tFA model by treating .
B: Calculations for eight parsimonious models
To facilitate the derivation, we adopt the following notation:
B.1: Model CCC
For model CCC, we have , . The -function can be expressed as
where and . Differentiating with respect to B and , respectively, and set the derivations equal to zero, we obtain
B.2: Model CCU
For model CCU, we have , . The -function can be expressed as
where . Differentiating with respect to B and , respectively, and set the results equal to zero, we obtain
B.3: Model CUC
For model CUC, we have , . The -function can be expressed as
where and . Differentiating with respect to B and , respectively, and set the results equal to zero, we obtain
B.4: Model CUU
For model CUU we have . The -function can be expressed as