Mixtures of Skewt Factor Analyzers
Abstract
In this paper, we introduce a mixture of skew factor analyzers as well as a family of mixture models based thereon. The mixture of skew distributions model that we use arises as a limiting case of the mixture of generalized hyperbolic distributions. Like their Gaussian and distribution analogues, our mixture of skew factor analyzers are very wellsuited to the modelbased clustering of highdimensional data. Imposing constraints on components of the decomposed covariance parameter results in the development of eight flexible models. The alternating expectationconditional maximization algorithm is used for model parameter estimation and the Bayesian information criterion is used for model selection. The models are applied to both real and simulated data, giving superior clustering results compared to a wellestablished family of Gaussian mixture models.
1 Introduction
Modelbased clustering employs finite mixture models to estimate the group memberships of a given set of unlabelled observations. A finite mixture model has density of the form , where , with , are the mixing proportions, is the vector of parameters, and is the th component density. Traditionally, modelbased clustering has been performed with the component densities taken to be multivariate Gaussian. However, there has recently been significant interest in nonGaussian approaches to mixture modelbased clustering (e.g., McLachlan and Peel, 1998; McLachlan et al., 2007; Karlis and Meligkotsidou, 2007; Lin, 2009; Browne et al., 2012; Lee and McLachlan, 2012; Franczak et al., 2012; Vrbik and McNicholas, 2012; Morris and McNicholas, 2013; Morris et al., 2013).
Browne and McNicholas (2013) introduced a mixture of generalized hyperbolic distributions with model density , where
(1) 
is the density of the generalized hyperbolic distribution, is a vector of parameters, and is the squared Mahalanobis distance between and . As we will discuss herein (Section 3), the skew distribution can be obtained as a special case of the generalized hyperbolic distribution (cf. BarndorffNielsen and Shephard, 2001).
Several alternative forms of the multivariate skew distribution have appeared in the literature (e.g., Sahu et al., 2003; Branco and Dey, 2001; Jones and Faddy, 2003; Ma and Genton, 2004). However, the form of the distribution arising from the generalized hyperbolic distribution is chosen for this work because of its particularly attractive form, i.e., because its form is particularly conducive to the development of skew factors (cf. Section 3.1). Furthermore, this form of the multivariate skew distribution has computational advantages (cf. Section 3.2) and it has not previously been used for modelbased clustering.
In this paper, we use this version of the skew distribution to introduce a skew analogue of the mixture of factor analyzers model, as well as a family of parsimonious models based thereon. The remainder of this paper is laid out as follows. In Section 2, we go through some important background material. Then we introduce our methodology, and illustrate our approach on real and simulated data (Section 4). The paper concludes with a discussion and suggestions for future work (Section 5).
2 Background
2.1 Mixtures of Factor Analyzers
The factor analysis model (Spearman, 1904) assumes that a dimensional vector of observed variables can be modelled by a dimensional vector of latent factors where , making it useful in the analysis of highdimensional data. We can write the factor analysis model as , where is a matrix of factor loadings, is the vector of factors, and with diag. Under this model, the marginal distribution of is .
McNicholas and Murphy (2008) introduced a family of eight parsimonious Gaussian mixture models (PGMM) that are suitable for clustering high dimensional data. These models include both mixtures of probabilistic principal component analyzers (Tipping and Bishop, 1999) as well as mixtures of factor analyzers models (Ghahramani and Hinton, 1997; McLachlan and Peel, 2000) as special cases. The most general such model is a Gaussian mixture model component covariance structure . By allowing the constraints , , and the isotropic constraint , McNicholas and Murphy (2008) introduced the eight models which comprise the PGMM family. In addition to introducing a mixture of skew distributions model herein, we also implement a family of mixtures of skew factor analyzers that is analogous to the PGMM family (Table 1). We refer to this family as parsimonious mixtures of skew factor analyzers (PMSTFA).
ID  Loading Matrix  Error Variance  Isotropic  Free Covariance Parameters 

CCC  Constrained  Constrained  Constrained  
CCU  Constrained  Constrained  Unconstrained  
CUC  Constrained  Unconstrained  Constrained  
CUU  Constrained  Unconstrained  Unconstrained  
UCC  Unconstrained  Constrained  Constrained  
UCU  Unconstrained  Constrained  Unconstrained  
UUC  Unconstrained  Unconstrained  Constrained  
UUU  Unconstrained  Unconstrained  Unconstrained 
2.2 Generalized Inverse Gaussian Distribution
We write to indicate that the random variable follows a generalized inverse Gaussian (GIG) distribution (Good, 1953; BarndorffNielsen and Halgreen, 1977; Blæsild, 1978; Halgreen, 1979; Jørgensen, 1982) with parameters . The density of is given by
(2) 
for , where , , and is the modified Bessel function of the third kind with index . The generalized inverse Gaussian distribution has some attractive features, including the tractability of the following expected values:
where and is the modified Bessel function of the third kind with index .
2.3 AECM Algorithm
The expectationmaximization (EM) algorithm (Dempster et al., 1977) is commonly used for parameter estimation in modelbased clustering. This iterative algorithm facilitates maximum likelihood estimation in the presence of incomplete data, making it wellsuited for clustering problems. The Estep of the algorithm consists of computing the expected value of the completedata loglikelihood based on the current model parameters and the Mstep maximizes this expected value with respect to the model parameters. The EM algorithm iterates between these two steps until some convergence criterion is met.
The expectationconditional maximization (ECM) algorithm (Meng and Rubin, 1993) is a variant of the EM algorithm that replaces the Mstep by multiple conditionalmaximization (CM) steps and is useful when the completedata log likelihood is relatively complicated. The alternating expectationconditional maximization (AECM) algorithm (Meng and van Dyk, 1997) is a further extension that allows the source of incomplete data to change between the CM steps. An AECM algorithm is used for parameter estimation for members of the PGMM family because there are two sources of missing data: the component membership labels and the latent factors . Note that we define such that if observation is in component and otherwise, for and . Similarly, we will use the AECM algorithms for parameter estimation for the models used herein (Section 3.2).
3 Methodology
3.1 A Mixture of Skew Factor Analyzers
The density of a skew distribution can obtained as a limiting case of the generalized hyperbolic distribution by setting and , and letting . A dimensional skew random variable arising in this way has density
(3) 
where is the location, is the scale matrix, is the skewness, and is the value for degrees of freedom. We write to denote that the random variable follows the skew distribution such that it has the density given in (3). Now, can be obtained through the relationship
(4) 
where and , where denotes the inverse Gamma distribution. We have and so, from Bayes’ theorem, .
Our skew factor analysis model arises by setting
(5) 
where is a matrix of factor loadings, is the vector of factors, and with diag. From (4) and (5), we can write our skew factor analysis model as
(6) 
The marginal distribution of arising from the model in (6) is . Accordingly, our mixture of skew factor analyzers model has density
where again denotes all model parameters.
3.2 Parameter Estimation
Parameter estimation is carried out within the AECM algorithm framework. In addition to the GIG expected value formulae given in Section 2.2, we need the expected value of the component labels, i.e.,
as well as the following conditional expectations, which are similar to those used by McNicholas and Murphy (2008) and Andrews and McNicholas (2011):
where .
In the first stage of our AECM algorithm, the completedata consist of the data , the group membership labels , and the latent , for , . Hence, the completedata loglikelihood at this stage is
By maximizing the expected value of the completedata loglikelihood, we obtain the parameter updates:
where , , , , , and . Note that these are analogous to the updates given by Browne and McNicholas (2013). The update for degrees of freedom does not exist in closed form and the equation
must be solved numerically to obtain the updated value of .
On the second CMstep, the completedata consist of the data , the group membership labels , the latent , and the latent factors , for and . The expected value of the resulting completedata loglikelihood is maximized to find updates for the parameters and . The exact nature of this completedata loglikelihood and the resulting parameter estimates will depend on which of the eight models in the PMSTFA family is under consideration. These parameter updates are analogous to those given by McNicholas and Murphy (2008) in the Gaussian case but now the ‘sample covariance’ matrix takes the more complicated form
Note that the form of the skew distribution introduced in Section 3.1 has a computational simplicity not present in other versions of the skew distribution that have been used for modelbased clustering (cf. Lee and McLachlan, 2013). The elegant parameter estimation for our skew distribution arises because of the properties of the GIG distribution (Section 2.2). Note that, in the same way described by (McNicholas et al., 2010), the Woodbury identity (Woodbury, 1950) can be used to avoid inverting any nondiagonal matrices. This latter trick is particularly useful with dealing with higher dimensional data.
3.3 Model Selection
The Bayesian information criterion (BIC; Schwarz, 1978) is used to select the best model in terms of number of mixture components, number of latent factors, and the covariance structure for each member of the PMSTFA family. The BIC is defined as BIC2 log , where is the maximized loglikelihood, is the maximum likelihood estimate of the model parameters , is the number of free parameters in the model, and is the number of observations. Support for the use of the BIC in mixture model selection is given by Campbell et al. (1997) and Dasgupta and Raftery (1998), while Lopes and West (2004) provide support for its use in selecting the number of latent factors.
4 Illustrations
4.1 Performance Assessment
Although all of our illustrations are carried out as bona fide cluster analysis, i.e., no knowledge of labels is assumed, the true labels are known and can be used to asses the classification performance of our PMSTFA models. The adjusted Rand index (ARI; Hubert and Arabie, 1985) was used to assess model performance in this study. The ARI indicates class agreement between the true and estimated group memberships and in doing so, accounts for the fact that random classification would almost certainly classify some observations correctly by chance. An ARI value of 1 indicates perfect classification, a value of 0 would be expected under random classification, and a negative ARI value indicates classification which is in some sense worse than one would expect under random classification.
4.2 Initialization
For all analyses performed herein, means clustering was performed on the data and the resulting cluster memberships were used to initialize the . The degrees of freedom were initialized at and the skewness parameters were initialized to be close to zero. The estimated group means and the sample covariance matrices were initialized based on the initial values, and the matrices and were initialized following McNicholas and Murphy (2008).
4.3 Simulation Study
A thirteendimensional toy data set was simulated from a twocomponent skew mixture model, with and . The two components are very well separated but are nonelliptical (cf. Figure 1). The PMSTFA models were applied to these simulated data for groups and factors using means starting values. The model with the highest BIC () was the CCC model with groups and latent factors. This model obtained perfect clustering results (AR1=1.00).
For comparison, the PGMM family was also fit to the simulated data using the pgmm package (McNicholas et al., 2011) for the R software (R Core Team, 2013). The model with the highest BIC () was the CUC model with components and latent factor (ARI=0.749). Note that the classification results obtained by this model (Table 2) could be considered correct if the second and third components were merged (cf. Figure 1). This nicely illustrates that Gaussian mixtures — in this case, a special case of a mixture of factor analyzers model — can sometimes capture nonelliptical components via a posteriori merging of components (cf. Figure 2).
Predicted  

1  2  3  
True 1  30  0  0 
True 2  0  17  13 
4.4 Australian Institute of Sport Data
The Australian Institute of Sport (AIS) data contains measured variables on 102 male and 100 female athletes. We consider body fat percentage and body mass index (BMI), taking gender to be unknown. The PMSTFA models were fit to the AIS data for with the number of latent factors fixed at . The model with the highest BIC () was the component CCC model (ARI=0.811). The classification results for this model are given in Table 3.
PMSTFA  PGMM  

1  2  1  2  3  
Male  97  5  82  16  4 
Female  5  95  1  9  90 
The PGMM models were also fit to the AIS data. The model with the highest value of the BIC () was the component UUC model (ARI=0.685). Comparing the true and predicted classifications (Table 3), we see that merging components still results in inferior classification performance when compared to the results from the best PMSTFA model (Table 3). The scatter plots in Figure 3 reinforce this point; here, we can see that the best PGMM model has effectively fitted a noise component (i.e., the green component) to pick up points that do not neatly fit into one of the other components. While we have previously demonstrated (Figure 2) that Gaussian mixtures can sensibly use multiple components to model a single asymmetric cluster, a different situation has emerged here. The PGMM solution on our AIS example illustrates that the imposition of a Gaussian component density is ipso facto a stringent constraint that can lead to classification results that are not sensible in any practical sense.
Note that this real data example was chosen to illustrate our PMSFTA models outperforming their Gaussian analogues in a fashion that cannot be overcome by merging Gaussian components. Our chosen model, using latent factor, has same number of misclassifications obtained by Lee and McLachlan (2012) using mixtures of skew distributions. As reported by Vrbik and McNicholas (2012), this number may decrease if deterministic annealing (Zhou and Lange, 2010) is used.
4.5 Yeast Data
The PMSTFA models were applied to data containing information on cellular localization sites of 1,484 yeast proteins. The data contains measured variables including the McGeoch’s method for signal sequence recognition (mcg), the score of the ALOM membrane spanning region prediction program (alm), and the score of discriminant analysis of amino acid content of vacuolar and extracellular proteins (vac). The data are available in the UCI Machine Learning Repository with a detailed description of the data given by Nakai and Kanehisa (1991, 1992). For this analysis, we considered two localization sites, CYT and ME3, and clustered the data based on the three variables described above.
The PMSTFA models were applied for groups with latent factors from means starting values. The best model in terms of BIC () was the component model (). The PGMM models were also fit to the data. The model with the highest BIC () was the =3 component UUC model (). Crosstabulations of the true and estimated classifications for both the PMSTFA and PGMM models are given in Table 4. Note that for the yeast data analysis, as for the AIS data analysis, the best possible merging of Gaussian components still fails to outperform our PMSTFA models (cf. Table 4 and Figure 4).
PMSTFA  PGMM  

1  2  1  2  3  
CYT  453  10  391  64  8 
ME3  13  150  16  4  143 
5 Conclusion
We have introduced a mixture of skew factor analyzers model as well as a family of mixture models based thereon, i.e., the PMSTFA family. This is the first use of a mixture of skewed factor analyzers within the literature. These models were developed using a form of the skew distribution that had not previously been used for modelbased clustering. This form of the skew distribution arises as a special case of the generalized hyperbolic distribution and offers a natural representation for skew factor analyzers. Furthermore, borrowing attractive features of the GIG distribution, the form of the skew distribution we use lends itself to elegant parameter estimation via an AECM algorithm.
We illustrated our PMSTFA models on real and simulated data, comparing them to their Gaussian analogous. The real data examples illustrate our PMSTFA family outperforming the PGMM family in ways that cannot be mitigated by merging Gaussian components. Note that the data we used to illustrate our PMSFTA models were low dimensional; however, our models are well suited to high dimensional applications by dint of the fact that the number of covariance parameters is linear in data dimensionality for all eight models. Developing a mixtures of skew distributions approach to the analysis of longitudinal data, after the fashion of McNicholas and Murphy (2010) and McNicholas and Subedi (2012), is a subject of ongoing work. Finally, although used for clustering herein, our PMSTFA models can also be used for modelbased classification (see McNicholas, 2010, for the PGMM analogue) or discriminant analysis (Hastie and Tibshirani, 1996).
References
 Andrews and McNicholas (2011) Andrews, J. L., McNicholas, P. D., 2011. Extending mixtures of multivariate factor analyzers. Statistics and Computing 21 (3), 361–373.
 BarndorffNielsen and Halgreen (1977) BarndorffNielsen, O., Halgreen, C., 1977. Infinite divisibility of the hyperbolic and generalized inverse Gaussian distributions. Z. Wahrscheinlichkeitstheorie Verw. Gebiete 38, 309–311.
 BarndorffNielsen and Shephard (2001) BarndorffNielsen, O., Shephard, N., 2001. NonGaussian OrnsteinUhlenbeckbased models and some of their uses in financial economics. Journal of the Royal Statistical: Society B 63, 167–241.
 Blæsild (1978) Blæsild, P., 1978. The shape of the generalized inverse Gaussian and hyperbolic distributions. Research Report 37, Department of Theoretical Statistics, Aarhus University, Denmark.
 Branco and Dey (2001) Branco, M., Dey, D., 2001. A general class of multivariate skewelliptical distributions. Journal of Multivariate Analysis 79, 99–113.
 Browne and McNicholas (2013) Browne, R. P., McNicholas, P. D., 2013. A mixture of generalized hyperbolic distributions. arXiv preprint arXiv:1305.1036v1.
 Browne et al. (2012) Browne, R. P., McNicholas, P. D., Sparling, M. D., 2012. Modelbased learning using a mixture of mixtures of Gaussian and uniform distributions. IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (4), 814–817.
 Campbell et al. (1997) Campbell, J., Fraley, C., Murtagh, F., Raftery, A., 1997. Linear flaw detection in woven textiles using modelbased clustering. Pattern Recognition Letters 18, 1539–1548.
 Dasgupta and Raftery (1998) Dasgupta, A., Raftery, A. E., 1998. Detecting features in spatial point processed with clutter via modelbased clustering. Journal of the American Statistical Association 93, 294–302.
 Dempster et al. (1977) Dempster, A. P., Laird, N. M., Rubin, D. B., 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B 39 (1), 1–38.
 Franczak et al. (2012) Franczak, B., Browne, R. P., McNicholas, P. D., 2012. Mixtures of shifted asymmetric Laplace distributions. arXiv:1207.1727v3.
 Ghahramani and Hinton (1997) Ghahramani, Z., Hinton, G., 1997. The EM algorithm for factor analyzers. Technical Report CRGTR961, University of Toronto, Toronto.
 Good (1953) Good, I, J., 1953. The population frequencies of species and the estimation of population parameters. Biometrika 40, 237–260.
 Halgreen (1979) Halgreen, C., 1979. Selfdecomposibility of the generalized inverse Gaussian and hyperbolic distributions. Z. Wahrscheinlichkeitstheorie Verw. Gebiete 47, 13–18.
 Hastie and Tibshirani (1996) Hastie, T., Tibshirani, R., 1996. Discriminant analysis by gaussian mixtures. Journal of the Royal Statistical Society: Series B 58, 155–176.
 Hubert and Arabie (1985) Hubert, L., Arabie, P., 1985. Comparing partitions. Journal of Classification 2, 193–218.
 Jones and Faddy (2003) Jones, M., Faddy, M., 2003. A skew extension of the distribution, with applications. Journal of the Royal Statistical Society B 65, 159–174.
 Jørgensen (1982) Jørgensen, B., 1982. Statistical Properties of the Generalized Inverse Gaussian Distribution. SpringerVerlag, New York.
 Karlis and Meligkotsidou (2007) Karlis, D., Meligkotsidou, L., 2007. Finite mixtures of multivariate poisson distributions with application. Journal of Statistical Planning and Inference 137 (6), 1942–1960.
 Lee and McLachlan (2012) Lee, S., McLachlan, G. J., 2012. On the fitting of mixtures of multivariate skew tdistributions via the EM algorithm. arXiv:1109.4706v2.
 Lee and McLachlan (2013) Lee, S. X., McLachlan, G. J., 2013. On mixtures of skew normal and skew tdistributions. Advances in Data Analysis and Classification To appear.
 Lin (2009) Lin, T.I., 2009. Maximum likelihood estimation for multivariate skew normal mixture models. Journal of Multivariate Analysis 100, 257–265.
 Lopes and West (2004) Lopes, H. F., West, M., 2004. Bayesian model assessment in factor analysis. Statistica Sinica 14, 41–67.
 Ma and Genton (2004) Ma, Y., Genton, M., 2004. A flexible class of skewsymmetric distributions. Scandinavian Journal of Statistics 31, 459–468.
 McLachlan et al. (2007) McLachlan, G. J., Bean, R. W., Jones, L. B.T., 2007. Extension of the mixture of factor analyzers model to incorporate the multivariate tdistribution. Computational Statistics and Data Analysis 51 (11), 5327–5338.
 McLachlan and Peel (1998) McLachlan, G. J., Peel, D., 1998. Robust cluster analysis via mixtures of multivariate tdistributions. In: Lecture Notes in Computer Science. SpringerVerlag, pp. 658–666.
 McLachlan and Peel (2000) McLachlan, G. J., Peel, D., 2000. Mixtures of factor analyzers. In: Seventh International Conference on Machine Learning. San Francisco.
 McNicholas (2010) McNicholas, P. D., 2010. Modelbased classification using latent Gaussian mixture models. Journal of Statistical Planning and Inference 140 (5), 1175–1181.
 McNicholas et al. (2011) McNicholas, P. D., Jampani, K. R., McDaid, A. F., Murphy, T. B., Banks, L., 2011. pgmm: Parsimonious Gaussian Mixture Models. R package version 1.0.
 McNicholas and Murphy (2008) McNicholas, P. D., Murphy, T. B., 2008. Parsimonious Gaussian mixture models. Statistics and Computing 18, 285–296.
 McNicholas and Murphy (2010) McNicholas, P. D., Murphy, T. B., 2010. Modelbased clustering of longitudinal data. The Canadian Journal of Statistics 38 (1), 153–168.
 McNicholas et al. (2010) McNicholas, P. D., Murphy, T. B., McDaid, A. F., Frost, D., 2010. Serial and parallel implementations of modelbased clustering via parsimonious Gaussian mixture models. Computational Statistics and Data Analysis 54 (3), 711–723.
 McNicholas and Subedi (2012) McNicholas, P. D., Subedi, S., 2012. Clustering gene expression time course data using mixtures of multivariate tdistributions. Journal of Statistical Planning and Inference 142 (5), 1114–1127.
 Meng and Rubin (1993) Meng, X.L., Rubin, D., 1993. Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika 80, 267–278.
 Meng and van Dyk (1997) Meng, X.L., van Dyk, D., 1997. The EM algorithm an old folk song sung to a fast new tune (with discussion). Journal of the Royal Statistical Society Series B 59, 511–567.
 Morris and McNicholas (2013) Morris, K., McNicholas, P. D., 2013. Dimension reduction for modelbased clustering via mixtures of shifted asymmetric Laplace distributions. Statistics and Probability Letters 83 (9), 2088–2093.
 Morris et al. (2013) Morris, K., McNicholas, P. D., Scrucca, L., 2013. Dimension reduction for modelbased clustering via mixtures of multivariate tdistributions. Advances in Data Analysis and Classification To appear.
 Nakai and Kanehisa (1991) Nakai, K., Kanehisa, M., 1991. Expert system for predicting protein localization sites in gramnegative bacteria. Proteins: Structure, Function, and Bioinformatics 11 (2), 95–110.
 Nakai and Kanehisa (1992) Nakai, K., Kanehisa, M., 1992. A knowledge base for predicting protein localization sites in eukaryotic cells. Genomics 14, 897–911, mEDLINE Abstract.

R Core Team (2013)
R Core Team, 2013. R: A Language and Environment for Statistical Computing. R
Foundation for Statistical Computing, Vienna, Austria.
URL http://www.Rproject.org  Sahu et al. (2003) Sahu, S., Dey, D., Branco, M., 2003. A new class of multivariate skew distributions with application to Bayesian regression models. Canadian Journal of Statistics 31, 129–150.
 Schwarz (1978) Schwarz, G., 1978. Estimating the dimension of a model. Annals of Statistics 6, 461–464.
 Spearman (1904) Spearman, C., 1904. The proof and measurement of association between two things. Journal of Statistical Planning and Inference 15, 72–101.
 Tipping and Bishop (1999) Tipping, T., Bishop, C., 1999. Mixtures of probabilistic component analyzers. Neural Computation 11 (2), 443–482.
 Vrbik and McNicholas (2012) Vrbik, I., McNicholas, P. D., 2012. Analytic calculations for the EM algorithm for multivariate skewmixture models. Statistics and Probability Letters 82 (6), 1169–1174.
 Woodbury (1950) Woodbury, M., 1950. Inverting modified matrices. Tech. Rep. 42, Princeton University, Princeton, N.J.
 Zhou and Lange (2010) Zhou, H., Lange, K. L., 2010. On the bumpy road to the dominant mode. Scandinavian Journal of Statistics 37 (4), 612–631.