Adaptive Mixtures of Factor Analyzers
Abstract
A mixture of factor analyzers is a semiparametric density estimator that generalizes the wellknown mixtures of Gaussians model by allowing each Gaussian in the mixture to be represented in a different lowerdimensional manifold. This paper presents a robust and parsimonious model selection algorithm for training a mixture of factor analyzers, carrying out simultaneous clustering and locally linear, globally nonlinear dimensionality reduction. Permitting different number of factors per mixture component, the algorithm adapts the model complexity to the data complexity. We compare the proposed algorithm with related automatic model selection algorithms on a number of benchmarks. The results indicate the effectiveness of this fast and robust approach in clustering, manifold learning and classconditional modeling.
keywords:
IMoFA, AMoFA, mixture models, clustering, model selection, dimensionality reduction, covariance modeling, mixture of factor analyzers1 Introduction
Mixture models have a widespread use in various domains of machine learning and signal processing for supervised, semisupervised and unsupervised tasks Moerland00dissertation (); mclachlan2000finite (). However, the model selection problem remains to be one of the challenges and there is a need for efficient and parsimonious automatic model selection methods Jain2010 ().
Let denote a random variable in . A mixture model represents the distribution of as a mixture of component distributions:
(1) 
where correspond to components, and are the prior probabilities of the components. are also called the mixture proportions, and sum up to unity. The likelihood term, expressed by , can be modeled by any distribution. In this paper we focus on Gaussians:
(2) 
where and denote the mean and covariance of the component distribution, respectively. The number of parameters in the model is primarily determined by the dimensionality of the covariance matrix, which scales quadratically with the feature dimensionality . When this number is large, overfitting becomes an issue. Indeed, one of the most important problems of modelbased clustering methods is that they are overparametrized in highdimensional spaces bouveyron2014model (). One way of keeping the number of parameters small is to constrain the covariance matrices to be tied (shared) across components, which assumes similar shaped distributions in the data space, and is typically unjustified. Another approach is to assume that each covariance matrix is diagonal or spherical, but this means valuable correlation information will be discarded.
It is possible to keep a low number of parameters for the model without sacrificing correlation information by adopting a factor analysis approach. Factor Analysis (FA) is a latent variable model, which assumes the observed variables are linear projections of a small number of independent factors with additive Gaussian noise:
(3) 
where is a factor loading matrix and is a diagonal uniquenesses matrix representing the common sensor noise. Subsequently, the covariance matrix in Eq. 2 is expressed as , effectively reducing the number of parameters from to , with . If each Gaussian component is expressed in a latent space, the result is a mixture of factor analyzers (MoFA).
Given a set of data points, there exists ExpectationMaximization (EM) approaches to train MoFA models mclachlan2000finite (); Ghahramani97EM4MFA (), but these approaches require the specification of hyperparameters like the number of clusters and the number of factors per component. For the model selection problem of MoFA, an incremental algorithm (IMoFA) was proposed in Salah04imofa (), where factors and components were added to the mixture one by one. The model complexity was monitored on a separate validation set.
In this study, we propose a fast and parsimonious model selection algorithm called Adaptive Mixture of Factor Analyzers (AMoFA). Similar to IMoFA, AMoFA is capable of adapting a mixture model to data by selecting an appropriate number of components and factors per component. However, the proposed AMoFA algorithm deals with two shortcomings of the IMoFA approach: 1) Instead of relying on a validation set, AMoFA uses a Minimum Message Length (MML) based criterion to control model complexity, subsequently using more training samples in practice. 2) AMoFA is capable of removing factors and components from the mixture when necessary. We test the proposed AMoFA approach on several benchmarks, comparing its performance with IMoFA, with a variational Bayesian MoFA approach Ghahramani1999 (), as well as with the popular Gaussian mixture model selection approach based on MML, introduced by Figueiredo and Jain Figueiredo2002 (). We show that the proposed approach is parsimonious and robust, and especially useful for highdimensional problems.
2 Related Work
There are numerous studies for mixture model class selection. These include using information theoretical tradeoffs between likelihood and model complexity AkaikeAIC (); SchwarzBIC (); Rissanen1983 (); Rissanen:InCSM2007 (); Wallace87MML (), greedy approaches verbeek2003efficient (); Salah04imofa () and full Bayesian treatment of the problem Ghahramani1999 (); Rasmussen2000 (); GomesIncLearnNonpBayMMCVPR08 (); Shi2011 (). A brief review of related automatic model selection methods is given in Table 1, a detailed treatment can be found in bouveyron2014model (). Here we provide some detail on the most relevant automatic model selection methods that are closely related to our work.
Work  Model Selection  Approach 

Ghahramani & Beal (1999) Ghahramani1999 ()  Variational Bayes  Incremental 
Pelleg & Moore (2000) Pelleg2000XM ()  MDL  Incremental 
Rasmussen (2000) Rasmussen2000 ()  MC for DPMM  Both 
Figueiredo & Jain (2002) Figueiredo2002 ()  MML  Decremental 
Verbeek et al. (2003) verbeek2003efficient ()  Fixed iteration  Incremental 
Law et al. (2004) LawFigJainPAMI04 ()  MML  Decremental 
Zivkovic & v.d. Heijden (2004) Zivkovic04RULFMM ()  MML  Decremental 
Salah & Alpaydin (2004) Salah04imofa ()  Cross Validation  Incremental 
Shi & Xu (2006) ShiMoFAICANN06 ()  Bayesian YinYang  Both 
Constantinopoulos et al. (2007) ConstantinopoulosVarComSplitNN07 ()  Variational Bayes  Incremental 
Gomes et al. (2008) GomesIncLearnNonpBayMMCVPR08 ()  Variational DP  Incremental 
Boutemedjet et al. (2009) Boutemedjet2009 ()  MML  Decremental 
Gorur & Rasmussen (2009) Gorur09SIUNonParMFA ()  MC for DPMM  Both 
Shi et al. (2011) Shi2011 ()  Bayesian YinYang  Both 
Yang et al. (2012) yang2012robust ()  Entropy Min.  Decremental 
Iwata et al. (2012) Iwata1206.1846 ()  MC for DPMM  Both 
Fan & Bouguila (2013) FanVBCSNeuCom2013 ()  Variational DP  Both 
Fan & Bouguila (2014) FanOVBDMFSNeuCom2014 ()  Variational Bayes  Incremental 
Kersten (2014) Kersten2014 ()  MML  Decremental 
In one of the most popular model selection approaches for Gaussian mixture models (GMMs), Figueiredo and Jain proposed to use an MML criterion for determining the number of components in the mixture, and shown that their approach is equivalent to assuming Dirichlet priors for mixture proportions Figueiredo2002 (). In their method, a large number of components (typically 2530) is fit to the training set, and these components are eliminated one by one. At each iteration, the EM algorithm is used to find a converged set of model parameters. The algorithm generates and stores all intermediate models, and selects one that optimizes the MML criterion.
The primary drawback of this approach is the curse of dimensionality. For a dimensional problem, fitting a single fullcovariance Gaussian requires parameters, which typically forces the algorithm to restrict its models to diagonal covariances in practice. We demonstrate empirically that this approach (unsupervised learning of finite mixture models  ULFMM) does not perform well in practice for problems with high dimensionality, regardless of its abundant use in the literature.
Using the parsimonious factor analysis representation described in Section 1, it is possible to explore many models that are between fullcovariance and diagonal Gaussian mixtures in their number of parameters. The resulting mixture of factor analysers (MoFA) can be considered as a noiserobust version of the mixtures of probabilistic principal component analysers (PPCA) approach Tipping99MPPCA (). Figure 1 summarizes the relations between the mixture representations in this area.
If we assume that the latent variables of each component in a MoFA model is distributed unit normal () in the latent space, the corresponding data in the feature space are also Gaussian distributed:
(4) 
where denotes the latent factor values. The mixture distribution of factor analyzers is then given as Ghahramani97EM4MFA ():
(5) 
The EM algorithm is used to find maximum likelihood solutions to latent variable models Dempster77MLEM (), and it can be used for training a MoFA Ghahramani97EM4MFA (). Since EM does not address the model selection problem, it requires the number of components and factors per component to be fixed beforehand.
Ghahramani and Beal Ghahramani1999 () have proposed a variational Bayes scheme (VBMoFA) for model selection in MoFA, which allows the local dimensionality of components and their total number to be automatically determined. In this study, we use VBMoFA as one of the benchmarks.
To alleviate the computational complexity of the variational approach, a greedy model selection algorithm was proposed by Salah and Alpaydın Salah04imofa (). This incremental approach (IMoFA) starts by fitting a single component  single factor model to the data and adds factors and components in each iteration using fast heuristic measures until a convergence criterion is met. The algorithm allows components to have as many factors as necessary, and uses a validation set to stop model adaptation, as well as to avoid overfitting. This is the third algorithm we use to compare with the proposed approach, which we describe in detail next.
3 Adaptive Mixtures of Factor Analyzers
We briefly summarize the proposed adaptive mixtures of factor analyzers (AMoFA) algorithm first, and then describe its details. Given a dataset with data points in dimensions, the AMoFA algorithm is initialized by fitting a 1component, 1factor mixture model. Here, the factor is initialized from the leading eigenvector of the covariance matrix i. e. the principal component of the data. At each subsequent step, the algorithm considers adding more components and factors to the mixture, running EM iterations to find a parametrization. During the Mstep of EM, an MML criterion is used to determine whether any weak components should be annihilated. Apart from this early component annihilation, the algorithm incorporates a second decremental scheme. When the incremental part of the algorithm no longer improves the MML criterion, a downsizing component annihilation process is initiated and all components are eliminated one by one. Similar to ULFMM, each intermediate model is stored, and the algorithm outputs the one giving the minimum message length. Figure 2 summarizes the proposed algorithm.
3.1 The Generalized Message Length Criterion
To allow local factor analyzers to have independent latent dimensionality, the MML criterion given in Figueiredo and Jain Figueiredo2002 () should be generalized accordingly to reflect the individual code length of components:
(6) 
where denotes the number of parameters for component , represents the dataset with data items, the model parameters, and represents the number of nonzero weight components. The first three terms in Eq. 6 comprise the code length for real valued model parameters, the fourth term is the model loglikelihood. We propose to include the code length for integer hyper parameters, namely and componentwise latent dimensionalities , such that the encoding becomes decodable as required by MDL theory Rissanen1983 (); Rissanen:InCSM2007 (). For this purpose, we use Rissanen’s universal prior for integers Rissanen:InCSM2007 ():
(7) 
which gives the (ideal) code length
(8) 
where is nfold logarithmic sum with positive terms, is the normalizing sum that is tightly approximated as Rissanen:InCSM2007 (). term in Eq. 8 can be computed via a recursive algorithm. We finally obtain , the cost to encode the number of components, and similarly , the cost to encode the local dimensionalities and add them to eq. (6) to obtain a message length criterion:
(9) 
3.2 Component Splitting and Factor Addition
Adding a new component by splitting an existing one involves two decisions: which component to split, and how to split it. AMoFA splits the component that looks least likely to a Gaussian, by looking at a multivariate kurtosis metric MardiaMA (). For a multinormal distribution, the multivariate kurtosis takes the value , and if the underlying population is multivariate normal with mean , the sample counterpart of , namely , has an asymptotic distribution as the number of samples goes to infinity. Salah and Alpaydın Salah04imofa () adapted this metric to the mixture model by using a “soft count” :
(10) 
(11) 
The component with greatest is selected for splitting. AMoFA runs a local, 2component MoFA on the data points that fall under the component. To initialize the means of new components prior to MoFA fitting, we use the weighted sum of all eigenvectors of the local covariance matrix: , and set , where is the mean vector of the component to split.
The component having the largest difference between modeled and sample covariance is selected for factor addition. As in IMoFA, AMoFA uses the residual factor addition scheme. Given a component and a set of data points under it, the reestimated points after projection to the latent subspace can be written as: . The reestimation error decreases with the number of factors used in . The newly added column in the factor loading matrix, , is selected to be the principal direction (the eigenvector with the largest eigenvalue) of the residual vectors . This new factor is used in bootstrapping the EM procedure.
3.3 Component Annihilation
In a Bayesian view, the message length criterion (eq. (9)) adopted from Figueiredo and Jain Figueiredo2002 () corresponds to assuming a flat prior on component parameters , and a Dirichlet prior on mixture proportions :
(12) 
Thus, in order to minimize the adopted cost in eq. (9), the Mstep of EM is changed for :
(13) 
which means that all components having a soft count () smaller than half the number of local parameters will be annihilated. This threshold enables the algorithm to get rid of components that do not justify their existence. In the special case of AMoFA, the number of parameters per component are defined as:
(14) 
where is the original dataset dimensionality, is the local latent dimensionality of component , and is the code length for . The additive constant inside the bracket accounts for the parameter cost of mean and local diagonal uniquenesses matrix . Finally, the localized annihilation condition to check at the M step of EM is simply .
In AMoFA, we use an outer loop to drive the model class adaptation and an inner EM loop to fit a mixture of factor analyzer model with initialized parameters. The inner EM algorithm is an improved and more generalized version of ULFMM Figueiredo2002 (), where after parallel EM updates we select the weakest component and check for annihilation, as opposed to sequential component update approach (using Componentwise EM  celeux2001component ()). Any time during EM, automatic component annihilation may take place. When the incremental progress is saturated, the downsizing component annihilation is initiated. The MML based EM algorithm and relevant details are given in A.
4 Experiments
4.1 Evaluation Protocol for Clustering Performance
We compare AMoFA with two benchmark algorithms on clustering, namely ULFMM algorithm from Figueiredo2002 ()^{1}^{1}1The code is available at http://www.lx.it.pt/~mtf and the IMoFAL from Salah04imofa ().
We use the Normalized Information Distance (NID) metric for evaluating clustering accuracy, as it possesses several important properties; in addition to being a metric, it admits an analytical adjustment for chance, and allows normalization to [01] range Vinh:2010:ITM (). NID is formulated as:
(15) 
where entropy and the mutual information for clustering are defined as follows:
(16)  
(17) 
Here, is the number of samples in cluster , is the number of samples falling into cluster in clustering and cluster in clustering . MI is a nonlinear measure of dependence between two random variables. It quantifies how much information in bits the two variables share. We compute NID between the ground truth and the clusterings obtained by the automatic model selection techniques in order to give a more precise measure of clustering than just the number of clusters. When there is no overlap, NID is expected to be close to 0; higher overlap of clusters might result in higher average NID, though a relative performance comparison can still be achieved.
4.2 Experiments on Benchmark Datasets for Clustering
We tested three algorithms, namely IMoFAL, AMoFA and ULFMM on benchmark synthetic/real datasets for clustering. For maximum comparability with previous work, we used some synthetic dataset examples from Figueiredo and Jain Figueiredo2002 (), as well as from a Yang et al.’s study on automatic mixture model selection yang2012robust ().
AMoFA, as opposed to IMoFA and ULFMM, does not rely on random initialization. In IMoFA, the first factor is randomly initialized, and in ULFMM the initial cluster centers are assigned to randomly selected instances. AMoFA initializes the first factor from the principal component of the dataset. Similar to residual factor addition, this scheme can be shown to converge faster than random initializations. Given a dataset, a single simulation is sufficient to assess performance.
Because of this deterministic property of AMoFA, we report the results with multiple datasets sampled from the underlying distribution, instead of sampling once and simulating multiple times. Unless stated otherwise, in the following experiments with synthetic datasets, 100 samples are drawn and the average results are reported. For ULFMM, we give initial number of clusters in all our simulations for clustering and use free full covariances. Moreover, the EM convergence threshold is set to in all three methods.
Example 1: 3 Separable Gaussians
To illustrate the evolution of the solution with AMoFA, we generated a mixture of three Gaussians having the same mixture proportions and the same covariance matrix with separate means . We generate 100 samples from the underlying distribution of 900 data points. This synthetic example is used in Figueiredo2002 (); yang2012robust (). Figure 3 shows the evolution of adaptive steps of AMoFA with found clusters shown in 2std contour plots, and the description length (DL) is given above each plot.
Example 2: Overlapping Gaussians
To test the approach for finding the correct number of clusters, we use a synthetic example very similar to the one used in Figueiredo2002 (); yang2012robust (). Here, three of the four Gaussians overlap with the following generative model:
We use data points. As in the previous example, we generate 100 random datasets. In figure 4 left plot, the data are illustrated with a sample result of AMoFA. Out of 100 simulations, the accuracy of finding K*=4 is 92, 56, and 33 for AMoFA, ULFMM, and IMoFA, respectively. The histogram in figure 4 right plot shows the distribution of number of automatically found clusters for three methods. Average NID over 100 datasets is found to be 0.2549, 0.2951, and 0.3377 for AMoFA, ULFMM and IMoFA, respectively. A paired ttest (two tailed) on NID scores indicates that AMoFA performs significantly better than ULFMM with .
4.3 Application to Classification: Modeling Class Conditional Densities
We compare AMoFA with three benchmark model selection algorithms, namely, VBMoFA algorithm from Ghahramani1999 (), ULFMM algorithm from Figueiredo2002 () and the IMoFA algorithm from Salah04imofa (). As baseline, we use Mixture of Gaussians, where the data of each class are modeled by a single Gaussian with full (MoGF) or diagonal (MoGD) covariances. We compare the performances of the methods on classification tasks (via classconditional modeling) on nine benchmark datasets: The ORL face database with binary gender classification task samaria1994parameterisation (), 16class phoneme database from LVQ package of Helsinki University of Technology LVQPAK (), the VISTEX texture database Salah04imofa (), a 6class Japanese Phoneme database^{2}^{2}2Preprocessed versions of VISTEX and Japanese Phoneme datasets that are used in this study can be accessed from http://www.cmpe.boun.edu.tr/~kaya/jpn_vis.zip GurgenAUA94 (), the MNIST dataset lecun1998gradient (), and four datasets (Letter, Pendigits, Opdigits, and Waveform) from UCI ML Repository UCIML (). Table 2 gives some basic statistics about the databases. Except MNIST that has an explicit train and testing protocol, all experiments were carried out with 10fold crossvalidation. Simulations are replicated 10 times in MNIST, where we crop the 4 pixel padding around the images and scale them to pixels to obtain 100dimensional feature vectors.
Dataset  Dimensions  Classes  # of Samples 

ORL  256  2  400 
LVQ  20  16  3858 
OPT  60  10  4677 
PEN  16  10  8992 
VIS  169  10  3610 
WAV  21  3  500 
JPN  112  6  1200 
LET  16  26  20000 
MNT  100  10  70000 
In the experiments, we trained separate mixture models for the samples of each class, and used maximum likelihood classification. We did not use informative class priors, as it would positively bias the results, and hide the impact of likelihood modeling. In Table 3, we provide accuracy computed over 10 folds, where all six approaches used the same protocol. ULFMM column reports performance of ULFMM models with free diagonal covariances, as full covariance models invariably give poorer results.
IMoFAL Salah04imofa ()  VBMoFA Ghahramani1999 ()  AMoFA  

ORL  97.8 1.5  93.0 2.8  97.5 1.2 
LVQ  91.2 1.9  91.3 1.9  89.3 1.6 
OPT  91.1 2.7  95.2 1.8  93.8 2.4 
PEN  97.9 0.7  97.8 0.6  98.1 0.6 
VIS  69.3 4.6  67.1 5.9  77.2 4.6** 
WAV  80.8 4.5  85.1 4.2  85.6 4.6 
JPN  93.4 2.4  93.2 3.2  96.5 2.2* 
LET  86.6 1.5  95.2 0.7  95.1 0.7 
MNT  91.5 0.2  84.5 0.1  93.9 0 
ULFMM Figueiredo2002 ()  MoGD  MoGF  
ORL  80.0 6.5  89 2.4  90 0 
LVQ  75.4 4.5  88.1 2.6  92.1 1.8 
OPT  49.5 10.2  84.2 3.1  94.9 1.7 
PEN  89.9 2.0  84.5 2.0  97.4 0.6 
VIS  20.6 3.7  68.6 3.9  44.7 12.8 
WAV  72.1 7.5  80.9 18.2  84.8 4.6 
JPN  82.4 2.1  82.2 4.9  92.3 2.3 
LET  56.9 2.8  64.2 1.2  88.6 0.9 
MNT  64.7 2.0  78.2 0  93.7 0 
The best results for a dataset are shown in bold. We compared the algorithms with a nonparametric sign test. For each dataset, we conducted a one tail pairedsample ttest with a significance level of 0.05 (0.01 upon of rejection of null hypothesis). Results indicate that ULFMM ranks last in all cases: it is consistently inferior even against the MoGF baseline. This is because of the fact that after randomized initialization of clusters, ULFMM algorithm annihilates all illegitimate components, skipping intermediate (possibly better than initial) models. On seven datasets AMoFA attains/shares the first rank, and on the remaining two it ranks the second. Note that though on the overall AMoFA and VBMoFA have similar number of wins against each other, on high dimensional datasets, namely on MNIST, VISTEX, Japanese Phoneme and ORL, AMoFA significantly outperforms VBMoFA.
AMoFA  VBMoFA  ULFMM  MoGD  MoGF  
IMoFA  1/2/6  2/4/3  9/0/0  7/2/0  2/3/4 
AMoFA  *  4/3/2  9/0/0  7/2/0  5/3/1 
VBMoFA  *  9/0/0  7/2/0  3/5/1  
ULFMM  *  0/3/6  0/0/9  
MoGD  *  1/2/6 
The results of pairwise tests at 0.05 significance level are shown in Table 4. We see that the adaptive MoFA algorithms dramatically outperform GMM based ULFMM algorithm. MoFA is capable of exploring a wider range of models between diagonal and full covariance with reduced parameterization. Among the three MoFA based algorithms, no significant difference () was found on Pendigits dataset. AMoFA outperforms the best results reported so far with the VISTEX dataset. The best test set accuracy reported in Salah04imofa () is 73.8 1.1 using GMMs. We attain 77.2 4.6 with AMoFA.
5 Conclusions and Outlook
In this study, we propose a novel and adaptive model selection approach for Mixtures of Factor Analyzers. Our algorithm first adds factors and components to the mixture, and then prunes excessive parameters, thus obtaining a parsimonious model in a very time and space efficient way. Our contributions include a generalization of the adopted MML criterion to reflect local parameter costs, as well as local component annihilation thresholds.
We carry out experiments on many real datasets, and the results indicate the superiority of the proposed method in classconditional modeling. We contrast our approach with the Incremental MoFA approach Salah04imofa (), Variational Bayesian MoFA Ghahramani1999 (), as well as the popular ULFMM algorithm Figueiredo2002 (). In high dimensions, MoFA based automatic modeling provides significantly better classification results than GMM based ULFMM modeling, as it is capable of modeling a much wider range of models with compact parametrization. It also makes use of the latent dimensionality of the local manifold, thus enables obtaining an adaptive cost for the description length. AMoFA algorithm is observed to offer the best performance on higher dimensional datasets.
The proposed algorithm does not necessitate a validation set to control model complexity. Thanks to the optimized MML criterion and the fast component selection measures for incremental adaptation, the algorithm is not only robust, but also efficient. It does not have any requirement for parameter tuning. Using a recursive version of ULFMM Zivkovic04RULFMM (), it is also possible to extend the proposed method for online learning. A MATLAB tool for AMoFA is available from http://www.cmpe.boun.edu.tr/~kaya/amofa.zip.
Appendix A EM Algorithm for Mixture of Factor Analyzers with MML Criterion
In this section, we give the MoFA EM algorithm optimizing the generalized MML criterion given in eq 9. This criterion is used for automatic annihilation of components at the M step. We provide the formulation of MML based EM algorithm, which is closely related to regular EM for MoFA model Ghahramani97EM4MFA ():
(18)  
(19)  
(20)  
(21)  
(22) 
where to keep the notation uncluttered, is defined as . Similarly, , , and
(23) 
The above EM formulation aims to optimize the MoFA log likelihood, which is the logarithm of the linear combination of component likelihoods:
(24) 
The EM Algorithm for MoFA using MML criterion is given in figure 5.
References
 (1) P. Moerland, Mixture Models for Unsupervised and Supervised Learning, Ph.D. Thesis, The Swiss Federal Inst. of Tech. at Lausanne, 2000.
 (2) G. McLachlan, D. Peel, Finite Mixture Models, New York: Wiley, 2000.
 (3) A. K. Jain, Data clustering: 50 years beyond Kmeans, Pattern Recognition Letters 31 (8) (2010) 651–666. doi:10.1016/j.patrec.2009.09.011.
 (4) C. Bouveyron, C. BrunetSaumard, Modelbased clustering of highdimensional data: A review, Computational Statistics & Data Analysis 71 (2014) 52–78.
 (5) Z. Ghahramani, G. E. Hinton, The em algorithm for mixtures of factor analyzers, Tech. Rep. CRGTR961, University of Toronto (1997).
 (6) A. A. Salah, E. Alpaydın, Incremental mixtures of factor analysers, in: Proc. Int. Conf. on Pattern Recognition, 2004, pp. 276–279.
 (7) Z. Ghahramani, M. J. Beal, Variational Inference for Bayesian Mixtures of Factor Analysers, in: NIPS, Vol. 12, 1999, pp. 449–455.
 (8) M. A. T. Figueiredo, A. K. Jain, Unsupervised Learning of Finite Mixture Models, IEEE Trans. Pattern Analysis and Machine Intelligence 24 (3) (2002) 381–396.
 (9) H. Akaike, A new look at the statistical model identification, IEEE Trans. Automatic Control 19 (6) (1974) 716–723.
 (10) G. Schwarz, Estimating the Dimension of a Model, Annals of Statistics 6 (2) (1979) 461–464.
 (11) J. Rissanen, A Universal Prior for Integers and Estimation by MDL, The Annals of Statistics 11 (2) (1983) 416–431.
 (12) J. Rissanen, Information and complexity in statistical modeling, Information Science and Statistics, Springer, Dordrecht, 2007.
 (13) C. Wallace, P. Freeman, Estimation and inference by compact coding, Journal of Royal Statistical Society, Series B 49 (3) (1987) 240–265.
 (14) J. J. Verbeek, N. Vlassis, B. Kröse, Efficient greedy learning of Gaussian mixture models, Neural computation 15 (2) (2003) 469–485.
 (15) C. E. Rasmussen, The Infinite Gaussian Mixture Model, in: NIPS, no. 11, 2000, pp. 554–560.
 (16) R. Gomes, M. Welling, P. Perona, Incremental learning of nonparametric bayesian mixture models, in: CVPR, 2008, pp. 1–8.
 (17) L. Shi, S. Tu, L. Xu, Learning Gaussian mixture with automatic model selection: A comparative study on three Bayesian related approaches, Frontiers of Electr. and Electronic Eng. in China 6 (2) (2011) 215–244.
 (18) D. Pelleg, A. W. Moore, Xmeans: Extending kmeans with efficient estimation of the number of clusters, in: ICML, 2000, pp. 727–734.
 (19) M. Law, M. A. T. Figueiredo, A. Jain, Simultaneous feature selection and clustering using mixture models, IEEE Trans. Pattern Analysis and Machine Intelligence 26 (9) (2004) 1154–1166.
 (20) Z. Zivkovic, F. van der Heijden, Recursive unsupervised learning of finite mixture models, IEEE Trans. Pattern Analysis and Machine Intelligence 26 (5) (2004) 651–656.
 (21) L. Shi, L. Xu, Local factor analysis with automatic model selection: A comparative study and digits recognition application, in: S. Kollias, A. Stafylopatis, W. Duch, E. Oja (Eds.), Int. Conf. on Artificial Neural Networks, Vol. 4132 of Lecture Notes in Computer Science, Springer Berlin Heidelberg, 2006, pp. 260–269.
 (22) C. Constantinopoulos, A. Likas, Unsupervised learning of gaussian mixtures based on variational component splitting, IEEE Trans. Neural Networks 18 (3) (2007) 745–755.
 (23) S. Boutemedjet, N. Bouguila, D. Ziou, A Hybrid Feature Extraction Selection Approach for HighDimensional NonGaussian Data Clustering, IEEE Trans. Pattern Analysis and Machine Intelligence 31 (8) (2009) 1429–1443.
 (24) D. Gorur, C. E. Rasmussen, Nonparametric mixtures of factor analyzers, in: IEEE Signal Processing and Communications Applications Conf., 2009, pp. 708–711. doi:10.1109/SIU.2009.5136494.
 (25) M.S. Yang, C.Y. Lai, C.Y. Lin, A robust EM clustering algorithm for Gaussian mixture models, Pattern Recognition 45 (11) (2012) 3950–3961.
 (26) T. Iwata, D. Duvenaud, Z. Ghahramani, Warped mixtures for nonparametric cluster shapes (2012). arXiv:1206.1846.
 (27) W. Fan, N. Bouguila, D. Ziou, Variational learning of finite dirichlet mixture models using component splitting, Neurocomputing 129 (2014) 3 – 16.
 (28) W. Fan, N. Bouguila, Online variational learning of generalized dirichlet mixture models with feature selection, Neurocomputing 126 (2014) 166 – 179.
 (29) J. Kersten, Simultaneous feature selection and Gaussian mixture model estimation for supervised classification problems, Pattern Recognition 47 (8) (2014) 2582 – 2595. doi:http://dx.doi.org/10.1016/j.patcog.2014.02.015.
 (30) M. E. Tipping, C. M. Bishop, Mixtures of probabilistic principal component analyzers, Neural Comput. 11 (2) (1999) 443–482.
 (31) A. P. Dempster, N. M. Laird, D. B. Rubin, Maximum likelihood from incomplete data via the em algorithm, Journal of the Royal Statistical Society, Series B 39 (1) (1977) 1–38.
 (32) K. V. Mardia, J. T. Kent, J. M. Bibby, Multivariate Analysis, Probability and Mathematical Statistics, Academic Press, 1979.
 (33) G. Celeux, S. Chrétien, F. Forbes, A. Mkhadri, A componentwise EM algorithm for mixtures, Journal of Computational and Graphical Statistics 10 (4) (2001) 697–712.
 (34) V. X. Nguyen, J. Epps, J. Bailey, Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance, Journal of Machine Learning Research 11 (2010) 2837–2854.
 (35) F. S. Samaria, A. C. Harter, Parameterisation of a stochastic model for human face identification, in: Prof. WACV, IEEE, 1994, pp. 138–142.
 (36) T. Kohonen, J. Hynninen, J. Kangas, K. Torkkola, LVQPAK, Helsinki University of Technology (1995).
 (37) F. S. Gürgen, R. Alpaydin, U. Ünlüakin, E. Alpaydin, Distributed and local neural classifiers for phoneme recognition, Pattern Recognition Letters 15 (10) (1994) 1111–1118.
 (38) Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradientbased learning applied to document recognition, Proceedings of the IEEE 86 (11) (1998) 2278–2324.

(39)
A. Frank, A. Asuncion, UCI machine
learning repository (2010).
URL http://archive.ics.uci.edu/ml