Adaptive Mixtures of Factor Analyzers

Adaptive Mixtures of Factor Analyzers

Heysem Kaya Albert Ali Salah Department of Computer Engineering, Çorlu Faculty of Engineering
Namık Kemal University, 59860, Çorlu, Tekirdağ, TURKEY
Department of Computer Engineering
Boğaziçi University, 34342, Bebek, İstanbul, TURKEY

A mixture of factor analyzers is a semi-parametric density estimator that generalizes the well-known mixtures of Gaussians model by allowing each Gaussian in the mixture to be represented in a different lower-dimensional manifold. This paper presents a robust and parsimonious model selection algorithm for training a mixture of factor analyzers, carrying out simultaneous clustering and locally linear, globally nonlinear dimensionality reduction. Permitting different number of factors per mixture component, the algorithm adapts the model complexity to the data complexity. We compare the proposed algorithm with related automatic model selection algorithms on a number of benchmarks. The results indicate the effectiveness of this fast and robust approach in clustering, manifold learning and class-conditional modeling.

IMoFA, AMoFA, mixture models, clustering, model selection, dimensionality reduction, covariance modeling, mixture of factor analyzers
journal: Pattern Recognition

1 Introduction

Mixture models have a widespread use in various domains of machine learning and signal processing for supervised, semi-supervised and unsupervised tasks Moerland00dissertation (); mclachlan2000finite (). However, the model selection problem remains to be one of the challenges and there is a need for efficient and parsimonious automatic model selection methods Jain2010 ().

Let denote a random variable in . A mixture model represents the distribution of as a mixture of component distributions:


where correspond to components, and are the prior probabilities of the components. are also called the mixture proportions, and sum up to unity. The likelihood term, expressed by , can be modeled by any distribution. In this paper we focus on Gaussians:


where and denote the mean and covariance of the component distribution, respectively. The number of parameters in the model is primarily determined by the dimensionality of the covariance matrix, which scales quadratically with the feature dimensionality . When this number is large, overfitting becomes an issue. Indeed, one of the most important problems of model-based clustering methods is that they are over-parametrized in high-dimensional spaces bouveyron2014model (). One way of keeping the number of parameters small is to constrain the covariance matrices to be tied (shared) across components, which assumes similar shaped distributions in the data space, and is typically unjustified. Another approach is to assume that each covariance matrix is diagonal or spherical, but this means valuable correlation information will be discarded.

It is possible to keep a low number of parameters for the model without sacrificing correlation information by adopting a factor analysis approach. Factor Analysis (FA) is a latent variable model, which assumes the observed variables are linear projections of a small number of independent factors with additive Gaussian noise:


where is a factor loading matrix and is a diagonal uniquenesses matrix representing the common sensor noise. Subsequently, the covariance matrix in Eq. 2 is expressed as , effectively reducing the number of parameters from to , with . If each Gaussian component is expressed in a latent space, the result is a mixture of factor analyzers (MoFA).

Given a set of data points, there exists Expectation-Maximization (EM) approaches to train MoFA models mclachlan2000finite (); Ghahramani97EM4MFA (), but these approaches require the specification of hyper-parameters like the number of clusters and the number of factors per component. For the model selection problem of MoFA, an incremental algorithm (IMoFA) was proposed in Salah04imofa (), where factors and components were added to the mixture one by one. The model complexity was monitored on a separate validation set.

In this study, we propose a fast and parsimonious model selection algorithm called Adaptive Mixture of Factor Analyzers (AMoFA). Similar to IMoFA, AMoFA is capable of adapting a mixture model to data by selecting an appropriate number of components and factors per component. However, the proposed AMoFA algorithm deals with two shortcomings of the IMoFA approach: 1) Instead of relying on a validation set, AMoFA uses a Minimum Message Length (MML) based criterion to control model complexity, subsequently using more training samples in practice. 2) AMoFA is capable of removing factors and components from the mixture when necessary. We test the proposed AMoFA approach on several benchmarks, comparing its performance with IMoFA, with a variational Bayesian MoFA approach Ghahramani1999 (), as well as with the popular Gaussian mixture model selection approach based on MML, introduced by Figueiredo and Jain Figueiredo2002 (). We show that the proposed approach is parsimonious and robust, and especially useful for high-dimensional problems.

The layout of the paper is organized as follows. In the next section we review related work in model selection. AMoFA algorithm is introduced in Section 3. Experimental results are presented in Section 4. Section 5 discusses our findings, and concludes with future directions.

2 Related Work

There are numerous studies for mixture model class selection. These include using information theoretical trade-offs between likelihood and model complexity AkaikeAIC (); SchwarzBIC (); Rissanen1983 (); Rissanen:InCSM2007 (); Wallace87MML (), greedy approaches verbeek2003efficient (); Salah04imofa () and full Bayesian treatment of the problem Ghahramani1999 (); Rasmussen2000 (); GomesIncLearnNonpBayMMCVPR08 (); Shi2011 (). A brief review of related automatic model selection methods is given in Table 1, a detailed treatment can be found in bouveyron2014model (). Here we provide some detail on the most relevant automatic model selection methods that are closely related to our work.

Work Model Selection Approach
Ghahramani & Beal (1999) Ghahramani1999 () Variational Bayes Incremental
Pelleg & Moore (2000) Pelleg2000XM () MDL Incremental
Rasmussen (2000) Rasmussen2000 () MC for DPMM Both
Figueiredo & Jain (2002) Figueiredo2002 () MML Decremental
Verbeek et al. (2003) verbeek2003efficient () Fixed iteration Incremental
Law et al. (2004) LawFigJainPAMI04 () MML Decremental
Zivkovic & v.d. Heijden (2004) Zivkovic04RULFMM () MML Decremental
Salah & Alpaydin (2004) Salah04imofa () Cross Validation Incremental
Shi & Xu (2006) ShiMoFAICANN06 () Bayesian Yin-Yang Both
Constantinopoulos et al. (2007) ConstantinopoulosVarComSplitNN07 () Variational Bayes Incremental
Gomes et al. (2008) GomesIncLearnNonpBayMMCVPR08 () Variational DP Incremental
Boutemedjet et al. (2009) Boutemedjet2009 () MML Decremental
Gorur & Rasmussen (2009) Gorur09SIUNonParMFA () MC for DPMM Both
Shi et al. (2011) Shi2011 () Bayesian Yin-Yang Both
Yang et al. (2012) yang2012robust () Entropy Min. Decremental
Iwata et al. (2012) Iwata1206.1846 () MC for DPMM Both
Fan & Bouguila (2013) FanVBCSNeuCom2013 () Variational DP Both
Fan & Bouguila (2014) FanOVBDMFSNeuCom2014 () Variational Bayes Incremental
Kersten (2014) Kersten2014 () MML Decremental
Table 1: Automatic Mixture Model Selection Approaches

In one of the most popular model selection approaches for Gaussian mixture models (GMMs), Figueiredo and Jain proposed to use an MML criterion for determining the number of components in the mixture, and shown that their approach is equivalent to assuming Dirichlet priors for mixture proportions Figueiredo2002 (). In their method, a large number of components (typically 25-30) is fit to the training set, and these components are eliminated one by one. At each iteration, the EM algorithm is used to find a converged set of model parameters. The algorithm generates and stores all intermediate models, and selects one that optimizes the MML criterion.

The primary drawback of this approach is the curse of dimensionality. For a -dimensional problem, fitting a single full-covariance Gaussian requires parameters, which typically forces the algorithm to restrict its models to diagonal covariances in practice. We demonstrate empirically that this approach (unsupervised learning of finite mixture models - ULFMM) does not perform well in practice for problems with high dimensionality, regardless of its abundant use in the literature.

Figure 1: Relationship of MoFA with some well known latent variable and mixture models. Model parameters are given in curly brackets. : () component priors, : (1) component mean, : () factor loading matrix, : () diagonal noise variances (uniqueness), : () component covariance.) denotes the number of components, the feature dimensionality, and the subspace dimensionality with .

Using the parsimonious factor analysis representation described in Section 1, it is possible to explore many models that are between full-covariance and diagonal Gaussian mixtures in their number of parameters. The resulting mixture of factor analysers (MoFA) can be considered as a noise-robust version of the mixtures of probabilistic principal component analysers (PPCA) approach Tipping99MPPCA (). Figure 1 summarizes the relations between the mixture representations in this area.

If we assume that the latent variables of each component in a MoFA model is distributed unit normal () in the latent space, the corresponding data in the feature space are also Gaussian distributed:


where denotes the latent factor values. The mixture distribution of factor analyzers is then given as Ghahramani97EM4MFA ():


The EM algorithm is used to find maximum likelihood solutions to latent variable models Dempster77MLEM (), and it can be used for training a MoFA Ghahramani97EM4MFA (). Since EM does not address the model selection problem, it requires the number of components and factors per component to be fixed beforehand.

Ghahramani and Beal Ghahramani1999 () have proposed a variational Bayes scheme (VBMoFA) for model selection in MoFA, which allows the local dimensionality of components and their total number to be automatically determined. In this study, we use VBMoFA as one of the benchmarks.

To alleviate the computational complexity of the variational approach, a greedy model selection algorithm was proposed by Salah and Alpaydın Salah04imofa (). This incremental approach (IMoFA) starts by fitting a single component - single factor model to the data and adds factors and components in each iteration using fast heuristic measures until a convergence criterion is met. The algorithm allows components to have as many factors as necessary, and uses a validation set to stop model adaptation, as well as to avoid over-fitting. This is the third algorithm we use to compare with the proposed approach, which we describe in detail next.

3 Adaptive Mixtures of Factor Analyzers

We briefly summarize the proposed adaptive mixtures of factor analyzers (AMoFA) algorithm first, and then describe its details. Given a dataset with data points in dimensions, the AMoFA algorithm is initialized by fitting a 1-component, 1-factor mixture model. Here, the factor is initialized from the leading eigenvector of the covariance matrix i. e. the principal component of the data. At each subsequent step, the algorithm considers adding more components and factors to the mixture, running EM iterations to find a parametrization. During the M-step of EM, an MML criterion is used to determine whether any weak components should be annihilated. Apart from this early component annihilation, the algorithm incorporates a second decremental scheme. When the incremental part of the algorithm no longer improves the MML criterion, a downsizing component annihilation process is initiated and all components are eliminated one by one. Similar to ULFMM, each intermediate model is stored, and the algorithm outputs the one giving the minimum message length. Figure 2 summarizes the proposed algorithm.

algorithm AMoFA(training set ) /*Initialization*/ [] train a 1-component, 1-factor model repeat /*Perform a single split*/ x Select a component for splitting via Eq. (10) [] MML_EM(split x). actionML() ML() via Eq. (9) /*Perform a single factor addition*/ y Select a component to add a factor [] MML_EM(add factor to y). actionML() ML() via Eq. (9) /*Select the best action*/ z arg min(actionML(1),actionML(2)) /*Update the parameters*/ [] [] until MML decrease /*Annihilation starts with components*/ while /*Select the weakest component for annihilation*/ [] EM(annihilate component). k = k - 1 end /*Select that minimizes MML criterion in Eq. (9)*/ return [] end

Figure 2: Outline of the AMoFA algorithm

3.1 The Generalized Message Length Criterion

To allow local factor analyzers to have independent latent dimensionality, the MML criterion given in Figueiredo and Jain Figueiredo2002 () should be generalized accordingly to reflect the individual code length of components:


where denotes the number of parameters for component , represents the dataset with data items, the model parameters, and represents the number of non-zero weight components. The first three terms in Eq. 6 comprise the code length for real valued model parameters, the fourth term is the model log-likelihood. We propose to include the code length for integer hyper parameters, namely and component-wise latent dimensionalities , such that the encoding becomes decodable as required by MDL theory Rissanen1983 (); Rissanen:InCSM2007 (). For this purpose, we use Rissanen’s universal prior for integers Rissanen:InCSM2007 ():


which gives the (ideal) code length


where is n-fold logarithmic sum with positive terms, is the normalizing sum that is tightly approximated as  Rissanen:InCSM2007 (). term in Eq. 8 can be computed via a recursive algorithm. We finally obtain , the cost to encode the number of components, and similarly , the cost to encode the local dimensionalities and add them to eq. (6) to obtain a message length criterion:


3.2 Component Splitting and Factor Addition

Adding a new component by splitting an existing one involves two decisions: which component to split, and how to split it. AMoFA splits the component that looks least likely to a Gaussian, by looking at a multivariate kurtosis metric MardiaMA (). For a multinormal distribution, the multivariate kurtosis takes the value , and if the underlying population is multivariate normal with mean , the sample counterpart of , namely , has an asymptotic distribution as the number of samples goes to infinity. Salah and Alpaydın Salah04imofa () adapted this metric to the mixture model by using a “soft count” :


The component with greatest is selected for splitting. AMoFA runs a local, 2-component MoFA on the data points that fall under the component. To initialize the means of new components prior to MoFA fitting, we use the weighted sum of all eigenvectors of the local covariance matrix: , and set , where is the mean vector of the component to split.

The component having the largest difference between modeled and sample covariance is selected for factor addition. As in IMoFA, AMoFA uses the residual factor addition scheme. Given a component and a set of data points under it, the re-estimated points after projection to the latent subspace can be written as: . The re-estimation error decreases with the number of factors used in . The newly added column in the factor loading matrix, , is selected to be the principal direction (the eigenvector with the largest eigenvalue) of the residual vectors . This new factor is used in bootstrapping the EM procedure.

3.3 Component Annihilation

In a Bayesian view, the message length criterion (eq. (9)) adopted from Figueiredo and Jain Figueiredo2002 () corresponds to assuming a flat prior on component parameters , and a Dirichlet prior on mixture proportions :


Thus, in order to minimize the adopted cost in eq. (9), the M-step of EM is changed for :


which means that all components having a soft count () smaller than half the number of local parameters will be annihilated. This threshold enables the algorithm to get rid of components that do not justify their existence. In the special case of AMoFA, the number of parameters per component are defined as:


where is the original dataset dimensionality, is the local latent dimensionality of component , and is the code length for . The additive constant inside the bracket accounts for the parameter cost of mean and local diagonal uniquenesses matrix . Finally, the localized annihilation condition to check at the M step of EM is simply .

In AMoFA, we use an outer loop to drive the model class adaptation and an inner EM loop to fit a mixture of factor analyzer model with initialized parameters. The inner EM algorithm is an improved and more generalized version of ULFMM Figueiredo2002 (), where after parallel EM updates we select the weakest component and check for annihilation, as opposed to sequential component update approach (using Component-wise EM - celeux2001component ()). Any time during EM, automatic component annihilation may take place. When the incremental progress is saturated, the downsizing component annihilation is initiated. The MML based EM algorithm and relevant details are given in A.

4 Experiments

4.1 Evaluation Protocol for Clustering Performance

We compare AMoFA with two benchmark algorithms on clustering, namely ULFMM algorithm from Figueiredo2002 ()111The code is available at and the IMoFA-L from Salah04imofa ().

We use the Normalized Information Distance (NID) metric for evaluating clustering accuracy, as it possesses several important properties; in addition to being a metric, it admits an analytical adjustment for chance, and allows normalization to [0-1] range Vinh:2010:ITM (). NID is formulated as:


where entropy and the mutual information for clustering are defined as follows:


Here, is the number of samples in cluster , is the number of samples falling into cluster in clustering and cluster in clustering . MI is a nonlinear measure of dependence between two random variables. It quantifies how much information in bits the two variables share. We compute NID between the ground truth and the clusterings obtained by the automatic model selection techniques in order to give a more precise measure of clustering than just the number of clusters. When there is no overlap, NID is expected to be close to 0; higher overlap of clusters might result in higher average NID, though a relative performance comparison can still be achieved.

4.2 Experiments on Benchmark Datasets for Clustering

We tested three algorithms, namely IMoFA-L, AMoFA and ULFMM on benchmark synthetic/real datasets for clustering. For maximum comparability with previous work, we used some synthetic dataset examples from Figueiredo and Jain Figueiredo2002 (), as well as from a Yang et al.’s study on automatic mixture model selection yang2012robust ().

AMoFA, as opposed to IMoFA and ULFMM, does not rely on random initialization. In IMoFA, the first factor is randomly initialized, and in ULFMM the initial cluster centers are assigned to randomly selected instances. AMoFA initializes the first factor from the principal component of the dataset. Similar to residual factor addition, this scheme can be shown to converge faster than random initializations. Given a dataset, a single simulation is sufficient to assess performance.

Because of this deterministic property of AMoFA, we report the results with multiple datasets sampled from the underlying distribution, instead of sampling once and simulating multiple times. Unless stated otherwise, in the following experiments with synthetic datasets, 100 samples are drawn and the average results are reported. For ULFMM, we give initial number of clusters in all our simulations for clustering and use free full covariances. Moreover, the EM convergence threshold is set to in all three methods.

Example 1: 3 Separable Gaussians

To illustrate the evolution of the solution with AMoFA, we generated a mixture of three Gaussians having the same mixture proportions and the same covariance matrix with separate means . We generate 100 samples from the underlying distribution of 900 data points. This synthetic example is used in Figueiredo2002 (); yang2012robust (). Figure 3 shows the evolution of adaptive steps of AMoFA with found clusters shown in 2-std contour plots, and the description length (DL) is given above each plot.

Figure 3: The evolution of AMoFA on a toy synthetic data. To keep the figure uncluttered, only the mixture models obtained at the end of adaptive steps are given. The initial step fits a single component-single factor model. The first two iterations add components to the mixture, and the next one add a factor. The incremental phase stops when no (considerable) improvement in the message length is observed. Then, the algorithm starts to annihilate the components, until a single component is left. As expected, the DL in the decremental phase is higher, since components have two factors. Finally, the algorithm selects the 3-component solution having the minimum DL.

Example 2: Overlapping Gaussians

To test the approach for finding the correct number of clusters, we use a synthetic example very similar to the one used in Figueiredo2002 (); yang2012robust (). Here, three of the four Gaussians overlap with the following generative model:

We use data points. As in the previous example, we generate 100 random datasets. In figure 4 left plot, the data are illustrated with a sample result of AMoFA. Out of 100 simulations, the accuracy of finding K*=4 is 92, 56, and 33 for AMoFA, ULFMM, and IMoFA, respectively. The histogram in figure 4 right plot shows the distribution of number of automatically found clusters for three methods. Average NID over 100 datasets is found to be 0.2549, 0.2951, and 0.3377 for AMoFA, ULFMM and IMoFA, respectively. A paired t-test (two tailed) on NID scores indicates that AMoFA performs significantly better than ULFMM with .

Figure 4: Overlapping Gaussians data. Left: A sample AMoFA result. The real labels are shown with colors and resulting AMoFA mixture model is shown with 2-std contour plot. Right: Histograms of number of clusters found by AMoFA, IMoFA and ULFMM respectively.

4.3 Application to Classification: Modeling Class Conditional Densities

We compare AMoFA with three benchmark model selection algorithms, namely, VBMoFA algorithm from Ghahramani1999 (), ULFMM algorithm from Figueiredo2002 () and the IMoFA algorithm from Salah04imofa (). As baseline, we use Mixture of Gaussians, where the data of each class are modeled by a single Gaussian with full (MoG-F) or diagonal (MoG-D) covariances. We compare the performances of the methods on classification tasks (via class-conditional modeling) on nine benchmark datasets: The ORL face database with binary gender classification task samaria1994parameterisation (), 16-class phoneme database from LVQ package of Helsinki University of Technology LVQPAK (), the VISTEX texture database Salah04imofa (), a 6-class Japanese Phoneme database222Pre-processed versions of VISTEX and Japanese Phoneme datasets that are used in this study can be accessed from  GurgenAUA94 (), the MNIST dataset lecun1998gradient (), and four datasets (Letter, Pendigits, Opdigits, and Waveform) from UCI ML Repository UCIML (). Table 2 gives some basic statistics about the databases. Except MNIST that has an explicit train and testing protocol, all experiments were carried out with 10-fold cross-validation. Simulations are replicated 10 times in MNIST, where we crop the 4 pixel padding around the images and scale them to pixels to obtain 100-dimensional feature vectors.

Dataset Dimensions Classes # of Samples
ORL 256 2 400
LVQ 20 16 3858
OPT 60 10 4677
PEN 16 10 8992
VIS 169 10 3610
WAV 21 3 500
JPN 112 6 1200
LET 16 26 20000
MNT 100 10 70000
Table 2: Datasets Used for Class Conditional Mixture Modeling

In the experiments, we trained separate mixture models for the samples of each class, and used maximum likelihood classification. We did not use informative class priors, as it would positively bias the results, and hide the impact of likelihood modeling. In Table 3, we provide accuracy computed over 10 folds, where all six approaches used the same protocol. ULFMM column reports performance of ULFMM models with free diagonal covariances, as full covariance models invariably give poorer results.

IMoFA-L Salah04imofa () VBMoFA Ghahramani1999 () AMoFA
ORL 97.8 1.5 93.0 2.8 97.5 1.2
LVQ 91.2 1.9 91.3 1.9 89.3 1.6
OPT 91.1 2.7 95.2 1.8 93.8 2.4
PEN 97.9 0.7 97.8 0.6 98.1 0.6
VIS 69.3 4.6 67.1 5.9 77.2 4.6**
WAV 80.8 4.5 85.1 4.2 85.6 4.6
JPN 93.4 2.4 93.2 3.2 96.5 2.2*
LET 86.6 1.5 95.2 0.7 95.1 0.7
MNT 91.5 0.2 84.5 0.1 93.9 0
ULFMM Figueiredo2002 () MoG-D MoG-F
ORL 80.0 6.5 89 2.4 90 0
LVQ 75.4 4.5 88.1 2.6 92.1 1.8
OPT 49.5 10.2 84.2 3.1 94.9 1.7
PEN 89.9 2.0 84.5 2.0 97.4 0.6
VIS 20.6 3.7 68.6 3.9 44.7 12.8
WAV 72.1 7.5 80.9 18.2 84.8 4.6
JPN 82.4 2.1 82.2 4.9 92.3 2.3
LET 56.9 2.8 64.2 1.2 88.6 0.9
MNT 64.7 2.0 78.2 0 93.7 0
Table 3: Classification Performances for Class-Conditional Models. Significantly better results compared to the first runner up are shown in bold, where * signifies 0.05 significance level, while ** corresponds to 0.01 significance level. If there are multiple best performers without pair-wise significant difference, they are shown in bold altogether.

The best results for a dataset are shown in bold. We compared the algorithms with a non-parametric sign test. For each dataset, we conducted a one tail paired-sample t-test with a significance level of 0.05 (0.01 upon of rejection of null hypothesis). Results indicate that ULFMM ranks last in all cases: it is consistently inferior even against the MoG-F baseline. This is because of the fact that after randomized initialization of clusters, ULFMM algorithm annihilates all illegitimate components, skipping intermediate (possibly better than initial) models. On seven datasets AMoFA attains/shares the first rank, and on the remaining two it ranks the second. Note that though on the overall AMoFA and VBMoFA have similar number of wins against each other, on high dimensional datasets, namely on MNIST, VISTEX, Japanese Phoneme and ORL, AMoFA significantly outperforms VBMoFA.

IMoFA 1/2/6 2/4/3 9/0/0 7/2/0 2/3/4
AMoFA * 4/3/2 9/0/0 7/2/0 5/3/1
VBMoFA * 9/0/0 7/2/0 3/5/1
ULFMM * 0/3/6 0/0/9
MoG-D * 1/2/6
Table 4: Row Wins/Ties/Loses against Column with 0.05 Significance.

The results of pairwise tests at 0.05 significance level are shown in Table 4. We see that the adaptive MoFA algorithms dramatically outperform GMM based ULFMM algorithm. MoFA is capable of exploring a wider range of models between diagonal and full covariance with reduced parameterization. Among the three MoFA based algorithms, no significant difference () was found on Pendigits dataset. AMoFA outperforms the best results reported so far with the VISTEX dataset. The best test set accuracy reported in Salah04imofa () is 73.8 1.1 using GMMs. We attain 77.2 4.6 with AMoFA.

5 Conclusions and Outlook

In this study, we propose a novel and adaptive model selection approach for Mixtures of Factor Analyzers. Our algorithm first adds factors and components to the mixture, and then prunes excessive parameters, thus obtaining a parsimonious model in a very time and space efficient way. Our contributions include a generalization of the adopted MML criterion to reflect local parameter costs, as well as local component annihilation thresholds.

We carry out experiments on many real datasets, and the results indicate the superiority of the proposed method in class-conditional modeling. We contrast our approach with the Incremental MoFA approach Salah04imofa (), Variational Bayesian MoFA Ghahramani1999 (), as well as the popular ULFMM algorithm Figueiredo2002 (). In high dimensions, MoFA based automatic modeling provides significantly better classification results than GMM based ULFMM modeling, as it is capable of modeling a much wider range of models with compact parametrization. It also makes use of the latent dimensionality of the local manifold, thus enables obtaining an adaptive cost for the description length. AMoFA algorithm is observed to offer the best performance on higher dimensional datasets.

The proposed algorithm does not necessitate a validation set to control model complexity. Thanks to the optimized MML criterion and the fast component selection measures for incremental adaptation, the algorithm is not only robust, but also efficient. It does not have any requirement for parameter tuning. Using a recursive version of ULFMM Zivkovic04RULFMM (), it is also possible to extend the proposed method for online learning. A MATLAB tool for AMoFA is available from

Appendix A EM Algorithm for Mixture of Factor Analyzers with MML Criterion

In this section, we give the MoFA EM algorithm optimizing the generalized MML criterion given in eq 9. This criterion is used for automatic annihilation of components at the M step. We provide the formulation of MML based EM algorithm, which is closely related to regular EM for MoFA model Ghahramani97EM4MFA ():


where to keep the notation uncluttered, is defined as . Similarly, , , and


The above EM formulation aims to optimize the MoFA log likelihood, which is the logarithm of the linear combination of component likelihoods:


  Require: data, and initialized MoFA parameter set   REPEAT   E Step: compute expectations using eq. (23), (18) and (19), respectively   M step: compute model parameters using equations (20)-(22)   Compute using eq. (14)   while any component needs annihilation   Annihilate the weakest component having   Update   end   Compute using eq. (24)   Compute message length using eq. (9)   UNTIL converges with tolerance

Figure 5: EM Algorithm for MoFA with MML Criterion

The EM Algorithm for MoFA using MML criterion is given in figure  5.


  • (1) P. Moerland, Mixture Models for Unsupervised and Supervised Learning, Ph.D. Thesis, The Swiss Federal Inst. of Tech. at Lausanne, 2000.
  • (2) G. McLachlan, D. Peel, Finite Mixture Models, New York: Wiley, 2000.
  • (3) A. K. Jain, Data clustering: 50 years beyond K-means, Pattern Recognition Letters 31 (8) (2010) 651–666. doi:10.1016/j.patrec.2009.09.011.
  • (4) C. Bouveyron, C. Brunet-Saumard, Model-based clustering of high-dimensional data: A review, Computational Statistics & Data Analysis 71 (2014) 52–78.
  • (5) Z. Ghahramani, G. E. Hinton, The em algorithm for mixtures of factor analyzers, Tech. Rep. CRG-TR-96-1, University of Toronto (1997).
  • (6) A. A. Salah, E. Alpaydın, Incremental mixtures of factor analysers, in: Proc. Int. Conf. on Pattern Recognition, 2004, pp. 276–279.
  • (7) Z. Ghahramani, M. J. Beal, Variational Inference for Bayesian Mixtures of Factor Analysers, in: NIPS, Vol. 12, 1999, pp. 449–455.
  • (8) M. A. T. Figueiredo, A. K. Jain, Unsupervised Learning of Finite Mixture Models, IEEE Trans. Pattern Analysis and Machine Intelligence 24 (3) (2002) 381–396.
  • (9) H. Akaike, A new look at the statistical model identification, IEEE Trans. Automatic Control 19 (6) (1974) 716–723.
  • (10) G. Schwarz, Estimating the Dimension of a Model, Annals of Statistics 6 (2) (1979) 461–464.
  • (11) J. Rissanen, A Universal Prior for Integers and Estimation by MDL, The Annals of Statistics 11 (2) (1983) 416–431.
  • (12) J. Rissanen, Information and complexity in statistical modeling, Information Science and Statistics, Springer, Dordrecht, 2007.
  • (13) C. Wallace, P. Freeman, Estimation and inference by compact coding, Journal of Royal Statistical Society, Series B 49 (3) (1987) 240–265.
  • (14) J. J. Verbeek, N. Vlassis, B. Kröse, Efficient greedy learning of Gaussian mixture models, Neural computation 15 (2) (2003) 469–485.
  • (15) C. E. Rasmussen, The Infinite Gaussian Mixture Model, in: NIPS, no. 11, 2000, pp. 554–560.
  • (16) R. Gomes, M. Welling, P. Perona, Incremental learning of nonparametric bayesian mixture models, in: CVPR, 2008, pp. 1–8.
  • (17) L. Shi, S. Tu, L. Xu, Learning Gaussian mixture with automatic model selection: A comparative study on three Bayesian related approaches, Frontiers of Electr. and Electronic Eng. in China 6 (2) (2011) 215–244.
  • (18) D. Pelleg, A. W. Moore, X-means: Extending k-means with efficient estimation of the number of clusters, in: ICML, 2000, pp. 727–734.
  • (19) M. Law, M. A. T. Figueiredo, A. Jain, Simultaneous feature selection and clustering using mixture models, IEEE Trans. Pattern Analysis and Machine Intelligence 26 (9) (2004) 1154–1166.
  • (20) Z. Zivkovic, F. van der Heijden, Recursive unsupervised learning of finite mixture models, IEEE Trans. Pattern Analysis and Machine Intelligence 26 (5) (2004) 651–656.
  • (21) L. Shi, L. Xu, Local factor analysis with automatic model selection: A comparative study and digits recognition application, in: S. Kollias, A. Stafylopatis, W. Duch, E. Oja (Eds.), Int. Conf. on Artificial Neural Networks, Vol. 4132 of Lecture Notes in Computer Science, Springer Berlin Heidelberg, 2006, pp. 260–269.
  • (22) C. Constantinopoulos, A. Likas, Unsupervised learning of gaussian mixtures based on variational component splitting, IEEE Trans. Neural Networks 18 (3) (2007) 745–755.
  • (23) S. Boutemedjet, N. Bouguila, D. Ziou, A Hybrid Feature Extraction Selection Approach for High-Dimensional Non-Gaussian Data Clustering, IEEE Trans. Pattern Analysis and Machine Intelligence 31 (8) (2009) 1429–1443.
  • (24) D. Gorur, C. E. Rasmussen, Nonparametric mixtures of factor analyzers, in: IEEE Signal Processing and Communications Applications Conf., 2009, pp. 708–711. doi:10.1109/SIU.2009.5136494.
  • (25) M.-S. Yang, C.-Y. Lai, C.-Y. Lin, A robust EM clustering algorithm for Gaussian mixture models, Pattern Recognition 45 (11) (2012) 3950–3961.
  • (26) T. Iwata, D. Duvenaud, Z. Ghahramani, Warped mixtures for nonparametric cluster shapes (2012). arXiv:1206.1846.
  • (27) W. Fan, N. Bouguila, D. Ziou, Variational learning of finite dirichlet mixture models using component splitting, Neurocomputing 129 (2014) 3 – 16.
  • (28) W. Fan, N. Bouguila, Online variational learning of generalized dirichlet mixture models with feature selection, Neurocomputing 126 (2014) 166 – 179.
  • (29) J. Kersten, Simultaneous feature selection and Gaussian mixture model estimation for supervised classification problems, Pattern Recognition 47 (8) (2014) 2582 – 2595. doi:
  • (30) M. E. Tipping, C. M. Bishop, Mixtures of probabilistic principal component analyzers, Neural Comput. 11 (2) (1999) 443–482.
  • (31) A. P. Dempster, N. M. Laird, D. B. Rubin, Maximum likelihood from incomplete data via the em algorithm, Journal of the Royal Statistical Society, Series B 39 (1) (1977) 1–38.
  • (32) K. V. Mardia, J. T. Kent, J. M. Bibby, Multivariate Analysis, Probability and Mathematical Statistics, Academic Press, 1979.
  • (33) G. Celeux, S. Chrétien, F. Forbes, A. Mkhadri, A component-wise EM algorithm for mixtures, Journal of Computational and Graphical Statistics 10 (4) (2001) 697–712.
  • (34) V. X. Nguyen, J. Epps, J. Bailey, Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance, Journal of Machine Learning Research 11 (2010) 2837–2854.
  • (35) F. S. Samaria, A. C. Harter, Parameterisation of a stochastic model for human face identification, in: Prof. WACV, IEEE, 1994, pp. 138–142.
  • (36) T. Kohonen, J. Hynninen, J. Kangas, K. Torkkola, LVQ-PAK, Helsinki University of Technology (1995).
  • (37) F. S. Gürgen, R. Alpaydin, U. Ünlüakin, E. Alpaydin, Distributed and local neural classifiers for phoneme recognition, Pattern Recognition Letters 15 (10) (1994) 1111–1118.
  • (38) Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE 86 (11) (1998) 2278–2324.
  • (39) A. Frank, A. Asuncion, UCI machine learning repository (2010).
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description