Overfitting Bayesian Mixtures of Factor Analyzers with an Unknown Number of Components
Abstract
Recent advances on overfitting Bayesian mixture models provide a solid and straightforward approach for inferring the underlying number of clusters and model parameters in heterogeneous data. In this study we demonstrate the applicability of such a framework in clustering multivariate continuous data with possibly complex covariance structure. For this purpose an overfitting mixture of factor analyzers is introduced, assuming that the number of factors is fixed. A Markov chain Monte Carlo (MCMC) sampler combined with a prior parallel tempering scheme is used to estimate the posterior distribution of model parameters. The optimal number of factors is estimated using information criteria. Identifiability issues related to the label switching problem are dealt by postprocessing the simulated MCMC sample by relabelling algorithms. The method is benchmarked against stateoftheart software for maximum likelihood estimation of mixtures of factor analyzers and standard mixtures of multivariate Gaussian distributions using an extensive simulation study. Finally, the applicability of the method is illustrated in publicly available data.
keywords:
Factor Analysis, Mixture Models, Clustering, MCMCurl]panagiotis.papastamoulis@manchester.ac.uk
1 Introduction
Factor Analysis (FA) is a popular statistical model that aims to explain correlations in a highdimensional space by dimension reduction. This is typically achieved by expressing the observed multivariate data as a linear combination of a smaller set of hypothetical and uncorrelated variables known as factors. The factors are not observed, so they are treated as missing data. The reader is referred to kim1978factor (); bartholomew2011latent () for an overview of factor analysis models, estimation techniques and applications.
However, when the observed data is not homogeneous, the typical FA model will not adequately fit the data. In such a case, a Mixture of Factor Analyzers (MFA) can be used in order to take into account the underlying heterogeneity. Thus, MFA models jointly treat two inferential tasks: modelbased density estimation for high dimensional data as well as dimensionality reduction. Estimation of MFA models is straightforward by using the ExpectationMaximization (EM) algorithm Dempster:77 (); ghahramani1996algorithm (); McLachlan:00 (); mclachlan2003modelling (); mclachlan2011mixtures (). The family of parsimonious Gaussian mixture models (PGMM) is introduced in McNicholas2008 (); doi:10.1093/bioinformatics/btq498 (), which is based on Gaussian mixture models with parsimonious factor analysis like covariance structures. These models are estimated by the alternating expectationconditional maximization (AECM) algorithm RSSB:RSSB082 (). Under a Bayesian setup, Fokoue2003 () estimate the number of mixture components and factors by simulating a continuoustime stochastic birthdeath point process using a BirthDeath MCMC algorithm stephens2000 (). Their algorithm is shown to perform well in small to moderately scaled multivariate data.
Fully Bayesian approaches to estimate the number of components in a mixture model include the Reversible jump MCMC (RJMCMC) Green:95 (); Richardson:97 (); dellaportas2006multivariate (); papRJ (), Birthdeath MCMC (BDMCMC) stephens2000 () and allocation sampler Nobile2007 () algorithms. However, in recent years, there is a growing progress on the usage of overfitted mixture models in Bayesian analysis rousseau2011asymptotic (); overfitting (). An overfitting mixture model consists of a number of components which is much larger than its true (and unknown) value. Under a frequentist approach, overfitting mixture models is not a recommended practice. In this case, the true parameter lies on the boundary of the parameter space and identifiability of the model is violated due to the fact that some of the component weights can be equal to zero or some components may have equal parameters. Consequently, standard asymptotic Maximum Likelihood theory does not apply in this case li1988mixtures (). Choosing informative prior distributions that bound the posterior away from unidentifiability sets can increase the stability of the MCMC sampler, however these informative priors tend to force too many distinct components and the possibility of reducing the overfitting mixture to the true model is lost (see Section 4.2.2 in fruhwirth2006finite ()). Under suitable prior assumptions introduced in rousseau2011asymptotic (), it has been shown that asymptotically the redundant components will have zero posterior weight and force the posterior distribution to put all its mass in the sparsest way to approximate the true density. Therefore, the inference on the number of mixture components can be based on the posterior distribution of the “alive” components of the overfitted model, that is, the components which contain at least one allocated observation.
The simplicity of this approach is in stark contrast with the fully Bayesian approach of treating the number of clusters as a random variable. For example, in the RJMCMC algorithm the researcher has to design sophisticated move types that bridge models with different number of clusters. On the other hand, the allocation sampler is only applicable to cases where the model parameters can be analytically integrated out. Even in such cases though, the design of proper MetropolisHastings moves on the space of latent allocation variables of the mixture model is required to obtain a reasonable mixing of the simulated MCMC chain (see Nobile2007 (); papastamoulis2016bayesbinmix ()).
The contribution of this study is to utilise recent advances on overfitting mixture models overfitting () to the context of Bayesian MFA Fokoue2003 (). We use a Gibbs sampler geman (); gelfand () which is embedded in a prior parallel tempering scheme in order to improve the mixing of the algorithm. In addition, we explore the usage of information criteria for estimating the number of factors. After estimating the number of clusters and factors, we perform inference on the chosen model by dealing with identifiability issues related to the label switching problem papastamoulis2016label (). Our results indicate that overfitting Bayesian MFA models provide a simple and efficient approach to estimate the number of clusters in correlated highdimensional data.
The rest of the paper is organized as follows. Section 2.1 reviews the basics of FA models. Finite mixtures of FA models are presented in Section 2.2 and the corresponding prior assumptions are discussed in Section 2.3. The overfitting MFA is introduced in Section 2.4. Section 2.5 deals with estimating the number of factors using information criteria. Section 2.6 presents the prior parallel tempering scheme which is incorporated in the MCMC sampler. Identifiability issues related to the label switching phenomenon are discussed in Section 2.7 and some details of the overall implementation are given in Section 2.8. Our method is illustrated in Section 3 using a simulation study (3.1) as well as a publicly available dataset (3.2). We also provide comparisons against popular software for estimating MFA models with the EM algorithm. The paper concludes in Section 4. Some technical details regarding the implementation of the Gibbs sampler as well as the generation of synthetic data are given in A.1 and A.2, respectively. Finally, Appendix A.3 contains additional benchmarking results based on traditional finite mixtures of multivariate normal distributions.
2 Methodology
At first we introduce some conventional guidelines that will be followed in our notation throughout this paper, unless explicitly stated otherwise. In principle, we will use lower case letters to represent scalars or vectors and upper case letters to denote matrices. The notation will correspond to the th member of a vector whose elements are scalars or vectors. In addition, will denote the th member of a vector whose elements are matrices. The element of a matrix will be denoted by the corresponding lower case letter, that is, . The transpose matrix of will be denoted as . Since we use upper case letters to denote matrices, we will not differentiate the notation between random variables and their specific realizations. We use to denote the probability mass or density function of given . Whether refers to a (univariate or multivariate) mass or density function will be clear from the context. For a discrete random variable , the notation will be also used to denote the probability of the event .
2.1 Factor Analysis Model
Let denote a random sample of dimensional observations, with ; . We assume that has been generated by a linear combination of a latent vector (factors)
(1) 
The unobserved random vector lies on a lower dimensional space, that is, and it consists of uncorrelated features . In particular, we assume that
(2) 
independent for , where denotes the identity matrix. The dimensional matrix contains the factor loadings, while the dimensional vector contains the marginal mean of . Finally, denotes an error term
(3) 
independent for , where . Furthermore, is assumed independent of ; . It follows that the conditional distribution of is
(4) 
It can be easily derived that the marginal distribution of is
(5) 
independent for .
According to the last expression, the covariance matrix of is equal to . As shown in Equation (4), the knowledge of the missing data () implies that the conditional distribution of has a diagonal covariance matrix. This is the crucial characteristic of factor analysis models, where they aim to explain highdimensional dependencies using a set of lowerdimensional uncorrelated factors.
Given that there are factors, the number of effective parameters in the covariance matrix is equal to (lawley1962factor, ). The number of free parameters in the uncostrained covariance matrix of is equal to . Hence, under the factor analysis model, the number of parameters in the covariance matrix is reduced by
The last expression is positive if where , a quantity which is known as the Ledermann bound ledermann1937rank (). Hence, we assume that the number of latent factors does not exceed . When it can be shown that is almost surely unique BEKKER1997255 (). Note however that this does not necessarily imply that the model is not identified if bekker2014identification ().
Given identifiability of , a second source of identifiability problems is related to orthogonal transformations of the matrix of factor loadings. Indeed, consider a matrix with and (orthogonal matrix) and define . It follows that the representation leads to the same marginal distribution of as the one in Equation (5). Following Fokoue2003 (), we preassign values to some entries of , in particular we set the entries of the upper diagonal of the first block matrix of equal to zero.
Another identifiability problem in model (1) is related to the socalled “sign switching” phenomenon, see e.g conti2014bayesian (). Observe that Equation (1) remains invariant when simultaneously switching the signs of a given row of ; and . Thus, when using MCMC samplers to explore the posterior distribution of model parameters, both and ; are not marginally identifiable due to signswitching across the MCMC trace. However, one could use the approximate Maximum A Posteriori (MAP) estimate arising from the MCMC output in order to infer the mode of the posterior distribution of the specific parameters. These estimates correspond to the parameter values obtained at the iteration that maximizes the posterior distribution across the MCMC run and they are not affected by the signswitching problem. Another possibility is to restrict our attention to signinvariant parameter functions. For example, notice that the covariance matrix is invariant when switching the sign of each element in . Following sabatti2006bayesian (), we also define the “regularised score” of variable to factor as
(6) 
for ; . Notice that Equation (6) is invariant to simultaneously switching the signs of and ’s, therefore, the estimation of is meaningful.
2.2 Finite Mixtures of Factor Analyzers
In this section the typical factor analysis model is generalized in order to take into account unobserved heterogeneity. Assume that there are underlying groups in the population, where the number of clusters denotes a known integer. Each cluster is characterized by different structure, which is reflected in our model by assuming a distinct set of parameters ; . Consider the latent allocation parameters which assign observation to a cluster for . Then, given the cluster allocations, each observation is expressed as
Let ; and . Apriori each observation is generated from cluster with probability equal to , that is,
(7) 
independent for . We will refer to as the vector of mixing proportions of the model. Note that the allocation vector is not observed, so it should be treated as missing data.
In general, the latent factors can be different for each cluster, for instance, , where represents the latent dimensionality of cluster ; . Following Fokoue2003 () we assume that are independent and that the intrinsic latent dimension is the same for each cluster, so the marginal distribution of is still described by Equation (2).
Now, we express Equations (3), (4) conditionally on the cluster membership
independent for , where and , . Thus, Equation (5) becomes
(8) 
As shown in Equation (8), the marginal distribution of is a finite mixture of distributions with components.
A further assumption, also imposed by Fokoue2003 (), is the restriction of the errors terms to have the same variance inside each cluster. In this case are independent and the marginal distribution of is given in Equation (3). Although we consider both cases, in the following section we present the general model where the variance of errors is allowed to vary between clusters.
Finally, the generalisation of Equation (6) to the case that there are clusters is straightforward. For cluster , the regularised score of variable to factor is defined as
(9) 
for ; and , where denotes the indicator function. Note that Equation (9) is not defined in case where , however this is not a problem in our implementation due to the fact that we only make inference on the “alive” clusters, that is, the subset of defined as .
2.3 Bayesian formulation
This section introduces the Bayesian framework for the MFA model, given , and a random sample size of observations from Equation (8). Our aim is to estimate the model parameters: , where . The cluster assignments as well as the latent factors are treated as missing data. Let denote the Dirichlet distribution and denote the Gamma distribution with mean . Let also denotes the th row of the matrix of factor loadings ; ; . The following prior assumptions are imposed on the model parameters:
(10)  
(11)  
(12)  
(13)  
(14) 
where all variables are assumed mutually independent and ; ; ; . In Equation (12), denotes a vector consisting of zero entries and denotes a diagonal matrix, where the diagonal entries are distributed independently according to Equation (14). The prior assumptions are the same as the ones introduced in Fokoue2003 () for fixed and .
Given the fixed set of hyperparameters , the joint probability density function of the model is written as:
(15) 
and its graphical representation is shown in Fig 1. In order to estimate the model parameters we use the Gibss sampler to approximately sample from the posterior distribution . The reader is referred to A.1 for details.
2.4 Overfitted mixture model
Assume that the observed data has been generated from a mixture model with components
where ; denotes a member of a parametric family of distributions. Consider that an overfitted mixture model with components is fitted to the data. It has been shown rousseau2011asymptotic () that the asymptotic behaviour of the posterior distribution of the redudant components depends on the prior distribution of mixing proportions . Let denote the dimension of free parameters of the emission distribution . For the case of a Dirichlet prior distribution in Equation (10), if then the posterior weight of the extra components converges to zero.
Let denote the joint posterior distribution of model parameters and latent allocation variables for a number of components equal to . When using an overfitted mixture model, the inference on the number of clusters reduces to (a): choosing a sufficiently large value of mixture components (), (b): running a typical MCMC sampler for drawing samples from the posterior distribution of and (c) inferring the number of “alive” mixture components. Note that at MCMC iteration (c) reduces to keeping track of the number of elements in the set , where denotes the simulated allocation of observation at iteration . At this point, we underline the fact that the MCMC sampler always operates on a space with components. The empty components at a given iteration may contain allocated observations in subsequent iterations.
In our case the dimension of free parameters in the emission distribution is equal to . We set , thus the distribution of mixing proportions in Equation (10) becomes
(16) 
where denotes a prespecified positive number. Such a value is chosen for two reasons. At first, it is smaller than so the asymptotic results of rousseau2011asymptotic () ensure that extra components will be emptied as . Second, this choice can be related to standard practice when using Bayesian nonparametric clustering methods where the parameters of a mixture are drawn from a Dirichlet process ferguson (), that is, a Dirichlet process mixture model neal2000markov (). This approach is used in a variety of applications that involve clustering of subjects into an unknown number of groups, see for example medvedovic2002bayesian () where they cluster gene expression profiles using infinite mixture models. Under this point of view, the prior distribution in Equation (16) represents a finitesum random probability measure that approximates the Dirichlet process with concentration parameter ishwaran2002exact ().
2.5 Selecting the Number of Factors
Recall that the MFA model has been defined assuming that the number of factors is fixed. However, the dimensionality of the latent factor space is not known and should be estimated. In Fokoue2003 () the number of factors is treated as a random variable and a BirthDeath MCMC sampler stephens2000 () is used to draw samples from the joint posterior distribution of model parameters and . Treating as a random variable is definitely preferable from a Bayesian point of view, however it is noted that the sampler may exhibit poor mixing, especially in high dimensions. In our setup, we consider a simpler approach where we produce MCMC samples conditionally on each value of in a prespecified range and choose the best one according to model selection criteria.
The following penalized likelihood criteria were used: Akaike’s information Criterion (AIC) akaike (), Bayesian Information Criterion (BIC) Schwarz:78 (), and Deviance Information Criterion (DIC) spiegelhalter2002bayesian (). Let denotes the deviance. Then,
where is the number of free parameters of a mixture model of factor analyzers with clusters and factors, whereas denotes the effective dimension of the model. It is wellknown that DIC tends to overfit spiegelhalter2014deviance (), so we also consider an alternative version which increases the penalty term, namely linde2005dic (); van2012bayesian (). For AIC and BIC, corresponds to the Maximum Likelihood estimate of . On the other hand, definition of is not unique for DIC (see the discussion in celeux2006deviance ()). We have chosen as the simulated parameter values that maximizes the observed loglikelihood across the MCMC run. Another choice would be the values that maximize the logposterior distribution, which is suggested in cases of small sample size. In our examples, both choices gave identical answers.
It should be noted that in order to compute the observed likelihood as well as the number of free parameters in AIC and BIC we used only the values of the most probable number of clusters that correspond to “alive” components (after rescaling the weights in order to sum to 1). It does not make sense to include the redundant parameters in the computation of the number of parameters simply because the results will be inconsistent when considering various values for the total number of clusters in the overfitted mixture (). Also recall here that asymptotically the redundant components are assigned zero posterior weight. This means that the contribution of the extra clusters in the observed likelihood is zero.
2.6 Prior Parallel Tempering
It is well known that the posterior surface of mixture models can exhibit many local modes celeux2000computational (); Marin:05 (). In such cases simple MCMC algorithms may become trapped in one minor mode and demand a very large number of iterations to sufficiently explore the posterior distribution. In order to produce a wellmixing MCMC sample and improve the convergence of our algorithm we utilize ideas from parallel tempering schemes geyer1991 (); geyer1995 (); Altekar12022004 (), where different chains are running in parallel and they are allowed to switch states. Each chain corresponds to a different posterior distribution, and usually each one represents a “heated” version of the target posterior distribution. This is achieved by raising the original target to a power with , which flattens the posterior surface, thus, easier to explore when using an MCMC sampler.
In the context of overfitting mixture models, overfitting () introduced a prior parallel tempering scheme. Under this approach, each heated chain corresponds to a model with identical likelihood as the original, but with a different prior distribution. Although the prior tempering can be done in any subset of parameters, it is only applied to the Dirichlet prior distribution of the weights of the overfitted mixture. Let us denote by and ; , the posterior and prior distribution of the th chain, respectively. Obviously, . Let denote the state of chain at iteration and assume that a swap between chains and is proposed. The proposed move is accepted with probability
(17) 
where corresponds to the probability density function of the Dirichlet prior distribution related to chain . According to Equation (16), this is
(18) 
for a prespecified set of parameters for .
2.7 Label Switching Problem
The likelihood of a mixture model with components is invariant with respect to permutations of the parameters . Typically, the same invariance property holds for the prior distribution of . Therefore, the posterior distribution will be invariant to permutations of the labels and it will exhibit (a multiple of) symmetric modes. The label switching problem redner1984mixture (); Jasra (); Papastamoulis2013 () refers to the fact that an MCMC sample that has sufficiently explored the posterior surface will be switching among those symmetric areas.
Note that the symmetry of the likelihood is not of great practical importance under a frequentist point of view, since the EM algorithm will convergence to a mode of the likelihood surface. However, under a Bayesian approach it burdens the estimation procedure since the marginal posterior distribution of will be the same for all . In order to derive meaningful estimates of the marginal posterior distributions as well as estimates of the posterior means the inference should take into account the label switching problem stephens2000dealing (); Marin:05 (); Papastamoulis:10 (); rodriguez (); yao2014online (). A variety of relabelling algorithms is available in the label.switching package papastamoulis2016label (), more specifically we have used the ECR reordering algorithm Papastamoulis:10 ().
Finally we underline that reordering the MCMC sample will only deal with the label switching problem but not the sign switching problem related to the rows of ; , as shown in Figure 2. Figure 2.(a) illustrates both label and sign switching. After successfully dealing with label switching, the MCMC trace of factor loadings shown in Figure 2.(b) is identifiable up to a sign indeterminacy. But as mentioned earlier, for this particular set of parameters we restrict our attention to signinvariant parametric functions.
(a)  (b) 
2.8 Details of the implementation
Data normalization and prior parameters Before running the sampler, the raw data is standardized by applying the transformation
(19) 
where and . The main reason for using standardized data is that the sampler mixes better. Furthermore, it is easier to choose prior parameters that are not depending on the observed data, that is, using the data twice. In any other case, one could use empirical prior distributions as reported in Fokoue2003 (), see also dellaportas2006multivariate (). For the case of standardized data, the prior parameters are specified in Table 1.
value  2  1  1  2  1 

Prior parallel tempering We used a total of parallel chains where the prior distribution of mixing proportions for chain in Equation (18) is selected as
where . In our examples we used , but in general we strongly suggest to tune this parameter until a reasonable acceptance rate is achieved. Each chain runs in parallel and every 10 iterations we randomly select two adjacent chains , and propose to swap their current states. A proposed swap is accepted with probability equal to Equation (17).
Initialization strategy The sampler may require a large number of iterations to discover the high posterior probability areas when initialized from totally random starting values. Of course this behaviour is alleviated as the number of parallel chains () increases, for example overfitting () obtained good mixing when for mixtures of univariate normal distributions. In our case, the number of parameters is dramatically larger, and an even larger number of parallel chains would be required to obtain similar levels of mixing. However we would rather to test our method in cases where the number of available cores takes smaller values, e.g. or even , which are typical numbers of threads in modernday laptops. Under this point of view, the following twostage initialization scheme is adopted.
We used an initial period of 100 MCMC iterations where each chain is initialized from totally random starting values, but under a Dirichlet prior distribution with large prior parameter values. These values were chosen in a way that the asymptotic results of rousseau2011asymptotic () guarantee that the redundant mixture components will have nonnegligible posterior weights. More specifically for chain we assume with , for . We observed that such an initialization quickly reaches to a state where the true clusters are split into many subgroups. Then, we initialize the actual model by this state. In this case, the sampler will spend some time combining the split clusters to homogeneous groups. As soon as all similar subgroups are merged into homogeneous clusters, the sampler will start exploring the stationary distribution of the chain. According to simulations (not reported here), this initialization scheme can recover the high posterior density areas under a considerably smaller number of iterations compared to standard initialization schemes with random starting values.
3 Results
At first, we use an extended simulation study in order to evaluate the ability of our method to estimate the correct clustering structure. Our results are benchmarked in terms of clustering estimation accuracy against stateoftheart software for fitting MFA models with the EM algorithm as implemented in the R packages pgmm (version 1.2) pgmm () and EMMIXmfa (version 1.2.6) EMMIXmfa (). Furthermore, we also evaluate the ability of the proposed method in terms of selecting the correct number of latent factors. Finally, we illustrate the applicability of our method to a publicly available dataset where the groundtruth classification of the data is known and give interpretations of the inferred latent factors.
3.1 Simulation study
3.1.1 Clustering accuracy and benchmarking against the EM algorithm
We simulated 300 synthetic datasets from mixtures of multivariate normal distributions (8) where the dimensionality of the observed data equals . We considered that the true number of clusters ranges in the set and for each value we generated 10 datasets consisting of observations, 10 datasets with and 10 datasets with . The underlying latent factor space consists of dimensions and the selection of true values for the matrices of loadings per cluster ; , gives rise to strong positive, negative and zero correlations among subsets of the 40 observed variables. The reader is referred to A.2 for full details of the true values used to generate the data.
The EMbased methods fitted a series of mixture models with . For our method, we considered overfitted mixtures of components. MFA models were fitted according to the proposed method (fabMix) as well as EMMIXmfa and pggm with a number of factors varying in the set and selecting the best one using the BIC criterion. In our method we also considered that , a model which essentially assumes a diagonal variance structure. Our MCMC sampler ran for iterations using heated chains that run on parallel using the same number of threads. The first iterations were discarded as burnin period and then each chain was thinned by keeping every iteration. The MCMC sample corresponding to the retained iterations was reordered according to the ECR algorithm Papastamoulis:10 () available in the label.switching package papastamoulis2016label ().
It is well known that likelihood methods may be highly sensitive to the selection of starting values and that it is quite challenging to adopt optimal schemes for the initialization of the EM algorithm in mixture models (see for example papastamoulis2016estimation ()). For both pgmm and EMMIXmfa we used six starting values: five runs were initialized from random starting values and one run was initialized using the means clustering algorithm. The reported results correspond to the run that converged to the highest loglikelihood value. Regarding the model parameterization we have considered constrained and uncostrained variances in EMMIXmfa. For pgmm, we considered 7 parameterizations by enabling the option: modelSubset = c("CCC", "CCU", "CUC", "CUU", "UCC", "UCU", "UUU"). Note that the 7 models included in the modelSubset argument are a subset of the 12 models available in total in pgmm. The rest 5 models were excluded due to the fact that, based on pivotal runs, they have found to be inferior than the models considered (in terms of BIC criterion) and that in some cases they produced computational errors. All other parameters were set to their default values.
A typical output of the proposed method (fabMix) for a single dataset is shown in Fig 3. Observe that when the number of factors is less than the true one () the posterior distribution of the number of alive components supports (with high posterior probability) larger values than the true number of clusters (). As soon as the true number of factors is reached, then the posterior distribution of the number of alive clusters remains concentrated at the true value.
As shown in Fig 4 and 5 the proposed method is able to infer the true number of clusters as well as the correct clustering structure in terms of the Adjusted Rand Index (ARI) doi:10.1080/01621459.1971.10482356 (). The EM algorithm implementation in pgmm and EMMIXmfa gives quite accurate results in terms of ARI, but note that sometimes the number of clusters may be overestimated (especially when and ). All methods always select the parameterization that corresponds to the one used to generate the data. More specifically, fabMix selects the model with the same variance of errors per cluster, EMMIXmfa selects parameterization sigmaType = "unique", Dtype = "common" and pgmm selects the model corresponding to the "UCU" parameterization.
3.1.2 Estimating the number of factors
In this section we assess the ability of our algorithm to infer the number of factors. We generated synthetic data from mixtures of factor analyzers where the true number of factors varies in the set and for each value 15 datasets were simulated. As previously, it was assumed that the true number of clusters is uniformly distributed on the set . For each case we considered that and .
For each dataset, the number of factors was estimated by running our algorithm considering that and selecting the best model according to AIC, BIC, DIC and . As shown in Fig 6, both AIC and BIC are able to correctly estimate the number of factors. On the other hand, DIC and tend to overfit, as often concluded in the literature (see e.g. spiegelhalter2014deviance ()). However, even in the case that the Deviance Information Criterion is used for choosing the number of factors, the corresponding estimate of the underlying clusters (as well as the number of “alive” clusters) is the same as the one returned by BIC or AIC. This behaviour is also illustrated in Fig 3 for a single dataset. Consequently, given that the MCMC sampler has converged, using the DIC or does not necessarily means that the inferred clustering is “wrong”.
3.2 Wave dataset
We selected a randomly sampled subset of 1500 observations from the wave dataset wavedata (), available from the UCI machine learning repository uci (). According to the available “groundtruth” classification of the dataset, there are 3 equally weighted underlying classes of dimensional continuous data, where each one is generated from a combination of 2 of 3 “base” waves, shown in Fig 7.
We applied the proposed method considering that the number of factors ranges in the set . According to both AIC and BIC criteria the selected number of factors is equal to . Conditionally on , our method selects the true value of clusters as the most probable number of “alive components”. In particular the corresponding estimated posterior probability is . The same model is also chosen by pgmm and EMMIXmfa.
The estimated clusters with respect to the true allocation of each observation is shown in Table 2. It is evident that the resulting estimates are very similar and exhibit strong agreement with the true clustering of the data. The true and estimated correlation matrix of the clustered data is illustrated in Fig 8 using the corrplot package corrplot (). Observe that our method is able to correctly estimate the covariance structure of the data.
fabMix  pgmm  EMMIXmfa  
1  2  3  1  2  3  1  2  3  
true cluster 1  403  46  57  401  49  56  402  48  56 
true cluster 2  32  411  38  31  409  41  41  406  34 
true cluster 3  12  27  474  17  22  474  17  25  471 
ARI  0.622  0.615  0.608  
RI  0.831  0.829  0.825 
As far as factor interpretation is concerned, we estimated the “regularized score” of each variable to the (single) factor per cluster. For this purpose, we only considered the simulated values that correspond to and reorder the MCMC values to undo the label switching. Then, we take ergodic averages in order to estimate (single factor ) in (9) for and , as shown in Fig 9. For the first cluster () we note that takes positive values for and negative values for , while for . Note the strong agreement when comparing these values with the patterns in the correlation matrix in Fig 8 for true cluster 1. It is evident there that the block of variables has different correlation behaviour than . Moreover, observe that the block of variables is not correlated to any of the other variables. Similar conclusions can be drawn when comparing the regularized scores for clusters 2 and 3 with their correlation structure.
4 Discussion and further remarks
This study presented a solution to the complex problem of clustering multivariate data using Bayesian mixtures of factor analyzers. The proposed model builds upon the prior assumptions of Fokoue2003 () assuming a fixed number of clusters and factors. Extending the Bayesian framework of overfitting () we demonstrated that estimating an overfitting mixture model is a straightforward and efficient approach to the problem at hand. We also used prior parallel tempering schemes in order to improve the mixing of the algorithm. The posterior inference conditionally on a specific model (number of alive clusters) is possible after applying suitable algorithms to deal with label switching.
According to our findings, the proposed method can accurately infer both the number and composition of the underlying clusters. Our results were benchmarked against stateoftheart software for estimating MFA models via the EM algorithm, that is, the R packages EMMIXmfa and pgmm. We concluded that our method is quite competitive with the EM algorithm in estimation accuracy and can lead to improved inference as the number of clusters grows large. Starting values and the usage of multiple runs are very important for maximum likelihood methods. We underline that both pgmm and EMMIXmfa produce inferior results when a smaller number of EM runs is being used (results not reported here). At this point recall that our benchmarking simulation study generated 100 datasets with dimension , 100 datasets with dimension , and 100 datasets with dimension . Using a single EM run for each dataset, the number of compared models is equal to , with (maximum number of components) and (maximum number of factors), hence this number should be multiplied by the total number of runs in the case of initializing the EM from multiple starting values (as done in EMMIXmfa and pgmm). On the contrary, the total number of compared models under our Bayesian approach is equal to . In order to give an indication of the complexity of the simulated datasets we also report results arising from standard mixtures of multivariate normal distributions. As shown in A.3, the specific implementation can introduce serious biases in the estimate of the number of clusters.
The estimation of the number of factors () using information criteria gave reasonable answers. In particular, we concluded that all criteria AIC, BIC and DIC are able to select a model (number of factors) with the correct clustering structure. Moreover, both AIC and BIC were able to select the “true” value of . However, the model selected by DIC may overestimate the number of factors. Nevertheless, our main objective is to estimate the underlying clustering structure and we were able to obtain accurate results with the presented method for estimating . Thus, we avoided using more advanced information criteria, such as the complete DICs introduced by celeux2006deviance (). In principle, the calculation of complete DICs requires an increased amount of computational and modelling effort since the analytic evaluation of certain conditional expectations is involved. However it is expected that such an approach could lead to improved results compared to the observed DIC, which is an interesting direction for further research.
Maximum likelihood approaches for estimating MFA models are strongly concentrated on the usage of parsimonious representations of the model parameters, as studied in McNicholas2008 (); doi:10.1093/bioinformatics/btq498 () in the family of PGMMs. In this way, potential overfitting is avoided by using parsimonious models with smaller number of parameters than the fully unconstrained model. Adopting the notation of McNicholas2008 (), the proposed method uses variants of the parameterizations “UUU” (unconstrained factor loadings and variance of errors) and “UCU” (unconstrained factor loadings and equal variance of errors per component). A very interesting extension of our model would be to propose Bayesian estimation of overfitted mixture models using the whole set of parameterizations introduced in the family of PGMMs.
The computing time for a serial implementation of the proposed method to our presented examples for a given number of factors and with heated chains is comparable to the total time needed by pgmm for fitting mixture models for a specified model parameterization and using different initializations. On the other hand, we observed that the task of fitting MFA models in EMMIXmfa for a range of starting values is faster than pgmm and fabMix. Recall also that the researcher will usually consider different model parameterizations, such as the models with the same variance of errors across clusters in EMMIXmfa and fabMix or the 12 parameterizations in pgmm. Thus, the total running time will be (roughly) multiplied by the considered number of models. Nevertheless, as discussed in mcnicholas2010serial (), this task is trivially parallelizable. Finally, we note that the default application of the proposed method uses parallel computing in multiple threads so that the heated chains run in parallel, hence the computing time is substantially reduced compared to a serial implementation. Under the current version of our software and using a LINUX workstation, 8 heated chains in fabMix running in parallel on 8 cores for iterations demand on average , and hours for , respectively.
The R source code of the proposed algorithm as well as scripts to generate all simulated datasets are available online at https://github.com/mqbssppe/overfittingFABMix. An R package is also under development.
5 Acknowledgments
Research was funded by MRC award MR/M02010X/1. The author would like to acknowledge the assistance given by IT services and use of the Computational Shared Facility of the University of Manchester.
Appendix A Supplementary Material
a.1 Gibbs sampling updates
In the following, denotes the distribution of conditionally on . Given and the Gibbs sampler updates each parameter according to the following scheme.

Give some initial values .

At iteration

update

update

update

update

update

update

update .

All full conditional distributions are detailed in Fokoue2003 (). However, in some cases, it is easy to analytically integrate some parameters and accelerate convergence. Note that in steps (2.b) and (2.d) we have integrated out and , respectively. Thus, in steps (2.b) and (2.d), the parameters are updated from the corresponding collapsed conditional distributions, rather than the full conditional distributions.
In particular, after integrating out , it follows that
(20) 
independent for ; . The parameters on the righthand side of Equation (20) are defined as
where
for . Finally, from Equations (7) and (8) it immediately follows that
independent for , where denotes the probability density function of the multivariate normal distribution with mean and covariance matrix . Note that in order to compute the right hand side of the last equation, inversion of the matrix is required. Since is diagonal, this matrix can be efficiently inverted using the Sherman–Morrison–Woodbury formula (10.2307/2030425, ):
for .
a.2 Simulation study details
The synthetic data in Section 3.1 was simulated according to mixtures of multivariate normal distributions, as shown in Equation (8). Given the number of mixture components () and number of factors () for each dataset, the true values of parameters were generated according to the following scheme.
independent for ; . Observe that the diagonal matrix containing the variance of errors () is the same for each mixture component. Under a slight abuse of notation in the matrix , a star () denotes realizations from a random variable following a distribution, where ; . Moreover, a square () denotes realizations from a random variable. Each block matrix containing a star consists of rows and columns, where denotes the integer part of a positive number. In case that is not integer, we fill the remaining columns with squares (corresponding to the last block of ).
a.3 Classic Mixture Models
In order to highlight the challenging nature of the simulated datasets in Section 3.1, we also provide estimates arising from standard mixtures of multivariate normal distributions using the EM algorithm implementation in the popular R packages mclust (version 5.2.3) mclust1 (); mclust2 (), flexmix (version 2.15.0) flexmix1 (); flexmix2 (); flexmix3 () and EMMIX EMMIXskew () (version 1.0.1).
We note that both mclust and EMMIX fit a variety of different parameterizations in the covariance matrix of the component specific multivariate normal distributions and select the best one in terms of the BIC criterion. In addition, we considered both diagonal and full covariance structure at the flexmix package, but we only report the results corresponding to the latter case since it produced more consistent estimates. Finally, we used different random initializations for flexmix and EMMIX.
For the traditional mixture models (Fig 10) observe that flexmix underestimates the number of clusters as soon as () or (). Slightly better results are obtained from EMMIX when and . On the other hand, the best inferred model by mclust overestimates the number of clusters when (), () or ().
References
 (1) J.O. Kim, C. W. Mueller, Factor analysis: Statistical methods and practical issues, Vol. 14, Sage, 1978.
 (2) D. J. Bartholomew, M. Knott, I. Moustaki, Latent variable models and factor analysis: A unified approach, Vol. 904, John Wiley & Sons, 2011.
 (3) A. P. Dempster, N. M. Laird, D. Rubin, Maximum Likelihood from Incomplete Data via the EM Algorithm (with discussion), Journal of the Royal Statistical Society B 39 (1977) 1–38.
 (4) Z. Ghahramani, G. E. Hinton, et al., The em algorithm for mixtures of factor analyzers, Tech. rep., Technical Report CRGTR961, University of Toronto (1996).
 (5) G. McLachlan, D. Peel, Finite Mixture Models, Wiley, New York, 2000.
 (6) G. J. McLachlan, D. Peel, R. Bean, Modelling highdimensional data by mixtures of factor analyzers, Computational Statistics & Data Analysis 41 (3) (2003) 379–388.
 (7) G. J. McLachlan, J. Baek, S. I. Rathnayake, Mixtures of factor analysers for the analysis of highdimensional data, Mixtures: Estimation and Applications (2011) 189–212.
 (8) P. D. McNicholas, T. B. Murphy, Parsimonious gaussian mixture models, Statistics and Computing 18 (3) (2008) 285–296.
 (9) P. D. McNicholas, T. B. Murphy, Modelbased clustering of microarray expression data via latent gaussian mixture models, Bioinformatics 26 (21) (2010) 2705.
 (10) X.L. Meng, D. Van Dyk, The EM algorithm – an old folksong sung to a fast new tune, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 59 (3) (1997) 511–567.
 (11) E. Fokoué, D. Titterington, Mixtures of factor analysers. bayesian estimation and inference by stochastic simulation, Machine Learning 50 (1) (2003) 73–94.
 (12) M. Stephens, Bayesian analysis of mixture models with an unknown number of components – an alternative to reversible jump methods, Annals of Statistics 28 (1) (2000) 40–74.
 (13) P. J. Green, Reversible jump markov chain monte carlo computation and bayesian model determination, Biometrika 82 (4) (1995) 711–732.
 (14) S. Richardson, P. J. Green, On bayesian analysis of mixtures with an unknown number of components., Journal of the Royal Statistical Society: Series B 59 (4) (1997) 731–758.
 (15) P. Dellaportas, I. Papageorgiou, Multivariate mixtures of normals with unknown number of components, Statistics and Computing 16 (1) (2006) 57–68.
 (16) P. Papastamoulis, G. Iliopoulos, Reversible jump MCMC in mixtures of normal distributions with the same component means, Computational Statistics and Data Analysis 53 (4) (2009) 900–911.
 (17) A. Nobile, A. T. Fearnside, Bayesian finite mixtures with an unknown number of components: The allocation sampler, Statistics and Computing 17 (2) (2007) 147–162.
 (18) J. Rousseau, K. Mengersen, Asymptotic behaviour of the posterior distribution in overfitted mixture models, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73 (5) (2011) 689–710.
 (19) Z. van Havre, N. White, J. Rousseau, K. Mengersen, Overfitting bayesian mixture models with an unknown number of components, PLOS ONE 10 (7) (2015) 1–27.
 (20) L. Li, N. Sedransk, et al., Mixtures of distributions: A topological approach, The annals of Statistics 16 (4) (1988) 1623–1634.
 (21) S. FrühwirthSchnatter, Finite mixture and Markov switching models, Springer Science & Business Media, 2006.

(22)
P. Papastamoulis, M. Rattray,
BayesBinMix:
an R package for model based clustering of multivariate binary data, The
R Journal (accepted) arXiv preprint arXiv:1609.06960.
URL https://journal.rproject.org/archive/2017/RJ2017022/index.html  (23) S. Geman, D. Geman, Stochastic relaxation, gibbs distributions, and the bayesian restoration of images, IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI6 (6) (1984) 721–741.
 (24) A. Gelfand, A. Smith, Samplingbased approaches to calculating marginal densities, Journal of American Statistical Association 85 (1990) 398–409.
 (25) P. Papastamoulis, label.switching: An R package for dealing with the label switching problem in MCMC outputs, Journal of Statistical Software 69 (1) (2016) 1–24.
 (26) D. Lawley, A. Maxwell, Factor analysis as a statistical method, Journal of the Royal Statistical Society. Series D (The Statistician) 12 (3) (1962) 209–229.
 (27) W. Ledermann, On the rank of the reduced correlational matrix in multiplefactor analysis, Psychometrika 2 (2) (1937) 85–93.
 (28) P. A. Bekker, J. M. ten Berge, Generic global indentification in factor analysis, Linear Algebra and its Applications 264 (1997) 255 – 263.
 (29) P. A. Bekker, A. Merckens, T. J. Wansbeek, Identification, Equivalent Models, and Computer Algebra: Statistical Modeling and Decision Science, Academic Press, 2014.
 (30) G. Conti, S. FrühwirthSchnatter, J. J. Heckman, R. Piatek, Bayesian exploratory factor analysis, Journal of econometrics 183 (1) (2014) 31–57.
 (31) C. Sabatti, G. M. James, Bayesian sparse hidden components analysis for transcription regulation networks, Bioinformatics 22 (6) (2006) 739–746.
 (32) T. S. Ferguson, A bayesian analysis of some nonparametric problems, The Annals of Statistics 1 (2) (1973) 209–230.
 (33) R. M. Neal, Markov chain sampling methods for dirichlet process mixture models, Journal of computational and graphical statistics 9 (2) (2000) 249–265.
 (34) M. Medvedovic, S. Sivaganesan, Bayesian infinite mixture model based clustering of gene expression profiles, Bioinformatics 18 (9) (2002) 1194–1206.
 (35) H. Ishwaran, M. Zarepour, Exact and approximate sum representations for the dirichlet process, Canadian Journal of Statistics 30 (2) (2002) 269–283.
 (36) H. Akaike, A new look at the statistical model identification, IEEE Transactions on Automatic Control 19 (6) (1974) 716–723.
 (37) G. Schwarz, Estimating the dimension of a model, The Annals of Statistics 6 (2) (1978) 461–464.
 (38) D. J. Spiegelhalter, N. G. Best, B. P. Carlin, A. Van Der Linde, Bayesian measures of model complexity and fit, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64 (4) (2002) 583–639.
 (39) D. J. Spiegelhalter, N. G. Best, B. P. Carlin, A. Linde, The deviance information criterion: 12 years on, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76 (3) (2014) 485–493.
 (40) A. Linde, Dic in variable selection, Statistica Neerlandica 59 (1) (2005) 45–56.
 (41) A. van der Linde, A bayesian view of model complexity, Statistica Neerlandica 66 (3) (2012) 253–271.
 (42) G. Celeux, F. Forbes, C. P. Robert, D. M. Titterington, et al., Deviance information criteria for missing data models, Bayesian analysis 1 (4) (2006) 651–673.
 (43) G. Celeux, M. Hurn, C. P. Robert, Computational and inferential difficulties with mixture posterior distributions, Journal of the American Statistical Association 95 (451) (2000) 957–970.
 (44) J. Marin, K. Mengersen, C. Robert, Bayesian modelling and inference on mixtures of distributions, Handbook of Statistics 25 (1) (2005) 577–590.
 (45) C. J. Geyer, Markov chain Monte Carlo maximum likelihood, in: Proceedings of the 23rd Symposium on the Interface, Interface Foundation, Fairfax Station, Va, 1991, pp. 156–163.
 (46) C. J. Geyer, E. A. Thompson, Annealing Markov chain Monte Carlo with applications to ancestral inference, Journal of the American Statistical Association 90 (431) (1995) 909–920.
 (47) G. Altekar, S. Dwarkadas, J. P. Huelsenbeck, F. Ronquist, Parallel Metropolis coupled Markov chain Monte Carlo for Bayesian phylogenetic inference, Bioinformatics 20 (3) (2004) 407–415.
 (48) R. A. Redner, H. F. Walker, Mixture densities, maximum likelihood and the EM algorithm, SIAM review 26 (2) (1984) 195–239.
 (49) A. Jasra, C. Holmes, D. Stephens, Markov Chain Monte Carlo methods and the label switching problem in Bayesian mixture modeling, Statistical Science 20 (2005) 50–67.
 (50) P. Papastamoulis, G. Iliopoulos, On the convergence rate of random permutation sampler and ECR algorithm in missing data models, Methodology and Computing in Applied Probability 15 (2) (2013) 293–304.
 (51) M. Stephens, Dealing with label switching in mixture models, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 62 (4) (2000) 795–809.
 (52) P. Papastamoulis, G. Iliopoulos, An artificial allocations based solution to the label switching problem in Bayesian analysis of mixtures of distributions, Journal of Computational and Graphical Statistics 19 (2010) 313–331.
 (53) C. E. Rodriguez, S. G. Walker, Label switching in bayesian mixture models: Deterministic relabeling strategies, Journal of Computational and Graphical Statistics 23 (1) (2014) 25–45.
 (54) W. Yao, L. Li, An online bayesian mixture labelling method by minimizing deviance of classification probabilities to reference labels, Journal of statistical computation and simulation 84 (2) (2014) 310–323.

(55)
P. D. McNicholas, A. ElSherbiny, R. K. Jampani, A. F. McDaid, B. Murphy,
L. Banks, pgmm: Parsimonious
Gaussian Mixture Models, r package version 1.2 (2015).
URL http://CRAN.Rproject.org/package=pgmm 
(56)
S. Rathnayake, G. McLachlan, J. Baek,
EMMIXmfa: Mixture
models with componentwise factor analyzers (2015).
URL https://people.smp.uq.edu.au/GeoffMcLachlan/mix_soft  (57) P. Papastamoulis, M.L. MartinMagniette, C. MaugisRabusseau, On the estimation of mixtures of poisson regression models with large number of components, Computational Statistics & Data Analysis 93 (2016) 97–106.
 (58) W. M. Rand, Objective criteria for the evaluation of clustering methods, Journal of the American Statistical Association 66 (336) (1971) 846–850.
 (59) L. Breiman, J. Friedman, R. Olshen, C. Stone, Classification and Regression Trees, Wadsworth International Group: Belmont, California, 1984.

(60)
M. Lichman, UCI machine learning
repository (2013).
URL http://archive.ics.uci.edu/ml 
(61)
T. Wei, V. Simko, corrplot:
Visualization of a Correlation Matrix, r package version 0.77 (2016).
URL http://CRAN.Rproject.org/package=corrplot  (62) P. D. McNicholas, T. B. Murphy, A. F. McDaid, D. Frost, Serial and parallel implementations of modelbased clustering via parsimonious gaussian mixture models, Computational Statistics & Data Analysis 54 (3) (2010) 711–723.
 (63) W. W. Hager, Updating the inverse of a matrix, SIAM Review 31 (2) (1989) 221–239.
 (64) C. Fraley, A. E. Raftery, Modelbased clustering, discriminant analysis and density estimation, Journal of the American Statistical Association 97 (2002) 611–631.
 (65) C. Fraley, A. E. Raftery, T. B. Murphy, L. Scrucca, mclust Version 4 for R: Normal Mixture Modeling for ModelBased Clustering, Classification, and Density Estimation (2012).

(66)
F. Leisch, FlexMix: A general
framework for finite mixture models and latent class regression in R,
Journal of Statistical Software 11 (8) (2004) 1–18.
URL http://www.jstatsoft.org/v11/i08/  (67) B. Grün, F. Leisch, Fitting finite mixtures of generalized linear regressions in R, Computational Statistics & Data Analysis 51 (11) (2007) 5247–5252. doi:10.1016/j.csda.2006.08.014.
 (68) B. Grün, F. Leisch, FlexMix version 2: Finite mixtures with concomitant variables and varying and constant parameters, Journal of Statistical Software 28 (4) (2008) 1–35.

(69)
K. Wang, A. Ng, G. McLachlan,
EMMIXskew: The EM
Algorithm and Skew Mixture Distribution, r package version 1.0.1 (2013).
URL http://CRAN.Rproject.org/package=EMMIXskew