Maximum likelihood estimation in constrained parameter spaces for mixtures of factor analyzers

# Maximum likelihood estimation in constrained parameter spaces for mixtures of factor analyzers

Francesca Greselin Francesca Greselin Department of Statistics and Quantitative Methods
Milano-Bicocca University
Via Bicocca degliArcimboldi 8 - 20126 Milano (Italy). 22email: francesca.greselin@unimib.itSalvatore IngrassiaDepartment of Economics and Business
University of Catania
Corso Italia 55, - Catania (Italy). 44email: s.ingrassia@unict.it
Salvatore Ingrassia Francesca Greselin Department of Statistics and Quantitative Methods
Milano-Bicocca University
Via Bicocca degliArcimboldi 8 - 20126 Milano (Italy). 22email: francesca.greselin@unimib.itSalvatore IngrassiaDepartment of Economics and Business
University of Catania
Corso Italia 55, - Catania (Italy). 44email: s.ingrassia@unict.it
###### Abstract

Mixtures of factor analyzers are becoming more and more popular in the area of model based clustering of high-dimensional data. According to the likelihood approach in data modeling, it is well known that the unconstrained log-likelihood function may present spurious maxima and singularities and this is due to specific patterns of the estimated covariance structure, when their determinant approaches 0. To reduce such drawbacks, in this paper we introduce a procedure for the parameter estimation of mixtures of factor analyzers, which maximizes the likelihood function in a constrained parameter space. We then analyze and measure its performance, compared to the usual non-constrained approach, via some simulations and applications to real data sets.

###### Keywords:
Constrained estimation Factor Analyzers Modeling Mixture Models Model-Based Clustering.

## 1 Introduction and motivation

Finite mixture distributions have been receiving a growing interest in statistical modeling. Their central role is mainly due to their double nature: they combine the flexibility of non-parametric models with the strong and useful mathematical properties of parametric models. According to this approach, when we know that a sample of observations has been drawn from different populations, we assume a specific distributional form in each of the underlying populations. The purpose is to decompose the sample into its mixture components, which, for quantitative data, are usually modeled as a multivariate Gaussian distribution, and to estimate parameters. The assumption of underlying normality, besides the elegant analytic properties, allows also to employ the EM algorithm for the ML estimation of the parameters. On the other side, when considering a large number of observed variables, Gaussian mixture models can provide an over-parameterized solution as, besides the mixing weights, it is required to estimate the mean vector and the covariance matrix for each component (Peel and McLachlan, 2000). As a consequence, we observe at the same time an undue load of computationally intensive procedures for the estimation.

This is the reason why a number of strategies have been introduced in the literature to avoid over-parameterized solutions. Among the various proposal, some authors developed methodologies for variable selection (see, f.i., Liu et al. (2003) and Hoff (2005) in the Bayesian framework, Pan and Shen (2007) and Raftery and Dean (2006) in the frequentist one). They further motivate their approach from the observation that the presence of non-informative variables can be strongly misleading for some clustering methods. With the same purpose of parsimony, but a completely different approach, Banfield and Raftery (1993) devised a methodology to identify common patterns among the component-covariance matrices; their proposal arose a great attention in the literature. Along a slightly different line of thinking, Ghahramani and Hilton (1997) and McLachlan et al. (2003) proposed to employ latent variables to perform dimensional reduction in each component, starting from the consideration that in many phenomena some few unobserved features could be explained by the many observed ones.

In this paper we address mixtures of factor analyzers by assuming that the data have been generated by a linear factor model with latent variables modeled as Gaussian mixtures. Our purpose is to improve the performances of the EM algorithm, by facing with some of its issues and giving practical recipes to overcome them. It is well known that the EM algorithm generates a sequence of estimates, starting from an initial guess, so that the corresponding sequence of the log-likelihood values is not decreasing. However, the convergence toward the MLE is not guaranteed, because the log-likelihood is unbounded and presents local maxima. Another critical point is that the parameter estimates as well as the convergence of the whole estimation process may be affected by the starting values (see, f.i., McLachlan and Krishnan (2007)) so that the final estimate crucially depends on the initial guess. This issue has been investigated by many authors, starting from the seminal paper of Redner and Walker (1984). Along the lines of (Ingrassia, 2004), in this paper we introduce and implement a procedure for the parameters estimation of mixtures of factor analyzers, which maximizes the likelihood function in a constrained parameter space, having no singularities and a reduced number of spurious local maxima. We then analyze and measure its performance, compared to the usual non-constrained approach.

We have organized the rest of the paper as follows. In Section 2 we summarize main ideas about Gaussian Mixtures of Factor Analyzer model; in Section 3 we provide fairly extensive notes concerning the likelihood function and the AECM algorithm. Some well known considerations (Hathaway, 1985) related to spurious maximizers and singularities in the EM algorithm are recalled in Section 4, and motivate our proposal to introduce constraints on factor analyzers. Further, we give a detailed methodology to implement such constraints into the EM algorithm. In Section 6 we show and discuss the improved performance of our procedure, on the ground of some numerical results based on both simulated and real data. Section 7 contains concluding notes and provides ideas for future research.

## 2 The Gaussian Mixture of Factor analyzers

Within the Gaussian Mixture (GM) model-based approach to density estimation and clustering, the density of the -dimensional random variable of interest is modelled as a mixture of a number, say , of multivariate normal densities in some unknown proportions . That is, each data point is taken to be a realization of the mixture probability density function,

 f(x;\boldmathθ)=G∑g=1πgϕd(x;\boldmathμg,\boldmathΣg) (1)

where denotes the -variate normal density function with mean and covariance matrix . Here the vector of unknown parameters consists of the mixing proportions , the elements of the component means , and the distinct elements of the component-covariance matrices . Therefore, the -component normal mixture model (1) with unrestricted component-covariance matrices is a highly parametrized model. We crucially need some method for parsimonious parametrization of the matrices , because they requires parameters. Among the various proposals for dimensionality reduction, we are interested here in considering Mixtures of Gaussian Factor Analyzers (MGFA), which allows to explain data by explicitly modeling correlations between variables in multivariate observations. We postulate a finite mixture of linear sub-models for the distribution of the full observation vector , given the (unobservable) factors . That is we can provide a local dimensionality reduction method by assuming that the distribution of the observation can be given as

 Xi=\boldmathμg+\boldmathΛgUig+eigwith probability πg(g=1,…,G)fori=1,…,n, (2)

where is a matrix of factor loadings, the factors are distributed independently of the errors , which are independently distributed, and is a diagonal matrix . We suppose that , which means that unobservable factors are jointly explaining the observable features of the statistical units. Under these assumptions, the mixture of factor analyzers model is given by (1), where the -th component-covariance matrix has the form

 \boldmathΣg=\boldmathΛg\boldmathΛ′g+\boldmathΨg(g=1,…,G). (3)

The parameter vector now consists of the elements of the component means , the , and the , along with the mixing proportions , on putting . Note that in the case of , there is an infinity of choices for , since model (2) is still satisfied if we replace by , where is any orthogonal matrix of order . As constraints are needed for to be uniquely defined, the number of free parameters, for each component of the mixture, is

 dq+d−12q(q−1).

Comparing the two approaches and willing now to measure the gained parsimony when we use mixtures of factor analyzers, with respect to the more usual gaussian mixtures, and denoting by and , the number of the estimated parameters for the covariance matrices in the GM and MGFA models, respectively, we have to choose values of such that the following quantity is positive

 P=|\boldmathθCovGM(d,G)|−|\boldmathθCovMGFA(d,q,G)|=G2d(d+1)−G[dq−d+12q(q−1)]

i.e.:

 P=G2[(d−q)2−(d+q)].

This is the only requirement for parsimony. Now, we can express the relative reduction given by

 RR(d,q) =|\boldmathθCovGM(d,G)|−|\boldmathθCovGMFA(d,q,G)||\boldmathθCovGM(d,G)|=(d−q)2−(d+q)d(d+1).

In Table 1 we report the relative reduction, in term of lower number of estimated parameters for the covariance matrices in the MGFA models, with respect to the GM models.

The relative reduction represents the extent to which the factor model offers a simpler interpretation for the behaviour of than the alternative assumption given by the gaussian mixture model.

## 3 The likelihood function and the EM algorithm for MGFA

In this section we summarize the main steps of the EM algorithm for mixtures of Factor analyzers, see e.g. McLachlan and Peel (2000) for details.

Let be a sample of size from density (1), and let () denotes the realization of in (2). For given data , parameters in (1) can be estimated according to the likelihood approach via the EM algorithm, where the likelihood function is given by:

 L(\boldmathθ;X∼) =n∏i=1{G∑g=1ϕd(xi;\boldmathμg,\boldmathΣg)πg}=n∏i=1{G∑g=1ϕd(xi;\boldmathμg,\boldmathΛg,\boldmathΨg)πg},

where we set (). Consider the augmented data , where , with if comes from the -th population and otherwise. Then, the complete-data likelihood function can be written in the form:

 Lc(\boldmathθ;X∼)=n∏i=1G∏g=1[ϕd(xi|ui;% \boldmathμg,Λg,Ψg)ϕq(uig)πg]zig. (4)

In particular, due to the factor structure of the model, see Meng and van Dyk (1997), we have to consider the alternating expectation-conditional maximization (AECM) algorithm. Such a procedure is an extension of the EM algorithm that uses different specifications of missing data at each stage. The idea is to partition in such a way that is easy to maximize for given and vice versa. Then, we can iterate between these two conditional maximizations until convergence. In this case where the missing data are the unobserved group labels , and the second part of the parameters vector is given by where the missing data are the group labels and the unobserved latent factors . Hence, the application of the AECM algorithm consists of two cycles, and there is one E-step and one CM-step alternatively considering and in each pair of cycles.

#### First Cycle.

Here it is where the missing data are the unobserved group labels . The complete data likelihood is

 Lc1(\boldmathθ1) =n∏i=1G∏g=1[ϕd(xi;\boldmathμg,\boldmathΣg)πg]zig. (5)

The E-step on the first cycle on the -th iteration requires the calculation of which is the expected complete-data log-likelihood given the data and using the current estimate for . In practice it requires calculating and usual computations show that this step is achieved by replacing each by its current conditional expectation given the observed data , that is we replace by , where

 (6)

On the M-step, the maximization of this complete-data log-likelihood yields

 π(k+1)g =∑ni=1z(k+1)ign \boldmathμ(k+1)g =1ngn∑i=1z(k+1)igxi

where . According to notation in McLachlan and Peel (2000), we set .

#### Second Cycle.

Here it is , where the missing data are the unobserved group labels and the latent factors . Therefore, the complete data likelihood is

 Lc2(\boldmathθ2) =n∏i=1G∏g=1[ϕd(xi|uig;\boldmathμ(k+1)g,\boldmathΣg)ϕq(uig)π(k+1)g]zig =n∏i=1G∏g=1[ϕd(xi|uig;\boldmathμ(k+1)g,\boldmathΛg,\boldmathΨg)ϕq(uig)π(k+1)g]zig, (7)

where

 ϕd(xi|uig;\boldmathμ% (k+1)g,\boldmathΛg,\boldmathΨg) =1|2π\boldmathΨg|1/2exp{−12(xi−\boldmathμ(k+1)g−\boldmathΛguig)′\boldmathΨ−1g(xi−\boldmathμ(k+1)g−\boldmathΛguig)}. ϕq(uig) =1(2π)q/2exp{−12u′iguig}.

Now the complete data log-likelihood is given by

 Lc2(\boldmathθ2) =−nd2ln2π+G∑g=1nglnπg+12n∑i=1G∑g=1zigln|\boldmathΨ−1g| −12n∑i=1G∑g=1zigtr{(xi−\boldmathμ(k+1)g−\boldmathΛguig)(xi−\boldmathμ(k+1)g−\boldmathΛguig)′\boldmathΨ−1g}. (8)

Some algebras lead to the following estimate of , :

 ^\boldmathΛg =S(k+1)g\boldmathγ(k)′g[\boldmathΘ(k)g]−1 ^\boldmathΨg =diag{S(k+1)g−^\boldmathΛg\boldmathγ(k)gS(k+1)g}.

where we set

 S(k+1)g =(1/n(k+1)g)n∑i=1z(k+1)ig(xi−\boldmathμ(k+1)g)(xi−\boldmathμ(k+1)g)′ \boldmathγ(k)g =\boldmathΛ(k)′g(\boldmathΛ(k)g\boldmathΛ(k)′g+% \boldmathΨ(k)g)−1 \boldmathΘ(k)ig =Iq−\boldmathγ(k)g\boldmath% Λ(k)g+\boldmathγ(k)g(xi−% \boldmathμg)(xi−\boldmathμg)′% \boldmathγ(k)′g.

Hence the maximum likelihood estimates and for and can be obtained by alternatively computing the update estimates and , by

 \boldmathΛ+g =S(k+1)g\boldmathγ(k)′g[\boldmathΘ(k)g]−1and% \boldmathΨ+g=diag{S(k+1)g−% \boldmathΛ+g\boldmathγ(k)gS(k+1)g}, (9)

and, from the latter, evaluating the update estimates and by

 \boldmathγ+g=\boldmathΛ′g(\boldmathΛg\boldmathΛ′g+\boldmathΨg)−1and\boldmath% Θ+g=Iq−\boldmathγg\boldmathΛg+\boldmathγgS(k+1)g% \boldmathγ′g, (10)

iterating these two steps until convergence on and , so giving and .

In summary, the procedure can be described as follows. For a given initial random clustering , on the iteration, the algorithm carries out the following steps, for :

1. Compute and consequently obtain , , and ;

2. Set a starting value for and from ;

3. Repeat the following steps, until convergence on and :

1. Compute and from (10);

2. Set and ;

3. Compute and ;

4. Set and ;

To completely describe the algorithm, here we give more details on how to specify the starting values for and from , as it is needed in Step 2.

Starting from the eigen-decomposition of , say , computed on the base of , the main idea is that has to synthesize the ”more important” relations between the observed features, see McNicholas and Murphy (2008). Then, looking at the equality , the initial values of were set as

 λij=√djaij (11)

where is the th largest eigenvalue of and is the th element of the corresponding eigenvector (the th column in ), for and . Finally the matrices can be initialized by the position .

## 4 Likelihood maximization in constrained parametric spaces

Properties of maximum likelihood estimation for normal mixture models have been deeply investigated. It is well known that is unbounded on and may present many local maxima. Day (1969) was perhaps the first noting that any small number of sample points, grouped sufficiently close together, can give raise to spurious maximizers, corresponding to parameters points with greatly differing component standard deviation. To overcome this issue and to prevent from singularities, Hathaway (1985) proposed a constrained maximum likelihood formulation for mixtures of univariate normal distributions, suggesting a natural extension to the multivariate case. Let , then the following constraints

 min1≤h≠j≤kλ(\boldmathΣh\boldmathΣ−1j)≥c (12)

on the eigenvalues of leads to properly defined, scale-equivariant, consistent ML-estimators for the mixture-of-normal case, see Hennig (2004). It is easy to show that a sufficient condition for (12) is

 a≤λig≤b,i=1,…,d;g=1,…,G (13)

where denotes the th eigenvalue of i.e. , and for such that , see Ingrassia (2004). Differently from (12), condition (13) can be easily implemented in any optimization algorithm. Let us consider the constrained parameter space of :

 \boldmathΘc= {(π1,…,πG,\boldmathμ1,…,% \boldmathμG,\boldmathΣ1,…,\boldmathΣG)∈Rk[1+d+(d2+d)/2]: πg≥0,π1+⋯+πG=1,a≤λig≤b,g=1,…,Gi=1,…,d}. (14)

Due to the structure of the covariance matrix given in (3), bound in (13) yields

 λmin(\boldmathΛg\boldmathΛ′g+\boldmathΨg)≥aandλmax(\boldmathΛg\boldmathΛ′g+\boldmathΨg)≤b,g=1,…,G (15)

where and denote the smallest and the largest eigenvalue of respectively. Since and are symmetric and positive definite, then it results:

 λmin(\boldmathΛg\boldmathΛ′g+\boldmathΨg) ≥λmin(\boldmathΛg% \boldmathΛ′g)+λmin(\boldmathΨg)≥a (16) λmax(\boldmathΛg\boldmathΛ′g+\boldmathΨg) ≤λmax(\boldmathΛg% \boldmathΛ′g)+λmax(\boldmathΨg)≤b, (17)

see Lütkepohl (1996). Moreover, being a diagonal matrix, then

 λmin(\boldmathΨg) =miniψig andλmax(\boldmathΨg) =maxiψig, (18)

where denotes the -th diagonal entry of the matrix .

Concerning the square matrix (), we can get its eigenvalue decomposition, i.e. we can find and such that

 \boldmathΛg\boldmathΛ′g=% \boldmathΓg\boldmathΔg\boldmathΓ′g (19)

where is the orthonormal matrix whose rows are the eigenvectors of and is the diagonal matrix of the eigenvalues of , sorted in non increasing order, i.e. , and .

Now, we can apply the singular value decomposition to the rectangular matrix , so giving , where is a unitary matrix (i.e., such that ) and is a rectangular diagonal matrix with nonnegative real numbers on the diagonal, known as singular values, and is a unitary matrix. The columns of and the columns of are called the left singular vectors and right singular vectors of , respectively. Now we have that

 \boldmathΛg\boldmathΛ′g=(UgDgV′g)(VgD′gU′g)=UgDgIqD′gU′g=UgDgD′gU′g (20)

and equating (19) and (20) we get and , that is

 diag(δ1g,…,δqg)=diag(d21g,…,d2qg). (21)

with . In particular, it is known that only the first values of are non negative, and the remaining terms are null. Thus it results

 λmax(\boldmathΛg\boldmathΛ′g)=d21g. (22)

Supposing now to choose a value for the upper bound in such a way that for and , then constraints (16) and (17) are satisfied when

 d2ig+ψig ≥a i=1,…,d (23) dig ≤√b−\boldmathΨig i=1,…,q (24) ψig ≤b i=q+1,…,d (25)

for . In particular, we remark that condition (23) reduces to for .

## 5 Constraints on the covariance matrix for factor analyzers

The two-fold (eigenvalue and singular value) decomposition of the presented above, suggests how to modify the EM algorithm in such a way that the eigenvalues of the covariances (for ) are confined into suitable ranges. To this aim we have to implement constraints (23), (24) and (25).

We proceed as follows on the th iteration:

1. Decompose according to the singular value decomposition as ;

2. Compute the squared singular values of ;

3. Create a copy of and a copy of ;

4. For to , if , then if set else into ;

5. For to , if then set into ;

6. For to , if , then if set into else into ;

7. For to , if then set into ;

8. Set ;

9. Set .

10. Stop.

It is important to remark that the resulting EM algorithm is monotone, once the initial guess, say , satisfies the constraints. Further, as shown in the case of gaussian mixtures in Ingrassia and Rocci (2007), the maximization of the complete loglikelihood is guaranteed. From the other side, it is apparent that the above recipes require some a priori information on the covariance structure of the mixture, throughout the bounds and .

## 6 Numerical studies

In this section we present numerical studies, based on both simulated and real data sets, in order to show the performance of the constrained EM algorithm with respect to unconstrained approaches.

### 6.1 Artificial data

We consider here three mixtures of components of -variate normal distributions, for different values of the parameter . First, we point out that the point of local maximum corresponding to the consistent estimator , has been chosen to be the limit of the EM algorithm using the true parameter as initial estimate, i.e. considering the true classification. In other words, we set if the th unit comes from the th component and otherwise. In the following, such estimate will be referred to as the right maximum of the likelihood function.

To begin with, we generate a set of 100 different random initial clusterings to initialize the algorithm at each run. To this aim, for a fixed number of components of the mixture, we draw each time a set of random starting values for the from the multinomial distribution with values in with parameters . Then we run a hundred times both the unconstrained and the constrained AECM algorithms (for different values of the constraints ) using the same set of initial clusterings in both cases. The initial values for the elements of and can be obtained as described at the end of Section 3 from the eigen-decomposition of , and the algorithms run until convergence or it reaches the fixed maximum number of iterations.

The stopping criterion is based on the Aitken acceleration procedure (Aitken, 1926), to estimate the asymptotic maximum of the log-likelihood at each iteration of the EM algorithm (in such a way, a decision can be made regarding whether or not the algorithm reaches convergence; that is, whether or not the log-likelihood is sufficiently close to its estimated asymptotic value). The Aitken acceleration at iteration is given by

 a(k)=L(k+1)−L(k)L(k)−L(k−1),

where , , and are the log-likelihood values from iterations , , and , respectively. Then, the asymptotic estimate of the log-likelihood at iteration is given by

 L(k+1)∞=L(k)+11−a(k)(L(k+1)−L(k)),

see Böhning et al. (1994). In our analyses, the algorithms stop when , with . Programs have been written in the R language; the different cases and the obtained results are described below.

#### Mixture 1: G=3, d=6, q=2, N=150.

The sample has been generated with weights according to the following parameters:

 \boldmathμ1 =(0,0,0,0,0,0)′ \boldmathΨ1 =diag(0.1,0.1,0.1,0.1,0.1,0.1) \boldmathμ2 =(5,5,5,5,5,5)′ \boldmathΨ2 =diag(0.4,0.4,0.4,0.4,0.4,0.4) \boldmathμ3 =(10,10,10,10,10,10)′ \boldmathΨ3 =diag(0.2,0.2,0.2,0.2,0.2,0.2)

 \boldmathΛ1=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝0.501.001.000.450.05−0.50−0.600.500.500.101.00−0.15⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠\boldmathΛ2=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝0.100.200.200.501.00−1.00−0.200.501.000.701.20−0.30⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠\boldmathΛ3=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝0.100.200.200.001.000.00−0.200.001.000.000.00−1.30⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠.

Hence, the covariance matrices () have the following eigenvalues:

 λ(\boldmathΣ1) =(3.17,1.63,0.10,0.10,0.10,0.10)′ λ(\boldmathΣ2) =(4.18,2.27,0.40,0.40,0.40,0.40)′ λ(\boldmathΣ3) =(2.29,1.93,0.20,0.20,0.20,0.20)′,

whose largest value is given by

First we run the unconstrained algorithm: the right solution has been attained in 24% of cases, without incurring in singularities. Summary statistics (minimum, first quartile , median , third quartile and maximum) about the distribution of the misclassification error over the 100 runs are reported in Table 2. Due to the choice on parameters, we rarely expect too small eigenvalues in the estimated covariance matrices: we set to protect from them; conversely, as local maxima are quite often due to large estimated eigenvalues, we consider setting also a constraint from above, taking into account some values for , the upper bound. To compare how the choice of the bounds and influences the performance of the constrained EM, we experimented with different pairs of values, and in Table 3 we report the more interesting cases. Further results are reported in Figure 1, which provides the boxplots of the distribution of the misclassification errors obtained in the sequence of runs, showing the poor performance of the unconstrained algorithm compared with the good behaviour of its constrained version. For all values of the upper bound , the third quartile of the misclassification error is steadily equal to . Indeed, for and 15 we had no misclassification error, while we observed very low and rare misclassification errors only for and (respectively 3 and 11 not null values, over 100 runs). Moreover, the robustness of the results with respect to the choice of the upper constraint is apparent.