Beyond the EM Algorithm: Constrained Optimization Methods for Latent Class Model
Latent class model (LCM), which is a finite mixture of different categorical distributions, is one of the most widely used models in statistics and machine learning fields. Because of its non-continuous nature and the flexibility in shape, researchers in practice areas such as marketing and social sciences also frequently use LCM to gain insights from their data. One likelihood-based method, the Expectation-Maximization (EM) algorithm, is often used to obtain the model estimators. However, the EM algorithm is well-known for its notoriously slow convergence. In this research, we explore alternative likelihood-based methods that can potential remedy the slow convergence of the EM algorithm. More specifically, we regard likelihood-based approach as a constrained nonlinear optimization problem, and apply quasi-Newton type methods to solve them. We examine two different constrained optimization methods to maximize the log likelihood function. We present simulation study results to show that the proposed methods not only converge in less iterations than the EM algorithm but also produce more accurate model estimators.
keywords:Constrained Optimization, Quasi-Newton’s Method, Quadratic Programming, EM Algorithm, Latent Class Model, Finite Mixture Model.
Latent class model (LCM) (mccutcheon1987latent) is a model to study latent (unobserved) categorical variables by examining a group of observed categorical variables which are regarded as the indictors of the underlying latent variables. It can be regarded as a special case of the finite mixture model (FMM) with component distributions being categorical distributions. It is widely used to analyze ordered survey data collected from real world applications. In many applications in econometrics, social sciences, biometrics, and business analytics (see Hagenaars2002latent; oser2013online for example), finite mixture of categorical distributions arises naturally when we sample from a population with heterogeneous subgroups. LCM is a powerful tool to conduct statistical inference from the collected data in such situations.
We provide a motivating example from white2014bayeslca where an LCM is applied to analyze a dataset of patient symptoms recorded in the Mercer Institute of St. James’ Hospital in Dublin, Ireland (moran2004syndromes). The data is a recording of the presence of six symptoms displayed by patients diagnosed with early onset Alzheimer’s disease. The six symptoms are as follows: hallucination, activity, aggression, agitation, diurnal and affective, and each symptom has status: either present or absent. white2014bayeslca proposed to divide patients into groups such that patients are homogeneous within each group and heterogeneous between groups. Each group’s characteristics are summarized by the LCM parameters that help doctors prepare more specialized treatments. In this sense, LCM is a typical unsupervised statistical learning method that could “learn” the group labels based on the estimated parameters.
Due to its theoretical importance and practical relevance, many different approaches have been proposed to estimate the unknown parameters in LCMs from the observed data. In general, there are mainly two different paradigms. The first one is the frequentist’s approach of maximum likelihood estimation (MLE), i.e., one maximizes the log-likelihood as a function of the unknown parameters. In contrast, a second paradigm – the Bayesian approach – where the unknown parameters obey a distribution and assumes prior distributions on them, then one either analytically or numerically obtains the posterior distributions and statistical inference is carried out based on the posterior distributions.
In recent years, significant progress has been made on Bayesian inference in LCM. white2014bayeslca, by assuming the Dirichlet distribution on each unknown parameter, used Gibbs sampling to iteratively draw samples from the posterior distribution and then conduct inference on the LCM using the samples drawn. The authors also provided an implementation of the approach in R. li2018bayesian described a similar Bayesian approach to estimate the parameters and they also utilized the Dirichlet distribution as the prior distribution. asparouhov2011using introduced a similar implementation package of Bayesian LCM in Mplus. However, compared to the fast development of the Bayesian inference via Markov chain Monte Carlo (MCMC), the frequentist’s MLE approach for LCM has largely lagged. As far as we know, researchers still heavily rely on the expectation-maximization (EM) algorithm (dempster1977maximum), even with its notoriously slow convergence (see for instance meilijson1989fast), to maximize the log-likelihood function. It is known that some authors (jamshidian1997acceleration) use Quasi-Newton methods as alternatives for the EM algorithm in Gaussian mixture models. However, the extension to LCM is not straightforward since LCM includes a lot more intrinsic constraints on the parameters than the general Gaussian mixture model when considered as an optimization problem. More sophisticated optimization methods need to be applied when maximizing the log-likelihood function.
This paper primarily focuses on the MLE paradigm. We propose the use of two widely-used constrained optimization methods to maximize the likelihood function, namely, the projected Quasi-Newton method and the sequential quadratic programming method. Our contributions include not only exploring alternatives beyond the EM algorithm, but also demonstrating that better results could be obtained by using these alternatives. The rest of this paper is organized as follows: in Section 2, we present the preliminaries including the log-likelihood function and the classical EM algorithm. In Section 3, we introduce and discuss the two constrained optimization methods in detail. Some simulation studies and a real world data analysis are presented in Section 4 to compare the performance of the proposed methods with the EM algorithm. We make concluding remarks in Section 5.
2 Latent Class Models and the EM Algorithm
In many applications, a finite mixture distribution arises naturally when we sample from a population with heterogeneous subgroups, indexed by taking values in . Consider a population composed of subgroups, mixed at random in proportion to the relative group sizes . There is a random feature , heterogeneous across and homogeneous within the subgroups. The feature obeys a different probability distribution, often from the same parametric family with differing, for each subgroup. Now we sample from this population, if it is impossible to record the subgroup label, denoted by , then the density is:
which is a finite mixture distribution. In this situation, we often need to estimate the ’s as well as based on the random samples of , when the subgroup label is known or unknown. Throughout this paper, we assume that is known.
The LCM is a special case of the FMM. In LCM, the component densities are multivariate categorical distributions. That is, with each being a categorical random variable, taking values from categories . It is assumed that ’s are independent within each subgroup with an indictor (the latent variable), which is a categorical random variable taking values in , i.e., within each subgroup, the probability density function (PDF) is written as:
where and is the Iverson bracket function, i.e.
Overall, the mixture density of latent class models is:
where, the parameters include both the weight distribution and the ’s that define the categorical distributions.
Suppose we have collected samples drawn from the LCM distribution, denoted by . We write as the data matrix. The log-likelihood function is given by
The maximum likelihood principle is to find a that maximizes the log-likelihood function (1) as the estimation of . Clearly, we can regard the problem of finding such a as an optimization problem. At the same time, we notice that the LCM implies several constraints that need to be satisfied when maximizing the log-likelihood function (1). In particular, the ’s are all nonnegative and sum up to 1. Also, for each and , the ’s are all nonnegative and sum up to 1. Let be the vector of ’s and be the vector of ’s. From an optimization point of view, the MLE in the LCM case is the following optimization problem.
As we can see, the optimization problem (2) possesses equality constraints together with nonnegativity constraints on all the individual decision variables. While there are considerable number of constraints, the feasible region in (2) is indeed the Cartesian product of probability simplexes. We recall that a probability simplex in -dimensional space is defined as
where is the nonnegative orthant of . Let for all and . The constraints in (2) can be written as and .
To maximize the log-likelihood function in (1), the EM algorithm is a classical approach. In statistics, the EM algorithm is a generic framework that is commonly used in obtaining maximum likelihood estimators. The reason why EM algorithm enjoys its popularity in finite mixture model is the fact that we can view finite mixture model as an estimation problem with missing data. More specifically, if we know the true label of each observation, we could obtain the MLE in a fairly straightforward fashion. On the other hand, if we know the true model parameters, it is also trivial to compute the probability each observation belonging to each class. Therefore, a natural idea is that we begin the process with an initial random guess of the parameters, and compute the probability each observation belonging to each class E(xpectation)-step. With those probabilities we compute the MLE, which is the M(aximization)-step. We iterate between the two steps until a convergence condition is reached. Particularly for the LCM, when the EM algorithm is applied to it, the constraints are implicitly satisfied for all the iterations thanks to the way the EM algorithm updates the values of the parameters. This nice property does not necessarily hold naturally when other non-linear optimization algorithms are applied to the optimization problem (2).
In the context of LCM, the details of the EM algorithm is given in Algorithm 1. We make two comments on Algorithm 1. First, Algorithm 1 does not produce standard errors of MLE as a by-product. In order to conduct statistical inference, one has to compute the observed Fisher information matrix and it could be algebraically tedious or might only apply to special cases. This is one of the criticisms often laid out against the EM algorithm as compared to Bayesian analysis using Gibbs samplers for example, where independent posterior samples are collected and statistical inference is easy under such circumstance. Second, the convergence of Algorithm 1 is typically slow. wu1983convergence studied the convergence issue of EM algorithm and concluded that the convergence of the EM algorithm is sublinear when the Jacobian matrix of the unknown parameters is singular. jamshidian1997acceleration also reported that the EM algorithm could well be accelerated by the Quasi-Newton method. In Section 4, we shall also empirically observe the two constrained optimization methods converge in less iterations than the EM algorithm.
3 Constrained Optimization Methods
Motivated by the significant progress in constrained non-linear optimization, as well as the constrained nature of the LCM estimation problem, we propose to apply two non-linear optimization approaches to solve the optimization problem (2). We notice that the EM algorithm is closely related to a gradient decent method wu1983convergence, whose convergence rate is at most linear. On the other hand, it is known in optimization theory that if the second order information is utilized in the algorithm, quadratic convergence may be achieved, e.g., the classical Newton’s method. However, in many applications, it is often computationally very expensive to obtain the second order information, i.e., the Hessian matrix. One remedy is to use computationally cheap approximation of the Hessian matrix. This idea leads to the family of Quasi-Newton methods in the unconstrained case. While the convergence rate is typically only superlinear, the per iteration cost (both the execution time and the memory usage) is significantly reduced. In the constrained case, sophisticated methods have been developed to allow us to deal with the constraints. Given that it is relative easy to solve a constrained optimization problem when the objective function is quadratic and the constraints are all linear, one idea in constrained non-linear optimization is to approximate the objective function (or the Lagrangian function) by a quadratic function (via second-order Taylor expansion at the current solution) and approximate the constraints by linear constraints (via first-order Taylor expansion at the current solution). A new solution is obtained by solving the approximation and hence a new approximation can be constructed at the new solution. Analogous to the idea of quasi-Newton methods in the unconstrained case, in the constrained case, we can also consider an approximated Taylor expansion without having to compute the Hessian matrix exactly. Once an approximated quadratic program is obtained, one may use different approaches to solve it. For example, one can use an active set method or an interior point method to solve the quadratic program when it does not possess any specific structure. When the feasible region of the quadratic program is easily computable (typically in strongly polynomial time), a gradient projection method can be applied to solve the quadratic program approximation. As we have seen, the feasible region of optimization problem (2) is the Cartesian product of probability simplexes. It is known that projection on a probability simplex is computable in strongly polynomial time. Therefore, it is reasonable to apply a projection method to solve the quadratic program approximation. In the following subsections, the two approaches we propose are discussed in details. In both approaches, we need to evaluate the gradient of the LCM log-likelihood function. We provide the analytical expression below. For the part, we have:
For the part, we have for all :
where . And therefore, for all ,
3.1 Limited Memory Projected Quasi-Newton Method
We first present the projected Quasi-Newton method which is proposed by schmidt2009optimizing. We augment it with the algorithm proposed by wang2013projection to project parameters onto a probability simplex in strongly polynomial time. In general, we address the problem of minimizing a differentiable function over a convex set subject to equality constraints:
In an iterative algorithm, we update the next iteration as follows:
where is the solution at the -th iteration, is the step length and is the moving direction at iteration . Different algorithms differ in how and are determined. In the projected Quasi-Newton method, a quadratic approximation of the objective function around the current iterate is constructed as follows.
where and denotes a positive-definite approximation of the Hessian . The projected quasi-Newton method then compute a feasible descent direction by minimizing this quadratic approximation subject to the original constraints:
Then the moving direction is .
To determine the step length , we ensure that a sufficient decrease condition, such as the Armijo condition is met:
Although there are many appealing theoretical properties of projected Newton method just summarized, many obstacles prevent its efficient implementation in its original form. A major shortcoming is that minimizing (7) could be as difficult as optimizing (5). In schmidt2009optimizing, the projected Newton method was modified into a more practical version which uses the limited memory BFGS update to obtain ’s and a Spectral Projected Gradient (SPG) Algorithm ((birgin2000nonmonotone)) to solve the quadratic approximation (7).
where is the feasible region given in the format of the Cartesian product of probability simplexes. This rewriting is to facilitate the projection operation. We denote as the projection of a vector on a closed convex set , i.e. is the unique solution of the following quadratic program:
As we can see, in general, a quadratic program needs to be solved to compute the projection onto a closed convex set, and hence is not computationally cheap. Fortunately, the feasible region in (9) allows for a projection computable in strongly polynomial time according to wang2013projection. This algorithm is presented in Algorithm 4. This algorithm is the building block for the SPG algorithm to solve the quadratic approximation in each iteration. More specifically, in the -th iteration, let
where and denotes a positive-definite approximation of the Hessian . The quadratic approximation is now given by
The gradient of is given by
We update using the limited memory version of BFGS. The non-limited memory BFGS update of is given by
where and . This will consume significant memory in storing ’s when the number of features increases dramatically. Therefore, in the proposed projected Quasi-Newton algorithm we only keep the most recent and arrays (the definitions of and are in Algorithm 2) and update using its compact representation described by byrd1994representations:
where and are explicitly given in equation (3.5) of byrd1994representations.
In addition, running Algorithm 2 until convergence, the matrix is outputted as a by-product. The matrix is an approximation of the observed Fisher information of the unknown parameters, which will enable us to construct asymptotic confidence intervals using the following classical results:
This is way easier than EM algorithm to conduct statistical inference. According to gower2017randomized, when is convex quadratic function with positive definite Hessian matrix, it is expected that from the Quasi-Newton method to converge to the true Hessian matrix. However, the log likelihood function is obviously not a convex function and as far as we know there is no formal theory that guarantees the convergence. Nonetheless, in Section of jamshidian1997acceleration, the authors empirically compared the estimates for standard errors to the true values and the results are satisfactory.
3.2 Sequential Quadratic Programming
Sequential quadratic programming (SQP) is a generic method for non-linear optimization with constraints. It is known as one of the most efficient computational method to solve the general nonlinear programming problem in (5) subject to both equality and inequality constraints. There are many variants of this algorithm, we use the version considered in kraft1988software. We give a brief review of this method and then we will specifically talk about how this method could be applied to optimization problem (2).
Consider the following minimization problem
where the problem functions . SQP is also an iterative method and each iteration a quadratic approximation of the original problem is also constructed and solved to obtain the moving direction. Compared to the projected Quasi-Newton method, in SQP, the quadratic approximations are typically solved by an active set method or an interior point method rather than a projection type method. This significantly complicates the algorithm, but also allows the algorithm to handle more general non-linear optimization problems, especially when the feasible region is too complex to admit an efficient projection computation. In particular, starting with a given vector of parameters , the moving direction at iteration is determined by a quadratic programming problem, which is formulated by a quadratic approximation of the Lagrangian function and a linear approximation of the constraints. Note that, in contrast to the projected Quasi-Newton method we presented in the previous subsection, the SQP algorithm here approximates the Lagrangian function instead of the objective function itself. An advantage is that the dual information can be incorporated in the algorithm to ensure better convergence property. Let
be the Lagrangian function associated with this optimization problem. This approximation is of the following standard form of quadratic programming:
as proposed in wilson1963simplicial.
In terms of the step length , han1977globally proved that a one-dimensional minimization of the non-differential exact penalty function
with , as a merit function
with and fixed, leads to a step length guaranteeing global convergence for values of the penalty parameters greater than some lower bounds. Then, powell1978fast proposed to update the penalty parameters according to
where denotes the Lagrange multiplier of the -th constraint in the quadratic problem and is the -th penalty parameter of the previous iteration, starting with some .
It is important in practical applications to not evaluate in (19) in every iteration, but to use only first order information to approximate the Hessian matrix of the Lagrange function in (17). powell1978fast proposed the following modification:
and is chosen as
which ensures that remains positive definite within the linear manifold defined by the tangent planes to active constraints at .
In LCM, the problem turns out to be simpler: the quadratic programming problem in (18) is only subject to equality constraints. In addition, unless we use projected Quasi-Newton method, for which we have to build our own solver, there is a popular implementation of SQP in Python’s SciPy package. The package uses a variant of SQP: Sequential Least SQuares Programming (SLSQP): It replaces the quadratic programming problem in (18) by a linear least squares problem using a stable factorization of the matrix .
4 Simulation Studies and Real Data Analysis
In this section, we provide five simulation studies and one real data analysis to demonstrate the performance of the proposed methods.
4.1 Example 1
In this example, we generate a dataset from a real latent class model. For , we have levels: and ; while for , we have levels . The true weights and categorical parameters are given in Table 1.
We then use the following three methods to maximize the log-likelihood function: (1) EM, (2) SQP, and (3) Projected Quasi-Newton (QN). Each method is applied times with different initial values and the best result based on the log-likelihood value is given in Table 2. Results from the true parameters are also included as comparison. Actually the log-likelihoods for each method are very similar among the replications, so we draw side by side boxplots for the misclassification rates and the number of iterations for the three methods to show the uncertainty due to different starting values. The boxplots are shown in Figure 1.
|True Parameters||EM||SQP||Projected QN|
|Number of Iterations||N.A.|
The LCM is hard to estimate: even if we substitute in the true parameters, the misclassification rate is high (). This makes sense since unlike Gaussian mixture model, for instance, where the magnitude of values could help to classify components, the latent class model is based on a discrete distribution and it is naturally much harder to attain a perfect classification, especially for a not-well separated case such as Example 1.
The proposed two methods have good performance compared to the traditional EM algorithm: both slightly beat EM in terms of misclassification rates ( compared to ) and the number of iterations ( compared to ), which suggests the usefulness of the two constrained optimization methods in learning LCM parameters in practice.
The variations among different initial values are astonishingly huge from Figure 1. It is therefore wise to try different starting values especially for an example like this, where optimization is performed in such a high dimension.
4.2 Example 2
In this example, we consider exactly the same setting as in Example , however we increase the sample size from to . We are interested in seeing the effect of a large sample size. Similar to Example 1, the results are summarized in Table 3 and Figure 2.
|True Parameters||EM||SQP||Projected QN|
|Number of Iterations||N.A.|
4.3 Example 3
In this example, we keep , but we increase the dimension from to to see how these methods perform. The true values are given in Table 4. Each method is fitted times with different initial values and the best result based on log-likelihood is given in Table 5. Results from the true parameters are also included as comparison. In addition, we draw side by side boxplots for the misclassification rates and the number of iterations only for the three methods to show the uncertainty among different starting values. The boxplots are shown in Figure 3.
|True Parameters||EM||SQP||Projected QN|
|Number of Iterations||N.A.|
This is a very interesting example. First of all, the number of iterations of EM has increased dramatically after we move from to . In contrast, the other two optimization methods see only a modest increase. Secondly, as the dimension increases, the difficulty of the problem reduces a bit since now the true parameters plugging in gets a misclassification rate. Lastly, the variations caused by different starting values are huge as we have already seen in the previous two examples.
4.4 Example 4 & 5
These results further confirm what we have observed: As dimension increases, the iteration for EM algorithm increases dramatically compared to the other two optimization methods. In addition, if we plug in the true parameters, the misclassification rates for is , and the rate for is . It is clear the number of dimensions has a much bigger impact on the difficulty.
After going through all the simulation studies, our observations are summarized as follows.
LCM is a hard in terms of estimating its parameters. A larger sample size could only slightly help with reducing the estimation difficulty, while a higher dimension can greatly reduce the difficulty.
The two proposed constrained optimization methods have an appealing performance: compared to the EM algorithm, they converge in less iterations and also produce slightly lower misclassification rates.
The variations of different starting points are astonishingly large. It is advised to use multiple initial values in practice.
4.5 An Application
We now go back to the motivating example discussed in Section 1. The data set is available in the R package BayesLCA. white2014bayeslca used a latent class model to fit the data using Gibbs sampler. It is clear that this is an binary data set. We follow the recommendation of moran2004syndromes and fit a latent class model with (1) EM, (2) SQP, and (3) Quasi-Newton methods and different initial points, and the best result of each method is recorded based on the log-likelihood value. The results are summarized in Table 6. The result from BayesLCA package (white2014bayeslca) is also included.
|Number of Iterations||N.A.|
From Table 6, SQP has the best performance in terms both the log-likelihood and the number of iterations. The results from EM and projected Quasi-Newton are very similar although EM needs way more iterations to converge. This agrees with the previous observations. The method proposed by white2014bayeslca actually has the smallest log-likelihood value.
5 Concluding Remarks
The primary research objective of the paper is to provide alternative methods to learn the unknown parameters of the latent class model. Given the log-likelihood as a function of the parameters, we aim to find estimators that can maximize the log-likelihood function. The traditional way is to use the EM algorithm. However, it is observed that EM algorithm converges slowly. Therefore, in this paper, we propose the use of two constrained optimization methods, namely sequential quadratic programming and the projected Quasi-Newton method as alternatives. Simulation studies and the real example in Section 4 reveal that the two proposed methods perform well. The obvious advantages we observed are as follows: (1) the two optimization methods have slightly larger log-likelihood values compared to EM algorithm; (2) they converge in significantly less iterations than the EM algorithm. That being said, we want to make it clear that the aim is not to completely replace EM algorithm, rather we would like to provide alternative ways of achieving the same goal using some optimization methods. Inter-disciplinary collaboration between researchers in statistics and mathematical optimization has never been as important as in the big data era.
Another important observation we made through simulation is that different initial points lead to rather different results. For example, Considering the misclassification rate of the EM algorithm in Example 1, among the replications, they vary from to as shown in Figure 1. The variation is astonishingly huge. This is sort of understandable since it is extremely hard to optimize the log-likelihood function in such a high dimensional setting. Therefore, it is strongly recommended to start with multiple initial values in practice.
Finally, we also observe that LCM is a rather difficult problem and it is hard to get a perfect fit especially with small sample size, low dimension and not well separated case. We observe that increasing sample size and/or dimension reduces the difficulty and it agrees with the observation made by chen2016analysis for fitting a Gaussian process model since more information is added to help make a better classification.
The Python sources codes for EM, projected Quasi-Newton for LCM are available upon request. The implementation of SQP is available in Python SciPy package.