Mixture of Latent Trait Analyzers for Model-Based Clustering of Categorical Data

Mixture of Latent Trait Analyzers for Model-Based Clustering of Categorical Data

Isabella Gollini & Thomas Brendan Murphy
National Centre for Geocomputation, National University of Ireland Maynooth, Ireland
School of Mathematical Sciences, University College Dublin, Ireland
Abstract

Model-based clustering methods for continuous data are well established and commonly used in a wide range of applications. However, model-based clustering methods for categorical data are less standard. Latent class analysis is a commonly used method for model-based clustering of binary data and/or categorical data, but due to an assumed local independence structure there may not be a correspondence between the estimated latent classes and groups in the population of interest. The mixture of latent trait analyzers model extends latent class analysis by assuming a model for the categorical response variables that depends on both a categorical latent class and a continuous latent trait variable; the discrete latent class accommodates group structure and the continuous latent trait accommodates dependence within these groups. Fitting the mixture of latent trait analyzers model is potentially difficult because the likelihood function involves an integral that cannot be evaluated analytically. We develop a variational approach for fitting the mixture of latent trait models and this provides an efficient model fitting strategy. The mixture of latent trait analyzers model is demonstrated on the analysis of data from the National Long Term Care Survey (NLTCS) and voting in the U.S. Congress. The model is shown to yield intuitive clustering results and it gives a much better fit than either latent class analysis or latent trait analysis alone.

1 Introduction

Model-based clustering methods are widely used because they offer a coherent strategy for clustering data where uncertainties can be appropriately quantified using probabilities. Many model-based clustering methods are based on the finite Gaussian mixture model (eg. Celeux and Govaert, 1995; Fraley and Raftery, 2002; McNicholas and Murphy, 2008) and these methods have been successfully applied to the clustering of multivariate continuous measurement data; more recent extensions of model-based clustering for continuous data use finite mixtures of non-Gaussian models (eg. Lin et al., 2007; Karlis and Santourian, 2008; Lin, 2010; Andrews and McNicholas, 2012).

Categorical data arise in a wide range of applications including the social sciences, health sciences and marketing. Latent class analysis and latent trait analysis are two commonly used latent variable models for categorical data (eg. Bartholomew et al., 2011). In the latent class analysis model, the dependence in the data is explained by a categorical latent variable that identifies groups (or classes) within which the response variables are independent (also known as the local independence assumption). Latent class analysis is used widely for model-based clustering of categorical data, however, if the condition of independence within the groups is violated, then the latent class model will suggest more than the true number of groups and so the results are difficult to interpret and can be potentially misleading. The latent trait analysis model uses a continuous univariate or multivariate latent variable, called a latent trait, to model the dependence in categorical response variables. If data come from a heterogeneous source or multiple groups, then a single multi-dimensional latent trait may not be sufficient to model the data. For these reasons, the proposed mixture of latent trait analyzers (MLTA) model is developed for model-based clustering of categorical data where a categorical latent variable identifies groups within the data and a latent trait is being used to accommodate the dependence between outcome variables within each cluster.

In this paper, we focus on the particular case of binary data and we propose two different mixture of latent trait analyzers models for model-based clustering of binary data: a general model that supposes that the latent trait has a different effect in each group and the other more restrictive, but more parsimonious, model that supposes that the latent trait has the same effect in all groups. Thus, the proposed family of models (MLTA) is a categorical analogue of some parsimonious model families used for clustering continuous data (eg. Fraley and Raftery, 2002; McNicholas and Murphy, 2008; Andrews and McNicholas, 2012). The MLTA family of models is most like the PGMM family (McNicholas and Murphy, 2008) which is based on the mixture of factor analyzers model (Ghahramani and Hinton, 1997) however the MLTA model accommodates binary response variables instead of continuous variables. In addition, the model can be easily applied to nominal categorical data by coding the data in binary form. The MLTA model is a generalization of the mixture of Rasch models (Rost, 1990), and is a special case of multilevel mixture item response models (Vermunt, 2007); these connections are explored in more detail in Section 3.3, after the MLTA model has been fully introduced.

Fitting the mixture of latent trait analyzers model is potentially difficult because the likelihood function involves an integral that cannot be evaluated analytically. However, we propose using a variational approximation of the likelihood as proposed by Tipping (1999) for model fitting purposes because it is extremely easy to implement and it converges quickly, so it can easily deal with high dimensional data and high dimensional latent traits.

The mixture of latent trait analyzers model is demonstrated through the analysis of two data sets: the National Long Term Care Survey (NLTCS) data set (Erosheva et al., 2007) and a U.S. Congressional Voting data set (Asuncion and Newman, 2007). The latent class analysis and the latent trait analysis models are not sufficient to summarize these two data sets. However, the mixture of latent trait analyzers model detects the presence of several classes within these data and the dependence structure within groups is explained by a latent trait.

The paper is organized as follows. Section 2.1 provides an introduction to latent class analysis. Section 2.2 provides an introduction to latent trait analysis with a description of three of the most common techniques to evaluate the likelihood in this model: the Gauss-Hermite quadrature approach (Section 2.2.1), Monte Carlo sampling (Section 2.2.2), and the variational approach (Section 2.2.3). In Section 3 we introduce the mixture of latent trait analyzers model and a parsimonious version of the model (Section 3.1). An interpretation of the model parameters is outlined in Section 3.2, a discussion of models related to the MLTA model is given in Section 3.3, and a variational approach for estimating the model parameters is proposed in Section 3.4. The issue of model identifiability is discussed in Section 3.5. In Section 4 an adjustment to the BIC (Schwarz, 1978) is proposed as a model selection criterion and Pearson’s test and the truncated sum of squares Pearson residual criterion (Erosheva et al., 2007) are used to assess model fit. Computational aspects of fitting the MLTA model are outlined in Section 5. The two applications are presented in Section 6.1 and Section 6.2. We conclude in Section 7 by discussing the model and the results of its application.

2 Latent Class & Trait Analysis

In this section, we give an overview of latent class and latent trait analysis which are the basis of the mixture of latent trait analyzers model. We also outline how the models are fitted in practice.

2.1 Latent Class Analysis

The latent class analysis (LCA) can be used to model a set of observations each involving binary (categorical) variables. Each observation , where records the value of binary variable for observation where .

LCA assumes that there is a latent categorical variable for which if observation belongs to class and otherwise. We assume that , where is the prior probability that a randomly chosen individual is in the th class ( and for all ). Furthermore, conditional on the latent class variable , the response variables are distributed as independent Bernoulli random variables with parameters ; therefore, is the conditional probability of observing if the observation is from class (that is, ). Hence, the dependence between variables is explained by the latent class variable. LCA can be seen as a finite mixture model with mixing proportions in which the component distributions are multivariate independent Bernoulli distributions. Consequently the log-likelihood function is given as,

(1)

Bartholomew et al. (2011) describe how to estimate the parameters by maximum likelihood estimation via the expectation-maximisation (EM) algorithm (Dempster et al., 1977).

When LCA is used in a model-based clustering context, the inferred latent classes are assumed to correspond to clusters in the data; this may be a valid conclusion when the model assumptions are correct but may not be otherwise.

2.2 Latent Trait Analysis

Latent trait analysis (LTA) can also be used to model a set of multivariate binary (categorical) observations. LTA assumes that there is a dimensional continuous latent variable () underlying the behavior of the categorical response variables within each observation (Bartholomew et al., 2011). The LTA model assumes that

(2)

where the conditional distribution of given is

(3)

and the response function is a logistic function

where is the intercept parameter and are the slope parameters in the logistic function. Thus, the conditional probability that given is an increasing function of and if we get a probability equal to independently of the value of . Finally, it is assumed that . Thus, although the variables within are conditionally independent given , the marginal distribution of accommodates dependence between the variables.

Therefore, the log-likelihood is

(4)

The integral on (4) cannot be evaluated analytically, so it is necessary to use other methods to evaluate it. These are briefly reviewed in Section 2.2.1-Section 2.2.3.

2.2.1 Gauss-Hermite Quadrature

Bock and Aitkin (1981) proposed a method to evaluate the integral in (4) by using the Gauss-Hermite quadrature (Abramowitz and Stegun, 1964):

where is the number of set of integration points and () are the weights for the sets of points . In practice this method treats the latent variables as discrete taking values with probabilities . Thus, the model density can be approximated by a finite mixture model with component densities, where are the component distributions and are fixed mixing proportions. Bartholomew et al. (2011) explain this approach in detail and outline its drawbacks. The number of component densities, , required increases exponentially in , so the Gauss-Hermite quadrature can be hard to implement and can be quite slow. Furthermore when the parameters are very unequal their estimates can diverge to infinity.

2.2.2 Monte Carlo Sampling

An alternative approach is the Monte Carlo method (Sammel et al., 1997), that samples from the -dimensional latent distribution, and approximates the log-likelihood as

This approximation of the model density can be seen as a finite mixture model with component densities and the mixing proportions are equal to , but usually is quite large, making the implementation of this method quite difficult.

2.2.3 Variational Approximation

Tipping (1999) proposed to use a variational approximation (Jaakkola and Jordan, 1996) to fit the latent trait analysis model in a manner that is fast in convergence and easy to implement. The main aim of this approach is to maximize a lower bound that approximates the likelihood function. In latent trait analysis the log-likelihood function (4) is governed by the logistic sigmoid function, that can be approximated by the exponential of a quadratic form involving variational parameters where for all . This allows for the computation of an approximate log-likelihood in closed form. In this case the lower bound of each term in the log-likelihood is given by,

where

, and . This approximation has the property that with equality when and thus it follows . Tipping (1999) and Bishop (2006) outline a variational EM algorithm that maximizes this lower bound and thus fit a latent trait analysis model.

However, the approximation of the log-likelihood obtained by using the variational approach is always less or equal than the true log-likelihood, so it may be advantageous to get a more accurate estimate the log-likelihood at the last step of the algorithm using Gauss-Hermite quadrature (Section 2.2.1); this is discussed further in Section 5.

2.2.4 Comparison of estimation methods

There are some advantages when using the variational approach to approximate the integral in the LTA likelihood. Firstly, the method involved avoids the need for user specified quantities like the number of quadrature points or the number of Monte Carlo samples. In addition, the variational approach involves iterating a series of closed-form updates that are easy to compute. Secondly, the algorithm usually converges considerably more quickly than in Gauss-Hermite quadrature and Monte Carlo sampling case, particularly so for large data sets and the approximation of the likelihood is more accurate as dimensionality increases, as the likelihood becomes more Gaussian in form. But, since the variational approach estimates the true likelihood function by using a lower bound, the approximation of the likelihood given by this method is always less or equal than the true value, so for low dimensional latent traits it may be better to use the Gauss-Hermite quadrature. Additionally, there are also the usual drawbacks of the EM algorithm: the results can vary because of the initialization of this algorithm and there is the risk of converging to a local maximum instead of the global maximum of the lower bound of the log-likelihood.

In Figure 1 we compare the response function estimated by the Gauss-Hermite quadrature with 6 quadrature points and the variational approach for a unidimensional latent trait model. The data consists of binary variables and observations, and are fully described in Section 6.1. In this example the two methods bring to very similar results, as shown by the plot of the two response functions, and their absolute differences that are always .

Figure 1: Response function for the Gauss-Hermite Quadrature and Variational Approach. The absolute difference of the response functions is also shown.

Furthermore, Tipping (1999) investigates the accuracy of the variational approach in a high dimensional context comparing the estimates given by the variational and the Monte Carlo sample approaches in terms of both computing time and errors.

3 Mixture of Latent Trait Analyzers

The mixture of latent trait analyzers (MLTA) model generalizes the latent class analysis and latent trait analysis by assuming that a set of observations comes from different groups, and the behavior of the categorical response variables given by each observation depends on both the group and the dimensional continuous latent variable . Thus, the MLTA model is a mixture model for binary data but where observations are not necessarily conditionally independent given the group memberships. In fact, the observations within groups are modeled using a latent trait analysis model and thus dependence is accommodated.

Suppose that each observation comes from one of groups and we have which is an indicator of the group membership. We assume that where is the prior probability of a randomly chosen observation coming from the th group ( and ). Further, we assume that the conditional distribution of given that the observation is from group (ie. ) is a latent trait analysis model with parameters and ; this yields the mixture of latent trait analyzers (MLTA) model.

Thus, the MLTA model is of the form,

where the conditional distribution given and is

(5)

and the response function for each group is given by

(6)

where and are the model parameters. In addition, it is assumed that the latent variable .

The log-likelihood can be written as

(7)

so, the model is a finite mixture model in which the component distributions are latent trait analysis models and the mixing proportions are .

3.1 Parsimonious Model

It is sometimes useful to use a more parsimonious response function than the one presented in (6); this is especially important when the data set is high dimensional, comes from several different groups and the continuous latent variable is high dimensional.

Similarly to the factor analysis setting (Bartholomew et al., 2011), parameters are constrained because of the indeterminacy arising from all the possible rotations of , so the model in (6) involves free parameters, of which are the parameters .

If the parameters are constrained to be the same in each group, a more parsimonious model would be:

(8)

that supposes that the for each group, so that it involves free parameters.

3.2 Interpretation of Model Parameters

In the mixture of latent trait analyzers model is the mixing proportion for the group , that corresponds to the prior probability that a randomly chosen individual is in the -th group. The behavior of the individuals within the group is characterized by the parameters and . In particular has a direct effect on the probability of a positive response to the variable given by an individual in group , through the relationship

(9)

The value is the probability that the median individual in group has a positive response for the variable , since the continuous latent variable is distributed as a . A measure of the heterogeneity of the values of variable within group is given by the slope value ; the larger the value of the greater the differences in the probabilities of positive response in the variable for observations from group . The value of the slope parameters also account for the dependence between observed data variables. For example, if two variables and yield a positive (negative) value for then they will both simultaneously have a probability of a positive outcome greater (lesser) than the median probability for group more often than expected under local independence.

The quantity can be used to calculate the correlation coefficient, within each group , between the observed variables and the latent variable which is given by its standardized value (Bartholomew et al., 2002),

(10)

Another useful quantity for analyzing the dependence within groups is a version of (Brin et al., 1997). We use within each group to quantify the effect of the dependence on the probability of two positive responses compared to the probability of two positive responses under an independence model. The is defined as,

(11)

where and . Two independent positive responses have : the more the variables are dependent, the further the value of the is from 1. It is possible to estimate it by using the Gauss-Hermite quadrature as,

(12)

where are the appropriate weights associated with the quadrature set of points and the parameters are estimated using the variational method (Section 3.4). Lift values that are much less than 1 are evidence of negative dependence within groups and lift values that are greater than 1 are evidence of positive dependence within groups.

Finally, the posterior distribution of the latent variable conditional on the observation belonging to a particular group can be obtained from the model outputs (see Section 3.4) and the posterior mean estimates of these scores can be used to interpret the latent variables within each group.

3.3 Related Models

The MLTA model has a lot of common characteristics with a number of models in the statistical and wider scientific literature.

The MLTA model can be seen as a discrete response variable analogue of the continuous response variable mixture of factor analyzers (MFA) model (Ghahramani and Hinton, 1997; McLachlan et al., 2003) and more recently developed parsimonious analogues of this model (McNicholas and Murphy, 2008, 2010; Baek et al., 2010). In these models, a discrete latent class variable accounts for grouping structure and a latent factor score accounts for correlation between response variables within groups. The component means in the MFA model are analogous to the intercepts in the MLTA model, the loading matrix is analogous to the slope parameters, the mixing proportions take an identical role in both models. Additionally, Muthén (2001) describes a general latent variable model for continuous outcome variables as used by the Mplus software and this model also uses both continuous and discrete latent variables in the model framework.

The Rasch model (Rasch, 1960) is a commonly used model for analyzing binary (and categorical data) and it has been widely used in educational statistics. The Rasch model has a very similar structure to the LTA model. In the Rasch model, the probability of a positive outcome is given as

where the values are called ability parameters and the values are called difficulty parameters and these correspond to the latent trait and the intercept parameters in the LTA model. Furthermore, the model makes a local independence assumption when constructing the probability

where . In some formulations of the Rasch model, the ability parameters are treated as fixed unknown parameters and in other formulations they are treated as random effects thus making the model equivalent to the univariate LTA model.

Consequently, the mixture of latent trait analyzers (MLTA) model has a similar structure to the mixture Rasch model (Rost, 1990; Rost and von Davier, 1995; von Davier and Yamamoto, 2007), where the response function is:

However, as with the Rasch model, the mixture Rasch model usually includes a univariate ability parameter that is the same across all the variables and groups, that would be equivalent to assuming that MLTA model and for all .

In von Davier et al. (2007) they extend the mixture Rasch model in order to allow multivariate ability parameters:

where are variable-specific parameters, and is the -dimensional ability parameter. This is equivalent to a MLTA model with for all , as in our parsimonious model.

Thus, many versions of the mixture Rasch model are included as special cases of the MLTA model. Again, the latent ability parameter is either assumed to be a fixed or random effect that varies across observations to accommodate different outcome probabilities; the latent ability parameter is the primary quantity of interest in many Rasch modeling applications.

Another model that has a very similar structure to the mixture Rasch model is the mixed latent trait (MLT) model described in Uebersax (1999), this model also has a univariate latent trait, uses a probit structure and numerical integration are required for computing the likelihood, so they can be difficult to apply to large heterogeneous data sets. Qu et al. (1996) and Hadgu and Qu (1998) use a two component model with a similar form to the MLT model in medical diagnosis.

Additionally, Uebersax (1999) also describes a probit latent class model (PLCA) which has a similar structure to the MLTA model, but this model uses a multivariate latent trait which has the same dimensionality as outcome variable. Vermunt and Magidson (2005) developed a latent class factor analysis (LCFA) model which uses discrete latent variables to develop a model with similar modeling aims to latent trait analysis, but which can be fitted in a much more computationally efficient manner.

Perhaps, the MLTA model proposed has closest connections to the family of models identified as multilevel mixture item response models (Vermunt, 2007). Key differences between our model and the multilevel mixture item response model are that we focus on a multivariate trait parameter (which is optional only in the model of Vermunt) and that we further also introduce a constrained parsimonious version of the model. In addition, we offer a computationally efficient alternative algorithm for fitting this model without the need to resort to numerical quadrature methods. The MLTA and multilevel mixture item response models can be estimated using the software Latent GOLD (Vermunt and Magidson, 2008), but the fact that this software uses quadrature methods for the numerical integration of continuous latent variables probably makes it less suitable for analyzing large datasets with underlying highdimensional latent trait structures.

3.4 Model Fitting

When fitting the MLTA model, the integral in the log-likelihood (7) cannot be solved analytically since it is exactly the same as in latent trait analysis (4). To obtain the approximation of the likelihood it is possible to use an EM algorithm:

  1. E-step: Estimate as the approximate posterior probability that an individual with response vector belongs to group .

  2. M-step: Estimate as the proportion of observations in group .

    Estimate the integral in the complete-data log-likelihood by using the variational approach.

    Estimate the parameters and for and .

    Also, estimate the approximate log-likelihood value.

  3. Return to step 1 until convergence is attained.

The advantages and the drawbacks of the different estimate methods for the likelihood are the same as in latent trait analysis (Section 2.2.4).

We further focus on the variational approach to approximate the log-likelihood because it can be efficiently implemented. Similarly to the latent trait analysis case, we introduce the variational parameters where for all to approximate the logarithm of the component densities with a lower bound:

where the conditional distribution is approximated by,

where , and .

To obtain the approximation of the log-likelihood it is necessary to use a double EM algorithm on the th iteration:

  1. E-step: Estimate with:

  2. M-step: Estimate using:

  3. Estimate the likelihood:

    1. E-step: Compute the latent posterior statistics for which is a density:

      and

    2. M-step: Optimize the variational parameter in order to make the approximation as close as possible to for all using:

      since is symmetric in choosing either the positive or the negative root of yields to the same results.

    3. Optimise the model parameters and in order to increase the approximate likelihood using:

      where and
      .

    4. Estimate the lower bound:

      and the log-likelihood:

  4. Return to step 1 until convergence is attained.

  5. Estimate the log-likelihood by using the Gauss-Hermite quadrature:

    (13)

    where are the appropriate weights associated to the quadrature set of points and is given by (5) and (6).

To fit the parsimonious model outlined in Section 3.1 it is possible to use the variational approach using the EM algorithm described above, except for the estimate of the model parameters at step 3c:

where

and,

The derivation of these parameter estimates is given in Section 1 of the supplementary material.

3.5 Model Identifiability

Model identifiability is an important issue for a model involving many parameters. Goodman (1974), Bartholomew et al. (2011) and Dean and Raftery (2010) give a detailed explanation of model identifiability in latent class analysis and Bartholomew (1980) introduces this issue in the latent trait analysis context. In Allman et al. (2009) they argue that the classical definition of identifiability is too strong for a lot of latent variable models, and they introduce the concept of “generic identifiability” that implies that the set of points for which identfiability does not hold has measure zero. They explore “generic identifiability” for different models, including LCA.

A necessary condition for model identifiability is that the number of the free estimated parameters not exceed the number of possible data patterns. Nevertheless this condition is not sufficient as the actual information in a dataset can be less depending of the size or the frequency of pattern occurrences within the dataset.

As with the loading matrices in mixture models with a factor analytic structure (eg. Ghahramani and Hinton, 1997; McNicholas and Murphy, 2008) the slope parameters are only identifiable up to a rotation of the factors. The rotational freedom of the factor scores is important when determining the number of parameters in the model (Section 3.1 and Section 4).

In addition, model identifiability holds if the observed information matrix is full rank. As a result possible non-identifiability can manifest itself through high standard errors of the parameter estimates. Another empirical method that can be used to assess non-identifiability consists of checking whether the same maximized likelihood value is reached with different estimates of the model parameter values when starting the EM algorithm from different values.

We found that these checks for identifiability were all satisfied in the empirical examples discussed in Section 6.

4 Model Selection

The Bayesian Information Criterion () (Schwarz (1978)) can be used to select the model,

where is the number of free parameters in the model and is the number of observations. The model with the lower value of is preferable. It is important to remember that its value could be overestimated if the log-likelihood is approximated by using the variational approach; hence, the proposal to use Gauss Hermite quadrature to evaluate the maximized log-likelihood for model selection purposes.

In the context of MLTA, the values of , and whether the values are constrained to be equal across groups need to be determined.

In the mixture of latent trait analyzers context (and more widely within finite mixture models) the penalizes too much the models with high and/or high . The as defined by Schwarz (1978) implicitly assumes that all the parameter estimates depend on the entire set of observations, but in MLTA the estimates of and depend just on the observations that belong to group . So, an alternative penalized is given as

where is the number of free parameters depending on each group of observations. This penalty penalizes parameters to a lesser extent than BIC because only the estimated number of observations involved in the estimation of each parameter is used in the penalty. This version of BIC has previously been proposed by Steele (2002) and similar criteria have been proposed by Pauler (1998) and Raftery et al. (2007). It is worth noting that will behave in a similar way to for large sample sizes. Hence, it can be seen as a small sample adjustment to .

The Pearson’s -test can be used to check the goodness of the model fit. The statistic is calculated as

where is the total number of possible patterns, is the number of observed patterns and and represent the observed and expected frequencies for the -th response pattern, respectively. Under the null hypothesis the statistic is asymptotically distributed as a as . If is large it is common to have a large number of very small counts and the Pearson’s -test is not applicable. In this case, Erosheva et al. (2007) suggest the truncated SSPR criterion to examine deviations between expected and observed frequencies via the sum of squared Pearson residual only for the patterns with large observed frequencies.

5 Computational Aspects

The estimates of the parameters in the latent class analysis models are the exact maximum likelihood estimates. Since the dimensionality of the data in the two data sets is large, the parameters for the model with are estimated by using the variational approach and the log-likelihood has been calculated at the last step of the algorithm by using the Gauss-Hermite quadrature, with 5 quadrature points per dimension; the results were not sensitive to the number of quadrature points but the computational time was very heavily dependent on this number.

The categorical latent variables have been initialized by randomly assigning each observation to one of the possible groups. The variational parameters are initialized to be equal to , this implies the initial approximation of the conditional distribution to be very close to and it reduces the dependence of the final estimates on the initializing values. The model parameters and have been initialized by random generated numbers from a . Ten random starts of the algorithm were used and the solution with the maximum likelihood value was selected.

The standard errors of the model parameter have been calculated using the jackknife method (Efron, 1981). It is worth noting that when employing this method the estimates of the parameter without the -th observation can be obtained in just a few iterations.

Since the EM algorithm is linearly convergent, a criterion based on the Aitken acceleration (McLachlan and Peel, 2000) has been used to determine the convergence of the algorithm. The EM algorithm has been stopped when

where is the iteration, is the desired tolerance,

6 Applications

6.1 National Long Term Care Survey (NLTCS)

Erosheva (2002, 2003, 2004) and Erosheva et al. (2007) report on 16 binary outcome variables recorded for 21574 elderly (age 65 and above) people who took part in the National Long-Term Care Survey in the years 1982, 1984, 1989 and 1994. The outcome variables record the level of disability and can be divided in two subsets: the "activities of daily living" (ADLs) and "instrumental activities of daily living" (IADLs). The first subset is composed of the first six variables which include basic activities of hygiene and personal care: (1) eating, (2) getting in/out of bed, (3) getting around inside, (4) dressing, (5) bathing, (6) using the toilet. The second subset concern the basic activities necessary to reside in the community: (7) doing heavy house work, (8) doing light house work, (9) doing laundry, (10) cooking, (11) grocery shopping, (12) getting about outside, (13) travelling, (14) managing money, (15) taking medicine, (16) telephoning. The responses are coded as 1 = disabled and 0 = able.

The MLTA model is fitted to these data for and ; the case corresponds to the LCA model and the case corresponds to the LTA model.

Table 1 records the log-likelihood evaluated for the parameter estimates found using the algorithm outlined in Section 2.2.3. Estimates of the log-likelihood and the lower bound are reported to emphasize the importance of estimating the log-likelihood at the last step of the variational approach by using the Gauss-Hermite quadrature instead of the lower bound values.

-200085.10 -140318.06 (-145827.00) -136169.53 (-144483.30) -136075.66 (-144483.30)
-152527.30 -135301.29 (-139930.20) -134273.79 (-139283.00) -134275.46 (-139091.30)
-141277.10 -134362.61 (-136349.10) -133025.27 (-136080.80) -133008.17 (-136066.50)
-137464.20 -133120.36 (-134540.70) -131839.77 (-134253.60) -132116.82 (-134145.00)
-135216.20 -131813.29 (-133203.40) -131505.23 (-133061.10) -131393.42 (-132919.80)
-133643.80 -131396.59 (-132261.50) -131154.94 (-132161.60) -130992.52 (-132123.00)
-132659.70 -131120.79 (-131840.30) -130729.39 (-131785.80) -130607.37 (-131731.10)
-132202.90 -130708.20 (-131565.90) -130450.55 (-131146.60) -130403.20 (-131106.60)
-131367.70 -130342.81 (-130866.20) -130164.32 (-130855.20) -130155.19 (-130798.00)
-131155.90 -130135.91 (-130806.60) -130049.64 (-130681.20) -129936.33 (-130544.50)
-130922.60 -130110.22 (-130574.40) -129860.74 (-130475.10) -129881.83 (-130404.50)
Table 1: Approximated log-likelihood and lower bound (in parentheses).

The estimated and have been used to select the best model (Table 2), both criteria indicate that the model with 10 groups and a one-dimensional latent variable as the best model. Fienberg et al. (2009) show that the LCA model that minimizes the BIC has nineteen latent classes, so the mixture of latent trait analyzers suggests that there are much fewer groups. The nineteen latent class model has a lower BIC (BIC=262165.07) but they argue that this number of classes is not sensible in the context of the application. This suggests that the LCA is using multiple groups to account for dependence and that the latent classes do not necessarily correspond to data clusters.

400329.84 280955.46 272808.09 272760.05
305383.99 271251.24 269495.61 269778.37
283053.38 269703.18 267477.56 267862.51
275597.01 267548.00 265585.58 266698.51
271270.65 265263.18 265395.51 265870.43
268295.46 264759.09 265173.93 265687.34
266496.97 264536.80 264801.82 265535.74
264882.64 264040.94 264723.15 265746.12
264252.28 263639.47 264629.69 265868.82
263998.39 263554.99 264879.33 266049.81
263701.27 263832.93 264980.55 266559.53
400329.84 280955.46 272808.09 272760.05
305360.21 271204.68 269427.88 269690.47
282997.58 269596.55 267322.64 267661.41
275504.40 267367.43 265322.50 266345.65
271136.22 264993.58 265006.92 265366.66
268113.98 264403.40 264614.73 264973.09
266267.39 264069.53 264102.10 264643.31
264591.64 263464.76 263877.58 264649.68
263910.91 262952.98 263618.30 264544.87
263600.50 262766.36 263684.61 264524.20
263242.58 262921.59 263581.68 264754.70
Table 2: The estimated (left) and (right) for the models with and .

Since there is a large number of response patterns with a very small number of observations (of all the possible response patterns, 62384 (95.2%) contain zero counts and only 481 (0.7%) contain more than 5 counts), the Pearson’s test is not applicable and the truncated SSPR criterion has been used to test the goodness of fit of the model. Three different levels of truncation are shown in Table 3.

The best model selected by the and the is one of the best fits as indicated by the SSPR over all the three levels of truncation. Table 1 of the supplementary material shows a comparison of the observed and the expected frequencies for the response patterns with more than 100 observations under the best model selected. The table shows that there is a close match between the observed and expected frequencies under this model.

From Table 2 and 3 it is also possible to see how the mixture of latent trait analyzers model is more appropriate than the latent class analysis and the latent trait analysis, in terms both of BIC and goodness of fit.

observed frequencies observed frequencies observed frequencies
6.1e+09 2441 1924 4082 6.3e+09 20524 21513 25795 6.3e+09 38927 25130 30009
80199 1600 1318 1718 129642 4614 3590 4822 182265 8126 6614 9654
5304 1470 1410 1597 16588 4610 2974 3318 61331 7610 4898 5814
4875 1439 638 806 10762 3060 1874 2530 17405 5498 2993 4609
2717 598 490 545 6321 1788 1498 1683 10886 3414 3037 3255
1434 323 533 462 3901 1423 1787 1530 6652 2498 2758 2817
561 561 561 418 2412 2412 2412 1450 4808 4808 4808 2468
413 413 413 234 1837 1837 1837 1228 3792 3792 3792 2115
417 253 183 229 1647 987 938 1054 3010 1725 1534 1761
348 160 292 150 1348 723 827 946 2695 1367 1495 1558
280 196 171 118 1336 872 760 700 2246 1518 1348 1481
Table 3: The sum of squared of Pearson residuals for different levels of truncation (, , ).

The parameter estimates and standard errors for the selected model are reported in Table 4. The standardized values and median probabilities are also reported to aid interpretation of the groups found by this model.