Extending Growth Mixture Models Using Continuous NonElliptical Distributions
Abstract
Growth mixture models (GMMs) incorporate both conventional random effects growth modeling and latent trajectory classes as in finite mixture modeling; therefore, they offer a way to handle the unobserved heterogeneity between subjects in their development. GMMs with Gaussian random effects dominate the literature. When the data are asymmetric and/or have heavier tails, more than one latent class is required to capture the observed variable distribution. Therefore, a GMM with continuous nonelliptical distributions is proposed to capture skewness and heavier tails in the data set. Specifically, multivariate skewt distributions and generalized hyperbolic distributions are introduced to extend GMMs. When extending GMMs, four statistical models are considered with differing distributions of measurement errors and random effects. The mathematical development of GMMs with nonelliptical distributions relies on their expression as normal variancemean mixtures and the resultant relationship with the generalized inverse Gaussian distribution. Parameter estimation is outlined within the expectationmaximization framework before the performance of our GMMs with nonelliptical distributions is illustrated on simulated and real data.
1 Introduction
Many longitudinal studies focus on investigating how individuals change over time with respect to a characteristic that is measured repeatedly for each participant. Conventional random effect growth modeling has provided a number of tools for modeling intraindividual change and interindividual differences in change (e.g., Laird and Ware, 1982; Bryk and Raudenbush, 1987, 1992; McArdle and Epstein, 1987; Singer and Willett, 2003), where withinclass changes are described as a function of time and betweenclass changes are described by random effects and coefficients. Conventional random effects models provide a basis for formulating growth mixture models (GMMs) for longitudinal data. In the latent variable framework, a quadratic random effects growth model with a continuous outcome for individual at time , and timeinvariant covariates , is specified according to
(1)  
(2)  
where are time scores () centred at , is the random intercept, is the random linear slope, is the quadratic growth rate, and and are normally distributed residuals, i.e., and . Formally, , , and are continuous latent variables (called the growth factors) representing the growth patterns, the are the mean parameters for the growth factors if there are no covariates , and the are the effects of covariates on the growth factors. In conventional growth modeling applications, the individual growth parameters (e.g., individual intercept and slope factors) are usually assumed to be identically distributed, i.e., drawn from a single homogeneous population. However, we are often interested in and deal with samples from multiple populations and, in most cases, the class memberships are either unknown or unobserved.
Simultaneous modeling of change over time and unobserved multiple populations (heterogeneity) in the data can be accommodated using GMMs and latent class growth analysis (LCGA), wherein parameters describing growth patterns are estimated and each individual’s most likely class membership is obtained via maximum a posteriori (MAP) probabilities. GMMs were introduced by Verbeke and Lesaffre (1996) and then extended by Muthén and Shedden (1999), Muthén (2004), and Muthén and Asparouhov (2008). For convenience, define so that if individual falls in class and otherwise. The quadratic growth model in (1) and (2) can be extended to a simple GMM in class () via
(3)  
(4)  
where parameters vary across classes to capture different trajectories; the parameters , and remain the same across classes but this could be relaxed to allow variation across classes with respect to how a covariate affects the growth factors; and and are still normally distributed residuals but with classspecific covariance matrices, i.e., and . The latent class growth analysis (LCGA) developed by Nagin and Land (Nagin and Land, 1993; Nagin, 1999, 2005) can be thought of as one special case of GMMs in the sense that it assumes zero withinclass growth factor variances, i.e., for .
One common fundamental assumption for GMMs is that model errors are normally distributed. However, simulation studies in Bauer and Curran (2003) show that, when the data are drawn from a single nonGaussian distribution, a twoclass Gaussian GMM is preferred when fitting the data. In such cases, the withinclass parameter estimates become uninterpretable because there are too many groups. Muthén and Asparouhov (2015) give an example of strongly nonnormal outcomes, i.e., body mass index (BMI) development over age, and show that more than one latent class is required to capture the observed variable distribution — interpreting mixture components as subpopulations will lead to overestimation of the number of subpopulations. In general, nonelliptical distributions are used in multivariate analysis for the study of asymmetric data; such distributions can have a concentration parameter to account for heavy tailed data. Intuitively, in the twodimensional case, the joint distribution forms a nonelliptical shape in the isodensity plot. Relaxing the normality assumption for asymmetry and skewness, Muthén and Asparouhov (2015) develop a GMM with a normally distributed model error and “classical” multivariate skewt (cMST) random effects, i.e., and . An alternative specification of GMMs (called nonlinear mixed effect mixture models) was developed by Lu and Huang (2014) within a Bayesian framework, wherein is assumed to follow a cMST while remains normally distributed. Note that the “classical” formulation of the multivariate skewt distribution is that given by Azzalini and Valle (1996) and Azzalini and Capitanio (1999). In this paper, we outline a more general extension of GMMs to the generalized hyperbolic distribution while also considering the formulation of the multivariate skewt distribution that arises as its special and limiting case (Section 2). The advantage of the generalized hyperbolic distribution is its flexibility. Many other wellknown distributions are special or limiting cases of the generalized hyperbolic distribution; please refer to McNeil et al. (2005) for details on a variety of limiting cases of the generalized hyperbolic distribution.
The remainder of this article is laid out as follows. In Section 2, we go through some important background material on the generalized hyperbolic distribution as well as a special and limiting case that gives a formulation of the multivariate skewt distribution. In Section 3, we outline the extension of GMMs to generalized hyperbolic and multivariate skewt distributions, respectively. Section 3.5 presents an expectationmaximization (EM) algorithm for obtaining maximum likelihood estimates of model parameters. Then, our approach is illustrated on simulated and real data (Section 4). The paper concludes with some discussion and suggestions for future work (Section 5).
2 Background
2.1 Generalized hyperbolic distribution
A multivariate generalized hyperbolic distribution arises from a multivariate meanvariance mixture where the weight function is the density of a generalized inverse Gaussian (GIG) distribution. The density of the GIG distribution is given by
(5) 
for , where is a scale parameter, is a concentration parameter, denotes the modified Bessel function of the third kind with index , and characterizes certain subclasses and considerably influences the size of tail weights. Write to denote that the random variable has the density in (5). The GIG distribution has some attractive features, including tractable expected values. Consider , then the following expected values hold:
(6)  
Extensive details on GIG distribution can be found in Jørgensen (1982).
Browne and McNicholas (2015) show that a dimensional generalized hyperbolic random variable can be generated using the relationship
(7) 
where , and are vectors that play the role of location and skewness parameters, respectively, and . From (7), it follows that . Now, recalling that and that the unconditional distribution of is a generalized hyperbolic, Bayes’ theorem gives
(8) 
Under this parameterization, a dimensional multivariate generalized hyperbolic distribution has density
(9) 
with index parameter , concentration parameter , skewness parameter , mean vector , and scale matrix . Here, is the squared Mahalanobis distance between and , i.e., , and are modified Bessel functions of the third kind with indices and , respectively, and denotes the model parameters. Herein, let represent a dimensional generalized hyperbolic random variable with density as per (9). Note that the parameterization in (9) is one of several available for multivariate generalized hyperbolic distributions (see McNeil et al., 2005; Browne and McNicholas, 2015).
There are a number of special and limiting cases that can be derived from the generalized hyperbolic distribution. However, the presence of the index parameter enables a flexibility that is not found in its special and limiting cases. Figure 1 illustrates the Gaussian distribution as well as a skewt distribution with degrees of freedom and the generalized hyperbolic distribution for two different values of ; clearly, the different values of lead to very different densities.
2.2 Multivariate skewt distribution
Several alternative formulations of the multivariate skewt distribution have appeared in the literature, e.g., Azzalini and Valle (1996), Sahu et al. (2003), and ArellanoValle and Genton (2005). Some recent discussion about some of these formulations is given by Azzalini et al. (2016). Muthén and Asparouhov (2015) developed a GMM with the SDB version of the restricted multivariate skewt distribution, i.e., the version of Sahu et al. (2003). The formulation of the multivariate skewt distribution used herein arises as a special and limiting case of the generalized hyperbolic distribution by setting and while also letting . This formulation of the multivariate skewt distribution has been used by Murray et al. (2014a) to develop a mixture of skewt factor analyzer models and by Murray et al. (2014b) to develop a mixture of common skewt factor analyzers. A dimensional skewt random variable , with this formulation, has the density function
(10) 
where is the location parameter, is the scale parameter, is the skew parameter, is the degree of freedom parameter, and and are as defined in (9). We write to denote that the random variable follows the skewt distribution with the density in (10). Now, can be obtained through the relationship in (7) with , where denotes the inversegamma distribution. We have , and so, from Bayes’s theorem, .
2.3 The EM algorithm and its convergence criterion
The EM algorithm (Dempster et al., 1977) is an iterative algorithm for finding maximum likelihood estimates when data are incomplete or treated as such, and is widely used to estimate model parameters in the context of modelbased clustering. The Estep computes the expected value of the completedata loglikelihood given the current model parameters, and the Mstep maximizes this expected value with respect to the model parameters. After each E and Mstep, the loglikelihood is driven uphill, and the method iterates towards a maximum until some convergence criterion is satisified. Many variants of the EM algorithm have been proposed over the years, such as the expectationconditionalmaximization (ECM) algorithm (Meng and Rubin, 1993), the alternating ECM (AECM) algorithm (Meng and Van Dyk, 1997), and the FisherEM algorithm (Bouveyron and Brunet, 2012). Herein, we will make use of the EM algorithm for parameter estimation and a stopping criterion based on the Aitken acceleration (Aitken, 1926) is used to determine the convergence. The Aitken acceleration at iteration is
(11) 
where is the (observed) loglikelihood value at iteration . This yields an asymptotic estimate of the loglikelihood at iteration , given by
(Böhning et al., 1994; Lindsay, 1995), and the EM algorithm is stopped when , provided this difference is positive (McNicholas et al., 2010). Note that this criterion is at least as strict as the lack of progress criterion in the neighbourhood of a maximum.
2.4 Model selection
In modelbased clustering, a penalized loglikelihoodbased criterion is typically used to determine the “best” fitting model among a family of models. The most popular such criterion is the Bayesian information criterion (BIC; Schwarz, 1978), which can be motivated as an approximation to a Bayes factor (see Kass and Raftery, 1995; Kass and Wasserman, 1995). The BIC is defined as
(12) 
where is the maximum likelihood estimate of model parameters , is the maximized loglikelihood, is the number of free parameters, and is the number of observations. Some theoretical support for use of the BIC in mixture model selection is given by Leroux (1992) and Keribin (2000).
3 Methodology
3.1 Conventional GMM with Gaussian random effects
Suppose a longitudinal study features subjects and time points or measurement occasions. For subject (), let be a vector where represents the outcome on occasion (), let be an vector of observed timeinvariant covariates, let be a vector containing continuous latent variables, and note that has a multinomial distribution, where if individual is in class and otherwise. The conventional GMM with Gaussian random effects can be represented using a hierarchical threelevel formulation as follows.
At level 1 of the GMM, the continuous outcome variables are related to the continuous latent variables via
(13) 
for , where is a vector of residuals or measurement errors that is assumed to follow a multivariate normal distribution, i.e., , and is a design matrix consisting of factor loadings with each column corresponding to specific aspects of change. The matrix and the vector determine the growth trajectory of the model. For instance, when , , and is a matrix. Assuming are agerelated time scores centred at age , the design matrix is given by
At level 2 of the GMM, the continuous latent variables are related to the latent categorical variables and to the observed timeinvariant covariate vector by the relation
(14) 
where denotes the intercept parameter for class , is a dimensional vector of residuals assumed to follow a multivariate normal distribution , and is a parameter matrix representing the effect of on the latent continuous variables and assumed to be different among classes. Note that the level 2 errors are uncorrelated with the measurement errors . We may allow for classspecific effects in (14) that are equal across classes.
By combining the first two levels of the GMM, we have
(15) 
where is the th class probability or mixing proportion satisfying and , and is a multivariate Gaussian density with mean and covariance matrix . Notice that the GMM in (15) assumes that class probability is constant for each class.
At level 3 of the GMM, we assume that the class probabilities are not constant for each class, but depend on the observed covariates. In other words, we want to know how is related to an individual’s background variables, e.g., gender and income. At this level, the categorical latent variables represent membership of mixture components that are related to through a multinomial logit regression for unordered categorical responses. Define , i.e., the probability that subject falls into the th class depending on the covariates . Let and
Then,
(16) 
where is a vector of parameters and is a parameter matrix. By combining these three levels of the GMM, we have
(17) 
where is defined as in (15). Note that the right hand side of (17) is not a finite mixture model because the class probabilities are not constant with resect to .
3.2 GMM with generalized hyperbolic random effects
The conventional GMM assumes that the residuals and have multivariate Gaussian distribution with zero means and withinclass covariance matrices, respectively. We are interested in constructing a GMM with generalized hyperbolic distribution model errors, denoted by GHDGMM. The generalized hyperbolic distribution can be represented as a normal meanvariance mixture, where the mixing weight has a GIG distribution (see Section 2.1). To this end, we introduce a latent continuous variable with . Accordingly, conditional on and , we assume that model errors and are noncentered Gaussian error terms with distinct covariance matrices:
(18)  
(19) 
where is the diagonal covariance matrix for , and is the covariance matrix for . The dimensional vector is a vector of skewness parameters, which we refer to as the skewness parameter for the measurement errors. The dimensional vector is the vector of skewness parameters for the continuous latent variables . Then, based on (14) and (18), the observed random variables , conditional on , , and , follow a conditional Gaussian distribution of the form
(20) 
(21) 
and, from the preceding equations, we have the conditional distribution
(22) 
where and . From (8), we obtain the conditional distributions
(23)  
(24) 
By combining the preceding setup and level 3 of the GMM from Section 3.1, we arrive at a GMM with density
(25) 
where is the density of a dimensional random variable following a generalized hyperbolic distribution. Note that the overall skewness for is . Note also that, within this setup, the dependent observed variable , the latent growth factors , and residual variables and all have generalized hyperbolic distributions. Note that the distribution of the covariates is not modelled; please refer to Muthén and Asparouhov (2015) for detailed explanations.
3.3 GMM with multivariate skewt random effects
In this section, we are interested in extending the conventional GMM to have multivariate skewt distribution model errors, denoted by GSTGMM. As in the case for the generalized hyperbolic distribution, the formulation of the multivariate skewt distribution we use has a convenient representation as a normal meanvariance mixture; this time, the weight has an inversegamma distribution (see Section 2.2). In analogous fashion to the GHDGMM, a latent continuous random variable is first introduced, where . Accordingly, we assume that and are noncentered Gaussian error terms with their own covariance matrices as in (18) and (19), and and are conditionally normally distributed as in (20) and (21). Form this characterization of the multivariate skewt distribution, the following conditional distributions are obtained:
(26)  
(27) 
where and are as described above and is a concentration parameter (i.e., the degrees of freedom). Similarly, we arrive at a GMM with a multivariate skewt distribution
(28) 
In this setup, the random variable , the latent growth factors , and the residual variables and all follow multivariate skewt distributions.
3.4 Comments on the GHDGMM and GSTGMM
Recalling that the overall skewness for is , there are a total of skewness parameters in our GMM extensions. Hence, the skewness parameters and are subject to identifiability issues because no more than skewness parameters can be identified from dimensional . Therefore, two special formulations are considered in this paper. The first formulation is where . In this formulation, the residuals for or the measurement errors are not skewed, i.e., , and all of the skewness in the data is assumed to come from the distribution of latent factors. The second special formulation is the case where . In this formulation, the residuals for the latent factors are symmetric, i.e., . Accordingly, all of the skewness in the data is assumed to come from the residuals of or the measurement errors. In practice, we would want as much of the skewness as possible in the observed data to be explained through the latent factors. There appears to be no optimal strategy with respect to which skewness parameter to estimate. Accordingly, four statistical models, differing with respect to the distributions of measurement errors and random effects for the first two levels of the GMM, are employed and compared. These models are as follows:

Model I: A model with independent multivariate generalized hyperbolic random effects and measurement errors while assuming all of the skewness in the data comes from the distribution of latent factors (i.e., GHDGMM under ).

Model II: A model with independent multivariate generalized hyperbolic random effects and measurement errors while assuming all of the skewness in the data comes from the residuals of (i.e., GHDGMM under ).

Model III: A model with independent multivariate skewt random effects and measurement errors while assuming all of the skewness in the data comes from the distribution of latent factors (i.e., GSTGMM under ).

Model IV: A model with independent multivariate skewt random effects and measurement errors while assuming all of the skewness in the data comes from the residuals of (i.e., GSTGMM under ).
Take Model I (i.e., GHDGMM under ) as an example. For different trajectory classes, the parameters , and may be different across classes, or may be the same across the classes. By imposing constraints on all these parameters (different or the same across classes), we obtain a family of GHDGMM models. In this paper, we only consider two models, one model assumes that the parameters , and are different across classes, we call this model the general model. The second model assumes that only the parameter is different across classes while all the other parameters are the same across classes, i.e., , , , , , and for ; we call this model the most constrained model.
To this end, eight parameterizations in Table 1 are considered. Models II and IV allow a more general representation of the class skewness parameters (i.e., ). However, in terms of model complexity, Models II and IV have more parameters than Models I and III. Hence, Models II and IV need larger sample sizes as small class sizes can create problems, such as singularity of the covariance matrix and slow or nonconvergence of the EM algorithm. In addition, Model III is the most parsimonious and it may be useful when the number of classes is large.
Model  Dist.  Skewness  Number of free parameters  

Model I  General  GHD  Latent  
Constrained  GHD  Latent  
Model II  General  GHD  Observed  
Constrained  GHD  Observed  
Model III  General  Skewt  Latent  
Constrained  Skewt  Latent  
Model IV  General  Skewt  Observed  
Constrained  Skewt  Observed 
3.5 Parameter estimation
To fit the models, we adopt the wellknown EM algorithm. In our case, the missing data comprise the latent categorical variables , the latent growth factors , and the latent weight parameter . Therefore, the completedata consist of the observed outcome data , the covariates together with the , , and , and completedata likelihood is given by
where for GHDGMM and for GSTGMM.
In the Estep, we compute the expected value of the complete data loglikelihood, denoted , conditional on the current model parameters. Then, in the Mstep, we obtain the updated model parameters by maximizing . Detailed parameter updates for Models I, II and III are outlined in Appendix B.
4 Illustrations
4.1 Performance assessment
Although all of our illustrations are treated as genuine clustering analysis, i.e., no prior knowledge of labels is assumed, the true labels are known in each case and can be used to evaluate the performance of our GHDGMM and GSTGMM models. We use misclassification rates (ERR) and the adjusted Rand index (ARI; Hubert and Arabie, 1985) to assess classification performance. The ERR is simply the proportion of misclassified observations. The ARI indicates the pairwise agreement between true and predicted group memberships while also accounting for the fact that random classification would classify some observations correctly by chance. An ARI value of 1 indicates perfect classification, its expected value is 0 under random classification, and a negative ARI value indicates classification that is worse than one would expect under random classification. Further details and discussion of the ARI are given by Steinley (2004).
4.2 Alcoholic consumption data from the National Longitudinal Survey of Youth
4.2.1 The data
The National Longitudinal Survey of Youth (NLSY) is a longitudinal study conducted by the United States Bureau of Labor Statistics with the goal of understanding the interaction between labor force participation, education, and health behaviors in children and adolescents. The sample for this study was a cohort of children who were between the ages of 12 and 17 when first interviewed in 1997. The data of interest were gathered each year between 1997 and 2011 and again in 2013 (15 total possible interviews). Each respondent provided a number between 0 and 98 that represents the number of alcoholic drinks they typically consume on a given day on which they are drinking. Because we are interested in modeling drinking behaviour over the life span, the data are shifted from representing year of interview to age. We follow individuals who were first interviewed at age 16 until they are 19; all individuals with missing inputs are excluded and no covariates are adopted. To this end, 1151 observations with four time points (i.e., at ages 16, 17, 18, and 19) are used for the following analyses.
4.2.2 Model selection
We implement the Gaussian GMM via Mplus Version 7.1 (Muthén and Muthén, 2012). Our proposed GHDGMM and GSTGMM are implemented in R and run with until the best model is obtained under each scenario. Table 2 shows the results of fitting all of the models as aforementioned for a varying number of latent classes. The BIC values show that more than eight classes are needed with the conventional GMM, two are needed with constrained Models I and IV, and three are needed for all of the other models. The BIC values for the GSTGMM and GHDGMM are always better than the BIC for the normal GMM. Notably, the BIC values for the GHDGMM do not always improve on those for the GSTGMM. Among all fitted models, the threecluster general GSTGMM under (i.e., general Model III) is preferable according to the BIC. It is worth mentioning that, even though the skewt distribution is a special case of the generalized hyperbolic distribution, the GSTGMM seems to be useful in addition to the GHDGMM.
GMM–Normal (constrained)  GMM–Normal (general)  
Class  Loglikelihood  Free paras  BIC  Loglikelihood  Free paras  BIC 
1  –14983.42  9  –30030.27  –14983.42  9  –30030.27 
2  –14623.41  12  –29331.40  –12671.88  19  –25477.69 
3  –14330.00  15  –28765.72  –12233.95  29  –24672.31 
4  –14182.66  18  –28492.19  –12119.21  39  –24513.30 
5  –14076.42  21  –28300.85  –12027.60  49  –24400.58 
6  –14015.58  24  –28200.32  –11950.97  59  –24317.78 
7  –13980.78  27  –28151.86  –11906.09  69  –24298.53 
8  –13937.17  30  –28085.80  –11870.44  79  –24297.69 
9  –13916.40  33  –28065.38  
GHD–GMM (Model I, constrained)  GHD–GMM (Model I, general)  
Class  Loglikelihood  Free paras  BIC  Loglikelihood  Free paras  BIC 
1  –12403.20  13  –24898.19  –12403.28  13  –24898.19 
2  –12315.75  16  –24744.27  –12119.53  27  –24429.36 
3  –12315.50  19  –24764.91  –11958.92  41  –24206.82 
4  –11953.84  55  –24295.33  
GHD–GMM (Model II, constrained)  GHD–GMM (Model II, general)  
Class  Loglikelihood  Free paras  BIC  Loglikelihood  Free paras  BIC 
1  –12399.68  15  –24883.94  –12399.68  15  –24905.09 
2  –12312.27  18  –24737.32  –12166.47  31  –24551.45 
3  –12288.26  21  –24717.49  –12002.12  47  –24335.51 
4  –12287.98  24  –24745.12  –11956.47  63  –24356.99 
GST–GMM (Model III, constrained)  GST–GMM (Model III, general)  
Classes  Loglikelihood  Free paras  BIC  Loglikelihood  Free paras  BIC 
1  –12421.85  12  –24928.28  –12421.92  12  –24928.42 
2  –12352.31  15  –24810.34  –12151.6  25  –24479.41 
3  –12340.61  18  –24808.10  –11966.84  38  –24201.52 
4  –12348.28  21  –24844.58  –11925.67  51  –24210.82 
GST–GMM (Model IV, constrained)  GST–GMM (Model IV, general)  
Classes  Loglikelihood  Free paras  BIC  Loglikelihood  Free paras  BIC 
1  –12418.18  14  –24935.05  –12418.19  14  –24935.05 
2  –12348.01  17  –24748.06  –12118.12  29  –24440.64 
3  –12347.48  20  –24756.18  –11990.60  44  –24291.33 
4  –11938.08  59  –24292.00 
4.2.3 Interpretation of the best model
The bestfitting model, the threeclass Model III, breaks the data into three groups. From Table 3, it can be seen that Class 1 comprising 56% of the population, begins with lowmoderate drinking ( drink per drinking day), slightly increases during adolescence, and by age 19 the average drinks per drinking day is at about 1. These can be considered “consistent low” drinkers. Although the intercept for this class is heavily positively skewed (intercept skewness ), the slope is not (intercept skewness ), which indicates that the individual slopes are nearly normally distributed around the class slope of 0.21. The second class, comprising 24% of the population, are what will be called the “decreasing” drinkers. This class has an intercept of around five drinks per drinking day (a drinking binge) and ends at about 3 drinks per drinking day (just below the amount considered a drinking binge).^{1}^{1}1The World Health Organization defines heavy episodic drinking (also called a drinking “binge”) as the consumption of 60 or more grams of alcohol on one occasion (www.who.int/gho/alcohol/consumption_patterns/heavy_episodic_drinkers_text/en/), which is about four standard drinks (www.niaaa.nih.gov/alcoholhealth/overviewalcoholconsumption/what standarddrink). The intercept is again positively skewed (intercept skewness ) but the slope is negatively skewed (slope skewness ), suggesting that individuals in this class decrease their consumption quickly over the period of adolescence. The third class, comprising 20% of the population, will be called the “increasing moderate” drinkers. Their initial level of drinking is around 2.87 drinks per drinking day (less than a binge) and this increases during adolescence, ending at age 19 around 7 drinks per drinking day (far above a drinking binge). Both the slope and intercept are slightly positively skewed (intercept skewness , slope skewness ; see Table 3).
Class  
645  276  230  
0.56  0.24  0.20  
6.83  2.97  2.62  
These results suggest that, during adolescence, which is typically a time when alcohol consumption is initiated, individuals will have different reactions to the exposure to alcohol given their previous experience. Those individuals who are low drinkers will tend to continue to be low drinkers, those who have already consumed alcohol heavily will begin to taper back to safe levels (alluding to these individuals “knowing their limits” when it comes to alcohol), and those who are only at moderate levels tend to increase to heavy drinking. This model may be useful because for indicating which 15yearolds should be the target of interventions if the goal is to prevent heavy drinking in late adolescence. Although the high drinkers may appear to be the most likely to develop problems related to alcohol, they may “grow out” of their alcohol consumption; most especially, the 15yearolds that only drink at moderate levels should not be neglected.
Kerr et al. (2002) find similar patterns using three different youth surveys. They measure the stability of alcohol consumption over four time periods. Time 1 cohort are separated into three classes: abstainers, moderate drinkers, and heavy drinkers. They find a high proportion of abstainers continue to abstain and very few drink more than once or drink heavily. Moderate drinkers also show considerable stability in these samples, with 70% or more staying in the moderate category. Time 1 heavy drinkers are the least stable. In all three surveys, less than half of the heavy drinkers remain heavy drinkers. Our results also aligned better with those in Warner et al. (2007), who found three groups of adolescent drinking initiation: a large group with no or low drinking (our “consistent low” drinkers), a group that drank exclusively in adolescence and then decreased (analogous to our “decreasing” drinkers), and a group that started low and increased (analogous to our “increasing moderate” drinkers).
4.2.4 Partition study
Other models tend to find similar latent classes. For instance, the threeclass Model I (which has a very similar BIC to the threeclass skewt model) demonstrates a similar partition (Table 4). This suggests that the same pattern endures regardless of the distributional assumptions. However, the cluster proportions differ slightly (58%, 24%, and 18%, for the low, high/decreasing, and moderate/increasing classes, respectively), which seems to suggest that the GHD model classifies more individuals into the “low” class than the skewt model. If the goal of the analysis is to identify groups to target for interventions for the prevention of alcoholism, the proportions found in the skewt model might be preferred as they create population groups that are larger. Therefore, interventions targeting this group may have a greater impact on the population than those targeting a smaller group.
Model III, general  

Consistent low  Decreasing  Increasing moderate  
Model I, general  Consistent low  645  0  23 
Decreasing  0  276  0  
Increasing moderate  0  0  207 
4.3 Simulation studies
In addition to real data application of our proposed model, we perform simulation studies with data generated in a number of scenarios: linear and quadratic GMMs with different distributions of the measurement errors and random effects, resulting in four distinct simulated data examples (see Table 5 for details). Individual trajectories for these four simulation experiments are plotted in Figure 2.
Simulation  Model  Partition  

1  Model I  400  2  50  3  Overlapping  
2  Model II  800  2  5  2  Separated  
3  Model III  1500  3  20  2  Separated  
4  Model IV  1000  2  8  3  Overlapping 
We assess the performance of the family of nonelliptical GMMs in several different ways: Section 4.3.1 illustrates the ability of our proposed family of models to recover underlying parameters when the number of classes and the model are correctly specified; then we present a comparison of the proposed models on BIC, ARI, and ERR from the clustering result for each simulation in Section 4.3.2; last but not least, we compare our proposed family of nonelliptical GMMs with Gaussian GMM developed by Muthén and colleagues, which dominates the literature on GMMs in Section 4.3.3
4.3.1 Parameter recovery under the true model
First, we evaluate the ability of our proposed model to recover underlying parameters when the number of classes and the model are correctly specified. To this end, 100 datasets are generated for each of the four simulation experiments. True values and the means of the parameter estimates with their associated standard deviations are summarized in Tables 6–9. The results for each of the first three simulation experiments show that the means of all parameter estimates are close to the true values, with small standard deviations. The means of the last four elements of and are different from the true values due to the overlap between classes.
True values  Means  Standard deviations  
1  0.52  0.70  
2  1.05  1.28  
2  2.02  0.31  
3  2.92  0.51  
True values  Means  Standard deviations  
1  0.7  0.73  
2  1.05  0.85  
2  2.02  0.32  
3  3.11  0.58  
True values  Means  Standard deviations  

7  7.09  0.61  
5  4.97  0.41  
6  6.08  0.50  