A Distribution of \mbox{\boldmath$\eta$}_{i}~{}|~{}\mathbf{y}_{i},\mathbf{x}_{i},w_{ik},c_{ik}=1

# Extending Growth Mixture Models Using Continuous Non-Elliptical Distributions

## Abstract

Growth mixture models (GMMs) incorporate both conventional random effects growth modeling and latent trajectory classes as in finite mixture modeling; therefore, they offer a way to handle the unobserved heterogeneity between subjects in their development. GMMs with Gaussian random effects dominate the literature. When the data are asymmetric and/or have heavier tails, more than one latent class is required to capture the observed variable distribution. Therefore, a GMM with continuous non-elliptical distributions is proposed to capture skewness and heavier tails in the data set. Specifically, multivariate skew-t distributions and generalized hyperbolic distributions are introduced to extend GMMs. When extending GMMs, four statistical models are considered with differing distributions of measurement errors and random effects. The mathematical development of GMMs with non-elliptical distributions relies on their expression as normal variance-mean mixtures and the resultant relationship with the generalized inverse Gaussian distribution. Parameter estimation is outlined within the expectation-maximization framework before the performance of our GMMs with non-elliptical distributions is illustrated on simulated and real data.

## 1 Introduction

Many longitudinal studies focus on investigating how individuals change over time with respect to a characteristic that is measured repeatedly for each participant. Conventional random effect growth modeling has provided a number of tools for modeling intra-individual change and inter-individual differences in change (e.g., Laird and Ware, 1982; Bryk and Raudenbush, 1987, 1992; McArdle and Epstein, 1987; Singer and Willett, 2003), where within-class changes are described as a function of time and between-class changes are described by random effects and coefficients. Conventional random effects models provide a basis for formulating growth mixture models (GMMs) for longitudinal data. In the latent variable framework, a quadratic random effects growth model with a continuous outcome for individual at time , and time-invariant covariates , is specified according to

 yit =η0i+η1i(at−a0)+η2i(at−a0)2+ϵit, (1) η0i =α0+\boldmathγ′0xi+ζ0i, η1i =α1+\boldmathγ′1xi+ζ1i, (2) η2i =α2+\boldmathγ′2xi+ζ2i,

where are time scores () centred at , is the random intercept, is the random linear slope, is the quadratic growth rate, and and are normally distributed residuals, i.e., and . Formally, , , and are continuous latent variables (called the growth factors) representing the growth patterns, the are the mean parameters for the growth factors if there are no covariates , and the are the effects of covariates on the growth factors. In conventional growth modeling applications, the individual growth parameters (e.g., individual intercept and slope factors) are usually assumed to be identically distributed, i.e., drawn from a single homogeneous population. However, we are often interested in and deal with samples from multiple populations and, in most cases, the class memberships are either unknown or unobserved.

Simultaneous modeling of change over time and unobserved multiple populations (heterogeneity) in the data can be accommodated using GMMs and latent class growth analysis (LCGA), wherein parameters describing growth patterns are estimated and each individual’s most likely class membership is obtained via maximum a posteriori (MAP) probabilities. GMMs were introduced by Verbeke and Lesaffre (1996) and then extended by Muthén and Shedden (1999), Muthén (2004), and Muthén and Asparouhov (2008). For convenience, define so that if individual falls in class and otherwise. The quadratic growth model in (1) and (2) can be extended to a simple GMM in class () via

 Yit∣cik=1 =η0i+η1i(at−a0)+η2i(at−a0)2+ϵit, (3) η0i∣cik=1 =α0k+\boldmathγ′0xi+ζ0i, η1i∣cik=1 =α1k+\boldmathγ′1xi+ζ1i, (4) η2i∣cik=1 =α2k+\boldmathγ′2xi+ζ2i,

where parameters vary across classes to capture different trajectories; the parameters , and remain the same across classes but this could be relaxed to allow variation across classes with respect to how a covariate affects the growth factors; and and are still normally distributed residuals but with class-specific covariance matrices, i.e., and . The latent class growth analysis (LCGA) developed by Nagin and Land (Nagin and Land, 1993; Nagin, 1999, 2005) can be thought of as one special case of GMMs in the sense that it assumes zero within-class growth factor variances, i.e., for .

One common fundamental assumption for GMMs is that model errors are normally distributed. However, simulation studies in Bauer and Curran (2003) show that, when the data are drawn from a single non-Gaussian distribution, a two-class Gaussian GMM is preferred when fitting the data. In such cases, the within-class parameter estimates become uninterpretable because there are too many groups. Muthén and Asparouhov (2015) give an example of strongly non-normal outcomes, i.e., body mass index (BMI) development over age, and show that more than one latent class is required to capture the observed variable distribution — interpreting mixture components as subpopulations will lead to overestimation of the number of subpopulations. In general, non-elliptical distributions are used in multivariate analysis for the study of asymmetric data; such distributions can have a concentration parameter to account for heavy tailed data. Intuitively, in the two-dimensional case, the joint distribution forms a non-elliptical shape in the iso-density plot. Relaxing the normality assumption for asymmetry and skewness, Muthén and Asparouhov (2015) develop a GMM with a normally distributed model error and “classical” multivariate skew-t (cMST) random effects, i.e., and . An alternative specification of GMMs (called nonlinear mixed effect mixture models) was developed by Lu and Huang (2014) within a Bayesian framework, wherein is assumed to follow a cMST while remains normally distributed. Note that the “classical” formulation of the multivariate skew-t distribution is that given by Azzalini and Valle (1996) and Azzalini and Capitanio (1999). In this paper, we outline a more general extension of GMMs to the generalized hyperbolic distribution while also considering the formulation of the multivariate skew-t distribution that arises as its special and limiting case (Section 2). The advantage of the generalized hyperbolic distribution is its flexibility. Many other well-known distributions are special or limiting cases of the generalized hyperbolic distribution; please refer to McNeil et al. (2005) for details on a variety of limiting cases of the generalized hyperbolic distribution.

The remainder of this article is laid out as follows. In Section 2, we go through some important background material on the generalized hyperbolic distribution as well as a special and limiting case that gives a formulation of the multivariate skew-t distribution. In Section 3, we outline the extension of GMMs to generalized hyperbolic and multivariate skew-t distributions, respectively. Section 3.5 presents an expectation-maximization (EM) algorithm for obtaining maximum likelihood estimates of model parameters. Then, our approach is illustrated on simulated and real data (Section 4). The paper concludes with some discussion and suggestions for future work (Section 5).

## 2 Background

### 2.1 Generalized hyperbolic distribution

A multivariate generalized hyperbolic distribution arises from a multivariate mean-variance mixture where the weight function is the density of a generalized inverse Gaussian (GIG) distribution. The density of the GIG distribution is given by

 h(w∣ω,η,λ)=(w/η)λ−12ηKλ(ω)exp{−ω2(wη+ηw)} (5)

for , where is a scale parameter, is a concentration parameter, denotes the modified Bessel function of the third kind with index , and characterizes certain subclasses and considerably influences the size of tail weights. Write to denote that the random variable has the density in (5). The GIG distribution has some attractive features, including tractable expected values. Consider , then the following expected values hold:

 E[W] =ηKλ+1(ω)Kλ(ω), E[1/W] =1ηKλ+1(ω)Kλ(ω)−2λωη, (6) E[logW] =logη+1Kλ(ω)∂∂λKλ(ω).

Extensive details on GIG distribution can be found in Jørgensen (1982).

Browne and McNicholas (2015) show that a -dimensional generalized hyperbolic random variable can be generated using the relationship

 Y=\boldmathμ+W\boldmathβ+√WZ, (7)

where , and are -vectors that play the role of location and skewness parameters, respectively, and . From (7), it follows that . Now, recalling that and that the unconditional distribution of is a generalized hyperbolic, Bayes’ theorem gives

 W∣y∼GIG(ω+\boldmathβ′\boldmathΣ−1\boldmathβ,ω+δ(y,\boldmathμ∣\boldmathΣ),λ−p/2). (8)

Under this parameterization, a -dimensional multivariate generalized hyperbolic distribution has density

 Extra open brace or missing close brace (9)

with index parameter , concentration parameter , skewness parameter , mean vector , and scale matrix . Here, is the squared Mahalanobis distance between and , i.e., , and are modified Bessel functions of the third kind with indices and , respectively, and denotes the model parameters. Herein, let represent a -dimensional generalized hyperbolic random variable with density as per (9). Note that the parameterization in (9) is one of several available for multivariate generalized hyperbolic distributions (see McNeil et al., 2005; Browne and McNicholas, 2015).

There are a number of special and limiting cases that can be derived from the generalized hyperbolic distribution. However, the presence of the index parameter enables a flexibility that is not found in its special and limiting cases. Figure 1 illustrates the Gaussian distribution as well as a skew-t distribution with degrees of freedom and the generalized hyperbolic distribution for two different values of ; clearly, the different values of lead to very different densities.

### 2.2 Multivariate skew-t distribution

Several alternative formulations of the multivariate skew-t distribution have appeared in the literature, e.g., Azzalini and Valle (1996), Sahu et al. (2003), and Arellano-Valle and Genton (2005). Some recent discussion about some of these formulations is given by Azzalini et al. (2016). Muthén and Asparouhov (2015) developed a GMM with the SDB version of the restricted multivariate skew-t distribution, i.e., the version of Sahu et al. (2003). The formulation of the multivariate skew-t distribution used herein arises as a special and limiting case of the generalized hyperbolic distribution by setting and while also letting . This formulation of the multivariate skew-t distribution has been used by Murray et al. (2014a) to develop a mixture of skew-t factor analyzer models and by Murray et al. (2014b) to develop a mixture of common skew-t factor analyzers. A -dimensional skew-t random variable , with this formulation, has the density function

 Missing or unrecognized delimiter for \right (10)

where is the location parameter, is the scale parameter, is the skew parameter, is the degree of freedom parameter, and and are as defined in (9). We write to denote that the random variable follows the skew-t distribution with the density in (10). Now, can be obtained through the relationship in (7) with , where denotes the inverse-gamma distribution. We have , and so, from Bayes’s theorem, .

### 2.3 The EM algorithm and its convergence criterion

The EM algorithm (Dempster et al., 1977) is an iterative algorithm for finding maximum likelihood estimates when data are incomplete or treated as such, and is widely used to estimate model parameters in the context of model-based clustering. The E-step computes the expected value of the complete-data log-likelihood given the current model parameters, and the M-step maximizes this expected value with respect to the model parameters. After each E- and M-step, the log-likelihood is driven uphill, and the method iterates towards a maximum until some convergence criterion is satisified. Many variants of the EM algorithm have been proposed over the years, such as the expectation-conditional-maximization (ECM) algorithm (Meng and Rubin, 1993), the alternating ECM (AECM) algorithm (Meng and Van Dyk, 1997), and the Fisher-EM algorithm (Bouveyron and Brunet, 2012). Herein, we will make use of the EM algorithm for parameter estimation and a stopping criterion based on the Aitken acceleration (Aitken, 1926) is used to determine the convergence. The Aitken acceleration at iteration is

 a(s)=l(s+1)−l(s)l(s)−l(s−1), (11)

where is the (observed) log-likelihood value at iteration . This yields an asymptotic estimate of the log-likelihood at iteration , given by

 l(s+1)∞=l(s)+11−a(s)(l(s+1)−l(s))

(Böhning et al., 1994; Lindsay, 1995), and the EM algorithm is stopped when , provided this difference is positive (McNicholas et al., 2010). Note that this criterion is at least as strict as the lack of progress criterion in the neighbourhood of a maximum.

### 2.4 Model selection

In model-based clustering, a penalized log-likelihood-based criterion is typically used to determine the “best” fitting model among a family of models. The most popular such criterion is the Bayesian information criterion (BIC; Schwarz, 1978), which can be motivated as an approximation to a Bayes factor (see Kass and Raftery, 1995; Kass and Wasserman, 1995). The BIC is defined as

 BIC=2l(^\boldmathϑ)−ρlogn, (12)

where is the maximum likelihood estimate of model parameters , is the maximized log-likelihood, is the number of free parameters, and is the number of observations. Some theoretical support for use of the BIC in mixture model selection is given by Leroux (1992) and Keribin (2000).

## 3 Methodology

### 3.1 Conventional GMM with Gaussian random effects

Suppose a longitudinal study features subjects and time points or measurement occasions. For subject (), let be a vector where represents the outcome on occasion (), let be an vector of observed time-invariant covariates, let be a vector containing continuous latent variables, and note that has a multinomial distribution, where if individual is in class and otherwise. The conventional GMM with Gaussian random effects can be represented using a hierarchical three-level formulation as follows.

At level 1 of the GMM, the continuous outcome variables are related to the continuous latent variables via

 Yi∣(cik=1)=\boldmathΛy\boldmathηi+\boldmathϵi (13)

for , where is a vector of residuals or measurement errors that is assumed to follow a multivariate normal distribution, i.e., , and is a design matrix consisting of factor loadings with each column corresponding to specific aspects of change. The matrix and the vector determine the growth trajectory of the model. For instance, when , , and is a matrix. Assuming are age-related time scores centred at age , the design matrix is given by

 \boldmathΛy=⎛⎜ ⎜ ⎜ ⎜ ⎜⎝1a1−a0(a1−a0)21a2−a0(a2−a0)2⋮⋮⋮1aT−1−a0(aT−1−a0)2⎞⎟ ⎟ ⎟ ⎟ ⎟⎠.

At level 2 of the GMM, the continuous latent variables are related to the latent categorical variables and to the observed time-invariant covariate vector by the relation

 \boldmathηi∣(cik=1)=\boldmathαk+% \boldmathΓkxi+\boldmathζi, (14)

where denotes the intercept parameter for class , is a -dimensional vector of residuals assumed to follow a multivariate normal distribution , and is a parameter matrix representing the effect of on the latent continuous variables and assumed to be different among classes. Note that the level 2 errors are uncorrelated with the measurement errors . We may allow for class-specific effects in (14) that are equal across classes.

By combining the first two levels of the GMM, we have

 p(yi∣xi)=K∑k=1πkϕ(yi;\boldmathμk,\boldmathΣk), (15)

where is the th class probability or mixing proportion satisfying and , and is a multivariate Gaussian density with mean and covariance matrix . Notice that the GMM in (15) assumes that class probability is constant for each class.

At level 3 of the GMM, we assume that the class probabilities are not constant for each class, but depend on the observed covariates. In other words, we want to know how is related to an individual’s background variables, e.g., gender and income. At this level, the categorical latent variables represent membership of mixture components that are related to through a multinomial logit regression for unordered categorical responses. Define , i.e., the probability that subject falls into the th class depending on the covariates . Let and

 logit(\boldmathπi)=(log(πi1πiK),log(πi2πiK),…,log(πiK−1πiK))′.

Then,

 logit(\boldmathπi)=\boldmathαc+% \boldmathΓcxi, (16)

where is a -vector of parameters and is a parameter matrix. By combining these three levels of the GMM, we have

 p(yi∣xi)=K∑k=1πikϕ(yi;\boldmathμk,\boldmathΣk), (17)

where is defined as in (15). Note that the right hand side of (17) is not a finite mixture model because the class probabilities are not constant with resect to .

### 3.2 GMM with generalized hyperbolic random effects

The conventional GMM assumes that the residuals and have multivariate Gaussian distribution with zero means and within-class covariance matrices, respectively. We are interested in constructing a GMM with generalized hyperbolic distribution model errors, denoted by GHD-GMM. The generalized hyperbolic distribution can be represented as a normal mean-variance mixture, where the mixing weight has a GIG distribution (see Section 2.1). To this end, we introduce a latent continuous variable with . Accordingly, conditional on and , we assume that model errors and are non-centered Gaussian error terms with distinct covariance matrices:

 \boldmathϵi∣wik,cik =1∼N(wik\boldmathβyk,wik% \boldmathΘk), (18) \boldmathζi∣wik,cik =1∼N(wik\boldmathβηk,wikΨk), (19)

where is the diagonal covariance matrix for , and is the covariance matrix for . The -dimensional vector is a vector of skewness parameters, which we refer to as the skewness parameter for the measurement errors. The -dimensional vector is the vector of skewness parameters for the continuous latent variables . Then, based on (14) and (18), the observed random variables , conditional on , , and , follow a conditional Gaussian distribution of the form

 Yi∣\boldmathηi,wik,cik=1∼N(\boldmathΛy\boldmathηi+wik\boldmathβyk,wik\boldmathΘk). (20)

Based on (14) and (19),

 \boldmathηi∣xi,wik,cik=1∼N(\boldmathαk+\boldmathΓkxi+wik\boldmathβηk,wikΨk) (21)

and, from the preceding equations, we have the conditional distribution

 Yi∣xi,wik,cik=1∼N(% \boldmathμk+wik(\boldmathΛy\boldmathβ% ηk+\boldmathβyk),wik\boldmathΣk), (22)

where and . From (8), we obtain the conditional distributions

 \boldmathηi∣xi,cik=1 ∼GHDq(λk,ωk,\boldmathα% k+\boldmathΓkxi,Ψk,% \boldmathβηk), (23) Yi∣xi,cik=1 ∼GHDT(λk,ωk,\boldmathμk,\boldmathΣk,\boldmathΛy\boldmathβηk+\boldmathβyk). (24)

By combining the preceding setup and level 3 of the GMM from Section 3.1, we arrive at a GMM with density

 p(yi∣xi)=K∑k=1πikfGHDT(yi;λk,ωk,\boldmathμk,% \boldmathΣk,\boldmathΛy\boldmathβηk+\boldmathβyk), (25)

where is the density of a -dimensional random variable following a generalized hyperbolic distribution. Note that the overall skewness for is . Note also that, within this setup, the dependent observed variable , the latent growth factors , and residual variables and all have generalized hyperbolic distributions. Note that the distribution of the covariates is not modelled; please refer to Muthén and Asparouhov (2015) for detailed explanations.

### 3.3 GMM with multivariate skew-t random effects

In this section, we are interested in extending the conventional GMM to have multivariate skew-t distribution model errors, denoted by GST-GMM. As in the case for the generalized hyperbolic distribution, the formulation of the multivariate skew-t distribution we use has a convenient representation as a normal mean-variance mixture; this time, the weight has an inverse-gamma distribution (see Section 2.2). In analogous fashion to the GHD-GMM, a latent continuous random variable is first introduced, where . Accordingly, we assume that and are non-centered Gaussian error terms with their own covariance matrices as in (18) and (19), and and are conditionally normally distributed as in (20) and (21). Form this characterization of the multivariate skew-t distribution, the following conditional distributions are obtained:

 \boldmathηi∣xi,cik=1 ∼GSTq(\boldmathαk+\boldmathΓkxi,Ψk,\boldmathβηk,νk), (26) Yi∣xi,cik=1 ∼GSTT(\boldmathμk,\boldmathΣk,\boldmathβyk+\boldmathΛy% \boldmathβηk,νk), (27)

where and are as described above and is a concentration parameter (i.e., the degrees of freedom). Similarly, we arrive at a GMM with a multivariate skew-t distribution

 p(yi∣xi)=K∑k=1πikfGSTT(yi;\boldmathμk,\boldmathΣk,% \boldmathβyk+\boldmathΛy\boldmathβηk,νk). (28)

In this setup, the random variable , the latent growth factors , and the residual variables and all follow multivariate skew-t distributions.

### 3.4 Comments on the GHD-GMM and GST-GMM

Recalling that the overall skewness for is , there are a total of skewness parameters in our GMM extensions. Hence, the skewness parameters and are subject to identifiability issues because no more than skewness parameters can be identified from -dimensional . Therefore, two special formulations are considered in this paper. The first formulation is where . In this formulation, the residuals for or the measurement errors are not skewed, i.e., , and all of the skewness in the data is assumed to come from the distribution of latent factors. The second special formulation is the case where . In this formulation, the residuals for the latent factors are symmetric, i.e., . Accordingly, all of the skewness in the data is assumed to come from the residuals of or the measurement errors. In practice, we would want as much of the skewness as possible in the observed data to be explained through the latent factors. There appears to be no optimal strategy with respect to which skewness parameter to estimate. Accordingly, four statistical models, differing with respect to the distributions of measurement errors and random effects for the first two levels of the GMM, are employed and compared. These models are as follows:

• Model I: A model with independent multivariate generalized hyperbolic random effects and measurement errors while assuming all of the skewness in the data comes from the distribution of latent factors (i.e., GHD-GMM under ).

• Model II: A model with independent multivariate generalized hyperbolic random effects and measurement errors while assuming all of the skewness in the data comes from the residuals of (i.e., GHD-GMM under ).

• Model III: A model with independent multivariate skew-t random effects and measurement errors while assuming all of the skewness in the data comes from the distribution of latent factors (i.e., GST-GMM under ).

• Model IV: A model with independent multivariate skew-t random effects and measurement errors while assuming all of the skewness in the data comes from the residuals of (i.e., GST-GMM under ).

Take Model I (i.e., GHD-GMM under ) as an example. For different trajectory classes, the parameters , and may be different across classes, or may be the same across the classes. By imposing constraints on all these parameters (different or the same across classes), we obtain a family of GHD-GMM models. In this paper, we only consider two models, one model assumes that the parameters , and are different across classes, we call this model the general model. The second model assumes that only the parameter is different across classes while all the other parameters are the same across classes, i.e., , , , , , and for ; we call this model the most constrained model.

To this end, eight parameterizations in Table 1 are considered. Models II and IV allow a more general representation of the class skewness parameters (i.e., ). However, in terms of model complexity, Models II and IV have more parameters than Models I and III. Hence, Models II and IV need larger sample sizes as small class sizes can create problems, such as singularity of the covariance matrix and slow or non-convergence of the EM algorithm. In addition, Model III is the most parsimonious and it may be useful when the number of classes is large.

### 3.5 Parameter estimation

To fit the models, we adopt the well-known EM algorithm. In our case, the missing data comprise the latent categorical variables , the latent growth factors , and the latent weight parameter . Therefore, the complete-data consist of the observed outcome data , the covariates together with the , , and , and complete-data likelihood is given by

 Extra open brace or missing close brace

where for GHD-GMM and for GST-GMM.

In the E-step, we compute the expected value of the complete data log-likelihood, denoted , conditional on the current model parameters. Then, in the M-step, we obtain the updated model parameters by maximizing . Detailed parameter updates for Models I, II and III are outlined in Appendix B.

## 4 Illustrations

### 4.1 Performance assessment

Although all of our illustrations are treated as genuine clustering analysis, i.e., no prior knowledge of labels is assumed, the true labels are known in each case and can be used to evaluate the performance of our GHD-GMM and GST-GMM models. We use misclassification rates (ERR) and the adjusted Rand index (ARI; Hubert and Arabie, 1985) to assess classification performance. The ERR is simply the proportion of misclassified observations. The ARI indicates the pairwise agreement between true and predicted group memberships while also accounting for the fact that random classification would classify some observations correctly by chance. An ARI value of 1 indicates perfect classification, its expected value is 0 under random classification, and a negative ARI value indicates classification that is worse than one would expect under random classification. Further details and discussion of the ARI are given by Steinley (2004).

### 4.2 Alcoholic consumption data from the National Longitudinal Survey of Youth

#### The data

The National Longitudinal Survey of Youth (NLSY) is a longitudinal study conducted by the United States Bureau of Labor Statistics with the goal of understanding the interaction between labor force participation, education, and health behaviors in children and adolescents. The sample for this study was a cohort of children who were between the ages of 12 and 17 when first interviewed in 1997. The data of interest were gathered each year between 1997 and 2011 and again in 2013 (15 total possible interviews). Each respondent provided a number between 0 and 98 that represents the number of alcoholic drinks they typically consume on a given day on which they are drinking. Because we are interested in modeling drinking behaviour over the life span, the data are shifted from representing year of interview to age. We follow individuals who were first interviewed at age 16 until they are 19; all individuals with missing inputs are excluded and no covariates are adopted. To this end, 1151 observations with four time points (i.e., at ages 16, 17, 18, and 19) are used for the following analyses.

#### Model selection

We implement the Gaussian GMM via Mplus Version 7.1 (Muthén and Muthén, 2012). Our proposed GHD-GMM and GST-GMM are implemented in R and run with until the best model is obtained under each scenario. Table 2 shows the results of fitting all of the models as aforementioned for a varying number of latent classes. The BIC values show that more than eight classes are needed with the conventional GMM, two are needed with constrained Models I and IV, and three are needed for all of the other models. The BIC values for the GST-GMM and GHD-GMM are always better than the BIC for the normal GMM. Notably, the BIC values for the GHD-GMM do not always improve on those for the GST-GMM. Among all fitted models, the three-cluster general GST-GMM under (i.e., general Model III) is preferable according to the BIC. It is worth mentioning that, even though the skew-t distribution is a special case of the generalized hyperbolic distribution, the GST-GMM seems to be useful in addition to the GHD-GMM.

#### Interpretation of the best model

The best-fitting model, the three-class Model III, breaks the data into three groups. From Table 3, it can be seen that Class 1 comprising 56% of the population, begins with low-moderate drinking ( drink per drinking day), slightly increases during adolescence, and by age 19 the average drinks per drinking day is at about 1. These can be considered “consistent low” drinkers. Although the intercept for this class is heavily positively skewed (intercept skewness ), the slope is not (intercept skewness ), which indicates that the individual slopes are nearly normally distributed around the class slope of 0.21. The second class, comprising 24% of the population, are what will be called the “decreasing” drinkers. This class has an intercept of around five drinks per drinking day (a drinking binge) and ends at about 3 drinks per drinking day (just below the amount considered a drinking binge).1 The intercept is again positively skewed (intercept skewness ) but the slope is negatively skewed (slope skewness ), suggesting that individuals in this class decrease their consumption quickly over the period of adolescence. The third class, comprising 20% of the population, will be called the “increasing moderate” drinkers. Their initial level of drinking is around 2.87 drinks per drinking day (less than a binge) and this increases during adolescence, ending at age 19 around 7 drinks per drinking day (far above a drinking binge). Both the slope and intercept are slightly positively skewed (intercept skewness , slope skewness ; see Table 3).

These results suggest that, during adolescence, which is typically a time when alcohol consumption is initiated, individuals will have different reactions to the exposure to alcohol given their previous experience. Those individuals who are low drinkers will tend to continue to be low drinkers, those who have already consumed alcohol heavily will begin to taper back to safe levels (alluding to these individuals “knowing their limits” when it comes to alcohol), and those who are only at moderate levels tend to increase to heavy drinking. This model may be useful because for indicating which 15-year-olds should be the target of interventions if the goal is to prevent heavy drinking in late adolescence. Although the high drinkers may appear to be the most likely to develop problems related to alcohol, they may “grow out” of their alcohol consumption; most especially, the 15-year-olds that only drink at moderate levels should not be neglected.

Kerr et al. (2002) find similar patterns using three different youth surveys. They measure the stability of alcohol consumption over four time periods. Time 1 cohort are separated into three classes: abstainers, moderate drinkers, and heavy drinkers. They find a high proportion of abstainers continue to abstain and very few drink more than once or drink heavily. Moderate drinkers also show considerable stability in these samples, with 70% or more staying in the moderate category. Time 1 heavy drinkers are the least stable. In all three surveys, less than half of the heavy drinkers remain heavy drinkers. Our results also aligned better with those in Warner et al. (2007), who found three groups of adolescent drinking initiation: a large group with no or low drinking (our “consistent low” drinkers), a group that drank exclusively in adolescence and then decreased (analogous to our “decreasing” drinkers), and a group that started low and increased (analogous to our “increasing moderate” drinkers).

#### Partition study

Other models tend to find similar latent classes. For instance, the three-class Model I (which has a very similar BIC to the three-class skew-t model) demonstrates a similar partition (Table 4). This suggests that the same pattern endures regardless of the distributional assumptions. However, the cluster proportions differ slightly (58%, 24%, and 18%, for the low, high/decreasing, and moderate/increasing classes, respectively), which seems to suggest that the GHD model classifies more individuals into the “low” class than the skew-t model. If the goal of the analysis is to identify groups to target for interventions for the prevention of alcoholism, the proportions found in the skew-t model might be preferred as they create population groups that are larger. Therefore, interventions targeting this group may have a greater impact on the population than those targeting a smaller group.

### 4.3 Simulation studies

In addition to real data application of our proposed model, we perform simulation studies with data generated in a number of scenarios: linear and quadratic GMMs with different distributions of the measurement errors and random effects, resulting in four distinct simulated data examples (see Table 5 for details). Individual trajectories for these four simulation experiments are plotted in Figure 6.

We assess the performance of the family of non-elliptical GMMs in several different ways: Section 4.3.1 illustrates the ability of our proposed family of models to recover underlying parameters when the number of classes and the model are correctly specified; then we present a comparison of the proposed models on BIC, ARI, and ERR from the clustering result for each simulation in Section 4.3.2; last but not least, we compare our proposed family of non-elliptical GMMs with Gaussian GMM developed by Muthén and colleagues, which dominates the literature on GMMs in Section 4.3.3

#### Parameter recovery under the true model

First, we evaluate the ability of our proposed model to recover underlying parameters when the number of classes and the model are correctly specified. To this end, 100 datasets are generated for each of the four simulation experiments. True values and the means of the parameter estimates with their associated standard deviations are summarized in Tables 69. The results for each of the first three simulation experiments show that the means of all parameter estimates are close to the true values, with small standard deviations. The means of the last four elements of and are different from the true values due to the overlap between classes.