A Derivation of ALPBIC

An Adaptive LASSO-Penalized BIC

Abstract

Mixture models are becoming a popular tool for the clustering and classification of high-dimensional data. In such high dimensional applications, model selection is problematic. The Bayesian information criterion, which is popular in lower dimensional applications, tends to underestimate the true number of components in high dimensions. We introduce an adaptive LASSO-penalized BIC (ALPBIC) to mitigate this problem. This efficacy of the ALPBIC is illustrated via applications of parsimonious mixtures of factor analyzers. The selection of the best model by ALPBIC is shown to be consistent with increasing numbers of observations based on simulated and real data analyses.

1 Introduction

The idiom ‘model-based clustering’ refers to the application of mixture models for clustering. The first such application of mixture models for clustering dates back at least fifty years (Wolfe, 1963) and until relatively recently, the Gaussian mixture model has dominated. Consider realizations of a -dimensional random variable that follows a -component finite Gaussian mixture model. The likelihood is given by , where , with , are the mixing proportions, is the multivariate Gaussian density with mean and covariance matrix , and denotes the model parameters. In model-based clustering applications, the expectation-maximization (EM) algorithm (Dempster et al., 1977) is most often used for parameter estimation. The EM algorithm is based the complete-data likelihood, i.e., the likelihood of the observed data plus the unobserved data. In the context of model-based clustering using a Gaussian mixture model, the component labels are the unobserved data and the complete-data log-likelihood is given by: , where the denote the component membership labels so that if observation  is in component  and otherwise.

Typically, the number of components and is selected using a model selection criterion. The Bayesian information criterion (BIC; Schwarz, 1978) is by far the most popular choice (cf. Kass and Wasserman, 1995; Kass and Raftery, 1995; Dasgupta and Raftery, 1998) and is given by: , where is the MLE of and is the number of free parameters. In addition to the selection of , the BIC may be also be used to select the number of latent variables where relevant. In high-dimensional applications, the BIC tends to underestimate . Bhattacharya and McNicholas (2014) addressed the problem by introducing a LASSO-penalized BIC (LPBIC), where the parameter estimates and the model selection criterion were derived by maximizing a penalized log-likelihood with the penalty akin to the LASSO (Tibshirani, 1996). The LPBIC is defined as

(1)

where is the number of estimated parameters which are non-zero, is the number of non-zero elements in , is the MLE derived by maximizing the penalized log-likelihood , is the unit information matrix at , and is the tuning parameter. Bhattacharya and McNicholas (2014) give several examples where the LPBIC outperforms the BIC in high-dimensional settings; however, the LPBIC is not without its drawbacks. Because the criterion is derived by using a LASSO-penalized log-likelihood, it suffers from the limitations of the LASSO. More particularly, the criterion does not satisfy the oracle properties (cf. Fan and Li, 2001), i.e., it cannot simultaneously satisfy consistency, sparsity, and asymptotic normality.

As an alternative, we propose a model selection criterion that is derived via a penalized log-likelihood, where the penalty is akin to the adaptive LASSO (Zou, 2006). The adaptive LASSO was proposed to mitigate the consistency problems of the LASSO and Zou (2006) considered the case where is fixed. Further progress on establishing the consistency of adaptive LASSO in a sparse high-dimensional linear regression set-up has been achieved by Huang et al. (2008). Recently, Zhou et al. (2009) proved consistency in high-dimensional settings for Gaussian graphical modelling. Because these works mainly concern a sparse high-dimensional linear regression set-up, we can easily draw parallels in mixture modelling. For example, Pan et al. (2006) used the adaptive LASSO procedure in mixture models but their model selection criterion did not put any constraint on the tuning parameter and the weights used in the penalty term, thus making the consistency of the criterion questionable. Pan and Shen (2007) and Pan et al. (2006) used two different penalization approaches, one with the LASSO and another with the adaptive LASSO; however, they used the same model selection criteria for both. The work presented herein shows that the model selection criteria derived from the adaptive LASSO approach should differ significantly from that derived from the conventional LASSO approach.

While deriving the MLE of the unknown parameters, we maximize the penalized log likelihood

(2)

where is an adaptive LASSO-like penalty. In particular, , where is the th element in , the are the weights, and is a tuning parameter that depends on . The existing literature on adaptive LASSO suggests different choices for the weights . The present paper uses the choice suggested by Zou (2006), i.e., that for some , where is a -consistent estimate of found by maximizing the (unpenalized) likelihood. In our theoretical discussions, we take . When illustrating our method (Section 4), we set , which is the most popular choice. Zou (2006) and Pan et al. (2006) use weights , but other choices may work better. For example, Zou (2006) noted that in high-dimensional cases an -penalized estimator can be a better choice. Zhou et al. (2009) proposed . Also, Meinshausen and Buhlmann (2006) and Zhang and Lu (2007) pointed out that estimation by maximizing the unpenalized likelihood is not always feasible in a high dimensions. This is particularly true for a regression problem, especially when there exists strong collinearity among the covariates.

Herein, we develop and illustrate our ALPBIC for parsimonious mixtures of factor analyzers models. We choose these models because they are well suited to the analysis of high dimensional data and because model selection is more involved than just selecting the number of components. Ghahramani and Hinton (1997) introduced a mixture of factor analyzers model, which was further developed by Tipping and Bishop (1999) and McLachlan and Peel (2000). Through imposing constraints on the covariance structure, McNicholas and Murphy (2008, 2010) develop a family of parsimonious Gaussian mixture models (PGMMs) based on the mixture of factor analyzers model. The mixture of factor analyzers model has the density of a Gaussian mixture model with , where is a matrix of factor loadings, is a diagonal matrix, is the dimensionality of the data, and is the dimensionality of the latent factors. The PGMM family arises by imposing, or not, the constraints , , and on the component covariance matrix . We show that when is large and , ALPBIC generally outperforms the LPBIC and the BIC in the PGMM setting in terms of consistently selecting the right number of components as well as more accurately classifying observations into components.

2 Methodology

Recall that we observe with . Now, denote and so

Here we make two assumptions. Firstly, because the penalty function is non-concave and singular at the origin, it can be locally approximated by a quadratic function as suggested by Fan and Li (2001). The parameters are estimated by successive iterations. Suppose is the estimate of of after iterations. Then, the penalty can be locally approximated as

(3)

where is the number of non-zero elements in . Secondly, we assume that the marginal distribution of the mixing proportions is uniform on the simplex and , for , where is the MLE derived by maximizing the penalized likelihood and is the unit information matrix at .

With these assumptions, parameter estimation is carried out using an alternating expectation-conditional maximization (AECM) algorithm (Meng and van Dyk, 1997). The AECM algorithm allows different specification of the complete-data at each stage. At the first stage, and are estimated and at the second stage, the elements of are estimated. The two stages are alternated iteratively until convergence; see McNicholas and Murphy (2008) for extensive details on the application of the AECM algorithm for the PGMM models.

At the first stage of the AECM algorithm, the complete-data comprise the observed and the component labels . The estimation of involves complexity for a penalized likelihood. However, Bhattacharya and McNicholas (2014) note that for practical applications, the difference between the analytical estimate and the estimate derived by the conventional EM algorithm for an unpenalized likelihood is negligible. Hence the mixing proportions are estimated via , where

denotes the expected value of conditional on the current parameter estimates. For the mean parameters, we differentiate the expected value of the complete-data log-likelihood with respect to and equate the result to zero, i.e.,

(4)

where is a vector with elements, its th element being . Now, replacing by , (4) can be written

(5)

where is the update of if no penalty term were involved. Furthermore, imposing an adaptive LASSO penalty on the mean components implies that the estimate of for a non-penalized case, i.e. , is shrunken towards zero, leading to a sparse solution. Therefore, the new estimate is either zero or takes a value lying between zero and . In other words, either or because is positive. Thus, if , then

(6)

Using (5) in (6) and the sparsity described above, we get and because is non-negative, where . Equation (5) and the above arguments lead to the following estimate of :

where, for any , if and otherwise.

At the second stage of the AECM algorithm, the complete-data comprise the observed , the group labels , and the unobserved latent factors . At this stage, we estimate the elements of the covariance matrix under the PGMM set-up and details are given by McNicholas and Murphy (2008, 2010).

To derive a model selection criterion from the penalized log-likelihood we apply the same method as used by Bhattacharya and McNicholas (2014). We maximize (2). Using (3), the second term of (2) becomes

where is the number of non-zero mean components in component . We assume that for a given model the parameters for any two components are independent. Hence, using the Weak Law of Large Numbers, the penalized criterion is

(7)

where is the number of estimated parameters which are non-zero. The derivation of (7) is discussed in detail in Appendix A.

3 Asymptotic Properties

The ALPBIC is a consistent model selection criterion with conditions imposed on the tuning parameter . To prove consistency we use some of the arguments proposed by Khalili and Chen (2007), who originally studied the asymptotic behaviour within a low-dimensional framework; accordingly, we have modified their results for high-dimensional cases. Suppose the true parameter set is decomposed as such that contains only the zero effects, and suppose any estimated parameter that is sufficiently close to is likewise decomposed as . Let be likewise decomposed as . Then to satisfy consistency, the necessary criteria are as and in probability. Thus, the criterion should choose as it would if the true number of clusters and the true parameters were known. Based on this idea, we assume some regularity conditions as stated below:

  1. is finite and positive-definite for all .

  2. For all satisfying , there exist finite real numbers and (possibly depending on ) such that

    with

  3. For all satisfying ,

    (8)

The asymptotic properties of the ALPBIC are given in the following theorem.

  • Suppose is known and let for . If

    then we have the following:

    There exists a local maximizer of ALPBIC for which

    For such , with probability tending to 1,

    provided for sufficiently large .

    and

    where is the Fisher information computed under the reduced model when all zero effects are removed, and is the number of non-zero elements in the th component for the true parameter missing.

    If is unknown and is estimated consistently by separately, then the results in Parts a, b, and c still hold when is subsequently used in the model selection procedure.

    Part a of the above theorem guarantees the existence of a local maximizer of the parameter. Part b satisfies the sparsity condition. Part c satisfies the asymptotic normality condition. These results together constitute the oracle property. To prove that the oracle property is satisfied, we impose two conditions on the tuning parameters. As an aside, we can regard this as a sort of explanation as to why the LASSO cannot satisfy the oracle property, i.e., for the LASSO, and so the two conditions cannot be simultaneously satisfied. Another observation worth noting is that, generally, for high-dimensional settings we take and we do so in the proof of the asymptotic properties in Appendix B. However, the same result happens with , i.e., for fixed , and the proof for fixed  in a regression setting is detailed by Zou (2006) and Huang et al. (2008).

    4 Data Analysis

    4.1 Overview

    We analyze two data sets and compare the performance of ALPBIC with the LPBIC, the BIC, the AIC (Akaike, 1974), and the CAIC (Bozdogan, 1987) for the PGMM family. The first one is a high-dimensional simulated data set and the second one is a real high-dimensional data set. For the first data set, we keep large and increase to study its effect on the consistency of the model selection criteria under consideration. The performance of a criterion is measured in terms of choosing the correct number of components and latent factors, and the correct covariance structure, as well as the resulting classification agreement. We use the adjusted Rand index (ARI; Rand, 1971; Hubert and Arabie, 1985) to measure the latter. An ARI value of indicates perfect agreement and a value of would be expected under random classification. Steinley (2004) provides extensive details on the ARI and its properties. Note that we set for the ALPBIC and for the LPBIC.

    4.2 Simulated Data

    We generate three -dimensional Gaussian mixtures with components and variables. We use the following parameters in the component densities: , isotropic; , diagonal; and , full with th element . Simulations are run in two steps. In the first step, we simulate three data sets, using and setting the ratio of number of elements in the three groups to be in each case. The PGMM family was applied to these data for and using the ALPBIC, the LPBIC, and the BIC, respectively, for model selection. In the second step, we run 25 simulations for , vary as before but also vary the relative size of clusters for each . We then compare the performance of the ALPBIC with the LPBIC, the BIC, the AIC, and the CAIC in this scenario to study how the model selection criteria behave for varying and varying relative component sizes.

    The results for the first step of experiment (Table 1) show that the ALPBIC consistently chooses , , and the CUU covariance structure as gets larger. The LPBIC, on the other hand, consistently chooses but fails to consistently choose or the covariance structure. The BIC, however, performs the worst: it fails to select the correct number of components. In fact, the BIC lives up to form by underestimating for each value of ; unsurprisingly, the value of is low.

    ALPBIC LPBIC BIC
    Model Model Model
    CUU CUC CCC
    CUU CUC CUC
    CUU CUU CUC
    CUU CUU CUC
    Table 1: Best model chosen by ALPBIC, LPBIC and BIC for high-dimensional simulated data.

    In the second step, we vary as well as the relative sizes of the clusters, and generate 25 simulations for each set of parameter values. For each , three choices of relative component sizes are considered: 4:3:3, 3:4:3, and 3:3:4. In each of these 12 settings, the ratio of component sizes are chosen such that the proportion of data points is higher in one component and the remaining data are equally divided among the other two components. This is done to generate different covariance structures on varying sample sizes. We compare the performance of the ALPBIC, the LPBIC, the BIC, the AIC and the CAIC for each situation. First, we evaluate the ALPBIC on its ability to detect the correct number of components, comparing its performance with the other methods and using the same ranges of and as before. The left-hand plot of Figure 1 shows that while and the relative cluster size vary, the ALPBIC is the most consistent in choosing the correct number of clusters (i.e., ). Out of 25 simulations in each case, the number of correct selections of is the highest for ALPBIC, closely followed by LPBIC and CAIC. ALPBIC has the highest number of correct selections of in 10 out of the 12 settings. The AIC and BIC are the poorest performers, but the latter’s performance improves as increases.

    Because we do not know the correct , it is difficult to study the performance of the ALPBIC in selecting . However, the results show that ALPBIC chooses most consistently. Specifically, out of 25 simulations in each of the 12 settings, ALPBIC chooses an average of 72% of time while LPBIC chooses an average of only 43% of time. BIC and AIC show the lowest consistency in this respect; the average ratio is 12% for both criteria. CAIC gives the most comparable performance to the ALPBIC, choosing an average of 61% of time. Although we do not know the true value of , seems a reasonable guess, based both on this experiment and the previous experiment (cf. Table 1).

    We also compare the classification performance of these five criteria for varying scenario. The right-hand plot of Figure 1 shows the average classification accuracy (ARI) over 25 simulations for each setting for each criterion. The ALPBIC gives consistent classification results, with the average ARI usually lying between 0.7 and 0.85 (the exception is , ). For one setting (, ), the LPBIC outperforms the ALBIC, and for another, the CAIC does better (, ). However, the ALBIC and CAIC generally return lower average ARI values and are less consistent (cf. Figure 1). The BIC and AIC, on the other hand, do not fluctuate much but have low average ARI values.

      
    Figure 1: Plot of the performance of ALPBIC, LPBIC, BIC, AIC AND CAIC for 25 simulations with varying and varying cluster size. The left-hand plot shows the selection of number of components (out of 25 simulations) by the five model selection criteria. The right-hand plot shows the average ARIs (over 25 simulations) of the models selected by the five model selection criteria.

    4.3 Leukaemia data

    Golub (1999) presented data on two forms of acute leukaemia: acute lymphoblastic leukaemia (ALL) and acute myeloid leukaemia (AML). Affymetrix arrays were used to collect measurements for 7,129 genes on 72 tissues. There were a total of 47 ALL tissues and 25 with AML. McNicholas and Murphy (2010) reduced the number of genes to 2,030 prior to analysis and we analyze these 2,030 genes using 20 different random starts for the initial . We run our approach for and .

    Value Model ARI
    ALPBIC CUC 0.51
    LPBIC CUC 0.47
    BIC CCU 0.29
    ALPBIC LPBIC
    1 2 1 2
    ALL 41 4 39 3
    AML 6 21 8 22
    Table 2: Performance of ALPBIC, LPBIC, and BIC for the leukaemia data. The left-hand table gives the selected model and associated ARI for each criteria. The right-hand table gives cross-tabulations of predicted classifications versus true labels for the model selected by the ALPBIC and LPBIC, respectively.

    Table 2 summarizes the performance of the three model selection criteria. ALPBIC and LPBIC are comparable in selecting the model and classification agreement, as shown in Table 2 (a). Both ALPBIC and LPBIC choose a CUC model with components. However, LPBIC selects and ALPBIC selects factors. The BIC chooses a CCU model with component and factors. The ARI for the model chosen by the ALPBIC is highest (0.51), and only 10 samples were misclassified (Table 2 (b)). The model chosen by the LPBIC gave slightly worse classification performance ( and 11 misclassifications; cf. Table 2 (a) and (b)) .The model selected by the BIC exhibits much worse classification performance, with an associated ARI of 0.29.

    5 Summary

    A model selection criterion, the ALPBIC, is proposed for mixture model selection in high-dimensional data applications. The ALPBIC is developed through a penalized likelihood-based approach, where the penalty term is akin to the adaptive LASSO. The ALPBIC is especially designed for high-dimensional data, where the BIC has known drawbacks. Through penalizing the likelihood by the adaptive LASSO, the resulting model selection criterion is consistent, which means it can consistently select the right number of components and covariance structure when . The criterion also satisfies oracle property, which means the estimated parameters of the best model selected by the criterion satisfy consistency, sparsity, and asymptotic normality.

    We used simulations to illustrate the consistency of the ALPBIC. These simulations also illustrate that while the LPBIC is a useful criteria for large , it suffers from lack of consistency, i.e., it cannot consistently choose the same model for varying . This is due to the fact that the LPBIC is derived through penalization of the likelihood where the penalty is the LASSO, which fails to satisfy oracle properties — this drawback is resolved herein by using the adaptive LASSO.

    Although we used the PGMM family to illustrate our approach, the ALPBIC can be applied to any mixture model selection problem. A likelihood-based approach through penalization of the covariance parameters is a logical extension of the present work as it would lead to greater parsimony. The main obstacle in doing so is the complex sampling distribution of the off-diagonal elements of the component covariance matrices. We will aim to mitigate this problem by first extending the penalization to diagonal covariance matrices, before further extending this to a full matrix by imposing a Wishart distribution on its maximum likelihood estimator.

    Appendix A Derivation of ALPBIC

    APBIC is derived in a way similar to the derivation of LPBIC (Bhattacharya and McNicholas, 2014). To derive the ALPBIC, we have to maximize (2). Using (3), the second term becomes

    where is the number of non-zero mean components in class . Under the assumption made in Section  2, is at most dependent, and the Weak Law of Large Numbers holds. for large, is a large number and so

    Thus the second term becomes

    Taylor’s series expansion of the first term gives

    where is the second derivative matrix of . Since is derived by maximizing the penalized likelihood, the second term within the integral is

    where is the LASSO penalty function. Since is close to , using (3) and the mean-value theorem, the second term within the integral becomes

    The third term within the integral similarly is

    where is the set of non-zero parameters and is their estimate. Using a Laplace approximation on and applying the Weak Law of Large Numbers, as in the usual BIC, we arrive at

    where . This, combined with the second term of (2), gives (7).

    Appendix B Proof of the Asymptotic Properties of the ALPBIC

    To prove Part a of the theorem in Section 3, let . It is sufficient to prove that for any given , there exists a constant such that

    Thus, we have to show that, with large probability, there is a local maximum inside

    Using (7), we can write

    (9)

    where , and are the number of non-zero elements in and , respectively, is the th element of corresponding to the th component, and is the number of non-zero elements in . If is the number of non-zero elements in , then

    is negligible in order comparison, as we shall see later in the proof. Also, for sufficiently large , we can make .

    Using Taylor’s expansion, we can write

    Also, for sufficiently large , . For the third term of (9), using the assumption , we have

    because is a -consistent estimator and hence . On the other hand, , where is a -consistent estimator, and hence . Using this result,

    where

    From the regularity conditions, and

    Therefore, as .

    Because

    by order comparison, we conclude that is the sole leading term of (9). Because the information matrix is positive definite,

    and this concludes the proof of Part a.

    To prove Part b, use the mean-value theorem,

    where Also,

    From the regularity conditions, is of order and so

    Therefore, from these order assessments, we conclude that

    where is defined as in (7). Also, . Hence,

    (10)

    Because we can conclude that

    (11)

    with probability tending to 1. This is due to assumption (iii) and the fact that

    Also,

    Therefore, by order assessment, as , (10) boils down to the leading term