Variable selection for mixed data clustering: a model-based approach
We propose two approaches for selecting variables in latent class analysis (i.e., mixture model assuming within component independence), which is the common model-based clustering method for mixed data. The first approach consists in optimizing the BIC with a modified version of the EM algorithm. This approach simultaneously performs both model selection and parameter inference. The second approach consists in maximizing the MICL, which considers the clustering task, with an algorithm of alternate optimization. This approach performs model selection without requiring the maximum likelihood estimates for model comparison, then parameter inference is done for the unique selected model. Thus, the benefits of both approaches is to avoid the computation of the maximum likelihood estimates for each model comparison. Moreover, they also avoid the use of the standard algorithms for variable selection which are often suboptimal (e.g. stepwise method) and computationally expensive. The case of data with missing values is also discussed. The interest of both proposed criteria is shown on simulated and real data.
Keywords: Information criterion, Missing values, Mixed data, Model-based clustering, Variable selection
Clustering permits to summarize large datasets by grouping observations into few homogeneous classes. Finite mixture models (McLachlan:04; McNicholas2016) allows assessment of this unknown partition among observations. They permit dealing with continuous (Ban93; Cel95; morris2016clustering), categorical (Mei01; mcparland2013clustering; marbac2016), integer (karlis2007finite) or mixed data (browne2012model; kosmidis2015mixture; marbac2015copules). When observations are described by many variables, the within components independence permits achievement of the clustering goal, by limiting the number of parameters (Goo74; Han01; Mou05). Like in regression (Davis20113256; Huang20062020) or classification (Greenshtein2009385; HUANG1994205), variable selection should be done in clustering. Indeed, in many cases, the partition may be explained by only a subset of the observed variables (biernacki2015). So, by performing a selection of the variables in clustering, both model fitting and result interpretation are facilitated. Indeed, for a fixed sample size, selecting the variables improves the accuracy of parameters and class identification. Moreover, such method brings out the variables characterizing the classes, thus facilitating the interpretation of the clustering results.
The first approaches for selecting the variables have been developed to cluster continuous data. Thus, Tadesse:05 consider two types of variables: the set of the relevant variables (having a different distribution among components) and the set of the irrelevant variables (having the same distribution among components) which are independent of the relevant ones. This method has been extended by considering a set of redundant variables. The distribution of the redundant variables is modelled by linear regressions on the whole discriminative variables (Raftery:06) or on a subset of the discriminative variables (Maugis:09b). Authors propose to perform model selection by maximizing the Bayesian Information Criterion (BIC, Schwarz:78). However, this maximization is complex because many models are in competition, and because each model comparison requires calls of EM algorithm to obtain the maximum likelihood estimates (MLE). This optimization can be carried out by a greedy search algorithm. This algorithm converges toward a local optimum in the space of models, but there is no guarantees that this optimum is the global one. This algorithm is feasible for quite large datasets, but it is computationally expensive when many variables are observed. Considering the latent class model (Goo74), dean2010latent then white2016bayesian introduce a similar way for selecting variables in a categorical data clustering.
For the first contribution of this paper, we show how to select the variables, according to the BIC, for a model-based clustering of mixed data. The model considers two kinds of variables (relevant and irrelevant) and assumes within component independence. Note that this model is useful especially when the number of variables is large (Han01), that is the most common situation where variable selection is needed. The within components independence allows us to easily implement a modified version of the Expected-Maximization (EM) algorithm proposed by green1990use, which permits the maximization of the penalized likelihood. Thus, the proposed method permits the selection of variables in clustering of mixed data according to any likelihood-based information criterion, like the AIC Aka73 or the BIC. Other penalised criterion considering the complexity of the model space could also be optimized (massart2007concentration; meynet2012selection; bontemps2013clustering).
The BIC approximates the logarithm of the integrated likelihood by adding a term of . This term can deteriorate its performances, when few observations are available. Moreover, it does not focus on the clustering goal, thus the Integrated Complete-data Likelihood criterion (ICL) has been introduced by Biernacki:00. This criterion makes a trade off between the the model fit to the data and the component entropy. Moreover, when the components belong to the exponential family and when conjugate prior distributions are used, this criterion does not imply any approximation. Biernacki:10 shows that this exact criterion can outperforms the BIC. However, selecting the variables according the ICL is complex. Therefore, marbac2015variable introduced the Maximum Integrated Complete-data Likelihood criterion (MICL) for selecting variables of a diagonal Gaussian mixture model. The ICL and the MICL are quite similar, because both of them are based on the integrated complete-data likelihood. However, the MICL uses the partition maximizing this function, while the ICL uses the partition provided by a MAP rule associated to the MLE.
For the second contribution of the paper, we show that the MICL keeps a closed form for a mixture model for mixed data, if prior distributions are conjugate. Hence, model selection is carried out by a simple and fast procedure which alternates between two maximizations, for providing the model maximizing the MICL. We shows that this exact criterion (i.e., not implying any approximation) can outperform the asymptotic criteria (like the BIC). Finally, we show that the two contributions of this paper improve the clustering results when data have missing values. To manage missing values, we assume that values are missing at random Little:2014.
Section 2 presents the mixture model for mixed data. Section 3 details the selection of variables according to the BIC, while Section 4 details the selection according to the MICL. Section 5 focuses on the missing values. Section 6 compares the proposed approaches to well-established methods on simulated and illustrates their benefits on real data. Section 7 concludes this work.
2 Model-based clustering for mixed data
2.1 The model
Data to analyze consists of observations , where each observation is defined on space , depending on the nature of variable . Hence, (respectively , ) if variable is continuous (respectively integer and categorical with levels). Observations are assumed to arise independently from the mixture of components defined by its probability distribution function (pdf)
where groups the model parameters, is the proportion of component such that and , is the pdf of component parametrized by , and is the pdf of variable for component parametrized by . The univariate marginal distribution of variable depends on its definition space, therefore is the pdf of a Gaussian distribution (Poisson and multinomial ) if variable is continuous (respectively integer and categorical) with (respectively and ).
In clustering, a variable is said to be irrelevant if its univariate margins are invariant over the mixture components. Considering the model defined by (1), variable is irrelevant if , and it is relevant otherwise. The role of the variables is defined by the binary vector , since if variable is irrelevant and otherwise. Hence, the couple defines the model at hand, because it defines the parameter space. Therefore, for a model , the pdf of is
where and .
2.2 Maximum likelihood inference
The general form of the observed-data log-likelihood of model is defined by . Hence, equalities between the parameters defined by imply that
The MLE of the parameters corresponding to the irrelevant variables are explicit, but not those of the proportions and the relevant variables. Thus, it is standard to use an EM algorithm (Dem77; McLachlan:08) to maximize the observed-data log-likelihood. Here, the partition among the observations is unobserved. We denote this partition by with , where if observation arises from component and otherwise. Hence, the complete-data likelihood of model (log-likelihood computed on the observed and unobserved variables) is defined by
The EM algorithm alternates between two steps: the Expectation step (E-step) consisting in computing the expectation of the complete-data likelihood under the current parameters, and the maximization step (M-step) consisting in maximizing this expectation over the model parameters. Thus, this algorithm starts from the initial value of the model parameter randomly sampled and its iteration is defined by
E-step Computation of the fuzzy partition , hence
M-step Maximization of the expected value of the complete-data log-likelihood over the parameters,
where , is the MLE for an irrelevant variable, and is the estimate for an relevant variable. This algorithm converges to a local optimum of the observed-data log-likelihood. Thus, the MLE for the model , denoted by , is obtained by performing many different random initializations.
3 Model selection by optimizing the BIC
3.1 Information criterion for data modelling
Model selection generally aims to find the model which obtains the greatest posterior probability, among a collection of competing models . The number of components of the competing models is usually bounded by a value . Thus,
By assuming uniformity for the prior distribution of , maximizes the integrated likelihood defined by
where is the parameter space of model , is the likelihood function, and is the pdf of the prior distribution of the parameters. Unfortunately, the integrated likelihood is intractable, but many methods permit approximations of its value (Fri12). The most popular approach consists of using the BIC (Schwarz:78; Ker00), which approximates the logarithm of the integrated likelihood by Laplace approximation, and thus requires MLE. The BIC is defined by
where is the number of independent parameters required by .
3.2 Optimizing the penalized likelihood
For a fixed number of components , selecting the variables necessitates the comparison of models. Therefore, an exhaustive approach approximating the integrated likelihood for each competing model is not feasible. Instead, Raftery:06 carry out model selection by deterministic algorithms (like a stepwise method). However, this approach cannot ensure that the model maximizing the BIC is obtained. Moreover, it can be computationally expensive if many variables are observed. In this paper, model selection is an easier problem, because the model assumes within components independence. This assumption permits the direct maximization of any penalized log-likelihood function defined by
for any constant . This function is maximized by using a modified version of the EM algorithm (green1990use). Hence, we introduce the penalized complete-data log-likelihood function
where is the number of parameters for one univariate marginal distribution of variable (i.e.,, is the variable is continuous, is the variable is integer and is the variable is categorical with levels). This modified version of the EM algorithm finds the model maximizing the penalized log-likelihood for a fixed number of components. It starts at a initial point randomly sampled with , and its iteration is composed of two steps:
E-step Computation of the fuzzy partition
M-step Maximization of the expectation of the penalized complete-data log-likelihood over , hence with
where is the difference between the maximum of the expected value of the penalized complete-data log-likelihood obtained when variable is relevant and when it is irrelevant. To obtain the couple maximizing the penalized observed-data log-likelihood, for a fixed number of components, many random initializations of this algorithm should be done. Hence, the couple maximizing the penalized observed-data log-likelihood is obtained by performing this algorithm for every values of between one and . By considering , this algorithm carry out the model selection according to the BIC. Moreover, other criteria can also be considered like the AIC by setting .
4 Model selection by optimizing the MICL
4.1 Information criterion
Although the BIC has good properties of consistency, it does not focus the clustering goal. Moreover, it involves an approximation in which can deteriorate its performances, especially when is small or when is large. To circumvent this issue, exact criteria could be preferred (Biernacki:10). Criteria based on the complete-data likelihood have been introduced like the ICL (Biernacki:00) or the MICL (marbac2015variable). The integrated complete-data likelihood is defined by
where is the complete-data likelihood. When conjugate prior distributions are used, the integrated complete-data likelihood has the following closed form. Thus, we assume independence between the prior distributions, such that
where . We use conjugate prior distributions, thus follows a Dirichlet distribution . If variable is continuous, where follows an Inverse-Gamma distribution and follows a Gaussian distribution . If variable is integer, then follows a Gamma distribution while follows a Dirichlet distribution if variable is ordinal with levels. If there is no information a priori on the parameters, we use the the Jeffreys non informative prior distributions (Rob07) for the proportions (i.e., ) and for the hyper-parameters of a categorical variables (i.e., ). Such prior distributions do not exist for the parameters of the Gaussian and Poisson distributions, so we use flat prior distributions (see Section 6).
The conjugate prior distributions implies the following closed-form of the integrated complete-data likelihood
where , and
The MICL corresponds to the greatest value of the integrated complete-data likelihood among all the possible partitions. Thus, the MICL is defined by
Obviously, this criterion is quite similar to the ICL and inherits its main properties. In particular, it is less sensitive to model misspecification than the BIC. Unlike the ICL and the BIC, it does not require the MLE and thus avoid the multiple calls to the EM algorithm. Because does not impact the dimension of z, we can maximize the integrated complete-data likelihood over , and thus the best model according the MICL can be obtained, for a fixed number of components .
4.2 Optimizing the MICL
An iterative algorithm is used for finding the model maximizing the MICL, for a fixed number of components. Starting at the initial point with , the algorithm alternates between two optimizations of the integrated complete-data likelihood: optimization on z given , and maximization on given .
The algorithm is initialized as follows: with probability 0.5 then is the partition provided by a MAP rule associated to model and to its MLE . Iteration of the algorithm is written as
Partition step: find such that
Model step: find such that
At iteration , the model step consists in finding the vector maximizing the integrated completed-data likelihood, for the current partition . This optimization can be performed independently for each element , thanks to the within component independence assumption. The partition step is more complex, hence is defined as a partition increasing the value of the integrated complete-data likelihood for the current model. It is obtained by an iterative method initialized at the partition . Each iteration consists in sampling uniformly an individual which is affiliated to the component maximizing the integrated complete-data likelihood, while the other component memberships are unchanged (details are given in marbac2015variable). Like the EM algorithm, the proposed algorithm converges to a local optimum of , so many different initializations should be done.
5 Missing values
Data can contain missing values, so we denoted by the indices where is observed. Assuming that data are missing at random, the pdf of is defined by the marginal pdf of the observed values given by
The EM algorithm maximizing the BIC can be used on data with missing values. Its M-step should considers only the observed values. Moreover, the within component independence avoids the computation the conditional expected values of the missing observations, at the E-step. The steps of this algorithm are detailed in Appendix B. Alternatively, variables can be selected according to the MICL. Note that this criterion is particularly relevant in this case, because it considers the number of missing values in the sample, while the penalty of the BIC neglects this quantity. Indeed, the integrated complete-data considers a number of observations per variable because
This integral keeps a closed-form detailed in Appendix C.
6 Numerical experiments
Implementation of the proposed method
Our method is implemented by the name VarSelLCM. When the MICL is used, the hyper-parameters must be specified. For the proportions and the parameters of the categorical data, we use the Jeffrey’s prior distributions (i.e., Dirichlet distribution with parameters equal to 1/2). Because there do not exist non-informative Jeffrey’s prior distributions for the Gaussian mixture, the following hyper-parameters are chosen to be fairly flat in the region where the likelihood is substantial and not much greater else-where: , , and . In the same spirit, we use the hyper-parameters for the Poisson distribution. The purpose of these experiments is to show the relevance of selecting variables in clustering. Two families of information criteria are considered: the model-fitting criterion (BIC) or the clustering-task criterion. When we apply the clustering-task criterion, the ICL is used if there is no selection of the variables, while the MICL is used if variables are selected.
First, methods of variables selection are compared for a model-based clustering of continuous data. Thus, we compare our approaches with the selection of variables by using deterministic algorithm maximizing the BIC implemented in the R package clustvarsel (Scr14). This package considers redundant variables and different covariance matrices. The optimisation of the BIC is proposed by two algorithms: forward and backward searches. Note that there already exist comprehensive comparisons of method for selecting variables in a continuous data clustering (Cel14; marbac2015variable). Second, the impact of the missing values is illustrates on mixed simulated data. Third, the benefits of the proposed approaches are illustrates on five real datasets. In this section, method are compared in a clustering task. Thus, the partitioning accuracy is measured with the Adjusted Rand Index (ARI, Hub85) because it permits the comparison between two partitions having possibly different numbers of classes. When it is close to one, the partitions are strongly similar, while they are strongly different when this index is close to zero.
6.1 Simulated data: continuous case
This experiment compares the methods of model selection on clustering of continuous data. We generate 200 observations from a bi-component Gaussian mixture with equals proportions. Under component , the variables follow a Gaussian distribution with
Where is used to define the class overlap. We add noisy variables from a standard Gaussian distribution . We consider different numbers of variables (10, 25, 50, 100), a theoretical misclassification rates of 5% and two values of (0 and 0.4). Thus, when , the model used for sampling the data belongs to the list of the competing models while it does not belong to this list when For each case, 20 replicates are generated.
For each replicates, we perform the clustering, with unknown number of classes, according to a modelling criterion (BIC) and to a clustering criterion (ICL/MICL). Model selection is performed by considering a maximum number of components equals to three. The ARI is computed for each selected model and their values are presented in Table 1.
For both criterion families, selecting the variables increases the clustering accuracy. Even if the data arise from a model with intra-components dependencies (), the proposed approach, which assumes within component independence, stays relevant. Indeed, it obtains a better ARI that the models implemented in clustvarsel. This phenomenon can be explained by different reasons. First, the information about the intra-component dependency vanishes when the number of irrelevant variables increases. Thus, the results of clustervarsel are deteriorated when increases. Second, the BIC can imply issues when increases, due to its approximation term in . Because clustervarsel considers a richer family of models, it can be more sensitive to its issue. Finally, the independence assumption permits to finds the model maximizing the BIC, while a richer family of models implies a sub-optimal optimization of this criterion.
Table 2 gives information about the model selected by the competing methods (number of components and rate of relevant variables). It shows that, when is fixed, the number of components tends to one, when we handle a very large number of irrelevant variables. This problem is circumvented, when the proposed procedure of variable selection is used with the BIC or the MICL. Moreover, this approach permits the detection of the role of the variables. Indeed for , 25, 50 and 100, the rate of relevant variables (i.e., 0.60, 0.25, 0.13, 0.06) is found.
6.2 Simulated data: mixed case
This experiment evaluates the benefits of variable selection, for clustering mixed data with missing values, if either a modelling criterion (BIC) or a clustering criterion (ICL/MICL) is used. We generate 200 observations from a bi-component model with equals proportions and assuming within components independence. We consider variables (two continuous, two integer and two binary). Under component , the univariate margins are defined by these parameters
The parameter allows us to fix the misclassification error at 10%. Noisy variables are added (equal number of continuous, integer and binary variables), then missing values are added randomly. Thus, 20 replicates are generated for different numbers of variables (12, 24, 48) and different rates of missing values (0%, 10% and 20%). Table 3 presents the results.
|missing||no selection||selection||no selection||selection|
Results shows that the selecting the variables increases the clustering performances for both types of criteria. Indeed, the values of the ARI are improved when variables are selected, especially when the clustering criteria are used. Moreover, the true number of components () is more often found when variables are selected. Finally, clustering interpretation is facilitates because only a subset of the observed variables characterizes the classes. As expected, when missing values are added, the results are deteriorated. However, the results are more impacted when all the variables are used to cluster. Note the BIC obtains better results for detecting the role of the variables. Indeed, the rate of discriminative variables is 0.50, 0.25 and 0.125 when is equal to 12, 24 and 48 respectively. For this simulation, the overall behaviour of the BIC criterion is better than the one of the MICL. This phenomenon is explained by a quite large class overlaps. It is known that the information criteria based on the integrated complete-data likelihood work better when the classes are well-separated. To illustrate this phenomenon, we perform a similar simulation by considering the 5% of theoretical misclassification and 20% of missing values. Table 4 shows the results obtained when the variables are selecting according to the BIC and the MICL. In this case, both criteria obtain equivalent results for the ARI, the number of component and the detection of the variables. In this section, the model used for sampling the data belongs to the list of the competing models. This favour the BIC criterion. Next section shows that the MICL is at least as relevant as the BIC for selecting the variables when real data are analysed.
6.3 Method comparison on real data
We now compare the competing methods on six real datasets presented in Table 5.
Table 6 presents the results obtained without variable selection (No selection), with a variable selection according to the BIC (BIC-selection) and with a variable selection according to the MICL (MICL-selection). For three datasets (birds, coffe, banknote), the ARI obtained by the three approaches are equal. However, selecting the variables facilitates the clustering interpretation. For example, the selection with the MICL perfectly detects the clusters with only 42% of the variables. For the three other datasets, selecting the variables increases the ARI.
In many applications the number of classes is unknown. Thus, we perform the clustering, with unknown number of classes, according to a modelling criterion (BIC) and to a clustering criterion (ICL/MICL). For both approaches, the clustering is done with and without selection of the variables. Table 7 presents the results obtained for the real datasets. For both approaches, the selection of the variables increases the ARI for almost all the datasets. The only case when the ARI is deteriorated by a selection of variables is for the clustering of the Golub dataset with the BIC. This dataset is really challenging because it is composed of many variables (3051) and few observations (83). This large number of variables implies a huge number of competing models. In this case, the BIC can lead to poor results because of its approximation implying a term of . For this type of dataset, the exact criteria (implying no approximation) are more relevant. Thus, the clustering with variable selection according to the MICL criterion finds the true number of components (2) and a relevant partition.
|Dataset||No selection||Selection||No selection||Selection|
We have proposed a new model-based approach for selecting variables in a cluster analysis of mixed data with missing values. The purpose of the variable selection is to increase the accuracy of the model fitting and facilitate its interpretation. The model at hand assumes within component independence. This assumption is relevant, because variable selection is mainly impacting when several variables are observed. Moreover, numerical experiments have shown robustness properties for the model misspecification. The within component independence assumption permits the maximization of the BIC and the MICL. The first criterion performs the selection of the variables and the clustering in a model-fitting purpose. The second criterion achieves these objectives in a clustering purpose. Both criteria have provided relevant results on numerical experiments. Considering a dataset composed of several variables but very few observations, the MICL should be prefered to the BIC, because it does not imply any approximation. However, if many observations are available, the maximization of the MICL could be time consuming (in practice, it is doable for ). Thus, the BIC could be prefered, in this case, because its issues due to the term vanishes when tends toward infinity.
Finally, this approach could be extend to perform a more elaborate variable selection. Indeed, by using the approach of Mau09, a group of redundant variables could be considered.
Appendix A Details on the closed-form of the integrated complete-data log-likelihood
To compute the integrated complete-data log-likelihood, we give the value for any type of data (continuous, integer and categorical).
If variable is continuous, then
where , , , , , , , and .
If variable is integer, then
where , , and .
If variable is categorical with levels, then
Appendix B EM algorithm to optimize the BIC criterion for data with missing values
The EM algorithm starts at a initial point with randomly sampled and its iteration is composed of two steps:
E step Computation of the fuzzy partition
M step Maximization of the expectation of the penalized complete-data log-likelihood over , hence with
where , where and where .
Appendix C Details on the closed-form of the integrated complete-data log-likelihood for data with missing values
To compute the integrated complete-data log-likelihood, for data with missing values we give the value for any type of data (continuous, integer and categorical) containing missing values.
If variable is continuous, then
where , , , , , , , , and .
If variable is integer, then
where , , , and .
If variable is categorical with levels, then