Variable selection for mixed data clustering: a model-based approach

Variable selection for mixed data clustering: a model-based approach

Matthieu Marbac and Mohammed Sedki
Abstract

We propose two approaches for selecting variables in latent class analysis (i.e., mixture model assuming within component independence), which is the common model-based clustering method for mixed data. The first approach consists in optimizing the BIC with a modified version of the EM algorithm. This approach simultaneously performs both model selection and parameter inference. The second approach consists in maximizing the MICL, which considers the clustering task, with an algorithm of alternate optimization. This approach performs model selection without requiring the maximum likelihood estimates for model comparison, then parameter inference is done for the unique selected model. Thus, the benefits of both approaches is to avoid the computation of the maximum likelihood estimates for each model comparison. Moreover, they also avoid the use of the standard algorithms for variable selection which are often suboptimal (e.g. stepwise method) and computationally expensive. The case of data with missing values is also discussed. The interest of both proposed criteria is shown on simulated and real data.

Keywords: Information criterion, Missing values, Mixed data, Model-based clustering, Variable selection

1 Introduction

Clustering permits to summarize large datasets by grouping observations into few homogeneous classes. Finite mixture models (McLachlan:04; McNicholas2016) allows assessment of this unknown partition among observations. They permit dealing with continuous (Ban93; Cel95; morris2016clustering), categorical (Mei01; mcparland2013clustering; marbac2016), integer (karlis2007finite) or mixed data (browne2012model; kosmidis2015mixture; marbac2015copules). When observations are described by many variables, the within components independence permits achievement of the clustering goal, by limiting the number of parameters (Goo74; Han01; Mou05). Like in regression (Davis20113256; Huang20062020) or classification (Greenshtein2009385; HUANG1994205), variable selection should be done in clustering. Indeed, in many cases, the partition may be explained by only a subset of the observed variables (biernacki2015). So, by performing a selection of the variables in clustering, both model fitting and result interpretation are facilitated. Indeed, for a fixed sample size, selecting the variables improves the accuracy of parameters and class identification. Moreover, such method brings out the variables characterizing the classes, thus facilitating the interpretation of the clustering results.

The first approaches for selecting the variables have been developed to cluster continuous data. Thus, Tadesse:05 consider two types of variables: the set of the relevant variables (having a different distribution among components) and the set of the irrelevant variables (having the same distribution among components) which are independent of the relevant ones. This method has been extended by considering a set of redundant variables. The distribution of the redundant variables is modelled by linear regressions on the whole discriminative variables (Raftery:06) or on a subset of the discriminative variables (Maugis:09b). Authors propose to perform model selection by maximizing the Bayesian Information Criterion (BIC, Schwarz:78). However, this maximization is complex because many models are in competition, and because each model comparison requires calls of EM algorithm to obtain the maximum likelihood estimates (MLE). This optimization can be carried out by a greedy search algorithm. This algorithm converges toward a local optimum in the space of models, but there is no guarantees that this optimum is the global one. This algorithm is feasible for quite large datasets, but it is computationally expensive when many variables are observed. Considering the latent class model (Goo74), dean2010latent then white2016bayesian introduce a similar way for selecting variables in a categorical data clustering.

For the first contribution of this paper, we show how to select the variables, according to the BIC, for a model-based clustering of mixed data. The model considers two kinds of variables (relevant and irrelevant) and assumes within component independence. Note that this model is useful especially when the number of variables is large (Han01), that is the most common situation where variable selection is needed. The within components independence allows us to easily implement a modified version of the Expected-Maximization (EM) algorithm proposed by green1990use, which permits the maximization of the penalized likelihood. Thus, the proposed method permits the selection of variables in clustering of mixed data according to any likelihood-based information criterion, like the AIC Aka73 or the BIC. Other penalised criterion considering the complexity of the model space could also be optimized (massart2007concentration; meynet2012selection; bontemps2013clustering).

The BIC approximates the logarithm of the integrated likelihood by adding a term of . This term can deteriorate its performances, when few observations are available. Moreover, it does not focus on the clustering goal, thus the Integrated Complete-data Likelihood criterion (ICL) has been introduced by Biernacki:00. This criterion makes a trade off between the the model fit to the data and the component entropy. Moreover, when the components belong to the exponential family and when conjugate prior distributions are used, this criterion does not imply any approximation. Biernacki:10 shows that this exact criterion can outperforms the BIC. However, selecting the variables according the ICL is complex. Therefore, marbac2015variable introduced the Maximum Integrated Complete-data Likelihood criterion (MICL) for selecting variables of a diagonal Gaussian mixture model. The ICL and the MICL are quite similar, because both of them are based on the integrated complete-data likelihood. However, the MICL uses the partition maximizing this function, while the ICL uses the partition provided by a MAP rule associated to the MLE.

For the second contribution of the paper, we show that the MICL keeps a closed form for a mixture model for mixed data, if prior distributions are conjugate. Hence, model selection is carried out by a simple and fast procedure which alternates between two maximizations, for providing the model maximizing the MICL. We shows that this exact criterion (i.e., not implying any approximation) can outperform the asymptotic criteria (like the BIC). Finally, we show that the two contributions of this paper improve the clustering results when data have missing values. To manage missing values, we assume that values are missing at random Little:2014.

Section 2 presents the mixture model for mixed data. Section 3 details the selection of variables according to the BIC, while Section 4 details the selection according to the MICL. Section 5 focuses on the missing values. Section 6 compares the proposed approaches to well-established methods on simulated and illustrates their benefits on real data. Section 7 concludes this work.

2 Model-based clustering for mixed data

2.1 The model

Data to analyze consists of observations , where each observation is defined on space , depending on the nature of variable . Hence, (respectively , ) if variable is continuous (respectively integer and categorical with levels). Observations are assumed to arise independently from the mixture of components defined by its probability distribution function (pdf)

(1)

where groups the model parameters, is the proportion of component such that and , is the pdf of component parametrized by , and is the pdf of variable for component parametrized by . The univariate marginal distribution of variable depends on its definition space, therefore is the pdf of a Gaussian distribution (Poisson and multinomial ) if variable is continuous (respectively integer and categorical) with (respectively and ).

In clustering, a variable is said to be irrelevant if its univariate margins are invariant over the mixture components. Considering the model defined by (1), variable is irrelevant if , and it is relevant otherwise. The role of the variables is defined by the binary vector , since if variable is irrelevant and otherwise. Hence, the couple defines the model at hand, because it defines the parameter space. Therefore, for a model , the pdf of is

(2)

where and .

2.2 Maximum likelihood inference

The general form of the observed-data log-likelihood of model is defined by . Hence, equalities between the parameters defined by imply that

(3)

The MLE of the parameters corresponding to the irrelevant variables are explicit, but not those of the proportions and the relevant variables. Thus, it is standard to use an EM algorithm (Dem77; McLachlan:08) to maximize the observed-data log-likelihood. Here, the partition among the observations is unobserved. We denote this partition by with , where if observation arises from component and otherwise. Hence, the complete-data likelihood of model (log-likelihood computed on the observed and unobserved variables) is defined by

(4)

The EM algorithm alternates between two steps: the Expectation step (E-step) consisting in computing the expectation of the complete-data likelihood under the current parameters, and the maximization step (M-step) consisting in maximizing this expectation over the model parameters. Thus, this algorithm starts from the initial value of the model parameter randomly sampled and its iteration is defined by
E-step Computation of the fuzzy partition , hence

M-step Maximization of the expected value of the complete-data log-likelihood over the parameters,

where , is the MLE for an irrelevant variable, and is the estimate for an relevant variable. This algorithm converges to a local optimum of the observed-data log-likelihood. Thus, the MLE for the model , denoted by , is obtained by performing many different random initializations.

3 Model selection by optimizing the BIC

3.1 Information criterion for data modelling

Model selection generally aims to find the model which obtains the greatest posterior probability, among a collection of competing models . The number of components of the competing models is usually bounded by a value . Thus,

(5)

By assuming uniformity for the prior distribution of , maximizes the integrated likelihood defined by

(6)

where is the parameter space of model , is the likelihood function, and is the pdf of the prior distribution of the parameters. Unfortunately, the integrated likelihood is intractable, but many methods permit approximations of its value (Fri12). The most popular approach consists of using the BIC (Schwarz:78; Ker00), which approximates the logarithm of the integrated likelihood by Laplace approximation, and thus requires MLE. The BIC is defined by

(7)

where is the number of independent parameters required by .

3.2 Optimizing the penalized likelihood

For a fixed number of components , selecting the variables necessitates the comparison of models. Therefore, an exhaustive approach approximating the integrated likelihood for each competing model is not feasible. Instead, Raftery:06 carry out model selection by deterministic algorithms (like a stepwise method). However, this approach cannot ensure that the model maximizing the BIC is obtained. Moreover, it can be computationally expensive if many variables are observed. In this paper, model selection is an easier problem, because the model assumes within components independence. This assumption permits the direct maximization of any penalized log-likelihood function defined by

(8)

for any constant . This function is maximized by using a modified version of the EM algorithm (green1990use). Hence, we introduce the penalized complete-data log-likelihood function

(9)

where is the number of parameters for one univariate marginal distribution of variable (i.e.,, is the variable is continuous, is the variable is integer and is the variable is categorical with levels). This modified version of the EM algorithm finds the model maximizing the penalized log-likelihood for a fixed number of components. It starts at a initial point randomly sampled with , and its iteration is composed of two steps:
E-step Computation of the fuzzy partition

M-step Maximization of the expectation of the penalized complete-data log-likelihood over , hence with

where is the difference between the maximum of the expected value of the penalized complete-data log-likelihood obtained when variable is relevant and when it is irrelevant. To obtain the couple maximizing the penalized observed-data log-likelihood, for a fixed number of components, many random initializations of this algorithm should be done. Hence, the couple maximizing the penalized observed-data log-likelihood is obtained by performing this algorithm for every values of between one and . By considering , this algorithm carry out the model selection according to the BIC. Moreover, other criteria can also be considered like the AIC by setting .

4 Model selection by optimizing the MICL

4.1 Information criterion

Although the BIC has good properties of consistency, it does not focus the clustering goal. Moreover, it involves an approximation in which can deteriorate its performances, especially when is small or when is large. To circumvent this issue, exact criteria could be preferred (Biernacki:10). Criteria based on the complete-data likelihood have been introduced like the ICL (Biernacki:00) or the MICL (marbac2015variable). The integrated complete-data likelihood is defined by

(10)

where is the complete-data likelihood. When conjugate prior distributions are used, the integrated complete-data likelihood has the following closed form. Thus, we assume independence between the prior distributions, such that

(11)

where . We use conjugate prior distributions, thus follows a Dirichlet distribution . If variable is continuous, where follows an Inverse-Gamma distribution and follows a Gaussian distribution . If variable is integer, then follows a Gamma distribution while follows a Dirichlet distribution if variable is ordinal with levels. If there is no information a priori on the parameters, we use the the Jeffreys non informative prior distributions (Rob07) for the proportions (i.e., ) and for the hyper-parameters of a categorical variables (i.e., ). Such prior distributions do not exist for the parameters of the Gaussian and Poisson distributions, so we use flat prior distributions (see Section 6).

The conjugate prior distributions implies the following closed-form of the integrated complete-data likelihood

(12)

where , and

(13)

The integral defined by (13) is explicit, thus providing a closed-form of the integrated complete-data likelihood. Its value depends on and its form is detailed in Appendix A.

The MICL corresponds to the greatest value of the integrated complete-data likelihood among all the possible partitions. Thus, the MICL is defined by

(14)

Obviously, this criterion is quite similar to the ICL and inherits its main properties. In particular, it is less sensitive to model misspecification than the BIC. Unlike the ICL and the BIC, it does not require the MLE and thus avoid the multiple calls to the EM algorithm. Because does not impact the dimension of z, we can maximize the integrated complete-data likelihood over , and thus the best model according the MICL can be obtained, for a fixed number of components .

4.2 Optimizing the MICL

An iterative algorithm is used for finding the model maximizing the MICL, for a fixed number of components. Starting at the initial point with , the algorithm alternates between two optimizations of the integrated complete-data likelihood: optimization on z given , and maximization on given . The algorithm is initialized as follows: with probability 0.5 then is the partition provided by a MAP rule associated to model and to its MLE . Iteration of the algorithm is written as
Partition step: find such that

Model step: find such that

At iteration , the model step consists in finding the vector maximizing the integrated completed-data likelihood, for the current partition . This optimization can be performed independently for each element , thanks to the within component independence assumption. The partition step is more complex, hence is defined as a partition increasing the value of the integrated complete-data likelihood for the current model. It is obtained by an iterative method initialized at the partition . Each iteration consists in sampling uniformly an individual which is affiliated to the component maximizing the integrated complete-data likelihood, while the other component memberships are unchanged (details are given in marbac2015variable). Like the EM algorithm, the proposed algorithm converges to a local optimum of , so many different initializations should be done.

5 Missing values

Data can contain missing values, so we denoted by the indices where is observed. Assuming that data are missing at random, the pdf of is defined by the marginal pdf of the observed values given by

(15)

The EM algorithm maximizing the BIC can be used on data with missing values. Its M-step should considers only the observed values. Moreover, the within component independence avoids the computation the conditional expected values of the missing observations, at the E-step. The steps of this algorithm are detailed in Appendix B. Alternatively, variables can be selected according to the MICL. Note that this criterion is particularly relevant in this case, because it considers the number of missing values in the sample, while the penalty of the BIC neglects this quantity. Indeed, the integrated complete-data considers a number of observations per variable because

(16)

This integral keeps a closed-form detailed in Appendix C.

6 Numerical experiments

Implementation of the proposed method

Our method is implemented by the name VarSelLCM. When the MICL is used, the hyper-parameters must be specified. For the proportions and the parameters of the categorical data, we use the Jeffrey’s prior distributions (i.e., Dirichlet distribution with parameters equal to 1/2). Because there do not exist non-informative Jeffrey’s prior distributions for the Gaussian mixture, the following hyper-parameters are chosen to be fairly flat in the region where the likelihood is substantial and not much greater else-where: , , and . In the same spirit, we use the hyper-parameters for the Poisson distribution. The purpose of these experiments is to show the relevance of selecting variables in clustering. Two families of information criteria are considered: the model-fitting criterion (BIC) or the clustering-task criterion. When we apply the clustering-task criterion, the ICL is used if there is no selection of the variables, while the MICL is used if variables are selected.

Simulation map

First, methods of variables selection are compared for a model-based clustering of continuous data. Thus, we compare our approaches with the selection of variables by using deterministic algorithm maximizing the BIC implemented in the R package clustvarsel (Scr14). This package considers redundant variables and different covariance matrices. The optimisation of the BIC is proposed by two algorithms: forward and backward searches. Note that there already exist comprehensive comparisons of method for selecting variables in a continuous data clustering (Cel14; marbac2015variable). Second, the impact of the missing values is illustrates on mixed simulated data. Third, the benefits of the proposed approaches are illustrates on five real datasets. In this section, method are compared in a clustering task. Thus, the partitioning accuracy is measured with the Adjusted Rand Index (ARI, Hub85) because it permits the comparison between two partitions having possibly different numbers of classes. When it is close to one, the partitions are strongly similar, while they are strongly different when this index is close to zero.

6.1 Simulated data: continuous case

This experiment compares the methods of model selection on clustering of continuous data. We generate 200 observations from a bi-component Gaussian mixture with equals proportions. Under component , the variables follow a Gaussian distribution with

Where is used to define the class overlap. We add noisy variables from a standard Gaussian distribution . We consider different numbers of variables (10, 25, 50, 100), a theoretical misclassification rates of 5% and two values of (0 and 0.4). Thus, when , the model used for sampling the data belongs to the list of the competing models while it does not belong to this list when For each case, 20 replicates are generated.

For each replicates, we perform the clustering, with unknown number of classes, according to a modelling criterion (BIC) and to a clustering criterion (ICL/MICL). Model selection is performed by considering a maximum number of components equals to three. The ARI is computed for each selected model and their values are presented in Table 1.

BIC ICL/MICL
no clustvarsel VarSelLCM no VarSelLCM
selection forward backward selection
0 10 0.78 0.53 0.71 0.78 0.78 0.78
25 0.31 0.34 0.71 0.77 0.04 0.77
50 0.00 0.13 0.04 0.80 0.00 0.80
100 0.00 0.10 0.00 0.77 0.00 0.77
0.4 10 0.78 0.52 0.69 0.72 0.78 0.78
25 0.78 0.44 0.63 0.77 0.78 0.78
50 0.50 0.30 0.03 0.79 0.08 0.80
100 0.00 0.18 0.00 0.73 0.00 0.77
Table 1: Mean of the ARI obtained by different methods selecting the variables in a continuous data clustering.

For both criterion families, selecting the variables increases the clustering accuracy. Even if the data arise from a model with intra-components dependencies (), the proposed approach, which assumes within component independence, stays relevant. Indeed, it obtains a better ARI that the models implemented in clustvarsel. This phenomenon can be explained by different reasons. First, the information about the intra-component dependency vanishes when the number of irrelevant variables increases. Thus, the results of clustervarsel are deteriorated when increases. Second, the BIC can imply issues when increases, due to its approximation term in . Because clustervarsel considers a richer family of models, it can be more sensitive to its issue. Finally, the independence assumption permits to finds the model maximizing the BIC, while a richer family of models implies a sub-optimal optimization of this criterion.

Table 2 gives information about the model selected by the competing methods (number of components and rate of relevant variables). It shows that, when is fixed, the number of components tends to one, when we handle a very large number of irrelevant variables. This problem is circumvented, when the proposed procedure of variable selection is used with the BIC or the MICL. Moreover, this approach permits the detection of the role of the variables. Indeed for , 25, 50 and 100, the rate of relevant variables (i.e., 0.60, 0.25, 0.13, 0.06) is found.

BIC ICL/MICL
no clustvarsel VarSelLCM no VarSelLCM
selection forward backward selection
rel. rel. rel. rel.
0 10 2.00 2.45 0.52 2.55 0.66 2.00 0.60 2.00 2.00 0.60
25 1.40 2.80 0.22 2.40 0.50 2.00 0.25 1.05 2.00 0.24
50 1.00 2.90 0.08 2.95 0.43 2.00 0.13 1.00 2.00 0.12
100 1.00 2.95 0.04 3.00 0.72 2.00 0.06 1.00 2.00 0.06
0.4 10 2.00 2.60 0.28 2.55 0.40 2.25 0.61 2.00 2.00 0.60
25 2.00 2.85 0.17 2.65 0.40 2.05 0.25 2.00 2.00 0.24
50 1.65 2.85 0.12 2.85 0.44 2.05 0.13 1.10 2.00 0.12
100 1.00 2.95 0.04 3.00 0.70 2.15 0.06 1.00 2.00 0.06
Table 2: Mean of component number () and rate of releveant variables (rel.) obtained different methods selecting the variables in a continuous data clustering.

6.2 Simulated data: mixed case

This experiment evaluates the benefits of variable selection, for clustering mixed data with missing values, if either a modelling criterion (BIC) or a clustering criterion (ICL/MICL) is used. We generate 200 observations from a bi-component model with equals proportions and assuming within components independence. We consider variables (two continuous, two integer and two binary). Under component , the univariate margins are defined by these parameters

The parameter allows us to fix the misclassification error at 10%. Noisy variables are added (equal number of continuous, integer and binary variables), then missing values are added randomly. Thus, 20 replicates are generated for different numbers of variables (12, 24, 48) and different rates of missing values (0%, 10% and 20%). Table 3 presents the results.

BIC ICL/MICL
missing no selection selection no selection selection
values ARI g ARI g rel. ARI g ARI g rel.
12 0.42 1.90 0.57 2.00 0.48 0.25 1.40 0.34 1.60 0.78
0% 24 0.52 1.90 0.61 2.00 0.26 0.46 1.80 0.50 2.05 0.41
48 0.35 1.60 0.59 2.00 0.14 0.33 1.95 0.39 2.00 0.22
12 0.29 2.00 0.51 2.00 0.47 0.16 1.30 0.19 1.40 0.82
10% 24 0.43 2.20 0.55 2.00 0.26 0.32 1.55 0.38 1.80 0.60
48 0.16 2.00 0.52 2.00 0.13 0.13 1.80 0.17 2.05 0.38
12 0.12 2.10 0.43 2.00 0.44 0.05 1.10 0.05 1.10 0.94
20% 24 0.20 2.30 0.48 2.00 0.24 0.13 1.30 0.15 1.40 0.79
48 0.08 2.00 0.41 2.00 0.12 0.05 1.85 0.09 2.05 0.35
Table 3: Results of the different approaches to cluster mixed data considering differents numbers of variables and rates of missing values (misclassification rate of 10%): mean of the ARI (ARI), mean of the number of components (g) and mean of the number of relevant variables (rel.).

Results shows that the selecting the variables increases the clustering performances for both types of criteria. Indeed, the values of the ARI are improved when variables are selected, especially when the clustering criteria are used. Moreover, the true number of components () is more often found when variables are selected. Finally, clustering interpretation is facilitates because only a subset of the observed variables characterizes the classes. As expected, when missing values are added, the results are deteriorated. However, the results are more impacted when all the variables are used to cluster. Note the BIC obtains better results for detecting the role of the variables. Indeed, the rate of discriminative variables is 0.50, 0.25 and 0.125 when is equal to 12, 24 and 48 respectively. For this simulation, the overall behaviour of the BIC criterion is better than the one of the MICL. This phenomenon is explained by a quite large class overlaps. It is known that the information criteria based on the integrated complete-data likelihood work better when the classes are well-separated. To illustrate this phenomenon, we perform a similar simulation by considering the 5% of theoretical misclassification and 20% of missing values. Table 4 shows the results obtained when the variables are selecting according to the BIC and the MICL. In this case, both criteria obtain equivalent results for the ARI, the number of component and the detection of the variables. In this section, the model used for sampling the data belongs to the list of the competing models. This favour the BIC criterion. Next section shows that the MICL is at least as relevant as the BIC for selecting the variables when real data are analysed.

BIC MICL
ARI g rel. ARI g rel.
12 0.69 2.00 0.50 0.68 2.00 0.49
24 0.69 2.00 0.27 0.65 2.00 0.30
48 0.70 2.00 0.13 0.62 2.00 0.16
Table 4: Criterion comparison for selecting the variables in a mixed data clustering (misclassification rate of 5% and 20% of missing values): mean of the ARI (ARI), mean of the number of components (g) and mean of the number of relevant variables (rel.).

6.3 Method comparison on real data

We now compare the competing methods on six real datasets presented in Table 5.

Name Classes Reference R package/website
Birds 69 5 2 Bretagnolle07 Rmixmod
Banknote 200 6 2 Flu88 VarSelLCM
Coffee 43 12 2 Str73 ppgm
Statlog-Heart 43 12 2 Brown04 UCI-database
Congress 435 16 2 Schlimmer:87 UCI-database
Golub 83 3051 2 Gol99 multtest
Table 5: Information about the benchmark datasets.

Table 6 presents the results obtained without variable selection (No selection), with a variable selection according to the BIC (BIC-selection) and with a variable selection according to the MICL (MICL-selection). For three datasets (birds, coffe, banknote), the ARI obtained by the three approaches are equal. However, selecting the variables facilitates the clustering interpretation. For example, the selection with the MICL perfectly detects the clusters with only 42% of the variables. For the three other datasets, selecting the variables increases the ARI.

Dataset No selection BIC-selection MICL-selection
ARI rel. ARI rel. ARI rel.
Birds 0.50 1.00 0.50 0.60 0.50 0.60
Banknote 0.96 1.00 0.96 0.83 0.96 0.83
Coffee 1.00 1.00 1.00 0.67 1.00 0.42
Statlog-Heart 0.25 1.00 0.31 0.69 0.33 0.69
Congress 0.56 1.00 0.57 0.88 0.57 0.88
Golub 0.53 1.00 0.70 0.38 0.79 0.18
Table 6: Results obtained on the real datasets with known number of components: ARI and proportion of relevant variables (rel.).

In many applications the number of classes is unknown. Thus, we perform the clustering, with unknown number of classes, according to a modelling criterion (BIC) and to a clustering criterion (ICL/MICL). For both approaches, the clustering is done with and without selection of the variables. Table 7 presents the results obtained for the real datasets. For both approaches, the selection of the variables increases the ARI for almost all the datasets. The only case when the ARI is deteriorated by a selection of variables is for the clustering of the Golub dataset with the BIC. This dataset is really challenging because it is composed of many variables (3051) and few observations (83). This large number of variables implies a huge number of competing models. In this case, the BIC can lead to poor results because of its approximation implying a term of . For this type of dataset, the exact criteria (implying no approximation) are more relevant. Thus, the clustering with variable selection according to the MICL criterion finds the true number of components (2) and a relevant partition.

BIC ICL/MICL
Dataset No selection Selection No selection Selection
ARI g ARI g rel. ARI g ARI g rel.
Birds 0.50 2 0.50 2 0.60 0.50 2 0.50 2 0.60
Banknote 0.48 4 0.48 4 1.00 0.61 3 0.61 3 1.00
Coffee 0.38 3 0.38 3 0.67 1.00 2 1.00 2 0.42
Statlog-Heart 0.25 2 0.31 2 0.69 0.25 2 0.33 2 0.69
Congress 0.40 4 0.46 4 0.88 0.47 5 0.47 5 0.94
Golub 0.53 2 0.32 4 0.32 0.00 1 0.79 2 0.18
Table 7: Results of the method comparison with unknown number of components: ARI, best number of components (g) and proportion of relevant variables (rel.).

7 Discussion

We have proposed a new model-based approach for selecting variables in a cluster analysis of mixed data with missing values. The purpose of the variable selection is to increase the accuracy of the model fitting and facilitate its interpretation. The model at hand assumes within component independence. This assumption is relevant, because variable selection is mainly impacting when several variables are observed. Moreover, numerical experiments have shown robustness properties for the model misspecification. The within component independence assumption permits the maximization of the BIC and the MICL. The first criterion performs the selection of the variables and the clustering in a model-fitting purpose. The second criterion achieves these objectives in a clustering purpose. Both criteria have provided relevant results on numerical experiments. Considering a dataset composed of several variables but very few observations, the MICL should be prefered to the BIC, because it does not imply any approximation. However, if many observations are available, the maximization of the MICL could be time consuming (in practice, it is doable for ). Thus, the BIC could be prefered, in this case, because its issues due to the term vanishes when tends toward infinity.

Finally, this approach could be extend to perform a more elaborate variable selection. Indeed, by using the approach of Mau09, a group of redundant variables could be considered.

References

Appendix A Details on the closed-form of the integrated complete-data log-likelihood

To compute the integrated complete-data log-likelihood, we give the value for any type of data (continuous, integer and categorical).

  • If variable is continuous, then

    where , , , , , , , and .

  • If variable is integer, then

    where , , and .

  • If variable is categorical with levels, then

Appendix B EM algorithm to optimize the BIC criterion for data with missing values

The EM algorithm starts at a initial point with randomly sampled and its iteration is composed of two steps:
E step Computation of the fuzzy partition

M step Maximization of the expectation of the penalized complete-data log-likelihood over , hence with

where , where and where .

Appendix C Details on the closed-form of the integrated complete-data log-likelihood for data with missing values

To compute the integrated complete-data log-likelihood, for data with missing values we give the value for any type of data (continuous, integer and categorical) containing missing values.

  • If variable is continuous, then

    where , , , , , , , , and .

  • If variable is integer, then

    where , , , and .

  • If variable is categorical with levels, then

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters