Clustering and Variable Selection in the Presence of Mixed Variable Types and Missing Data

# Clustering and Variable Selection in the Presence of Mixed Variable Types and Missing Data

C. B. Storlie, S. M. Myers, S. K. Katusic, A. L. Weaver, R. Voigt, R. C. Colligan, P. E. Croarkin, R. E. Stoeckel, J. D. Port
###### Abstract

We consider the problem of model-based clustering in the presence of many correlated, mixed continuous and discrete variables, some of which may have missing values. Discrete variables are treated with a latent continuous variable approach and the Dirichlet process is used to construct a mixture model with an unknown number of components. Variable selection is also performed to identify the variables that are most influential for determining cluster membership. The work is motivated by the need to cluster patients thought to potentially have autism spectrum disorder (ASD) on the basis of many cognitive and/or behavioral test scores. There are a modest number of patients (480) in the data set along with many (100) test score variables (many of which are discrete valued and/or missing). The goal of the work is to (i) cluster these patients into similar groups to help identify those with similar clinical presentation, and (ii) identify a sparse subset of tests that inform the clusters in order to eliminate unnecessary testing. The proposed approach compares very favorably to other methods via simulation of problems of this type. The results of the ASD analysis suggested three clusters to be most likely, while only four test scores had high () posterior probability of being informative. This will result in much more efficient and informative testing. The need to cluster observations on the basis of many correlated, continuous/discrete variables with missing values, is a common problem in the health sciences as well as in many other disciplines.

Some key words: Model-Based Clustering; Dirichlet Process; Missing Data; Hierarchical Bayesian Modeling; Mixed Variable Types; Variable Selection.

\@xsect

Model-based clustering has become a very popular means for unsupervised learning (Fraley and Raftery, 2002; Basu and Chib, 2003; Quintana and Iglesias, 2003; Tadesse et al., 2005). This is due in part to the ability to use the model likelihood to inform, not only the cluster membership, but also the number of clusters which has been a heavily researched problem for many years. The most widely used model-based approach is the normal mixture model which is not suitable for mixed continuous/discrete variables. For example, this work is motivated by the need to cluster patients thought to potentially have autism spectrum disorder (ASD) on the basis of many correlated test scores. There are a modest number of patients (480) in the training data set along with many () test score/self-report variables, many of which are discrete valued or have left or right boundaries. Figure 1 provides a look at the data across three of the variables; Beery_standard is discrete valued and ABC_irritability is continuous, but with significant mass at the left boundary of zero. The goals of this problem are to (i) cluster these patients into similar groups to help identify those with similar clinical presentation, and (ii) identify a sparse subset of tests that inform the clusters in an effort to eliminate redundant testing. This problem is also complicated by the fact that many patients in the training data have missing test scores. The need to cluster incomplete observations on the basis of many correlated continuous/discrete variables is a common problem in the health sciences as well as in many other disciplines.

When clustering in high dimensions, it becomes critically important to use some form of dimension reduction or variable selection to achieve accurate cluster formation. A common approach to deal with this is a principle components or factor approach (e.g., Liu et al., 2003). However, such a solution does not address goal (ii) above for the ASD clustering problem. The problem of variable selection in regression or conditional density estimation has been well studied from both the penalization (see, e.g., Tibshirani, 1996; Zou and Hastie, 2005; Lin and Zhang, 2006) and Bayesian perspectives (George and McCulloch, 1993; Reich et al., 2009; Chung and Dunson, 2012). However, variable selection in clustering is more challenging than that in regression as there is no response to guide (supervise) the selection. Still, there have been several articles considering this topic. For example, Raftery and Dean (2006) propose a partition of the variables into informative (dependent on cluster membership even after conditioning on all of the other variables) and non-informative (conditionally independent of cluster membership given the values of the other variables). They use BIC to accomplish variable selection with a greedy search. Similar approaches are used by Maugis et al. (2009) and Fop et al. (2015). The popular LASSO or L1 type penalization has also been applied to shrink cluster means together for variable selection (Pan and Shen, 2007; Wang and Zhu, 2008; Xie et al., 2008). There have also been several approaches developed for sparse K-means and distance based clustering (Friedman and Meulman, 2004; Hoff, 2006; Witten and Tibshirani, 2012).

In the Bayesian literature Tadesse et al. (2005) consider variable selection in the finite normal mixture model using reversible jump (RJ) Markov chain Monte Carlo (MCMC) (Richardson and Green, 1997). Kim et al. (2006) extend that work to the nonparametric Bayesian mixture model via the Dirichlet process model (DPM) (Ferguson, 1973; Neal, 2000; Teh et al., 2006; Lid Hjort et al., 2010). The DPM has the advantage of allowing for a countably infinite number of possible components, while providing a posterior distribution for how many components have been observed in the data set at hand. Both Tadesse et al. (2005) and Kim et al. (2006) use a point mass prior to achieve sparse representation of the informative variables. However, for convenience they assume all non-informative variables are (unconditionally) independent of the informative variables. This assumption is frequently violated in practice and it is particularly problematic in the case of the ASD analysis as it would force far too many variables to be included into the informative set as is demonstrated later in this paper.

There is not a generally accepted best practice to clustering with mixed discrete and continuous variables. Hunt and Jorgensen (2003) and Murray and Reiter (2016) meld mixtures of independent multinomials for the categorical variables and mixtures of Gaussian for the continuous variables. However, it may not be desirable for the dependency between the discrete variables to be entirely represented by mixture components when clustering is the primary objective. As pointed out in Hennig and Liao (2013), mixture models can approximate any distribution arbitrarily well so care must be taken to ensure the mixtures fall in line with the goals of clustering. When using mixtures of Gaussian combined with independent multinomials, a data set with many correlated discrete variables will tend to result in more clusters than a comparable dataset with mostly continuous variables. A discrete variable measure of some quantity instead of the continuous version could therefore result in very different clusters. Thus, a Gaussian latent variable approach (Muthen, 1983; Dunson, 2000) would seem more appropriate for treating discrete variables when clustering is the goal. An observed ordinal variable , for example, is assumed to be the result of thresholding a latent Gaussian variable . For binary variables, this reduces to the multivariate probit model (Lesaffre and Molenberghs, 1991; Chib and Greenberg, 1998). There are also extensions of this approach to allow for unordered categorical variables.

In this paper, we propose a Bayesian nonparametric approach to perform simultaneous estimation of the number of clusters, cluster membership, and variable selection while explicitly accounting for discrete variables and partially observed data. To the best of our knowledge, this is the first model-based clustering approach to allow for this complex but common data structure. The discrete variables as well as continuous variables with boundaries are treated with a Gaussian latent variable approach. The informative variable construct of Raftery and Dean (2006) for normal mixtures is then adopted. However, in order to effectively handle the missing values and account for uncertainty in the variable selection and number of clusters, the proposed model is cast in a fully Bayesian framework via the Dirichlet process. This is then similar to the work of Kim et al. (2006), however, they did not consider discrete variables or missing data. Further, a key result of this paper is a solution to allow for dependence between informative and non-informative variables in the nonparametric Bayesian mixture model. This solution takes a particularly simple form and also provides an intuitive means with which to define the prior distribution in a manner that decreases prior sensitivity. The component parameters are marginalized out to facilitate more efficient MCMC sampling via a modified version of the split-merge algorithm of Jain and Neal (2004). Finally, missing data is then handled in a principled manner by treating missing values as unknown parameters in the Bayesian framework (see, e.g., Storlie et al., 2015, 2017). This approach implicitly assumes a missing at random (MAR) mechanism (Rubin, 1976), which implies that the likelihood of a missing value can depend on the value of the unobserved variable(s), marginally, just not after conditioning on the observed variables.

The rest of the paper is laid out as follows. Section Clustering and Variable Selection in the Presence of Mixed Variable Types and Missing Data describes the proposed nonparametric Bayesian approach to clustering with mixed discrete and continuous variables with variable selection. Section Clustering and Variable Selection in the Presence of Mixed Variable Types and Missing Data evaluates the performance of this approach when compared to other methods on several simulation cases. The approach is then applied to the problem for which it was designed in Section Clustering and Variable Selection in the Presence of Mixed Variable Types and Missing Data where a comprehensive analysis of the ASD problem is presented. Section Clustering and Variable Selection in the Presence of Mixed Variable Types and Missing Data concludes the paper. This paper also has supplementary material which contains likelihood derivations, full exposition of the MCMC algorithm used to fit the model, and MCMC trace plots.

\@xsect\@xsect

As discussed above, the proposed model for clustering uses mixture distributions with a countably infinite number of components via the Dirichlet process prior (Ferguson, 1973; Escobar and West, 1995; MacEachern and Müller, 1998). Let be a -variate random vector and let , , denote the observation of . It is assumed that are independent random vectors coming from distribution . The model parameters are assumed to come from a mixing distribution which has a Dirichlet process prior, i.e., the familiar model,

 \boldmathyi∣θi∼F(θi),θi∼G,G∼DP(G0,α), (0)

where DP represents a Dirichlet Process distribution, is the base distribution and is a precision parameter, determining the concentration of the prior for about (Escobar and West, 1995). The prior distribution for in terms of successive conditional distributions is obtained by integrating over , i.e.,

 θi ∣ θ1,…,θi−1∼1i−1+αi−1∑i′=1δ(θi′)+αi−1+αG0,

where is a point mass distribution at . The representation in (Clustering and Variable Selection in the Presence of Mixed Variable Types and Missing Data) makes it clear that (Clustering and Variable Selection in the Presence of Mixed Variable Types and Missing Data) can be viewed as a countably infinite mixture model. Alternatively, let denote the unique values of the and let be the index for the component to which observation belongs, i.e., so that . The following model (Neal, 2000) is equivalent to (Clustering and Variable Selection in the Presence of Mixed Variable Types and Missing Data)

 P(ϕi=m∣ϕ1,…,ϕi−1)=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩\ignorespaces1if i=1 and m=1.ni,mi−1+αif ϕi′=m for any i′

with , and is the number of for . Thus, a new observation is allocated to an existing cluster with probability proportional to the cluster size or it is assigned to a new cluster with probability proportional to . This is often called the Chinese restaurant representation of the Dirichlet process. It is often assumed that is a Gaussian distribution in which case describes the mean and covariance for the component. This results in a normal mixture model with a countably infinite number of components.

\@xsect

Normal mixture models are not effective for clustering when some of the variables are too discretized as demonstrated in Section Clustering and Variable Selection in the Presence of Mixed Variable Types and Missing Data. This is also a problem when the data have left or right boundaries that can be achieved (e.g., several people score the minimum or maximum on a test). However, a Gaussian latent variable approach can be used to circumvent these issues. Suppose that variables for are discrete, ordinal variables taking on possible values and that for are continuous variables with lower and upper limits of and , which could be infinite. Assume for some latent, -variate, continuous random vector that

 yj=⎧⎨⎩\ignorespaces∑Ljl=1dj,lI{aj,l−1cj}for j∈C\endlinenomath\vspace−.0in (0)

where is the indicator function equal to 1 if and 0 otherwise, , , and for . That is, the discrete are the result of thresholding the latent variable on the respective cut-points. The continuous variables are simply equal to the unless the cross the left or right boundary of what can be observed for . That is, if there are finite limits for , then is assumed to be a left and/or right censored version of , thus producing a positive mass at the boundary values of .

A joint mixture model for mixed discrete and continuous variables can then be represented as

 \boldmathzi∣ϕi,Ω∼N(\boldmathμϕi,\boldmathΣϕi), (0)

with prior distributions for and as in (Clustering and Variable Selection in the Presence of Mixed Variable Types and Missing Data).

Binary such as gender can be accommodated by setting . However, if there is only one cut-point then the model must be restricted for identifiability (Chib and Greenberg, 1998); namely, if is binary, then we must set . The restriction that for binary complicates posterior inference, however, this problem has been relatively well studied in the multinomial probit and multivariate probit settings and various proposed solutions exist (McCulloch et al., 2000; Imai and van Dyk, 2005). It is also straight-forward to use the latent Gaussian variable approach to allow for unordered categorical variables (McCulloch et al., 2000; Imai and van Dyk, 2005; Zhang et al., 2008; Bhattacharya and Dunson, 2012), however, inclusion of categorical variables also complicates notation and there are no such variables in the ASD data. For brevity, attention is restricted here to continuous and ordinal discrete variables.

\@xsect

Variable selection in clustering problems is more challenging than in regression problems due to the lack of targeted information with which to guide the selection. Using a model-based clustering approach allows a likelihood based approach to model selection, but exactly how the parameter space should be restricted when a variable is “out of the model” requires some care. Raftery and Dean (2006) defined a variable to be non-informative if conditional on the values of the other variables, it is independent of cluster membership. This implies that a non-informative may still be quite dependent on cluster membership through its dependency with other variables. They assumed a Gaussian mixture distribution for the informative variables, with a conditional Gaussian distribution for the non-informative variables and used maximum likelihood to obtain the change in BIC between candidate models. Thus, they accomplished variable selection with a greedy search to minimize BIC. They further considered restricted covariance parameterizations to reduce the parameter dimensionality (e.g., diagonal, common volume, common shape, common orientation, and combinations of these restrictions). We instead take a Bayesian approach to this problem via Stochastic Search Variable Selection (SSVS) (George and McCulloch, 1993, 1997) as this allows for straight-forward treatment of uncertainty in the selected variables and that due to missing values. Kim et al. (2006) used such an approach with a DPM for infinite normal mixtures, however, due to the difficulty imposed they did not use the same definition as Raftery and Dean (2006) for a non-informative variable. They defined a non-informative variable to be one that is (unconditionally) independent of cluster membership and all other variables. This is not reasonable in many cases, particularly in the ASD problem, and can result in negative consequences as seen in Section Clustering and Variable Selection in the Presence of Mixed Variable Types and Missing Data. Below, we layout a more flexible model specification akin to that taken in Raftery and Dean (2006) to allow for dependence between informative and non-informative variables in the nonparametric Bayesian mixture model.

Let the informative variables be represented by the model , a vector of binary values such that is the set of informative variables. A priori it is assumed that . Without loss of generality assume that has elements ordered such that , with and , and similarly for and . The model in (Clustering and Variable Selection in the Presence of Mixed Variable Types and Missing Data) becomes,

 \boldmathzi∣\boldmathγ,ϕi,Ω∼N(\boldmathμϕi,\boldmathΣϕi),\vspace−.12in (0)

with

 \boldmathμm=(\ignorespaces\boldmathμm1\boldmathμm2\endlinenomath),\boldmath% Σm=(\ignorespaces\boldmathΣm11\boldmathΣm12\boldmathΣm21\boldmathΣm22\endlinenomath). (0)

Then from standard multivariate normal theory, with and . Now in order for the non-informative variables to follow the definition of Raftery and Dean (2006), the and must be parameterized so that do not depend on . In order to accomplish this, it is helpful to make use of the canonical parameterization of the Gaussian (Rue and Held, 2005),

 \boldmathz∣\boldmathγ,Ω,ϕ=m∼NC(\boldmathbm,\boldmathQm),

with precision and . Partition the canonical parameters as,

 \boldmathbm=(\ignorespaces\boldmathbm1\boldmathb2\endlinenomath),\boldmathQ% m=(\ignorespaces\boldmathQm11% \boldmathQ12\boldmathQ21\boldmathQ22\endlinenomath). (0)

Result 1. The parameterization in Clustering and Variable Selection in the Presence of Mixed Variable Types and Missing Data results in that does not depend on .

Proof. The inverse of a partitioned matrix directly implies that , which does not depend on . It also implies that , and substituting for in gives , which also does not depend on .

Now the problem reduces to defining a prior distribution for , i.e., , , conditional on the model , that maintains the form of (Clustering and Variable Selection in the Presence of Mixed Variable Types and Missing Data). Let and . The prior distribution for will be defined first unconditionally for and then for , conditional on . There are several considerations in defining these distributions: (i) the resulting must be positive definite, (ii) it is desirable for the marginal distribution of to remain unchanged for any model to limit the influence of the prior for on variable selection, and (iii) it is desirable for them to be conjugate to facilitate MCMC sampling (Neal, 2000; Jain and Neal, 2004). Let be a positive definite matrix, partitioned just as , and for a given assume the following distribution for ,

 \ignorespaces\boldmathQ22∼W(\boldmathΨ−122∣1,η),\boldmathb2∣\boldmathQ22∼N(\boldmath0,1λ\boldmathQ22),\boldmathQ21∣\boldmathQ22∼MN(−\boldmathQ22\boldmathΨ21% \boldmathΨ−111,\boldmathQ22,\boldmathΨ−111),\endlinenomath (0)

where denotes the Wishart distribution, and denotes the matrix normal distribution. The distribution of , conditional on is then defined implicitly below in terms of ,

 \boldmathΣm11\lx@stackreliid∼W−1(\boldmathΨ11,η−p2),\boldmathμm1∣\boldmathΣm11\lx@stackrelind∼N(\boldmath0,1λ\boldmathΣm11), (0)

where denotes the inverse-Wishart distribution and are independent of . This is not to say that is independent of , rather the distribution imposed on via (Clustering and Variable Selection in the Presence of Mixed Variable Types and Missing Data) and (Clustering and Variable Selection in the Presence of Mixed Variable Types and Missing Data) is quite dependent on via the relations, , and .

Result 2. The prior distribution defined in Clustering and Variable Selection in the Presence of Mixed Variable Types and Missing Data and Clustering and Variable Selection in the Presence of Mixed Variable Types and Missing Data results in a marginal distribution for of , i.e., the same normal-inverse-Wishart regardless of .

Proof. It follows from Theorem 3 of Bodnar and Okhrin (2008) that . It remains to show . However, according to (Clustering and Variable Selection in the Presence of Mixed Variable Types and Missing Data) and (Clustering and Variable Selection in the Presence of Mixed Variable Types and Missing Data) and the independence assumption,

 (\ignorespaces\boldmathμm1\boldmathb2\endlinenomath)∣∣∣\boldmathΣm∼N((\ignorespaces% \boldmath0\boldmath0\endlinenomath),1λ(\ignorespaces\boldmathΣm11\boldmath% 0\boldmath0\boldmathQ22\endlinenomath)).

Also, implies,

 (\ignorespaces\boldmathμm1\boldmathμm2\endlinenomath)=(\ignorespaces\boldmathI\boldmath0−\boldmathQ−122\boldmathQ21\boldmathQ−122\endlinenomath)(\ignorespaces\boldmathμm1\boldmathb2\endlinenomath).

Using the relation for gives the desired result.

As mentioned above, the normal-inverse-Wishart distribution is conjugate for in the unrestricted (no variable selection) setting. It turns out that the distribution defined in (Clustering and Variable Selection in the Presence of Mixed Variable Types and Missing Data) and (Clustering and Variable Selection in the Presence of Mixed Variable Types and Missing Data) is conjugate for the parameterization in (Clustering and Variable Selection in the Presence of Mixed Variable Types and Missing Data) as well, so that the component parameters can be integrated out of the likelihood. Let the (latent) observations be denoted as , and the data likelihood as .

Result 3. The marginal likelihood of is given by

 f(\boldmathZ∣\boldmathγ,\boldmathϕ)=∫f(\boldmathZ∣\boldmathγ,\boldmathϕ,Ω)f(Ω∣\boldmathγ)dΩ\vspace−.19in

where (i) , i.e., the number of observed components, (ii) is the number of informative variables, (iii) , (iv) is the number of , (v) is the multivariate gamma function, and (vi) , , are defined as,

 \boldmathVm11=∑ϕi=m(\boldmathz(1)i−¯\boldmathzm1)(\boldmathz(1)i−¯\boldmathzm1)′+nmλnm+λ¯\boldmathzm1¯\boldmathz′m1+\boldmathΨ11,\vspace−.22in
 \boldmathV11=n∑i=1(\boldmathz(1)i−¯\boldmathz1)(\boldmathz(1)i−¯\boldmathz1)′+nλn+λ¯\boldmathz1¯\boldmathz′1+%\boldmath$Ψ$11,\boldmathV2∣1=% \boldmathV22−\boldmathV21\boldmathV−111\boldmathV′21,

with , , ,

 \boldmathV22 = n∑i=1(\boldmathz(2)i−¯\boldmathz2)(\boldmathz(2)i−¯% \boldmathz2)′+nλn+λ¯% \boldmathz2¯\boldmathz′2+\boldmathΨ22,and (0) \boldmathV21 = n∑i=1(\boldmathz(2)i−¯\boldmathz2)(\boldmathz(1)i−¯% \boldmathz1)′+nλn+λ¯% \boldmathz2¯\boldmathz′1+\boldmathΨ21.

The derivation of Result 3 is provided in the Supplementary Material.

\@xsect

Kim et al. (2006) found there to be a lot of prior sensitivity due to the choice of prior for the component parameters. This is in part due to the separate prior specification for the parameters corresponding to informative and non-informative variables, respectively. The specification above treats all component parameters collectively, in a single prior, so that the choice will not be sensitive to the interplay between the priors chosen for informative and non-informative variables. A further stabilization can be obtained by rationale similar to that used in Raftery and Dean (2006) for restricted forms of the covariance (such as equal shape, orientation, etc.). We do not enforce such restrictions exactly, but one might expect the components to have similar covariances or similar means for some of the components. Thus it makes sense to put hierarchical priors on , , and , to encourage such similarity if warranted by the data. A Gamma prior is also placed on the concentration parameter , i.e.,

 \ignorespacesλ∼Gamma(Aλ,Bλ),η−(p+1)∼Gamma(Aη,Bη),\boldmathΨ∼W(\boldmathP,N),α∼Gamma(Aα,Bα).\endlinenomath (0)

In the analyses below, relatively vague priors were used with . The prior for was set to , , to encourage anywhere from 1 to 10 clusters from 100 observations. The results will still have some sensitivity to the choice of . In addition, there are well known issues with using a Wishart prior on variables of differing scale. In order to alleviate these issues, we recommend first standardizing the columns of the data to mean zero and unit variance, then using , . Finally, the prior probability for variable inclusion was set to for all . The data model defined in (Clustering and Variable Selection in the Presence of Mixed Variable Types and Missing Data) and (Clustering and Variable Selection in the Presence of Mixed Variable Types and Missing Data), the component prior distribution defined in (Clustering and Variable Selection in the Presence of Mixed Variable Types and Missing Data) and (Clustering and Variable Selection in the Presence of Mixed Variable Types and Missing Data), along with the hyper-priors in (Clustering and Variable Selection in the Presence of Mixed Variable Types and Missing Data), completes the model specification.

\@xsect

Complete MCMC details are provided in the Supplementary Material. However, an overview is provided here to illustrate the main idea. The complete list of parameters to be sampled in the MCMC are , where contains any latent element of (i.e., either corresponding to missing data, discrete variable, or boundary/censored observation). Because the component parameters are integrated out, the can be updated with simple Gibbs sampling (Neal, 2000), however, this approach has known mixing issues (Jain and Neal, 2004; Ishwaran and James, 2011). Thus, a modified split-merge algorithm (Jain and Neal, 2004) similar to that used in (Kim et al., 2006) was developed to sample from the posterior distribution of . The remaining parameters are updated in a hybrid Gibbs, Metropolis Hastings (MH) fashion. The vector is updated with MH by proposing an add, delete, or swap move (George and McCulloch, 1997). The are block updated, each with a MH step, but with a proposal that looks almost conjugate, and is therefore accepted with high probability; the block size can be adjusted to trade-off between acceptance and speed (e.g., acceptance %). A similar strategy is taken with the update, i.e., a nearly conjugate update is proposed and accepted/rejected via an MH step. The parameters have standard MH random walk updates on log-scale. The MCMC routine then consists of applying each of the above updates in turn to complete a single MCMC iteration, with the exception that the update be applied times each iteration.

Two modifications were also made to the above computational strategy to improve mixing. The algorithm above would at times have trouble breaking away from local modes when proposing and updates separately. Thus, an additional joint update is proposed for and each iteration. Also, as described in more detail in the Supplementary Material, the traditional split merge algorithm proposes an update by first selecting two points, and , at random. If they are from the same cluster (according to the current ) it then assigns them to separate clusters and assigns the remaining points from that cluster to each of the two new clusters at random. It then conducts several () restricted (to one of the two clusters) Gibbs sampling updates to the remaining from the original cluster. The resulting then becomes the proposal for a split move. We found that the following adjustment resulted in much higher acceptance of split/merge moves. Instead of assigning the remaining points to the two clusters at random, simply assign them to the closest of the two observations or . Then conduct restricted Gibbs sample updates to produce the proposal. We found very little performance gain beyond .

It would also be possible to instead use a finite mixture approximation and sample via the kernel stick breaking representation of a DPM (Sethuraman, 1994; Ishwaran and James, 2011) to alleviate the slow mixing concerns with Gibbs updates for the . However, this approach would be complicated by the dependency between and the structure and dimensionality of the component parameters. This issue is entirely avoided with the proposed approach.

\@xsect

The estimated cluster membership for all of the methods were taken to be the respective mode of the estimated cluster membership probability. For the DPM methods, the cluster membership probability matrix (which is an matrix in principle) is not sampled in the MCMC, and is not identified due to many symmetric modes (and thus their can be label switching in the posterior samples). However, the information theoretic approach of Stephens (2000) (applied to the DPM in Fu et al. (2013)) can be used to address this issue and relabel the posterior samples of to provide an estimate of . The resulting estimate has row, column that can be thought of as the proportion of the relabeled posterior samples of that have the value . While technically is an matrix, all columns after have zero entries in , where is the maximum number of clusters observed in the posterior. For the results below, the point estimate of the model is determined by if .

\@xsect

In this section the performance of the proposed approach for clustering is evaluated on two simulation cases similar in nature to the ASD clustering problem. Each of the cases is examined (i) without missing data or discrete variables/censoring, (ii) with missing data, but no discrete variables/censoring, (iii) with missing data and several discrete and/or censored variables.

The approaches to be compared are: (i) DPM-vs: the proposed method, (ii) DPM-cont: the proposed method without accounting for discrete variables/censoring (i.e., assuming all continuous variables), (ii) DPM: the proposed method with variable selection turned off (i.e., a prior probability ), (iv) DPM-ind: the approach of Kim et al. (2006) when all variables are continuous (i.e., assuming non-informative variables are independent of the rest), but modified to treat discrete variables/censoring and missing data when applicable just as the proposed approach, (v) Mclust-vs: the approach of Raftery and Dean (2006) implemented with the clustvarsel package in R. When there are missing data, Random Forest Imputation (Stekhoven and Bühlmann, 2012) implemented with the missForest package in R is used prior to application of clustvarsel. However, the Mclust-vs approach does not treat discrete variables differently and thus treats all variables as continuous and uncensored.

Each simulation case is described below. Figure 2 provides a graphical depiction of the problem for the first eight variables from the first of the 100 realizations of Case 2(c). Case 1 simulations resulted in very similar data patterns as well.

Case 1(a). , . The true model has components with mixing proportions 0.5, 0.25, 0.25, respectively, and is a multivariate normal with no censoring nor missing data. Only two variables are informative, with means of , , , unit variances, and correlations of 0.5, 0.5, -0.5 in each component, respectively. The non-informative variables are generated as iid .

Case 1(b). Same as the setup in 1(a) only the non-informative variables are correlated with through the relation , where is a matrix whose elements are distributed as iid , and , with .

Case 1(c). Same as in 1(b), but variables are discretized to the closest integer, variables are left censored at -1.4 (8% of the observations), and are right censored at 1.4.

Case 1(d). Same as 1(c), but the even numbered have % of the observations MAR.

Case 2(a). , . The true model has components with mixing proportions 0.5, 0.25, 0.25, respectively, is a multivariate normal with no censoring nor missing data. Only four variables are informative, with means of , , and all variables with unit variance for each of the three components, respectively. All correlations among informative variables are equal to 0.5 in components 1 and 2, while component 3 has correlation matrix, .

The non-informative variables are generated as iid .

Case 2(b). Same as the setup in Case 2(a) only the non-informative variables are correlated with through the relation , where is a matrix whose elements are distributed as iid , and , with .

Case 2(c). Same setup as in Case 2(b), but now variables are discretized to the closest integer, variables are left censored at -1.4 (8% of the observations), and variables are right censored at 1.4.

Case 2(d). Same as Case 2(c), but the even numbered have % MAR.

For each of the eight simulation cases, 100 data sets were randomly generated and each of the five methods above was fit to each data set. The methods are compared on the basis of the following statistics: (i) Accuracy (Acc), calculated as the proportion of observations in the estimated clustering that are in the same group as they are in the true clustering, when put in the arrangement (relabeling) that best matches the true clusters. (ii) Fowlkes-Mallows index (FI) of relative to the true clusters. (iii) Adjusted Rand index (ARI). (iv) The number of groups in the estimated clustering (M). (vi) The model size, . (vii) The proportion of variables correctly included/excluded from the model (PVC). (vii) The computation time (CompT) in minutes (using 10,000 iterations for the Bayesian methods). These measures are summarized in the tables below by their mean (and stdev) over the 100 data sets.

The simulation results from Cases 1(a)-(d) are summarized in Table 1. The summary score for the best method for each summary is in bold font along with that for any other method that was not statistically different from the best method on the basis of the 100 trials (via an uncorrected paired -test with a level of significance of 0.05). As might be expected, DPM-ind is one of the best methods on Case 1(a). However, it is not significantly better than DPM-vs or Mclust-vs on any of the metrics. All of the variable selection methods solidly outperform DPM, which had a difficult time finding more than a single cluster since it had to include all 10 variables. In Case 1(b) the assumptions of DPM-ind are being violated and it is unable to perform adequate variable selection. It must include far too many non-informative variables due to the correlation within and between and . The clustering performance suffers as a result and like DPM, it has difficulty finding more than a single cluster. Mclust-vs performs well in this case, but DPM-vs (and DPM-cont) is significantly better on two of the metrics. In case 1(c) DPM-vs is now explicitly accounting for the discrete and left/right censored variables, while DPM-cont does not. When the discrete variables are incorrectly assumed to be continuous it tends to create separate clusters at some of the unique values of the discrete variables. This is because a very high likelihood can be obtained by normal distributions that are almost singular along the direction of the discrete variables. Thus, DPM-vs substantially outperforms DPM-cont and Mclust-vs, demonstrating the importance of explicitly treating the discrete nature of the data when clustering. Finally, Case 1(d) shows that the loss of 30% of the data for half of the variables (including an informative variable) does not degrade the performance of DPM-vs by much. Mclust-vs has much faster run-time than the Bayesian methods, however, when there are discrete variables and/or missing data, Mclust-vs did not perform nearly as well as DPM-vs.

The simulation results from Cases 2(a)-(d) are summarized in Table 2. A similar story line from Case 1 carries over into Case 2 where there are now (four informative) variables and observations. Namely, DPM-vs is not significantly different from DPM-ind or Mclust-vs on any of the summary measures for Case 2(a), with the exception of computation time. DPM-vs is the best method on all summary statistics (except CompT) by a sizeable margin on the remaining cases. While Mclust is much faster than DPM-vs, the cases of the most interest in this paper are those with discrete variables, censoring and/or missing data (i.e., Cases 1(c), 1(d), 2(c), and 2(d)). In these cases, the additional computation time of DPM-vs might seem inconsequential relative to the enormous gain in accuracy. It is interesting that DPM-vs suffers far less from the missing values when moving from Case 2(c) to 2(d) than it did from Case 1(c) to 1(d). This is likely due to the fact that there are a larger number of observations to offset the additional complexity of a larger . However, it is also likely that the additional (correlated) variables may help to reduce the posterior variance of the imputed values.

\@xsect

The cohort for this study consists of subjects falling in the criteria for “potential ASD” (PASD) on the basis of various combinations of developmental and psychiatric diagnoses obtained from comprehensive medical and educational records as described in Katusic et al. (2016). The population of individuals with PASD is important because this group represents the pool of patients with developmental/behavioral symptoms from which clinicians have to determine who has ASD and/or other disorders. Subjects 18 years of age or older were invited to participate in a face-to-face session to complete psychometrist-administered assessments of autism symptoms, cognition/intelligence, memory/learning, speech and language, adaptive functions, and maladaptive behavior. In addition, guardians were asked to complete several self-reported, validated questionnaires. The goal is to describe how the patients’ test scores separate them into different types of clinical presentation and which test scores are the most useful for this purpose. This falls in line with the new Research Domain Criteria (RDoC) philosophy that has gained traction in the field of mental health research. RDoC is a new research framework for studying mental disorders. It aims to integrate many levels of information (cognitive/self-report tests, imaging, genetics) to understand how all of these might be related to similar clinical presentations.

A total of 87 test scores measuring cognitive and/or behavioral characteristics were considered from a broad list of commonly used tests for assessing such disorders. Using expert judgment to include several commonly used aggregates in place of individual subtest scores, this list was reduced to 55 test score variables to be considered in the clustering procedure. Five of the 55 variables have fewer than 15 possible values and are treated here as discrete, ordinal variables. A majority (46) of the 55 variables also have a lower bound, which is attained by a significant portion of the individuals, and are treated as left censored. Five of the variables have an upper bound that is attained by many of the individuals and are thus treated as right censored. There are 479 observations (individuals) in the dataset, however, only 67 individuals have complete data, i.e., a complete case analysis would throw out 412 (86%) of the observations.

DPM-vs was applied to these data; four chains with random starting points were run in parallel for 85,000 iterations each, which took hours on a 2.2GHz processor. The first 10,000 iterations were discarded as burn-in. MCMC trace plots are provides in the Supplementary Material. All chains converged to the same distribution (aside from relabeling) and were thus combined.

Four of the tests (Beery standard, CompTsc_ol, WJ_Pass_Comprehen, and Adaptive Composite) had a high () posterior probability of being informative (Table 3). There is also evidence that Ach_abc_Attention and Ach_abc_AnxDep are informative. The posterior samples were split on which of these two should be included in the model (they were only informative together for 0.1% of the MCMC samples). The next highest posterior inclusion probability for any of the remaining variables was 0.17 and the sum of the inclusion probabilities for all remaining variables was only 0.28. Thus, there is strong evidence to suggest that five variables are sufficient to inform the cluster membership.

A majority (54%) of the posterior samples identified three components/clusters, with 0.12 and 0.25 posterior probability of two and four clusters, respectively. The calculation of also resulted in three components. Figure 3 displays the estimated cluster membership via pairwise scatterplots of the five most informative variables on a standardized scale. Ach_abc_Attention has also been multiplied by minus one so that higher values imply better functioning for all tests. The corresponding mean vectors of the three main components are also provided in Table 3. It appears as though there are two groups that are very distinct (i.e., Clusters 1 and 2 are the “high” and “low” groups, respectively), but there is also a “middle” group (Cluster 3). Cluster 3 subjects generally have medium-to-high Adaptive_Composite, WJ_Pass_Comprehen, and Ach_abc_Attention scores, but low-to-medium Beery_standard and WJ_Pass_Comprehen.

Figure 4(a) provides a 3D plot of the clusters on the three most informative variables, highlighting the separation between Cluster 1 and Clusters 2 and 3. However, Clusters 2 and 3 are not well differentiated in this plot. Figure 4(b) shows a 3D scatter plot on the variables CompTsc_ol, WJ_Pass_Comprehen, and Ach_abc_Attention, to better illustrate some differentiation between Clusters 2 and 3. As these results are based on a bottom-up data-driven method, they may prove useful for determining imaging biomarkers that correspond better with cluster assignment than an arbitrary diagnosis provided by a physician. This will be the subject of future work.

\@xsect

In this paper we developed a general approach to clustering via the Dirichlet process model that explicitly allows for (i) discrete and censored variables with a latent variable approach, (ii) missing data, (iii) correlation among informative and non-informative variables. The MCMC computation proceeds via a split/merge type algorithm by integrating out the component parameters. This approach was shown to perform markedly better than other approaches on several simulated test cases. The approach was developed for moderate in the range of . The computation is , which makes it ill suited for extremely large dimensions. However, it may be possible to use a sparse matrix approach, i.e., graphical model (Giudici and Green, 1999; Wong et al., 2003), within the proposed framework to alleviate this burden for very large