Mixtures of Generalized Hyperbolic Distributions and Mixtures of Skewt Distributions for ModelBased Clustering with Incomplete Data
Abstract
Robust clustering from incomplete data is an important topic because, in many practical situations, real data sets are heavytailed, asymmetric, and/or have arbitrary patterns of missing observations. Flexible methods and algorithms for modelbased clustering are presented via mixture of the generalized hyperbolic distributions and its limiting case, the mixture of multivariate skewt distributions. An analytically feasible EM algorithm is formulated for parameter estimation and imputation of missing values for mixture models employing missing at random mechanisms. The proposed methodologies are investigated through a simulation study with varying proportions of synthetic missing values and illustrated using a real dataset. Comparisons are made with those obtained from the traditional mixture of generalized hyperbolic distribution counterparts by filling in the missing data using the mean imputation method.
Keywords: Clustering; generalized hyperbolic; missing data; mixture models; skewt.
1 Introduction
Finite mixture models are powerful and flexible tools for discovering unobserved heterogeneity in multivariate datasets. Assuming no prior knowledge of class labels, the application of finite mixture models in this way is known as modelbased clustering. As McNicholas (2016a) points out, the association between mixture models and clustering goes back at least as far as Tiedeman (1955), who uses the former as a means of defining the latter. Gaussian mixture models are historically the most popular tool for modelbased clustering and dominated the literature for quite some time (e.g., Celeux and Govaert, 1995; Fraley and Raftery, 1998; McLachlan et al., 2003; Bouveyron et al., 2007; McNicholas and Murphy, 2008, 2010). The multivariate distribution, being a heavytailed alternative to the multivariate Gaussian distribution, made (robust) mixture modelling based on mixtures of multivariate distributions the most natural extension (e.g., Peel and McLachlan, 2000; Andrews and McNicholas, 2011, 2012; Lin et al., 2014). In many practical situations, however, real world datasets exhibit clusters that are not just heavy tailed but also asymmetric; furthermore, clusters can also be asymmetric yet not heavy tailed. Over the few past years, much attention has been paid to nonGaussian approaches to modelbased clustering and classification, including work on multivariate skew distributions (e.g., Lin, 2010; Vrbik and McNicholas, 2012; Lee and McLachlan, 2014; Murray et al., 2014a, b, 2017b), shifted asymmetric Laplace distributions (Franczak et al., 2014), multivariate power exponential distributions (Dang et al., 2015), multivariate normal inverse Gaussian distributions (Karlis and Santourian, 2009; O’Hagan et al., 2016), generalized hyperbolic distributions (Browne and McNicholas, 2015; Morris and McNicholas, 2016; Tortora et al., 2016), and hidden truncation hyperbolic distributions (Murray et al., 2017a). A comprehensive review of modelbased clustering work, up to and including some recent work on nonGaussian mixtures, is given by McNicholas (2016b).
Unobserved or missing observations are frequently a hindrance in multivariate datasets and so developing mixture models that can accommodate incomplete data is an important issue in modelbased clustering. The maximum likelihood and Bayesian approaches are two common imputation paradigms for analyzing data with incomplete observations. Herein, the missing data mechanism is assumed to be missing at random (MAR), as per Rubin (1976) and Little and Rubin (1987), meaning that the probability that a variable is missing for a particular individual depends only on the observed data and not on the value of the missing variable. Note that missing completely at random (MCAR) is a special case of MAR. Under MAR, the missing data mechanisms are ignorable for methods using the maximum likelihood approach.
The maximum likelihood approach to clustering incomplete data has been well studied and is often used, particularly for Gaussian mixture models (e.g., Ghahramani and Jordan, 1994; Lin et al., 2006; Browne et al., 2013). Wang et al. (2004) present a framework maximum likelihood estimation using an expectationmaximization (EM) algorithm (Dempster et al., 1977) to fit a mixture of multivariate distributions with arbitrary missing data patterns, which was generalized by Lin et al. (2009) to efficient supervised learning via the parameter expanded (PXEM) algorithm (Liu et al., 1998) through two auxiliary indicator matrices. Lin (2014) further develops a family of multivariate mixture models with 14 eigendecomposed scale matrices in the presence of missing data through a computationally flexible EM algorithm by incorporating two auxiliary indicator matrices. Wang and Lin (2015) uses a formulation of the mixture of skewt distributions for modelbased clustering with missing data.
We consider fitting mixtures of generalized hyperbolic distributions (MGHD) and mixtures of multivariate skewt distributions (MST) with missing information. In each case, an EM algorithm is used for model selection. The chosen formulation of the (multivariate) generalized hyperbolic distribution (GHD) is that used by Browne and McNicholas (2015) and has formulations of several wellknown distributions as special cases such as the multivariate skew, normal inverse Gaussian, variancegamma, Laplace, and Gaussian distributions (cf. McNeil et al., 2005). In addition to considering missing data, we develop families of MGHD and MST mixture models, each with 14 parsimonious eigendecomposed scale matrices corresponding to the famous Gaussian parsimonious clustering models (GPCMs) of Banfield and Raftery (1993) and Celeux and Govaert (1995).
2 Background
2.1 Generalized Inverse Gaussian Distribution
A random variable is said to have a generalized inverse Gaussian (GIG) distribution, introduced by (Good, 1953), with parameters , , and if its probability density function is given by
(1) 
where , and is the modified Bessel function of the third kind with index . Herein, we write to indicate that a random variable has the GIG density as parameterized in (1). The GIG distribution has some attractive properties (BarndorffNielsen and Halgreen, 1977; Blæsild, 1978; Halgreen, 1979; Jørgensen, 1982), including the tractability of the expectations:
(2) 
for , and
(3) 
Specifically, for and , we have
Browne and McNicholas (2015) introduce another parameterization of the GIG distribution by setting and . Write ; its density is given by
(4) 
for , where is a scale parameter and is a concentration parameter. These two parameterizations of the GIG distribution are important ingredients for building the generalized hyperbolic distribution presented later.
2.2 Generalized Hyperbolic Distribution
Several alternative parameterizations of the GHD have appeared in the literature, e.g., BarndorffNielsen and Blæsild (1981), McNeil et al. (2005), and Browne and McNicholas (2015). BarndorffNielsen (1977) introduces the generalized hyperbolic distribution (GHD) to model the distribution of the sand grain sizes and subsequent reports described its statistical properties (e.g., BarndorffNielsen, 1978; BarndorffNielsen and Blæsild, 1981). However, under this standard parameterization, the parameters of the mixing distribution are not invariant by affine transformations. An important innovation was made by McNeil et al. (2005), who gave a new parameterization of the GHD. Under this new parameterization, the linear transformation of GHD remains in the same subfamily characterized by the parameters of the mixing distribution. However, there is an identifiability issue arising under this parameterization. To solve this problem, Browne and McNicholas (2015) give an alternative parameterization.
Following McNeil et al. (2005), a random vector is said to follow a generalized hyperbolic distribution with index parameter , concentration parameters and , location vector , dispersion matrix , and skewness vector , denoted by , if it can be represented by
(5) 
where , , the symbol indicates independence, and it follows that . So, the density of the generalized hyperbolic random vector is given by
(6) 
where is the squared Mahalanobis distance between and , is the modified Bessel function of the third kind with index , and denotes the model parameters. The mean and covariance matrix of are
(7) 
respectively, where and are the mean and variance of the random variable , respectively.
Note that, in this parameterization, we need to hold to ensure identifiability. Using solves the identifiability problem but would be prohibitively restrictive for modelbased clustering and classification applications. Hence, Browne and McNicholas (2015) develop a new parameterization of the GHD with index parameter , concentration parameter , location vector , dispersion matrix , and skewness vector , denoted by . Note that . This formulation is given by
(8) 
where , with , and . Under this parameterization, the density of the generalized hyperbolic random vector is
(9) 
where and are as described earlier. We use this parameterization when we describe parameter estimation (cf. Section 3).
The following result shows an appealing closure property of the generalized hyperbolic distribution under affine transformation and conditioning as well as the formation of marginal distributions, which is useful for developing new methods presented later. Suppose that is a dimensional random vector having a generalized hyperbolic distribution as in (9), i.e., . Assume that is partitioned as , where takes values in and in , with
where , , and have similar partitions. Furthermore, is and is .
Proposition 1.
Affine transformation of the generalized hyperbolic distribution. If and where and , then
(10) 
Proof.
The result follows by substituting (8) into . ∎
Proposition 2.
The marginal distribution of is a generalized hyperbolic distribution as in (9) with index parameter , concentration parameter , location vector , dispersion matrix , and skewness vector , i.e., .
Proof.
The result follows by applying Proposition 1 and choosing ] and . The parameters inherited from the mixing distribution remain the same under the affine transformation and marginal distribution. ∎
Proposition 3.
The conditional distribution of given is a generalized hyperbolic distribution as in (6), i.e., , where
The proof of Proposition 3 is given in Appendix B.
2.3 The Multivariate Skew Distribution
There are several alternative formulations of multivariate skewt distributions appearing in the literature (e.g., Branco and Dey, 2001; Sahu, Dey, and Branco, 2003; Murray, Browne, and McNicholas, 2014a; Lee and McLachlan, 2014). Lin and Lin (2011) develop a mixture of multivariate skewt distributions incomplete data using the formulation of Sahu et al. (2003). Herein, the formulation of the multivariate skew distribution arising from the generalized hyperbolic distribution is used. This formulation of the multivariate skew distribution has been used by Murray et al. (2014a) to develop a mixture of skew factor analyzers model.
Following McNeil et al. (2005), a x random vector is said to follow a multivariate skewt distribution with degree of freedom parameter , location vector , dispersion matrix , and skewness vector , denoted by , if it can be represented by
(11) 
where , , with denoting the inverse Gamma distribution. It follows that and the pdf of the multivariate skewt random vector is given by
(12) 
This formulation of the multivariate skewt distribution can be obtained as a special case of the generalized hyperbolic distribution by setting and , and letting . Similarly, this formulation of the multivariate skewt distribution has a closed form under affine transformation and conditioning, and the formation of marginal distributions, which is useful for developing new methods presented later. Suppose that is a dimensional random vector having the multivariate skewt distribution as in (12), i.e., . Assume that is partitioned as , where takes values in and in , with
where , , and have similar partitions. Furthermore, is and is .
Proposition 4.
Affine transformation of the multivariate skewt distribution. If and , where and , then
(13) 
Proof.
The proof follows easily by substituting (11) into . ∎
Proposition 5.
The marginal distribution of is a multivariate skewt distribution as in (12) with degree of freedom parameter , location vector , dispersion matrix , and skewness vector , i.e., .
Proof.
The proof follows easily by applying Proposition 4 and choosing ] and . The degree of freedom parameter inherited from the mixing distribution remains invariant under affine transformation and marginal distribution. ∎
Proposition 6.
The conditional distribution of given is a generalized hyperbolic distribution as in (6), i.e., , where
The proof of Proposition 6 is similar to that for Proposition 3, hence is omitted. Similar results for Proposition 4, 5, and 6 have been obtained in ArellanoValle and Genton (2010).
3 MGHD with Incomplete Data
Let be dimensional random variables arising from a heterogeneous population with disjoint MGHD subpopulations. That is, each has the density
(14) 
where , such that , are the mixing proportions, denotes the model parameters, and is the GHD density defined in (9).
To apply the MGHD model (14) in the clustering paradigm, introduce , where if observation is in component and otherwise. The corresponding random variable , i.e., follows a multinomial distribution with one trial and cell probabilities .
A threelevel hierarchical representation of the MGHD model (14) can be expressed by
(15)  
The completedata consist of the observed together with the missing group membership and the latent , for and , and the completedata loglikelihood is given by
(16) 
Browne and McNicholas (2015) present an EM algorithm for parameter estimation with the MGHD when there is no missing data in . We are interested in parameter estimation for the MGHD model (14) when are partially observed with arbitrary missing patterns. The missing data mechanism is assumed to be MAR. Assume now that we split into two components, and that denote the observed and missing components of , respectively. In general, each data vector may have a different pattern of missing features, i.e., , but can be simplified for the sake of clarity.
For each , partition the vector mean , where and denote the subvectors of matching the observed and missing components of , respectively. Similarly, the skewness vector is and the covariance matrix as
(17) 
correspond to . As a result, in addition to the observed , the missing group membership , and the latent variable , the completedata also include the missing data . In the framework of the EM algorithm, the missing data are considered to be random variables that are updated in each iteration. Hence, the completedata loglikelihood (16) is rewritten as
Given (15), we establish the following:

The marginal distribution of given is
where is the dimension corresponding to the observed component , which should be exactly written as but here is simplified.

The conditional distribution of given and , according to Proposition 3, is
(18) where

The conditional distribution of given , and is
(19) 
The conditional distribution of given and is
(20)
After a little algebra, we get the complete data loglikelihood function is
(21) 
On the th iteration of the Estep, the expected value of the complete data loglikelihood is computed given the observed data and the current parameter updates . That is, we need to compute , , , , , and .