Mixtures of Generalized Hyperbolic Distributions and Mixtures of Skew-t Distributions for Model-Based Clustering with Incomplete Data

# Mixtures of Generalized Hyperbolic Distributions and Mixtures of Skew-t Distributions for Model-Based Clustering with Incomplete Data

Yuhong Wei, Yang Tang and Paul D. McNicholas
Dept. of Mathematics & Statistics, McMaster University, Hamilton, Ontario, Canada.
###### Abstract

Robust clustering from incomplete data is an important topic because, in many practical situations, real data sets are heavy-tailed, asymmetric, and/or have arbitrary patterns of missing observations. Flexible methods and algorithms for model-based clustering are presented via mixture of the generalized hyperbolic distributions and its limiting case, the mixture of multivariate skew-t distributions. An analytically feasible EM algorithm is formulated for parameter estimation and imputation of missing values for mixture models employing missing at random mechanisms. The proposed methodologies are investigated through a simulation study with varying proportions of synthetic missing values and illustrated using a real dataset. Comparisons are made with those obtained from the traditional mixture of generalized hyperbolic distribution counterparts by filling in the missing data using the mean imputation method.

Keywords: Clustering; generalized hyperbolic; missing data; mixture models; skew-t.

## 1 Introduction

Finite mixture models are powerful and flexible tools for discovering unobserved heterogeneity in multivariate datasets. Assuming no prior knowledge of class labels, the application of finite mixture models in this way is known as model-based clustering. As McNicholas (2016a) points out, the association between mixture models and clustering goes back at least as far as Tiedeman (1955), who uses the former as a means of defining the latter. Gaussian mixture models are historically the most popular tool for model-based clustering and dominated the literature for quite some time (e.g., Celeux and Govaert, 1995; Fraley and Raftery, 1998; McLachlan et al., 2003; Bouveyron et al., 2007; McNicholas and Murphy, 2008, 2010). The multivariate -distribution, being a heavy-tailed alternative to the multivariate Gaussian distribution, made (robust) mixture modelling based on mixtures of multivariate -distributions the most natural extension (e.g., Peel and McLachlan, 2000; Andrews and McNicholas, 2011, 2012; Lin et al., 2014). In many practical situations, however, real world datasets exhibit clusters that are not just heavy tailed but also asymmetric; furthermore, clusters can also be asymmetric yet not heavy tailed. Over the few past years, much attention has been paid to non-Gaussian approaches to model-based clustering and classification, including work on multivariate skew- distributions (e.g., Lin, 2010; Vrbik and McNicholas, 2012; Lee and McLachlan, 2014; Murray et al., 2014a, b, 2017b), shifted asymmetric Laplace distributions (Franczak et al., 2014), multivariate power exponential distributions (Dang et al., 2015), multivariate normal inverse Gaussian distributions (Karlis and Santourian, 2009; O’Hagan et al., 2016), generalized hyperbolic distributions (Browne and McNicholas, 2015; Morris and McNicholas, 2016; Tortora et al., 2016), and hidden truncation hyperbolic distributions (Murray et al., 2017a). A comprehensive review of model-based clustering work, up to and including some recent work on non-Gaussian mixtures, is given by McNicholas (2016b).

Unobserved or missing observations are frequently a hindrance in multivariate datasets and so developing mixture models that can accommodate incomplete data is an important issue in model-based clustering. The maximum likelihood and Bayesian approaches are two common imputation paradigms for analyzing data with incomplete observations. Herein, the missing data mechanism is assumed to be missing at random (MAR), as per Rubin (1976) and Little and Rubin (1987), meaning that the probability that a variable is missing for a particular individual depends only on the observed data and not on the value of the missing variable. Note that missing completely at random (MCAR) is a special case of MAR. Under MAR, the missing data mechanisms are ignorable for methods using the maximum likelihood approach.

The maximum likelihood approach to clustering incomplete data has been well studied and is often used, particularly for Gaussian mixture models (e.g., Ghahramani and Jordan, 1994; Lin et al., 2006; Browne et al., 2013). Wang et al. (2004) present a framework maximum likelihood estimation using an expectation-maximization (EM) algorithm (Dempster et al., 1977) to fit a mixture of multivariate -distributions with arbitrary missing data patterns, which was generalized by Lin et al. (2009) to efficient supervised learning via the parameter expanded (PX-EM) algorithm (Liu et al., 1998) through two auxiliary indicator matrices. Lin (2014) further develops a family of multivariate- mixture models with 14 eigen-decomposed scale matrices in the presence of missing data through a computationally flexible EM algorithm by incorporating two auxiliary indicator matrices. Wang and Lin (2015) uses a formulation of the mixture of skew-t distributions for model-based clustering with missing data.

We consider fitting mixtures of generalized hyperbolic distributions (MGHD) and mixtures of multivariate skew-t distributions (MST) with missing information. In each case, an EM algorithm is used for model selection. The chosen formulation of the (multivariate) generalized hyperbolic distribution (GHD) is that used by Browne and McNicholas (2015) and has formulations of several well-known distributions as special cases such as the multivariate skew-, normal inverse Gaussian, variance-gamma, Laplace, and Gaussian distributions (cf. McNeil et al., 2005). In addition to considering missing data, we develop families of MGHD and MST mixture models, each with 14 parsimonious eigen-decomposed scale matrices corresponding to the famous Gaussian parsimonious clustering models (GPCMs) of Banfield and Raftery (1993) and Celeux and Govaert (1995).

## 2 Background

### 2.1 Generalized Inverse Gaussian Distribution

A random variable is said to have a generalized inverse Gaussian (GIG) distribution, introduced by (Good, 1953), with parameters , , and if its probability density function is given by

 fGIG(w | λ,χ,ψ)=(ψ/χ)λ/2wλ−12Kλ(√ψχ)exp{−ψw+χ/w2}, (1)

where , and is the modified Bessel function of the third kind with index . Herein, we write to indicate that a random variable has the GIG density as parameterized in (1). The GIG distribution has some attractive properties (Barndorff-Nielsen and Halgreen, 1977; Blæsild, 1978; Halgreen, 1979; Jørgensen, 1982), including the tractability of the expectations:

 E[Wα]=(χψ)α/2Kλ+α(√ψχ)Kλ(√ψχ), (2)

for , and

 E[logW]=log⎛⎝√χψ⎞⎠+∂∂λlog(Kλ(√ψχ)). (3)

Specifically, for and , we have

 E[W] =√χψKλ+1(√ψχ)Kλ(√ψχ), E[1/W] =√ψχKλ−1(√ψχ)Kλ(√ψχ)=√ψχKλ+1(√ψχ)Kλ(√ψχ)−2λχ.

Browne and McNicholas (2015) introduce another parameterization of the GIG distribution by setting and . Write ; its density is given by

 fI(w∣λ,η,ω)=(w/η)λ−12ηKλ(ω)exp{−ω2(wη+ηw)} (4)

for , where is a scale parameter and is a concentration parameter. These two parameterizations of the GIG distribution are important ingredients for building the generalized hyperbolic distribution presented later.

### 2.2 Generalized Hyperbolic Distribution

Several alternative parameterizations of the GHD have appeared in the literature, e.g., Barndorff-Nielsen and Blæsild (1981), McNeil et al. (2005), and Browne and McNicholas (2015). Barndorff-Nielsen (1977) introduces the generalized hyperbolic distribution (GHD) to model the distribution of the sand grain sizes and subsequent reports described its statistical properties (e.g., Barndorff-Nielsen, 1978; Barndorff-Nielsen and Blæsild, 1981). However, under this standard parameterization, the parameters of the mixing distribution are not invariant by affine transformations. An important innovation was made by McNeil et al. (2005), who gave a new parameterization of the GHD. Under this new parameterization, the linear transformation of GHD remains in the same sub-family characterized by the parameters of the mixing distribution. However, there is an identifiability issue arising under this parameterization. To solve this problem, Browne and McNicholas (2015) give an alternative parameterization.

Following McNeil et al. (2005), a random vector is said to follow a generalized hyperbolic distribution with index parameter , concentration parameters and , location vector , dispersion matrix , and skewness vector , denoted by , if it can be represented by

 X=\boldmathμ+W\boldmathα+√WU,U⊥W (5)

where , , the symbol indicates independence, and it follows that . So, the density of the generalized hyperbolic random vector is given by

 f(x∣\boldmathϑ)=⎡⎣χ+δ(x,\boldmathμ∣\boldmathΣ)ψ+\boldmathα⊺\boldmathΣ\raisebox0.86pt$−1$\boldmathα⎤⎦λ−p/22(ψ/χ)λ/2Kλ−p/2(√(χ+δ(x,\boldmathμ∣\boldmathΣ))(ψ+\boldmathα⊺\boldmathΣ\raisebox0.86pt$−1$\boldmathα))(2π)p/2|% \boldmathΣ|1/2Kλ(√χψ)exp{−(x−% \boldmathμ)⊺\boldmathΣ\raisebox0.86pt$−1$\boldmathα}, (6)

where is the squared Mahalanobis distance between and , is the modified Bessel function of the third kind with index , and denotes the model parameters. The mean and covariance matrix of are

 E(X)=\boldmathμ+E(W)\boldmathαandVar(X)=E(W)\boldmathΣ+Var(W)\boldmathα\boldmathα⊺, (7)

respectively, where and are the mean and variance of the random variable , respectively.

Note that, in this parameterization, we need to hold to ensure identifiability. Using solves the identifiability problem but would be prohibitively restrictive for model-based clustering and classification applications. Hence, Browne and McNicholas (2015) develop a new parameterization of the GHD with index parameter , concentration parameter , location vector , dispersion matrix , and skewness vector , denoted by . Note that . This formulation is given by

 X=\boldmathμ+W\boldmathβ+√WU, (8)

where , with , and . Under this parameterization, the density of the generalized hyperbolic random vector is

 (9)

where and are as described earlier. We use this parameterization when we describe parameter estimation (cf. Section 3).

The following result shows an appealing closure property of the generalized hyperbolic distribution under affine transformation and conditioning as well as the formation of marginal distributions, which is useful for developing new methods presented later. Suppose that is a -dimensional random vector having a generalized hyperbolic distribution as in (9), i.e., . Assume that is partitioned as , where takes values in and in , with

 \boldmathμ=(\boldmathμ1\boldmathμ2), \boldmathβ=(\boldmathβ1\boldmathβ2), \boldmathΣ=(\boldmathΣ11\boldmathΣ12\boldmathΣ21\boldmathΣ22),

where , , and have similar partitions. Furthermore, is and is .

###### Proposition 1.

Affine transformation of the generalized hyperbolic distribution. If and where and , then

 Y∼GHDk(λ,ω,B\boldmathμ+b,B\boldmathΣB⊺,B\boldmathβ), (10)
###### Proof.

The result follows by substituting (8) into . ∎

###### Proposition 2.

The marginal distribution of is a generalized hyperbolic distribution as in (9) with index parameter , concentration parameter , location vector , dispersion matrix , and skewness vector , i.e., .

###### Proof.

The result follows by applying Proposition 1 and choosing ] and . The parameters inherited from the mixing distribution remain the same under the affine transformation and marginal distribution. ∎

###### Proposition 3.

The conditional distribution of given is a generalized hyperbolic distribution as in (6), i.e., , where

 λ2∣1 =λ−d12, χ2∣1 =ω+(x1−\boldmathμ1)⊺\boldmathΣ\raisebox0.86pt$−1$11(x1−\boldmathμ1), ψ2∣1 =ω+\boldmathβ⊺1\boldmathΣ⊺11\boldmathβ, \boldmathμ2∣1 =\boldmathμ2+\boldmathΣ⊺12\boldmathΣ\raisebox0.86pt$−1$11(x1−\boldmathμ1), \boldmathΣ2∣1 =\boldmathΣ22−\boldmathΣ⊺12\boldmathΣ\raisebox0.86pt$−1$11\boldmathΣ12, \boldmathβ2∣1 =\boldmathβ2−\boldmathΣ⊺12\boldmathΣ\raisebox0.86pt$−1$11\boldmathβ1.

The proof of Proposition 3 is given in Appendix B.

### 2.3 The Multivariate Skew-t Distribution

There are several alternative formulations of multivariate skew-t distributions appearing in the literature (e.g., Branco and Dey, 2001; Sahu, Dey, and Branco, 2003; Murray, Browne, and McNicholas, 2014a; Lee and McLachlan, 2014). Lin and Lin (2011) develop a mixture of multivariate skew-t distributions incomplete data using the formulation of Sahu et al. (2003). Herein, the formulation of the multivariate skew- distribution arising from the generalized hyperbolic distribution is used. This formulation of the multivariate skew- distribution has been used by Murray et al. (2014a) to develop a mixture of skew- factor analyzers model.

Following McNeil et al. (2005), a x random vector is said to follow a multivariate skew-t distribution with degree of freedom parameter , location vector , dispersion matrix , and skewness vector , denoted by , if it can be represented by

 X=\boldmathμ+W\boldmathβ+√WU, (11)

where , , with denoting the inverse Gamma distribution. It follows that and the pdf of the multivariate skew-t random vector is given by

 (12)

This formulation of the multivariate skew-t distribution can be obtained as a special case of the generalized hyperbolic distribution by setting and , and letting . Similarly, this formulation of the multivariate skew-t distribution has a closed form under affine transformation and conditioning, and the formation of marginal distributions, which is useful for developing new methods presented later. Suppose that is a -dimensional random vector having the multivariate skew-t distribution as in (12), i.e., . Assume that is partitioned as , where takes values in and in , with

 \boldmathμ=(\boldmathμ1\boldmathμ2) \boldmathβ=(\boldmathβ1\boldmathβ2) \boldmathΣ=(\boldmathΣ11\boldmathΣ12\boldmathΣ21\boldmathΣ22),

where , , and have similar partitions. Furthermore, is and is .

###### Proposition 4.

Affine transformation of the multivariate skew-t distribution. If and , where and , then

 Y∼STk(v,B\boldmathμ+b,B\boldmathΣB⊺,B% \boldmathβ). (13)
###### Proof.

The proof follows easily by substituting (11) into . ∎

###### Proposition 5.

The marginal distribution of is a multivariate skew-t distribution as in (12) with degree of freedom parameter , location vector , dispersion matrix , and skewness vector , i.e., .

###### Proof.

The proof follows easily by applying Proposition 4 and choosing ] and . The degree of freedom parameter inherited from the mixing distribution remains invariant under affine transformation and marginal distribution. ∎

###### Proposition 6.

The conditional distribution of given is a generalized hyperbolic distribution as in (6), i.e., , where

 λ2∣1 =−(v+d1)/2, χ2∣1 =v+(x1−\boldmathμ1)⊺% \boldmathΣ\raisebox0.86pt$−1$11(x1−\boldmathμ1), ψ2∣1 =\boldmathβ⊺1\boldmathΣ⊺11\boldmathβ, \boldmathμ2∣1 =\boldmathμ2+\boldmathΣ⊺12\boldmathΣ\raisebox0.86pt$−1$11(x1−\boldmathμ1), \boldmathΣ2∣1 =\boldmathΣ22−\boldmathΣ⊺12\boldmathΣ\raisebox0.86pt$−1$11\boldmathΣ12, \boldmathβ2∣1 =\boldmathβ2−\boldmathΣ⊺12\boldmathΣ\raisebox0.86pt$−1$11\boldmathβ1.

The proof of Proposition 6 is similar to that for Proposition 3, hence is omitted. Similar results for Proposition 4, 5, and 6 have been obtained in Arellano-Valle and Genton (2010).

## 3 MGHD with Incomplete Data

Let be -dimensional random variables arising from a heterogeneous population with disjoint MGHD subpopulations. That is, each has the density

 fMGHD(xi∣\boldmathΘ)=G∑g=1πgfGHD(xi∣λg,ωg,\boldmathμg,\boldmathΣg,\boldmathβg), (14)

where , such that , are the mixing proportions, denotes the model parameters, and is the GHD density defined in (9).

To apply the MGHD model (14) in the clustering paradigm, introduce , where if observation is in component and otherwise. The corresponding random variable , i.e., follows a multinomial distribution with one trial and cell probabilities .

A three-level hierarchical representation of the MGHD model (14) can be expressed by

 Xi∣wig,zig=1 ∼N(\boldmathμg+wig\boldmathβg,wig\boldmathΣg), Wig∣zig=1 ∼I(λg,η=1,ωg), (15) Zi ∼M(1;π1,…,πG).

The complete-data consist of the observed together with the missing group membership and the latent , for and , and the complete-data log-likelihood is given by

 (16)

Browne and McNicholas (2015) present an EM algorithm for parameter estimation with the MGHD when there is no missing data in . We are interested in parameter estimation for the MGHD model (14) when are partially observed with arbitrary missing patterns. The missing data mechanism is assumed to be MAR. Assume now that we split into two components, and that denote the observed and missing components of , respectively. In general, each data vector may have a different pattern of missing features, i.e., , but can be simplified for the sake of clarity.

For each , partition the vector mean , where and denote the sub-vectors of matching the observed and missing components of , respectively. Similarly, the skewness vector is and the covariance matrix as

 \boldmathΣg=⎛⎝\boldmathΣoog,i\boldmathΣomg,i\boldmathΣmog,i\boldmathΣmmg,i⎞⎠and\boldmathΣ\raisebox0.86pt$−1$g=⎛⎝(\boldmathΣ% oog,i)−1(\boldmathΣomg,i)−1(\boldmathΣmog,i)−1(\boldmathΣmmg,i)−1⎞⎠, (17)

correspond to . As a result, in addition to the observed , the missing group membership , and the latent variable , the complete-data also include the missing data . In the framework of the EM algorithm, the missing data are considered to be random variables that are updated in each iteration. Hence, the complete-data log-likelihood (16) is rewritten as

 lc(\boldmathΘ)=n∑i=1G∑g=1zig[logπg+ logϕ(xoi,xmi∣\boldmathμg+wig\boldmathβg,wig% \boldmathΣg)+loghI(wig∣λg,ωg)].

Given (15), we establish the following:

• The marginal distribution of given is

 Xoi∼G∑g=1πgfGHD,po% i(λg,ωg,\boldmathμog,i,% \boldmathΣoog,i,\boldmathβog,i),

where is the dimension corresponding to the observed component , which should be exactly written as but here is simplified.

• The conditional distribution of given and , according to Proposition 3, is

 Xmi∣xoi,zig=1∼GHp−poi(λm∣og,i,χm∣og,i,ψm∣og,i,\boldmathμm∣og,i,\boldmathΣm∣%og,i,\boldmathβm∣og,i), (18)

where

 λm∣og,i =λg−poi2, χm∣og,i =ωg+(xoi−\boldmathμog,i)⊺(\boldmathΣoog,i)\raisebox0.86pt$−1$(xoi−% \boldmathμog,i), ψm∣og =ωg+(\boldmathβog,i)⊺(\boldmathΣoog,i)\raisebox0.86pt$−1$\boldmathβog,i, \boldmathμm∣og,i =\boldmathμmg+(\boldmathΣomg,i)⊺(\boldmathΣoog,i)\raisebox0.86pt$−1$(xoi−% \boldmathμog,i), \boldmathΣm∣og,i =\boldmathΣmmg,i−(\boldmathΣomg,i)⊺(\boldmathΣ% oog,i)\raisebox0.86pt$−1$\boldmathΣomg,i, \boldmathβm∣og,i =\boldmathβmg,i−(\boldmathΣomg,i)⊺(\boldmathΣoog,i)\raisebox0.86pt$−1$\boldmathβog,i.
• The conditional distribution of given , and is

 Xmi∣xoi,wig,zig=1∼Np−poi(\boldmathμm∣og,i+wig\boldmathβm∣og,i,wig\boldmathΣm∣og,i). (19)
• The conditional distribution of given and is

 Wig∣xoi,zig=1∼GIG⎛⎝ωg+(\boldmathβog,i)⊺(\boldmathΣoog,i)\raisebox0.86pt$−1$\boldmath% βog,i,ωg+δ(xoi,% \boldmathμog,i∣\boldmathΣoog,i),λg−poi2⎞⎠. (20)

After a little algebra, we get the complete data log-likelihood function is

 (21)

On the th iteration of the E-step, the expected value of the complete data log-likelihood is computed given the observed data and the current parameter updates . That is, we need to compute , , , , , and .

First, let denote the a posteriori probability that -th observation belongs to the -th component of the mixture, based on the observed data:

 ^z(k)ig\vbox:=E(Zig∣xoi,\boldmathΘ(k))=π(k)gfGHD,poi(xoi;λ(k)g,ω(k)g,\boldmathμo(k)g,i,\boldmathΣoo(k)g,i,\boldmathβo(k)g,i)∑Gl=1π(k)lfGHD,poi(xoi;λ(k)l,ω(k)l,\boldmathμo(k)l,i,% \boldmathΣoo(k)l,i,\boldmathβo% (k)l,i).

Given (2), (3), and (20), we have the following expectations as to the latent variable :

 a(k)ig \vbox:=E(Wig∣xoi,zig=1;\boldmathΘ(k))=  ⎷ω(k)g+δ(xoi,\boldmathμo(k)g,i∣\boldmathΣoo(k)g,i)ω(k)g+% \boldmathβo(k)⊺g,i(\boldmathΣ% oo(k)g,i)\raisebox0.86pt$−1$\boldmathβo(k)g,i ×Kλ(k)g−p0i2+1(√(ω(k)g+δ(xoi,\boldmath% μo(k)g,i∣\boldmathΣoo(k)g,i))(ω(k)g+(\boldmathβo(k)g,i)⊺(\boldmathΣoo(k)g,i)\raisebox0.86pt$−1$\boldmathβo(k)g,i))Kλ(k)g−p0i2(√(ω(k)g+δ(xoi,\boldmathμo(k)g,i∣% \boldmathΣoo(k)g,i))(ω(k)g+(\boldmathβo(k)g,i)⊺(\boldmathΣ%oo(k)g,i)\raisebox0.86pt$−1$\boldmathβo(k)g,i)),
 b(k)ig \vbox:=E(1/Wig∣xoi,zig=1;\boldmathΘ(k)) =−2λ(k)g−poiω(k)g+δ(xoi,\boldmathμo(k)g,i∣\boldmathΣoo(k)g,i)+  ⎷ω(k)g+(\boldmathβo(k)g,i)⊺(\boldmathΣoo(k)g,i)\raisebox0.86pt$−1$\boldmathβo(k)g,iω(k)g+δ(xoi,\boldmathμo(k)g,i∣\boldmathΣoo(k)g,i) ×Kλ(k)g−p0i2+1(√(ω(k)g+δ(xoi,\boldmath% μo(k)g,i∣\boldmathΣoo(k)g,i))(ω(k)g+(\boldmathβo(k)g,i)⊺(\boldmathΣoo(k)g,i)\raisebox0.86pt$−1$\boldmathβo(k)g,i))Kλ(k)g−p0i2(√(ω(k)g+δ(xoi,\boldmathμo(k)g,i∣% \boldmathΣoo(k)g,i))(ω(k)g+(\boldmathβo(k)g,i)⊺(\boldmathΣ%oo(k)g,i)\raisebox0.86pt$−1$\boldmathβo(k)g,i)), c(k)ig \vbox:=E(logWig∣x