Mixtures of Skewed Matrix VariateBilinear Factor Analyzers

# Mixtures of Skewed Matrix Variate Bilinear Factor Analyzers

Michael P.B. Gallaugher and Paul D. McNicholas
Dept. of Mathematics & Statistics, McMaster University, Hamilton, Ontario, Canada.
###### Abstract

In recent years, data have become increasingly higher dimensional and, therefore, an increased need has arisen for dimension reduction techniques for clustering. Although such techniques are firmly established in the literature for multivariate data, there is a relative paucity in the area of matrix variate, or three-way, data. Furthermore, the few methods that are available all assume matrix variate normality, which is not always sensible if cluster skewness or excess kurtosis is present. Mixtures of bilinear factor analyzers using skewed matrix variate distributions are proposed. In all, four such mixture models are presented, based on matrix variate skew-, generalized hyperbolic, variance-gamma, and normal inverse Gaussian distributions, respectively.

Keywords: Clustering; factor analysis; kurtosis; skewed; matrix variate distribution; mixture models.

## 1 Introduction

Classification is the process of finding and analyzing underlying group structure in heterogenous data. This problem can be framed as the search for class labels of unlabelled observations. In general, some (non-trivial) proportion of observations have known labels. A special case of classification, known as clustering, occurs when none of the observations have known labels. One common approach for clustering is mixture model-based clustering, which makes use of a finite mixture model. In general, a -component finite mixture model assumes that a multivariate random variable has density

 f(x | ϑ)=G∑g=1πgfg(x | θg), (1)

where , is the th component density, and is the th mixing proportion such that . Note that the notation used in (1) corresponds to the multivariate case and, save for Appendix A, will hereafter represent a matrix variate random variable with denoting its realization.

McNicholas (2016) traces the relationship between mixture models and clustering to Tiedeman (1955), who uses a component of a mixture model to define a cluster. The mixture model was first used for clustering by Wolfe (1965) who considered a Gaussian mixture model. Other early uses of Gaussian mixture models for clustering can be found in Baum et al. (1970) and Scott & Symons (1971). Although the Gaussian mixture model is attractive due to its mathematical properties, it can be problematic when dealing with outliers and asymmetry in clusters and thus there has been an interest in non-Gaussian mixtures in the multivariate setting. Some examples of mixtures of symmetric distributions that parameterize tail weight include the distribution (Peel & McLachlan 2000, Andrews & McNicholas 2011, 2012, Lin et al. 2014) and the power exponential distribution (Dang et al. 2015). There has also been work in the area of skewed distributions such as the skew- distribution, (Lin 2010, Vrbik & McNicholas 2012, 2014, Lee & McLachlan 2014, Murray, Browne & McNicholas 2014, Murray, McNicholas & Browne 2014), the normal-inverse Gaussian (NIG) distribution (Karlis & Santourian 2009), the shifted asymmetric Laplace (SAL) distribution (Morris & McNicholas 2013, Franczak et al. 2014), the variance-gamma distribution (McNicholas et al. 2017), the generalized hyperbolic distribution (Browne & McNicholas 2015), the hidden truncation hyperbolic distribution (Murray et al. 2017), the joint generalized hyperbolic distribution (Tang et al. 2018), and the coalesced generalized hyperbolic distribution (Tortora et al. 2019).

There has also been an increased interest in model-based clustering of matrix variate, or three-way, data such as multivariate longitudinal data and images. Examples include the work of Viroli (2011a) and Anderlucci et al. (2015), who consider mixtures of matrix variate normal distributions for clustering. Viroli (2011b) further considers model-based clustering with matrix variate normal distributions in the Bayesian paradigm. More recently, Gallaugher & McNicholas (2018a) investigate mixtures of four skewed matrix distributions— namely, the matrix variate skew-, generalized hyperbolic, variance-gamma, and normal inverse Gaussian (NIG) distributions (see Gallaugher & McNicholas 2019, for details about these matrix variate distributions)—and consider classification of greyscale Arabic numerals. Melnykov & Zhu (2018, 2019) consider modelling skewness by means of transformations.

The main problem with all of the aforementioned methods, for both the multivariate and matrix variate cases, arises when the dimensionality of the data increases. Although the problem of dealing with high-dimensional data has been thoroughly addressed in the case of multivariate data, there is relative paucity of work for matrix variate data. In the matrix variate case, matrix variate bilinear probabilistic principal component analysis was developed by Zhao et al. (2012). More recently, Gallaugher & McNicholas (2018b) considered the closely-related mixture of matrix variate bilinear factor analyzers (MMVBFA) model for clustering. The MMVBFA model can be viewed as a matrix variate generalization of the mixture of factor analyzers model (Ghahramani & Hinton 1997) in the multivariate case. Although these methods allow for simultaneous dimension reduction and clustering, they both assume matrix variate normality which is not sensible if cluster skewness or heavy tails are present. Herein we present an extension of the MMVBFA model to skewed distributions; specifically, the matrix variate skew-, generalized hyperbolic, variance-gamma, and NIG distributions.

## 2 Background

### 2.1 Generalized Inverse Gaussian Distribution

The generalized inverse Gaussian distribution has two different parameterizations, both of which will be utilized herein. A random variable has a generalized inverse Gaussian (GIG) distribution parameterized by and , and denoted , if its probability density function can be written as

 f(y|a,b,λ)=(a/b)λ2yλ−12Kλ(√ab)exp{−ay+b/y2},

for , and , where

 Kλ(u)=12∫∞0yλ−1exp{−u2(y+1y)}dy

is the modified Bessel function of the third kind with index . Expectations of some functions of a GIG random variable have a mathematically tractable form, e.g.:

 E(Y)=√baKλ+1(√ab)Kλ(√ab), (2)
 E(1/Y)=√abKλ+1(√ab)Kλ(√ab)−2λb, (3)
 E(logY)=log(√ba)+1Kλ(√ab)∂∂λKλ(√ab). (4)

Although this parameterization of the GIG distribution will be useful for parameter estimation, the alternative parameterization given by

 g(y|ω,η,λ)=(w/η)λ−12ηKλ(ω)exp{−ω2(wη+ηw)}, (5)

where and , is used when deriving the generalized hyperbolic distribution (see Browne & McNicholas 2015). For notational clarity, we will denote the parameterization given in (5) by .

### 2.2 Matrix Variate Distributions

As in the multivariate case, the most mathematically tractable matrix variate distribution is the matrix variate normal. An random matrix follows an matrix variate normal distribution with location matrix and scale matrices and , of dimensions and , respectively, denoted , if the density of is

 f(X | M,Σ,Ψ)=1(2π)np2|Σ|p2|Ψ|n2exp{−12tr(Σ−1(X−M)Ψ−1(X−M)′)}. (6)

The matrix variate normal distribution is related to the multivariate normal distribution, as discussed in Harrar & Gupta (2008), via where is the multivariate normal density with dimension , is the vectorization of , and denotes the Kronecker product. Although the matrix variate normal distribution is popular, there are other well-known examples of matrix variate distributions. For example, the Wishart distribution (Wishart 1928) is the distribution of the sample covariance matrix in the multivariate normal case. There are also a few formulations of a matrix variate skew normal distribution (Chen & Gupta 2005, Domínguez-Molina et al. 2007, Harrar & Gupta 2008).

More recently, Gallaugher & McNicholas (2017, 2019) derived a total of four skewed matrix variate distributions using a variance-mean matrix variate mixture approach. This assumes that a random matrix can be written as

 X=M+WA+√WV, (7)

where and are matrices representing the location and skewness, respectively, , and is a random variable with density . Gallaugher & McNicholas (2017), show that the matrix variate skew- distribution, with degrees of freedom, arises from (7) with , where denotes the inverse-gamma distribution with density

 f(y | a,b)=baΓ(a)y−a−1exp{−by},

for and . The resulting density of is

 f\tiny MVST(X | ϑ)= ×K−ν+np2(√[ρ(A,Σ,Ψ)][δ(X;M,Σ,Ψ)+ν]),

where , and . For notational clarity, this distribution will be denoted by .

In Gallaugher & McNicholas (2019), one of the distributions considered is a matrix variate generalized hyperbolic distribution. This again arises from (7) with . This distribution will be denoted by , and the density is

 f\tiny MVGH(X|ϑ)= exp{tr(Σ−1(X−M)Ψ−1A′)}(2π)np2|Σ|p2|Ψ|n2Kλ(ω)(δ(X;M,Σ,Ψ)+ωρ(A,Σ,Ψ)+ω)(λ−np2)2 ×K(λ−np/2)(√[ρ(A,Σ,Ψ)+ω][δ(X;M,Σ,Ψ)+ω]),

where and .

The matrix variate variance-gamma distribution, also derived in Gallaugher & McNicholas (2019) and denoted , arises from (7) with , where denotes the gamma distribution with density

 f(y | a,b)=baΓ(a)ya−1exp{−by},

for and The density of the random matrix with this distribution is

 f\tiny MVVG(X|ϑ)= 2γγexp{tr(Σ−1(X−M)Ψ−1A′)}(2π)np2|Σ|p2|Ψ|n2Γ(γ)(δ(X;M,Σ,Ψ)ρ(A,Σ,Ψ)+2γ)(γ−np/2)2 ×K(γ−np2)(√[ρ(A,Σ,Ψ)+2γ][δ(X;M,Σ,Ψ)]),

where .

Finally, the matrix variate NIG distribution arises when , where denotes the inverse-Gaussian distribution with density

 f(y | δ,κ)=δ√2πexp{δκ}y−32exp{−12(δ2y+κ2y)},

for , . The density of is

 f\tiny MVNIG(X|ϑ) =2exp{tr(Σ−1(X−M)Ψ−1A′)+κ}(2π)np+12|Σ|p2|Ψ|n2(δ(X;M,Σ,Ψ)+1ρ(A,Σ,Ψ)+κ2)−(1+np)/4

where . This distribution is denoted by .

### 2.3 Matrix Variate Factor Analysis

Readers who may benefit from the context provided by the mixture of factor analyzers model should consult Appendix A. Xie et al. (2008) and Yu et al. (2008) consider a matrix variate extension of probabilistic principal components analysis (PPCA), which assumes an random matrix can be written

 X=M+ΛUΔ′+E, (8)

where is an location matrix, is an matrix of column factor loadings, is a matrix of row factor loadings, , and . It is assumed that and are independent of each other. The main disadvantage of this model is that, in general, does not follow a matrix variate normal distribution.

Zhao et al. (2012) present bilinear probabilistic principal component analysis (BPPCA), which extends (8) by adding two projected error terms. The resulting model assumes can be written where and are defined as in (8), , and . In this model, it is assumed that , and are all independent of each other. Gallaugher & McNicholas (2018b) further extend this to matrix variate factor analysis and consider a mixture of matrix variate bilinear factor analyzers (MMVBFA) model. For MMVBFA, Gallaugher & McNicholas (2018b) generalize BPPCA by removing the isotropic constraints so that , , and , where with , and with . With these slight modifications, it can be shown that , similarly to its multivariate counterpart (Appendix A).

It is important to note that the term “column factors” refers to reduction in the dimension of the columns, which is equivalent to the number of rows, and not a reduction in the number of columns. Likewise, the term “row factors” refers to the reduction in the dimension of the rows (number of columns). As discussed by Zhao et al. (2012), the interpretation of the terms and are the row and column noise, respectively, whereas the term is the common noise.

## 3 Mixture of Skewed Matrix Variate Bilinear Factor Analyzers

### 3.1 Model Specification

We now consider a mixture of skewed bilinear factor analyzers according to one of the four skewed distributions discussed previously. Each random matrix from a random sample distributed according to one of the four distributions can be written as

 Xi=Mg+WigAg+Vig

with probability for , , , where is the location of the th component, is the skewness, and is a random variable with density . The distribution of the random variable —and so the density —will change depending on the distribution of , i.e., skew-, generalized hyperbolic, variance-gamma, or NIG. Assume also that can be written as

 Vig=ΛgUigΔ′g+ΛgEBig+EAigΔ′g+Eig,

 Uig|wig ∼Nq×r(0,wigIq,Ip), EBig|wig∼Nq×p(0,wigIq,Ψg), EAig|wig ∼Nn×r(0,wigΣg,Ir), Eig|wig∼Nn×p(0,wigΣg,Ψg).

Note that , and are all independently distributed and independent of each other.

To facilitate clustering, introduce the indicator , where if observation belongs to group , and otherwise. Then, it can be shown that

 Xi | zig=1∼Dn×p(Mg,Ag,Σg+ΛgΛ′g,Ψg+ΔgΔ′g,θg),

where D is the distribution in question, and is the set of parameters related to the distribution of .

As in the matrix variate normal case, this model has a two stage interpretation given by

 Xi=Mg+WigA+ΛgYBig+RBig,YBig=UigΔ′g+EBig,RBig=EAigΔ′g+Eig,

and

 Xi=Mg+WigA+YAigΔ′g+RAig,YAig=ΛgUig+EAig,RAig=ΛgEBig+Eig,

which will be useful for parameter estimation.

### 3.2 Parameter Estimation

Suppose we observe the matrices distributed according to one of the four distributions. We assume that these data are incomplete and employ an alternating expectation conditional maximization (AECM) algorithm (Meng & van Dyk 1997). This algorithm is now described after initialization.

#### AECM 1st Stage

The complete-data in the first stage consists of the observed data , the latent variables , and the unknown group labels for . In this case, the complete-data log-likelihood is

 ℓC1=C+N∑i=1G∑g=1zig[logπg+logh(wig|θg)−12tr{1Wig(Σ∗g)−1(Xi−Mg)(Ψ∗g)−1(Xi−Mg)′−(Σ∗g)−1(Xi−Mg)(Ψ∗g)−1A′g−(Σ∗g)−1Ag(Ψ∗g)−1(Xi−Mg)′+Wig(Σ∗g)−1Ag(Ψ∗g)−1A′g}],

where , and is constant with respect to the parameters.

In the E-step, we calculate the following conditional expectations:

As usual, all expectations are conditional on current parameter estimates; however, to avoid cluttered notation, we do not use iteration-specific notation. Although these expectations are dependent on the distribution in question, it can be shown that

 WSTig | Xi,zig=1 ∼GIG(ρ(Ag,Σ∗g,Ψ∗g),δ(X;Mg,Σ∗g,Ψ∗g)+νg,−(νg+np)/2), WGHig | Xi,zig=1 ∼GIG(ρ(Ag,Σ∗g,Ψ∗g)+ωg,δ(X;Mg,Σ∗g,Ψ∗g)+ωg,λg−np/2), WVGig | Xi,zig=1 ∼GIG(ρ(Ag,Σ∗g,Ψ∗g)+2γg,δ(X;Mg,Σ∗g,Ψ∗g),γg−np/2), WNIGig | Xi,zig=1

Therefore, the exact updates are obtained by using the expectations given in (2)–(4) for appropriate values of , and .

In the M-step, we update , and for . We have:

 ^πg=NgN,^Mg=∑Ni=1^zig(¯¯¯agbig−1)Xi∑Ni=1^zig¯¯¯agbig−Ng,^A=∑Ni=1^zig(¯¯bg−big)Xi∑Ni=1^zig¯¯¯agbig−Ng,

where

 Ng=N∑i=1^zig,¯¯¯ag=1NgN∑i=1^zigaig,¯¯bg=1NgN∑i=1^zigbig.

The update for is dependent on the distribution and will be identical to one of those given in Gallaugher & McNicholas (2018a).

#### AECM Stage 2

In the second stage, the complete-data consists of the observed data , the latent variables , the unknown group labels , and the latent matrices for . The complete-data log-likelihood at this stage is

 ℓC2 =C+N∑i=1G∑g=1zig[logπg+logh(Wig|νg)+logϕq×p(YBig|0,WigIq,Ψ∗g) +logϕn×p(Xi|Mg+WigAg+ΛgYBig,WigΣg,Ψ∗g)] =C+N∑i=1G∑g=1−12zig[−plog|Σg|+tr{1WigΣ−1g(Xi−Mg)(Ψ∗g)−1(Xi−Mg)′ −Σ−1g(Xi−Mg)(Ψ∗g)−1A′g}−1WigΣ−1g(Xi−Mg)(Ψ∗g)−1YBig′Λ′g−Σ−1gAg(Ψ∗g)−1(Xi−Mg)′ +Σ−1gΛgYBig(Ψ∗g)−1A′g+1WigΣ−1gΛgYBig(Ψ∗g)−1YBig′Λ′g].

In the E-step, it can be shown that

 YBig | Xi,Wig,zig=1∼Nq×p((Iq+Λ′gΣ−1gΛg)−1Λ′gΣ−1g(Xi−Mg−WigAg),Wig(Iq+Λ′gΣ−1gΛg)−1,Ψ∗g)

and so we can calculate the expectations

where .

In the M-step, the updates for and are calculated. These updates are given by

 ^Λg=N∑i=1^zig[(Xi−^Mg)(^Ψ∗g)−1E(2)2ig′−^Ag(^Ψ∗g)−1E(2)1ig′](N∑i=1zigE(2)3ig)−1

and respectively, where

 SLg=1NgpN∑i=1^zig[big(Xi−^Mg)(^Ψ∗g)−1(Xi−^Mg)′−(^Ag+^ΛgE(2)2ig)(^Ψ∗g)−1(Xi−^Mg)′−(Xi−^Mg)(^Ψ∗g)−1^A′g+aig^Ag(^Ψ∗g)−1^Ag+^ΛgE(1)1ig(^Ψ∗g)−1^A′g−(Xi−^Mg)(^Ψ∗g)−1E(2)2ig′^Λ′g+^Ag(^Ψ∗g)−1E(2)1ig′^Λ′g+^ΛgE(2)3ig^Λ′g].

#### AECM Stage 3

In the third stage, the complete-data consists of the observed data , the latent variables , the labels and the latent matrices for . The complete-data log-likelihood at this stage is

 ℓC3=C+N∑i=1G∑g=1zig[logπg+logh(Wig|νg)+logϕq×p(YAig|0,WigΣ∗g,Ip)+logϕn×p(Xi|Mg+WigAg+YAigΔ′g,WigΣ∗g,Ψg)]=C+N∑i=1G∑g=1−12zig[−nlog|Ψg|+tr{1WigΨ−1g(Xi−Mg)′(Σ∗g)−1(Xi−Mg)−Ψ−1g(Xi−Mg)′(Σ∗g)−1Ag}−1WigΨ−1g(Xi−Mg)′(Σ∗g)−1YAigΔ′g−Ψ−1gA′g(Σ∗g)−1(Xi−Mg)+WigΨ−1gA′g(Σ∗g)−1Ag+Ψ−1gA′g(Σ∗g)−1YAigΔ′g−1WigΨ−1gΔgYAig′(Σ∗g)−1(Xi−Mg)+Ψ−1gΔgYAig′(Σ∗g)−1Ag+1WigΨ−1gΔgYAig′(Σ∗g)−1YAigΔ′g].

In the E-step, it can be shown that

 YAig|Xi,Wig,zig=1∼Nn×r((Xi−Mg−WigAg)Ψ−1gΔg(Ir+Δ′gΨ−1gΔg)−1,WigΣ∗g,(Ir+Δ′gΨ−1gΔg)−1)

and so we can calculate the expectations

where .

In the M-step, the updates for and are calculated. These updates are given by

 ^Δg=N∑i=1^zig[(Xi−^Mg)′(^Σ∗g)−1E(3)2ig−^A′g(^Σ∗g)−1E(3)1ig](N∑i=1zigE(3)3ig)−1

and respectively, where

 SDg=1NgpN∑i=1^zig[big(Xi−^Mg)′(^Σ∗g)−1(Xi−^Mg)−(^A′g+^ΔgE(3)2ig′)(^Σ∗g)−1(Xi−^Mg)−(Xi