Mixtures of Matrix Variate Bilinear Factor Analyzers

# Mixtures of Matrix Variate Bilinear Factor Analyzers

Michael P.B. Gallaugher and Paul D. McNicholas
Dept. of Mathematics & Statistics, McMaster University, Hamilton, Ontario, Canada.
###### Abstract

Over the years data has become increasingly higher dimensional, which has prompted an increased need for dimension reduction techniques. This is perhaps especially true for clustering (unsupervised classification) as well as semi-supervised and supervised classification. Although dimension reduction in the area of clustering for multivariate data has been quite thoroughly discussed in the literature, there is relatively little work in the area of three way, or matrix variate, data. Herein, we develop a mixture of matrix variate bilinear factor analyzers (MMVBFA) model for use in clustering high-dimensional matrix variate data. This work can be considered both the first matrix variate bilinear factor analyzers model as well as the first MMVBFA model. Parameter estimation is discussed, and the MMVBFA model is illustrated using simulated and real data.

Keywords: Factor analysis; matrix variate distribution; mixture models.

## 1 Introduction

Dimensionality is an ever present concern with data becoming increasingly higher dimensional over the last few years. To combat this issue, dimension reduction techniques have become very important tools, especially in the area of clustering (unsupervised classification) as well as semi-supervised and supervised classification. For multivariate data, the mixture of factor analyzers model has proved to be very useful in this regard as the model performs clustering and dimension reduction simultaneously, details in Section 2. However, there is relative paucity in the area of dimension reduction for use in model-based clustering for matrix variate data. Matrix variate distributions have been shown to be useful for modelling three way data such as images and multivariate longitudinal data; however, the methods presented in the literature do suffer from dimensionality concerns. In this paper, we present a mixture of matrix variate bilinear factor analyzers (MMVBFA) model for use in clustering for higher dimensional matrix data. The matrix variate bilinear factor analyzers model can be viewed as a generalization of bilinear principal component analysis (BPCA; Zhao et al. 2012), and contains BPCA as a special case. An alternating expectation conditional maximization (AECM) algorithm (Meng & van Dyk 1997) is used for parameter estimation. The proposed method is illustrated using both simulated and real datasets.

## 2 Background

### 2.1 Model-Based Clustering

Model-based clustering makes use of a finite mixture model. A -component finite mixture model assumes a random variate has density

 f(x | ϑ)=G∑g=1πgfg(x | θg),

where , is the th component density, and is the th mixing proportion such that . The association between clustering and mixture models, as discussed in McNicholas (2016a), can be traced all the way back to Tiedeman (1955). The earliest use of a mixture model, specifically a Gaussian mixture model, for model-based clustering can be found in Wolfe (1965). Other early work in this area can be found in Baum et al. (1970) and Scott & Symons (1971) with a recent review given by McNicholas (2016b).

The Gaussian mixture model is well-established for clustering both multivariate and matrix variate data because of its mathematical tractability, however; there are many examples of non-Gaussian distributions used for clustering. In the multivariate case, work has been done using symmetric component densities that parameterize concentration (tail weight), e.g., the distribution (Peel & McLachlan 2000, Andrews & McNicholas 2011, 2012, Lin et al. 2014) and the power exponential distribution (Dang et al. 2015). Furthermore, there has been some work in the area of multivariate skewed-distributions, for example the normal-inverse Gaussian distribution (Karlis & Santourian 2009), the skew- distribution (Lin 2010, Vrbik & McNicholas 2012, 2014, Lee & McLachlan 2014, Murray et al. 2014, 2017), the shifted asymmetric Laplace distribution (Franczak et al. 2014), the generalized hyperbolic distribution (Browne & McNicholas 2015), and the variance-gamma distribution (McNicholas et al. 2017).

In the area of matrix variate data, Viroli (2011) consider a mixture of matrix variate normal distributions for clustering, and Doğru et al. (2016) consider a mixture of matrix variate distributions. More recently, Gallaugher & McNicholas (2018) consider mixtures of four skewed matrix variate distributions, specifically the matrix variate skew- (Gallaugher & McNicholas 2017b) as well as generalized hyperbolic, variance-gamma and normal-inverse Gaussian distributions (Gallaugher & McNicholas 2017a), with an application in handwritten digit recognition. As pointed out by Gallaugher & McNicholas (2018), these approaches are limited by the dimensionality of the data and this work aims to help address that limitation.

### 2.2 Matrix Variate Normal Distribution

An random matrix follows a matrix variate normal distribution with location parameter and scale matrices and of dimensions and , respectively, denoted by , if the density of can be written as

 f(X | M,Σ,Ψ)=1(2π)np2|Σ|p2|Ψ|n2exp{−12tr(Σ−1(X−M)Ψ−1(X−M)′)}. (1)

One notable property of the matrix variate normal distribution (Harrar & Gupta 2008) is

 X∼Nn×p(M,Σ,Ψ)⟺vec(X)∼Nnp(vec(M),Ψ⊗Σ), (2)

where is the multivariate normal density with dimension , is the vectorization operator, and is the Kronecker product.

### 2.3 Mixture of Factor Analyzers Model

For the purpose of this section, we temporarily revert back to the notation where represents a -dimensional random vector, with as its realization. The factor analysis model for is given by

 Xi=\boldmathμ+ΛUi+\boldmathεi,

where is a location vector, is a matrix of factor loadings with , denotes the latent factors, , where , and and are each independently distributed and independent of one another. Under this model, the marginal distribution of is . Probabilistic principal component analysis (PPCA) arises as a special case with the isotropic constraint (Tipping & Bishop 1999b).

Ghahramani & Hinton (1997) develop the mixture of factor analyzers model, which is a Gaussian mixture model with covariance structure . A small extension was presented by McLachlan & Peel (2000), who utilize the more general structure . Tipping & Bishop (1999a) introduce the closely-related mixture of PPCAs with . McNicholas & Murphy (2008) constructed a family of eight parsimonious Gaussian models by considering the constraint in addition to and . There has also been work on extending the mixture of factor analyzers to other distributions, such as the skew- distribution (Murray et al. 2014, 2017), the generalized hyperbolic distribution (Tortora et al. 2016), the skew-normal distribution (Lin et al. 2016), the variance-gamma distribution (McNicholas et al. 2017), and others (e.g., Murray et al. 2017).

### 2.4 Previous Work on Matrix Variate Factor Analysis

Xie et al. (2008) and Yu et al. (2008) consider a matrix variate extension of PPCA in a linear fashion. The model assumes an random matrix can be written

 X=M+AUB′+E, (3)

where is an location matrix, is an matrix of column factor loadings, is a matrix of row factor loadings, , and . It is assumed that and are independent of each other. The main disadvantage of this model is that, in general, does not follow a matrix variate normal distribution.

Zhao et al. (2012) present bilinear probabilistic principal component analysis (BPPCA) which extends (3) by adding two projected error terms. The resulting model assumes can be written

 X=M+AUB′+AEB+EAB′g+E, (4)

where is the same as in (3), , . In this model it is assumed that , and are all independent of each other. It is important to note that the term “column factors” refers to reduction in the dimension of the columns, which is equivalent to the number of rows, and not a reduction in the number of columns. Likewise, the term “row factors” refers to the reduction in the dimension of the rows (number of columns). As discussed by Zhao et al. (2012) the interpretation of the terms and are the row and column noise respectively, whereas the final term is the common noise. It can be shown using property (2) that under this model . Note that the covariance structure for the two covariance matrices of the matrix variate normal are analogous to the covariance structure for the (multivariate) factor analysis model.

## 3 Methodology

### 3.1 MMVBFA Model

An MMVBFA model is derived here by extending (4). Specifically, we remove the isotropic constraint and assume

 Xi=Mg+AgUigB′g+AgEBig+EAigB′g+Eig (5)

with probability , for , where is an location matrix, is an column factor loading matrix, with , is a row factor loading matrix, with , and

 Uig∼Nq×r(0,Iq,Ir),EBig∼Nq×p(0,Iq,Ψg),EAig∼Nn×r(0,Σg,Ir),Eig∼Nn×p(0,Σg,Ψg)

are independent of each other, , with , and , with .

Let denote the component membership for , where

 zig={1if Xi belongs to component g,0otherwise,

for and . Using the vectorization of , and property (2), it can be shown that

 Xi | zig=1∼Nn×p(Mg,Σg+AgA′g,Ψg+BgB′g).

Therefore, the density of can be written

 f(Xi|ϑ)=G∑g=1πgφn×p(Xi|Mg,Σg+AgA′g,Ψg+BgB′g),

where denotes the matrix variate normal density. Following a similar procedure described by Zhao et al. (2012), by introducing latent variables and , (5) can be written

The two stage interpretation of this formulation of the model is the same as that given by Zhao et al. (2012) where this can viewed as first projecting in the column direction onto the latent matrix , and then and are further projected in the row direction. Likewise, introducing and , (5) can be written

 Xi=Mg+YAigB′g+VAig,YAig=AgUig+EAig,VAig=AgEBig+Eig.

The interpretation is the same as before only we project in the row direction first followed by the column direction. It can be shown that

 YBig|Xi,zig=1∼Nq×p(WAg−1A′gΣ−1g(Xi−Mg),WAg−1,ΛBg)

and

 YAig|Xi,zig=1∼Nn×r((Xi−Mg)Ψ−1gBgWBg−1,ΛAg,WBg−1),

where , , , and

### 3.2 Parameter Estimation

Suppose we observe N observations then the log-likelihood is given by

 L(ϑ)=N∑i=1logG∑g=1πgφn×p(Xi|Σg+AgA′g,Ψg+BgB′g). (6)

To maximize (6), the observed data is viewed as incomplete and an AECM is then to maximize (6). There are three different sources of missingingness: the component memberships as well as the latent variables and . A three-stage AECM algorithm is now described for parameter estimation.

AECM Stage 1: In the first stage, the complete-data is taken to be the observed matrices and the component memberships , and the update for is calculated. The complete-data log-likelihood in the first stage is then

 ℓ(1)=C+G∑g=1N∑i=1zig{logπg−12tr[Λ−1Ag(Xi−Mg)Λ−1Bg(Xi−Mg)′]},

where is a constant independent of , and . In the E-Step, the updates for the component memberships are given by

 ^zig=πgφn×p(Xi | ^Mg,^ΛAg,^ΛBg)∑Gh=1πgφn×p(Xi | ^Mh,^ΛAh,^ΛBh),

where denotes the matrix variate normal density. In the CM-step, the update for is calculated using

 ^Mg=∑Ni=1^zigXiNg,

where .

AECM Stage 2: In the second stage, the complete-data is taken to be the observed , the component memberships and the latent factors . The complete-data log-likelihood is then

 (7)

In the E-Step, the following expectations are calculated:

 aBig\vbox:=E[YBig | Xi,zig=1]=WAg−1A′gΣ−1g(Xi−Mg),bBig\vbox:\vbox:=E[YBig^Λ−1BgYBig′]=pWAg−1+aBigΛ−1BgaBig′. (8)

As usual, these expectations are calculated using the current estimates of the parameters. In the CM-step and are updated via

 ^Ag=N∑i=1^zig(Xi−^Mg)^Λ−1BgaBig′(N∑i=1^zigbBig)−1,^Σg=1Ngp% diag{^SBg},

where

 SBg=N∑i=1^zig[(Xi−^Mg)^Λ−1Bg(Xi−^Mg)′−^AgaBig^Λ−1Bg(Xi−^Mg)′].

AECM Stage 3: In the last stage of the AECM algorithm, the complete data is taken to be the observed , the component memberships and the latent factors . In this step, the complete-data log-likelihood is

 ℓ(3)=C−Ngn2log|Ψg|−12G∑g=1N∑i=1zigtr[Ψ−1g(Xi−Mg)′Λ−1Ag(Xi−Mg)−Ψ−1gBgYAig′Λ−1Ag(Xi−Mg)−Ψ−1g(Xi−Mg)′Λ−1AgYAigB′g+Ψ−1gBgYAig′Λ−1AgYAigB′g].

In the E-Step, expectations similar to those in the second step are calculated.

 aAig:=E[YAig | Xi,zig=1]=(Xi−Mg)Ψ−1gBgWBg−1

and

 bAig:=E[YAig′Λ−1BgYAig]=nWBg−1+aAig′Λ−1AgaAig.

In the CM-step we update and given by

 ^Bg=N∑i=1^zig(Xi−^Mg)′^Λ−1AgaAig(N∑i=1^zigbAig)−1,^Ψg=1Ngn% diag{^SAg},

where

 SAg=N∑i=1^zig[(Xi−^Mg)′^Λ−1Ag(Xi−^Mg)−^BgaAig′^Λ−1Ag(Xi−^Mg)].

### 3.3 Semi-Supervised Classification

The model presented herein for clustering may also be used for semi-supervised classification. Suppose matrices are observed, and of these observations have known labels from one of classes. Without loss of generality order the matrices so that the first have known labels and the remaining observations have unknown labels. The observed likelihood is then

 L(ϑ)=K∏i=1G∏g=1[πgφn×p(Xi|Mg,Σg+AgA′g,Ψg+BgB′g)]zig×N∏j=K+1H∑h=1πhφn×p(Xi|Mh,Σh+AhA′h,Ψh+BhB′h).

It is possible for , however; for our analyses we assume that . Parameter estimation then proceeds in a similar manner for the clustering scenario. For more information on semi-supervised classification refer to McNicholas (2016a).

### 3.4 Model Selection, Initialization and Convergence

For a general dataset the number of components and/or the number of factors will not be known a priori and therefore we will have to select them. One common selection criterion is the Bayesian information criterion (BIC; Schwarz 1978) and is given by

 BIC=2ℓ(^ϑ)−flogN,

where is the number of free parameters. The BIC is used as the selection criterion for all of our analyses.

To initialize the AECM algorithm, we employ an alternating emEM strategy (Biernacki et al. 2003). This consists of running the AECM algorithm for a small number of iterations for different random starting values of the parameters and then use the parameters that maximize the likelihood to continue with the AECM algorithm until convergence.

The simplest convergence criterion would be to use lack of progress in the log-likelihood, however; it is possible for the log-likelihood to “plateau” and then increase again thus terminating the algorithm prematurely. One alternative is to use a criterion based on the Aitken acceleration (Aitken 1926). The acceleration at iteration is

 a(t)=l(t+1)−l(t)l(t)−l(t−1),

where is the observed likelihood at iteration . We then define

 l(t+1)∞=l(t)+(l(t+1)−l(t))1−a(t),

(refer to Böhning et al. 1994, Lindsay 1995). This quantity is an estimate of the observed log likelihood after many iterations at iteration . As in McNicholas et al. (2010), we terminate the algorithm when , provided the difference is positive. It is important to note that, in each AECM algorithm run for the analyses herein, we make the choice of based on the magnitude of the log-likelihood. Specifically, after running the 10 iterations of the emEM algorithm, we choose  to be four orders of magnitude lower than the log-likelihood.

### 3.5 Reduction in Number of Free Covariance Parameters

Because the covariance structure of both covariance matrices in the MVVBFA model is equivalent to the covariance structure in the multivariate MFA model many of the results on the number of free covariance parameters may be used here. Specifically there are free covariance parameters in and free covariance parameters in (Lawley & Maxwell 1962). Therefore, reduction in the number of free covariance parameters for the row covariance matrix is

which is positive for . Likewise for the column covariance matrix the reduction in the number of parameters is

which is positive for .

In applications herein, the model is fit for a range of row factors and column factors. If the number of factors chosen by the BIC is the maximum in the range, the number of factors should be increased so long as the above conditions are met.

## 4 Data Analyses

### 4.1 Simulations

#### Simulation 1

In the first simulation, two groups were considered with matrices. The mixing proportions were taken to be , and . Observations were simulated from (5) with column factors and row factors. For each value of , 50 datasets were simulated. For each dataset, for each , the correct number of groups, column and row factors were selected. In addition, perfect classification was achieved (). In Table  1, we show the average value of , for and for each value of , over the 50 datasets. Note that if is an matrix then

 ||W||1=max1≤j≤pn∑i=1|wij|.

As expected, the estimates of get closer to the true values as the sample size in increased. Moreover, the variability of decreases as the sample size increases.

#### Simulation 2

The second simulation considered three groups with matrices. The mixing proportions were and , and . Again, 50 datasets were simulated for each with column factors and row factors. As in Simulation 1, the correct number of groups, column and row factors were chosen and perfect classification was achieved. In Table 2, we again show the average 1-norms for the difference between the true and estimated location parameters.

### 4.2 MNIST Digit Recognition

We considered the MNIST digit dataset (LeCun et al. 1998), which contains over 60,000 greyscale images of handwritten Arabic digits 0 to 9. The images are represented by pixel matrices with greyscale intensities ranging from 0 to 255. Because of the lack of variability in the outer rows and columns, some random noise was added while adding 50 to each of the non-zero elements to avoid confusing the noise with a true signal. We were interested in comparing digit 1 to digit 7, as this was considered in Gallaugher & McNicholas (2018). Similar to Gallaugher & McNicholas (2018), we consider semi-supervised classification with 25%, 50% and 75% supervision. In each case, 25 datasets were considered each consisting of 200 observations from each digit and we fit the model for 10 to 20 column and row factors.

In Table 3 we show an aggregated classification table between the true and predicted classifications at each level of supervision for the points considered unlabelled. As expected, when the level of supervision is increased slightly better classification performance is obtained. Moreover, there is a more substantial difference when going from 25% supervision to 50% supervision than 50% to 75%. Table 4 shows the average ARI and misclassification rate (MCR) over the 25 datasets with the respective standard deviations for each level of supervision. We note that we obtain better results than Gallaugher & McNicholas (2018) even with a lower level of supervision; however, the results in Gallaugher & McNicholas (2018) were based on resized images due to dimensionality constraints whereas this analysis was performed on the original images.

In Table 5 the frequency of the number of factors chosen for each level of supervision over the 25 datasets is shown. For the majority of the datasets, the number of row and column factors lie between 13 and 15 factors. Figure 1: Heatmaps for the average estimated location matrices taken over the 25 runs for digit 1 in a, b, and c at 25%, 50% and 75% supervision respectively, and digit 7 in d, e and f at 25%, 50% and 75% supervision respectively.

Finally in Figure 1 heatmaps are displayed for the average estimates of the location matrices over the 25 runs for each level of supervision for both digits. We see a slight increase in quality when going from 25% to 50% supervision for digit 7 with the centre of the digit being a little smoother with 50% supervision. There is no noticeable different when going from 50% to 75% supervision. This similarity across the three levels of supervision illustrates the power of semi-supervised classification.

### 4.3 Olivetti Faces Dataset

Finally, consider the Olivetti faces dataset from the R package RnavGraphImageData (Waddell & Oldford 2013). The dataset consists of greyscale images of faces that were taken between 1992 and 1994 at AT&T laboratories in Cambridge. There were 40 individuals with 10 images of each individual for a total of 400 images. The images were taken with varied lighting, expressions (eyes open/closed, smile/frown etc.) and glasses or no glasses. We fit the model for 15 to 30 column and row factors, and 1 to 9 components. The BIC choose three components with 23 column factors and 26 row factors. The estimated mixing proportions are . In Figure 2, we show a heatmap of the estimated location parameters for each component. The heatmap for component 3 arguably shows the clearest image and appears to display the glasses feature. Figure 2: Estimated location matrices for (a) component 1, (b) component 2 and (c) component 3 for the faces dataset.

Upon looking at individual faces classified to component 3 (Figure 2), all the faces have glasses. Moreover, all faces with glasses are classified to component 3 with the exception of two which were classified to component 2. The faces with closed eyes are scattered throughout the three different components and were not classified to any one component. Although it is a difficult to determine the main feature that differentiates component 1 from component 2, it is apparent that the eyebrows for the faces classified to component 1 tend to be more prominent and higher above the eyelid. Of course, a semi-supervised approach to these data could be used to detect specific classes, similar to the MNIST analysis (Section 4.2). However, the unsupervised analysis here has shown that the MMVBFA approach can be effective at detecting subgroups without training.

## 5 Summary

In this paper, we developed a mixture of bilinear factor analyzers model for use in clustering and classification of matrix variate data. Two simulations as well as two real data examples were used for illustration. For each of the simulations, the correct number of components and column/row factors were chosen by the BIC for all of the datasets. Perfect classification performance was also obtained in the simulations. In the MNIST digit application, even with a lower level of supervision, we obtained better results than Gallaugher & McNicholas (2018). However, this is almost certainly due to the fact that we used the full image compared to a scaled image. In the faces application, the BIC choose three groups with the third group being defined by the presence of the glasses facial feature. The matrix normality of in the presented model will allow for direct extensions to mixtures of matrix variate factor analyzers, as well as skewed matrix variate factor analyzers analogous to their multivariate counterparts.

## References

• (1)
• Aitken (1926) Aitken, A. C. (1926), ‘A series formula for the roots of algebraic and transcendental equations’, Proceedings of the Royal Society of Edinburgh 45, 14–22.
• Andrews & McNicholas (2011) Andrews, J. L. & McNicholas, P. D. (2011), ‘Extending mixtures of multivariate t-factor analyzers’, Statistics and Computing 21(3), 361–373.
• Andrews & McNicholas (2012) Andrews, J. L. & McNicholas, P. D. (2012), ‘Model-based clustering, classification, and discriminant analysis via mixtures of multivariate -distributions: The EIGEN family’, Statistics and Computing 22(5), 1021–1029.
• Baum et al. (1970) Baum, L. E., Petrie, T., Soules, G. & Weiss, N. (1970), ‘A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains’, Annals of Mathematical Statistics 41, 164–171.
• Biernacki et al. (2003) Biernacki, C., Celeux, G. & Govaert, G. (2003), ‘Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models’, Computational Statistics and Data Analysis 41, 561–575.
• Böhning et al. (1994) Böhning, D., Dietz, E., Schaub, R., Schlattmann, P. & Lindsay, B. (1994), ‘The distribution of the likelihood ratio for mixtures of densities from the one-parameter exponential family’, Annals of the Institute of Statistical Mathematics 46, 373–388.
• Browne & McNicholas (2015) Browne, R. P. & McNicholas, P. D. (2015), ‘A mixture of generalized hyperbolic distributions’, Canadian Journal of Statistics 43(2), 176–198.
• Dang et al. (2015) Dang, U. J., Browne, R. P. & McNicholas, P. D. (2015), ‘Mixtures of multivariate power exponential distributions’, Biometrics 71(4), 1081–1089.
• Doğru et al. (2016) Doğru, F. Z., Bulut, Y. M. & Arslan, O. (2016), ‘Finite mixtures of matrix variate t distributions’, Gazi University Journal of Science 29(2), 335–341.
• Franczak et al. (2014) Franczak, B. C., Browne, R. P. & McNicholas, P. D. (2014), ‘Mixtures of shifted asymmetric Laplace distributions’, IEEE Transactions on Pattern Analysis and Machine Intelligence 36(6), 1149–1157.
• Gallaugher & McNicholas (2017a) Gallaugher, M. P. B. & McNicholas, P. D. (2017a), ‘Three skewed matrix variate distributions’. arXiv preprint arXiv:1704.02531.
• Gallaugher & McNicholas (2018) Gallaugher, M. P. B. & McNicholas, P. D. (2018), ‘Finite mixtures of skewed matrix variate distributions’, Pattern Recognition . In press.
• Gallaugher & McNicholas (2017b) Gallaugher, M. P. & McNicholas, P. D. (2017b), ‘A matrix variate skew-t distribution’, Stat 6(1), 160–170.
• Ghahramani & Hinton (1997) Ghahramani, Z. & Hinton, G. E. (1997), The EM algorithm for factor analyzers, Technical Report CRG-TR-96-1, University of Toronto, Toronto, Canada.
• Harrar & Gupta (2008) Harrar, S. W. & Gupta, A. K. (2008), ‘On matrix variate skew-normal distributions’, Statistics 42(2), 179–194.
• Karlis & Santourian (2009) Karlis, D. & Santourian, A. (2009), ‘Model-based clustering with non-elliptically contoured distributions’, Statistics and Computing 19(1), 73–83.
• Lawley & Maxwell (1962) Lawley, D. N. & Maxwell, A. E. (1962), ‘Factor analysis as a statistical method’, Journal of the Royal Statistical Society: Series D 12(3), 209–229.
• LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. (1998), ‘Gradient-based learning applied to document recognition’, Proceedings of the IEEE 86(11), 2278–2324.
• Lee & McLachlan (2014) Lee, S. & McLachlan, G. J. (2014), ‘Finite mixtures of multivariate skew t-distributions: some recent and new results’, Statistics and Computing 24, 181–202.
• Lin (2010) Lin, T.-I. (2010), ‘Robust mixture modeling using multivariate skew t distributions’, Statistics and Computing 20(3), 343–356.
• Lin et al. (2014) Lin, T.-I., McNicholas, P. D. & Hsiu, J. H. (2014), ‘Capturing patterns via parsimonious t mixture models’, Statistics and Probability Letters 88, 80–87.
• Lin et al. (2016) Lin, T., McLachlan, G. J. & Lee, S. X. (2016), ‘Extending mixtures of factor models using the restricted multivariate skew-normal distribution’, Journal of Multivariate Analysis 143, 398–413.
• Lindsay (1995) Lindsay, B. G. (1995), Mixture models: Theory, geometry and applications, in ‘NSF-CBMS Regional Conference Series in Probability and Statistics’, Vol. 5, Hayward, California: Institute of Mathematical Statistics.
• McLachlan & Peel (2000) McLachlan, G. J. & Peel, D. (2000), Mixtures of factor analyzers, in ‘Proceedings of the Seventh International Conference on Machine Learning’, Morgan Kaufmann, San Francisco, pp. 599–606.
• McNicholas (2016a) McNicholas, P. D. (2016a), Mixture Model-Based Classification, Chapman & Hall/CRC Press, Boca Raton.
• McNicholas (2016b) McNicholas, P. D. (2016b), ‘Model-based clustering’, Journal of Classification 33.
• McNicholas & Murphy (2008) McNicholas, P. D. & Murphy, T. B. (2008), ‘Parsimonious Gaussian mixture models’, Statistics and Computing 18(3), 285–296.
• McNicholas et al. (2010) McNicholas, P. D., Murphy, T. B., McDaid, A. F. & Frost, D. (2010), ‘Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models’, Computational Statistics and Data Analysis 54(3), 711–723.
• McNicholas et al. (2017) McNicholas, S. M., McNicholas, P. D. & Browne, R. P. (2017), A mixture of variance-gamma factor analyzers, in S. E. Ahmed, ed., ‘Big and Complex Data Analysis: Methodologies and Applications’, Springer International Publishing, Cham, pp. 369–385.
• Meng & van Dyk (1997) Meng, X.-L. & van Dyk, D. (1997), ‘The EM algorithm — an old folk song sung to a fast new tune (with discussion)’, Journal of the Royal Statistical Society: Series B 59(3), 511–567.
• Murray et al. (2014) Murray, P. M., Browne, R. B. & McNicholas, P. D. (2014), ‘Mixtures of skew-t factor analyzers’, Computational Statistics and Data Analysis 77, 326–335.
• Murray et al. (2017) Murray, P. M., Browne, R. B. & McNicholas, P. D. (2017), ‘A mixture of SDB skew-t factor analyzers’, Econometrics and Statistics 3, 160–168.
• Peel & McLachlan (2000) Peel, D. & McLachlan, G. J. (2000), ‘Robust mixture modelling using the t distribution’, Statistics and Computing 10(4), 339–348.
• Schwarz (1978) Schwarz, G. (1978), ‘Estimating the dimension of a model’, The Annals of Statistics 6(2), 461–464.
• Scott & Symons (1971) Scott, A. J. & Symons, M. J. (1971), ‘Clustering methods based on likelihood ratio criteria’, Biometrics 27, 387–397.
• Tiedeman (1955) Tiedeman, D. V. (1955), On the study of types, in S. B. Sells, ed., ‘Symposium on Pattern Analysis’, Air University, U.S.A.F. School of Aviation Medicine, Randolph Field, Texas.
• Tipping & Bishop (1999a) Tipping, M. E. & Bishop, C. M. (1999a), ‘Mixtures of probabilistic principal component analysers’, Neural Computation 11(2), 443–482.
• Tipping & Bishop (1999b) Tipping, M. E. & Bishop, C. M. (1999b), ‘Probabilistic principal component analysers’, Journal of the Royal Statistical Society. Series B 61, 611–622.
• Tortora et al. (2016) Tortora, C., McNicholas, P. D. & Browne, R. P. (2016), ‘A mixture of generalized hyperbolic factor analyzers’, Advances in Data Analysis and Classification 10(4), 423–440.
• Viroli (2011) Viroli, C. (2011), ‘Finite mixtures of matrix normal distributions for classifying three-way data’, Statistics and Computing 21(4), 511–522.
• Vrbik & McNicholas (2012) Vrbik, I. & McNicholas, P. D. (2012), ‘Analytic calculations for the EM algorithm for multivariate skew-t mixture models’, Statistics and Probability Letters 82(6), 1169–1174.
• Vrbik & McNicholas (2014) Vrbik, I. & McNicholas, P. D. (2014), ‘Parsimonious skew mixture models for model-based clustering and classification’, Computational Statistics and Data Analysis 71, 196–210.
• Waddell & Oldford (2013) Waddell, A. R. & Oldford, R. W. (2013), RnavGraphImageData: Some image data used in the RnavGraph package demos. R package version 0.0.3.
• Wolfe (1965) Wolfe, J. H. (1965), A computer program for the maximum likelihood analysis of types, Technical Bulletin 65-15, U.S. Naval Personnel Research Activity.
• Xie et al. (2008) Xie, X., Yan, S., Kwok, J. T. & Huang, T. S. (2008), ‘Matrix-variate factor analysis and its applications’, IEEE Transactions on Neural Networks 19(10), 1821–1826.
• Yu et al. (2008) Yu, S., Bi, J. & Ye, J. (2008), Probabilistic interpretations and extensions for a family of 2D PCA-style algorithms, in ‘Workshop Data Mining Using Matrices and Tensors (DMMT ‘08): Proceedings of a Workshop held in Conjunction with the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD 2008)’.
• Zhao et al. (2012) Zhao, J., Philip, L. & Kwok, J. T. (2012), ‘Bilinear probabilistic principal component analysis’, IEEE Transactions on Neural Networks and Learning Systems 23(3), 492–503.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters   