Mixtures of Shifted Asymmetric Laplace Distributions

# Mixtures of Shifted Asymmetric Laplace Distributions

Brian C. Franczak, Ryan P. Browne, and Paul D. McNicholas Department of Mathematics & Statistics, University of Guelph, Guelph, Ontario, N1G 2W1, Canada. E-mail: paul.mcnicholas@uoguelph.ca.
Department of Mathematics & Statistics, University of Guelph.
###### Abstract

A mixture of shifted asymmetric Laplace distributions is introduced and used for clustering and classification. A variant of the EM algorithm is developed for parameter estimation by exploiting the relationship with the general inverse Gaussian distribution. This approach is mathematically elegant and relatively computationally straightforward. Our novel mixture modelling approach is demonstrated on both simulated and real data to illustrate clustering and classification applications. In these analyses, our mixture of shifted asymmetric Laplace distributions performs favourably when compared to the popular Gaussian approach. This work, which marks an important step in the non-Gaussian model-based clustering and classification direction, concludes with discussion as well as suggestions for future work.

## 1 Introduction

Finite mixture models are based on the underlying assumption that a population is a convex combination of a finite number of densities. They therefore lend themselves quite naturally to classification and clustering problems. Formally, a random vector arises from a parametric finite mixture distribution if, for all , we can write its density as

 f(x∣\boldmathϑ)=G∑g=1πgfg(x∣\boldmathθg),

where , such that are the mixing proportions, are called component densities, and is the vector of parameters with . The component densities are usually taken to be of the same type, most often multivariate Gaussian. In the event that the component densities are multivariate Gaussian, the density of the mixture model is , where is the multivariate Gaussian density with mean  and covariance matrix . The popularity of the multivariate Gaussian distribution is due to its mathematical tractability and its flexibility in capturing densities; we will return to this latter point in Section 4.1. Herein, we shall follow convention and use the term model-based clustering to mean clustering using mixture models. Model-based classification (e.g., McNicholas, 2010), or partial classification (cf. McLachlan, 1992, Section 2.7), can be regarded as a semi-supervised version of model-based clustering.

At the time of the review paper of Fraley and Raftery (2002), almost all work on clustering and classification using mixture models had been based on Gaussian mixture models (e.g., Banfield and Raftery (1993), Celeux and Govaert (1995), Ghahramani and Hinton (1997), Fraley and Raftery (1998), Tipping and Bishop (1999), McLachlan and Peel (2000a), and McLachlan and Peel (2000b), amongst others). An important example of non-Gaussian work from this time is the early work on clustering using mixtures of multivariate -distributions carried out by McLachlan and Peel (1998) and Peel and McLachlan (2000). This work was the forerunner to several papers on clustering using mixtures of multivariate -distributions, including those by McLachlan et al. (2007), Andrews and McNicholas (2011), and Baek and McLachlan (2011). Work has also burgeoned on skew-normal distributions (e.g., Lin, 2009), skew- distributions (e.g., Lin, 2010; Lee and McLachlan, 2011; Vrbik and McNicholas, 2012), and other non-elliptically contoured distributions (e.g., Karlis and Meligkotsidou, 2007; Karlis and Santourian, 2009; Browne et al., 2012).

The recent burgeoning of non-Gaussian approaches to model-based clustering and classification has coincided with yet more papers on Gaussian approaches. These include work on extensions of mixtures of factor analyzers (McNicholas and Murphy, 2008, 2010; Baek et al., 2010) and developments on variable selection and dimension reduction for Gaussian model-based clustering (Raftery and Dean, 2006; Maugis et al., 2009; Scrucca, 2010). Nevertheless, the fecundity of non-Gaussian approaches has certainly been more potent than that of Gaussian approaches over the last few years. This paper introduces a non-Gaussian approach that allows for skewness while also parameterizing location and scale. Our approach is effective while also being mathematically elegant and relatively computationally straightforward. The methodology is developed in Section 2, parameter estimation via deterministic annealing and an expectation-maximization algorithm is outlined in Section 3, and both simulated and real data analyses are used to illustrate our approach (Section 4). The paper concludes with a summary and suggestions for future work (Section 5).

## 2 Methodology

### 2.1 Generalized Inverse Gaussian Distribution

The density of a random variable following a generalized inverse Gaussian (GIG) distribution is given by

 q(x)=(a/b)ν/2xν−12Kν(√ab)exp{−ax+b/x2}, (1)

for , where , , and is the modified Bessel function of the third kind with index . There are several special cases of the GIG distribution, such as the gamma distribution (, ) and the inverse Gaussian distribution (). The GIG distribution was introduced by Good (1953) and its statistical properties were laid down by Barndorff-Nielsen and Halgreen (1977), Blæsild (1978), Halgreen (1979), and Jørgensen (1982). It has some attractive properties including the tractability of the following expected values:

 E[X]=√bKν+1(√ab)√aKν(√ab),E[1/X]=√aKν+1(√ab)√bKν(√ab)−2νb. (2)

### 2.2 Shifted Asymmetric Laplace Distribution

Consider a -dimensional random vector from a centralized asymmetric Laplace (CAL) distribution (Kotz et al., 2001). The density of is given by

 f(z∣\boldmathα,% \boldmathΣ)=2Kν(u)(2π)p/2|\boldmath% Σ|1/2(z′\boldmathΣ−1z2+\boldmathα′\boldmathΣ−1\boldmathα)ν/2exp{z′\boldmathΣ−1\boldmathα}, (3)

where , , is a covariance matrix, and represents the skewness in each dimension. Kotz et al. (2001) use the notation to indicate that the random variable follows a -dimensional CAL distribution and provide an extensive properties list.

The CAL density (3) is prohibitive for model-based clustering and classification applications because it would force each component density to be joined at the same origin. To address this problem, consider and introduce a shift parameter by considering a random vector , where denotes a -dimensional shifted (non-centralized) asymmetric Laplace (SAL) distribution with density given by

 (4)

where , is the squared Mahalanobis distance between and , and , , and are defined as before.

Kotz et al. (2001) note that the random variable can be generated through the relationship , where is a random variable from an exponential distribution with mean 1 and is generated independent of . Therefore, the random variable can be generated through the relationship and so .

The distribution of conditional on the data can be computed through the use of Bayes’ theorem, i.e.,

 fW(w∣X=x)=fX(x∣W=w)h(w)/fX(x),

where , , and is the density of the shifted asymmetric Laplace distribution given in (4). It follows that

 fW(w∣X=x)=wν−12(δ(x,\boldmathμ∣\boldmathΣ)2+\boldmathα′% \boldmathΣ−1\boldmathα)−ν/2exp{−12wδ(x,\boldmathμ∣% \boldmathΣ)−w2(2+\boldmathα′\boldmathΣ−1\boldmathα)}Kν(√(2+\boldmathα′\boldmathΣ−1\boldmathα)δ(x,\boldmathμ∣% \boldmathΣ)), (5)

where , , , , and are as defined for (4). Recalling the density of a GIG random variable (1), it then follows from (5) that is the GIG density with and , cf. (Barndorff-Nielsen, 1997).

We introduce a finite mixture of SAL distributions so that the th component density is , where the parameters are as defined for (4). The density of a mixture of SAL distributions is , where is the vector of all model parameters and is the SAL density from (4).

## 3 Parameter Estimation

### 3.1 EM Algorithm

The expectation-maximization (EM) algorithm (Dempster et al., 1977) is an iterative procedure for finding the maximum likelihood estimates when data are incomplete or are treated as such. EM algorithm computations are based on the complete-data likelihood, i.e., the likelihood of the observed data plus the latent or missing data. In the E-step, the expected value of the complete-data log-likelihood is computed, and in the M-step, this value is maximized with respect to the model parameters. The E- and M-steps are then iterated until some convergence criterion is attained.

A common criticism of the use of the EM algorithm for model-based clustering is that the singularity-riddled likelihood surface makes parameter estimation unreliable and heavily dependent on the starting values. To help overcome this problem, Zhou and Lange (2010) introduced a deterministic annealing algorithm that flattens the likelihood surface by introducing an auxiliary variable . We will illustrate this approach by using a version of this deterministic annealing algorithm in conjunction with an EM algorithm to fit our SAL mixture model (cf. Section 3.3). Note that Eltoft et al. (2006) give an example of an EM-type algorithm for fitting a multivariate Laplace distribution.

### 3.2 Application to Mixture of SAL Distributions

For our SAL mixture models, the complete-data comprise the observed , the component membership labels , and the variable . For each , we have , where if observation  is in component  and otherwise, for and . The complete-data likelihood is given by

 L=n∏i=1G∏g=1[πgϕ(xi∣\boldmathμg+wi\boldmathαg,wi% \boldmathΣg)h(wi)]τig, (6)

with the same notation used previously and where is the density of a multivariate Gaussian distribution with mean and covariance matrix . The expected-value of the complete-data log-likelihood is given by

 Q=G∑g=1nglogπg−np2log2π−np2n∑i=1E[logWi∣xi]+G∑g=1ng2log∣∣\boldmath% Σ−1g∣∣+2n∑i=1G∑g=1^τig(xi−\boldmathμg)′\boldmathΣ−1g\boldmathαg−12n∑i=1G∑g=1^τig(xi−\boldmathμg)′E[1/Wi∣xi]\boldmathΣ−1g(xi−\boldmathμg)−12n∑i=1G∑g=1^τigE[Wi∣xi]\boldmathα′g\boldmathΣ−1g\boldmathαg −n∑i=1G∑g=1^τigE[Wi∣xi],

where and

 ^τig\vbox:=E[τig∣xi]=πgξ(xi∣\boldmathαg,% \boldmathΣg,\boldmathμg)∑Gj=1πjξ(xi∣\boldmathαj,\boldmathΣj,\boldmathμj). (7)

The expected values and are computed using the formulae in (2).

In the M-step, we maximize with respect to the model parameters to get the updates. Specifically, the mixing proportions, skewness parameter, and shift parameter are updated by ,

 ^\boldmathαg=∑ni=1^τigE[1/Wi∣xi]∑nj=1^τjgxj−ng∑ni=1^τigE[1/Wi∣xi]xi∑ni=1^τigE[Wi∣xi]∑nj=1^τjgE[1/Wj∣xj]−n2g,

and

 ^\boldmathμg=∑ni=1^τigE[Wi∣xi]∑nj=1E[1/Wj∣xj]xj−ng∑ni=1^τigxi∑ni=1^τigE[Wi∣xi]∑nj=1E[1/Wj∣xj]−n2g,

respectively. Each component covariance matrix is updated by

 ^\boldmathΣg=Sg−^\boldmathα% gr′g−rg^\boldmathα′g+1ng^\boldmathαg^% \boldmathα′gn∑i=1^τigE[Wi∣xi], (8)

where and .

The E- and M-steps are iterated until convergence. Practically, the E-step consists of updating the values of the expected values , , and . At the first iteration, these updates are based on the initialized values of the parameter estimates , , , and (cf. Section 3.3). For all other iterations, these updates are based on the values of the parameter estimates from the previous iteration. The M-step consists of updating the values of the parameters , , , and based on the expected values from the E-step. The E- and M-steps are iterated until convergence (cf. Section 3.4).

At convergence, the are the a posteriori probabilities of component membership for each observation and can be used to cluster the observations into groups. Predicted classifications are obtained via maximum a posteriori (MAP) probabilities, where if max occurs at component and otherwise.

### 3.3 Initialization

The deterministic annealing algorithm (Zhou and Lange, 2010) is the same as the EM algorithm described in Section 3.1, except that now

 E[τig∣xi]=[πgξ(xi∣\boldmathαg,\boldmathΣg,\boldmathμg)]v∑Gh=1[πhξ(xi∣\boldmathαh,\boldmathΣh,% \boldmathμh)]v (9)

in each E-step. Here, the auxiliary parameter , which is drawn from an increasing sequence of user-specified length, transforms the likelihood surface to improve the chances of finding the dominant mode. The user-specified sequence runs from to and its length determines how many iterations of the deterministic annealing algorithm will be preformed. The annealing algorithm itself is initialized using random starting values of , , , and . In our analyses (Section 4), we run the deterministic annealing algorithm ten times and choose the values that give the highest likelihood as the starting values for our EM algorithms.

### 3.4 Convergence

#### 3.4.1 Aitken acceleration

The Aitken acceleration (Aitken, 1926) is used to determine convergence of our EM algorithms. An EM algorithm can be considered to have converged when , where is the log-likelihood at iteration and

 l(k+1)∞=l(k)+l(k+1)−l(k)1−a(k)

is an asymptotic estimate of the log-likelihood at iteration (cf. Böhning et al., 1994). The value is the Aitken acceleration at iteration . For the analyses herein, we use a slightly modified convergence criterion, stopping our EM algorithms when (cf. Lindsay, 1995).

#### 3.4.2 Dealing with infinite likelihood

As our EM algorithm iterates, we must handle the complications that arise when computing . Specifically, as the algorithm iterates towards convergence, the value of will tend to an observation . This happens because of the term in the multivariate SAL density (4). Although these estimates of maximize the likelihood, they create computational issues when trying to determine the remaining parameter values and, specifically, the expected value .

To overcome this problem, we stop searching for when has the same value as some because the log-likelihood becomes infinite at this point. We proceed by taking the value of at the iteration before it becomes equal to any (we denote this value ) as the estimate for . We then update of using

 ^\boldmathα∗g=⎡⎢ ⎢⎣∑ni=1^τig(xi−\boldmathμ∗g)′∑ni=1^τigE[Wi∣xi]⎤⎥ ⎥⎦′,

and update given the update in (11). We use a different approach to overcome this problem in the deterministic annealing algorithm, simply restricting the expected value of from exceeding a value of at each iteration. We acknowledge that our solution to this problem is a simple-minded one. However, we have found it to be quite effective and a more thorough exploration of this problem in general is the subject of ongoing work.

### 3.5 Model-Based Classification

Suppose we have observations and of these observations have known group memberships. We can use the group memberships of these observations to estimate memberships for the remaining observations within a joint likelihood framework. This approach, known as model-based classification, is a semi-supervised version of model-based clustering. Without loss of generality, we order the observations so that the first  have known group memberships. Therefore, the values of are known for and the SAL model-based classification likelihood is given by

 L(x1,…,xn,\boldmathτ1,…,\boldmathτn∣\boldmathϑ)=k∏i=1G∏g=1[πgξ(xi∣\boldmathαg,\boldmathΣg,\boldmathμg)]τign∏j=k+1H∑h=1πhξ(xj∣\boldmathαh,% \boldmathΣh,\boldmathμh), (10)

where . Parameter estimation is carried out in an analogous fashion to model-based clustering.

### 3.6 Model Selection and Performance

The Bayesian information criterion (BIC; Schwarz, 1978) is commonly used for Gaussian mixture model selection:

 BIC=2l(x∣^\boldmathϑ)−νlogn,

where is the maximized log-likelihood, is the maximum likelihood estimate of , is the number of free parameters in the model, and is the number of observations. In the analyses herein, we prefer to use a model selection approach that is specifically designed for clustering and classification applications. To this end, we use the integrated classification likelihood (ICL; Biernacki et al., 2000), which is calculated using

 ICL≈BIC+n∑i=k+1G∑g=1MAP{^zig}log^zig,

where , the estimated mean entropy, reflects the uncertainty in the classification of observation into group .

To assess clustering and classification performance, we use a cross tabulation of our MAP classifications against the true group memberships. We then compute the adjusted Rand index (ARI; Hubert and Arabie, 1985). The Rand index (Rand, 1971) was introduced to compare partitions. It is the ratio of pairs that should be and are together plus pairs that should be and are apart, divided by the total number of pairs. The Rand index takes a value between 0 and 1, where 1 indicates perfect agreement. An unattractive feature of the Rand index is that it has a positive expected value under random classification. To correct this undesirable property, Hubert and Arabie (1985) introduced the ARI to account for chance agreement. The ARI also takes a value of 1 when classification is perfect but has an expected value of 0 under random classification. The ARI can also take negative values and this happens for classifications that are worse than would be expected by chance.

## 4 Data Analyses

### 4.1 Introduction

The SAL mixture models are applied to simulated data (Section 4.2) and to two real data sets: the famous Old Faithful geyser data (Section 4.3) and data on cellular localization sites for proteins in yeast (Section 4.4). The simulation study was conducted to assess the accuracy of our SAL parameter estimates, including selection of the number of components, and to draw comparison with the Gaussian mixture model approach. In the real analyses, we illustrate the SAL approach, judging its performance alongside that of Gaussian mixtures.

On the face of it, a comparison of our SAL approach to Gaussian mixtures might be considered a little empty because one would expect Gaussian mixtures to use more than one component to model a cluster with skewness. Indeed, there has been a lot of work within the literature on merging Gaussian components (e.g., Baudry et al., 2010; Hennig, 2010). However, our examples illustrate that when predicted classifications differ for the SAL and Gaussian approaches, merging Gaussian components cannot always be used to rectify shortcomings in the Gaussian results (cf. Sections 4.2 and 4.4). One may also argue that the mixture of multivariate skew- distributions could also be used to accommodate skewness. This is true, but two different representations of the mixture of skew- distributions have appeared within the literature (cf. Sahu et al., 2003; Pyne et al., 2009) and neither have the elegance of SAL mixtures; this is most apparent in the intractability of the E-step for the skew- approach (cf. Lin, 2010; Lee and McLachlan, 2011; Vrbik and McNicholas, 2012).

Note that, for all analyses, we use the same deterministic annealing starting values for the mixtures of Gaussian distributions as for our SAL mixtures. Further, we treat each analysis as a genuine clustering problem by removing all labels; we then apply SAL and Gaussian mixture models.

### 4.2 Simulation Study

We used the relationship between the SAL and Gaussian distributions (cf. Section 2.2) to generate multivariate SAL data. Specifically, we simulated 25 data sets for with dimensions and components (e.g., Figure 1). The data were generated using skewness parameters and , shift parameters and , and covariance matrices

The clustering results (Table 1) show that the SAL mixtures gave almost perfect clustering performance over all 25 runs (average ARI=). The Gaussian approach, however, gave relatively poor clustering performance (average ARI=) and never returned the correct number of clusters. The most common Gaussian solution (chosen 21 times) has five components, e.g., Figure 2, and each of the other four solutions had four components. Figure 2: One of the G=5 Gaussian solutions for the simulated data, coloured by MAP classification results.

The five-component Gaussian solution depicted in Figure 2 is typical of 21 of the cases observed in this simulation. Particularly striking here is that the components cannot be combined to give the correct cluster memberships. Instead, there is a clear ‘noise’ group comprising points that do not fit neatly within any of the other four components. Although one might argue that this problem is obvious by inspection in two dimensions, and might be resolved in some post hoc fashion, it would be virtually impossible to detect or resolve in all but very low dimensional examples.

### 4.3 Old Faithful Geyser Data

The famous Old Faithful geyser data, which are available as faithful in R (R Development Core Team, 2012), comprise a two-variable data set measuring the waiting time between eruptions and the duration of 272 eruptions of the Old Faithful geyser in Yellowstone National Park. These data are well-known as an example of skewness and they have been used many times to illustrate approaches to analyzing skewed data (e.g., Ali et al., 2010). Our mixture of SAL distributions and mixtures of Gaussian distributions were fitted to the geyser data. There are no ‘true’ classifications for these data but both approaches selected sensible groups with identical classifications. The associated contour plots (Figure 3) illustrate that the fit to the data is better for the mixture of SAL distributions. Note that applying a mixture of skew- distributions to these data will also give nicely fitting contours (cf. Vrbik and McNicholas, 2012). However, as discussed in Section 4.1, the mixture of SAL distributions is less computationally cumbersome. Figure 3: Model-based clustering results with contours for the SAL and Gaussian mixture models (GMMs) on the Old Faithful data.

### 4.4 Yeast Data

#### 4.4.1 The Data

The yeast data, which are available from the UCI machine learning repository, contain cellular localization sites of 1,484 proteins. The development of these data, as well as classification results using a ‘rule-based expert system’, are discussed by Nakai and Kanehisa (1991, 1992). For illustration, we consider three variables: McGeoch’s method for signal sequence recognition (mcg), the score of the ALOM membrane spanning region prediction program (alm), and the score of discriminant analysis of the amino acid content of vacuolar and extracellular proteins (vac). The goal of our cluster analysis is to distinguish between the two localization sites, CYT (cytosolic or cytoskeletal) and ME3 (membrane protein, no N-terminal signal), cf. Figure 4.

#### 4.4.2 Model-Based Clustering Results

The SAL and Gaussian mixture models were fitted to these data for components and the best fitting model was chosen using the ICL. The chosen SAL mixture model had two components (ICL=) and the chosen Gaussian mixture model had three components (ICL=). The classification performance of the SAL mixture model (ARI=) is superior to that of the Gaussian mixture model (ARI=), as illustrated in Table 2 and Figure 5. Again, this superiority cannot be negated by merging Gaussian components. Figure 5: Clustering results for the SAL and Gaussian mixture models on the yeast data; the colours highlight the predicted component memberships.

Now, one may argue that the Gaussian mixture model would perform better under a different model selection criterion. Therefore, we also investigated the Gaussian mixture model with (Table 3). Surprisingly, this Gaussian mixture model gives very poor clustering performance, producing classifications that are worse than would be expected by guessing (i.e., ARI=).

#### 4.4.3 Model-Based Classification Results

To compare model-based classification within the SAL and Gaussian mixture modelling frameworks, we analyze the yeast data with 70% of the group memberships taken to be known. We set and each model was fitted using 25 different random 70/30 partitions of the data. The aggregate classification results (Table 4) and ARIs ( and , respectively) indicate that the SAL mixture models outperform their Gaussian counterparts by some margin.

The poor performance of the Gaussian mixture modelling approach here is surprising because the parameter estimates are computed with 70% of the location sites taken as known. This result raises questions around the efficacy of the semi-supervised Gaussian model-based classification approach, as well as further reinforcing the need for more flexible non-Gaussian approaches. Once again, the poor performance of the Gaussian approach on these data cannot be mitigated by merging components.

## 5 Summary and Discussion

A mixture of SAL distributions model was introduced and applied for both clustering and classification. An EM algorithm was used for parameter estimation, with starting values obtained using the deterministic annealing approach of Zhou and Lange (2010). To account for both parsimony and entropy, the ICL is used to select the number of mixture components. Our model-based clustering approach was illustrated on both real and simulated data.

In the simulation, data were generated from a SAL mixture model with two components that were very well separated. The SAL mixtures gave near-perfect results on these data whereas the Gaussian mixture models consistently overestimated the number of components. Furthermore, it is particularly notable that the overestimation of the number of components by the Gaussian mixture models could not be resolved by merging components.

We also analyzed two real data sets. For the Old Faithful data, we considered only model-based clustering with the SAL and Gaussian mixture models and both gave the same predicted group memberships. However, a contour plot revealed that the SAL mixture model captured the shape of the data far more effectively than its Gaussian counterpart. The yeast protein location data presented a much more difficult clustering problem, and so we also used these data to illustrate model-based classification. For model-based clustering, the chosen SAL model gave very good clustering performance (, ARI=) and outperformed its Gaussian counterpart (, ARI=). Furthermore, when we forced components, the Gaussian mixture modelling approach gave worse clustering performance than would be expect by guessing (ARI). In the model-based classification applications, we set and considered 25 random subsets of the data for which 70% of the locations were taken as known. The SAL mixtures again gave excellent performance (ARI=0.86) but the Gaussian mixtures had negative ARIs; it is surprising that the Gaussian approach gave very poor classification performance in the semi-supervised case when 70% of the locations were taken as known. This result calls into question the efficacy of the Gaussian model-based classification approach. Again, as with the simulation study, the poor performance of Gaussian mixtures on the yeast data could not be mitigated by merging components.

A decade on from the landmark paper of Fraley and Raftery (2002), we have put forth a case for substantial departure from the Gaussian model-based clustering paradigm. Unlike the skew-normal and skew- approaches, which are perhaps less substantial departures, our approach is elegant and computationally straightforward. This paper reinforces the fact that the literature is moving away from Gaussian approaches with some alacrity but it should not be taken as being pejorative of Gaussian mixture models. Gaussian mixture models remain a highly effective method for clustering and classifying certain types of data, and will forever be the midwife that introduced mixture model-based clustering. Gaussian mixtures can, however, give misleading results and, as we have illustrated, one cannot rely on rectifying poor performance by merging components.

As mentioned in Section 3.4.2, a better solution to the issue of infinite log-likelihood values in our EM algorithm is the subject of ongoing work. Future work will focus on the introduction of parsimony into these models by decomposing the covariance structure in the same way as described by Celeux and Govaert (1995) for the Gaussian mixture model (). Analysis of very high-dimensional data with our SAL mixtures will be explored along the lines of work by McNicholas and Murphy (2008, 2010) and Baek et al. (2010), who focused on extensions of mixtures of factor analyzers. Handling noise will be considered, both via the addition of a uniform component (cf. Banfield and Raftery, 1993) and through the trimmed clustering procedures of Gallegos and Ritter (2005). Work will also be conducted on developing a model selection technique to compare clustering results from SAL mixtures with those from other mixtures, such as mixtures of skew- distributions.

## Acknowledgements

This work was supported by an Ontario Graduate Scholarship, the University Research Chair in Computational Statistics, and an Early Researcher Award from the Ontario Ministry of Research and Innovation.

## References

• Aitken (1926) Aitken, A. C. (1926). On Bernoulli’s numerical solution of algebraic equations. Proceedings of the Royal Society of Edinburgh 46, 289–305.
• Ali et al. (2010) Ali, M. M., J. Woo, and S. Nadarajah (2010). Some skew symmetric inverse reflected distributions. Brazilian Journal of Probability and Statistics 24(1), 1–23.
• Andrews and McNicholas (2011) Andrews, J. L. and P. D. McNicholas (2011). Mixtures of modified t-factor analyzers for model-based clustering, classification, and discriminant analysis. Journal of Statistical Planning and Inference 141(4), 1479–1486.
• Baek and McLachlan (2011) Baek, J. and G. J. McLachlan (2011). Mixtures of common t-factor analyzers for clustering high-dimensional microarray data. Bioinformatics 27, 1269–1276.
• Baek et al. (2010) Baek, J., G. J. McLachlan, and L. K. Flack (2010). Mixtures of factor analyzers with common factor loadings: Applications to the clustering and visualization of high-dimensional data. IEEE Transactions on Pattern Analysis and Machine Intelligence 32, 1298–1309.
• Banfield and Raftery (1993) Banfield, J. D. and A. E. Raftery (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics 49(3), 803–821.
• Barndorff-Nielsen and Halgreen (1977) Barndorff-Nielsen, O. and C. Halgreen (1977). Infinite divisibility of the hyperbolic and generalized inverse Gaussian distributions. Z. Wahrscheinlichkeitstheorie Verw. Gebiete 38, 309–311.
• Barndorff-Nielsen (1997) Barndorff-Nielsen, O. E. (1997). Normal inverse Gaussian distributions and stochastic volatility modelling. Scandinavian Journal of Statistics 24(1), 1–13.
• Baudry et al. (2010) Baudry, J.-P., Raftery, G. A. E., Celeux, K. Lo, and R. Gottardo (2010). Combining mixture components for clustering. Journal of Computational and Graphical Statistics 19, 332–353.
• Biernacki et al. (2000) Biernacki, C., G. Celeux, and G. Govaert (2000). Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(7), 719–725.
• Blæsild (1978) Blæsild, P. (1978). The shape of the generalized inverse Gaussian and hyperbolic distributions. Research Report 37, Department of Theoretical Statistics, Aarhus University, Denmark.
• Böhning et al. (1994) Böhning, D., E. Dietz, R. Schaub, P. Schlattmann, and B. Lindsay (1994). The distribution of the likelihood ratio for mixtures of densities from the one-parameter exponential family. Annals of the Institute of Statistical Mathematics 46, 373–388.
• Browne et al. (2012) Browne, R. P., P. D. McNicholas, and M. D. Sparling (2012). Model-based learning using a mixture of mixtures of Gaussian and uniform distributions. IEEE Transactions on Pattern Analysis and Machine Intelligence 34(4), 814–817.
• Celeux and Govaert (1995) Celeux, G. and G. Govaert (1995). Gaussian parsimonious clustering models. Pattern Recognition 28, 781–793.
• Dempster et al. (1977) Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B 39(1), 1–38.
• Eltoft et al. (2006) Eltoft, T., T. Kim, and T. Lee (2006). On the multivariate Laplace distribution. IEEE Signal Processing Letters 13(5), 300–303.
• Fraley and Raftery (1998) Fraley, C. and A. E. Raftery (1998). How many clusters? Which clustering methods? Answers via model-based cluster analysis. The Computer Journal 41(8), 578–588.
• Fraley and Raftery (2002) Fraley, C. and A. E. Raftery (2002). Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association 97(458), 611–631.
• Gallegos and Ritter (2005) Gallegos, M. T. and G. Ritter (2005). A robust method for cluster analysis. Annals of Statistics 33, 347–380.
• Ghahramani and Hinton (1997) Ghahramani, Z. and G. E. Hinton (1997). The EM algorithm for factor analyzers. Technical Report CRG-TR-96-1, University of Toronto, Toronto.
• Good (1953) Good, I. J. (1953). The population frequencies of species and the estimation of population parameters. Biometrika 40, 237–260.
• Halgreen (1979) Halgreen, C. (1979). Self-decomposibility of the generalized inverse Gaussian and hyperbolic distributions. Z. Wahrscheinlichkeitstheorie Verw. Gebiete 47, 13–18.
• Hennig (2010) Hennig, C. (2010). Methods for merging Gaussian mixture components. Advances in Data Analysis and Classification 4, 3–34.
• Hubert and Arabie (1985) Hubert, L. and P. Arabie (1985). Comparing partitions. Journal of Classification 2, 193–218.
• Jørgensen (1982) Jørgensen, B. (1982). Statistical Properties of the Generalized Inverse Gaussian Distribution. New York: Springer-Verlag.
• Karlis and Meligkotsidou (2007) Karlis, D. and L. Meligkotsidou (2007). Finite mixtures of multivariate Poisson distributions with application. Journal of Statistical Planning and Inference 137(6), 1942–1960.
• Karlis and Santourian (2009) Karlis, D. and A. Santourian (2009). Model-based clustering with non-elliptically contoured distributions. Statistics and Computing 19(1), 73–83.
• Kotz et al. (2001) Kotz, S., T. J. Kozubowski, and K. Podgorski (2001). The Laplace Distribution and Generalizations: A Revisit with Applications to Communications, Economics, Engineering, and Finance (1st ed.). Burkhauser Boston.
• Lee and McLachlan (2011) Lee, S. and G. J. McLachlan (2011). On the fitting of mixtures of multivariate skew -distributions via the EM algorithm. arXiv:1109.4706.
• Lin (2009) Lin, T.-I. (2009). Maximum likelihood estimation for multivariate skew normal mixture models. Journal of Multivariate Analysis 100, 257–265.
• Lin (2010) Lin, T.-I. (2010). Robust mixture modeling using multivariate skew t distributions. Statistics and Computing 20, 343–356.
• Lindsay (1995) Lindsay, B. G. (1995). Mixture models: Theory, geometry and applications. In NSF-CBMS Regional Conference Series in Probability and Statistics, Volume 5. California: Institute of Mathematical Statistics: Hayward.
• Maugis et al. (2009) Maugis, C., G. Celeux, and M.-L. Martin-Magniette (2009). Variable selection for clustering with Gaussian mixture models. Biometrics 65(3), 701–709.
• McLachlan (1992) McLachlan, G. J. (1992). Discriminant Analysis and Statistical Pattern Recognition. New Jersey: John Wiley & Sons.
• McLachlan et al. (2007) McLachlan, G. J., R. W. Bean, and L. B.-T. Jones (2007). Extension of the mixture of factor analyzers model to incorporate the multivariate t-distribution. Computational Statistics and Data Analysis 51(11), 5327–5338.
• McLachlan and Peel (1998) McLachlan, G. J. and D. Peel (1998). Robust cluster analysis via mixtures of multivariate t-distributions. In Lecture Notes in Computer Science, Volume 1451, pp. 658–666. Berlin: Springer-Verlag.
• McLachlan and Peel (2000a) McLachlan, G. J. and D. Peel (2000a). Finite Mixture Models. New York: John Wiley & Sons.
• McLachlan and Peel (2000b) McLachlan, G. J. and D. Peel (2000b). Mixtures of factor analyzers. In Proceedings of the Seventh International Conference on Machine Learning, San Francisco, pp. 599–606. Morgan Kaufmann.
• McNicholas (2010) McNicholas, P. D. (2010). Model-based classification using latent Gaussian mixture models. Journal of Statistical Planning and Inference 140(5), 1175–1181.
• McNicholas and Murphy (2008) McNicholas, P. D. and T. B. Murphy (2008). Parsimonious Gaussian mixture models. Statistics and Computing 18, 285–296.
• McNicholas and Murphy (2010) McNicholas, P. D. and T. B. Murphy (2010). Model-based clustering of microarray expression data via latent Gaussian mixture models. Bioinformatics 26(21), 2705–2712.
• Nakai and Kanehisa (1991) Nakai, K. and M. Kanehisa (1991). Expert system for predicting protein localization sites in gram-negative bacteria. Protiens 11(2), 95–110.
• Nakai and Kanehisa (1992) Nakai, K. and M. Kanehisa (1992). A knowledge base for predicting protein localization sites in eukaryotic cells. Genomics 14(4), 897–911.
• Peel and McLachlan (2000) Peel, D. and G. J. McLachlan (2000). Robust mixture modelling using the t distribution. Statistics and Computing 10(4), 339–348.
• Pyne et al. (2009) Pyne, S., X. Hu, K. Wang, E. Rossin, T.-I. Lin, L. M. Maier, C. Baecher-Allan, G. J. McLachlan, P. Tamayo, D. A. Hafler, P. L. D. Jager, and J. P. Mesirow (2009). Automated high-dimensional flow cytometric data analysis. Proceedings of the National Academy of Sciences 106, 8519–8524.
• R Development Core Team (2012) R Development Core Team (2012). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.
• Raftery and Dean (2006) Raftery, A. E. and N. Dean (2006). Variable selection for model-based clustering. Journal of the American Statistical Association 101(473), 168–178.
• Rand (1971) Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66, 846–850.
• Sahu et al. (2003) Sahu, S. K., D. K. Dey, and M. D. Branco (2003). A new class of multivariate skew distributions with application to Bayesian regression models. Canadian Journal of Statistics 31, 129–150.
• Schwarz (1978) Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics 6(2), 461–464.
• Scrucca (2010) Scrucca, L. (2010). Dimension reduction for model-based clustering. Statistics and Computing 20(4), 471–484.
• Tipping and Bishop (1999) Tipping, T. E. and C. M. Bishop (1999). Mixtures of probabilistic principal component analysers. Neural Computation 11(2), 443–482.
• Vrbik and McNicholas (2012) Vrbik, I. and P. D. McNicholas (2012). Analytic calculations for the EM algorithm for multivariate skew-t mixture models. Statistics and Probability Letters 82(6), 1169–1174.
• Zhou and Lange (2010) Zhou, H. and K. L. Lange (2010). On the bumpy road to the dominant mode. Scandinavian Journal of Statistics 37(4), 612–631.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters   