# Unsupervised Learning via Mixtures of Skewed Distributions with Hypercube Contours

## Abstract

Mixture models whose components have skewed hypercube contours are developed via a generalization of the multivariate shifted asymmetric Laplace density. Specifically, we develop mixtures of multiple scaled shifted asymmetric Laplace distributions. The component densities have two unique features: they include a multivariate weight function, and the marginal distributions are also asymmetric Laplace. We use these mixtures of multiple scaled shifted asymmetric Laplace distributions for clustering applications, but they could equally well be used in the supervised or semi-supervised paradigms. The expectation-maximization algorithm is used for parameter estimation and the Bayesian information criterion is used for model selection. Simulated and real data sets are used to illustrate the approach and, in some cases, to visualize the skewed hypercube structure of the components.

Keywords. Finite Mixture Models; Shifted Asymmetric Laplace; EM Algorithm; Skewed Distribution; Multiple Scaled Distributions.

## 1 Introduction

Unsupervised learning, or cluster analysis, can be lucidly defined as the process of sorting like-objects into groups. Finite mixture models are a convex combination of probability densities; accordingly they are natural choice for performing cluster analysis. The general finite mixture model has density

where , such that , are the mixing proportions and are the component densities. To date, multivariate Gaussian component densities have been the focal point in the development of finite mixtures for clustering (e.g., Fraley and Raftery, 2002; Banfield and Raftery, 1993; Celeux and Govaert, 1995; Ghahramani and Hinton, 1997). Their popularity can be attributed to their mathematical tractability and they continue to be prominent in clustering applications (e.g., Maugis et al., 2009; Scrucca, 2010; Punzo and McNicholas, 2013).

Around the beginning of the st century work using mixtures of multivariate- distributions began to surface (Peel and McLachlan, 2000), and in the last 5 years has flourished (e.g., Greselin and Ingrassia, 2010; Andrews and McNicholas, 2011; Baek and McLachlan, 2011). One limitation of the multivariate- distribution is that the degrees of freedom parameter is constant across dimensions. Forbes and Wraith (2014) exploited the fact that a random variable from a multivariate- distribution is a normal variance-mean mixture to give a generalized multivariate- distribution, where the degrees of freedom parameter can be uniquely estimated in each dimension of the parameter space. Herein, we discuss the development and application of a mixture of multiple scaled shifted asymmetric Laplace (MSSAL) distributions which, unlike the multivariate- generalizations, have the ability to parameterize skewness in addition to location and scale. Furthermore, the level sets of our MSSAL density are guaranteed to be convex, making mixtures thereof ideal for clustering applications.

## 2 Mixtures of Multiple Scaled Shifted Asymmetric Laplace distributions

### 2.1 Shifted Asymmetric Laplace

Kotz et al. (2001) show that a random vector arising from a multivariate asymmetric Laplace distribution can be generated through the relationship , where is a -dimensional random vector from a multivariate Gaussian distribution with mean and covariance matrix , and , independent of , is a random variable following an exponential distribution with rate 1. To facilitate cluster analysis, Franczak et al. (2014) introduce a -dimensional shift parameter and consider a random vector . It follows that will have the stochastic representation

(1) |

where and are as previously defined. Herein, we follow Franczak et al. (2014) and use the notation to mean that the random vector is distributed multivariate shifted asymmetric Laplace (SAL) with skewness parameter , scale matrix , and location parameter .

We can see from (1) that the random vector is multivariate Gaussian with mean and scale matrix . Therefore, is a normal variance-mean mixture with density

(2) |

where is the multivariate Gaussian density with mean and covariance matrix , and . Formally, the random vector has density

(3) |

where , , is the squared Mahalanobis distance between and , is the modified Bessel function of the third kind with index , and , , and are previously defined. In the one-dimensional case, (3) reduces to

(4) |

where is a location parameter, , is a skewness parameter and is a scale parameter (cf. Kotz et al., 2001).

### 2.2 Multiple Scaled Distributions

Forbes and Wraith (2014) show that the density of a random variable arising from a normal variance-mean mixture can be written

(5) |

where is the multivariate Gaussian density with mean and covariance matrix , is a matrix of eigenvectors, is a diagonal matrix of eigenvalues, is a diagonal weight matrix where are independent, and is a -variate density function.

Notably,

(6) | ||||

(7) |

and, therefore,

(8) |

where is the th element of , is the univariate Gaussian density with mean 0 and variance , is the density of an univariate random variable , and is the th eigenvalue of the matrix .

Amalgamating (2) and (5) gives an expression for the density of a generalized multivariate SAL distribution. Explicitly, this density is given by

(9) |

where is the multivariate Gaussian density with mean and scale matrix , and is the density of a -variate exponential distribution with for .

To simplify the derivation of our parameter estimates we set , where and . Given this parameterization it follows from (6) that (9) can be written

where is the th element of , is the univariate Gaussian density with mean and covariance matrix , and , for .

Thus, the density of MSSAL distribution is given by

(10) |

where and , , , , and are previously defined. It follows that the density of a mixture of MSSAL distributions is

where are the mixing proportions and is the density of the MSSAL distribution given in (10) with , component eigenvector matrix , component eigenvalue matrix , and component location parameter .

## 3 Parameter Estimation

The EM algorithm was formulated in the seminal paper of Dempster et al. (1977) and is commonly used to estimate the parameters of mixture models. On each iteration of the EM algorithm two steps are preformed: an expectation (E-) step, and a maximization (M-) step. On each E-step the expected value of the complete-data log-likelihood, denoted , is calculated based on the current parameter values and on each M-step is maximized with respect to the model parameters. Note that the complete-data refers to the combination of the unobserved (i.e., the missing or latent data) and the observed data.

For our mixtures of MSSAL distributions, the complete-data are composed of the observed data , the latent , and the component indicators . Note that for each and , comprise the diagonal elements of the multidimensional weight variable , i.e.,

for and . Furthermore, for each , such that if observation is in group and otherwise, for .

Using the conditional distributions given in Franczak et al. (2014), it follows that

where for , and with and , where denotes the th element of and and are defined for (8). Note that represents the exponential distribution with rate and denotes the generalized inverse Gaussian distribution with parameters and . The statistical properties of the GIG distribution are well established and thoroughly reviewed in Jørgensen (1982). For our purposes, the most useful properties of the GIG distribution are the tractability of its expected values

where .

Using the conditional distributions given above we can now formulate the complete-data log-likelihood for the MSSAL mixtures. Formally, the complete-data log-likelihood is given by

(11) |

where are the mixing proportions, is the density of the multivariate Gaussian distribution with mean and covariance , and and are previously defined.

### 3.1 E-step

For the MSSAL mixtures the expected value of the complete-data log-likelihood on the th iteration is

where , , and , , and are, respectively, the expected values of the sufficient statistics of the component indicators and latent variables. For ease of notation, let . To compute the value of on iteration , we calculate:

(12) |

and let the off-diagonal elements of and be equal to zero and their diagonal elements be equal to

respectively, where , , for , and , , , and are the values of the model parameters on iteration .

### 3.2 M-step

On the -step of the th iteration the update for is, in the usual way, given by . The updates for and are given by

(13) |

respectively. To obtain an estimate of we employ an iterative optimization procedure. Specifically, the goal is to minimize the function

(14) |

with respect to , where

and all expected values and model parameters are previously defined. Typically, the Fleury-Gautschi (FG) algorithm is employed (Celeux and Govaert, 1995; Forbes and Wraith, 2014) but, as noted in Lefkovitch (1993); Boik (2002); Bouveyron et al. (2007) and Browne and McNicholas (2014b), the FG algorithm slows down considerably and becomes computationally expensive as the dimension of the data increases. To circumvent this issue we exploit the convexity of the objective function and construct computationally simpler majorization-minimization (MM) algorithms (see Hunter and Lange, 2000, 2004, for examples). Specifically, we follow Kiers (2002) and Browne and McNicholas (2014a) and preform the following procedure to minimize (14). Note that our MM algorithms use the surrogate function

where is a constant that does not depend on and the matrices , for , are explicitly defined below.

On iteration compute

given the current parameter estimates and expected values, where and are the largest eigenvalues of the matrices and , respectively. Then calculate the elements of the singular value decomposition of , i.e., set and find , , and where and are orthonormal and is a diagonal matrix containing the singular values of . It follows that the initial estimate of on iteration of this optimization procedure is given by . Given this estimate, denoted , compute

where and are, respectively, the largest eigenvalues of and . Then set to obtain , the final estimate of on iteration , where and are also orthonormal.

We repeat the calculations for , , and until the difference in (14) over consecutive iterations is small. At convergence, we take the final estimate of on iteration to be the estimate of .

We maximize with respect to the diagonal matrix via

(15) |

where , is the th element of the matrix , is the th element of the matrix , and all off-diagonal elements of are equal to zero.

This EM algorithm is considered to have converged when the difference between an asymptotic estimate of the log-likelihood, , and the log-likelihood value on iteration , , is less than some small value (Aitken, 1926; Böhning et al., 1994; Lindsay, 1995). At convergence we use the final estimates of the to obtain the maximum a posteriori (MAP) classification values. Specifically, if occurs in component , and otherwise.

## 4 Applications

### 4.1 Model Selection and Performance Assessment

The Bayesian information criterion (BIC; Schwarz, 1978) is used to select the best fitting MSSAL mixture. The BIC was derived via a Laplace approximation and its precision is influenced by the specific form of the model parameters prior densities and by the correlation structure between observations. The BIC is given by , where is the maximized log-likelihood, is the maximum likelihood estimate of , is the number of free parameters in the model, and is the number of observations. The BIC is commonly used for Gaussian mixture model selection and has some useful asymptotic properties, for example, as the BIC is shown to consistently choose the correct model (see Leroux et al., 1992; Dasgupta and Raftery, 1998, for example).

The Rand index (Rand, 1971) compares partitions based on pairwise agreements. It takes a value between 0 and 1, where 1 indicates perfect agreement. An unattractive feature of the Rand index is that it has a positive expected value under random classification. To correct this, Hubert and Arabie (1985) introduced the adjusted Rand index (ARI) to account for chance agreement. The ARI also takes a value of 1 when classification agreement is perfect but has an expected value of 0 under random classification. The ARI can also take negative values and this happens for classifications that are worse than would be expected by chance. Steinley (2004) gives general properties of the ARI and provides evidence supporting its use for assessing classification performance.

### 4.2 Illustrative Example: Leptograpsus Crabs

Campbell and Mahon (1974) give data on 200 crabs of the species Leptograpsus variegatus collected at Fremantle, Western Australia. The data are available in the R (R Core Team, 2014) package MASS (Venables and Ripley, 2002) and contain five morphological measurements. Not surprisingly, the variables in the crabs data are highly correlated with one another and analysis of the covariance matrix using principal components reveals two clusters corresponding to the gender of the crabs (see Panel 1 of Figure 1).

For strictly illustrative purposes, we remove all group labels and fitted component MSSAL mixtures to the first and third principal components of the Leptograpsus crabs data set. Note that for this and each application herein, our MSSAL mixtures are initialized using 50 random starting values.

The BIC (-771.3386) selects a component model where the MAP classifications correspond perfectly to gender (i.e., male or female; ). The contour plot for this model (Figure 2) shows the unique skewed hypercube shapes of this multivariate generalization. Panel 2 of Figure 1 shows the path of the log-likelihood values obtained for the best fitting model on 56 iterations until convergence.

### 4.3 Computational Cost

In this application we evaluate the speed of our EM algorithm. Specifically, we measure how long it takes to complete one and one hundred iterations of the proposed parameter estimation scheme using one-component, two-component and three-component MSSAL mixtures. We fitted each mixture to subsets of the 27-variable wine data set, available in the R package pgmm (McNicholas et al., 2014). In total there were 5 subsets with variables, respectively. Panel 1 of Figure 3 shows the average elapsed times, in seconds, for the one-component (blue), two-component (green) and three-component (red) MSSAL mixtures to complete one hundred EM iterations. Panel 2 shows the average elapsed time, in seconds, for each MSSAL mixture to complete one EM iteration.

As expected, the elapsed system time increases with the number of dimensions. Notably, the MSSAL mixtures appear to scale well with dimension as it takes, on average, 61 seconds for our EM algorithm to complete 100 iterations when , 118 seconds when , and 174 seconds when .

### 4.4 Simulation Study: Classification Performance

We use a simulation study to evaluate the classification ability of our MSSAL mixtures. Specifically, we investigate how the mixtures of MSSAL distributions handle symmetric data, skewed Gaussian data, and data generated from a mixture of MSSAL distributions.

In total, we generate 75 bivariate data sets: 25 from a two-component Gaussian distribution (Scenario I), 25 from a two-component skew-normal distribution (Scenario II) and 25 from a two-component MSSAL distribution (Scenario III). Row 1 of Figure S.1 (Supplementary Material) displays the typical shapes from each scenario. We expect good classification performances from each of the models as the data have very little overlap.

Table 1 gives the average ARI values, with standard deviations in parenthesis, obtained for the best fitting MSSAL mixtures. We compare these values to the average ARI values obtained from the best fitting mixtures of multivariate SAL, multivariate restricted skew-normal (MSN) and skew- (MST) and multivariate Gaussian distributions, as chosen by the BIC. Note: the Gaussian mixture models (GMM) were fitted using the R package mixture (Browne and McNicholas, 2013) and the skew-normal and skew- mixtures were fitted using the EMMIXskew package (Wang et al., 2014). For this, and the subsequent real data application, all approaches are initialized using 50 random starting values and we remove all group labels. For each scenario the mixtures were fitted for components.

Scenario I | Scenario II | Scenario III | |
---|---|---|---|

MSSAL | () | () | 0.995 () |

SAL | () | () | () |

MST | () | () | () |

MSN | () | 0.995 () | () |

GMM | () | () | () |

The chosen mixtures of MSSAL distributions give excellent classification results in all three scenarios. For the MSSAL mixtures the BIC chooses the correct number of components of the time in Scenarios I and III and of the time in Scenario II. In Scenario III, the chosen MSSAL mixtures outperform the other mixtures by a substantial margin. With the exception of the mixtures of SAL distributions (where the BIC chooses components for 24/25 data sets) the MST, MSN and GMM mixtures return average ARI values that are essentially no better than random classification. In Row 2 of Figure S.1, the 2nd and 3rd panels show the typical fits of the most popular Gaussian and skew-normal mixtures. It is clear that merging components (see Baudry et al., 2010; Hennig, 2010) would not be able to rectify the Gaussian solution.

Interestingly, the mixtures of multivariate Gaussian distributions gave very good performance on the skew-normal data without the benefit of component merging. The BIC selected component mixtures for of the simulated skew-normal data sets with the other chosen mixtures having three-components. Panel 1 in row 2 of Figure 4 displays the typical three-component Gaussian solution for the skew-normal data. It is clear this solution would benefit from merging to give a two component mixture with .

Interestingly, the mixtures of multivariate Gaussian distributions fitted the skew-normal data quite well. Specifically, the BIC selects a two-component mixture for of the simulated skew-normal data sets. For the other skew-normal data sets the BIC chooses three-component mixtures. Panel 1 in row 2 of Figure S.1 displays the typical three-component Gaussian solution for the skew-normal data. Not surprisingly, this solution could be merged to give a two component mixture with .

### 4.5 Swiss Banknotes

The Swiss banknotes data (Flury and Fiedwyl, 1988) are available in the R packages alr3 and gclus. In total there are six physical measurements for 100 counterfeit and 100 genuine banknotes. Our goal is to differentiate between each type of banknote. We fitted the mixtures considered in Section 4.4 for components using the initialization procedure previously described. Table 2 summarizes the performance of the best fitting mixtures.

Model | MSSAL | SAL | MST | MSN | GMM |
---|---|---|---|---|---|

Components | |||||

ARI | 0.98 | ||||

BIC |

MSSAL | MSN | GMM | |||||
---|---|---|---|---|---|---|---|

A | B | A | B | A | B | C | |

Genuine | |||||||

Counterfeit |

The results show that the best fitting MSSAL mixture does an excellent job at discriminating between the counterfeit and genuine banknotes, misclassifying only one genuine banknote. On the other hand, both the multivariate skew-normal and skew- mixtures identified the correct number of groups but misclassified 17 banknotes, the chosen multivariate SAL mixture fail to identify any group structure in this data, and the best fitting GMM uses an extra component. Interestingly, merging Gaussian components would not benefit this solution. Table 3 gives the classifications of best fitting MSSAL, MSN (which is identical to the MST) and Gaussian mixtures.

## 5 Summary

A mixture of MSSAL distributions was introduced and gives mixtures with components whose contours are skewed hypercubes. Crucially, the level sets of our MSSAL density are guaranteed to be convex, making mixtures thereof ideal for unsupervised learning applications. Specifically, the MSSAL distribution is guaranteed to have convex level sets, similar to elliptical distributions like the Gaussian, because it has the same concentration in each direction from the mode. In contrast, the contours for the multiple scaled multivariate -distribution will have levels that are not convex; therefore, situations will arise where one component is used to model two clusters, e.g., X-shaped components. In addition to the distinct advantage of having convex level sets, our MSSAL distribution has great modelling flexibility. For example, consider the contour plot for the crabs data (Figure 2) — the diamond-like shapes illustrated are a far cry from any of the spherical or tear-drop like densities commonly displayed in the mainstream non-elliptical clustering literature. In addition to its suitability for unsupervised learning, our MSSAL mixtures should also perform well for supervised and semi-supervised learning, and this will be a focus of future work. Another subject of future work will be exploring evolutionary computation as an alternative to the EM algorithm for parameter estimation, cf. Andrews and McNicholas (2013) for related work on Gaussian mixtures.

## Acknowledgements

This work was supported by an Ontario Graduate Scholarship (Franczak), the University Research Chair in Computational Statistics (McNicholas), and an Early Researcher Award from the Ontario Ministry of Research and Innovation (McNicholas).

This work is currently under consideration at Pattern Recognition Letters

### References

- Aitken, A. C. (1926). On Bernoulli’s numerical solution of algebraic equations. Proceedings of the Royal Society of Edinburgh 46, 289–305.
- Andrews, J. L. and P. D. McNicholas (2011). Extending mixtures of multivariate -factor analyzers. Statistics and Computing 21(3), 361–373.
- Andrews, J. L. and P. D. McNicholas (2013). Using evolutionary algorithms for model-based clustering. Pattern Recognition Letters 34(9), 987–992.
- Baek, J. and G. J. McLachlan (2011). Mixtures of common t-factor analyzers for clustering high-dimensional microarray data. Bioinformatics 27, 1269–1276.
- Banfield, J. D. and A. E. Raftery (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics 49(3), 803–821.
- Baudry, J.-P., Raftery, G. A. E., Celeux, K. Lo, and R. Gottardo (2010). Combining mixture components for clustering. Journal of Computational and Graphical Statistics 19, 332–353.
- Böhning, D., E. Dietz, R. Schaub, P. Schlattmann, and B. G. Lindsay (1994). The distribution of the likelihood ratio for mixtures of densities from the one-parameter exponential family. Annals of the Institute of Statistical Mathematics 46(2), 373–388.
- Boik, R. J. (2002). Spectral models for covariance matrices. Biometrika 89(1), 159–182.
- Bouveyron, C., S. Girard, and C. Schmid (2007). High-dimensional data clustering. Computational Statistics & Data Analysis 52(1), 502–519.
- Browne, R. P. and P. D. McNicholas (2013). mixture: Mixture Models for Clustering and Classification. R package version 1.0.
- Browne, R. P. and P. D. McNicholas (2014a). Estimating common principal components in high dimensions. Advances in Data Analysis and Classification 8(2), 217–226. To appear.
- Browne, R. P. and P. D. McNicholas (2014b). Orthogonal Stiefel manifold optimization for eigen-decomposed covariance parameter estimation in mixture models. Statistics and Computing 24(2), 203–210.
- Campbell, N. A. and R. J. Mahon (1974). A multivariate study of variation in two species of rock crab of genus leptograpsus. Australian Journal of Zoology 22, 417–425.
- Celeux, G. and G. Govaert (1995). Gaussian parsimonious clustering models. Pattern Recognition 28, 781–793.
- Dasgupta, A. and A. E. Raftery (1998). Detecting features in spatial point processes with clutter via model-based clustering. Journal of the American Statistical Association 93, 294–302.
- Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B 39(1), 1–38.
- Flury, B. and H. Fiedwyl (1988). Multivariate Statistics: A Practical Approach. London: Chapman and Hall.
- Forbes, F. and D. Wraith (2014). A new family of multivariate heavy-tailed distributions with variable marginal amounts of tailweight: application to robust clustering. Statistics and Computing 24, 971–984.
- Fraley, C. and A. E. Raftery (2002). Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association 97(458), 611–631.
- Franczak, B. C., R. P. Browne, and P. D. McNicholas (2014). Mixtures of shifted asymmetric Laplace distributions. IEEE Transactions in Pattern Analysis and Machine Intelligence 36(6), 1149–1157.
- Ghahramani, Z. and G. E. Hinton (1997). The EM algorithm for factor analyzers. Technical Report CRG-TR-96-1, University of Toronto, Toronto.
- Greselin, F. and S. Ingrassia (2010). Constrained monotone EM algorithms for mixtures of multivariate -distributions. Statistics and Computing 20(1), 9–22.
- Hennig, C. (2010). Methods for merging Gaussian mixture components. Advances in Data Analysis and Classification 4, 3–34.
- Hubert, L. and P. Arabie (1985). Comparing partitions. Journal of Classification 2, 193–218.
- Hunter, D. L. and K. Lange (2004). A tutorial on MM algorithms. The American Statistician 58(1), 30–37.
- Hunter, D. R. and K. Lange (2000). Quantile regression via an mm algorithm. Journal of Computational and Graphical Statistics 9(1), 60–77.
- Jørgensen, B. (1982). Statistical Properties of the Generalized Inverse Gaussian Distribution. New York: Springer-Verlag.
- Kiers, H. A. (2002). Setting up alternating least squares and iterative majorization algorithms for solving various matrix optimization problems. Computational Statistics and Data Analysis 41(1), 157–170.
- Kotz, S., T. J. Kozubowski, and K. Podgorski (2001). The Laplace Distribution and Generalizations: A Revisit with Applications to Communications, Economics, Engineering, and Finance (1st ed.). Boston: Burkhauser.
- Lefkovitch, L. P. (1993). Consensus principal components. Biometrical Journal 35(5), 567–580.
- Leroux, B. G. et al. (1992). Consistent estimation of a mixing distribution. The Annals of Statistics 20(3), 1350–1360.
- Lindsay, B. G. (1995). Mixture Models: Theory, Geometry and Applications, Volume 5. California: Institute of Mathematical Statistics: Hayward.
- Maugis, C., G. Celeux, and M.-L. Martin-Magniette (2009). Variable selection for clustering with Gaussian mixture models. Biometrics 65(3), 701–709.
- McNicholas, P. D., K. R. Jampani, A. F. McDaid, T. B. Murphy, and L. Banks (2014). pgmm: Parsimonious Gaussian Mixture Models. R package version 1.1.
- Peel, D. and G. J. McLachlan (2000). Robust mixture modelling using the t distribution. Statistics and Computing 10(4), 339–348.
- Punzo, A. and P. D. McNicholas (2013). Outlier detection via parsimonious mixtures of contaminated Gaussian distributions. arXiv:1305.4669.
- R Core Team (2014). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.
- Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66, 846–850.
- Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics 6(2), 461–464.
- Scrucca, L. (2010). Dimension reduction for model-based clustering. Statistics and Computing 20(4), 471–484.
- Steinley, D. (2004). Properties of the Hubert-Arable adjusted Rand index. Psychological methods 9(3), 386.
- Venables, W. N. and B. D. Ripley (2002). Modern Applied Statistics with S (Fourth ed.). New York: Springer.
- Wang, K., A. Ng, and G. McLachlan (2014). EMMIXskew: The EM Algorithm and Skew Mixture Distribution. R package version 1.0.1.