A universal framework for learning based on the elliptical mixture model (EMM)
Abstract
An increasing prominence of unbalanced and noisy data highlights the importance of elliptical mixture models (EMMs), which exhibit enhanced robustness, flexibility and stability over the widely applied Gaussian mixture model (GMM). However, existing studies of the EMM are typically of ad hoc nature, without a universal analysis framework or existence and uniqueness considerations. To this end, we propose a general framework for estimating the EMM, which makes use of the Riemannian manifold optimisation to convert the original constrained optimisation paradigms into an unconstrained one. We first revisit the statistics of elliptical distributions, to give a rationale for the use of Riemannian metrics as well as the reformulation of the problem in the Riemannian space. We then derive the EMM learning framework, based on Riemannian gradient descent, which ensures the same optimum as the original problem but accelerates the convergence speed. We also unify the treatment of the existing elliptical distributions to build a universal EMM, providing a simple and intuitive way to deal with the nonconvex nature of this optimisation problem. Numerical results demonstrate the ability of the proposed framework to accommodate EMMs with different properties of individual functions, and also verify the robustness and flexibility of the proposed framework over the standard GMM.
A universal framework for learning based on the elliptical mixture model (EMM)
Shengxi Li Department of Electrical and Electronic Engineering Imperial College London London, SW7 2AZ shengxi.li17@imperial.ac.uk Danilo P. Mandic Department of Electrical and Electronic Engineering Imperial College London London, SW7 2AZ d.mandic@imperial.ac.uk Zeyang Yu Department of Electrical and Electronic Engineering Imperial College London London, SW7 2AZ z.yu17@@imperial.ac.uk
noticebox[b]Preprint. Work in progress.\end@float
1 Introduction
Finite mixture models have a prominent role in statistical machine learning, as these enhanced provide probabilistic awareness in many learning paradigms, including clustering, feature extraction and density estimation [1]. This is achieved in a very intuitive and elegant way, through a linear combination of well understood distributions, which is powerful enough to approximate arbitrary complex distributions [2]. The Gaussian mixture model (GMM) is the most widely used such model whose popularity stems from the simple formulation and the conjugate property of Gaussian distribution. Despite mathematical elegance, standard Gaussianbased mixture model estimator is subject to robustness issues, and even a slight deviation from the Gaussian assumption or a single outlier can significantly degrade the performance or even break down the estimator [3]. Alternative mixture models are therefore rapidly being sought for robust learning.
Another rapidly emerging issue in modern applications is the requirement for the flexibility in mixture models, this is due to an exponential emergence of multifaceted data which are almost invariably unbalanced; sources of such imbalance may be vastly different natures of the data channels involved, different powers in the constitutive channels, or temporal misalignment [4]. Another less obvious but equally important obstacle which is prohibitive to the use of current mixture models is that of the different scales of information within multivariate data; for example, in biomedical recordings, respiration and heart beats occupy totally different scales, of Hz and 13 Hz respectively, but their harmonics overlap spectrally.
An important class of multivariate analysis techniques are elliptical distributions, which are quite general and flexible and include as special cases a range of standard distributions, such as the Gaussian distribution, the exponential family and the distribution [5]. The desirable property of elliptical distributions is their robustness; indeed their use results in robust Mestimators [6], thus making them a natural candidate for robust mixture modelling. In addition to the robustness, it is reported that members of the elliptical mixture model (EMM) class can also effectively mitigate the singular covariance problem experienced in the GMM [7]. Moreover, EMMs are more flexible in capturing intrinsic data structures than the GMM, as the EMM can even use different types of distributions in a single mixture. By virtue of their robustness and flexibility, EMMs are therefore perfectly suited to dealing with data acquired from imperfect sensors, a typical case in modern applications.
Existing mixture models related to elliptical distributions are most frequently based on the distribution [7; 8; 9], the Laplace distribution [10], or the hyperbolic distribution [11]. Table 1 summarises the existing results which adopt elliptical distributions that belong to the class of scale mixture of normals [12], where the expectationmaximisation (EM) process, employed in model tuning, is guaranteed to converge. Despite all their desirable properties, a general estimation method for fitting arbitrary elliptical distributions is still lacking.
The development of a general method for estimating the EMM, however, is nontrivial, owing to both theoretical and practical difficulties; for example, different from the GMM, there is no closedform solution for the maximisation step within the EMM learning. Specifically, the convergence of the iterative reweighting algorithm, the de facto standard in estimation of elliptical distributions, requires constraints on both the functional formations of elliptical distributions and the data structure. For more detail, readers are referred to [13]. Although these limitations are have been recently somewhat relaxed [14; 15], applications of the EMM are still severely restricted.
To this end, we consider Riemannian manifold optimisation for parameter estimation in this context, which has proven to be extremely effective in problems related to positive definite matrices, as it naturally casts a general illposed constrained problem onto that of optimising on a convex halfcone, which can be solved via the vector space of matrices. In contrast, it is always difficult to handle the positive definite constraint in the Euclidean coordinates. Along this direction, Hosseini and Sra successfully applied gradient descent along the Riemannian manifold to the GMM problem, and achieved fast convergence speed without any sacrifice in the accuracy [16; 17]. It is therefore natural to ask, whether a general estimation method based on the EMM can be approached from the manifold optimisation perspective?
1.1 Challenges and contributions
The first step towards our aim to introduce a class of feasible and computable EMMs is to define a proper Riemannian metric for elliptical distributions, as the metric completely determines the optimisation procedure of the EMM. A wide variety of works related to positive definite matrices adopt an intrinsic Riemannian metric which comes from the statistics (the Hessian of entropy [18] or the Fisher information [19]) of multivariate normal distributions. Such a metric is also adopted in [16; 17] for estimating the GMM. It is therefore natural to first investigate whether such a “Gaussian”based metric is an appropriate choice for the EMM. To this end, we start from the statistics, and first assess the rationale of this metric. Then, in addition to the covariance matrices, EMM also needs to estimate the location vector. Locationcovariance estimation is typically more complicated but less theoretically supported compared with the covariance problem [20]. A common current strategy is to reformulate the locationestimation problem into that of solely covariance estimation with one more dimension [13]. As reported in [16; 17], this reformulation significantly accelerates the convergence speed. For the EMM, the reformulation is not direct due to the nonexistence of a closedform representation. We thus develop the corresponding reformulation for the EMM, and further find that such a reformulation during manifold optimisation attains the same metric of a natural gradient [21] descent for the location vector and the standard covariance estimation. Finally, we propose a general estimation method for the EMM, which overcomes the above limitations [14; 15].
Robust EMM estimation is therefore badly needed and is rapidly emerging; for example, a toolbox in [22] which was originally designed for the GMM [17], has already included several types of elliptical distributions. However, the existing toolbox has not been generalised to the EMM. This paper therefore sets out to fill the void in the literature by rigorously establishing a whole new unifying framework for the analysis of EMMs, thus opening a new avenue for practical approaches based on realworld data. Unlike the current inconsistent solutions, the proposed framework is generic and can be considered a natural generalisation of parametrisation from the GMM. Our contributions can be summarised in the following:

We rigorously unify typical elliptical distributions by means of their intrinsic relationships, which enables simple ways to generate samples and conduct further analysis;

We introduce a Riemannian metric for the location vector and the covariance matrices within elliptical distributions, which provides us with further understanding of this reformulation. The approach is shown to admit straightforward physical interpretability and to include asymmetric distributions in a seamless and natural way;

The proposed estimation approach for the EMM is general and generic, and includes the meanshift algorithm as one of the special cases.
1.2 Related works
The GMM based estimation is well established and its importance has been widely acknowledged in the machine learning community. Since our focus is on the EMM, we omit the review of GMM and the readers are referred to [23] for a comprehensive review. To robustify the GMM model, the mixtures of the distribution have been thoroughly studied [7; 8; 9], on the basis of a generalised EM algorithm (expectationconditional maximization). A more general mixture model has been proposed in [24] based on the Pearson type VII distribution (includes the distribution as a special case). Moreover, as the transformed coefficients in the wavelet domain tend to be Laplace distributed, a mixture of the Laplace distribution has been proposed in [10] for image denoising. Its more general version, a mixture of hyperbolic distributions, has also been recently introduced in [11]. Typically, these approaches employ generalised EM algorithms because contrary to the GMM, there is no closedfrom solution at each maximisation step. Fortunately, the above distributions belong to the scale mixture of normal class, which can be regarded as a convolution of a Gamma distribution and a Gaussian distribution, which ensures the convergence of generalised EM algorithm. However, these approaches lack in generality, as e.g., for other elliptical distributions, the convergence is no longer guaranteed to be generalised. It is important to notice that despite several attempts, current mixture models, including [25; 26; 27], are of a rather ad hoc nature.
For a comprehensive text on the optimisation on the Riemannian manifold, we refer to [28], together with a seminal book on information geometry by Amari [29]. We here mainly focus on manifold optimisation of positive definite matrices. Specifically, pioneering in this direction is the work of Rao, which introduced the Rao distance to define the statistical difference between two multivariate normal distributions [30]. This work was later generalised by [19; 31; 32]. In the last decade, Wiesel proved the convergence of the iterative reweighting algorithm in [33] via the concept of geodesic convexity, and Zhang et al. further relaxed the convergence conditions in [14]. Sra and Hosseini [15] provided similar results from another perspective of the Riemannian manifold. For more details on the manifold of positive definite matrices, readers are referred to comprehensive works in [34; 20]. Recently, Hosseini and Sra directly adopted the gradient descent on the Riemannian manifold for estimating the GMM, and achieved significant improvement over the traditional EM algorithm [16; 17].
2 Preliminaries and notations
We first provide a brief introduction and notations of elliptical distributions, focusing especially on their relationships with commonly used distributions in statistical machine learning. Then, several key concepts in manifold optimisation are presented.
2.1 Elliptical distributions
A random variable is said to have an elliptical distribution if and only if it admits the following stochastic representation [35],
(1) 
where is a nonnegative real scalar random variable which models the tail properties of the elliptical distribution, is a random vector that is uniformly distributed on a unit spherical surface with the pdf within the class of , is a location (mean) vector, while is a matrix that transforms from a sphere to an ellipse, and the symbol “" designates “the same distribution”. For a comprehensive review of elliptical distributions, we refer to [5; 36].
Note that an elliptical distribution does not necessarily possess an explicit pdf, but can always be formulated by its characteristic function. However, when , that is, for a nonsingular scatter matrix , the pdf for elliptical distributions does exist and has the following form
(2) 
where is called density generator and is a constant solely related to the dimension .
Remark.
Observe that the term in(2) serves as a normalisation term, while when , the term formulates the multivariate Gaussian distribution, thus indicating the generality of elliptical distributions.
For simplicity, the elliptical distribution in (2) will be denoted by .
2.2 Riemannian manifold
A Riemannian manifold (, ) is a smooth (differential) manifold (i.e., locally homeomorphic to the Euclidean space) equipped with a smooth varying inner product on its tangent space. The inner product also defines the Riemannian metric on the tangent space. So that, the length of a curve and angle of two vectors can be correspondingly defined. Curves on the manifold with the shortest paths are called geodesics, which exhibit constant instantaneous speed and generalise straight lines in the Euclidean space. The distance between two points on is defined as the minimum length of all geodesics connecting these two points.
We use the symbol to denote the tangent space at the point , which is the firstorder approximation of at . Consequently, vectors on generalise the directional derivative, and the Riemannian gradient of a function is defined with regard to the equivalence between its inner product with an arbitrary vector on and the Fréchet derivative of at . Moreover, a smooth mapping from and is called the retraction, whereby an exponential mapping obtains the point on geodesics in the direction. Because the tangent spaces vary across different points on , parallel transport across different tangent spaces can be introduced on the basis of the LeviCivita connection, which preserves the inner product and norm. Then, we can convert a complex optimisation problem on into a more analysis friendly space, that is, .
For covariance matrices, or more generally, positive definite matrices, although there are various metrics designed for measuring the distance between matrices [37; 38; 39; 40], not all of them arise from the smooth varying inner product (i.e., Riemannian manifold), which would consequently give a “true” geodesic distance. The most popular such metric comes from the statistical manifold in which each point defined as a probability distribution. The inner product in such a manifold was adopted by Skovgaard [19] to measure dissimilarities through covariance matrices of two multivariate Gaussian distributions, in the form of , and its effectiveness has been comprehensively verified [16; 17; 34; 37]. It is also possible to obtain a closedform solution for the geodesic between two positive definite matrices and , , to yield its geodesic distance [41]. A geodesic convex function can be defined as with . We should point out that a function with a geodesic convex form ensures that the global optimum can be found, although such function may not be Euclidean convex.
3 Statistical Riemannian metrics for elliptical distributions
Typically, in the manifold , the metric has been widely defined through the Fisher information, which results in the natural gradient in manifold gradient descent [21] and represents an the information geodesic measure, as outlined in a seminal book by Amari [29]. However, for positive definite matrices, natural gradient is not guaranteed to be explicitly obtained [21]. Alternatively, Burbea and Rao introduced the “entropy differential metric” [42] on the basis of entropy, which was later used by Hiai and Petz to define the Riemannian metric for positive definite matrices [18]. It needs to be pointed out that the intrinsic metric, that is, , is obtained via the Hessian of Boltzmann entropy of multivariate normal distributions.
This allows us here to calculate the corresponding Riemannian metrics for the elliptical distributions.
Proposition 1.
Consider the class of elliptical distributions, . Then, the Riemannian metric for the location vector is given by
(3) 
and the Riemannian metric for the covariance by
(4) 
Proof.
The Riemannian metric for the mean vectors is directly obtained from the Fisher information matrix [43]. To obtain the Riemannian metric for the covariance matrix, we first calculate the Hessian of Boltzman entropy, as follows,
(5)  
Because is irrelevant to , the Hessian of can be calculated
(6) 
The Riemannian metric thus can thus be obtained as , which is the same as the case for multivariate normal distributions and is the mostly widely used metric. ∎
Finally, on the basis of [44], we can now provide the following treatment for the elliptical distributions.
Theorem 1.
Consider the class of elliptical distributions, . Then, upon reformulating and as
(7) 
gives the following Riemannian metric for the reformulated covariance,
(8) 
where .
Remark.
From Theorem 1, we can see that after reformulation manifold optimisation is actually performed under the same Riemannian metric as a simultaneous estimation of the location and the covariance in their respective Riemannian manifolds. This provides another perspective in understanding the proposed reformulation and enhanced physical interpretability.
4 Manifold optimisation for the elliptical mixture model (EMM)
We next introduce a concise summary on the elliptical distributions, in order to provide clarify in handling different types of elliptical distribution in later sections. Then, we lay out the EMM optimisation problem, following by the reformulation and manifold optimisation.
4.1 Elliptical family of distributions
The elliptical family of distributions is quite general, and includes many widely used standard distributions as special cases, e.g., the Gaussian distribution. A comprehensive summary can be found in Chapter 3 in [5], but involves complicated closedform formulations for each type of elliptical distribution. In addition, the open literature employs different notations and formulations to categorise these distributions, which may lead to confusion. To this end, we here fist provide a unifying summary of elliptical distributions which is achieved through stochastic representations of (1). This makes it possible to avoid complicated formulations, and instead classify different categories simply through several typical distributions of . Uniquely, this makes it possible to generate highdimensional samples from the onedimensional for a range of elliptical distributions, and also further clarifies the commonalities between the members of elliptical family of distributions.
In general, according to (1), an arbitrary elliptical distribution can be represented by and . As the uniformly distributed only relates to the dimension , we focus on the parameter , or equivalently, , in order to provide a unifying summary of typical elliptical distributions^{1}^{1}1The term is frequently used in practice because it has the same distribution as the quadratic form (i.e., the Mahalanobis distance)., which are listed^{2}^{2}2The symbol represents the Gamma distribution, ; is the generalised inverse Gaussian distribution, ; denotes the KolmogorovSmirnov distribution, ; is the Beta distribution, and is the symmetric stable distribution with index . In addition, is the Bessel function of the second kind, whilst is the Gamma function. in Table 1. The proof of this is obvious and can be achieved by direct validation. We omit it here due to the space limitation.
Types  &  Typical Multivariate Distributions  
Kotz Type [5]  , ,  Gamma dist.:  
Weibull dist.:  
Generalised Gaussian dist.:  
Normal dist.:  
Scale Mixture Type ,  Pearson Type VII [5]  dist.:  
,  Cauchy dist.: ,  
,  
Hyperbolic Type [45] ,  ,  InverseGaussian dist.:  
dist. [46]:  
Laplace dist.:  
Other Types [12; 47]  Logistic dist.  
,  stable dist.  
Pearson Type II [5]  ,  ,  – 
4.2 The elliptical mixture model (EMM)
Generally, we assume the EMM consists of mixtures, each elliptically distributed. To make the proposed EMM flexible enough to capture inherent structures in data, in our framework it is not necessary for every elliptical distribution to have the same density generator (denoted by ).
In finite mixture models, latent variables are binary, to represent membership to the th mixture. The probability of choosing the th mixture is denoted by , so that and . Upon rearranging the scalars into a vector , we can further simply write . For a set of observed i.i.d samples , the negative loglikelihood can be obtained as
(9) 
4.3 Manifold optimisation and reformulation
The estimation of , and requires the minimisation of in (9),which is not possible to achieve in closedfrom. We therefore proceed to introduce the manifold optimisation for the EMM framework, by first reformulating the terms and to on the basis of Theorem 2.
Theorem 2.
Proof.
The proof of the first property rests upon a generalisation of the result in the GMM [17], and its proof is analogous to that of Theorem in [17]. In fact, for Gaussian distribution, leads to , which is the reformulation adopted in the GMM in [17].
The second property can be proved via the relationship . This relationship can be easily verified through a decomposition of to [48]. ∎
On the basis of the reformulated , we have the Euclidean gradient for the reformulated :
Before moving to the Riemannian manifold optimisation, we shall first inspect the properties of the introduced gradient. The optimum value occurs when , and we arrive at . , that is, the Mestimators, decrease to 0 when the Mahalanobis distance increases. It is wellknown that the Mahalanobis distance is a scalefree metric, which is particularly suited to measure outliers. Thus, with large Mahalanobis distance result in small values of , and would have little impact on the , which therefore generates the robustness of the EMM. Furthermore, the existence of also mitigate the problem of the singular distributions during estimation.
Next, we obtain the Riemannian gradient for the manifold optimisation [16], through the retraction of
which is an approximation of the exponential mapping whilst ensuring of the computational feasibility. This makes it possible to implement the steepest manifold gradient descent algorithm to minimise iteratively. The convergence speed can be further improved via the conjugategradient descent and the LBFGS, with the parallel transport given in [17].
When it comes to the estimation onf , it is stated in [16] that it forms a product manifold with , in which gradient descent can be conducted in their respective manifolds. Therefore, for a given , we can solve for in a closedform to yield , by means of setting its derivative to 0 and considering the constraint of .
4.4 Regularisation
We have shown that the EMM can relieve the problem of singular distributions, however, it cannot completely alleviate the problem for all the elliptical distributions. We therefore resort to the regularisation of the covariance matrix, which is basically imposing the sparsity of the precision matrix. While those regulators cannot ensure the geodesic convex property, we here follow the approach by Ollila and Tyler [49] to impose the as the regulator, where controls the weight of the regulator. In fact, similar regularisation forms can be obtained when adding an inverseWishart prior distribution of followed by the maximum a posterior instead of the maximum likelihood process [50]. Specific to the EMM, the advantages of this regulator are that it is strictly geodesic convex in and the solutions are ensured to exist for any data configuration [49]. In this case, the reformulation turns to
Therefore, the are always fullrank, and thus completely avoid the singular distributions during estimation as desired.
When reaching the optimum, i.e., , we can obtain the following equations for and :
(11)  
We can find from (11) that when increases, is more likely to be an identity matrix, in which the estimation on is actually the meanshift algorithm.
5 Numerical results
We verified the proposed framework on both synthetic and realworld data. The synthetic dataset was generated according to [23] and consisted of two Gaussiandistributed clusters, with their location vectors satisfying
(12) 
where is a constant that controls the separation. We set here for clear illustration. Each Gaussiandistributed cluster contained samples, and the small set of points centred at was treated as noise, which contained samples from a spherical Gaussian distribution. The elliptical distribution used was the distribution with , as in Table 1. The estimation results were generated without (in orange) and with the noise (in red) in Figure 1.
As can be seen from this figure, the outliers at (10,0) have dramatically biased the location estimation in the GMM away from the groundtruth, while estimations based on the EMM model remain almost unchanged. This demonstrates the robustness of the EMM, compared with the GMM.
Then, in order to assess the quality of features extracted by the EMM, we considered the probability of the location vector in image for reconstruction. We adopted five types of EMMs for comprehensive comparison: the distribution with (denoted by Tdist.1), the distribution with (denoted by Tdist.100), the Kotz type with (denoted by Kotz1), the Kotz type with (denoted by Kotz2), and the logistic distribution (denoted by Logi). We randomly chose 50 images from the BSDS500 database [51] for mixture modelling. We also used the kmeans in vlfeat toolbox [52] as initialisations. After estimations of each EMM together with the GMM, the mean value was assigned to each pixel with regard to the posterior distribution of , followed by the reconstruction. We used the peak signaltonoise ratio (PSNR) as the quality assessment metric, whereby a higher PSNR value indicates lower reconstruction error, that is, the more accurate and effective features were captured by the elliptical mixture model. The results are presented in Table 2.
Mixture numbers  Tdist.1  Tdist.100  Kotz1  Kotz2  Logi 

2  0.51  0.63  0.52  1.11  0.90 
5  2.33  1.36  0.6  1.32  3.1 
10  4.23  4.10  0.7  1.00  4.5 
For some representative results are visualised in Fig. 2. By comparing Fig. 2(b) and (c), we can see that because distributions with is more heavilytailed than that with , its reconstruction is more clear and it also well reconstructs the details of the ship. Furthermore, both Kotz1 and Kotz2 are not geodesic convex, which means they cannot be estimated via the iterative reweighting algorithm. However, the proposed manifold optimisation ensures that an optimum for these distributions is found. Moreover, the Kotz1 model in Figure 2(d) is the only one which is inferior to the GMM. This may be due to its lighter tails (), which indicates it is sensitive to the details. In contrast, other elliptical distributions given here are all heavytailed than the Gaussian distribution, and are thus more robust to outliers. This all demonstrates the flexibility of the EMM framework, and the ability of the proposed manifold optimisation algorithm to provide a general solution to the EMM.
6 Conclusions
We have proposed a universal framework for estimating the EMM for the general case of unbalanced data and mixtures of different members of the class of elliptical distributions. We have revisited the statistics of elliptical distributions to justify the effectiveness of the Riemannian metrics adopted the EMM learning process. We have also analysed the rationale for the problem reformulation under the framework of Riemannian manifold, and have introduced its EMM version. The existing elliptical distributions have also been unified in this paper, to provide much needed flexibility in choosing the EMM. Numerical results have not only demonstrated the robustness and flexibility of the proposed EMM framework, but also further highlighted the physical interpretability and the effects of individual distributions on the EMM.
References
 Figueiredo and Jain [2002] Mario A. T. Figueiredo and Anil K. Jain. Unsupervised learning of finite mixture models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3):381–396, 2002.
 McLachlan and Peel [2004] Geoffrey McLachlan and David Peel. Finite mixture models. John Wiley & Sons, 2004.
 Zoubir et al. [2012] Abdelhak M Zoubir, Visa Koivunen, Yacine Chakhchoukh, and Michael Muma. Robust estimation in signal processing: A tutorialstyle treatment of fundamental concepts. IEEE Signal Processing Magazine, 29(4):61–80, 2012.
 Mandic et al. [2005] Danilo P Mandic, Dragan Obradovic, Anthony Kuh, Tülay Adali, Udo Trutschell, Martin Golz, Philippe De Wilde, Javier Barria, Anthony Constantinides, and Jonathon Chambers. Data fusion for modern engineering applications: An overview. In International Conference on Artificial Neural Networks, pages 715–721. Springer, 2005.
 Fang et al. [1990] Kai Wang Fang, Samuel Kotz, and Kai Wang Ng. Symmetric multivariate and related distributions. London, U.K.: Chapman & Hall, 1990.
 Huber [2011] Peter J Huber. Robust statistics. In International Encyclopedia of Statistical Science, pages 1248–1251. Springer, 2011.
 Peel and McLachlan [2000] David Peel and Geoffrey J McLachlan. Robust mixture modelling using the t distribution. Statistics and computing, 10(4):339–348, 2000.
 Andrews and McNicholas [2012] Jeffrey L Andrews and Paul D McNicholas. Modelbased clustering, classification, and discriminant analysis via mixtures of multivariate tdistributions. Statistics and Computing, 22(5):1021–1029, 2012.
 Lin et al. [2014] TsungI Lin, Paul D McNicholas, and Hsiu J Ho. Capturing patterns via parsimonious t mixture models. Statistics & Probability Letters, 88:80–87, 2014.
 Tan and Jiao [2007] Shan Tan and Licheng Jiao. Multivariate statistical models for image denoising in the wavelet domain. International Journal of Computer Vision, 75(2):209–230, 2007.
 Browne and McNicholas [2015] Ryan P Browne and Paul D McNicholas. A mixture of generalized hyperbolic distributions. Canadian Journal of Statistics, 43(2):176–198, 2015.
 Andrews and Mallows [1974] David F Andrews and Colin L Mallows. Scale mixtures of normal distributions. Journal of the Royal Statistical Society. Series B (Methodological), pages 99–102, 1974.
 Kent and Tyler [1991] John T Kent and David E Tyler. Redescending mestimates of multivariate location and scatter. The Annals of Statistics, pages 2102–2119, 1991.
 Zhang et al. [2013] Teng Zhang, Ami Wiesel, and Maria Sabrina Greco. Multivariate generalized Gaussian distribution: Convexity and graphical models. IEEE Transactions on Signal Processing, 61(16):4141–4148, 2013.
 Sra and Hosseini [2013] Suvrit Sra and Reshad Hosseini. Geometric optimisation on positive definite matrices for elliptically contoured distributions. In Advances in Neural Information Processing Systems, pages 2562–2570, 2013.
 Hosseini and Sra [2015] Reshad Hosseini and Suvrit Sra. Matrix manifold optimization for Gaussian mixtures. In Advances in Neural Information Processing Systems, pages 910–918, 2015.
 Hosseini and Sra [2017] Reshad Hosseini and Suvrit Sra. An alternative to EM for Gaussian mixture models: Batch and stochastic riemannian optimization. arXiv preprint arXiv:1706.03267, 2017.
 Hiai and Petz [2009] Fumio Hiai and Dénes Petz. Riemannian metrics on positive definite matrices related to means. Linear Algebra and its Applications, 430(1112):3105–3130, 2009.
 Skovgaard [1984] Lene Theil Skovgaard. A Riemannian geometry of the multivariate normal model. Scandinavian Journal of Statistics, pages 211–223, 1984.
 Duembgen and Tyler [2016] Lutz Duembgen and David E Tyler. Geodesic convexity and regularized scatter estimators. arXiv preprint arXiv:1607.05455, 2016.
 Amari [1998] ShunIchi Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
 Hosseini and Mash’al [2015] Reshad Hosseini and Mohamadreza Mash’al. Mixest: An estimation toolbox for mixture models. arXiv preprint arXiv:1507.06065, 2015.
 Lindsay [1995] Bruce G Lindsay. Mixture models: theory, geometry and applications. In NSFCBMS regional conference series in probability and statistics, pages i–163. JSTOR, 1995.
 Sun et al. [2010] Jianyong Sun, Ata Kaban, and Jonathan M Garibaldi. Robust mixture modeling using the Pearson type VII distribution. In Neural Networks (IJCNN), The 2010 International Joint Conference on, pages 1–7. IEEE, 2010.
 Karlis and Meligkotsidou [2007] Dimitris Karlis and Loukia Meligkotsidou. Finite mixtures of multivariate Poisson distributions with application. Journal of Statistical Planning and Inference, 137(6):1942–1960, 2007.
 LópezRubio [2011] Ezequiel LópezRubio. Stochastic approximation learning for mixtures of multivariate elliptical distributions. Neurocomputing, 74(17):2972–2984, 2011.
 Browne et al. [2012] Ryan P Browne, Paul D McNicholas, and Matthew D Sparling. Modelbased learning using a mixture of mixtures of Gaussian and uniform distributions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(4):814–817, 2012.
 Absil et al. [2009] PA Absil, Robert Mahony, and Rodolphe Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, 2009.
 Amari and Nagaoka [2007] Shunichi Amari and Hiroshi Nagaoka. Methods of information geometry, volume 191. American Mathematical Soc., 2007.
 RAO [1945] CR RAO. Information and accuracy attanaible in the estimation of statistical parameters. Bull. Calcutta Math. Soc., 37:81–91, 1945.
 Atkinson and Mitchell [1981] Colin Atkinson and Ann FS Mitchell. Rao’s distance measure. Sankhyā: The Indian Journal of Statistics, Series A, pages 345–365, 1981.
 Amari [1982] ShunIchi Amari. Differential geometry of curved exponential familiescurvatures and information loss. The Annals of Statistics, pages 357–385, 1982.
 Wiesel [2012] Ami Wiesel. Geodesic convexity and covariance estimation. IEEE Transactions on Signal Processing, 60(12):6182–6189, 2012.
 Sra and Hosseini [2015] Suvrit Sra and Reshad Hosseini. Conic geometric optimization on the manifold of positive definite matrices. SIAM Journal on Optimization, 25(1):713–739, 2015.
 Cambanis et al. [1981] Stamatis Cambanis, Steel Huang, and Gordon Simons. On the theory of elliptically contoured distributions. Journal of Multivariate Analysis, 11(3):368–385, 1981.
 Frahm [2004] Gabriel Frahm. Generalized elliptical distributions: Theory and applications. PhD thesis, Universität zu Köln, 2004.
 Jeuris et al. [2012] Ben Jeuris, Raf Vandebril, and Bart Vandereycken. A survey and comparison of contemporary algorithms for computing the matrix geometric mean. Electronic Transactions on Numerical Analysis, 39(EPFLARTICLE197637):379–402, 2012.
 Sra [2012] Suvrit Sra. A new metric on the manifold of kernel matrices with application to matrix geometric means. In Advances in neural information processing systems, pages 144–152, 2012.
 Jayasumana et al. [2013] Sadeep Jayasumana, Richard Hartley, Mathieu Salzmann, Hongdong Li, and Mehrtash Harandi. Kernel methods on the Riemannian manifold of symmetric positive definite matrices. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 73–80. IEEE, 2013.
 Faraki et al. [2018] Masoud Faraki, Mehrtash T Harandi, and Fatih Porikli. A comprehensive look at coding techniques on Riemannian manifolds. IEEE Transactions on Neural Networks and Learning Systems, 2018.
 Bhatia [2009] Rajendra Bhatia. Positive definite matrices. Princeton university press, 2009.
 Burbea and Rao [1982] Jacob Burbea and C. Radhakrishna Rao. Entropy differential metric, distance and divergence measures in probability spaces: A unified approach. Journal of Multivariate Analysis, 12(4):575–596, 1982.
 Mitchell [1989] Ann ES Mitchell. The information matrix, skewness tensor and aconnections for the general multivariate elliptic distribution. Annals of the Institute of Statistical Mathematics, 41(2):289–304, 1989.
 Calvo and Oller [1990] Miquel Calvo and Josep M Oller. A distance between multivariate normal distributions based in an embedding into the Siegel group. Journal of multivariate analysis, 35(2):223–242, 1990.
 BarndorffNielsen et al. [1982] Ole BarndorffNielsen, John Kent, and Michael Sørensen. Normal variancemean mixtures and z distributions. International Statistical Review/Revue Internationale de Statistique, pages 145–159, 1982.
 Ollila et al. [2012] Esa Ollila, David E Tyler, Visa Koivunen, and H Vincent Poor. Complex elliptically symmetric distributions: Survey, new results and applications. IEEE Transactions on Signal Processing, 60(11):5597–5625, 2012.
 Stefanski [1991] Leonard A Stefanski. A normal scale mixture representation of the logistic distribution. Statistics & Probability Letters, 11(1):69–70, 1991.
 Dümbgen et al. [2015] Lutz Dümbgen, Markus Pauly, Thomas Schweizer, et al. functionals of multivariate scatter. Statistics Surveys, 9:32–105, 2015.
 Ollila and Tyler [2014] Esa Ollila and David E Tyler. Regularized estimators of scatter matrix. IEEE Transactions on Signal Processing, 62(22):6059–6070, 2014.
 Diebolt and Robert [1994] Jean Diebolt and Christian P Robert. Estimation of finite mixture distributions through Bayesian sampling. Journal of the Royal Statistical Society. Series B (Methodological), pages 363–375, 1994.
 Arbelaez et al. [2011] Pablo Arbelaez, Michael Maire, Charless Fowlkes, and Jitendra Malik. Contour detection and hierarchical image segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 33(5):898–916, May 2011. ISSN 01628828. doi: 10.1109/TPAMI.2010.161. URL http://dx.doi.org/10.1109/TPAMI.2010.161.
 Vedaldi and Fulkerson [2008] A. Vedaldi and B. Fulkerson. VLFeat: An open and portable library of computer vision algorithms. http://www.vlfeat.org/, 2008.