A universal framework for learning based on the elliptical mixture model (EMM)
An increasing prominence of unbalanced and noisy data highlights the importance of elliptical mixture models (EMMs), which exhibit enhanced robustness, flexibility and stability over the widely applied Gaussian mixture model (GMM). However, existing studies of the EMM are typically of ad hoc nature, without a universal analysis framework or existence and uniqueness considerations. To this end, we propose a general framework for estimating the EMM, which makes use of the Riemannian manifold optimisation to convert the original constrained optimisation paradigms into an un-constrained one. We first revisit the statistics of elliptical distributions, to give a rationale for the use of Riemannian metrics as well as the reformulation of the problem in the Riemannian space. We then derive the EMM learning framework, based on Riemannian gradient descent, which ensures the same optimum as the original problem but accelerates the convergence speed. We also unify the treatment of the existing elliptical distributions to build a universal EMM, providing a simple and intuitive way to deal with the non-convex nature of this optimisation problem. Numerical results demonstrate the ability of the proposed framework to accommodate EMMs with different properties of individual functions, and also verify the robustness and flexibility of the proposed framework over the standard GMM.
A universal framework for learning based on the elliptical mixture model (EMM)
Shengxi Li Department of Electrical and Electronic Engineering Imperial College London London, SW7 2AZ firstname.lastname@example.org Danilo P. Mandic Department of Electrical and Electronic Engineering Imperial College London London, SW7 2AZ email@example.com Zeyang Yu Department of Electrical and Electronic Engineering Imperial College London London, SW7 2AZ z.yu17@@imperial.ac.uk
noticebox[b]Preprint. Work in progress.\end@float
Finite mixture models have a prominent role in statistical machine learning, as these enhanced provide probabilistic awareness in many learning paradigms, including clustering, feature extraction and density estimation . This is achieved in a very intuitive and elegant way, through a linear combination of well understood distributions, which is powerful enough to approximate arbitrary complex distributions . The Gaussian mixture model (GMM) is the most widely used such model whose popularity stems from the simple formulation and the conjugate property of Gaussian distribution. Despite mathematical elegance, standard Gaussian-based mixture model estimator is subject to robustness issues, and even a slight deviation from the Gaussian assumption or a single outlier can significantly degrade the performance or even break down the estimator . Alternative mixture models are therefore rapidly being sought for robust learning.
Another rapidly emerging issue in modern applications is the requirement for the flexibility in mixture models, this is due to an exponential emergence of multi-faceted data which are almost invariably unbalanced; sources of such imbalance may be vastly different natures of the data channels involved, different powers in the constitutive channels, or temporal misalignment . Another less obvious but equally important obstacle which is prohibitive to the use of current mixture models is that of the different scales of information within multivariate data; for example, in biomedical recordings, respiration and heart beats occupy totally different scales, of Hz and 1-3 Hz respectively, but their harmonics overlap spectrally.
An important class of multivariate analysis techniques are elliptical distributions, which are quite general and flexible and include as special cases a range of standard distributions, such as the Gaussian distribution, the exponential family and the -distribution . The desirable property of elliptical distributions is their robustness; indeed their use results in robust M-estimators , thus making them a natural candidate for robust mixture modelling. In addition to the robustness, it is reported that members of the elliptical mixture model (EMM) class can also effectively mitigate the singular covariance problem experienced in the GMM . Moreover, EMMs are more flexible in capturing intrinsic data structures than the GMM, as the EMM can even use different types of distributions in a single mixture. By virtue of their robustness and flexibility, EMMs are therefore perfectly suited to dealing with data acquired from imperfect sensors, a typical case in modern applications.
Existing mixture models related to elliptical distributions are most frequently based on the -distribution [7; 8; 9], the Laplace distribution , or the hyperbolic distribution . Table 1 summarises the existing results which adopt elliptical distributions that belong to the class of scale mixture of normals , where the expectation-maximisation (EM) process, employed in model tuning, is guaranteed to converge. Despite all their desirable properties, a general estimation method for fitting arbitrary elliptical distributions is still lacking.
The development of a general method for estimating the EMM, however, is non-trivial, owing to both theoretical and practical difficulties; for example, different from the GMM, there is no closed-form solution for the maximisation step within the EMM learning. Specifically, the convergence of the iterative re-weighting algorithm, the de facto standard in estimation of elliptical distributions, requires constraints on both the functional formations of elliptical distributions and the data structure. For more detail, readers are referred to . Although these limitations are have been recently somewhat relaxed [14; 15], applications of the EMM are still severely restricted.
To this end, we consider Riemannian manifold optimisation for parameter estimation in this context, which has proven to be extremely effective in problems related to positive definite matrices, as it naturally casts a general ill-posed constrained problem onto that of optimising on a convex half-cone, which can be solved via the vector space of matrices. In contrast, it is always difficult to handle the positive definite constraint in the Euclidean coordinates. Along this direction, Hosseini and Sra successfully applied gradient descent along the Riemannian manifold to the GMM problem, and achieved fast convergence speed without any sacrifice in the accuracy [16; 17]. It is therefore natural to ask, whether a general estimation method based on the EMM can be approached from the manifold optimisation perspective?
1.1 Challenges and contributions
The first step towards our aim to introduce a class of feasible and computable EMMs is to define a proper Riemannian metric for elliptical distributions, as the metric completely determines the optimisation procedure of the EMM. A wide variety of works related to positive definite matrices adopt an intrinsic Riemannian metric which comes from the statistics (the Hessian of entropy  or the Fisher information ) of multivariate normal distributions. Such a metric is also adopted in [16; 17] for estimating the GMM. It is therefore natural to first investigate whether such a “Gaussian”-based metric is an appropriate choice for the EMM. To this end, we start from the statistics, and first assess the rationale of this metric. Then, in addition to the covariance matrices, EMM also needs to estimate the location vector. Location-covariance estimation is typically more complicated but less theoretically supported compared with the covariance problem . A common current strategy is to reformulate the location-estimation problem into that of solely covariance estimation with one more dimension . As reported in [16; 17], this reformulation significantly accelerates the convergence speed. For the EMM, the reformulation is not direct due to the non-existence of a closed-form representation. We thus develop the corresponding reformulation for the EMM, and further find that such a reformulation during manifold optimisation attains the same metric of a natural gradient  descent for the location vector and the standard covariance estimation. Finally, we propose a general estimation method for the EMM, which overcomes the above limitations [14; 15].
Robust EMM estimation is therefore badly needed and is rapidly emerging; for example, a toolbox in  which was originally designed for the GMM , has already included several types of elliptical distributions. However, the existing toolbox has not been generalised to the EMM. This paper therefore sets out to fill the void in the literature by rigorously establishing a whole new unifying framework for the analysis of EMMs, thus opening a new avenue for practical approaches based on real-world data. Unlike the current inconsistent solutions, the proposed framework is generic and can be considered a natural generalisation of parametrisation from the GMM. Our contributions can be summarised in the following:
We rigorously unify typical elliptical distributions by means of their intrinsic relationships, which enables simple ways to generate samples and conduct further analysis;
We introduce a Riemannian metric for the location vector and the covariance matrices within elliptical distributions, which provides us with further understanding of this reformulation. The approach is shown to admit straightforward physical interpretability and to include asymmetric distributions in a seamless and natural way;
The proposed estimation approach for the EMM is general and generic, and includes the mean-shift algorithm as one of the special cases.
1.2 Related works
The GMM based estimation is well established and its importance has been widely acknowledged in the machine learning community. Since our focus is on the EMM, we omit the review of GMM and the readers are referred to  for a comprehensive review. To robustify the GMM model, the mixtures of the -distribution have been thoroughly studied [7; 8; 9], on the basis of a generalised EM algorithm (expectation-conditional maximization). A more general mixture model has been proposed in  based on the Pearson type VII distribution (includes the -distribution as a special case). Moreover, as the transformed coefficients in the wavelet domain tend to be Laplace distributed, a mixture of the Laplace distribution has been proposed in  for image denoising. Its more general version, a mixture of hyperbolic distributions, has also been recently introduced in . Typically, these approaches employ generalised EM algorithms because contrary to the GMM, there is no closed-from solution at each maximisation step. Fortunately, the above distributions belong to the scale mixture of normal class, which can be regarded as a convolution of a Gamma distribution and a Gaussian distribution, which ensures the convergence of generalised EM algorithm. However, these approaches lack in generality, as e.g., for other elliptical distributions, the convergence is no longer guaranteed to be generalised. It is important to notice that despite several attempts, current mixture models, including [25; 26; 27], are of a rather ad hoc nature.
For a comprehensive text on the optimisation on the Riemannian manifold, we refer to , together with a seminal book on information geometry by Amari . We here mainly focus on manifold optimisation of positive definite matrices. Specifically, pioneering in this direction is the work of Rao, which introduced the Rao distance to define the statistical difference between two multivariate normal distributions . This work was later generalised by [19; 31; 32]. In the last decade, Wiesel proved the convergence of the iterative reweighting algorithm in  via the concept of geodesic convexity, and Zhang et al. further relaxed the convergence conditions in . Sra and Hosseini  provided similar results from another perspective of the Riemannian manifold. For more details on the manifold of positive definite matrices, readers are referred to comprehensive works in [34; 20]. Recently, Hosseini and Sra directly adopted the gradient descent on the Riemannian manifold for estimating the GMM, and achieved significant improvement over the traditional EM algorithm [16; 17].
2 Preliminaries and notations
We first provide a brief introduction and notations of elliptical distributions, focusing especially on their relationships with commonly used distributions in statistical machine learning. Then, several key concepts in manifold optimisation are presented.
2.1 Elliptical distributions
A random variable is said to have an elliptical distribution if and only if it admits the following stochastic representation ,
where is a non-negative real scalar random variable which models the tail properties of the elliptical distribution, is a random vector that is uniformly distributed on a unit spherical surface with the pdf within the class of , is a location (mean) vector, while is a matrix that transforms from a sphere to an ellipse, and the symbol “" designates “the same distribution”. For a comprehensive review of elliptical distributions, we refer to [5; 36].
Note that an elliptical distribution does not necessarily possess an explicit pdf, but can always be formulated by its characteristic function. However, when , that is, for a non-singular scatter matrix , the pdf for elliptical distributions does exist and has the following form
where is called density generator and is a constant solely related to the dimension .
Observe that the term in(2) serves as a normalisation term, while when , the term formulates the multivariate Gaussian distribution, thus indicating the generality of elliptical distributions.
For simplicity, the elliptical distribution in (2) will be denoted by .
2.2 Riemannian manifold
A Riemannian manifold (, ) is a smooth (differential) manifold (i.e., locally homeomorphic to the Euclidean space) equipped with a smooth varying inner product on its tangent space. The inner product also defines the Riemannian metric on the tangent space. So that, the length of a curve and angle of two vectors can be correspondingly defined. Curves on the manifold with the shortest paths are called geodesics, which exhibit constant instantaneous speed and generalise straight lines in the Euclidean space. The distance between two points on is defined as the minimum length of all geodesics connecting these two points.
We use the symbol to denote the tangent space at the point , which is the first-order approximation of at . Consequently, vectors on generalise the directional derivative, and the Riemannian gradient of a function is defined with regard to the equivalence between its inner product with an arbitrary vector on and the Fréchet derivative of at . Moreover, a smooth mapping from and is called the retraction, whereby an exponential mapping obtains the point on geodesics in the direction. Because the tangent spaces vary across different points on , parallel transport across different tangent spaces can be introduced on the basis of the Levi-Civita connection, which preserves the inner product and norm. Then, we can convert a complex optimisation problem on into a more analysis friendly space, that is, .
For covariance matrices, or more generally, positive definite matrices, although there are various metrics designed for measuring the distance between matrices [37; 38; 39; 40], not all of them arise from the smooth varying inner product (i.e., Riemannian manifold), which would consequently give a “true” geodesic distance. The most popular such metric comes from the statistical manifold in which each point defined as a probability distribution. The inner product in such a manifold was adopted by Skovgaard  to measure dissimilarities through covariance matrices of two multivariate Gaussian distributions, in the form of , and its effectiveness has been comprehensively verified [16; 17; 34; 37]. It is also possible to obtain a closed-form solution for the geodesic between two positive definite matrices and , , to yield its geodesic distance . A geodesic convex function can be defined as with . We should point out that a function with a geodesic convex form ensures that the global optimum can be found, although such function may not be Euclidean convex.
3 Statistical Riemannian metrics for elliptical distributions
Typically, in the manifold , the metric has been widely defined through the Fisher information, which results in the natural gradient in manifold gradient descent  and represents an the information geodesic measure, as outlined in a seminal book by Amari . However, for positive definite matrices, natural gradient is not guaranteed to be explicitly obtained . Alternatively, Burbea and Rao introduced the “entropy differential metric”  on the basis of entropy, which was later used by Hiai and Petz to define the Riemannian metric for positive definite matrices . It needs to be pointed out that the intrinsic metric, that is, , is obtained via the Hessian of Boltzmann entropy of multivariate normal distributions.
This allows us here to calculate the corresponding Riemannian metrics for the elliptical distributions.
Consider the class of elliptical distributions, . Then, the Riemannian metric for the location vector is given by
and the Riemannian metric for the covariance by
The Riemannian metric for the mean vectors is directly obtained from the Fisher information matrix . To obtain the Riemannian metric for the covariance matrix, we first calculate the Hessian of Boltzman entropy, as follows,
Because is irrelevant to , the Hessian of can be calculated
The Riemannian metric thus can thus be obtained as , which is the same as the case for multivariate normal distributions and is the mostly widely used metric. ∎
Finally, on the basis of , we can now provide the following treatment for the elliptical distributions.
Consider the class of elliptical distributions, . Then, upon reformulating and as
gives the following Riemannian metric for the reformulated covariance,
From Theorem 1, we can see that after reformulation manifold optimisation is actually performed under the same Riemannian metric as a simultaneous estimation of the location and the covariance in their respective Riemannian manifolds. This provides another perspective in understanding the proposed reformulation and enhanced physical interpretability.
4 Manifold optimisation for the elliptical mixture model (EMM)
We next introduce a concise summary on the elliptical distributions, in order to provide clarify in handling different types of elliptical distribution in later sections. Then, we lay out the EMM optimisation problem, following by the reformulation and manifold optimisation.
4.1 Elliptical family of distributions
The elliptical family of distributions is quite general, and includes many widely used standard distributions as special cases, e.g., the Gaussian distribution. A comprehensive summary can be found in Chapter 3 in , but involves complicated closed-form formulations for each type of elliptical distribution. In addition, the open literature employs different notations and formulations to categorise these distributions, which may lead to confusion. To this end, we here fist provide a unifying summary of elliptical distributions which is achieved through stochastic representations of (1). This makes it possible to avoid complicated formulations, and instead classify different categories simply through several typical distributions of . Uniquely, this makes it possible to generate high-dimensional samples from the one-dimensional for a range of elliptical distributions, and also further clarifies the commonalities between the members of elliptical family of distributions.
In general, according to (1), an arbitrary elliptical distribution can be represented by and . As the uniformly distributed only relates to the dimension , we focus on the parameter , or equivalently, , in order to provide a unifying summary of typical elliptical distributions111The term is frequently used in practice because it has the same distribution as the quadratic form (i.e., the Mahalanobis distance)., which are listed222The symbol represents the Gamma distribution, ; is the generalised inverse Gaussian distribution, ; denotes the Kolmogorov-Smirnov distribution, ; is the Beta distribution, and is the symmetric -stable distribution with index . In addition, is the Bessel function of the second kind, whilst is the Gamma function. in Table 1. The proof of this is obvious and can be achieved by direct validation. We omit it here due to the space limitation.
|Types||&||Typical Multivariate Distributions|
|Kotz Type ||, ,||Gamma dist.:|
|Generalised Gaussian dist.:|
|Scale Mixture Type ,||Pearson Type VII ||-dist.:|
|,||Cauchy dist.: ,|
|Hyperbolic Type  ,||,||Inverse-Gaussian dist.:|
|Other Types [12; 47]||Logistic dist.|
|Pearson Type II ||,||,||–|
4.2 The elliptical mixture model (EMM)
Generally, we assume the EMM consists of mixtures, each elliptically distributed. To make the proposed EMM flexible enough to capture inherent structures in data, in our framework it is not necessary for every elliptical distribution to have the same density generator (denoted by ).
In finite mixture models, latent variables are binary, to represent membership to the -th mixture. The probability of choosing the -th mixture is denoted by , so that and . Upon rearranging the scalars into a vector , we can further simply write . For a set of observed i.i.d samples , the negative log-likelihood can be obtained as
4.3 Manifold optimisation and reformulation
The estimation of , and requires the minimisation of in (9),which is not possible to achieve in closed-from. We therefore proceed to introduce the manifold optimisation for the EMM framework, by first reformulating the terms and to on the basis of Theorem 2.
The proof of the first property rests upon a generalisation of the result in the GMM , and its proof is analogous to that of Theorem in . In fact, for Gaussian distribution, leads to , which is the reformulation adopted in the GMM in .
The second property can be proved via the relationship . This relationship can be easily verified through a decomposition of to . ∎
On the basis of the reformulated , we have the Euclidean gradient for the reformulated :
Before moving to the Riemannian manifold optimisation, we shall first inspect the properties of the introduced gradient. The optimum value occurs when , and we arrive at . , that is, the M-estimators, decrease to 0 when the Mahalanobis distance increases. It is well-known that the Mahalanobis distance is a scale-free metric, which is particularly suited to measure outliers. Thus, with large Mahalanobis distance result in small values of , and would have little impact on the , which therefore generates the robustness of the EMM. Furthermore, the existence of also mitigate the problem of the singular distributions during estimation.
Next, we obtain the Riemannian gradient for the manifold optimisation , through the retraction of
which is an approximation of the exponential mapping whilst ensuring of the computational feasibility. This makes it possible to implement the steepest manifold gradient descent algorithm to minimise iteratively. The convergence speed can be further improved via the conjugate-gradient descent and the LBFGS, with the parallel transport given in .
When it comes to the estimation onf , it is stated in  that it forms a product manifold with , in which gradient descent can be conducted in their respective manifolds. Therefore, for a given , we can solve for in a closed-form to yield , by means of setting its derivative to 0 and considering the constraint of .
We have shown that the EMM can relieve the problem of singular distributions, however, it cannot completely alleviate the problem for all the elliptical distributions. We therefore resort to the regularisation of the covariance matrix, which is basically imposing the sparsity of the precision matrix. While those regulators cannot ensure the geodesic convex property, we here follow the approach by Ollila and Tyler  to impose the as the regulator, where controls the weight of the regulator. In fact, similar regularisation forms can be obtained when adding an inverse-Wishart prior distribution of followed by the maximum a posterior instead of the maximum likelihood process . Specific to the EMM, the advantages of this regulator are that it is strictly geodesic convex in and the solutions are ensured to exist for any data configuration . In this case, the reformulation turns to
Therefore, the are always full-rank, and thus completely avoid the singular distributions during estimation as desired.
When reaching the optimum, i.e., , we can obtain the following equations for and :
We can find from (11) that when increases, is more likely to be an identity matrix, in which the estimation on is actually the mean-shift algorithm.
5 Numerical results
We verified the proposed framework on both synthetic and real-world data. The synthetic dataset was generated according to  and consisted of two Gaussian-distributed clusters, with their location vectors satisfying
where is a constant that controls the separation. We set here for clear illustration. Each Gaussian-distributed cluster contained samples, and the small set of points centred at was treated as noise, which contained samples from a spherical Gaussian distribution. The elliptical distribution used was the -distribution with , as in Table 1. The estimation results were generated without (in orange) and with the noise (in red) in Figure 1.
As can be seen from this figure, the outliers at (10,0) have dramatically biased the location estimation in the GMM away from the ground-truth, while estimations based on the EMM model remain almost unchanged. This demonstrates the robustness of the EMM, compared with the GMM.
Then, in order to assess the quality of features extracted by the EMM, we considered the probability of the location vector in image for reconstruction. We adopted five types of EMMs for comprehensive comparison: the -distribution with (denoted by Tdist.1), the -distribution with (denoted by Tdist.100), the Kotz type with (denoted by Kotz1), the Kotz type with (denoted by Kotz2), and the logistic distribution (denoted by Logi). We randomly chose 50 images from the BSDS500 database  for mixture modelling. We also used the k-means in vl-feat toolbox  as initialisations. After estimations of each EMM together with the GMM, the mean value was assigned to each pixel with regard to the posterior distribution of , followed by the reconstruction. We used the peak signal-to-noise ratio (PSNR) as the quality assessment metric, whereby a higher PSNR value indicates lower reconstruction error, that is, the more accurate and effective features were captured by the elliptical mixture model. The results are presented in Table 2.
For some representative results are visualised in Fig. 2. By comparing Fig. 2-(b) and (c), we can see that because -distributions with is more heavily-tailed than that with , its reconstruction is more clear and it also well reconstructs the details of the ship. Furthermore, both Kotz1 and Kotz2 are not geodesic convex, which means they cannot be estimated via the iterative re-weighting algorithm. However, the proposed manifold optimisation ensures that an optimum for these distributions is found. Moreover, the Kotz1 model in Figure 2-(d) is the only one which is inferior to the GMM. This may be due to its lighter tails (), which indicates it is sensitive to the details. In contrast, other elliptical distributions given here are all heavy-tailed than the Gaussian distribution, and are thus more robust to outliers. This all demonstrates the flexibility of the EMM framework, and the ability of the proposed manifold optimisation algorithm to provide a general solution to the EMM.
We have proposed a universal framework for estimating the EMM for the general case of unbalanced data and mixtures of different members of the class of elliptical distributions. We have revisited the statistics of elliptical distributions to justify the effectiveness of the Riemannian metrics adopted the EMM learning process. We have also analysed the rationale for the problem reformulation under the framework of Riemannian manifold, and have introduced its EMM version. The existing elliptical distributions have also been unified in this paper, to provide much needed flexibility in choosing the EMM. Numerical results have not only demonstrated the robustness and flexibility of the proposed EMM framework, but also further highlighted the physical interpretability and the effects of individual distributions on the EMM.
- Figueiredo and Jain  Mario A. T. Figueiredo and Anil K. Jain. Unsupervised learning of finite mixture models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3):381–396, 2002.
- McLachlan and Peel  Geoffrey McLachlan and David Peel. Finite mixture models. John Wiley & Sons, 2004.
- Zoubir et al.  Abdelhak M Zoubir, Visa Koivunen, Yacine Chakhchoukh, and Michael Muma. Robust estimation in signal processing: A tutorial-style treatment of fundamental concepts. IEEE Signal Processing Magazine, 29(4):61–80, 2012.
- Mandic et al.  Danilo P Mandic, Dragan Obradovic, Anthony Kuh, Tülay Adali, Udo Trutschell, Martin Golz, Philippe De Wilde, Javier Barria, Anthony Constantinides, and Jonathon Chambers. Data fusion for modern engineering applications: An overview. In International Conference on Artificial Neural Networks, pages 715–721. Springer, 2005.
- Fang et al.  Kai Wang Fang, Samuel Kotz, and Kai Wang Ng. Symmetric multivariate and related distributions. London, U.K.: Chapman & Hall, 1990.
- Huber  Peter J Huber. Robust statistics. In International Encyclopedia of Statistical Science, pages 1248–1251. Springer, 2011.
- Peel and McLachlan  David Peel and Geoffrey J McLachlan. Robust mixture modelling using the t distribution. Statistics and computing, 10(4):339–348, 2000.
- Andrews and McNicholas  Jeffrey L Andrews and Paul D McNicholas. Model-based clustering, classification, and discriminant analysis via mixtures of multivariate t-distributions. Statistics and Computing, 22(5):1021–1029, 2012.
- Lin et al.  Tsung-I Lin, Paul D McNicholas, and Hsiu J Ho. Capturing patterns via parsimonious t mixture models. Statistics & Probability Letters, 88:80–87, 2014.
- Tan and Jiao  Shan Tan and Licheng Jiao. Multivariate statistical models for image denoising in the wavelet domain. International Journal of Computer Vision, 75(2):209–230, 2007.
- Browne and McNicholas  Ryan P Browne and Paul D McNicholas. A mixture of generalized hyperbolic distributions. Canadian Journal of Statistics, 43(2):176–198, 2015.
- Andrews and Mallows  David F Andrews and Colin L Mallows. Scale mixtures of normal distributions. Journal of the Royal Statistical Society. Series B (Methodological), pages 99–102, 1974.
- Kent and Tyler  John T Kent and David E Tyler. Redescending m-estimates of multivariate location and scatter. The Annals of Statistics, pages 2102–2119, 1991.
- Zhang et al.  Teng Zhang, Ami Wiesel, and Maria Sabrina Greco. Multivariate generalized Gaussian distribution: Convexity and graphical models. IEEE Transactions on Signal Processing, 61(16):4141–4148, 2013.
- Sra and Hosseini  Suvrit Sra and Reshad Hosseini. Geometric optimisation on positive definite matrices for elliptically contoured distributions. In Advances in Neural Information Processing Systems, pages 2562–2570, 2013.
- Hosseini and Sra  Reshad Hosseini and Suvrit Sra. Matrix manifold optimization for Gaussian mixtures. In Advances in Neural Information Processing Systems, pages 910–918, 2015.
- Hosseini and Sra  Reshad Hosseini and Suvrit Sra. An alternative to EM for Gaussian mixture models: Batch and stochastic riemannian optimization. arXiv preprint arXiv:1706.03267, 2017.
- Hiai and Petz  Fumio Hiai and Dénes Petz. Riemannian metrics on positive definite matrices related to means. Linear Algebra and its Applications, 430(11-12):3105–3130, 2009.
- Skovgaard  Lene Theil Skovgaard. A Riemannian geometry of the multivariate normal model. Scandinavian Journal of Statistics, pages 211–223, 1984.
- Duembgen and Tyler  Lutz Duembgen and David E Tyler. Geodesic convexity and regularized scatter estimators. arXiv preprint arXiv:1607.05455, 2016.
- Amari  Shun-Ichi Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
- Hosseini and Mash’al  Reshad Hosseini and Mohamadreza Mash’al. Mixest: An estimation toolbox for mixture models. arXiv preprint arXiv:1507.06065, 2015.
- Lindsay  Bruce G Lindsay. Mixture models: theory, geometry and applications. In NSF-CBMS regional conference series in probability and statistics, pages i–163. JSTOR, 1995.
- Sun et al.  Jianyong Sun, Ata Kaban, and Jonathan M Garibaldi. Robust mixture modeling using the Pearson type VII distribution. In Neural Networks (IJCNN), The 2010 International Joint Conference on, pages 1–7. IEEE, 2010.
- Karlis and Meligkotsidou  Dimitris Karlis and Loukia Meligkotsidou. Finite mixtures of multivariate Poisson distributions with application. Journal of Statistical Planning and Inference, 137(6):1942–1960, 2007.
- López-Rubio  Ezequiel López-Rubio. Stochastic approximation learning for mixtures of multivariate elliptical distributions. Neurocomputing, 74(17):2972–2984, 2011.
- Browne et al.  Ryan P Browne, Paul D McNicholas, and Matthew D Sparling. Model-based learning using a mixture of mixtures of Gaussian and uniform distributions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(4):814–817, 2012.
- Absil et al.  P-A Absil, Robert Mahony, and Rodolphe Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, 2009.
- Amari and Nagaoka  Shun-ichi Amari and Hiroshi Nagaoka. Methods of information geometry, volume 191. American Mathematical Soc., 2007.
- RAO  CR RAO. Information and accuracy attanaible in the estimation of statistical parameters. Bull. Calcutta Math. Soc., 37:81–91, 1945.
- Atkinson and Mitchell  Colin Atkinson and Ann FS Mitchell. Rao’s distance measure. Sankhyā: The Indian Journal of Statistics, Series A, pages 345–365, 1981.
- Amari  Shun-Ichi Amari. Differential geometry of curved exponential families-curvatures and information loss. The Annals of Statistics, pages 357–385, 1982.
- Wiesel  Ami Wiesel. Geodesic convexity and covariance estimation. IEEE Transactions on Signal Processing, 60(12):6182–6189, 2012.
- Sra and Hosseini  Suvrit Sra and Reshad Hosseini. Conic geometric optimization on the manifold of positive definite matrices. SIAM Journal on Optimization, 25(1):713–739, 2015.
- Cambanis et al.  Stamatis Cambanis, Steel Huang, and Gordon Simons. On the theory of elliptically contoured distributions. Journal of Multivariate Analysis, 11(3):368–385, 1981.
- Frahm  Gabriel Frahm. Generalized elliptical distributions: Theory and applications. PhD thesis, Universität zu Köln, 2004.
- Jeuris et al.  Ben Jeuris, Raf Vandebril, and Bart Vandereycken. A survey and comparison of contemporary algorithms for computing the matrix geometric mean. Electronic Transactions on Numerical Analysis, 39(EPFL-ARTICLE-197637):379–402, 2012.
- Sra  Suvrit Sra. A new metric on the manifold of kernel matrices with application to matrix geometric means. In Advances in neural information processing systems, pages 144–152, 2012.
- Jayasumana et al.  Sadeep Jayasumana, Richard Hartley, Mathieu Salzmann, Hongdong Li, and Mehrtash Harandi. Kernel methods on the Riemannian manifold of symmetric positive definite matrices. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 73–80. IEEE, 2013.
- Faraki et al.  Masoud Faraki, Mehrtash T Harandi, and Fatih Porikli. A comprehensive look at coding techniques on Riemannian manifolds. IEEE Transactions on Neural Networks and Learning Systems, 2018.
- Bhatia  Rajendra Bhatia. Positive definite matrices. Princeton university press, 2009.
- Burbea and Rao  Jacob Burbea and C. Radhakrishna Rao. Entropy differential metric, distance and divergence measures in probability spaces: A unified approach. Journal of Multivariate Analysis, 12(4):575–596, 1982.
- Mitchell  Ann ES Mitchell. The information matrix, skewness tensor and a-connections for the general multivariate elliptic distribution. Annals of the Institute of Statistical Mathematics, 41(2):289–304, 1989.
- Calvo and Oller  Miquel Calvo and Josep M Oller. A distance between multivariate normal distributions based in an embedding into the Siegel group. Journal of multivariate analysis, 35(2):223–242, 1990.
- Barndorff-Nielsen et al.  Ole Barndorff-Nielsen, John Kent, and Michael Sørensen. Normal variance-mean mixtures and z distributions. International Statistical Review/Revue Internationale de Statistique, pages 145–159, 1982.
- Ollila et al.  Esa Ollila, David E Tyler, Visa Koivunen, and H Vincent Poor. Complex elliptically symmetric distributions: Survey, new results and applications. IEEE Transactions on Signal Processing, 60(11):5597–5625, 2012.
- Stefanski  Leonard A Stefanski. A normal scale mixture representation of the logistic distribution. Statistics & Probability Letters, 11(1):69–70, 1991.
- Dümbgen et al.  Lutz Dümbgen, Markus Pauly, Thomas Schweizer, et al. -functionals of multivariate scatter. Statistics Surveys, 9:32–105, 2015.
- Ollila and Tyler  Esa Ollila and David E Tyler. Regularized -estimators of scatter matrix. IEEE Transactions on Signal Processing, 62(22):6059–6070, 2014.
- Diebolt and Robert  Jean Diebolt and Christian P Robert. Estimation of finite mixture distributions through Bayesian sampling. Journal of the Royal Statistical Society. Series B (Methodological), pages 363–375, 1994.
- Arbelaez et al.  Pablo Arbelaez, Michael Maire, Charless Fowlkes, and Jitendra Malik. Contour detection and hierarchical image segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 33(5):898–916, May 2011. ISSN 0162-8828. doi: 10.1109/TPAMI.2010.161. URL http://dx.doi.org/10.1109/TPAMI.2010.161.
- Vedaldi and Fulkerson  A. Vedaldi and B. Fulkerson. VLFeat: An open and portable library of computer vision algorithms. http://www.vlfeat.org/, 2008.