Extreme deconvolution: Inferring complete distribution functions from noisy, heterogeneous and incomplete observations

Extreme deconvolution: Inferring complete distribution functions from noisy, heterogeneous and incomplete observations

[ [    [ [    [ [ New York University J. Bovy
D. W. Hogg
Center for Cosmology
 and Particle Physics
Department of Physics
New York University
4 Washington Place
New York, New York 10003
USA
\printeade1
E-mail: \printead*e2
S. T. Roweis
Courant Institute
 of Mathematical Sciences
New York University
251 Mercer Street
New York, New York 10012
USA
\printeade3
\smonth10 \syear2009\smonth11 \syear2010
\smonth10 \syear2009\smonth11 \syear2010
\smonth10 \syear2009\smonth11 \syear2010
Abstract

We generalize the well-known mixtures of Gaussians approach to density estimation and the accompanying Expectation–Maximization technique for finding the maximum likelihood parameters of the mixture to the case where each data point carries an individual -dimensional uncertainty covariance and has unique missing data properties. This algorithm reconstructs the error-deconvolved or “underlying” distribution function common to all samples, even when the individual data points are samples from different distributions, obtained by convolving the underlying distribution with the heteroskedastic uncertainty distribution of the data point and projecting out the missing data directions. We show how this basic algorithm can be extended with conjugate priors on all of the model parameters and a “split-and-merge” procedure designed to avoid local maxima of the likelihood. We demonstrate the full method by applying it to the problem of inferring the three-dimensional velocity distribution of stars near the Sun from noisy two-dimensional, transverse velocity measurements from the Hipparcos satellite.

\kwd
\doi

10.1214/10-AOAS439 \volume5 \issue2B 2011 \firstpage1657 \lastpage1677

\runtitle

Extreme deconvolution

{aug}

A]\fnmsJo \snmBovy\corref\thanksreft1label=e1]jo.bovy@nyu.edu, A]\fnmsDavid W. \snmHogg\thanksreft1,t2label=e2]david.hogg@nyu.edu and B]\fnmsSam T. \snmRoweis\thanksreft3label=e3]roweis@cs.nyu.edu \thankstextt1Supported in part by NASA (Grant NNX08AJ48G) and the NSF (Grant AST-0908357). \thankstextt2During part of the period in which this research was performed, D. W. Hogg was a research fellow of the Alexander von Humboldt Foundation of Germany at the Max-Planck-Institut für Astronomie, Heidelberg, Germany. \thankstextt3Deceased.

Bayesian inference \kwddensity estimation \kwdExpectation–Maximization \kwdmissing data \kwdmultivariate estimation \kwdnoise.

1 Introduction

Inferring a distribution function given a finite set of samples from this distribution function and the related problem of finding clusters and/or overdensities in the distribution is a problem of significant general interest [e.g., McLachlan and Basford (1988); Rabiner and Biing-Hwang (1993); Dehnen (1998); Helmi et al. (1999); Skuljan, Hearnshaw and Cottrell (1999); Hogg et al. (2005)]. Performing this inference given only a noisy set of measurements is a problem commonly encountered in many of the sciences [see examples in Carroll et al. (2006)]. In many cases of interest, the noise properties of the observations are different from one measurement to the next (i.e., they are heteroskedastic), even though the uncertainties are well characterized for each observation. This is, for example, often the case in astronomy, where in many cases the dominant source of uncertainty is due to well-characterized photon-counting statistics, while spatial and temporal variations in the atmosphere cause the uncertainties to significantly vary even for sources of the same apparent brightness observed with the same telescope.

The description you are interested in as a scientist is not the observed distribution, what you really want is the description of the distribution that you would have if you had good data, that is, data with vanishingly small uncertainties and with all of the dimensions measured. In the low signal-to-noise regime the data never have these two properties such that the underlying, true distribution cannot be found without taking the noise properties of the data into account. If you want to know the underlying distribution, in order to compare your model with the data, you need to convolve the model with the data uncertainties, not deconvolve the data. When the given set of data has heterogeneous noise properties, that is, when the uncertainty convolution is different for each data point, each data point is a sample of a different distribution, that is, the distribution obtained from convolving the true, underlying distribution with the noise of that particular observation. Incomplete data poses a similar problem when the part of the data that is missing is different for different data points.

Most existing approaches to density estimation only apply in the high signal-to-noise regime [e.g., McLachlan and Basford (1988); Silverman (1986); Diebolt and Robert (1994)], and most approaches to density estimation from noisy samples are nonparametric techniques that assume that the noise is homoskedastic [e.g., Stefanski and Carroll (1990); Zhang (1990)]. The case of heteroskedastic uncertainties has only recently attracted attention [e.g., Delaigle and Meister (2008); Staudenmayer, Ruppert and Buonaccorsi (2008)], and all of the approaches that have been developed so far are nonparametric. None of these approaches can be used when only incomplete data are available, although parametric techniques that properly account for incomplete, but noiseless, data have been developed [Ghahramani and Jordan (1994a, 1994b)].

In this paper we show that the frequently used Gaussian-mixture-model approach to density estimation can be generalized in the presence of noisy, heterogeneous and incomplete data. The likelihood of the model for each data point is given by the model convolved with the (unique) uncertainty distribution of that data point; the objective function is obtained by simply multiplying these individual likelihoods together for the various data points. Optimizing this objective function, one obtains a maximum likelihood estimate of the distribution (more specifically, of its parameters).

While optimization of this objective function can, in principle, be performed by a generic optimizer, we develop an Expectation–Maximization (EM) algorithm that optimizes the objective function. This algorithm works in much the same way as the normal EM algorithm for mixture-of-Gaussians density estimation, except that an additional degree of incompleteness is given by the actual values of the observables, since we only have access to noisy projections of these; in the expectation step these actual values are estimated based on the noisy and incomplete measured values and the current estimate of the distribution function. In the limit in which the noise is absent but the data are lower dimensional projections of the quantities of interest, this algorithm reduces to the algorithm described in Ghahramani and Jordan (1994a, 1994b).

We also show how prior distributions for a Bayesian version of the calculation reporting a MAP estimate can be naturally included in this algorithm as well as how a split-and-merge procedure that heuristically searches parameter space for better approximations to the global maximum can also be incorporated in this approach. These priors and the split-and-merge procedure can be important when applying the EM algorithm developed here in situations with real data where the likelihood surface can have a very complicated structure. We also briefly discuss the practical issues having to do with model selection in the mixture model approach.

An application to a real data set is given in Section 5, where we fit the distribution of stellar velocities near the Sun. The observed velocities of stars that we use for this purpose have all of the properties that the approach developed in this paper handles correctly: The velocity measurements are noisy, and since we only use observations of the velocity components in the plane of the sky, the data are incomplete, and this incompleteness is different for each velocity measurement, which covers the full sky. Nevertheless, we are able to obtain good agreement with other fits of the velocity distribution based on complete data.

The technique we describe below has many applications besides returning a maximum likelihood fit to the error-deconvolved distribution function of a data sample. For instance, when an estimate of the uncertainty in the estimated parameters or distribution function is desired or when a full Bayesian analysis of the mixture model preferred, the outcome of the maximum likelihood technique developed here can be used as a seed for Markov Chain Monte Carlo (MCMC) methods for finite mixture modeling [e.g., Diebolt and Robert (1994); Richardson and Green (1997)].

2 Likelihood of a mixture of Gaussian distributions given a set of heterogeneous, noisy samples

Our goal is to fit a model for the distribution of a -dimensional quantity using a set of observational data points . Therefore, we need to write down the probability of the data under the model for the distribution. The observations are assumed to be noisy projections of the true values ,

(1)

where the noise is drawn from a Gaussian with zero mean and known covariance matrix . The case in which there is missing data occurs when the projection matrix is rank-deficient. Alternatively, we can handle the missing data case by describing the missing data as directions of the covariance matrix that have a formally infinite eigenvalue. In practice, we use very large eigenvalues in the noise-matrix. When the data has only a small degree of incompleteness, that is, when each data point has only a small number of unmeasured dimensions, this latter approach is often the best choice, since one often has some idea about the unmeasured values. For example, in the example given below of inferring the velocity distribution of stars near the Sun, we know that the stars are moving at velocities that do not exceed the speed of light, which is not very helpful, but also that none of the velocities exceed the local Galactic escape speed, since we can safely assume that all the stars are bound to the Galaxy. However, in situations in which each data point has observations of a dimensionality using the projections matrices will greatly reduce the computational cost, since, as will become clear below, the most computationally expensive operations all take place in the lower dimensional space of the observations.

We will model the probability density of the true values as a mixture of Gaussians,

(2)

where the amplitudes sum to unity and the function is the Gaussian probability density function with mean and variance matrix ,

(3)

For a given observation the likelihood of the model parameters given that observation and the noise covariance , which we will write as , can be written as

where

(5)

This likelihood works out to be itself a mixture of Gaussians

(6)

where

(7)

The free parameters of this model can now be chosen such as to maximize an explicit, justified, scalar objective function , given here by the logarithm (log) likelihood of the model given the data, that is,

(8)

This function can be optimized in several ways, one of which is to use a generic optimizer to increase the likelihood until it reaches a maximum. This approach is complicated by parameter constraints (e.g., the amplitudes  must all be nonnegative and add up to one, the variance matrices must be positive definite and symmetric) and multimodality of the likelihood surface. In what follows we will describe a different approach that is natural in this setting: An EM algorithm that iteratively and monotonically maximizes the likelihood, while naturally respecting the restrictions on the parameters.

3 Fitting mixtures with heterogeneous, noisy data using an EM algorithm

To optimize the likelihood in equation (8), we can extend the standard EM approach to Gaussian mixture estimation. In the case of complete and precise observations, the problem is framed as a tractable missing-data problem by positing that the labels or indicator variables indicating which Gaussian a data point was drawn from are missing [Dempster, Laird and Rubin (1977)]. We extend this approach by including the true values as extra missing data. This is a well-known approach for handling measurement uncertainty in latent variable or random effects models [e.g., Schafer (1993); Schafer and Purdy (1996)].

We write down the “full data” log likelihood—the likelihood we would write down if we had the true values and the labels ,

(9)

We will now show how we can use the EM methodology to find straightforward update steps that maximize the full data likelihood of the model. In Appendix A we prove that these updates also maximize the likelihood of the model given the noisy observations.

The E-step consists as usual of taking the expectation of the full data likelihood with respect to the current model parameters . Writing out the full data log likelihood from equation (9), we find

which shows that in addition to the expectation of the indicator variables for each component we also need the expectation of the terms and the expectation of the terms given the data, the current model estimate and the component . The expectation of the is equal to the posterior probability that a data point was drawn from the component . The expectation of the and the can be found as follows: Consider the joint distribution for the true and observed velocities, denoted by the expanded vector , given the model estimate and the component . From the description of the problem, we can see that this vector is distributed normally with mean

(11)

and covariance matrix

(12)

The conditional distribution of the given the data is normal with mean

(13)

and covariance matrix

(14)

Thus, we see that the expectation of given the data , the model estimate and the component is given by , whereas the expectation of the given the same is given by .

Given this, the expectation of the full data log likelihood is given by

Straightforward optimization of this with respect to the model parameters gives the following algorithm:

(16)

where .

In Appendix A we prove that this procedure for maximizing the full data likelihood also monotonically increases the likelihood of the data given the model, as is the case for the EM algorithm for noiseless and complete measurements [Dempster, Laird and Rubin (1977); Wu (1983)].

4 Extensions to the basic algorithm

Singularities and local maxima are two problems that can severely limit the generalization capabilities of the computed density estimates for inferring the densities of unknown data points. These are commonly encountered when using the EM algorithm to iteratively compute the maximum likelihood estimates of Gaussian mixtures: Singularities arise when the covariance in equation (3) becomes singular; the EM updates might get stuck in a local maximum because of the monotonic increase in likelihood ensured by the EM algorithm. The latter can be avoided through the use of a stochastic EM procedure [Broniatowski, Celeux and Diebolt (1983); Celeux and Diebolt (1985); Celeux and Diebolt (1986)] or through the split and merge procedure described below.

4.1 Bayesian-inspired regularization

The problem of singular covariances can be mitigated through the use of priors on the model parameters in a Bayesian setting [Ormoneit and Tresp (1996)]. It should be emphasized here that this calculation is only Bayesian in the sense of producing a maximum a posteriori (MAP) point estimate rather than a maximum likelihood estimate, and that this is no different than penalized maximum likelihood. We briefly show here that this procedure can be applied here as well.

The regularization scheme of Ormoneit and Tresp (1996) introduces conjugate priors on the Gaussian mixtures parameters space as penalty terms. These conjugate priors are the following: A normal density for the mean of each Gaussian, a Wishart density [Gelman et al. (2000)],

(17)

with a normalization constant, for the covariance of each Gaussian, and a Dirichlet density , given by

(18)

where is a normalizing factor, for the amplitudes . Optimizing the posterior distribution for the model parameters replaces the M-step of equation (3) with

Hyperparameters can be set by leave-one-out cross-validation. Vague priors on the amplitude and the means are obtained by setting

(20)

Since we are only interested in the MAP estimate, propriety of the resulting posterior is not an issue with the improper prior resulting from this choice.

The label-switching problem in Bayesian mixtures [Jasra, Holmes and Stephens (2005)] is not an issue for the maximization of the posterior distribution here.

4.2 Avoiding local maxima

The split and merge algorithm starts from the basic EM algorithm, with or without the Bayesian regularization of the variances, and jumps into action after the EM algorithm has reached a maximum, which more often than not will only be a local maximum. At this point, three of the Gaussians in the mixture are singled out and two of these Gaussians are merged, while the third Gaussian is split into two Gaussians [Ueda et al. (1998)]. An alternative, but similar, approach to local maxima avoidance is given by the birth and death moves in reversible jump MCMC [Richardson and Green (1997)] or variational approaches [Ghahramani and Beal (2000); Beal (2003)] to mixture modeling. These moves do not conserve the number of mixture components and are therefore less suited for our fixed- approach to mixture modeling.

Full details of the split and merge algorithm are given in Appendix B.

4.3 Setting the remaining free parameters

No real world application of Gaussian mixture density estimation is complete without a well-specified methodology for setting the number of Gaussian components and any hyperparameters introduced in the Bayesian regularization described above, the covariance regularization . If we further assume that , then this covariance regularization parameter basically sets the square of the smallest scale of the distribution function on which we can reliably infer small-scale features. Therefore, this scale could be set by hand to the smallest scale we believe we have access to based on the properties of the data set.

In order to get the best results, the parameters and should be set by some objective procedure. As mentioned above, leave-one-out cross-validation [Stone (1974)] could be used to set the regularization parameter , and the number of Gaussians could be set by this procedure as well. Other techniques include methods based on Bayesian model selection [Roberts et al. (1998)] as well as approaches based on minimum encoding inference [Wallace and Boulton (1968); Oliver, Baxter and Wallace (1996); Rissanen (1978); Schwartz (1978)], although these methods have difficulty dealing with significant overlap between components (such as the overlap we see in the example in Figure 1), but there are methods to deal with these situations [Baxter (1995)]. Alternatively, when a separate, external data set is available, we can use this as a test data set to validate the obtained distribution function. All of these methods are explored in an accompanying paper on the velocity distribution of stars in the Solar neighborhood from measurements from the Hipparcos satellite [see below; Bovy, Hogg and Roweis (2009)].

Figure 1: Two-dimensional projections of the three-dimensional velocity distribution of Hipparcos stars using 10 Gaussians and km s. The top right plot shows 1-sigma covariance ellipses around each individual Gaussian in the plane; the thickness of each covariance ellipse is proportional to the natural logarithm of its amplitude . In the other three panels the density grayscale is linear and contours contain, from the inside outward, 2, 6, 12, 21, 33, 50, 68, 80, 90, 95, 99 and 99.9 percent of the distribution. 50 percent of the distribution is contained within the innermost dark contour. The feature at km s is real and corresponds to a known feature in the velocity distribution: the Arcturus moving group; Indeed, all the features that appear in these projections are real and correspond to known features.

A rather different approach to the model selection problem is to avoid it altogether. That is, by introducing priors over the hyperparameters and including them as part of the model it is often possible to infer, or fully marginalize over, them simultaneously with the parameters of the components of the mixture. These methods also address uncertainty quantification throughout the model. Such approaches include reversible jump MCMC methods [Richardson and Green (1997)], mixtures consisting of an infinite number of components based on the Dirichlet process [Rasmussen (2000)], or approximate, variational algorithms [Ghahramani and Beal (2000); Beal (2003)]. Extending these approaches to deal with noisy, heterogeneous and incomplete data is beyond the scope of this paper, but it is clear that this extension is, in principle, straightforward: the MCMC methods mentioned above can include the true values of the observations —known in the Bayesian MCMC literature as data augmentation—and these can be Gibbs sampled given the current model and the observed values in an MCMC sweep from the Gaussian with mean given in equation (13) and variance given in equation (14).

5 The velocity distribution from Hipparcos data

We have applied the technique developed in this paper to the problem of inferring the velocity distribution of stars in the Solar neighborhood from transverse angular data from the Hipparcos satellite and we present in this section some results of this study to demonstrate the performance of the algorithm on a real data set. A more detailed and complete account of this study is presented elsewhere [Bovy, Hogg and Roweis (2009)].

The local velocity distribution is interesting because we can learn about the structure and evolution of the Galactic disk from deviations from the smooth, close to Gaussian velocity distribution expected in simple, axisymmetric models of the disk. It has been shown that the dynamical effects of a bar-shaped distribution of stars in the central region of our Galaxy can resonate in the outer parts of the disk and give rise to structure in the velocity distribution [e.g., Dehnen (2000); Bovy (2010)]. Similarly, steady-state or transient spiral structure can effect the velocities of stars in a coherent way, such that we can see this effect locally [e.g., Quillen and Minchev (2005); De Simone, Wu and Tremaine (2004)]. Inferring the local velocity distribution from observational data is therefore necessary to assess whether these dynamical signatures are observed.

Velocities of stars are not directly observable. Rather, they need to be patched together from observations of the stars’ directions on the sky at different times—the branch of astronomy known as astrometry—and spectroscopic observations to determine the velocity along the line of sight. The annual motion of the Earth around the Sun gives rise to an apparent displacement of a star relative to background objects that is inversely proportional to the distance to the star. Measurements of this apparent shift, or parallax, can thus be used to determine the distance to stars. Parallaxes are traditionally reported in units of arcseconds; a star with a parallax of 1 arcsecond is defined to be at a distance of 1 parsec (pc), which equals  m. The intrinsic motion of a star also gives rise to a systematic shift in its position relative to background sources, such that its angular motion—known as its proper motion—can be measured. Combining the distance and angular velocity gives the components of the space velocity of a star that are perpendicular to the line of sight.

The astrometric ESA space mission Hipparcos, which collected data over a 3.2 year period around 1990, provided for the first time an all-sky catalogue of absolute parallaxes and proper motions, with typical uncertainties in these quantities on the order of milli-arcseconds [ESA (1997)]. From this catalogue of 100,000 stars, kinematically unbiased samples of stars with accurate positions and velocities can be extracted [Dehnen and Binney (1998)]. Since this was a purely astrometric mission, and the only components of a star’s velocity that can be measured astrometrically are the components perpendicular to the line of sight, the line-of-sight velocities of the stars in the Hipparcos sample were not obtained during the mission.

Distances in astronomy are notoriously hard to measure precisely, and at the accuracy level of the Hipparcos mission distances can only be reliably obtained for stars near the Sun (out to 100 pc; the diameter of the Galactic disk is about 30,000 pc). In addition to this, since distances are measured as inverse distances (parallaxes), only distances that are measured relatively precisely will have approximately Gaussian uncertainties associated with them. Balancing the size of the sample with the accuracy of the distance measurement leaves us with distance uncertainties that are typically 10-percent, such that the velocities perpendicular to the line of sight that are obtained from the proper motions and the distances have low signal-to-noise. Since the dominant source of noise is due to simple photon-counting statistics [van Leeuwen (2007a)], the uncertainties are well characterized and can be assumed to be known, as is necessary for the technique developed in this paper to apply. Star-to-star correlations are negligible and can be ignored [van Leeuwen (2007a)].

Of course, if we want to describe the distribution of the velocities of the stars in this sample, we need to express the velocities in a common reference frame, which for kinematical studies of stars around the Sun is generally chosen to be the Galactic coordinate system, in which the -axis points toward the Galactic center, the -axis points in the direction of Galactic rotation, and the -axis points toward the North Galactic Pole [Blaauw et al. (1960); Binney and Merrifield (1998)]. The measured velocities perpendicular to the line of sight are then projections of the three-dimensional velocity of a star with respect to the Sun in the two-dimensional plane perpendicular to the line of sight to the star. Therefore, this projection is different for each individual star.

Observations of celestial objects are expressed in the equatorial coordinate system, in which the Earth’s geographic poles and equator are projected onto the celestial sphere. The components of the three-dimensional velocities of the stars in the Galactic coordinate system in terms of the observed quantities in the equatorial coordinate frame—angular position on the sky (, ), inverse distance (), angular motion on the sky (, ), and line-of-sight velocity ()—are given by

(21)

where , , . The matrix transforms the velocities from the equatorial reference frame in which the observations are made to the Galactic coordinate frame; it depends on a few parameters defining this coordinate transformation and is given by

(23)

The matrix depends on the epoch that the reduced data are referred to (1991.25 for Hipparcos) through the values of , and (the position in equatorial coordinates of the north Galactic pole, and the Galactic longitude of the north Celestial pole, respectively). These quantities were defined for the epoch 1950.0 as follows: [Blaauw et al. (1960)]: , , and .

The matrix depends on the position of the source on the sky,

(24)

In the context of the deconvolution technique described above, the observations are and the projection matrix is . Since we do not use the radial velocities of the stars, we set  to zero in and use a large uncertainty-variance for this component of the uncertainty variance ; equivalently, we could remove from and restrict  to the projection on the sky.

We have studied the velocity distribution of a sample of main sequence stars selected to have accurate distance measurements (parallax uncertainties ) and to be kinematically unbiased (in that the sample of stars faithfully represents the kinematics of similar stars). In detail, we use the sample of 11,865 stars from Dehnen and Binney (1998), but we use the new reduction of the Hipparcos raw data, which has improved the accuracy of the astrometric quantities [van Leeuwen (2007a; 2007b)]. A particular reconstruction of the underlying velocity distribution of the stars is shown in Figure 1, in which 10 Gaussians are used, a prior on the variances was used (the prior was restricted to ), and this regularization parameter is set to 4 km.

These values for the hyperparameters were set using an external data set rather than any of the other methods described in Section 4.3. For this we use a set of 7682 stars from the Geneva–Copenhagen Survey [Nordström et al. (2004); Holmberg, Nordström and Andersen (2009)] for which the line-of-sight velocity (perpendicular to the plane of the sky) has been measured spectroscopically. This is a separate data set from the one considered above. It partially overlaps with the previous data set and it is also kinematically unbiased. We then fit the velocity distribution for different choices of the hyperparameters and evaluate the probability of the line-of-sight velocities for the best-fit velocity distribution based on tangential velocities. The values of the hyperparameters that maximize this probability are and 4 km.

The recovered distribution compares favorably with other reconstructions of the velocity distribution of stars in the Solar neighborhood, based on the same sample of stars [using a maximum penalized likelihood density estimation technique, Dehnen (1998)], as well as with those based on other samples of stars for which three-dimensional velocities are available [Skuljan, Hearnshaw and Cottrell (1999); Nordström et al. (2004); Famaey et al. (2005); Antoja et al. (2008)]. In particular, this means that the main shape of the velocity distribution agrees with that found in previous studies and that the number and location of the peaks in the distribution, all real and known features, are consistent with those found before. This includes the very sparsely populated feature at kms, which is known as the Arcturus moving group. Therefore, we conclude that the method developed in this paper performs very well on this complicated data set. In contrast to previous determinations of the velocity distributions, our method allows us to study the structures found quantitatively, since it turns out that individual structures in the velocity distribution are well represented by individual components in the mixture model. Thus, we were able to conclude that these structures are not the remnants of a large group of stars that formed together in a cluster, but rather that they are probably caused by dynamical effects related to the bar at the center of the Milky Way or spiral structure [Bovy and Hogg (2010)].

The convergence of the algorithm is shown in Figure 2. Only split-and-merge steps that improved the likelihood are shown in this plot, therefore, the actual number of iterations is much higher than the number given on the -axis. It is clear that all of the split-and-merge steps only slightly improve the initial estimate from the first EM procedure, but since what is shown is the likelihood per data point, the improvement of the total likelihood is more significant.

Figure 2: Convergence of the full algorithm: total log likelihood at each iteration step. Shown are only split-and-merge steps that improve the likelihood; each vertical gray line corresponds to a point at which a successful split and merge is performed. For clarity’s sake, we show in black only the parts of the split-and-merge steps at which the likelihood is larger than the likelihood right before that split-and-merge procedure; the log likelihoods of the steps in a split-and-merge procedure in which the likelihood is still climbing back up to the previous maximum in likelihood have been replaced by horizontal gray segments. The -axis has been cut off for display purposes: The log likelihood of the initial condition was 2.39E5.

6 Implementation and code availability

The algorithm presented in this paper was implemented in the C programming language, depending only on the standard C library and the GNU Scientific Library.444http://www.gnu.org/software/gsl/. The code is available at http://code.google.com/p/extreme-deconvolution/; instructions for its installation and use are given there. The code can be compiled into a shared object library, which can then be incorporated into other projects or accessed through IDL555http://www.ittvis.com/ProductServices/IDL.aspx. or Python666http://www.python.org/. wrapper functions supplied with the C code.

The code can do everything described above. The convergence of a single run is quick, but when including split-and-merge iterations the convergence is rather slow because of the large number of split-and-merge steps that can be taken by the algorithm (the split-and-merge aspect of the algorithm, however, can easily be turned off or restricted by setting the parameter specifying the number of steps to go down the split-and-merge hierarchy).

7 Conclusions and future work

We have generalized the mixture of Gaussians approach to density estimation such that it can be applied to noisy, heterogeneous and incomplete data. The objective function is obtained by integrating over the unknown true values of the quantities for which we only have noisy and/or incomplete observations. In order to optimize the objective function resulting from this marginalization, we have derived an EM algorithm that monotonically increases the model likelihood; this EM algorithm, in which the E-step involves finding the expected value of the first and second moments of the true values of the observables given the current model and the noisy observations, reduces to the basic EM algorithm for Gaussian mixture modeling in the limit of noiseless data. We have shown that the model can incorporate conjugate priors on all of the model parameters without losing any of its analytical attractiveness and that the algorithm can accommodate the split-and-merge algorithm to deal with the presence of local maxima, which this EM algorithm, as many other EM algorithms, suffers from.

The work presented here can be extended to be incorporated in various more nonparametric approaches to density modeling, for example, in mixture models with an infinite number of components based on the Dirichlet Process [e.g., Rasmussen (2000)]. In this way current advances in nonparametric modeling can be applied to the low signal-to-noise sciences where the situation of complete and noise-free data is more often than not an untenable and unattainable approximation.

Appendix A Proof that the proposed algorithm maximizes the likelihood

We use Jensen’s inequality in the continuous case for a concave function  and a nonnegative integrable function , where we have assumed that is normalized, that is, is a probability distribution. For each observation we can then introduce a function such that

(25)

where is the entropy of the distribution . This inequality becomes an equality when we take

(26)

The above holds for each data point, and we can write

(27)

The last factor reduces to calculating the posterior probabilities and we can write the function as (we drop the entropy term here, since it plays no role in the optimization, as it does not depend on the model parameters)

This shows that this reduces exactly to the procedure described above, that is, to taking the expectation of the and terms with respect to the distribution of the given the data , the current parameter estimate and the component . We conclude that the E-step as described above ensures that the expectation of the full data log likelihood becomes equal to the log likelihood of the model given the observed data. Optimizing this log likelihood in the M-step then also increases the log likelihood of the model given the observations. Therefore, the EM algorithm we described will increase the likelihood of the model in every iteration, and the algorithm will approach local maxima of the likelihood. Convergence is identified, as usual, as extremely small incremental improvement in the log likelihood per iteration.

Appendix B Split and merge algorithm

Let us denote the indices of the three selected Gaussians as and , where the former two are to be merged while will be split. The Gaussians corresponding to the indices and will be merged as follows: the model parameters of the merged Gaussian are

(28)

where stands for and . Thus, the mean and the variance of the new Gaussian is a weighted average of the means and variances of the two merging Gaussians.

The Gaussian corresponding to is split as follows:

(29)

Thus, the Gaussian is split into equally contributing Gaussians with each new Gaussian having a covariance matrix that has the same volume as . The means and can be initialized by adding a random perturbation vector to , for example,

(30)

where and .

After this split and merge initialization, the parameters of the three affected Gaussians need to be re-optimized in a model in which the parameters of the unaffected Gaussians are held fixed. This can be done by using the M-step in equation (3) for the parameters of the three affected Gaussians, while keeping the parameters of the other Gaussians fixed, including the amplitudes. This ensures that the sum of the amplitudes of the three affected Gaussians remains fixed. This procedure is called the partial EM procedure. After convergence this is then followed by the full EM algorithm on the resulting model parameters. Finally, the resulting parameters are accepted if the total log likelihood of this model is greater than the log likelihood before the split and merge step. If the likelihood does not increase, the same split and merge procedure is performed on the next triplet of split and merge candidates.

The question that remains to be answered is how to choose the 2 Gaussians that should be merged and the Gaussian that should be split. In general, there are possible triplets like this which quickly reach a large number when the number of Gaussians gets larger. In order to rank these triplets, one can define a merge criterion and a split criterion.

The merge criterion is constructed based on the observation that if many data points have equal posterior probabilities for two Gaussians, these Gaussians are good candidates to be merged. Therefore, one can define the merge criterion:

(31)

where is the -dimensional vector of posterior probabilities for the th Gaussian. Pairs of Gaussians with larger are good candidates for a merger.

We can define a split criterion based on the Kullback–Leibler distance between the local data density around the th Gaussian, which can be written in the case of complete data as , and the th Gaussian density specified by the current model estimates and . The Kullback–Leibler divergence between two distributions and is given by [MacKay (2003)]

(32)

Since the local data density is only nonzero at a finite number of values, we can write this as

(33)

The larger the distance between the local density and the Gaussian representing it, the larger and the better candidate this Gaussian is to be split.

When dealing with incomplete data determining the local data density is more problematic. One possible way to estimate how well a particular Gaussian describes the local data density is to calculate the Kullback–Leibler divergence between the model Gaussian under consideration and each individual data point perpendicular to the unobserved directions for that data point. Thus, we can write

(34)

Candidates for merging and splitting are then ranked as follows: first the merge criterion is calculated for all pairs and the pairs are ranked by decreasing . For each pair in this ranking the remaining Gaussians are then ranked by decreasing .

To summarize the full algorithm, we briefly list all the steps involved:

  1. Run the EM algorithm as specified in equations (3) and (4.1). Store the resulting model parameters and the corresponding model log likelihood .

  2. Compute the merge criterion for all pairs and the split criterion for all . Sort the split and merge candidates based on these criteria as detailed above.

  3. For the first triplet in this sorted list set the initial parameters of the merged Gaussian using equation (28) and the parameters of the two Gaussian resulting from splitting the third Gaussian using equations (29) and (30). Then run the partial EM procedure on the parameters of the three affected Gaussians, that is, run EM while keeping the parameters of the unaffected Gaussians fixed, and follow this up by running the full EM procedure on all the Gaussians. If after convergence the new log likelihood  is greater than , accept the new parameter values and return to step two. If , return to the beginning of this step and use the next triplet in the list.

  4. Halt this procedure when none of the split and merge candidates improve the log likelihood or, if this list is too long, if none of the first lead to an improvement.

Deciding when to stop going down the split-and-merge hierarchy will be dictated in any individual application of this technique by computational constraints. This is an essential feature of any search-based approach to finding global maxima of (likelihood) functions.

Acknowledgments

It is a pleasure to thank Frédéric Arenou and Phil Marshall for comments and assistance and the anonymous referee and Associate Editor for valuable criticism.

References

  • Antoja et al. (2008) {barticle}[author] \bauthor\bsnmAntoja, \bfnmT.\binitsT., \bauthor\bsnmFigueras, \bfnmF.\binitsF., \bauthor\bsnmFernández, \bfnmD.\binitsD. \AND\bauthor\bsnmTorra, \bfnmJ.\binitsJ. (\byear2008). \btitleOrigin and evolution of moving groups. I. Characterization in the observational kinematic-age-metallicity space. \bjournalAstron. Astrophys. \bvolume490 \bpages135. \endbibitem
  • Baxter (1995) {btechreport}[author] \bauthor\bsnmBaxter, \bfnmR. A.\binitsR. A. (\byear1995). \btitleFinding overlapping distributions with MML. \btypeTechnical Report No. \bnumber244, \binstitutionDept. Computer Science, Monash Univ., \baddressClayton, Australia. \endbibitem
  • Beal (2003) {bphdthesis}[author] \bauthor\bsnmBeal, \bfnmM. J.\binitsM. J. (\byear2003). \btitleVariational algorithms for approximate Bayesian inference. \btypePh.D. thesis, \bschoolGatsby Computational Neuroscience Unit, Univ. College London. \endbibitem
  • Binney and Merrifield (1998) {bbook}[author] \bauthor\bsnmBinney, \bfnmJ.\binitsJ. \AND\bauthor\bsnmMerrifield, \bfnmM.\binitsM. (\byear1998). \btitleGalactic Astronomy. \bpublisherPrinceton Univ. Press, Princeton, NJ. \endbibitem
  • Blaauw et al. (1960) {barticle}[author] \bauthor\bsnmBlaauw, \bfnmA.\binitsA., \bauthor\bsnmGum, \bfnmC. S.\binitsC. S., \bauthor\bsnmPawsey, \bfnmJ. L.\binitsJ. L. \AND\bauthor\bsnmWesterhout, \bfnmG.\binitsG. (\byear1960). \btitleThe new IAU system of galactic coordinates (1958 revision). \bjournalMon. Not. R. Astron. Soc. \bvolume121 \bpages123. \endbibitem
  • Bovy (2010) {barticle}[author] \bauthor\bsnmBovy, \bfnmJ.\binitsJ. (\byear2010). \btitleTracing the Hercules stream around the galaxy. \bjournalAstrophys. J. \bvolume725 \bpages1676. \endbibitem
  • Bovy, Hogg and Roweis (2009) {barticle}[author] \bauthor\bsnmBovy, \bfnmJ.\binitsJ., \bauthor\bsnmHogg, \bfnmD. W.\binitsD. W. \AND\bauthor\bsnmRoweis, \bfnmS. T.\binitsS. T. (\byear2009). \btitleThe velocity distribution of nearby stars from Hipparcos data I. The significance of the moving groups. \bjournalAstrophys. J. \bvolume700 \bpages1794. \endbibitem
  • Bovy and Hogg (2010) {barticle}[author] \bauthor\bsnmBovy, \bfnmJ.\binitsJ. \AND\bauthor\bsnmHogg, \bfnmD. W.\binitsD. W. (\byear2010). \btitleThe velocity distribution of nearby stars from Hipparcos data II. The nature of the low-velocity moving groups. \bjournalAstrophys. J. \bvolume717 \bpages617. \endbibitem
  • Broniatowski, Celeux and Diebolt (1983) {bincollection}[author] \bauthor\bsnmBroniatowski, \bfnmM.\binitsM., \bauthor\bsnmCeleux, \bfnmG.\binitsG. \AND\bauthor\bsnmDiebolt, \bfnmJ.\binitsJ. (\byear1983). \btitleReconaissance de Densités par un Algorithme d’Apprentissage Probabiliste. In \bbooktitleData Analysis and Informatics, Vol. 3 \bpages359–373. \bpublisherNorth-Holland, \baddressAmsterdam. \MR0787647 \endbibitem
  • Carroll et al. (2006) {bbook}[author] \bauthor\bsnmCarroll, \bfnmR. J.\binitsR. J., \bauthor\bsnmRuppert, \bfnmD.\binitsD., \bauthor\bsnmStefanski, \bfnmL. A.\binitsL. A. \AND\bauthor\bsnmCrainiceanu, \bfnmC. M.\binitsC. M. (\byear2006). \btitleMeasurement Error in Nonlinear Models: A Modern Perspective, \bedition2nd ed. \bpublisherChapman and Hall/CRC, \baddressBoca Raton, FL. \MR2243417 \endbibitem
  • Celeux and Diebolt (1985) {barticle}[author] \bauthor\bsnmCeleux, \bfnmG.\binitsG. \AND\bauthor\bsnmDiebolt, \bfnmJ.\binitsJ. (\byear1985). \btitleThe SEM algorithm: A probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. \bjournalComput. Statist. Quart \bvolume2 \bpages73. \endbibitem
  • Celeux and Diebolt (1986) {barticle}[author] \bauthor\bsnmCeleux, \bfnmG.\binitsG. \AND\bauthor\bsnmDiebolt, \bfnmJ.\binitsJ. (\byear1986). \btitleL’Algorithme SEM: un Algorithme d’Apprentissage Probabiliste pour la Reconnaisance de Mélanges de Densités. \bjournalRev. Stat. Appl. \bvolume34 \bpages35. \endbibitem
  • De Simone, Wu and Tremaine (2004) {barticle}[author] \bauthor\bsnmDe Simone, \bfnmR.\binitsR., \bauthor\bsnmWu, \bfnmX.\binitsX. \AND\bauthor\bsnmTremaine, \bfnmS.\binitsS. (\byear2004). \btitleThe stellar velocity distribution in the solar neighbourhood. \bjournalMon. Not. R. Astron. Soc. \bvolume350 \bpages627. \endbibitem
  • Dehnen (1998) {barticle}[author] \bauthor\bsnmDehnen, \bfnmW.\binitsW. (\byear1998). \btitleThe distribution of nearby stars in velocity space inferred from Hipparcos data. \bjournalAstron. J. \bvolume115 \bpages2384. \endbibitem
  • Dehnen (2000) {barticle}[author] \bauthor\bsnmDehnen, \bfnmW.\binitsW. (\byear2000). \btitleThe effect of the outer Lindblad resonance of the galactic bar on the local stellar velocity distribution. \bjournalAstron. J. \bvolume119 \bpages800. \endbibitem
  • Dehnen and Binney (1998) {barticle}[author] \bauthor\bsnmDehnen, \bfnmW.\binitsW. \AND\bauthor\bsnmBinney, \bfnmJ. J.\binitsJ. J. (\byear1998). \btitleLocal stellar kinematics from Hipparcos data. \bjournalMon. Not. R. Astron. Soc. \bvolume298 \bpages387. \endbibitem
  • Delaigle and Meister (2008) {barticle}[author] \bauthor\bsnmDelaigle, \bfnmA.\binitsA. \AND\bauthor\bsnmMeister, \bfnmA.\binitsA. (\byear2008). \btitleDensity estimation with heteroscedastic error. \bjournalBernoulli \bvolume14 \bpages562–579. \MR2544102 \endbibitem
  • Dempster, Laird and Rubin (1977) {barticle}[author] \bauthor\bsnmDempster, \bfnmA. P.\binitsA. P., \bauthor\bsnmLaird, \bfnmN. M.\binitsN. M. \AND\bauthor\bsnmRubin, \bfnmD. B\binitsD. B. (\byear1977). \btitleMaximum likelihood from incomplete data via the EM algorithm. \bjournalJ. R. Stat. Soc. Ser. B Methodol. Stat. \bvolume39 \bpages1–38. \MR0501537 \endbibitem
  • Diebolt and Robert (1994) {barticle}[author] \bauthor\bsnmDiebolt, \bfnmJ.\binitsJ. \AND\bauthor\bsnmRobert, \bfnmC. P.\binitsC. P. (\byear1994). \btitleEstimation of finite mixture distributions through Bayesian sampling. \bjournalJ. R. Stat. Soc. Ser. B Methodol. Stat. \bvolume56 \bpages363–375. \MR1281940 \endbibitem
  • ESA (1997) {bbook}[author] \bauthor\bsnmESA  (\byear1997). \btitleThe Hipparcos and Tycho Catalogues. \bpublisherESA SP-1200, \baddressNoordwijk. \endbibitem
  • Famaey et al. (2005) {barticle}[author] \bauthor\bsnmFamaey, \bfnmB.\binitsB., \bauthor\bsnmJorissen, \bfnmA.\binitsA., \bauthor\bsnmLuri, \bfnmX.\binitsX., \bauthor\bsnmMayor, \bfnmM.\binitsM., \bauthor\bsnmUdry, \bfnmS.\binitsS., \bauthor\bsnmDejonghe, \bfnmH.\binitsH. \AND\bauthor\bsnmTuron, \bfnmC.\binitsC. (\byear2005). \btitleLocal kinematics of and giants from CORAVEL/Hipparcos/Tycho-2 data. Revisiting the concept of superclusters. \bjournalAstron. Astrophys. \bvolume430 \bpages165. \endbibitem
  • Gelman et al. (2000) {bbook}[author] \bauthor\bsnmGelman, \bfnmA.\binitsA., \bauthor\bsnmCarlin, \bfnmJ. B.\binitsJ. B., \bauthor\bsnmStern, \bfnmH. S.\binitsH. S. \AND\bauthor\bsnmRubin, \bfnmD. B.\binitsD. B. (\byear2000). \btitleBayesian Data Analysis. \bpublisherChapman and Hall/CRC, Boca Raton, FL. \MR1385925 \endbibitem
  • Ghahramani and Beal (2000) {binproceedings}[author] \bauthor\bsnmGhahramani, \bfnmZ.\binitsZ. \AND\bauthor\bsnmBeal, \bfnmM. J.\binitsM. J. (\byear2000). \btitleVariational inference for Bayesian mixtures of factor analysers. In \bbooktitleAdvances in Neural Information Processing Systems 12 (\beditor\bfnmS. A.\binitsS. A. \bsnmSolla, \beditor\bfnmT. K.\binitsT. K. \bsnmLeen \AND\beditor\bfnmK. R.\binitsK. R. \bsnmMuller, eds.) \bpages449. \bpublisherMIT Press, Cambridge, MA. \endbibitem
  • Ghahramani and Jordan (1994a) {btechreport}[author] \bauthor\bsnmGhahramani, \bfnmZ.\binitsZ. \AND\bauthor\bsnmJordan, \bfnmM. I.\binitsM. I. (\byear1994a). \btitleLearning from incomplete data. \binstitutionCBCL Technical Report No. 108. Center for Biological and Computational Learning, MIT. \endbibitem
  • Ghahramani and Jordan (1994b) {binproceedings}[author] \bauthor\bsnmGhahramani, \bfnmZ.\binitsZ. \AND\bauthor\bsnmJordan, \bfnmM. I.\binitsM. I. (\byear1994b). \btitleSupervised learning from incomplete data via an EM approach. In \bbooktitleAdvances in Neural Information Processing Systems 6 (\beditor\bfnmJ. D.\binitsJ. D. \bsnmCowan, \beditor\bfnmG.\binitsG. \bsnmTesauro \AND\beditor\bfnmJ.\binitsJ. \bsnmAlspector, eds.) \bpages120–127. \bpublisherMorgan Kaufman, \baddressSan Francisco. \endbibitem
  • Helmi et al. (1999) {barticle}[author] \bauthor\bsnmHelmi, \bfnmA.\binitsA., \bauthor\bsnmWhite, \bfnmS. D. M.\binitsS. D. M., \bauthor\bsnmde Zeeuw, \bfnmP. T.\binitsP. T. \AND\bauthor\bsnmZhao, \bfnmH.\binitsH. (\byear1999). \btitleDebris streams in the solar neighbourhood as relicts from the formation of the milky way. \bjournalNature \bvolume402 \bpages53–55. \endbibitem
  • Hogg et al. (2005) {barticle}[author] \bauthor\bsnmHogg, \bfnmD. W.\binitsD. W., \bauthor\bsnmBlanton, \bfnmM. R.\binitsM. R., \bauthor\bsnmRoweis, \bfnmS. T.\binitsS. T. \AND\bauthor\bsnmJohnston, \bfnmK. V.\binitsK. V. (\byear2005). \btitleModeling complete distributions with incomplete observations: The velocity ellipsoid from Hipparcos data. \bjournalAstrophys. J. \bvolume629 \bpages268. \endbibitem
  • Holmberg, Nordström and Andersen (2009) {barticle}[author] \bauthor\bsnmHolmberg, \bfnmJ.\binitsJ., \bauthor\bsnmNordström, \bfnmB.\binitsB. \AND\bauthor\bsnmAndersen, \bfnmJ.\binitsJ. (\byear2009). \btitleThe Geneva–Copenhagen survey of the solar neighbourhood III. Improved distances, ages, and kinematics. \bjournalAstron. Astrophys. \bvolume501 \bpages941. \endbibitem
  • Jasra, Holmes and Stephens (2005) {barticle}[author] \bauthor\bsnmJasra, \bfnmA.\binitsA., \bauthor\bsnmHolmes, \bfnmC. C.\binitsC. C. \AND\bauthor\bsnmStephens, \bfnmD. A.\binitsD. A. (\byear2005). \btitleMarkov chain Monte Carlo methods and the label switching problem in Bayesian mixture modeling. \bjournalStatist. Sci. \bvolume20 \bpages50–67. \MR2182987 \endbibitem
  • MacKay (2003) {bbook}[author] \bauthor\bsnmMacKay, \bfnmD. J. C.\binitsD. J. C. (\byear2003). \btitleInformation Theory, Inference, and Learning Algorithms. \bpublisherCambridge Univ. Press, \baddressCambridge. \MR2012999 \endbibitem
  • McLachlan and Basford (1988) {bbook}[author] \bauthor\bsnmMcLachlan, \bfnmG. J.\binitsG. J. \AND\bauthor\bsnmBasford, \bfnmK.\binitsK. (\byear1988). \btitleMixture Models: Inference and Application to Clustering. \bpublisherDekker, \baddressNew York. \MR0926484 \endbibitem
  • Nordström et al. (2004) {barticle}[author] \bauthor\bsnmNordström, \bfnmB.\binitsB., \bauthor\bsnmMayor, \bfnmM.\binitsM., \bauthor\bsnmAndersen, \bfnmJ.\binitsJ., \bauthor\bsnmHolmberg, \bfnmJ.\binitsJ., \bauthor\bsnmPont, \bfnmF.\binitsF., \bauthor\bsnmJørgensen, \bfnmB. R.\binitsB. R., \bauthor\bsnmOlsen, \bfnmE. H.\binitsE. H., \bauthor\bsnmUdry, \bfnmS.\binitsS. \AND\bauthor\bsnmMowlavi, \bfnmN.\binitsN. (\byear2004). \btitleThe Geneva–Copenhagen survey of the solar neighbourhood. Ages, metallicities, and kinematic properties of 14 000 and dwarfs. \bjournalAstron. Astrophys. \bvolume418 \bpages989. \endbibitem
  • Oliver, Baxter and Wallace (1996) {binproceedings}[author] \bauthor\bsnmOliver, \bfnmJ. J.\binitsJ. J., \bauthor\bsnmBaxter, \bfnmR. A.\binitsR. A. \AND\bauthor\bsnmWallace, \bfnmC. S.\binitsC. S. (\byear1996). \btitleUnsupervised learning using MML. In \bbooktitleMachine Learning: Proceedings of the Thirteenth International Conference (ICML 96) \bpages364. \bpublisherMorgan Kaufmann, \baddressSan Francisco. \endbibitem
  • Ormoneit and Tresp (1996) {binproceedings}[author] \bauthor\bsnmOrmoneit, \bfnmD.\binitsD. \AND\bauthor\bsnmTresp, \bfnmV.\binitsV. (\byear1996). \btitleImproved Gaussian mixture density estimates using Bayesian penalty terms and network averaging. In \bbooktitleAdvances in Neural Information Processing Systems 8, NIPS, Denver, CO, November 27–30, 1995 (\beditor\bfnmD. S.\binitsD. S. \bsnmTouretzky, \beditor\bfnmM.\binitsM. \bsnmMozer \AND\beditor\bfnmM. E.\binitsM. E. \bsnmHasselmo, eds.) \bpages542–548. \bpublisherMIT Press, \baddressCambridge. \endbibitem
  • Quillen and Minchev (2005) {barticle}[author] \bauthor\bsnmQuillen, \bfnmA. C.\binitsA. C. \AND\bauthor\bsnmMinchev, \bfnmI.\binitsI. (\byear2005). \btitleThe effect of spiral structure on the stellar velocity distribution in the solar neighborhood. \bjournalAstron. J. \bvolume130 \bpages576. \endbibitem
  • Rabiner and Biing-Hwang (1993) {bbook}[author] \bauthor\bsnmRabiner, \bfnmL.\binitsL. \AND\bauthor\bsnmBiing-Hwang, \bfnmJ.\binitsJ. (\byear1993). \btitleFundamentals of Speech Recognition. \bpublisherPrentice-Hall, \baddressNew York. \endbibitem
  • Rasmussen (2000) {binproceedings}[author] \bauthor\bsnmRasmussen, \bfnmC.\binitsC. (\byear2000). \btitleThe infinite Gaussian mixture model. In \bbooktitleAdvances in Neural Information Processing Systems 12 (\beditor\bfnmS. A.\binitsS. A. \bsnmSolla, \beditor\bfnmT. K.\binitsT. K. \bsnmLeen \AND\beditor\bfnmK. R.\binitsK. R. \bsnmMuller, eds.) \bpages554–560. \bpublisherMIT Press, \baddressCambridge. \endbibitem
  • Richardson and Green (1997) {barticle}[author] \bauthor\bsnmRichardson, \bfnmS.\binitsS. \AND\bauthor\bsnmGreen, \bfnmP. J.\binitsP. J. (\byear1997). \btitleOn Bayesian analysis of mixtures with an unknown number of components. \bjournalJ. R. Stat. Soc. Ser. B Methodol. Stat. \bvolume59 \bpages731–792. \MR1483213 \endbibitem
  • Rissanen (1978) {barticle}[author] \bauthor\bsnmRissanen, \bfnmJ.\binitsJ. (\byear1978). \btitleModeling by shortest data description. \bjournalAutomatica \bvolume14 \bpages465. \endbibitem
  • Roberts et al. (1998) {barticle}[author] \bauthor\bsnmRoberts, \bfnmS. J.\binitsS. J., \bauthor\bsnmHusmeier, \bfnmD.\binitsD., \bauthor\bsnmRezek, \bfnmI.\binitsI. \AND\bauthor\bsnmPenny, \bfnmW.\binitsW. (\byear1998). \btitleBayesian approaches to Gaussian mixture modeling. \bjournalIEEE Trans. Pattern Anal. Mach. Intell. \bvolume20 \bpages1133. \endbibitem
  • Schafer (1993) {barticle}[author] \bauthor\bsnmSchafer, \bfnmD. W.\binitsD. W. (\byear1993). \btitleLikelihood analysis for probit regression with measurement errors. \bjournalBiometrika \bvolume80 \bpages899. \endbibitem
  • Schafer and Purdy (1996) {barticle}[author] \bauthor\bsnmSchafer, \bfnmD. W.\binitsD. W. \AND\bauthor\bsnmPurdy, \bfnmK. G.\binitsK. G. (\byear1996). \btitleLikelihood analysis for errors-in-variables regression with replicate measurements. \bjournalBiometrika \bvolume83 \bpages813–824. \MR1440046 \endbibitem
  • Schwartz (1978) {barticle}[author] \bauthor\bsnmSchwartz, \bfnmG.\binitsG. (\byear1978). \btitleEstimating the dimension of a model. \bjournalAnn. Statist. \bvolume6 \bpages461–464. \MR0468014 \endbibitem
  • Silverman (1986) {bbook}[author] \bauthor\bsnmSilverman, \bfnmB. W.\binitsB. W. (\byear1986). \btitleDensity Estimation for Statistics and Data Analysis. \bpublisherChapman and Hall, Boca Raton, FL. \MR0848134 \endbibitem
  • Skuljan, Hearnshaw and Cottrell (1999) {barticle}[author] \bauthor\bsnmSkuljan, \bfnmJ.\binitsJ., \bauthor\bsnmHearnshaw, \bfnmJ. B.\binitsJ. B. \AND\bauthor\bsnmCottrell, \bfnmP. L.\binitsP. L. (\byear1999). \btitleVelocity distribution of stars in the solar neighbourhood. \bjournalMon. Not. R. Astron. Soc. \bvolume308 \bpages731. \endbibitem
  • Staudenmayer, Ruppert and Buonaccorsi (2008) {barticle}[author] \bauthor\bsnmStaudenmayer, \bfnmJ.\binitsJ., \bauthor\bsnmRuppert, \bfnmD.\binitsD. \AND\bauthor\bsnmBuonaccorsi, \bfnmJ.\binitsJ. (\byear2008). \btitleDensity estimation in the presence of heteroscedastic measurement error. \bjournalJ. Amer. Statist. Assoc. \bvolume103 \bpages726–736. \MR2524005 \endbibitem
  • Stefanski and Carroll (1990) {barticle}[author] \bauthor\bsnmStefanski, \bfnmL. A.\binitsL. A. \AND\bauthor\bsnmCarroll, \bfnmR. J.\binitsR. J. (\byear1990). \btitleDeconvoluting kernel density estimators. \bjournalStatistics \bvolume21 \bpages169–184. \MR1054861 \endbibitem
  • Stone (1974) {barticle}[author] \bauthor\bsnmStone, \bfnmM.\binitsM. (\byear1974). \btitleCross-validation choice and assessment of statistical predictions. \bjournalJ. R. Stat. Soc. Ser. B Methodol. Stat. \bvolume36 \bpages111–147. \MR0356377 \endbibitem
  • Ueda et al. (1998) {binproceedings}[author] \bauthor\bsnmUeda, \bfnmN.\binitsN., \bauthor\bsnmNakano, \bfnmR.\binitsR., \bauthor\bsnmGhahramani, \bfnmZ.\binitsZ. \AND\bauthor\bsnmHinton, \bfnmG. E.\binitsG. E. (\byear1998). \btitleSplit and merge EM algorithm for improving Gaussian mixture density estimates. In \bbooktitleNeural Networks for Signal Processing VIII, 1998. Proceedings of the 1998 IEEE Signal Processing Society Workshop \bpages274–283. \bpublisherIEEE. \endbibitem
  • van Leeuwen (2007a) {bbook}[author] \bauthor\bsnmvan Leeuwen, \bfnmF.\binitsF. (\byear2007a). \btitleHipparcos, the New Reduction of the Raw Data. \bseriesAstrophysics and Space Science Library \bvolume250. \bpublisherSpringer, \baddressDordrecht. \endbibitem
  • van Leeuwen (2007b) {barticle}[author] \bauthor\bsnmvan Leeuwen, \bfnmF.\binitsF. (\byear2007b). \btitleValidation of the new Hipparcos reduction. \bjournalAstron. Astrophys. \bvolume474 \bpages653. \endbibitem
  • Wallace and Boulton (1968) {barticle}[author] \bauthor\bsnmWallace, \bfnmC. S.\binitsC. S. \AND\bauthor\bsnmBoulton, \bfnmD. M.\binitsD. M. (\byear1968). \btitleAn information measure for classification. \bjournalComput. J. \bvolume11 \bpages185. \endbibitem
  • Wu (1983) {barticle}[author] \bauthor\bsnmWu, \bfnmC. F. J\binitsC. F. J. (\byear1983). \btitleOn the convergence properties of the EM algorithm. \bjournalAnn. Statist. \bvolume11 \bpages95–103. \MR0684867 \endbibitem
  • Zhang (1990) {barticle}[author] \bauthor\bsnmZhang, \bfnmC. H.\binitsC. H. (\byear1990). \btitleFourier methods for estimating mixing densities and distributions. \bjournalAnn. Statist. \bvolume18 \bpages806–831. \MR1056338 \endbibitem
\printaddresses
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
49693
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description