Extreme deconvolution: Inferring complete distribution functions from noisy, heterogeneous and incomplete observations
Abstract
We generalize the wellknown mixtures of Gaussians approach to density estimation and the accompanying Expectation–Maximization technique for finding the maximum likelihood parameters of the mixture to the case where each data point carries an individual dimensional uncertainty covariance and has unique missing data properties. This algorithm reconstructs the errordeconvolved or “underlying” distribution function common to all samples, even when the individual data points are samples from different distributions, obtained by convolving the underlying distribution with the heteroskedastic uncertainty distribution of the data point and projecting out the missing data directions. We show how this basic algorithm can be extended with conjugate priors on all of the model parameters and a “splitandmerge” procedure designed to avoid local maxima of the likelihood. We demonstrate the full method by applying it to the problem of inferring the threedimensional velocity distribution of stars near the Sun from noisy twodimensional, transverse velocity measurements from the Hipparcos satellite.
10.1214/10AOAS439 \volume5 \issue2B 2011 \firstpage1657 \lastpage1677
Extreme deconvolution
A]\fnmsJo \snmBovy\corref\thanksreft1label=e1]jo.bovy@nyu.edu, A]\fnmsDavid W. \snmHogg\thanksreft1,t2label=e2]david.hogg@nyu.edu and B]\fnmsSam T. \snmRoweis\thanksreft3label=e3]roweis@cs.nyu.edu \thankstextt1Supported in part by NASA (Grant NNX08AJ48G) and the NSF (Grant AST0908357). \thankstextt2During part of the period in which this research was performed, D. W. Hogg was a research fellow of the Alexander von Humboldt Foundation of Germany at the MaxPlanckInstitut für Astronomie, Heidelberg, Germany. \thankstextt3Deceased.
Bayesian inference \kwddensity estimation \kwdExpectation–Maximization \kwdmissing data \kwdmultivariate estimation \kwdnoise.
1 Introduction
Inferring a distribution function given a finite set of samples from this distribution function and the related problem of finding clusters and/or overdensities in the distribution is a problem of significant general interest [e.g., McLachlan and Basford (1988); Rabiner and BiingHwang (1993); Dehnen (1998); Helmi et al. (1999); Skuljan, Hearnshaw and Cottrell (1999); Hogg et al. (2005)]. Performing this inference given only a noisy set of measurements is a problem commonly encountered in many of the sciences [see examples in Carroll et al. (2006)]. In many cases of interest, the noise properties of the observations are different from one measurement to the next (i.e., they are heteroskedastic), even though the uncertainties are well characterized for each observation. This is, for example, often the case in astronomy, where in many cases the dominant source of uncertainty is due to wellcharacterized photoncounting statistics, while spatial and temporal variations in the atmosphere cause the uncertainties to significantly vary even for sources of the same apparent brightness observed with the same telescope.
The description you are interested in as a scientist is not the observed distribution, what you really want is the description of the distribution that you would have if you had good data, that is, data with vanishingly small uncertainties and with all of the dimensions measured. In the low signaltonoise regime the data never have these two properties such that the underlying, true distribution cannot be found without taking the noise properties of the data into account. If you want to know the underlying distribution, in order to compare your model with the data, you need to convolve the model with the data uncertainties, not deconvolve the data. When the given set of data has heterogeneous noise properties, that is, when the uncertainty convolution is different for each data point, each data point is a sample of a different distribution, that is, the distribution obtained from convolving the true, underlying distribution with the noise of that particular observation. Incomplete data poses a similar problem when the part of the data that is missing is different for different data points.
Most existing approaches to density estimation only apply in the high signaltonoise regime [e.g., McLachlan and Basford (1988); Silverman (1986); Diebolt and Robert (1994)], and most approaches to density estimation from noisy samples are nonparametric techniques that assume that the noise is homoskedastic [e.g., Stefanski and Carroll (1990); Zhang (1990)]. The case of heteroskedastic uncertainties has only recently attracted attention [e.g., Delaigle and Meister (2008); Staudenmayer, Ruppert and Buonaccorsi (2008)], and all of the approaches that have been developed so far are nonparametric. None of these approaches can be used when only incomplete data are available, although parametric techniques that properly account for incomplete, but noiseless, data have been developed [Ghahramani and Jordan (1994a, 1994b)].
In this paper we show that the frequently used Gaussianmixturemodel approach to density estimation can be generalized in the presence of noisy, heterogeneous and incomplete data. The likelihood of the model for each data point is given by the model convolved with the (unique) uncertainty distribution of that data point; the objective function is obtained by simply multiplying these individual likelihoods together for the various data points. Optimizing this objective function, one obtains a maximum likelihood estimate of the distribution (more specifically, of its parameters).
While optimization of this objective function can, in principle, be performed by a generic optimizer, we develop an Expectation–Maximization (EM) algorithm that optimizes the objective function. This algorithm works in much the same way as the normal EM algorithm for mixtureofGaussians density estimation, except that an additional degree of incompleteness is given by the actual values of the observables, since we only have access to noisy projections of these; in the expectation step these actual values are estimated based on the noisy and incomplete measured values and the current estimate of the distribution function. In the limit in which the noise is absent but the data are lower dimensional projections of the quantities of interest, this algorithm reduces to the algorithm described in Ghahramani and Jordan (1994a, 1994b).
We also show how prior distributions for a Bayesian version of the calculation reporting a MAP estimate can be naturally included in this algorithm as well as how a splitandmerge procedure that heuristically searches parameter space for better approximations to the global maximum can also be incorporated in this approach. These priors and the splitandmerge procedure can be important when applying the EM algorithm developed here in situations with real data where the likelihood surface can have a very complicated structure. We also briefly discuss the practical issues having to do with model selection in the mixture model approach.
An application to a real data set is given in Section 5, where we fit the distribution of stellar velocities near the Sun. The observed velocities of stars that we use for this purpose have all of the properties that the approach developed in this paper handles correctly: The velocity measurements are noisy, and since we only use observations of the velocity components in the plane of the sky, the data are incomplete, and this incompleteness is different for each velocity measurement, which covers the full sky. Nevertheless, we are able to obtain good agreement with other fits of the velocity distribution based on complete data.
The technique we describe below has many applications besides returning a maximum likelihood fit to the errordeconvolved distribution function of a data sample. For instance, when an estimate of the uncertainty in the estimated parameters or distribution function is desired or when a full Bayesian analysis of the mixture model preferred, the outcome of the maximum likelihood technique developed here can be used as a seed for Markov Chain Monte Carlo (MCMC) methods for finite mixture modeling [e.g., Diebolt and Robert (1994); Richardson and Green (1997)].
2 Likelihood of a mixture of Gaussian distributions given a set of heterogeneous, noisy samples
Our goal is to fit a model for the distribution of a dimensional quantity using a set of observational data points . Therefore, we need to write down the probability of the data under the model for the distribution. The observations are assumed to be noisy projections of the true values ,
(1) 
where the noise is drawn from a Gaussian with zero mean and known covariance matrix . The case in which there is missing data occurs when the projection matrix is rankdeficient. Alternatively, we can handle the missing data case by describing the missing data as directions of the covariance matrix that have a formally infinite eigenvalue. In practice, we use very large eigenvalues in the noisematrix. When the data has only a small degree of incompleteness, that is, when each data point has only a small number of unmeasured dimensions, this latter approach is often the best choice, since one often has some idea about the unmeasured values. For example, in the example given below of inferring the velocity distribution of stars near the Sun, we know that the stars are moving at velocities that do not exceed the speed of light, which is not very helpful, but also that none of the velocities exceed the local Galactic escape speed, since we can safely assume that all the stars are bound to the Galaxy. However, in situations in which each data point has observations of a dimensionality using the projections matrices will greatly reduce the computational cost, since, as will become clear below, the most computationally expensive operations all take place in the lower dimensional space of the observations.
We will model the probability density of the true values as a mixture of Gaussians,
(2) 
where the amplitudes sum to unity and the function is the Gaussian probability density function with mean and variance matrix ,
(3) 
For a given observation the likelihood of the model parameters given that observation and the noise covariance , which we will write as , can be written as
where
(5)  
This likelihood works out to be itself a mixture of Gaussians
(6) 
where
(7) 
The free parameters of this model can now be chosen such as to maximize an explicit, justified, scalar objective function , given here by the logarithm (log) likelihood of the model given the data, that is,
(8) 
This function can be optimized in several ways, one of which is to use a generic optimizer to increase the likelihood until it reaches a maximum. This approach is complicated by parameter constraints (e.g., the amplitudes must all be nonnegative and add up to one, the variance matrices must be positive definite and symmetric) and multimodality of the likelihood surface. In what follows we will describe a different approach that is natural in this setting: An EM algorithm that iteratively and monotonically maximizes the likelihood, while naturally respecting the restrictions on the parameters.
3 Fitting mixtures with heterogeneous, noisy data using an EM algorithm
To optimize the likelihood in equation (8), we can extend the standard EM approach to Gaussian mixture estimation. In the case of complete and precise observations, the problem is framed as a tractable missingdata problem by positing that the labels or indicator variables indicating which Gaussian a data point was drawn from are missing [Dempster, Laird and Rubin (1977)]. We extend this approach by including the true values as extra missing data. This is a wellknown approach for handling measurement uncertainty in latent variable or random effects models [e.g., Schafer (1993); Schafer and Purdy (1996)].
We write down the “full data” log likelihood—the likelihood we would write down if we had the true values and the labels ,
(9) 
We will now show how we can use the EM methodology to find straightforward update steps that maximize the full data likelihood of the model. In Appendix A we prove that these updates also maximize the likelihood of the model given the noisy observations.
The Estep consists as usual of taking the expectation of the full data likelihood with respect to the current model parameters . Writing out the full data log likelihood from equation (9), we find
which shows that in addition to the expectation of the indicator variables for each component we also need the expectation of the terms and the expectation of the terms given the data, the current model estimate and the component . The expectation of the is equal to the posterior probability that a data point was drawn from the component . The expectation of the and the can be found as follows: Consider the joint distribution for the true and observed velocities, denoted by the expanded vector , given the model estimate and the component . From the description of the problem, we can see that this vector is distributed normally with mean
(11) 
and covariance matrix
(12) 
The conditional distribution of the given the data is normal with mean
(13) 
and covariance matrix
(14) 
Thus, we see that the expectation of given the data , the model estimate and the component is given by , whereas the expectation of the given the same is given by .
Given this, the expectation of the full data log likelihood is given by
Straightforward optimization of this with respect to the model parameters gives the following algorithm:
(16) 
where .
4 Extensions to the basic algorithm
Singularities and local maxima are two problems that can severely limit the generalization capabilities of the computed density estimates for inferring the densities of unknown data points. These are commonly encountered when using the EM algorithm to iteratively compute the maximum likelihood estimates of Gaussian mixtures: Singularities arise when the covariance in equation (3) becomes singular; the EM updates might get stuck in a local maximum because of the monotonic increase in likelihood ensured by the EM algorithm. The latter can be avoided through the use of a stochastic EM procedure [Broniatowski, Celeux and Diebolt (1983); Celeux and Diebolt (1985); Celeux and Diebolt (1986)] or through the split and merge procedure described below.
4.1 Bayesianinspired regularization
The problem of singular covariances can be mitigated through the use of priors on the model parameters in a Bayesian setting [Ormoneit and Tresp (1996)]. It should be emphasized here that this calculation is only Bayesian in the sense of producing a maximum a posteriori (MAP) point estimate rather than a maximum likelihood estimate, and that this is no different than penalized maximum likelihood. We briefly show here that this procedure can be applied here as well.
The regularization scheme of Ormoneit and Tresp (1996) introduces conjugate priors on the Gaussian mixtures parameters space as penalty terms. These conjugate priors are the following: A normal density for the mean of each Gaussian, a Wishart density [Gelman et al. (2000)],
(17) 
with a normalization constant, for the covariance of each Gaussian, and a Dirichlet density , given by
(18) 
where is a normalizing factor, for the amplitudes . Optimizing the posterior distribution for the model parameters replaces the Mstep of equation (3) with
Hyperparameters can be set by leaveoneout crossvalidation. Vague priors on the amplitude and the means are obtained by setting
(20) 
Since we are only interested in the MAP estimate, propriety of the resulting posterior is not an issue with the improper prior resulting from this choice.
The labelswitching problem in Bayesian mixtures [Jasra, Holmes and Stephens (2005)] is not an issue for the maximization of the posterior distribution here.
4.2 Avoiding local maxima
The split and merge algorithm starts from the basic EM algorithm, with or without the Bayesian regularization of the variances, and jumps into action after the EM algorithm has reached a maximum, which more often than not will only be a local maximum. At this point, three of the Gaussians in the mixture are singled out and two of these Gaussians are merged, while the third Gaussian is split into two Gaussians [Ueda et al. (1998)]. An alternative, but similar, approach to local maxima avoidance is given by the birth and death moves in reversible jump MCMC [Richardson and Green (1997)] or variational approaches [Ghahramani and Beal (2000); Beal (2003)] to mixture modeling. These moves do not conserve the number of mixture components and are therefore less suited for our fixed approach to mixture modeling.
Full details of the split and merge algorithm are given in Appendix B.
4.3 Setting the remaining free parameters
No real world application of Gaussian mixture density estimation is complete without a wellspecified methodology for setting the number of Gaussian components and any hyperparameters introduced in the Bayesian regularization described above, the covariance regularization . If we further assume that , then this covariance regularization parameter basically sets the square of the smallest scale of the distribution function on which we can reliably infer smallscale features. Therefore, this scale could be set by hand to the smallest scale we believe we have access to based on the properties of the data set.
In order to get the best results, the parameters and should be set by some objective procedure. As mentioned above, leaveoneout crossvalidation [Stone (1974)] could be used to set the regularization parameter , and the number of Gaussians could be set by this procedure as well. Other techniques include methods based on Bayesian model selection [Roberts et al. (1998)] as well as approaches based on minimum encoding inference [Wallace and Boulton (1968); Oliver, Baxter and Wallace (1996); Rissanen (1978); Schwartz (1978)], although these methods have difficulty dealing with significant overlap between components (such as the overlap we see in the example in Figure 1), but there are methods to deal with these situations [Baxter (1995)]. Alternatively, when a separate, external data set is available, we can use this as a test data set to validate the obtained distribution function. All of these methods are explored in an accompanying paper on the velocity distribution of stars in the Solar neighborhood from measurements from the Hipparcos satellite [see below; Bovy, Hogg and Roweis (2009)].
A rather different approach to the model selection problem is to avoid it altogether. That is, by introducing priors over the hyperparameters and including them as part of the model it is often possible to infer, or fully marginalize over, them simultaneously with the parameters of the components of the mixture. These methods also address uncertainty quantification throughout the model. Such approaches include reversible jump MCMC methods [Richardson and Green (1997)], mixtures consisting of an infinite number of components based on the Dirichlet process [Rasmussen (2000)], or approximate, variational algorithms [Ghahramani and Beal (2000); Beal (2003)]. Extending these approaches to deal with noisy, heterogeneous and incomplete data is beyond the scope of this paper, but it is clear that this extension is, in principle, straightforward: the MCMC methods mentioned above can include the true values of the observations —known in the Bayesian MCMC literature as data augmentation—and these can be Gibbs sampled given the current model and the observed values in an MCMC sweep from the Gaussian with mean given in equation (13) and variance given in equation (14).
5 The velocity distribution from Hipparcos data
We have applied the technique developed in this paper to the problem of inferring the velocity distribution of stars in the Solar neighborhood from transverse angular data from the Hipparcos satellite and we present in this section some results of this study to demonstrate the performance of the algorithm on a real data set. A more detailed and complete account of this study is presented elsewhere [Bovy, Hogg and Roweis (2009)].
The local velocity distribution is interesting because we can learn about the structure and evolution of the Galactic disk from deviations from the smooth, close to Gaussian velocity distribution expected in simple, axisymmetric models of the disk. It has been shown that the dynamical effects of a barshaped distribution of stars in the central region of our Galaxy can resonate in the outer parts of the disk and give rise to structure in the velocity distribution [e.g., Dehnen (2000); Bovy (2010)]. Similarly, steadystate or transient spiral structure can effect the velocities of stars in a coherent way, such that we can see this effect locally [e.g., Quillen and Minchev (2005); De Simone, Wu and Tremaine (2004)]. Inferring the local velocity distribution from observational data is therefore necessary to assess whether these dynamical signatures are observed.
Velocities of stars are not directly observable. Rather, they need to be patched together from observations of the stars’ directions on the sky at different times—the branch of astronomy known as astrometry—and spectroscopic observations to determine the velocity along the line of sight. The annual motion of the Earth around the Sun gives rise to an apparent displacement of a star relative to background objects that is inversely proportional to the distance to the star. Measurements of this apparent shift, or parallax, can thus be used to determine the distance to stars. Parallaxes are traditionally reported in units of arcseconds; a star with a parallax of 1 arcsecond is defined to be at a distance of 1 parsec (pc), which equals m. The intrinsic motion of a star also gives rise to a systematic shift in its position relative to background sources, such that its angular motion—known as its proper motion—can be measured. Combining the distance and angular velocity gives the components of the space velocity of a star that are perpendicular to the line of sight.
The astrometric ESA space mission Hipparcos, which collected data over a 3.2 year period around 1990, provided for the first time an allsky catalogue of absolute parallaxes and proper motions, with typical uncertainties in these quantities on the order of milliarcseconds [ESA (1997)]. From this catalogue of 100,000 stars, kinematically unbiased samples of stars with accurate positions and velocities can be extracted [Dehnen and Binney (1998)]. Since this was a purely astrometric mission, and the only components of a star’s velocity that can be measured astrometrically are the components perpendicular to the line of sight, the lineofsight velocities of the stars in the Hipparcos sample were not obtained during the mission.
Distances in astronomy are notoriously hard to measure precisely, and at the accuracy level of the Hipparcos mission distances can only be reliably obtained for stars near the Sun (out to 100 pc; the diameter of the Galactic disk is about 30,000 pc). In addition to this, since distances are measured as inverse distances (parallaxes), only distances that are measured relatively precisely will have approximately Gaussian uncertainties associated with them. Balancing the size of the sample with the accuracy of the distance measurement leaves us with distance uncertainties that are typically 10percent, such that the velocities perpendicular to the line of sight that are obtained from the proper motions and the distances have low signaltonoise. Since the dominant source of noise is due to simple photoncounting statistics [van Leeuwen (2007a)], the uncertainties are well characterized and can be assumed to be known, as is necessary for the technique developed in this paper to apply. Startostar correlations are negligible and can be ignored [van Leeuwen (2007a)].
Of course, if we want to describe the distribution of the velocities of the stars in this sample, we need to express the velocities in a common reference frame, which for kinematical studies of stars around the Sun is generally chosen to be the Galactic coordinate system, in which the axis points toward the Galactic center, the axis points in the direction of Galactic rotation, and the axis points toward the North Galactic Pole [Blaauw et al. (1960); Binney and Merrifield (1998)]. The measured velocities perpendicular to the line of sight are then projections of the threedimensional velocity of a star with respect to the Sun in the twodimensional plane perpendicular to the line of sight to the star. Therefore, this projection is different for each individual star.
Observations of celestial objects are expressed in the equatorial coordinate system, in which the Earth’s geographic poles and equator are projected onto the celestial sphere. The components of the threedimensional velocities of the stars in the Galactic coordinate system in terms of the observed quantities in the equatorial coordinate frame—angular position on the sky (, ), inverse distance (), angular motion on the sky (, ), and lineofsight velocity ()—are given by
(21) 
where , , . The matrix transforms the velocities from the equatorial reference frame in which the observations are made to the Galactic coordinate frame; it depends on a few parameters defining this coordinate transformation and is given by
(23)  
The matrix depends on the epoch that the reduced data are referred to (1991.25 for Hipparcos) through the values of , and (the position in equatorial coordinates of the north Galactic pole, and the Galactic longitude of the north Celestial pole, respectively). These quantities were defined for the epoch 1950.0 as follows: [Blaauw et al. (1960)]: , , and .
The matrix depends on the position of the source on the sky,
(24) 
In the context of the deconvolution technique described above, the observations are and the projection matrix is . Since we do not use the radial velocities of the stars, we set to zero in and use a large uncertaintyvariance for this component of the uncertainty variance ; equivalently, we could remove from and restrict to the projection on the sky.
We have studied the velocity distribution of a sample of main sequence stars selected to have accurate distance measurements (parallax uncertainties ) and to be kinematically unbiased (in that the sample of stars faithfully represents the kinematics of similar stars). In detail, we use the sample of 11,865 stars from Dehnen and Binney (1998), but we use the new reduction of the Hipparcos raw data, which has improved the accuracy of the astrometric quantities [van Leeuwen (2007a; 2007b)]. A particular reconstruction of the underlying velocity distribution of the stars is shown in Figure 1, in which 10 Gaussians are used, a prior on the variances was used (the prior was restricted to ), and this regularization parameter is set to 4 km.
These values for the hyperparameters were set using an external data set rather than any of the other methods described in Section 4.3. For this we use a set of 7682 stars from the Geneva–Copenhagen Survey [Nordström et al. (2004); Holmberg, Nordström and Andersen (2009)] for which the lineofsight velocity (perpendicular to the plane of the sky) has been measured spectroscopically. This is a separate data set from the one considered above. It partially overlaps with the previous data set and it is also kinematically unbiased. We then fit the velocity distribution for different choices of the hyperparameters and evaluate the probability of the lineofsight velocities for the bestfit velocity distribution based on tangential velocities. The values of the hyperparameters that maximize this probability are and 4 km.
The recovered distribution compares favorably with other reconstructions of the velocity distribution of stars in the Solar neighborhood, based on the same sample of stars [using a maximum penalized likelihood density estimation technique, Dehnen (1998)], as well as with those based on other samples of stars for which threedimensional velocities are available [Skuljan, Hearnshaw and Cottrell (1999); Nordström et al. (2004); Famaey et al. (2005); Antoja et al. (2008)]. In particular, this means that the main shape of the velocity distribution agrees with that found in previous studies and that the number and location of the peaks in the distribution, all real and known features, are consistent with those found before. This includes the very sparsely populated feature at kms, which is known as the Arcturus moving group. Therefore, we conclude that the method developed in this paper performs very well on this complicated data set. In contrast to previous determinations of the velocity distributions, our method allows us to study the structures found quantitatively, since it turns out that individual structures in the velocity distribution are well represented by individual components in the mixture model. Thus, we were able to conclude that these structures are not the remnants of a large group of stars that formed together in a cluster, but rather that they are probably caused by dynamical effects related to the bar at the center of the Milky Way or spiral structure [Bovy and Hogg (2010)].
The convergence of the algorithm is shown in Figure 2. Only splitandmerge steps that improved the likelihood are shown in this plot, therefore, the actual number of iterations is much higher than the number given on the axis. It is clear that all of the splitandmerge steps only slightly improve the initial estimate from the first EM procedure, but since what is shown is the likelihood per data point, the improvement of the total likelihood is more significant.
6 Implementation and code availability
The algorithm presented in this paper was implemented in the C programming language, depending only on the standard C library and the GNU Scientific Library.^{4}^{4}4http://www.gnu.org/software/gsl/. The code is available at http://code.google.com/p/extremedeconvolution/; instructions for its installation and use are given there. The code can be compiled into a shared object library, which can then be incorporated into other projects or accessed through IDL^{5}^{5}5http://www.ittvis.com/ProductServices/IDL.aspx. or Python^{6}^{6}6http://www.python.org/. wrapper functions supplied with the C code.
The code can do everything described above. The convergence of a single run is quick, but when including splitandmerge iterations the convergence is rather slow because of the large number of splitandmerge steps that can be taken by the algorithm (the splitandmerge aspect of the algorithm, however, can easily be turned off or restricted by setting the parameter specifying the number of steps to go down the splitandmerge hierarchy).
7 Conclusions and future work
We have generalized the mixture of Gaussians approach to density estimation such that it can be applied to noisy, heterogeneous and incomplete data. The objective function is obtained by integrating over the unknown true values of the quantities for which we only have noisy and/or incomplete observations. In order to optimize the objective function resulting from this marginalization, we have derived an EM algorithm that monotonically increases the model likelihood; this EM algorithm, in which the Estep involves finding the expected value of the first and second moments of the true values of the observables given the current model and the noisy observations, reduces to the basic EM algorithm for Gaussian mixture modeling in the limit of noiseless data. We have shown that the model can incorporate conjugate priors on all of the model parameters without losing any of its analytical attractiveness and that the algorithm can accommodate the splitandmerge algorithm to deal with the presence of local maxima, which this EM algorithm, as many other EM algorithms, suffers from.
The work presented here can be extended to be incorporated in various more nonparametric approaches to density modeling, for example, in mixture models with an infinite number of components based on the Dirichlet Process [e.g., Rasmussen (2000)]. In this way current advances in nonparametric modeling can be applied to the low signaltonoise sciences where the situation of complete and noisefree data is more often than not an untenable and unattainable approximation.
Appendix A Proof that the proposed algorithm maximizes the likelihood
We use Jensen’s inequality in the continuous case for a concave function and a nonnegative integrable function , where we have assumed that is normalized, that is, is a probability distribution. For each observation we can then introduce a function such that
(25)  
where is the entropy of the distribution . This inequality becomes an equality when we take
(26) 
The above holds for each data point, and we can write
(27) 
The last factor reduces to calculating the posterior probabilities and we can write the function as (we drop the entropy term here, since it plays no role in the optimization, as it does not depend on the model parameters)
This shows that this reduces exactly to the procedure described above, that is, to taking the expectation of the and terms with respect to the distribution of the given the data , the current parameter estimate and the component . We conclude that the Estep as described above ensures that the expectation of the full data log likelihood becomes equal to the log likelihood of the model given the observed data. Optimizing this log likelihood in the Mstep then also increases the log likelihood of the model given the observations. Therefore, the EM algorithm we described will increase the likelihood of the model in every iteration, and the algorithm will approach local maxima of the likelihood. Convergence is identified, as usual, as extremely small incremental improvement in the log likelihood per iteration.
Appendix B Split and merge algorithm
Let us denote the indices of the three selected Gaussians as and , where the former two are to be merged while will be split. The Gaussians corresponding to the indices and will be merged as follows: the model parameters of the merged Gaussian are
(28)  
where stands for and . Thus, the mean and the variance of the new Gaussian is a weighted average of the means and variances of the two merging Gaussians.
The Gaussian corresponding to is split as follows:
(29)  
Thus, the Gaussian is split into equally contributing Gaussians with each new Gaussian having a covariance matrix that has the same volume as . The means and can be initialized by adding a random perturbation vector to , for example,
(30) 
where and .
After this split and merge initialization, the parameters of the three affected Gaussians need to be reoptimized in a model in which the parameters of the unaffected Gaussians are held fixed. This can be done by using the Mstep in equation (3) for the parameters of the three affected Gaussians, while keeping the parameters of the other Gaussians fixed, including the amplitudes. This ensures that the sum of the amplitudes of the three affected Gaussians remains fixed. This procedure is called the partial EM procedure. After convergence this is then followed by the full EM algorithm on the resulting model parameters. Finally, the resulting parameters are accepted if the total log likelihood of this model is greater than the log likelihood before the split and merge step. If the likelihood does not increase, the same split and merge procedure is performed on the next triplet of split and merge candidates.
The question that remains to be answered is how to choose the 2 Gaussians that should be merged and the Gaussian that should be split. In general, there are possible triplets like this which quickly reach a large number when the number of Gaussians gets larger. In order to rank these triplets, one can define a merge criterion and a split criterion.
The merge criterion is constructed based on the observation that if many data points have equal posterior probabilities for two Gaussians, these Gaussians are good candidates to be merged. Therefore, one can define the merge criterion:
(31) 
where is the dimensional vector of posterior probabilities for the th Gaussian. Pairs of Gaussians with larger are good candidates for a merger.
We can define a split criterion based on the Kullback–Leibler distance between the local data density around the th Gaussian, which can be written in the case of complete data as , and the th Gaussian density specified by the current model estimates and . The Kullback–Leibler divergence between two distributions and is given by [MacKay (2003)]
(32) 
Since the local data density is only nonzero at a finite number of values, we can write this as
(33) 
The larger the distance between the local density and the Gaussian representing it, the larger and the better candidate this Gaussian is to be split.
When dealing with incomplete data determining the local data density is more problematic. One possible way to estimate how well a particular Gaussian describes the local data density is to calculate the Kullback–Leibler divergence between the model Gaussian under consideration and each individual data point perpendicular to the unobserved directions for that data point. Thus, we can write
(34) 
Candidates for merging and splitting are then ranked as follows: first the merge criterion is calculated for all pairs and the pairs are ranked by decreasing . For each pair in this ranking the remaining Gaussians are then ranked by decreasing .
To summarize the full algorithm, we briefly list all the steps involved:

Compute the merge criterion for all pairs and the split criterion for all . Sort the split and merge candidates based on these criteria as detailed above.

For the first triplet in this sorted list set the initial parameters of the merged Gaussian using equation (28) and the parameters of the two Gaussian resulting from splitting the third Gaussian using equations (29) and (30). Then run the partial EM procedure on the parameters of the three affected Gaussians, that is, run EM while keeping the parameters of the unaffected Gaussians fixed, and follow this up by running the full EM procedure on all the Gaussians. If after convergence the new log likelihood is greater than , accept the new parameter values and return to step two. If , return to the beginning of this step and use the next triplet in the list.

Halt this procedure when none of the split and merge candidates improve the log likelihood or, if this list is too long, if none of the first lead to an improvement.
Deciding when to stop going down the splitandmerge hierarchy will be dictated in any individual application of this technique by computational constraints. This is an essential feature of any searchbased approach to finding global maxima of (likelihood) functions.
Acknowledgments
It is a pleasure to thank Frédéric Arenou and Phil Marshall for comments and assistance and the anonymous referee and Associate Editor for valuable criticism.
References
 Antoja et al. (2008) {barticle}[author] \bauthor\bsnmAntoja, \bfnmT.\binitsT., \bauthor\bsnmFigueras, \bfnmF.\binitsF., \bauthor\bsnmFernández, \bfnmD.\binitsD. \AND\bauthor\bsnmTorra, \bfnmJ.\binitsJ. (\byear2008). \btitleOrigin and evolution of moving groups. I. Characterization in the observational kinematicagemetallicity space. \bjournalAstron. Astrophys. \bvolume490 \bpages135. \endbibitem
 Baxter (1995) {btechreport}[author] \bauthor\bsnmBaxter, \bfnmR. A.\binitsR. A. (\byear1995). \btitleFinding overlapping distributions with MML. \btypeTechnical Report No. \bnumber244, \binstitutionDept. Computer Science, Monash Univ., \baddressClayton, Australia. \endbibitem
 Beal (2003) {bphdthesis}[author] \bauthor\bsnmBeal, \bfnmM. J.\binitsM. J. (\byear2003). \btitleVariational algorithms for approximate Bayesian inference. \btypePh.D. thesis, \bschoolGatsby Computational Neuroscience Unit, Univ. College London. \endbibitem
 Binney and Merrifield (1998) {bbook}[author] \bauthor\bsnmBinney, \bfnmJ.\binitsJ. \AND\bauthor\bsnmMerrifield, \bfnmM.\binitsM. (\byear1998). \btitleGalactic Astronomy. \bpublisherPrinceton Univ. Press, Princeton, NJ. \endbibitem
 Blaauw et al. (1960) {barticle}[author] \bauthor\bsnmBlaauw, \bfnmA.\binitsA., \bauthor\bsnmGum, \bfnmC. S.\binitsC. S., \bauthor\bsnmPawsey, \bfnmJ. L.\binitsJ. L. \AND\bauthor\bsnmWesterhout, \bfnmG.\binitsG. (\byear1960). \btitleThe new IAU system of galactic coordinates (1958 revision). \bjournalMon. Not. R. Astron. Soc. \bvolume121 \bpages123. \endbibitem
 Bovy (2010) {barticle}[author] \bauthor\bsnmBovy, \bfnmJ.\binitsJ. (\byear2010). \btitleTracing the Hercules stream around the galaxy. \bjournalAstrophys. J. \bvolume725 \bpages1676. \endbibitem
 Bovy, Hogg and Roweis (2009) {barticle}[author] \bauthor\bsnmBovy, \bfnmJ.\binitsJ., \bauthor\bsnmHogg, \bfnmD. W.\binitsD. W. \AND\bauthor\bsnmRoweis, \bfnmS. T.\binitsS. T. (\byear2009). \btitleThe velocity distribution of nearby stars from Hipparcos data I. The significance of the moving groups. \bjournalAstrophys. J. \bvolume700 \bpages1794. \endbibitem
 Bovy and Hogg (2010) {barticle}[author] \bauthor\bsnmBovy, \bfnmJ.\binitsJ. \AND\bauthor\bsnmHogg, \bfnmD. W.\binitsD. W. (\byear2010). \btitleThe velocity distribution of nearby stars from Hipparcos data II. The nature of the lowvelocity moving groups. \bjournalAstrophys. J. \bvolume717 \bpages617. \endbibitem
 Broniatowski, Celeux and Diebolt (1983) {bincollection}[author] \bauthor\bsnmBroniatowski, \bfnmM.\binitsM., \bauthor\bsnmCeleux, \bfnmG.\binitsG. \AND\bauthor\bsnmDiebolt, \bfnmJ.\binitsJ. (\byear1983). \btitleReconaissance de Densités par un Algorithme d’Apprentissage Probabiliste. In \bbooktitleData Analysis and Informatics, Vol. 3 \bpages359–373. \bpublisherNorthHolland, \baddressAmsterdam. \MR0787647 \endbibitem
 Carroll et al. (2006) {bbook}[author] \bauthor\bsnmCarroll, \bfnmR. J.\binitsR. J., \bauthor\bsnmRuppert, \bfnmD.\binitsD., \bauthor\bsnmStefanski, \bfnmL. A.\binitsL. A. \AND\bauthor\bsnmCrainiceanu, \bfnmC. M.\binitsC. M. (\byear2006). \btitleMeasurement Error in Nonlinear Models: A Modern Perspective, \bedition2nd ed. \bpublisherChapman and Hall/CRC, \baddressBoca Raton, FL. \MR2243417 \endbibitem
 Celeux and Diebolt (1985) {barticle}[author] \bauthor\bsnmCeleux, \bfnmG.\binitsG. \AND\bauthor\bsnmDiebolt, \bfnmJ.\binitsJ. (\byear1985). \btitleThe SEM algorithm: A probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. \bjournalComput. Statist. Quart \bvolume2 \bpages73. \endbibitem
 Celeux and Diebolt (1986) {barticle}[author] \bauthor\bsnmCeleux, \bfnmG.\binitsG. \AND\bauthor\bsnmDiebolt, \bfnmJ.\binitsJ. (\byear1986). \btitleL’Algorithme SEM: un Algorithme d’Apprentissage Probabiliste pour la Reconnaisance de Mélanges de Densités. \bjournalRev. Stat. Appl. \bvolume34 \bpages35. \endbibitem
 De Simone, Wu and Tremaine (2004) {barticle}[author] \bauthor\bsnmDe Simone, \bfnmR.\binitsR., \bauthor\bsnmWu, \bfnmX.\binitsX. \AND\bauthor\bsnmTremaine, \bfnmS.\binitsS. (\byear2004). \btitleThe stellar velocity distribution in the solar neighbourhood. \bjournalMon. Not. R. Astron. Soc. \bvolume350 \bpages627. \endbibitem
 Dehnen (1998) {barticle}[author] \bauthor\bsnmDehnen, \bfnmW.\binitsW. (\byear1998). \btitleThe distribution of nearby stars in velocity space inferred from Hipparcos data. \bjournalAstron. J. \bvolume115 \bpages2384. \endbibitem
 Dehnen (2000) {barticle}[author] \bauthor\bsnmDehnen, \bfnmW.\binitsW. (\byear2000). \btitleThe effect of the outer Lindblad resonance of the galactic bar on the local stellar velocity distribution. \bjournalAstron. J. \bvolume119 \bpages800. \endbibitem
 Dehnen and Binney (1998) {barticle}[author] \bauthor\bsnmDehnen, \bfnmW.\binitsW. \AND\bauthor\bsnmBinney, \bfnmJ. J.\binitsJ. J. (\byear1998). \btitleLocal stellar kinematics from Hipparcos data. \bjournalMon. Not. R. Astron. Soc. \bvolume298 \bpages387. \endbibitem
 Delaigle and Meister (2008) {barticle}[author] \bauthor\bsnmDelaigle, \bfnmA.\binitsA. \AND\bauthor\bsnmMeister, \bfnmA.\binitsA. (\byear2008). \btitleDensity estimation with heteroscedastic error. \bjournalBernoulli \bvolume14 \bpages562–579. \MR2544102 \endbibitem
 Dempster, Laird and Rubin (1977) {barticle}[author] \bauthor\bsnmDempster, \bfnmA. P.\binitsA. P., \bauthor\bsnmLaird, \bfnmN. M.\binitsN. M. \AND\bauthor\bsnmRubin, \bfnmD. B\binitsD. B. (\byear1977). \btitleMaximum likelihood from incomplete data via the EM algorithm. \bjournalJ. R. Stat. Soc. Ser. B Methodol. Stat. \bvolume39 \bpages1–38. \MR0501537 \endbibitem
 Diebolt and Robert (1994) {barticle}[author] \bauthor\bsnmDiebolt, \bfnmJ.\binitsJ. \AND\bauthor\bsnmRobert, \bfnmC. P.\binitsC. P. (\byear1994). \btitleEstimation of finite mixture distributions through Bayesian sampling. \bjournalJ. R. Stat. Soc. Ser. B Methodol. Stat. \bvolume56 \bpages363–375. \MR1281940 \endbibitem
 ESA (1997) {bbook}[author] \bauthor\bsnmESA (\byear1997). \btitleThe Hipparcos and Tycho Catalogues. \bpublisherESA SP1200, \baddressNoordwijk. \endbibitem
 Famaey et al. (2005) {barticle}[author] \bauthor\bsnmFamaey, \bfnmB.\binitsB., \bauthor\bsnmJorissen, \bfnmA.\binitsA., \bauthor\bsnmLuri, \bfnmX.\binitsX., \bauthor\bsnmMayor, \bfnmM.\binitsM., \bauthor\bsnmUdry, \bfnmS.\binitsS., \bauthor\bsnmDejonghe, \bfnmH.\binitsH. \AND\bauthor\bsnmTuron, \bfnmC.\binitsC. (\byear2005). \btitleLocal kinematics of and giants from CORAVEL/Hipparcos/Tycho2 data. Revisiting the concept of superclusters. \bjournalAstron. Astrophys. \bvolume430 \bpages165. \endbibitem
 Gelman et al. (2000) {bbook}[author] \bauthor\bsnmGelman, \bfnmA.\binitsA., \bauthor\bsnmCarlin, \bfnmJ. B.\binitsJ. B., \bauthor\bsnmStern, \bfnmH. S.\binitsH. S. \AND\bauthor\bsnmRubin, \bfnmD. B.\binitsD. B. (\byear2000). \btitleBayesian Data Analysis. \bpublisherChapman and Hall/CRC, Boca Raton, FL. \MR1385925 \endbibitem
 Ghahramani and Beal (2000) {binproceedings}[author] \bauthor\bsnmGhahramani, \bfnmZ.\binitsZ. \AND\bauthor\bsnmBeal, \bfnmM. J.\binitsM. J. (\byear2000). \btitleVariational inference for Bayesian mixtures of factor analysers. In \bbooktitleAdvances in Neural Information Processing Systems 12 (\beditor\bfnmS. A.\binitsS. A. \bsnmSolla, \beditor\bfnmT. K.\binitsT. K. \bsnmLeen \AND\beditor\bfnmK. R.\binitsK. R. \bsnmMuller, eds.) \bpages449. \bpublisherMIT Press, Cambridge, MA. \endbibitem
 Ghahramani and Jordan (1994a) {btechreport}[author] \bauthor\bsnmGhahramani, \bfnmZ.\binitsZ. \AND\bauthor\bsnmJordan, \bfnmM. I.\binitsM. I. (\byear1994a). \btitleLearning from incomplete data. \binstitutionCBCL Technical Report No. 108. Center for Biological and Computational Learning, MIT. \endbibitem
 Ghahramani and Jordan (1994b) {binproceedings}[author] \bauthor\bsnmGhahramani, \bfnmZ.\binitsZ. \AND\bauthor\bsnmJordan, \bfnmM. I.\binitsM. I. (\byear1994b). \btitleSupervised learning from incomplete data via an EM approach. In \bbooktitleAdvances in Neural Information Processing Systems 6 (\beditor\bfnmJ. D.\binitsJ. D. \bsnmCowan, \beditor\bfnmG.\binitsG. \bsnmTesauro \AND\beditor\bfnmJ.\binitsJ. \bsnmAlspector, eds.) \bpages120–127. \bpublisherMorgan Kaufman, \baddressSan Francisco. \endbibitem
 Helmi et al. (1999) {barticle}[author] \bauthor\bsnmHelmi, \bfnmA.\binitsA., \bauthor\bsnmWhite, \bfnmS. D. M.\binitsS. D. M., \bauthor\bsnmde Zeeuw, \bfnmP. T.\binitsP. T. \AND\bauthor\bsnmZhao, \bfnmH.\binitsH. (\byear1999). \btitleDebris streams in the solar neighbourhood as relicts from the formation of the milky way. \bjournalNature \bvolume402 \bpages53–55. \endbibitem
 Hogg et al. (2005) {barticle}[author] \bauthor\bsnmHogg, \bfnmD. W.\binitsD. W., \bauthor\bsnmBlanton, \bfnmM. R.\binitsM. R., \bauthor\bsnmRoweis, \bfnmS. T.\binitsS. T. \AND\bauthor\bsnmJohnston, \bfnmK. V.\binitsK. V. (\byear2005). \btitleModeling complete distributions with incomplete observations: The velocity ellipsoid from Hipparcos data. \bjournalAstrophys. J. \bvolume629 \bpages268. \endbibitem
 Holmberg, Nordström and Andersen (2009) {barticle}[author] \bauthor\bsnmHolmberg, \bfnmJ.\binitsJ., \bauthor\bsnmNordström, \bfnmB.\binitsB. \AND\bauthor\bsnmAndersen, \bfnmJ.\binitsJ. (\byear2009). \btitleThe Geneva–Copenhagen survey of the solar neighbourhood III. Improved distances, ages, and kinematics. \bjournalAstron. Astrophys. \bvolume501 \bpages941. \endbibitem
 Jasra, Holmes and Stephens (2005) {barticle}[author] \bauthor\bsnmJasra, \bfnmA.\binitsA., \bauthor\bsnmHolmes, \bfnmC. C.\binitsC. C. \AND\bauthor\bsnmStephens, \bfnmD. A.\binitsD. A. (\byear2005). \btitleMarkov chain Monte Carlo methods and the label switching problem in Bayesian mixture modeling. \bjournalStatist. Sci. \bvolume20 \bpages50–67. \MR2182987 \endbibitem
 MacKay (2003) {bbook}[author] \bauthor\bsnmMacKay, \bfnmD. J. C.\binitsD. J. C. (\byear2003). \btitleInformation Theory, Inference, and Learning Algorithms. \bpublisherCambridge Univ. Press, \baddressCambridge. \MR2012999 \endbibitem
 McLachlan and Basford (1988) {bbook}[author] \bauthor\bsnmMcLachlan, \bfnmG. J.\binitsG. J. \AND\bauthor\bsnmBasford, \bfnmK.\binitsK. (\byear1988). \btitleMixture Models: Inference and Application to Clustering. \bpublisherDekker, \baddressNew York. \MR0926484 \endbibitem
 Nordström et al. (2004) {barticle}[author] \bauthor\bsnmNordström, \bfnmB.\binitsB., \bauthor\bsnmMayor, \bfnmM.\binitsM., \bauthor\bsnmAndersen, \bfnmJ.\binitsJ., \bauthor\bsnmHolmberg, \bfnmJ.\binitsJ., \bauthor\bsnmPont, \bfnmF.\binitsF., \bauthor\bsnmJørgensen, \bfnmB. R.\binitsB. R., \bauthor\bsnmOlsen, \bfnmE. H.\binitsE. H., \bauthor\bsnmUdry, \bfnmS.\binitsS. \AND\bauthor\bsnmMowlavi, \bfnmN.\binitsN. (\byear2004). \btitleThe Geneva–Copenhagen survey of the solar neighbourhood. Ages, metallicities, and kinematic properties of 14 000 and dwarfs. \bjournalAstron. Astrophys. \bvolume418 \bpages989. \endbibitem
 Oliver, Baxter and Wallace (1996) {binproceedings}[author] \bauthor\bsnmOliver, \bfnmJ. J.\binitsJ. J., \bauthor\bsnmBaxter, \bfnmR. A.\binitsR. A. \AND\bauthor\bsnmWallace, \bfnmC. S.\binitsC. S. (\byear1996). \btitleUnsupervised learning using MML. In \bbooktitleMachine Learning: Proceedings of the Thirteenth International Conference (ICML 96) \bpages364. \bpublisherMorgan Kaufmann, \baddressSan Francisco. \endbibitem
 Ormoneit and Tresp (1996) {binproceedings}[author] \bauthor\bsnmOrmoneit, \bfnmD.\binitsD. \AND\bauthor\bsnmTresp, \bfnmV.\binitsV. (\byear1996). \btitleImproved Gaussian mixture density estimates using Bayesian penalty terms and network averaging. In \bbooktitleAdvances in Neural Information Processing Systems 8, NIPS, Denver, CO, November 27–30, 1995 (\beditor\bfnmD. S.\binitsD. S. \bsnmTouretzky, \beditor\bfnmM.\binitsM. \bsnmMozer \AND\beditor\bfnmM. E.\binitsM. E. \bsnmHasselmo, eds.) \bpages542–548. \bpublisherMIT Press, \baddressCambridge. \endbibitem
 Quillen and Minchev (2005) {barticle}[author] \bauthor\bsnmQuillen, \bfnmA. C.\binitsA. C. \AND\bauthor\bsnmMinchev, \bfnmI.\binitsI. (\byear2005). \btitleThe effect of spiral structure on the stellar velocity distribution in the solar neighborhood. \bjournalAstron. J. \bvolume130 \bpages576. \endbibitem
 Rabiner and BiingHwang (1993) {bbook}[author] \bauthor\bsnmRabiner, \bfnmL.\binitsL. \AND\bauthor\bsnmBiingHwang, \bfnmJ.\binitsJ. (\byear1993). \btitleFundamentals of Speech Recognition. \bpublisherPrenticeHall, \baddressNew York. \endbibitem
 Rasmussen (2000) {binproceedings}[author] \bauthor\bsnmRasmussen, \bfnmC.\binitsC. (\byear2000). \btitleThe infinite Gaussian mixture model. In \bbooktitleAdvances in Neural Information Processing Systems 12 (\beditor\bfnmS. A.\binitsS. A. \bsnmSolla, \beditor\bfnmT. K.\binitsT. K. \bsnmLeen \AND\beditor\bfnmK. R.\binitsK. R. \bsnmMuller, eds.) \bpages554–560. \bpublisherMIT Press, \baddressCambridge. \endbibitem
 Richardson and Green (1997) {barticle}[author] \bauthor\bsnmRichardson, \bfnmS.\binitsS. \AND\bauthor\bsnmGreen, \bfnmP. J.\binitsP. J. (\byear1997). \btitleOn Bayesian analysis of mixtures with an unknown number of components. \bjournalJ. R. Stat. Soc. Ser. B Methodol. Stat. \bvolume59 \bpages731–792. \MR1483213 \endbibitem
 Rissanen (1978) {barticle}[author] \bauthor\bsnmRissanen, \bfnmJ.\binitsJ. (\byear1978). \btitleModeling by shortest data description. \bjournalAutomatica \bvolume14 \bpages465. \endbibitem
 Roberts et al. (1998) {barticle}[author] \bauthor\bsnmRoberts, \bfnmS. J.\binitsS. J., \bauthor\bsnmHusmeier, \bfnmD.\binitsD., \bauthor\bsnmRezek, \bfnmI.\binitsI. \AND\bauthor\bsnmPenny, \bfnmW.\binitsW. (\byear1998). \btitleBayesian approaches to Gaussian mixture modeling. \bjournalIEEE Trans. Pattern Anal. Mach. Intell. \bvolume20 \bpages1133. \endbibitem
 Schafer (1993) {barticle}[author] \bauthor\bsnmSchafer, \bfnmD. W.\binitsD. W. (\byear1993). \btitleLikelihood analysis for probit regression with measurement errors. \bjournalBiometrika \bvolume80 \bpages899. \endbibitem
 Schafer and Purdy (1996) {barticle}[author] \bauthor\bsnmSchafer, \bfnmD. W.\binitsD. W. \AND\bauthor\bsnmPurdy, \bfnmK. G.\binitsK. G. (\byear1996). \btitleLikelihood analysis for errorsinvariables regression with replicate measurements. \bjournalBiometrika \bvolume83 \bpages813–824. \MR1440046 \endbibitem
 Schwartz (1978) {barticle}[author] \bauthor\bsnmSchwartz, \bfnmG.\binitsG. (\byear1978). \btitleEstimating the dimension of a model. \bjournalAnn. Statist. \bvolume6 \bpages461–464. \MR0468014 \endbibitem
 Silverman (1986) {bbook}[author] \bauthor\bsnmSilverman, \bfnmB. W.\binitsB. W. (\byear1986). \btitleDensity Estimation for Statistics and Data Analysis. \bpublisherChapman and Hall, Boca Raton, FL. \MR0848134 \endbibitem
 Skuljan, Hearnshaw and Cottrell (1999) {barticle}[author] \bauthor\bsnmSkuljan, \bfnmJ.\binitsJ., \bauthor\bsnmHearnshaw, \bfnmJ. B.\binitsJ. B. \AND\bauthor\bsnmCottrell, \bfnmP. L.\binitsP. L. (\byear1999). \btitleVelocity distribution of stars in the solar neighbourhood. \bjournalMon. Not. R. Astron. Soc. \bvolume308 \bpages731. \endbibitem
 Staudenmayer, Ruppert and Buonaccorsi (2008) {barticle}[author] \bauthor\bsnmStaudenmayer, \bfnmJ.\binitsJ., \bauthor\bsnmRuppert, \bfnmD.\binitsD. \AND\bauthor\bsnmBuonaccorsi, \bfnmJ.\binitsJ. (\byear2008). \btitleDensity estimation in the presence of heteroscedastic measurement error. \bjournalJ. Amer. Statist. Assoc. \bvolume103 \bpages726–736. \MR2524005 \endbibitem
 Stefanski and Carroll (1990) {barticle}[author] \bauthor\bsnmStefanski, \bfnmL. A.\binitsL. A. \AND\bauthor\bsnmCarroll, \bfnmR. J.\binitsR. J. (\byear1990). \btitleDeconvoluting kernel density estimators. \bjournalStatistics \bvolume21 \bpages169–184. \MR1054861 \endbibitem
 Stone (1974) {barticle}[author] \bauthor\bsnmStone, \bfnmM.\binitsM. (\byear1974). \btitleCrossvalidation choice and assessment of statistical predictions. \bjournalJ. R. Stat. Soc. Ser. B Methodol. Stat. \bvolume36 \bpages111–147. \MR0356377 \endbibitem
 Ueda et al. (1998) {binproceedings}[author] \bauthor\bsnmUeda, \bfnmN.\binitsN., \bauthor\bsnmNakano, \bfnmR.\binitsR., \bauthor\bsnmGhahramani, \bfnmZ.\binitsZ. \AND\bauthor\bsnmHinton, \bfnmG. E.\binitsG. E. (\byear1998). \btitleSplit and merge EM algorithm for improving Gaussian mixture density estimates. In \bbooktitleNeural Networks for Signal Processing VIII, 1998. Proceedings of the 1998 IEEE Signal Processing Society Workshop \bpages274–283. \bpublisherIEEE. \endbibitem
 van Leeuwen (2007a) {bbook}[author] \bauthor\bsnmvan Leeuwen, \bfnmF.\binitsF. (\byear2007a). \btitleHipparcos, the New Reduction of the Raw Data. \bseriesAstrophysics and Space Science Library \bvolume250. \bpublisherSpringer, \baddressDordrecht. \endbibitem
 van Leeuwen (2007b) {barticle}[author] \bauthor\bsnmvan Leeuwen, \bfnmF.\binitsF. (\byear2007b). \btitleValidation of the new Hipparcos reduction. \bjournalAstron. Astrophys. \bvolume474 \bpages653. \endbibitem
 Wallace and Boulton (1968) {barticle}[author] \bauthor\bsnmWallace, \bfnmC. S.\binitsC. S. \AND\bauthor\bsnmBoulton, \bfnmD. M.\binitsD. M. (\byear1968). \btitleAn information measure for classification. \bjournalComput. J. \bvolume11 \bpages185. \endbibitem
 Wu (1983) {barticle}[author] \bauthor\bsnmWu, \bfnmC. F. J\binitsC. F. J. (\byear1983). \btitleOn the convergence properties of the EM algorithm. \bjournalAnn. Statist. \bvolume11 \bpages95–103. \MR0684867 \endbibitem
 Zhang (1990) {barticle}[author] \bauthor\bsnmZhang, \bfnmC. H.\binitsC. H. (\byear1990). \btitleFourier methods for estimating mixing densities and distributions. \bjournalAnn. Statist. \bvolume18 \bpages806–831. \MR1056338 \endbibitem