Estimating multidimensional probability fields using the Field Estimator for Arbitrary Spaces (FiEstAS) with applications to Astrophysics
The Field Estimator for Arbitrary Spaces (FiEstAS) computes the continuous probability density field underlying a given discrete data sample in multiple, non-commensurate dimensions. The algorithm works by constructing a metric-independent tessellation of the data space based on a recursive binary splitting. Individual, data-driven bandwidths are assigned to each point, scaled so that a constant “mass” is enclosed. Kernel density estimation may then be performed for different kernel shapes, and a combination of balloon and sample point estimators is proposed as a compromise between resolution and variance. A bias correction is evaluated for the particular (yet common) case where the density is computed exactly at the locations of the data points rather than at an uncorrelated set of locations. By default, the algorithm combines a top-hat kernel with with the balloon estimator and applies the corresponding bias correction. These settings are shown to yield reasonable results for a simple test case, a two-dimensional ring, that illustrates the performance for oblique distributions, as well as for a six-dimensional Hernquist sphere, a fairly realistic model of the dynamical structure of stellar bulges in galaxies and dark matter haloes in cosmological N-body simulations. Results for different parameter settings are discussed in order to provide a guideline to select an optimal configuration in other cases. Source code is available upon request.
keywords:Kernel density estimation, multivariate data analysis
Pacs:02.50.-r, 02.50.Sk, 02.50.Ng
Given a point process where the -dimensional probability density field is sampled by random points , the goal of density estimation is to infer the continuous function from the discrete set of . One of the most popular approaches to the problem is kernel density estimation, in which the field is estimated by
where the kernel is an even function that integrates to unity, and the bandwidth is a matrix that specifies the scale, shape, and orientation of the kernel. The choice of this matrix has been thoroughly discussed in different contexts, and extensive reviews exist in the literature (e.g. Silverman86, ; WandJones95, ).
The importance of density estimation cannot be overstressed. Quite often, one is directly interested in the density itself; the FiEstAS algorithm was originally developed AscasibarBinney05 () to evaluate the density of particles in the six-dimensional phase space of positions and velocities. Although the problem has recently arisen considerable interest (e.g. Vogelsberger+08, ; Wojtak+08, ; Vass+09, ; Maciejewski+09, ), it is of course only an anecdotical example. Nevertheless, it illustrates the difficulty of defining a metric (and related concepts, such as neighbourhood) in the general, non-Euclidean case. Although distances can be trivially defined in both three-dimensional subspaces, it is not clear how positions and velocities should be combined in order to produce a meaningful six-dimensional distance. It can be shown that a global scaling will only be appropriate for a certain region of the phase space, but not for the whole system (see the discussion in AscasibarBinney05, ; Maciejewski+09, ). In other words, the metric must adapt to the local structure of the data in order to recover the underlying density field.
In terms of applications, density estimation can be helpful in data mining problems. Unsupervised classification may be performed by identifying independent clusters with local density maxima, with boundaries set by the saddle points. In supervised classification, one can compute the probability distribution for each group in the training set, , from the data points belonging to it. Applying Bayes’ theorem, the probability that a new datum belongs to class is given by
where denotes the prior probability of each class, and the sum in the denominator runs over all classes.
2 Description of the algorithm
FiEstAS provides, for a given dataset in dimensions, the value of at any arbitrary point . The algorithm involves the following steps:
Tesselation of the -dimensional space.
Assignment of bandwidths to every data point.
Estimation of .
Bias correction (if necessary).
Each of them is described below, along with the different options and parameters that apply in each case.
The first step of the algorithm is the division of the data space in cells containing exactly one point. An important issue is the absence of a well-defined metric, which greatly increases the range of applicability of the method. Rather than using distances between data points, FiEstAS recursively divides the space by means of a -d tree, one dimension at a time, until there is only one point per leaf.
There are several criteria to select the dimension to split at each step. The original version of FiEstAS AscasibarBinney05 () was fine-tuned to estimate densities in phase space, and it used the information that both the position and velocity subspaces are Euclidean. Moreover, it was imposed that divisions should take place alternatively in each subspace. A significant improvement over this scheme, proposed by SharmaSteinmetz06 (), is the selection of the dimension with lower Shannon entropy. Such a choice results in more divisions along the dimensions that show more structure, and therefore it adapts better to the distribution of the data. A very similar scheme was implemented in Ascasibar08 () to use FiEstAS in the context of Monte Carlo numerical integration: when a tree node has to be split, a histogram with bins is built for each dimension, from the minimum to the maximum value attained by the corresponding coordinate. The log-likelihood for the histogram counts to arise from a Poissonian distribution is given by
where the indices and denote the dimension and the bin number, respectively, is the number of points in each bin, and is the total number of points in the node. The dimension with smaller is divided at the point , where is the maximum of all points lying on the “left” side () and is the minimum of the points lying on the “right” () side. The bin is chosen in order that the number of points on each side is as close as possible to .
A crude estimate of the density can be obtained as the inverse of the cell volume. As shown in AscasibarBinney05 (), this estimate is very noisy, and it dramatically underestimates the density of particles near the boundary of the system. This becomes a critical problem in many dimensions, because the fraction of points affected quickly approaches unity as increases. A simple correction was applied in AscasibarBinney05 () to data points at the boundary of the hypercubical domain, and a scheme based on the mean interparticle separation was used in SharmaSteinmetz06 () to adjust the shape of every tree node. In the present version of FiEstAS, such a correction is not necessary.
2.2 Bandwith assignment
In principle, one should compute the independent coefficients of the bandwith matrix that minimize the mean integrated square error. However, doing that for every single datapoint can be impractical for large samples, and a simpler prescription has been adopted.
First, the bandwidth matrices are constrained to be diagonal. Although this is far from optimal when the data are distributed obliquely with respect to the coordinate axes (see e.g. WandJones93, ; DuongHazelton03, ; DuongHazelton05, ), there is a substantial gain in speed, memory consumption, and code simplicity, by reducing the number of free parameters. This prescription will work well if the field is well sampled, although anisotropic kernels would perform better in oblique regions where the sampling is sparse.
The relation between the smoothing lengths of each point is estimated from the local dispersion of the data along each axis
where the index refers to the neighbours defined by the FiEstAS tessellation. The smoothing lengths are then set to
This measure is less sensitive to the presence of outliers than the simpler prescription (see Figure 1).
In addition, FiEstAS offers the possibility of imposing a particular metric to any subspace by specifying a list of dimensions and the relative scale between them . Defining and ,
all other dimensions remaining unaltered. For instance, in phase space one could set dimensions (positions) to scale as and then impose the same Euclidean metric to the velocities, . The relation between both spaces is not specified, and can vary freely from point to point.
Finally, the overall scale of the bandwidths is set so that the mass contained within the hypercube they define is equal to the user-defined parameter . The value of controls the degree of smoothing, and can be thought of as a constant (not necessarily integer) “number of neighbours” of the smoothing kernel. In order to compute it, each data point (of unit mass) is uniformly distributed over its cell, without any boundary correction,
where if lies inside the -th FiEstAS cell and 0 otherwise, and the bandwidths are scaled until within a 10 per cent tolerance. This is the only case in which the mass of the data is distributed like in the original implementation of FiEstAS.
2.3 Field estimation
At this point, it would be possible to estimate the density as
where we have used a “product kernel” . The current implementation includes top hat, , triangular-shaped cloud, , and Epanechnikov, , kernels, where .
Apart from this possibility, FiEstAS can also combine with a top-hat balloon estimator
based on a local bandwidth
interpolated from the individual particle bandwidths by using the same kernel as in equation (9).
2.4 Bias correction
In many, if not most, practical applications of the algorithm, one is interested in the value of the density field precisely at the locations of the sample points, and only is evaluated. As discussed in SharmaSteinmetz06 (), a positive bias that depends on the chosen kernel and its bandwidth arises in this particular case because we are not evaluating the density at a completely independent set of locations. The magnitude of this bias can be easily estimated for a uniform probability distribution by considering the average values of and . In a uniform Poissonian distribution, , all the smoothing lengths would be given by
whereas, for the balloon estimator,
Therefore, assuming , the algorithm can apply a correction when only the are requested, where and . It is important to bear in mind that this correction factor must not be applied in the general case, where the sample and evaluation points do not coincide. In particular, it should not be confused with the bias arising from the derivatives of (note that, in fact, the values of have been derived for a constant density), that has not been accounted for due to the difficulties associated to the estimation of local derivatives.
The accuracy of the density reconstruction has been tested in two benchmark cases: a two-dimensional ring and a six-dimensional Hernquist sphere. We compare the performance of differnet kernels, as well as the scaling with the number of sample points. Regarding the smoothing parameter, arguably represents a reasonable minimum, with smaller values yielding results (bandwidths and densities) that are dominated by the nearest data point. As will be shown below, increasing this parameter reduces the statistical variance of the estimator at the expense of resolution. A value is considered for reference, but higher values may be suitable depending on the user requirements, especially as the number of dimensions increases.
3.1 Two-dimensional ring
The first distribution is a ring in two dimensions with uniform density between an inner and an outer radius of 0.95 and 1.05, respectively, in arbitrary units. A random realization with 100 sample points is depicted in Figure 2, together with the density field returned by the FiEstAS algorithm under different parameter configurations. In all cases, the shape of the ring is correctly recovered, although some artifacts arise when the cells of the FiEstAS tessellation become extremely elongated. Since these artifacts are associated to individual points, they become more evident for large values of . As can be seen in the bottom panels, they are completely absent when a locally Euclidean metric (arguably the most appropriate for this problem, at least globally) is imposed.
The reconstruction obtained by the top-hat kernel has the obvious drawback of the sharp square edges, and the results obtained with the triangular-shaped cloud (not shown) or the Epanechnikov kernel are much more satisfactory in that sense. For , the Epanechnikov kernel with tends to severely oversmooth the density distribution. When the metric is constrained to be locally Euclidean ( at every point), the width of the ring is systematically overestimated, but the recovered shape is perfectly circular. For the unrestricted metric, the density distribution is deformed into a slightly square shape aligned with the coordinate axes. This is due to the combined effect of the hypercubical FiEstAS tessellation (see Maciejewski+09 () for a comparison of different schemes) and the diagonal bandwith matrix. As a result, kernel shapes in “horizontal” or “vertical” regions tend to be more elongated, whereas in the “diagonal” regions, causing the “diamond” and “square” shapes observed for the inner and outer boundaries of the distribution. As stated above, it is in these oblique regions, poorly sampled within a smoothing volume, where an anisotropic kernel would certainly provide a significant advantage. Finally, combining a top-hat kernel with with the balloon estimator (10) yields a density field that is bracketed by the results of the Epanechnikov kernel with and .
More quantitatively, the probability distribution of the variable is shown in Figure 3 for several values of the number of sample points between and . The bias and the variance of each estimator are quoted in Table 1. Since the density could already be properly reconstructed with points, the probability distribution of for this two-dimensional problem does not change much with , with the exception of the oversmoothing shown by all estimators for . The bias correction was of the order of per cent ( dex) in all cases but the Epanechnikov kernel with , for which it was about a factor of two. The variance also depends on the choice of a specific kernel and smoothing parameter , ranging from percent in the Epanechnikov kernel with to more than a factor of two for the top-hat kernel. It may be argued, though, that some of this dispersion is indeed physical, in the sense that it reflects the Poisson fluctuations inherent to the random realization of the ideal uniform distribution. In other words, there really are several clumps in the point distribution, and they are clearly visible in Figure 2. If one is interested in the actual physical density of these regions, its value should be higher than in those others that happen to contain less points. If, on the other hand, one is interested in the probability density field from which the sample was drawn, some statistical criterion has to be devised in order to test whether the fluctuations correspond to real variations of the field or are simply due to Poisson noise.
3.2 Hernquist sphere
The performance of the algorithm has also been tested by recovering the density of a six-dimensional Hernquist sphere Hernquist90 (). This distribution is often used to model the central bulges of galaxies, as well as their dark matter haloes. The density of particles in the phase space of three-dimensional positions and velocities can be written as
in terms of the dimensionless specific binding energy of the particle
and the total mass and characteristic radius of the system. The generation of a random realization of this distribution is described in AscasibarBinney05 ().
Results obtained for different values of are displayed in Figure 4 and Table 2. Overall, they are qualitatively similar to the example discussed in the previous section, with only minor differences due to the higher dimensionality of the problem and the very inhomogeneous nature of the Hernquist density distribution. In particular, the bias correction is much more important in six dimensions, reaching values as high as a factor of for the Epanechnikov kernel with . Moreover, many more points are necessary in order to achieve an adequate sampling, and a clear evolution with is now evident in the probability distribution of . The negative bias observed at low is mostly due to oversmoothing of the central regions, which contain the majority of the particles. The Hernquist distribution becomes optimally resolved for : the sampling within a smoothing volume becomes close to Poissonian, and the probability distribution of approaches the asymptotic for the chosen kernel. As in the two-dimensional ring, the specification of a metric based on external knowledge of the problem (in this case, and ) affects the results only mildly.
Kernel density estimation has been implemented within the Field Estimator for Arbitrary Spaces (FiEstAS) algorithm, using different kernels and opening the possibility of combining sample point and balloon estimators. The only free parameters are the specific form of the kernel function (top-hat, triangular-shaped cloud and Epanechnikov kernels are provided by default) and the smoothing parameter . The bandwidth matrix, constrained to be diagonal, is automatically computed for every point. Additional constraints can be imposed by the user, but the test cases considered do not suggest that this results in a significant advantage. In fact, it has already been established for a wide range of cases (see e.g. WandJones93, ) that independent bandwidths (arbitrary metric) do not lose power against the Euclidean metric, even if the latter is true. A bias correction must be applied when one is only interested in the values of the density field exactly at the sample points . The magnitude of this correction depends on the details of the kernel, but it is already significant at and tends to increase with dimensionality.
The optimal choice of kernel and smoothing parameter are, of course, problem-dependent. Based on the results presented in the previous section, the combination of a top-hat kernel with with the balloon estimator given by equation (10) seems to yield a reasonable compromise between accuracy (low dispersion) and resolution (small number of points required) for any number of dimensions. This, however, may not hold in the general case, and the user is encouraged to experiment with different options. In particular, smaller values of the smoothing parameter are unlikely to provide useful results, but larger bandwidths may be helpful in order to reduce the statistical noise of the estimator at the expense of losing information about the small-scale structure of the data. The kernel shape has a much milder effect, but in some cases (e.g. if exact mass conservation is required), a sample point may be preferable to a balloon estimator. In this case, the Epanechnikov kernel is optimal for an loss criterion with fixed bandwidths WandJones95 (), and this would be, in principle, the recommended choice.
Financial support for this work has been provided by the Spanish Ministerio de Educación y Ciencia (project AYA2007-67965-C03-03) and the European Science Foundation (ESF) for the activity entitled “Computational Astrophysics and Cosmology” (reference ASTROSIM 2027).
- (1) B. W. Silverman, Density estimation for statistics and data analysis, Monographs on Statistics and Applied Probability, London: Chapman and Hall, 1986, 1986.
- (2) M. P. Wand, M. C. Jones, Kernel Smoothing (Monographs on Statistics and Applied Probability), Chapman & Hall/CRC, 1995.
- (3) Y. Ascasibar, J. Binney, Numerical estimation of densities, MNRAS 356 (2005) 872–882. arXiv:arXiv:astro-ph/0409233, doi:10.1111/j.1365-2966.2004.08480.x.
- (4) M. Vogelsberger, S. D. M. White, A. Helmi, V. Springel, The fine-grained phase-space structure of cold dark matter haloes, MNRAS 385 (2008) 236–254. arXiv:0711.1105, doi:10.1111/j.1365-2966.2007.12746.x.
- (5) R. Wojtak, E. L. Łokas, G. A. Mamon, S. Gottlöber, A. Klypin, Y. Hoffman, The distribution function of dark matter in massive haloes, MNRAS 388 (2008) 815–828. arXiv:0802.0429, doi:10.1111/j.1365-2966.2008.13441.x.
- (6) I. M. Vass, M. Valluri, A. V. Kravtsov, S. Kazantzidis, Evolution of the dark matter phase-space density distributions of CDM haloes, MNRAS 395 (2009) 1225–1236. arXiv:0810.0277, doi:10.1111/j.1365-2966.2009.14614.x.
- (7) M. Maciejewski, S. Colombi, C. Alard, F. Bouchet, C. Pichon, Phase-space structures - I. A comparison of 6D density estimators, MNRAS 393 (2009) 703–722. arXiv:0810.0504, doi:10.1111/j.1365-2966.2008.14121.x.
- (8) S. Sharma, M. Steinmetz, Multidimensional density estimation and phase-space structure of dark matter haloes, MNRAS 373 (2006) 1293–1307. doi:10.1111/j.1365-2966.2006.11043.x.
- (9) Y. Ascasibar, FiEstAS sampling – a Monte Carlo algorithm for multidimensional numerical integration, Computer Physics Communications 179 (2008) 881–887. arXiv:0807.4479, doi:10.1016/j.cpc.2008.07.011.
M. P. Wand, M. C. Jones, Comparison
of smoothing parameterizations in bivariate kernel density estimation,
Journal of the American Statistical Association 88 (422) (1993) 520–528.
- (11) T. Duong, M. L. Hazelton, Plug-in bandwidth selectors for bivariate kernel density estimation, Journal of Nonparametric Statistics 15 (2003) 17–30.
- (12) T. Duong, M. L. Hazelton, Cross-validation bandwidth matrices for multivariate kernel density estimation, Scandinavian Journal of Statistics 32 (3) (2005) 485–506.
- (13) L. Hernquist, An analytical model for spherical galaxies and bulges, ApJ 356 (1990) 359–364. doi:10.1086/168845.