A non-parametric -nearest neighbour entropy estimator
A non-parametric -nearest neighbour based entropy estimator is proposed. It improves on the classical Kozachenko-Leonenko estimator by considering non-uniform probability densities in the region of -nearest neighbours around each sample point. It aims at improving the classical estimators in three situations: first, when the dimensionality of the random variable is large; second, when near-functional relationships leading to high correlation between components of the random variable are present; and third, when the marginal variances of random variable components vary significantly with respect to each other. Heuristics on the error of the proposed and classical estimators are presented. Finally, the proposed estimator is tested for a variety of distributions in successively increasing dimensions and in the presence of a near-functional relationship. Its performance is compared with a classical estimator and shown to be a significant improvement.
Entropy is a fundamental quantity in information theory that finds applications in various areas such as coding theory and data compression Cover and Thomas (2012). It is also a building block for other important measures, such as mutual information and interaction information, that are widely employed in the areas of computer science, machine learning, and data analysis. In most realistic applications, the underlying true probability density function (pdf) is rarely known, but samples from it can be obtained via data-acquisition, experiments, or numerical simulations. An interesting problem then, is to estimate the entropy of the underlying distribution only from a finite number of samples. The approaches to perform such a task can broadly be classified into two categories: parametric and non-parametric. In the parametric approach the form of the pdf is assumed to be known and its parameters are identified from the samples. This, however, is a strong assumption and in most realistic cases an a priori assumption on the form of the pdf is not justified. Consequently, non-parametric approaches where no such assumption is made have been proposed Beirlant et al. (1997). One such approach is to first estimate the pdf through histograms or kernel density estimators (KDE) Silverman (1986); Devroye and Györfi (1985); Scott (2015), and then to compute the entropy by either numerical or Monte-Carlo (MC) integration. Other alternatives include methods based on sample spacings for one-dimensional distributions Hall (1984); Dudewicz and Van der Meulen (1987) and -nearest neighbours (kNN) Kozachenko and Leonenko (1987); Tsybakov and Van der Meulen (1996); Singh et al. (2003); Kraskov et al. (2004).
While KDE based entropy estimation is generally accurate and efficient in low dimensions, the method suffers from the curse of dimensionality Gray and Moore (2003). On the other hand the kNN based estimators are computationally efficient in high dimensions, but not necessarily accurate, especially in the presence of large correlations or functional dependencies Gao et al. (2015). The latter problem has recently been addressed by estimating the local non-uniformity through principal component analysis (PCA) in Gao et al. (2015). In the current work, a different approach to overcome the aforementioned limitations associated with kNN based entropy estimators is presented. The central idea is to estimate the probability mass around each sample point by a local Gaussian approximation. The local approximation is obtained by looking at -neighbours around the sample point. This procedure has two distinct advantages: first, that the tails of the true probability distribution are better captured; and second, that if the probability mass in one or more directions is small due to large correlations (near-functional dependencies), or due to significant variation in the marginal variances of the random variable components, the non-uniformity is inherently taken into account. These two features allow the entropy to be estimated in high dimensions with a significantly lower error when compared to classical estimators.
The structure of the work is as follows: first, the classical and the new kNN estimators are presented in section II; then, the heuristics on the errors of the two estimators are presented in section III; and finally, numerical test cases are presented in section IV for a variety of distributions in successively increasing dimensions.
Ii Formulation of the entropy estimator
Let the random variable under consideration be and its probability density be denoted by . Its entropy is defined as
where is the support of . The goal is to estimate from finite samples, , from the distribution . A Monte-Carlo estimate of the entropy can be written as
However, since is unknown, an estimate must be substituted in equation (2) to obtain .
The key idea is to estimate through -nearest neighbours (kNN) of . Consider the probability density of , the distance from to its kNN (see Figure 1). The probability is the probability that exactly one point is in , exactly points are at distances less than the kNN, and the remaining points are farther than the kNN. Then it follows that
where, is the probability mass of an -ball centered at a sample point . The region inside the -ball is and is denoted by denoted by . The probability mass in is
If the probability mass in can be written in the following form
In what follows the classical manner to obtain equation (6) and the new estimator are presented.
ii.1 Classical estimators
The classical estimates by Kozachenko and Leonenko Kozachenko and Leonenko (1987); Kraskov et al. (2004), and similarly by Singh et. al. Singh et al. (2003), assume that the probability density is constant inside . For example Kozachenko and Leonenko Kozachenko and Leonenko (1987); Kraskov et al. (2004) assume that
where is the volume of the -dimensional unit-ball ( with ). The expression for depends on the type of norm used to calculate the distances; for example, for maximum () norm and for euclidean () norm , where is the Gamma function. Using equation (8) in equations (6) and (7), the entropy estimate can be written as
where is the distance of the sample to its nearest neighbour. This estimator is referred as KL estimator in the remainder of this article.
ii.2 The kpN estimator
Although the classical estimator works well in low-dimensions, it presents with large errors when the dimensionality of the random variable is high or the pdf in shows high non-uniformity. The latter may result from: i) presence of a near-functional relationship (leading to high correlation) between two or more components of the random variable Gao et al. (2015); and ii) high variability in the marginal variances of in . In the remainder of the manuscript the term non-uniformity is used to imply the aforementioned features. The primary cause of high error in the KL estimator is the assumption of constant density in each . This may be unjustified when the true probability mass is likely to be high only on a small sub-region of . In such cases, a constant density assumption in leads to and overestimation of the probability mass and hence the entropy estimate Gao et al. (2015). To remedy this, an alternate formulation for in equation (6) is sought. Contrary to a constant density assumption, the probability density in is represented as
where and represent the empirical mean and covariance matrix of the neighbours of the point . Essentially, the probability density is assumed to be proportional to a Gaussian function approximated by using -nearest neighbours of . The idea is that the -neighbours would capture the local non-uniformity of the true probability density inside . This approach is contrary to Gao et al. (2015) where the assumption of constant density is kept, and the ball is transformed using local PCA. In the proposed approach, the ball is kept constant but the probability density is assumed non-uniform. From a physical point of view, is reflective of the characteristic length of changes in the true probability distribution.
Consequently, the probability mass in can be written as
The above estimator for entropy is referred as the kpN estimator. In this estimator, while the evaulation of is straightforward, the evaluation of in equation (15) for each sample point is not trivial, especially in high dimensions. Before describing a computationally efficient method to evaluate this integral in the next section, a graphical demonstration of the difference in the integrals of probability density considered by the KL and kpN estimators is shown in Figure 2. Two different points – one near the tails and one near the mode – of a Gaussian distribution are shown. While near the mode of the distribution the approximations to the integral of the probability density are similar for the two estimators, in the tails the integral is better captured by the kpN estimator as a local Gaussian is constructed. This difference, while insignificant in low dimensions can have a significant impact in higher dimensions (demonstrated in section IV).
ii.3 Gaussian integral in boxes
In order to compute the function a multivariate Gaussian definite integral inside has to be computed. Since we adopt the distance, this operation amounts to computing the integral of a multivariate Gaussian inside a box. Among the methods proposed in the literature (see for instance Genz (1992)), the Expectation Propagation Multivariate Gaussian Probability (EPMGP) method, proposed in Cunningham et al. (2012), is chosen. The method is based on the introduction of a fictitious probability distribution, whose Kullback-Leibler distance with respect to the original distribution is minimised, inside the box. Since the minimisation of the Kullback-Leibler distance is equivalent, for the present setting, to the moment matching, the zero-th, first and second moments of the fictitious distribution match the ones of the original distribution. The zero-th order moment, in particular, is the sought integral value.
This method, as shown in Cunningham et al. (2012), is precise in computing the definite Gaussian integral when the domain is a box.
Algorithm 1 shows the steps to obtain the kpN estimate.
Iii Heuristics on the error
In this section analytical heuristics on the error are presented to motivate the approach proposed in this work. First, the error of the KL estimator is derived. The result shows that the estimate is sensitive to both the space dimension and non-uniformity of the pdf in .
In what follows, . Let be the probability density in . In each ball, it is supposed to be . Albeit quite strong, this regularity is introduced for sake of simplicity of the heuristics. The probability mass is .
iii.1 KL estimator error analysis
The error of the KL estimator is analysed. It comprises of two contributions: a statistical error related to the MC integration and an analytical error, resulting from the hypothesis of constant density in .
iii.1.1 Error in the approximation of probability mass
The analytical contribution to the error is analysed in this section (see details in Appendix). By considering a second order Taylor expansion of the pdf in , the probability mass can be approximated by:
where is the probability mass resulting from constant density assumption in the KL estimator, and is the Hessian of the pdf computed at .
Let the error in the approximation of be . Then:
where denote respectively the minimum and maximum eigenvalues of the Hessian. The lower bound can thus vanish. Concerning the upper bound, note the dependence on the dimension as well as on the maximum eigenvalue, which can be very large in the presence of non-uniformity of the pdf.
iii.1.2 Error in entropy estimation
Let denote the KL entropy estimate. After some derivation and by introducing the approximation of the KL estimator in the ball, it holds:
where . After some algebra, the following expression for the entropy estimation is obtained:
where is the statistical error due to the MC approximation, and the last term on the right hand side is the analytical error.
Eq.(37) and the standard -inequality (see Appendix) allows to state the upper and lower bounds for the error:
The error is thus bounded by the statistical error and an analytical contribution. If the distribution is piecewise linear, then the analytical contribution vanishes ( in Eq.(47)) since the Hessian vanishes. This corresponds to a particular case that hardly represents realistic probability distributions. The lower bound Eq.(45) can vanish for particular distributions. The analysis of the expressions reveals that, given a target distribution, the error in the entropy estimate can be significant in the presence of non-uniformity (high ), and when the dimension () is high.
iii.2 The kpN estimator error analysis
The analysis presented for the KL estimator is repeated in this section for the kpN estimator. The error analysis shows that the choice made allows to keep the structure of the KL estimator while mitigating the analytical contribution to the error. The main difference is in the approximation of the probability mass.
iii.2.1 Error in the approximation of the probability mass
The details of the computation are presented in the Appendix. The main difference with respect to the KL estimator consists in the fact that, by constructing a Gaussian osculatory interpolant (empirically identified by using neighbours), an approximation of the Hessian of the distribution is obtained. This estimate can be rough, but is beneficial in two cases: when the probability distributions are in a high dimensional space, or the pdf in exhibits non-uniformity.
The probability mass approximation in the kpN estimator is denoted by and it is defined as:
so that is it the sum of the probability mass of the KL estimator and a term that approximate the Hessian of the distribution. The error estimate is:
where is the difference between the target distribution and its gaussian approximation inside the box.
iii.2.2 Error in the approximation of the entropy
By repeating the same analysis as for the KL estimator, the following upper and lower bounds are obtained:
where are the maximum and minimum eigenvalues of the Hessian of the residual .
Let us remark that the behaviour is the same for the KL estimator and the kpN estimator in terms of functional dependence with respect to the space dimension. However, by approximating the Hessian (avoiding a bad choice of and is important to this end), can be significantly lower than . This has two potential advantages: first, that in the presence of non-uniformity, the upper bound on the kpN error is smaller; and second that, even if correlations are not significant, a lower results in a lower rate of increase of error with increasing dimensions.
Iv Numerical testcases
In this section, the numerical experiments are presented. The first test case aims at validating the proposed approach against analytical results, in simple settings.
Then, several relevant properties of the methods are investigated in more complicated settings, that frequently occur when realistic datasets are considered. First, the robustness in dimension increase is investigated. Then, the entropy estimation in presence of functional dependency leading to high correlation is shown.
2-D Multivariate Normal
(correlation coefficient )
3-D Gamma distribution
(Independent along each dimension)
4-D Beta distribution
(Independent along each dimension)
iv.1 Analysis of estimator: effect of , , and
To assess the effect of the parameters , , and , in the kpN estimator, three probability distributions in two, three, and four dimensions are considered. A summary of the these distributions is presented in Table 1. For all the three distributions, the number of samples are varied from 1000 to 32000, is varied from 1 to 10, and is varied from 0.01 to 0.10. For each set of these parameters an independent kpN entropy estimates are calculated and the corresponding mean and variance of the error with respect to the analytically known true entropy is calculated. These results for the 2-D Gaussian, 3-D Gamma, and 4-D Beta distributions are shown in Figures 3, 4, and 5, respectively. From these plots it is observed that the variance of the error decreases with increasing as expected. Furthermore, the variance appears to be high for and then lower and approximately invariant with increasing . This is consistent with the behaviour of the KL estimator Kraskov et al. (2004). Recall that the parameter is reflective of the length-scale of changes in the probability density. For a Gaussian distribution, it is clear that a higher will result in lower error as the local Gaussian approximations will better approximate the true distribution. This is observed in Figure (a)a. A similar behaviour is observed for the Gamma distribution (Figure (a)a), but for the Beta distribution (Figure (a)a) a clear optimal range of varying from 0.01 to 0.05 can be identified. The length-scale of the density variation will in general not be known a priori (especially in higher dimensions where the samples are hard to visualise) and consequently a large should be avoided. From Figures (a)a, (a)a, and (a)a, it is observed that unless a particularly bad combination of the , , and , parameters – specifically low , high , and small – is chosen, the errors across the entire spectrum of parameter variations are less than 10%. Overall, based on the performance of the estimator across the three significantly different distributions considered, is recommended to be chosen between 3 and 5, and between 0.02 and 0.05.
iv.2 Dimension increase
The properties of the kpN estimator regarding robustness to dimension increase are investigated. For all these tests , , are fixed. The method is compared to the standard KL estimator in multi-dimensional uncorrelated Gaussian, Gamma and Beta distributions. The dimension ranges from to . For all the cases, the quantity of interest is the relative error, defined as , where is the analytical value. The distributions used to test the method are quite regular and smooth. Moreover, no correlation is considered, the only focus being the behaviour with respect to the dimension increase. For all the tests, the computations were repeated times, and corresponding mean-values and variances are reported.
iv.2.1 Multi-dimensional Gaussian
The first test case is the entropy computation of a multi-dimensional Gaussian:
where is the space dimension, . The variance ranges uniformly in , i.e. .
The results are summarised in Fig.6. The relative error at low dimension is higher than that of the KL estimator. This is due to the fact that the parameters adopted are not optimal for this distribution, given the number of sample (a higher value of would provide a better result). The kpN error is significantly smaller when the dimension increases: namely, at dimension , it has an error which is less than , while the KL estimator has an error which is about three times larger, despite the fact that the probability distribution is quite regular.
iv.2.2 Multi-dimensional Gamma
The case of a multivariate Gamma distribution is commented. Similarly to earlier case, the distribution is defined as a product of univariate distributions:
where and are the shape and scale parameters of the distribution. The shape parameter varies uniformly in while the scale parameter varies in .
The results are shown in Fig.7. For this case, the kpN estimator always outperforms the KL estimator. Note that the error is not necessarily monotonic with respect to the dimension of the space. This depends on the particular nature of the distribution as well as on the parameters and adopted. Nonetheless, the kpN error is less than across the entire range of dimensions considered, while the KL error grows up to .
iv.2.3 Multi-dimensional Beta
The last test case shown concerns the entropy estimation for a multivariate Beta distribution of the form:
where varies in and varies in .
The results of the kpN and KL entropy estimates are shown in Figure 8. This test appears to be most critical as, on average, the errors on both KL and kpN estimates are higher when compared to the previous Gaussian and Gamma distributions. This may partly be due to pathological nature of the Beta distribution for particular choices of the and parameters (for example ), or (although unclear why) due to the fact that the Beta distribution has only a finite support over [0.0,1.0] in all dimensions. From Figure 8, the error of the kpN estimator is always less than whereas the KL estimate has a relative error of about , which is almost one order of magnitude higher.
The three tests presented aim at investigating the behaviour of the estimator with respect to the dimension of the space. The error of the KL estimator is monotonic and grows quite fast, because the analytical contribution to the error grows significantly with the dimension, when the number of sample is kept fixed. On the contrary, the kpN estimator proposed manages to mitigate this error by providing a rough estimate of the Hessian of the distribution in each box. The proposed kpN estimator is more robust to the dimension increase, or, conversely, given a certain dimension of the space, it allows to estimate the entropy by using a smaller number of samples. This feature is particularly appealing when dealing with the analysis of realistic datasets.
iv.3 Functional dependency and correlation
Another interesting aspect that occurs frequently when realistic applications are considered is the possible presence of correlation. In this section, the robustness of the entropy estimators is investigated: the kpN method is compared to the KL method for fixed parameters: , , . A simple test case is proposed: the entropy of a Gaussian distribution on a linear manifold is computed, with different levels of noise. The system is
where is a normal random variable with zero mean and unit variance, is a positive scalar, and is a normal random variable with zero mean and variance .
The system output is observed at discrete times , providing . The objective is to estimate the entropy of the joint probability distribution of for increasing . For this test case, two different levels of noise are considered, namely . The joint dimension increases up to . The results (in terms of absolute error and variance) are shown in Fig.9 for and . When the dimension is low, the performances of the KL estimator and that of the proposed kpN estimator are comparable, i.e. no significant difference in error is observed in terms of both the means and the variances. When the dimension increases, depending on the level of noise, the KL estimator starts deviating from the true estimate, whereas the proposed kpN estimator provides a significantly better result. The higher the noise level, the better is the behaviour of the classical KL estimator. This apparently paradoxical result can be explained by considering the analytical heuristics proposed. When the level of noise is higher, the samples are less correlated and, thus, the maximum eigenvalue of the Hessian is, on average, smaller. The joint distribution being more regular, a better entropy estimate is obtained by the classical KL estimator. The kpN estimator, on the other hand, is more robust to variations in noise-levels as based on the -neighbours the covariance of the local Gaussian approximation adjusts accordingly.
V Conclusions and Perspectives
A new -nearest neighbour based entropy estimator, that is efficient in high dimensions and in the presence of large non-uniformity, is proposed. The proposed idea relies on the introduction of a Gaussian osculatory interpolation, which in-turn is based on an empirical evaluation of -nearest neighbours. By this introduction, the local non-uniformity of the underlying probability distribution is captured, while retaining all the appealing computational advantages of classical kNN estimators. The robustness of the new estimator is tested for a variety of distributions – ranging from infinite support Gamma distributions to finite support Beta distributions – in successively increasing dimensions (up to 80). Furthermore, a case of direct functional relationship leading to high correlations between the components of a random variable is considered. Across all the tests, the new estimator is shown to consistently outperform the classical kNN estimator.
The main perspective of the current work is that the proposed estimator can be used as a building block to construct estimators for other quantities of interest such as mutual information, particularly in high dimensions. Another perspective is the development of strategies to automatically adapt based on properties of the cloud of local samples.
In this Appendix, the details of the error heuristics are presented. Let us recall the notation and the main hypotheses. The -ball is denoted by . Let be the probability density in . The probability mass in is .
vi.1 KL estimator error analysis
The error of the KL estimator is analysed. First, the error on the probability mass in a generic is computed, and the result is used to compute the error on the entropy.
vi.1.1 Error in the approximation of the probability mass
The analytical contribution to the error is due to the approximation of the probability mass . Consider a Taylor expansion of centered around :
where is the Hessian computed in . The first term of the series yields the KL approximation , the second term vanishes since it is the integral of an even function over a symetric interval, the third term represent the error of the approximation:
obtained by discarding the higher order terms. Since , let us make the hypothesis that this hold even for the truncated approximation, i.e.:
The integral is estimated. A standard result on the quadratic forms is used:
where are the minimum and maximum eigenvalues of . Then, the bounds on are simply obtained by computing the integral over :
By virtue of the symetry of the ball, this integral can be computed for just one and then multiplied by . Let . It holds:
vi.1.2 Error in the approximation of the entropy
The error on the entropy estimate is obtained by a derivation of the KL estimator. The KL estimator is obtained by equating . Let denote the entropy estimated by using the KL estimator. We write:
The properties of the logarithm are used, leading to:
The KL approximation of is introduced:
After some algebra, it holds:
where is the statistical error due to the MC approximation, and the last term on the right hand side is the analytical error.
The use of the result presented in Eq.(37) and of a standard -inequality allows to state upper and lower bounds for the error.
Indeed, the hypothesis in Eq.(33) allows to make use of the following:
After having set , we have:
In order to get a lower bound, the left hand side is studied. It holds:
The use of this results allows to state the lower bound for the error:
In order to derive the upper bound, the right hand side of the logarithmic inequality is studied:
By using this, the upper bound reads:
vi.2 Analysis of the kpN estimator
As commented above, the main difference is in the approximation of the probability mass in . In particular, an osculatory interpolation with an empirically estimated multivariate Gaussian is constructed.
vi.2.1 Error in the approximation of the probability mass
The probability density distribution inside the ball is approximated by:
where , where are the empirically evaluated mean and covariance, is the residual of the approximation. Since by construction of the approximation, it follows . The Taylor expansion of the probability density distribution centred around is computed for the gaussian approximation:
where is the Hessian computed for the gaussian approximation.
The expression is used to compute the probability mass. Remark that, as before, the linear contribution vanishes identically due to the symetry of the ball. It holds, at second order:
where denotes the Hessian of the target distribution. On the other hand:
The expression for is introduced, allowing to understand what the gaussian approximation does in terms of approximating the mass:
What is retained in the present approximation is the first term, the error thus reducing to the last term of the expansion (equate the Taylor expansion Eq.(50) with Eq.(VI.2.1)). The mass approximation is denoted by and it can be defined as:
Roughly speaking, the mass is the sum of the mass obtained by the KL hypothesis plus an additional term that results from the approximation of Hessian of the target distribution by means of the Hessian of the empirically estimated gaussian.
The error is denoted by :
If the distribution is Gaussian and it is perfectly estimated through the samples, this term vanishes. Remark that, the behaviour of the error as function of the dimension is exactly the same as for the KL estimator, but if the Hessian of the Gaussian estimates the Hessian of the target distribution, the upper bound on the error will be smaller.
vi.2.2 Error in the approximation of the entropy
The error on the entropy estimate is computed by following exactly the same strategy as for the KL estimator. The upper and lower bounds have the same expression, except that the eigenvalues appearing (namely ) in the expressions are those of the Hessian of the residual .
The lower bound reads:
And the upper bound is:
- Cover and Thomas (2012) T. M. Cover and J. A. Thomas, Elements of information theory (John Wiley & Sons, 2012).
- Beirlant et al. (1997) J. Beirlant, E. J. Dudewicz, L. Györfi, and E. C. Van der Meulen, Int. J. Math. Stat. Sci. 6, 17 (1997).
- Silverman (1986) B. W. Silverman, Density estimation for statistics and data analysis, Vol. 26 (CRC press, 1986).
- Devroye and Györfi (1985) L. Devroye and L. Györfi, Nonparametric density estimation: the L1 view, Wiley series in probability and mathematical statistics (Wiley, 1985).
- Scott (2015) D. W. Scott, Multivariate density estimation: theory, practice, and visualization (John Wiley & Sons, 2015).
- Hall (1984) P. Hall, in Mathematical Proceedings of the Cambridge Philosophical Society, Vol. 96 (Cambridge Univ Press, 1984) pp. 517–532.
- Dudewicz and Van der Meulen (1987) E. Dudewicz and E. Van der Meulen, in New Perspectives in Theoretical and Applied Statistics, edited by W. W. E. M.L. Puri, J. Vilaplana (Wiley, New York, 1987).
- Kozachenko and Leonenko (1987) L. F. Kozachenko and N. N. Leonenko, Probl. Inf. Transm. 23, 9 (1987).
- Tsybakov and Van der Meulen (1996) A. B. Tsybakov and E. Van der Meulen, Scandinavian Journal of Statistics , 75 (1996).
- Singh et al. (2003) H. Singh, N. Misra, V. Hnizdo, A. Fedorowicz, and E. Demchuk, Amer. J. Math. Management Sci. 23, 301 (2003).
- Kraskov et al. (2004) A. Kraskov, H. Stögbauer, and P. Grassberger, Phys. Rev. E 69, 066138 (2004).
- Gray and Moore (2003) A. G. Gray and A. W. Moore, in SDM (SIAM, 2003) pp. 203–211.
- Gao et al. (2015) S. Gao, G. V. Steeg, and A. Galstyan, Arxiv (http://arxiv.org/abs/1411.2003) (2015).
- Genz (1992) A. Genz, J. Comp. Graph. Stat. 1, 141 (1992).
- Cunningham et al. (2012) J. Cunningham, P. Hennig, and S. Lacoste-Julien, Arxiv (http://arxiv.org/abs/1111.6832) (2012).