Analysis of Nearest Neighbor Distances
with Application to
Entropy Estimation
Abstract
Estimating entropy and mutual information consistently is important for many machine learning applications. The KozachenkoLeonenko (KL) estimator (kozachenko87statistical) is a widely used nonparametric estimator for the entropy of multivariate continuous random variables, as well as the basis of the mutual information estimator of Kraskov04estimating, perhaps the most widely used estimator of mutual information in this setting. Despite the practical importance of these estimators, major theoretical questions regarding their finitesample behavior remain open. This paper proves finitesample bounds on the bias and variance of the KL estimator, showing that it achieves the minimax convergence rate for certain classes of smooth functions. In proving these bounds, we analyze finitesample behavior of nearest neighbors (NN) distance statistics (on which the KL estimator is based). We derive concentration inequalities for NN distances and a general expectation bound for statistics of NN distances, which may be useful for other analyses of NN methods.
Carnegie Mellon University, 5000 Forbes Ave., Pittsburgh, PA 15213 USA
1 Introduction
Estimating entropy and mutual information in a consistent manner is of importance in a number problems in machine learning. For example, entropy estimators have applications in goodnessoffit testing (goria05new), parameter estimation in semiparametric models (Wolsztynski85minimum), studying fractal random walks (Alemany94fractal), and texture classification (hero02alpha; hero2002aes). Mutual information estimators have applications in feature selection (peng05feature), clustering (aghagolzadeh07hierarchical), causality detection (Hlavackova07causality), optimal experimental design (lewi07realtime; poczos09identification), fmri data processing (chai09exploring), prediction of protein structures (adami04information), and boosting and facial expression recognition (Shan05conditionalmutual). Both entropy estimators and mutual information estimators have been used for independent component and subspace analysis (radical03; szabo07undercomplete_TCC; poczos05geodesic; Hulle08constrained), as well as for image registration (kybic06incremental; hero02alpha; hero2002aes). For further applications, see (LeonenkoPronzatoSavani2008).
In this paper, we focus on the problem of estimating the Shannon entropy of a continuous random variable given samples from its distribution. All of our results extend to the estimation of mutual information, since the latter can be written as a sum of entropies. ^{1}^{1}1Specifically, for random variables and , . In our setting, we assume we are given IID samples from an unknown probability measure . Under nonparametric assumptions (on the smoothness and tail behavior of ), our task is then to estimate the differential Shannon entropy of .
Estimators of entropy and mutual information come in many forms (as reviewed in Section 2), but one common approach is based on statistics of nearest neighbor (NN) distances (i.e., the distance from a sample to its nearest neighbor amongst the samples, in some metric on the space). These nearestneighbor estimates are largely based on initial work by kozachenko87statistical, who proposed an estimate for differential Shannon entropy and showed its weak consistency. Henceforth, we refer to this historic estimator as the ‘KL estimator’, after its discoverers. Although there has been much work on the problem of entropy estimation in the nearly three decades since the KL estimator was proposed, there are still major open questions about the finitesample behavior of the KL estimator. The goal of this paper is to address some of these questions in the form of finitesample bounds on the bias and variance of the estimator.
Specifically, our main contributions are the following:

We derive bounds on the bias of the KL estimate, where is a measure of the smoothness (i.e., Hölder continuity) of the sampling density, is the intrinsic dimension of the support of the distribution, and is the sample size.

We derive bounds on the variance of the KL estimator.

We derive concentration inequalities for NN distances, as well as general bounds on expectations of NN distance statistics, with important special cases:

We bound the moments of NN distances, which play a role in analysis of many applications of NN methods, including both the bias and variance of the KL estimator. In particular, we significantly relax strong assumptions underlying previous results by evans02KNNmoments, such as compact support and smoothness of the sampling density. Our results are also the first which apply to negative moments (i.e., with ); these are important for bounding the variance of the KL estimator.

We give upper and lower bounds on the logarithms of NN distances. These are important for bounding the variance of the KL estimator, as well as NN estimators for divergences and mutual informations.

We present our results in the general setting of a set equipped with a metric, a base measure, a probability density, and an appropriate definition of dimension. This setting subsumes Euclidean spaces, in which NN methods have traditionally been analyzed, ^{2}^{2}2A recent exception in the context of classification, is chaudhuri14KNNrates which considers general metric spaces. but also includes, for instance, Riemannian manifolds, and perhaps other spaces of interest. We also strive to weaken some of the restrictive assumptions, such as compact support and boundedness of the density, on which most related work depends.
We anticipate that the some of the tools developed here may be used to derive error bounds for NN estimators of mutual information, divergences (WangKulkarniVerdu2009), their generalizations (e.g., Rényi and Tsallis quantities (LeonenkoPronzatoSavani2008)), norms, and other functionals of probability densities. We leave such bounds to future work.
Organization
Section 2 discusses related work. Section 3 gives theoretical context and assumptions underlying our work. In Section 4, we prove concentration boundss for NN distances, and we use these in Section 5 to derive bounds on the expectations of NN distance statistics. Section 6 describes the KL estimator, for which we prove bounds on the bias and variance in Sections 7 and 8, respectively.
2 Related Work
Here, we review previous work on the analysis of nearest neighbor statistics and their role in estimating information theoretic functionals, as well as other approaches to estimating information theoretic functionals.
2.1 The KozachenkoLeonenko Estimator of Entropy
In general contexts, only weak consistency of the KL estimator is known (kozachenko87statistical). biau15EntropyKNN recently reviewed finitesample results known for the KL estimator. They show (Theorem 7.1) that, if the density has compact support, then the variance of the KL estimator decays as . They also claim (Theorem 7.2) to bound the bias of the KL estimator by , under the assumptions that is Hölder continuous (), bounded away from , and supported on the interval . However, in their proof biau15EntropyKNN neglect the additional bias incurred at the boundaries of , where the density cannot simultaneously be bounded away from and continuous. In fact, because the KL estimator does not attempt to correct for boundary bias, for densities bounded away from , the estimator may suffer bias worse than .
The KL estimator is also important for its role in the mutual information estimator proposed by Kraskov04estimating, which we refer to as the KSG estimator. The KSG estimator expands the mutual information as a sum of entropies, which it estimates via the KL estimator with a particular random (i.e., datadependent) choice of the nearestneighbor parameter . The KSG estimator is perhaps the most widely used estimator for the mutual information between continuous random variables, despite the fact that it currently appears to have no theoretical guarantees, even asymptotically. In fact, one of the few theoretical results, due to gao15stronglyDependent, concerning the KSG estimator is a negative result: when estimating the mutual information between strongly dependent variables, the KSG estimator tends to systematically underestimate mutual information, due to increased boundary bias. ^{3}^{3}3To alleviate this, gao15stronglyDependent provide a heuristic correction based on using local PCA to estimate the support of the distribution. gao15localGaussian provide and prove asymptotic unbiasedness of another estimator, based on local Gaussian density estimation, that directly adapts to the boundary. Nevertheless, the widespread use of the KSG estimator motivates study of its behavior. We hope that our analysis of the KL estimator, in terms of which the KSG estimator can be written, will lead to a better understanding of the latter.
2.2 Analysis of nearestneighbor distance statistics
evans08SLLNforKNN derives a law of large numbers for NN statistics with uniformly bounded (central) kurtosis as the sample size . Although it is not obvious that the kurtosis of NN distances is uniformly bounded (indeed, each NN distance approaches almost surely), we show in Section 8 that this is indeed the case, and we apply the results of evans08SLLNforKNN to bound the variance of the KL estimator.
evans02KNNmoments derives asymptotic limits and convergence rates for moments of NN distances, for sampling densities with bounded derivatives and compact domain. In contrast, we use weaker assumptions to simply prove bounds on the moments of NN distances. Importantly, whereas the results of evans02KNNmoments apply only to nonnegative moments (i.e., with ), our results also hold for certain negative moments, which is crucial for our bounds on the variance of the KL estimator.
2.3 Other Approaches to Estimating Information Theoretic Functionals
Analysis of convergence rates: For densities over satisfying a Hölder smoothness condition parametrized by , the minimax rate for estimating entropy has been known since birge95estimation to be in mean squared error, where is the sample size.
Quite recently, there has been much work on analyzing new estimators for entropy, mutual information, divergences, and other functionals of densities. Most of this work has been along one of three approaches. One series of papers (liu12exponential; singh14divergence; singh14densityfuncs) studied boundarycorrected plugin approach based on undersmoothed kernel density estimation. This approach has strong finite sample guarantees, but requires prior knowledge of the support of the density and can necessitate computationally demanding numerical integration. A second approach (krishnamurthy14divergences; kandasamy15vonMises) uses von Mises expansion to correct the bias of optimally smoothed density estimates. This approach shares the difficulties of the previous approach, but is statistically more efficient. Finally, a long line of work (perez08estimation; pal10estimation; sricharan12ensemble; sricharan10confidence; moon14ensemble) has studied entropy estimation based on continuum limits of certain properties of graphs (including NN graphs, spanning trees, and other samplebased graphs).
Most of these estimators achieve rates of or . Only the von Mises approach of krishnamurthy14divergences is known to achieve the minimax rate for general and , but due to its high computational demand (), the authors suggest the use of other statistically less efficient estimators for moderately sized datasets. In this paper, we prove that, for , the KL estimator converges at the rate . It is also worth noting the relative computational efficiency of the KL estimator (, or using d trees for small ).
Boundedness of the density: For all of the above approaches, theoretical finitesample results known so far assume that the sampling density is lower and upper bounded by positive constants. This also excludes most distributions with unbounded support, and hence, many distributions of practical relevance. A distinctive feature of our results is that they hold for a variety of densities that approach and on their domain, which may be unbounded. Our bias bounds apply, for example, to densities that decay exponentially, such as Gaussian distributions. To our knowledge, the only previous results that apply to unbounded densities are those of tsybakov96rootn, who show consistency of a truncated modification of the KL estimate for a class of functions with exponentially decaying tails. In fact, components of our analysis are inspired by tsybakov96rootn, and some of our assumptions are closely related. Their analysis only applies to the case and , for which our results also imply consistency, so our results can be seen in some respects as a generalization of this work.
3 Setup and Assumptions
While most prior work on NN estimators has been restricted to , we present our results in a more general setting. This includes, for example, Riemannian manifolds embedded in higher dimensional spaces, in which case we note that our results depend on the intrinsic, rather than extrinsic, dimension. Such data can be better behaved in their native space than when embedded in a lower dimensional Euclidean space (e.g., working directly on the unit circle avoids boundary bias caused by mapping data to the interval ).
Definition 1.
(Metric Measure Space): A quadruple is called a metric measure space if is a set, is a metric on , is a algebra on containing the Borel algebra induced by , and is a finite measure on the measurable space .
Definition 2.
(Dimension): A metric measure space is said to have dimension if there exist constants such that, , , . ^{4}^{4}4Here and in what follows, denotes the open ball of radius centered at .
Definition 3.
(Full Dimension): Given a metric measure space of dimension , a measure on is said to have full dimension on a set if there exist functions such that, for all and almost all ,
Remark 4.
If , is the Euclidean metric, and is the Lebesgue measure, then the dimension of the metric measure space is . However, if is a lower dimensional subspace of , then the dimension may be less than . For example, if ), is the geodesic distance on , and is the dimensional surface measure, then the dimension is .
Remark 5.
In previous work on NN statistics (evans02KNNmoments; biau15EntropyKNN) and estimation of information theoretic functionals (sricharan10confidence; krishnamurthy14divergences; singh14divergence; moon14ensemble), it has been common to make the assumption that the sampling distribution has full dimension with constant and (or, equivalently, that the density is lower and upper bounded by positive constants). This excludes distributions with densities approaching or on their domain, and hence also densities with unbounded support. By letting and be functions, our results extend to unbounded densities that instead satisfy certain tail bounds.
In order to ensure that entropy is well defined, we assume that is a probability measure absolutely continuous with respect to , and that its probability density function satisfies ^{5}^{5}5See (baccetti13infiniteEntropy) for discussion of sufficient conditions for .
(1) 
Finally, we assume we have samples drawn IID from . We would like to use these samples to estimate the entropy as defined in Equation (1).
Our analysis and methods relate to the nearest neighbor distance , defined for any by , where is the nearest neighbor of in the set . Note that, since the definition of dimension used precludes the existence of atoms (i.e., for all , ), , almost everywhere. This is important, since we will study .
Initially (i.e., in Sections 4 and 5), we will study with fixed , for which we will derive bounds in terms of and . When we apply these results to analyze the KL estimator in Section 7 and 8, we will need to take expectations such as (for which we reserve the extra sample ), leading to ‘tail bounds’ on in terms of the functions and .
4 Concentration of NN Distances
We begin with a consequence of the multiplicative Chernoff bound, asserting a sort of concentration of the distance of any point in from its nearest neighbor in . Since the results of this section are concerned with fixed , for notational simplicity, we suppress the dependence of and on .
Lemma 6.
Let be a metric measure space of dimension . Suppose is an absolutely continuous probability measure with full dimension on and density function . For , if , then
and, if , then
5 Bounds on Expectations of KNN Statistics
Here, we use the concentration bounds of Section 4 to bound expectations of functions of nearest neighbor distances. Specifically, we give a simple formula for deriving bounds that applies to many functions of interest, including logarithms and (positive and negative) moments. As in the previous section, the results apply to a fixed , and we continue to suppress the dependence of and on .
Theorem 7.
Let be a metric measure space of dimension . Suppose is an absolutely continuous probability measure with full dimension and density function that satisfies the tail condition ^{6}^{6}6Since need not be surjective, we use the generalized inverse defined by .
(2) 
for some constant . Suppose is continuously differentiable, with . Fix . Then, we have the upper bound
(3)  
and the lower bound
(4) 
( and denote the positive and negative parts of , respectively).
Remark 8.
If is continuously differentiable with , we can apply Theorem 7 to . Also, similar techniques can be used to prove analogous lower bounds (i.e., lower bounds on the positive part and upper bounds on the negative part).
Remark 9.
The tail condition (2) is difficult to validate directly for many distributions. Clearly, it is satisfied when the support of is bounded. However, (tsybakov96rootn) show that, for the functions we are interested in (i.e., logarithms and power functions), when , is the Euclidean metric, and is the Lebesgue measure, (2) is also satisfied by upperbounded densities with exponentially decreasing tails. More precisely, that is when there exist and such that, whenever ,
which permits, for example, Gaussian distributions. It should be noted that the constant depends only on the metric measure space, the distribution , and the function , and, in particular, not on .
5.1 Applications of Theorem 7
We can apply Theorem 7 to several functions of interest. Here, we demonstrate the cases and for certain , as we will use these bounds when analyzing the KL estimator.
6 The KL Estimator for Entropy
Recall that, for a random variable sampled from a probability density with respect to a base measure , the Shannon entropy is defined as
As discussed in Section 1, many applications call for estimate of given IID samples . For a positive integer , the KL estimator is typically written as
where denotes the digamma function. The motivating insight is the observation that, independent of the sampling distribution, ^{7}^{7}7See (Kraskov04estimating) for a concise proof of this fact.
Hence,
where, for any , ,
denotes the local average of in a ball of radius around . Since is a smoothed approximation of (with smoothness increasing with ), the KL estimate can be intuitively thought of as a plugin estimator for , using a density estimate with an adaptive smoothing parameter.
In the next two sections, we utilize the bounds derived in Section 5 to bound the bias and variance of the KL estimator. We note that, for densities in the Hölder smoothness class (), our results imply a meansquared error of when and when .
7 Bias Bound
In this section, we prove bounds on the bias of the KL estimator, first in a relatively general setting, and then, as a corollary, in a more specific but better understood setting.
Theorem 10.
Suppose and satisfy the conditions of Theorem 7, and there exist with
and suppose satisfies a ‘tail bound’
(9) 
Then,
where .
We now show that the conditions of Theorem 10 are satisfied by densities in the commonly used nonparametric class of Hölder continuous densities on .
Definition 11.
Given a constant and an open set , a function is called Hölder continuous if is times differentiable and there exists such that, for any multiindex with ,
where is the greatest integer strictly less than .
Definition 12.
Given an open set and a function , is said to vanish on the boundary of if, for any sequence in with as , as . Here,
denotes the boundary of .
Corollary 13.
Consider the metric measure space , where is Euclidean and is the Lebesgue measure. Let be an absolute continuous probability measure with full dimension and density supported on an open set . Suppose satisfies (9) and the conditions of Theorem 7 and is Hölder continuous () with constant . Assume vanishes on . If , assume vanishes on . Then,
where .
Remark 14.
The assumption that (and perhaps ) vanish on the boundary of can be thought of as ensuring that the trivial continuation
of to is Hölder continuous. This reduces boundary bias, for which the KL estimator does not correct. ^{8}^{8}8Several estimators controlling for boundary bias have been proposed (e.g., sricharan10confidence give a modified NN estimator that accomplishes this without prior knowledge of .
8 Variance Bound
We first use the bounds proven in Section 5 to prove uniform (in ) bounds on the moments of . We the for any fixed , although almost surely as , , and indeed all higher central moments of , are bounded, uniformly in . In fact, there exist exponential bounds, independent of , on the density of .
8.1 Moment Bounds on Logarithmic NN distances
Lemma 15.
Suppose and satisfy the conditions of Theorem 7. Suppose also that . Let and assume the following expectations are finite:
(10) 
(11) 
(12) 
Then, for any integer , the central moment
satisfies
(13) 
where is a constant independent of , , and .
Remark 16.
The conditions (10), (11), and (12) are mild. For example, when , is the Euclidean metric, and is the Lebesgue measure, it suffices that is Lipschitz continuous ^{9}^{9}9Significantly milder conditions than Lipschitz continuity suffice, but are difficult to state here due to space limitations. and there exist such that whenever . The condition is more prohibitive, but still permits many (possibly unbounded) distributions of interest.
Remark 17.
If the terms were independent, a Bernstein inequality, together with the moment bound (13) would imply a subGaussian concentration bound on the KL estimator about its expectation. This may follow from one of several more refined concentration results relaxing the independence assumption that have been proposed.
8.2 Bound on the Variance of the KL Estimate
Bounds on the variance of the KL estimator now follow from the law of large numbers in evans08SLLNforKNN (itself an application of the EfronStein inequality to NN statistics).
Theorem 18.
Remark 19.
depends only on and the geometry of the metric space . For example, Corollary A.2 of evans08SLLNforKNN shows that, when and is the Euclidean metric, then , where is the kissing number of .
9 Bounds on the Mean Squared Error
10 Conclusions and Future Work
This paper derives finite sample bounds on the bias and variance of the KL estimator under general conditions, including for certain classes of unbounded distributions. As intermediate results, we proved concentration inequalities for NN distances and bounds on the expectations of statistics of NN distances. We hope these results and methods may lead to convergence rates for the widely used KSG mutual information estimator, or to generalize convergence rates for other estimators of entropy and related functionals to unbounded distributions.
Acknowledgements
This material is based upon work supported by a National Science Foundation Graduate Research Fellowship to the first author under Grant No. DGE1252522.