Analysis of k-Nearest Neighbor Distances with Application to Entropy Estimation

# Analysis of k-Nearest Neighbor Distances with Application to Entropy Estimation

Shashank Singh    Barnabás Póczos
###### Abstract

Estimating entropy and mutual information consistently is important for many machine learning applications. The Kozachenko-Leonenko (KL) estimator (kozachenko87statistical) is a widely used nonparametric estimator for the entropy of multivariate continuous random variables, as well as the basis of the mutual information estimator of Kraskov04estimating, perhaps the most widely used estimator of mutual information in this setting. Despite the practical importance of these estimators, major theoretical questions regarding their finite-sample behavior remain open. This paper proves finite-sample bounds on the bias and variance of the KL estimator, showing that it achieves the minimax convergence rate for certain classes of smooth functions. In proving these bounds, we analyze finite-sample behavior of -nearest neighbors (-NN) distance statistics (on which the KL estimator is based). We derive concentration inequalities for -NN distances and a general expectation bound for statistics of -NN distances, which may be useful for other analyses of -NN methods.

entropy, nonparametric statistics, nearest neighbor

Carnegie Mellon University, 5000 Forbes Ave., Pittsburgh, PA 15213 USA

## 1 Introduction

Estimating entropy and mutual information in a consistent manner is of importance in a number problems in machine learning. For example, entropy estimators have applications in goodness-of-fit testing (goria05new), parameter estimation in semi-parametric models (Wolsztynski85minimum), studying fractal random walks (Alemany94fractal), and texture classification (hero02alpha; hero2002aes). Mutual information estimators have applications in feature selection (peng05feature), clustering (aghagolzadeh07hierarchical), causality detection (Hlavackova07causality), optimal experimental design (lewi07realtime; poczos09identification), fmri data processing (chai09exploring), prediction of protein structures (adami04information), and boosting and facial expression recognition (Shan05conditionalmutual). Both entropy estimators and mutual information estimators have been used for independent component and subspace analysis (radical03; szabo07undercomplete_TCC; poczos05geodesic; Hulle08constrained), as well as for image registration (kybic06incremental; hero02alpha; hero2002aes). For further applications, see (Leonenko-Pronzato-Savani2008).

In this paper, we focus on the problem of estimating the Shannon entropy of a continuous random variable given samples from its distribution. All of our results extend to the estimation of mutual information, since the latter can be written as a sum of entropies. 111Specifically, for random variables and , . In our setting, we assume we are given IID samples from an unknown probability measure . Under nonparametric assumptions (on the smoothness and tail behavior of ), our task is then to estimate the differential Shannon entropy of .

Estimators of entropy and mutual information come in many forms (as reviewed in Section 2), but one common approach is based on statistics of -nearest neighbor (-NN) distances (i.e., the distance from a sample to its nearest neighbor amongst the samples, in some metric on the space). These nearest-neighbor estimates are largely based on initial work by kozachenko87statistical, who proposed an estimate for differential Shannon entropy and showed its weak consistency. Henceforth, we refer to this historic estimator as the ‘KL estimator’, after its discoverers. Although there has been much work on the problem of entropy estimation in the nearly three decades since the KL estimator was proposed, there are still major open questions about the finite-sample behavior of the KL estimator. The goal of this paper is to address some of these questions in the form of finite-sample bounds on the bias and variance of the estimator.

Specifically, our main contributions are the following:

1. We derive bounds on the bias of the KL estimate, where is a measure of the smoothness (i.e., Hölder continuity) of the sampling density, is the intrinsic dimension of the support of the distribution, and is the sample size.

2. We derive bounds on the variance of the KL estimator.

3. We derive concentration inequalities for -NN distances, as well as general bounds on expectations of -NN distance statistics, with important special cases:

1. We bound the moments of -NN distances, which play a role in analysis of many applications of -NN methods, including both the bias and variance of the KL estimator. In particular, we significantly relax strong assumptions underlying previous results by evans02KNNmoments, such as compact support and smoothness of the sampling density. Our results are also the first which apply to negative moments (i.e., with ); these are important for bounding the variance of the KL estimator.

2. We give upper and lower bounds on the logarithms of -NN distances. These are important for bounding the variance of the KL estimator, as well as -NN estimators for divergences and mutual informations.

We present our results in the general setting of a set equipped with a metric, a base measure, a probability density, and an appropriate definition of dimension. This setting subsumes Euclidean spaces, in which -NN methods have traditionally been analyzed, 222A recent exception in the context of classification, is chaudhuri14KNNrates which considers general metric spaces. but also includes, for instance, Riemannian manifolds, and perhaps other spaces of interest. We also strive to weaken some of the restrictive assumptions, such as compact support and boundedness of the density, on which most related work depends.

We anticipate that the some of the tools developed here may be used to derive error bounds for -NN estimators of mutual information, divergences (Wang-Kulkarni-Verdu2009), their generalizations (e.g., Rényi and Tsallis quantities (Leonenko-Pronzato-Savani2008)), norms, and other functionals of probability densities. We leave such bounds to future work.

### Organization

Section 2 discusses related work. Section 3 gives theoretical context and assumptions underlying our work. In Section 4, we prove concentration boundss for -NN distances, and we use these in Section 5 to derive bounds on the expectations of -NN distance statistics. Section 6 describes the KL estimator, for which we prove bounds on the bias and variance in Sections 7 and 8, respectively.

## 2 Related Work

Here, we review previous work on the analysis of -nearest neighbor statistics and their role in estimating information theoretic functionals, as well as other approaches to estimating information theoretic functionals.

### 2.1 The Kozachenko-Leonenko Estimator of Entropy

In general contexts, only weak consistency of the KL estimator is known (kozachenko87statistical). biau15EntropyKNN recently reviewed finite-sample results known for the KL estimator. They show (Theorem 7.1) that, if the density has compact support, then the variance of the KL estimator decays as . They also claim (Theorem 7.2) to bound the bias of the KL estimator by , under the assumptions that is -Hölder continuous (), bounded away from , and supported on the interval . However, in their proof biau15EntropyKNN neglect the additional bias incurred at the boundaries of , where the density cannot simultaneously be bounded away from and continuous. In fact, because the KL estimator does not attempt to correct for boundary bias, for densities bounded away from , the estimator may suffer bias worse than .

The KL estimator is also important for its role in the mutual information estimator proposed by Kraskov04estimating, which we refer to as the KSG estimator. The KSG estimator expands the mutual information as a sum of entropies, which it estimates via the KL estimator with a particular random (i.e., data-dependent) choice of the nearest-neighbor parameter . The KSG estimator is perhaps the most widely used estimator for the mutual information between continuous random variables, despite the fact that it currently appears to have no theoretical guarantees, even asymptotically. In fact, one of the few theoretical results, due to gao15stronglyDependent, concerning the KSG estimator is a negative result: when estimating the mutual information between strongly dependent variables, the KSG estimator tends to systematically underestimate mutual information, due to increased boundary bias. 333To alleviate this, gao15stronglyDependent provide a heuristic correction based on using local PCA to estimate the support of the distribution. gao15localGaussian provide and prove asymptotic unbiasedness of another estimator, based on local Gaussian density estimation, that directly adapts to the boundary. Nevertheless, the widespread use of the KSG estimator motivates study of its behavior. We hope that our analysis of the KL estimator, in terms of which the KSG estimator can be written, will lead to a better understanding of the latter.

### 2.2 Analysis of nearest-neighbor distance statistics

evans08SLLNforKNN derives a law of large numbers for -NN statistics with uniformly bounded (central) kurtosis as the sample size . Although it is not obvious that the kurtosis of --NN distances is uniformly bounded (indeed, each --NN distance approaches almost surely), we show in Section 8 that this is indeed the case, and we apply the results of evans08SLLNforKNN to bound the variance of the KL estimator.

evans02KNNmoments derives asymptotic limits and convergence rates for moments of -NN distances, for sampling densities with bounded derivatives and compact domain. In contrast, we use weaker assumptions to simply prove bounds on the moments of -NN distances. Importantly, whereas the results of evans02KNNmoments apply only to non-negative moments (i.e., with ), our results also hold for certain negative moments, which is crucial for our bounds on the variance of the KL estimator.

### 2.3 Other Approaches to Estimating Information Theoretic Functionals

Analysis of convergence rates: For densities over satisfying a Hölder smoothness condition parametrized by , the minimax rate for estimating entropy has been known since birge95estimation to be in mean squared error, where is the sample size.

Quite recently, there has been much work on analyzing new estimators for entropy, mutual information, divergences, and other functionals of densities. Most of this work has been along one of three approaches. One series of papers (liu12exponential; singh14divergence; singh14densityfuncs) studied boundary-corrected plug-in approach based on under-smoothed kernel density estimation. This approach has strong finite sample guarantees, but requires prior knowledge of the support of the density and can necessitate computationally demanding numerical integration. A second approach (krishnamurthy14divergences; kandasamy15vonMises) uses von Mises expansion to correct the bias of optimally smoothed density estimates. This approach shares the difficulties of the previous approach, but is statistically more efficient. Finally, a long line of work (perez08estimation; pal10estimation; sricharan12ensemble; sricharan10confidence; moon14ensemble) has studied entropy estimation based on continuum limits of certain properties of graphs (including -NN graphs, spanning trees, and other sample-based graphs).

Most of these estimators achieve rates of or . Only the von Mises approach of krishnamurthy14divergences is known to achieve the minimax rate for general and , but due to its high computational demand (), the authors suggest the use of other statistically less efficient estimators for moderately sized datasets. In this paper, we prove that, for , the KL estimator converges at the rate . It is also worth noting the relative computational efficiency of the KL estimator (, or using -d trees for small ).

Boundedness of the density: For all of the above approaches, theoretical finite-sample results known so far assume that the sampling density is lower and upper bounded by positive constants. This also excludes most distributions with unbounded support, and hence, many distributions of practical relevance. A distinctive feature of our results is that they hold for a variety of densities that approach and on their domain, which may be unbounded. Our bias bounds apply, for example, to densities that decay exponentially, such as Gaussian distributions. To our knowledge, the only previous results that apply to unbounded densities are those of tsybakov96rootn, who show -consistency of a truncated modification of the KL estimate for a class of functions with exponentially decaying tails. In fact, components of our analysis are inspired by tsybakov96rootn, and some of our assumptions are closely related. Their analysis only applies to the case and , for which our results also imply -consistency, so our results can be seen in some respects as a generalization of this work.

## 3 Setup and Assumptions

While most prior work on -NN estimators has been restricted to , we present our results in a more general setting. This includes, for example, Riemannian manifolds embedded in higher dimensional spaces, in which case we note that our results depend on the intrinsic, rather than extrinsic, dimension. Such data can be better behaved in their native space than when embedded in a lower dimensional Euclidean space (e.g., working directly on the unit circle avoids boundary bias caused by mapping data to the interval ).

###### Definition 1.

(Metric Measure Space): A quadruple is called a metric measure space if is a set, is a metric on , is a -algebra on containing the Borel -algebra induced by , and is a -finite measure on the measurable space .

###### Definition 2.

(Dimension): A metric measure space is said to have dimension if there exist constants such that, , , . 444Here and in what follows, denotes the open ball of radius centered at .

###### Definition 3.

(Full Dimension): Given a metric measure space of dimension , a measure on is said to have full dimension on a set if there exist functions such that, for all and -almost all ,

 γ∗(x)rD≤P(B(x,r))≤γ∗(x)rD.
###### Remark 4.

If , is the Euclidean metric, and is the Lebesgue measure, then the dimension of the metric measure space is . However, if is a lower dimensional subspace of , then the dimension may be less than . For example, if ), is the geodesic distance on , and is the -dimensional surface measure, then the dimension is .

###### Remark 5.

In previous work on -NN statistics (evans02KNNmoments; biau15EntropyKNN) and estimation of information theoretic functionals (sricharan10confidence; krishnamurthy14divergences; singh14divergence; moon14ensemble), it has been common to make the assumption that the sampling distribution has full dimension with constant and (or, equivalently, that the density is lower and upper bounded by positive constants). This excludes distributions with densities approaching or on their domain, and hence also densities with unbounded support. By letting and be functions, our results extend to unbounded densities that instead satisfy certain tail bounds.

In order to ensure that entropy is well defined, we assume that is a probability measure absolutely continuous with respect to , and that its probability density function satisfies 555See (baccetti13infiniteEntropy) for discussion of sufficient conditions for .

 H(p):=EX∼P[logp(X)]=∫Xp(x)logp(x)dμ(x)∈R. (1)

Finally, we assume we have samples drawn IID from . We would like to use these samples to estimate the entropy as defined in Equation (1).

Our analysis and methods relate to the -nearest neighbor distance , defined for any by , where is the -nearest neighbor of in the set . Note that, since the definition of dimension used precludes the existence of atoms (i.e., for all , ), , -almost everywhere. This is important, since we will study .

Initially (i.e., in Sections 4 and 5), we will study with fixed , for which we will derive bounds in terms of and . When we apply these results to analyze the KL estimator in Section 7 and 8, we will need to take expectations such as (for which we reserve the extra sample ), leading to ‘tail bounds’ on in terms of the functions and .

## 4 Concentration of k-NN Distances

We begin with a consequence of the multiplicative Chernoff bound, asserting a sort of concentration of the distance of any point in from its -nearest neighbor in . Since the results of this section are concerned with fixed , for notational simplicity, we suppress the dependence of and on .

###### Lemma 6.

Let be a metric measure space of dimension . Suppose is an absolutely continuous probability measure with full dimension on and density function . For , if , then

 P[εk(x)>r]≤e−γ∗rDn(eγ∗rDnk)k.

and, if , then

 P[εk(x)≤r]≤(eγ∗rDnk)kγ∗/γ∗.

## 5 Bounds on Expectations of KNN Statistics

Here, we use the concentration bounds of Section 4 to bound expectations of functions of -nearest neighbor distances. Specifically, we give a simple formula for deriving bounds that applies to many functions of interest, including logarithms and (positive and negative) moments. As in the previous section, the results apply to a fixed , and we continue to suppress the dependence of and on .

###### Theorem 7.

Let be a metric measure space of dimension . Suppose is an absolutely continuous probability measure with full dimension and density function that satisfies the tail condition 666Since need not be surjective, we use the generalized inverse defined by .

 EX∼P[∫∞ρ[1−P(B(X,f−1(r)))]n]≤CTn (2)

for some constant . Suppose is continuously differentiable, with . Fix . Then, we have the upper bound

 E[f+(εk(x))]≤f+⎛⎝(kγ∗n)1D⎞⎠+CTn (3) +(e/k)kD(nγ∗)1D∫∞ke−yyk+1D−1f′⎛⎝(ynγ∗)1D⎞⎠dy

and the lower bound

 E[f−(εk(x))]≤f−((kγ∗n)1/D)+CTn +(enγ∗k)kγ∗γ∗∫(kγ∗n)1D0yDkγ∗/γ∗f′(y)dy (4)

( and denote the positive and negative parts of , respectively).

###### Remark 8.

If is continuously differentiable with , we can apply Theorem 7 to . Also, similar techniques can be used to prove analogous lower bounds (i.e., lower bounds on the positive part and upper bounds on the negative part).

###### Remark 9.

The tail condition (2) is difficult to validate directly for many distributions. Clearly, it is satisfied when the support of is bounded. However, (tsybakov96rootn) show that, for the functions we are interested in (i.e., logarithms and power functions), when , is the Euclidean metric, and is the Lebesgue measure, (2) is also satisfied by upper-bounded densities with exponentially decreasing tails. More precisely, that is when there exist and such that, whenever ,

 ae−α∥x∥β≤p(x)≤be−α∥x∥β,

which permits, for example, Gaussian distributions. It should be noted that the constant depends only on the metric measure space, the distribution , and the function , and, in particular, not on .

### 5.1 Applications of Theorem 7

We can apply Theorem 7 to several functions of interest. Here, we demonstrate the cases and for certain , as we will use these bounds when analyzing the KL estimator.

When , (3) gives

 E[log+(εk(x))] ≤1Dlog+(kγ∗n)+(ek)kΓ(k,k)D ≤1D(1+log+(kγ∗n)) (5)

(where denotes the upper incomplete Gamma function, and we used the bound ), and (4) gives

 E[log−(εk(x))] ≤1Dlog−(kγ∗n)+C1, (6)

for . For , , (3) gives

 E[εαk(x)] ≤(kγ∗n)αD+(ek)kαΓ(k+α/D,k)D(nγ∗)α/D ≤C2(kγ∗n)αD, (7)

where . For any , when , (4) gives

 E[εαk(x)] ≤C3(kγ∗n)αD, (8)

where .

## 6 The KL Estimator for Entropy

Recall that, for a random variable sampled from a probability density with respect to a base measure , the Shannon entropy is defined as

 H(X)=−∫Xp(x)logp(x)dx.

As discussed in Section 1, many applications call for estimate of given IID samples . For a positive integer , the KL estimator is typically written as

 ^Hk(X)=ψ(n)−ψ(k)+logcD+Dnn∑i=1logεk(Xi),

where denotes the digamma function. The motivating insight is the observation that, independent of the sampling distribution, 777See (Kraskov04estimating) for a concise proof of this fact.

 E[logP(B(Xi,εk(Xi)))]=ψ(k)−ψ(n),

Hence,

 E[^Hk(X)] =E[−logP(B(Xi,εk(Xi)))+logcD+Dnn∑i=1logεk(Xi)] =−E[1nn∑i=1log(P(B(xi,εk(Xi)))cDεDk(Xi))] =−E[1nn∑i=1logpεk(i)(Xi)]=−E[logpεk(X1)(X1)],

where, for any , ,

 pε(x)=1cDεD∫B(x,ε)p(y)dμ(y)=P(B(x,ε))cDεD

denotes the local average of in a ball of radius around . Since is a smoothed approximation of (with smoothness increasing with ), the KL estimate can be intuitively thought of as a plug-in estimator for , using a density estimate with an adaptive smoothing parameter.

In the next two sections, we utilize the bounds derived in Section 5 to bound the bias and variance of the KL estimator. We note that, for densities in the -Hölder smoothness class (), our results imply a mean-squared error of when and when .

## 7 Bias Bound

In this section, we prove bounds on the bias of the KL estimator, first in a relatively general setting, and then, as a corollary, in a more specific but better understood setting.

###### Theorem 10.

Suppose and satisfy the conditions of Theorem 7, and there exist with

 supx∈X|p(x)−pε(x)|≤Cβεβ,

and suppose satisfies a ‘tail bound’

 ΓB:=EX∼P[(γ∗(X))−β+DD]<∞. (9)

Then,

 ∣∣E[H(X)−^Hk(X)]∣∣≤CB(kn)βD,

where .

We now show that the conditions of Theorem 10 are satisfied by densities in the commonly used nonparametric class of -Hölder continuous densities on .

###### Definition 11.

Given a constant and an open set , a function is called -Hölder continuous if is times differentiable and there exists such that, for any multi-index with ,

 supx≠y∈X|Dαf(x)−Dαf(y)|∥x−y∥β−ℓ≤L,

where is the greatest integer strictly less than .

###### Definition 12.

Given an open set and a function , is said to vanish on the boundary of if, for any sequence in with as , as . Here,

 ∂X:={x∈RD:∀δ>0,B(x,δ)⊈X and B(x,δ)⊈Xc},

denotes the boundary of .

###### Corollary 13.

Consider the metric measure space , where is Euclidean and is the Lebesgue measure. Let be an absolute continuous probability measure with full dimension and density supported on an open set . Suppose satisfies (9) and the conditions of Theorem 7 and is -Hölder continuous () with constant . Assume vanishes on . If , assume vanishes on . Then,

 ∣∣E[^Hk(X)−H(X)]∣∣≤CH(nk)−βD,

where .

###### Remark 14.

The assumption that (and perhaps ) vanish on the boundary of can be thought of as ensuring that the trivial continuation

 q(x)={p(x)x∈X0x∈RD∖X

of to is -Hölder continuous. This reduces boundary bias, for which the KL estimator does not correct. 888Several estimators controlling for boundary bias have been proposed (e.g., sricharan10confidence give a modified -NN estimator that accomplishes this without prior knowledge of .

## 8 Variance Bound

We first use the bounds proven in Section 5 to prove uniform (in ) bounds on the moments of . We the for any fixed , although almost surely as , , and indeed all higher central moments of , are bounded, uniformly in . In fact, there exist exponential bounds, independent of , on the density of .

### 8.1 Moment Bounds on Logarithmic k-NN distances

###### Lemma 15.

Suppose and satisfy the conditions of Theorem 7. Suppose also that . Let and assume the following expectations are finite:

 Γ:=EX∼P[γ∗(X)γ∗(X)]<∞. (10)
 Γ∗(λ):=EX∼P[(γ∗(X))−λ/D]<∞. (11)
 Γ∗(λ):=EX∼P[(γ∗(X))λ/D]<∞. (12)

Then, for any integer , the central moment

 Mℓ:=E[(logεk(X)−E[logεk(X)])ℓ]

satisfies

 Mℓ≤CMℓ!/λℓ, (13)

where is a constant independent of , , and .

###### Remark 16.

The conditions (10), (11), and (12) are mild. For example, when , is the Euclidean metric, and is the Lebesgue measure, it suffices that is Lipschitz continuous 999Significantly milder conditions than Lipschitz continuity suffice, but are difficult to state here due to space limitations. and there exist such that whenever . The condition is more prohibitive, but still permits many (possibly unbounded) distributions of interest.

###### Remark 17.

If the terms were independent, a Bernstein inequality, together with the moment bound (13) would imply a sub-Gaussian concentration bound on the KL estimator about its expectation. This may follow from one of several more refined concentration results relaxing the independence assumption that have been proposed.

### 8.2 Bound on the Variance of the KL Estimate

Bounds on the variance of the KL estimator now follow from the law of large numbers in evans08SLLNforKNN (itself an application of the Efron-Stein inequality to -NN statistics).

###### Theorem 18.

Suppose and satisfy the conditions of Lemma 15, and that that there exists a constant such that, for any finite , any can be among the -NN of at most other points in that set. Then, almost surely (as ), and, for and satisfying (13).

 V[^Hk(X)]≤5(3+kNk)(3+64k)M4n∈O(1nk),
###### Remark 19.

depends only on and the geometry of the metric space . For example, Corollary A.2 of evans08SLLNforKNN shows that, when and is the Euclidean metric, then , where is the kissing number of .

## 9 Bounds on the Mean Squared Error

The bias and variance bounds (Theorems 10 and 18) imply a bound on the mean squared error of the KL estimator:

###### Corollary 20.

Suppose

1. is -Hölder continuous with .

2. vanishes on . If , then also suppose vanishes on .

[TODO: Other assumptions.] satisfies the assumptions of Theorems 10 and 18. Then,

 E[(^Hk(X)−H(X))2]≤C2B(kn)2β/D+CVnk. (14)

If we let scale as this gives an overall convergence rate of

 E[(^Hk(X)−H(X))2]≤C2B(kn)2β/D+CVnk. (15)

## 10 Conclusions and Future Work

This paper derives finite sample bounds on the bias and variance of the KL estimator under general conditions, including for certain classes of unbounded distributions. As intermediate results, we proved concentration inequalities for -NN distances and bounds on the expectations of statistics of -NN distances. We hope these results and methods may lead to convergence rates for the widely used KSG mutual information estimator, or to generalize convergence rates for other estimators of entropy and related functionals to unbounded distributions.

## Acknowledgements

This material is based upon work supported by a National Science Foundation Graduate Research Fellowship to the first author under Grant No. DGE-1252522.

## References

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters