# Minimax Optimal Estimation of KL Divergence for Continuous Distributions

## Abstract

Estimating Kullback-Leibler divergence from identical and independently distributed samples is an important problem in various domains. One simple and effective estimator is based on the nearest neighbor distances between these samples. In this paper, we analyze the convergence rates of the bias and variance of this estimator. Furthermore, we derive a lower bound of the minimax mean square error and show that kNN method is asymptotically rate optimal.

## I Introduction

Kullback-Leibler (KL) divergence has a broad range of applications in information theory, statistics and machine learning. For example, KL divergence can be used in hypothesis testing [1], text classification [10], outlying sequence detection [5], multimedia classification [18], speech recognition [20], etc. In many applications, we hope to know the value of KL divergence, but the distributions are unknown. Therefore, it is important to estimate KL divergence based only on some identical and independently distributed (i.i.d) samples. Such problem has been widely studied [6, 27, 28, 2, 21, 7, 8, 30].

The estimation method is different depending on whether the underlying distribution is discrete or continuous. For discrete distributions, an intuitive method is called plug-in estimator, which first estimates the probability mass function (PMF) by simply counting the number of occurrences at each possible value and then calculates the KL divergence based on the estimated PMF. However, since it is always possible that the number of occurrences at some locations is zero, this method has infinite bias and variance for arbitrarily large sample size. As a result, it is necessary to design some new estimators, such that both the bias and variance converge to zero. Several methods have been proposed in [30, 7, 8]. These methods perform well for distributions with fixed alphabet size. Recently, there is a growing interest in designing estimators that are suitable for distributions with growing alphabet size. [6] provided an ‘augmented plug-in estimator’, which is a modification of the simple plug-in method. The basic idea of this method is to add a term to both the numerator and the denominator when calculating the ratio of the probability mass. Although this modification will introduce some additional bias, the overall bias is reduced. Moreover, a minimax lower bound has also been derived in [6], which shows that the augmented plug-in estimator proposed in [6] is rate optimal.

For continuous distributions, there are also many interesting methods. A simple one is to divide the support into many bins, so that continuous values can be quantized, and then the distribution can be converted to a discrete one. As a result, the KL divergence can be estimated based on these two discrete distributions. However, compared with other methods, this method is usually inefficient, especially when the distributions have heavy tails, as the probability mass of a bin at the tail of distributions is hard to estimate. An improvement was proposed in [27], which is based on data dependent partitions on the densities with an appropriate bias correction technique. Comparing with the direct partition method mentioned above, this adaptive one constructs more bins at the regions with higher density, and vice versa, to ensure that the probability mass in each bins are approximately equal. It is shown in [27] that this method is strongly consistent. Another estimator was designed in [19], which uses a kernel based approach to estimate the density ratio. There are also some previous works that focus on a more general problem of estimating -divergence, with KL divergence being a special case. For example, [17] constructed an estimator based on a weighted ensemble of plug in estimators, and the parameters need to be tuned properly to get a good bias and variance tradeoff. Another method of estimating -divergence in general was proposed in [21], under certain structural assumptions.

Among all the methods for the estimation of KL divergence between two continuous distributions, a simple and effective one is nearest neighbor (kNN) method based estimator. kNN method, which was first proposed in [11], is a powerful tool for nonparametric statistics. Kozachenko and Leonenko [15] designed a kNN based method for the estimation of differential entropy, which is convenient to use and does not require too much parameter tuning. Both theoretical analysis and numerical experiments show that this method has desirable accuracy [25, 23, 22, 12, 3, 31, 14]. In particular, [31] shows that this estimator is nearly minimax rate optimal under some assumptions. The estimation of KL divergence shares some similarity with that of entropy estimation, since KL divergence between and , which denotes the probability density functions (pdf) of two distributions, is actually the difference of the entropy of and the cross entropy between and . As a result, the idea of Kozachenko-Leonenko entropy estimator can be used to construct a kNN based estimator for KL divergence, which was first proposed in [28]. The basic idea of this estimator [28] is to obtain an approximate value of the ratio between and based on the ratio of kNN distances. It has been discussed in [28] that, compared with other KL divergence estimators, the kNN based estimator has a much lower sample complexity, and is easier to generalize and implement for high dimensional data. Moreover, it was proved in [28] that the kNN based estimator is consistent, which means that both the bias and the variance converge to zero as sample sizes increase. However, the convergence rate remains unknown.

In this paper, we make the following two contributions. Our first main contribution is the analysis of the convergence rates of bias and variance of the kNN based KL divergence estimator proposed in [28]. For the bias, we discuss two significantly different types of distributions separately. In the first type of distributions analyzed, both and have bounded support, and are bounded away from zero. One such example is when both distributions are uniform distributions. This implies that the distribution has boundaries, where the pdf suddenly changes. There are two main sources of estimation bias of kNN method for this case. The first source is the boundary effect, as the kNN method tends to underestimate the pdf values at the region near the boundary. The second source is the local non-uniformity of the pdf. It can be shown that the bias caused by the second source converges fast enough and thus can be negligible. As a result, the boundary bias is the main cause of bias of the kNN based KL divergence estimator for the first type of distributions considered. In the second type of distributions analyzed, we assume that both and are continuous everywhere. For example, a pair of two Gaussian distributions with different mean or variance belong to this case. For this type of distributions, the boundary effect does not exist. However, as the density values can be arbitrarily close to zero, we need to consider the bias caused by the tail region, in which or is too low and thus kNN distances are too large for us to obtain an accurate estimation of the density ratio . For the variance of this estimator, we bound the convergence rate under a unified assumption, which holds for both two cases discussed above. The convergence rate of the mean square error can then be obtained based on that of the bias and variance. In this paper, we assume that is fixed. We will show that with fixed , the convergence rate of the mean square error over the sample sizes is already minimax optimal.

Our second main contribution is to derive a minimax lower bound of the mean square error of KL divergence estimation, which characterizes the theoretical limit of the convergence rates of any methods. For discrete distributions, the minimax lower bound has already been derived in [13] and [6]. However, for continuous distributions, the minimax lower bound has not been established. In fact, there exists no estimators that are uniformly consistent for all continuous distributions. For example, let , in which is the indicator function, and is uniform in . Then the estimation error of KL divergence between and equals the estimation error of the entropy of . Since can be arbitrarily large, according to the lower bound derived in [29], there exists no uniformly consistent estimator. As a result, to find a minimax lower bound, it is necessary to impose some restrictions on the distributions. In this paper, we analyze the minimax lower bound for two cases that match our assumptions for deriving the upper bound, i.e. distributions with bounded support and densities bounded away from zero, and distributions that are smooth everywhere and densities can be arbitrarily close to zero. For each case, we show that the minimax lower bound nearly matches our upper bound using kNN method. This result indicates that the kNN based KL divergence estimator is nearly minimax optimal. To the best of our knowledge, our work is the first attempt to analyze the convergence rate of KL divergence estimator based on kNN method, and prove its minimax optimality.

The remainder of this paper is organized as follows. In Section II, we provide the problem statements. In Sections III and IV, we characterize the convergence rates of the bias and variance of the kNN based KL divergence estimator respectively. In Section V, we show the minimax lower bound. We then provide numerical examples in Section VI, and concluding remarks in Section VII.

## Ii Problem Statement

Consider two pdfs where only if . The KL divergence between and is defined as

(1) |

and are unknown. However, we are given a set of samples drawn i.i.d from pdf , and another set of samples drawn i.i.d from pdf . The goal is to estimate based on these samples.

[28] proposed a kNN based estimator:

(2) |

in which is the distance between and its -th nearest neighbor in , while is the distance between and its -th nearest neighbor in , is the dimension. The distance between any two points is defined as , in which can be an arbitrary norm. The basic idea of this estimator is using kNN method to estimate the density ratio. An estimation of at is

(3) |

in which is the volume of set . (3) can be understood as follows. Apart from , there are another samples from , among which points fall in . Therefore, is an estimate of , in which is the probability mass with respect to the distribution with pdf . As the distribution is continuous, we have . We can then use (3) to estimate . Similarly, as there are samples generated from , we can obtain an estimate by

(4) |

As

(5) |

by replacing , with (3) and (4) respectively, we can get the expression of the KL divergence estimator in (2).

[28] has proved that this estimator is consistent, but the convergence rate remains unknown. In this paper, we analyze the convergence rates of the bias and variance of this estimator, and derive the minimax lower bound.

## Iii Bias Analysis

In this section, we derive convergence rate of the bias of the estimator (2). We will consider two different cases depending on whether the support is bounded or not, as they have different sources of biases.

### Iii-a The Case with Bounded Support

We first discuss the case in which the distributions have bounded support and the densities are bounded away from zero. The main source of bias of this case is boundary effects. The analysis is based on the following assumptions:

###### Assumption 1.

Assume the following conditions:

(a) , in which and are the supports of and ;

(b) There exist constants , , , such that for all and for all ;

(c) The surface areas (or Hausdorff measure) of and are bounded by and ;

(d) The diameters of and are bounded by , i.e. ;

(e) There exists a constant such that for all and , , and for all , , in which denotes the volume of a set;

(f) The Hessian of and are both bounded by .

Assumption (a) is necessary to ensure that the definition of KL divergence in (1) is valid. (b) bounds both the lower and upper bound of the pdf value. (c) restricts the surface area of the supports of and . Since the kNN divergence estimator tends to cause significant bias at the region near to the boundary, the estimation bias for distributions with irregular supports with large surface area are usually large. (d) requires the boundedness of the support. The case with unbounded support will be considered in Section III-B. (e) ensures that the angles at the corners of the support sets have a lower bound, so that there will not be significant bias at the corner region. (f) ensures the smoothness of distribution in the support set. Note that (3) and (4) actually estimate the average density and over the ball and . If the and are smooth, then the average values will not deviate too much from the pdf value at the center of the balls, i.e. and .

Based on the above assumptions, we have the following theorem regarding the bias of estimator (2).

###### Theorem 1.

Under Assumption 1, the convergence rate of the bias of kNN based KL divergence estimator is bounded by:

(6) |

###### Proof.

(Outline) Considering that

(7) |

in which denotes the differential entropy, we decompose the KL divergence estimator to an estimator of the differential entropy of , as well as an estimator of the cross entropy between and . We then bound the bias of these two estimators. In particular, we can write

(8) |

with

(9) |

in which is the digamma function, , with being the Gamma function. Due to the property of Gamma distribution, we know that , and . Hence decays sufficiently fast and can be negligible for large sample sizes and .

has the same form as the bias of Kozachenko-Leonenko entropy estimator [15], which has been analyzed in many previous literatures [4, 12, 23, 3, 31]. With some modifications, the proofs related to the entropy estimator can also be used to bound , which is actually the bias of a cross entropy estimator. However, as the assumptions are different from the assumptions made in previous literatures, we need to derive (6) in a different way.

In our proof, for both the entropy estimator and the cross entropy estimator, we divide the support into two parts, the central region and the boundary region. In the central region, will be within and will be within with high probability. Since and are smooth, the expected estimate and are very close to the truth, and thus will not cause significant bias. The main bias comes from the boundary region, in which the density estimator and are no longer accurate, as or exceeds the supports and . We bound the boundary bias by letting the boundary region to shrink with a proper speed.

The detailed proof is shown in Appendix A. ∎

### Iii-B The Case with Smooth Distributions

We now consider the second case where the density is smooth everywhere and the density can be arbitrarily close to zero. For this case, the main source of bias is tail effects. We make the following assumptions:

###### Assumption 2.

Assume the following conditions:

(a) If , then ;

(b) and for some constants and , in which follows a distribution with pdf ;

(c), for some constant ;

(d) , and for some constants , .

Assumption (a) ensures that the definition of KL divergence in (1) is valid. (b) is the tail assumption. A lower indicates a stronger tail, and thus the convergence of bias of the KL divergence estimator will be slower. For example, for Gaussian distribution and for Cauchy distribution. (c) is the smoothness assumption. (d) is an additional tail assumption, which is actually very weak and holds for almost all of the common distributions, since can be arbitrarily small. However, this assumption is important since it prevents very large and . Based on the above assumptions, we have the following theorem regarding the bias of estimator (2).

###### Theorem 2.

Under Assumption 2, the convergence rate of the bias of kNN based KL divergence estimator is bounded by:

(10) |

###### Proof.

(Outline) Similar to the proof of Theorem 1, we still decompose the KL divergence estimator to two estimators that estimate the entropy of and the cross entropy between and , separately. In particular, we can still decompose the bias using (8). For simplicity, we only provide the convergence bound of , which is the error of the cross entropy estimator. The bound of the entropy estimator holds similarly.

For the cross entropy estimator, we divide the support into two parts, including a central region , in which or is relatively high, and a tail region , in which or is relatively low. According to the results of order statistics [9, 4], , in which is the probability mass of with respect to the distribution with pdf . Therefore, can be bounded by

(11) |

We bound two terms in (11) separately. To derive the bound of bias in , we find a high probability upper bound of , denoted as . The bound of bias can be obtained by bounding the local non-uniformity of in if . On the contrary, if , we use assumption (d) to ensure that will not be too large, and thus will not cause significant estimation error. We let to decay with at a proper speed, to maximize the overall convergence rate of the bias.

To bound the bias in , we let the threshold between and to decay with sample size , so that the probability mass of also decreases with . We then combine the bound of and , and adjust the rate of the decay of the threshold between and properly.

The detailed proof can be found in Appendix B. ∎

## Iv Variance Analysis

We now discuss the variance of this divergence estimator, based on the following unifying assumptions.

###### Assumption 3.

Assume that the following conditions hold:

(a) and are continuous almost everywhere;

(b) , such that

(12) | |||

(13) | |||

(14) | |||

(15) |

in which

(16) |

is the average of over . is similarly defined;

(c) and for two finite constants ;

(d) There exist two constants and , such that for all , and .

Assumption 3 (a)-(c) are satisfied if either Assumption 1 or Assumption 2 is satisfied. (a) only requires that the pdf is continuous almost everywhere, and thus holds not only for distributions that are smooth everywhere, but also for distributions that have boundaries. (b) is obviously satisfied under Assumption 1, since it requires that the densities are both upper and lower bounded. From Assumption 2, it is also straightforward to show that and . This property combining with the smoothness condition (Assumption 2 (c)) imply that (15) holds for sufficiently small . (c) is the same as Assumption 2 (d) and weaker than Assumption 1 (d). Therefore, (a)-(c) are weaker than both previous assumptions on the analysis of bias. (d) is a new assumption which restricts the density ratio. This is important since if the density ratio can be too large, which means that there exists a region on which there are too many samples from , but much fewer samples from , then will be large and unstable for too many . Therefore we use assumption (d) to bound the density ratio.

Under these assumptions, the variance of the divergence estimator can be bounded using the following theorem.

###### Theorem 3.

###### Proof.

(Outline) From (2), we have

(18) | |||||

Our proof uses some techniques from [4], which proved the convergence of variance of Kozachenko-Leonenko entropy estimator with for one dimensional distributions, and [31], which generalizes the result to arbitrary fixed dimension and , without restrictions on the boundedness of the support. The basic idea is that if one sample is replaced by another i.i.d sample, then it can be shown that the -NN distance will change only for a tiny fraction of the samples.

The first term in (18) is just the variance of Kozachenko-Leonenko entropy estimator. Therefore we can use similar proof procedure as was already used in the proof of Theorem 2 in [31]. [31] analyzed a truncated Kozachenko-Leonenko entropy estimator, which means that is truncated by an upper bound . We prove the same convergence bound for the estimator without truncation.

For the second term in (18), the analysis becomes much harder, since the -NN distance may change for much more samples from , instead of only a tiny fraction of samples. For this term, we design a new method to obtain the high probability bound of the deviation of from its mean. The basic idea of our new methods can be briefly stated as following: Define two sets and , in which is a subset of such that for any , is among the nearest neighbors of in . Similarly, define to be a set such that for all , is among the nearest neighbors of . If we replace with , the kNN distance of will only change if or . With this observation, we give a high probability bound of the number of samples from that are in and respectively, and then bound the maximum difference of the estimated result caused by replacing with . Based on this bound, we can then bound the second term in (18) using Efron-Stein inequality.

The detailed proof can be found in Appendix C. ∎

## V Minimax Analysis

In this section, we derive the minimax lower bound of the mean square error of KL divergence estimation, which holds for all methods (not necessarily kNN based) that do not have the knowledge of the distributions and . The minimax analysis also considers two cases, i.e. the distributions whose densities are bounded away from zero, and those who has approaching zero densities.

For the first case, the following theorem holds.

###### Theorem 4.

###### Proof.

(Outline) The minimax lower bound of functional estimation can be bounded using Le Cam’s method [26]. For the proof of Theorem 4, we use some techniques from [29], which derived the minimax bound of entropy estimation for discrete distributions. The main idea is to construct a subset of distributions that satisfy Assumptions 1 and 3, and then conduct Poisson sampling. These operations can help us calculate the distance between two distributions in a more convenient way, which is important for using Le Cam’s method. Details of the proof can be found in Appendix D. ∎

We remark that in Theorem 4, the support set and of pdfs and are unknown. If we assume that and are known, then with some boundary correction methods, such as the mirror reflection method proposed in [16], the convergence rate can be faster than that in (22). However, in Theorem 4, instead of using fixed support sets, contains distributions with a broad range of different support sets. These support sets are only restricted by Assumption 1 (c) and (d), which require that the surface area of all the elements in are bounded by and , and the diameters are bounded by . As a result, the minimax convergence rate becomes slower. This result indicates the inherent difficulty caused by the boundary effect for distributions with densities bounded away from zero.

For the second case, the corresponding result is shown in Theorem 5.

###### Theorem 5.

###### Proof.

(Outline) The minimax convergence rate of differential entropy estimation under similar assumptions was derived in [31]. We can extend the analysis to the minimax convergence rate of cross entropy estimation between and . Combine the bound for entropy and cross entropy, we can then obtain the minimax lower bound of the mean square error of KL divergence estimation. The detailed proof is shown in Appendix E. ∎

Comparing (23) with (19), as well as (25) with (20), we observe that the convergence rate of the upper bound of mean square error of kNN based KL divergence estimator nearly matches the minimax lower bound for both cases. These results indicate that the kNN method with fixed is nearly minimax rate optimal.

## Vi Numerical Examples

In this section, we provide numerical experiments to illustrate the theoretical results in this paper. In the simulation, we plot the curve of the estimated bias and variance over sample sizes. For illustration simplicity, we assume that the sample sizes for two distributions are equal, i.e. . For each sample size, the bias and variance are estimated by repeating the simulation times, and then calculate the sample mean and the sample variance of all these trials. For low dimensional distributions, the bias is relatively small, therefore it is necessary to conduct more trials comparing with high dimensional distributions. In the following experiments, we repeat times if , and times if . In all of the figures, we use log-log plots with base . In all of the trials, we fix .

Figure 1 shows the convergence rate of kNN based KL divergence estimator for two uniform distributions with different support. This case is an example that satisfies Assumption 1. In Figure 2, and are two Gaussian distributions with different mean but equal variance. In Figure 3, and are two Gaussian distributions with the same mean but different variance. These two cases are examples that satisfy Assumption 2.

For all of these distributions above, we compare the empirical convergence rates of the bias and variance with the theoretical prediction. The empirical convergence rates are calculated by finding the negative slope of the curves in these figures by linear regression, while the theoretical ones come from Theorems 1, 2 and 3 respectively. The results are shown in Table I. For the convenience of expression, we say that the theoretical convergence rate of bias or variance is , if it decays with either or for arbitrarily small , given the condition .

Bias, Empirical/Theoretical | Variance, Empirical/Theoretical | |||||
---|---|---|---|---|---|---|

Fig.1 | 1.01/1.00 | 0.51/0.50 | 0.34/0.33 | 1.00/1.00 | 0.98/1.00 | 0.96/1.00 |

Fig.2 | 0.68/0.67 | 0.47/0.50 | 0.36/0.40 | 0.94/– | 0.85/– | 0.81/– |

Fig.3 | 0.90/0.67 | 0.68/0.50 | 0.45/0.40 | 0.99/1.00 | 1.00/1.00 | 0.99/1.00 |

In Table I, we observe that for the distribution used in Figure 1, the empirical convergence rates of both bias and variance agree well with the theoretical prediction, in which the theoretical bound of bias comes from Theorem 1, while the variance comes from Theorem 3.

For the distribution in Figure 2, the empirical convergence of bias matches the theoretical prediction from Theorem 2. For Gaussian distributions with different mean, it can be shown that for any , there exists a constant such that Assumption 2 (b) holds. Therefore, according to Theorem 2, the convergence rate of bias is for arbitrarily small . Therefore, in the second line of Table I, the theoretical rate of bias is , and , respectively. Now we discuss the convergence rate of variance. Note that the theoretical result about the variance is unknown, since can reach infinity, thus Assumption 3 (d) is not satisfied, and Theorem 3 does not hold here. We observe that the empirical convergence rate is slower than that in other cases. Such a result may indicate that it is harder to estimate the KL divergence if the density ratio is unbounded.

For the distribution in Figure 3, the empirical and theoretical convergence rate of the variance matches well, while the empirical rate of bias is faster than the theoretical prediction. Note that the bound we have derived holds universally for all distributions that satisfy the assumptions. For certain specific distribution, the convergence rate can probably be faster. In particular, there is an uniform bound on the Hessian of and in Assumption 2 (c). However, for Gaussian distributions, the Hessian is lower where the pdf value is small. Therefore, the local non-uniformity is not as serious as the worst case that satisfies the assumptions.

## Vii Conclusion

In this paper, we have analyzed the convergence rates of the bias and variance of the kNN based KL divergence estimator proposed in [28]. For the bias, we have discussed two types of distributions depending on the main causes of the bias. In the first case, the distribution has bounded support, and the pdf is bounded away from zero. In the second case, the distribution is smooth everywhere and the pdf can approach zero arbitrarily close. For the variance, we have derived the convergence rate under a more general assumption. Furthermore, we have derived the minimax lower bound of KL divergence estimation. The bound holds for all possible estimators. We have shown that for both types of distributions, the kNN based KL divergence estimator is nearly minimax rate optimal. We have also used numerical experiments to illustrate that the practical performances of kNN based KL divergence estimator are consistent with our theoretical analysis.

## Appendix A Proof of Theorem 1

According to (2),

(26) | |||||

in which

(27) | |||||

(28) | |||||

(29) |

and is the volume of unit ball. Here, we omit , since and are the same for all .

In the following, we provide details on how to bound . can then be bounded using similar method.

To begin with, we denote as the probability mass of under pdf , i.e. . We have the following lemma.

###### Lemma 1.

There exists a constant , such that, if , we have

###### Proof.

(30) | |||||

in which the first inequality uses Assumption 1 (f). ∎

From order statistics [9], , therefore

(31) |

Define

(32) | |||||

(33) |

in which , and . From (31), we observe that the bias is determined by the difference between the average pdf in and the pdf at its center . is the region that is relatively far from the boundary. For all , with high probability, . In this case, the bias is caused by the non-uniformity of density. With the increase of sample size, the effect of such non-uniformity will converge to zero. is the region near to the boundary, in which the probability that is not negligible, hence can deviate significantly comparing with . Therefore, the bias in this region will not converge to zero. However, we let the size of converge to zero, so that the overall bound of the bias converges.

For sufficiently large ,

(34) | |||||

In step (a), we use Lemma 1, Assumption 1 (b) and Assumption 1 (e). In step (b), the first term uses the fact that for sufficiently large , will be sufficiently small, hence . The second term of step (b) comes from the Chernoff bound, which indicates that for all and sufficiently large ,

(35) | |||||

## Appendix B Proof of Theorem 2

In this section, we derive the bound of the bias for distributions that satisfy Assumption 2. These distributions are smooth everywhere and the densities can approach zero. We begin with the following lemmas, whose proofs can be found in Appendix B-A, B-B, and B-C, respectively.

###### Lemma 2.

There exist constants and such that and for all .

###### Lemma 3.

There exists a constant , such that

for sufficiently small , in which follows a distribution with pdf .

###### Lemma 4.

For sufficiently small ,

(39) |

Similar to the proof of Theorem 1, we decompose the bias as . Then

(40) |

Divide into two parts.

(41) | |||||

(42) |

in which , . will be determined later. is the constant in Lemma 1.

We first consider the region .

(43) | |||||

in which (a) comes from Lemma 1. For (b), note that according to (41), for , and for any . (c) uses Lemma 4.

For , note that according to Lemma 1,

(44) |

Based on this fact, if , we show the following two lemmas:

###### Lemma 5.

There exists a constant , such that