# Estimation of KL Divergence:

Optimal Minimax Rate

###### Abstract

The problem of estimating the Kullback-Leibler divergence between two unknown distributions and is studied, under the assumption that the alphabet size of the distributions can scale to infinity. The estimation is based on independent samples drawn from and independent samples drawn from . It is first shown that there does not exist any consistent estimator that guarantees asymptotically small worst-case quadratic risk over the set of all pairs of distributions. A restricted set that contains pairs of distributions, with density ratio bounded by a function is further considered. An augmented plug-in estimator is proposed, and its worst-case quadratic risk is shown to be within a constant factor of , if and exceed a constant factor of and , respectively. Moreover, the minimax quadratic risk is characterized to be within a constant factor of , if and exceed a constant factor of and , respectively. The lower bound on the minimax quadratic risk is characterized by employing a generalized Le Cam’s method. A minimax optimal estimator is then constructed by employing both the polynomial approximation and the plug-in approaches.

## I Introduction

As an important quantity in information theory, the Kullback-Leibler (KL) divergence between two distributions has a wide range of applications in various domains. For example, KL divergence can be used as a similarity measure in nonparametric outlier detection [2], multimedia classification [3, 4], text classification [5], and the two-sample problem [6]. In these contexts, it is often desired to estimate KL divergence efficiently based on available data samples. This paper studies such a problem.

Consider the estimation of KL divergence between two probability distributions and defined as

(1) |

where and are supported on a common alphabet set , and is absolutely continuous with respect to , i.e., if , , for . We use to denote the collection of all such pairs of distributions.

Suppose and are unknown, and that independent and identically distributed (i.i.d.) samples drawn from and i.i.d. samples drawn from are available for estimation. The sufficient statistics for estimating are the histograms of the samples and , where

(2) |

record the numbers of occurrences of in samples drawn from and , respectively. Then and . An estimator of is then a function of the histograms and , denoted by .

We adopt the following worst-case quadratic risk to measure the performance of estimators of the KL divergence:

(3) |

We further define the minimax quadratic risk as:

(4) |

In this paper, we are interested in the large-alphabet regime with . Furthermore, the number and of samples are functions of , which are allowed to scale with to infinity.

###### Definition 1.

A sequence of estimators , indexed by , is said to be consistent under sample complexity and if

(5) |

We are also interested in the following set:

(6) |

which contains distributions with density ratio bounded by .

We define the worst-case quadratic risk over as

(7) |

and define the corresponding minimax quadratic risk as

(8) |

### I-a Notations

We adopt the following notation to express asymptotic scaling of quantities with : represents that there exists a constant s.t. ; represents that there exists a constant s.t. ; when and hold simultaneously; represents that for all , there exists s.t. for all , ; and represents that for all , there exists s.t. for all , .

### I-B Comparison to Related Problems

Several estimators of KL divergence when and are continuous have been proposed and shown to be consistent. The estimator proposed in [7] is based on data-dependent partition on the densities, the estimator proposed in [8] is based on a K-nearest neighbor approach, and the estimator developed in [9] utilizes a kernel-based approach for estimating the density ratio. A more general problem of estimating the -divergence was studied in [10], where an estimator based on a weighted ensemble of plug-in estimators was proposed to trade bias with variance. All of these approaches exploit the smoothness of continuous densities or density ratios, which guarantees that samples falling into a certain neighborhood can be used to estimate the local density or density ratio accurately. However, such a smoothness property does not hold for discrete distributions, whose probabilities over adjacent point masses can vary significantly. In fact, an example is provided in [7] to show that the estimation of KL divergence can be difficult even for continuous distributions if the density has sharp dips.

Estimation of KL divergence when the distributions and are discrete has been studied in [11, 12, 13] for the regime with fixed alphabet size and large sample sizes and . Such a regime is very different from the large-alphabet regime in which we are interested, with scaling to infinity. Clearly, as increases, the scaling of the sample sizes and must be fast enough with respect to in order to guarantee consistent estimation.

In the large-alphabet regime, KL divergence estimation is closely related to entropy estimation with a large alphabet recently studied in [14, 15, 16, 17, 18, 19, 20]. Compared to entropy estimation, KL divergence estimation has one more dimension of uncertainty, that is regarding the distribution . Some distributions can contain very small point masses that contribute significantly to the value of divergence, but are difficult to estimate because samples of these point masses occur rarely. In particular, such distributions dominate the risk in (3), and make the construction of consistent estimators challenging.

### I-C Summary of Main Results

We summarize our main results in the following three theorems, more details are given respectively in Sections II, III and IV.

Our first result, based on Le Cam’s two-point method [21], is that there is no consistent estimator of KL divergence over the distribution set .

###### Theorem 1.

For any , and is infinite. Therefore, there does not exist any consistent estimator of KL divergence over the set .

The intuition behind this result is that the set contains distributions , that have arbitrarily small components that contribute significantly to KL divergence but require arbitrarily large number of samples to estimate accurately. However, in practical applications, it is reasonable to assume that the ratio of to is bounded, so that the KL divergence is bounded. Thus, we further focus on the set given in (I) that contains distribution pairs with their density ratio bounded by .

###### Remark 1.

Consider the Rényi divergence of order between distributions and , which is defined as

(9) |

Then the KL divergence is equivalent to the Rényi divergence of order one. Moreover, the bounded density ratio condition is equivalent to the following upper bound on the Rényi divergence of order infinity,

(10) |

It is shown in [22] that is non-decreasing for , i.e., . This implies that the conditions on any order of the Rényi divergence can be used to bound the KL divergence as long as . For the sake of simplicity and convenience, we adopt the density ratio bound in this paper.

We construct an augmented plug-in estimator , defined in (16), and characterize its worst-case quadratic risk over the set in the following theorem.

###### Theorem 2.

For any , and , the worst-case quadratic risk of the augmented plug-in estimator defined in (16) over the set satisfies

(11) |

The upper bound is derived by evaluating the bias and variance of the estimator separately. In order to prove the term in the lower bound, we analyze the bias of the estimator and construct different pairs of “worst-case” distributions, respectively, for the cases where the bias caused by insufficient samples from or the bias caused by insufficient samples from dominates. The terms and in the lower bound are due to the variance and follow from the minimax lower bound given by Le Cam’s two-point method with a judiciously chosen pair of distributions.

Theorem 2 implies that the augmented plug-in estimator is consistent over if and only if

(12) |

Thus, the number of samples and should be larger than the alphabet size for the plug-in estimator to be consistent. This naturally inspires the question of whether the plug-in estimator achieves the minimax risk, and if not, what estimator is minimax optimal and what is the corresponding minimax risk.

We show that the augmented plug-in estimator is not minimax optimal, and that the minimax optimality can be achieved by an estimator that employs both the polynomial approximation and plug-in approaches, and the following theorem characterizes the minimax risk.

###### Theorem 3.

If , , , and , where is any positive constant, then the minimax risk satisfies

(13) |

The key idea in the construction of the minimax optimal estimator is the application of a polynomial approximation to reduce the bias in the regime where the bias of the plug-in estimator is large. Compared to entropy estimation [17, 19], the challenge here is that the KL divergence is a function of two variables, for which a joint polynomial approximation is difficult to derive. We solve this problem by employing separate polynomial approximations for functions involving and as well as judiciously using the density ratio constraint to bound the estimation error. The proof of the lower bound on the minimax risk is based on a generalized Le Cam’s method involving two composite hypotheses, as in the case of entropy estimation [17]. But the challenge here that requires special technical treatment is the construction of prior distributions for that satisfy the bounded density ratio constraint.

We note that the first term in (3) captures the squared bias, and the remaining terms correspond to the variance. If we compare the worst-case quadratic risk for the augmented plug-in estimator in (2) with the minimax risk in (3), there is a factor rate improvement in the bias.

Theorem 3 directly implies that in order to estimate the KL divergence over the set with vanishing mean squared error, the sufficient and necessary conditions on the sample complexity are given by

(14) |

The comparison of (12) with (14) shows that the augmented plug-in estimator is strictly sub-optimal.

While our results are proved under a ratio upper bound constraint , under certain upper bounds on and , and by taking a worst case over , we make the following (non-rigorous) observation that is implied by our results, by disregarding the aforementioned assumptions to some extent. If is known to be the uniform distribution (and hence ), then , and the KL divergence estimation problem (without taking the worst case over ) reduces to the problem of estimating the entropy of distribution . More specifically, letting and in (3) directly yields , which is the same as the minimax risk for entropy estimation [17, 19].

We note that after our initial conference submission was accepted by ISIT 2016 and during our preparation of this full version of the work, an independent study of the same problem of KL divergence estimation was posted on arXiv [23].

A comparison between our results and the results in [23] shows that the constraints for the minimax rate to hold in Theorem 3 are weaker than those in [23]. On the other hand, our estimator requires the knowledge of to determine whether to use polynomial approximation or the plug-in approach, whereas the estimator proposed in [23] based on the same idea does not require such knowledge. Our estimator also requires the knowledge of to keep the estimate between and in order to simplify the theoretical analysis, but our experiments demonstrate that desirable performance can be achieved even without such a step (and correspondingly without exploiting ). However, in practice, the knowledge of is typically useful to determine the number of samples that should be taken in order to achieve a certain level of estimation accuracy. In situations without the knowledge of , the estimator in [18] has the advantage of being independent from and performs well adaptively.

## Ii No Consistent Estimator over

Theorem 1 states that the minimax risk over the set is unbounded for arbitrary alphabet size and and samples, which suggests that there is no consistent estimator for the minimax risk over .

This result follows from Le Cam’s two-point method [21]: If two pairs of distributions and are sufficiently close such that it is impossible to reliably distinguish between them using samples from and samples from with error probability less than some constant, then any estimator suffers a quadratic risk proportional to the squared difference between the divergence values, .

We next give examples to illustrate how the distributions used in Le Cam’s two-point method can be constructed.

For the case , we let , and , where . For any , we choose sufficiently large such that . Thus, the error probability for distinguishing and with samples is greater than a constant. However, and . Hence, the minimax risk, which is lower bounded by the difference of the above divergences, can be made arbitrarily large by letting . This example demonstrates that two pairs of distributions and can be very close so that the data samples are almost indistinguishable, but the KL divergences and can still be far away.

For the case with , the distributions can be constructed based on the same idea by distributing one of the mass in the above and uniformly on the remaining bins. Thus, it is impossible to estimate the KL divergence accurately over the set under the minimax setting.

## Iii Augmented Plug-in Estimator over

Since there does not exist any consistent estimator of KL divergence over the set , we study KL divergence estimators over the set .

The “plug-in” approach is a natural way to estimate the KL divergence, namely, first estimate the distributions and then substitute these estimates into the divergence function. This leads to the following plug-in estimator, i.e., the empirical divergence

(15) |

where and denote the empirical distributions with and , respectively, for .

Unlike the entropy estimation problem, where the plug-in estimator is asymptotically efficient in the “fixed , large ” regime, the direct plug-in estimator in (15) of KL divergence has an infinite bias. This is because of the non-zero probability of and , for some , which leads to infinite .

We can get around the above issue associated with the direct plug-in estimator as follows. We first add a fraction of a sample to each mass point of , and then normalize and take as an estimate of , where is a constant. We refer to as the add-constant estimator [24] of . Clearly, is non-zero for all . We therefore propose the following “augmented plug-in” estimator based on

(16) |

Theorem 2 characterizes the worst-case quadratic risk of the augmented plug-in estimator over . The proof of Theorem 2 involves the following two propositions, which provide upper and lower bounds on , respectively.

###### Proposition 1.

For all ,

(17) |

###### Outline of Proof.

The proof consists of separately bounding the bias and variance of the augmented plug-in estimator. The details are provided in Appendix A. ∎

It can be seen that in the risk bound (1), the first term captures the squared bias, and the remaining terms correspond to the variance.

###### Proposition 2.

For all , , , , and ,

(18) |

###### Outline of Proof.

We provide the basic idea of the proof here with the details provided in Appendix B.

We first derive the terms corresponding to the squared bias in the lower bound by choosing two different pairs of worst-case distributions. Note that the bias of the augmented plug-in estimator can be decomposed into: (1) the bias due to estimating ; and (2) the bias due to estimating . Since is a convex function, we can show that the first bias term is always positive. But the second bias term can be negative, so that the two bias terms may cancel out partially or even fully. Thus, to derive the minimax lower bound, we first determine which bias term dominates, and then construct a pair of distributions such that the dominant bias term is either lower bounded by some positive terms or upper bounded by negative terms.

Case I: If , for some , which implies that the number of samples drawn from is relatively smaller than the number of samples drawn from , the first bias term dominates. Setting to be uniform and , it can be shown that the bias is lower bounded by in the order sense.

Case II: If , which implies that the number of samples drawn from is relatively larger than the number of samples drawn from , the second bias term dominates. Setting , and , it can be shown that the bias is upper bounded by in the order sense.

The proof of the terms corresponding to the variance in the lower bound can be done using the minimax lower bound given by Le Cam’s two-point method. ∎

## Iv Minimax Quadratic Risk over

Our third main result Theorem 3 characterizes the minimax quadratic risk (within a constant factor) of estimating KL divergence over . In this section, we describe ideas and central arguments to show this theorem with detailed proofs relegated to the appendix.

### Iv-a Poisson Sampling

The sufficient statistics for estimating are the histograms of the samples and , and and are multinomial distributed. However, the histograms are not independent across different bins, which is hard to analyze. In this subsection, we introduce the Poisson sampling technique to handle the dependency of the multinomial distribution across different bins, as in [17] for entropy estimation. Such a technique is used in our proofs to develop the lower and upper bounds on the minimax risk in Sections IV-B and IV-C.

In Poisson sampling, we replace the deterministic sample sizes and with Poisson random variables with mean and with mean , respectively. Under this model, we draw and i.i.d. samples from and , respectively. The sufficient statistics and are then independent across different bins, which significantly simplifies the analysis.

Analogous to the minimax risk (8), we define its counterpart under the Poisson sampling model as

(19) |

where the expectation is taken over and for . Since the Poisson sample sizes are concentrated near their means and with high probability, the minimax risk under Poisson sampling is close to that with fixed sample sizes as stated in the following lemma.

###### Lemma 1.

There exists a constant such that

(20) |

###### Proof.

See Appendix C. ∎

Thus, in order to show Theorem 3, it suffices to bound the Poisson risk . In Section IV-B, a lower bound on the minimax risk with deterministic sample size is derived, and in Section IV-C, an upper bound on the minimax risk with Poisson sampling is derived, which further yields an upper bound on the minimax risk with deterministic sample size. It can be shown that the upper and lower bounds match each other (up to a constant factor).

### Iv-B Minimax Lower Bound

In this subsection, we develop the following lower bound on the minimax risk for the estimation of KL divergence over the set .

###### Proposition 3.

If , , and ,

(21) |

###### Outline of Proof.

We describe the main idea in the development of the lower bound, with the detailed proof provided in Appendix D.

To show Proposition 3, it suffices to show that the minimax risk is lower bounded separately by each individual terms in (3) in the order sense. The proof for the last two terms requires the Le Cam’s two-point method, and the proof for the first term requires more general method, as we outline in the following.

Le Cam’s two-point method: The last two terms in the lower bound correspond to the variance of the estimator.

The bound can be shown by setting

(22) | ||||

(23) | ||||

(24) |

where .

The bound can be shown by choosing

(25) | ||||

(26) | ||||

(27) |

where .

Generalized Le Cam’s method: In order to show , it suffices to show that and . These two lower bounds can be shown by applying a generalized Le Cam’s method, which involves the following two composite hypotheses [21]:

Le Cam’s two-point approach is a special case of this generalized method. If no test can distinguish and reliably, then we obtain a lower bound on the quadratic risk with order . Furthermore, the optimal probability of error for composite hypothesis testing is equivalent to the Bayesian risk under the least favorable priors. Our goal here is to construct two prior distributions on (respectively for two hypotheses), such that the two corresponding divergence values are separated (by ), but the error probability of distinguishing between the two hypotheses is large. However, it is difficult to design joint prior distributions on that satisfy all the above desired property. In order to simplify this procedure, we set one of the distributions and to be known. Then the minimax risk when both and are unknown is lower bounded by the minimax risk with only either or being unknown. In this way, we only need to design priors on one distribution, which can be shown to be sufficient for the proof of the lower bound.

As shown in [17], the strategy which chooses two random variables with moments matching up to a certain degree ensures the impossibility to test in the minimax entropy estimation problem. The minimax lower bound is then obtained by maximizing the expected separation subject to the moment matching condition. For our KL divergence estimation problem, this approach also yields the optimal minimax lower bound, but the challenge here that requires special technical treatment is the construction of prior distributions for that satisfy the bounded density ratio constraint.

In order to show , we set to be the uniform distribution and assume it is known. Therefore, the estimation of reduces to the estimation of , which is the minus entropy of . Following steps similar to those in [17], we can obtain the desired result.

In order to show , we set

(28) |

and assume is known. Therefore, the estimation of reduces to the estimation of . We then properly design priors on and apply the generalized Le Cam’s method to obtain the desired result. ∎

### Iv-C Minimax Upper Bound via Optimal Estimator

Comparing the lower bound in Proposition 3 with the upper bound in Proposition 1 that characterizes an upper bound on the risk for the augmented plug-in estimator, it is clear that there is a difference of a factor in the bias terms, which implies that the augmented plug-in estimator is not minimax optimal. A promising approach to fill in this gap is to design an improved estimator. Entropy estimation [17, 19] suggests the idea to incorporate a polynomial approximation into the estimator in order to reduce the bias with price of the variance. In this subsection, we construct an estimator using this approach, and characterize an upper bound on the minimax risk in Proposition 4.

The KL divergence can be written as

(29) |

The first term equals the minus entropy of , and the minimax optimal entropy estimator (denoted by ) in [17] can be applied to estimate it. The major challenge in estimating arises due to the second term. We overcome the challenge by using a polynomial approximation to reduce the bias when is small. Under Poisson sampling model, unbiased estimators can be constructed for any polynomials of and . Thus, if we approximate by polynomials, and then construct unbiased estimator for the polynomials, the bias of estimating is reduced to the error in the approximation of using polynomials.

A natural idea is to construct polynomial approximation for in two dimensions, exploiting the fact that is bounded by in the order sense. The authors of [23] also discuss the idea of a two-dimensional polynomial approximation. However, it is challenging to find the explicit form of the two-dimensional polynomial approximation for estimating KL divergence as shown in [25]. However, for some problems it is still worth exploring the two-dimensional approximation directly, as shown in [26], where no one-dimensional approximation can achieve the minimax rate.

On the other hand, a one-dimensional polynomial approximation of also appears challenging to develop. First of all, the function on interval is not bounded due to the singularity point at . Hence, the approximation of when is near the point is inaccurate. Secondly, such an approach implicitly ignores the fact that , which implies that when is small, the value of should also be small.

Another approach is to rewrite the function as , and then estimate and separately. Although the function can be approximated using polynomial approximation and then estimated accurately (see [27, Section 7.5.4] and [17]), it is difficult to find a good estimator for .

Motivated by those unsuccessful approaches, we design our estimator as follows. We rewrite as . When is small, we construct a polynomial approximation for , which does not contain a zero-degree term. Then, is also a polynomial, which can be used to approximate . Thus, an unbiased estimator for is constructed. Note that the error in the approximation of using is not bounded, which implies that the bias of using unbiased estimator of to estimate is not bounded. However, we can show that the bias of estimating is bounded, which is due to the density ratio constraint . The fact that when is small, is also small helps to reduce the bias. In the following, we will introduce how we construct our estimator in detail.

By Lemma 1, we apply Poisson sampling to simplify the analysis. We first draw Poi, and Poi, and then draw and i.i.d. samples from distribution , where we use and to denote the histograms of samples and samples, respectively. We then use these samples to estimate following the entropy estimator proposed in [17]. Next, we draw Poi and Poi independently. We then draw and i.i.d. samples from distribution , where we use and to denote the histograms of samples and samples, respectively. We note that Poi, and Poi.

We then focus on the estimation of . If , we construct a polynomial approximation for the function and further estimate the polynomial function. And if , we use the bias-corrected augmented plug-in estimator. We use to determine whether to use polynomial estimator or plug-in estimator, and use to estimate . Intuitively, if is large, then is more likely to be large, and vice versa. Based on the generation scheme, and are independent. Such independence significantly simplifies the analysis.

We let , where is a constant to be determined later, and denote the degree- best polynomial approximation of the function over the interval as . We further scale the interval to . Then we have the best polynomial approximation of the function over the interval as follows:

(30) |

Following the result in [27, Section 7.5.4], the error in approximating by over the interval can be upper bounded as follows:

(31) |

Therefore, we have , which implies that the zero-degree term in satisfies:

(32) |

Now, subtracting the zero-degree term from in (30) yields the following polynomial:

(33) |

The error in approximating by over the interval can also be upper bounded by , because

(34) |

The bound in (IV-C) implies that although is not the best polynomial approximation of , the error in the approximation by has the same order as that by . Compared to , there is no zero-degree term in , and hence is a valid polynomial approximation of . Although the approximation error of using is unbounded, the error in the approximation of using can be bounded. More importantly, by the way in which we constructed , is a polynomial function of and , for which an unbiased estimator can be constructed. More specifically, the error in using to approximate can be bounded as follows:

(35) |

for . We further define the factorial moment of by . If Poi, . Based on this fact, we construct an unbiased estimator for as follows:

(36) |

We then construct our estimator for as follows:

(37) |

Note that we set for the bias-corrected augmented plug-in estimator used here, and we do not normalize the estimate of . When we normalize with instead of , the estimate differs by at most , which requires a different bias correction term. For the purpose of convenience, we do not normalize .

For the term in , we use the minimax optimal entropy estimator proposed in [17]. We note that is the best polynomial approximation of the function . And an unbiased estimator of is as follows:

(38) |

Based on , the estimator for is constructed as follows: