Minimax Estimation of the Distance
Abstract
We consider the problem of estimating the distance between two discrete probability measures and from empirical data in a nonasymptotic and large alphabet setting. When is known and one obtains samples from , we show that for every , the minimax rateoptimal estimator with samples achieves performance comparable to that of the maximum likelihood estimator (MLE) with samples. When both and are unknown, we construct minimax rateoptimal estimators whose worst case performance is essentially that of the known case with being uniform, implying that being uniform is essentially the most difficult case. The effective sample size enlargement phenomenon, identified in Jiao et al. (2015), holds both in the known case for every and the unknown case. However, the construction of optimal estimators for requires new techniques and insights beyond the Approximation methodology of functional estimation in Jiao et al. (2015).
I Introduction
Ia Problem formulation
Statistical functionals are usually used to quantify the fundamental limits of data processing tasks such as data compression (e.g. Shannon entropy [1]), data transmission (e.g. mutual information [1]), estimation and testing (e.g. Kullback–Leibler divergence [2, Thm. 11.8.3], distance [3, Chap. 13]), etc. They measure the difficulties of the corresponding data processing tasks and shed light on how much improvement one may expect beyond the current stateoftheart approaches. In this sense, it is of great value to obtain accurate estimates of these functionals in various problems.
In this paper, we consider estimating the distance between two discrete distributions , which is defined as:
(1) 
Throughout we use the squared error loss, i.e., the risk function for an estimator is defined as
(2) 
where . The maximum risk of an estimator , and the minimax risk in estimating are defined as
(3)  
(4) 
respectively, where are given collections (uncertainty sets) of probability measures and , respectively, and the infimum is taken over all possible estimators .
The distance is closely related to the Bayes error, i.e., the fundamental limit, in classification problems. Specifically, for a twoclass classification problem, if the prior probabilities for each class are equal, then the minimum probability of error achieved using the optimal classifier is given by
(5) 
where indicates the class, and are the classconditional distributions. Hence, the problem of estimating in this classification problem is reduced to estimating the distance between the two classconditional distributions from the empirical data. In the statistical learning theory literature, most work on Bayes classification error estimation deals with the case that and are continuous distributions, and it turns out that even in an asymptotic setting it is very difficult to estimate this quantity in this general continuous case. Indeed, we know from [4, Section 8.5] the negative result that for every sample size , any estimate of the Bayes error , and any , there exist some classconditional distributions such that .
This negative result shows that one needs to look at special classes of the classconditional distributions in order to obtain meaningful and consistent estimates. In the discrete setting, the seminal work of Valiant and Valiant [5] deserves special mention. They constructed an estimator for and showed that when , it achieves error , and it takes at least samples to achieve consistent estimation of . Valiant and Valiant [6] constructed another estimator of using linear programming which achieves the error when . We argue in this paper that the simplest estimator for , namely plugging in the empirical distribution and obtaining achieves error rate for . In this sense, the optimal estimator seems to enlarge the sample size to in the error rate expression. This phenomenon was termed the effective sample size enlargement in [7].
IB Approximation: the general recipe
We emphasize that the observed effective sample size enlargement here is another manifestation of the recently discovered phenomenon in functional estimation of high dimensional objects. There has been a recent wave of study on functional estimation of high dimensional parameters [6, 7, 8, 9], and it was shown in Jiao et al. [7] that for a wide class of functional estimation problems (including Shannon entropy , , and mutual information), there exists a general methodology, termed Approximation, that can be applied to design minimax rateoptimal estimators whose performance with samples is essentially that of the MLE (maximum likelihood estimator, or the plugin approach) with samples.
The general methodology of Approximation in [7] is as follows. Consider estimating of a parameter for an experiment , with a consistent estimator for , where is the number of observations. Suppose the functional is analytic^{1}^{1}1A function is analytic at a point if and only if its Taylor series about converges to in some neighborhood of . everywhere except at . A natural estimator for is , and we know from classical asymptotics [10, Lemma 8.14] that given the benign LAN (Local Asymptotic Normality) condition [10], is asymptotically efficient for for if is asymptotically efficient for . In the estimation of functionals of discrete distributions, is the dimensional probability simplex, and a natural candidate for is the empirical distribution, which is unbiased for any .
We propose to conduct the following twostep procedure in estimating .

Classify the Regime: Compute , and declare that we are in the “nonsmooth” regime if is “close” enough to . Otherwise declare we are in the “smooth” regime;

Estimate:

If falls in the “smooth” regime, use an estimator “similar” to to estimate ;

If falls in the “nonsmooth” regime, replace the functional in the “nonsmooth” regime by an approximation (another functional) which can be estimated without bias, then apply an unbiased estimator for the functional .

Approaches of this nature appeared before [7] in Lepski, Nemirovski, and Spokoiny [11], Cai and Low [12], Vinck et al. [13], Valiant and Valiant [5]. It was developed independently for entropy estimation by Wu and Yang [8], and the ideas proved to be very fruitful in Acharya et al. [9], Wu and Yang [14], Orlitsky, Suresh, and Wu [15], Wu and Yang [16]. However, we emphasize that in all the examples above except for the distance estimator in Valiant and Valiant [5], the functionals considered all take the form or , where is a univariate density or function, and each . In particular, the functions considered are everywhere analytic except at zero, e.g., for and . Most of these features are violated in the distance estimation problem. If we write with , then we have:

a bivariate function in the sum;

a function which is analytic except on a segment .
As discussed in Jiao et al. [7], approximation of multivariate functions is much more involved than that of univariate functions, and the fact that the “nonsmooth” regime is around a line segment here makes the application of the Approximation approach quite difficult: what shape should we use to specify the “nonsmooth” regime? We provide a comprehensive answer to this problem in this paper, thereby substantially generalizing the applicability of the Approximation methodology and demonstrate the intricacy of functional estimation problems in high dimensions. Our recent work [17] presents the most uptodate version of the general Approximation methodology, which is applied to construct minimax rateoptimal estimators for the KL divergence (also see Bu et al.[18]), divergence, and the squared Hellinger distance. The effective sample size enlargement phenomenon holds in all these cases as well.
We emphasize that the complications triggered by the bivariate function make the distance estimation problem highly challenging. Indeed, prior to our work, the only known estimators that require sublinear samples were in [5, 6], which achieved error in the regime of but not the regime , and the lower bound was proved for the regime , i.e., the constant error regime. The complete characterization of the minimax rates and the estimator that achieves the minimax rates were unknown prior to this work.
Our main contributions in this paper are the following:

We apply the Approximation methodology to construct minimax rateoptimal estimators with linear complexity for when is known, and show that for any fixed , our estimator performs with samples at least as well as the plugin estimator with samples. Precisely, the performance of the plugin estimator for any fixed is dictated by the functional , while that of the minimax rateoptimal estimator is dictated by the functional . Furthermore, we show that any plugin approach does not work. As we argue in Lemma 20, for estimating with known , for any distribution estimate constructed from the samples from , the estimator does not achieve the minimax rates in the worst case if does not depend on . Concretely, the performance of any plugin rule behaves essentially as the MLE in the worst case.

We generalize the Approximation methodology in [7] to construct a minimax rateoptimal estimator for when both and are unknown. We illustrate the novelty of our scheme via the following results:

The performance of our estimator with samples is essentially that of the MLE with samples.

Any algorithm that only conducts approximation around the origin does not achieve the minimax rates. Indeed, as we argue in Lemma 17, for any algorithm that employs the MLE when cannot achieve the minimax rates when . The reason why the estimator of Valiant and Valiant [5] cannot achieve the minimax rates when is that [5] did not conduct approximation when and are large. One of our key contributions is to figure out how to conduct approximation when and achieve the minimax rates when .

Best polynomial approximation is not sufficient for achieving minimax rateoptimality in this problem. As we argue in Lemma 18, for approximation in the lower left corner, any onedimensional polynomial that achieves the best approximation error rate cannot be used in constructing the optimal estimator, and it is necessary to use a multivariate polynomial with certain pointwise error guarantees. One of our key contributions is to construct a proper multivariate polynomial with desired pointwise approximation error.

Approximation over the union of the “nonsmooth” regime does not work. As we show in Lemma 19, there does not exist a single multivariate polynomial that achieves the desired approximation error over the whole “nonsmooth” regime. Instead, in our approach, we construct polynomial approximations of the function over a random regime that is determined by empirical data. To our knowledge, it is the first time that a random approximation regime approach appears in the functional estimation literature.

Our estimator is agnostic to the potentially unknown support size , but behaves as well as the minimax rateoptimal estimator that knows the support size .

The rest of the paper is organized as follows. In Section II and III, we present a thorough performance analysis of the MLE and explicitly construct the minimax rateoptimal estimators, where Section II covers the known case and Section III generalizes to the case of unknown . Discussions in Section IV highlight the significance and novelty of our approaches by reviewing several other approaches which are shown to be suboptimal. The auxiliary lemmas used throughout this paper are collected in Appendix A. Proofs of all the lemmas in the main text can be found in Appendix B, where proofs of all the auxiliary lemmas are collected in Appendix C.
Notation: for nonnegative sequences , we use the notation to denote that there exists a universal constant such that , and is equivalent to . Notation is equivalent to and . Notation means that , and is equivalent to . We write and . Moreover, denotes the set of all variate polynomials of degree of each variable no more than , and denotes the distance of the function to the space in the uniform norm on . The space is also abbreviated as . All logarithms are in the natural base. The notation , where is a real number and is a set of real numbers, is equivalent to for all .
Throughout this paper, we utilize the Poisson sampling model instead of the binomial model, whose minimax risks can be shown to be on the same scale, as in [7, Lemma 16].
Ii Divergence Estimation with Known
First we consider the case where is known while is an unknown distribution with support . In other words, and . We analyze the performance of the MLE in this case, and construct the approximationbased minimax rateoptimal estimator.
We utilize the Poisson sampling model, in which we observe a Poisson random vector
(6) 
where the coordinates of are mutually independent, and . We define as the empirical probabilities.
Iia Performance of the MLE
The MLE serves as a natural estimator for the distance which can be expressed as , where is the empirical distribution. Since we are using the Poisson sampling mode, we have .
The following lemma provides sharp estimates of , which can be viewed as an analog of the binomial case studied in [19].
Lemma 1.
Suppose . Then,
(7) 
Hence,
(8) 
Lemma 2.
Suppose . Then, for any ,
(9) 
Further,
(10) 
The next lemma upper bounds the variance of .
Lemma 3.
Suppose . Then, for any ,
(11) 
We obtain the upper and lower bounds for the mean squared error of .
Theorem 1.
The maximum likelihood estimator satisfies
(12) 
We can also lower bound the worst case mean squared error as
(13) 
Proof.
We have
(14) 
To analyze the variance, due to the mutual independence of , we have
(17)  
(18)  
(19) 
where we used Lemma 3 in the second step.
The proof of the upper bound is complete. Regarding the lower bound, setting , we have
(20)  
(21)  
(22) 
∎
The following corollary is straightforward since when .
Corollary 1.
If , we have
(23) 
Hence, it is necessary and sufficient for the MLE to have samples to be consistent in terms of the worst case mean squared error.
IiB Construction of the optimal estimator
We apply our general recipe to construct the minimax rateoptimal estimator. For simplicity of analysis, we conduct the classical “splitting” operation [20] on the Poisson random vector , and obtain two independent identically distributed random vectors , such that each component in has distribution , and all coordinates in are independent. For each coordinate , the splitting process generates a random sequence such that , and assign for . All the random variables are conditionally independent given our observation . The “splitted” empirical probabilities are defined as . To simplify notation, we redefine as to ensure that . We emphasize that the sampling splitting approach is not needed for the actual estimator construction.
We construct two set functions with variable as input defined as:
(24)  
(25) 
Here are universal constants that will be determined later. The set is constructed to satisfy the following property:
Lemma 4.
It is clear that for any . The constants will be chosen later to make sure that the following three “good” events have overwhelming probability:
(27)  
(28)  
(29) 
Here represents the logical implication operation that is equivalent to . The intuitions behind the constructions of these “good” events are as follows. Since we use the first half of the samples to classify regime, and would later use three different estimators depending on whether lies to the left, to the right, or inside , it is desirable that we can infer the relationship between and based on the location of . The reason why these events can be controlled to have high probabilities is that we have specifically designed to make it a strict subset of the set , and the sets are designed to satisfy Lemma 4, which ensures that the size of is essentially the length of the confidence interval when the empirical probability is observed.
We have the following lemma controlling the probability of these probabilities.
Lemma 5.
Now we construct the estimator. In the “smooth” regime, i.e., , we simply employ the plugin approach to estimate . In the “nonsmooth” regime, i.e., , we need to approximate by another functional which can be estimated without bias. We consider the best polynomial approximation of on , which is defined as
(32) 
where denotes the set of polynomials with degree no more than . Once we obtain , we can use an unbiased estimate such that for . As a result, the bias of the estimator in the “nonsmooth” regime is exactly the approximation error of in approximating on , which can be significantly smaller than the MLE. The following lemma gives the bias and variance bound of .
Lemma 6.
Hence, the bias is of the order , a logarithmic improvement compared to the result in Lemma 2. In summary, we have the following construction.
Estimator Construction 1.
The performance of this estimator is presented in the following theorem.
Theorem 2.
The performance of satisfies
(37) 
where is a constant that can be made as small as possible, and is a constant that can be made as large as possible.
Proof.
Recall the “good” events defined in (27),(28),(29) and define . We have
(38)  
(39)  
(40) 
where we have applied Lemma 5.
Define the random variables
(41)  
(42)  
(43) 
where the random index sets are defined as
(44)  
(45)  
(46) 
The indices are independent of the random variables . Since
(47) 
it follows from Cauchy’s inequality that
(48) 
where is defined in (31).
It follows from the law of total variance that
(49)  
(50)  
(51)  
(52) 
where we have used the fact that with probability one and Lemma 3. Similarly we have .
∎
The following corollary is immediate.
Corollary 2.
Suppose , we have
(56) 
In particular, if , we have
(57) 
Proof.
For the first part, when , one may choose large enough to ensure that . When , one may choose small enough to ensure that .
The second part is proved upon noticing that
(58)  
(59) 
∎
IiC Minimax lower bound
It was shown in Valiant and Valiant [5] that if is the uniform distribution, when , the minimax risk of estimating is a constant.
We prove a minimax lower bound for every , and show that the performance achieved by our estimator in Theorem 2 is minimax rateoptimal for every fixed .
The main tool we employ is the socalled method of two fuzzy hypotheses presented in Tsybakov [21]. Suppose we observe a random vector which has distribution where . Let and be two prior distributions supported on . Write for the marginal distribution of when the prior is for . Let be an arbitrary estimator of a function based on . We have the following general minimax lower bound.
Lemma 7.
[21, Thm. 2.15] Given the setting above, suppose there exist such that
(60)  
(61) 
If , then
(62) 
where are the marginal distributions of when the priors are , respectively.
Here is the total variation distance between two probability measures on the measurable space . Concretely, we have
(63) 
where , and is a dominating measure so that .
The following lemma was shown in Cai and Low [12]:
Lemma 8.
For any given even integer , there exist two probability measures and on that satisfy the following conditions:

and are symmetric around ;
