Minimax Estimation of the L_{1} Distance

Minimax Estimation of the Distance

Jiantao Jiao, , Yanjun Han, , and Tsachy Weissman, . Jiantao Jiao, Yanjun Han, and Tsachy Weissman are with the Department of Electrical Engineering, Stanford University, CA, USA. Email: {jiantao,yjhan, tsachy}@stanford.eduThe materials in this paper was presented in part at the 2016 IEEE International Symposium on Information Theory, Barcelona, Spain.
June 30, 2019
Abstract

We consider the problem of estimating the distance between two discrete probability measures and from empirical data in a nonasymptotic and large alphabet setting. When is known and one obtains samples from , we show that for every , the minimax rate-optimal estimator with samples achieves performance comparable to that of the maximum likelihood estimator (MLE) with samples. When both and are unknown, we construct minimax rate-optimal estimators whose worst case performance is essentially that of the known case with being uniform, implying that being uniform is essentially the most difficult case. The effective sample size enlargement phenomenon, identified in Jiao et al. (2015), holds both in the known case for every and the unknown case. However, the construction of optimal estimators for requires new techniques and insights beyond the Approximation methodology of functional estimation in Jiao et al. (2015).

Divergence estimation, total variation distance, multivariate approximation theory, functional estimation, optimal classification error, high dimensional statistics, Hellinger distance

I Introduction

I-a Problem formulation

Statistical functionals are usually used to quantify the fundamental limits of data processing tasks such as data compression (e.g. Shannon entropy [1]), data transmission (e.g. mutual information [1]), estimation and testing (e.g. Kullback–Leibler divergence [2, Thm. 11.8.3], distance [3, Chap. 13]), etc. They measure the difficulties of the corresponding data processing tasks and shed light on how much improvement one may expect beyond the current state-of-the-art approaches. In this sense, it is of great value to obtain accurate estimates of these functionals in various problems.

In this paper, we consider estimating the distance between two discrete distributions , which is defined as:

(1)

Throughout we use the squared error loss, i.e., the risk function for an estimator is defined as

(2)

where . The maximum risk of an estimator , and the minimax risk in estimating are defined as

(3)
(4)

respectively, where are given collections (uncertainty sets) of probability measures and , respectively, and the infimum is taken over all possible estimators .

The distance is closely related to the Bayes error, i.e., the fundamental limit, in classification problems. Specifically, for a two-class classification problem, if the prior probabilities for each class are equal, then the minimum probability of error achieved using the optimal classifier is given by

(5)

where indicates the class, and are the class-conditional distributions. Hence, the problem of estimating in this classification problem is reduced to estimating the distance between the two class-conditional distributions from the empirical data. In the statistical learning theory literature, most work on Bayes classification error estimation deals with the case that and are continuous distributions, and it turns out that even in an asymptotic setting it is very difficult to estimate this quantity in this general continuous case. Indeed, we know from [4, Section 8.5] the negative result that for every sample size , any estimate of the Bayes error , and any , there exist some class-conditional distributions such that .

This negative result shows that one needs to look at special classes of the class-conditional distributions in order to obtain meaningful and consistent estimates. In the discrete setting, the seminal work of Valiant and Valiant [5] deserves special mention. They constructed an estimator for and showed that when , it achieves error , and it takes at least samples to achieve consistent estimation of . Valiant and Valiant [6] constructed another estimator of using linear programming which achieves the error when . We argue in this paper that the simplest estimator for , namely plugging in the empirical distribution and obtaining achieves error rate for . In this sense, the optimal estimator seems to enlarge the sample size to in the error rate expression. This phenomenon was termed the effective sample size enlargement in [7].

I-B Approximation: the general recipe

We emphasize that the observed effective sample size enlargement here is another manifestation of the recently discovered phenomenon in functional estimation of high dimensional objects. There has been a recent wave of study on functional estimation of high dimensional parameters [6, 7, 8, 9], and it was shown in Jiao et al. [7] that for a wide class of functional estimation problems (including Shannon entropy , , and mutual information), there exists a general methodology, termed Approximation, that can be applied to design minimax rate-optimal estimators whose performance with samples is essentially that of the MLE (maximum likelihood estimator, or the plug-in approach) with samples.

The general methodology of Approximation in [7] is as follows. Consider estimating of a parameter for an experiment , with a consistent estimator for , where is the number of observations. Suppose the functional is analytic111A function is analytic at a point if and only if its Taylor series about converges to in some neighborhood of . everywhere except at . A natural estimator for is , and we know from classical asymptotics [10, Lemma 8.14] that given the benign LAN (Local Asymptotic Normality) condition [10], is asymptotically efficient for for if is asymptotically efficient for . In the estimation of functionals of discrete distributions, is the -dimensional probability simplex, and a natural candidate for is the empirical distribution, which is unbiased for any .

We propose to conduct the following two-step procedure in estimating .

  1. Classify the Regime: Compute , and declare that we are in the “non-smooth” regime if is “close” enough to . Otherwise declare we are in the “smooth” regime;

  2. Estimate:

    • If falls in the “smooth” regime, use an estimator “similar” to to estimate ;

    • If falls in the “non-smooth” regime, replace the functional in the “non-smooth” regime by an approximation (another functional) which can be estimated without bias, then apply an unbiased estimator for the functional .

Approaches of this nature appeared before [7] in Lepski, Nemirovski, and Spokoiny [11], Cai and Low [12], Vinck et al. [13], Valiant and Valiant [5]. It was developed independently for entropy estimation by Wu and Yang [8], and the ideas proved to be very fruitful in Acharya et al. [9], Wu and Yang [14], Orlitsky, Suresh, and Wu [15], Wu and Yang [16]. However, we emphasize that in all the examples above except for the distance estimator in Valiant and Valiant [5], the functionals considered all take the form or , where is a univariate density or function, and each . In particular, the functions considered are everywhere analytic except at zero, e.g., for and . Most of these features are violated in the distance estimation problem. If we write with , then we have:

  1. a bivariate function in the sum;

  2. a function which is analytic except on a segment .

As discussed in Jiao et al. [7], approximation of multivariate functions is much more involved than that of univariate functions, and the fact that the “non-smooth” regime is around a line segment here makes the application of the Approximation approach quite difficult: what shape should we use to specify the “non-smooth” regime? We provide a comprehensive answer to this problem in this paper, thereby substantially generalizing the applicability of the Approximation methodology and demonstrate the intricacy of functional estimation problems in high dimensions. Our recent work [17] presents the most up-to-date version of the general Approximation methodology, which is applied to construct minimax rate-optimal estimators for the KL divergence (also see Bu et al.[18]), -divergence, and the squared Hellinger distance. The effective sample size enlargement phenomenon holds in all these cases as well.

We emphasize that the complications triggered by the bivariate function make the distance estimation problem highly challenging. Indeed, prior to our work, the only known estimators that require sublinear samples were in  [5, 6], which achieved error in the regime of but not the regime , and the lower bound was proved for the regime , i.e., the constant error regime. The complete characterization of the minimax rates and the estimator that achieves the minimax rates were unknown prior to this work.

Our main contributions in this paper are the following:

  1. We apply the Approximation methodology to construct minimax rate-optimal estimators with linear complexity for when is known, and show that for any fixed , our estimator performs with samples at least as well as the plug-in estimator with samples. Precisely, the performance of the plug-in estimator for any fixed is dictated by the functional , while that of the minimax rate-optimal estimator is dictated by the functional . Furthermore, we show that any plug-in approach does not work. As we argue in Lemma 20, for estimating with known , for any distribution estimate constructed from the samples from , the estimator does not achieve the minimax rates in the worst case if does not depend on . Concretely, the performance of any plug-in rule behaves essentially as the MLE in the worst case.

  2. We generalize the Approximation methodology in [7] to construct a minimax rate-optimal estimator for when both and are unknown. We illustrate the novelty of our scheme via the following results:

    1. The performance of our estimator with samples is essentially that of the MLE with samples.

    2. Any algorithm that only conducts approximation around the origin does not achieve the minimax rates. Indeed, as we argue in Lemma 17, for any algorithm that employs the MLE when cannot achieve the minimax rates when . The reason why the estimator of Valiant and Valiant [5] cannot achieve the minimax rates when is that [5] did not conduct approximation when and are large. One of our key contributions is to figure out how to conduct approximation when and achieve the minimax rates when .

    3. Best polynomial approximation is not sufficient for achieving minimax rate-optimality in this problem. As we argue in Lemma 18, for approximation in the lower left corner, any one-dimensional polynomial that achieves the best approximation error rate cannot be used in constructing the optimal estimator, and it is necessary to use a multivariate polynomial with certain pointwise error guarantees. One of our key contributions is to construct a proper multivariate polynomial with desired pointwise approximation error.

    4. Approximation over the union of the “nonsmooth” regime does not work. As we show in Lemma 19, there does not exist a single multivariate polynomial that achieves the desired approximation error over the whole “nonsmooth” regime. Instead, in our approach, we construct polynomial approximations of the function over a random regime that is determined by empirical data. To our knowledge, it is the first time that a random approximation regime approach appears in the functional estimation literature.

    5. Our estimator is agnostic to the potentially unknown support size , but behaves as well as the minimax rate-optimal estimator that knows the support size .

The rest of the paper is organized as follows. In Section II and III, we present a thorough performance analysis of the MLE and explicitly construct the minimax rate-optimal estimators, where Section II covers the known case and Section III generalizes to the case of unknown . Discussions in Section IV highlight the significance and novelty of our approaches by reviewing several other approaches which are shown to be suboptimal. The auxiliary lemmas used throughout this paper are collected in Appendix A. Proofs of all the lemmas in the main text can be found in Appendix B, where proofs of all the auxiliary lemmas are collected in Appendix C.

Notation: for non-negative sequences , we use the notation to denote that there exists a universal constant such that , and is equivalent to . Notation is equivalent to and . Notation means that , and is equivalent to . We write and . Moreover, denotes the set of all -variate polynomials of degree of each variable no more than , and denotes the distance of the function to the space in the uniform norm on . The space is also abbreviated as . All logarithms are in the natural base. The notation , where is a real number and is a set of real numbers, is equivalent to for all .

Throughout this paper, we utilize the Poisson sampling model instead of the binomial model, whose minimax risks can be shown to be on the same scale, as in [7, Lemma 16].

Ii Divergence Estimation with Known

First we consider the case where is known while is an unknown distribution with support . In other words, and . We analyze the performance of the MLE in this case, and construct the approximation-based minimax rate-optimal estimator.

We utilize the Poisson sampling model, in which we observe a Poisson random vector

(6)

where the coordinates of are mutually independent, and . We define as the empirical probabilities.

Ii-a Performance of the MLE

The MLE serves as a natural estimator for the distance which can be expressed as , where is the empirical distribution. Since we are using the Poisson sampling mode, we have .

The following lemma provides sharp estimates of , which can be viewed as an analog of the binomial case studied in [19].

Lemma 1.

Suppose . Then,

(7)

Hence,

(8)
Lemma 2.

Suppose . Then, for any ,

(9)

Further,

(10)

The next lemma upper bounds the variance of .

Lemma 3.

Suppose . Then, for any ,

(11)

We obtain the upper and lower bounds for the mean squared error of .

Theorem 1.

The maximum likelihood estimator satisfies

(12)

We can also lower bound the worst case mean squared error as

(13)
Proof.

We have

(14)

Hence,

(15)
(16)

where we applied Lemma 2.

To analyze the variance, due to the mutual independence of , we have

(17)
(18)
(19)

where we used Lemma 3 in the second step.

The proof of the upper bound is complete. Regarding the lower bound, setting , we have

(20)
(21)
(22)

The following corollary is straightforward since when .

Corollary 1.

If , we have

(23)

Hence, it is necessary and sufficient for the MLE to have samples to be consistent in terms of the worst case mean squared error.

Ii-B Construction of the optimal estimator

We apply our general recipe to construct the minimax rate-optimal estimator. For simplicity of analysis, we conduct the classical “splitting” operation [20] on the Poisson random vector , and obtain two independent identically distributed random vectors , such that each component in has distribution , and all coordinates in are independent. For each coordinate , the splitting process generates a random sequence such that , and assign for . All the random variables are conditionally independent given our observation . The “splitted” empirical probabilities are defined as . To simplify notation, we redefine as to ensure that . We emphasize that the sampling splitting approach is not needed for the actual estimator construction.

We construct two set functions with variable as input defined as:

(24)
(25)

Here are universal constants that will be determined later. The set is constructed to satisfy the following property:

Lemma 4.

Suppose . Then,

(26)

where the set function is defined in (24).

It is clear that for any . The constants will be chosen later to make sure that the following three “good” events have overwhelming probability:

(27)
(28)
(29)

Here represents the logical implication operation that is equivalent to . The intuitions behind the constructions of these “good” events are as follows. Since we use the first half of the samples to classify regime, and would later use three different estimators depending on whether lies to the left, to the right, or inside , it is desirable that we can infer the relationship between and based on the location of . The reason why these events can be controlled to have high probabilities is that we have specifically designed to make it a strict subset of the set , and the sets are designed to satisfy Lemma 4, which ensures that the size of is essentially the length of the confidence interval when the empirical probability is observed.

We have the following lemma controlling the probability of these probabilities.

Lemma 5.

Denote the overall “good” event , where are defined in (27),(28),(29). Then,

(30)

where

(31)

Now we construct the estimator. In the “smooth” regime, i.e., , we simply employ the plug-in approach to estimate . In the “non-smooth” regime, i.e., , we need to approximate by another functional which can be estimated without bias. We consider the best polynomial approximation of on , which is defined as

(32)

where denotes the set of polynomials with degree no more than . Once we obtain , we can use an unbiased estimate such that for . As a result, the bias of the estimator in the “non-smooth” regime is exactly the approximation error of in approximating on , which can be significantly smaller than the MLE. The following lemma gives the bias and variance bound of .

Lemma 6.

For with , we have

(33)
(34)

for some universal constant , where is the unique unbiased estimate of defined in (32), is defined in (24) and .

Hence, the bias is of the order , a logarithmic improvement compared to the result in Lemma 2. In summary, we have the following construction.

Estimator Construction 1.

We use the first half samples to classify regimes and the second half samples for estimation. Denote

(35)

and define

(36)

where and are given by (24), (25), , and are properly chosen universal constants.

The performance of this estimator is presented in the following theorem.

Theorem 2.

The performance of satisfies

(37)

where is a constant that can be made as small as possible, and is a constant that can be made as large as possible.

Proof.

Recall the “good” events defined in (27),(28),(29) and define . We have

(38)
(39)
(40)

where we have applied Lemma 5.

Define the random variables

(41)
(42)
(43)

where the random index sets are defined as

(44)
(45)
(46)

The indices are independent of the random variables . Since

(47)

it follows from Cauchy’s inequality that

(48)

where is defined in (31).

It follows from the law of total variance that

(49)
(50)
(51)
(52)

where we have used the fact that with probability one and Lemma 3. Similarly we have .

Regarding , it follows from Lemma 6 and the mutual independence of that

(53)
(54)

where .

Hence,

(55)

where is defined in (31) and .

The following corollary is immediate.

Corollary 2.

Suppose , we have

(56)

In particular, if , we have

(57)
Proof.

For the first part, when , one may choose large enough to ensure that . When , one may choose small enough to ensure that .

The second part is proved upon noticing that

(58)
(59)

Ii-C Minimax lower bound

It was shown in Valiant and Valiant [5] that if is the uniform distribution, when , the minimax risk of estimating is a constant.

We prove a minimax lower bound for every , and show that the performance achieved by our estimator in Theorem 2 is minimax rate-optimal for every fixed .

The main tool we employ is the so-called method of two fuzzy hypotheses presented in Tsybakov [21]. Suppose we observe a random vector which has distribution where . Let and be two prior distributions supported on . Write for the marginal distribution of when the prior is for . Let be an arbitrary estimator of a function based on . We have the following general minimax lower bound.

Lemma 7.

[21, Thm. 2.15] Given the setting above, suppose there exist such that

(60)
(61)

If , then

(62)

where are the marginal distributions of when the priors are , respectively.

Here is the total variation distance between two probability measures on the measurable space . Concretely, we have

(63)

where , and is a dominating measure so that .

The following lemma was shown in Cai and Low [12]:

Lemma 8.

For any given even integer , there exist two probability measures and on that satisfy the following conditions:

  1. and are symmetric around ;