Dualizing Le Cam’s method, with applications to estimating the unseens
One of the most commonly used techniques for proving statistical lower bounds, Le Cam’s method, has been the method of choice for functional estimation. This papers aims at explaining the effectiveness of Le Cam’s method from an optimization perspective. Under a variety of settings it is shown that the maximization problem that searches for the best lower bound provided by Le Cam’s method, upon dualizing, becomes a minimization problem that optimizes the bias-variance tradeoff among a family of estimators. While Le Cam’s method can be used with arbitrary distance, our duality result applies specifically to the -divergence, thus singling it out as a natural choice for quadratic risk. For estimating linear functionals of a distribution our work strengthens prior results of Dohono-Liu [DL91] (for quadratic loss) by dropping the Hölderian assumption on the modulus of continuity. For exponential families our results improve those of Juditsky-Nemirovski [JN09] by characterizing the minimax risk for the quadratic loss under weaker assumptions on the exponential family.
We also provide an extension to the high-dimensional setting for estimating separable functionals. Notably, coupled with tools from complex analysis, this method is particularly effective for characterizing the “elbow effect” – the phase transition from parametric to nonparametric rates. As the main application of our methodology, we consider three problems in the area of “estimating the unseens”, recovering the prior result of [PSW17] on population recovery and, in addition, obtaining two new ones:
Distinct elements problem: Randomly sampling a fraction of colored balls from an urn containing balls in total, the optimal normalized estimation error of the number of distinct colors in the urn is within logarithmic factors of , exhibiting an elbow at ;
Fisher’s species problem: Given independent samples drawn from an unknown distribution, the optimal normalized prediction error of the number of unseen symbols in the next (unobserved) samples is within logarithmic factors of , exhibiting an elbow at .
eq(LABEL:#1) \newrefformatthmTheorem LABEL:#1 \newrefformatthTheorem LABEL:#1 \newrefformatchapChapter LABEL:#1 \newrefformatsecSection LABEL:#1 \newrefformatsecaSection LABEL:#1 \newrefformatalgoAlgorithm LABEL:#1 \newrefformatfigFig. LABEL:#1 \newrefformattabTable LABEL:#1 \newrefformatrmkRemark LABEL:#1 \newrefformatclmClaim LABEL:#1 \newrefformatdefDefinition LABEL:#1 \newrefformatcorCorollary LABEL:#1 \newrefformatlmmLemma LABEL:#1 \newrefformatpropProposition LABEL:#1 \newrefformatprProposition LABEL:#1 \newrefformatappAppendix LABEL:#1 \newrefformatapxAppendix LABEL:#1 \newrefformatexExample LABEL:#1 \newrefformatexerExercise LABEL:#1 \newrefformatsolnSolution LABEL:#1
- 1 Introduction
- 2 Linear functionals
- 3 Extension 1: High-dimensional functional estimation
- 4 Extension 2: Exponential families
- 5 Additional proofs
- A Auxiliary results from convex analysis
- B Proof of technical results
One of the most commonly used tools for statistical lower bound is Le Cam’s method (or the two-point method) [LC86]. To explain its rationale, consider the following general setup of functional estimation: Let be iid samples drawn from some distribution parameterized by . Given these samples, the goal is to estimate some real-valued functional . The minimax quadratic risk (mean-squared error) is defined as follows
where the infimum is taken over all estimators that are measurable with respect to . Then Le Cam’s method yields the following lower bound (cf., e.g., [Tsy09, Sec 2.3]):
where is typically chosen to be a small constant and is some constant that only depends on ; the rationale is that testing is easier (statistically) than estimation. Indeed, the constraint in \prettyrefeq:LC-intro ensures that the two hypotheses cannot be reliably tested and hence the worst-case statistical risk is lower bounded by the separation of the functional values. A more convenient form that avoids product distributions is the following in terms of the -divergence:
for some absolute constant , thanks to the inequality [Tsy09, Sec. 2.4] and the tensorization property . Similar lower bounds can be obtained by replacing in \prettyrefeq:LC-intro with the squared Hellinger distance or the Kullback-Leibler (KL) divergence ; nevertheless, the -version is perhaps the most popular since the second moment nature of the -divergence renders it frequently easy to compute. In virtually all problems of functional estimation, the lower bound follows from applying \prettyrefeq:LC-intro-chi2 or the variants thereof (such as the version with two priors), which often turn out to be rate-optimal.
This papers aims at explaining the effectiveness of Le Cam’s method, specifically the version \prettyrefeq:LC-intro-chi2 based on the -divergence, from an optimization perspective. The main observation is the following: For certain problems such as estimating linear functionals in the density model (with possibly indirect observations), under suitable conditions, the maximization in \prettyrefeq:LC-intro-chi2 can be viewed as a convex optimization problem, whose dual problem corresponds to (within constant factors) a minimization problem that optimizes the bias-variance tradeoff. This perspective yields the following characterization of the minimax rate in terms of the -modulus of continuity:111Throughout the paper, for any sequences and of positive numbers, we write if holds for all and some absolute constant , if , and if both and hold.
which strengthens the prior result of Donoho-Liu [DL91] for linear functionals. In addition, we show the result holds for exponential families for estimating functionals linear in the mean parameters, where the -divergence in \prettyrefeq:mainresult-intro is replaced by the squared Hellinger distance, extending the result of Juditsky-Nemirovski [JN09] to quadratic risk and relaxing the assumptions. See \prettyrefsec:related for more discussion.
We also provide an extension to the high-dimensional setting for estimating separable functionals, where the parameter is a high-dimensional vector belonging to the parameter space defined by moment constraint for some cost function . Given observations drawn from , the goal is to estimate a separable functional . Under certain assumptions, we show that the minimax quadratic risk is within constant factors of
where the supremum is taken over all pairs of priors in the constraint set and denotes the mixture distribution. This result gives conditions under which the generalized version of Le Cam’s method using two priors (also known as fuzzy hypotheses testing [Tsy09, Sec. 2.7.4]) is tight.
The duality view in this paper is in fact natural. Indeed, the classical minimax theorem in decision theory states that, under regularity assumptions,
This can also be interpreted from the duality perspective,222This follows from standard arguments in optimization by rewriting the left-hand side as and the Lagrange multipliers correspond to priors. When both and are finitely-valued, \prettyrefeq:minimax is simply the duality of linear programming (LP). where the primal variables corresponds to (randomized) estimators and the dual variables correspond to priors. However, the duality view of \prettyrefeq:minimax is unwieldy except in special cases or simple univariate problems, because finding the least favorable prior that maximizes the Bayes risk is a difficult infinite-dimensional optimization problem. In this vein, results such as \prettyrefeq:mainresult-intro and \prettyrefeq:LC-twoprior can be viewed as approximate version of the general minimax theorem that applies to functional estimation.
To produce concrete results of rate of convergence, one needs to evaluate the value of the maximization program such as \prettyrefeq:LC-twoprior. Using tools from complex analysis, we do so for a number of problems and obtain new results on the sharp rate of convergence, characterizing, in particular, the “elbow effect”, that is, the phase transition from parametric to nonparametric rates. As the main application of our methodology, we consider three problems in the area of “estimating the unseens”, namely, population recovery, distinct elements problem, and Fisher’s species problem. In addition to recovering the prior result of [PSW17] on the sharp rate of population recovery, we establish the following new results:
Distinct elements problem: Randomly sampling a fraction of colored balls from an urn containing balls in total, the goal is to estimate the number of distinct colors in the urn [RRSS09, Val11, WY18]. We show that, as , the optimal normalized estimation error is within logarithmic factors of , exhibiting an elbow at ;
Fisher’s species problem: Given independent samples drawn from an unknown distribution, the goal is to predict the number of unseen symbols in the next (unobserved) samples [FCW43, ET76, OSW16]. We show that, as , the optimal normalized prediction error is within logarithmic factors of , exhibiting an elbow at .
We emphasize that in obtaining the above results, we do not demonstrate an explicit choice of the optimal estimator; instead, capitalizing on the duality between the minimization problem over the linear estimators and the maximization that produces the best Le Cam lower bound, we bound the value of the dual problem from above, thereby showing the achievability of the optimal rates. This is conceptually distinct from previous explicit construction of linear estimators such as kernel-based methods for density estimation [Tsy09] or smoothed estimators in the context of species problems [OSW16] (which do not attain the optimal rate). Nevertheless, the estimators can be constructed in polynomial time as solutions to certain linear programs.
Before proceeding to the discussion of the related literature, let us mention that the duality view in this paper need not be limited to functional estimation. In a companion paper [JPW19] we extend the methods to estimating the distribution itself (with respect to the total variation loss) in the context of the distinct elements problem. The connection to functional estimation is that estimating the distribution in total variation is equivalent to simultaneously estimating all bounded linear functionals; this view enables us to analyze minimum-distance estimators in the duality framework.
1.1 Related work
A celebrated result of Donoho-Liu [DL91] relates the minimax rate of estimating linear functionals to the Hellinger modulus of continuity. For the density estimation models, under certain assumptions, it is shown that the minimax rate coincides with the right-hand side of \prettyrefeq:LC-intro-chi2 with in place of the -divergence.333The resulting moduli of continuity are in fact the same up to constant factors, as we show in \prettyrefprop:deltaproperty. However, the constant factors may not be universal and depend on the problem or its hyper-parameters, thus precluding the application to high-dimensional problems. More importantly, the proof (of the upper bound) in [DL91] is based on constructing an estimator via pairwise hypotheses tests, by means of a binary search on the functional value. While this method can deal with general loss function, the limitation is that it assumes the Hölderianity of the modulus of continuity in order to show tightness. We refer the readers to \prettyrefsec:dl_compare for a detailed comparison of the results.
The prior work that is closest to ours in spirit is that of Juditsky-Nemirovski [JN09], where the main technology was also convex optimization and the minimax theorem. As opposed to the squared loss, they considered the -quantile loss and the corresponding minimax risk:
For exponential families, under certain convexity assumptions, it is shown (cf. [JN09, Theorem 3.1 and Proposition 3.1]) that is within absolute constant factors of the Hellinger modulus of continuity, provided that . We extend this result to quadratic risk under more relaxed assumptions (see \prettyrefsec:jn_compare for details). Note that the quadratic risk result cannot be obtained through the usual route of integrating the high-probability risk bound, since the estimator for -quantile loss potentially depends on . On the other hand, one can deduce the result on -quantile loss for constant from that for quadratic risk by applying the Markov inequality.444However, for small the results of [JN09] are not implied by the quadratic risk result in this paper. Notwithstanding these improvements, the main advantage of our approach is its versatility, as witnessed, e.g., by the treatment of the high-dimensional case.
Other examples that operationalized the duality perspective for statistical estimation include the following:
The linear programming (LP) duality between the risk of the optimal linear estimator and the best Le Cam lower bound based on the total variation was recognized in [PSW17, Theorem 4] for linear functional estimation in discrete problems; this is the precursor to the present paper. However, this result in general has a -gap in the convergence rate, which was mended in an ad hoc manner in [PSW17] for specific problems. In fact, similar proof technique was previously employed by Moitra-Saks in [MS13] to upper bound the value of the dual LP in order to establish statistical upper bounds, although the connection that the dual program in fact corresponds to the minimax lower bound was missing.
The duality between the best polynomial approximation and the moment matching problem was leveraged in [WY16, WY19, JVHW15] for estimating symmetric functionals, such as the Shannon entropy and support size, of distributions supported on large domains. As opposed to optimizing over general linear estimators, the construction is by using approximating polynomials whose uniform approximation error bound the bias. Matching minimax lower bound is obtained by using the solution of the dual problem (moment matching) to construct priors. In similar context of estimating distribution functionals, general sample complexity bounds are obtained [VV11] based on linear programming duality.
The rest of the paper is organized as follows. \prettyrefsec:linear presents the main result for estimating linear functionals of a distribution (with possibly indirect observations) under a general setup. We provide two examples: population recovery (\prettyrefsec:poprec) and density estimation (\prettyrefsec:density), which are finite-dimensional and infinite-dimensional application of the main theorem respectively. \prettyrefsec:hd extend the result to estimating separable functions in high-dimensional models. The methods are then applied to the distinct elements problem (\prettyrefsec:de) and Fisher’s species extrapolation problem (\prettyrefsec:species) to yield sharp minimax rates of convergence. Finally, in \prettyrefsec:exp we extend the result for exponential families under weaker assumptions than those in [JN09]. To present a simple motivating example and to exhibit the duality perspective in a familiar problem, in \prettyrefsec:whitenoise we revisit the classical Gaussian white noise model and re-derive the classical result of Ibragimov and Has’minskii [IH84]. For readers unfamiliar with this type of argument, it might be helpful to start with \prettyrefsec:whitenoise.
2 Linear functionals
Let and be measurable spaces and a transition probability kernel between them. Denote by the set of all probability distributions on and let be a (given) subset of . Let be a functional of . We define the minimax rate of estimating using samples as:
When is the identity kernel, the samples are simply drawn from ; otherwise, the samples are indirect observations.
We also define the modulus of continuity of functional with respect to various distances (and quasi-distances) between distributions :
where is the total variation, is the squared Hellinger distance (with being any dominating measure s.t. and , e.g. ). Finally, the -divergence is defined as if and otherwise . We note that and are distances on . For a signed measure its total variation norm is denoted , so that .
2.1 General properties of
Let be affine in . Then
(Concavity) and are concave.
(Subadditivity) For any and we have:
(10) (11) (12)
(Comparison of various ’s) For all we have
(Superlinearity) Let , then
The first property follows from the fact that and are both convex in the pair (in fact they are distances). The second one for TV and follows from the first and the fact that , while for it follows from the convexity of and hence the concavity of . For the third, we recall standard bounds (cf. e.g. [Tsy09, Sec. 2.4.1]): For any pair of distributions we have
Thus, for any that are feasible for the problem, then and are feasible for the problem, since according to (17), and satisfy .
Finally, (14) follows from \prettyrefeq:delchi_sub and the observation that since by definition. ∎
2.2 Main result: Minimax rate for linear functionals
Our main result is the following:
Suppose that satisfy the following assumptions:
The functional is affine;
The set is convex;
There exists a vector space of functions on such that contains constants and is dense in for every ;
There exists a topology on such that:
It is coarse enough that is compact;
It is fine enough that , and are continuous in for all .
Some remarks are in order:
If and are finite, then can be taken to be all functions on and assumptions A3 and A4 are automatic.
If is a normal topological space, then every probability measure is regular [DS58, IV.6.2] and the set of all bounded continuous functions is dense in , cf. [DS58, IV.8.19]. Other convenient choices of are all Lipschitz functions (and Wasserstein -convergence), all polynomials, trigonometric polynomials or sums of exponentials.
The continuity of under the weak topology on can be assured by demanding a (strong Feller) property for kernel : For any bounded measurable , is bounded continuous.
The lower bound simply follows from the -version of Le Cam’s method. Consider a pair of distributions such that for some to be optimized. From the tensorization property of -divergence we have
Using Brown-Low’s two-point lower bound [BL96] and optimizing over the pair , we have
Using \prettyrefeq:delchi_sub and optimizing over , we obtain
To prove an upper bound we consider estimators of the form
where . For convenience we denote . We analyze the quadratic risk of this estimator by decomposing it into bias and variance part:
Taking worst-case and optimizing over we get
The proof is completed by applying the next proposition. ∎
Under the conditions of Theorem 2, we have
Furthermore, the supremum over in the definition of is achieved: There exist s.t. and .
Before proving the proposition, we recall the minimax theorem due to Ky Fan [Fan53, Theorem 2]:555There it is stated for Hausdorff , but this condition is not necessary, e.g., [BZ86]. Note that in defining convex-concave-like property we mandate it hold for all in (26), but it is also known that minimax theorem holds for functions that only satisfy, e.g., , see [Kön68].
Theorem 4 (Ky Fan).
Let be a compact space and an arbitrary set (not topologized). Let be such that for every , is upper semicontinuous on . If is concave-convex-like on , then
We remind that the function is concave-convex-like on if a) for any two and there exists such that for all :
and b) for any two and there exists such that for all :
Proof of Proposition 3.
We aim to apply the minimax theorem in order to get a more convenient expression for . The function
satisfies all the conditions except for the concavity in due to the last term (it is convex instead of concave). To mend this consider the following upper bound
Indeed, if , take ; otherwise, take .
So letting we consider the following function on :
We claim it is concave-convex-like. Convexity in is easy: the term is clearly convex, whereas the convexity of follows from observation that without loss of generality we may assume and then is a norm (hence convex).
We proceed to checking the concave-like property of in . Define for convenience,
It is clear that is affine, whereas is concave. Indeed, is a concave and increasing scalar function, whereas is concave in . So for we have
Consider and and . First, suppose that . We see that in this case
And set . We claim that
Indeed, we have from affinity of :
Therefore, we have
Knowing that is concave-convex-like, for applying the minimax theorem we only need to check that is continuous for all and that is compact. This is satisfied by the assumption of Theorem 2. Applying \prettyrefthm:minimax, we have
Next, to evaluate the rightmost term, fix and consider the optimization
We claim that
which implies the desired \prettyrefeq:delachi by continuing (29):
To prove (31), we first recall that contains constants. Thus if , we have that the first term in (30) can be driven to , while keeping the second term zero, by taking and . So fix . Recall a variational characterization of the -divergence:666For completeness, here is short proof of (32). First, assume . Denoting and assuming without loss of generality that we have , which completes the proof since . For the other direction, simply approximate by elements of . If , set and let .
where is any subset that is dense in . Thus, if (in particular, if ) there must exists such that
and thus , while is achievable by taking . ∎
2.3 Application: Population recovery
For a positive integer , consider the following three specializations of Theorem 2, namely the following tuples :
, , where is the all-zero string, , , and the kernel is given by
(i.e. each coordinate of is erased independently with probability ).
, , , and equals
where stands for the binomial distribution with independent trials and success probability .
, , , and equals
We will denote the minimax quadratic risk for estimating based on iid samples by and the modulus of continuity function by , for , respectively.
The first model corresponds to the so-called “lossy population recovery” – a problem initially considered in [DRWY12, WY12] in the context of learning DNFs with partial observations, and further investigated in [BIMP13, MS13, LZ15, DST16, PSW17]. This problem can also be viewed as a special instance of learning mixtures of discrete distributions in the framework of [KMR94]. Here the parameter is an arbitrary distribution on the -dimensional Hamming space . For iid random binary strings drawn from , we observe their erased version, where each bit is erased with probability . The goal is to estimate the weight of the all-zero string . It has been shown in [DRWY12] (cf. [PSW17, Appendix A]) estimating the entire distribution in the sup norm can be reduced to estimating in terms of both sample and time complexity.
It is easy to see that from permutation invariance, in the context of , to estimate it is sufficient to summarize each sample into its number of 1’s and 0’s. Correspondingly, the set of distributions in the definition of the minimax risk can be safely restricted to permutation invariant distributions on . With these reductions we arrive at the second model which is statistically equivalent. Thus,
The third setting corresponds to ignoring the number of 0’s in the second setting (i.e. restricting to estimators that only depend on the number of ’s in each sample). Since we reduce the observation space, it is clear that
In fact, the reverse direction is almost true, since the number of 0’s provides negligible information for estimating [PSW17].
The minimax risk of population recovery has been characterized within logarithmic factors in [PSW17]. Next we deduce this result from the general \prettyrefth:linear, which boils down to characterizing the function. The following result can be distilled from [PSW17] (a proof is given in \prettyrefapp:lmm for completeness):
For any we have
Conversely, for there exists and such that
provided that and . Furthermore, if then also
Applying the general \prettyrefth:linear together with \prettyreflmm:horo, we obtain the following characterization of the minimax risks, where the rate of convergence exhibits an elbow effect at erasure probability :
Corollary 6 ([Psw17]).
For all three minimax risks , the following holds:
If , then for any ,
If , then there exists a constant such that we have
where the lower bound holds provided that .
2.4 Application: Density estimation
As another application of \prettyrefth:linear, we consider the classical setting of density estimation under smoothness conditions. For simplicity, we focus on the one-dimensional setting where is a probability density function on and belongs to the Hölder class , namely, for any . Given iid samples drawn from , the goal is to estimate the value of the density at point zero . So the minimax risk is given by
We now verify that this setting fulfills the assumptions of \prettyrefth:linear. First, we have , the identity kernel . We take to be all continuous functions on . Note that by identifying a measure on with its density , we can set and view as a subset of :
If we endow and with the topology of uniform convergence, then becomes a closed convex subset of and the Arzela-Ascoli theorem [DS58, IV.6.7] implies that is in fact compact. Finally, it is clear that , and are all continuous on for any .
So all assumptions A1-A4 of the theorem are satisfied and the minimax quadratic risk is determined within absolute constant factors by . It is well-known that the modulus continuity here satisfies the following (a proof is given in \prettyrefapp:lmm for completeness):
There exist constants depending on and , such that for all ,
Applying \prettyrefth:linear, we recover the classical result:
Furthermore, \prettyrefth:linear ensures that empirical-mean estimators of the form are rate optimal for some appropriately chosen function . Indeed, kernel density estimates are of this form, which achieve the minimax rate for suitably chosen kernel and bandwidth (cf. e.g. [Tsy09, Section 1.2]).