Maximum Likelihood Estimation of Functionals of Discrete Distributions
We consider the problem of estimating functionals of discrete distributions, and focus on tight (up to universal multiplicative constants for each specific functional) nonasymptotic analysis of the worst case squared error risk of widely used estimators. We apply concentration inequalities to analyze the random fluctuation of these estimators around their expectations, and the theory of approximation using positive linear operators to analyze the deviation of their expectations from the true functional, namely their bias.
We explicitly characterize the worst case squared error risk incurred by the Maximum Likelihood Estimator (MLE) in estimating the Shannon entropy , and the power sum , up to universal multiplicative constants for each fixed functional, for any alphabet size and sample size for which the risk may vanish. As a corollary, for Shannon entropy estimation, we show that it is necessary and sufficient to have observations for the MLE to be consistent. In addition, we establish that it is necessary and sufficient to consider samples for the MLE to consistently estimate . The minimax rate-optimal estimators for both problems require and samples, which implies that the MLE has a strictly sub-optimal sample complexity. When , we show that the worst-case squared error rate of convergence for the MLE is for infinite alphabet size, while the minimax squared error rate is . When , the MLE achieves the minimax optimal rate regardless of the alphabet size.
As an application of the general theory, we analyze the Dirichlet prior smoothing techniques for Shannon entropy estimation. In this context, one approach is to plug-in the Dirichlet prior smoothed distribution into the entropy functional, while the other one is to calculate the Bayes estimator for entropy under the Dirichlet prior for squared error, which is the conditional expectation. We show that in general such estimators do not improve over the maximum likelihood estimator. No matter how we tune the parameters in the Dirichlet prior, this approach cannot achieve the minimax rates in entropy estimation. The performance of the minimax rate-optimal estimator with samples is essentially at least as good as that of Dirichlet smoothed entropy estimators with samples.
Entropy and related information measures arise in information theory, statistics, machine learning, biology, neuroscience, image processing, linguistics, secrecy, ecology, physics, and finance, among other fields. Numerous inferential tasks rely on data driven procedures to estimate these quantities (see, e.g. [1, 2, 3, 4, 5, 6]). We focus on two concrete and well-motivated examples of information measures, namely the Shannon entropy 
and the power sum :
Consider estimating the Shannon entropy based on i.i.d. samples following unknown discrete distribution with unknown alphabet size . This problem has a rich history with extensive study in various fields ranging from information theory, statistics, neuroscience, physics, psychology, medicine, etc. We refer the reader to  for a review. One of the most widely used estimators for this purpose is the Maximum Likelihood Estimator (MLE), which is simply the empirical entropy. The empirical entropy is an instantiation of the plug-in principle in functional estimation, where a point estimate of the parameter (distribution in this case) is used to construct an estimator for a functional of the parameter via the plug-in approach. The idea of using the MLE for estimating information measures of interest (in this case entropy), is not only intuitive, but has sound justification: asymptotic efficiency.
The beautiful theory of Hájek and Le Cam [11, 12, 13] shows that, as the number of observed samples grows without bound while the finite parameter dimension (e.g., alphabet size) remains fixed, the MLE performs optimally in estimating any differentiable functional when the statistical model complies with the benign LAN (Local Asymptotic Normality) condition . Thus, for finite dimensional problems, the problems of parameter and functional estimation are well understood in an asymptotic sense, and the MLE appears to be not only natural but also theoretically justified. But does it make sense to employ the MLE to estimate the entropy in most practical applications?
As it turns out, while asymptotically optimal in entropy estimation, the MLE is by no means sacrosanct in many real applications, especially in regimes where the alphabet size is comparable to, or even larger than the number of observations. It was shown that the MLE for entropy is strictly sub-optimal in the large alphabet regime [14, 15]. Therefore, classical asymptotic theory does not satisfactorily address high dimensional settings, which are becoming increasingly important in the modern era of high dimensional statistics.
There has been a wave of recent research activities focusing on analyzing existing approaches of functional estimation, as well as proposing new estimators that are provably near optimal in the large alphabet regime. Paninski  showed that the MLE needs samples to consistently estimate the Shannon entropy, and Paninski  established the existence of a (non-explicit) estimator that only required samples. It implies that the MLE is strictly sub-optimal in terms of sample complexity. It was Valiant and Valiant  who first explicitly constructed a linear programming based estimator (later modified in ) that achieves consistency in entropy estimation with samples, which they also proved to be necessary. Valiant and Valiant  constructed another approximation based estimator that achieved better theoretical properties than the linear programming ones, which was not yet shown to be minimax rate-optimal for all ranges of and . The authors  constructed the first minimax rate-optimal estimators for and based on best polynomial approximation, which are agnostic to the alphabet size . Utilizing the released MATLAB and Python packages of the estimators in , [19, 20] demonstrated that these minimax rate-optimal estimators can lead to significant performance boosts in various machine learning tasks. Wu and Yang  independently applied the best polynomial approximation idea to entropy estimation and obtained the minimax rates. However, their estimator requires the knowledge of the alphabet size . The approximation ideas proved to be very fruitful in Acharya et al. , Wu and Yang , Han, Jiao, and Weissman , Jiao, Han, and Weissman , Bu et al. , Orlitsky, Suresh, and Wu , Wu and Yang .
The main contribution of this paper is an explicit characterization of the worst case squared error risk of estimating and using the MLE up to a universal multiplicative constant for each specific functional, for all ranges of and in which the risk may vanish. Understanding the benefits and limitations of the MLE in a nonasymptotic setting serves two key purposes. First, the approach is a natural benchmark for comparing other more nuanced procedures for estimation of functionals. Second, performance analysis for the MLE reveals regimes where the problem is difficult, and motivates the development of improvements, which have been validated in [14, 15, 16, 17, 18, 10, 21, 22]. As a byproduct of the analysis, we explicitly point out an equivalence between bias analysis of functional estimators using plug-in rules and approximation theory using positive linear operators. We believe these powerful tools introduced from approximation theory may have far reaching impacts in various applications in the information theory community.
We mention that there exist numerous other approaches proposed in various disciplines to estimate entropy, many among which are difficult to analyze theoretically. Among them we mention the Miller–Madow bias-corrected estimator and its variants [29, 30, 31], the jackknife estimator , the shrinkage estimator , the coverage adjusted estimator , the Best Upper Bound (BUB) estimator , the B-Splines estimator , and [36, 37] etc. For a Bayesian statistician, a natural approach is to first impose a prior on the unknown discrete distribution before considering estimating entropy. The Dirichlet prior, being the conjugate prior to the multinomial distribution, appears to be particularly popular in the Bayesian approach to entropy estimation. Dirichlet smoothing may have two connotations in the context of entropy estimation:
Nemenman, Shafee, and Bialek  argued in an intuitive way why Dirichlet prior is bad for entropy estimation and proposed to use mixtures of Dirichlet priors. Archer, Park, and Pillow  have come up with priors that perform better than the Dirichlet prior. Also see [44, 45].
Another contribution of this paper is an explicit characterization of the worst case squared error risk of estimating using the Dirichlet prior plug-in approach up to a universal multiplicative constant, for all ranges of and in which the risk may vanish. We show rigorously that neither of the two approaches utilizing the Dirichlet prior result in improvements over the MLE in the large alphabet regime. Specifically, these approaches require at least to be consistent, while the minimax rate-optimal estimators such as the ones in  only need to achieve consistency.
The Dirichlet distribution with order with parameters has a probability density function with respect to Lebesgue measure on the Euclidean space given by
on the open -dimensional simplex defined by:
and zero elsewhere. The normalizing constant is the multinomial Beta function, which can be expressed in terms of the Gamma function:
Assuming the unknown discrete distribution follows prior distribution , and we observe a vector with multinomial distribution , then one can show that the posterior distribution is also a Dirichlet distribution with parameters
Furthermore, the posterior mean (conditional expectation) of given is given by [46, Example 5.4.4]
The estimator is widely used in practice for various choices of . For example, if , then the corresponding is the minimax estimator for under squared loss [46, Example 5.4.5]. However, it is no longer minimax under other loss functions such as loss, which was investigated in .
Note that the estimator subsumes the MLE as a special case, since we can take the limit for to obtain MLE. We denote the empirical distribution by . The Dirichlet prior smoothed distribution estimate is denoted as , where
Note that the smoothed distribution can be viewed as a convex combination of the empirical distribution and the prior distribution . We call the estimator the Dirichlet prior smoothed plug-in estimator.
Another way to apply Dirichlet prior in entropy estimation is to compute the Bayes estimator for under squared error, given that follows Dirichlet prior. It is well known that the Bayes estimator under squared error is the conditional expectation. It was shown in Wolpert and Wolf  that
where is the digamma function. We call the estimator the Bayes estimator under Dirichlet prior.
Throughout this paper, we observe i.i.d. samples from an unknown discrete distribution . We denote the samples as i.i.d. random variables taking values in with probability . Defining
we know that follows a multinomial distribution with parameter . Denote . The Maximum Likelihood Estimator (MLE) for and are defined, respectively, as and , with being the empirical distribution. We assume the functional takes the form
Then it is evident that the MLE for estimating functional in (13) can be alternatively represented as the following linear function of :
Recall that the risk function under squared error for any estimator in estimating functional may be decomposed as
where represents the squared bias, and represents the variance. The subscript means that the expectation is taken with respect to the distribution that generates the i.i.d. observations. We omit the subscript for the expectation operator if the meaning of the expectation is clear from the context.
Notation: denotes , denotes . For two non-negative series , notation means that there exists a positive universal constant such that , for all . The notation is equivalent to and . Notation means that . Throughout this paper, the notations involve absolute constants that may only depend on but not or . We denote by the space of discrete distributions with alphabet size .
Iii Main results
Theorem 1 (Upper bounds).
We have the following upper bounds on the worst case squared error risk of MLE in estimating :
where satisfies for , and is the second-order Ditzian–Totik modulus of smoothness introduced in Section IV-B.
Moreover, in all the bounds presented above, the first term bounds the square of the bias, and the second term bounds the variance.
Theorem 2 (Lower bounds).
We have the following lower bounds on the worst case squared error risk of MLE in estimating :
: there exists a constant such that for all ,
: if , for any , then
: if , then
: if , then
There are several interesting implications of this result, highlighted in the following corollaries.
For any fixed , there exist universal convergence rates for :
Corollary 1 implies that, when , estimation of is extremely simple in terms of convergence rate: plug-in estimation achieves the best possible rate (as shown in the theory of regular statistical experiments of classical asymptotic theory, see [48, Chap. 1.7.]). Results of this form have appeared in the literature, for example, Antos and Kontoyiannis  showed that it suffices to take samples to consistently estimate . However, when , the rate is considerably slower. Interestingly, there exist estimators that demonstrate better convergence rates for estimating . Jiao et al.  showed that the minimax rate in estimating , is as long as , which is achieved using the general methodology developed therein for constructing minimax rate-optimal estimators for nonsmooth functionals.
Let us now examine the case , another interesting regime that has not been characterized before. In this regime, we observe significant increase in the difficulty of the estimation problem. In particular, the relative scaling between the number of observations and the alphabet size for consistent estimation of exhibits a phase transition, encapsulated in the following.
Fix . The worst case squared error risk of the MLE in estimating is characterized as follows when :
Corollary 2 follows directly from Theorem 1 and Theorem 2. In particular, it implies that it is necessary and sufficient to take samples to consistently estimate using MLE. Thus, as one might expect, the scale of the number of measurements required for consistent estimation increases as decreases. When , the number of samples required for the MLE grows super-polynomially in , which is consistent with the intuition that is essentially equivalent to the alphabet size of a distribution, whose estimation is known to be very hard when there may exist symbols with very small probabilities .
We exhibit some of our findings by plotting the value required of for consistent estimation of using the MLE , as a function of , in Figure 1.
It turns out that one can construct estimators that are better than the MLE in terms of required sample complexity for consistent estimation for the regime . Indeed, Jiao et al.  showed that the minimax rate-optimal estimator requires samples to achieve consistency, which attains a logarithmic improvement in the sample complexity over the MLE.
We not only consider , but also the so-called Miller–Madow bias-corrected estimator  defined as
The worst case squared error risk of admits the following upper bound for all :
If , then
Moreover, if , the Miller–Madow bias-corrected estimator satisfies
where the positive constant in both expressions does not depend on or .
Theorem 3 implies the following corollary.
The worst case squared error risk of the MLE in estimating is characterized as follows when :
Here the first term corresponds to the squared bias, and the second term corresponds to the variance.
Paninski  showed that if , where is a constant, the maximum squared error risk of , and the Miller–Madow bias-corrected estimator , would be bounded from zero. Paninski  also showed that when , the MLE is consistent for estimating entropy. Corollary 3 implies that it is necessary and sufficient to take samples for the MLE to be consistent for estimating entropy. Comparing the results for with those for , we see that the intuition that being viewed close to when is indeed approximately correct as coincides with on the phase transition curve shown in Figure 1.
Table I summarizes the minimax squared error rates and the worst case squared error rates of the MLE in estimating and . It is clear that the MLE cannot achieve the minimax rates for estimation of , and when . In these cases, there exist strictly better estimators whose performance with samples is roughly the same as that of the MLE with samples. This phenomenon was termed effective sample size enlargement in .
Iii-C Dirichlet prior techniques applying to entropy estimation
For symmetry, we restrict attention to the case where the parameter in the Dirichlet distribution takes the form .
In comparison to MLE , where is the empirical distribution, the Dirichlet smoothing scheme has a disadvantage: it requires the knowledge of the alphabet size in general. We define
It is clear that
where stands for the empirical distribution, is the true distribution, and denotes the uniform distribution on the same alphabet with size .
If , then the maximum squared error risk of in estimating is upper bounded as
Here the first term bounds the squared bias, and the second term bounds the variance.
If , then the maximum risk of in estimating is lower bounded as
where is a universal constant that does not depend on , or .
If , then we have
If , then we have
If , then we have
where is the largest integer that does not exceed , and represents the positive part of .
If and is upper bounded by a constant, then the maximum squared error risk of vanishes. Conversely, if , then the maximum squared error risk of is bounded away from zero.
The next theorem presents a lower bound on the maximum risk of the Bayes estimator under Dirichlet prior. Since we have assumed that all , the Bayes estimator under Dirichlet prior is
If , then
where is the Euler-âMascheroni constant.
Evident from Theorem 4, 5, and 6 is the fact that in the best situation (i.e. not too large), both the Dirichlet prior smoothed plug-in estimator and the Bayes estimator under Dirichlet prior still require at least samples to be consistent, which is the same as MLE. In contrast, the estimators in Valiant and Valiant [16, 18, 17], Jiao et al. , Wu and Yang  are consistent if , which is the optimal sample complexity. Thus, we can conclude that the Dirichlet smoothing technique does not solve the entropy estimation problem.
Iv Fundamental ideas of our analysis
In this section, we discuss the fundamental tools we employed to obtain the results in Section III, as well as general recipes we suggest for analyzing performances of functional estimators.
The variance characterizes the degree to which the random variable is fluctuating around its expectation, and the field of concentration inequalities perfectly fits our glove to give the desired results. For all the functionals we consider, it turns out that the Efron–Stein inequality  and the bounded differences inequality give very tight bounds. For completeness we state them below.
[52, Efron–Stein inequality, Theorem 3.1] Let be independent random variables and let be a square integrable function. Moreover, if are independent copies of and if we define, for every ,
The following inequality, which is called the bounded differences inequality, is a useful corollary of the Efron–Stein inequality.
[52, Bounded differences inequality, Corollary 3.2] If function has the bounded differences property, i.e., for some nonnegative constants ,
for every , then
given that are independent random variables.
We refer the readers to Boucheron et al.  for a modern exposition of the concentration inequality toolbox.
It turns out that the bias analysis in estimation, albeit widely studied in statistics, seems to still largely bear an asymptotic and expansion nature in the mainstream statistical literature [53, 54]. In particular, the bootstrap  as a method for estimating functionals was essentially only analyzed in an asymptotic setting . Among asymptotic analysis techniques, probably the most popular one is the Taylor expansion. We will show that the Taylor expansion may encounter great difficulties in analyzing the bias of MLE in information measure estimation. Then, we will introduce the field of approximation theory using positive linear operators and demonstrate that it is essentially equivalent to nonasymptotic bias analysis for plug-in functional estimators. In doing so, we present the readers with abundant handy tools from approximation theory, which could be readily applicable to many problems that may seem highly intractable with standard expansion methods.
We start from entropy estimation. In the literature, considerable effort has been devoted to understanding the non-asymptotic performance of the MLE in estimating . One of the earliest investigations in this direction is due to Miller  in 1955, who showed that, for any fixed distribution ,
Harris’s result reveals an undesirable consequence of the Taylor expansion method: one cannot obtain uniform bounds on the bias of the MLE. Indeed, the term can be arbitrarily large for some distribution . However, it is evident that both and are bounded above by , since the maximum entropy of any distribution supported on elements is . Conceivably, for such a distribution that would make very large, we need to compute even higher order Taylor expansions to obtain more accuracy, but even with such efforts we cannot obtain a uniform bias bound for all .
We gain one of our key insights into the bias of the MLE by relating it to the approximation error induced by the Bernstein polynomial approximation of the function , which was first observed in Paninski . To see this, we first compute the bias of in estimating the functional in (13).
The bias of the estimator is given by
The bias term in (3) can be equivalently expressed as111In the literature of combinatorics, the sum is called the Bernoulli sum, and various approaches have been proposed to evaluate its asymptotics , , .
where is the well-known Bernstein polynomial basis, and is the so-called Bernstein polynomial for function . Bernstein in 1912  provided an insightful constructive proof of the Weierstrass theorem on approximation of continuous functions using polynomials, by showing that the Bernstein polynomial of any continuous function converges uniformly to that function. From a functional analytic viewpoint, the Bernstein polynomial is an operator that maps a continuous function to another continuous function . This operator is linear in , and is positive because is also pointwise non-negative if is pointwise non-negative. Evidently, bounding the approximation error incurred by the Bernstein polynomial is equivalent to bounding the bias of the MLE , where . Fortunately, the theory of approximation using positive linear operators  provides us with advanced tools that are very effective for the bias analysis our problem calls for. A century ago, probability theory served Bernstein in breaking new ground in function approximation. It is therefore very satisfying that advancements in the latter have come full circle to help us better understand probability theory and statistics. We briefly review the general theory of approximation using positive linear operators below.
Iv-B1 Approximation theory using positive linear operators
Generally speaking, for any estimator of a parametric model indexed by , the expectation is a positive linear operator for , and analyzing the bias is equivalent to analyzing the approximation properties of the positive linear operator in approximating . Hence, analyzing the bias of any plug-in estimator for functionals of parameters from any parametric families can be recast as a problem of approximation theory using positive linear operators .
Conversely, given a positive linear operator that operates on the space of continuous functions, the Riesz–Markov–Kakutani theorem implies that under mild conditions the operator may be written as
where is a set of probability measures parametrized by , which may be viewed as a parameter. If we view the random variable as a summary statistics to plug-in the functional , the positive linear operator is nothing but the expectation of the plug-in estimator . In this sense, there exists a one-to-one correspondence between essentially the most general bias analysis problem in statistics, and the most general positive linear operator approximation problem in approximation theory.
After more than a century’s active research on approximation using positive linear operators, we now have highly non-trivial tools for positive linear operators of functions on one dimensional compact sets, but the general theory for vector valued multivariate functions on non-compact sets is still far from complete . In the next subsection, we present a sample of existing results in approximation using positive linear operators, corollaries of which will be used to analyze the bias of the MLE for two examples: and .
Iv-B2 Some general results in bias analysis
First, some elementary approximation theoretic concepts need to be introduced in order to characterize the degree of smoothness of functions. For an interval, the first-order modulus of smoothness is defined as 
The second-order modulus of smoothness  is defined as
Ditzian and Totik  introduced a class of moduli of smoothness, which proves to be extremely useful in characterizing the incurred approximation errors. For simplicity, for functions defined on , , the first-order Ditzian–Totik modulus of smoothness is defined as
and the second-order Ditzian–Totik modulus of smoothness is defined as