Maximum Likelihood Estimation of Functionals of Discrete Distributions
Abstract
We consider the problem of estimating functionals of discrete distributions, and focus on tight (up to universal multiplicative constants for each specific functional) nonasymptotic analysis of the worst case squared error risk of widely used estimators. We apply concentration inequalities to analyze the random fluctuation of these estimators around their expectations, and the theory of approximation using positive linear operators to analyze the deviation of their expectations from the true functional, namely their bias.
We explicitly characterize the worst case squared error risk incurred by the Maximum Likelihood Estimator (MLE) in estimating the Shannon entropy , and the power sum , up to universal multiplicative constants for each fixed functional, for any alphabet size and sample size for which the risk may vanish. As a corollary, for Shannon entropy estimation, we show that it is necessary and sufficient to have observations for the MLE to be consistent. In addition, we establish that it is necessary and sufficient to consider samples for the MLE to consistently estimate . The minimax rateoptimal estimators for both problems require and samples, which implies that the MLE has a strictly suboptimal sample complexity. When , we show that the worstcase squared error rate of convergence for the MLE is for infinite alphabet size, while the minimax squared error rate is . When , the MLE achieves the minimax optimal rate regardless of the alphabet size.
As an application of the general theory, we analyze the Dirichlet prior smoothing techniques for Shannon entropy estimation. In this context, one approach is to plugin the Dirichlet prior smoothed distribution into the entropy functional, while the other one is to calculate the Bayes estimator for entropy under the Dirichlet prior for squared error, which is the conditional expectation. We show that in general such estimators do not improve over the maximum likelihood estimator. No matter how we tune the parameters in the Dirichlet prior, this approach cannot achieve the minimax rates in entropy estimation. The performance of the minimax rateoptimal estimator with samples is essentially at least as good as that of Dirichlet smoothed entropy estimators with samples.
I Introduction
Entropy and related information measures arise in information theory, statistics, machine learning, biology, neuroscience, image processing, linguistics, secrecy, ecology, physics, and finance, among other fields. Numerous inferential tasks rely on data driven procedures to estimate these quantities (see, e.g. [1, 2, 3, 4, 5, 6]). We focus on two concrete and wellmotivated examples of information measures, namely the Shannon entropy [7]
(1) 
and the power sum :
(2) 
The power sum functional often emerges in various operational problems [8]. It also has connections to the Rényi entropy [9] via the formula .
Consider estimating the Shannon entropy based on i.i.d. samples following unknown discrete distribution with unknown alphabet size . This problem has a rich history with extensive study in various fields ranging from information theory, statistics, neuroscience, physics, psychology, medicine, etc. We refer the reader to [10] for a review. One of the most widely used estimators for this purpose is the Maximum Likelihood Estimator (MLE), which is simply the empirical entropy. The empirical entropy is an instantiation of the plugin principle in functional estimation, where a point estimate of the parameter (distribution in this case) is used to construct an estimator for a functional of the parameter via the plugin approach. The idea of using the MLE for estimating information measures of interest (in this case entropy), is not only intuitive, but has sound justification: asymptotic efficiency.
The beautiful theory of Hájek and Le Cam [11, 12, 13] shows that, as the number of observed samples grows without bound while the finite parameter dimension (e.g., alphabet size) remains fixed, the MLE performs optimally in estimating any differentiable functional when the statistical model complies with the benign LAN (Local Asymptotic Normality) condition [13]. Thus, for finite dimensional problems, the problems of parameter and functional estimation are well understood in an asymptotic sense, and the MLE appears to be not only natural but also theoretically justified. But does it make sense to employ the MLE to estimate the entropy in most practical applications?
As it turns out, while asymptotically optimal in entropy estimation, the MLE is by no means sacrosanct in many real applications, especially in regimes where the alphabet size is comparable to, or even larger than the number of observations. It was shown that the MLE for entropy is strictly suboptimal in the large alphabet regime [14, 15]. Therefore, classical asymptotic theory does not satisfactorily address high dimensional settings, which are becoming increasingly important in the modern era of high dimensional statistics.
There has been a wave of recent research activities focusing on analyzing existing approaches of functional estimation, as well as proposing new estimators that are provably near optimal in the large alphabet regime. Paninski [14] showed that the MLE needs samples to consistently estimate the Shannon entropy, and Paninski [15] established the existence of a (nonexplicit) estimator that only required samples. It implies that the MLE is strictly suboptimal in terms of sample complexity. It was Valiant and Valiant [16] who first explicitly constructed a linear programming based estimator (later modified in [17]) that achieves consistency in entropy estimation with samples, which they also proved to be necessary. Valiant and Valiant [18] constructed another approximation based estimator that achieved better theoretical properties than the linear programming ones, which was not yet shown to be minimax rateoptimal for all ranges of and . The authors [10] constructed the first minimax rateoptimal estimators for and based on best polynomial approximation, which are agnostic to the alphabet size . Utilizing the released MATLAB and Python packages of the estimators in [10], [19, 20] demonstrated that these minimax rateoptimal estimators can lead to significant performance boosts in various machine learning tasks. Wu and Yang [21] independently applied the best polynomial approximation idea to entropy estimation and obtained the minimax rates. However, their estimator requires the knowledge of the alphabet size . The approximation ideas proved to be very fruitful in Acharya et al. [22], Wu and Yang [23], Han, Jiao, and Weissman [24], Jiao, Han, and Weissman [25], Bu et al. [26], Orlitsky, Suresh, and Wu [27], Wu and Yang [28].
The main contribution of this paper is an explicit characterization of the worst case squared error risk of estimating and using the MLE up to a universal multiplicative constant for each specific functional, for all ranges of and in which the risk may vanish. Understanding the benefits and limitations of the MLE in a nonasymptotic setting serves two key purposes. First, the approach is a natural benchmark for comparing other more nuanced procedures for estimation of functionals. Second, performance analysis for the MLE reveals regimes where the problem is difficult, and motivates the development of improvements, which have been validated in [14, 15, 16, 17, 18, 10, 21, 22]. As a byproduct of the analysis, we explicitly point out an equivalence between bias analysis of functional estimators using plugin rules and approximation theory using positive linear operators. We believe these powerful tools introduced from approximation theory may have far reaching impacts in various applications in the information theory community.
We mention that there exist numerous other approaches proposed in various disciplines to estimate entropy, many among which are difficult to analyze theoretically. Among them we mention the Miller–Madow biascorrected estimator and its variants [29, 30, 31], the jackknife estimator [32], the shrinkage estimator [33], the coverage adjusted estimator [34], the Best Upper Bound (BUB) estimator [14], the BSplines estimator [35], and [36, 37] etc. For a Bayesian statistician, a natural approach is to first impose a prior on the unknown discrete distribution before considering estimating entropy. The Dirichlet prior, being the conjugate prior to the multinomial distribution, appears to be particularly popular in the Bayesian approach to entropy estimation. Dirichlet smoothing may have two connotations in the context of entropy estimation:
Nemenman, Shafee, and Bialek [42] argued in an intuitive way why Dirichlet prior is bad for entropy estimation and proposed to use mixtures of Dirichlet priors. Archer, Park, and Pillow [43] have come up with priors that perform better than the Dirichlet prior. Also see [44, 45].
Another contribution of this paper is an explicit characterization of the worst case squared error risk of estimating using the Dirichlet prior plugin approach up to a universal multiplicative constant, for all ranges of and in which the risk may vanish. We show rigorously that neither of the two approaches utilizing the Dirichlet prior result in improvements over the MLE in the large alphabet regime. Specifically, these approaches require at least to be consistent, while the minimax rateoptimal estimators such as the ones in [10][21] only need to achieve consistency.
Ii Preliminaries
The Dirichlet distribution with order with parameters has a probability density function with respect to Lebesgue measure on the Euclidean space given by
(3) 
on the open dimensional simplex defined by:
(4)  
(5)  
(6) 
and zero elsewhere. The normalizing constant is the multinomial Beta function, which can be expressed in terms of the Gamma function:
(7) 
Assuming the unknown discrete distribution follows prior distribution , and we observe a vector with multinomial distribution , then one can show that the posterior distribution is also a Dirichlet distribution with parameters
(8) 
Furthermore, the posterior mean (conditional expectation) of given is given by [46, Example 5.4.4]
(9) 
The estimator is widely used in practice for various choices of . For example, if , then the corresponding is the minimax estimator for under squared loss [46, Example 5.4.5]. However, it is no longer minimax under other loss functions such as loss, which was investigated in [47].
Note that the estimator subsumes the MLE as a special case, since we can take the limit for to obtain MLE. We denote the empirical distribution by . The Dirichlet prior smoothed distribution estimate is denoted as , where
(10) 
Note that the smoothed distribution can be viewed as a convex combination of the empirical distribution and the prior distribution . We call the estimator the Dirichlet prior smoothed plugin estimator.
Another way to apply Dirichlet prior in entropy estimation is to compute the Bayes estimator for under squared error, given that follows Dirichlet prior. It is well known that the Bayes estimator under squared error is the conditional expectation. It was shown in Wolpert and Wolf [40] that
(11) 
where is the digamma function. We call the estimator the Bayes estimator under Dirichlet prior.
Throughout this paper, we observe i.i.d. samples from an unknown discrete distribution . We denote the samples as i.i.d. random variables taking values in with probability . Defining
(12) 
we know that follows a multinomial distribution with parameter . Denote . The Maximum Likelihood Estimator (MLE) for and are defined, respectively, as and , with being the empirical distribution. We assume the functional takes the form
(13) 
Then it is evident that the MLE for estimating functional in (13) can be alternatively represented as the following linear function of :
(14) 
Recall that the risk function under squared error for any estimator in estimating functional may be decomposed as
(15) 
where represents the squared bias, and represents the variance. The subscript means that the expectation is taken with respect to the distribution that generates the i.i.d. observations. We omit the subscript for the expectation operator if the meaning of the expectation is clear from the context.
Notation: denotes , denotes . For two nonnegative series , notation means that there exists a positive universal constant such that , for all . The notation is equivalent to and . Notation means that . Throughout this paper, the notations involve absolute constants that may only depend on but not or . We denote by the space of discrete distributions with alphabet size .
Iii Main results
Iiia Estimating
We split the upper bounds and the lower bounds into two theorems, and present their succinct summaries in Corollary 1 and 2.
Theorem 1 (Upper bounds).
We have the following upper bounds on the worst case squared error risk of MLE in estimating :

:
(16) 
:
(17) where satisfies for , and is the secondorder Ditzian–Totik modulus of smoothness introduced in Section IVB.

:
(18) 
:
(19)
Moreover, in all the bounds presented above, the first term bounds the square of the bias, and the second term bounds the variance.
Theorem 2 (Lower bounds).
We have the following lower bounds on the worst case squared error risk of MLE in estimating :

: there exists a constant such that for all ,
(20) 
: if , for any , then
(21) 
: if , then
(22) 
: if , then
(23)
There are several interesting implications of this result, highlighted in the following corollaries.
Corollary 1.
For any fixed , there exist universal convergence rates for :
(24) 
Corollary 1 implies that, when , estimation of is extremely simple in terms of convergence rate: plugin estimation achieves the best possible rate (as shown in the theory of regular statistical experiments of classical asymptotic theory, see [48, Chap. 1.7.]). Results of this form have appeared in the literature, for example, Antos and Kontoyiannis [49] showed that it suffices to take samples to consistently estimate . However, when , the rate is considerably slower. Interestingly, there exist estimators that demonstrate better convergence rates for estimating . Jiao et al. [10] showed that the minimax rate in estimating , is as long as , which is achieved using the general methodology developed therein for constructing minimax rateoptimal estimators for nonsmooth functionals.
Let us now examine the case , another interesting regime that has not been characterized before. In this regime, we observe significant increase in the difficulty of the estimation problem. In particular, the relative scaling between the number of observations and the alphabet size for consistent estimation of exhibits a phase transition, encapsulated in the following.
Corollary 2.
Fix . The worst case squared error risk of the MLE in estimating is characterized as follows when :
(25) 
Corollary 2 follows directly from Theorem 1 and Theorem 2. In particular, it implies that it is necessary and sufficient to take samples to consistently estimate using MLE. Thus, as one might expect, the scale of the number of measurements required for consistent estimation increases as decreases. When , the number of samples required for the MLE grows superpolynomially in , which is consistent with the intuition that is essentially equivalent to the alphabet size of a distribution, whose estimation is known to be very hard when there may exist symbols with very small probabilities [50].
We exhibit some of our findings by plotting the value required of for consistent estimation of using the MLE , as a function of , in Figure 1.
It turns out that one can construct estimators that are better than the MLE in terms of required sample complexity for consistent estimation for the regime . Indeed, Jiao et al. [10] showed that the minimax rateoptimal estimator requires samples to achieve consistency, which attains a logarithmic improvement in the sample complexity over the MLE.
IiiB Estimating
We not only consider , but also the socalled Miller–Madow biascorrected estimator [29] defined as
(26) 
Theorem 3.
The worst case squared error risk of admits the following upper bound for all :
(27) 
If , then
(28) 
Moreover, if , the Miller–Madow biascorrected estimator satisfies
(29) 
where the positive constant in both expressions does not depend on or .
Theorem 3 implies the following corollary.
Corollary 3.
The worst case squared error risk of the MLE in estimating is characterized as follows when :
(30) 
Here the first term corresponds to the squared bias, and the second term corresponds to the variance.
Paninski [14] showed that if , where is a constant, the maximum squared error risk of , and the Miller–Madow biascorrected estimator , would be bounded from zero. Paninski [14] also showed that when , the MLE is consistent for estimating entropy. Corollary 3 implies that it is necessary and sufficient to take samples for the MLE to be consistent for estimating entropy. Comparing the results for with those for , we see that the intuition that being viewed close to when is indeed approximately correct as coincides with on the phase transition curve shown in Figure 1.
Table I summarizes the minimax squared error rates and the worst case squared error rates of the MLE in estimating and . It is clear that the MLE cannot achieve the minimax rates for estimation of , and when . In these cases, there exist strictly better estimators whose performance with samples is roughly the same as that of the MLE with samples. This phenomenon was termed effective sample size enlargement in [10].
IiiC Dirichlet prior techniques applying to entropy estimation
For symmetry, we restrict attention to the case where the parameter in the Dirichlet distribution takes the form .
In comparison to MLE , where is the empirical distribution, the Dirichlet smoothing scheme has a disadvantage: it requires the knowledge of the alphabet size in general. We define
(31) 
and
(32) 
It is clear that
(33)  
(34) 
where stands for the empirical distribution, is the true distribution, and denotes the uniform distribution on the same alphabet with size .
Theorem 4.
If , then the maximum squared error risk of in estimating is upper bounded as
(35) 
Here the first term bounds the squared bias, and the second term bounds the variance.
Theorem 5.
If , then the maximum risk of in estimating is lower bounded as
(36) 
where is a universal constant that does not depend on , or .
If , then we have
(37) 
If , then we have
(38) 
If , then we have
(39) 
where is the largest integer that does not exceed , and represents the positive part of .
Corollary 4.
If and is upper bounded by a constant, then the maximum squared error risk of vanishes. Conversely, if , then the maximum squared error risk of is bounded away from zero.
The next theorem presents a lower bound on the maximum risk of the Bayes estimator under Dirichlet prior. Since we have assumed that all , the Bayes estimator under Dirichlet prior is
(40) 
Theorem 6.
If , then
(41) 
where is the EulerâMascheroni constant.
Evident from Theorem 4, 5, and 6 is the fact that in the best situation (i.e. not too large), both the Dirichlet prior smoothed plugin estimator and the Bayes estimator under Dirichlet prior still require at least samples to be consistent, which is the same as MLE. In contrast, the estimators in Valiant and Valiant [16, 18, 17], Jiao et al. [10], Wu and Yang [21] are consistent if , which is the optimal sample complexity. Thus, we can conclude that the Dirichlet smoothing technique does not solve the entropy estimation problem.
Iv Fundamental ideas of our analysis
In this section, we discuss the fundamental tools we employed to obtain the results in Section III, as well as general recipes we suggest for analyzing performances of functional estimators.
Iva Variance
The variance characterizes the degree to which the random variable is fluctuating around its expectation, and the field of concentration inequalities perfectly fits our glove to give the desired results. For all the functionals we consider, it turns out that the Efron–Stein inequality [51] and the bounded differences inequality give very tight bounds. For completeness we state them below.
Lemma 1.
[52, Efron–Stein inequality, Theorem 3.1] Let be independent random variables and let be a square integrable function. Moreover, if are independent copies of and if we define, for every ,
(42) 
then
(43) 
The following inequality, which is called the bounded differences inequality, is a useful corollary of the Efron–Stein inequality.
Lemma 2.
[52, Bounded differences inequality, Corollary 3.2] If function has the bounded differences property, i.e., for some nonnegative constants ,
(44) 
for every , then
(45) 
given that are independent random variables.
We refer the readers to Boucheron et al. [52] for a modern exposition of the concentration inequality toolbox.
IvB Bias
It turns out that the bias analysis in estimation, albeit widely studied in statistics, seems to still largely bear an asymptotic and expansion nature in the mainstream statistical literature [53, 54]. In particular, the bootstrap [55] as a method for estimating functionals was essentially only analyzed in an asymptotic setting [56]. Among asymptotic analysis techniques, probably the most popular one is the Taylor expansion. We will show that the Taylor expansion may encounter great difficulties in analyzing the bias of MLE in information measure estimation. Then, we will introduce the field of approximation theory using positive linear operators and demonstrate that it is essentially equivalent to nonasymptotic bias analysis for plugin functional estimators. In doing so, we present the readers with abundant handy tools from approximation theory, which could be readily applicable to many problems that may seem highly intractable with standard expansion methods.
We start from entropy estimation. In the literature, considerable effort has been devoted to understanding the nonasymptotic performance of the MLE in estimating . One of the earliest investigations in this direction is due to Miller [29] in 1955, who showed that, for any fixed distribution ,
(46) 
Equation (46) was later refined by Harris [57] using higher order Taylor series expansions to yield
(47) 
Harris’s result reveals an undesirable consequence of the Taylor expansion method: one cannot obtain uniform bounds on the bias of the MLE. Indeed, the term can be arbitrarily large for some distribution . However, it is evident that both and are bounded above by , since the maximum entropy of any distribution supported on elements is . Conceivably, for such a distribution that would make very large, we need to compute even higher order Taylor expansions to obtain more accuracy, but even with such efforts we cannot obtain a uniform bias bound for all .
We gain one of our key insights into the bias of the MLE by relating it to the approximation error induced by the Bernstein polynomial approximation of the function , which was first observed in Paninski [14]. To see this, we first compute the bias of in estimating the functional in (13).
Lemma 3.
The bias of the estimator is given by
(48) 
The bias term in (3) can be equivalently expressed as^{1}^{1}1In the literature of combinatorics, the sum is called the Bernoulli sum, and various approaches have been proposed to evaluate its asymptotics [58], [59], [60].
(49)  
(50) 
where is the wellknown Bernstein polynomial basis, and is the socalled Bernstein polynomial for function . Bernstein in 1912 [61] provided an insightful constructive proof of the Weierstrass theorem on approximation of continuous functions using polynomials, by showing that the Bernstein polynomial of any continuous function converges uniformly to that function. From a functional analytic viewpoint, the Bernstein polynomial is an operator that maps a continuous function to another continuous function . This operator is linear in , and is positive because is also pointwise nonnegative if is pointwise nonnegative. Evidently, bounding the approximation error incurred by the Bernstein polynomial is equivalent to bounding the bias of the MLE , where . Fortunately, the theory of approximation using positive linear operators [62] provides us with advanced tools that are very effective for the bias analysis our problem calls for. A century ago, probability theory served Bernstein in breaking new ground in function approximation. It is therefore very satisfying that advancements in the latter have come full circle to help us better understand probability theory and statistics. We briefly review the general theory of approximation using positive linear operators below.
IvB1 Approximation theory using positive linear operators
Generally speaking, for any estimator of a parametric model indexed by , the expectation is a positive linear operator for , and analyzing the bias is equivalent to analyzing the approximation properties of the positive linear operator in approximating . Hence, analyzing the bias of any plugin estimator for functionals of parameters from any parametric families can be recast as a problem of approximation theory using positive linear operators [62].
Conversely, given a positive linear operator that operates on the space of continuous functions, the Riesz–Markov–Kakutani theorem implies that under mild conditions the operator may be written as
(51) 
where is a set of probability measures parametrized by , which may be viewed as a parameter. If we view the random variable as a summary statistics to plugin the functional , the positive linear operator is nothing but the expectation of the plugin estimator . In this sense, there exists a onetoone correspondence between essentially the most general bias analysis problem in statistics, and the most general positive linear operator approximation problem in approximation theory.
After more than a century’s active research on approximation using positive linear operators, we now have highly nontrivial tools for positive linear operators of functions on one dimensional compact sets, but the general theory for vector valued multivariate functions on noncompact sets is still far from complete [62]. In the next subsection, we present a sample of existing results in approximation using positive linear operators, corollaries of which will be used to analyze the bias of the MLE for two examples: and .
IvB2 Some general results in bias analysis
First, some elementary approximation theoretic concepts need to be introduced in order to characterize the degree of smoothness of functions. For an interval, the firstorder modulus of smoothness is defined as [62]
(52) 
The secondorder modulus of smoothness [62] is defined as
(53) 
Ditzian and Totik [63] introduced a class of moduli of smoothness, which proves to be extremely useful in characterizing the incurred approximation errors. For simplicity, for functions defined on , , the firstorder Ditzian–Totik modulus of smoothness is defined as
(54) 
and the secondorder Ditzian–Totik modulus of smoothness is defined as