On the Estimation of Information Measures of Continuous Distributions
\DeclareDocumentCommand\cardO m \DeclareDocumentCommand\Vol O m \DeclareDocumentCommand\cvx O m \ProvideDocumentCommand\bmm#1 \DeclareDocumentCommand\Ntoo O m \DeclareDocumentCommand\NZtoo O m \DeclareDocumentCommand\indm m \DeclareDocumentCommand\Exp O O m \DeclareDocumentCommand\condExp O m m \DeclareDocumentCommand\Prob O O m \DeclareDocumentCommand\DKL O m m \DeclareDocumentCommand\pcond O O O m m \DeclareDocumentCommand\Pcond O O O m m \DeclareDocumentCommand\mutInf O m m \DeclareDocumentCommand\condMutInf O m m m \DeclareDocumentCommand\wtyp O O m \DeclareDocumentCommand\condWtyp O O m m m \DeclareDocumentCommand\typ O O m \DeclareDocumentCommand\type O m \DeclareDocumentCommand\condTyp O O m m m \DeclareDocumentCommand\condType O m m m \DeclareDocumentCommand\ent O m \DeclareDocumentCommand\entRate O m \DeclareDocumentCommand\entPhi O m \DeclareDocumentCommand\condEnt O m m \DeclareDocumentCommand\dent O m \DeclareDocumentCommand\dentRate O m \DeclareDocumentCommand\condDent O m m \DeclareDocumentCommand\binEnt O m \DeclareDocumentCommand\binEntInv O m \DeclareDocumentCommand\uniform O m \NewEnvironvlong \NewEnvironvshort \IfSubStringInStringshortversionitestimation \IfSubStringInStringprintversionitestimation \setacronymstylelongshort \newacronympdfpdfprobability density function \newacronymdgcDGCdense graph condition \newacronymrhsRHSright hand side \glsunsetrhs
Abstract
The estimation of information measures of continuous distributions based on samples is a fundamental problem in statistics and machine learning. In this paper, we analyze estimates of differential entropy in dimensional Euclidean space, computed from a finite number of samples, when the probability density function belongs to a predetermined convex family . First, estimating differential entropy to any accuracy is shown to be infeasible if the differential entropy of densities in is unbounded, clearly showing the necessity of additional assumptions. Subsequently, we investigate sufficient conditions that enable confidence bounds for the estimation of differential entropy. In particular, we provide confidence bounds for simple histogram based estimation of differential entropy from a fixed number of samples, assuming that the probability density function is Lipschitz continuous with known Lipschitz constant and known, bounded support. Our focus is on differential entropy, but we provide examples that show that similar results hold for mutual information and relative entropy as well.
1 Introduction
Many learning tasks, especially in unsupervised/semisupervised settings, use information theoretic quantities, such as relative entropy, mutual information, differential entropy, or other divergence functionals as target functions in numerical optimization problems [19, 20, 25, 38, 16, 8, 5]. Furthermore, estimators for information theoretical quantities are useful in other fields, such as neuroscience [26]. As these quantities typically cannot be computed directly, surrogate functions, either upper/lower bounds, or estimates are used in place. Here, we will investigate the problem of estimating differential entropy using a finite number of samples. Throughout, we will restrict our attention to differential entropy, but similar results also hold for conditional differential entropy, mutual information and relative entropy (cf. Section 4).
1.1 Our Contribution
The contributions of this work can be summarized as follows:

First, we explore the following basic but fundamental question: Fixing and given samples from a \glspdf , where is a family of \glsplpdf on , is it possible to obtain an estimate of the differential entropy satisfying
(1) In Section 2, we show that the answer to this question is negative (Proposition 2) if is convex and the differential entropy of the \glsplpdf in is unbounded.

Subsequently, we investigate sufficient conditions for the class that enable estimation of differential entropy with such a confidence bound and in Section 3 (Theorem 3) we show that a known, bounded support together with an Lipschitz continuous \glspdf for fixed , suffices.^{1}^{1}1These assumptions assure that the differential entropy of the \glsplpdf in is bounded. A known, bounded support bounds the differential entropy from above, and Lipschitz continuity bounds it from below. For a simple histogram based estimator we explicitly compute a relation between probability of correct estimation, accuracy, dimension , sample size , and Lipschitz constant . It is shown that estimation becomes impossible if either assumption is removed.

Finally, in Section 4 we obtain impossibility results, similar to Proposition 2, for the estimation of other information measures.
1.2 Previous Work
The problem of estimating information measures from a finite number of samples is as old as information theory itself. Shortly after his seminal paper [31], Shannon worked on estimating the entropy rate of English text [32]. There have been numerous works on the estimation of information measures, such as entropy, mutual information, and differential entropy, since. There are many different approaches for estimating information measures, including kernel based methods, nearest neighbor methods, methods based on sample distances as well as multiple variants of plugin estimates. Many estimators have been shown to be consistent and/or asymptotically unbiased under various constraints, e. g., in [17, 1, 12, 21, 36]. An excellent overview can be found in [6].
In [36], rateofconvergence results as well as a central limit theorem are provided for differential entropy and Rényi entropy. However, the confidence bounds and the constants involved in the rateofconvergence results depend on the underlying distribution which is typically unknown. Similarly, [18] obtains a rateofconvergence result, assuming a Lipschitz ball smoothness assumption combined with known compact support, but the involved constants remain unspecified. In a similar spirit, [23] provides asymptotic results for the estimation of differential entropy in two dimensions, when certain smoothness conditions are satisfied and the \glspdf is bounded away from zero. The related task of estimating relative entropy is studied, e. g., in [39, 27] and partitionbased estimation of mutual information is analyzed in [9]. While [39] only shows consistency, convergence rates are obtained in [27, Th. 2], but again the constants involved remain unspecified.
In contrast to our present work, the existing results for the estimation of differential entropy mentioned above fall short when addressing the practical problem of a finite sample size. However, some results are available in a more general context. In [35], a finitesample analysis is conducted. Similar to our approach (cf. Section 3), the authors of [35] assume a fixed support , but instead of Lipschitz continuity, Hölder continuity, is assumed. Additionally, strict positivity on the interior of the support is required and the constants bounding the approximation error depend on the underlying, unknown distribution. These additional complications are likely due to the extended scope, as [35] is not focused on differential entropy, but the expectation of arbitrary functionals of the probability density. The same authors also provide finite sample analysis for the estimation of Rényi divergence under similarly strong conditions in [34].
There are several negative results, which clearly show that information measures are hard to estimate from a finite number of samples. It was shown in [3, Th. 4] that rateofconvergence results cannot be obtained for any consistent estimator of entropy on a countable alphabet and only when imposing various assumptions on the true distribution, rateofconvergence results were obtained. More negative results on the estimation of entropy and mutual information can be found in [28]. In fact, obtaining confidence bounds for information measures from samples is inherently difficult and requires regularity assumptions about the involved distributions, which are not subject to empirical test. In the seminal work of [4] as well as subsequent works [13, 10, 29, 11] (and references therein) such necessary conditions for the estimation of statistical parameters with confidence bounds are discussed in great detail and generality. The results of [10, 29] can be applied to differential entropy estimation and yield a result very similar to Proposition 2, essentially showing that differential entropy cannot be bounded using a finite number of samples, unless additional assumptions on the distribution are made.
Especially in the context of unsupervised and semisupervised machine learning it recently became popular to use variational bounds or estimates of information measures as part of the loss function for training neural networks [7, 30, 19]. Criticism to this approach, in particular the use of variational bounds, has been voiced [24]. The current paper has a more general scope, dealing with the estimation problem of information measures in general, not limited to specific variational bounds or techniques.
The information flow in neural networks is also a recent topic of investigation. In [33], an argument for successive compression in the layers of a deep neural network is given, along the lines of the information bottleneck method [37]. While flaws in this argument were pointed out [2, 22], the authors of [15] found that a clustering phenomenon might elucidate the behavior of deep neural networks. These insights were obtained by estimating the differential entropy of a sum of two random vectors, where is subGaussian and is an independent Gaussian vector. This is similar in spirit to the work conducted here, however, our assumption of compact support is replaced by assuming to be subGaussian.^{2}^{2}2Similar to our assumption of an arbitrary but fixed compact support in Section 3, the constant in the definition of the subGaussian is assumed to be fixed in [14, eq. (1)]. Note that the \glspdf of is Lipschitz continuous with fixed Lipschitz constant , so [15] is implicitly also using a Lipschitz assumption.
2 The Nonexistence of Confidence Sets
Let be a family of \glsplpdf on with finite differential entropy, i. e., for every .
Suppose we observe i.i.d. copies of some random vector and want to obtain an estimate of differential entropy from these samples . Such an estimator is a function that maps into , approximating the differential entropy . Its accuracy can be measured by a confidence interval, a widely used tool in statistical practice for indicating the precision of point estimators. For a given error probability , we would like to have such that with probability less than , i. e., a confidence interval of size with confidence . However, there is no free lunch when estimating differential entropy, as evidenced by the following result, a corollary of a more general result in [10], here specialized to a bound of differential entropy. It is based on the abstract notion of a \glsdgc.^{3}^{3}3 satisfies the \glsdgc over if the graph of over is dense in its own epigraph [10, eq. (2.4)].
Theorem 1 ([10, Th. 2.1]).
Assume that satisfies the \glsdgc over and define , where, e. g., . If for any , , then
(2) 
A similar result follows from [29, Prop. 3.1].
We will not work with the \glsdgc, but make two practical assumptions: is a convex family and the differential entropy of the \glsplpdf in is unbounded (either from above or from below). Under these assumptions we show that for any there is a \glspdf , such that with probability less than , i. e., is far from with high probability. Fundamentally, this follows from the fact that contains \glsplpdf with a large difference in differential entropy, which cannot be accurately distinguished based on samples. Similar results hold true for mutual information and relative entropy and are given in Section 4.
Proposition 2.
Let be a convex family of \glsplpdf with unbounded differential entropy, i. e., for any and , we have as well as . Then, for any pair of constants , there exists a continuous random vector , satisfying
(3) 
Remark 1.
Before proceeding with the proof of Proposition 2, we note that this result could be proved as a consequence of Theorem 1. However, this would necessitate to show that our conditions imply the \glsdgc. Furthermore, the proof of [10, Th. 2.1] itself hinges on deep statistical results and thus we opted for providing a short, selfcontained proof.
Proof of Proposition 2.
The function , constants , and the sample size are arbitrary, but fixed. Choose an arbitrary and let . Then fix , such that , where are i.i.d. copies of . Furthermore, let be a Bernoulli random variable with parameter , independent of , where . Choose such that . By our assumption , we can find with such that .
Define , which yields where denotes mutual information. For convenience we use , where and define the event . By the union bound, we have and obtain
(4)  
(5)  
(6)  
(7) 
We thus found such that
(8)  
(9)  
(10)  
(11)  
(12)  
(13) 
Remark 2.
Proposition 2 shows that in order to obtain confidence bounds, one needs to make assumptions about the underlying distribution. However, as pointed out in [10, p. 1395], when making these assumptions, one uses information external to the samples.
Remark 3.
Note that the family of all \glsplpdf with support satisfies the requirements of Proposition 2. It also satisfies the \glsdgc, but it is not strongly nonparametric, as defined in [10, p. 1395].
3 Lipschitz Density Assumption
One way to avoid the problems outlined in Section 2 is to impose additional assumptions on the underlying probability distribution, that bound the differential entropy from above and from below. We will showcase that the differential entropy of an Lipschitz continuous \glspdf with fixed, known on and known, compact support can be well approximated from samples. In the following, let be supported^{4}^{4}4Any known compact support suffices. An affine transformation then yields , while possibly resulting in a different Lipschitz constant. on , i. e., , where denotes the Lebesgue measure on . The \glspdf of is assumed to be Lipschitz continuous on with some fixed , where is equipped with the norm^{5}^{5}5The norm is only chosen to facilitate subsequent computations. By the equivalence of norms on , any norm suffices. , hence,
(14) 
Given i.i.d. copies of , let be distributed according to the empirical distribution of , i. e., , where is a uniform random variable on . Let the discrete random vector be the elementwise quantization of , where is the step discretization of for some . Additionally define the continuous random vector , i. e., independent uniform noise is added. Note also that , where denotes Shannon entropy.
We will estimate differential entropy by , i. e., the Shannon entropy of the discretized and binned samples with a correction factor.
In the following, we shall also use the two constants
(15)  
(16) 
Theorem 3.
For and any , we have with probability greater than that \cref@addtoresetequationparentequation
(17a)  
(17b)  
(17c) 
The proof will be given in Appendix A.
Remark 4.
Of the three error terms Equations 17c, 17b and 17a, the terms Equations 17c and 17a constitute the bias and Equation 17b is a variancelike error term. While the variance Equation 17b vanishes as , the term Equation 17a does not depend on the sample size as it merely measures the error incurred due to the quantization , which is bounded by the Lipschitz constraint and approaches zero as . The final term Equation 17c results from ensuring that samples suffice to suitably approximate the empirical distribution over quantization steps. Thus, it ties the quantization to the sample size and approaches zero if . In total, the \glsrhs of Equation 17 approaches zero for provided that .
Remark 5.
Theorem 3 should be regarded as a proofofconcept rather than a practical tool for performing differential entropy estimation. While analytically tractable, the estimation strategy is crude and the bounds, especially the term Equation 17b, while being a completely universal bound, is know to be loose, as pointed out in [28, p. 1200].
Remark 6.
We want to note that requiring both a fixed Lipschitz constant and a known bounded support, e. g., , is necessary. Consider for instance the set supported on and Lipschitz continuous of \glsplpdf with arbitrary Lipschitz constant or the set supported on a bounded set and Lipschitz continuous with fixed Lipschitz constant, but arbitrary, bounded support. Both families satisfy the conditions of Proposition 2, i. e., they are convex and .
In principle, Theorem 3 also allows for the approximation of mutual information with a confidence bound. Let be two random vectors, supported on and , respectively. Assuming that is Lipschitz continuous on , it is clear that the marginals and are Lipschitz continuous as well. Thus, Theorem 3 can be used to approximate all three terms in
(18) 
4 Estimation of other Measures
In this Section, we showcase that similar statements as Proposition 2 also hold for mutual information and relative entropy. For simplicity we will not assume for some family of probability density functions, but merely require . Only proof sketches are provided as the examples provided in this Section are similar to the proof of Proposition 2.
Here we show that in general, it is not possible to accurately estimate mutual information and relative entropy from samples .
4.1 Mutual Information
For any , let be a measurable function, which represents an estimate of the mutual information from . For convenience we use . Let , , and be independent random variables. Define
(19) 
We have
(20)  
(21)  
(22) 
The random vectors [] are i.i.d. realizations of []. For any , we can find such that . Letting , we have . Thus, when choosing ,
(23)  
(24)  
(25)  
(26) 
We may choose . Then, for arbitrary and , we found , and , such that , yet .
Remark 7.
Note that [7, Th. 3] claims a confidence bound for mutual information, that together with the approximation result [7, Lem. 1] seemingly contradicts our result. However, the confidence bound proved in [7, Th. 3] requires strong conditions on the functions^{6}^{6}6Here we use the notation of [7]. and [7, Lem. 1] does not necessarily hold under these conditions. Moreover, both approximation results [7, Lem. 1 and Lem. 2] do not hold uniformly for a family of distributions, but implicitly assume a fixed, underlying distribution. This is especially evident in [7, Lem. 2], which also seemingly contradicts our result, when assuming that the optimal function is in the family . However, this apparent contradiction is resolved by noting that the chosen depends on the underlying, true distribution.
4.1.1 One Discrete Random Variable
In the following we show that a similar result holds if has a fixed finite alphabet, say . Again, for every , let be an estimator that estimates from . Note that the result for continuous cannot carry over unchanged as we have . We shall assume that is consistent in the sense that in probability as , where we use .
Let and be independent random variables. Fix and by consistency find such that for all . In the following consider fixed. Fix and , and define the quantization . The random variable is simply . We use the notation to highlight that depends on the particular choice of and wish to show that for at least one .
Assume to the contrary, that for all . Let be the event that two elements of fall in the same “bin.” Note that is the quantization of . For large enough, we obtain
(27)  
(28)  
(29) 
Defining , independent of , we obtain for small enough
(30)  
(31)  
(32)  
(33)  
(34)  
(35)  
(36) 
leading to a contradiction.
To summarize, for arbitrary and large enough, there exists such that , but clearly is a deterministic function of and hence .
4.2 Relative Entropy
Let and be two continuous \glsplpdf (w.r.t. ) and , be i.i.d. random variables distributed according to and , respectively. For any , let be an estimator that estimates from . For convenience we use . Let , be two independent i.i.d. vectors with components uniformly distributed on . For an arbitrary we can find such that . Consider and an arbitrary .
Define the \glsplpdf
(37)  
(38) 
for .
With , where , we have , with the function
(39)  
(40)  
(41) 
Let and be the events that every component of and , respectively, is negative. Then,
(42)  
(43) 
Choose and such that and . We can now bound the probability
(44)  
(45)  
(46)  
(47) 
In summary, for an estimator and any and , we can find distributions , and , such that , even though .
5 Discussion and Perspectives
We showed that under mild assumptions on the family of allowed distributions, differential entropy cannot be reliably estimated solely based on samples, no matter how many samples are available. In particular, as first noted in [10] no nontrivial bound or estimate of an information measure can be obtained based only on samples. External information about the regularity of the underlying probability distribution needs to be taken into account. However, such regularity assumptions are not subject to empirical verification and thus, the existence of statistical guarantees for an empirical estimate cannot be empirically tested. This shows that researchers should take great care when approximating or bounding information measures, and specifically explore the necessary assumptions for the underlying distribution.
Regarding the use of information measures in machine learning, we note that our results apply to all estimators of information measures. In particular, empirical versions of variational bounds cannot provide estimates of information measures with high reliability in general.
It would be interesting to investigate the type of assumptions on the underlying distributions that may hold in typical machine learning setups. However, as pointed out previously, these properties cannot be deduced from data, but must result from the model under consideration. In a related note, it might be interesting if the confidence bounds for differential entropy estimation under bounded support and Lipschitz condition from Section 3 carry over to empirical versions of variational bounds. Extensions of these results to other information measures, e. g., Rényi entropy, Rényi divergences, or divergences, could also be of particular interest for future work.
Acknowledgments
The authors are very grateful to Prof. Elisabeth Gassiat for pointing out the connection between the present work and the reference [10].
Appendix A Proof of Theorem 3
We shall first introduce auxiliary random variables which are depicted in Figure 1. Let be the elementwise discretization . The continuous random vector is obtained by adding independent uniform random noise. Let be the \glspdf of . It is straightforward to see that is distributed according to the empirical distribution of i.i.d. copies of . Also note that .
In order to prove Theorem 3 we use the triangle inequality twice to obtain
(48)  
(49)  
(50)  
(51) 
noting that . Note that is a random quantity that depends on . We thus split the bound in three terms, where the first term in Equation 51 is variancelike and the second and third terms constitute the bias. In the remainder of this Appendix, we will complete the proof by showing that all three terms in Equation 51 can be bounded as follows. First, we have
(52)  
(53) 
And with probability greater than we also have
(54) 
As is distributed according to the empirical distribution of i.i.d. copies of , on an alphabet of size , the inequalities Equations 54 and 52 follow directly from the following wellknown Lemma, concerning the estimation of (discrete) Shannon entropy.
Lemma 4 ([28, eq. (3.4), and Prop. 1] and [3, Remark iii, p.168]).
Let be a random variable on and distributed according to the empirical measure of i.i.d. copies of . We then have , where
(55) 
and for any , with probability greater than ,
(56) 
In order to show Equation 53, we will first obtain some preliminary results and then conclude the proof in Lemma 10. We start by bounding the difference between and the approximation using the following auxiliary results.
Lemma 5.
Let be an arbitrary Lipschitz continuous function and assume for some^{7}^{7}7We use the notation . , then
(57) 
In particular, for , , we have .
Proof.
For we have and hence
(58)  
(59)  
(60) 
Lemma 6.
For an Lipschitz continuous \glspdf on and the \glspdf of , as defined above, we have for every .
Proof.
Let and . The function is constant on and given by for all , where . Thus, since , we obtain
(61)  
(62)  
(63) 
where we applied Lemma 5 to in Equation 63. ∎
Lemma 7.
If is an Lipschitz continuous \glspdf on , then .
Proof.
Let and define the ball with radius , centered at . We then have and hence,
(64)  
(65)  
(66)  
(67) 
where the fact that for all is used in Equation 66. ∎
Using the previous lemmas to bound the distance between and , the following two results will allow us to bound the difference of the differential entropies.
Lemma 8.
For , , and we have
(68) 
Proof.
In the following, we assume w.l.o.g. that . If , then is monotonically decreasing in and thus maximal at and hence, Equation 68 follows. If, on the other hand, , then necessarily . Define the function , and . Note that by the mean value theorem there are and such that and . Inequality Equation 68 then follows by observing that whenever and . ∎
Lemma 9.
Let and be two \glsplpdf supported on with finite differential entropies. Assume that for all we have and , and that holds. Then,
(69) 
Proof.
Define and for . We have , as well as
(70)  
(71)  
(72)  
(73)  
(74) 
where Lemma 8 was applied in Equation 73. ∎
We can now finish the proof of Theorem 3 by showing Equation 53.
Lemma 10.
If we have
(75) 
Proof.
By Lemma 6, and, by Lemma 7, . We can thus apply Lemma 9 with and provided that , which is equivalent to . Inserting and in Equation 69 proves the result. ∎
References
 [1] I. Ahmad and P.E. Lin. A nonparametric estimation of the entropy for absolutely continuous distributions. IEEE Trans. Inf. Theory, 22(3):372–375, May 1976.
 [2] R. A. Amjad and B. C. Geiger. Learning representations for neural networkbased classification using the information bottleneck principle. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019. to appear.
 [3] A. Antos and I. Kontoyiannis. Convergence properties of functional estimates for discrete distributions. Random Structures & Algorithms, 19(34):163–193, Nov. 2001.
 [4] R. R. Bahadur and L. J. Savage. The nonexistence of certain statistical procedures in nonparametric problems. Ann. Math. Statist., 27(4):1115–1122, 1956.
 [5] D. Barber and F. Agakov. The IM algorithm: A variational approach to information maximization. In NIPS’03, volume 16 of Advances in neural information processing systems, pages 201–208, Cambridge, MA, USA, Dec. 2003.
 [6] J. Beirlant, E. J. Dudewicz, L. Györfi, and E. C. Van der Meulen. Nonparametric entropy estimation: An overview. International Journal of Mathematical and Statistical Sciences, 6(1):17–39, 1997.
 [7] M. I. Belghazi, A. Baratin, S. Rajeshwar, S. Ozair, Y. Bengio, A. Courville, and D. Hjelm. Mutual information neural estimation. In ICML’18, volume 80 of PMLR, pages 531–540, Stockholm, Sweden, July 2018.
 [8] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. In NIPS’16, volume 29 of Advances in neural information processing systems, pages 2172–2180, Barcelona, Spain, 2016.
 [9] G. A. Darbellay and I. Vajda. Estimation of the information by an adaptive partitioning of the observation space. IEEE Trans. Inf. Theory, 45(4):1315–1321, May 1999.
 [10] D. L. Donoho. Onesided inference about functionals of a density. Ann. Statist., 16(4):1390–1420, 1988.
 [11] D. L. Donoho and R. C. Liu. Geometrizing rates of convergence, II. Ann. Statist., 19(2):633–667, 1991.
 [12] W. Gao, S. Kannan, S. Oh, and P. Viswanath. Estimating mutual information for discretecontinuous mixtures. In NIPS’17, volume 30 of Advances in Neural Information Processing Systems, pages 5986–5997, Long Beach, CA, USA, 2017.
 [13] L. J. Gleser and J. T. Hwang. The nonexistence of 100 (1)% confidence sets of finite expected diameter in errorsinvariables and related models. Ann. Statist., pages 1351–1362, 1987.
 [14] Z. Goldfeld, K. Greenewald, Y. Polyanskiy, and J. Weed. Convergence of smoothed empirical measures with applications to entropy estimation. arXiv preprint, 2019.
 [15] Z. Goldfeld, E. Van Den Berg, K. Greenewald, I. Melnyk, N. Nguyen, B. Kingsbury, and Y. Polyanskiy. Estimating information flow in deep neural networks. In ICML’19, volume 97 of PMLR, pages 2299–2308, Long Beach, CA, USA, 2019.
 [16] S. Gordon, H. Greenspan, and J. Goldberger. Applying the information bottleneck principle to unsupervised clustering of discrete and continuous image representations. In Proc. Ninth IEEE Int. Conf. Comput. Vision, pages 370–377, Nice, France, Oct. 2003.
 [17] L. Györfi and E. C. Van der Meulen. Densityfree convergence properties of various estimators of entropy. Comput. Stat. Data Anal., 5(4):425–436, Sept. 1987.
 [18] Y. Han, J. Jiao, T. Weissman, and Y. Wu. Optimal rates of entropy estimation over Lipschitz balls. arXiv preprint, 2019.
 [19] R. D. Hjelm, A. Fedorov, S. LavoieMarchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio. Learning deep representations by mutual information estimation and maximization. In ICLR, New Orleans, LA, USA, 2019.
 [20] W. Hu, T. Miyato, S. Tokui, E. Matsumoto, and M. Sugiyama. Learning discrete representations via information maximizing selfaugmented training. In ICML’17, volume 70 of PMLR, pages 1558–1567, Sydney, Australia, 2017.
 [21] K. Kandasamy, A. Krishnamurthy, B. Poczos, L. Wasserman, and J. M. Robins. Nonparametric von Mises estimators for entropies, divergences and mutual informations. In NIPS’15, volume 28 of Advances in Neural Information Processing Systems, pages 397–405, Montréal, Canada, 2015.
 [22] A. Kolchinsky, B. D. Tracey, and S. V. Kuyk. Caveats for information bottleneck in deterministic scenarios. In ICLR, New Orleans, LA, USA, 2019.
 [23] H. Liu, L. Wasserman, and J. D. Lafferty. Exponential concentration for mutual information estimation with application to forests. In NIPS’12, volume 26 of Advances in Neural Information Processing Systems, pages 2537–2545, Lake Tahoe, NV, USA, 2012.
 [24] D. McAllester and K. Stratos. Formal limitations on the measurement of mutual information. In ICLR, New Orleans, LA, USA, 2019.
 [25] T. Miyato, S.i. Maeda, M. Koyama, K. Nakae, and S. Ishii. Distributional smoothing with virtual adversarial training. In ICLR, San Juan, Puerto Rico, 2016.
 [26] I. Nemenman, W. Bialek, and R. D. R. Van Steveninck. Entropy and information in neural spike trains: Progress on the sampling problem. Phys. Rev. E, 69(5):056111, 2004.
 [27] X. Nguyen, M. J. Wainwright, and M. I. Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Trans. Inf. Theory, 56(11):5847–5861, Nov. 2010.
 [28] L. Paninski. Estimation of entropy and mutual information. Neural Comput., 15(6):1191–1253, 2003.
 [29] J. Pfanzagl. The nonexistence of confidence sets for discontinuous functionals. J. Stat. Plan. Inference, 75(1):9–20, 1998.
 [30] B. Poole, S. Ozair, A. van den Oord, A. A. Alemi, and G. Tucker. On variational bounds of mutual information. In ICML’19, volume 97 of PMLR, pages 5171–5180, Long Beach, CA, USA, 2019.
 [31] C. E. Shannon. A mathematical theory of communication. Bell Syst. Tech. J., 27(3):379–423, July 1948.
 [32] C. E. Shannon. Prediction and entropy of printed English. Bell Ssyst. Tech. J., 30(1):50–64, 1951.
 [33] R. ShwartzZiv and N. Tishby. Opening the black box of deep neural networks via information. arXiv preprint, 2017.
 [34] S. Singh and B. Poczos. Generalized exponential concentration inequality for renyi divergence estimation. In ICML’14, volume 32 of PMLR, pages 333–341, Bejing, China, 2014.
 [35] S. Singh and B. Poczos. Finitesample analysis of fixedk nearest neighbor density functional estimators. In NIPS’16, volume 29 of Advances in Neural Information Processing Systems, pages 1217–1225, Barcelona, Spain, 2016.
 [36] K. Sricharan, R. Raich, and A. O. Hero. knearest neighbor estimation of entropies with confidence. In Proc. IEEE Int. Symp. Inf. Theory (ISIT 2011), pages 1205–1209, Saint Petersburg, Russia, July 2011.
 [37] N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck method. In Annu. Allerton Conf. Commun., Control, and Comput., pages 368–377, Monticello, IL, Sept. 1999.
 [38] A. van den Oord, Y. Li, and O. Vinyals. Representation learning with contrastive predictive coding. arXiv preprint, 2018.
 [39] Q. Wang, S. R. Kulkarni, and S. Verdú. Divergence estimation of continuous distributions based on datadependent partitions. IEEE Trans. Inf. Theory, 51(9):3064–3074, Sept. 2005.