Estimating Differential Entropy under Gaussian Convolutions

Estimating Differential Entropy under Gaussian Convolutions

Ziv Goldfeld, Kristjan Greenewald and Yury Polyanskiy This work was partially supported by the MIT-IBM Watson AI Lab. The work of Z. Goldfeld and Y. Polyanskiy was also supported in part by the National Science Foundation CAREER award under grant agreement CCF-12-53205, by the Center for Science of Information (CSoI), an NSF Science and Technology Center under grant agreement CCF-09-39370, and a grant from Skoltech–MIT Joint Next Generation Program (NGP). The work of Z. Goldfeld was also supported by the Rothschild postdoc fellowship.
This work will be presented in part at the 2018 IEEE International Conference on the Science of Electrical Engineering (ICSEE-2018), Eilat, Israel.
Z. Goldfeld and Y. Polyanskiy are with the Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02139 US (e-mails: zivg@mit.edu, yp@mit.edu). K. Greenewald is with IBM Research, Cambridge, MA 02142 US (email: kristjan.h.greenewald@ibm.com)
Abstract

This paper studies the problem of estimating the differential entropy , where and are independent -dimensional random variables with . The distribution of is unknown, but independently and identically distributed (i.i.d) samples from it are available. The question is whether having access to samples of as opposed to samples of can improve estimation performance. We show that the answer is positive.

More concretely, we first show that despite the regularizing effect of noise, the number of required samples still needs to scale exponentially in . This result is proven via a random-coding argument that reduces the question to estimating the Shannon entropy on a -sized alphabet. Next, for a fixed and , it is shown that a simple plugin estimator, given by the differential entropy of the empirical distribution from convolved with the Gaussian density, achieves the loss of . Note that the plugin estimator amounts here to the differential entropy of a -dimensional Gaussian mixture, for which we propose an efficient Monte Carlo computation algorithm. At the same time, estimating via generic differential entropy estimators applied to samples from would only attain much slower rates of order , despite the smoothness of .

As an application, which was in fact our original motivation for the problem, we estimate information flows in deep neural networks and discuss Tishby’s Information Bottleneck and the compression conjecture, among others.

Deep neural networks, differential entropy, estimation, minimax rates, mutual information.

I Introduction

This work studies a new nonparametric and high-dimensional differential entropy estimation problem. The goal is to estimated the differential entropy , based on samples of one random variable while knowing the distribution of the other. Specifically, let be an arbitrary (continuous / discrete / mixed) random variable with values in and be an independent isotropic Gaussian. Upon observing i.i.d. samples from and assuming is known, we aim to estimate , where denotes the probability density function (PDF) of a centered isotropic Gaussian with parameter .111See the notation section at the end of the introduction for a precise definition of when is discrete / continuous / mixed. To investigate the decision-theoretic fundamental limit, we consider the minimax absolute-error risk of differential entropy estimation:

(1)

where is a nonparametric class of -dimensional distributions and is the estimator. The sample complexity is the smallest number of samples for which estimation within an additive gap is possible. The goal of this work is to study whether having access to ‘clean’ samples of can improve estimation performance compared to the case where only ‘noisy’ samples of are available and the distribution of is unknown.

I-a Motivation

Our motivation to study the considered differential entropy estimation problem stems from mutual information estimation over deep neural networks (DNNs). There has been a recent surge of interest in estimating the mutual information between selected groups of neurons in a DNN [1, 2, 3, 4, 5], partially driven by the Information Bottleneck (IB) theory [6, 7]. Attention mostly focuses on the mutual information , between the input feature and a hidden activity vector . However, as explained in [5], this quantity is vacuous in deterministic DNNs222i.e., DNNs that, upon fixing their parameters, define a deterministic map from input to output. and becomes meaningful only when a mechanism for discarding information (e.g., noise) is integrated into the system. Such a noisy DNN framework was proposed in [5], where each neuron adds a small amount of Gaussian noise (i.i.d. across neurons) after applying the activation function. While the injection of noise renders meaningful for studying deep learning, the concatenation of Gaussian noises and nonlinearities make this mutual information impossible to compute analytically or even evaluate numerically. Specifically, the distribution of (marginal or conditioned on ) is highly convoluted and the appropriate mode of operation becomes treating it as unknown, belonging to some nonparametric class of distributions. This work sets the groundwork for estimating (or any other mutual information between layers) DNN classifiers while providing theoretical guarantees that are not vacuous when is relatively large.

To achieve this, we distill the estimation of to the problem of differential entropy estimation under Gaussian convolutions described above. In a noisy DNN each hidden layer can be written as , where is a deterministic function of the previous layer and is a centered isotropic Gaussian vector. The DNN’s generative model enables sampling by feeding data samples up the network; the distribution of is known since the noise is injected by design. Estimating mutual information over noisy DNNs thus boils down to the considered differential entropy estimation setup, which is the focus of this work.

I-B Past Works for Unstructured Differential Entropy Estimation

General-purpose differential entropy estimators are applicable for estimating by accessing noisy i.i.d. samples of . However, the theoretical guarantees for unstructured differential entropy estimation commonly found in the literature are invalid for our framework. There are two prevailing approaches for estimating the nonsmooth differential entropy functional: the first relying on kernel density estimators (KDEs) [8, 9, 10], and the second using k nearest neighbor (kNN) techniques [11, 12, 13, 14, 15, 16, 17, 18] (see also [19, 20] for surveys). However, performance analyses of these estimators often restrict attention to nonparametric classes of smooth densities that are bounded away from zero. Various works require uniform boundedness from zero [21, 22, 14, 8, 9], while others restrict it on average [23, 24, 13, 25, 16]. Since the convolved density can attain arbitrarily small values these results do not apply in the considered scenario.

To the best of our knowledge, the only two works that dropped the boundedness from zero assumption are [10] and [18], where the minimax risk of a KDE-based method and the Kozachenko-Leonenko (KL) entropy estimator [12] are, respectively, analyzed. These results assume that the densities are supported inside the unit hypercube, satisfy periodic boundary conditions and have (Lipschitz or Hölder) smoothness parameter . The convolved density violates the two former assumptions. Notably, the analysis from [10] was also extended to densities supported on that have sub-Gaussian tails. The derived upper bound on the minimax risk in this sub-Gaussian regime applies for our estimation setup when is compactly supported or sub-Gaussian. However, the obtained risk convergence rate (overlooking some multiplicative polylogarithmic factors) is , which quickly deteriorates with dimension and in unable to fully exploit the smoothness of due to the restriction. 333This convergence rate is typical in unstructured differential entropy estimation. Indeed, all the results cited above (applicable in our estimation problem or otherwise) show that the estimation risk decays as , where are constants that may depend on and . Consequently, this risk bound is ineffective for evaluating the error of implemented estimators, even for moderate dimensions. We also note that all the above results include implicit constants that depend on (possibly exponentially) that may (when combined with the weak decay with respect to ) significantly increase the number of samples required to achieve a desired estimation accuracy. We therefore ask if exploiting the explicit modeling of , with the ‘clean’ samples from and the knowledge of , can improve estimation performance.

I-C This Work

We begin the study of estimating by showing that an exponential dependence of the sample complexity on dimension is unavoidable. Specifically, we prove that , where is a positive, monotonically decreasing function of . The proof relates the estimation of to estimating the discrete entropy of a distribution over a capacity achieving codebook for the additive white Gaussian noise (AWGN) channel. Viewing as a functional of with parameter , i.e., , we then analyze the performance of the plugin estimator. Specifically, based on the empirical measure , where is the Dirac measure associated with , we consider the estimator

(2)

Note that approximates via the differential entropy of a Gaussian mixture with centers at the sample points . When belongs to a class of compactly supported distributions on (corresponding, for instance, to tanh/sigmoid DNNs) or when it has sub-Gaussian marginals (corresponding to ReLU DNNs with a sub-Gaussian input), we show that the minimax absolute-error risk is bounded by , with the constant (that also depends on exponentially) explicitly characterized. This convergence rate presents a significant improvement over the rates derived for general-purpose differential entropy estimators. This is, of course, expected since is tailored for our particular setup, while generic KDE- or kNN-based estimators are not designed to exploit the structure nor the ‘clean’ samples .

Our proof exploits the modulus of continuity for the map to bound the absolute estimation error in terms of the pointwise mean squared error (MSE) of as a proxy of the true density . The analysis then reduces to integrating the modulus of continuity evaluated at the square root of the MSE bound. Functional optimization and concentration of measure arguments are used to control the integral and obtain the result. A similar result is derived for the nonparametric class of distributions that have sub-Gaussian marginals. The bounded support and sub-Gaussian results essentially capture all cases of interest, and in particular, correspond respectively to DNNs with bounded nonlinearities and to unbounded nonlinearities with weight regularization.

We then focus on the practical implementation of . While our performance guarantees give sufficient conditions on the number of samples needed to drive the estimation error below a desired threshold, these are worst-case result by definition. In practice, the unknown distribution may not be one that follows the minimax rates, and the resulting decay of error could be faster. However, while the variance of the can be empirically evaluated using bootstrapping, there is no empirical test for the bias. We derive a lower bound on the bias of our estimator to have a guideline of the least number of samples needed for unbiased estimation. Our last step is to propose an efficient implementation of based on Monte Carlo (MC) integration. Since is simply the entropy of a known Gaussian mixture, MC integration using samples from this mixture allows a simple computation of . We bound the MSE of the computed value that converge as , where is the number of centers in the mixture444The number of centers is the number of samples used for estimation., is the number of MC samples, and is an explicit constant that depends linearly on . The proof leverages the Gaussian Poincaré inequality to reduce the analysis to that of the log-mixture distribution gradient. Several simulations (including an estimation experiment over a small DNN for a 3-dimensional spiral dataset classification) illustrate the gain of the ad-hoc estimator over its general-purpose counterparts, both in the rate of error decay and in its scalability with dimension.

The remainder of this paper is organized as follows. In Section II we set up the estimation problem, state our main results and discuss them. Section III presents applications of the considered estimation problem, focusing on mutual information estimation over DNNs. Simulation results are shown in Section IV-B, while Section V provides proofs. Our main insights from this work and appealing future directions are discussed in Section VI.

Notation: Throughout this work logarithms are with respect to the natural base. For an integer , we set . For a real number , the -norm of is denoted by , while . Matrices are denoted by non-italic letters, e.g., ; the -dimensional identity matrix is . We use calligraphic letters, such as , to denote sets. The cardinality of a finite set is . Probability distributions are denoted by uppercase letters such as or . The support of a -dimensional distribution , denoted by , is the smallest set such that . If is discrete, the corresponding probability mass function (PMF) is designated by , i.e., , for . With some abuse of notation, the PDF associated with a continuous distribution is also denoted by . Whether is a PMF or a PDF is of no consequence for most of our results; whenever the distinction is important, the nature of will be clarified. The -fold product distribution associated with is denoted by . To highlight that an expectation or a probability measure is with respect to an underlying distribution we write or ; if has a PMF/density , we use or instead. For a random variable and a deterministic function , we sometimes highlight that the expectation of is with respect to the underlying distribution of by writing . For a continuous random variable with density , we interchangeably use and for its differential entropy.

Lastly, since our estimation setting considers the sum of independent random variables , we oftentimes deal with convolutions. For two probability measures and on , their convolution is defined by

where is the indicator of the Borel set . If and are independent random variables, then . In this work, is always an isotropic Gaussian with parameter , whose PDF is denoted by . The random variable , however, may be discrete, continuous or mixed. Regardless of the nature of , the random variable is always continuous and its PDF is denoted by . By the latter we mean , when is continuous with density . If is discrete with PMF , then . For a mixed distribution , Lebesgue’s decomposition theorem allows to write as the sum of two expressions as above. Henceforth, we typically overlook the exact structure of only mentioning it when it is consequential.

Ii Main Results

Ii-a Preliminary Definitions

Let be the set of distributions with .555Any support included in a compact subset of would do. We focus on the case of due to its correspondence to a noisy DNN with tanh nonlinearities. We also consider the class of distributions whose marginals are sub-Gaussian [26]. The sub-Gaussian norm is defined as , and we set as the class of distributions of a -dimensional random variable whose coordinates satisfy , for all . The class will be used in Section III to handle DNNs with unbounded activation functions, such as ReLUs. Clearly, for any with we have , for all , and therefore .

Ii-B Lower Bounds on Risk

We give two converse claims showing that the sample complexity is exponential in .

Theorem 1 (Exponential Sample-Complexity)

The following claims are true:

  1. Fix . Then there exist , and (monotonically decreasing in ), such that for all and we have .

  2. Fix . Then there exist and , such that for all and we have .

Theorem 1 is proven in Section V-B, based on channel coding arguments. For instance, the proof of Part 1) relates the estimation of to the output sequence of a peak-constrained AWGN channel. Then, we show that estimating the entropy of interest is equivalent to estimating the entropy of a discrete random variable with some distribution over a capacity-achieving codebook. The positive capacity of the considered AWGN channel means that the size of this codebook is exponential in . Therefore, (discrete) entropy estimation over the codebook within a small additive gap cannot be done with less than order of samples. Furthermore, the exponent is monotonically decreasing in , implying that larger values of are favorable for estimation. The 2nd part of the theorem relies on a similar argument but for a -dimensional AWGN channel and an input constellation that comprises the vertices of the -dimensional hypercube .

Remark 1 (Exponential Sample Complexity for Restricted Classes of Distributions)

Note that restricting by imposing smoothness or lower-boundedness assumptions on the distributions in the class would not alleviate the exponential dependence on from Theorem 1. For instance, consider convolving any with , i.e., replacing each with . These distributions are smooth, but if one could accurately estimate over the convolved class, then over would have been estimated as well. Therefore, an exponential sample complexity lower bound applies also for the class of such smooth distributions.

Remark 2 (Critical Value of Noise Parameter)

We state Theorem 1 in asymptotic form for simplicity; the full bounds are found in the proof (Section V-B). We also note that, for any , the critical value from the 2nd part can be extracted by following the constants through the proof (which relies on Proposition 3 from [27]). These critical values are not unreasonably small. For example for , a careful analysis gives that Theorem 1 holds for all . This threshold on changes very slowly when increasing due to the rapid decay of the Gaussian density. As a reference point, note that the per-neuron noise variance values used in the noisy DNNs from [5] ranged from to .

Ii-C Upper Bound on Risk

This is our main section, where we analyze the performance of the estimator from (2). Recall that , where is the empirical measure associated with the data . The following theorem shows that the expected absolute error of decays like for all dimensions . We provide explicit constants (in terms of and ), which present an exponential dependence on the dimension, in accordance to the results of Theorem 1.

Theorem 2 (Absolute-Error Risk for Bounded Support)

Fix and . Then

(3)
Dimension 1 2 3 5 7 9 10 11 12
Risk bound 0.00166 0.00369 0.0179 0.0856 0.402 0.869 1.87 4.02
TABLE I: Evaluation of the absolute-error risk bound from Theorem 2 (via the full formula (52)), for , and different dimensions. The bound produces satisfactory values up to but quickly deteriorates for larger dimensions. The exponential growth of the bound with is also evident from the table.

The proof of Theorem 2 is given in Section V-C. While the theorem is stated in asymptotic form, a full expression, with all constants explicit, is given as part of the proof (see (52)). Table 1 evaluates this bound with samples and up to . Several things to note about the result are the following:

  1. The theorem does not assume any smoothness conditions on the distributions in . This is possible due to the inherent smoothing introduced by the convolution with the Gaussian density. Specifically, while the differential entropy is not a smooth functional of the underlying density in general, our functional is , which is smooth.

  2. The result does not rely on being bounded away from zero. We circumvent the need for such an assumption by observing that although the convolved density can be arbitrarily close to zero, it is easily lower bounded inside (i.e., a Minkowski sum of with a -dimensional sphere of radius ). The analysis inside the region exploits the modulus of continuity for the map combined with some functional optimization arguments; the integral outside the region is controlled using tail bounds for the Chi-squared distribution.

Theorem 2 provides convergence rates when estimating differential entropy (or mutual information) over DNNs with bounded activation functions, such as tanh or sigmoid. To account for networks with unbounded nonlinearities, such as the popular ReLU networks, the following theorem gives a more general result of estimation over the nonparametric class of -dimensional distributions with sub-Gaussian marginals.

Theorem 3 (Absolute-Error Risk for sub-Gaussian Distributions)

Fix and . Then

(4)

The proof of Theorem 3 is given in Section V-D. Again, while (4) only states the asymptotic behavior of the risk, an explicit expression is given in (59) at the end of the proof. The derivation relies on the decomposition of the absolute-error and the technical lemmas employed in the proof of Theorem 2. The main difference is the analysis of the probability that exceeds , which is taken here as the -dimensional hypercube with .

Ii-D Necessary Number of Samples for Unbiased Estimation

The results of the previous subsection are in minimax form, that is, they state worst-case convergence rates of the over a certain nonparametric class of distributions. In practice, the true distribution may very well not be one that attains these worst-case rates, and convergence may be faster. However, while the variance of can be empirically evaluated using bootstrapping, there is no empirical test for the bias. Even if multiple estimates of via consistently produce similar values, this does not necessarily suggest that these values are close to the true . To have a guideline to the least number of samples needed to avoid biased estimation, we present the following lower bound on the estimator bias .

Theorem 4 (Bias Lower Bound)

Fix and , and let , where is the Q-function.666The Q-function is defined as . Set , where is the inverse of the Q-function. By the choice of , clearly , and the bias of over the class is bounded as

(5)

Consequently, the bias cannot be less than a given so long as .

The theorem is proven in Section V-E. Since shrinks with , for sufficiently small values the lower bound from (5) essentially shows that the our estimator will not have negligible bias unless is satisfied. The condition is non-restrictive in any relevant regime of and . For the latter, values we have in mind are inspired by [5], where noisy DNNs with parameter studied. In that work, values are around , for which the lower bound on is at most 0.0057 for all dimensions up to at least . For example, when setting (for which ), the corresponding equals 3 for and 2 for . Thus, with these parameters, a negligible bias requires to be at least , for any conceivably relevant dimension.

Ii-E Computing the Estimator

Evaluating requires computing the differential entropy of a Gaussian mixture. Although it cannot be computed in closed form, this section presents a method for approximate computation via MC integration [28]. To simplify the presentation, we present the method for an arbitrary Gaussian mixture without referring to the notation of the estimation setup.

Let be a -dimensional, -mode Gaussian mixture, with centers . Let be independent of and note that . First, rewrite as follows:

(6)

where the last step uses the independence of and . Let be i.i.d. samples from . For each , we estimate the -th summand on the RHS of (6) by

(7a)
which produces
(7b)

as our estimate of . Note that since is a mixture of Gaussians, it can be efficiently evaluated using off the shelf KDE software packages, many of which require only operations on average per evaluation of .

Define the mean squared error (MSE) of as

(8)

We have the following bounds on the MSE for tanh/sigmoid and ReLU networks, i.e., when the support or the second moment of is bounded, respectively.

Theorem 5 (MSE Bounds for the MC Estimator)

  1. Assume almost surely (i.e., tanh / sigmoid networks), then

    (9)
  2. Assume (e.g., ReLU networks with weight regularization), then

    (10)

The proof is given in Section V-F. The bounds on the MSE scale only linearly with the dimension , making in the denominator often the dominating factor experimentally.

Iii Applications for Deep Neural Networks

A main application of the developed theory is estimating the mutual information between selected groups of neurons in DNNs. Much attention was recently devoted to this task [1, 2, 3, 4, 5], mostly motivated by the Information Bottleneck (IB) theory for DNNs [6, 7]. The theory tracks the mutual information pair , where is the DNN’s input (i.e., the feature), is the true label and is the hidden activity. An intriguing claim from [7] is that the mutual information undergoes a so-called ‘compression’ phase as the DNN’s training progresses. Namely, after a short ‘fitting’ phase at the beginning of training (during which and both grow), exhibits a slow long-term decrease, which, according to [7], explains the excellent generalization performance of DNNs. The main caveat in the supporting empirical results provided in [7] (and the partially opposing results from the followup work [1]) is that in a deterministic DNN the mapping is almost always injective when the activation functions are strictly monotone. As a result, is either infinite (when the data distribution is continuous) or a constant (when is discrete777The mapping from a discrete set of values to is almost always (except for a measure-zero set of weights) injective whenever the nonlinearities are, thereby causing for any hidden layer , even if consists of a single neuron.). Thus, when the DNN is deterministic, is not an informative quantity to consider. As explained in [5], the reason [7] and [1] miss this fact stems from an inadequate application of a binning-based mutual information estimator for .

As a remedy for this constant/infinite mutual information issue, [5] proposed the framework of noisy DNNs, in which each neuron adds a small amount of Gaussian noise (i.i.d. across all neurons) after applying the activation function. The injected noise makes the map a stochastic parameterized channel, and as a consequence, is a finite quantity that depends on the network’s parameters. Interestingly, although the primary purpose of the noise injection in [5] was to ensure that is a meaningful quantity, experimentally it was found that the DNN’s performance is optimized at non-zero noise variance, thus providing a natural way for selecting this parameter. In the following, we first properly define noisy DNNs and then show that estimating , or any other mutual information term between layers of a noisy DNN can be reduced to differential entropy estimation under Gaussian convolutions. The reduction relies on a sampling procedure that leverages the DNN’s generative model.

Iii-a Noisy DNNs and Mutual Information between Layers

We start by describing the noisy DNN setup from [5]. Consider the learning problem of the feature-label pair , where is the (unknown) true distribution of . The labeled dataset comprises i.i.d. samples from .

An -layered noisy DNN for learning this model has layers , with input and output (i.e., the output is an estimate of ). For each , the -th hidden layer is given by , where with being a deterministic function of the previous layer and being the noise injected at layer . The functions can represent any type of layer (fully connected, convolutional, max-pooling, etc.). For instance, for a fully connected or a convolutional layer, where is the activation function which operates on a vector component-wise, is the weight matrix and is the bias. For fully connected layers is arbitrary, while for convolutional layers is Toeplitz. Fig. 1 shows a neuron in a noisy DNN.

Fig. 1: -th noisy neuron in a fully connected or a convolutional layer with activation function ; and are the -th row and the -th entry of the weight matrix and the bias vector, respectively.

The noisy DNN induces a stochastic map from to the rest of the network, described by the conditional distribution . The joint distribution of the tuple is under which forms a Markov chain. For each , the PDF of or any of its conditional versions is denoted by a lowercase with the appropriate subscripts (e.g., is the PDF of , while is its conditional PDF given ). For any , consider the mutual information between the hidden layer and the input (see Remark 4 for an account of ):

(11)

Since has a highly complicated structure (due to the composition of Gaussian noises and nonlinearities), this mutual information cannot be computed analytically and must be estimated. Based on the expansion from (11), an estimator of is constructed by estimating the unconditional and each of the conditional differential entropy terms, while approximating the expectation by an empirical average. As explained next, all these entropy estimation tasks are instances of our framework of estimating based on samples from and knowledge of .

Iii-B From Differential Entropy to Mutual Information

Recall that , where and are independent. Thus,

(12a)
and
(12b)

where is the pdf . The DNN’s generative model enables sampling from and as follows:

  1. Unconditional Sampling: To generate the sample set from , feed each , for , into the DNN and collect the outputs it produces at the -th layer. The function is then applied to each collected output to obtain , which is the a set of i.i.d. samples from .

  2. Conditional Sampling Given : To generate i.i.d. samples from , for , we feed into the DNN times, collect outputs from corresponding to different noise realizations, and apply on each. Denote the obtained samples by .888The described sampling procedure is valid for any layer . For , coincides with but the conditional samples are undefined. Nonetheless, noting that for the first layer , we see that no estimation of the conditional entropy is needed. The mutual information estimator given in (14) is modified by replacing the subtracted term with .

The knowledge of and these generated samples and can be used to estimate the unconditional and the conditional entropies, from (12a) and (12b), respectively.

For notational simplicity, we henceforth omit the layer index . Based on the above sampling procedure we construct an estimator of using a given estimator of for supported inside (i.e., a tanh / sigmoid network), based on i.i.d. samples from and knowledge of . Assume that  attains

(13)

An example of such an is the estimator from (2); the corresponding term is given in Theorem 2. Our estimator for the mutual information is

(14)

The expected absolute error of is bounded in the following proposition, proven in Section V-A.

Proposition 1 (Input-Hidden Layer Mutual Information Estimation Error)

For the above described estimation setting, we have

(15)

Interestingly, the quantity is the signal-to-noise ratio (SNR) between and . The larger is the easier estimation becomes, since the noise smooths out the complicated distribution. Also note that the dimension of the ambient space in which lies does not appear in the absolute-risk bound for estimating . The bound depends only on the dimension of (through ). This is because the additive noise resides in the domain, limiting the possibility of encoding the rich structure of into in full. On a technical level, the blurring effect caused by the noise enables uniformly lower bounding and thereby controlling the variance of the estimator for each conditional entropy. In turn, this reduces the impact of on the estimation of to that of an empirical average converging to its expected value with rate .

Remark 3 (The sub-Gaussian Class and Noisy ReLU DNNs)

We provide performance guarantees for the plugin estimator also over the more general class of distributions with sub-Gaussian marginals. This class accounts for the following important cases:

  1. Distributions with bounded support, which correspond to noisy DNNs with bounded nonlinearities. This case is directly studied through the bounded support class .

  2. Discrete distributions over a finite set, which is a special case of bounded support.

  3. Distributions of a random variable that is a hidden layer of a noisy ReLU DNN, so long as the input to the network is itself sub-Gaussian. To see this recall that linear combinations of independent sub-Gaussian random variables are also sub-Gaussian. Furthermore, for any (scalar) random variable , we have that , almost surely. Each layer in a noisy ReLU DNN is a coordinate-wise applied to a linear transformation of the previous layer plus a Gaussian noise. Consequently, for a -dimensional hidden layer and any , one may upper bound by a constant, provided that the input is coordinate-wise sub-Gaussian. This constant will depend on the network’s weights and biases, the depth of the hidden layer, the sub-Gaussian norm of the input and the noise variance. In the context of estimation of mutual information over DNNs, the input distribution is typically taken as uniform over the dataset [7, 1, 5]. Such a discrete distribution satisfies the required input sub-Gaussianity assumption.

Remark 4 (Mutual Information Between Hidden Layer and Label)

Another information-theoretic quantity of possible interest is the mutual information between the hidden layer and the true label (see, e.g., [7]). Let be a feature-label pair distributed according to . If is a hidden layer in a noisy DNN with input , the joint distribution of is , under which forms a Markov chain (in fact, the Markov chain is even since but this is inconsequential here). The mutual information of interest is then

(16)

where is the (known and) finite set of labels. Just like for , estimating reduces to differential entropy estimation under Gaussian convolutions. Namely, an estimator for can be constructed by estimating the unconditional and each of the conditional differential entropy terms from (16), while approximating the expectation by an empirical average. There are several required modifications in estimating as compared to . Most notably is the procedure for sampling from , which results in a sample set whose size is random (a Binomial random variable). In appendix A, the process of estimating is described in detail and a bound on the estimation error is derived.

This section, and, in particular, the result of Proposition 1 (see also Proposition 2 from Appendix A) show that the performance in estimating mutual information depends on our ability to estimate . In Section IV-B we present experimental results for , when is induced by a DNN.

Iv Comparison to Past Works on Differential Entropy Estimation

In the considered estimation setup, one could always sample and add the obtained noise samples to , thus producing a sample set from . This set can be used to estimate via general-purpose differential entropy estimators, such as those based on kNN or KDE techniques. In the following we theoretically and empirically compare the performance of to state-of-the-art instances of these two methods.

Iv-a Comparison of Theoretical Results

The main thing to note here is that convergence guarantees commonly found in the literature for KDE- and kNN-based differential estimation methods do not apply in the considered setup. Most past risk analyses [23, 21, 22, 24, 13, 25, 14, 16, 8, 9] rely on the distribution being bounded away from zero, an assumption that is violated by . The only two works we are aware of that drop this assumption are [10, 18], the first for a KDE-based method and the second for the kNN-based KL estimator [11], assume that the density is supported inside , satisfies periodic boundary conditions and has a (Lipschitz or Hölder) smoothness parameter . The convolved density does not satisfy the first two conditions. It is noteworthy that the analysis from [10] was also extended to sub-Gaussian densities supported on the entire Euclidean space. This extension is applicable for estimation based on samples from , but as explained next, the obtained risk convergences slowly when is large and is unable to exploit the smoothness of due to the restriction.

Because is constructed to exploit the particular structure of our genie-aided estimation setup it achieves a fast convergence rate of . The risk associated with unstructured differential entropy estimators typically converges as the slower999This expression overlooking polylogarithmic factors that appear in many of the results , where are relatively small constants. In particular, the sub-Gaussian result from [10] upper bounds the risk by an term, where is the Lipschitz smoothness and is a norm parameter. This rate (as well as the ones from previous works) is too slow to guarantee satisfactory estimation accuracy even for moderate dimensions, especially when taking into account the (possibly huge) multiplicative constants this asymptotic expression hides. This highlights the advantage of ad-hoc estimation as opposed to an unstructured approach.

Iv-B Simulations

In the following we present empirical results illustrating the convergence of the estimator compared it to two such state-of-the-art methods: the KDE-based estimator of [8] and KL estimator from [11, 18].

Iv-B1 Simulations for Differential Entropy Estimation

The convergence rates in the bounded support regime are illustrated first. We set as a mixture of Gaussians truncated to have support in . Before truncation, the mixture consists of Gaussian components with means at the corners of . This produces a distribution that is, on one hand, complicated ( mixtures) while, on the other hand, is still simple to implement. The entropy is estimated for various values of .

Fig. 2: Estimation results for compared to state-of-the-art kNN-based and KDE-based differential entropy estimators. The differential entropy of is estimated, where is a truncated -dimensional mixture of Gaussians and . Results are shown as a function of , for and . The estimator presents faster convergence rates, improved stability and better scalability with dimension compared to the two competing methods.
Fig. 3: Entropy estimation results for compared to state-of-the-art kNN-based and KDE-based differential entropy estimators. The setup is the same as Figure 2, with results shown as a function of for and .

Estimation results as a function of the number of samples are shown for dimensions and in Fig. 2, and for dimension in Fig. 3. The kernel width for the KDE estimate was chosen via cross-validation, varying with both and ; the kNN estimator and our require no tuning parameters. Observe that the KDE estimate is rather unstable and, while not shown here, the estimated value is highly sensitive to the chosen kernel width (varying widely if the kernel width is perturbed from the cross-validated value). Note that both the kNN and the KDE estimators converge slowly, at a rate that degrades with increased . This rate is significantly worse than that of our proposed estimator, which also lower bounds the true entropy (as according to our theory - see (61)). We also note that the difference between the performance of the KDE estimator and decreases for smaller . This is because for small enough the distribution of and that of become close, making the KDE estimator and our estimator (which bears some similarities to a KDE estimate on directly) become more similar. However, when is larger, the KDE estimate does not coincide with the true entropy even for the maximal number of samples used in our simulations (, for all considered dimensions. Finally, we note that in accordance to the upper bound from Theorem 2, the absolute estimation errors increase with larger and smaller .

In Fig. 4, we show the convergence rates in the unbounded support regime by considering the same setting but without truncating the -mode Gaussian mixture. Observe that a good convergence for is still attained, outperforming the competing methods.

Fig. 4: Estimation results for the estimator, the kNN-based estimator and the KDE-based estimator in the unbounded support regime. Estimation of is considered, where is a -dimensional mixture of Gaussians and . Results for and are presented.

Iv-B2 Monte Carlo Integration

Fig. 5 illustrates the convergence of the MC integration method for computing . The figure shows the root-MSE (RMSE) as a function of MC samples , for the truncated Gaussian mixture distribution with (which corresponds to the number of modes in the Gaussian mixture whose entropy approximates ), , and . Note the error decays approximately as in accordance with Theorem 5, and that the convergence does not vary excessively for different and  values.

Fig. 5: Convergence of the Monte Carlo integrator computation of the proposed estimator. Shown is the decay of the RMSE as the number of Monte Carlo samples increases, for a variety of and values. The MC integrator is computing the estimate of the entropy of where is a truncated -dimensional mixture of Gaussians and . The number of samples of used by is .

Iv-B3 Estimation in a Noisy Deep Neural Network

We next illustrate entropy estimation in a noisy DNN. The dataset is a -dimensional 3-class spiral (shown in Fig. 6(a)). The network has 3 fully connected layers of sizes 8-9-10, with tanh activations and Gaussian noise added to the output of each neuron, where . We estimate the entropy of the output of the 10-dimensional third layer in the network trained to achieve 98% classification accuracy. Estimation results are shown in Fig. 6(b), comparing our method to the kNN and KDE estimators. As before, our method converges faster than the competing methods illustrating its efficiency for entropy and mutual information estimation over noisy DNNs. Observe that the KDE estimate is particularly poor in this regime. Indeed, KDE is known to be not well-suited for high-dimensional problems and to underperform on distributions with widely-varying smoothness characteristics (as in these nonlinear-activation DNN hidden layer distributions). In our companion work [5], extensive additional examples of mutual information estimation in DNN classifiers based on the proposed estimator are provided.

(a)
(b)
Fig. 6: 10-dimensional entropy estimation in a 3-layer neural network trained on the 2-dimensional 3-class spiral dataset shown on the left. Estimation results for compared to state-of-the-art kNN-based and KDE-based differential entropy estimators are shown on the right. The differential entropy of is estimated, where is the output of the third (10-dimensional) layer. Results are shown as a function of samples with . Like in previous example, our estimator convergences faster and is more stable compared to the two competing methods.

Iv-B4 Mutual Information of Reed-Mullar Codes over AWGN Channels

Consider data transmission over an AWGN channel via a binary phase-shift keying (BPSK) modulation of an error-correcting Reed-Muller code. Denote a Reed-Muller code of parameters , where by . An code encodes messages of length into -lengthed binary codewords. Let be set of BPSK modulated sequences corresponding to (with, e.g., and mapped to and , respectively). The number of bits reliably transmittable over the -dimensional AWGN with noise is given by

(17)

where is independent of . Despite being a well-behaved function of , an exact computation of this mutual information is infeasible.

Using our estimator for differential entropy under Gaussian convolutions, can be readily estimated based on samples of . The estimation results for the Reed-Muller codes and (containing and codewords, respectively) are shown in Fig. 7 for various values of and samples of used for estimation. In Fig. 7(a) we plot our estimate of for an code as a function of , for different values of . This subfigure also shows that, as expected,