Information Theoretic Interpretation of Deep learning

# Information Theoretic Interpretation of Deep learning

Tianchen Zhao
Department of Mathematics
University of Michigan
Ann Arbor, MI 48104
ericolon@umich.edu
\AndYuekai Sun
Department of Statistics
University of Michigan
Ann Arbor, MI 48104
yuekai@umich.edu
###### Abstract

We interpret part of the experimental results of Shwartz-Ziv and Tishby (2017). Inspired by these results, we established a conjecture of the dynamics of the machinary of deep neural network. This conjecture can be used to explain the counterpart result by Saxe et al. (2018).

Information Theoretic Interpretation of Deep learning

Tianchen Zhao Department of Mathematics University of Michigan Ann Arbor, MI 48104 ericolon@umich.edu Yuekai Sun Department of Statistics University of Michigan Ann Arbor, MI 48104 yuekai@umich.edu

\@float

noticebox[b]\end@float

## 1 Introduction

### 1.1 Set up

Consider a deep learning problem, the training data set is known, where are the objects of interest, and are the corresponding labels, sampled from random variables with unknown joint distribution. For example, could be the continuous random variable over for some large integer , representing the image of a single digit, and is a discrete random variable with integer ranging from 0 to 9.

In practice, the data is fed into a deep neural network parametrized by which is of very high dimension. For each object , assigns a probability, and the prediction is taken from the value with highest probability. In practice this is usually achieved by using softmax function. The goal is to use an optimization algorithm to train the parameter of such that the joint probability of matches the joint probability of .

### 1.2 Motivation

Our work is primarily motivated by Zhang et al. (2017a) and Poggio et al. (2017). In an underdeterministic problem where the number of parameter is way larger than the amount of samples , there are typically infinite many solutions for . According to Occam’s razor, "simple" solutions are usually desired. In practice, the training of a deep neural network is not explicitly regularized, so there’s no obvious guarantee that the solution we trained is "simple". However, the experimental results report that the model is still steadly improving after the set of solutions have already been reached. It’s commonly believed that the optimization algorithm used, named stochastic gradient descent, is improving among the solutions of the underdetermined system.

The experiments performed by Shwartz-Ziv and Tishby (2017) give an answer to this phenomina from an information theoretic perspective. In their experiment, they designed a binary classification problem and trained a fully connected feed-forward neural network using SGD and cross-entropy loss function.

For each epoch during the training, they discretized the values for the last feature layer into 30 bins over -1 and 1(which follows from the sigmoid activation function). Then can be directly approximated by running the neural nets. Then they compute

 P(~Y|x)=P(~Y,x)P(x),

which is used to compute

 P(~Y,Y)=∑xP(x,Y)P(~Y|x,Y)=∑xP(x,Y)P(~Y|x).

and are all we need to estimate and .

Their goal is to estimate the dynamics of the mutual information and . Here can be interpreted as a measure of how much information retains("encodes") from , and similarly is how much information preserves("decodes") from . We encourage the reader to watch the fantastic video of the optimization process in the information plane at https://goo.gl/rygyIT.

In this paper we are interested in the behaviour of their last layer(in orange), which is essentially the behaviour of and . At the beginning of the training, both and increases, meaning the network is "memorizing" the data from and outputing more meaningful information to , note that this process is fast. Then starts to decrease while keeps increasing, meaning the network is "forgetting" the data it just memorized but is still improving to give more information about , note that this process is slow. The first part is named the fitting phase and the second part is named the compression phase.

Saxe et al. (2018) started a similar line of work measuring the dynamics of the mutual information. They proposed the following:

1. The compression phase is not explicit if the network uses ReLU activation function instead throughout the network.

2. The compression phase is happening alongside with the fitting phase if the input data is Gaussian-like and the labels are assigned randomly.

3. It is theoretically implausible to measure the information in the intermediate layer, given that mutual information between continuous random variables is in general ill-defined.

### 1.3 Our contribution

#### 1.3.1 Decomposition of Deep Network

We propose the following conjecture of the machinary of deep neural network(see Figure 1): the data is transformed into linearly seperable feature by the nonlinear ReLU network, which is "almost invertible" as to be explained in detail below. The following full connected layer and sigmoid/softmax function together behave like a multiclass SVM, giving the best prediction among all available classes.

#### 1.3.2 Interpretation of Information Theory

Given this structure, we can give interpretation to the experimental results by Shwartz-Ziv and Tishby (2017). The fitting phase corresponds to finding parameters from ReLU network to organize the features in last layer into a linearly seperable fashion, and find full connected parameters represent hyperplanes to seperate the features. The compression phase corresponds to finding parameters maximizing the margin of the linear seperation, which is essentially driven by the randomness of the SGD algorithm. This progresses slowly as we will prove in Section 2.1. Note that the procedure of the margin maximization is to find the "support feature" and "forgetting" other features, enabled by the sigmoid/softmax function. It follows that mutual information between input and prediction is diminising during this phase.

The phenomena propose by Saxe et al. (2018) can therefore be explained:

1. If the activation function used in last layer is replaced by ReLU, then the SVM structure will be destroyed. The network is not "forgetting" approximately half of the features, so the compression phase will not be explicit.

2. The Gaussian input is isotropic so no network can transform this data into a linearly seperable one. We argue that compression phase only exists if there’s a margin between data from different classes.

3. We agree that it’s often dangerous to define mutual information between continuous valued random variables. In general, if the joint probability of two continuous random variable is not degenerate over an open neighborhood, then there exists an invertible mapping, which needs infinite amount of information to describe the mutual relationship. In fact, we do believe that the neural network will have better performance if the intermediate layers can fully preserve the information of . In this case ReLU does a better job than sigmoid or tanh functions do, which matches the experimental reports by Nair and Hinton (2010). But the last layer should be fully connected and activated by sigmoid function, as indicated by Krizhevsky et al. (2012).

However, the mutual information between continuous valued random variable and discrete valued random variable is well defined in a sense that the discretized measurement converges(see Appendix C). So we argue that the expirical measurement by Shwartz-Ziv and Tishby (2017) on the last layer remains theoretical valid, which supports our hypothesis. Throughout our theoretical analysis below, both , represent a discretized prediction random variable when comparing with .

#### 1.3.3 Interpretation of Res-Net

Residual Network by He et al. (2015), the winner of ILSVRC2015, is one of the best existing deep network structures. Here we explain why it is so successful from our theory.

It’s well known that a network that is too deep is not working very well. One of the reason could be, from our perspective, there’s too much information loss from the first part, , of our proposed structure (Figure 1). Res-Net is designed to allow the model to "learn" identity map easily. Specifically, in Kraskov et al. (2004), they mentioned in appendix that mutual information is fully preserved under homeomorphisms (smooth and uniquely invertible maps). In Res-Net, the building block with input vector and output vector is related as:

 y=L(x)+x=(L+I)(x), (1)

where the operator could be a composition of activation functions, convolution, drop-out(Srivastava et al. (2014)) and batch normalization(Ioffe and Szegedy (2015)). See Section 3.1 for more details.

It can be shown that (see Appendix E for a proof) if the operator norm , then is theoretically guaranteed to have an inverse, which enables information preservation between intermediate layers. In Section 3.3 we experimentally verified that for all intermediate layers.

The main strength of Res-Net, in our understanding, is it allows the deep mapping in Figure 1 to be invertible, independent of , which is almost never invertible due to the use of ReLU, convolution, drop-out and other singular operations.

In the work by Jacobsen et al. (2018), they built an deep invertible networks and showed that a network could be successful without losing any information in the intermediate layers.

### 1.4 Related work

Zheng et al. (2018) understood DNN from a Maximum entropy perspective. Chaudhari et al. (2016) used entropy to detect wide valleys for SGD algorithm. Shamir et al. (2010) proved generalization property of IB framework. Achille and Soatto (2016) investigated the amount of information loss through various of operations in deep network. Troost et al. (2018) proved an upper bound on the number of nodes needed for a two-layer network to seperate the data. Alemi et al. (2016) designed a variational approximation to the IB framework.

## 2 Proofs

In this section we prove our conjecture in Figure 1 as follows:

1. We prove that it takes a time with polynomial order to grow during the fitting phase, and a time with exponential order to grow during the compression phase, which matches the video of the information plane dynamics by Shwartz-Ziv and Tishby (2017).

2. We assume that in Figure 1 is almost invertible in the sense that .

We argue that the Deep Network , treated as an abstract function describing relationship between and , is governed, if the word "parametrized" is not proper, by the quantities and .

From an information theoretic set up, the IB problem can be considered as:

 I(X;~Y)−αI(Y;~Y)=I(F(X);~Y)−αI(Y;~Y), (2)

where is some positive constant. Therefore we can abuse the notation between and as they are equivalent in the information theoretic setting. In particular, we have where is defined in Figure 1.

Then we prove there’s a direct relationship between IB problem and SVM problem in linear case.

We conclude that deep neural network, from a information theoretic point of view, can be reduced to a hard-margin SVM problem.

3. We discuss the notion of generalization in our context.

### 2.1 Machinary of SGD

There is a history of work towarding the behaviour of discrete SGD. In particular, Roberts and Tweedie (1996) shows that the discrete Langevin difussion converges to the target distribution exponentially fast in time. This is a general result applying to Markov Chain Monte Carlo(MCMC) and is not practical because SGD is a "path-finder" with a diminishing step length. Raginsky et al. (2017) and Tzen et al. (2018) showed that the searching path of SGD needs time of exponential order to jump from one local min to another local min, which is intuitively why the compression phase is slow. Zhang et al. (2017b) showed that the SGD path needs time of polynomial order to enter the first "good" local min, which is intuitively why the fitting phase is fast.

In our work, we present a similar result from an information theoretic perspective, by using an analogy to standard stochastic convex analysis in Bottou (1998).

Recall this video from Shwartz-Ziv and Tishby (2017) on information plane at https://goo.gl/rygyIT.

###### Claim 2.1.

The general form of SGD is given by:

 wt+1=wt−at∇F(wt)−btB, (3)

where are positive constant varying with time and is the standard Gaussian .

In particular, consider a Langevin Monte Carlo(LMC) setting where and .

Then reaches local maximum in time with polynomial order, and hits higher local maximum in time with exponential order.

###### Proof.

From Appendix A, we have:

 I(X;~Y)≤2(Var(p(~y))+Var(p(~y|x))). (4)

From Appendix B, we have:

 I(Y;~Y)≥A+log(n∏i=1f(yi|xi;wt))+n∑i=1log(p(yi|xi)). (5)

Now I focus on showing this lower bound (5) is first increasing polynomially fast, then logarithmically slow, driven by the machinary of SGD.

Denote and consider the SGD mechanics in general form as follows:

 wt+1=wt−at∇F(wt)−btB, (6)

where are positive constant varying with time and is standard Gaussian .

Denote a discrete set of local minimas and saddle points of as .

Define the metric as:

 ρ(w,C)=inf{||w−c||2:c∈C}. (7)

Assumption:

(1) is smooth with respect to :

Let , then , for all satisfying .

(2) is strongly convex with respect to :

Let , then , for all satisfying .

Now suppose is closed to , where is more than O(t) away from other element in , then if we consider the training process restricted to a polynomial time regime, we have , and we have the following:

 ||wt+1−w∗||2=||wt−at∇F(wt)−btB−w∗||2=||wt−w∗||2−2⟨wt−w∗,at∇F(wt)+btB⟩+||at∇F(wt)+btB||2 (8)

Taking expectation with respect to filtration at time gives:

 E(||wt+1−w∗||2|Ft)=||wt−w∗||2−2at⟨wt−w∗,∇F(wt)⟩+a2t||∇F(wt)||2+b2t≤||wt−w∗||2−2atm||wt−w∗||2+a2tM2||wt−w∗||2+b2t=(1−2atm+a2tM)||wt−w∗||2+b2t (9)

Now consider a Langevin Monte Carlo(LMC) setting where and . Then (7) becomes:

 E(||wt+1−w∗||2|Ft)=(1−2mt+M2t2)||wt−w∗||2+1tE(||wt+1−w∗||2)=(1−2mt+M2t2)E(||wt−w∗||2)+1tE(||wt+1−w∗||2)≤∏ts=1(1−2ms+M2s2)E(||w1−w∗||2)+∑ts=11s=∏ts=1(1−2ms+M2s2)E(||w1−w∗||2)+O(log(t)+1) (10)

According to L’Hopital’s rule:

 lims→∞log(1−2ms+M2s2)1s=lims→∞2s−2m−2M2s−31−2ms+M2s2−s−2=−2m. (11)

So is comparable with , and is comparable with and therefore is of order .

So our conclusion is:

 E(||wt+1−w∗||2)=O(1t)E(||w1−w∗||2)+O(log(t)) (12)

This shows under a polynomial time regime, is converging to a local min of linearly fast.

But if we consider the exponential time regime, the above analysis is invalid because the last term of (10) is innegligible. There’s a possibility that may jump to another .

Here we conclude that the lower bound (5) of the quantity of interest converges to a local min in polynomial time, then switch higher to "better" local max in exponential time, which explains the video of Tishby’s. ∎

### 2.2 Relationship between IB and SVM

For simplicity we prove the result for binary classification.

###### Claim 2.2.

Consider a binary classification problem where the data set are linearly seperable. is modeled by and . Recall that .

Given a IB problem:

 Minimize I(X;~Y)−αI(Y;~Y) (13)

it can be formulated as hard margin SVM problem.

###### Proof.

From Appendix A we have that:

 I(X;~Y)≤2(Var(p(~y))+Var(p(~y|x))) (14)

Here we make an assumption that for the models we trained over time, it’s output is approximately uniform distributed over the finite labels. So we treat as constant for all . In particular, we assume: , then the first term of (14) on RHS can be controlled:

 Var(p(~y))≤D, (15)

for some constant .

We can bound the second term of (14) on RHS by:

 Var(p(~y|x))≤∫~Y∫Xp(x,~y)p(~y|x)2dxd~y=∫Xp(~y=1|x)3p(x)dx+∫Xp(~y=−1|x)3p(x)dx=∫X(σ(wtx))3p(x)dx+∫X(1−σ(wtx))3p(x)dx=∫X(1−3σ(wtx)+3σ(wtx)2)p(x)dx=14+3∫X(σ(wtx)−12)2p(x)dx (16)

Consider the 1st order Taylor expansion:

 σ(wtx)=σ(0)+σ′(c)(wtx)≤12+14wtx. (17)

Substitute it into (16) to get:

 Var(p(~y|x))≤14+3∫X(σ(wtx)−12)2p(x)dx=14+316∫X(wtx)2p(x)dx=14+(316∫Xx2p(x)dx)||w||2 (18)

To conclude, we have a bound of (14) the form:

 I(X;~Y)≤A+B||w||2. (19)

From Appendix B, we approximate the Mutual information by:

 I(Y;~Y)=∑ilog(p(yi,~yi)p(yi)p(~yi))=∑ni=1log(∑nj=1p(yi,~yi|xj)p(yi)p(~yi))=∑ni=1log(∑nj=1p(yi|~yi,xj)p(~yi|xj)p(yi)p(~yi))=∑ni=1log(∑nj=1p(yi|xj)p(~yi|xj)p(yi)p(~yi)). (20)

with high probability.

Also note that is given by the model .

where the prediction satisfies:

 p(~yi=1|xi)=f(xi,w)=σ(wtxi). (21)

So is now of the form:

 I(Y;~Y)≥A+∑ni=1log(∑nj=1p(yi|xj)f(~yi|xj;w))≥A+∑ni=1log(p(yi|xi)f(~yi|xi;w))≥A+∑ni=1log(f(~yi|xi;w))+∑ni=1log(p(yi|xi))≥A+∑ni=1log(|σ(wtxi)−12|+12)+∑ni=1log(p(yi|xi)) (22)

with high probability for some constant .

Finnally we put (19)&(22) together:

 I(X;~Y)−αI(Y;~Y)≤A′+B||w||2−α∑ilog(|σ(wtxi)−12|+12). (23)

This is a Lagrangian form of an optimal margin classifier. ∎

### 2.3 Generalization

Consider the target random variable with unknown distribution. We have a collection of samples of it, denoted as . Also denote the hypothesis class as .

The loss function is defined as:

 l:X×H↦R+ (24)

And the risk:

 R=EX[l(x,f)]. (25)

The optimal function is defined as:

 f∗∈argminf∈HR(f). (26)

Empirical risk minimization(ERM) is given as:

 ^f∈argminf∈H^R(f), (27)

where is given as:

 ^R(f)=1nn∑i=1l(xi,f). (28)

Consider the following decomposition:

 (29)

Here in (29), the second is bounded by zero by definition; is small, guaranteed by the law of large number under the assumption that the number of samples is sufficiently large. is a constant depending on the hypothesis class .

The bound for is typically controlled by the VC theory in the literature, see Vapnik (1995). But as pointed out in Zhang et al. (2017a), if the number of parameters is much larger than the number of samples, some form of regularization is needed to ensure small generalization error. Neyshabur et al. (2017b) mentioned that a sharper bound can be obtained by making it dependent on the choice of .

We would like to provide more insights to generalization, by formally establishing relationship between generalization in deep learning and margin. In our model, the SVM in the last layer generalizes better if the margin is larger. In Neyshabur et al. (2017a), they carefully analysized the notion of normed based control on margin with relation to generalization. In Bartlett et al. (2017), they also proved a bound on generalization with margin by using Rademacher complexity.

## 3 Experiments

In this section we interpret existing experimental reults using our theory. We also did experiment verifying features of our own interest on Res-Net.

### 3.1 Deep Residual Network

Here we use full reference to experimental results from He et al. (2016). Figure 3: Various usages of activation in Table 1. All these units consist of the same components — only the orders are different.

Here they compared different structures for building blocks and report the performance in Table 1. We give an interpretation to their result according to our theory: (a)&(b) both have ReLU after adding the identity mapping, which makes the whole mapping not invertible; (c) makes defined in (1) an nonnegative operator, which may potentially enlarge this operator norm out of our theoretical guarantee (Appendix E); (d) performs ReLU directly to input, which loses information, applying Batch Normalization after convolution also is meaningless.

### 3.2 Flat Basins Figure 4: Large batch training immediately followed by small batch training on the full dataset of CIFAR10 with a raw version of AlexNet.

Here we use full reference to experimental results from Sagun et al. (2017).

On generalization property of GD, Poggio et al. (2017) proved in appendix that GD converges to optimal solution to underdetermined least square problem, under proper initialization; Soudry et al. (2018) proved that GD converges to the optimal solution of a hard margin SVM slowly.

Their work motivated us to see the corresponding result in deep learning. The difference is, in our theory, the seperable data is not deterministic as it’s controlled by the feature learning map in Figure 1. The network is looking for a to create a larger margin between data of different category and a "weighted" linear seperator to achieve that maximum margin.

Back to their experiment, they trained a neural network with batch gradient descent(GD) for the first 2.5k steps and switch to stochastic gradient descent(SGD) for the rest 2.5k steps. Notice there’s a significant jump at step 2.5k when the optimization algorithm is changed.

They argue that despite the jump, there exists a linear interpolation between LHS and RHS so GD and SGD lead to essentially the same "basin". As pointed out by Dauphin et al. (2014), in a very high dimensional problem, it’s very hard to encounter a strict local minima: almost all critical points are saddle points as there could always exist some direction that is not "going up" in the landscape. So we do suspect that almost all "basins" are connected together in some sense. In the work by Dinh et al. (2017), they argued that a universal definition of flatness of the error surface is still unclear. In particular, they proved that the geometry of a local minimum can be changed arbitrarily by reparametrization, without changing what function it represents.

Instead concerning about the notion of flatness, we can talk about margin. According to our theory, GD only seperates the feature data, but does not maximize the margin. At step 2.5k, there are some feature point sitting right at the boundary of the linear seperator and SGD will break this balance, which leads to that sudden jump in the training error. In the end SGD generalizes better because the linear seperator in the last layer seperates the feature data by a larger margin. The notion of margin needs to be defined carefully as it scales with the norm of the weights and data.

### 3.3 Norm of L Figure 5: Values of |L(x)||x| for each building blocks over the training steps.

We used tensorflow code uploaded on Github by wenxinxu, ran a 32-layer residual network on CIFAR-10 for 80k steps and computed the for each building blocks at each step. Here is defined to be the square-root of the sum of squares of all entries in the tensor. We exported the output for all 15 building blocks over the training, from tensorboard. We can conclude that operator norm for every building block is smaller than 1, which meets our hypothesis.

### 3.4 Advantage of ReLU over Sigmoid/Tanh Figure 6: Stochastic training and the information plane. (A) tanh network trained with SGD. (B) tanh network trained with BGD. (C) ReLU network trained with SGD. (D) ReLU network trained with BGD. Both random and non-random training procedures show similar information plane dynamics.

Here we use full reference to experimental results from Saxe et al. (2018).

Although Sigmoid and Tanh functions are mathematically invertible, they push large amount of information to the boundary of the range, which in practice will be classified as a single bin, making them highly noninvertible. On the other hand, ReLU keeps at least half of the information from input.

Their experimental result matches our theory: Tanh function compresses information and ReLU keeps fair amount of information.

We would like to emphasis again that a network only needs one compressive activation function, which is in the last layer playing the role similar to SVM.

### 3.5 iRevNet Figure 7: Accuracy of a linear SVM and nearest neighbor against the number of principal components retaine.

Here we use full reference to experimental results from Jacobsen et al. (2018).

They projected the feature data of last layer of the network to a low dimensional space by PCA and did SVM on them. The good performance of SVM shows the intrinsic dimensionality of the feature data in the last layer is low, which supports our seperability assumptions.

## 4 Conclusion

In this paper, we analyzed the dynamics of the information plane proposed by Shwartz-Ziv and Tishby (2017). More importantly, we gave a hypothesis for the learning structure of deep neural network, and answered the questions arised from Saxe et al. (2018).

## References

• Achille and Soatto  Alessandro Achille and Stefano Soatto. Information dropout: learning optimal representations through noise. CoRR, abs/1611.01353, 2016.
• Alemi et al.  Alexander A. Alemi, Ian Fischer, Joshua V. Dillon, and Kevin Murphy. Deep variational information bottleneck. CoRR, abs/1612.00410, 2016.
• Bartlett et al.  Peter L. Bartlett, Dylan J. Foster, and Matus Telgarsky. Spectrally-normalized margin bounds for neural networks. CoRR, abs/1706.08498, 2017.
• Bottou  Léon Bottou. On-line learning in neural networks. chapter On-line Learning and Stochastic Approximations, pages 9–42. Cambridge University Press, New York, NY, USA, 1998. ISBN 0-521-65263-4.
• Chaudhari et al.  Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer T. Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. CoRR, abs/1611.01838, 2016.
• Dauphin et al.  Yann Dauphin, Razvan Pascanu, Çaglar Gülçehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. CoRR, abs/1406.2572, 2014.
• Dinh et al.  Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. CoRR, abs/1703.04933, 2017.
• He et al.  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.
• He et al.  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. CoRR, abs/1603.05027, 2016.
• Ioffe and Szegedy  Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR, abs/1502.03167, 2015.
• Jacobsen et al.  Jörn-Henrik Jacobsen, Arnold Smeulders, and Edouard Oyallon. i-RevNet: Deep Invertible Networks. In ICLR 2018 - International Conference on Learning Representations, Vancouver, Canada, April 2018.
• Kraskov et al.  Alexander Kraskov, Harald Stögbauer, and Peter Grassberger. Estimating mutual information. Phys. Rev. E, 69:066138, Jun 2004.
• Krizhevsky et al.  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
• Lax  P.D. Lax. Functional analysis. Pure and applied mathematics. Wiley, 2002. ISBN 9780471556046.
• Nair and Hinton  Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, pages 807–814, USA, 2010. Omnipress. ISBN 978-1-60558-907-7.
• Neyshabur et al. [2017a] Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. Exploring generalization in deep learning. CoRR, abs/1706.08947, 2017a.
• Neyshabur et al. [2017b] Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. Exploring generalization in deep learning. In NIPS, 2017b.
• Poggio et al.  Tomaso Poggio, Qianli Liao, Brando Miranda, Lorenzo Rosasco, Xavier Boix, Jack Hidary, and Hrushikesh Mhaskar. Theory of deep learning iii: explaining the non-overfitting puzzle. 12/2017 2017.
• Raginsky et al.  Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. In Satyen Kale and Ohad Shamir, editors, Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 1674–1703, Amsterdam, Netherlands, 07–10 Jul 2017. PMLR.
• Roberts and Tweedie  Gareth O. Roberts and Richard L. Tweedie. Exponential convergence of langevin distributions and their discrete approximations. Bernoulli, 2(4):341–363, 12 1996.
• Sagun et al.  Levent Sagun, Utku Evci, V. Ugur Güney, Yann Dauphin, and Léon Bottou. Empirical analysis of the hessian of over-parametrized neural networks. CoRR, abs/1706.04454, 2017.
• Saxe et al.  Andrew Michael Saxe, Yamini Bansal, Joel Dapello, Madhu Advani, Artemy Kolchinsky, Brendan Daniel Tracey, and David Daniel Cox. On the information bottleneck theory of deep learning. In International Conference on Learning Representations, 2018.
• Shamir et al.  Ohad Shamir, Sivan Sabato, and Naftali Tishby. Learning and generalization with the information bottleneck. Theoretical Computer Science, 411(29):2696 – 2711, 2010. ISSN 0304-3975. Algorithmic Learning Theory (ALT 2008).
• Shwartz-Ziv and Tishby  Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810, 2017.
• Soudry et al.  Daniel Soudry, Elad Hoffer, and Nathan Srebro. The implicit bias of gradient descent on separable data. In International Conference on Learning Representations, 2018.
• Srivastava et al.  Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014.
• Troost et al.  Marjolein Troost, Katja Seeliger, and Marcel van Gerven. Generalization of an upper bound on the number of nodes needed to achieve linear separability. 02 2018.
• Tzen et al.  Belinda Tzen, Tengyuan Liang, and Maxim Raginsky. Local optimality and generalization guarantees for the langevin algorithm via empirical metastability. CoRR, abs/1802.06439, 2018.
• Vapnik  Vladimir N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag New York, Inc., New York, NY, USA, 1995. ISBN 0-387-94559-8.
• Vershynin  Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018.
• Zhang et al. [2017a] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. 2017a.
• Zhang et al. [2017b] Yuchen Zhang, Percy Liang, and Moses Charikar. A hitting time analysis of stochastic gradient langevin dynamics. In COLT, 2017b.
• Zheng et al.  Guanhua Zheng, Jitao Sang, and Changsheng Xu. Understanding deep learning generalization by maximum entropy, 2018.

## Appendix A Bound for I(X;~Y)

The intuition is Jensen’s inequality is loose if the term in deviates from constant by a lot.

Consider the second quantity given by:

 I(X;~Y)=∫~Y∫Xp(x,~y)log(p(x,~y)p(x)p(~y))dxd~y (30)

Our goal is to show that (1) is decreasing at a rate upper bounded by polynomial.

Denote , consider the 2nd order Taylor form around 1:

 log(f(x,~y))=log(1)+(f(x,~y)−1)−12c2(f(x,~y)−1)2, (31)

where is between and 1.

Observing , then substitute (2) into (1) get:

 I(X;~Y)=∫~Y∫Xp(x,~y)12c2(f(x,~y)−1)2dxd~y. (32)

If , then ; if , then .

It follows we can upper bound (3) by:

 I(X;~Y)≤12∫~Y∫Xp(x,~y)[(f(x,~y)−1)2+(1−1f(x,~y))2]dxd~y=Var(f(X,~Y))+Var(1f(X,~Y))=Var(p(~y)p(~y|x))+Var(p(~y|x)p(~y))≤2(Var(p(~y))+Var(p(~y|x))) (33)

## Appendix B Approximation for I(Y;~Y)

Consider

 I(Y;~Y)=∫~Y∫Yp(y,~y)log(p(y,~y)p(y)p(~y))dyd~y. (34)

which can be regarded as for some random variable with probability density:

 pZ(log(p~Y,Y(y,~y)p~Y(~y)pY(y)))=p~Y,Y(y,~y). (35)

By Appendix F, is subexponential, so by Bernstein’s inequality(see Vershynin ) we have, for sample size large enough, a high probability bound guarantees the empirical approximation of :

 n∑i=1log(p(yi,~yi)p(yi)p(~yi)). (36)

which has another empirical version:

 n∑i=1log(∑nj=1p(yi,~yi|xj)p(yi)p(~yi))=n∑i=1log(∑nj=1p(yi|~yi,xj)p(~yi|xj)p(yi)p(~yi))=n∑i=1log(∑nj=1p(yi|xj)p(~yi|xj)p(yi)p(~yi)). (37)

Here we make an assumption that for the models we trained over time, it’s output is approximately uniform distributed over the finite labels. So we treat as constant for all . Also note that is given by the model .

The prediction satisfies:

 ~yi=argmaxyf(y|xi,wt), (38)

So is now of the form:

 I(Y;~Y)≥A+∑ni=1log(∑nj=1p(yi|xj)f(~yi|xj;wt))≥A+∑ni=1log(p(yi|xi)f(~yi|xi;wt))≥A+log(∏ni=1f(~yi|xi;wt))+∑ni=1log(p(yi|xi))≥A+log(∏ni=1f(yi|xi;wt))+∑ni=1log(p(yi|xi)) (39)

with high probability for some constant .

## Appendix C Continuous entropy

The natural definition of a mutual information of a discrete random variable to where is a deterministic function, if we try to define it at all, is as follows:

 I(X;g(X))=−∑xp(x)logp(x), (40)

where for simplicity we assume the range of is in .

If instead we consider has a continuous range, for example , we can take a uniform mesh over with interval and do the approximation:

 p(X∈[x,x+Δ])=∫x+Δxf(a)da≈f(x)Δ. (41)

And therefore:

 I(X;g(X))≈−i=∞∑i=−∞f(xi)Δlog(f(xi)Δ), (42)

where is the mesh on . We expect this estimation to be precise if we take .

But this limit is different from the intuitive definition of differential entropy , provide it exists:

 I(X)=∫Rf(x)log(f(x))=−limΔ→0i=∞∑i=−∞f(xi)Δlog(f(xi)). (43)

Intuitively the term in (3) will blow down as .

In practice people estimate the mutual information by (3) but it doesn’t yield a meaningful quantity. In particular, we don’t know whether (3) converges or not if we take .

Here key point here is that for general random variables , if is not degenerate over some open interval , then by inverse function theorem, there exists some invertible relationship between in and we need infinite amount of information to describe what is happening in .

But if we consider instead the mutual information between a discrete random variable with finite range and a continuous random variable with continuous density, then the estimation would take the form:

 I(X;Y)≈n∑i=1∞∑j=−∞p(xj,yi)Δlog(p(xj|yi)Δ∑lp(xj|yl)Δ), (44)

which has a limit:

 I(X;Y)=n∑i=1∫Rf(x,yi)log(f(x|yi)∑lf(x|yl))dx. (45)

So if the analytical form (6) of is finite, we know our practical approximation is meaningful.

In practice, the true quantity is usually finite. For example, in MNIST, both image and its label are essentially discrete so their mutual information can be defined in a strict discrete sense.

As a conclusion, it’s always sensible to define the mutual information between discrete finite random variable and continuous random variable in an exact integral form.

## Appendix D Heat Equation

Given a Cauchy problem for heat equation:

 ut=kuxxu(x,0)=ϕ(x) (46)

The solution is known as:

 u(x,t)=1√4πkt∫∞−∞e−(x−y)24ktϕ(y)dy=K(x,t)∗ϕ(x), (47)

where

Without losing generality assuming , consider:

 ||K(x,t)||22=Ct−1∫∞−∞e−x22tdx. (48)

Substitude gives:

 ||K(x,t)||22=Ct−1/2∫∞−∞e−x′22dx′=C′t−1/2. (49)

## Appendix E Convergence in Banach Space

###### Claim E.1.

Consider Holder Space , where is the closure of some bounded open set , with equiped Holder norm , here is some positive scalar. If and , then there exists such that .

###### Proof.

It’s well known that is a Banach space (Lax ). Note that can be scale down to an arbitrary small constant in practice, which has no influence to our proof.

Define

 B=∞∑n=0(−L)n. (50)

Since , the sequence in (51) is a Cauchy sequence. So it coverges in Banach space. Convergence sequence can be multiplied termwise, it follows that

 BL=L∞∑n=0(−L)n=−∞∑n=1(−L)n=−(B−I). (51)

So . The other equality can be shown similarly. ∎

## Appendix F Subexponential

###### Claim F.1.

Let be two discrete real-valued random variable with probability density . Then the real valued random variable with probability density is subexponential.

###### Proof.

Let , then

 es=pAB(a,b)pA(a)pB(b)≤p