An Information-Theoretic View for Deep Learning

An Information-Theoretic View for Deep Learning

Abstract

Deep learning has transformed the computer vision, natural language processing and speech recognition. However, the following two critical questions are remaining obscure: (1) why deep neural networks generalize better than shallow networks? (2) Does it always hold that a deeper network leads to better performance? Specifically, letting be the number of convolutional and pooling layers in a deep neural network, and be the size of the training sample, we derive the upper bound on the expected generalization error for this network, i.e.,

 E[R(W)−RS(W)]≤exp(−L2log1η)√2σ2nI(S,W)

where is a constant depending on the loss function, is a constant depending on the information loss for each convolutional or pooling layer, and is the mutual information between the training sample and the output hypothesis . This upper bound discovers: (1) As the network increases its number of convolutional and pooling layers , the expected generalization error will decrease exponentially to zero. Layers with strict information loss, such as the convolutional layers, reduce the generalization error of deep learning algorithms. This answers the first question. However, (2) algorithms with zero expected generalization error does not imply a small test error or . This is because will be large when the information for fitting the data is lost as the number of layers increases. This suggests that the claim “the deeper the better” is conditioned on a small training error or .

\AtAppendix\counterwithin

theoremsection \AtAppendix\counterwithinlemmasection \AtAppendix\counterwithinequationsection

1 Introduction

We study the standard framework of statistical learning where the instance space is denoted by and the hypothesis space is denoted by . We denote the training sample by an n-tuple where each element is drawn i.i.d. from an unknown distribution . A learning algorithm can be seen as a randomized mapping from the training sample space to the hypothesis space . We characterize the learning algorithm by a Markov kernel , which means that given the training sample , the algorithm picks a hypothesis in according to the conditional distribution .

We introduce a loss function to measure the quality of a prediction w.r.t. a hypothesis. For any learned hypothesis by , we define the expected risk

 R(W)=EZ∼D[ℓ(W,Z)] , (1)

and the empirical risk

 RS(W)=1nn∑i=1ℓ(W,Zi) . (2)

For a learning algorithm , the generalization error is defined as

 GS(D,PW|S)=R(W)−RS(W) . (3)

A small generalization error implies the learned hypothesis will have similar performances on the training and test datasets.

In this paper, we study the following expected generalization error for deep learning algorithms

 G(D,PW|S)=E[R(W)−RS(W)] , (4)

where the expectation is over the joint distribution .

We have the following decomposition

 E[R(W)]=G(D,PW|S)+E[RS(W)] (5)

where the first term on the right hand side is the expected generalization error and the second term reflects how well the learned hypothesis fit the training data from an expectation view.

When designing a learning algorithm, we want the expectation of the expected risk, i.e., , to be as small as possible. However, it is not easy to have small values for the expected generalization error and the expected empirical risk at the same time. Usually, if a model fits the training data too well, it may generalize poorly on the test data. This is known as the bias-variance trade-off problem (Domingos 2000). Surprisingly, deep learning algorithms have successfully and empirically shown their power in minimizing and at the same time. They have small because neural networks having deep architectures are efficient to compactly represent highly-varying functions (Sonoda and Murata 2015). However, the theoretical justification for their small expected generalization errors remains elusive.

In this paper, we study the expected generalization error for deep learning algorithms from an information-theoretic point of view. We will show that as the number of layers grows, the expected generalization error will decrease exponentially to zero. Specifically, in Theorem 2, we prove that

 G(D,PW|S)=E[R(W)−RS(W)]≤exp(−L2log1η)√2σ2nI(S,W) ,

where is the number of information loss layers of deep neural networks, is a constant depending on the information loss of each layer, is a constant depending on the loss function, is the size of the training sample , and is the mutual information between the input training sample and the output hypothesis .

Our conclusion is based on two important results related to information theory. The first is the strong data processing inequalities (SDPIs) proposed by (Ahlswede and Gács 1976), which states that for a Markov chain , if there is information loss on the mapping , then and is a nonnegative information loss factor. Another result is by the line of (Russo and Zou 2015, Xu and Raginsky 2017), which states that the mutual information between the input and output of a learning algorithm controls its generalization error.

Our result is not conflicted with the bias-variance trade-off. Although the expected generalization error will decrease exponentially to zero as the number of information loss layers increases, the expected empirical risk will increase as the information loss will be harmful to the fitting of the training data. This implies that, when designing deep learning algorithms, more efforts should be placed on the balance of the information loss and the training error.

The advantage of using the mutual information between the input and output to bound the expected generalization error is that it depends on almost every aspects of the learning algorithm, including the data distribution, the complexity of hypothesis class, and the property of the learning algorithm itself; While the traditional frameworks for proving PAC-learnablility (Mohri et al. 2012) may only focus some of the aspects. For example, the VC dimension (Vapnik 2013), covering number (Zhang 2002), Rademacher (Bartlett and Mendelson 2002, Bartlett et al. 2005, Liu et al. 2017), PAC-Bayes (Langford and Shawe-Taylor 2003), algorithmic stability (Liu et al. 2017, Bousquet and Elisseeff 2002) and robustness (Xu and Mannor 2012) based frameworks.

The rest of this paper is organized as follows: in Section 2, we relate our DNNs with a Markov chain; Section 3 will exploit the strong data processing inequality to derive how the mutual information between intermediate features representations and the output varies in deep neural networks; our main results will be given in Section 4, which gives the exponential generalization error bound for DNNs in terms of depth ; then we give the proofs of our main theorem in Section 5; finally, we conclude our paper and point out some implications in Section 6 .

2 The Hierarchical Feature Mapping of DNNs and Its Relations to Markov Chain

Let’s introduce some notations for deep neural networks (DNNs). As shown in figure 1, a DNN with hidden layers can be seen as feature maps that conduct feature transformations times on the input sequentially. After feature transformations times, the learned feature will be the input of a classifier (or regressor) at the output layer. If the distribution on a single input is , then we denote the distribution after going through -th hidden layer as and the corresponding variable is where and the weight of the whole network is denoted by , where is the space of all possible weights. As shown in figure 2, the input is transformed layer by layer and the output of the -th hidden layer is , where . We also denote the -th sample after going through the -th hidden layer by . In words, we have the following relationship:

 Z∼D (6) Zk∼Dk,fork=1,…,L (7) S={Z1,…,Zn}∼Dn (8) Tk={Zk1,…,Zkn}∼Dnk,fork=1,…,L. (9)

We now have a Markov model for DNNs, as shown in figure 2. From the Markov property, we know that if forms a Markov chain, then is conditionally independent with given . Furthermore, from the data processing inequality (Cover and Thomas 2012), we have and the equality holds if and only if also forms a Markov chain. Applying the data processing inequality to the Markov chain, we have,

 I(TL,h)≤I(TL−1,h)≤I(TL−2,h)≤…≤I(S,h)≤I(S,W) . (10)

This means that the mutual information between input and output is non-increasing as it goes through the network layer by layer. As the feature map in each layer is likely to be non-invertible, the mutual information between the input and output is likely to be strictly decreasing when it goes through each layers. This encourages the study of the strong data processing inequality (Polyanskiy and Wu 2015). In the next section, we will prove that the strong data processing inequality holds for DNNs generally.

3 Information Loss in DNNs

In the previous section, we have modeled the DNN as a Markov chain and, by using the data processing inequality, we conclude that the mutual information between input and output in DNNs is non-increasing. The equalities in equation (10) will not hold for most cases and therefore we can apply the strong data processing inequality to achieve tighter inequalities.

For a Markov chain , the random transformation can be seen as a channel from an information-theoretic point of view. Strong data processing inequalities (SDPIs) quantify an intuitive observation that the noise inside the channel will reduce the mutual information between and . That is, there exists , such that

 I(U,W)≤ηI(V,W) . (11)

Formally,

Theorem 1 (Ahlswede and Gács 1976).

Consider a Markov chain and the corresponding random mapping . If the mapping is not noiseless, that is we cannot recover any perfectly with probability from the observed random variable . Then there exists , such that

 I(W,Y)≤ηI(X,Y) (12)

More details can be found in the comprehensive survey about SDPIs (Polyanskiy and Wu 2015).

Let’s consider the -th hidden layer () in figure 1. It can be seen as a randomized transformation that maps from one distribution to another distribution (when , we denote ). We then denote the parameters of the -th hidden layer by . Without loss of generality, let be a matrix of dimension . Also, we denote the activation function in this layer by .

We give the definition of the contraction layer.

Definition 1 (Contraction Layer).

A layer in deep neural networks is called a contraction layer if it causes information loss.

We now give the first result of this paper, which quantifies the information loss in DNNs.

Corollary 1 (Information Loss in DNNs).

Consider a DNN as shown in figure 1 and its corresponding Markov model in figure 2. If its -th ( ) hidden layer is a contraction layer, then there exists , such that

 I(Tk,h)≤ηkI(Tk−1,h) . (13)

We show that the most used convolutional or pooling layers are contraction layers.

Lemma 1.

For any layer in a DNN, with parameters , if , it is a contraction layer.

Proof.

For the -th hidden layer, consider any input and the corresponding output of the -th hidden layer , we have

 xk=σk(wkxk−1) . (14)

As , then the dimension of its right null space is greater than or equal to . We denote the right null space of by , then we can pick a vector such that .

Then, we have

 σk(wk(xk−1+α))=σk(wkxk−1)=xk . (15)

Therefore, for any input of the -th layer, there exists such that their corresponding outputs are the same, which means, for any , we cannot recover it perfectly with probability .

We conclude that the mapping is noisy and the corresponding layer will cause information loss. ∎

Theorem 1 shows that the mutual information will decrease after it goes through a contraction layer. From Lemma 1, we know that the convolutional and pooling layer is often guaranteed to be a contraction layer. For a fully connected layer, the contraction property may not hold as the weight sometimes may be of full rank which leads to a noiseless and invertible mapping. However, the active function employed sub-sequentially can contribute to forming a contraction layer. Without loss of generality, we let all hidden layers be contraction layers, e.g., convolutional or pooling layers.

4 Exponential Bounds on the Generalization Error of DNNs

Before we introduce our main theorem, we need to restrict the loss function to be -sub-Gaussian with respect to Z for any .

Definition 2 (σ-sub-Gaussian).

A random variable X is said to be -sub-Gaussian if the following inequality holds for any ,

 E[exp(λ(X−E[X]))]≤exp(σ2λ22) (16)

We now present our main theorem, which gives an exponential bound for the expected generalization error of deep learning algorithms.

Theorem 2.

For a DNN with hidden layers, input , and parameters . Assuming that the loss function is -sub-Gaussian with respect to Z for any . Without loss of generality, let all hidden layers are contraction layers (e.g., convolutional or pooling layers). Then, the expected generalization error of the corresponding deep learning algorithm can be upper bounded as follows,

 E[R(W)−RS(W)]≤exp(−L2log1η)√2σ2nI(S,W) (17)

where is the maximum information loss factor for all L contraction layers, that is

 η=maxi∈{1,…,L}ηi . (18)

The upper bound in Theorem 2 is loose w.r.t. the mutual information since we used the inequality . However, we have that

 I(S,W)≤min{H(S),H(W)}≤H(S) , (19)

which implies that as the number of contraction layers increases, the expected generalization error will decrease exponentially to zero. We give the detailed proofs of the main theorem in the next section.

From Theorem 2, it implies that deeper neural networks will lead to better generalization error. However, it does not means that the deeper the better. Recall that . A small does not imply a small . Because the expected empirical training error will increase due to information loss. Specially, if the information that cares the relationship between the observation and the target loses, fitting the training data will become difficult and the empirical training error will increase. Our results point out a research direction for designing deep learning algorithms that we should increase the number of contraction layers while keep the empirical training errors to be small as well.

The information loss factor plays an essential role in the generalizing of deep learning algorithms. A successful deep learning algorithm should filter redundant information as much as possible while keep sufficient information to fit the training data. The functions of some deep learning tricks, such as convolution, pooling, and activation, serve very well on filter some redundant information. This further confirms the information-bottle theory (Shwartz-Ziv and Tishby 2017) by that with more contraction layers, more redundant information will be removed while the prediction information is preserved.

5 Proof of Theorem 2

First, by the law of total expectation, we have,

 E[R(W)−RS(W)]=E[E[R(W)−RS(W)|w1,…,wL]] . (20)

We now give an upper bound on similar to the way of (Russo and Zou 2015, Xu and Raginsky 2017).

Lemma 2.

Under the same conditions as in Theorem 2, the upper bound of is given by

 E[R(W)−RS(W)|w1,…,wL]≤√2σ2nI(TL,h) . (21)
Proof.

We have,

 E[R(W)−RS(W)|w1,…,wL] =Eh,S[EZ∼D[ℓ(W,Z)]−1nn∑i=1ℓ(W,Zi)|w1,…,wL] =Eh,TL[EZL∼DL[ℓ(h,ZL)]−1nn∑i=1ℓ(h,ZLi)] . (22)

We are now going to upper bound

 Eh,TL[EZL∼DL[ℓ(h,ZL)]−1nn∑i=1ℓ(h,ZLi)] . (23)

Note that when give because of the Markov property. We adopt the classical idea of ghost sample in statical learning theory. That is, we sample another -tuple :

 T′L={Z′1,…,Z′L} (24)

where each element is drawn i.i.d from the distribution . We now have,

 Eh,TL[EZL∼DL[ℓ(h,ZL)]−1nn∑i=1ℓ(h,ZLi)] =Eh,TL[ET′L[1nn∑i=1ℓ(h,ZL′i)]−1nn∑i=1ℓ(h,ZLi)] =Eh,TL,T′L[1nn∑i=1ℓ(h,ZL′i)]−Eh,TL[1nn∑i=1ℓ(h,ZLi)] . (25)

We know that the output classifier in the output layer follows the distribution . We denote the joint distribution of and by . Also, we denote the marginal distribution of and by and respectively. Therefore, we have,

 Eh,TL,T′L[1nn∑i=1ℓ(h,ZL′i)]−Eh,TL[1nn∑i=1ℓ(h,ZLi)] =Eh′∼Ph,T′L∼PTL[1nn∑i=1ℓ(h′,ZL′i)]−E(h,TL)∼Ph,TL[1nn∑i=1ℓ(h,ZLi)] . (26)

We now bound the above term by the mutual information by employing the following lemma.

Lemma 3 (Donsker and Varadhan 1983).

Let P and Q be two probability distributions on the same measurable space . Then the KL-divergence between P and Q can be represent as,

 D(P||Q)=supF[EP[F]−logEQ[eF]] (27)

where the supreme is taken over all measurable functions such that .

Using lemma 3, we have,

 I(TL,h)=D(Ph,TL||Ph×PTL) Missing \left or extra \right ≥E(h,TL)∼Ph,TL[λnn∑i=1ℓ(h,ZLi)]−logEh′∼Ph,T′L∼PTL[eλn∑ni=1ℓ(h′,ZL′i)] . (28)

As the loss function is -sub-Gaussian w.r.t for any and is i.i.d for , then is -sub-Gaussian. By definition, we have,

 logEh′∼Ph,T′L∼PTL[eλn∑ni=1ℓ(h′,ZL′i)]≤σ2λ22n+Eh′∼Ph,T′L∼PTL[λnn∑i=1ℓ(h′,ZL′i)] . (29)

Substituting inequality (29) into inequality (2), we have,

 E(h,TL)∼Ph,TL[λnn∑i=1ℓ(h,ZLi)]−σ2λ22n Missing or unrecognized delimiter for \right =−σ2λ22n−I(TL,h) −[Eh′∼Ph,T′L∼PTL[1nn∑i=1ℓ(h′,ZL′i)]−E(h,TL)∼Ph,TL[1nn∑i=1ℓ(h,ZLi)]]λ ≤0 . (30)

The above inequality is a quadratic curve about and always less than or equal to zero. Therefore we have,

 |E[R(W)−RS(W)|w1,…,wL]|2≤2σ2nI(TL,h) (31)

which completes the proof.

By Theorem 1, we can use the strong data processing inequality for a Markov chain in figure 2 recursively. Thus, we have,

 √2σ2nI(TL,h)≤√2σ2nηLI(TL−1,h) ≤√2σ2nηLηL−1I(TL−1,h) ≤…≤ ⎷2σ2n(L∏k=1ηk)I(S,h) ≤√2σ2ηLnI(S,h) =exp(−L2log1η)√2σ2nI(S,h) (32)

where

 η=maxi∈{1,…,L}ηi<1 . (33)

Therefore, we have

 E[R(W)−RS(W)] =E[E[R(W)−RS(W)|w1,…,wL]] ≤E(exp(−L2log1η)√2σ2nI(S,h)) =exp(−L2log1η)√2σ2nI(S,h) ≤exp(−L2log1η)√2σ2nI(S,W) (34)

which completes the proof of Theorem 2.

6 Conclusions

In this paper, we obtain an exponential bound for the expected generalization error of deep learning algorithms. Our results have valuable implications to other critical problems in deep learning and need to be further investigated. (1) The traditional statistical learning theory can validate the success of deep neural networks because the mutual information decreases with the increasing number of layers and the fact that smaller mutual information implies higher algorithmic stability (Raginsky et al. 2016) and smaller complexity of the algorithmic hypothesis class (Liu et al. 2017). (2) The information loss factor has the potential to study characteristics of various pooling and activation functions as well as deep learning tricks (how they contribute to the reduction of the expected generalization error).

Footnotes

1. footnotemark:
2. footnotemark:

References

1. Ahlswede, R. and Gács, P. (1976). Spreading of sets in product spaces and hypercontraction of the markov operator. The annals of probability, pages 925–939.
2. Bartlett, P. L., Bousquet, O., Mendelson, S., et al. (2005). Local rademacher complexities. The Annals of Statistics, 33(4):1497–1537.
3. Bartlett, P. L. and Mendelson, S. (2002). Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482.
4. Bousquet, O. and Elisseeff, A. (2002). Stability and generalization. Journal of machine learning research, 2(Mar):499–526.
5. Cover, T. M. and Thomas, J. A. (2012). Elements of information theory. John Wiley & Sons.
6. Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of 17th International Conference on Machine Learning, pages 231–238.
7. Donsker, M. D. and Varadhan, S. S. (1983). Asymptotic evaluation of certain markov process expectations for large time. iv. Communications on Pure and Applied Mathematics, 36(2):183–212.
8. Langford, J. and Shawe-Taylor, J. (2003). Pac-bayes & margins. In Advances in neural information processing systems, pages 439–446.
9. Liu, T., Lugosi, G., Neu, G., and Tao, D. (2017). Algorithmic stability and hypothesis complexity. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2159–2167. PMLR.
10. Mohri, M., Rostamizadeh, A., and Talwalkar, A. (2012). Foundations of machine learning. MIT press.
11. Polyanskiy, Y. and Wu, Y. (2015). Strong data-processing inequalities for channels and Bayesian networks. ArXiv e-prints.
12. Raginsky, M., Rakhlin, A., Tsao, M., Wu, Y., and Xu, A. (2016). Information-theoretic analysis of stability and bias of learning algorithms. In Information Theory Workshop (ITW), 2016 IEEE, pages 26–30. IEEE.
13. Russo, D. and Zou, J. (2015). How much does your data exploration overfit? Controlling bias via information usage. ArXiv e-prints.
14. Shwartz-Ziv, R. and Tishby, N. (2017). Opening the Black Box of Deep Neural Networks via Information. ArXiv e-prints.
15. Sonoda, S. and Murata, N. (2015). Neural network with unbounded activation functions is universal approximator. arXiv preprint arXiv:1505.03654.
16. Vapnik, V. (2013). The nature of statistical learning theory. Springer science & business media.
17. Xu, A. and Raginsky, M. (2017). Information-theoretic analysis of generalization capability of learning algorithms. In Advances in Neural Information Processing Systems 30, pages 2524–2533. Curran Associates, Inc.
18. Xu, H. and Mannor, S. (2012). Robustness and generalization. Machine learning, 86(3):391–423.
19. Zhang, T. (2002). Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2(Mar):527–550.