Flow of Information in FeedForward Deep Neural Networks
Abstract
Feedforward deep neural networks have been used extensively in various machine learning applications. Developing a precise understanding of the underling behavior of neural networks is crucial for their efficient deployment. In this paper, we use an information theoretic approach to study the flow of information in a neural network and to determine how entropy of information changes between consecutive layers. Moreover, using the Information Bottleneck principle, we develop a constrained optimization problem that can be used in the training process of a deep neural network. Furthermore, we determine a lower bound for the level of data representation that can be achieved in a deep neural network with an acceptable level of distortion.
I Introduction
With the increasing demand for data analytics, Big Data, and artificial intelligence, efficient machine learning algorithms are required now more than anytime before [1]. Deep learning and deep neural networks (DNNs) have been shown to be among the most efficient machine learning paradigms, specifically for supervised learning tasks. Due to their fascinating performance, different deep learning structures have been deployed in various applications in the past decades [1, 2, 3]. However, despite of their great performance, more theoretical effort is required to understand the dynamic behavior of DNNs both from learning and design perspectives.
Deep neural networks are considered as multilayer structures, constructed by simple processing units known as neurons that process the input information to generate a desired output [4, 5]. These structures have been used previously in a variety of applications, such as dimensionality reduction [6], face representation [2], robotic grasps detection [3], and object detection [7].
While DNNs have shown their capability in solving machine learning problems, they have been traditionally deployed in a heuristic manner [8, 9]. However, to be able to use these structures more efficiently, we need to have a deeper understanding of their underling dynamic behavior [9].
Mehta and Schwab shown in [10] that deep learning is related to renormalization groups in theoretical physics and provide a mapping between deep learning methods and variational renormalization groups using Restricted Boltzmann Machines. In [8], Tishby and Zaslavsky proposed a theoretical framework, based on the principle of Information Bottleneck [11], to analyze the DNNs where, the ultimate goal of deep learning has been formulated as a tradeoff between compression and prediction. In [8], the authors claim that an optimal point exists on the compressiondistortion plane that can efficiently address that tradeoff. Moreover, they suggest that an Information Bottleneck based learning algorithm may achieve the optimal information representation.
In this paper, we analyze the flow of information in a deep neural network using an information theoretic approach. While different structures have been developed for DNNs, in this paper, we consider the multilayer feedforward structure and we assume that the network is used in a supervised setting. We determine an upper bound on the total compression rate in a neural network that can be achieved with an acceptable level of distortion. Furthermore, using the fundamental concepts of Information Bottleneck and based on the approach of Tishby and Zaslavsky in [8], we develop an optimization problem that can be used in the learning process of DNNs. A case study supports the justifications of the paper. Thus, our contributions and the structure of the paper are as follows:

In Section II, we focus on the information flow across any two consecutive layers of a DNN by characterizing the relative change in the entropy across layers and also developing some properties of the same.

In Section III, motivated by the Information Bottleneck principle, we define an optimization problem for training a DNN, in which the goal is to minimize the overall logloss distortion. Moreover, we prove an upper bound on the total data compression which is achievable in a DNN with an acceptable level of distortion.

In Section IV we modify the optimization problem of Section III to address the practical limitations of neural computation. We first illustrate that how the results of the original optimization model may be unfeasible and then, by adding sufficient constraints to the model we propose a modified optimization problem that can be used in the training process of a DNN.
Ii Flow of Entropy in Deep Neural Networks
The typical structure of a feedforward DNN is illustrated in Fig 1. In this figure, input layer is represented by and the representation of data in the hidden layer is shown by . We assume that the network has layers and the output layer, i.e. , should estimate a desired output, . Each layer is constructed by multiple neurons that process the information in parallel and the output of the neuron in layer is calculated as follows:
(1) 
where is the number of neurons in layer and is a weight that connects the output of the neuron in layer to the input of the neuron in layer . Also, is the bias of the neuron in layer and is the output function of the neurons in this layer. To illustrate (1) in vector notation we say:
(2) 
where is a combination of neuron outputs in layer , i.e. . The total number of possible output combinations in the layer is illustrated by which depends on the output functions and the input space. As an example, with binary output functions, . At the layer, is the set of all possible output combinations. It should be noted that a single neuron has a very limited processing capability. As a matter of fact, an individual neuron can only implement a hyperplane in its input space and hence, not all the mappings from its input to its output are feasible. This limitation is one of the main reasons that neural networks are constructed by multiple layers of interacting neurons.
The neural computation that occurs in each layer performs a mapping between the outputs of the consecutive layers. In other words, various output combinations in layer are mapped to certain output combinations in layer . Depending on the weights, bias, and the output function in layer , there are two possibilities:

Each unique combination of neuron outputs in layer is uniquely mapped to a combination of neuron outputs in layer . In other words:
In this case, regardless of the number of neurons in each layer, we have .

Multiple (at least two) combinations of neuron outputs in layer are mapped to a single combination of neuron outputs in layer . In other words:
and in this case, we have .
It worth mentioning that the mapping in each layer of DNN is also a partitioning of the layer’s input space and hence, what a DNN does is multiple consecutive of partitioning and mapping processes with goal of estimating of a desired output.
Definition 1
In the layer of a feedforward DNN, layer partition is the partitioning of the layer’s input space which occurs due to the neural computations performed in layer . We illustrate this partitioning by where, is the set of all the output combinations in layer that are mapped to the output combination in layer , i.e. . In other words,
Note that we have:
Let us assume that is the probability of . Then, it can be observed that
(3) 
Furthermore, considering this fact that is a partitioning of , one can easily show that is the probability of partition . It can be shown that
(4) 
is the probability distribution of all the combinations in .
Definition 2
In a feedforward DNN, the entropy of layer , , is the entropy of the neuron outputs at this layer and we have:
Definition 3
The entropy of partition , , is the entropy of the output combinations that belong to . In other words:
In the following lemma we show how entropy of information changes in a feedforward neural network.
Lemma 1
In a feedforward DNN, the entropy of information that flows from layer to layer is decreased by the expected entropy of the partitions on the possible output combinations in layer . The amount of this reduction is shown by and we have
(5) 
Proof
Let us assume that the information has been processed up to layer . Using Definition 2, the entropy of layer is
(6) 
To determine , we can say
(7) 
In other words, we started from the entropy of and substituted the individual terms of all the output combinations that belong to with their new equivalent term, i.e. . On the other hand, we have:
then, considering (6), we can rewrite (II) as follows:
(8) 
Equivalently, we have
(9) 
Then, based on Definition 3 we can say:
(10) 
and from here, we can observe that . Furthermore, the difference between the entropy in layers and is which is the expected entropy of the partitions on . This proves the lemma.
The following lemma proves a similar result for the flow of conditional entropy in a deep neural network.
Lemma 2
In a feedforward DNN, the conditional entropy of each layer, conditioned on the desired outputs, , i.e. , is a nonincreasing function of and we have:
(11) 
where
and
Proof
The proof of Lemma 2 follows on similar lines as Lemma 1 by conditioning on the random variable Y and is therefore omitted.
Iii Optimal Mapping and Information Bottleneck
Extraction of relevant information from an input data with respect to a desired output is one of the central issues in supervised learning [12]. In information theoretic terms, assuming that and are the input and output random variables, is the relevant information between and . In order to improve the efficiency of a machine learning task, we generally need to extract the minimal sufficient statistics of with respect to . Hence, as the Information Bottleneck principle [11] indicates, we need to find a maximally compressed representation of by extracting the relevant information with respect to [8]. In this section, we follow the approach of [8] to define an optimization problem that can be used in the learning process of a DNN. We also prove an upper bound on the achievable compression rate of the input in DNNs.
Consider a DNN with layers. Let us assume that denotes the output of these layers. To find the maximally compressed representation of , we need to minimize the mutual information between the input, i.e. , and the representation of data by the DNN, . This can be formulated as minimizing the mutual information between the input and the representation of data in each layer, i.e. which can be modeled as
(12) 
To measure the fidelity of the training process of a feedforward DNN, we focus on the Logarithmicloss distortion function. Let denotes the conditional distribution of given (i.e., the original input data) and similarly, let denotes the conditional distribution of given (i.e., given the outputs of all the layers). Then, one measure of fidelity is the KLdivergence between these distributions:
(13) 
and taking the expectation of we get:
(14) 
Using the Markov chain properties of the feedforward network, one can show that
(15) 
and hence, the overall distortion of the DNN will be . Therefore, using the Information Bottleneck principle, the training criteria for a feedforward DNN can be formulated as follows
(16) 
As we have mentioned in Section II, in the layer, each input combination is mapped to a specific output combination. Therefore, it can be easily shown that and hence, . Then, using Lemma 1 we have:
(17) 
Note that in the above equation, is constant, i.e. it does not depend on the neural network settings.
Regarding the constraint of (III), it can be shown that
(18) 
On the other hand, we know that and using Lemma 1 and Lemma 2, we have
which results in the following equation:
(19) 
Using (17) and (19) and by minor manipulation, the optimization problem of (III) can be rewritten as follows:
(20) 
This is a convex optimization problem and due to its complexity, it is generally difficult to find an analytic solution for that. However, numerical solutions (such as algorithms based on BlahutArimoto [13]) may be deployed here to solve (III).
In the optimization problem of (III), is the amount of entropy reduction in layer which can be interpreted as the amount of data compression that has been occurred at this layer. Moreover, we can observe that the total data compression (i.e. reduction in entropy) that occurs in DNN is and is defined in the following definition:
Definition 4
The total compression in a feed forward DNN with layers is illustrated by and we have:
The following lemma shows an upper bound on :
Lemma 3
In a multilayer neural network with input , output layer , and the desired output , the maximum possible entropy reduction from the input space to the output space that satisfies the distortion constraint is .
Proof
The first consequence of Lemma 3 is that, regardless of the number of layers, cannot be greater than . In fact, considering Lemma 2, is a nonincreasing function of , and hence we have:
(23) 
However, it should be noted that higher number of layers may result in a more compressed representation of data. In other words, based on Lemma 2, for we have and hence, . In the next section, we indicate that due to the structural limitations of an artificial neuron, not all the mappings determined by (III) can be implemented using one layer. Therefore, in addition to have a more compressed representation of information, in a neural network multiple layers may be required to achieve feasible mappings from the input space to the output space.
(a)  (b) 
Iv Feasible Optimal Mappings
While the optimization problem of (III) can be used to find the optimal mappings between consecutive layers of a neural network, it may result in unfeasible solutions. As a matter of fact, a single neuron implements a single hyperplane in the input space and hence, only linearly separable classification problems may be solved with a single neuron. As an example, in binary input/output space (i.e. when the inputs and the output of the neuron are binary), an XOR function cannot be implemented with a single neuron. Examples of feasible and unfeasible mappings with a single neuron are illustrated in Fig. 2(a). In this figure, black and white circles are mapped to outputs ’1’ and ’0’, respectively. However, a single neuron can only divide the space into two parts and hence, the top mapping (i.e. boolean OR function) can be implemented by a single neuron while the bottom mapping (i.e. boolean XOR function) is not implementable. The unfeasible mappings which are resulted from the optimization problem of (III) for a boolean XOR function are illustrated in Fig. 2(b). Therefore, we need to add more constraints to the above optimization problems to exclude the unfeasible mappings.
Using (5) and the definition of we can show that
(24) 
Let us define the following parameter:
(25) 
where, . Moreover, in vector notations, is a matrix such that . Then, (24) can be written as follows
(26) 
Furthermore, using a similar notation, it can be shown that
(27) 
As we mentioned before, not all the mappings between the inputs and outputs of a neuron are feasible. Unfeasible mappings depend on the structure of the network, number of neurons in each layer, and the corresponding output functions in each layer. Let us assume that at the layer, is the set of forbidden mappings. Then, the optimization problem of (III) can be modified as follows:
(28) 
where the solution to (IV) is the set of ’s. Note that the last statement is used to exclude the forbidden mappings from the set of solutions. The optimization problem of (IV) finds the optimal mappings between any two consecutive layers in a feedforward DNN. These mappings can then be implemented by proper selection of neuron weights.
V Case Study: Boolean Functions
In this section, we perform a case study to observe how the proposed optimization problem of (IV) may be used to determine optimal mappings between consecutive layers in a DNN. For this study, we try to implement basic boolean functions and show how entropy changes from the input to the output layer. In this set of experiments we use AND, OR, and XOR functions with two and three inputs and we assume that . Results are illustrated in Table I. It is clear from these results that feasible mappings cannot be determined for two and three input XOR functions when . As we have mentioned before, this is due to the processing limitations of a single neuron. However, for AND and OR functions even one single neuron was able to implement the function. Figure 3 shows an example set of mappings between consecutive layers for XOR function using a twolayer neural network.
Table I also illustrates the achievable compression rate, i.e. , and its corresponding upper bound, i.e. . As we can observe, in the cases that the function was implementable using the neural network, is equal to , which means that for these functions we have been able to achieve the minimum representation of data at the output layer. As we proved in Lemma 3, we observe that the maximum achievable level of data compression in a feedforward DNN is . Moreover, results indicate that the main reason to add an extra layer to a DNN is to achieve feasible mappings. However, as Lemma 3 shows, extra layers may lead to a more compressed representation of the input data.
Function  Neurons  Solution  Optimization  

(Inputs)  per Layer  Exists?  Function  
AND (2)  1  [1]  Yes  1.189  1.189  1.189 
OR (2)  1  [1]  Yes  0.888  0.888  0.888 
XOR (2)  1  [1]  No       
AND (2)  2  [2 1]  Yes  2.377  1.189  1.189 
XOR (2)  2  [2 1]  Yes  1.500  1.000  1.000 
AND (2)  3  [2 2 1]  Yes  3.566  1.189  1.189 
XOR (2)  3  [2 2 1]  Yes  2.500  1.000  1.000 
AND (3)  1  [1]  Yes  2.456  2.456  2.456 
XOR (3)  1  [1]  No       
(*) Distribution is not uniform: Probability of is 0.7 and others are 0.1.
Vi Conclusion
In this paper, we used information theory methods to study the flow of information in DNNs. We determined how entropy and conditional entropy of information changes between consecutive layers and using the Information Bottleneck principle we modeled the learning process of a feedforward neural network as a constrained optimization problem. Furthermore, we proved an upper bound for the total compression rate of information that can be achieved in a neural network while the overall distortion in the output layer with respect to a desired output is in an acceptable range. In this paper, we assumed that the neural network is used for supervised learning tasks and the input/output spaces are based on discrete alphabets. For the future work, we aim to extend our work to a broader range of learning problems and to include continues input/output spaces in our model.
References
 [1] X.W. Chen and X.Lin, “Big data deep learning: Challenges and perspectives,” IEEE Access, vol. 2, no. 2, pp. 514–525, 1991.
 [2] Y. Sun, X. Wang, and X. Tang, “Deep learning face representation from predicting 10,000 classes,” in Proc. of the CVPR’14, 2014, pp. 1891–1898.
 [3] I. Lenz, H. Lee, and A. Saxena, “Deep learning for detecting robotic grasps,” The International Journal of Robotics Research, vol. 34, no. 45, April 2015.
 [4] T.M. Mitchell, Machine Learning, McGrawHill, 1997.
 [5] J. Schmidhuber, “Deep learning in neural networks: An overview,” Tech. Rep. IDSIA0314, The Swiss AI Lab IDSIA, University of Lugano & SUPSI, Swiss, 2014.
 [6] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006.
 [7] C. Szegedy, A. Toshev, and D. Erhan, “Deep neural networks for object detection,” in Proc. of NIPS’13, 2013, pp. 2553–2561.
 [8] N.Tishby and N.Zaslavsky, “Deep learning and the information bottleneck principle,” in Proc. of ITW’15, 2015.
 [9] C. M. Bishop, “Theoretical foundations of neural networks,” in Proceedings of Physics Computing, 1996, pp. 500–507.
 [10] P. Mehta and D. J. Schwab, “An exact mapping between the Variational Renormalization Group and Deep Learning,” ArXiv eprints, Oct. 2014.
 [11] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” in Proceedings of 37th Annual Allerton Conference on Communication, Control and Computing, 1999.
 [12] T.G. Dietterich, Structural, Syntactic, and Statistical Pattern Recognition: Joint IAPR International Workshops SSPR 2002 and SPR 2002 Windsor, Ontario, Canada, August 6–9, 2002 Proceedings, chapter Machine Learning for Sequential Data: A Review, pp. 15–30, Springer Berlin Heidelberg, Berlin, Heidelberg, 2002.
 [13] R.E. Blahut, “Computation of channel capacity and ratedistortion functions,” Information Theory, IEEE Trans. on, vol. 18, no. 4, pp. 460–473, April 1972.