Layerwise Learning of Stochastic Neural Networks with Information Bottleneck
Abstract
Information Bottleneck (IB) is a generalization of rate distortion theory that naturally incorporates compression and relevance tradeoffs for learning. Though the original IB has been extensively studied, there has not been much understanding about multiple bottlenecks which better fit in the context of neural networks. In this work, we propose Information MultiBottlenecks (IMBs) as an extension of IB to multiple bottlenecks which has a direct application to training neural networks by considering layers as multiple bottlenecks and weights as parameterized encoders and decoders. We show that the multiple optimality of IMB are not simultaneously achievable for stochastic encoders. We thus propose a simple compromised scheme of IMB which in turn generalizes maximum likelihood estimate (MLE) principle in the context of stochastic neural networks. We demonstrate the effectiveness of IMB on classification tasks and adversarial robustness in MNIST and CIFAR10.
1 Introduction
The Information Bottleneck (IB) principle [Tishby99] extracts relevant information about a target variable from an input variable via a single bottleneck variable . In detail, the IB framework constructs a bottleneck variable that is a compressed version of but preserves as much relevant information in about as possible. The compression of the representation is quantized by , the mutual information of and . The relevance in , the amount of information contains about , is specified by . An optimal representation satisfying a certain compressionrelevance tradeoff constraint is then determined via minimization of the following Lagrangian , where is a positive Lagrangian multiplier that controls the tradeoff.
Deep neural networks (DNNs) have demonstrated stateoftheart performances in several important machine learning tasks including image recognition [DBLP:conf/nips/KrizhevskySH12], natural language translation [cho2014learning, bahdanau2014neural] and game playing [DBLP:journals/nature/SilverHMGSDSAPL16]. Behind the practical success of DNNs there are various revolutionary techniques such as dataspecific design of network architecture (e.g., convolutional neural network architecture), regularization techniques (e.g., early stopping, weight decay, dropout [DBLP:journals/jmlr/SrivastavaHKSS14], and batch normalization [DBLP:conf/icml/IoffeS15]), and optimization methods [DBLP:journals/corr/KingmaB14]. For learning DNNs, the maximum likelihood estimate (MLE) principle (in its various forms such as maximum loglikelihood or KullbackLeibler divergence) has generally become a defacto standard. The MLE principle maximizes the likelihood of the model for observing the entire training data. This principle is, however, generic and not specially tailored to hierarchystructured models like neural networks. Particularly, MLE treats the entire neural network as a collective body without considering an explicit contribution of its hidden layers to model learning. As a result, the information contained within the hidden structure may not be adequately modified to capture the data regularities reflecting a target variable. Thus, a reasonable question to ask is whether the MLE principle effectively and sufficiently exploits a neural network’s representative power and whether there is any better alternative?
In this work, we propose a unifying perspective to bridge between IB and neural networks via Information MultiBottlenecks (IMBs). The core idea is that we extend the original IB to multiple bottlenecks whereby neural networks can be viewed as a parameterized version of IMBs. We show a conflicting optimality for multiple bottlenecks under some mild conditions which are the case for stochastic neural networks. When applied to stochastic neural networks, IMBs readily enjoy various mutual information approximations (such as Variational Information Bottleneck [DBLP:journals/corr/AlemiFD016]) and gradient estimation. Consequently, we show how IMBs provide an alternative learning principle for stochastic neural networks as compared to the standard MLE principle. Finally, we demonstrate the superior or at least competitive empirical performance of IMBs, compared to MLE, in terms of generalization, and adversarial robustness for stochastic neural networks on MNIST and CIFAR10. We also show that IMBs empirically learn the neural representations better in terms of information exploitation.
This paper is organized as follows. We first review related literature in Section 2. Section 3 introduces our Information MultiBottlenecks with some important insights and how it is applied to stochastic neural networks. Section 4 demonstrates a case study that we apply our IMB to binary stochastic neural networks. Finally, Section 5 presents the empirical results of our framework on MNIST and CIFAR10, in comparison with MLE and Variational Information Bottleneck.
2 Related Work
Our IMBs are a direct multibottleneck extension of Information Bottleneck (IB) proposed in [Tishby99]. IB generalizes the rate distortion theory to better fit in the scenario of learning. IB extracts the relevant information in one variable about another variable via some intermediate variable . This IB problem has been solved efficiently in the following three cases only: (1) and are all discrete [Tishby99]; (2) and are mutually joint Gaussian [DBLP:journals/jmlr/ChechikGTW05]; (3) or has metaGaussian distributions [DBLP:conf/nips/ReyR12]. However, the extension of the IB principle to multiple bottlenecks is not straightforward. In addition, the IB principle has been proven to be mathematically equivalent to the MLE principle for the multinomial mixture model for the clustering problem when the input distribution is uniform or has a large sample size [Slonim03]. It is, however, not clear how the IB principle is related to the MLE principle in the context of DNNs. In IMBs, we extend IB to multiple bottlenecks and show the connection between IMBs and MLE in stochastic neural networks.
Our work also shares with the literature in training DNNs. Perhaps the most common way to generalize the MLE principle in DNNs is to apply Bayesian modeling to the weights of DNNs [Neal:1995:BLN:922680, DBLP:journals/neco/MacKay92a, DBLP:journals/neco/DayanHNZ95]. This approach equips DNNs with uncertainty reasoning and takes good advantage of the wellstudied tools in probability theory. As a result, this idea achieved interpretability and stateoftheart performance on many tasks [Gal2016Uncertainty]. The main challenges of this approach is to approximate the intractable posterior and scale the model to highdimensional data. IMBs, on the other hand, take a different but very natural perspective on DNNs by reasoning about the learning in terms of information contained in the neural representations.
Some works have considered applying Information Bottleneck to multiple layers in neural networks. For example, [DBLP:conf/itw/TishbyZ15] proposes the use of the mutual information of a hidden layer with the input layer and the output layer to quantify the performance of DNNs. A different family of work along the line approximates the mutual information of highdimensional variables arose from DNNs [DBLP:conf/uai/StrouseS16, DBLP:conf/nips/ChalkMT16, DBLP:journals/corr/AlemiFD016]. While we also apply it to neural networks, the key difference in our work is that we explicitly develop a framework of Information MultiBottlenecks which are specifically calibrated to each layers of neural networks.
A possibly (but not closely) related work is the idea of layerwise learning in IMBs in which certain aspects of each layer are considered to train a neural network. A great deal of work along this line is greedy layerwise training of DNNs [Hinton2006, Bengio2006]. While this line of work uses unsupervised greedy pretraining of DNNs, IMB is not necessarily a pretraining method nor a greedy algorithm. Especially, the unsupervised greedy pretraining of stacked autoencoder is proven to be equivalent to maximizing mutual information between the input variable and latent variable [Vincent2008]. In contrast, IMBs follow a different principle which aims at preserving the relevant information under the compression at the same time.
3 Information MultiBottlenecks
Notations
We denote random variables (RVs) by capital letters, e.g., , and its specific realization value by the corresponding little letter, e.g., . Note that can be vectorvalued. We write to denote a Markov chain where and are independent given . We write (respectively, ) to indicate and are independent (respectively, not independent) and abuse the notation integral, e.g., , regardless of whether the variable is realvalued or discretevalued.
3.1 Information MultiBottlenecks
Information Bottleneck ([Tishby99]) extends the notion of rate distortion theory to incorporate compression and relevance tradeoffs which are more suitable for learning than rate distortion theory. The information about data variables and (e.g., represents MNIST images and represents the MNIST labels) is squeezed through a bottleneck (possibly multivariate) variable . The goal is to find an optimal encoder such that preserves only relevant information in about and compress irrelevant one. This goal can be formally described in a constrained optimization in terms of mutual information.
Here we consider a natural extension of IB to multiple bottlenecks such that the bottlenecks form a Markov chain. The extension simply introduces multiple constrained Information Bottleneck optimization problems for each of the bottlenecks:
Despite being simple, this multiobjective optimization is very challenging especially when the bottlenecks are highdimensional. This extension is also essential in the context of neural networks as it solves an important problem that otherwise it is not possible in the conventional IB. The approach that uses the conventional IB to neural networks (e.g., [DBLP:journals/corr/AlemiFD016] directly apply IB to neural networks by considering the entire neural network as a parameterized encoder in IB) does not fully leverage the multilayered structure of neural networks. Each hidden layer has an important contribution to the compression and relevance process in the neural networks; thus should be explicitly considered in the learning process via information bottleneck principle. Therefore, multiple bottlenecks are more suitable in neural networks, and thus it is important to understand multiple bottlenecks. In the following theorem, we identify when it is possible to obtain a nonconflicting optimal solutions for multiple bottlenecks.
Theorem 3.1 (Conflicting MultiInformation Optimality).
Given four random variables and such that and and two constrained minimization problems:
(1) 
where , , and . Then, the following two statements are equivalent:

and do not satisfy either of the following conditions:

is a sufficient statistics of for and (i.e., ),

is independent of .


The two constrained minimization problems defined above are conflicting, i.e., there does not exist a single solution that minimizes and simultaneously.
Sketch proof.
Leveraging the Markovian structure of four random variables in the two Information Bottleneck objectives and using Lagrangian multipliers. The detailed proof is in Appendix A (the supplementary). ∎
Theorem 3.1 suggests that the multiinformation optimality (the optimal conditions for each of the individual Information Bottleneck optimization problems in IMBs) conflicts for most cases of interest, e.g. stochastic neural networks which we will present in detail in next subsection. The values of and are important to control the number of bits we extract the relevant information into the bottlenecks and determine the conflictability of multiple bottlenecks on the edge cases. Recall that for , we have . If and go to infinity, the optimal bottlenecks and are both deterministic function of and they do not conflict. When , the information about in is maximally compressed in and (i.e., ), and they do not conflict. But the optimal solutions conflict when and as the former leads to a maximally compressed while the latter prefers an informative (this contradicts the Markov structure which indicates that maximal compression in leads to maximal compression in ). We can also easily construct some nonconflicting IMBs for that violates the conditions. For example, if and are jointly Gaussian, the optimal bottlenecks and are linear transform of and jointly Gaussian with and ([DBLP:journals/jmlr/ChechikGTW05]). In this case, is a sufficient statistic of for . In the case of neural networks, we can also construct a simple but nontrivial neural network that can obtain a nonconflicting multiinformation optimality. For example, consider a neural network of two hidden layers and where is arbitrarily mapped from the input layer but is a sample mean of samples i.i.d. drawn from the normal distribution . This construction guarantees that is a sufficient statistic of for , thus there is nonconflicting multiinformation optimality.
3.2 Stochastic Neural Networks
Stochastic neural networks has been studied in literature [DBLP:conf/nips/TangS13, DBLP:journals/corr/RaikoBAD14, DBLP:journals/access/ShafieeSW16]. One important advantage of stochastic neural networks is that they can induce rich multimodel distributions in the output space [DBLP:conf/nips/TangS13] and enable exploration in reinforcement learning [DBLP:conf/iclr/FlorensaDA17]. Here we consider a stochastic neural network with hidden layers without any feedback or skip connection, we view the input layer , the output of the hidden layer , and the network output layer as random variables (RVs). Without any feedback or skip connection, and form a Markov chain in that order, denoted as:
(2) 
The role of the neural network is, therefore, reduced to transforming from one RV to another via the Markov chain where is used as a surrogate for . We call the transition distribution ^{1}^{1}1If the mapping from to is deterministic, then is simply a delta function. from to an encoder as it encodes the data into the representation . For each encoder , there is a unique corresponding decoder, namely relevance decoder, that decodes the relevant information in about from representation :
(3) 
It follows from the Markov chain in Equation (2) that inference in a neural network can be done as:
(4) 
where , and .
3.3 Stochastic Neural Networks as Information MutiBottlenecks
We consider a stochastic neural network as a parameterized version of Information MultiBottlenecks in which each layer is a bottleneck and the weights connecting the layers are parameterized encoders and relevance decoders. Specifically, is parameterized by the subnetwork from the input layer to layer . In this perspective, a stochastic neural network is a dataprocessing system that transforms a data distribution via a series of bottlenecks . Thus, the role of each layer can be interpreted as information filter that compress irrelevant information and preserve relevant one. This notion of compression and relevance can be captured with mutual information and , respectively. Subsequently, it is natural to interpret the learning of a stochastic neural networks as a multiobjective optimization:
(5) 
where are the positive Lagrange multipliers for the constraints.
We first present how to approximate mutual information and using variational mutual information. After that, we present how to solve the multiobjective optimization in Eq. 5.
3.3.1 Approximate Relevance
The relevance is intractable due to the intractable relevance decoder in Equation (3). It follows from the nonnegativity of KullbackLeibler divergence that:
(6) 
where is any probability distribution. Note that where which can be ignored in the minimization of . Specifically in IMB, we propose to use the network architecture connecting to to define the variational relevance decoder for layer , i.e., where is determined by the network architecture:
(7) 
For the rest of this work, we refer to with as the variational conditional relevance (VCR) of the layer. Theorem 3.2 addresses the relation between VCR and the MLE principle.
Theorem 3.2 (Information on the extreme layers).
The VCR of the lowestlevel (socalled super) layer (i.e., ) is the negative loglikelihood (NLL) function of the neural network, i.e.,
(8) 
Similarly, the VCR of the highestlevel layer (i.e., ) equals that of the compositional layer , a composite of all hidden layers; in addition, their VCR is an upper bound on the NLL:
(9) 
Sketch Proof.
Using the definition of VCR, the Markov Chain assumption, and Jensen’s inequality. The detailed proof is in Appendix B (the supplementary). The details of how to decompose VCR for multivariate variable can also be found in Appendix F (the supplementary). ∎
An interpretation of MLE in terms of VCR and vice versus is immediately followed from Theorem 3.2. That said, the MLE principle is to optimize the superlevel VCR while VCR allows an explicit extension of this concept to any layer.
3.3.2 Approximate Compression
While in DNNs has an analytical form, for generally does not as it is a mixture of . We thus propose to avoid directly estimating by instead resorting to its upper bound as its surrogate in the optimization. However, is still intractable as it has the intractable distribution . We then approximate using a meanfield (factorized) variational distribution :
(10) 
The detailed derivations can be found in the Appendix C (the supplementary).
3.4 Compromised Information Optimality
Due to Theorem 3.1, we cannot achieve the information optimality for simultaneously all layers; thus we need some compromised approach to instead obtain a compromised optimality. We propose two natural compromised strategies, namely JointIMB and GreedyIMB. JointIMB (Algorithm 1) is a weighted sum of the variation IB objectives where is the variational approximation of using approximate relevance (Eq. (6)) and approximate compression (Eq. (10)), and . The main idea of JointMIB is to simultaneously optimize all encoders and variational relevance decoders. Even though each layer might not achieve its individual optimality, their joint optimality encourages a joint compromise. On the other hand, GreedyIMB applies PIB progressively in a greedy manner. In other words, GreedyIMB tries to obtain the conditional optimality of a current layer which is conditioned on the achieved conditional optimality of the previous layers.
4 A Case Study: Binary Stochastic Neural Networks
To analyze IMB, we apply it to a simple network architecture: binary stochastic feedforward (fullyconnected) neural networks (SFNN) though the extension to realvalued stochastic neural networks are straightforward by using reparameterization tricks ([DBLP:journals/corr/KingmaW13]). In binary SFNN, we use sigmoid as its activation function: where is the (elementwise) sigmoid function, is the network weights connecting layer to layer , is a bias vector and . We also make each a learnable Bernoulli distribution.
It has been not clear so far how the gradient is computed in stochastic neural network at line of Algorithm 1. The sampling operation in stochastic neural networks precludes the backpropagation in a computation graph. It becomes even more challenging with binary stochastic neural networks as it is not welldefined to compute gradients w.r.t. discretevalued variables. Fortunately, we can find approximate gradients which has been proved to be efficient in practice: REINFORCE estimator ([DBLP:journals/ml/Williams92, DBLP:journals/corr/BengioLC13]), straightthrough estimator ([Hintonlecture]), the generalized EM algorithm ([DBLP:conf/nips/TangS13]), and Raiko (biased) estimator ([DBLP:journals/corr/RaikoBAD14]). Especially, we found the Raiko gradient estimator works best in our specific setting thus deployed it in this application. In the Raiko estimator, the gradient of a bottleneck particle is propagated only through the deterministic term :
5 Experiments
Model  Classification  Adv. Robustness (%)  

MNIST  CIFAR10  Targeted  Untargeted  
(Error %)  (Accuracy %)  
DET  
VIB ([DBLP:journals/corr/AlemiFD016])  
SFNN ([DBLP:journals/corr/RaikoBAD14])  
GreedyIMB  57.61  
JointIMB  1.36  96.00 
In this section, we evaluated IMB with GreedyIMB and JointIMB algorithms on MNIST [LeCun98] and CIFAR10 [cifar10] for classification, learning dynamics and robustness against adversarial attacks. The MNIST dataset consists of 28x28 pixel greyscale images of handwritten digits 09, with 60,000 training and 10,000 test examples. The CIFAR10 dataset consists of 60,000 (50,000 for train and 10,000 for test) 32x32 colour images in 10 classes, with 6,000 images per class.
5.1 Image classification
We compared JointIMB and GreedyIMB with other three comparative models which used the same network architecture without any explicit regularizer: (1) Standard deterministic neural network (DET) which simply treats each hidden layer as deterministic; (2) Stochastic Feedforward Neural Network (SFNN) [DBLP:journals/corr/RaikoBAD14] which is a binary stochastic neural network as in IMB but is trained with the MLE principle; (3) Variational Information Bottleneck (VIB) [DBLP:journals/corr/AlemiFD016] which employs the entire deterministic network as an encoder, adds an extra stochastic layer as a outofnetwork bottleneck variable, and is then trained with the IB principle on that single bottleneck layer. The base network architecture in this experiment had two hidden layers with sigmoidactivated neurons per each layer. The experimental setup details can be found in Appendix D (the supplementary).
The results are shown in Table 1. Even though we did not optimize for the hyperparameters in IMB, JointIMB already outperforms MLE and VIB on MNIST, GreedyMIB outperforms the other models on CIFAR10. The performance of JointPIB on CIFAR10 is also comparable to MLE. The result suggests a promising effectiveness of explicitly inducing relevant but compressed information into each layer of a neural network via IMBs.
5.2 Robustness against adversarial attacks
We consider here the adversarial robustness of neural networks trained by IMBs. Neural networks are prone to adversarial attacks which disturb the input pixels by small amounts imperceptible to humans [DBLP:journals/corr/SzegedyZSBEGF13, DBLP:conf/cvpr/NguyenYC15]. Adversarial attacks generally fall into two categories: untargeted and targeted attacks. An untargeted adversarial attack maps the target model and an input image into an adversarially perturbed image : , and is considered successful if it can fool the model . A targeted attack, on the other hand, has an additional target label : , and is considered successful if .
We performed adversarial attacks to the neural networks trained by MLE and IMB, and resorted to the accuracy on adversarially perturbed versions of the test set to rank a model’s robustness. In addition, we use the attack method for both targeted and untargeted attacks [DBLP:conf/sp/Carlini017], which has shown to be most effective attack algorithm with smaller perturbations. Specifically, we attacked the same four comparative models described from the previous experiment on the first samples of the MNIST test set. For the targeted attacks, we targeted each image into the other labels other than the true label of the image.
The results are also shown in Table 1. We see that the deterministic model DET is totally fooled by the attacks. It is known that stochasticity in neural networks improves adversarial robustness which is consistent in our experiment as SFNN is significantly more adversarially robust than DET. VIB has compatible adversarial robustness with SFNN even if VIB has “less stochasticity" than SFNN (VIB has one stochastic layer while all hidden layers of SFNN are stochastic). This is because VIB performance is compensated with IB principle for its stochastic layer. Finally, JointIMB is more adversarially robust than the other models. Explicitly inducing compression and relevance into each layers thus has a potential of being more adversarially robust.
5.3 Learning dynamics
To better understand how MIB has modified the information within the layers during the learning process, we visualize the compression and relevance of each layer over the course of training of SFNN and JointMIB (the visualization for GreedyPIB is in the Appendix E (the supplementary)). To simplify our analysis, we considered a binary decision problem where is binary inputs making up equally likely input patterns and is a binary variable equally distributed among input patterns [DBLP:journals/corr/ShwartzZivT17]. The base neural network architecture had 4 hidden layers with widths: 10864 neurons. Since the network architecture is small, we could precisely compute the (true) compression and (true) relevance over training epochs. We fixed for both JointIMB, trained five different randomly initialized neural networks for each comparative model with SGD up to 20,000 epochs on of the data, and averaged the mutual information.
Figure 1 provides a visualization of the learning dynamics of SFNN versus JointIMB on the information plane . We can observe a common trend in the learning dynamics offered by both MLE (in SFNN model) and JointIMB framework. Both principles allow the network to gradually encode more information about and the relevant information about into the hidden layers at the beginning as and both increase. Especially, compression does occur in SFNN which is consistent with the result reported in ([DBLP:journals/corr/ShwartzZivT17]).
What distinguishes IMB from MLE is the maximum level of relevance at each layer and the number of epochs to encode the same level of relevance. Firstly, JointIMB at needs only about of the training epochs to achieve at least the same level of relevance in all layers of SFNN at the final epoch. Secondly, MLE is unable to encode the network layers to reach the maximum level of relevance enabled by IMB (We also trained SFNN up to epochs and observed that the level of relevance of each layer degrades before ever reaching the value of bits.). We also see that the compression constraints within the IMB framework keep the layer representation from shifting to the the right (in the information plane) during the encoding of relevant information.
The reason that the relevance for IMB increases until some point before decreasing while the relevance for SFNN increases until some point where the value almost stays there (without a decrease) is, we believe, because that IMB can explicitly exploit the information from each layer in a way that is more effective than MLE. The IMB objective can allow the encoding of relevant information into each layer to its optimal information tradeoff eventually at some point. After this point if we continue with the training, due to the mismatch between the exact IMB objective and its variational bound (Eq. 6,7, 10 in our draft), the further minimization of the variational bound would decrease (consequently, in order to make sure small, also needs to decrease after this point to compensate for the decrease in ). In the case of SFNN (trained with MLE), the MLE objective reaches its local minimum before the information of each layer can even reach its optimal information tradeoff (if ever). This explains why the relevance and compression of SFNN almost stay the same after some point. This suggests that IMB is better than MLE in terms of exploiting information for each layer during the learning.
I think the claim we made in the draft, which is "Especially, compression does occur in SFNN which is consistent with the result reported in ([28])", is a bit general without much elaboration in the draft. By seeing compression there in Fig. 1.a., I think a more precise wording for that claim of "compression occurs" is that the increase of slows down at some point for deeper layers. Intuitively, in order for the representations to make sense of the task, the representations should encode enough information about ; thus should increase (This is especially true for shallow layers because, due to the Markov chain property, the shallower a layer, the greater its burden of carrying enough information to make sense a task. This might also explain why the compression force, explained below, cannot dominate the force of increasing for shallow layers). However, instead of keeping increasing forever at the same rate, MLE trained with SGD slows down the increase of at some point for deep layers (e.g., Fig. 1.a., for ). The force that slows down the increase of at some point we would say that compression force takes place. Depending on how strong that force of compression takes place (which in turn I believe depends on which tasks we are demonstrating and the structure of is used), the slowing down phenomenon might have a stronger effect of bending over the increase of . The stronger the compression force, the more severe the bending over of the increase, though in Fig.1.a., the bending effect is not as strong as that present in SchwartzZiv and Tishby experiment. The idea that MLE trained with SGD has compression is very interesting because MLE principle is not explicitly set out for compression in its mind, but still has compression effect. SchwartzZiv and Tishby have very well explained the compression effects of MLE trained with SGD; I think, if not mistakenly, most of their explanation is from SGD perspective, not from a learning principle perspective (which is MLE in this case). I would believe that there is an alternative explanation from a learning principle perspective, and such an explanation would be very interesting. In the meantime, IMB takes over the implicit role of MLE and push it more explicitly: explicitly encourage such compressionrelevance tradeoffs in all layers. In IMB (fig. 1.b), the compression force is stronger (i.e., the bending over effect is stronger than that in MLE) while the relevance is pushed higher in much fewer epoches (represented by the colors). This suggests that MLE does not fully encourage compressionrelevance tradeoffs for each layer even though MLE happens to do so implicitly in a limited way. Later down, explicit encouragement of compressionrelevance into each layer would bring potential benefits for classification, adversarial learning and multimodal learning.
6 Conclusion
In this work, we introduce Information MultiBottlenecks (IMBs) which provides a principled informationtheoretic learning of neural networks. We provide important insights about when IMBs are possible and how they can be approximated in stochastic neural networks. The principled inducing of compression and relevance into each layer of DNNs via IMBs also provides a promising improvement in classification problems and adversarial robustness. In general, IMBs also better exploits the neural representations for learning. We demonstrate the results in MNIST and CIFAR10.
References
Appendix
Appendix A. Proof of Theorem 3.1
Proof of Theorem 3.1.
We present a detailed proof for Theorem 3.1 which uses the contradiction technique and three following lemmas.
Lemma 1.
Given , we have
(11)  
(12) 
Proof.
Lemma 2.
Given and , let us define the conditional Information Bottleneck objective:
(13) 
If and do not satisfy either of the following conditions:

is a sufficient statistic of for and (i.e., ),

is independent of .
Then, depends on . Informally, if the conditional variable in the conditional Information Bottleneck objective is not an “trivial" transform of the bottleneck variable , induces a nontrivial topology into the conditional Information Bottleneck objective.
Proof.
By the definition of the conditional mutual information
depends on as long as the presence of in the conditional Information Bottleneck objective does not vanish (we will discuss the conditions for to vanish in the final part of this proof). Note that due to the Markov chain , we have . Thus, depends on as long as does not vanish in the objective. The same result is applied to ; hence depends on (note that prevents the collapse of when summing two mutual information) if does not vanish in the objective.
Now we discuss the vanishing condition for in the objective. Note that it follows from Lemma 1 that:
It is easy to see that vanishes in the conditional Information Bottleneck objective iff each of the mutual information in does not depend on iff the equality conditions for the inequalities immediately above are met. Note that if , then then we have (i.e., is a sufficient statistic for and ). This implies that . Similarly, implies that is independent of which in turn implies that . ∎
Now we prove Theorem (3.1) by contradiction and three lemmas above.
First we prove that if and satisfy neither condition (a) nor (b) in the theorem, the constrained minimization problems are conflicting. Assume, by contradiction, that there exists a solution that minimizes both and simultaneously, i.e., s.t. has a minimum at and has a minimum at . Note that and are independent variables for the optimizations. By introducing Lagrangian multipliers and for the constraint of and , respectively, we obtain:
(14)  
(15) 
where
(16)  
(17) 
It follows from Lemma 1 that:
(18) 
where (defined as in Lemma 2). Now take the derivative w.r.t both sides of Eq. 18 with notice that , we have:
(19) 
Notice that the left hand side of Eq. 19 strictly depends on (Lemma 2) while the right hand side is independent of . This contradiction implies that the initial existence assumption is invalid; thus implies the conclusion in Theorem 3.1.
() The the direction is obvious. When and satisfy condition (a) (i.e., and ) or (b) (i.e., and ) in the theorem, there are effectively only one optimization problem for , and this reduces into the original Information Bottleneck (with single bottleneck) ([Tishby99]). After solving for from the Information Bottleneck optimization, we can construct as a sufficient statistic of . ∎
Appendix B. Proof of Theorem 3.2
Proof of Theorem 2.
The first claim of the theorem immediately follows from the definition of VCR:
(20) 
For the second claim of the theorem, it follows from the Markov assumption in Equation (1) and from Jensen’s inequality, respectively, that:
(21) 
(22) 
Thus, we have
(23)  
(24) 
Appendix C. Detailed derivations of Approximate Compression
(25) 
Appendix D. Detailed experimental description in classification experiment
Adopted from the common practice, we used the last 10,000 images of the training set as a validation (holdout) set for tuning hyperparameters. We then retrained models from scratch in the full training set with the best validated configuration. We trained each of the five models with the same set of 5 different initializations and reported the average results over the set. For the stochastic models (all except DET), we drew samples per stochastic layer during both training and inference, and performed inference times at test time to report the mean of classification errors for MNIST and classification accuracy for CIFAR10. For JointIMB and GreedyIMB, we set (in JointIMB only) and , tuned on a linear log scale . We found worked best for both models. For VIB, we found that and worked best on MNIST and CIFAR10, respectively. We trained the models on MNIST with Adadelta optimization ([DBLP:journals/corr/abs12125701]) and on CIFAR10 with Adagrad optimization ([DBLP:journals/jmlr/DuchiHS11]) (except for VIB we used Adam optimization ([DBLP:journals/corr/KingmaB14]) as we found that they worked best in the validation set.
Appendix E. Learning dynamics of GreedyIMB
We further present the visualization of the learning dynamics of GreedyIMB in Figure 2. GreedyIMB at needs only about of the training epochs to achieve at least the same level of relevance in all layers of SFNN at the final epoch. Recall that in GreedyIMB at the PIB principle is applied to the first hidden layer only. The layer representation at the final epoch gradually shifts to the left (i.e., more compressed) while not degrading the relevance over the greedy training from layer to layer in Figure 2. We also see that the compression constraints within the IMB framework keep the layer representation from shifting to the the right (in the information plane) during the encoding of relevant information.
Appendix F. VCR decomposition for a multivariate target variable
We will prove that the VCR of level for a multivariate variable can be decomposed as the sum of the VCRs of each of its vector elements. Indeed, consider . It follows from the fact that the neurons within a layer are conditionally independent given the previous layer that we have:
implying the claim about the decomposibility of VCR for a multivariate target variable. ∎