1 INTRODUCTION
###### Abstract

Stochastic variational inference is an established way to carry out approximate Bayesian inference for deep models. While there have been effective proposals for good initializations for loss minimization in deep learning, far less attention has been devoted to the issue of initialization of stochastic variational inference. We address this by proposing a novel layer-wise initialization strategy based on Bayesian linear models. The proposed method is extensively validated on regression and classification tasks, including Bayesian DeepNets and ConvNets, showing faster convergence compared to alternatives inspired by the literature on initializations for loss minimization.

Good Initializations of Variational Bayes for Deep Models

Simone Rossi                        Pietro Michiardi                        Maurizio Filippone

EURECOM, France rossi@eurecom.fr                        EURECOM, France michiard@eurecom.fr                        EURECOM, France filippone@eurecom.fr

## 1 Introduction

Deep Neural Networks (dnns) and Convolutional Neural Networks (cnns) have become the preferred choice to tackle various supervised learning problems, such as regression and classification, due to their ability to model complex problems and the mature development of regularization techniques to control overfitting (LeCun et al., 2015; Srivastava et al., 2014). There has been a recent surge of interest in the issues associated with their overconfidence in predictions, and proposals to mitigate these (Guo et al., 2017; Kendall and Gal, 2017; Lakshminarayanan et al., 2017). Bayesian techniques offer a natural framework to deal with such issues, but they are characterized by computational intractability (Bishop, 2006; Ghahramani, 2015).

A popular way to recover tractability is to use variational inference (Jordan et al., 1999). In variational inference, an approximate posterior distribution is introduced and its parameters are adapted by optimizing a variational objective, which is a lower bound to the marginal likelihood. The variational objective can be written as the sum of an expectation of the log-likelihood under the approximate posterior and a regularization term which is the negative Kullback-Leibler (kl) divergence between the approximating distribution and the prior over the parameters. Stochastic Variational Inference (svi) offers a practical way to carry out stochastic optimization of the variational objective. In svi, stochasticity is introduced with a doubly stochastic approximation of the expectation term, which is unbiasedly approximated using Monte Carlo and by selecting a subset of the training points (mini-batching) (Graves, 2011; Kingma and Welling, 2014).

While svi is an attractive and practical way to perform approximate inference for dnns, there are limitations. For example, the form of the approximating distribution can be too simple to accurately approximate complex posterior distributions (Ha et al., 2016; Ranganath et al., 2015; Rezende and Mohamed, 2015). Furthermore, svi increases the number of optimization parameters compared to optimizing model parameters through, e.g., loss minimization; for example, a fully factorized Gaussian posterior over model parameters doubles the number of parameters in the optimization compared to loss minimization. This has motivated research into other ways to perform approximate Bayesian inference for dnns by establishing connections between variational inference and dropout (Gal and Ghahramani, 2016a, b; Gal et al., 2017).

The development of a theory to fully understand the optimization landscape of dnns and cnns is still in its infancy (Dziugaite and Roy, 2017) and most works have focused on the practical aspects characterizing the optimization of their parameters (Duchi et al., 2011; Kingma and Ba, 2015; Srivastava et al., 2014). If this lack of theory is apparent for optimization of model parameters, this is even more so for the understanding of the optimization landscape of the objective in variational inference, where variational parameters enter in a nontrivial way in the objective (Graves, 2011; Rezende et al., 2014). However, the use of stochastic gradient optimization, which is guaranteed to asymptotically converge to a local optimum, motivates us to investigate ways to position the optimizer closer to a local optimum from the outset. Initialization in svi plays a huge role in the convergence of the learning process; the illustrative example in Figure 1 shows how a poor initialization can prevent svi to converge to good solutions in short amount of time even for simple problems.

In this work, we focus on this issue affecting svi for dnns and cnns. While there is an established literature on ways to initialize model parameters of dnns when minimizing its loss (Glorot and Bengio, 2010; Saxe et al., 2013; Mishkin and Matas, 2015), to the best of our knowledge, there is no study on the equivalent for svi for Bayesian dnns and cnns. Inspired by the literature on residual networks (He et al., 2016) and greedy initialization of dnns (Bengio et al., 2006; Mishkin and Matas, 2015), we propose Iterative Bayesian Linear Modeling (i-blm), which is an initialization strategy for svi grounded on Bayesian linear modeling. Iterating from the first layer, i-blm initializes the posteriors at layer by learning Bayesian linear models which regress from the input, propagated up to layer , to the labels.

We show how i-blm can be applied in a scalable way and without considerable overhead to regression and classification problems, and how it can be applied to initialize svi not only for dnns but also for cnns. Through a series of experiments, we demonstrate that i-blm leads to faster convergence compared to other initalizations inspired by the work on loss minimization for dnns. Furthermore, we show that i-blm makes it possible for svi with a Gaussian approximation applied to cnns to compete with Monte Carlo Dropout (mcd; Gal and Ghahramani (2016b)), which is currently the state-of-the-art method to perform approximate inference for cnns. In all, thanks to the proposed initialization, we make it possible to reconsider Gaussian svi for dnns and cnns as a valid competitor to mcd, as well as highlight the limitations of svi with a Gaussian posterior in applications involving cnns.

In summary, in this work we make the following contributions: (1) we propose a novel way to initialize svi for dnns based on Bayesian linear models; (2) we show how this can be done for regression and classification; (3) we show how to apply our initialization strategy to cnns; (4) we empirically demonstrate that our proposal allows us to achieve performance superior to other initializations of svi inspired by the literature on loss minimization; (5) for the first time, we achieve state-of-the-art performance with Gaussian svi for cnns.

## 2 Related Works

The problem of initialization of weights and biases in dnns for gradient-based loss minimization has been extensively tackled in the literature since early breakthroughs in the field (Rumelhart et al., 1986; Baldi and Hornik, 1989). LeCun (1998) is one of the seminal papers discussing practical tricks to achieve an efficient loss minimization through back-propagation.

More recently, Bengio et al. (2006) propose a greedy layer-wise unsupervised pre-training that proved to help optimization and generalization. A justification can be found in Erhan et al. (2010), where the authors show that pre-training can act as regularization; by initializing the parameters in a region corresponding to a better basin of attraction for the optimization procedure, the model can reach a better local minimum and increase its generalization capabilities. Glorot and Bengio (2010) propose a simple way to estimate the variance for random initialization of weights that makes it possible to avoid saturation both in forward and back-propagation steps. Another possible strategy can be found in Saxe et al. (2013), that investigate the dynamics of gradient descend optimization, and propose an initialization based on random orthogonal initial conditions. This algorithm takes a weight matrix filled with Gaussian noise, decomposes it to orthonormal basis using a singular value decomposition and replaces the weights with one of the components. Building on this work, Mishkin and Matas (2015) propose a data-driven weight initialization by scaling the orthonormal matrix of weights to make the variance of the output as close as possible to one.

Variational inference addresses the problem of intractable Bayesian inference by reinterpreting inference as an optimization problem. Its origins can be tracked back to early works in MacKay (1992); Hinton and van Camp (1993); Neal (1997). More recently, Graves (2011) proposes a practical way to carry out variational inference using stochastic optimization (Duchi et al., 2011; Zeiler, 2012; Sutskever et al., 2013; Kingma and Ba, 2015). Kingma and Welling (2014) propose a reparameterization trick that allows for the optimization of the variational lower bound through automatic differentiation. To decrease the variance of stochastic gradients, which impacts convergence speed, this work was extended using the so-called local reparameterization trick, where the sampling from the approximate posterior over model parameters is replaced by the sampling from the resulting distribution over the dnn units (Kingma et al., 2015).

In the direction of finding richer posterior families for variational inference, we mention the works on Normalizing Flows, a series of transformations that, starting from simple distributions, can capture complex structures in the variational posterior (Rezende and Mohamed, 2015; Kingma et al., 2016; Louizos and Welling, 2017; Huang et al., 2018). Alternatives can be found in Stein variational inference (Liu and Wang, 2016), quasi-Monte Carlo variational inference (Buchholz et al., 2018) and variational boosting (Miller et al., 2017).

To the best of our knowledge, there is no study that either empirically or theoretically addresses the problem of initialization of parameters for variational inference, and this work aims to fill this gap.

## 3 Preliminaries

In this section we introduce some background material on Bayesian dnns and svi.

### 3.1 Bayesian Deep Neural Networks

Bayesian dnns are statistical models whose parameters (weights and biases) are assigned a prior distribution and inferred using Bayesian inference techniques. Bayesian dnns inherit the modeling capacity of dnns while allowing for quantification of uncertainty in model parameters and predictions. Considering an input and a corresponding output , the relation between inputs and outpus can be seen as a composition of nonlinear vector-valued functions for each hidden layer

 y=f(x)=(f(L−1)∘…∘f(0))(x). (1)

Let be a collection of all model parameters (weights and biases) at all hidden layers. Each neuron computes its output as

 f(l)i=ϕ(w(l)Tif(l−1)), (2)

where denotes a so-called activation function which introduces a nonlinearity at each layer. Note that we absorbed the biases in .

Given a prior over , the objective of Bayesian inference is to find the posterior distribution over all model parameters using the available input data associated with labels

 p(W|X,Y)=p(Y|X,W)p(W)p(Y|X). (3)

Bayesian inference for dnns is analytically intractable and it is necessary to resort to approximations. One way to recover tractability is through the use of variational inference techniques as described next.

### 3.2 Stochastic Variational Inference

In variational inference, we introduce a family of distributions , parameterized through , and attempt to find an element of this family which is as close to the posterior distribution of interest as possible (Jordan et al., 1999). This can be formulated as a minimization with respect to of the kl divergence (Kullback, 1959) between the elements of the family and the posterior:

 q~θ(W)=argminθ{\textsckl(qθ(W)||p(W|X,Y))}. (4)

Simple manipulations allow us to rewrite this expression as the negative lower bound (nelbo) to the marginal likelihood of the model (see supplementary material)

 \textscnelbo=\textscnll+\textsckl(qθ(W)||p(W)), (5)

where the first term is the expected negative log-likelihood , and the second term acts as regularizer, penalizing distributions that deviate too much from the prior. When the likelihood factorizes across data points, we can unbiasedly estimate the expectation term randomly selecting a mini-batch of out of training points

 \textscnll≈nm∑x,y∈BEqθp(y|x,W). (6)

Each term in the sum can be further unbiasedly estimated using Monte Carlo samples as

 p(y|x,W)=1NMCNMC∑i=1p(y|x,Wi), (7)

where . Following Kingma and Welling (2014), each sample is constructed using the reparameterization trick, which allows to obtain a deterministic dependence of the nelbo w.r.t. . Alternatively, it is possible to determine the distribution of the dnn units before activation from . This trick, known as the local reparameterization trick, allows one to considerably reduce the variance of the stochastic gradient w.r.t. and achieve faster convergence as shown by Kingma et al. (2015).

## 4 Proposed Method

In this section, we introduce our proposed Iterative Bayesian Linear Model (i-blm) initialization for svi. We first introduce i-blm for regression with dnns, and we then show how this can be extended to classification and to cnns.

### 4.1 Initialization of DNNs for Regression

In order to initialize the weights of dnns, we proceed iteratively as follows. Given a fully-factorized Gaussian approximation , starting from the first layer, we can set the parameters of by running Bayesian linear regression with inputs and labels . After this, we initialize the approximate posterior over the weights at the second layer by running Bayesian linear regression with inputs and labels . Here, denotes the elementwise application of the activation function to the argument, whereas is a sample from . We then proceed iteratively in the same way up to the last layer.

The intuition behind i-blm is as follows. If one layer is enough to capture the complexity of a regression task, we expect to be able to learn an effective mapping right after the initialization of the first layer. In this case, we also expect that the mapping at the next layers implements simple transformations, close to the identity. Learning a set of weights with these characteristics starting from a random initialization is extremely hard; this motivated the work in He et al. (2016) that proposed the residual network architecture. Our i-blm initialization takes this observation as an inspiration to perform initialization for svi for general deep models.

In practice, i-blm proceeds as follows. Before applying the nonlinearity through the activation function, each layer in a Bayesian dnn can be seen as multivariate Bayesian linear regression model. Denoting by the number of output neurons at layer , this is equivalent to univariate Bayesian linear models. Instead of using the entire training set to learn the linear models, each one of these is inferred based on a random mini-batch of data, whose inputs are propagated through the previous layers. The complexity of i-blm is linear in the batch size and cubic in the number of neurons to be initialized. Figure 2 gives an illustration of the proposed method for a simple architecture.

#### 4.1.1 Fully factorized Gaussian posterior

In the case of a fully factorized approximate posterior over the weights , we are interested in obtaining the best approximate to the solution of the Bayesian linear regression, which in general yields a posterior that is not factorized. For simplicity of notation, let be the parameters of interest in Bayesian linear regression for a given output . We can formulate this problem by minimizing the kl divergence from to the actual posterior . This results in the mean being equal to the mean of and the variances ; see the supplementary material for the full derivation.

### 4.2 Initialization for Classification

In this section we show how our proposal can be extended to -class classification problems. We assume a one-hot encoding of the labels, so that is an matrix of zeros and ones (one for each row of ). Recently, it has been shown that it is possible to obtain an accurate modeling of the posterior over classification functions by applying regression on a transformation of the labels (Milios et al., 2018). This is interesting because it allows us to apply Bayesian linear regression as before in order to initialize svi for dnns.

The transformation of the labels is based on the formalization of a simple intuition, which is the inversion of the softmax transformation. One-hot encoded labels are viewed as a set of parameters of a degenerate Dirichlet distribution. We resolve the degeneracy of the Dirichlet distribution by adding a small regularization, say , to the parameters. At this point, we leverage the fact that Dirichlet distributed random variables can be constructed as a ratio of Gamma random variables, that is, if , then

 xi∑jxj∼Dir(a) (8)

We can then approximate the Gamma random variables with log-Normals by moment matching. By doing so, we obtain a representation of the labels which allows us to use standard regression with a Gaussian likelihood, and which retrieves an approximate Dirichlet when mapping predictions back using the softmax transformation (8). As a result, the latent functions obtained represent probabilities of class labels.

The only small complication is that the transformation imposes a different noise level for labels that are or , and this is because of the non-symmetric nature of the transformation. Nevertheless, it is a simple matter to extend Bayesian linear regression to handle heteroscedasticity; see the supplement for a derivation of heteroscedastic Bayesian linear regression and (Milios et al., 2018) for more insights on the transformation to apply regression on classification problems.

### 4.3 Initialization of CNNs

The same method can be also applied on cnns. Convolutional layers are commonly implemented as matrix multiplication (e.g. as a linear model) between a batched matrix of patches and a reshaped filter matrix (Jia, 2014). Rather than using the outputs of the previous layer as they are, for convolutional layers each Bayesian linear model learns the mapping from patches to output features.

## 5 Results

In this section, we compare different initialization algorithms for svi to prove the effectiveness of i-blm. We propose a number of competitors inspired from the literature developed for loss minimization in dnns and cnns. In the case of cnns, we also compare with Monte Carlo Dropout (mcd; Gal and Ghahramani (2016a)), which is the state-of-the-art for inference in Bayesian cnns.

As we discussed, in the case of a fully factorized Gaussian approximation we double the number of parameters compared to loss minimization, as we have a mean and a variance parameter for each weight. At layer , given the fully-factorized variational distribution , we initialize and with the following methods.

##### Uninformative

The optimization of the posterior starts from the prior distribution . Note that this yields an initial kl divergence in the nelbo equal to zero.

##### Random Heuristic

An extension to commonly used heuristic with and , with the number of input features at layer .

##### Xavier Normal

Originally proposed by Glorot and Bengio (2010), it samples all weights independently from a Gaussian distribution with zero mean and . This variance-based scaling avoids issues with vanishing or exploding gradients. Although this work considers only the case with linear activations, this initialization has been shown to work well in many applications. In this case, it is straightforward to extend it to the case of svi; indeed, instead of sampling, we directly set and , knowing that the sampling is performed during the Monte Carlo estimate of the log-likelihood.

##### Orthogonal

Starting from an analysis of learning dynamics of dnns with linear activations, Saxe et al. (2013) propose an initialization scheme with orthonormal weight matrices. The idea is to decompose a Gaussian random matrix onto an orthonormal basis, and use the resulting orthogonal matrix. We adapt this method for svi by initializing the mean matrix with the orthogonal matrix and . For our experiments, we use the implementation in pytorch (Paszke et al., 2017) provided by the Authors, which uses a QR-decomposition.

##### Layer-Sequential Unit-Variance (LSUV)

Starting from orthogonal initialization, Mishkin and Matas (2015) propose to perform a layer sequential variance scaling of the weight matrix. By implementing a data-driven greedy initialization, it generalizes the results to any nonlinear activation function and even to any type of layers that can impact the variance of the activations. We implement Layer-Sequential Unit-Variance (lsuv) for the means, while the variances are set to .

### 5.1 Experiments

Throughout the experiments, we use adam optimizer (Kingma and Ba, 2015) with learning rate , batch size , and Monte Carlo samples at training time and at test time. All experiments are run on a workstation equipped with two 16c/32t Intel Xeon CPU and four NVIDIA Tesla P100, with a maximum time budget of 24 hours (never reached). To better understand the effectiveness of different initialization, all learning curves are plotted w.r.t. training iteration rather than wall-clock time. In all experiments, even the ones involving millions of parameters, the initialization costs only few seconds (a couple of minutes in the case of AlexNet), so we do not shift the curves as this cost is negligible in the whole training procedure.

##### Toy example

With this simple example we want once more illustrate how i-blm works and how it can speed up the convergence of svi. We set up a regression problem considering the function corrupted by noise , with sampled uniformly in the interval . Figure 3 reports the output of a -layer dnn after different initializations. The figure shows that i-blm obtains a sensible initialization compared to the competitors.

##### Regression with a shallow architecture

In this experiment we compare initialization methods for a shallow dnn architecture on two datasets. The architecture used in these experiments has one single hidden layer with hidden neurons and relu activations. We impose that the approximate posterior has fully factorized covariance. Figure 4 shows the learning curves on the powerplant (, ) and protein (, ) datasets, repeated over five different train/test splits. i-blm allows for a better initialization compared to the competitors, leading to a lower root mean square error (rmse) and lower mean negative log-likelihood (mnll) on the test for a given computational budget. We refer the reader to the supplementary material for a more detailed analysis of the results.

##### Regression with a deeper architecture

Similar considerations hold when increasing the depth of the model, keeping the same experimental setup. Figure 5 shows the progression of the rmse and mnll error metrics when using svi to infer parameters of a dnn with five hidden layers and hidden neurons per layer, and relu activations. Again, the proposed initialization allows svi to converge faster than when using other initializations.

##### Classification with a deep architecture

Using the same deep dnn architecture as in the last experiment (five hidden layers with neurons), we tested i-blm with classification problems on mnist (, ), eeg (, ), credit (, ) and spam (, ). Interestingly, with this architecture, some initialization strategies struggled to converge, e.g., uninformative on mnist and lsuv on eeg. The gains offered by i-blm achieves the most striking results on mnist. After less than training steps (less than an epoch), it can already reach a test accuracy greater than ; other initalizations reach such performance much later during training. Even after epochs, svi inference initialized with i-blm provides on avarage an increase up to of accuracy at test time. Full results are reported in the supplementary material.

##### Experiments on CNNs

For this experiment, we implemented a Bayesian version of the original LeNet-5 architecture proposed by LeCun et al. (1998) with two convolutional layers of and filters, respectively and relu activations applied after all convolutional layers and fully-connected layers. We tested our framework on mnist and on cifar10. The only initialization strategies that achieve convergence are orthogonal and lsuv, along with i-blm; the other methods did not converge, meaning that they push the posterior back to the prior. Figure 7 reports the progression of error rate and mnll. For both mnist and cifar10, i-blm places the parameters where the network can consistently deliver better performance both in terms of error rate and mnll throughout the entire learning procedure.

##### Comparison with Monte Carlo Dropout

Monte Carlo Dropout (mcd; Gal and Ghahramani (2016b)) offers a simple and effective way to perform approximate Bayesian cnn inference, thanks to the connection that the Authors have established between dropout and variational inference. In this experiment, we aim to compare and discuss benefits and disadvantages of using a Gaussian posterior approximation with respect to the Bernoulli approximation that characterizes mcd. For a fair comparison, we implemented the same LeNet-5 architecture and the same learning procedure in Gal and Ghahramani (2016b). In particular, for mnist, the two convolutional layers have and filters, respectively. Dropout layers are placed after every convolutional and fully-connected layers with a dropout probability of . To replicate the results in Gal and Ghahramani (2016b), we used the same learning rate policy with , , and weight decay of . Figure 8 shows the learning curves. Monte Carlo Dropout achieves state-of-art error rate but the form assumed by mcd for the posterior is reflected on an higher mnll compared to svi with a Gaussian posterior. Provided with a nontrivial initialization, Gaussian svi can better fit the model and deliver a better quantification of uncertainty.

##### Analysis of out-of-sample uncertainty

One of the advantages of Bayesian inference is the possibility to reason about uncertainty. With this experiment, we aim to demonstrate that svi with a Gaussian approximate posterior is competitive with mcd in capturing uncertainty in predictions. To show this, we focus on a cnn with the LeNet-5 architecture. We run mcd and svi with a Gaussian approximate posterior with the proposed initialization on mnist. At test time, we carry out predictions on both mnist and not-mnist; the latter is a dataset equivalent to mnist in input dimensions () and number of classes, but it represents letters rather than numbers. This experimental setup is often used to check that the entropy of the predictions on not-mnist are actually higher than the entropy of the predictions on mnist. We report the entropy of the prediction on mnist and not-mnist in Figure 9. mcd and svi behave similarly on mnist, but on not-mnist the the histogram of the entropy indicates that svi yields a slightly higher uncertainty compared to mcd.

##### Deeper CNNs - AlexNet

We report here error rate and mnll for svi with i-blm and mcd on AlexNet (Krizhevsky et al., 2012). The cnn is composed by a stack of five convolutional layers and three fully-connected layers for a total of more than M parameters (M for svi). In this experiment, we have experienced the situation in which, due to the overparameterization of the model, the nelbo is completely dominated by the kl divergence. Therefore, the prior has a large influence on the optimization, so we decided to follow the approach in Graves (2011), allowing for a phase of optimization of the variances of the prior over the parameters. The results are reported in Figure 8. Once again, we show that svi with i-blm provides a lower negative log-likelihood with respect to Bernoulli approximation in mcd. In light of these results and considerations, we can consider Gaussian svi as a competitive method to carry out Bayesian inference on cnns.

## 6 Conclusions

This work fills an important gap in the literature of Bayesian deep learning, that is how to effectively initialize variational parameters in svi. We proposed a novel way to do so, i-blm, which is based on an iterative layer-wise initialization based on Bayesian linear models. Through a series of experiments, including regression and classification with dnns and cnns, we demonstrated the ability of our approach to consistently initialize the optimization in a way that makes convergence faster than alternatives inspired from the state-of-the-art in loss minimization for deep learning.

Thanks to i-blm, it was possible to carry out an effective comparison with mcd, which is currently the state-of-the-art method to carry out approximate inference for dnns and cnns. This suggests a number of directions to investigate to improve on svi and Bayesian cnns. We found that the choice of the prior plays an important role in the behavior of the optimization, so we are investigating ways to define sensible priors for these models. Furthermore, we are looking into ways to extend our initialization strategy to more complex posterior distributions beyond Gaussian and to other deep models, such as Deep Gaussian Processes.

##### Acknowledgements

MF gratefully acknowledges support from the AXA Research Fund.

## References

• Baldi and Hornik (1989) P. Baldi and K. Hornik. Neural networks and principal component analysis: Learning from examples without local minima. Neural Networks, 2(1):53–58, 1989. ISSN 08936080.
• Bengio et al. (2006) Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep networks. In Adv. in Neural Inf. Proc. Syst. 19, pages 153–160, 2006.
• Bishop (2006) C. M. Bishop. Pattern recognition and machine learning. Springer, 1st ed. 2006. corr. 2nd printing 2011 edition, Aug. 2006. ISBN 0387310738.
• Buchholz et al. (2018) A. Buchholz, F. Wenzel, and S. Mandt. Quasi-Monte Carlo variational inference. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 668–677, Stockholm, Sweden, 10–15 Jul 2018. PMLR.
• Duchi et al. (2011) J. Duchi, E. Hazan, and Y. Singer. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12:2121–2159, July 2011.
• Dziugaite and Roy (2017) G. K. Dziugaite and D. M. Roy. Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data, Oct. 2017. arXiv:1703.11008.
• Erhan et al. (2010) D. Erhan, A. Courville, and P. Vincent. Why Does Unsupervised Pre-training Help Deep Learning? Journal of Machine Learning Research, 11:625–660, 2010. ISSN 15324435.
• Gal and Ghahramani (2016a) Y. Gal and Z. Ghahramani. Dropout As a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pages 1050–1059. JMLR.org, 2016a.
• Gal and Ghahramani (2016b) Y. Gal and Z. Ghahramani. Bayesian Convolutional Neural Networks with Bernoulli Approximate Variational Inference, Jan. 2016b. arXiv:1506.02158.
• Gal et al. (2017) Y. Gal, J. Hron, and A. Kendall. Concrete Dropout. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 3581–3590. Curran Associates, Inc., 2017.
• Ghahramani (2015) Z. Ghahramani. Probabilistic machine learning and artificial intelligence. Nature, 521(7553):452–459, May 2015. ISSN 0028-0836.
• Glorot and Bengio (2010) X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. PMLR, 9:249–256, 2010. ISSN 15324435.
• Graves (2011) A. Graves. Practical Variational Inference for Neural Networks. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 2348–2356. Curran Associates, Inc., 2011.
• Guo et al. (2017) C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On Calibration of Modern Neural Networks. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1321–1330, International Convention Centre, Sydney, Australia, Aug. 2017. PMLR.
• Ha et al. (2016) D. Ha, A. M. Dai, and Q. V. Le. HyperNetworks, 2016. arXiv:1609.09106.
• He et al. (2016) K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 770–778, 2016.
• Hinton and van Camp (1993) G. E. Hinton and D. van Camp. Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the sixth annual conference on Computational learning theory - COLT ’93, 1993. ISBN 0897916115.
• Huang et al. (2018) C.-W. Huang, D. Krueger, A. Lacoste, and A. Courville. Neural autoregressive flows. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2078–2087, Stockholm, Sweden, 10–15 Jul 2018. PMLR.
• Jia (2014) Y. Jia. Learning Semantic Image Representations at a Large Scale. PhD thesis, Berkeley, California, 2014.
• Jordan et al. (1999) M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An Introduction to Variational Methods for Graphical Models. Machine Learning, 37(2):183–233, Nov. 1999.
• Kendall and Gal (2017) A. Kendall and Y. Gal. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5574–5584. Curran Associates, Inc., 2017.
• Kingma and Ba (2015) D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations (ICLR), San Diego, may 2015.
• Kingma and Welling (2014) D. P. Kingma and M. Welling. Auto-Encoding Variational Bayes. In Proceedings of the Second International Conference on Learning Representations (ICLR 2014), Apr. 2014.
• Kingma et al. (2015) D. P. Kingma, T. Salimans, and M. Welling. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems 28, pages 2575–2583. 2015.
• Kingma et al. (2016) D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling. Improving Variational Inference with Inverse Autoregressive Flow. In Advances in Neural Information Processing Systems 29, pages 4743–4751. 2016.
• Krizhevsky et al. (2012) A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. Advances In Neural Information Processing Systems, 2012. ISSN 10495258.
• Kullback (1959) S. Kullback. Information Theory and Statistics. 1959. ISBN 0780327616.
• Lakshminarayanan et al. (2017) B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6402–6413. Curran Associates, Inc., 2017.
• LeCun (1998) Y. LeCun. Efficient backprop. Neural networks: tricks of the trade, 53(9):1689–1699, 1998. ISSN 1098-6596.
• LeCun et al. (1998) Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2323, 1998. ISSN 00189219.
• LeCun et al. (2015) Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
• Liu and Wang (2016) Q. Liu and D. Wang. Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm. In Advances in Neural Information Processing Systems 29, pages 2378–2386. 2016.
• Louizos and Welling (2017) C. Louizos and M. Welling. Multiplicative Normalizing Flows for Variational Bayesian Neural Networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2218–2227, Sydney, Australia, 06–11 Aug 2017. PMLR.
• MacKay (1992) D. J. C. MacKay. A Practical Bayesian Framework for Backpropagation Networks. Neural Computation, 4(3):448–472, may 1992. ISSN 0899-7667.
• Milios et al. (2018) D. Milios, R. Camoriano, P. Michiardi, L. Rosasco, and M. Filippone. Dirichlet-based Gaussian Processes for Large-scale Calibrated Classification. In Advances in Neural Information Processing System 31 (to appear), may 2018.
• Miller et al. (2017) A. C. Miller, N. J. Foti, and R. P. Adams. Variational Boosting: Iteratively Refining Posterior Approximations. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2420–2429, Sydney, Australia, 06–11 Aug 2017. PMLR.
• Mishkin and Matas (2015) D. Mishkin and J. Matas. All you need is a good init. arXiv preprint arXiv:1511.06422, nov 2015. ISSN 08981221.
• Neal (1997) R. M. Neal. Bayesian Learning for Neural Networks. Journal of the American Statistical Association, 1997. ISSN 01621459.
• Paszke et al. (2017) A. Paszke, G. Chanan, Z. Lin, S. Gross, E. Yang, L. Antiga, and Z. Devito. Automatic differentiation in PyTorch. Advances in Neural Information Processing Systems 30, 2017.
• Ranganath et al. (2015) R. Ranganath, L. Tang, L. Charlin, and D. M. Blei. Deep Exponential Families. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2015, San Diego, California, USA, May 9-12, 2015, 2015.
• Rezende and Mohamed (2015) D. Rezende and S. Mohamed. Variational Inference with Normalizing Flows. In Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1530–1538, Lille, France, 07–09 Jul 2015. PMLR.
• Rezende et al. (2014) D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic Backpropagation and Approximate Inference in Deep Generative Models. In T. Jebara and E. P. Xing, editors, Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1278–1286. JMLR Workshop and Conference Proceedings, 2014.
• Rumelhart et al. (1986) D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. Nature, 323(6088):533–536, oct 1986. ISSN 0028-0836.
• Saxe et al. (2013) A. M. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. CoRR, abs/1312.6:1–22, dec 2013.
• Srivastava et al. (2014) N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research, 15(1):1929–1958, Jan. 2014.
• Sutskever et al. (2013) I. Sutskever, J. Martens, G. Dahl, and G. E. Hinton. On the importance of initialization and momentum in deep learning. Proceedings of Machine Learning Research, (2010):1139–1147, feb 2013. ISSN 15206149.
• Zeiler (2012) M. D. Zeiler. ADADELTA: An Adaptive Learning Rate Method. arXiv, page 6, dec 2012. ISSN 09252312.

## Appendix A Full derivation of variational lower bound

 \textsckl(qθ(W)||p(W|X,Y)) =Eqθlogqθ(W)p(W|X,Y)= =Eqθ[logqθ(W)−logp(W|X,Y)]= =Eqθ[−logp(Y|X,W)]+Eqθ[logqθ(W)−logp(W)]+logp(Y|X)= =\pagecolorwhite!50$\textscnll+\textsckl(qθ(W)||p(W))$+logp(Y|X)

## Appendix B Bayesian linear regression

We express the likelihood and the prior on the parameters as follows:

 p(Y|W,L)=∏ip(Y⋅i|XW⋅i,L)=∏iN(Y⋅i|XW⋅i,L)

Denote by the matrix containing input vectors , and let be the set consisting of the corresponding multivariate labels . In Bayesian linear regression we introduce a set of latent variables that we compute as a linear combination of the input through a set of weights, and we express the likelihood and the prior on the parameters as follows:

 p(Y|W,L)=∏ip(Y⋅i|XW⋅i,λ)=∏iN(Y⋅i|XW⋅i,L)

and

 p(W|Λ)=∏ip(W⋅i)=N(W⋅i|0,Λ)

The posterior of this model is:

 p(W|Y,L)∝∏iN(Y⋅i|XW⋅i,L)N(W⋅i|0,Λ)

which implies that the posterior factorizes across the columns of , with factors

 p(W⋅i|Y,X,L,Λ)=N(W⋅i|ΣiX⊤L−1Y⋅i,Σi)

with . Similarly, the marginal likelihood factorizes as the product of the following factors

 p(Y⋅i|X,L,Λ)=N(Y⋅i|0,L+XΛX⊤)

## Appendix C Heteroscedastic Bayesian linear regression

We can extend Bayesian linear regression to the heteroscedastic case where and . These yield

 p(W⋅i|Y,X,σ2)=N(W⋅i|μi,Σi)with
 μi=ΣiX⊤diag(σ−2)Y⋅i
 Σi=(I+X⊤diag(σ−2)X)−1

and

 p(Y⋅i|X,σ2)=N(Y⋅i|0,diag(σ2)+XX⊤)

The expression for the marginal likelihood is computationally unconvenient due to the need to deal with an matrix. We can use Woodbury identities333 and to express this calculation using . In particular,

 log[p(Y⋅i|X,σ2)]=−12log∣∣diag(σ2)+XX⊤∣∣−12Y⊤⋅i(diag(σ2)+XX⊤)−1Y⋅i+const.

Using Woodbury identites, we can rewrite the algebraic operations as follows:

and

 (diag(σ2)+XX⊤)−1=diag(σ−2)−diag(σ−2)X(I+X⊤diag(σ−2)X)−1X⊤diag(σ−2)

So, wrapping up, we ca express all quantities of interest as:

 Σ−1i=I+X⊤diag(σ−2)X
 log[p(Y⋅i|X,σ2)]=−12(∑jlogσ2j+log∣∣Σ−1i∣∣)−12Y⊤⋅i(diag(σ−2)−diag(σ−2)XΣiX⊤diag(σ−2))Y⋅i+const.

If we factorize , we obtain:

 log[p(Y⋅i|X,σ2)]=−12(∑jlog(σ2j)+∑k2log(Qkk))−12Y⊤⋅i~Y⋅i+12~Y⊤⋅iXQ−⊤Q−1X⊤~Y⋅i+const.

where

Predictions follow from the same identities as before - looking at the predicted latent process, we have

 p(f∗i|X,Y,x∗)=∫p(f∗i|W,x∗)p(W|X,Y)dW

We can again remove the dependence from the dimensions of that do not affect the prediction for the th function as

 p(f∗i|X,Y,x∗)=∫p(f∗i|W⋅i,x∗)p(W⋅i|X,Y)dW

Now:

 p(f∗i|W⋅i,x∗)=N(f∗i|x⊤∗W⋅i,0)andp(W⋅i|X,Y)=N(W⋅i|μi,Σi)

giving

 p(f∗i|X,Y,x∗)=N(f∗i|x⊤∗μi,x⊤∗Σix∗)

## Appendix D Full derivation of fully factorized Gaussian posterior approximation to Bayesian linear regression posterior

For simplicity of notation, let be the parameters of interest in Bayesian linear regression for a given output . We can formulate the problem of obtaining the best approximate factorized posterior of a Bayeian linear model as a minimization of the Kullback-Leibler divergence between and the actual posterior . The expression of the KL divergence between multivariate Gaussians and is as follows:

 KL[p0||p1]=12Tr(Σ−11Σ0)+12(μ1−μ0)⊤Σ−11(μ1−μ0)−D2+12log(detΣ1detΣ0)

The KL divergence is not symmetric, so the order in which we take this matters. In case we consider , the expression becomes:

 KL[p(w|X,y)||q(w)]=12Tr(diag(s2)−1Σ)+12(m−μ)⊤diag(s2)−1(m−μ)−D2+12log(∏is2idetΣ)

It is a simple matter to show that the optimal mean is as appears only in the quadratic form which is clearly minimized when . For the variances , we need to take the derivative of the KL divergence and set it to zero:

 ∂KL[p(w|X,y)||q(w)]∂s2i=12∂Tr(diag(s2)−1Σ)∂s2i+12∂∑ilogs2i∂s2i=0

Rewriting the trace term as the sum of the Hadamrd product of the matrices in the product , this yields

 ∂KL[p(w|X,y)||q(w)]∂s2i=12∂Σii/s2i)∂s2i+12∂logs2i∂s2i=0

This results in , which is the simplest way to approximate the correlated posterior over but it is going to inflate the variance in case of strong correlations.

In case we consider , the expression of the KL becomes:

 KL[q(w)||p(w|X,y)]=12Tr(Σ−1diag(s2))+12(m−μ)⊤Σ−1(m−μ)−D2+12log(detΣ∏is2i)

Again, the optimal mean is . For the variances , we need to take the derivative of the KL divergence and set it to zero:

 ∂KL[q(w)||p(w|X,y)]∂s2i=12∂Tr(Σ−1diag(s2))∂s2i−12∂∑ilogs2i∂s2i=0

Rewriting the trace term as the sum of the Hadamrd product of the matrices in the product , this yields

 ∂KL[q(w)||p(w|X,y)]∂s2i=12∂s2iΣ−1ii)∂s2i−12∂logs2i∂s2i=0

This results in . This approximation has the opposite effect of underestimating the variance for each variable.

## Appendix E Extended results

### e.4 Classification with deep architecture

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters