SLANG: Fast Structured Covariance Approximations for Bayesian Deep Learning with Natural Gradient
Abstract
Uncertainty estimation in large deeplearning models is a computationally challenging task, where it is difficult to form even a Gaussian approximation to the posterior distribution. In such situations, existing methods usually resort to a diagonal approximation of the covariance matrix despite, the fact that these matrices are known to give poor uncertainty estimates. To address this issue, we propose a new stochastic, lowrank, approximate naturalgradient (SLANG) method for variational inference in large, deep models. Our method estimates a “diagonal plus lowrank” structure based solely on backpropagated gradients of the network loglikelihood. This requires strictly less gradient computations than methods that compute the gradient of the whole variational objective. Empirical evaluations on standard benchmarks confirm that SLANG enables faster and more accurate estimation of uncertainty than meanfield methods, and performs comparably to stateoftheart methods.
numbersnatbib
SLANG: Fast Structured Covariance Approximations for Bayesian Deep Learning with Natural Gradient
Aaron Mishkin^{†}^{†}thanks: Equal contributions. This work was conducted during an internship at the RIKEN Center for AI project. University of British Columbia Vancouver, Canada amishkin@cs.ubc.ca Frederik Kunstner^{1}^{1}footnotemark: 1 Ecole Polytechnique Fédérale de Lausanne Lausanne, Switzerland frederik.kunstner@epfl.ch Didrik Nielsen RIKEN Center for AI Project Tokyo, Japan didrik.nielsen@riken.jp Mark Schmidt University of British Columbia Vancouver, Canada schmidtm@cs.ubc.ca Mohammad Emtiyaz Khan RIKEN Center for AI Project Tokyo, Japan emtiyaz.khan@riken.jp
noticebox[b]Preprint. Work in progress.\end@float
1 Introduction
Deep learning has had enormous recent success in fields such as speech recognition and computer vision. In these problems, our goal is to predict well and we are typically less interested in the uncertainty behind the predictions. However, deep learning is now becoming increasingly popular in applications such as robotics and medical diagnostics, where accurate measures of uncertainty are crucial for reliable decisions. For example, uncertainty estimates are important for physicians who use automated diagnosis systems to choose effective and safe treatment options. Lack of such estimates may lead to decisions that have disastrous consequences.
The goal of Bayesian deep learning is to provide uncertainty estimates by integrating over the posterior distribution of the parameters. Unfortunately, the complexity of deep learning models makes it infeasible to perform the integration exactly. Sampling methods such as stochasticgradient Markov chain Monte Carlo chen2014stochastic () have been applied to deep models, but they usually converge slowly. They also require a large memory to store the samples and often need large preconditioners to mix well ahn2012bayesian (); balan2015bayesian (); simsekli2016stochastic (). In contrast, variational inference (VI) methods require much less memory and can scale to large problems by exploiting stochastic gradient methods (blundell2015weight, ; graves2011practical, ; ranganath2013black, ). However, they often make crude simplifications, like the meanfield approximation, to reduce the memory and computation cost. This can result in poor uncertainty estimates turner2011two (). Fast and accurate estimation of uncertainty for large models remains a challenging problem in Bayesian deep learning.
In this paper, we propose a new variational inference method to estimate Gaussian approximations with a diagonal plus lowrank covariance structure. This gives more accurate and flexible approximations than the meanfield approach. Our method also enables fast estimation by using an approximate naturalgradient algorithm that builds the covariance estimate solely based on the backpropagated gradients of the network loglikelihood. We call our method stochastic lowrank approximate naturalgradient (SLANG). SLANG requires strictly less gradient computations than methods that require gradients of the variational objective obtained using the reparameterization trick miller2017variational (); ong2017gaussian (); titsias2014doubly (). Our empirical comparisons demonstrate the improvements obtained over meanfield methods (see Figure 1 for an example) and show that SLANG gives comparable results to the stateoftheart on standard benchmarks.
The code to reproduce the experimental results in this paper is available at https://github.com/aaronpmishkin/SLANG.
1.1 Related Work
Gaussian variational distributions with full covariance matrices have been used extensively for shallow models barber1998ensemble (); Challis:11 (); Jaakkola96b (); marlin2011piecewise (); miller2017variational (); seeger2000bayesian (); tan2018gaussian (); titsias2014doubly (). Several efficient ways of computing the full covariance matrix are discussed by Seeger seeger2010gaussian (). Other works have considered various structured covariance approximations, based on the Cholesky decomposition Challis:11 (); titsias2014doubly (), sparse covariance matrices tan2018gaussian () and lowrank plus diagonal structure barber1998ensemble (); ong2017gaussian (); seeger2000bayesian (). Recently, several works miller2017variational (); ong2017gaussian () have applied stochastic gradient descent on the variational objective to estimate such a structure. These methods often employ an adaptive learning rate method, such as Adam or RMSprop, which increases the memory cost. All of these methods have only been applied to shallow models, and it remains unclear how they will perform (and whether they can be adapted) for deep models. Moreover, a naturalgradient method is preferable to gradientbased methods when optimizing the parameters of a distribution amari1998natural (); hoffman2013stochastic (); khan2018fast (). Our work shows that a naturalgradient method not only has better convergence properties, but also has lower computation and memory cost than gradientbased methods.
For deep models, a variety of methods have been proposed based on meanfield approximations. These methods optimize the variational objective using stochasticgradient methods and differ from each other in the way they compute those gradients (blundell2015weight, ; graves2011practical, ; hernandez15pbp, ; khan2017vprop, ; ranganath2013black, ; zhang2017noisy, ). They all give poor uncertainty estimates in the presence of strong posterior correlations and also shrink variances turner2011two (). SLANG is designed to add extra covariance structure and ensure better performance than meanfield approaches.
A few recent works have explored structured covariance approximations for deep models. In zhang2017noisy (), the Kroneckerfactored approximate curvature (KFAC) method is applied to perform approximate naturalgradient VI. Another recent work has applied KFAC to find a Laplace approximation ritter2018scalable (). However, the Laplace approximation can perform worse than variational inference in many scenarios, e.g., when the posterior distribution is not symmetric nickisch2008approximations (). Other types of approximation methods include Bayesian dropout yarin16dropout () and methods that use matrixvariate Gaussians louizos2016structured (); sun2017learning (). All of these approaches make structural assumptions that are different from our lowrank plus diagonal structure. However, similarly to our work, they provide new ways to improve the speed and accuracy of uncertainty estimation in deep learning.
2 Gaussian Approximation with NaturalGradient Variational Inference
Our goal is to estimate the uncertainty in deep models using Bayesian inference. Given data examples , a Bayesian version of a deep model can be specified by using a likelihood parametrized by a deep network with parameters and a prior distribution . For simplicity, we assume that the prior is a Gaussian distribution, such as an isotropic Gaussian with the scalar precision parameter . However, the methods presented in this paper can easily be modified to handle many other types of prior distributions. Given such a model, Bayesian approaches compute an estimate of uncertainty by using the posterior distribution: . This requires computation of the marginal likelihood , which is a highdimensional integral and difficult to compute.
Variational inference (VI) simplifies the problem by approximating with a distribution . In this paper, our focus is on obtaining approximations that have a Gaussian form, i.e., with mean and covariance . The parameters and are referred to as the variational parameters and can be obtained by maximizing a lower bound on called the evidence lower bound (ELBO),
(1) 
where denotes the KullbackLeibler divergence.
A straightforward and popular approach to optimize is to use a stochastic gradient method miller2017variational (); ong2017gaussian (); ranganath2013black (); titsias2014doubly (). However, naturalgradients are preferable when optimizing the parameters of a distribution amari1998natural (); hoffman2013stochastic (); khan2018fast (). This is because naturalgradient methods perform optimization on the Riemannian manifold of the parameters, which can lead to a faster convergence when compared to gradientbased methods. Typically, naturalgradient methods are difficult to implement, but many easytoimplement updates have been derived in recent works hoffman2013stochastic (); khan2017conjugate (); khan2017vprop (); zhang2017noisy (). We build upon the approximate naturalgradient method proposed in khan2017vprop () and modify it to estimate structured covarianceapproximations.
Specifically, we extend the Variational Online GaussNewton (VOGN) method khan2017vprop (). This method uses the following update for and (a derivation is in Appendix A),
(2)  
where is the iteration number, are learning rates, , is the backpropagated gradient obtained on the ’th data example, is an Empirical Fisher (EF) matrix, and is a minibatch of data examples. This update is an approximate naturalgradient update obtained by using the EF matrix as an approximation of the Hessian martens2014new () in a method called Variational Online Newton (VON) khan2017vprop (). This is explained in more detail in Appendix A. As discussed in khan2017vprop (), the VOGN method is an approximate Naturalgradient update which may not have the same properties as the exact naturalgradient update. However, an advantage of the update (2) is that it only requires backpropagated gradients, which is a desirable feature when working with deep models.
The update (2) is computationally infeasible for large deep models because it requires the storage and inversion of the covariance matrix. Storage takes memory space and inversion requires computations, which makes the update very costly to perform for large models. We cannot form or invert it when is in millions. Meanfield approximations avoid this issue by restricting to be a diagonal matrix, but they often give poor Gaussian approximations. Our idea is to estimate a lowrank plus diagonal approximation of that reduces the computational cost while preserving some offdiagonal covariance structure. In the next section, we propose modifications to the update (2) to obtain a method whose time and space complexity are both linear in .
3 Stochastic, Lowrank, Approximate NaturalGradient (SLANG) Method
Our goal is to modify the update (2) to obtain a method whose time and space complexity is linear in . We propose to approximate the inverse of by a “lowrank plus diagonal” matrix:
(3) 
where is a matrix with and is a diagonal matrix. The cost of storing and inverting this matrix is linear in and reasonable when is small. We now derive an update for and such that the resulting closely approximates the update shown in (2). We start by writing an approximation to the update of where we replace covariance matrices by their structured approximations:
(4) 
This update cannot be performed exactly without potentially increasing the rank of the lowrank component , since the structured components on the right hand side are of rank at most , where is the size of the minibatch. This is shown in (5) below where we have rearranged the left hand side of (4) as the sum of a structured component and a diagonal component. To obtain a rank approximation to the left hand side of (5), we propose to approximate the structured component by an eigenvalue decomposition as shown in (6) below,
(5)  
(6) 
where is a matrix containing the first leading eigenvectors of and is an diagonal matrix containing the corresponding eigenvalues. Figure 2 visualizes the update from (5) to (6).
The lowrank component can now be updated to mirror the lowrank component of (6),
(7) 
and the diagonal can be updated to match the diagonal of the left and right sides of (4), i.e.,
(8) 
This gives us the following update for using a diagonal correction ,
(9)  
(10) 
This step is cheap since computing the diagonal of the EF matrix is linear in .
The new covariance approximation can now be used to update according to (2) as shown below:
(11) 
The above update uses a stochastic, lowrank covariance estimate to approximate naturalgradient updates, which is why we use the name SLANG.
When , is full rank and SLANG is identical to the approximate naturalgradient update (2). When , SLANG produces matrices with diagonals matching (2) at every iteration. The diagonal correction ensures that no diagonal information is lost during the lowrank approximation of the covariance. A formal statement and proof is given in Appendix D.
We also tried an alternative method where is learned using an exponential movingaverage of the eigendecompositions of . This previous iteration of SLANG is discussed in Appendix B, where we show that it gives worse results than the SLANG update.
Next, we give implementation details of SLANG.
3.1 Details of the SLANG Implementation
At every iteration, we generate a sample . This is implemented in line 4 using the function fast_sample (see Algorithm 3 for a pseudocode). This function uses the Woodbury identity and the symmetric factorization algorithm of ambikasaran2014fast () to compute . The sample is then computed as , where . The function fast_sample requires computations in to generate samples, which is linear in . More details are given in Appendix C.4.
Given a sample, we need to compute and store all the individual stochastic gradients for all examples in a minibatch . The standard backpropagation implementation does not allow this. We instead use a version of the backpropagation algorithm outlined in a note by Goodfellow goodfellow2015efficient (), which enables efficient computation of the gradients . This is shown in line 6. More details on the function backprop_goodfellow is given in Appendix C.1.
In line 7, we compute the eigenvalue value decomposition of by using the fast_eig function. The function fast_eig implements a randomized eigenvalue decomposition method discussed in halko2011finding (). It computes the top eigenvalue decomposition of a lowrank matrix in . More details of the function is given in Appendix C.2. The matrix and are updated using the eigenvalue decomposition in lines 8, 9 and 10.
In lines 11 and 12, we compute the update vector , which requires solving a linear system. We use the function fast_inverse shown in Algorithm 2. This function uses the Woodbury identity to efficiently compute the inverse with a cost . More details are given in Appendix C.3. Finally, in line 13, we update .
The overall computational complexity of SLANG is and its memory cost is . Both are linear in and . The cost is quadratic in , but since (e.g., 5 or 10), this only adds a small multiplicative constant in the runtime. SLANG reduces the cost of the update (2) significantly while preserving some posterior correlations.
4 Experiments
In this section, our goal is to show experimental results to support the following claims: (1) SLANG gives reasonable posterior approximations, and (2) SLANG performs well on standard benchmarks for Bayesian neural networks. We present evaluations on several LIBSVM datasets, the UCI regression benchmark, and MNIST classification. SLANG beats meanfield methods on almost all tasks considered and performs comparably to stateoftheart methods. SLANG also converges faster than meanfield methods.
4.1 Bayesian Logistic Regression
MeanField Methods  SLANG  Full Gaussian  

Dataset  Metrics  EF  Hess.  Exact  L = 1  L = 5  L = 10  EF  Hess.  Exact  
Australian  ELBO  0.614  0.613  0.593  0.574  0.569  0.566  0.560  0.558  0.559  
NLL  0.348  0.347  0.341  0.342  0.339  0.338  0.340  0.339  0.338  
KL ( )  2.240  2.030  0.195  0.033  0.008  0.002  0.000  0.000  0.000  

ELBO  0.122  0.121  0.121  0.112  0.111  0.111  0.111  0.109  0.109  
NLL  0.095  0.094  0.094  0.092  0.092  0.092  0.092  0.091  0.091  
KL ( )  8.019  9.071  7.771  0.911  0.842  0.638  0.637  0.002  0.000  
a1a  ELBO  0.384  0.383  0.383  0.377  0.374  0.373  0.369  0.368  0.368  
NLL  0.339  0.339  0.339  0.339  0.339  0.339  0.339  0.339  0.339  
KL ()  2.590  2.208  1.295  0.305  0.173  0.118  0.014  0.000  0.000  

ELBO  0.268  0.268  0.267  0.210  0.198  0.193  0.189  0.186  0.186  
NLL  0.139  0.139  0.138  0.132  0.132  0.131  0.131  0.130  0.130  
KL ()  7.684  7.188  7.083  1.492  0.755  0.448  0.180  0.001  0.000 
We considered four benchmark datasets for our comparison: USPS 3vs5, Australian, BreastCancer, and a1a. Details of the datasets are in Table 8 in Appendix E.2 along with the implementation details of the methods we compare to. We use LBFGS wright1999numerical () to compute the optimal fullGaussian variational approximation that minimizes the ELBO using the method described in Marlin et al. marlin2011piecewise (). We refer to the optimal fullGaussian variational approximation as the “FullGaussian Exact” method. We also compute the optimal meanfield Gaussian approximation and refer to it as “MF Exact”.
Figure 1 shows a qualitative comparison of the estimated posterior means, variances, and covariances for the USPS3vs5 dataset (, ). The figure on the left compares covariance approximations obtained with SLANG to the FullGaussian Exact method. Only offdiagonal entries are shown. We see that the approximation becomes more and more accurate as the rank is increased. The figures on the right compare the means and variances. The means match closely for all methods, but the variance is heavily underestimated by the MF Exact method; we see that the variances obtained under the meanfield approximation estimate a high variance where FullGaussian Exact has a low variance and viceversa. This “trendreversal” is due to the typical shrinking behavior of meanfield methods turner2011two (). In contrast, SLANG corrects the trend reversal problem even when . Similar results for other datasets are shown in Figure 7 in Appendix E.1.
The complete results for Bayesian logistic regression are summarized in Table 1, where we also compare to four additional methods called “FullGaussian EF”, “FullGaussian Hessian”, “MeanField EF”, and “MeanField Hessian”. The FullGaussian EF method is the naturalgradient update (2) which uses the EF matrix , while the FullGaussian Hessian method uses the Hessian instead of the EF matrix (the updates are given in (12) and (13) in Appendix A). The last two methods are the meanfield versions of the FullGaussian EF and FullGaussian Hessian methods, respectively. We compare negative ELBO, test logloss using crossentropy, and symmetric KLdivergence between the approximations and the FullGaussian Exact method. We report averages over 20 random 50%50% trainingtest splits of the dataset. Variances and results for SLANG with are omitted here due to space constraints, but are reported in Table 6 in Appendix E.1.
We find that SLANG with nearly always produces better approximations than the meanfield methods. As expected, increasing improves the quality of the variational distribution found by SLANG according to all three metrics. We also note that FullGaussian EF method has similar performance to the FullGaussian Hessian method, which indicates that the EF approximation may be acceptable for Bayesian logistic regression.
The left side in Figure 4 shows convergence results on the USPS 3vs5 and Breast Cancer datasets. The three methods SLANG(1, 2, 3) refer to SLANG with . We compare these three SLANG methods to MeanField Hessian and FullGaussian Hessian. SLANG converges faster than the meanfield method, and matches the convergence of the fullGaussian method when is increased.
4.2 Bayesian Neural Networks (BNNs)
An example for Bayesian Neural Networks on a synthetic regression dataset is given in Appendix F.1, where we illustrate the quality of SLANG’s posterior covariance.
The right side in Figure 4 shows convergence results for Bayesian neural networks trained on the USPS 3vs5 and Breast Cancer datasets. Here, the three methods SLANG(1, 2, 3) refer to SLANG with . We compare SLANG to a meanfield method called Bayes by Backprop blundell2015weight (). Similar to the Bayesian logistic regression experiment, SLANG converges much faster than the meanfield method. However, the ELBO convergence for SLANG shows that the optimization procedure does not necessarily converge to a local minimum. This issue does not appear to affect the test loglikelihood. While it might only be due to stochasticity, it is possible that the problem is exacerbated by the replacement of the Hessian with the EF matrix. We have not determined the specific cause and it warrants further investigation in future work.
Test RMSE  Test loglikelihood  

Dataset  BBB  Dropout  SLANG  BBB  Dropout  SLANG 
Boston  3.43 0.20  2.97 0.19  3.21 0.19  2.66 0.06  2.46 0.06  2.58 0.05 
Concrete  6.16 0.13  5.23 0.12  5.58 0.19  3.25 0.02  3.04 0.02  3.13 0.03 
Energy  0.97 0.09  1.66 0.04  0.64 0.03  1.45 0.10  1.99 0.02  1.12 0.01 
Kin8nm  0.08 0.00  0.10 0.00  0.08 0.00  1.07 0.00  0.95 0.01  1.06 0.00 
Naval  0.00 0.00  0.01 0.00  0.00 0.00  4.61 0.01  3.80 0.01  4.76 0.00 
Power  4.21 0.03  4.02 0.04  4.16 0.04  2.86 0.01  2.80 0.01  2.84 0.01 
Wine  0.64 0.01  0.62 0.01  0.65 0.01  0.97 0.01  0.93 0.01  0.97 0.01 
Yacht  1.13 0.06  1.11 0.09  1.08 0.06  1.56 0.02  1.55 0.03  1.88 0.01 
Next, we present results on the UCI regression datasets which are common benchmarks for Bayesian neural networks hernandez15pbp (). We repeat the setup^{1}^{1}1We use the data splits available at https://github.com/yaringal/DropoutUncertaintyExps used in Gal and Ghahramani yarin16dropout (). Following their work, we use neural networks with one hidden layer with 50 hidden units and ReLU activation functions. We compare SLANG with to the Bayes By Backprop (BBB) method blundell2015weight () and the Bayesian Dropout method of yarin16dropout (). For the 5 smallest datasets, we used a minibatch size of 10 and 4 MonteCarlo samples during training. For the 3 larger datasets, we used a minibatch size of 100 and 2 MonteCarlo samples during training. More details are given in Appendix F.3. We report test RMSE and test loglikelihood in Table 2. SLANG with just one rank outperforms BBB on 7 out of 8 datasets for RMSE and on 5 out of 8 datasets for loglikelihood. Moreover, SLANG shows comparable performance to Dropout.
Finally, we report results for classification on MNIST. We train a BNN with two hidden layers of 400 hidden units each. The training set consists of 50,000 examples and the remaining 10,000 are used as a validation set. The test set is a separate set which consists of 10,000 examples. We use SLANG with . For each value of , we choose the prior precision and learning rate based on performance on the validation set. Further details can be found in Appendix F.4. The test accuracies are reported in Table 3 and compared to the results obtained in blundell2015weight () by using BBB. For SLANG, a good performance can be obtained for a moderate . Note that there might be small differences between our experimental setup and the one used in blundell2015weight () since BBB implementation is not publicly available. Therefore, the results might not directly comparable. Nevertheless, SLANG appears to perform well compared to BBB.
SLANG  

BBB  L = 1  L = 2  L = 4  L = 8  L = 16  L = 32  
Test Error  1.82%  2.00%  1.95%  1.81%  1.92%  1.77%  1.73% 
5 Conclusion
We consider the challenging problem of uncertainty estimation in large deep models. For such problems, it is infeasible to form a Gaussian approximation to the posterior distribution. We address this issue by estimating a Gaussian approximation that uses a covariance with lowrank plus diagonal structure. We proposed an approximate naturalgradient algorithm to estimate the structured covariance matrix. Our method, called SLANG, relies only on the backpropagated gradients to estimate the covariance structure, which is a desirable feature when working with deep models. Empirical results strongly suggest that the accuracy of our method is better than those obtained by using meanfield methods. Moreover, we observe that, unlike meanfield methods, our method does not drastically shrink the marginal variances. Experiments also show that SLANG is highly flexible and that its accuracy can be improved by increasing the rank of the covariance’s lowrank component. Finally, our method converges faster than the meanfield methods and can sometimes converge as fast as VI methods that use a fullGaussian approximation.
The experiments presented in this paper are restricted to feedforward neural networks. This is partly because existing deeplearning software packages do not support individual gradient computations. Individual gradients, which are required in line 6 of Algorithm 1, must be manually implemented for other types of architectures. Further work is therefore necessary to establish the usefulness of our method on other types of network architectures.
SLANG is based on a naturalgradient method that employs the empirical Fisher approximation khan2017vprop (). Our empirical results suggest that this approximation is reasonably accurate. However, it is not clear if this is always the case. It is important to investigate this issue to gain better understanding of the effect of the approximation, both theoretically and empirically.
During this work, we also found that comparing the quality of covariance approximations is a nontrivial task for deep neural networks. We believe that existing benchmarks are not sufficient to establish the quality of an approximate Bayesian inference method for deep models. An interesting and useful area of further research is the development of good benchmarks that better reflect the quality of posterior approximations. This will facilitate the design of better inference algorithms.
Acknowledgements
We thank the anonymous reviewers for their helpful feedback. We greatly appreciate useful discussions with Shunichi Amari (RIKEN), Rio Yokota (Tokyo Institute of Technology), Kazuki Oosawa (Tokyo Institute of Technology), Wu Lin (University of British Colubmia), and Voot Tangkaratt (RIKEN). We are also thankful for the RAIDEN computing system and its support team at the RIKEN Center for Advanced Intelligence Project, which we used extensively for our experiments.
References
 [1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek Gordon Murray, Benoit Steiner, Paul A. Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: A system for largescale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI, pages 265–283, 2016.
 [2] Sungjin Ahn, Anoop Korattikara Balan, and Max Welling. Bayesian posterior sampling via stochastic gradient fisher scoring. In Proceedings of the 29th International Conference on Machine Learning, ICML, 2012.
 [3] Shunichi Amari. Natural gradient works efficiently in learning. Neural Computation, 10(2):251–276, 1998.
 [4] Sivaram Ambikasaran and Michael O’Neil. Fast symmetric factorization of hierarchical matrices with applications. CoRR, abs/1405.0223, 2014.
 [5] Anoop Korattikara Balan, Vivek Rathod, Kevin P. Murphy, and Max Welling. Bayesian dark knowledge. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems, pages 3438–3446, 2015.
 [6] David Barber and Christopher M. Bishop. Ensemble learning for multilayer networks. In Advances in Neural Information Processing Systems 10, pages 395–401, 1997.
 [7] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. CoRR, abs/1505.05424, 2015.
 [8] Edward Challis and David Barber. Concave gaussian variational approximations for inference in largescale bayesian linear models. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS, pages 199–207, 2011.
 [9] Tianqi Chen, Emily B. Fox, and Carlos Guestrin. Stochastic gradient hamiltonian monte carlo. In Proceedings of the 31th International Conference on Machine Learning, ICML, pages 1683–1691, 2014.
 [10] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the 33nd International Conference on Machine Learning, ICML, pages 1050–1059, 2016.
 [11] Ian J. Goodfellow. Efficient perexample gradient computations. CoRR, abs/1510.01799, 2015.
 [12] Alex Graves. Practical variational inference for neural networks. In Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems., pages 2348–2356, 2011.
 [13] Nathan Halko, PerGunnar Martinsson, and Joel A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2):217–288, 2011.
 [14] José Miguel HernándezLobato and Ryan P. Adams. Probabilistic backpropagation for scalable learning of bayesian neural networks. In Proceedings of the 32nd International Conference on Machine Learning, ICML, pages 1861–1869, 2015.
 [15] Matthew D. Hoffman, David M. Blei, Chong Wang, and John William Paisley. Stochastic variational inference. Journal of Machine Learning Research, 14(1):1303–1347, 2013.
 [16] Tommi Jaakkola and Michael Jordan. A variational approach to Bayesian logistic regression problems and their extensions. In Proceedings of the Sixth International Workshop on Artificial Intelligence and Statistics, AISTATS, 1997.
 [17] Mohammad Emtiyaz Khan and Wu Lin. Conjugatecomputation variational inference: Converting variational inference in nonconjugate models to inferences in conjugate models. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS, pages 878–887, 2017.
 [18] Mohammad Emtiyaz Khan and Didrik Nielsen. Fast yet simple naturalgradient descent for variational inference in complex models. CoRR, abs/1807.04489, 2018.
 [19] Mohammad Emtiyaz Khan, Didrik Nielsen, Voot Tangkaratt, Wu Lin, Yarin Gal, and Akash Srivastava. Fast and scalable Bayesian deep learning by weightperturbation in Adam. In Proceedings of the 35th International Conference on Machine Learning, pages 2611–2620, 2018.
 [20] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
 [21] Christos Louizos and Max Welling. Structured and efficient variational deep learning with matrix gaussian posteriors. In Proceedings of the 33nd International Conference on Machine Learning, ICML, pages 1708–1716, 2016.
 [22] Benjamin M. Marlin, Mohammad Emtiyaz Khan, and Kevin P. Murphy. Piecewise bounds for estimating bernoullilogistic latent gaussian models. In Proceedings of the 28th International Conference on Machine Learning, ICML, pages 633–640, 2011.
 [23] James Martens. New perspectives on the natural gradient method. CoRR, abs/1412.1193, 2014.
 [24] Andrew C. Miller, Nicholas J. Foti, and Ryan P. Adams. Variational boosting: Iteratively refining posterior approximations. In Proceedings of the 34th International Conference on Machine Learning, ICML, pages 2420–2429, 2017.
 [25] Hannes Nickisch and Carl Edward Rasmussen. Approximations for binary gaussian process classification. Journal of Machine Learning Research, 9(Oct):2035–2078, 2008.
 [26] Victor M.H. Ong, David J. Nott, and Michael S. Smith. Gaussian variational approximation with a factor covariance structure. Journal of Computational and Graphical Statistics, 27(3):465–478, 2018.
 [27] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In Autodiff Workshop, during the Annual Conference on Neural Information Processing Systems, 2017.
 [28] Rajesh Ranganath, Sean Gerrish, and David M. Blei. Black box variational inference. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, AISTATS, pages 814–822, 2014.
 [29] Hippolyt Ritter, Aleksandar Botev, and David Barber. A scalable laplace approximation for neural networks. In International Conference on Learning Representations, 2018.
 [30] Matthias W. Seeger. Bayesian model selection for support vector machines, gaussian processes and other kernel classifiers. In Advances in Neural Information Processing Systems 12, pages 603–609, 1999.
 [31] Matthias W. Seeger. Gaussian covariance and scalable variational inference. In Proceedings of the 27th International Conference on Machine Learning, ICML, pages 967–974, 2010.
 [32] Umut Simsekli, Roland Badeau, A. Taylan Cemgil, and Gaël Richard. Stochastic quasinewton langevin monte carlo. In Proceedings of the 33nd International Conference on Machine Learning, ICML, pages 642–651, 2016.
 [33] Shengyang Sun, Changyou Chen, and Lawrence Carin. Learning structured weight uncertainty in bayesian neural networks. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS, pages 1283–1292, 2017.
 [34] Linda S. L. Tan and David J. Nott. Gaussian variational approximation with sparse precision matrices. Statistics and Computing, 28(2):259–275, 2018.
 [35] Michalis K. Titsias and Miguel LázaroGredilla. Doubly stochastic variational bayes for nonconjugate inference. In Proceedings of the 31th International Conference on Machine Learning, ICML, pages 1971–1979, 2014.
 [36] Richard E. Turner and Maneesh Sahani. Two problems with variational expectation maximisation for timeseries models. In D. Barber, T. Cemgil, and S. Chiappa, editors, Bayesian Time series models, chapter 5, pages 109–130. Cambridge University Press, 2011.
 [37] Stephen J. Wright and J Nocedal. Numerical optimization. Springer New York, 1999.
 [38] Guodong Zhang, Shengyang Sun, David K. Duvenaud, and Roger B. Grosse. Noisy natural gradient as variational inference. In Proceedings of the 35th International Conference on Machine Learning, ICML, pages 5847–5856, 2018.
Appendix A Derivation of the VOGN Update
We can derive the VOGN update by using the Variational Online Newton (VON) method derived in Appendix D of [19]. The updates are given as follows:
(12)  
(13) 
where is the iteration number, is the learning rate, , and and are the stochastic gradient and Hessian. Recall that the stochastic gradient and the Hessian are defined as follows:
(14)  
(15) 
is the backpropagated gradient obtained on the ’th data example, and is a minibatch of data examples.
Dealing with Hessians can be difficult because they may not always be positivedefinite and may produce invalid Gaussian approximations. Following [19], we approximate the Hessian by the Empirical Fisher (EF) matrix:
(16) 
This is also known as the Generalized GaussNewton approximation. Using this approximation in (13) gives us the VOGN update of (2).
The VON update is an exact naturalgradient method and uses a single learning rate . The VOGN update, on the other hand, is an approximate naturalgradient method because it uses the EF approximation. Due to this approximation, a single learning rate might not give good results and it can be sensible to use different learning rates for the and updates. In (2), we therefore use a different learning rate for (denoted by ).
Appendix B An Alternative LowRank Update
We tried an alternative approach to learn the lowrank plus diagonal covariance approximation. We call this method SLANGOnlineEig. It forms the lowrank term in the precision approximation from an online estimate of the leading eigenvectors of . We now describe this procedure in detail. Experimental results are presented and comparisons are made with SLANG.
b.1 Approximating Natural Gradients by Online Estimation of the Eigendecomposition (SLANGOnlineEig)
The following eigenvalue decomposition forms the basis of SLANGOnlineEig:
(17) 
We emphasize that SLANGOnlineEig involves the decomposition of , rather than the updated matrix as in SLANG. This is cheaper by a factor of , which is a marginal difference as in most applications. To mimic the update of in (2), we use the following “movingaverage“ update for :
(18) 
where is a scalar learning rate. is a thus an online estimate of weighted eigenvectors of the EF.
Similar to SLANG, the diagonal is updated to capture the curvature information lost in the projection of to , i.e., the remaining eigenvectors:
(19) 
The updated covariance is then given by
(20) 
The final step of SLANGOnlineEig is identical to equation (11):
(21) 
SLANGOnlineEig is amenable to the same algorithmic tools as SLANG, which were discussed in 3.1. Goodfellow’s trick can be used to compute the Jacobian needed for the EF and algorithms 2 and 3 are available for fast covariancevector products and fast sampling, respectively. The overall computational complexity is and memory cost is .
b.2 Comparison with SLANG
SLANGOnlineEig has several promising properties. It has slightly better computational complexity than SLANG and its update closely resembles the natural gradient update in (2). However, update (18) involves the product approximation,
where the second line is the true product of interest. The update assumes that the online estimate of the factors of the eigendecomposition wellapproximates the online estimate of the eigendecomposition itself. It is not clear that this is true in practice. We note that a similar approximation also appears in noisy KFAC, where it is used to estimate the Kronecker factors separately [38].
SLANG does not require the product approximation for efficient covariance learning. Instead, is updated exactly before the projection into the space of rank matrices. This is why SLANG with reduces to 2 using the EF approximation. SLANGOnlineEig does not posses this property as a direct result of the product approximation. It also does not necessarily match precision diagonals with the update, as SLANG does for all . These are highly desirable properties.
A final issue is that SLANGOnlineEig also requires matching stochastic eigenvector estimates to their corresponding online estimates in . This may introduce additional approximation error when is highly stochastic.
b.3 Experimental Results for SLANGOnlineEig
SLANGOnlineEig  SLANG  

Metrics  L = 1  L = 2  L = 5  L = 1  L = 2  L = 5  
Australian  ELBO  0.580  0.578  0.652  0.574  0.574  0.569  
NLL  0.344  0.343  0.347  0.342  0.342  0.339  
KL ()  0.057  0.032  0.012  0.033  0.031  0.008  

ELBO  0.114  0.116  0.123  0.112  0.111  0.111  
NLL  0.092  0.093  0.093  0.092  0.092  0.092  
KL ()  1.544  2.128  4.402  0.911  0.756  0.842  
a1a  ELBO  0.380  0.380  0.383  0.377  0.376  0.374  
NLL  0.339  0.339  0.339  0.339  0.339  0.339  
KL ()  0.351  0.293  0.253  0.305  0.249  0.173  
USPS 3vs5  ELBO  0.210  0.208  0.210  0.210  0.206  0.198  
NLL  0.133  0.132  0.133  0.132  0.132  0.132  
KL ()  1.497  1.353  1.432  1.492  1.246  0.755 
Table 4 compares SLANG and SLANGOnlineEig for logistic regression on the Australian, Breast Cancer, USPS 3vs5, and a1a datasets from LIBSVM. The results presented for SLANG are identical to those in Table 1. SLANG always matches or beats the best results for SLANGOnlineEig. Furthermore, as is increased, the quality of posterior approximations computed by SLANG improves while SLANGOnlineEig approximations sometimes degrade. We speculate that this is due to the product approximation.
Figure 5 presents results on the convergence of SLANGOnlineEig for logistic regression and regression with a Bayesian neural network. We see that despite the product approximation, SLANGOnlineEig is able learn a better structured approximation when is increased.
Table 5 shows regression on the UCI datasets using Bayesian neural networks. The setup for this experiment was similar to the experiment in Section 4.2, except that the learning rates were fixed to and for all datasets, both for SLANGOnlineEig and for the Adam optimizer used for BBB. Moreover, the search spaces for the Bayesian optimization were fixed (using a normalized scale for the noise precision) and not adjusted to the individual datasets. Finally, BBB used 40 MC samples for the 5 smallest datasets and 20 MC samples for the 3 largest datasets in this experiment.
Test RMSE  Test loglikelihood  

Dataset  BBB  Dropout  OnlineEig  BBB  Dropout  OnlineEig 
Boston  4.04 0.28  2.97 0.19  3.17 0.17  2.75 0.07  2.46 0.06  2.61 0.06 
Concrete  6.16 0.14  5.23 0.12  5.79 0.13  3.22 0.02  3.04 0.02  3.19 0.02 
Energy  0.86 0.04  1.66 0.04  0.59 0.01  1.20 0.05  1.99 0.02  1.05 0.01 
Kin8nm  0.09 0.00  0.10 0.00  0.08 0.00  0.97 0.01  0.95 0.01  1.13 0.00 
Naval  0.00 0.00  0.01 0.00  0.00 0.00  5.34 0.07  3.80 0.01  5.09 0.08 
Power  4.28 0.03  4.02 0.04  4.09 0.04  2.87 0.01  2.80 0.01  2.83 0.01 
Wine  0.66 0.01  0.62 0.01  0.64 0.01  0.99 0.01  0.93 0.01  0.98 0.01 
Yacht  1.07 0.08  1.11 0.09  0.72 0.04  1.45 0.03  1.55 0.03  1.41 0.01 
Appendix C Additional Algorithmic Details for SLANG
The following section goes into more detail about the individual components of the algorithm required to leverage the lowrank plus diagonal structure. In general, we obtain efficient algorithms by operating only on or matrices. This avoids the storage cost and the computational cost of working in the space.
c.1 Fast Computation of Individual Gradients
Most deeplearning automatic differentiation packages, such as PyTorch [27] and TensorFlow [1], are optimized to return the overall gradient of a minibatch, not individual gradients for each example passed through the network. It is true that the naïve option of doing a forward and backward pass for each example has a similar computational complexity as a fully parallel version. However, in practice most of the computations involved can be either reused across examples, or sped up drastically by batching them. Implementations that use matrixmatrix multiplications instead of repeatedly doing matrixvector operations are far more efficient on GPUs.
Ian Goodfellow’s note [11] outlines a method for efficiently computing perexample gradients. It suggests saving the neuron activations and the linear combinations of activations computed during the minibatch’s forward pass through the neural network. These values are then used in a manual implementation of the perexample gradients, which avoids the summation defined by the cost function. While much more efficient than the sequential approach, Goodfellow’s approach requires greater implementation effort; a separate implementation is required to handle each type of layer used. This is partly why the experiments presented here are limited to standard MultiLayer Perceptrons.
c.2 Fast Top Eigendecomposition
The goal is to get the top eigenvalues and eigenvectors of the matrix defined in (6). Since we do not want to compute the matrix explicitly, we use the lowrank structure of the update matrix to compute matrixvector or matrixmatrix products in . This can be seen by rewriting the matrix as follows:
(22) 
These products can be used to compute eigendecomposition efficiently by using a randomized algorithm.
The main idea behind the randomized eigendecomposition is to project a matrix onto a randomly selected subspace by sampling vectors , each entry being selected uniformly at random, and computing , where is larger than . A traditional eigendecomposition can then be performed on the matrix to recover the top eigenvectors, with acting as a precisioncomputation tradeoff parameter. More details on randomized eigenvalue methods can be found in [13].
Our implementation of this procedure follows Facebook’s Fast Randomized SVD^{2}^{2}2 https://github.com/facebook/fbpca, https://research.fb.com/fastrandomizedsvd/ closely. This method alternates between multiplying by and applying a QR decomposition on the result for a few iterations for more stability, similarly to the Lanczos iterations. As all operations are done on the smaller matrix, with , the computational cost of the QR decomposition and eigendecomposition are limited to , leading to an overall algorithm.
c.3 Fast multiplication by inverse of lowrank + diagonal
To be able to apply the natural gradient update in Eq. 11, we need to be able to multiply an arbitrary vector by , given given as a lowrank + diagonal . We can use Woodbury’s identity to do so without forming the matrix and doing the costly inversion. Simplified for our special case, the identity gives
(23) 
The only inversions remaining involve diagonal or matrices. Correct ordering of the operations allows the storage cost to be avoided when computing the product and ensures that we only need to store or diagonal matrices,
(24) 
yielding a algorithm.
c.4 Fast sampling
To generate a sample from , it is sufficient to generate a sample and compute , where . Assume that we have an efficient algorithm for computing a symmetric factorization for matrices of the form . As is given as a lowrank + diagonal,
(25)  
(26)  
(27) 
Letting be a symmetric factor for , we then have that is a symmetric factor for . Such a factorization can be found using the work of [4], which showed that by taking
(28)  
is a symmetric factorization for . We can then use Woodbury’s Identity to avoid taking the inverse in the space,
(29) 
and careful ordering of operations, as above, leads to a complexity.
Appendix D Diagonal Correction
In this section, we prove that the diagonal of the precision computed by SLANG when is identical to the diagonal computed when .
Consider the precision , where the diagonal and low rank components are updated by (7), (9) and (10). Recall that when ,
where was the precision matrix updated by (2). Assume both methods use the same initial diagonal precision matrix and that they are updated by same sequence of EF matrices . Then we will show that at every iteration ,
Proof:
and are initialized as the same diagonal matrix, so the claim holds trivially at .
Assume that the claim holds at some iteration . The inductive hypothesis implies