Reliable training and estimation of variance networks
Abstract
We propose and investigate new complementary methodologies for estimating predictive variance networks in regression neural networks. We derive a locally aware minibatching scheme that result in sparse robust gradients, and show how to make unbiased weight updates to a variance network. Further, we formulate a heuristic for robustly fitting both the mean and variance networks post hoc. Finally, we take inspiration from posterior Gaussian processes and propose a network architecture with similar extrapolation properties to Gaussian processes. The proposed methodologies are complementary, and improve upon baseline methods individually. Experimentally, we investigate the impact on predictive uncertainty on multiple datasets and tasks ranging from regression, active learning and generative modeling. Experiments consistently show significant improvements in predictive uncertainty estimation over stateoftheart methods across tasks and datasets.
figuret
1 Introduction
Quality of mean predictions have dramatically increased in the last decade with the rediscovery of neural networks (LeCun et al., 2015). The predictive variance, however, has turned out to be a more elusive target with established solutions being subpar. The general finding is that neural networks tend to make overconfident predictions (Guo et al., 2017), that has the possibility of being harmful or offensive (Amodei et al., 2016). This may be explained by neural networks being general function estimators, that does not come with principled uncertainty estimates. Another explanation is that variance estimation is a fundamentally different task than mean estimation, and that the tools for mean estimation perhaps do not generalize. We focus on the latter hypothesis within regression.
To illustrate the main practical problems in variance estimation, we consider a toy problem where data is generated as , with and is uniform on (Fig. 1). As is common, we do maximum likelihood estimation of , where and are neural nets. While provides an almost perfect fit to the ground truth, shows three problems:

is significantly underestimated.

is almost independent of , i.e. it is near constant.

does not increase outside the data support to capture the poor mean predictions.
These findings are general (Sec. 4), and alleviating them is the main purpose of the present paper. We find that this can be achieved by a combination of methods that 1) change the usual minibatching to be location aware; 2) only optimize variance conditioned on the mean; and 3) enforce wellbehaved interpolation and extrapolation of variances through a suitable network architecture. We empirically demonstrate that these new tools significantly improve on stateoftheart across datasets in tasks ranging from regression, active learning, and generative modeling.
2 Related work
Gaussian processes (GPs) are wellknown function approximators with builtin uncertainty estimators (Rasmussen and Williams, 2006). GPs are robust in settings with low amount of data, and can model a rich class of functions with few hyperparameters. However, GPs are computationally intractable for large amounts of data and limited by the expressiveness of a chosen kernel. Advances like sparse and deep GPs (Snelson and Ghahramani, 2006; Damianou and Lawrence, 2013) partially alleviate this, but neural nets still tend to have more accurate mean predictions.
Uncertainty aware neural networks model the predictive mean and variance as two separate neural networks, often as multilayer perceptrons. This originates with the work of Nix and Weigend (1994) and Bishop (1994); today, the approach is commonly used for making variational approximations (Kingma and Welling, 2013; Rezende et al., 2014), and it is this general approach we investigate.
Bayesian neural networks (BNN) (MacKay, 1992) assume a prior distribution over the network parameters, and approximate the posterior distribution. This gives direct access to the approximate predictive uncertainty. In practice, placing an informative prior over the parameters is nontrivial. Even with advances in stochastic variational inference (Kingma and Welling, 2013; Rezende et al., 2014; Hoffman et al., 2013) and expectation propagation (HernándezLobato and Adams, 2015), it is still challenging to perform inference in BNNs.
Ensemble methods represent the current stateoftheart. Monte Carlo (MC) Dropout (Gal and Ghahramani, 2016) measure the uncertainty induced by Dropout layers (Hinton et al., 2012) arguing that this is a good proxy for predictive uncertainty. Deep Ensembles (Lakshminarayanan et al., 2017) form an ensemble from multiple neural networks trained with different initializations. Both approaches obtain ensembles of correlated networks, and it is unclear the extent to which this bias the predictive uncertainty. Alternatives include estimating confidence intervals instead of variances (Pearce et al., 2018), and gradientbased Bayesian model averaging (Maddox et al., 2019).
Applications of uncertainty include reinforcement learning, active learning, and Bayesian optimization (Szepesvári, 2010; Huang et al., 2010; Frazier, 2018). Here uncertainty is the crucial bit that allows for systematically making a tradeoff between exploration and exploitation. It has also been shown that uncertainty is required to learn the topology of data manifolds (Hauberg, 2018).
The main categories of uncertainty are epistemic and aleatoric uncertainty (Kiureghian and Ditlevsen, 2009; Kendall and Gal, 2017). Aleatoric uncertainty is induced by unknown or unmeasured features, and, hence, does not vanish in the limit of infinite data. Epistemic uncertainty is often referred to as model uncertainty, as it is the uncertainty due to model limitations. It is this type of uncertainty that Bayesian and ensemble methods generally estimate. We focus on the overall predictive uncertainty, which reflect both epistemic and aleatoric uncertainty.
3 Methods
The opening remarks (Sec. 1) highlighted three common problems that appear when and are neural networks. In this section we analyze these problems and propose solutions.
Preliminaries.
We assume that datasets contain i.i.d. observations . The targets are assumed to be conditionally Gaussian, , where and are continuous functions parametrized by . The maximum likelihood estimate (MLE) of the variance of i.i.d. observations is , where is the sample mean. This MLE does not exist based on a single observation, unless the mean is known, i.e. the mean is not a free parameter. When is Gaussian, the residuals are gamma distributed.
3.1 A local likelihood model analysis
By assuming that both and are continuous functions, we are implicitly saying that is correlated with for sufficiently small , and similar for . Consider the local likelihood estimation problem (Loader, 1999; Tibshirani and Hastie, 1987) at a point ,
(1) 
where is a function that declines as increases, implying that the local likelihood at is dependent of the points nearest to . Notice if . Consider, with this , a uniformly drawn subsample (i.e. a standard minibatch) of the data and its corresponding stochastic gradient of Eq. 1 with respect to . If for a point, , no points near it are in the subsample, then no other point will influence the gradient of , which will point in direction of the MLE, that is highly uninformative as it does not exist unless is known. Local data scarcity, thus, implies that while we have sufficient data for fitting a mean, we, locally, have insufficient data for fitting a variance. Essentially, if a point is isolated in a minibatch, all information it carries goes to updating and none is present for .
If we do not use minibatches, we encounter that gradients wrt. and will both be scaled with meaning that points with small variances effectively have higher learning rates (Nix and Weigend, 1994). This imply a bias towards lownoise regions of data.
3.2 HorvitzThompson adjusted stochastic gradients
We will now consider a solution to this problem within the local likelihood framework, which will give us a reliable, but biased, stochastic gradient for the usual (nonlocal) loglikelihood. We will then show how this can be turned into an unbiased estimator.
If we are to add some local information, giving more reliable gradients we would choose a in Eq.1 that reflects this. Assume for simplicity that for some . The gradient of will then be informative as more than one observation will contribute to the local variance if is chosen appropriately. Accordingly, we suggest a practical minibatching algorithm that samples a random point and let the minibatch consist of the nearest neighbors of .^{1}^{1}1By convention, we say that the nearest neighbor of a point is the point itself. In order to allow for more variability in a minibatch, we suggest to sample points uniformly, and then sample points among the nearest neighbors of each of the initially sampled points. Notice this is a more informative sample, as all units in the sample are likely to influence the same subset of parameters in , effectively increasing the degrees of freedom, hence the quality of variance estimation. In the supplementary material, we have a small empirical analysis of the sparsity and variation in gradients to emphasize the increase in information in gradients of . For pseudocode of this samplescheme, see supplementary material.
While such a minibatch would give rise to an informative stochastic gradient, it would not be an unbiased stochastic gradient of the (nonlocal) loglikelihood. This can, however, be adjusted by using the HorvitzThompson (HT) algorithm (Horvitz and Thompson, 1952), i.e. rescaling the loglikelihood contribution of each sample by its inclusion probability . With this, an unbiased estimate of the loglikelihood (up to an additive constant) becomes
(2) 
where denotes the minibatch. With the nearest neighbor minibatching, the inclusion probabilities can be calculated as follows. The probability that observation is in the sample is if it is among the nearest neighbors of one of the initial points, which are chosen with probability , i.e.
(3) 
where denotes the nearest neighbors of .
3.3 Meanvariance split training
The most common training strategy is to first optimize assuming a constant , and then proceed to optimize jointly, i.e. a warmup of . As previously noted, the MLE of does not exist when only a single observation is available and is unknown. However, the MLE does exist when is known, in which case it is , assuming that the continuity of is not crucial. This observation suggests that the usual training strategy is substandard as is never optimized assuming is known. This is easily solved: we suggest to never update and simultaneously, i.e. to only optimize conditioned on , and vice versa. This reads as sequentially optimizing and , as we under these conditional distributions may think of and as known, respectively. We will refer to this as meanvariance split training (MV).
3.4 Variance network architecture
When is influenced by few observations, underestimation is still likely due to the left skewness of the gamma distribution of . As always, when in a low data regime, it is sensible to be Bayesian about it; hence instead of point estimating we seek to find a distribution. Notice we are not imposing a prior, we are training the parameters of a Bayesian model. We choose the inverseGamma distribution, as this is the conjugate prior of when data is Gaussian. This means where are the shape and scale parameters of the inverseGamma respectively. So the loglikelihood is now calculated by integrating out
(4) 
where and are modeled as neural networks. Having an inverseGamma prior changes the predictive distribution to a locatedscaled^{2}^{2}2This means , where . The explicit density can be found in the supplementary material. Studentt distribution, parametrized with and . Further, the tdistribution is often used as a replacement of the Gaussian when data is scarce and the true variance is unknown and yields a robust regression (Gelman et al., 2014; Lange et al., 1989). We let and be neural networks, that implicitly determine the degrees of freedom and the scaling of the distribution. Recall the higher the degrees of freedom, the better the Gaussian approximation of the .
If we evaluate the local loglikelihood (Eq. 1) at a point far away from all data points, then the weights will all be nearzero (or exactly zero). Consequently, the local loglikelihood is approximately 0 regardless of the observed value , which should be interpreted as a large entropy of . Since we are working with Gaussian and tdistributed variables, we can recreate this behavior by exploiting that entropy is only an increasing function of the variance. We can reenact this behavior by letting the variance tend to an a priori determined value if tend away from the training data. Let be points in that represent the training data, akin to inducing points in sparse GPs (Snelson and Ghahramani, 2006). Then define and
(5) 
where is a surjective increasing function. Then the variance estimate will go to as at a rate determined by . In practice, we choose to be a scaledandtranslated sigmoid function: where is a free parameter we optimize during training and to ensure that . The inducing points are initialized with means and optimized during training. This choice of architecture is similar to that attained by posterior Gaussian processes when the associated covariance function is stationary. It is indeed the behavior of these established models that we aim to mimic with Eq. 5.
4 Experiments
4.1 Regression
To test our methodologies we conduct multiple experiments in various settings. We compare our method to stateoftheart method for quantifying uncertainty: Bayesian neural network (BNN) (HernándezLobato and Adams, 2015), Monte Carlo Dropout (MCDropout) (Gal and Ghahramani, 2016) and Deep Ensembles (EnsNN) (Lakshminarayanan et al., 2017). Additionally we compare to two baseline methods: standard meanvariance neural network (NN) (Nix and Weigend, 1994) and GPs (sparse GPs (SGP) when standard GPs are not applicable) (Rasmussen and Williams, 2006). We refer to our own method(s) as Combined, since we apply all the methodologies described in Sec. 3. Implementation details and code can be found in supplementary material.
Toy regression.
We first return to the toy problem of Sec. 1, where we consider 500 points from , with . In this example, the variance is heteroscedastic, and models should estimate larger variance for larger values of . The results^{3}^{3}3The standard deviation plotted for Combined, is the root mean of the inverseGamma. can be seen in Figs. 4.1 and 4.1. Our approach is the only one to capture the increasing variance, whereas e.g. MCDropout seem to increase variance, when the first derivative changes. This is most likely because MCDropout mostly quantifies model uncertainty.
Variance calibration.
To our knowledge, no benchmark for quantifying variance estimation exists. We propose a simple dataset with known uncertainty information. More precisely, we consider weather data from over 130 years.^{4}^{4}4https://mrcc.illinois.edu/CLIMATE/Station/Daily/StnDyBTD2.jsp Each day the maximum temperature is measured, and the uncertainty is then given as the variance in temperature over the 130 years. The fitted models can be seen in Fig. 2. Here we measure performance by calculating the mean error in uncertainty: . The numbers are reported above each fit. We observe that our Combined model achieves the lowest error of all the models, tightly followed by EnsNN and GP. Both NN, BNN and MCDropout severely underestimate the uncertainty.
Ablation study.
To determine the influence of each methodology from Sec. 3, we experimented with four UCI regression datasets (Fig. 3). We split our contributions in three: the locality sampler (LS), the meanvariance split (MV) and our combined model, which included both the LS, MV and the variance architecture. The results show that LS has the most impact on performance. Combined with MV results further improve, and adding the architecture again further improves the results. The results indicate that the proposed methodologies are complementary.
UCI benchmark.
We now follow the experimental setup from (HernándezLobato and Adams, 2015), by evaluating models on a number of regression datasets from the UCI machine learning database. Additional to the standard benchmark, we have added 4 datasets. Test set loglikelihood can be seen in Table 1, and the corresponding RMSE scores can be found in supplementary material.
GP  SGP  NN  BNN  MCDropout  EnsNN  Combined  

Boston  506  13  
Carbon  10721  7    
Concrete  1030  9  
Energy  768  8  
Kin8nm  8192  8    
Naval  11934  16      
Power plant  9568  3    
Protein  45730  9      
Superconduct  21263  81    
Wine (red)  1599  11  
Wine (white)  4898  11    
Yacht  308  7  
Year  515345  90     
Our Combined model performs best on 10 of the 13 datasets. For the small Boston and Yacht datasets, the standard GP performs the best, which is in line with the experience that GPs perform well when data is scarce. On these datasets our model is the best performing neural network. On the Energy and Protein datasets EnsNN perform the best, closely followed by our Combined model. One clear advantage of our model compared to EnsNN is that we only need to train one model, whereas EnsNN need to train 5+ (see supplementary material for training times for each model). The worst performing model in all cases is the baseline NN model, which clearly indicate that the usual tools for mean estimation does not carry over to variance estimation.
Active learning.
The performance of active learning depend on predictive uncertainty (Settles, 2009), so we use this to demonstrate the improvements induced by our method. We use the same network architectures and datasets as in the UCI benchmark. Each dataset is split into: 20% train, 60% pool and 20% test. For each active learning iteration, we first train a model, evaluate the performance on the test set and then estimate uncertainty for all datapoints in the pool. We then select the points with highest variance (corresponding to highest entropy (Houlsby et al., 2012)) and add these to the training set. We set of the initial pool size. This is repeated 10 times, such that the last model is trained on 30%. We repeat this on 10 random trainingtest splits to compute standard errors.
In Fig. 4, we visualize the evolution of average RMSE for each method during the data collection process for the Boston, Superconduct and Wine (white) datasets (all remaining UCI datasets are visualized in supplementary material). In general, we observe two trends. For some datasets we observe that our Combined model outperforms all other models, achieving significantly faster learning. This indicates that our model is better at predicting the uncertainty of the data in the pool set. On datasets where the sampling process does not increase performance, we are on par with other models.
4.2 Generative models
To show a broader application of our approach, we also explore it in the context of generative modeling. We focus on variational autoencoders (VAEs) (Kingma and Welling, 2013; Rezende et al., 2014), that are popular deep generative models. A VAE model the generative process:
(6) 
where . This is trained by introducing a variational approximation and then jointly training and . For our purposes, it suffice to note that a VAE estimate both a mean and a variance function. Thus using standard training methods, the same problems arise as in the regression setting. Mattei and Frellsen (2018) have recently shown that estimating a VAE is illposed unless the variance is bounded from below. In the literature, we often find that
1. Variance networks are avoided by using a Bernoulli distribution, even if data is not binary.
2. Optimizing VAEs with a Gaussian posterior is considerably harder than the Bernoulli case. To overcome this, the variance is often set to a constant e.g. . The consequence is that loglikelihood reconstruction term in the ELBO collapse into a L2 reconstruction term.
3. Even though the generative process is given by Eq. 6, samples shown in the literature are often reduced to . This is probably due to the wrong/meaningless variance term.
We aim to fix this by training the posterior variance with our Combined method. We do not change the encoder variance and leave this to future study.
Artificial data.
We first evaluate the benefits of more reliable variance networks in VAEs on artificial data. We generate data inspired by the two moon dataset^{5}^{5}5https://scikitlearn.org/stable/modules/generated/sklearn.datasets.make_moons.html, which we map into four dimensions. The mapping is thoroughly described in supplementary material, and we emphasize that we have deliberately used mappings that MLP’s struggle to learn, thus with a low capacity network the only way to compensate is to learn a meaningful variance function.
In Fig. 6 we visualize the results by plotting pairs of output dimensions using 5000 generated samples. For all pairwise combinations we refer to supplementary material. We observe that samples from our CombVAE capture the data distribution in more detail than a standard VAE. For VAE the variance seems to be underestimated, which is similar to the results from regression. The bad sample quality of a standard VAE can partially be explained by the arbitrariness of decoder variance function away from data. In Fig. 5 we calculated the accumulated variance over a grid of latent points. We clearly see that for the standard VAE, the variance is low where we have data and arbitrary away from data. However, our method produce lowvariance region where the two half moons are and a high variance region away from data. We note that (Arvanitidis et al., 2018) also dealt with the problem of arbitrariness of the decoder variance. However their method relies on postfitting of the variance, whereas ours is fitted during training. Additionally, we note that (Takahashi et al., 2018) also successfully modeled the posterior of a VAE as a Student tdistribution similar to our proposed method, but without the extrapolation and different training procedure.
MNIST  FashionMNIST  CIFAR10  SVHN  

ELBO  VAE  2053.01 1.60  1506.31 2.71  1980.84 3.32  3696.35 2.94 
CombVAE  2152.31 3.32  1621.29 7.23  2057.32 8.13  3701.41 5.84  
VAE  1914.77 2.15  1481.38 3.68  1809.43 10.32  3606.28 2.75  
CombVAE  2018.37 4.35  1567.23 4.82  1891.39 20.21  3614.39 7.91 
Image data.
For our last set of experiments we fitted a standard VAE and our CombVAE to four datasets: MNIST, FashionMNIST, CIFAR10, SVHN. We want to measure if there is an improvement to generative modeling by getting better variance estimation. The details about network architecture and training can be found in the supplementary material. Training set ELBO and test set loglikelihood can be viewed in Table 2. We observe on all datasets that on average tighter bounds and higher loglikelihood is achieved, indicating that we better fit the data distribution. We quantitatively observe (see Fig. 6) that variance has a more local structure for CombVAE and that the variance reflects the underlying latent structure.
5 Discussion & Conclusion
While variance networks are commonly used for modeling the predictive uncertainty in regression and in generative modeling, there have been no systematic study of how to fit these to data. We have demonstrated that tools developed for fitting mean networks to data are subpar when applied to variance estimation. The key underlying issue appear to be it is not feasible to estimate both a mean and a variance at the same time, where data is scarce.
We have proposed a new minibatching scheme that sample locally to ensure that variances are better defined during model training. We have further argued that variance estimation is more meaningful when conditioned on the mean, which implies a change to the usual training procedure of joint meanvariance estimation. Finally, we have highlighted that variance networks need to extrapolate differently from mean networks, which imply architectural differences between such networks. We specifically propose a new architecture for variance networks that ensure similar variance extrapolations to posterior Gaussian processes from stationary priors.
Experimentally, we have demonstrated that the methods are complementary and provide significant improvements over stateoftheart. In particular, on benchmark data we have shown that our method improves upon the test set loglikelihood without improving the RMSE, which demonstrate that the uncertainty is a significant improvement over current methods. Another indicator of improved uncertainty estimation is that our method speeds up active learning tasks compared to stateoftheart. Due to the similarities between active learning, Bayesian optimization, and reinforcement learning, we expect that our approach carry significant value to these fields as well. Furthermore, we have demonstrated that variational autoencoders can be improved through better generative variance estimation. Finally, we note that our approach is directly applicable alongside ensemble methods, which may further improve results.
Acknowledgments
This work was supported by a research grant (15334) from VILLUM FONDEN. This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement n^{o} 757360).
References
 Abadi et al. [2015] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Largescale machine learning on heterogeneous systems, 2015. URL https://www.tensorflow.org/. Software available from tensorflow.org.
 Amodei et al. [2016] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016.
 Arvanitidis et al. [2018] G. Arvanitidis, L. K. Hansen, and S. Hauberg. Latent space oddity: on the curvature of deep generative models. In International Conference on Learning Representations, 2018.
 Bishop [1994] C. M. Bishop. Mixture density networks. Technical report, Citeseer, 1994.
 Damianou and Lawrence [2013] A. Damianou and N. D. Lawrence. Deep gaussian processes. Proceedings of the 16th International Conference on Artificial Intelligence and Statistics (AISTATS), 2013.
 Frazier [2018] P. I. Frazier. A tutorial on bayesian optimization. arXiv preprint arXiv:1807.02811, 2018.
 Gal and Ghahramani [2016] Y. Gal and Z. Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international Conference on Machine Learning, pages 1050–1059, 2016.
 Gelman et al. [2014] A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin. Bayesian Data Analysis. CRC Press, 2014.
 GPy [since 2012] GPy. GPy: A gaussian process framework in python. http://github.com/SheffieldML/GPy, since 2012.
 Guo et al. [2017] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1321–1330. JMLR. org, 2017.
 Hauberg [2018] S. Hauberg. Only bayes should learn a manifold (on the estimation of differential geometric structure from data). arXiv preprint arXiv:1806.04994, 2018.
 HernándezLobato and Adams [2015] J. M. HernándezLobato and R. Adams. Probabilistic backpropagation for scalable learning of bayesian neural networks. In International Conference on Machine Learning, pages 1861–1869, 2015.
 Hinton et al. [2012] G. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing coadaptation of feature detectors. arXiv preprint, arXiv, 07 2012.
 Hoffman et al. [2013] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013.
 Horvitz and Thompson [1952] D. G. Horvitz and D. J. Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260):663–685, 1952.
 Houlsby et al. [2012] N. Houlsby, F. Huszar, Z. Ghahramani, and J. M. Hernándezlobato. Collaborative gaussian processes for preference learning. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 2096–2104. Curran Associates, Inc., 2012.
 Huang et al. [2010] S.J. Huang, R. Jin, and Z.H. Zhou. Active learning by querying informative and representative examples. In Advances in neural information processing systems, pages 892–900, 2010.
 Kendall and Gal [2017] A. Kendall and Y. Gal. What uncertainties do we need in bayesian deep learning for computer vision? In Advances in neural information processing systems, pages 5574–5584, 2017.
 Kingma and Ba [2015] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.
 Kingma and Welling [2013] D. P. Kingma and M. Welling. Autoencoding variational bayes. ICLR, 12 2013.
 Kiureghian and Ditlevsen [2009] A. D. Kiureghian and O. Ditlevsen. Aleatory or epistemic? does it matter? Structural Safety, 31(2):105 – 112, 2009. Risk Acceptance and Risk Communication.
 Lakshminarayanan et al. [2017] B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pages 6402–6413, 2017.
 Lange et al. [1989] K. L. Lange, R. J. A. Little, and J. M. G. Taylor. Robust statistical modeling using the t distribution. Journal of the American Statistical Association, 84(408):881–896, 1989.
 LeCun et al. [2015] Y. LeCun, Y. Bengio, and G. E. Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
 Loader [1999] C. Loader. Local Regression and Likelihood. Springer, New York, 1999.
 MacKay [1992] D. J. C. MacKay. A Practical Bayesian Framework for Backpropagation Networks. Neural Comput., 4(3):448–472, may 1992.
 Maddox et al. [2019] W. Maddox, T. Garipov, P. Izmailov, D. Vetrov, and A. G. Wilson. A Simple Baseline for Bayesian Uncertainty in Deep Learning. CoRR, feb 2019.
 Mattei and Frellsen [2018] P.A. Mattei and J. Frellsen. Leveraging the exact likelihood of deep latent variable models. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, NIPS’18, pages 3859–3870, USA, 2018. Curran Associates Inc.
 Nix and Weigend [1994] D. Nix and A. Weigend. Estimating the mean and variance of the target probability distribution. In Proc. 1994 IEEE Int. Conf. Neural Networks, pages 55–60 vol.1. IEEE, 1994.
 Paszke et al. [2017] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
 Pearce et al. [2018] T. Pearce, M. Zaki, A. Brintrup, and A. Neely. HighQuality Prediction Intervals for Deep Learning: A DistributionFree, Ensembled Approach. In Proceedings of the 35th International Conference on Machine Learning, feb 2018.
 Rasmussen and Williams [2006] C. E. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. University Press Group Limited, 2006.
 Rezende et al. [2014] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In E. P. Xing and T. Jebara, editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 1278–1286, Bejing, China, 22–24 Jun 2014. PMLR.
 Settles [2009] B. Settles. Active learning literature survey. Technical report, University of WisconsinMadison Department of Computer Sciences, 2009.
 Snelson and Ghahramani [2006] E. Snelson and Z. Ghahramani. Sparse gaussian processes using pseudoinputs. In Advances in neural information processing systems, pages 1257–1264, 2006.
 Sønderby et al. [2016] C. K. Sønderby, T. Raiko, L. Maalø e, S. r. K. Sø nderby, and O. Winther. Ladder variational autoencoders. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 3738–3746. Curran Associates, Inc., 2016. URL http://papers.nips.cc/paper/6275laddervariationalautoencoders.pdf.
 Szepesvári [2010] C. Szepesvári. Algorithms for reinforcement learning. Synthesis lectures on artificial intelligence and machine learning, 4(1):1–103, 2010.
 Takahashi et al. [2018] H. Takahashi, T. Iwata, Y. Yamanaka, M. Yamada, and S. Yagi. Studentt variational autoencoder for robust density estimation. In Proceedings of the TwentySeventh International Joint Conference on Artificial Intelligence, IJCAI18, pages 2696–2702. International Joint Conferences on Artificial Intelligence Organization, 7 2018.
 Tibshirani and Hastie [1987] R. Tibshirani and T. Hastie. Local likelihood estimation. Journal of the American Statistical Association, 82(398):559–567, 1987.
 Wen et al. [2018] Y. Wen, P. Vicol, J. Ba, D. Tran, and R. Grosse. Flipout: Efficient pseudoindependent weight perturbations on minibatches. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rJNpifWAb.
Appendix A Further results
Gradient experiments
In Fig. 7 we have plotted the sparsity index and variance of gradient for both the mean (top row) and variance function (bottom row). We do this for both normal minibatching and our proposed locality sampler. Sparsity is measured as and the sparsity index is then given by . We observe for the mean function, that the sparsity index and variance is similar for the two methods, indicating that our locality sampler does not improve on the fitting of the mean function, as expected. However for the variance function we see a clear gap in sparsity and variance, indicating that our locality sampler gives more local and stable updates to variance networks.
UCI benchmark (RMSE)
In Table 3 the test set RMSE results for the UCI regression benchmark can be seen. We clearly observe that all neural network based methods achieve nearly identical RMSE for all datasets, indicating that the mean function is similarly trained for all the methods.
GP  SGP  NN  BNN  MCDropout  EnsNN  Combined  

Boston  506  13  2.79 0.52  2.98 0.55  4.45 1.41  3.45 0.87  3.01 0.99  3.33 1.33  3.11 0.35 
Carbon  10721  7    1.01 0.01  0.41 0.0  0.18 0.1  0.29 0.0  0.41 0.0  0.35 0.01 
Concrete  1030  9  6.03 0.59  6.45 0.64  7.71 1.32  5.78 0.21  5.33 0.65  5.65 0.55  5.75 0.41 
Energy  768  8  1.98 0.76  2.12 0.56  1.67 0.44  1.89 0.04  1.69 0.19  2.13 0.46  1.70 0.21 
Kin8nm  8192  8    0.08 0.0  0.21 0.01  0.18 0.02  0.12 0.01  0.01 0.01  0.12 0.01 
Naval  11934  16      0.01 0.0  0.01 0.0  0.01 0.0  0.01 0.0  0.01 0.0 
Power plant  9568  3    4.65 0.12  4.23 0.33  4.12 0.45  4.13 0.13  4.11 0.21  4.12 0.13 
Protein  45730  9      4.38 0.07  4.67 0.94  4.19 0.08  4.36 0.07  4.52 0.19 
Superconduct  21263  81    11.32 0.38  11.73 0.46  11.07 1.7  11.44 0.39  11.63 0.49  11.65 0.65 
Wine (red)  1599  11  0.88 0.06  0.65 0.04  0.66 0.06  0.69 0.41  0.64 0.06  0.67 0.06  0.68 0.11 
Wine (white)  4898  11    0.65 0.03  0.67 0.04  0.68 0.32  0.71 0.04  0.78 0.04  0.72 0.09 
Yacht  308  7  0.42 0.21  0.72 0.21  1.63 0.61  1.05 0.11  1.11 0.48  1.58 0.58  1.27 0.11 
Year  515345  90      12.47 0.96  9.01 0.45  8.92 0.23  8.88 0.13  8.85 0.05 
Timings of models
In Table 4 we show the average computation time for each model. The experiments was conducted with an Intel Xeon E52620v4 CPU and Nvidia GTX TITAN X GPU. We note that our model suffers from long computations for very large datasets, mainly due to the computation of the neighborhood graph in the locality sampler. This could be reduced by using fast approximative method for knearestneighbor and by reducing data dimensionality e.g. PCA dimensionality reduction.
GP  SGP  NN  BNN  MCDropout  EnsNN  Combined  

Boston  506  13  8.37 + 2.92  91.86 + 30.12  94.04 + 2.24  81.32 + 1.83  98.08 + 2.0  479.1 + 12.49  93.39 + 1.82 
Carbon  10721  7    192.57 + 72.28  90.05 + 3.4  80.95 + 2.04  98.62 + 1.85  439.45 + 23.01  123.61 + 2.84 
Concrete  1030  9  7.48 + 1.2  173.73 + 4.0  92.91 + 4.17  81.02 + 2.04  97.94 + 1.84  468.94 + 10.81  97.65 + 6.4 
Energy  768  8  10.48 + 4.12  121.39 + 52.2  92.91 + 2.14  80.92 + 2.14  97.86 + 2.13  475.04 + 12.61  93.01 + 1.29 
Kin8nm  8192  8    1526.29 + 20.28  92.22 + 4.65  80.65 + 2.11  97.85 + 2.87  459.81 + 27.66  123.15 + 2.86 
Navel  11934  16    9.79 + 0.23  91.2 + 3.28  82.97 + 1.84  98.64 + 1.89  482.74 + 8.54  136.25 + 3.29 
Power  9568  3    1267.82 + 783.28  92.23 + 3.28  81.29 + 1.98  98.26 + 1.87  472.39 + 9.73  118.26 + 1.86 
Protein  45730  9      138.49 + 2.51  124.72 + 1.86  140.73 + 3.32  707.69 + 11.02  658.63 + 11.75 
Superconduct  21263  81    313.9 + 2.9  95.27 + 3.43  90.71 + 1.86  90.25 + 1.67  477.57 + 11.91  235.72 + 3.8 
Wine (red)  1599  11  35.33 + 19.09  262.69 + 10.14  93.51 + 2.09  80.67 + 2.04  97.78 + 2.58  416.02 + 23.9  130.61 + 7.91 
Wine (white)  4898  11    781.13 + 8.2  91.97 + 2.41  80.94 + 1.75  97.61 + 2.73  451.65 + 19.04  99.44 + 1.33 
Yacht  308  7  0.93 + 0.32  22.74 + 11.98  92.82 + 3.47  81.0 + 1.92  98.27 + 1.68  422.9 + 35.85  105.33 + 2.32 
Year  515345  90      139.45 + 6.15  643.96 + 4.17  77.96 + 2.09  725.27 + 20.08  2453.62 + 18.7 
Ablation study (RMSE)
In Fig. 8 we have plotted the RMSE for different combinations of our methodologies. Since the RMSE is only influenced by how well is fitted, the difference in log likelihood that we observe between the models (see paper) must be explained by how well is fitted.
Active learning
In Figs. 9 and 10 we respective shows the progress of RMSE and log likelihood on all 13 dataset. We observe that for some of the datasets (Boston, Superconduct, Power) our proposed Combine model achieves faster learning than other methods. On all other datasets we are equally good as the best performing model.
Generative modeling toy data
In Fig. 12 we show marginal distribution, pairwise pointplots and pairwise joint distribution for our artificially dataset used in the generative setting. Top row show the ground true data, middle row shows reconstructions and samples from standard VAE model and bottom row show reconstructions and samples from our proposed CombVAE model. We observe that reconstruction are similar for the two models, but the quality of the generative samples are much better for CombVAE.
Generative modeling of image data
In Fig. 11 we show a meshgrid of samples from VAE and CombVAE on the MNIST dataset. We clearly see how proper extrapolation of variance in CombVAE can be used to "mask" when we are inside our data region and when we are outside. For standard VAE we observe a near constant variance added to the images.
Appendix B On the parametrization of the distributed predictive distribution
We parametrize the Student distribution by letting the variance have an inverseGamma distribution with shape and scale parameters and . We use that if then . Then
where we substituted the variable with and used that the remaining was a Gamma integral.
Appendix C Parameters of the locality sampler
In Algorithm 1 a pseudoimplementation of our proposed locality sampler can be seen. The two important parameters in this algorithm are the primary sampling units (psu) and secondary sampling units (ssu). In Fig. 13 and 14 we visually show the effect of these two parameters.
For all our experiments we set psu=3 and ssu=40 when we are training the mean function and for variance function we set psu=1 and ssu=10.
Appendix D Implementation details for regression experiments
All neural network based models were implemented in Pytorch (Paszke et al., 2017), except for the Baysian neural network which was implemented in Tensorflow (Abadi et al., 2015). GP models where implemented in GPy (GPy, since 2012). Below we have stated details for all models:
 GP

Fitted using a ARD kernel and with default settings of GPy.
 SGP

Fitted using a ARD kernel and with default settings of GPy. Number of inducing points was set to .
 NN

Model use two networks, one for the predictive mean and one for the predictive variance. Model was trained to optimize the loglikelihood of data.
 BNN

We use a standard factored Gaussian prior for the weights and use the Flipout approximation (Wen et al., 2018) for the variational approximation.
 MCDropout

Model use a single network, where we place dropout on all weights. The dropout weight was set to 0.05. The model was trained to optimize the RMSE of data.
 EnsNN

Model consist of an ensemble of 5 individual NN models, each modeled as two individual networks. Each are trained to optimize the loglikelihood of data. Only difference between ensemble models is initialization.
 Combined

Model use three networks, one for the mean function, one for the parameter and one for the parameter. We set the number of inducing points to . For the in the scaledandtranslated sigmoid function we initialize it to 1.5, and try to minimize it during training. Model was trained to optimize the loglikelihood of data.
For each neural network based approach, we follow the experimental setup (HernándezLobato and Adams, 2015). All individual networks was modeled a single hidden layer MLPs with 50 neurons for all other datasets than "Protein" and "Year" where we use 100 neurons. Except for the output of each network, the activation function used is ReLU. For the output of the mean networks, no activation function is applied. For the output of the variance network, the Softplus activation function is used to secure positive variance. All neural network based models where trained using the Adam optimizer (Kingma and Ba, 2015) with a learning rate of using a batch size of 256. All models were trained for 10.000 iterations.
The code will be published alongside the paper, but currently an anonymous version is available at https://anonymous.4open.science/r/f8733507d83f490f950795c2d1fcb73e/.
Appendix E Generative network architecture
Pixel values of the images were scaled to the interval [0,1]. Each pixel is assumed to be i.i.d. Gaussian distributed. For the encoders and decoders we use multilayer perceptron networks, see table below.
Layer 1  Layer 2  Layer 3  Layer 4  

512 (BN + ReLU)  256 (BN + ReLU)  128 (BN + ReLU)  d (BN + Linear)  
512 (BN + ReLU)  256 (BN + ReLU)  128 (BN + ReLU)  d (Softplus)  
128 (BN + ReLU)  256 (BN + ReLU)  512 (BN + ReLU)  D (ReLU)  
128 (BN + ReLU)  256 (BN + ReLU)  512 (BN + ReLU)  D (Softplus) 
The numbers corresponds to the size of the layer and the parenthesis states the used activation function and whether or not batch normalization was used. indicates the size of the images i.e. . For MNIST and FashionMNIST these are 28,28,1 and for CIFAR10 and SVHN these are 32,32,1. indicates the size of the latent space. For MNIST and FashionMNIST we set for visualization purpose and for CIFAR10 and SVHN we set to be able to capture the higher complexity of these datasets.
To train the networks we used the Adam optimizer (Kingma and Ba, 2015) with learning rate and a batch size of 512. We train for 20000 iterations without early stopping. Additionally, we use warmup for the KL term (Sønderby et al., 2016), by scaling it with where is the current iteration number and was set to half the number of iterations. This secures that we converge to stable reconstructions before introducing too much structure in the latent space.
Appendix F Artificial Data
The data considered in Section 4 are generated in the following way: first we sample points in in a twomoon type way. See details in Algorithm 2. We generate 500 points in this way to establish a ’known’ latent space. We then map these to four dimensions by
(7)  
(8)  
(9)  
(10) 
where all and independent.
A typical dataset from this procedure is shown in Figure 15.