Can You Trust Your Model’s Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift
Abstract
Modern machine learning methods including deep learning have achieved great success in predictive accuracy for supervised learning tasks, but may still fall short in giving useful estimates of their predictive uncertainty. Quantifying uncertainty is especially critical in realworld settings, which often involve input distributions that are shifted from the training distribution due to a variety of factors including sample bias and nonstationarity. In such settings, well calibrated uncertainty estimates convey information about when a model’s output should (or should not) be trusted. Many probabilistic deep learning methods, including Bayesianand nonBayesian methods, have been proposed in the literature for quantifying predictive uncertainty, but to our knowledge there has not previously been a rigorous largescale empirical comparison of these methods under dataset shift. We present a largescale benchmark of existing stateoftheart methods on classification problems and investigate the effect of dataset shift on accuracy and calibration. We find that traditional posthoc calibration does indeed fall short, as do several other previous methods. However, some methods that marginalize over models give surprisingly strong results across a broad spectrum of tasks.
1 Introduction
Recent successes across a variety of domains have led to the widespread deployment of deep neural networks (DNNs) in practice. Consequently, the predictive distributions of these models are increasingly being used to make decisions in important applications ranging from machinelearning aided medical diagnoses from imaging (Esteva et al., 2017) to selfdriving cars (Bojarski et al., 2016). Such highstakes applications require not only point predictions but also accurate quantification of predictive uncertainty, i.e. meaningful confidence values in addition to class predictions. With sufficient independent labeled samples from a target data distribution, one can estimate how well a model’s confidence aligns with its accuracy and adjust the predictions accordingly. However, in practice, once a model is deployed the distribution over observed data may shift and eventually be very different from the original training data distribution. Consider, e.g., online services for which the data distribution may change with the time of day, seasonality or popular trends. Indeed, robustness under conditions of distributional shift and outofdistribution (OOD) inputs is necessary for the safe deployment of machine learning (Amodei et al., 2016). For such settings, calibrated predictive uncertainty is important because it enables accurate assessment of risk, allows practitioners to know how accuracy may degrade, and allows a system to abstain from decisions due to low confidence.
A variety of methods have been developed for quantifying predictive uncertainty in DNNs. Probabilistic neural networks such as mixture density networks (MacKay and Gibbs, 1999) capture the inherent ambiguity in outputs for a given input, also referred to as aleatoric uncertainty (Kendall and Gal, 2017). Bayesian neural networks learn a posterior distribution over parameters that quantifies parameter uncertainty, a type of epistemic uncertainty that can be reduced through the collection of additional data. Popular approximate Bayesian approaches include Laplace approximation (MacKay, 1992), variational inference (Graves, 2011; Blundell et al., 2015), dropoutbased variational inference (Gal and Ghahramani, 2016; Kingma et al., 2015), expectation propagation HernándezLobato and Adams (2015) and stochastic gradient MCMC (Welling and Teh, 2011). NonBayesian methods include training multiple probabilistic neural networks with bootstrap or ensembling (Osband et al., 2016; Lakshminarayanan et al., 2017). Another popular nonBayesian approach involves recalibration of probabilities on a heldout validation set through temperature scaling (Platt, 1999), which was shown by Guo et al. (2017) to lead to wellcalibrated predictions on the i.i.d. test set.
Using Distributional Shift to Evaluate Predictive Uncertainty While previous work has evaluated the quality of predictive uncertainty on OOD inputs (Lakshminarayanan et al., 2017), there has not to our knowledge been a comprehensive evaluation of uncertainty estimates from different methods under dataset shift. Indeed, we suggest that effective evaluation of predictive uncertainty is most meaningful under conditions of distributional shift. One reason for this is that posthoc calibration gives good results in independent and identically distributed (i.i.d.) regimes, but can fail under even a mild shift in the input data. And in real world applications, as described above, distributional shift is widely prevalent. Understanding questions of risk, uncertainty, and trust in a model’s output becomes increasingly critical as shift from the original training data grows larger.
Contributions In the spirit of calls for more rigorous understanding of existing methods (Lipton and Steinhardt, 2018; Sculley et al., 2018; Rahimi and Recht, 2017), this paper provides a benchmark for evaluating uncertainty that focuses not only on the i.i.d. setting but also uncertainty under distributional shift. We present a largescale evaluation of popular approaches in probabilistic deep learning, focusing on methods that operate well in largescale settings, and evaluate them on a diverse range of classification benchmarks across image, text, and categorical modalities. We use these experiments to evaluate the following questions:

How trustworthy are the uncertainty estimates of different methods under dataset shift?

Does calibration in the i.i.d. setting translate to calibration under dataset shift?

How do uncertainty and accuracy of different methods covary under dataset shift? Are there methods that consistently do well in this regime?
In addition to answering the questions above, our code is made available opensource along with our model predictions such that researchers can easily evaluate their approaches on these benchmarks
2 Background
Notation and Problem Setup Let represent a set of dimensional features and denote corresponding labels (targets) for class classification. We assume that a training dataset consists of i.i.d.samples .
Let denote the true distribution (unknown, observed only through the samples ), also referred to as the data generating process. We focus on classification problems, in which the true distribution is assumed to be a discrete distribution over classes, and the observed is a sample from the conditional distribution . We use a neural network to model and estimate the parameters using the training dataset. At test time, we evaluate the model predictions against a test set, sampled from the same distribution as the training dataset. However, here we also evaluate the model against OOD inputs sampled from . In particular, we consider two kinds of shifts:

shifted versions of the test inputs where the ground truth label belongs to one of the classes. We use shifts such as corruptions and perturbations proposed by Hendrycks and Dietterich (2019), and ideally would like the model predictions to become more uncertain with increased shift, assuming shift degrades accuracy. This is also referred to as covariate shift (Sugiyama et al., 2009).

a completely different OOD dataset, where the ground truth label is not one of the classes. Here we check if the model exhibits higher predictive uncertainty for those new instances and to this end report diagnostics that rely only on predictions and not ground truth labels.
Highlevel overview of existing methods A large variety of methods have been developed to either provide higher quality uncertainty estimates or perform OOD detection to inform model confidence. These can roughly be divided into:

Methods which deal with only, we discuss these in more detail in Section 3.
We refer to Shafaei et al. (2018) for a recent summary of these methods. Due to the differences in modeling assumptions, a fair comparison between these different classes of methods is challenging; for instance, some OOD detection methods rely on knowledge of a known OOD set, or train using a noneoftheabove class, and it may not always be meaningful to compare predictions from these methods with those obtained from a Bayesian DNN. We focus on methods described by (1) above, as this allows us to focus on methods which make the same modeling assumptions about data and differ only in how they quantify predictive uncertainty.
3 Methods and Metrics
We select a subset of methods from the probabilistic deep learning literature for their prevalence, scalability and practical applicability

(Vanilla) Maximum softmax probability (Hendrycks and Gimpel, 2017)

(Temp Scaling) Posthoc calibration by temperature scaling using a validation set (Guo et al., 2017)

(Ensembles) Ensembles of networks trained independently on the entire dataset using random initialization (Lakshminarayanan et al., 2017) (we set in experiments below)

(LL) Approx. Bayesian inference for the parameters of the last layer only (Riquelme et al., 2018)

(LL SVI) Mean field stochastic variational inference on the last layer only

(LL Dropout) Dropout only on the activations before the last layer

In addition to metrics (we use arrows to indicate which direction is better) that do not depend on predictive uncertainty, such as classification accuracy , the following metrics are commonly used:
Negative LogLikelihood (NLL) Commonly used to evaluate the quality of model uncertainty on some held out set. Drawbacks: Although a proper scoring rule (Gneiting and Raftery, 2007), it can overemphasize tail probabilities (QuinoneroCandela et al., 2006).
Brier Score (Brier, 1950) Proper scoring rule for measuring the accuracy of predicted probabilities. It is computed as the squared error of a predicted probability vector, , and the onehot encoded true response, . That is,
(1) 
The Brier score has a convenient interpretation as , where is the marginal uncertainty over labels, measures the deviation of individual predictions against the marginal, and measures calibration as the average violation of longterm true label frequencies. We refer to DeGroot and Fienberg (1983) for the decomposition of Brier score into calibration and refinement for classification and to (Bröcker, 2009) for the general decomposition for any proper scoring rule. Drawbacks: Brier score is insensitive to predicted probabilities associated with in/frequent events.
Both the Brier score and the negative loglikelihood are proper scoring rules and therefore the optimum score corresponds to a perfect prediction. In addition to these two metrics, we also evaluate two metrics—expected calibration error and entropy. Neither of these is a proper scoring rule, and thus there exist trivial solutions which yield optimal scores; for example, returning the marginal probability for every instance will yield perfectly calibrated but uninformative predictions. Each proper scoring rule induces a calibration measure (Bröcker, 2009). However, ECE is not the result of such decomposition and has no corresponding proper scoring rule; we instead include ECE because it is popularly used and intuitive. Each proper scoring rule is also associated with a corresponding entropy function and Shannon entropy is that for log probability (Gneiting and Raftery, 2007).
Expected Calibration Error (ECE) Measures the correspondence between predicted probabilities and empirical accuracy (Naeini et al., 2015). It is computed as the average gap between within bucket accuracy and within bucket predicted probability for buckets . That is, where , , and is the th prediction. When bins are quantiles of the heldout predicted probabilities, and the estimation error is approximately constant. Drawbacks: Due to binning, ECE does not monotonically increase as predictions approach ground truth. If , the estimation error varies across bins.
There is no ground truth label for fully OOD inputs. Thus we report histograms of confidence and predictive entropy on known and OOD inputs and accuracy versus confidence plots (Lakshminarayanan et al., 2017): Given the prediction , we define the predicted label as , and the confidence as . We filter out test examples corresponding to a particular confidence threshold and compute the accuracy on this set.
4 Experiments and Results
We evaluate the behavior of the predictive uncertainty of deep learning models on a variety of datasets across three different modalities: images, text and categorical (online ad) data. For each we follow standard training, validation and testing protocols, but we additionally evaluate results on increasingly shifted data and an OOD dataset. We detail the models and implementations used in Appendix A. Hyperparameters were tuned for all methods using Bayesian optimization (Golovin et al., 2017) (except on ImageNet) as detailed in Appendix A.8.
4.1 An illustrative example  MNIST
We first illustrate the problem setup and experiments using the MNIST dataset. We used the LeNet (LeCun et al., 1998) architecture, and, as with all our experiments, we follow standard training, validation, testing and hyperparameter tuning protocols. However, we also compute predictions on increasingly shifted data (in this case increasingly rotated or horizontally translated images) and study the behavior of the predictive distributions of the models. In addition, we predict on a completely OOD dataset, NotMNIST (Bulatov, 2011), and observe the entropy of the model’s predictions. We summarize some of our findings in Figure 1 and discuss below.
What we would like to see: Naturally, we expect the accuracy of a model to degrade as it predicts on increasingly shifted data, and ideally this reduction in accuracy would coincide with increased forecaster entropy. A model that was wellcalibrated on the training and validation distributions would ideally remain so on shifted data. If calibration (ECE or Brier reliability) remained as consistent as possible, practitioners and downstream tasks could take into account that a model is becoming increasingly uncertain. On the completely OOD data, one would expect the predictive distributions to be of high entropy. Essentially, we would like the predictions to indicate that a model “knows what it does not know” due to the inputs straying away from the training data distribution.
What we observe: We see in Figures 1(a) and 1(b) that accuracy certainly degrades as a function of shift for all methods tested, and they are difficult to disambiguate on that metric. However, the Brier score paints a clearer picture and we see a significant difference between methods, i.e. prediction quality degrades more significantly for some methods than others. An important observation is that while calibrating on the validation set leads to wellcalibrated predictions on the test set, it does not guarantee calibration on shifted data. In fact, nearly all other methods (except vanilla) perform better than the stateoftheart posthoc calibration (Temperature scaling) in terms of Brier score under shift. While SVI achieves the worst accuracy on the test set, it actually outperforms all other methods by a much larger margin when exposed to significant shift. In Figures 1(c) and 1(d) we look at the distribution of confidences for each method to understand the discrepancy between metrics. We see in Figure 1(d) that SVI has the lowest confidence in general but in Figure 1(c) we observe that SVI gives the highest accuracy at high confidence (or conversely is much less frequently confidently wrong), which can be important for highstakes applications. Most methods demonstrate very low entropy (Figure 1(e)) and give high confidence predictions (Figure 1(f)) on data that is entirely OOD, i.e. they are confidently wrong about completely OOD data.
4.2 Image Models: CIFAR10 and ImageNet
We now study the predictive distributions of residual networks (He et al., 2016) trained on two benchmark image datasets, CIFAR10 (Krizhevsky, 2009) and ImageNet (Deng et al., 2009), under distributional shift. We use 20layer and 50layer ResNets for CIFAR10 and ImageNet respectively. For shifted data we use 80 different distortions (16 different types with 5 levels of intensity each, see Appendix B for illustrations) introduced by Hendrycks and Dietterich (2019). To evaluate predictions of CIFAR10 models on entirely OOD data, we use the SVHN dataset (Netzer et al., 2011).
Figure 2 summarizes the accuracy and ECE for CIFAR10 (top) and ImageNet (bottom) across all 80 combinations of corruptions and intensities from (Hendrycks and Dietterich, 2019). Figure 3 inspects the predictive distributions of the models on CIFAR10 (top) and ImageNet (bottom) for shifted (Gaussian blur) and OOD data. Classifiers on both datasets show poorer accuracy and calibration with increasing shift. Comparing accuracy for different methods, we see that ensembles achieve highest accuracy under distributional shift. Comparing the ECE for different methods, we observe that while the methods achieve comparable low values of ECE for small values of shift, ensembles outperform the other methods for larger values of shift. To test whether this result is due simply to the larger aggregate capacity of the ensemble, we trained models with double the number of filters for the Vanilla and Dropout methods. The highercapacity models showed no better accuracy or calibration for medium to highshift than the corresponding lowercapacity models (see Appendix C). In Figures S8 and S9 we also explore the effect of the number of samples used in dropout, SVI and last layer methods and size of the ensemble, on CIFAR10. We found that while increasing ensemble size up to 50 did help, most of the gains of ensembling could be achieved with only 5 models. Interestingly, while temperature scaling achieves low ECE for low values of shift, the ECE increases significantly as the shift increases, which indicates that calibration on the i.i.d. validation dataset does not guarantee calibration under distributional shift. (Note that for ImageNet, we found similar trends considering just the top5 predicted classes, See Figure S5.) Furthermore, the results show that while temperature scaling helps significantly over the vanilla method, ensembles and dropout tend to be better. In Figure 3, we see that ensembles and dropout are more accurate at higher confidence. However, in 3(c) we see that temperature scaling gives the highest entropy on OOD data. Ensembles consistently have high accuracy but also high entropy on OOD data. We refer to Appendix C for additional results; Figures S4 and S5 report additional metrics on CIFAR10 and ImageNet, such as Brier score (and its component terms), as well as top5 error for increasing values of shift.
Overall, ensembles consistently perform best across metrics and dropout consistently performed better than temperature scaling and last layer methods. While the relative ordering of methods is consistent on both CIFAR10 and ImageNet (ensembles perform best), the ordering is quite different from that on MNIST where SVI performs best. Interestingly, LLSVI and LLDropout perform worse than the vanilla method on shifted datasets as well as SVHN. We also evaluate a variational Gaussian process as a last layer method in Appendix E but it did not outperform LLSVI and LLDropout.
4.3 Text Models
Following Hendrycks and Gimpel (2017), we train an LSTM (Hochreiter and Schmidhuber, 1997) on the 20newsgroups dataset (Lang, 1995) and assess the model’s robustness under distributional shift and OOD text. We use the evennumbered classes (10 classes out of 20) as indistribution and the 10 oddnumbered classes as shifted data. We provide additional details in Appendix A.4.
We look at confidence vs accuracy when the test data consists of a mix of indistribution and either shifted or completely OOD data, in this case the One Billion Word Benchmark (LM1B) (Chelba et al., 2013). Figure 4 (bottom row) shows the results. Ensembles significantly outperform all other methods, and achieve better tradeoff between accuracy versus confidence. Surprisingly, LLDropout and LLSVI perform worse than the vanilla method, giving higher confidence incorrect predictions, especially when tested on fully OOD data.
Figure 4 reports histograms of predictive entropy on indistribution data and compares them to those for the shifted and OOD datasets. This reflects how amenable each method is to abstaining from prediction by applying a threshold on the entropy. As expected, most methods achieve the highest predictive entropy on the completely OOD dataset, followed by the shifted dataset and then the indistribution test dataset. Only ensembles have consistently higher entropy on the shifted data, which explains why they perform best on the confidence vs accuracy curves in the second row of Figure 4. Compared with the vanilla model, Dropout and LLSVI have more a distinct separation between indistribution and shifted or OOD data. While Dropout and LLDropout perform similarly on indistribution, LLDropout exhibits less uncertainty than Dropout on shifted and OOD data. Temperature scaling does not appear to increase uncertainty significantly on the shifted data.
4.4 AdClick Model with Categorical Features
Finally, we evaluate the performance of different methods on the Criteo Display Advertising Challenge
Results from these experiments are depicted in Figure 5. (Figure S7 in Appendix C shows additional results including ECE and Brier score decomposition.) We observe that ensembles are superior in terms of both AUC and Brier score for most of the values of shift, with the performance gap between ensembles and other methods generally increasing as the shift increases. Both Dropout model variants yielded improved AUC on shifted data, and Dropout surpassed ensembles in Brier score at shiftrandomization values above 60%. SVI proved challenging to train, and the resulting model uniformly performed poorly; LLSVI fared better but generally did not improve upon the vanilla model. Strikingly, temperature scaling has a worse Brier score than Vanilla indicating that posthoc calibration on the validation set actually harms calibration under dataset shift.
5 Takeaways and Recommendations
We presented a largescale evaluation of different methods for quantifying predictive uncertainty under dataset shift, across different data modalities and architectures. Our takehome messages are the following:

Along with accuracy, the quality of uncertainty consistently degrades with increasing dataset shift regardless of method.

Better calibration and accuracy on the i.i.d. test dataset does not usually translate to better calibration under dataset shift (shifted versions as well as completely different OOD data).

Posthoc calibration (on i.i.d validation) with temperature scaling leads to wellcalibrated uncertainty on the i.i.d. test set and small values of shift, but is significantly outperformed by methods that take epistemic uncertainty into account as the shift increases.

Last layer Dropout exhibits less uncertainty on shifted and OOD datasets than Dropout.

SVI is very promising on MNIST/CIFAR but it is difficult to get to work on larger datasets such as ImageNet and other architectures such as LSTMs.

The relative ordering of methods is mostly consistent (except for MNIST) across our experiments. The relative ordering of methods on MNIST is not reflective of their ordering on other datasets.

Deep ensembles seem to perform the best across most metrics and be more robust to dataset shift. We found that relatively small ensemble size (e.g. ) may be sufficient (Appendix D).

We also compared the set of methods on a realworld challenging genomics problem from Ren et al. (2019). Our observations were consistent with the other experiments in the paper. Deep ensembles performed best, but there remains significant room for improvement, as with the other experiments in the paper. See Section F for details.
We hope that this benchmark is useful to the community and inspires more research on uncertainty under dataset shift, which seems challenging for existing methods. While we focused only on the quality of predictive uncertainty, applications may also need to consider computational and memory costs of the methods; Table S1 in Appendix A.9 discusses these costs, and the best performing methods tend to be more expensive. Reducing the computational and memory costs, while retaining the same performance under dataset shift, would also be a key research challenge.
Acknowledgements
We thank Alexander D’Amour, Jakub Świa̧tkowski and our reviewers for helpful feedback that improved the manuscript.
Can You Trust Your Model’s Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift: Appendix
Appendix A Model Details
a.1 Mnist
We evaluated both LeNet and a fullyconnected neural network (MLP) under shift on MNIST. We observed similar trends across metrics for both models, so we report results only for LeNet in Section 4.1. LeNet and MLP were trained for 20 epochs using the Adam optimizer (Kingma and Ba, 2014) and used ReLU activation functions. For stochastic methods, we averaged 300 sample predictions to yield a predictive distribution, and the ensemble model used 10 instances trained from independent random initializations. The MLP architecture consists of two hidden layers of 200 units each with dropout applied before every dense layer. The LeNet architecture (LeCun et al., 1998) applies two convolutional layers 3x3 kernels of 32 and 64 filters respectively) followed by two fullyconnected layers with one hidden layer of 128 activations; dropout was applied before each fullyconnected layer. We employed hyperparameter tuning (See Section A.8) to select the training batch size, learning rate, and dropout rate.
a.2 Cifar10
Our CIFAR model used the ResNet20 V1 architecture with ReLU activations. Model parameters were trained for 200 epochs using the Adam optimizer and employed a learning rate schedule that multiplied an initial learning rate by 0.1, 0.01, 0.001, and 0.0005 at steps 80, 120, 160, and 180 respectively. Training inputs were randomly distorted using horizontal flips and random crops preceded by 4pixel padding as described in (He et al., 2016). For relevant methods, dropout was applied before each convolutional and dense layer (excluding the raw inputs), and stochastic methods sampled 128 predictions per sample. Hyperparameter tuning was used to select the initial learning rate, training batch size, and the dropout rate.
a.3 ImageNet 2012
Our ImageNet model used the ResNet50 V1 architecture with ReLU activations and was trained for 90 epochs using SGD with Nesterov momentum. The learning rate schedule linearly ramps up to a base rate in 5 epochs and scales down by a factor of 10 at each of epochs 30, 60, and 80. As with the CIFAR10 model, stochastic methods used a samplesize of 128. Training images were distorted with random horizontal flips and random crops.
a.4 20 Newsgroups
We use a preprocessing strategy similar to the one proposed by Hendrycks and Gimpel (2017) for 20 Newsgroups. We build a vocabulary of size 30,000 words and words are indexed based on the word frequencies. The rare words are encoded as unknown words. We fix the length of each text input by setting a limit of 250 words, and those longer than 250 words are truncated, and those shorter than 250 words are padded with zeros. Text in evennumbered classes are used as indistribution inputs, and text from the oddnumbered of classes are used shifted OOD inputs. A dataset with the same number of randomly selected text inputs from the LM1B dataset (Chelba et al., 2013) is used as completely different OOD dataset. The classifier is trained and evaluated only using the text from the evennumbered indistribution classes in the training dataset. The final test results are evaluated based on indistribution test dataset, shift OOD test dataset, and LM1B dataset.
The vanilla model uses a onelayer LSTM model of size 32 and a dense layer to predict the 10 class probabilities based on word embedding of size 128. A dropout rate of 0.1 is applied to both the LSTM layer and the dense layer for the Dropout model. The LLSVI model replaces the last dense layer with a Bayesian layer, the ensemble model aggregates 10 vanilla models, and stochastic methods sample 5 predictions per example. The vanilla model accuracy for indistribution test data is 0.955.
a.5 Criteo
Each categorical feature from the Criteo dataset was encoded by hashing the string token into a fixed number of buckets and either encoding the hashbin as a onehot vector if or embedding each bucket as a dimensional vector otherwise. This dense feature vector, concatenated with 13 numerical features, feeds into a batchnorm layer followed by a 3hiddenlayer MLP. Each model was trained for one epoch using the Adam optimizer with a nondecaying learning rate.
Values of and were tuned to maximize loglikelihood for a vanilla model, and the resulting architectural parameters were applied to all methods. This tuning yielded hiddenlayers of size 2572, 1454, and 1596, and hashbucket counts and embedding dimensions of sizes listed below:
Learning rate, batch size, and dropout rate were further tuned for each method. Stochastic methods used 128 prediction samples per example.
a.6 Stochastic Variational Inference Details
For MNIST we used Flipout (Wen et al., 2018), where we replaced each dense layer and convolutional layer with meanfield variational dense and convolutional Flipout layers respectively. Variational inference for deep ResNets (He et al., 2016) is nontrivial, so for CIFAR we replaced a single linear layer per residual branch with a Flipout layer, removed batch normalization, added Selu nonlinearities (Klambauer et al., 2017), empirical Bayes for the prior standard deviations as in Wu et al. (2019) and careful tuning of the initialization via Bayesian optimization.
a.7 Variational Gaussian Process Details
For the experiments where Gaussian Processes were compared, we used Variational Gaussian Processes to fit the model logits as in Hensman et al. (2015). These were then passed through a Categorical distribution and numerically integrated over using GaussHermite quadrature. Each class was treated as a separate Gaussian Process, with 100 inducing points used for each class. The inducing points were initialized with model outputs on random dataset examples for CIFAR, and with Gaussian noise for MNIST. Uniform noise inducing point initialization was also tested but there was negligible difference between the three methods. All zero inducing points initializations numerically failed early on in training. Exponentiated quadratic plus linear kernels were used for all experiments. 250 samples were drawn from the logit distribution during training time to get a better estimate of the ELBO to backpropagate through. 250 logit samples were drawn at test time. was added to the diagonal of the covariance matrix to ensure positive definiteness.
We used 100 trials of random hyperparamter settings, selecting the configuration with the best final validation accuracy. The learning rate was tuned in on a log scale; the initial kernel amplitude in ; the initial kernel length scale in ; the variational distribution covariance was initialized to where was tuned in ; in Adam was tuned on on a log scale.
The Adam optimizer with a batch size of 512 was used, training for the same number of epochs as other methods. The same learning rate schedule was as other methods for the model and kernel parameters, but the learning rate for the variational parameters also included a 5 epoch warmup in order to help with numerical stability.
a.8 Hyperparameter Tuning
Hyperparameters were optimized through Bayesian optimization using Google Vizier (Golovin et al., 2017). We maximized the loglikelihood on a validation set that was held out from training (10K examples for MNIST and CIFAR10, 125K examples for ImageNet). We optimized loglikelihood rather than accuracy since the former is a proper scoring rule.
a.9 Computational and Memory Complexity of Different methods
In addition to performance, applications may also need to consider computational and memory costs; Table S1 discusses them for each method.
Method  Compute/  Storage 

Vanilla  
Temp Scaling  
LLDropout  
LLSVI  
SVI  
Dropout  
Gaussian Process  
Ensemble 
Appendix B Shifted Images
We distorted MNIST images using rotations with spline filter interpolation and cyclic translations as depicted in Figure S1.
For the corrupted ImageNet dataset, we used ImageNetC (Hendrycks and Dietterich, 2019). Figure S2 shows examples of ImageNetC images at varying corruption intensities. Figure S3 shows ImageNetC images with the 16 corruptions analyzed in this paper, at intensity 3 (on a scale of 1 to 5).
Appendix C Evaluating uncertainty under distributional shift: Additional Results
Figures S4, S5 and S7 show comprehensive results on CIFAR10, ImageNet and Criteo respectively across various metrics including Brier score, along with the components of the Brier score : reliability (lower means better calibration) and resolution (higher values indicate better predictive quality). Ensembles and dropout outperform all other methods across corruptions, while LL SVI shows no improvement over the baseline model. Figure S6 shows accuracy and ECE for models with double the number of ResNet filters; the highercapacity models are not better calibrated than their lowercapacity counterparts, suggesting that the good calibration performance of ensembles is not due simply to higher capacity.
Appendix D Effect of the number of samples on the quality of uncertainty
Figure S8 shows the effect of the number of sample sizes used by Dropout, SVI (and lastlayer variants) on the quality of predictive uncertainty, as measured by the Brier score. Increasing the number of samples has little effect on lastlayer variants, whereas increasing the number of samples improves the performance for SVI and Dropout, with diminishing returns beyond size 5.
Figure S9 shows the effect of ensemble size on CIFAR10 (top) and ImageNet (bottom). Similar to SVI and Dropout, we see that increasing the number of models in the ensemble improves performance with diminishing returns beyond size 5. As mentioned earlier, the Brier score can be further decomposed into where measures calibration as the average violation of longterm true label frequencies, and , where is the marginal uncertainty over labels (independent of predictions) and measures the deviation of individual predictions from the marginal.
Appendix E Variational Gaussian Process Results
Appendix F OOD detection for genomic sequences
We studied the set of methods for detecting OOD genomic sequence, as a challenging realistic problem for OOD detection proposed by Ren et al. (2019). Classifiers are trained on 10 indistribution bacteria classes, and tested for OOD detection of 60 OOD bacteria classes. The model architecture is the same as that in Ren et al. (2019): a convolutional neural networks with 1000 filters of length 20, followed by a global max pooling layer, a dense layer of 1000 units, and a last dense layer that outputs class prediction logits. For the dropout method, we add a dropout layer each after the max pooling layer and the dense layer respectively. For the LLDropout method, only a dropout layer after the dense layer is added. We use the dropout rate of 0.2. For the LLSVI method, we replace the last dense layer with a stochastic variational inference dense layer. The classification accuracy for indistribution is around 0.8 for the various types of classifiers.
Figure S11 shows the confidence vs (a) accuracy and (b) count when the test data consists of a mix of indistribution and OOD data. Ensembles significantly outperform all other methods, and achieve better tradeoff between accuracy versus confidence. Dropout performs better than Temp Scaling, and they both perform better than LLDropout, LLSVI, and the Vanilla method. Note that the accuracy on examples for the best method is still below 65%, suggesting that this realistic genomic sequences dataset is a challenging problem to benchmark future methods.
Appendix G Tables of Metrics
The tables below report quartiles of Brier score, negative loglikelihood, and ECE for each model and dataset where quartiles are computed over all corrupted variants of the dataset.
g.1 Cifar10
Method  Vanilla  Temp. Scaling  Ensembles  Dropout  LLDropout  SVI  LLSVI 

Brier Score (25th)  0.243  0.227  0.165  0.215  0.259  0.250  0.246 
Brier Score (50th)  0.425  0.392  0.299  0.349  0.416  0.363  0.431 
Brier Score (75th)  0.747  0.670  0.572  0.633  0.728  0.604  0.732 
NLL (25th)  2.356  1.685  1.543  1.684  2.275  1.628  2.352 
NLL (50th)  1.120  0.871  0.653  0.771  1.086  0.823  1.158 
NLL (75th)  0.578  0.473  0.342  0.446  0.626  0.533  0.591 
ECE (25th)  0.057  0.022  0.031  0.021  0.069  0.029  0.058 
ECE (50th)  0.127  0.049  0.037  0.034  0.136  0.064  0.135 
ECE (75th)  0.288  0.180  0.110  0.174  0.292  0.187  0.275 
g.2 ImageNet
Method  Vanilla  Temp. Scaling  Ensembles  Dropout  LLDropout  LLSVI 

Brier Score (25th)  0.553  0.551  0.503  0.577  0.550  0.590 
Brier Score (50th)  0.733  0.726  0.667  0.754  0.723  0.766 
Brier Score (75th)  0.914  0.899  0.835  0.922  0.896  0.938 
NLL (25th)  1.859  1.848  1.621  1.957  1.830  2.218 
NLL (50th)  2.912  2.837  2.446  3.046  2.858  3.504 
NLL (75th)  4.305  4.186  3.661  4.567  4.208  5.199 
ECE (25th)  0.057  0.031  0.022  0.017  0.034  0.065 
ECE (50th)  0.102  0.072  0.032  0.043  0.071  0.106 
ECE (75th)  0.164  0.129  0.053  0.109  0.123  0.148 
g.3 Criteo
Method  Vanilla  Temp. Scaling  Ensembles  Dropout  LLDropout  SVI  LLSVI 

Brier Score (25th)  0.353  0.355  0.336  0.350  0.353  0.512  0.361 
Brier Score (50th)  0.385  0.391  0.366  0.373  0.379  0.512  0.396 
Brier Score (75th)  0.409  0.416  0.395  0.393  0.403  0.512  0.421 
NLL (25th)  0.581  0.594  0.508  0.532  0.542  7.479  0.554 
NLL (50th)  0.788  0.829  0.552  0.577  0.600  7.479  0.633 
NLL (75th)  0.986  1.047  0.608  0.624  0.664  7.479  0.711 
ECE (25th)  0.041  0.055  0.044  0.043  0.052  0.254  0.066 
ECE (50th)  0.097  0.113  0.100  0.085  0.100  0.254  0.127 
ECE (75th)  0.135  0.149  0.141  0.116  0.136  0.254  0.162 
Footnotes
 footnotemark:
 footnotemark:
 footnotemark:
 https://github.com/googleresearch/googleresearch/tree/master/uq˙benchmark˙2019
 The methods used scale well for training and prediction (see in Appendix A.9.). We also explored methods such as scalable extensions of Gaussian Processes (Hensman et al., 2015), but they were challenging to train on the 37M example Criteo dataset or the 1000 classes of ImageNet.
 https://www.kaggle.com/c/criteodisplayadchallenge
References
 Uncertainty in the variational information bottleneck. arXiv preprint arXiv:1807.00906. Cited by: item 2.
 Concrete problems in AI safety. arXiv preprint arXiv:1606.06565. Cited by: §1.
 Invertible residual networks. arXiv preprint arXiv:1811.00995. Cited by: item 2.
 Novelty Detection and Neural Network Validation. IEE ProceedingsVision, Image and Signal processing 141 (4), pp. 217–222. Cited by: item 3.
 Weight uncertainty in neural networks. In ICML, Cited by: §1, 5th item.
 End to end learning for selfdriving cars.. arXiv preprint arXiv:1604.07316. Cited by: §1.
 Verification of forecasts expressed in terms of probability. Monthly weather review. Cited by: §3.
 Reliability, sufficiency, and the decomposition of proper scores. Quarterly Journal of the Royal Meteorological Society 135 (643), pp. 1512–1519. Cited by: §3, §3.
 NotMNIST dataset. External Links: Link Cited by: §4.1.
 One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005. Cited by: §A.4, §4.3.
 The comparison and evaluation of forecasters. The statistician. Cited by: §3.
 ImageNet: A LargeScale Hierarchical Image Database. In Computer Vision and Pattern Recognition, Cited by: §4.2.
 Dermatologistlevel classification of skin cancer with deep neural networks. Nature 542. Cited by: §1.
 Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In ICML, Cited by: §1, 3rd item.
 Selective classification for deep neural networks. In NeurIPS, Cited by: item 3.
 Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association 102 (477), pp. 359–378. Cited by: §3, §3.
 Google vizier: a service for blackbox optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1487–1495. Cited by: §A.8, §4.
 Practical variational inference for neural networks. In NeurIPS, Cited by: §1, 5th item.
 On calibration of modern neural networks. In International Conference on Machine Learning, Cited by: §1, 2nd item.
 Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §A.2, §A.6, §4.2.
 Benchmarking neural network robustness to common corruptions and perturbations. In ICLR, Cited by: Appendix B, 1st item, §4.2, §4.2.
 A Baseline for Detecting Misclassified and OutofDistribution Examples in Neural Networks. In ICLR, Cited by: §A.4, 1st item, §4.3.
 Scalable variational gaussian process classification. In International Conference on Artificial Intelligence and Statistics, Cited by: §A.7, footnote 2.
 Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks. In ICML, Cited by: §1.
 Long shortterm memory. Neural Comput. 9 (8), pp. 1735–1780. Cited by: §4.3.
 What uncertainties do we need in Bayesian deep learning for computer vision?. In NeurIPS, Cited by: §1.
 Adam: A Method for Stochastic Optimization. In ICLR, Cited by: §A.1.
 Semisupervised learning with deep generative models. In NeurIPS, Cited by: item 2.
 Variational dropout and the local reparameterization trick. In NeurIPS, Cited by: §1.
 Selfnormalizing neural networks. In NeurIPS, Cited by: §A.6.
 Learning multiple layers of features from tiny images. Cited by: §4.2.
 Simple and Scalable Predictive Uncertainty Estimation Using Deep Ensembles. In NeurIPS, Cited by: §1, §1, 4th item, §3.
 Newsweeder: learning to filter netnews. In Machine Learning, Cited by: §4.3.
 Gradientbased learning applied to document recognition. In Proceedings of the IEEE, Cited by: §A.1, §4.1.
 A simple unified framework for detecting outofdistribution samples and adversarial attacks. In NeurIPS, Cited by: item 3.
 Enhancing the Reliability of OutofDistribution Image Detection in Neural Networks. ICLR. Cited by: item 3.
 Troubling trends in machine learning scholarship. arXiv preprint arXiv:1807.03341. Cited by: §1.
 Structured and efficient variational deep learning with matrix Gaussian posteriors. arXiv preprint arXiv:1603.04733. Cited by: 5th item.
 Multiplicative Normalizing Flows for Variational Bayesian Neural Networks. In ICML, Cited by: 5th item.
 Density Networks. Statistics and Neural Networks: Advances at the Interface. Cited by: §1.
 Bayesian methods for adaptive models. Ph.D. Thesis, California Institute of Technology. Cited by: §1.
 Obtaining Well Calibrated Probabilities Using Bayesian Binning. In AAAI, pp. 2901–2907. Cited by: §3.
 Hybrid models with deep and invertible features. arXiv preprint arXiv:1902.02767. Cited by: item 2.
 Reading Digits in Natural Images with Unsupervised Feature Learning. In NeurIPS Workshop on Deep Learning and Unsupervised Feature Learning, Cited by: §4.2.
 Deep exploration via bootstrapped DQN. In NeurIPS, Cited by: §1.
 Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Advances in Large Margin Classifiers, pp. 61–74. Cited by: §1.
 Evaluating predictive uncertainty challenge. In Machine Learning Challenges, Cited by: §3.
 An addendum to alchemy. Cited by: §1.
 Likelihood ratios for outofdistribution detection. arXiv preprint arXiv:1906.02845. Cited by: Appendix F, 8th item.
 Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling. In ICLR, Cited by: 6th item.
 Winner’s curse? On pace, progress, and empirical rigor. Cited by: §1.
 Does Your Model Know the Digit 6 Is Not a Cat? A Less Biased Evaluation of “Outlier” Detectors. ArXiv ePrint arXiv:1809.04729. Cited by: §2.
 Training Very Deep Networks. In NeurIPS, Cited by: 3rd item.
 Dataset shift in machine learning. The MIT Press. Cited by: 1st item.
 Bayesian Learning via Stochastic Gradient Langevin Dynamics. In ICML, Cited by: §1.
 Flipout: efficient pseudoindependent weight perturbations on minibatches. arXiv preprint arXiv:1803.04386. Cited by: §A.6, 5th item.
 Deterministic Variational Inference for Robust Bayesian Neural Networks. In ICLR, Cited by: §A.6.