Bayesian Neural Networks With Maximum Mean Discrepancy Regularization

Bayesian Neural Networks With Maximum Mean Discrepancy Regularization


Bayesian Neural Networks (BNNs) are trained to optimize an entire distribution over their weights instead of a single set, having significant advantages in terms of, e.g., interpretability, multi-task learning, and calibration. Because of the intractability of the resulting optimization problem, most BNNs are either sampled through Monte Carlo methods, or trained by minimizing a suitable Evidence Lower BOund (ELBO) on a variational approximation. In this paper, we propose an optimized version of the latter, wherein we replace the Kullback-Leibler divergence in the ELBO term with a Maximum Mean Discrepancy (MMD) estimator, inspired by recent work in variational inference. After motivating our proposal based on the properties of the MMD term, we proceed to show a number of empirical advantages of the proposed formulation over the state-of-the-art. In particular, our BNNs achieve higher accuracy on multiple benchmarks, including several image classification tasks. In addition, they are more robust to the selection of a prior over the weights, and they are better calibrated. As a second contribution, we provide a new formulation for estimating the uncertainty on a given prediction, showing it performs in a more robust fashion against adversarial attacks and the injection of noise over their inputs, compared to more classical criteria such as the differential entropy.

Bayesian learning, variational approximation, maximum mean discrepancy, calibration

1 Introduction

Deep Neural Networks (DNNs) are currently the most widely used and studied models in the machine learning field, due to the large number of problems that can be solved very well with these architectures, such as image classification He et al. (2015), speech processing Rethage et al. (2018), image generation Li et al. (2017), and several others. Despite their empirical success, however, these models have a number of open research problems. Among them, how to quantify the uncertainty of an individual prediction remains challenging Kendall and Gal (2017). A concrete measure of uncertainty is critical for several real-world applications, e.g., driverless vehicles, out-of-distribution detection, and medical applications Kwon et al. (2020).

The principal approach to model the uncertainty in these models is based on Bayesian statistics. Bayesian Neural Networks (BNNs) model the parameters of a DNN as a probability distribution computed via the application of the Bayes’ rule, instead of a single fixed point in the space of parameters MacKay (1992). Despite a wealth of theoretical and applied research, the challenge with these BNNs is that codifying a distribution over the weights remain difficult, mainly because: (1) the minimization problem is intractable in the general case, and (2) we need to specify prior knowledge (in the form of a prior distribution) over the parameters of the network, and in most datasets the results can be sensible to this choice Wenzel et al. (2020). On top of this, applying these principles to a Convolutional Neural Network (CNN) is even harder, because of the depth of these networks in practice.

In the last years, different approaches have been proposed to build and/or train a BNN, outlined later in Section 1.2. In general, these approaches can either avoid the minimization problem altogether and sample from the posterior distribution Chen et al. (2015), or they can solve the optimization problem in a restricted class of variational approximations Blundell et al. (2015). The latter approach (referred to as Variational Inference, VI) has become extremely popular over the last years thanks to the possibility of straightforwardly leveraging automatic differentiation routines common in deep learning frameworks Blei et al. (2017), and avoiding a large quantity of sampling operations during the inference phase. However, the empirical results of BNNs remain sub-optimal in practice Wenzel et al. (2020), and ample margins exist to further increase their accuracy, robustness to the choice of the prior distribution, and calibration of the classification models.

1.1 Contributions of the work

In this paper, we partially address the aforementioned problems with two innovations related to BNNs. Firstly, we propose a modification of the commonly used optimization procedure in VI for BNNs. In particular, we leverage across recent works in variational auto-encoding Zhao et al. (2017) to propose a modification of the standard Evidence Lower BOund (ELBO) minimized during the BNN training. In the proposed approach, we replace the Kullback-Leibler term on the variational approximation with a more flexible Maximum Mean Discrepancy (MMD) estimator Gretton et al. (2012). After motivating our proposal, we perform an extensive empirical evaluation showing that the proposed BNN can significantly improve over the state-of-the-art in terms of classification accuracy, calibration, and robustness in the selection of the prior distribution. Secondly, we provide a new definition to measure the uncertainty on the prediction over a single point. Different from previous state-of-the-art approaches Kwon et al. (2020), our formulation provides a single scalar measure also in the multi-class case. In the experimental evaluation, we show that it performs better when defending from an adversarial attacks against the BNN using a simple thresholding mechanism.

1.2 Related work

VI training for BNNs

The idea to apply Bayesian methods on neural networks has been studied widely during the years. In Buntine (1991) the authors were the first to propose several Bayesian methods applied to the networks, but only in Hinton and Van Camp (1993) the first VI method was proposed as a regularization approach. In MacKay and Mac Kay (2003) and Neal (1996) posterior’s approximations were investigated, in the first case using a Laplacian approximation and in the second one by a Monte Carlo approach. Only recently, the first practical VI training techniques were advanced in Graves (2011). In Blundell et al. (2015), this approach was extended and an unbiased way of updating the posterior was found. Dropout has also been proposed as an approximation of VI Gal and Ghahramani (2015b); Kingma et al. (2015).

While these methods can be applied in general for most DNNs, few works were carried out in the context of image classification, due to the complexity and the depth of the networks involved in these tasks, combined with the inner difficulties of VI methods. In Gal and Ghahramani (2015a) and Shridhar et al. (2019) the authors used Bayesian methods to train CNNs, while in Kendall and Gal (2017) and Kwon et al. (2020) the authors proposed two alternatives that work also for CNNs, to measure the uncertainty of a classification, using the posterior distribution.

Almost all the works devoted to VI training of BNNs have considered the standard ELBO formulation Blei et al. (2017), where we minimize the sum of a likelihood term and the Kullback-Leibler (KL) divergence with respect to the variational approximation. However, recently several works have put forward alternative formulations of the ELBO term replacing the KL term with separate divergences Zhao et al. (2017). To the best of our knowledge, these works have focused mostly on generative scenarios Zhao et al. (2017), and not on generic BNNs. The target of this paper is to leverage on these proposals to improve the training procedure of a Bayesian CNN and the estimation of the classification’s uncertainty.

Uncertainty quantification in BNNs

Quantifying the uncertainty of a prediction is a fundamental task in modern deep learning. In the context of BNNs, entropy allows to obtain a simple measure of uncertainty Leibig et al. (2017). The work in Kendall and Gal (2017), however, analyzed the difference between aleatoric uncertainty (due to the noise in the data), and epistemic uncertainty (due to volatility in the model specification) Der Kiureghian and Ditlevsen (2009); Leibig et al. (2017); Hüllermeier and Waegeman (2019). In order to properly model the former (which is not captured by standard entropy), they propose a modification of the BNN to also output an additional term necessary to quantify the aleatoric component. A further extension that does not require additional outputs is proposed in Kwon et al. (2020). Their formulation, however, does not allow for a simple scalar definition in the multi-class case.

2 Bayesian Neural Networks

The core idea of Bayesian approaches is to estimate uncertainty using an entire distribution over the parameters, as opposed to the frequentist approach in which we estimate the solution of a problem as a fixed point. This is accomplished by using the Bayes’ theorem:


where is a dataset and is a set of parameters that we want to estimate. A BNN is a neural network with a distribution over the parameters specified according to (1) Neal (1996). In particular, we can see a DNN as a function that, given a sample and parameters , computes the associated output . Bayesian methods give us the possibility to have a distribution of functions (the posterior) for a particular dataset , starting from a prior belief on the shape of the functions (the prior) and its likelihood on a single point, defined as .

Once we have the posterior, the inference step consists in integrating over all the possible configurations of parameters:


This new equation represents a Bayesian Model Average (BMA): instead of choosing only one hypothesis (a single setting of the parameters ) we, ideally, want to use any possible set of the parameters, weighted by the posterior probabilities. This process is called marginalization over the parameters .

2.1 Bayes by back-propagation

In general the posterior in (1) is intractable. As outlined in Section 1.2, several techniques can be used to handle this intractability, and in this paper we focus on VI approximations, as described next. VI is an alternative to Markov Chain Monte Carlo (MCMC) methods, that can be used to faster approximate the posterior of Bayesian models if compared to MCMC but with less guarantees. For a complete review we refer to Blei et al. (2017). Generally speaking, the nature of BNNs (e.g., highly non-convex minimization problem, millions of parameters, etc.) makes these models very challenging for standard Bayesian methods.

Bayes By Back-propagation (BBB, Graves (2011); Blundell et al. (2015)) is a VI method to fit a variational distribution with variational parameters over the true posterior, from which the weights can be sampled. The set of variational parameters can easily be found by exploiting the back-propagation algorithm, as shown afterward. This posterior distribution answer queries about unseen data points - given a sample and variational parameters , taking the expectation with respect to the variational distribution. To make the process computationally viable, the expectation is generally approximated by sampling the weights from the posterior times; each set of weights gives us a DNN from which we predict the output, then the expectation is calculated as the average of all the predictions. Thus, Eq. (2) can be approximated as:


where are the sets of sampled weights. In the most common case, the variational family is chosen as a diagonal Gaussian distribution over the weights of the network. In this case, the variational parameters are composed, for each weight of the DNN, of a mean and a value , which is used to calculate the variance of the parameter (to ensure that the variance is always positive). Sampling a weight from the posterior is achieved by: , where ; this technique is called re-parametrization trick Blei et al. (2017). Note that there exists alternative ways to codify the posterior over the parameters, and we explore some simplifications in the experimental section.

With this formulation, the parameters of the approximated posterior can be found using the Kullback-Leibler (KL, Kullback and Leibler (1951)) divergence:

The optimal parameters are the ones that satisfy both the complexity of the dataset and the prior distribution . The final objective function to minimize is:


where is an additional scale factor to weight the two terms. This equation is called ELBO because maximizing it is equivalent to minimizing the Kullback-Leibler divergence between the approximated posterior and the real one. We can also look at this equation like as loss over the dataset plus a regularization term (the KL divergence).

The ELBO function has limitations, one of them is that it might fail to learn an amortized posterior which correctly approximates the true posterior. This can happen in two cases: when the ELBO is minimized despite the fact that the posterior is inaccurate and when the model capacity is not sufficient to achieve both a good posterior as well as good data fitting. For further information we refer to Alemi et al. (2017) and Zhao et al. (2017).

2.2 Measuring the uncertainty of a prediction

As introduced in Section 1.2.2, the uncertainty of a prediction vector can be calculated in many ways. In a classification setting, where vector contains the probability associated to the classes, the most straightforward way is the entropy:


Combining this formulation with Eq. (3), the classification entropy can be calculated as:


where and is the number of weights sampled from the posterior. This entropy formulation allows the calculation of the uncertainty also for a BNN, However, a more suitable measure of uncertainty, exploiting the possibility of sampling the weights to calculate the cross uncertainty between the classes (a covariance matrix), can be formulated Kwon et al. (2020). To this end, we define the variance of the predictive distribution (3) as:


where and . For further information about the derivation, we refer to Kwon et al. (2020). The first term in the variance formula is called aleatoric uncertainty, while the second one is the epistemic uncertainty Der Kiureghian and Ditlevsen (2009). The first quantity measures the inherent uncertainty of the dataset , it is not dependent on the model, and more data might not reduce it, instead the second term incorporates the uncertainty of the model itself, and can be decreased by augmenting the dataset or by redefining the model. In Kendall and Gal (2017) and Kwon et al. (2020) the authors have proposed different ways to approximate these quantities. In Kendall and Gal (2017) the authors constructed a BNN and used the mean and the standard deviation of the logits, the output of the last layer before the softmax activation function, to calculate the variance:


where . In Kwon et al. (2020) the authors highlighted the problems of this approach: it models the variability of the logits (and not the predictive probabilities), ignoring that the covariance matrix is a function of the mean vector; moreover, the aleatoric uncertainty does not reflect the correlation due to the diagonal matrix modeling. To overcome these limitations, they proposed an improvement:


where . This formulation converges in probability to Eq. (7) as the number of samples increases. In the case of binary classification, the formula simplifies to:


This definition is more viable because it calculates a scalar instead of a matrix, but cannot be used trivially if the problem involves more than two classes; if not by collapsing all the probabilities that are less than the maximum one into one single probability and treating the problem as a binary one. In this paper, we also present a modified version of the definition (8), which can be used to evaluate the uncertainty as a scalar also in multiclass scenarios.

3 Proposed approaches

In this section, we introduce the proposed variations for the training of BNNs. Firstly, we outline a new way to approximate the weights’ posteriors, leading to a better posterior approximation, higher accuracy, and an easier minimization problem. Secondly, we provide an improvement of the measure of uncertainty (8), which is more suited for problems that are not binary classification tasks.

3.1 Posterior approximation via Maximum Mean Discrepancy regularization

1:Given a prior distribution , a kernel function , the set of variational parameters , a loss function , and the number of sampling steps .
2:for each batch in the dataset do
4:     for  do
5:          with
9:     end for
11:     update using
12:end for
Algorithm 1 One epoch of the training procedure.

The MMD estimator was originally introduced as a non-parametric test for distinguishing samples from two separate distributions Gretton et al. (2012). It can be used to build an universal estimator Chérief-Abdellatif and Alquier (2019) which is robust to outliers, and also in the Bayesian statistics Cherief-Abdellatif and Alquier (2020) and to build a generative model Briol et al. (2019). Formally, denote by and two samples from an independent random variable with distribution , by and two samples from an independent random variable with distribution , and by a characteristic positive-definite kernel. The square of the MMD distance between the two distributions is defined as:

We have that . Inspired by Zhao et al. (2017) and Li et al. (2017) (who considered a similar approach for generative models), we propose to replace the KL term in (4) with an MMD estimator, i.e., we propose to search for a variational set of parameters that minimizes the MMD distance with respect to the prior :


In practice, the quantity can be estimated using finite samples from the two distributions. Given a sample and a sample , an unbiased estimator of is given by:


where is the th element of (and similarly for ), and both vectors have size . Using the unbiased version, the results can be negative if the two distributions are very close to each other. For this reason we use a different formulation in which, to speed up the convergence, we decide to eliminate the negative part:

As stated earlier, the idea of using the MMD distance connected to neural network models was originally explored in Karolina Dziugaite et al. (2015) and Li et al. (2015), who were mostly focused on generative models. The power of minimizing the MMD distance relies on the fact that it is equivalent to minimizing a distance between all the moments of the two distributions, under an affine kernel. In (10) we use this metric as a regularization approach, which minimizes the distance between the posterior over the parameters and the chosen prior. Summarizing, we propose to estimate the posterior by minimizing:


with and , the number of times that we sample from the two distributions, while the value is an additional scale factor to balance the classification loss and the posterior’s one; as in Blundell et al. (2015), we set to , where is the current batch in the training phase and is the total number of batches; in this way, the first optimization steps are influenced by the prior more than the future ones, which are influenced only by the data samples. The pseudocode for training our model is summarized in Algorithm 1.

3.2 Bayesian Cross Uncertainty (BCU)

In this section, we propose a modified version of the uncertainty measure formulated in Eq. (8), that we call Bayesian Cross Uncertainty (BCU).

The variance formulated in (8) gives us a matrix, with the number of classes of our classification problem. Sometimes, it is useful to have a scalar value, which indicates the uncertainty of our prediction and that can be easily used or visualized, so that different uncertainties (or models) can be compared easily.

The most straightforward approach to reduce a matrix to a scalar is to calculate its determinant, the sparser the matrix is the closer the resulting value will converge to zero. This approach comes with an inconvenience: in a binary classification problem, if we have, for a sample , two vectors of predictions , which codify the absolute certainty of the prediction, and , indicating that the network is maximally uncertain, we have that . To avoid these cases, we propose to modify the formulation of Eq. (8) as follows:


where is the number of classes and is the identity matrix. In this case, we have that the determinant of Eq. (14) is lower bounded when we have utmost confidence, and this bound is equal to the determinant of the matrix : . To calculate the upper bound we need to study when such a scenario could emerge. The possible scenarios in which we have the utmost uncertainty are the following: in the first one the network produces the same probability, , for each class (utmost aleatoric uncertainty), while in the second one we have a sample that is classified times, with , and at each prediction the network assign a probability equals to to a different class, and zeros to the others (utmost epistemic uncertainty). In these cases, the upper bound is: . These two values can be used to normalize the result of Eq. (14) between zero, maximum certainty, and one, utmost uncertainty, given that this formulation ensures a bounded measure of uncertainty. In this way, the uncertainty is well defined for a BNN model, since it reaches its maximum only when one of two terms, epistemic or aleatoric, reaches it. The final measure of uncertainty that we propose is the normalized version of (13):


where and are the minimum and maximum values as defined above. Furthermore, we define a way of discarding a sample based on its classification’s uncertainty. When the training of the DNN is over, we collect all the measures of uncertainty associated to the samples that have been classified correctly in a set that we call . From this set of uncertainties , we define a threshold as:


where and are, respectively, two functions that return the first and the third quartile of the set , and is an hyper-parameter. Once a threshold is calculated, a new sample can be discarded if its associated uncertainty exceeds it.

We underscore that this way of discarding images is not related to the formulation of variance in Eq. (7) or the BCU, nor to the BNNs, but can be used with every combination of DNN and measures of uncertainty.

4 Neural networks calibration

BNNs are more suitable for a real world decision making application, due to the possibility to give an interval of confidence for the prediction, as explored in Section 2.2. However, another important aspect in these scenarios, apart from the correctness of the predictions, is the ability of a model to provide a good calibration: the more the network is confident about a prediction, the more the probability associated with the predicted class label should reflect the likelihood of a correct classification.

In Niculescu-Mizil and Caruana (2005) the authors proved that shallow neural networks are typically well calibrated for a binary classification task. On the other hand, when considering deeper models, while the networks’ predictions become more accurate, due to the growing complexity, they also become less calibrated, as pointed out in Guo et al. (2017). In this work, we also analyze how calibrated BNNs are. In particular, we show in the experimental section that the proposed MMD estimator leads to better calibrated models.

Given a sample , the associated ground truth label , the predicted class with its associated probability of correctness , we want that:


This quantity cannot be computed with a finite set of samples, since is a continuous random variable, but it can be approximated and visually represented (as proposed in DeGroot and Fienberg (1983) and Niculescu-Mizil and Caruana (2005)) using the following formula:


where and are the predicted and the true label for the sample , and, chosen the number of splits of the range (each one has size equals to ), we group the predictions into interval bins . Each is the set of indices of samples with a prediction confidence that falls into the range . The Eq. (17) can be combined with a measure of confidence calculated as:

to understand if a model is calibrated, which is true when , for each bin with . Not only these formulas provide a good visualization tool, namely reliability diagram, but also it is useful to have a scalar value which summarizes the calibration statistics. The metric that we use is called Expected Calibration Error (ECE, Naeini et al. (2015)):


where is the total number of samples. The resulting scalar gives us the calibration gap between a perfectly calibrated network and the evaluated one.

5 Experiments

To evaluate the proposed VI method, we start with a toy regression task for visualization purposes, before moving to different datasets for image classification. We compare our proposed model with others state-of-the-art approaches. We put particular emphasis on evaluating different priors and seeing how this choice affects the final results (robustness), to study the calibration of BNNs, and how well our measure of uncertainty behaves when we want to discard images on which the network is uncertain (e.g., adversarial attacks). The code to replicate the experiments can be found in a public repository.2

5.1 Case study 1: Regression

(a) DNN
(b) MC Dropout
(c) BBB
(d) MMD
Figure 1: The images show the results obtained on the heteroscedastic regression problem with additive noise equals to . The line represents the prediction, the smaller points are the train dataset, while the bigger ones are test points outside the training range, to check how the function evolves; we also show the variances of the prediction.
(a) DNN
(b) MC Dropout
(c) BBB
(d) MMD
Figure 2: The images show the results obtained on the hemoroscedastic regression problem with additive noise equals to . The line represents the prediction, the smaller points are the train dataset, while the bigger ones are test points outside the training range, to check how the function evolves; we also show the variances of the prediction.

In this section, we evaluate the models on a toy regression problem, in which the networks should learn the underlying distribution of the points, and then being able to provide reasonable predictions even in regions outside the training one.

Each regression dataset is generated randomly using a Gaussian Process, with the RBF kernel, given a range in which the points lie, the number of points to generate and the variance of the additive noise. We generated two different kinds of regression problems: homoscedastic and heteroscedastic. In the first one, the variance is shared across all the random variables, while in the second one each random variable has its own. Each experiment consists in 100 training points in the range and 100 testing points outside this range. In the MMD experiments, we used the RBF kernel with to regularize the posterior.

We compare our approach with a standard DNN, the BBB method and a network which uses the dropout layer to approximate the variational inference (proposed in Gal and Ghahramani (2015b)) by keeping the dropout turned on even during the test phase. This technique is called Monte Carlo Dropout (MC Dropout). For all the experiments we trained the network for 100 epochs using RMSprop with a learning rate equal to .

As prior for BBB and MMD we used a Gaussian distribution . We initialize the as proposed in He et al. (2015) and , to keep the resulting weights around a value that guarantees the convergence of the optimization procedure. In these experiments we do not vary the prior distribution.

In Fig. 1 we show the results obtained from the heterocedastic experiment with additive noise equals to . We can see that the BNN trained with MMD is the only one capable of reasonably estimating the interval of confidence in regions outside the training one. While the DNN and MC Dropout are too confident about their predictions, BBB gives the less confident predictions, but we can see that it fails to understand that the uncertainty should increase outside the training range. In Fig. 2 the results obtained on a homoscedastic regression problem are shown, with similar trends.

5.2 Case study 2: Image classification

Neuron-wise Weight-wise Neuron-wise Weight-wise
MNIST 98.59 0.08 98.30 0.09 98.16 0.02 28.17 2.69 98.64 0.061 98.84 0.02
CIFAR10 74.73 0.36 75.56 0.01 65.73 0.50 - 75.24 0.29 75.64 0.12
CIFAR100 39.89 0.33 38.85 0.20 35.31 038 - 42.2 0.51 42.36 0.36
Table 1: The table shows, for each method, the accuracy results and the associated standard deviation, both expressed in percentage, obtained on the classification benchmarks. Some results are missing because no combination of parameters lead to convergence of the classification task.

In this section, we present the results obtained on image classification experiments. To the best of our knowledge, no competitive results on this field have been proposed using BNNs; the best results are present in Shridhar et al. (2019), in which the authors used the local re-parametrization trick, described in Kingma et al. (2015): a technique in which the output of a layer is sampled instead of the weights. The main problem of this technique applied to the CNNs is that it doubles the number of operations inside a layer (e.g., in the CNN case we have two convolutions, one for the mean and the other for the variance of the layer’s output). For this reason, we believe that it is not computationally reasonable, especially with deeper architectures, and MMD could be a step towards better Bayesian CNNs.

Our main concern is to show that the MMD approach works even with a “bad” prior, which implies having small knowledge about the problem. For this purpose, we studied different priors: the Gaussian distribution , the Laplace distribution , the uniform distribution and the Scaled Gaussian Mixture from Blundell et al. (2015). In addition, we will evaluate the introduced measure of uncertainty under the Fast Gradient Sign Method (FGSM, Goodfellow et al. (2014)). Finally, we evaluate the calibration of each network.

To this end, we evaluate the methods on three datasets: the first is MNIST LeCun et al. (2010), the second one is CIFAR10, and the last one is a harder version of CIFAR10 called CIFAR100, which contains the same number of images, but 100 classes instead of 10. For all the experiments, we used the Adam optimizer Kingma and Ba (2014) with the learning rate set to , and the weights initialized as in the regression experiments, to ensure a good gradient flow. For MNIST, we used a simple network composed by one CNN layer, with 64 kernels, followed by max pooling and two linear layers. For CIFAR10, we used a network composed by three blocks of convolutions and max pooling, respectively with , and kernels, followed by three linear layers; for CIFAR100, we used the same architecture but doubling the number of kernels. In all the architectures, the activation function is the ReLU.

We trained all the networks for 20 epochs; we also implemented an early stopping criteria, in which training is stopped if the validation score does not improve for 5 consecutive epochs. For BBB, MMD, and MC Droput we sampled one set of weights during the train phase and 10 sets during the test phase. To have better statistics of the results, we repeated each experiment times.

Since the posterior over the weights doubles the number of parameters, leading to a minimization problem which is harder to minimize, we decided also to test a simplification of it, called neuron-wise posterior. This posterior is defined as , in which each weight connecting neuron to neuron in layer has its own mean, but the variance is given by scaled by a parameter which is defined neuron-wise. In this way, we have less parameters and the minimization problem could benefit from it.

Prior choice

Neuron-wise Neuron-wise Weight-wise
66.26 75.43 75.43
33.28 74.59 75.04
- 75.47 75.64
52.84 75.47 75.32
12.11 74.90 75.30
- 75.58 75.47
- 74.46 74.89
66.23 74.67 75.70
66.23 75.6 75.93
66.23 74.95 74.89
66.23 74.67 75.70
66.23 75.60 75.93
66.23 74.95 74.89
66.23 74.95 74.89
66.23 74.95 74.89
Table 2: The Table shows the accuracy results, on CIFAR10, about the robustness w.r.t. the prior choice. Some results are missing because no combination of parameters lead to convergence of the classification task.

We evaluated all the priors previously exposed to understand how much the prior choice impacts the optimization problem and the final results. Only one result is shown, due to the large number of priors; the best results, for each method, are then used to train the models for all the experiments; the overall classification results will be presented later.

The Table 2 shows the results obtained, on CIFAR10, with all the tested priors. It is clear that BBB fails to converge with spiky priors because the KL divergence forces the distributions to collapse on zero. A clear case of this behaviour can be observed with the Laplacian prior, as shown in Fig. 3.

In the end, we can say that MMD works better than BBB, even with an uninformative prior, such as a uniform distribution which gives only a range for the parameters, because its sampling nature allows more operating space than BBB. Moreover, Fig. 3 also shows that BNNs trained with MMD are capable of approximating a more complex posterior.

(a) MMD
(b) BBB
Figure 3: The images show the posterior distribution of the weights obtained on CIFAR10 with the prior . BBB method fails when combined with the peaked prior, because it forces the convergence of the distributions on zero, neglecting the minimization problem associated to the classification. Each color represents the weights of a specific layer.
(a) Discarded images while varying the threshold .
(b) Classification score while varying the threshold .
(c) Difference between the scores obtained.
Figure 4: The images show, respectively, how many images are discarded, the obtained score calculated over the samples that have not been discarded, and, in the last plot, the difference between the classification score obtained using BCU and the entropy based thresholds. We tested different thresholds. The results are associated to the best model trained on CIFAR100 with the BNN trained using the proposed MMD method, under the FGSM attack with .
(a) Discarded images while varying the threshold .
(b) Classification score while varying the threshold .
(c) Difference between the scores obtained.
Figure 5: The images show, respectively, how many images are discarded, the obtained score calculated over the samples that have not been discarded, and, in the last plot, the difference between the classification score obtained using the BCU and entropy based thresholds. We tested different thresholds. The results are associated to the best model trained on CIFAR10 with the MC Dropout approach, under the FGSM attack with .
DNN MC Dropout
MC Dropout
(no weight decay)
(Weight wise)
MNIST 0.73 0.09 0.41 0.11 0.49 0.23 0.50 0.08
CIFAR10 14.56 0.32 3.43 0.57 6.00 0.17 5.93 0.91
CIFAR100 13.11 6.75 2.22 0.63 5.92 0.58 3.89 1.76
Table 3: The Table shows, for each method, the results with the associated standard deviation, in term of calibration, measured as ECE score (%, Eq. (18))); lower is better.
(a) Reliability diagram of DNN.
(b) Reliability diagram of MC Dropout.
(c) Reliability diagram of MMD.
Figure 6: The images show the reliability diagram for each method compared in Table 3. In these images the correlation between the ECE score and the gap bars is shown visually. The methods are trained on CIFAR10.

Classification results

Table 1 shows the results obtained on the classification experiments. It shows that BBB method fails drastically if we use a weight-wise posterior, but also to reach good performances when the posterior is neuron-wise and the dataset becomes harder (CIFAR100). In the end we can say that the networks trained with the original ELBO loss fail when the models become bigger and the dataset harder; we will also show that they are also more sensible to the choice of the prior.

FGSM test

In this test we compare the proposed BCU measure (14) with the normalized entropy formulation in (6), under the FGSM attack Goodfellow et al. (2014), in which, given an image and its label , we modify the image as: , where is the input-output Jacobian of a randomly sampled network.

The purpose is to discard images in which the network is less confident, therefore we study how the threshold, defined as in Eq. (15), behaves when we change the uncertainty measure.

In Fig. 5 and 4, we show the results obtained, respectively, on CIFAR10 with MC Dropout and CIFAR100 with MMD. We can see how the number of discarded images decrease exponentially when the threshold is applied to the uncertainty based entropy measure; the score also drops, since more noisy images are evaluated instead of being discarded. This is due to the fact that the entropy measure does not take into consideration the correlation between the classes, and this happens because only the distribution obtained using a set of weights is evaluated at each time, thus the entropy does not codify the overall uncertainty across all the possible models and how a class can influence the others. Both of these informations are taken into account when using the measure of uncertainty proposed.

Network calibration

In this section, we evaluate the calibration of each network. To visually show the calibration of these models we used the reliability diagrams DeGroot and Fienberg (1983); Niculescu-Mizil and Caruana (2005). Fig. 6 shows these diagrams, while Table 1 contains the results achieved in terms of ECE score. Only the networks that achieve a classification result near their best one, presented in Table 1, are considered in this experiments; for this reason, results obtained with BNNs trained with BBB method are not evaluated due to the inability of reaching competitive scores. We decided also to compare two different versions of MC Dropout to make the comparisons fairer, because the original one uses a weight decay, which leads to a better ECE score (as pointed out in Guo et al. (2017)); consequently we trained also a MC Dropout network without weight regularization. We can observe that DNN never achieves a good calibration, and while MC Dropout networks are well calibrated due to the weight decay, our method achieves a good calibration result even if no regularization is used. By comparing our method with the MC Dropout without weight regularization, we find that our method achieves a better ECE score. In the end we can say that the BNNs trained using MMD, in general, are well calibrated and do not require external normalization techniques to achieve it.

6 Conclusion

In this paper, we proposed a new VI method to approximate the posterior over the weights of a BNN, which uses the MMD distance as a regularization metric between the posterior and the prior. This method has advantageous characteristics, if compared to other VI methods such as MC Dropout and BBB. First, the BNNs trained with this technique achieve better results, and they are able of approximating a more complex posterior. Second, it is more robust to the prior choice, if compared to BBB, an important aspect in these models. Third, this method, if combined with the right prior, can lead to a very well calibrated network, that also achieves good performance.

We also proposed and tested a new method to calculate the classification’s uncertainty of a BNN. We showed that this measure, combined with a threshold-based rejection technique, behaves better when discarding samples on which the BNN is less certain, by leading to a better score, if compared to the entropy measure, on noisy samples.

Our MMD method suggests interesting lines of further research, in which a BNN network can be trained using VI methods that involve a regularization method different from the KL divergence, and leading to better and more interesting posteriors.


  1. journal: Neurocomputing


  1. Alemi, A.A., Poole, B., Fischer, I., Dillon, J.V., Saurous, R.A., Murphy, K., 2017. Fixing a broken ELBO. arXiv preprint arXiv:1711.00464 .
  2. Blei, D.M., Kucukelbir, A., McAuliffe, J.D., 2017. Variational inference: A review for statisticians. Journal of the American statistical Association 112, 859–877.
  3. Blundell, C., Cornebise, J., Kavukcuoglu, K., Wierstra, D., 2015. Weight uncertainty in neural networks, in: Proceedings of the 32nd International Conference on Machine Learning (ICML).
  4. Briol, F.X., Barp, A., Duncan, A.B., Girolami, M., 2019. Statistical inference for generative models with maximum mean discrepancy. arXiv preprint arXiv:1906.05944 .
  5. Buntine, W.L., 1991. Bayesian back-propagation. Complex Systems 5, 603–643.
  6. Chen, C., Ding, N., Carin, L., 2015. On the convergence of stochastic gradient MCMC algorithms with high-order integrators, in: Advances in Neural Information Processing Systems, pp. 2278–2286.
  7. Chérief-Abdellatif, B.E., Alquier, P., 2019. Finite sample properties of parametric mmd estimation: robustness to misspecification and dependence. arXiv preprint arXiv:1912.05737 .
  8. Cherief-Abdellatif, B.E., Alquier, P., 2020. Mmd-bayes: Robust bayesian estimation via maximum mean discrepancy, PMLR. pp. 1–21. URL:
  9. DeGroot, M.H., Fienberg, S.E., 1983. The comparison and evaluation of forecasters. Journal of the Royal Statistical Society: Series D (The Statistician) 32, 12–22.
  10. Der Kiureghian, A., Ditlevsen, O., 2009. Aleatory or epistemic? does it matter? Structural Safety 31, 105–112.
  11. Gal, Y., Ghahramani, Z., 2015a. Bayesian convolutional neural networks with Bernoulli approximate variational inference. arXiv preprint arXiv:1506.02158 .
  12. Gal, Y., Ghahramani, Z., 2015b. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. arXiv preprint arXiv:1506.02142 .
  13. Goodfellow, I.J., Shlens, J., Szegedy, C., 2014. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 .
  14. Graves, A., 2011. Practical variational inference for neural networks, in: Advances in Neural Information Processing Systems, pp. 2348–2356.
  15. Gretton, A., Borgwardt, K.M., Rasch, M.J., Schölkopf, B., Smola, A., 2012. A kernel two-sample test. The Journal of Machine Learning Research 13, 723–773.
  16. Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q., 2017. On calibration of modern neural networks, in: Proceedings of the 34th International Conference on Machine Learning-Volume 70, JMLR. org. pp. 1321–1330.
  17. He, K., Zhang, X., Ren, S., Sun, J., 2015. Delving deep into rectifiers: Surpassing human-level performance on Imagenet classification, in: Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034.
  18. Hinton, G.E., Van Camp, D., 1993. Keeping the neural networks simple by minimizing the description length of the weights, in: Proceedings of the sixth annual conference on Computational learning theory, pp. 5–13.
  19. Hüllermeier, E., Waegeman, W., 2019. Aleatoric and epistemic uncertainty in machine learning: A tutorial introduction. arXiv preprint arXiv:1910.09457 .
  20. Karolina Dziugaite, G., Roy, D.M., Ghahramani, Z., 2015. Training generative neural networks via Maximum Mean Discrepancy optimization. arXiv e-prints , arXiv:1505.03906arXiv:1505.03906.
  21. Kendall, A., Gal, Y., 2017. What uncertainties do we need in Bayesian deep learning for computer vision?, in: Advances in Neural Information Processing Systems, pp. 5574–5584.
  22. Kingma, D.P., Ba, J., 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 .
  23. Kingma, D.P., Salimans, T., Welling, M., 2015. Variational dropout and the local reparameterization trick, in: Advances in Neural Information Processing Systems, pp. 2575–2583.
  24. Kullback, S., Leibler, R.A., 1951. On information and sufficiency. The Annals of Mathematical Statistics 22, 79–86.
  25. Kwon, Y., Won, J.H., Kim, B.J., Paik, M.C., 2020. Uncertainty quantification using Bayesian neural networks in classification: Application to biomedical image segmentation. Computational Statistics & Data Analysis 142, 106816.
  26. LeCun, Y., Cortes, C., Burges, C., 2010. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann. lecun. com/exdb/mnist 2.
  27. Leibig, C., Allken, V., Ayhan, M.S., Berens, P., Wahl, S., 2017. Leveraging uncertainty information from deep neural networks for disease detection. Scientific Reports 7, 1–14.
  28. Li, C.L., Chang, W.C., Cheng, Y., Yang, Y., Póczos, B., 2017. MMD GAN: Towards deeper understanding of moment matching network, in: Advances in Neural Information Processing Systems, pp. 2203–2213.
  29. Li, Y., Swersky, K., Zemel, R.S., 2015. Generative moment matching networks. arXiv:1502.02761 .
  30. MacKay, D.J., 1992. A practical Bayesian framework for backpropagation networks. Neural Computation 4, 448–472.
  31. MacKay, D.J., Mac Kay, D.J., 2003. Information theory, inference and learning algorithms. Cambridge University Press.
  32. Naeini, M.P., Cooper, G., Hauskrecht, M., 2015. Obtaining well calibrated probabilities using bayesian binning, in: Twenty-Ninth AAAI Conference on Artificial Intelligence.
  33. Neal, R.M., 1996. Bayesian Learning for Neural Networks. Springer-Verlag, Berlin, Heidelberg.
  34. Niculescu-Mizil, A., Caruana, R., 2005. Predicting good probabilities with supervised learning, in: Proceedings of the 22nd International Conference on Machine Learning, Association for Computing Machinery, New York, NY, USA. p. 625–632. URL:, doi:10.1145/1102351.1102430.
  35. Rethage, D., Pons, J., Serra, X., 2018. A wavenet for speech denoising, in: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE. pp. 5069–5073.
  36. Shridhar, K., Laumann, F., Liwicki, M., 2019. A Comprehensive guide to Bayesian Convolutional Neural Network with Variational Inference. arXiv e-prints , arXiv:1901.02731arXiv:1901.02731.
  37. Wenzel, F., Roth, K., Veeling, B.S., Światkowski, J., Tran, L., Mandt, S., Snoek, J., Salimans, T., Jenatton, R., Nowozin, S., 2020. How good is the Bayes posterior in deep neural networks really? arXiv:2002.02405 .
  38. Zhao, S., Song, J., Ermon, S., 2017. InfoVAE: Information Maximizing Variational Autoencoders. arXiv e-prints , arXiv:1706.02262arXiv:1706.02262.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description