1 Introduction


Neural networks utilize the softmax as a building block in classification tasks, which contains an overconfidence problem and lacks an uncertainty representation ability. As a Bayesian alternative to the softmax, we consider a random variable of a categorical probability over class labels. In this framework, the prior distribution explicitly models the presumed noise inherent in the observed label, which provides consistent gains in generalization performance in multiple challenging tasks. The proposed method inherits advantages of Bayesian approaches that achieve better uncertainty estimation and model calibration. Our method can be implemented as a plug-and-play loss function with negligible computational overhead compared to the softmax with the cross-entropy loss function.


1 Introduction

Softmax (Bridle, 1990) is the de facto standard for post processing of logits of neural networks (NNs) for classification. When combined with the maximum likelihood objective, it enables efficient gradient computation with respect to logits and has achieved state-of-the-art performances on many benchmark datasets. However, softmax lacks the ability to represent the uncertainty of predictions (Blundell et al., 2015; Gal and Ghahramani, 2016) and has poorly calibrated behavior (Guo et al., 2017). For instance, the NN with softmax can easily be fooled to confidently produce wrong outputs; when rotating digit 3, it will predict it as the digit 8 or 4 with high confidence (Louizos and Welling, 2017). Another concern of the softmax is its confident predictive behavior makes NNs to be subject to overfitting (Xie et al., 2016; Pereyra et al., 2017). This issue raises the need for effective regularization techniques for improving generalization performance.

Bayesian NNs (BNNs; MacKay, 1992) can address the aforementioned issues of softmax. BNNs provide quantifiable measures of uncertainty such as predictive entropy and mutual information (Gal, 2016) and enable automatic embodiment of Occam’s razor (MacKay, 1995). However, some practical obstacles have impeded the wide adoption of BNNs. First, the intractable posterior inference in BNNs demands approximate methods such as variation inference (VI; Graves, 2011; Blundell et al., 2015) and Monte Carlo (MC) dropout (Gal and Ghahramani, 2016). Even with such novel approximation methods, concerns arise regarding both the degree of approximation and the computational expensive posterior inference (Wu et al., 2019; Osawa et al., 2019). In addition, under extreme non-linearity between parameters and outputs in the NNs, determining a meaningful weight prior distribution is challenging (Sun et al., 2019). Last but not least, BNNs often require considerable modifications to existing baselines, or they result in performance degradation (Lakshminarayanan et al., 2017).

In this paper, we apply the Bayesian principle to construct the target distribution for learning classifiers. Specifically, we regard a categorical probability as a random variable, and construct the target distribution over the categorical probability by means of the Bayesian inference, which is approximated by NNs. The resulting target distribution can be thought as being regularized via the prior belief whose impact is controlled by the number of observations. By considering only the random variable of categorical probability, the Bayesian principle can be efficiently adopted to existing deep learning building blocks without huge modifications. Our extensive experiments show effectiveness of being Bayesian about the categorical probability in improving generalization performances, uncertainty estimation, and calibration property.

Our contributions can be summarized as follows: 1) we show the importance of considering categorical probability as a random variable instead of being determined by the label; 2) we provide experimental results showing the usefulness of the Bayesian principle in improving generalization performance of large models on standard benchmark datasets, e.g., ResNext-101 on ImageNet; 3) we enable NNs to inherit the advantages of the Bayesian methods in better uncertainty representation and well-calibrated behavior with a negligible increase in computational complexity.

(a) Softmax
(b) BM
Figure 1: Illustration of the difference between softmax and BM when each image is unique in the training set. In softmax, the label “cat” is directly transformed into the target categorical distribution. In BM, the label “cat” is combined with the prior Dirichlet distribution over the categorical probability. Then, the Bayes’ rule updates the belief about categorical probability, which produces the target distribution.

2 Method

2.1 Preliminary

This paper focuses on classification problems in which, given i.i.d. training samples , we construct a classifier . Here, is an input space and is a set of labels. We denote and y as random variables whose unknown probability distributions generate inputs and labels, respectively. Also, we let be a one-hot representation of y.

Let be a NN with parameters where is a logit space. In this paper, we assume is the classification model where denotes the -th output basis of , and we concentrate on the problem of learning .

Given logit of an input and a label , the standard framework of learning is to apply softmax on the logit and minimize the cross-entropy between the one-hot encoded label and the softmax output (Figure 1(a)). Softmax, denoted by , is motivated by transforming the logit into a categorical probability. That is, it transforms the logit into a normalized exponential form:


where and . Here, we can see the softmax output can be viewed as where is a categorical distribution with parameter . Then, the cross-entropy can be computed by .

We can reformulate the minimization of the softmax into the collection of distribution matching problems. To this end, let be a vector-valued function that counts label frequency at in , which is defined as where is an indicator function that takes 1 when and 0 otherwise. The empirical risk on can be expressed as follows:


where measures the KL divergence between the empirical target distribution and the categorical distribution modeled by the NN at location , which is given by:


Therefore, the counting function becomes the estimator of categorical probability of the target distribution. However, directly approximating this target distribution can cause overfitting. This is because the estimator uses single or very few samples since most of the inputs are unique or very rare in the training set.

One simple heuristic to handle this problem is label smoothing (Szegedy et al., 2016) that constructs a regularized target estimator, in which a one-hot encoded label is relaxed by with hyperparameter . Under the smoothing operation, the target estimator is regularized by a mixture of the empirical counts and the parameter of the discrete uniform distribution such that . One concern about the label smoothing is the mixing coefficient is constant with respect to the number of observations, which can possibly prevent exploitation of the empirical counting information when it is needed.

Another more principled approach is BNNs, which prevents full exploitation of the noisy estimation by balancing the distance to the target distribution with model complexity and maintaining the weight ensemble instead of choosing a single best configuration. Specifically, in BNNs with the Gaussian weight prior , the score of configuration is measured by the posterior density where we have . Therefore, the complexity penalty term induced by the prior prevents the softmax output from exactly matching a one-hot encoded target. In modern deep NNs, however, may be poor proxy for the model complexity due to extreme non-linear relationship between weights and outputs (Hafner et al., 2018; Sun et al., 2019) as well as weight-scaling invariant property of batch normalization (Ioffe and Szegedy, 2015).

2.2 Being Bayesian about Categorical Probability

In this paper, we take a Bayesian approach to construct the target distribution, for which we refer to as belief matching framework (BM; Figure 1(b)). Specifically, BM regards the categorical probability as a random variable , and approximating the target posterior distribution by NNs. We will explain the construction of the target distribution and then specify the post-processing module on logits that makes NNs to represent the approximate distribution. We conclude this subsection by giving the objective function for matching the approximate distribution to the posterior belief , which can be optimized by standard gradient-based optimization techniques.

Target Distribution

Because we have 1, once we specify the prior distribution over , then the target distribution is automatically determined via the Bayes’ rule . In this paper, we consider a conjugate prior for simplicity, i.e., the Dirichlet distribution. A random variable (given ) following the Dirichlet distribution with concentration parameter vector , denoted as , has the following density:


where is the gamma function, meaning that belongs to the simplex, , and . Here, we have that the mean of is and controls the sharpness of the density such that more mass centered around the mean as becomes larger.

By the characteristics of the conjugate family, given we have the following posterior distribution:


where we can see that the target posterior mean is explicitly smoothed by the prior belief, and the smoothing operation is performed by the principled way of applying Bayes’ rule. Specifically, the posterior mean is given by , in which the prior distribution acts as adding pseudo counts. We note that the relative strength between the prior belief and the empirical count information becomes adaptive with respect to each data point.

Approximate Distribution

Now, we specify the approximate posterior distribution modeled by the NNs, which aims to approximate . In this paper, we model the approximate posterior as the Dirichlet distribution. To this end, We use an exponential function to transform logits to the concentration parameter, and we let . Then, the NN can represents the density over :


where . It can be easily shown that the approximate posterior mean corresponds to the softmax. Therefore, BM can be considered as a generalization of softmax in terms of changing the moment matching problem to the distribution matching problem in .

To understand the distribution matching objective in BM, we reformulate equation 6 as follows:


In the limit of , we can see that mean of the target posterior (equation 5) becomes a virtual label, for which individual ought to match; the penalty for ambiguous configuration is determined by the number of observations. Therefore, the distribution matching in BM can be thought as learning to score a categorical probability based on closeness to the target posterior mean, in which exploitation of the closeness information is automatically controlled by the data.

Target Distribution Matching

We solve the distribution matching problem by maximizing the evidence lower bound (ELBO), denoted by , which can be derived by (Jordan et al., 1999):


where we can see that maximizing corresponds to minimizing because the KL-divergence is non-negative and is a constant with respect to . Using ELBO can be encouraged by the fact that it removes the need for counting operations contained in the target distribution construction (equation 5).

Each term inside the expectation in can be analytically computed by:


where is the digamma function (the logarithmic derivative of ), and


where is assumed to be an input independent conjugate prior for simplicity; that is, .

Given analytic solutions in equation 10 and equation 11, we obtain the derivative of with respect to weight :


where , can be computed by the back-propagation (Rumelhart et al., 1986), and


where we can see that a local optimal is achieved when . We note that the gradient computation in equation 13 has a complexity of per sample, which is equal to those of softmax. As in standard practice, we evaluate this gradient by the Monte Carlo estimation, i.e., averaging the gradient over mini-batch samples.

2.3 On Prior Distributions

We can see that the target posterior mean in equation 5 becomes the counting estimator as . On the contrary, as becomes higher, the effect of empirical counting information is weakened, and eventually disappeared in the limit of . Therefore, considering most of the inputs are unique in , choosing small is appropriate for prevents the resulting posterior distribution from being dominated by the prior. In ideal fully Bayesian treatment, can be modeled hierarchically; however, we left this as an important future research direction.

However, a prior distribution with small implicitly makes small, which poses significant challenges on the gradient-based optimization. This is because the gradient of ELBO (cf. equation 13) is notoriously large in the small-value regimes of , e.g., . In addition, our various building blocks including normalization (Ioffe and Szegedy, 2015), initialization (He et al., 2015), and architecture (He et al., 2016a) are implicitly or explicitly designed to make ; that is, . Therefore, making small can be wasteful or requires huge modifications to the existing building blocks. Also, is encouraged in a sense of natural gradient (Amari, 1998), which improves the conditioning of Fisher information matrix (Schraudolph, 1998; LeCun et al., 1998; Raiko et al., 2012; Wiesler et al., 2014).

Therefore, we set for prior distribution and then multiply to the KL divergence term in ELBO: , whose gradient with respect to can be obtained by the following derivative:


where . We can see that is a local optima, in which the ratio between and equal to those of a local optimal point in equation 13 for every pair of and . Therefore, in a naive sense, searching for with and then multiplying after training corresponds to the process of searching for the prior distribution’s parameter with .

3 Related Work

BNNs are the dominant approach for applying Bayesian principles in neural networks, which models the weight posterior distribution. Because BNNs require the intractable posterior inference, many posterior approximation schemes have been developed to reduce the approximation gap and improve scalability (e.g., VI (Graves, 2011; Blundell et al., 2015; Wu et al., 2019) and stochastic gradient Markov Chain Monte Carlo (Welling and Teh, 2011; Ma et al., 2015; Gong et al., 2019)). However, even with these novel approximation techniques, BNNs are not scalable to the state-of-the-art architectures in large-scale datasets or they often reduce the generalization performance in practice, which impedes the wide adoption of BNNs despite their numerous potential benefits.

Other approaches avoid explicit modeling of the weight posterior distribution. MC dropout (Gal and Ghahramani, 2016) reinterprets the dropout (Srivastava et al., 2014) as an approximate VI. As a result, it retains the standard NN training procedure and modifies only the inference procedure for posterior MC approximation. Similar strategies have been adopted in the context of multiplicative noise (Kingma et al., 2015; Louizos and Welling, 2017). In a similar spirit, some approaches (Mandt et al., 2017; Zhang et al., 2018; Maddox et al., 2019; Osawa et al., 2019) sequentially estimate the mean and covariance of the weight posterior distribution by using gradients computed at each step. As different from the BNNs, Deep kernel learning (Wilson et al., 2016a, b) places Gaussian processes (GPs) on top of the “deterministic” NNs, which combines NNs’ capability of handling complex high dimensional data and GPs’ capability of principled uncertainty representation and robust extrapolation.

Non-Bayesian approaches also help to resolve the limitations of softmax. Lakshminarayanan et al. (2017) propose an ensemble-based method to achieve better uncertainty representation and improved self-calibration. Both Guo et al. (2017) and Neumann et al. (2018) proposed temperature scaling-based methods for post-hoc modifications of softmax for improved calibration. To improve generalization by penalizing over-confidence, Pereyra et al. (2017) propose an auxiliary loss function that penalizes low predictive entropy, and Szegedy et al. (2016) and Xie et al. (2016) consider the types of noise included in ground-truth labels.

We also note that some recent studies use NNs to model the concentration parameter of the Dirichlet distribution but with a different purpose than BM. Sensoy et al. (2018) uses the loss function of explicitly minimizing prediction variances on training samples, which can help to produce high uncertainty prediction for out-of-distribution or adversarial samples. Prior network (Malinin and Gales, 2018) investigates two types of auxiliary losses computed on in-distribution and out-of-distribution samples, respectively. Similar to prior network, Chen et al. (2018) considers an auxiliary loss computed on adversarially generated samples.

4 Experiment

We conduct extensive experiments for careful evaluation of BM. We first verify its improvement of the generalization error in image classification tasks, comparing against the softmax and MC dropout as a simple and efficient BNN (section 4.1). We then verify whether BM inherits the advantages of the Bayesian method by placing prior distribution only on the categorical distribution (sections 4.3 and 4.4). We conclude this section by providing further applications that shows versatility of BM (section 4.5). To support reproducibility, we will submit our code and make it publicly available after the review process. We performed all experiments on a single workstation with 8 GPUs (NVIDIA GeForce RTX 2080 Ti).

4.1 Generalization Performance

Model Method C-10 C-100
ResNet-18 Softmax 6.13 26.44
MC dropout 6.13 26.15
BM 5.93 24.19
ResNet-50 Softmax 5.76 25.00
MC dropout 5.75 25.17
BM 5.59 23.86
Table 1: Test classification error rates on CIFAR. Numbers indicate computed across five trials, and boldface indicates the minimum mean error rate. Model and hyperparameter are selected based on validation error rates.

Our first experiment uses the pre-activation ResNet (He et al., 2016b) on CIFAR (Krizhevsky, 2009). CIFAR-10 and CIFAR-100 contain 50K training and 10K test images, and each 32x32x3-sized image belongs to one of 10 categories in CIFAR-10 and one of 100 categories in CIFAR-100.

We fixed experimental configurations of He et al. (2016b) except for the following: for warm-up, use learning rates of [0.01, 0.02, 0.04, 0.06, 0.08] for first five epochs; clip gradient when its norm exceeds 1.0; 40K/10K training and validation split. We note that the learning rate warm-up and the gradient clipping are extremely helpful for stable training of BM. We searched for the coefficients of BM over and MC dropout over . Here, dropout is applied only on logits before final fully-connected layers with 100 MC samples at test time as in Gal and Ghahramani (2016)2.

Table 1 lists the classification error rates on test sets. BM presents the best error rates on all configurations. While MC dropout sometimes improves softmax generalization errors, we can see that BM improves the performance substantially, which shows the effectiveness of regarding categorical probability as a random variable.

Next, we performed a large-scale experiment using ResNext-50 32x4d and ResNext-101 32x8d (Xie et al., 2017) on ImageNet (Russakovsky et al., 2015). ImageNet contains approximately 1.3M training samples and 50K validation samples. Each sample is resized to 224x224x3 and belongs to one of the 1K categories.

Our training strategy follows Xie et al. (2017) except for the clipping gradient when its norm exceeds 1.0, the same as in the CIFAR experiment. Since we use only for the BM hyperparameter, we measure the validation error rates directly.

Consistent with the results on CIFAR, BM improves test errors of softmax (Table 2). This result is appealing because improving the generalization error of deep NNs on large-scale datasets by adopting a Bayesian principle without computational overhead has rarely been reported in the literature.

Model Method Top1 Top5
ResNext-50 Softmax 22.23 6.36
BM 22.03 6.23
ResNext-101 Softmax 20.72 5.59
BM 20.23 5.26
Table 2: Classification error rates on the ImageNet. Due to computational constraint, we report the result obtained by single experiment.

4.2 Regularization Effects Analysis

We analyze regularization effects of BM by ablation study of the KL term in ELBO (equation 9). In theory, the target distribution in BM is regularized by both the prior distribution, which smooths the target posterior mean by adding pseudo counts, and averaging of all possible categorical probabilities with the Dirichlet distribution. Therefore, the ablation of the KL term help to examine these two effects separately, which removes the effect of the prior distribution.

We first examine its impact on the generalization performance by training a ResNet-50 on CIFAR without the KL term. The resulting test error rates were 5.68% on CIFAR-10 and on CIFAR-100. Significant reduction in generalization performances by exclusion of KL term indicates the powerful regularization effect of the prior distribution (cf. Table 1). The result that BM without the KL term still achieves lower test error rates compared to softmax demonstrates the regularization effect of considering all possible categorical probabilities by the Dirichlet distribution instead of choosing single categorical probability.

Considering the role of the prior distribution on smoothing the posterior mean, we conjecture that the impact of the prior distribution is similar to the effect of label smoothing. In a recent study (Müller et al., 2019), it is shown that label smoothing makes learned representation reveal tight clusters of data points within the same classes and smaller deviations among the data points. Inspired by this result, we also visualize the activations in the penultimate layer in two-dimensional space. As Figure 2 presents, we can see the result is consistent with Müller et al. (2019), which clearly shows the regularization effect of the prior distribution. To see this, assume that two images belonging to the same class has close distance in the data manifold. Then, the difference between logits of same class examples is a good proxy for the gradient of along the data manifold, denoted by , since the gradient measures changes in the output space with respect to small changes in the input space. Therefore, the tight clusters induced by the prior distribution indicate reduced norm of . In addition, the smaller deviations among the data points implies the reduced norm of since we have .

Figure 2: Penultimate layer’s activations of examples belonging to one of three classes (beaver, dolphin, and otter; indexed by 0,1,2 in CIFAR-100) based on the visualization method proposed in (Müller et al., 2019).

4.3 Calibration

Calibration refers to a model’s ability to match its probabilistic output associated with an event to the actual long-term frequency of the event (Dawid, 1982). The notion of the calibration property in NNs is associated with how well its confidence matches the actual accuracy. For instance, if we gather predictions of a well-calibrated classifier such that each prediction has the confidence of around , the accuracy of the group of predictions will be around . The calibration property has become an important feature of NNs as they are deployed in many real-world systems; underconfident NNs can produce many false alarms, which makes humans to ignore the predictions of NNs; conversely, overconfident NNs can exclude the humans from the decision making loop, which results in catastrophic accidents.

Following Guo et al. (2017), we regard as a measure of prediction confidence for the input . We then measure the calibration performance by the expected calibration error (ECE; Naeini et al., 2015). ECE is calculated by grouping predictions based on the confidence score and then computing the absolute difference between the average accuracy and average confidence for each group; that is, the ECE of on with groups is as follows:


where is a set of samples in the -th group, defined as , , and .

(a) CIFAR-10
(b) CIFAR-100
Figure 3: Reliability plots of ResNet-50 with BM and softmax. Here, ECE is computed with 15 groups.

We analyze the calibration property of ResNet-50 examined in section 4.1. As Figure 3 presents, BM’s predictive probability is well matched to its’ accuracy compared to softmax–that is, BM improves the calibration property of NNs. We note that this result is automatically induced by the framework.

4.4 Uncertainty Representation

One of the most attractive benefits of Bayesian methods is their ability to represent the uncertainty of their predictions. In a naive sense, uncertainty representation ability is the ability to “know what it doesn’t know.” This ability is extremely useful in many promising applications, which includes balancing exploitation and exploration in reinforcement learning (Gal and Ghahramani, 2016) and detecting out-of-distribution samples (Malinin and Gales, 2018).

To assess the ability to produce uncertain predictions, we quantify uncertainty against out-of-distribution samples. This experiment can be thought as making examples that the NN should produce outputs of “I don’t know” because the examples belong to unseen classes during training. In this experiment, we quantify uncertainty by predictive entropy, which measures uncertainty of as follows: . This uncertainty measure gives intuitive interpretation such that the “I don’t know” the answer is close to the uniform distribution; conversely, “I confidently know” answer has one dominating categorical probability.

Figure 4 presents density plots of the predictive entropy, showing that BM provides notably better uncertainty estimation compared to softmax. Specifically, BM makes clear peaks of predictive entropy in high uncertainty region for out-of-distribution samples (Figure 4(b)). In contrast, softmax produces relatively flat uncertainty for out-of-distribution samples (Figure 4(a)). We note that this remarkable result in obtained by being Bayesian only about the categorical probability.

Note that some in-distribution samples should be answered as “I don’t know” because the network does not achieve perfect test accuracy. As Figure 4 shows, BM contains more samples of high uncertainty for in-distribution samples compared to softmax that is almost certain in its predictions. This result consistently supports the previous result that BM resolves the overconfidence problem of softmax.

(a) Softmax
(b) BM
Figure 4: Uncertainty representation for in-distribution (CIFAR-100) and out-of-distribution (SVHN) of ResNet-50 under softmax and BM.

4.5 Further Applications

Transfer Learning

Compared to BNNs, BM adopts the Bayesian principle outside the NNs, which enables BM to be applied to models already trained on different tasks without considering Bayesian principles. This interaction is appealing when we consider recent efforts of the deep learning community to construct general baseline models from extremely large-scale datasets and then transfer the baselines to multiple down-stream tasks (e.g., BERT (Devlin et al., 2018) and MoCo (He et al., 2019)).

To address this aspect, we downloaded ResNet-50 pretrained on the ImageNet, and trained only the last linear layer with BM and softmax on CIFAR-10, Food-101 (Bossard et al., 2014; 100K images with 101 food categories), and Cars (Krause et al., 2013; 16K images with 196 car categories). Specifically, we trained the networks for 100 epochs and used the Adam optimizer (Kingma and Ba, 2015) with learning rate of 3e-4. For BM, is used.

Table 3 compares softmax and BM on the three datasets, in which BM achieves significantly better generalization performances compared to softmax. We also observed that better uncertainty estimation ability of BM is preserved in the transfer learning scenario (Figure 5).

Method C-10 Food-101 Cars
Softmax 5.44 28.49 42.99
BM 5.03 26.41 39.99
Table 3: Transfer learning performance (test error rates) from ResNet-50 pretrained on ImageNet to smaller datasets. and are obtained by five experiments, and boldface indicates the minimum mean error rate.
Method -Model VAT
Softmax 16.52 13.33
BM 16.01 12.40
Table 4: Classification error rates on CIFAR-10. and are obtained by five experiments, and boldface indicates the minimum mean error rate.

Semi-Supervised Learning

While deep learning has achieved impressive performances on many tasks, its’ success heavily relies on large amounts of supervised training examples. However, constructing such dataset is expensive and time consuming due to the human labeling process. In this regard, semi-supervised learning aims to achieve better use of the information contained in unlabeled input data.

Among them, consistency-based losses, which is of our interest, regularize variations of NNs in the data manifold by penalizing prediction differences under stochastic perturbations (Belkin et al., 2006; Oliver et al., 2018). Since the prediction differences under softmax are measured by the divergence between two categorical probabilities, BM can provide a more delicate measure of the prediction consistency–divergence between Dirichlet distributions–that can capture richer probabilistic structures, e.g., (co)variances of categorical probabilities.

We investigate two baselines that construct consistency targets under stochastic perturbations on inputs (VAT; Miyato et al., 2018) and networks (-model; Laine and Aila. 2017). Specifically, VAT generates adversarial direction , then measures KL-divergence between predictions at and :


-model measures distance between predictions with and without enabling stochastic parts in NNs:


where is a prediction without the stochastic parts.

(a) Softmax
(b) BM
Figure 5: Uncertainty representation for in-distribution samples (CIFAR-100) and out-of-distribution samples (SVHN, Foods, and Cars) in transfer learning tasks. BM produces clear peaks in high uncertainty region on SVHN and Food-101. We note that BM confidently predicts examples in Cars because CIFAR-10 contains the object category of “automobile”. On the other hand, softmax produces confident predictions on all datasets compared to BM.

We can see that both models measure consistency in the space of with KL-divergence or norm. Therefore, if we replace the consistency measures in equation 16 by and those in equation 17 to , the consistency can be measured in the space of ; the generalization of the moment matching problem to the distribution matching as in the supervised learning tasks.

With new consistency measures, we trained wide ResNet 28-2 (Zagoruyko and Komodakis, 2016) on CIFAR-10 with 4K number of labeled and 41K number of unlabeled examples. Validation and test sets contain 5K and 10K examples, respectively. We matched the experimental configurations to those of Oliver et al. (2018) except for a consistency loss coefficient of 0.03 for VAT and 0.5 for -model to match the scale between supervised and unsupervised losses. We only use for the BM hyperparameter. Our experimental results show that the distribution matching metric of BM is more effective than the moment matching metric under the softmax on reducing the error rates (Table 4). BM’s more sophisticated consistency measures are valuable because the prediction differences are frequently used in other domains such as knowledge distillation (Hinton et al., 2015) and model interpretation (Zintgraf et al., 2017).

5 Conclusion

We adopted the Bayesian principle for constructing the target distribution by considering the categorical probability as a random variable rather than being given by the training label. The proposed method can be flexibly applied to the standard deep learning models by replacing only the softmax and the cross-entropy loss, which provides the consistent improvements in generalization performance, better uncertainty estimation, and well-calibrated behavior. We believe that BM shows promising advantages of being Bayesian about categorical probability.


  1. In this paper, is read as a random variable follows a probability distribution with parameter .
  2. We also tried to apply dropout after every non-linear activation functions but it consistently reduces the performances.


  1. Natural gradient works efficiently in learning. Neural Computation 10 (2), pp. 251–276. Cited by: §2.3.
  2. Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research 7 (Nov), pp. 2399–2434. Cited by: §4.5.
  3. Weight uncertainty in neural networks. In International Conference on Machine Learning, Cited by: §1, §1, §3.
  4. Food-101 – mining discriminative components with random forests. In European Conference on Computer Vision, Cited by: §4.5.
  5. Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In Neurocomputing, pp. 227–236. Cited by: §1.
  6. A variational Dirichlet framework for out-of-distribution detection. arXiv preprint arXiv:1811.07308. Cited by: §3.
  7. The well-calibrated Bayesian. Journal of the American Statistical Association 77 (379), pp. 605–610. Cited by: §4.3.
  8. BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §4.5.
  9. Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In International Conference on Machine Learning, Cited by: §1, §1, §3, §4.1, §4.4.
  10. Uncertainty in deep learning. Ph.D. Thesis, University of Cambridge. Cited by: §1.
  11. Meta-learning for stochastic gradient MCMC. In International Conference on Learning Representations, Cited by: §3.
  12. Practical variational inference for neural networks. In Advances in Neural Information Processing Systems, Cited by: §1, §3.
  13. On calibration of modern neural networks. In International Conference on Machine Learning, Cited by: §1, §3, §4.3.
  14. Noise contrastive priors for functional uncertainty. arXiv preprint arXiv:1807.09289. Cited by: §2.1.
  15. Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722. Cited by: §4.5.
  16. Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In IEEE International Conference on Computer Vision, Cited by: §2.3.
  17. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.3.
  18. Identity mappings in deep residual networks. In European Conference on Computer Vision, Cited by: §4.1, §4.1.
  19. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §4.5.
  20. Batch normalization: accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, Cited by: §2.1, §2.3.
  21. An introduction to variational methods for graphical models. Machine Learning 37 (2), pp. 183–233. Cited by: §2.2.
  22. Adam: a method for stochastic optimization. In International Conference on Machine Learning, Cited by: §4.5.
  23. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, Cited by: §3.
  24. 3d object representations for fine-grained categorization. In ICCV Workshop on 3D Representation and Recognition, pp. 554–561. Cited by: §4.5.
  25. Learning multiple layers of features from tiny images. Technical report. Cited by: §4.1.
  26. Temporal ensembling for semi-supervised learning. In International Conference on Learning Representations, Cited by: §4.5.
  27. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, Cited by: §1, §3.
  28. Efficient backprop. In Neural Networks: Tricks of the Trade, pp. 9–50. Cited by: §2.3.
  29. Multiplicative normalizing flows for variational Bayesian neural networks. In International Conference on Machine Learning, Cited by: §1, §3.
  30. A complete recipe for stochastic gradient MCMC. In Advances in Neural Information Processing Systems, Cited by: §3.
  31. A practical Bayesian framework for backpropagation networks. Neural Computation 4 (3), pp. 448–472. Cited by: §1.
  32. Probable networks and plausible predictions—a review of practical Bayesian methods for supervised neural networks. Network: Computation in Neural Systems 6 (3), pp. 469–505. Cited by: §1.
  33. A simple baseline for Bayesian uncertainty in deep learning. In Advances in Neural Information Processing Systems, Cited by: §3.
  34. Predictive uncertainty estimation via prior networks. In Advances in Neural Information Processing Systems, Cited by: §3, §4.4.
  35. Stochastic gradient descent as approximate Bayesian inference. Journal of Machine Learning Research 18 (1), pp. 4873–4907. Cited by: §3.
  36. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (8), pp. 1979–1993. Cited by: §4.5.
  37. When does label smoothing help?. In Advances in Neural Information Processing Systems, Cited by: Figure 2, §4.2.
  38. Obtaining well calibrated probabilities using Bayesian binning. In AAAI Conference on Artificial Intelligence, Cited by: §4.3.
  39. Relaxed softmax: efficient confidence auto-calibration for safe pedestrian detection. In NIPS Workshop on Machine Learning for Intelligent Transportation Systems, Cited by: §3.
  40. Realistic evaluation of deep semi-supervised learning algorithms. In Advances in Neural Information Processing Systems, Cited by: §4.5, §4.5.
  41. Practical deep learning with Bayesian principles. In Advances in Neural Information Processing Systems, Cited by: §1, §3.
  42. Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548. Cited by: §1, §3.
  43. Deep learning made easier by linear transformations in perceptrons. In International Conference on Artificial Intelligence and Statistics, Cited by: §2.3.
  44. Learning representations by back-propagating errors. Nature 323 (6088), pp. 533–536. Cited by: §2.2.
  45. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision 115 (3), pp. 211–252. Cited by: §4.1.
  46. Accelerated gradient descent by factor-centering decomposition. Technical report. Cited by: §2.3.
  47. Evidential deep learning to quantify classification uncertainty. In Advances in Neural Information Processing Systems, Cited by: §3.
  48. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §3.
  49. Functional variational Bayesian neural networks. In International Conference on Learning Representations, Cited by: §1, §2.1.
  50. Rethinking the inception architecture for computer vision. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1, §3.
  51. Bayesian learning via stochastic gradient Langevin dynamics. In International Conference on Machine Learning, Cited by: §3.
  52. Mean-normalized stochastic gradient for large-scale deep learning. In IEEE International Conference on Acoustics, Speech and Signal Processing, Cited by: §2.3.
  53. Deep kernel learning. In International Conference on Artificial Intelligence and Statistics, Cited by: §3.
  54. Stochastic variational deep kernel learning. In Advances in Neural Information Processing Systems, Cited by: §3.
  55. Deterministic variational inference for robust Bayesian neural networks. In International Conference on Learning Representations, Cited by: §1, §3.
  56. Disturblabel: regularizing cnn on the loss layer. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §3.
  57. Aggregated residual transformations for deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §4.1, §4.1.
  58. Wide residual networks. In British Machine Vision Conference, Cited by: §4.5.
  59. Noisy natural gradient as variational inference. In International Conference of Machine Learning, Cited by: §3.
  60. Visualizing deep neural network decisions: prediction difference analysis. In International Conference on Learning Representations, Cited by: §4.5.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description