A Review of Uncertainty Quantification in Deep Learning: Techniques, Applications and Challenges

A Review of Uncertainty Quantification in Deep Learning: Techniques, Applications and Challenges

Abstract

Uncertainty quantification (UQ) plays a pivotal role in reduction of uncertainties during both optimization and decision making processes. It can be applied to solve a variety of real-world applications in science and engineering. Bayesian approximation and ensemble learning techniques are two most widely-used UQ methods in the literature. In this regard, researchers have proposed different UQ methods and examined their performance in a variety of applications such as computer vision (e.g., self-driving cars and object detection), image processing (e.g., image restoration), medical image analysis (e.g., medical image classification and segmentation), natural language processing (e.g., text classification, social media texts and recidivism risk-scoring), bioinformatics, etc. This study reviews recent advances in UQ methods used in deep learning. Moreover, we also investigate the application of these methods in reinforcement learning (RL). Then, we outline a few important applications of UQ methods. Finally, we briefly highlight the fundamental research challenges faced by UQ methods and discuss the future research directions in this field.

Artificial intelligence, Uncertainty quantification, Deep learning, Machine learning, Bayesian statistics, Ensemble learning, Reinforcement learning.

1 Introduction

In everyday scenarios, we deal with uncertainties in numerous fields, from invest opportunities and medical diagnosis to sporting games and weather forecast, with an objective to make decision based on collected observations and uncertain domain knowledge. Nowadays, we can rely on models developed using machine and deep learning techniques can quantify the uncertainties to accomplish statistical inference [309]. It is very important to evaluate the efficacy of artificial intelligence (AI) systems before its usage [209]. The predictions made by such models are uncertain as they are prone to noises and wrong model inference besides the inductive assumptions that are inherent in case of uncertainty. Thus, it is highly desirable to represent uncertainty in a trustworthy manner in any AI-based systems. Such automated systems should be able to perform accurately by handling uncertainty effectively. Principle of uncertainty plays an important role in AI settings such as concrete learning algorithms [329], and active learning (AL) [348, 5].

Fig. 1: A schematic view of main differences between aleatoric and epistemic uncertainties.

The sources of uncertainty occurs when the test and training data are mismatched and data uncertainty occurs because of class overlap or due to the presence of noise in the data [379]. Estimating knowledge uncertainty is more difficult compared to data uncertainty which naturally measures it as a result of maximum likelihood training. Sources of uncertainty in prediction are essential to tackle the uncertainty estimation problem [130]. There are two main sources of uncertainty, conceptually called aleatoric and epistemic uncertainties [200] (see Fig. 1).

(a) Monte Carlo (MC) dropout
(b) Bootstrap model
(c) Gaussian Mixture model (GMM)
Fig. 2: Schematic view of three different uncertainty models with the related network architectures, reproduced based on [198].

Irreducible uncertainty in data giving rise to uncertainty in predictions is an aleatoric uncertainty (also known as data uncertainty). This type of uncertainty is not the property of the model, but rather is an inherent property of the data distribution; hence it is irreducible. Another type of uncertainty is epistemic uncertainty (also known as knowledge uncertainty) that occurs due to inadequate knowledge and data. One can define models to answer different human questions poised in model-based prediction. In the case of data-rich problem, there is a collection of massive data but it may be informatively poor [335]. In such cases, AI-based methods can be used to define the efficient models which characterize the emergent features from the data. Very often these data are incomplete, noisy, discordant and multimodal [309].
Uncertainty quantification (UQ) underpins many critical decisions today. Predictions made without UQ are usually not trustworthy and inaccurate. To understand the Deep Learning (DL) [220, 514] process life cycle, we need to comprehend the role of UQ in DL. The DL models start with the collection of most comprehensive and potentially relevant datasets available for decision making process. The DL scenarios are designed to meet some performance goals to select the most appropriate DL architecture after training the model using the labeled data. The iterative training process optimizes different learning parameters, which will be ‘tweaked’ until the network provides a satisfactory level of performance.
There are several uncertainties that need to be quantified in the steps involved. The uncertainties that are obvious in these steps are the following: (i) selection and collection of training data, (ii) completeness and accuracy of training data, (iii) understanding the DL (or traditional machine learning) model with performance bounds and its limitations, and (iv) uncertainties corresponds to the performance of the model based on operational data [28]. Data driven approaches such as DL associated with UQ poses at least four overlapping groups of challenges: (i) absence of theory, (ii) absence of casual models, (iii) sensitivity to imperfect data, and (iv) computational expenses. To mitigate such challenges, ad hoc solutions like the study of model variability and sensitivity analysis are sometimes employed. Uncertainty estimation and quantification have been extensively studied in DL and traditional machine learning. In the following, we provide a brief summary of some recent studies that examined the effectiveness of various methods to deal with uncertainties.
A schematic comparison of the three different uncertainty models [198] (MC dropout, Boostrap model and GMM is provided in Fig. 2. In addition, two graphical representations of uncertainty-aware models (BNN) vs OoD classifier) is illustrated in Fig. 3.

(a) BNN
(b) OoD classifier
Fig. 3: A graphical representation of two different uncertainty-aware (UA) models, reproduced based on [156].

1.1 Research Objectives and Outline

In the era of big data, ML and DL, intelligent use of different raw data has an enormous potential to benefit wide variety of areas. However, UQ in different ML and DL methods can significantly increase the reliability of their results. Ning et al. [349] summarized and classified the main contributions of the data-driven optimization paradigm under uncertainty. As can be observed, this paper reviewed the data-driven optimization only. In another study, Kabir et al. [221] reviewed Neural Network-based UQ. The authors focused on probabilistic forecasting and prediction intervals (PIs) as they are among most widely used techniques in the literature for UQ.
We have noticed that, from 2010 to 2020 (end of June), more than 2500 papers on UQ in AI have been published in various fields (e.g., computer vision, image processing, medical image analysis, signal processing, natural language processing, etc.). In one hand, we ignore large number of papers due to lack of adequate connection with the subject of our review. On the other hand, although many papers that we have reviewed have been published in related conferences and journals, many papers have been found on open-access repository as electronic preprints (i.e. arXiv) that we reviewed them due to their high quality and full relevance to the subject. We have tried our level to best to cover most of the related articles in this review paper. It is worth mentioning that this review can, therefore, serve as a comprehensive guide to the readers in order to steer this fast-growing research field.
Unlike previous review papers in the field of UQ, this study reviewed most recent articles published in quantifying uncertainty in AI (ML and DL) using different approaches. In addition, we are keen to find how UQ can impact the real cases and solve uncertainty in AI can help to obtain reliable results. Meanwhile, finding important chats in existing methods is a great way to shed light on the path to the future research. In this regard, this review paper gives more inputs to future researchers who work on UQ in ML and DL. We investigated more recent studies in the domain of UQ applied in ML and DL methods. Therefore, we summarized few existing studies on UQ in ML and DL. It is worth mentioning that the main purpose of this study is not to compare the performance of different UQ methods proposed because these methods are introduced for different data and specific tasks. For this reason, we argue that comparing the performance of all methods is beyond the scope of this study. For this reason, this study mainly focuses on important areas including DL, ML and Reinforcement Learning (RL). Hence, the main contributions of this study are as follows:

  • To the best of our knowledge, this is the first comprehensive review paper regarding UQ methods used in ML and DL methods which is worthwhile for researchers in this domain.

  • A comprehensive review of newly proposed UQ methods is provided.

  • Moreover, the main categories of important applications of UQ methods are also listed.

  • The main research gaps of UQ methods are pointed out.

  • Finally, few solid future directions are discussed.

2 Preliminaries

In this section, we explained the structure of feed-forward neural network followed by Bayesian modeling to discuss the uncertainty in detail.

2.1 Feed-forward neural network

In this section, the structure of a single-hidden layer neural network [418] is explained, which can be extended to multiple layers. Suppose is a -dimensional input vector, we use a linear map and bias to transform into a row vector with Q elements, i.e., . Next a non-linear transfer function , such as rectified linear (ReLU), can be applied to obtain the output of the hidden layer. Then another linear function can be used to map hidden layer to the output:

(1)

For classification, to compute the probability of belonging to a label in the set , the normalized score is obtained by passing the model output through a softmax function . Then the softmax loss is used:

(2)

where and are inputs and their corresponding outputs, respectively.

For regression, the Euclidean loss can be used:

(3)

2.2 Uncertainty Modeling

As mentioned above, there are two main uncertainties: epistemic (model uncertainty) and aleatoric (data uncertainty) [238]. The aleatoric uncertainty has two types: homoscedastic and heteroscedastic [232].
The predictive uncertainty (PU) consists of two parts: (i) epistemic uncertainty (EU), and (ii) aleatoric uncertainty (AU), which can be written as sum of these two parts:

(4)

Epistemic uncertainties can be formulated as probability distribution over model parameters. Let denotes a training dataset with inputs and their corresponding classes , where represents the number of classes. The aim is to optimize the parameters, i.e., , of a function that can produce the desired output. To achieve this, the Bayesian approach defines a model likelihood, i.e., . For classification, the softmax likelihood can be used:

(5)

and the Gaussian likelihood can be assumed for regression:

(6)

where represents the model precision.

The posterior distribution, i.e., , for a given dataset over by applying Bayes’ theorem can be written as follows:

(7)

For a given test sample , a class label with regard to the can be predicted:

(8)

This process is called inference or marginalization. However, cannot be computed analytically, but it can be approximated by variational parameters, i.e., . The aim is to approximate a distribution that is close to the posterior distribution obtained by the model. As such, the Kullback-Leibler (KL) [250] divergence is needed to be minimised with regard to . The level of similarity among two distributions can be measured as follows:

(9)

The predictive distribution can be approximated by minimizing KL divergence, as follows:

(10)

where indicates the optimized objective.

KL divergence minimization can also be rearranged into the evidence lower bound (ELBO) maximization [40]:

(11)

where is able to describe the data well by maximizing the first term, and be as close as possible to the prior by minimizing the second term. This process is called variational inference (VI). Dropout VI is one of the most common approaches that has been widely used to approximate inference in complex models [126]. The minimization objective is as follows [212]:

(12)

where and represent the number of samples and dropout probability, respectively.

To obtain data-dependent uncertainty, the precision in (6) can be formulated as a function of data. One approach to obtain epistemic uncertainty is to mix two functions: predictive mean, i.e., , and model precision, i.e., , and the likelihood function can be written as . A prior distribution is placed over the weights of the model, and then the amount of change in the weights for given data samples is computed. The Euclidian distance loss function (3) can be adapted as follows:

(13)

The predictive variance can be obtained as follows:

(14)

3 Uncertainty Quantification using Bayesian techniques

3.1 Bayesian Deep Learning/Bayesian Neural Networks

Despite the success of standard DL methods in solving various real-word problems, they cannot provide information about the reliability of their predictions. To alleviate this issue, BNNs/BDL [213, 508, 313] can be used to interpret the model parameters. BNNs/BDL are robust to over-fitting problem and can be trained on both small and big datasets [248].

3.2 Monte Carlo (MC) dropout

As stated earlier, it is difficult to compute the exact posterior inference, but it can be approximated. In this regard, Monte Carlo (MC) [340] is an effective method. Nonetheless, it is a slow and computationally expensive method when integrated into a deep architecture. To combat this, MC (MC) dropout has been introduced, which uses dropout [449] as a regularization term to compute the prediction uncertainty [127]. Dropout is an effective technique that has been widely used to solve over-fitting problem in DNNs. During the training process, dropout randomly drops some units of NN to avoid them from co-tuning too much. Assume a NN with layers, which , and denote the weight matrices, bias vectors and dimensions of the th layer, respectively. The output of NN and target class of the th input () are indicated by and , respectively. The objective function using regularization can be written as:

(15)

Dropout samples binary variables for each input data and every network unit in each layer (except the output layer), with the probability for th layer, if its value is 0, the unit is dropped for a given input data. Same values are used in the backward pass to update parameters. Fig. 4 shows several visualization of variational distributions on a simple NN [317].

(a) Baseline neural network
(b) Bernoulli DropConnect
(c) Gaussian DropConnect
(d) Bernoulli Dropout
(e) Gaussian Dropout
(f) Spike-and-Slab Dropout
Fig. 4: A graphical representation of several different visualization of variational distributions on a simple NN which is reproduced based on [317].

Several studies used MC dropout [47] to estimate UQ. Wang et al. [506] analyzed epistemic and aleatoric uncertainties for deep CNN-based medical image segmentation problems at both pixel and structure levels. They augmented the input image during test phase to estimate the transformation uncertainty. Specifically, the MC sampling was used to estimate the distribution of the output segmentation. Liu et al. [280] proposed a unified model using SGD to approximate both epistemic and aleatoric uncertainties of CNNs in presence of universal adversarial perturbations. The epistemic uncertainty was estimated by applying MC dropout with Bernoulli distribution at the output of neurons. In addition, they introduced the texture bias to better approximate the aleatoric uncertainty. Nasir et al. [337] conducted MC dropout to estimate four types of uncertainties, including variance of MC samples, predictive entropy, and Mutual Information (MI), in a 3D CNN to segment lesion from MRI sequences.

Fig. 5: A general view demonstrating the semi-supervised UA-MT framework applied to LA segmentation which is reproduced based on [551].

In [10], two dropout methods, i.e. element-wise Bernoulli dropout [449] and spatial Bernoulli dropout [477] are implemented to compute the model uncertainty in BNNs for the end-to-end autonomous vehicle control. McClure and Kriegeskorte [317] expressed that sampling of weights using Bernoulli or Gaussian can lead to have a more accurate depiction of uncertainty in comparison to sampling of units. However, according to the outcomes obtained in [317], it can be argued that using either Bernoulli or Gaussian dropout can improve the classification accuracy of CNN. Based on these findings, they proposed a novel model (called spike-and-slab sampling) by combining Bernoulli or Gaussian dropout.
Do et al. [95] modified U-Net [414], which is a CNN-based deep model, to segment myocardial arterial spin labeling and estimate uncertainty. Specifically, batch normalization and dropout are added after each convolutional layer and resolution scale, respectively. Later, Teye et al. [471] proposed MC batch normalization (MCBN) that can be used to estimate uncertainty of networks with batch normalization. They showed that batch normalization can be considered as an approximate Bayesian model. Yu et al. [551] proposed a semi-supervised model to segment left atrium from 3D MR images. It consists of two modules including teacher and student, and used them in UA framework called UA self-ensembling mean teacher (UA-MT) model (see Fig. 5). As such, the student model learns from teacher model via minimizing the segmentation and consistency losses of the labeled samples and targets of the teacher model, respectively. In addition, UA framework based on MC dropout was designed to help student model to learn a better model by using uncertainty information obtained from teacher model. Table I lists studies that directly applied MC dropout to approximate uncertainty along with their applications.

(a) One worker
(b) Synchronous
(c) Asynchronous
(d) Asynchronous and periodic
Fig. 6: A graphical implementations of different SG-MCMC models which is reproduced based on [268].
Study Year Method Application Code
Kendal et al. [228] 2015 SegNet [24] semantic segmentation
Leibig et al. [266] 2017 CNN diabetic retinopathy
Choi et al. [70] 2017 mixture density network (MDN) [39] regression
Jung et al. [215] 2018 full-resolution ResNet [382] brain tumor segmentation
Wickstrom et al. [522] 2018 FCN [437] and SehNet [24] polyps segmentation
Jungo et al. [217] 2018 FCN brain tumor segmentation
Vandal et al. [497] 2018 Variational LSTM predict flight delays
Devries and Taylor [91] 2018 CNN medical image segmentation
Tousignant et al. [480] 2019 CNN MRI images
Norouzi et al. [350] 2019 FCN MRI images segmentation
Roy et al. [415] 2019 Bayesian FCNN brain images (MRI) segmentation
Filos et al. [119] 2019 CNN diabetic retinopathy
Harper and Southern [162] 2020 RNN and CNN emotion prediction
TABLE I: A summary of studies that applied the original MC dropout to approximate uncertainty along with their applications (Sorted by year).

Comparison of MC dropout with other UQ methods

Recently, several studies have been conducted to compare different UQ methods. For example, Foong et al. [122] empirically and theoretically studied the MC dropout and mean-field Gaussian VI. They found that both models can express uncertainty well in shallow BNNs. However, mean-field Gaussian VI could not approximate posterior well to estimate uncertainty for deep BNNs. Ng et al. [344] compared MC dropout with BBB using U-Net [414] as a base classifier. Siddhant et al. [442] empirically studied various DAL models for NLP. During prediction, they applied dropout to CNNs and RNNs to estimate the uncertainty. Hubschneider et al. [198] compared MC dropout with bootstrap ensembling-based method and a Gaussian mixture for the task of vehicle control. In addition, Mukhoti [336] applied MC dropout with several models to estimate uncertainty in regression problems. Kennamer et al. [233] empirically studied MC dropout in Astronomical Observing Conditions.

3.3 Markov chain Monte Carlo (MCMC)

Markov chain Monte Carlo (MCMC) [252] is another effective method that has been used to approximate inference. It starts by taking random draw from distribution or . Then, it applies a stochastic transition to , as follows:

(16)

This transition operator is chosen and repeated for times, and the outcome, which is a random variable, converges in distribution to the exact posterior. Salakhutdinov et al. [420] used MCMC to approximate the predictive distribution rating values of the movies. Despite the success of the conventional MCMC, the sufficiant number of iteration is unknown. In addition, MCMC requires long time to converge to a desired distribution [340]. Several studies have been conducted to overcome these shortcomings. For example, Salimans et al. [422] expanded space into a set of auxiliary random variables and interpreted the stochastic Markov chain as a variational approximation.

The stochastic gradient MCMC (SG-MCMC) [63, 94] was proposed to train DNNs. It only needs to estimate the gradient on small sets of mini-batches. In addition, SG-MCMC can be converged to the true posterior by decreasing the step sizes [62, 269]. Gong et al. [143] combined amortized inference with SG-MCMC to increase the generalization ability of the model. Li et al. [268] proposed an accelerating SG-MCMC to improve the speed of the conventional SG-MCMC (see Fig. 6 for implementation of different SG-MCMC models). However, in short time, SG-MCMC suffers from a bounded estimation error [469] and it loses surface when applied to the multi-layer networks [71]. In this regard, Zhang et al. [565] developed a cyclical SG-MCMC (cSG-MCMC) to compute the posterior over the weights of neural networks. Specifically, a cyclical stepsize was used instead of the decreasing one. Large stepsize allows the sampler to take large moves, while small stepsize attempts the sampler to explore local mode.
Although SG-MCMC reduces the computational complexity by using a smaller subset, i.e. mini-batch, of dataset at each iteration to update the model parameters, those small subsets of data add noise into the model, and consequently increase the uncertainty of the system. To alleviate this, Luo et al. [297] introduced a sampling method called the thermostat-assisted continuously tempered Hamiltonian Monte Carlo, which is an extended version of the conventional Hamiltonian MC (HMC) [99]. Note that HMC is a MCMC method [178]. Specifically, they used Nosé-Hoover thermostats [185, 352] to handle the noise generated by mini-batch datasets. Later, dropout HMC (D-HMC) [178] was proposed for uncertainty estimation, and compared with SG-MCMC [63] and SGLD [518].
Besides, MCMC was integrated into the generative based methods to approximate posterior. For example, in [580], MCMC was applied to the stochastic object models, which is learned by generative adversarial networks (GANs), to approximate the ideal observer. In [253], a visual tracking system based on a variational autoencoder (VAE) MCMC (VAE-MCMC) was proposed.

Fig. 7: A summary of various VI methods for BDL which is reproduced based on [459]. Note that  is added based on the proposed method in [459].

3.4 Variational Inference (VI)

The variational inference (VI) is an approximation method that learns the posterior distribution over BNN weights. VI-based methods consider the Bayesian inference problem as an optimization problem which is used by the SGD to train DNNs. Fig. 7 summaries various VI methods for BNN [459].
For BNNs, VI-based methods aim to approximate posterior distributions over the weights of NN. To achieve this, the loss can be defined as follows:

(17)

where indicates the number of samples, and

(18)
(19)
(20)

where and 1 represent the element-wise product and vector filled with ones, respectively. Eq. (17) can be used to compute (10).

Posch et al. [388] defined the variational distribution using a product of Gaussian distributions along with diagonal covariance matrices. For each network layer, a posterior uncertainty of the network parameter was represented. Later, in  [387], they replaced the diagonal covariance matrices with the traditional ones to allow the network parameters to correlate with each other. Inspired from transfer learning and empirical Bayes (EB) [411], MOPED [243] used a deterministic weights, which was derived from a pre-trained DNNs with same architecture, to select meaningful prior distributions over the weight space. Later, in [245], they integrated an approach based on parametric EB into MOPED for mean field VI in Bayesian DNNs, and used fully factorized Gaussian distribution to model the weights. In addition, they used a real-world case study, i.e., diabetic retinopathy diagnosis, to evaluate their method. Subedar et al. [454] proposed an uncertainty aware framework based on multi-modal Bayesian fusion for activity recognition. They scaled BDNN into deeper structure by combining deterministic and variational layers. Marino et al. [312] proposed a stochastic modeling based approach to model uncertainty. Specifically, the DBNN was used to learn the stochastic learning of the system. Variational BNN [260], which is a generative-based model, was proposed to predict the superconducting transition temperature. Specifically, the VI was adapted to compute the distribution in the latent space for the model.

Louizos and Welling [292] adopted a stochastic gradient VI [236] to compute the posterior distributions over the weights of NNs. Hubin and Storvik [197] proposed a stochastic VI method that jointly considers both model and parameter uncertainties in BNNs, and introduced a latent binary variables to include/exclude certain weights of the model. Liu et al. [286] integrated the VI into a spatial–temporal NN to approximate the posterior parameter distribution of the network and estimate the probability of the prediction. Ryu et al. [419] integrated the graph convolutional network (GCN) into the Bayesian framework to learn representations and predict the molecular properties. Swiatkowski et al. [459] empirically studied the Gaussian mean-field VI. They decomposed the variational parameters into a low-rank factorization to make a more compact approximation, and improve the SNR ratio of the SG in estimating the lower bound of the variational. Franquhar et al. [115] used the mean-field VI to better train deep models. They argued that a deeper linear mean-field network can provide an analogous distribution of function space like shallowly full-co-variance networks. A schematic view of the proposed approach is demonstrated in Fig. 8.

Fig. 8: A general architecture of the deeper linear mean-field network with three mean-field weight layers or more which is reproduced based on [115].

3.5 Bayesian Active Learning (BAL)

Active learning (AL) methods aim to learn from unlabeled samples by querying an oracle [186]. Defining the right acquisition function, i.e., the condition on which sample is most informative for the model, is the main challenge of AL-based methods. Although existing AL frameworks have shown promising results in variety of tasks, they lack of scalability to high-dimensional data [478]. In this regard, the Baysian approaches can be integrated into DL structure to represent the uncertainty, and then combine with deep AL acquisition function to probe for the uncertain samples in the oracle.

DBAL [129], i.e., deep Bayesian AL, combine an AL framework with Bayesian DL to deal with high-dimensional data problems, i.e., image data. DBAL used batch acquisition to select the top samples with the highest Bayesian AL by disagreement (BALD) [187] score. Model priors from empirical bayes (MOPED) [244] used BALD to evaluate the uncertainty. In addition, MC dropout was applied to estimate the model uncertainty. Later, Krisch et al. [237] proposed BatchBALD, which uses greedy algorithm to select a batch in linear time and reduce the run time. They modeled the uncertainty by leveraging the Bayesian AL (BAL) using Dropout-sampling. In [53], two types of uncertainty measures namely entropy and BALD [187], were compared.

ActiveHARNet [149], which is an AL-based framework for human action recognition, modeled the uncertainty by linking BNNs with GP using dropout. To achieve this, dropout was applied before each fully connected layer to estimate the mean and variance of BNN. DeepBASS [316], i.e., a deep AL semi-supervised learning, is an expectation-maximization  [88] -based technique paired with an AL component. It applied MC dropout to estimate the uncertainty.

Scandalea et al. [93] proposed a framework based on U-Net structure for deep AL to segment biomedical images, and used uncertainty measure obtained by MC dropout, to suggest the sample to be annotated. Specifically, the uncertainty was defined based on the posterior probabilities’ SD of the MC-samples. Zheng et al. [559] varied the number of Bayesian layers and their positions to estimate uncertainty through AL on MNIST dataset. The outcome indicated that few Bayesian layers near the output layer are enough to fully estimate the uncertainty of the model.

Inspired from [199], the Bayesian batch AL [380], which selects a batch of samples at each AL iteration to perform posterior inference over the model parameters, was proposed for large-scale problems. Active user training [434], which is a BAL-based crowdsourcing model, was proposed to tackle high-dimensional and complex classification problems. In addition, the Bayesian inference proposed in [443] was used to consider the uncertainty of the confusion matrix of the annotators.

Several generative-based AL frameworks have been introduced. In [146], a semi-supervised Bayesian AL model, which is a deep generative-based model that uses BNNs to give discriminative component, was developed. Tran et al. [483] proposed a Bayesian-based generative deep AL (BGADL) (Fig. 9) for image classification problems. They, firstly used the concept of DBAL to select the must informative samples and then VAE-ACGAN was applied to generate new samples based on the selected ones. Akbari et al. [7] proposed a unified BDL framework to quantify both aleatoric and epistemic uncertainties for activity recognition. They used an unsupervised DL model to extract features from the time series, and then their posterior distributions was learned through a VAE model. Finally, the Dropout [127] was applied after each dense layer and test phase for randomness of the model weights and sample from the approximate posterior, respectively.

Fig. 9: Bayesian generative active deep learning (Note, ACGAN stands for the Auxiliary-classifier GAN which is reproduced based on [483].

3.6 Bayes by Backprop (BBB)

The learning process of a probability distribution using the weights of neural networks plays significant role for having better predictions results. Blundell et al. [42] proposed a novel yet efficient algorithm named Bayes by Backprop (BBB) to quantify uncertainty of these weights. The proposed BBB minimizes the compression cost which is known as the variational free energy (VFE) or the lower bound (expected) of the marginal likelihood. To do so, they defined a cost function as follows:

(21)

The BBB algorithm uses unbiased gradient estimates of the cost function in 21 for learning distribution over the weights of neural networks. In another research, Fortunato et al. [124] proposed a new Bayesian recurrent neural network (BRNNs) using BBB algorithm. In order to improve the BBB algorithm, they used a simple adaptation of truncated back-propagation throughout the time. The proposed Bayesian RNN (BRNN) model is shown in Fig. 10.

Fig. 10: Bayesian RNNs (BRNNs) which is reproduced based on the proposed model by Fortunato et al. [124].

Ebrahimi et al. [107] proposed an uncertainty-guided continual approach with BNNs (named UCB which stands for Uncertainty-guided continual learning technique with BNNs). The continual learning leads to learn a variety of new tasks while impound the aforetime knowledge obtained learned ones. The proposed UCB exploits the predicted uncertainty of the posterior distribution in order to formulate the modification in “important” parameters both in setting a hard-threshold as well as in a soft way. Recognition of different actions in videos needs not only big data but also is a time consuming process. To deal with this issue, de la Riva and Mettes [87] proposed a Bayesian-based deep learning method (named Bayesian 3D ConvNet) to analyze a small number of videos. In this regard, BBB was extended to be used by 3D CNNs and then employed to deal with uncertainty over the convolution weights in the proposed model. To do so, Gaussian distribution was applied to approximate the correct posterior in the proposed 3D Convolution layers using mean and STD (standard deviation) as follows:

(22)

where represents the input, is the output, is the filter height, is the filter width and is the time dimension. In another research, Ng et al. [344] compared the performance of two well-known uncertainty methods (MC dropout and BBB) in medical image segmentation (cardiac MRI) on a U-Net model. The obtained results showed that MC dropout and BBB demonstrated almost similar performances in medical image segmentation task.

3.7 Variational Autoencoders

Fig. 11: Pairwise Supervised Hashing-Bernoulli VAE (PSHBVAE) which is reproduced based on [81].

An autoencoder is a variant of DL that consists of two components: (i) encoder, and (ii) decoder. Encoder aims to map high-dimensional input sample to a low-dimensional latent variable . While decoder reproduces the original sample using latent variable . The latent variables are compelled to conform a given prior distribution . Variational Autoencoders (VAEs) [236] are effective methods to model the posterior. They cast learning representations for high-dimensional distributions as a VI problem [139]. A probabilistic model of sample in data space with a latent variable in latent space can be written as follows:

(23)

The VI can be used to model the evidence lower bound as follows:

(24)

where and are the encoder and decoder models, respectively, and and indicate their parameters.

Zamani et al. [81] developed a discrete VAE framework with Bernoulli latent variables as binary hashing code (Fig.11). The stochastic gradient was exploited to learn the model. They proposed a pairwise supervised hashing (PSH) framework to derive better hashing codes. PSH maximizes the ELBO with weighted KL regularization to learn more informative binary codes, and adapts a pairwise loss function to reward within-class similarity and between-class dissimilarity to minimize the distance among the hashing codes of samples from same class and vice versa.
Bohm et al. [43] studied UQ for linear inverse problems using VAEs. Specifically, the vanilla VAE with mean-field Gaussian posterior was trained on uncorrupted samples under the ELBO. In addition, the ELO method [430] was adopted to approximate the posterior. Edupuganti et al. [108] studied the UQ tasks in magnetic resonance image recovery (see Fig. 12). As such, a VAE-GAN, which is a probabilistic recovery scheme, was developed to map the low quality images to high-quality ones. The VAE-GAN consists of VAE and multi-layer CNN as generator and discriminator, respectively. In addition, the Stein’s unbiased risk estimator (SURE) was leveraged as a proxy to predict error and estimate the uncertainty of the model.

Fig. 12: A schematic view of the proposed VAE model by Edupuganti et al. which is reproduced based on [108].

In [210], a framework based on variational U-Net [112] architecture was proposed for UQ tasks in reservoir simulations. Both simple U-Net and variational U-Net (VUNet) are illustrated in Fig. 13. Cosmo VAE [545], which is a DL, i.e., U-Net, based VAE, was proposed to restore the missing observations of the cosmic microwave background (CMB) map. As such, the variational Bayes approximation was used to determine the ELBO of likelihood of the reconstructed image. Mehrasa et al. [320] proposed action point process VAE (APP VAE) for action sequences. APP VAE consists of two LSTM to estimate the prior and posterior distributions. Sato et al. [424] proposed a VAE-based UA for anomaly detection. They used MC sampling to estimate posterior.

(a) U-Net
(b) VUNet
Fig. 13: A general view of U-Net and VUNet which are reproduced based on [210].

Since VAEs are not stochastic processes, they are limited to encode finite-dimensional priors. To alleviate this limitation, Mishra et al. [328] developed the prior encoding VAE, i.e., VAE. Inspired by the Gaussian process [133], VAE is a stochastic process that learns the distribution over functions. To achieve this, VAE encoder, firstly, transforms the locations to a high-dimensional space, and then, uses a linear mapping to link the feature space to outputs. While VAE encoder aims to recreate linear mapping from the lower dimensional probabilistic embedding. Finally, the recreated mapping is used to get the reconstruction of the outputs. Guo et al. [151] used VAE to deal with data uncertainty under a just-in-time learning framework. The Gaussian distribution was employed to describe latent space features as variable-wise, and then the KL-divergence was used to ensure that the selected samples are the most relevant to a new sample. Daxberger et al. [85] tried to detect OoD samples during test phase. As such, the developed an unsupervised, probabilistic framework based on a Bayesian VAE. Besides, they estimated the posterior over the decoder parameters by applying SG-MCMC.

4 Other methods

In this section, we discuss few other proposed UQ methods used in machine and deep learning algorithms.

4.1 Deep Gaussian processes

Deep Gaussian processes (DGPs) [84, 105, 423, 45, 549, 458, 355] are effective multi-layer decision making models that can accurately model the uncertainty. They represent a multi-layer hierarchy to Gaussian processes (GPs) [401, 470]. GPs is a non-parametric type of Bayesian model that encodes the similarity between samples using kernel function. It represents distributions over the latent variables with respect to the input samples as a Gaussian distribution . Then, the output is distributed based on a likelihood function . However, the conventional GPs can not effectively scale the large datasets. To alleviate this issue, inducing samples can be used. As such, the following variational lower bound can be optimized.

(25)

where and are the location of the inducing samples and the approximated variational to the distribution of , respectively.

Oh et al. [358] proposed the hedged instance embedding (HIB), which hedges the position of each sample in the embedding space, to model the uncertainty when the input sample is ambiguous. As such, the probability of two samples matching was extended to stochastic embedding, and the MC sampling was used to approximate it. Specifically, the mixture of Gaussians was used to represent the uncertainty. Havasi et al. [167] applied SGHMC into DGPs to approximate the posterior distribution. They introduced a moving window MC expectation maximization to obtain the maximum likelihood to deal with the problem of optimizing large number of parameters in DGPs. Maddox et al. [301] used stochastic weight averaging (SWA) [203] to build a Gaussian-baed model to approximate the true posterior. Later, they proposed SWA-G [302], which is SWA-Gaussian, to model Bayesian averaging and estimate uncertainty.

Most of the weight perturbation-based algorithms suffer from high variance of gradient estimation due to sharing same perturbation by all samples in a mini-batch. To alleviate this problem, flipout [520] was proposed. Filipout samples the pseudo-independent weight perturbations for each input to decorrelate the gradient within the mini-batch. It is able to reduce variance and computational time in training NNs with multiplicative Gaussian perturbations.

Despite the success of DNNs in dealing with complex and high-dimensional image data, they are not robust to adversarial examples [460]. Bradshaw et al. [48] proposed a hybrid model of GP and DNNs (GPDNNs) to deal with uncertainty caused by adversarial examples (see Fig. 14).

Fig. 14: A general Gaussian-based DNN model proposed by Bradshaw et al. [48] which is reproduced based on the same reference.

Choi et al. [69] proposed a Gaussian-based model to predict the localization uncertainty in YOLOv3 [405]. As such, they applied a single Gaussian model to the bbox coordinates of the detection layer. Specifically, the coordinates of each bbox is modeled as the mean () and variance () to predict the uncertainty of bbox.

Khan et al. [234] proposed a natural gradient-based algorithm for Gaussian mean-field VI. The Gaussian distribution with diagonal covariances was used to estimate the probability. The proposed algorithm was implemented within the Adam optimizer. To achieve this, the network weights were perturbed during the gradient evaluation. In addition, they used a vector to adapt the learning rate to estimate uncertainty.

Sun et al. [456] considered structural information of the model weights. They used the matrix variate Gaussian (MVG) [152] distribution to model structured correlations within the weights of DNNs, and introduced a reparametrization for the MVG posterior to make the posterior inference feasible. The resulting MVG model was applied to a probabilistic BP framework to estimate posterior inference. Louizos and Welling [291] used MVG distribution to estimate the weight posterior uncertainty. They treated the weight matrix as a whole rather than treating each component of weight matrix independently. As mentioned earlier, GPs were widely used for UQ in deep learning methods. Van der Wilk et al. [494], Blomqvist et al. [41], Tran et al. [481], Dutordoir et al. [103] and Shi et al.  [439] introduced convolutional structure into GP.

Fig. 15: A schematic view of the TCP model which is reproduced based on the same reference. [76].

In another study, Corbière et al. [76] expressed that the confidence of DDNs and predicting their failures is of key importance for the practical application of these methods. In this regard, they showed that the TCP () is more suitable than the MCP () for failure prediction of such deep learning methods as follows:

(26)

where represents a -dimensional feature and is its correct class. Then, they introduced a new normalized type of the TCP confidence criterion:

(27)

A general view of the proposed model in  [76] is illustrated by Fig. 15.
In another research, Atanov et al. [21] introduced a probabilistic model and showed that Batch Normalization (BN) approach can maximize the lower bound of its related marginalized log-likelihood. Since inference computationally was not efficient, they proposed the Stochastic BN (SBN) approach for approximation of appropriate inference procedure, as an uncertainty estimation method. Moreover, the induced noise is generally employed to capture the uncertainty, check overfitting and slightly improve the performance via test-time averaging whereas ordinary stochastic neural networks typically depend on the expected values of their weights to formulate predictions. Neklyudov et al. [341] proposed a different kind of stochastic layer called variance layers. It is parameterized by its variance and each weight of a variance layer obeyed a zero-mean distribution. It implies that each object was denoted by a zero-mean distribution in the space of the activations. They demonstrated that these layers presented an upright defense against adversarial attacks and could serve as a crucial exploration tool in reinforcement learning tasks.

4.2 Laplace approximations

Laplace approximations (LAs) are other popular UQ methods which are used to estimate the Bayesian inference [300]. They build a Gaussian distribution around true posterior using a Taylor expansion around the MAP, , as follows:

(28)

where indicates the Hessian of the likelihood estimated at the MAP estimate. Ritter et al. [410] introduced a scalable LA (SLA) approach for different NNs. The proposed the model, then compared with the other well-known methods such as Dropout and a diagonal LA for the uncertainty estimation of networks.

5 Uncertainty Quantification in Reinforcement Learning

In decision making process, uncertainty plays a key role in decision performance in various fields such as Reinforcement Learning (RL) [96]. Different UQ methods in RL have been widely investigated in the literature  [572]. Lee et al. [261] formulated the model uncertainty problem as Bayes-Adaptive Markov Decision Process (BAMDP). The general BAMDP defined by a tuple S, , A, T, R, , , where where shows the underlying MDP’s observable state space, indicates the latent space, represents the action space, is the parameterized transition and finally is the reward functions, respectively. Lets be an initial belief, a Bayes filter updates the posterior as follows:

(29)
(a) Training procedure
(b) Network structure
Fig. 16: A general view of BPO which is reproduced based on [261].

Then, Bayesian Policy Optimization (BPO) method (see Fig. 16) is applied to POMDPs as a Bayes filter to compute the belief of the hidden state as follows:

(30)

In another research, O’Donoghue et al. [354] proposed the uncertainty Bellman equation (UBE) to quantify uncertainty. The authors used a Bellman-based which propagated the uncertainty (here variance) relationship of the posterior distribution of Bayesian. Kahn et al. a [223] presented a new UA model for learning algorithm to control a mobile robot. A review of past studies in RL shows that different Bayesian approaches have been used for handling parameters uncertainty [137]. Bayesian RL was significantly reviewed by Ghavamzadeh et al. [137] in 2015. Due to page limitation, we do not discuss the application of UQ in RL; but we summarise some of the recent studies here.
Kahn et al. a [223] used both Bootstrapping and Dropout methods to estimate uncertainty in NNs and then used in UA collision prediction model. Besides Bayesian statistical methods, ensemble methods have been used to quantify uncertainty in RL [484]. In this regard, Tschantz et al. [484] applied an ensemble of different point-estimate parameters when trained on various batches of a dataset and then maintained and treated by the posterior distribution . The ensemble method helped to capture both aleatoric and epistemic uncertainty. There are more UQ techniques used in RL, however, we are not able to discuss all of them in details in this work due to various reasons, such as page restrictions and the breadth of articles. Table II summarizes different UQ methods used in a variety of RL subjects.

Study Application Goal/Objective UQ method Code
Tegho et al. [468] Dialogue management context Dialogue policy optimisation BBB propagation deep Q-networks (BBQN)
Janz et al. [206] Temporal difference learning Posterior sampling for RL (PSRL) Successor Uncertainties (SU)
Shen and How [438] Discriminating potential threats Stochastic belief space policy Soft-Q learning
Benatan and Pyzer-Knapp [32] Safe RL (SRL) The weights in RNN using mean and variance weights Probabilistic Backpropagation (PBP)
Kalweit and Boedecker [224] Continuous Deep RL (CDRL) Minimizing real-world interaction Model-assisted Bootstrapped Deep Deterministic Policy Gradient (MA-BDDPG)
Riquelme et al. [409] Approximating the posterior sampling Balancing both exploration and exploitation in different complex domains Deep Bayesian Bandits Showdown using Thompson sampling
Huang et al. [194] Model-based RL (MRL) Better decision and improve performance Bootstrapped model-based RL (BMRL)
Eriksson and Dimitrakakis [111] Risk measures and leveraging preferences Risk-Sensitive RL (RSRL) Epistemic Risk Sensitive Policy Gradient (EPPG)
Lötjens et al. [290] SRL UA navigation Ensemble of MC dropout (EMCD) and Bootstrapping
Clements et al. [74] Designing risk-sensitive algorithm Disentangling aleatoric and epistemic uncertainties Combination of distributional RL (DRL) and Approximate Bayesian computation (ABC) methods with NNs
D’Eramo et al. [82] Drive exploration Multi-Armed Bandit (MAB) Bootstrapped deep Q-network with TS (BDQNTS)
TABLE II: Further information of some UQ methods used in RL.

6 Ensemble Techniques

Deep neural networks (DNNs) have been effectively employed in a wide variety of machine learning tasks and have achieved state-of-the-art performance in different domains such as bioinformatics, natural language processing (NLP), speech recognition and computer vision [562, 283]. In supervised learning benchmarks, NNs yielded competitive ac curacies, yet poor predictive uncertainty quantification. Hence, it is inclined to generate overconfident predictions. Incorrect overconfident predictions can be harmful; hence it is important to handle UQ in a proper manner in real-world applications [256]. As empirical evidence of uncertainty estimates are not available in general, quality of predictive uncertainty evaluation is a challenging task. Two evaluation measures called calibration and domain shift are applied which usually are inspired by the practical applications of NNs. Calibration measures the discrepancy between long-run frequencies and subjective forecasts. The second notion concerns generalization of the predictive uncertainty to domain shift that is estimating if the network knows what it knows. An ensemble of models enhances predictive performance. However, it is not evident why and when an ensemble of NNs can generate good uncertainty estimates. Bayesian model averaging (BMA) believes that the true model reclines within the hypothesis class of the prior and executes soft model selection to locate the single best model within the hypothesis class. On the contrary, ensembles combine models to discover more powerful model; ensembles can be anticipated to be better when the true model does not lie down within the hypothesis class.
The authors in [204] devised Maximize Overall Diversity (MOD) model to estimate ensemble-based uncertainty by taking into account diversity in ensemble predictions across future possible inputs. Gustafsson et al. [153] presented an evaluation approach for measuring uncertainty estimation to investigate the robustness in computer vision domain. Researchers in  [319] proposed a deep ensemble echo state network model for spatio-temporal forecasting in uncertainty quantification. Chua et al. [72] devised a novel method called probabilistic ensembles with trajectory sampling that integrated sampling-based uncertainty propagation with UA deep network dynamics approach. The authors in  [562] demonstrated that prevailing calibration error estimators were unreliable in small data regime and hence proposed kernel density-based estimator for calibration performance evaluation and proved its consistency and unbiasedness. Liu et al. [281] presented a Bayesian nonparametric ensemble method which enhanced an ensemble model that augmented model’s distribution functions using Bayesian nonparametric machinery and prediction mechanism. Hu et al. [189] proposed a model called margin-based Pareto deep ensemble pruning utilizing deep ensemble network that yielded competitive uncertainty estimation with elevated confidence of prediction interval coverage probability and a small value of the prediction interval width. In another study, the researchers [307] exploited the challenges associated with attaining uncertainty estimations for structured predictions job and presented baselines for sequence-level out-of-domain input detection, sequence-level prediction rejection and token-level error detection utilizing ensembles.
Ensembles involve memory and computational cost which is not acceptable in many application [308]. There has been noteworthy work done on the distillation of an ensemble into a single model. Such approaches achieved comparable accuracy using ensembles and mitigated the computational costs. In posterior distribution , the uncertainty of model is captured. Let us consider from the posterior sampled ensemble of models as follows  [308]:

(31)

where a test is input and represents the parameters of a categorical distribution . By taken into account the expectation with respect to the model posterior, predictive posterior or the expected predictive distribution, for a test input is acquired. And then we have:

(32)

Different estimate of data uncertainty are demonstrated by each of the models . The ‘disagreement’ or the level of spread of an ensemble sampled from the posterior is occurred due to the uncertainty in predictions as a result of model uncertainty. Let us consider an ensemble that yields the expected set of behaviors, the entropy of expected distribution can be utilized as an estimate of total uncertainty in the prediction. Measures of spread or ‘disagreement’ of the ensemble such as MI can be used to assess uncertainty in predictions due to knowledge uncertainty as follows:

(33)

The total uncertainty can be decomposed into expected data uncertainty and knowledge uncertainty via MI formulation. If the model is uncertain – both in out-of-domain and regions of severe class overlap, entropy of the total uncertainty or predictive posterior is high. If the models disagree, the difference of the expected entropy and entropy of predictive posterior of the individual models will be non-zero. For example, MI will be low and expected and predictive posterior entropy will be similar, and each member of the ensemble will demonstrate high entropy distribution in case of in regions of class overlap. In such scenario, data uncertainty dominates total uncertainty. The predictive posterior is near uniform while the expected entropy of each model may be low that yielded from diverse distributions over classes as a result of out-of-domain inputs on the other hand.In this region of input space, knowledge uncertainty is high because of the model’s understanding of data is low. In ensemble distribution distillation, the aim is not only to capture its diversity but also the mean of the ensemble. An ensemble can be observed as a set of samples from an implicit distribution of output distributions:

(34)

Prior Networks, a new class model was proposed that explicitly parameterize a conditional distribution over output distributions utilizing a single neural network parameterized by a point estimate of the model parameters . An ensemble can be emulated effectively by a Prior Networks and hence illustrated the same measure of uncertainty. By parameterizing the Dirichlet distribution, the Prior Network represents a distribution over categorical output distributions. Ensembling performance is measured by uncertainty estimation. Deep learning ensembles produces benchmark results in uncertainty estimation. The authors in [20] exploited in-domain uncertainty and examined its standards for its quantification and revealed pitfalls of prevailing matrices. They presented the deep ensemble equivalent score (DEE) and demonstrated how an ensemble of trained networks which is only few in number can be equivalent to many urbane ensembling methods with respect to test performance. For one ensemble, they proposed the test-time augmentation (TTA) in order to improve the performance of different ensemble learning techniques (see Fig. 17).

Fig. 17: A schematic view of TTA for ensembling techniques which is reproduced based on [20].

However, deep ensembles [385] are a simple approach that presents independent samples from various modes of the loss setting. Under a fixed test-time computed budget, deep ensembles can be regarded as powerful baseline for the performance of other ensembling methods. It is a challenging task to compare the performance of ensembling methods. Different values of matrices are achieved by different models on different datasets. Interpretability is lacking in values of matrices as performance gain is compared with dataset and model specific baseline. Hence, Ashukha et al. [20] proposed DDE with an aim to introduce interpretability and perspective that applies deep ensembles to compute the performance of other ensembling methods. DDE score tries to answer the question: what size of deep ensemble demonstrates the same performance as a specific ensembling technique? The DDE score is based on calibrated log-likelihood (CLL). DDE is defined for an ensembling technique (m) and lower and upper bounds are depicted as below  [20]:

(35)
(36)

where the mean and standard deviation of the calibrated log-likelihood yielded by an ensembling technique with samples is dubbed as . They measured and for natural numbers and linear interpolation is applied to define them for real values . They depict for different number of samples for different methods with upper and lower bounds and .

Different sources of model uncertainty can be taken care by incorporating a presented ensemble technique to propose a Bayesian nonparametric ensemble (BNE) model devised by Liu et al. [281]. Bayesian nonparametric machinery was utilized to augment distribution functions and prediction of a model by BNE. The BNE measure the uncertainty patterns in data distribution and decompose uncertainty into discrete components that are due to error and noise. The model yielded precise uncertainty estimates from observational noise and demonstrated its utility with respect to model’s bias detection and uncertainty decomposition for an ensemble method used in prediction. The predictive mean of BNE can be expressed as below [281]:

(37)

The predictive mean for the full BNE is comprised of three sections:

  1. The predictive mean of original ensemble ;

  2. BNE’s direct correction to the prediction function is represented by the term ; and

  3. BNE’s indirect correction on prediction derived from the relaxation of the Gaussian assumption in the model cumulative distribution function is represented by the term . In addition, two error correction terms and are also presented.

To denote BNE’s predictive uncertainty estimation, the term is used which is the predictive cumulative distribution function of the original ensemble (i.e. with variance and mean ). The BNE’s predictive interval is presented as [281]:

(38)

Comparing the above equation to the predictive interval of original ensemble , it can be observed that the residual process adjusts the locations of the BNE predictive interval endpoints while calibrates the spread of the predictive interval.
As an important part of ensemble techniques, loss functions play a significant role of having a good performance by different ensemble techniques. In other words, choosing the appropriate loss function can dramatically improve results. Due to page limitation, we summarise the most important loss functions applied for UQ in Table III.

Study Dataset type Base classifier(s) Method’s name Loss equation Code
TV et al. [488] Sensor data Neural Networks (LSTM) Ordinal Regression (OR)
Sinha et al. [444] Image Neural Networks Diverse Information Bottleneck in Ensembles (DIBS)
Zhang et al. [562] Image Neural Networks Mix-n-Match Calibration (the standard square loss)
Lakshminarayanan et al. [256] Image Neural Networks Deep Ensembles
Jain et al. [204] Image and Protein DNA binding Deep Ensembles Maximize Overall Diversity (MOD)
Gustafsson et al. [153] Video Neural Networks Scalable BDL Regression: , Classification:
Chua et al. [72] Robotics (video) Neural Networks Probabilistic ensembles with trajectory sampling (PETS)
Hu et al. [189] Image and tabular data Neural Networks margin-based Pareto deep ensemble pruning (MBPEP)
Malinin et al. [308] Image Neural Networks Ensemble Distribution Distillation ()
Ashukha et al. [20] Image Neural Networks Deep ensemble equivalent score (DEE)
Pearce et al. [374] Tabular data Neural Networks Quality-Driven Ensembles (QD-Ens)
Ambrogioni et al. [9] Tabular data Bayesian logistic regression Wasserstein variational gradient descent (WVG)
Hu et al. [191] Image Neural Networks Bias-variance decomposition
TABLE III: Main loss functions used by ensemble techniques for UQ.

6.1 Deep Ensemble

Deep ensemble, is another powerful method used to measure uncertainty and has been extensively applied in many real-world applications [189]. To achieve good learning results, the data distributions in testing datasets should be as close as the training datasets. In many situations, the distributions of test datasets are unknown especially in case of uncertainty prediction problem. Hence, it is tricky for the traditional learning models to yield competitive performance. Some researchers applied MCMC and BNNs that relied on the prior distribution of datasets to work out the uncertainty prediction problems [204]. When these approaches are employed into large size networks, it becomes computationally expensive. Model ensembling is an effective technique which can be used to enhance the predictive performance of supervised learners. Deep ensembles are applied to get better predictions on test data and also produce model uncertainty estimates when learns are provided with OoD data. The success of ensembles depends on the variance-reduction generated by combining predictions that are prone to several types of errors individually. Hence, the improvement in predictions is comprehended by utilizing a large ensemble with numerous base models and such ensembles also generate distributional estimates of model uncertainty. A deep ensemble echo state network (D-EESN) model with two versions of the model for spatio-temporal forecasting and associated uncertainty measurement presented in [319]. The first framework applies a bootstrap ensemble approach and second one devised within a hierarchical Bayesian framework. Multiple levels of uncertainties and non-Gaussian data types were accommodated by general hierarchical Bayesian approach. The authors in [319] broadened some of the deep ESN technique constituents presented by Antonelo et al. [15] and Ma et al. [299] to fit in a spatio-temporal ensemble approach in the D-EESN model to contain such structure. As shown in previous section, in the following , we summarise few loss functions of deep ensembles in Table IV.

Study Dataset type Base classifier(s) Method’s name Loss equation Code
Fan et al. [113] GPS-log Neural Networks Online Deep Ensemble Learning (ODEL)
Yang et al. [538] Smart grid K-means Least absolute shrinkage and selection operator (LASSO)
van Amersfoort et al. [492] Image Neural Networks Deterministic UQ (DUQ)
TABLE IV: Main loss functions used by deep ensemble techniques for UQ.

6.2 Deep Ensemble Bayesian

The expressive power of various ensemble techniques extensively shown in the literature. However, traditional learning techniques suffered from several drawbacks and limitations as listed in  [117]. To overcome these limitations, Fersini et al. [117] utilized the ensemble learning approach to mitigate the noise sensitivity related to language ambiguity and more accurate prediction of polarity can be estimated. The proposed ensemble method employed Bayesian model averaging, where both reliability and uncertainty of each single model were considered. Study [373] presented one alteration to the prevailing approximate Bayesian inference by regularizing parameters about values derived from a distribution that could be set equal to the prior. The analysis of the process suggested that the recovered posterior was centered correctly but leaned to have an overestimated correlation and underestimated marginal variance. To obtain uncertainty estimates, one of the most promising frameworks is Deep BAL (DBAL) with MC dropout. Pop et al. [385] argued that in variational inference methods, the mode collapse phenomenon was responsible for overconfident predictions of DBAL methods. They devised Deep Ensemble BAL that addressed the mode collapse issue and improved the MC dropout method. In another study, Pop et al. [386] proposed a novel AL technique especially for DNNs. The statistical properties and expressive power of model ensembles were employed to enhance the state-of-the-art deep BAL technique that suffered from the mode collapse problem. In another research, Pearce et al. [371] a new ensemble of NNs, approximately Bayesian ensembling approach, called ””. The proposed approach regularises the parameters regarding values attracted from a distribution.

6.3 Uncertainty Quantification in Traditional Machine Learning domain using Ensemble Techniques

It is worthwhile noting that UQ in traditional machine learning algorithms have extensively been studied using different ensemble techniques and few more UQ methods (e.g. please see [489]) in the literature. However, due to page limitation, we just summarized some of the ensemble techniques (as UQ methods) used in traditional machine learning domain. For example, Tzelepis et al. [489] proposed a maximum margin classifier to deal with uncertainty in input data. The proposed model is applied for classification task using SVM (Support Vector Machine) algorithm with multi-dimensional Gaussian distributions. The proposed model named SVM-GSU (SVM with Gaussian Sample Uncertainty) and it is illustrated by Fig. 18:

Fig. 18: A schematic view of SVM-GSU which is reproduced based on . [489].

In another research, Pereira et al. [375] examined various techniques for transforming classifiers into uncertainty methods where predictions are harmonized with probability estimates with their uncertainty. They applied various uncertainty methods: Venn-ABERS predictors, Conformal Predictors, Platt Scaling and Isotonic Regression. Partalas et al. [365] presented a novel measure called Uncertainty Weighted Accuracy (UWA), for ensemble pruning through directed hill climbing that took care of uncertainty of present ensemble decision. The experimental results demonstrated that the new measure to prune a heterogeneous ensemble significantly enhanced the accuracy compared to baseline methods and other state-of-the-art measures. Peterson et al. [377] exploited different types of errors that might creep in atomistic machine learning, and addressed how uncertainty analysis validated machine-learning predictions. They applied a bootstrap ensemble of neural network based calculators, and exhibited that the width of the ensemble can present an approximation of the uncertainty.

Fig. 19: A single block diagram for searching space in the architecture which is reproduced based on [17].

7 Further Studies of UQ Methods

In this section, we cover other methods used to estimate the uncertainty. In this regard, presented a summary of the proposed methods, but not the theoretical parts. Due to the page limitation and large number of references, we are not able to review all the details of the methods. For this reason, we recommend that readers check more details of each method in the reference if needed.
The OoD is a common error appears in machine and deep learning systems when training data have different distribution. To address this issue, Ardywibowo et al. [17] introduced a new UA architecture called . The proposed NADS finds an appropriate distribution of different architectures which accomplish significantly good on a specified task. A single block diagram for searching space in the architecture is presented by Fig. 19.
Unlike previous designing architecture methods, NADS allows to recognize common blocks amongst the entire UA architectures. On the other hand, the cost functions for the uncertainty oriented neural network (NN) are not always converging. Moreover, an optimized prediction interval (PI) is not always generated by the converged NNs. The convergence of training is uncertain and they are not customizable in the case of such cost functions. To construct optimal PIs, Kabir et al. [222] presented a smooth customizable cost function to develop optimal PIs to construct NNs. The PI coverage probability (PICP), PI-failure distances and optimized average width of PIs were computed to lessen the variation in the quality of PIs, enhance convergence probability and speed up the training. They tested their method on electricity demand and wind power generation data. In the case of non-Bayesian deep neural classification, uncertainty estimation methods introduced biased estimates for instances whose predictions are highly accurate. They argued that this limitation occurred because of the dynamics of training with SGD-like optimizers and possessed similar characteristics such as overfitting. Geifman et al. [135] proposed an uncertainty estimation method that computed the uncertainty of highly confident points by utilizing snapshots of the trained model before their approximations were jittered. The proposed algorithm outperformed all well-known techniques. In another research, Tagasovska et al. [462] proposed single-model estimates for DNNs of epistemic and aleatoric uncertainty. They suggested a loss function called Simultaneous Quantile Regression (SQR) to discover the conditional quantiles of a target variable to assess aleatoric uncertainty. Well-calibrated prediction intervals could be derived by using these quantiles. They devised Orthonormal Certificates (OCs), a collection of non-constant functions that mapped training samples to zero to estimate epistemic uncertainty. The OoD examples were mapped by these certificates to non-zero values.
van Amersfoort et al. [492, 493] presented a method to find and reject distribution data points for training a deterministic deep model with a single forward pass at test time. They exploited the ideas of RBF networks to devise deterministic UQ (DUQ) which is presented in Fig. 20. They scaled training in this with a centroid updating scheme and new loss function. Their method could detect out of distribution data consistently by utilizing a gradient penalty to track changes in the input. Their method is able to enhance deep ensembles and scaled well to huge databases.

Fig. 20: A general view of the DUQ architecture which is reproduced based on [492, 493].

Tagasovska et al. [461] demonstrated frequentist estimates of epistemic and aleatoric uncertainty for DNNs. They proposed a loss function, simultaneous quantile regression to estimate all the conditional quantiles of a given target variable in case of aleatoric uncertainty. Well-calibrated prediction intervals could be measured by using these quantiles. They proposed a collection of non-trivial diverse functions that map all training samples to zero and dubbed as training certificates for the estimation of epistemic uncertainty. The certificates signalled high epistemic uncertainty by mapping OoD examples to non-zero values. By using Bayesian deep networks, it is possible to know what the DNNs do not know in the domains where safety is a major concern. Flawed decision may lead to severe penalty in many domains such as autonomous driving, security and medical diagnosis. Traditional approaches are incapable of scaling complex large neural networks. Mobiny et al. [330] proposed an approach by imposing a Bernoulli distribution on the model weights to approximate Bayesian inference for DNNs. Their framework dubbed as MC-DropConnect demonstrated model uncertainty by small alternation in the model structure or computed cost. They validated their technique on various datasets and architectures for semantic segmentation and classification tasks. They also introduced a novel uncertainty quantification metrics. Their experimental results showed considerable enhancements in uncertainty estimation and prediction accuracy compared to the prior approaches.
Uncertainty measures are crucial estimating tools in machine learning domain, that can lead to evaluate the similarity and dependence between two feature subsets and can be utilized to verify the importance of features in clustering and classification algorithms. There are few uncertainty measure tools to estimate a feature subset including rough entropy, information entropy, roughness, and accuracy etc. in the classical rough sets. These measures are not proper for real-valued datasets and relevant to discrete-valued information systems. Chen et al. [65] proposed the neighborhood rough set model. In their approach, each object is related to a neighborhood subset, dubbed as a neighborhood granule. Different uncertainty measures of neighborhood granules were introduced, that were information granularity, neighborhood entropy, information quantity, and neighborhood accuracy. Further, they confirmed that these measures of uncertainty assured monotonicity, invariance and non-negativity. In the neighborhood systems, their experimental results and theoretical analysis demonstrated that information granularity, neighborhood entropy and information quantity performed superior to the neighborhood accuracy measure. On the other hand, reliable and accurate machine learning systems depends on techniques for reasoning under uncertainty. The UQ is provided by a framework using Bayesian methods. But Bayesian uncertainty estimations are often imprecise because of the use of approximate inference and model misspecification. Kuleshov et al. [249] devised a simple method for calibrating any regression algorithm; it was guaranteed to provide calibrated uncertainty estimates having enough data when used to probabilistic and Bayesian models. They assessed their technique on recurrent, feedforward neural networks, and Bayesian linear regression and located outputs well-calibrated credible intervals while enhancing performance on model-based RL and time series forecasting tasks.
Gradient-based optimization techniques have showed its efficacy in learning overparameterized and complex neural networks from non-convex objectives. Nevertheless, generalization in DNNs, the induced training dynamics, and specific theoretical relationship between gradient-based optimization methods are still unclear. Rudner et al. [416] examined training dynamics of overparameterized neural networks under natural gradient descent. They demonstrated that the discrepancy between the functions obtained from non-linearized and linearized natural gradient descent is smaller in comparison to standard gradient descent. They showed empirically that there was no need to formulate a limit argument about the width of the neural network layers as the discrepancy is small for overparameterized neural networks. Finally, they demonstrated that the discrepancy was small on a set of regression benchmark problems and their theoretical results were steady with empirical discrepancy between the functions obtained from non-linearized and linearized natural gradient descent. Patro et al. [368] devised gradient-based certainty estimates with visual attention maps. They resolved visual question answering job. The gradients for the estimates were enhanced by incorporating probabilistic deep learning techniques. There are two key advantages: 1. enhancement in getting the certainty estimates correlated better with misclassified samples and 2. state-of-the-art results obtained by improving attention maps correlated with human attention regions. The enhanced attention maps consistently improved different techniques for visual question answering. Improved certainty estimates and explanation of deep learning techniques could be achieved through the presented method. They provided empirical results on all benchmarks for the visual question answering job and compared it with standard techniques.
BNNs have been used as a solution for neural network predictions, but it is still an open challenge to specify their prior. Independent normal prior in weight space leads to weak constraints on the function posterior, permit it to generalize in unanticipated ways on inputs outside of the training distribution. Hafner et al. [156] presented noise contrastive priors (NCPs) to estimate consistent uncertainty. The prime initiative was to train the model for data points outside of the training distribution to output elevated uncertainty. The NCPs relied on input prior, that included noise to the inputs of the current mini batch, and an output prior, that was an extensive distribution set by these inputs. The NCPs restricted overfitting outside of the training distribution and produced handy uncertainty estimates for AL. BNNs with latent variables are flexible and scalable probabilistic models. They can record complex noise patterns in the data by using latent variables and uncertainty is accounted by network weights. Depeweg et al. [90] exhibited the decomposition and derived uncertainty into aleatoric and epistemic for decision support systems. That empowered them to detect informative points for AL of functions with bimodal and heteroscedastic noises. They further described a new risk-sensitive condition to recognize policies for RL that balanced noise aversion, model-bias and expected cost by applying decomposition.
Uncertainty modelling in DNNs is an open problem despite advancements in the area. BNNs, where the prior over network weights is a design choice, is a powerful solution. Frequently normal or other distribution supports sparsity. The prior is agnostic to the generative process of the input data. This may direct to unwarranted generalization for out-of- distribution tested data. Rohekar et al. [413] suggested a confounder for the relation between the discriminative function and the input data given the target label. They proposed for modelling the confounder by sharing neural connectivity patterns between the discriminative and generative networks. Hence, a novel deep architecture was framed where networks were coupled into a compact hierarchy and sampled from the posterior of local causal structures (see Fig. 21).

Fig. 21: A causal view demonstrating the main assumptions taken by Rohekar et al. [413] (this figure is reproduced based on the reference).

They showed that sampling networks from the hierarchy, an efficient technique, was proportional to their posterior and different types of uncertainties could be estimated. It is a challenging job to learn unbiased models on imbalanced datasets. The generalization of learned boundaries to novel test examples are hindered by concentrated representation in the classification space in rare classes. Khan et al. [235] yielded that the difficulty level of individual samples and rarity of classes had direct correlation with Bayesian uncertainty estimates. They presented a new approach for uncertainty based class imbalance learning that exploited two-folded insights: 1. In rare (uncertain) classes, the classification boundaries should be broadened to evade overfitting and improved its generalization; 2. sample’s uncertainty was defined by multivariate Gaussian distribution with a covariance matrix and a mean vector that modelled each sample. Individual samples and its distribution in the feature space should be taken care by the learned boundaries. Class and sample uncertainty information was used to obtain generalizable classification techniques and robust features. They formulated a loss function for max-margin learning based on Bayesian uncertainty measure. Their technique exhibited key performance enhancements on six benchmark databases for skin lesion detection, digit/object classification, attribute prediction and face verification.
Neural networks do not measure uncertainty meaningfully as it leans to be overconfident on incorrectly labelled, noisy or unseen data. Variational approximations such as Multiplicative Normalising Flows or BBB are utilized by BDL to overcome this limitation. However, current methods have shortcomings regarding scalability and flexibility. Pawlowski et al. [370] proposed a novel technique of variational approximation, termed as Bayes by Hypernet (BbH) that deduced hypernetworks as implicit distributions. It naturally scaled to deep learning architectures and utilized neural networks to model arbitrarily complex distributions. Their method was robust against adversarial attacks and yielded competitive accuracies. On the other hand, significant increase in prediction accuracy records in deep learning models, but it comes along with the enhancement in the cost of rendering predictions. Wang et al. [513] speculated that for many of the real world inputs, deep learning models created recently, it tended to “over-think” on simple inputs. They proposed I Don’t Know” (IDK) prediction cascades approach to create a set of pre-trained models systematically without a loss in prediction accuracy to speed up inference. They introduced two search based techniques for producing a new cost-aware objective as well as cascades. Their IDK cascade approach can be adopted in a model without further model retraining. They tested its efficacy on a variety of benchmarks.
Yang et al. [539] proposed a deep learning approach for propagating and quantifying uncertainty in models inspired by non-linear differential equations utilized by physics-informed neural networks. Probabilistic representations for the system states were produced by latent variable models while physical laws described by partial differential equations were satisfied by constraining their predictions. It also forwards an adversarial inference method for training them on data. A regularization approach for efficiently training deep generative models was provided by such physics-informed constraints. Surrogates of physical models in which the training of datasets was usually small, and the cost of data acquisition was high. The outputs of physical systems were characterized by the framework due to noise in their observations or randomness in their inputs that bypassed the need of sampling costly experiments or numerical simulators. They proved efficacy of their method via a series of examples that demonstrated uncertainty propagation in non-linear conservation laws and detection of constitutive laws. For autonomous driving, 3D scene flow estimation techniques generate 3D motion of a scene and 3D geometry. Brickwedde et al. [50] devised a new monocular 3D scene flow estimation technique dubbed as Mono-SF that assessed both motion of the scene and 3D structure by integrating single-view depth information and multi-view geometry. A CNN algorithm termed as ProbDepthNet was devised for combining single-view depth in a statistical manner. The new recalibration technique, ProbDepth-Net, was presented for regression problems to guarantee well-calibrated distributions. ProbDepthNet design and Mono-SF method proved its efficacy in comparison to the state-of-the-art approaches.
Mixup is a DNN training technique where extra samples are produced during training by convexly integrating random pairs of images and their labels. The method had demonstrated its effectiveness in improving the image classification performance. Thulasidasan et al. [474] investigated the predictive uncertainty and calibration of models trained with mixup. They revealed that DNNs trained with mixup were notably better calibrated than trained in regular technique. They tested their technique in large datasets and observed that this technique was less likely to over-confident predictions using random-noise and OoD data. Label smoothing in mixup trained DNNs played a crucial role in enhancing calibration. They concluded that training with hard labels caused overconfidence observed in neural networks. The transparency, fairness and reliability of the methods can be improved by explaining black-box machine learning models. Model’s robustness and users’ trust raised concern as the explanation of these models exhibited considerable uncertainty. Zhang et al. [568] illustrated the incidence of three sources of uncertainty, viz. variation in explained model credibility, variation with sampling proximity and randomness in the sampling procedure across different data points by concentrating on a specific local explanation technique called Local Interpretable Model-Agnostic Explanations (LIME). Even the black-box models with high accuracy yielded uncertainty. They tested the uncertainty in the LIME technique on two publicly available datasets and synthetic data.
In the incidence of even small adversarial perturbations, employment of DNNs in safety-critical environments is rigorously restricted. Sheikholeslami et al. [435] devised a randomized approach to identify these perturbations that dealt with minimum uncertainty metrics by sampling at the hidden layers during the DNN inference period. Adversarial corrupted inputs were identified by the sampling probabilities. Any pre-trained DNN at no additional training could be exploited by new detector of adversaries. The output uncertainty of DNN from the BNNs perspectives could be quantified by choosing units to sample per hidden layer where layer-wise components denoted the overall uncertainty. Low-complexity approximate solvers were obtained by simplifying the objective function. These approximations associated state-of-the-art randomized adversarial detectors with the new approach in addition to delivering meaningful insights. Moreover, consistency loss between various predictions under random perturbations is the basis of one of the effective strategies in semi-supervised learning. In a successful student model, teachers’ pseudo labels must possess good quality, otherwise learning process will suffer. But the prevailing models do not evaluate the quality of teachers’ pseudo labels. Li et al. [274] presented a new certainty-driven consistency loss (CCL) that employed predictive uncertainty information in the consistency loss to learn students from reliable targets dynamically. They devised two strategies i.e. Temperature CCL and Filtering CCL to either pay less attention on the uncertain ones or filter out uncertain predictions in the consistency regularization. They termed it FT-CCL by integrating the two strategies to enhance consistency learning approach. The FT-CCL demonstrated robustness to noisy labels and enhancement on a semi-supervised learning job. They presented a new mutual learning technique where one student was detached with its teacher and gained additional knowledge with another student’s teacher.
Englesson et al. [110] introduced a modified knowledge distillation method to achieve computationally competent uncertainty estimates with deep networks. They tried to yield competitive uncertainty estimates both for out and in-of-distribution samples. Their major contributions were as follows: 1. adapting and demonstrating to distillation’s regularization effect, 2. presenting a new target teacher distribution, 3. OoD uncertainty estimates were enhanced by a simple augmentation method, and 4. widespread set of experiments were executed to shed light on the distillation method. On the other hand, well calibrated uncertainty and accurate full predictive distributions are provided by Bayesian inference. High dimensionality of the parameter space limits the scaling of Bayesian inference methods to DNNs. Izmailov et al. [202] designed low-dimensional subspaces of parameter space that comprised of diverse sets of high performing approaches. They applied variational inference and elliptical slice sampling in the subspaces. Their method yielded well-calibrated predictive uncertainty and accurate predictions for both image classification and regression by exploiting Bayesian model averaging over the induced posterior in the subspaces.
Csáji et al. [78] introduced a data-driven strategy for uncertainty quantification of models based on kernel techniques. The method needed few mild regularities in the computation of noise instead of distributional assumptions such as dealing with exponential families or GPs. The uncertainty about the model could be estimated by perturbing the residuals in the gradient of the objective function. They devised an algorithm to make it distribution-free, non-asymptotically guaranteed and exact confidence regions for noise-free and ideal depiction of function that they estimated. For the symmetric noises and usual convex quadratic problems, the regions were star convex centred on a specified small estimate, and ellipsoidal outer approximations were also efficiently executed. On the other hand, the uncertainty estimates can be measured while pre-training process. Hendrycks et al. [173] demonstrated that pre-training enhanced the uncertainty estimates and model robustness although it might not improve the classification metrics. They showed the key gains from pre-training by performing empirical experiments on confidence calibration, OoD detection, class imbalance, label corruption and adversarial examples. Their adversarial pre-training method demonstrated approximately10% enhancement over existing methods in adversarial robustness. Pre-training without task-specific techniques highlighted the need for pre-training, surpassed the state-of-the-art when examining the future techniques on uncertainty and robustness.
Trustworthy confidence estimates are required by high-risk domains from predictive models. Rigid variational distributions utilized for tractable inference that erred on the side of overconfidence suffered from deep latent variable models. Veeling et al. [499] devised Stochastic Quantized Activation Distributions (SQUAD) that executed a tractable yet flexible distribution over discretized latent variables. The presented technique is sample efficient, self-normalizing and scalable. Their method yielded predictive uncertainty of high quality, learnt interesting non-linearities, fully used the flexible distribution. Multi-task learning (MTL) is another domain that the impact of the importance of uncertainty methods on it can be considered. For example, MTL demonstrated its efficacy for MR-only radiotherapy planning as it can jointly automate contour of organs-at-risk - a segmentation task – and simulate a synthetic CT (synCT) scan - a regression task from MRI scans. Bragman et al. [49] suggested utilizing a probabilistic deep-learning technique to estimate the parameter and intrinsic uncertainty. Parameter uncertainty was estimated through a approximate Bayesian inference whilst intrinsic uncertainty was modelled using a heteroscedastic noise technique. This developed an approach for uncertainty measuring over prediction of the tasks and data-driven adaptation of task losses on a voxel-wise basis. They demonstrated competitive performance in the segmentation and regression of prostate cancer scans. More information can be found in Tables V and VI.

UQ category Studies
Bayesian Balan et al. [25] (BPE: Bayesian parameter estimation), Houthooft et al. [188] (VIME: VI Maximizing Exploration), Springenberg et al. [448], Ilg et al. [201], Heo et al. [177], Henderson et al. [172], Ahn et al. [6], Zhang et al. [563], Sensoy et al. [433], Khan et al. [234], Acerbi [3] (VBMC: Variational Bayesian Monte Carlo), Haußmann et al. [165], Gong et al. [143], De Ath et al. [86], Foong et al. [123], Hasanzadeh et al. [164], Chang et al. [60], Stoean et al. [451], Xiao et al. [529], Repetti et al. [407], Tóthová et al.  [479], Moss et al. [333], Dutordoir et al. [104], Luo et al. [298], Gafni et al. [125], Jin et al. [211],Han et al.  [158], Stoean et al. [452], Oh et al. [356], Dusenberry et al. [101], Havasi et al. [168], Krishnan et al. [246] (MOPED: MOdel Priors with Empirical Bayes using DNN), Jesson et al. [208], Filos et al. [120], Huang et al. [195], Amit and Meir [12], Bhattacharyya et al. [34], Yao et al. [540], Laves et al.  [258] (UCE: uncertainty calibration error), Yang et al. [536] (OC-BNN: Output-Constrained BNN), Thakur et al. [472] (LUNA: Learned UA), Yacoby et al. [534] (NCAI: Noise Constrained Approximate Inference), Masood and Doshi-Velez [315] (PVI Particle-based VI), Abdolshah et al. [2] (MOBO: Multi-objective Bayesian optimisation), White et al. [521] (BO), Balandat et al. [26] (BOTORCH), Galy-Fajou et al. [131] (CMGGPC: Conjugate multi-class GP classification), Lee et al. [262] (BTAML: Bayesian Task Adaptive Meta Learning), Vadera and Marlin [490] (BDK: Bayesian Dark Knowledge), Siahkoohi et al. [441] (SGLD: stochastic gradient Langevin dynamics), Sun et al. [457], Patacchiola et al. [367], Cheng et al. [66], Caldeira and Nord [57], Wandzik et al. [504] (MCSD: Monte Carlo Stochastic Depth), Deng et al. [89] (DBSN: DNN Structures), González-López et al. [144], Foong et al. [121] (ConvNP: Convolutional Neural Process), Yao et al. [542] (SI: Stacked inference), Prijatelj et al. [392], Herzog et al. [180], Prokudin et al. [393] (CVAE: conditional VAE), Tuo and Wang  [487], Acerbi [4] (VBMC+EIG (expected information gain)/VIQR (variational interquantile range)), Zhao et al. [571] (GEP: generalized expectation propagation), Li et al. [273] (DBGP: deep Bayesian GP), He et al. [169] (NTK: Neural Tangent Kernel), Wang and Ročková [516] (Gaussian approximability)
Ensemble Zhang et al. [563], Chang et al. [60], He et al. [169] (BDE: Bayesian Deep Ensembles), Schwab et al. [426], Smith et al. [446], Malinin and Gales [306], Jain et al. [205], Valdenegro-Toro [491], Juraska et al. [219], Oh et al. [357], Brown et al. [51], Salem et al. [421], Wen et al. [519]
Others Jiang et al. [209] (Trust score), Qin et al. [395] (infoCAM: informative class activation map), Wu et al. [525] (Deep Dirichlet mixture networks), Qian et al. [394] (Margin preserving metric learning), Gomez et al. [142] (Targeted dropout), Malinin and Gales [305] (Prior networks), Dunlop et al.et al. [100] (DGP: deep GP), Hendrycks et al. [175] (Self-supervision), Kumar et al. [251] (Scaling-binning calibrator),  [176] (AugMix as a data processing approach), Możejko et al. [334] (Softmax output), Boiarov et al. [44] (SPSA: Simultaneous Perturbation Stochastic Approximation), Ye et al. [544] (Lasso bootstrap), Monteiro et al. [332] (SSNs: Stochastic Segmentation Networks), Maggi et al. [303] (Superposition semantics), Amiri et al [11] (LCORPP: learning-commonsense reasoning and probabilistic planning), Sensoy et al. [432] (GEN: Generative Evidential Neural Network), Belakaria, et al. [29] (USeMO: UA Search framework for optimizing Multiple Objectives), Liu et al. [287] (UaGGP: UA Graph GP), Northcutt et al. [351] (CL: Confident learning), Manders et al. [310] (Class Prediction Uncertainty Alignment), Chun et al. [73] (Regularization Method), Mehta et al.  [322] (Uncertainty metric), Liu et al. [282] (SNGP: Spectral-normalized Neural GP), Scillitoe et al. [427] (MF’s: Mondrian forests), Ovadia et al. [361] (Dataset shift), BiloÅ¡ et al. [37] (FD-Dir (Function Decomposition-Dirichlet) and WGP-LN (Weighted GP-Logistic-Normal)), Zheng and Yang [576] (MR: memory regularization), Zelikman et al. [558] (CRUDE: Calibrating Regression Uncertainty Distributions Empirically), Da Silva et al. [80] (RCMP: Requesting Confidence-Moderated Policy advice), Thiagarajan et al. [473] (Uncertainty matching), Zhou et al. [578] (POMBU: Policy Optimization method with Model-Based Uncertainty), Standvoss et al. [450] (RGNN: recurrent generative NN), Wang et al. [512] (TransCal: Transferable Calibration), Grover and Ermon [148] (UAE: uncertainty autoencoders), Cakir et al. [56, 55] (MI), Yildiz et al. [546] (: Ordinary Differential Equation VAE), Titsias, Michalis et al.  [476] and Lee et al. [264] (GP), Ravi and Beatson [403] (AVI: Amortized VI), Lu et al.  [293] (DGPM: DGP with Moments), Wang et al. [505] (NLE loss: negative log-likelihood error), Tai et al. [464] (UIA: UA imitation learning), Selvan et al. [431] (cFlow: conditional Normalizing Flow), Poggi et al. [381] (Self-Teaching), Cui et al. [79] (MMD: Maximum Mean Discrepancy)
TABLE V: More UQ methods in the main three categories proposed in the literature. Note that we provide in the row related to other methods the names of the proposed UQ methods for each reference. But, because of the importance of mentioning the proposed method, we also did the same in some other parts (General information).

As discussed earlier, GP is a powerful technique used for quantifying uncertainty. However, it is complex to form a Gaussian approximation to the posterior distribution even in the context of uncertainty estimation in huge deep-learning models. In such scenario, prevailing techniques generally route to a diagonal approximation of the covariance matrix in spite of executing low uncertainty estimates by these matrices. Mishkin et al. [327] designed a novel stochastic, low-rank, approximate natural-gradient (SLANG) technique for VI in huge deep models to tackle this issue. Their technique computed a “diagonal plus low-rank” structure based on back-propagated gradients of the network log-likelihood. Their findings indicate that the proposed technique in forming Gaussian approximation to the posterior distribution. As a fact, the safety of the AI systems can be enhanced by estimating uncertainty in predictions. Such uncertainties arise due to distributional mismatch between the training and test data distributions, irreducible data uncertainty and uncertainty in model parameters. Malinin et al. [305] devised a novel framework for predictive uncertainty dubbed as Prior Networks (PNs) that modelled distributional uncertainty explicitly. They achieved it by parameterizing a prior distribution over predictive distributions. Their work aimed at uncertainty for classification and scrutinized PNs on the tasks of recognizing OoD samples and identifying misclassification on the CIFAR-10 and MNIST datasets. Empirical results indicate that PNs, unlike non-Bayesian methods, could successfully discriminate between distributional and data uncertainty.

7.1