A Review of Uncertainty Quantification in Deep Learning: Techniques, Applications and Challenges
Abstract
Uncertainty quantification (UQ) plays a pivotal role in reduction of uncertainties during both optimization and decision making processes. It can be applied to solve a variety of realworld applications in science and engineering. Bayesian approximation and ensemble learning techniques are two most widelyused UQ methods in the literature. In this regard, researchers have proposed different UQ methods and examined their performance in a variety of applications such as computer vision (e.g., selfdriving cars and object detection), image processing (e.g., image restoration), medical image analysis (e.g., medical image classification and segmentation), natural language processing (e.g., text classification, social media texts and recidivism riskscoring), bioinformatics, etc. This study reviews recent advances in UQ methods used in deep learning. Moreover, we also investigate the application of these methods in reinforcement learning (RL). Then, we outline a few important applications of UQ methods. Finally, we briefly highlight the fundamental research challenges faced by UQ methods and discuss the future research directions in this field.
1 Introduction
In everyday scenarios, we deal with uncertainties in numerous fields, from invest opportunities and medical diagnosis to sporting games and weather forecast, with an objective to make decision based on collected observations and uncertain domain knowledge. Nowadays, we can rely on models developed using machine and deep learning techniques can quantify the uncertainties to accomplish statistical inference [309]. It is very important to evaluate the efficacy of artificial intelligence (AI) systems before its usage [209].
The predictions made by such models are uncertain as they are prone to noises and wrong model inference besides the inductive assumptions that are inherent in case of uncertainty. Thus, it is highly desirable to represent uncertainty in a trustworthy manner in any AIbased systems. Such automated systems should be able to perform accurately by handling uncertainty effectively. Principle of uncertainty plays an important role in AI settings such as concrete learning algorithms [329], and active learning (AL) [348, 5].
The sources of uncertainty occurs when the test and training data are mismatched and data uncertainty occurs because of class overlap or due to the presence of noise in the data [379]. Estimating knowledge uncertainty is more difficult compared to data uncertainty which naturally measures it as a result of maximum likelihood training. Sources of uncertainty in prediction are essential to tackle the uncertainty estimation problem [130]. There are two main sources of uncertainty, conceptually called aleatoric and epistemic uncertainties [200] (see Fig. 1).
Irreducible uncertainty in data giving rise to uncertainty in predictions is an aleatoric uncertainty (also known as data uncertainty). This type of uncertainty is not the property of the model, but rather is an inherent property of the data distribution; hence it is irreducible. Another type of uncertainty is epistemic uncertainty (also known as knowledge uncertainty) that occurs due to inadequate knowledge and data.
One can define models to answer different human questions poised in modelbased prediction. In the case of datarich problem, there is a collection of massive data but it may be informatively poor [335]. In such cases, AIbased methods can be used to define the efficient models which characterize the emergent features from the data. Very often these data are incomplete, noisy, discordant and multimodal [309].
Uncertainty quantification (UQ) underpins many critical decisions today. Predictions made without UQ are usually not trustworthy and inaccurate. To understand the Deep Learning (DL) [220, 514] process life cycle, we need to comprehend the role of UQ in DL. The DL models start with the collection of most comprehensive and potentially relevant datasets available for decision making process. The DL scenarios are designed to meet some performance goals to select the most appropriate DL architecture after training the model using the labeled data. The iterative training process optimizes different learning parameters, which will be âtweakedâ until the network provides a satisfactory level of performance.
There are several uncertainties that need to be quantified in the steps involved. The uncertainties that are obvious in these steps are the following: (i) selection and collection of training data, (ii) completeness and accuracy of training data, (iii) understanding the DL (or traditional machine learning) model with performance bounds and its limitations, and (iv) uncertainties corresponds to the performance of the model based on operational data [28]. Data driven approaches such as DL associated with UQ poses at least four overlapping groups of challenges: (i) absence of theory, (ii) absence of casual models, (iii) sensitivity to imperfect data, and (iv) computational expenses. To mitigate such challenges, ad hoc solutions like the study of model variability and sensitivity analysis are sometimes employed.
Uncertainty estimation and quantification have been extensively studied in DL and traditional machine learning. In the following, we provide a brief summary of some recent studies that examined the effectiveness of various methods to deal with uncertainties.
A schematic comparison of the three different uncertainty models [198] (MC dropout, Boostrap model and GMM is provided in Fig. 2. In addition, two graphical representations of uncertaintyaware models (BNN) vs OoD classifier) is illustrated in Fig. 3.
1.1 Research Objectives and Outline
In the era of big data, ML and DL, intelligent use of different raw data has an enormous potential to benefit wide variety of areas. However, UQ in different ML and DL methods can significantly increase the reliability of their results.
Ning et al. [349] summarized and classified the main contributions of the datadriven optimization paradigm under uncertainty. As can be observed, this paper reviewed the datadriven optimization only. In another study, Kabir et al. [221] reviewed Neural Networkbased UQ. The authors focused on probabilistic forecasting and prediction intervals (PIs) as they are among most widely used techniques in the literature for UQ.
We have noticed that, from 2010 to 2020 (end of June), more than 2500 papers on UQ in AI have been published in various fields (e.g., computer vision, image processing, medical image analysis, signal processing, natural language processing, etc.). In one hand, we ignore large number of papers due to lack of adequate connection with the subject of our review. On the other hand, although many papers that we have reviewed have been published in related conferences and journals, many papers have been found on openaccess repository as electronic preprints (i.e. arXiv) that we reviewed them due to their high quality and full relevance to the subject. We have tried our level to best to cover most of the related articles in this review paper. It is worth mentioning that this review can, therefore, serve as a comprehensive guide to the readers in order to steer this fastgrowing research field.
Unlike previous review papers in the field of UQ, this study reviewed most recent articles published in quantifying uncertainty in AI (ML and DL) using different approaches. In addition, we are keen to find how UQ can impact the real cases and solve uncertainty in AI can help to obtain reliable results. Meanwhile, finding important chats in existing methods is a great way to shed light on the path to the future research. In this regard, this review paper gives more inputs to future researchers who work on UQ in ML and DL. We investigated more recent studies in the domain of UQ applied in ML and DL methods. Therefore, we summarized few existing studies on UQ in ML and DL. It is worth mentioning that the main purpose of this study is not to compare the performance of different UQ methods proposed because these methods are introduced for different data and specific tasks. For this reason, we argue that comparing the performance of all methods is beyond the scope of this study. For this reason, this study mainly focuses on important areas including DL, ML and Reinforcement Learning (RL).
Hence, the main contributions of this study are as follows:

To the best of our knowledge, this is the first comprehensive review paper regarding UQ methods used in ML and DL methods which is worthwhile for researchers in this domain.

A comprehensive review of newly proposed UQ methods is provided.

Moreover, the main categories of important applications of UQ methods are also listed.

The main research gaps of UQ methods are pointed out.

Finally, few solid future directions are discussed.
2 Preliminaries
In this section, we explained the structure of feedforward neural network followed by Bayesian modeling to discuss the uncertainty in detail.
2.1 Feedforward neural network
In this section, the structure of a singlehidden layer neural network [418] is explained, which can be extended to multiple layers. Suppose is a dimensional input vector, we use a linear map and bias to transform into a row vector with Q elements, i.e., . Next a nonlinear transfer function , such as rectified linear (ReLU), can be applied to obtain the output of the hidden layer. Then another linear function can be used to map hidden layer to the output:
(1) 
For classification, to compute the probability of belonging to a label in the set , the normalized score is obtained by passing the model output through a softmax function . Then the softmax loss is used:
(2) 
where and are inputs and their corresponding outputs, respectively.
For regression, the Euclidean loss can be used:
(3) 
2.2 Uncertainty Modeling
As mentioned above, there are two main uncertainties: epistemic (model uncertainty) and aleatoric (data uncertainty) [238]. The aleatoric uncertainty has two types: homoscedastic and heteroscedastic [232].
The predictive uncertainty (PU) consists of two parts: (i) epistemic uncertainty (EU), and (ii) aleatoric uncertainty (AU), which can be written as sum of these two parts:
(4) 
Epistemic uncertainties can be formulated as probability distribution over model parameters. Let denotes a training dataset with inputs and their corresponding classes , where represents the number of classes. The aim is to optimize the parameters, i.e., , of a function that can produce the desired output. To achieve this, the Bayesian approach defines a model likelihood, i.e., . For classification, the softmax likelihood can be used:
(5) 
and the Gaussian likelihood can be assumed for regression:
(6) 
where represents the model precision.
The posterior distribution, i.e., , for a given dataset over by applying Bayes’ theorem can be written as follows:
(7) 
For a given test sample , a class label with regard to the can be predicted:
(8) 
This process is called inference or marginalization. However, cannot be computed analytically, but it can be approximated by variational parameters, i.e., . The aim is to approximate a distribution that is close to the posterior distribution obtained by the model. As such, the KullbackLeibler (KL) [250] divergence is needed to be minimised with regard to . The level of similarity among two distributions can be measured as follows:
(9) 
The predictive distribution can be approximated by minimizing KL divergence, as follows:
(10) 
where indicates the optimized objective.
KL divergence minimization can also be rearranged into the evidence lower bound (ELBO) maximization [40]:
(11) 
where is able to describe the data well by maximizing the first term, and be as close as possible to the prior by minimizing the second term. This process is called variational inference (VI). Dropout VI is one of the most common approaches that has been widely used to approximate inference in complex models [126]. The minimization objective is as follows [212]:
(12) 
where and represent the number of samples and dropout probability, respectively.
To obtain datadependent uncertainty, the precision in (6) can be formulated as a function of data. One approach to obtain epistemic uncertainty is to mix two functions: predictive mean, i.e., , and model precision, i.e., , and the likelihood function can be written as . A prior distribution is placed over the weights of the model, and then the amount of change in the weights for given data samples is computed. The Euclidian distance loss function (3) can be adapted as follows:
(13) 
The predictive variance can be obtained as follows:
(14) 
3 Uncertainty Quantification using Bayesian techniques
3.1 Bayesian Deep Learning/Bayesian Neural Networks
Despite the success of standard DL methods in solving various realword problems, they cannot provide information about the reliability of their predictions. To alleviate this issue, BNNs/BDL [213, 508, 313] can be used to interpret the model parameters. BNNs/BDL are robust to overfitting problem and can be trained on both small and big datasets [248].
3.2 Monte Carlo (MC) dropout
As stated earlier, it is difficult to compute the exact posterior inference, but it can be approximated. In this regard, Monte Carlo (MC) [340] is an effective method. Nonetheless, it is a slow and computationally expensive method when integrated into a deep architecture. To combat this, MC (MC) dropout has been introduced, which uses dropout [449] as a regularization term to compute the prediction uncertainty [127]. Dropout is an effective technique that has been widely used to solve overfitting problem in DNNs. During the training process, dropout randomly drops some units of NN to avoid them from cotuning too much. Assume a NN with layers, which , and denote the weight matrices, bias vectors and dimensions of the th layer, respectively. The output of NN and target class of the th input () are indicated by and , respectively. The objective function using regularization can be written as:
(15) 
Dropout samples binary variables for each input data and every network unit in each layer (except the output layer), with the probability for th layer, if its value is 0, the unit is dropped for a given input data. Same values are used in the backward pass to update parameters. Fig. 4 shows several visualization of variational distributions on a simple NN [317].
Several studies used MC dropout [47] to estimate UQ.
Wang et al. [506] analyzed epistemic and aleatoric uncertainties for deep CNNbased medical image segmentation problems at both pixel and structure levels.
They augmented the input image during test phase to estimate the transformation uncertainty.
Specifically, the MC sampling was used to estimate the distribution of the output segmentation.
Liu et al. [280] proposed a unified model using SGD to approximate both epistemic and aleatoric uncertainties of CNNs in presence of universal adversarial perturbations.
The epistemic uncertainty was estimated by applying MC dropout with Bernoulli distribution at the output of neurons.
In addition, they introduced the texture bias to better approximate the aleatoric uncertainty.
Nasir et al. [337] conducted MC dropout to estimate four types of uncertainties, including variance of MC samples, predictive entropy, and Mutual Information (MI), in a 3D CNN to segment lesion from MRI sequences.
In [10], two dropout methods, i.e. elementwise Bernoulli dropout [449] and spatial Bernoulli dropout [477] are implemented to compute the model uncertainty in BNNs for the endtoend autonomous vehicle control.
McClure and Kriegeskorte [317] expressed that sampling of weights using Bernoulli or Gaussian can lead to have a more accurate depiction of uncertainty in comparison to sampling of units. However, according to the outcomes obtained in [317], it can be argued that using either Bernoulli or Gaussian dropout can improve the classification accuracy of CNN. Based on these findings, they proposed a novel model (called spikeandslab sampling) by combining Bernoulli or Gaussian dropout.
Do et al. [95] modified UNet [414], which is a CNNbased deep model, to segment myocardial arterial spin labeling and estimate uncertainty. Specifically, batch normalization and dropout are added after each convolutional layer and resolution scale, respectively.
Later, Teye et al. [471] proposed MC batch normalization (MCBN) that can be used to estimate uncertainty of networks with batch normalization.
They showed that batch normalization can be considered as an approximate Bayesian model.
Yu et al. [551] proposed a semisupervised model to segment left atrium from 3D MR images.
It consists of two modules including teacher and student, and used them in UA framework called UA selfensembling mean teacher (UAMT) model (see Fig. 5).
As such, the student model learns from teacher model via minimizing the segmentation and consistency losses of the labeled samples and targets of the teacher model, respectively.
In addition, UA framework based on MC dropout was designed to help student model to learn a better model by using uncertainty information obtained from teacher model.
Table I lists studies that directly applied MC dropout to approximate uncertainty along with their applications.
Study  Year  Method  Application  Code 

Kendal et al. [228]  2015  SegNet [24]  semantic segmentation  
Leibig et al. [266]  2017  CNN  diabetic retinopathy  
Choi et al. [70]  2017  mixture density network (MDN) [39]  regression  
Jung et al. [215]  2018  fullresolution ResNet [382]  brain tumor segmentation  
Wickstrom et al. [522]  2018  FCN [437] and SehNet [24]  polyps segmentation  
Jungo et al. [217]  2018  FCN  brain tumor segmentation  
Vandal et al. [497]  2018  Variational LSTM  predict flight delays  
Devries and Taylor [91]  2018  CNN  medical image segmentation  
Tousignant et al. [480]  2019  CNN  MRI images  
Norouzi et al. [350]  2019  FCN  MRI images segmentation  
Roy et al. [415]  2019  Bayesian FCNN  brain images (MRI) segmentation  
Filos et al. [119]  2019  CNN  diabetic retinopathy  
Harper and Southern [162]  2020  RNN and CNN  emotion prediction 
Comparison of MC dropout with other UQ methods
Recently, several studies have been conducted to compare different UQ methods. For example, Foong et al. [122] empirically and theoretically studied the MC dropout and meanfield Gaussian VI. They found that both models can express uncertainty well in shallow BNNs. However, meanfield Gaussian VI could not approximate posterior well to estimate uncertainty for deep BNNs. Ng et al. [344] compared MC dropout with BBB using UNet [414] as a base classifier. Siddhant et al. [442] empirically studied various DAL models for NLP. During prediction, they applied dropout to CNNs and RNNs to estimate the uncertainty. Hubschneider et al. [198] compared MC dropout with bootstrap ensemblingbased method and a Gaussian mixture for the task of vehicle control. In addition, Mukhoti [336] applied MC dropout with several models to estimate uncertainty in regression problems. Kennamer et al. [233] empirically studied MC dropout in Astronomical Observing Conditions.
3.3 Markov chain Monte Carlo (MCMC)
Markov chain Monte Carlo (MCMC) [252] is another effective method that has been used to approximate inference. It starts by taking random draw from distribution or . Then, it applies a stochastic transition to , as follows:
(16) 
This transition operator is chosen and repeated for times, and the outcome, which is a random variable, converges in distribution to the exact posterior. Salakhutdinov et al. [420] used MCMC to approximate the predictive distribution rating values of the movies. Despite the success of the conventional MCMC, the sufficiant number of iteration is unknown. In addition, MCMC requires long time to converge to a desired distribution [340]. Several studies have been conducted to overcome these shortcomings. For example, Salimans et al. [422] expanded space into a set of auxiliary random variables and interpreted the stochastic Markov chain as a variational approximation.
The stochastic gradient MCMC (SGMCMC) [63, 94] was proposed to train DNNs.
It only needs to estimate the gradient on small sets of minibatches.
In addition, SGMCMC can be converged to the true posterior by decreasing the step sizes [62, 269].
Gong et al. [143] combined amortized inference with SGMCMC to increase the generalization ability of the model.
Li et al. [268] proposed an accelerating SGMCMC to improve the speed of the conventional SGMCMC (see Fig. 6 for implementation of different SGMCMC models).
However, in short time, SGMCMC suffers from a bounded estimation error [469] and it loses surface when applied to the multilayer networks [71].
In this regard, Zhang et al. [565] developed a cyclical SGMCMC (cSGMCMC) to compute the posterior over the weights of neural networks.
Specifically, a cyclical stepsize was used instead of the decreasing one.
Large stepsize allows the sampler to take large moves, while small stepsize attempts the sampler to explore local mode.
Although SGMCMC reduces the computational complexity by using a smaller subset, i.e. minibatch, of dataset at each iteration to update the model parameters, those small subsets of data add noise into the model, and consequently increase the uncertainty of the system.
To alleviate this, Luo et al. [297] introduced a sampling method called the thermostatassisted continuously tempered Hamiltonian Monte Carlo, which is an extended version of the conventional Hamiltonian MC (HMC) [99].
Note that HMC is a MCMC method [178].
Specifically, they used NoséHoover thermostats [185, 352] to handle the noise generated by minibatch datasets.
Later, dropout HMC (DHMC) [178] was proposed for uncertainty estimation, and compared with SGMCMC [63] and SGLD [518].
Besides, MCMC was integrated into the generative based methods to approximate posterior. For example, in [580], MCMC was applied to the stochastic object models, which is learned by generative adversarial networks (GANs), to approximate the ideal observer. In [253], a visual tracking system based on a variational autoencoder (VAE) MCMC (VAEMCMC) was proposed.
3.4 Variational Inference (VI)
The variational inference (VI) is an approximation method that learns the posterior distribution over BNN weights.
VIbased methods consider the Bayesian inference problem as an optimization problem which is used by the SGD to train DNNs.
Fig. 7 summaries various VI methods for BNN [459].
For BNNs, VIbased methods aim to approximate posterior distributions over the weights of NN.
To achieve this, the loss can be defined as follows:
(17) 
where indicates the number of samples, and
(18) 
(19) 
(20) 
where and 1 represent the elementwise product and vector filled with ones, respectively. Eq. (17) can be used to compute (10).
Posch et al. [388] defined the variational distribution using a product of Gaussian distributions along with diagonal covariance matrices. For each network layer, a posterior uncertainty of the network parameter was represented. Later, in [387], they replaced the diagonal covariance matrices with the traditional ones to allow the network parameters to correlate with each other. Inspired from transfer learning and empirical Bayes (EB) [411], MOPED [243] used a deterministic weights, which was derived from a pretrained DNNs with same architecture, to select meaningful prior distributions over the weight space. Later, in [245], they integrated an approach based on parametric EB into MOPED for mean field VI in Bayesian DNNs, and used fully factorized Gaussian distribution to model the weights. In addition, they used a realworld case study, i.e., diabetic retinopathy diagnosis, to evaluate their method. Subedar et al. [454] proposed an uncertainty aware framework based on multimodal Bayesian fusion for activity recognition. They scaled BDNN into deeper structure by combining deterministic and variational layers. Marino et al. [312] proposed a stochastic modeling based approach to model uncertainty. Specifically, the DBNN was used to learn the stochastic learning of the system. Variational BNN [260], which is a generativebased model, was proposed to predict the superconducting transition temperature. Specifically, the VI was adapted to compute the distribution in the latent space for the model.
Louizos and Welling [292] adopted a stochastic gradient VI [236] to compute the posterior distributions over the weights of NNs. Hubin and Storvik [197] proposed a stochastic VI method that jointly considers both model and parameter uncertainties in BNNs, and introduced a latent binary variables to include/exclude certain weights of the model. Liu et al. [286] integrated the VI into a spatialâtemporal NN to approximate the posterior parameter distribution of the network and estimate the probability of the prediction. Ryu et al. [419] integrated the graph convolutional network (GCN) into the Bayesian framework to learn representations and predict the molecular properties. Swiatkowski et al. [459] empirically studied the Gaussian meanfield VI. They decomposed the variational parameters into a lowrank factorization to make a more compact approximation, and improve the SNR ratio of the SG in estimating the lower bound of the variational. Franquhar et al. [115] used the meanfield VI to better train deep models. They argued that a deeper linear meanfield network can provide an analogous distribution of function space like shallowly fullcovariance networks. A schematic view of the proposed approach is demonstrated in Fig. 8.
3.5 Bayesian Active Learning (BAL)
Active learning (AL) methods aim to learn from unlabeled samples by querying an oracle [186]. Defining the right acquisition function, i.e., the condition on which sample is most informative for the model, is the main challenge of ALbased methods. Although existing AL frameworks have shown promising results in variety of tasks, they lack of scalability to highdimensional data [478]. In this regard, the Baysian approaches can be integrated into DL structure to represent the uncertainty, and then combine with deep AL acquisition function to probe for the uncertain samples in the oracle.
DBAL [129], i.e., deep Bayesian AL, combine an AL framework with Bayesian DL to deal with highdimensional data problems, i.e., image data. DBAL used batch acquisition to select the top samples with the highest Bayesian AL by disagreement (BALD) [187] score. Model priors from empirical bayes (MOPED) [244] used BALD to evaluate the uncertainty. In addition, MC dropout was applied to estimate the model uncertainty. Later, Krisch et al. [237] proposed BatchBALD, which uses greedy algorithm to select a batch in linear time and reduce the run time. They modeled the uncertainty by leveraging the Bayesian AL (BAL) using Dropoutsampling. In [53], two types of uncertainty measures namely entropy and BALD [187], were compared.
ActiveHARNet [149], which is an ALbased framework for human action recognition, modeled the uncertainty by linking BNNs with GP using dropout. To achieve this, dropout was applied before each fully connected layer to estimate the mean and variance of BNN. DeepBASS [316], i.e., a deep AL semisupervised learning, is an expectationmaximization [88] based technique paired with an AL component. It applied MC dropout to estimate the uncertainty.
Scandalea et al. [93] proposed a framework based on UNet structure for deep AL to segment biomedical images, and used uncertainty measure obtained by MC dropout, to suggest the sample to be annotated. Specifically, the uncertainty was defined based on the posterior probabilities’ SD of the MCsamples. Zheng et al. [559] varied the number of Bayesian layers and their positions to estimate uncertainty through AL on MNIST dataset. The outcome indicated that few Bayesian layers near the output layer are enough to fully estimate the uncertainty of the model.
Inspired from [199], the Bayesian batch AL [380], which selects a batch of samples at each AL iteration to perform posterior inference over the model parameters, was proposed for largescale problems. Active user training [434], which is a BALbased crowdsourcing model, was proposed to tackle highdimensional and complex classification problems. In addition, the Bayesian inference proposed in [443] was used to consider the uncertainty of the confusion matrix of the annotators.
Several generativebased AL frameworks have been introduced. In [146], a semisupervised Bayesian AL model, which is a deep generativebased model that uses BNNs to give discriminative component, was developed. Tran et al. [483] proposed a Bayesianbased generative deep AL (BGADL) (Fig. 9) for image classification problems. They, firstly used the concept of DBAL to select the must informative samples and then VAEACGAN was applied to generate new samples based on the selected ones. Akbari et al. [7] proposed a unified BDL framework to quantify both aleatoric and epistemic uncertainties for activity recognition. They used an unsupervised DL model to extract features from the time series, and then their posterior distributions was learned through a VAE model. Finally, the Dropout [127] was applied after each dense layer and test phase for randomness of the model weights and sample from the approximate posterior, respectively.
3.6 Bayes by Backprop (BBB)
The learning process of a probability distribution using the weights of neural networks plays significant role for having better predictions results. Blundell et al. [42] proposed a novel yet efficient algorithm named Bayes by Backprop (BBB) to quantify uncertainty of these weights. The proposed BBB minimizes the compression cost which is known as the variational free energy (VFE) or the lower bound (expected) of the marginal likelihood. To do so, they defined a cost function as follows:
(21) 
The BBB algorithm uses unbiased gradient estimates of the cost function in 21 for learning distribution over the weights of neural networks. In another research, Fortunato et al. [124] proposed a new Bayesian recurrent neural network (BRNNs) using BBB algorithm. In order to improve the BBB algorithm, they used a simple adaptation of truncated backpropagation throughout the time. The proposed Bayesian RNN (BRNN) model is shown in Fig. 10.
Ebrahimi et al. [107] proposed an uncertaintyguided continual approach with BNNs (named UCB which stands for Uncertaintyguided continual learning technique with BNNs). The continual learning leads to learn a variety of new tasks while impound the aforetime knowledge obtained learned ones. The proposed UCB exploits the predicted uncertainty of the posterior distribution in order to formulate the modification in âimportantâ parameters both in setting a hardthreshold as well as in a soft way. Recognition of different actions in videos needs not only big data but also is a time consuming process. To deal with this issue, de la Riva and Mettes [87] proposed a Bayesianbased deep learning method (named Bayesian 3D ConvNet) to analyze a small number of videos. In this regard, BBB was extended to be used by 3D CNNs and then employed to deal with uncertainty over the convolution weights in the proposed model. To do so, Gaussian distribution was applied to approximate the correct posterior in the proposed 3D Convolution layers using mean and STD (standard deviation) as follows:
(22) 
where represents the input, is the output, is the filter height, is the filter width and is the time dimension. In another research, Ng et al. [344] compared the performance of two wellknown uncertainty methods (MC dropout and BBB) in medical image segmentation (cardiac MRI) on a UNet model. The obtained results showed that MC dropout and BBB demonstrated almost similar performances in medical image segmentation task.
3.7 Variational Autoencoders
An autoencoder is a variant of DL that consists of two components: (i) encoder, and (ii) decoder. Encoder aims to map highdimensional input sample to a lowdimensional latent variable . While decoder reproduces the original sample using latent variable . The latent variables are compelled to conform a given prior distribution . Variational Autoencoders (VAEs) [236] are effective methods to model the posterior. They cast learning representations for highdimensional distributions as a VI problem [139]. A probabilistic model of sample in data space with a latent variable in latent space can be written as follows:
(23) 
The VI can be used to model the evidence lower bound as follows:
(24) 
where and are the encoder and decoder models, respectively, and and indicate their parameters.
Zamani et al. [81] developed a discrete VAE framework with Bernoulli latent variables as binary hashing code (Fig.11).
The stochastic gradient was exploited to learn the model.
They proposed a pairwise supervised hashing (PSH) framework to derive better hashing codes.
PSH maximizes the ELBO with weighted KL regularization to learn more informative binary codes, and adapts a pairwise loss function to reward withinclass similarity and betweenclass dissimilarity to minimize the distance among the hashing codes of samples from same class and vice versa.
Bohm et al. [43] studied UQ for linear inverse problems using VAEs.
Specifically, the vanilla VAE with meanfield Gaussian posterior was trained on uncorrupted samples under the ELBO.
In addition, the ELO method [430] was adopted to approximate the posterior.
Edupuganti et al. [108] studied the UQ tasks in magnetic resonance image recovery (see Fig. 12).
As such, a VAEGAN, which is a probabilistic recovery scheme, was developed to map the low quality images to highquality ones.
The VAEGAN consists of VAE and multilayer CNN as generator and discriminator, respectively.
In addition, the Steinâs unbiased risk estimator (SURE) was leveraged as a proxy to predict error and estimate the uncertainty of the model.
In [210], a framework based on variational UNet [112] architecture was proposed for UQ tasks in reservoir simulations. Both simple UNet and variational UNet (VUNet) are illustrated in Fig. 13. Cosmo VAE [545], which is a DL, i.e., UNet, based VAE, was proposed to restore the missing observations of the cosmic microwave background (CMB) map. As such, the variational Bayes approximation was used to determine the ELBO of likelihood of the reconstructed image. Mehrasa et al. [320] proposed action point process VAE (APP VAE) for action sequences. APP VAE consists of two LSTM to estimate the prior and posterior distributions. Sato et al. [424] proposed a VAEbased UA for anomaly detection. They used MC sampling to estimate posterior.
Since VAEs are not stochastic processes, they are limited to encode finitedimensional priors. To alleviate this limitation, Mishra et al. [328] developed the prior encoding VAE, i.e., VAE. Inspired by the Gaussian process [133], VAE is a stochastic process that learns the distribution over functions. To achieve this, VAE encoder, firstly, transforms the locations to a highdimensional space, and then, uses a linear mapping to link the feature space to outputs. While VAE encoder aims to recreate linear mapping from the lower dimensional probabilistic embedding. Finally, the recreated mapping is used to get the reconstruction of the outputs. Guo et al. [151] used VAE to deal with data uncertainty under a justintime learning framework. The Gaussian distribution was employed to describe latent space features as variablewise, and then the KLdivergence was used to ensure that the selected samples are the most relevant to a new sample. Daxberger et al. [85] tried to detect OoD samples during test phase. As such, the developed an unsupervised, probabilistic framework based on a Bayesian VAE. Besides, they estimated the posterior over the decoder parameters by applying SGMCMC.
4 Other methods
In this section, we discuss few other proposed UQ methods used in machine and deep learning algorithms.
4.1 Deep Gaussian processes
Deep Gaussian processes (DGPs) [84, 105, 423, 45, 549, 458, 355] are effective multilayer decision making models that can accurately model the uncertainty. They represent a multilayer hierarchy to Gaussian processes (GPs) [401, 470]. GPs is a nonparametric type of Bayesian model that encodes the similarity between samples using kernel function. It represents distributions over the latent variables with respect to the input samples as a Gaussian distribution . Then, the output is distributed based on a likelihood function . However, the conventional GPs can not effectively scale the large datasets. To alleviate this issue, inducing samples can be used. As such, the following variational lower bound can be optimized.
(25) 
where and are the location of the inducing samples and the approximated variational to the distribution of , respectively.
Oh et al. [358] proposed the hedged instance embedding (HIB), which hedges the position of each sample in the embedding space, to model the uncertainty when the input sample is ambiguous. As such, the probability of two samples matching was extended to stochastic embedding, and the MC sampling was used to approximate it. Specifically, the mixture of Gaussians was used to represent the uncertainty. Havasi et al. [167] applied SGHMC into DGPs to approximate the posterior distribution. They introduced a moving window MC expectation maximization to obtain the maximum likelihood to deal with the problem of optimizing large number of parameters in DGPs. Maddox et al. [301] used stochastic weight averaging (SWA) [203] to build a Gaussianbaed model to approximate the true posterior. Later, they proposed SWAG [302], which is SWAGaussian, to model Bayesian averaging and estimate uncertainty.
Most of the weight perturbationbased algorithms suffer from high variance of gradient estimation due to sharing same perturbation by all samples in a minibatch. To alleviate this problem, flipout [520] was proposed. Filipout samples the pseudoindependent weight perturbations for each input to decorrelate the gradient within the minibatch. It is able to reduce variance and computational time in training NNs with multiplicative Gaussian perturbations.
Despite the success of DNNs in dealing with complex and highdimensional image data, they are not robust to adversarial examples [460]. Bradshaw et al. [48] proposed a hybrid model of GP and DNNs (GPDNNs) to deal with uncertainty caused by adversarial examples (see Fig. 14).
Choi et al. [69] proposed a Gaussianbased model to predict the localization uncertainty in YOLOv3 [405]. As such, they applied a single Gaussian model to the bbox coordinates of the detection layer. Specifically, the coordinates of each bbox is modeled as the mean () and variance () to predict the uncertainty of bbox.
Khan et al. [234] proposed a natural gradientbased algorithm for Gaussian meanfield VI. The Gaussian distribution with diagonal covariances was used to estimate the probability. The proposed algorithm was implemented within the Adam optimizer. To achieve this, the network weights were perturbed during the gradient evaluation. In addition, they used a vector to adapt the learning rate to estimate uncertainty.
Sun et al. [456] considered structural information of the model weights. They used the matrix variate Gaussian (MVG) [152] distribution to model structured correlations within the weights of DNNs, and introduced a reparametrization for the MVG posterior to make the posterior inference feasible. The resulting MVG model was applied to a probabilistic BP framework to estimate posterior inference. Louizos and Welling [291] used MVG distribution to estimate the weight posterior uncertainty. They treated the weight matrix as a whole rather than treating each component of weight matrix independently. As mentioned earlier, GPs were widely used for UQ in deep learning methods. Van der Wilk et al. [494], Blomqvist et al. [41], Tran et al. [481], Dutordoir et al. [103] and Shi et al. [439] introduced convolutional structure into GP.
In another study, CorbiÃ¨re et al. [76] expressed that the confidence of DDNs and predicting their failures is of key importance for the practical application of these methods. In this regard, they showed that the TCP () is more suitable than the MCP () for failure prediction of such deep learning methods as follows:
(26) 
where represents a dimensional feature and is its correct class. Then, they introduced a new normalized type of the TCP confidence criterion:
(27) 
A general view of the proposed model in [76] is illustrated by Fig. 15.
In another research, Atanov et al. [21] introduced a probabilistic model and showed that Batch Normalization (BN) approach can maximize the lower bound of its related marginalized loglikelihood. Since inference computationally was not efficient, they proposed the Stochastic BN (SBN) approach for approximation of appropriate inference procedure, as an uncertainty estimation method. Moreover, the induced noise is generally employed to capture the uncertainty, check overfitting and slightly improve the performance via testtime averaging whereas ordinary stochastic neural networks typically depend on the expected values of their weights to formulate predictions. Neklyudov et al. [341] proposed a different kind of stochastic layer called variance layers. It is parameterized by its variance and each weight of a variance layer obeyed a zeromean distribution. It implies that each object was denoted by a zeromean distribution in the space of the activations. They demonstrated that these layers presented an upright defense against adversarial attacks and could serve as a crucial exploration tool in reinforcement learning tasks.
4.2 Laplace approximations
Laplace approximations (LAs) are other popular UQ methods which are used to estimate the Bayesian inference [300]. They build a Gaussian distribution around true posterior using a Taylor expansion around the MAP, , as follows:
(28) 
where indicates the Hessian of the likelihood estimated at the MAP estimate. Ritter et al. [410] introduced a scalable LA (SLA) approach for different NNs. The proposed the model, then compared with the other wellknown methods such as Dropout and a diagonal LA for the uncertainty estimation of networks.
5 Uncertainty Quantification in Reinforcement Learning
In decision making process, uncertainty plays a key role in decision performance in various fields such as Reinforcement Learning (RL) [96]. Different UQ methods in RL have been widely investigated in the literature [572]. Lee et al. [261] formulated the model uncertainty problem as BayesAdaptive Markov Decision Process (BAMDP). The general BAMDP defined by a tuple S, , A, T, R, , , where where shows the underlying MDP’s observable state space, indicates the latent space, represents the action space, is the parameterized transition and finally is the reward functions, respectively. Lets be an initial belief, a Bayes filter updates the posterior as follows:
(29) 
Then, Bayesian Policy Optimization (BPO) method (see Fig. 16) is applied to POMDPs as a Bayes filter to compute the belief of the hidden state as follows:
(30) 
In another research, O’Donoghue et al. [354] proposed the uncertainty Bellman equation (UBE) to quantify uncertainty. The authors used a Bellmanbased which propagated the uncertainty (here variance) relationship of the posterior distribution of Bayesian. Kahn et al. a [223] presented a new UA model for learning algorithm to control a mobile robot. A review of past studies in RL shows that different Bayesian approaches have been used for handling parameters uncertainty [137]. Bayesian RL was significantly reviewed by Ghavamzadeh et al. [137] in 2015. Due to page limitation, we do not discuss the application of UQ in RL; but we summarise some of the recent studies here.
Kahn et al. a [223] used both Bootstrapping and Dropout methods to estimate uncertainty in NNs and then used in UA collision prediction model.
Besides Bayesian statistical methods, ensemble methods have been used to quantify uncertainty in RL [484]. In this regard, Tschantz et al. [484] applied an ensemble of different pointestimate parameters when trained on various batches of a dataset and then maintained and treated by the posterior distribution . The ensemble method helped to capture both aleatoric and epistemic uncertainty. There are more UQ techniques used in RL, however, we are not able to discuss all of them in details in this work due to various reasons, such as page restrictions and the breadth of articles. Table II summarizes different UQ methods used in a variety of RL subjects.
Study  Application  Goal/Objective  UQ method  Code 

Tegho et al. [468]  Dialogue management context  Dialogue policy optimisation  BBB propagation deep Qnetworks (BBQN)  
Janz et al. [206]  Temporal difference learning  Posterior sampling for RL (PSRL)  Successor Uncertainties (SU)  
Shen and How [438]  Discriminating potential threats  Stochastic belief space policy  SoftQ learning  
Benatan and PyzerKnapp [32]  Safe RL (SRL)  The weights in RNN using mean and variance weights  Probabilistic Backpropagation (PBP)  
Kalweit and Boedecker [224]  Continuous Deep RL (CDRL)  Minimizing realworld interaction  Modelassisted Bootstrapped Deep Deterministic Policy Gradient (MABDDPG)  
Riquelme et al. [409]  Approximating the posterior sampling  Balancing both exploration and exploitation in different complex domains  Deep Bayesian Bandits Showdown using Thompson sampling  
Huang et al. [194]  Modelbased RL (MRL)  Better decision and improve performance  Bootstrapped modelbased RL (BMRL)  
Eriksson and Dimitrakakis [111]  Risk measures and leveraging preferences  RiskSensitive RL (RSRL)  Epistemic Risk Sensitive Policy Gradient (EPPG)  
LÃ¶tjens et al. [290]  SRL  UA navigation  Ensemble of MC dropout (EMCD) and Bootstrapping  
Clements et al. [74]  Designing risksensitive algorithm  Disentangling aleatoric and epistemic uncertainties  Combination of distributional RL (DRL) and Approximate Bayesian computation (ABC) methods with NNs  
DâEramo et al. [82]  Drive exploration  MultiArmed Bandit (MAB)  Bootstrapped deep Qnetwork with TS (BDQNTS) 
6 Ensemble Techniques
Deep neural networks (DNNs) have been effectively employed in a wide variety of machine learning tasks and have achieved stateoftheart performance in different domains such as bioinformatics, natural language processing (NLP), speech recognition and computer vision [562, 283]. In supervised learning benchmarks, NNs yielded competitive ac curacies, yet poor predictive uncertainty quantification. Hence, it is inclined to generate overconfident predictions. Incorrect overconfident predictions can be harmful; hence it is important to handle UQ in a proper manner in realworld applications [256]. As empirical evidence of uncertainty estimates are not available in general, quality of predictive uncertainty evaluation is a challenging task. Two evaluation measures called calibration and domain shift are applied which usually are inspired by the practical applications of NNs. Calibration measures the discrepancy between longrun frequencies and subjective forecasts. The second notion concerns generalization of the predictive uncertainty to domain shift that is estimating if the network knows what it knows. An ensemble of models enhances predictive performance. However, it is not evident why and when an ensemble of NNs can generate good uncertainty estimates. Bayesian model averaging (BMA) believes that the true model reclines within the hypothesis class of the prior and executes soft model selection to locate the single best model within the hypothesis class. On the contrary, ensembles combine models to discover more powerful model; ensembles can be anticipated to be better when the true model does not lie down within the hypothesis class.
The authors in [204] devised Maximize Overall Diversity (MOD) model to estimate ensemblebased uncertainty by taking into account diversity in ensemble predictions across future possible inputs. Gustafsson et al. [153] presented an evaluation approach for measuring uncertainty estimation to investigate the robustness in computer vision domain. Researchers in [319] proposed a deep ensemble echo state network model for spatiotemporal forecasting in uncertainty quantification. Chua et al. [72] devised a novel method called probabilistic ensembles with trajectory sampling that integrated samplingbased uncertainty propagation with UA deep network dynamics approach. The authors in [562] demonstrated that prevailing calibration error estimators were unreliable in small data regime and hence proposed kernel densitybased estimator for calibration performance evaluation and proved its consistency and unbiasedness. Liu et al. [281] presented a Bayesian nonparametric ensemble method which enhanced an ensemble model that augmented modelâs distribution functions using Bayesian nonparametric machinery and prediction mechanism. Hu et al. [189] proposed a model called marginbased Pareto deep ensemble pruning utilizing deep ensemble network that yielded competitive uncertainty estimation with elevated confidence of prediction interval coverage probability and a small value of the prediction interval width. In another study, the researchers [307] exploited the challenges associated with attaining uncertainty estimations for structured predictions job and presented baselines for sequencelevel outofdomain input detection, sequencelevel prediction rejection and tokenlevel error detection utilizing ensembles.
Ensembles involve memory and computational cost which is not acceptable in many application [308]. There has been noteworthy work done on the distillation of an ensemble into a single model. Such approaches achieved comparable accuracy using ensembles and mitigated the computational costs. In posterior distribution , the uncertainty of model is captured. Let us consider from the posterior sampled ensemble of models as follows [308]:
(31) 
where a test is input and represents the parameters of a categorical distribution . By taken into account the expectation with respect to the model posterior, predictive posterior or the expected predictive distribution, for a test input is acquired. And then we have:
(32) 
Different estimate of data uncertainty are demonstrated by each of the models . The âdisagreementâ or the level of spread of an ensemble sampled from the posterior is occurred due to the uncertainty in predictions as a result of model uncertainty. Let us consider an ensemble that yields the expected set of behaviors, the entropy of expected distribution can be utilized as an estimate of total uncertainty in the prediction. Measures of spread or âdisagreementâ of the ensemble such as MI can be used to assess uncertainty in predictions due to knowledge uncertainty as follows:
(33) 
The total uncertainty can be decomposed into expected data uncertainty and knowledge uncertainty via MI formulation. If the model is uncertain â both in outofdomain and regions of severe class overlap, entropy of the total uncertainty or predictive posterior is high. If the models disagree, the difference of the expected entropy and entropy of predictive posterior of the individual models will be nonzero. For example, MI will be low and expected and predictive posterior entropy will be similar, and each member of the ensemble will demonstrate high entropy distribution in case of in regions of class overlap. In such scenario, data uncertainty dominates total uncertainty. The predictive posterior is near uniform while the expected entropy of each model may be low that yielded from diverse distributions over classes as a result of outofdomain inputs on the other hand.In this region of input space, knowledge uncertainty is high because of the modelâs understanding of data is low. In ensemble distribution distillation, the aim is not only to capture its diversity but also the mean of the ensemble. An ensemble can be observed as a set of samples from an implicit distribution of output distributions:
(34) 
Prior Networks, a new class model was proposed that explicitly parameterize a conditional distribution over output distributions utilizing a single neural network parameterized by a point estimate of the model parameters . An ensemble can be emulated effectively by a Prior Networks and hence illustrated the same measure of uncertainty. By parameterizing the Dirichlet distribution, the Prior Network represents a distribution over categorical output distributions. Ensembling performance is measured by uncertainty estimation. Deep learning ensembles produces benchmark results in uncertainty estimation. The authors in [20] exploited indomain uncertainty and examined its standards for its quantification and revealed pitfalls of prevailing matrices. They presented the deep ensemble equivalent score (DEE) and demonstrated how an ensemble of trained networks which is only few in number can be equivalent to many urbane ensembling methods with respect to test performance. For one ensemble, they proposed the testtime augmentation (TTA) in order to improve the performance of different ensemble learning techniques (see Fig. 17).
However, deep ensembles [385] are a simple approach that presents independent samples from various modes of the loss setting. Under a fixed testtime computed budget, deep ensembles can be regarded as powerful baseline for the performance of other ensembling methods. It is a challenging task to compare the performance of ensembling methods. Different values of matrices are achieved by different models on different datasets. Interpretability is lacking in values of matrices as performance gain is compared with dataset and model specific baseline. Hence, Ashukha et al. [20] proposed DDE with an aim to introduce interpretability and perspective that applies deep ensembles to compute the performance of other ensembling methods. DDE score tries to answer the question: what size of deep ensemble demonstrates the same performance as a specific ensembling technique? The DDE score is based on calibrated loglikelihood (CLL). DDE is defined for an ensembling technique (m) and lower and upper bounds are depicted as below [20]:
(35) 
(36) 
where the mean and standard deviation of the calibrated loglikelihood yielded by an ensembling technique with samples is dubbed as . They measured and for natural numbers and linear interpolation is applied to define them for real values . They depict for different number of samples for different methods with upper and lower bounds and .
Different sources of model uncertainty can be taken care by incorporating a presented ensemble technique to propose a Bayesian nonparametric ensemble (BNE) model devised by Liu et al. [281]. Bayesian nonparametric machinery was utilized to augment distribution functions and prediction of a model by BNE. The BNE measure the uncertainty patterns in data distribution and decompose uncertainty into discrete components that are due to error and noise. The model yielded precise uncertainty estimates from observational noise and demonstrated its utility with respect to modelâs bias detection and uncertainty decomposition for an ensemble method used in prediction. The predictive mean of BNE can be expressed as below [281]:
(37) 
The predictive mean for the full BNE is comprised of three sections:

The predictive mean of original ensemble ;

BNEâs direct correction to the prediction function is represented by the term ; and

BNEâs indirect correction on prediction derived from the relaxation of the Gaussian assumption in the model cumulative distribution function is represented by the term . In addition, two error correction terms and are also presented.
To denote BNEâs predictive uncertainty estimation, the term is used which is the predictive cumulative distribution function of the original ensemble (i.e. with variance and mean ). The BNEâs predictive interval is presented as [281]:
(38) 
Comparing the above equation to the predictive interval of original ensemble , it can be observed that the residual process adjusts the locations of the BNE predictive interval endpoints while calibrates the spread of the predictive interval.
As an important part of ensemble techniques, loss functions play a significant role of having a good performance by different ensemble techniques. In other words, choosing the appropriate loss function can dramatically improve results. Due to page limitation, we summarise the most important loss functions applied for UQ in Table III.
Study  Dataset type  Base classifier(s)  Method’s name  Loss equation  Code 

TV et al. [488]  Sensor data  Neural Networks (LSTM)  Ordinal Regression (OR)  
Sinha et al. [444]  Image  Neural Networks  Diverse Information Bottleneck in Ensembles (DIBS)  
Zhang et al. [562]  Image  Neural Networks  MixnMatch Calibration  (the standard square loss)  
Lakshminarayanan et al. [256]  Image  Neural Networks  Deep Ensembles  
Jain et al. [204]  Image and Protein DNA binding  Deep Ensembles  Maximize Overall Diversity (MOD)  
Gustafsson et al. [153]  Video  Neural Networks  Scalable BDL  Regression: , Classification:  
Chua et al. [72]  Robotics (video)  Neural Networks  Probabilistic ensembles with trajectory sampling (PETS)  
Hu et al. [189]  Image and tabular data  Neural Networks  marginbased Pareto deep ensemble pruning (MBPEP)  
Malinin et al. [308]  Image  Neural Networks  Ensemble Distribution Distillation ()  
Ashukha et al. [20]  Image  Neural Networks  Deep ensemble equivalent score (DEE)  
Pearce et al. [374]  Tabular data  Neural Networks  QualityDriven Ensembles (QDEns)  
Ambrogioni et al. [9]  Tabular data  Bayesian logistic regression  Wasserstein variational gradient descent (WVG)  
Hu et al. [191]  Image  Neural Networks  Biasvariance decomposition 
6.1 Deep Ensemble
Deep ensemble, is another powerful method used to measure uncertainty and has been extensively applied in many realworld applications [189]. To achieve good learning results, the data distributions in testing datasets should be as close as the training datasets. In many situations, the distributions of test datasets are unknown especially in case of uncertainty prediction problem. Hence, it is tricky for the traditional learning models to yield competitive performance. Some researchers applied MCMC and BNNs that relied on the prior distribution of datasets to work out the uncertainty prediction problems [204]. When these approaches are employed into large size networks, it becomes computationally expensive. Model ensembling is an effective technique which can be used to enhance the predictive performance of supervised learners. Deep ensembles are applied to get better predictions on test data and also produce model uncertainty estimates when learns are provided with OoD data. The success of ensembles depends on the variancereduction generated by combining predictions that are prone to several types of errors individually. Hence, the improvement in predictions is comprehended by utilizing a large ensemble with numerous base models and such ensembles also generate distributional estimates of model uncertainty. A deep ensemble echo state network (DEESN) model with two versions of the model for spatiotemporal forecasting and associated uncertainty measurement presented in [319]. The first framework applies a bootstrap ensemble approach and second one devised within a hierarchical Bayesian framework. Multiple levels of uncertainties and nonGaussian data types were accommodated by general hierarchical Bayesian approach. The authors in [319] broadened some of the deep ESN technique constituents presented by Antonelo et al. [15] and Ma et al. [299] to fit in a spatiotemporal ensemble approach in the DEESN model to contain such structure. As shown in previous section, in the following , we summarise few loss functions of deep ensembles in Table IV.
Study  Dataset type  Base classifier(s)  Method’s name  Loss equation  Code 

Fan et al. [113]  GPSlog  Neural Networks  Online Deep Ensemble Learning (ODEL)  
Yang et al. [538]  Smart grid  Kmeans  Least absolute shrinkage and selection operator (LASSO)  
van Amersfoort et al. [492]  Image  Neural Networks  Deterministic UQ (DUQ) 
6.2 Deep Ensemble Bayesian
The expressive power of various ensemble techniques extensively shown in the literature. However, traditional learning techniques suffered from several drawbacks and limitations as listed in [117]. To overcome these limitations, Fersini et al. [117] utilized the ensemble learning approach to mitigate the noise sensitivity related to language ambiguity and more accurate prediction of polarity can be estimated. The proposed ensemble method employed Bayesian model averaging, where both reliability and uncertainty of each single model were considered. Study [373] presented one alteration to the prevailing approximate Bayesian inference by regularizing parameters about values derived from a distribution that could be set equal to the prior. The analysis of the process suggested that the recovered posterior was centered correctly but leaned to have an overestimated correlation and underestimated marginal variance. To obtain uncertainty estimates, one of the most promising frameworks is Deep BAL (DBAL) with MC dropout. Pop et al. [385] argued that in variational inference methods, the mode collapse phenomenon was responsible for overconfident predictions of DBAL methods. They devised Deep Ensemble BAL that addressed the mode collapse issue and improved the MC dropout method. In another study, Pop et al. [386] proposed a novel AL technique especially for DNNs. The statistical properties and expressive power of model ensembles were employed to enhance the stateoftheart deep BAL technique that suffered from the mode collapse problem. In another research, Pearce et al. [371] a new ensemble of NNs, approximately Bayesian ensembling approach, called ””. The proposed approach regularises the parameters regarding values attracted from a distribution.
6.3 Uncertainty Quantification in Traditional Machine Learning domain using Ensemble Techniques
It is worthwhile noting that UQ in traditional machine learning algorithms have extensively been studied using different ensemble techniques and few more UQ methods (e.g. please see [489]) in the literature. However, due to page limitation, we just summarized some of the ensemble techniques (as UQ methods) used in traditional machine learning domain. For example, Tzelepis et al. [489] proposed a maximum margin classifier to deal with uncertainty in input data. The proposed model is applied for classification task using SVM (Support Vector Machine) algorithm with multidimensional Gaussian distributions. The proposed model named SVMGSU (SVM with Gaussian Sample Uncertainty) and it is illustrated by Fig. 18:
In another research, Pereira et al. [375] examined various techniques for transforming classifiers into uncertainty methods where predictions are harmonized with probability estimates with their uncertainty. They applied various uncertainty methods: VennABERS predictors, Conformal Predictors, Platt Scaling and Isotonic Regression. Partalas et al. [365] presented a novel measure called Uncertainty Weighted Accuracy (UWA), for ensemble pruning through directed hill climbing that took care of uncertainty of present ensemble decision. The experimental results demonstrated that the new measure to prune a heterogeneous ensemble significantly enhanced the accuracy compared to baseline methods and other stateoftheart measures. Peterson et al. [377] exploited different types of errors that might creep in atomistic machine learning, and addressed how uncertainty analysis validated machinelearning predictions. They applied a bootstrap ensemble of neural network based calculators, and exhibited that the width of the ensemble can present an approximation of the uncertainty.
7 Further Studies of UQ Methods
In this section, we cover other methods used to estimate the uncertainty. In this regard, presented a summary of the proposed methods, but not the theoretical parts. Due to the page limitation and large number of references, we are not able to review all the details of the methods. For this reason, we recommend that readers check more details of each method in the reference if needed.
The OoD is a common error appears in machine and deep learning systems when training data have different distribution. To address this issue, Ardywibowo et al. [17] introduced a new UA architecture called . The proposed NADS finds an appropriate distribution of different architectures which
accomplish significantly good on a specified task. A single block diagram for searching space in the architecture is presented by Fig. 19.
Unlike previous designing architecture methods, NADS allows to recognize common blocks amongst the entire UA architectures.
On the other hand, the cost functions for the uncertainty oriented neural network (NN) are not always converging. Moreover, an optimized prediction interval (PI) is not always generated by the converged NNs. The convergence of training is uncertain and they are not customizable in the case of such cost functions. To construct optimal PIs, Kabir et al. [222] presented a smooth customizable cost function to develop optimal PIs to construct NNs. The PI coverage probability (PICP), PIfailure distances and optimized average width of PIs were computed to lessen the variation in the quality of PIs, enhance convergence probability and speed up the training. They tested their method on electricity demand and wind power generation data. In the case of nonBayesian deep neural classification, uncertainty estimation methods introduced biased estimates for instances whose predictions are highly accurate. They argued that this limitation occurred because of the dynamics of training with SGDlike optimizers and possessed similar characteristics such as overfitting. Geifman et al. [135] proposed an uncertainty estimation method that computed the uncertainty of highly confident points by utilizing snapshots of the trained model before their approximations were jittered. The proposed algorithm outperformed all wellknown techniques. In another research, Tagasovska et al. [462] proposed singlemodel estimates for DNNs of epistemic and aleatoric uncertainty. They suggested a loss function called Simultaneous Quantile Regression (SQR) to discover the conditional quantiles of a target variable to assess aleatoric uncertainty. Wellcalibrated prediction intervals could be derived by using these quantiles. They devised Orthonormal Certificates (OCs), a collection of nonconstant functions that mapped training samples to zero to estimate epistemic uncertainty. The OoD examples were mapped by these certificates to nonzero values.
van Amersfoort et al. [492, 493] presented a method to find and reject distribution data points for training a deterministic deep model with a single forward pass at test time. They exploited the ideas of RBF networks to devise deterministic UQ (DUQ) which is presented in Fig. 20. They scaled training in this with a centroid updating scheme and new loss function. Their method could detect out of distribution data consistently by utilizing a gradient penalty to track changes in the input. Their method is able to enhance deep ensembles and scaled well to huge databases.
Tagasovska et al. [461]
demonstrated frequentist estimates of epistemic and aleatoric uncertainty for DNNs. They proposed a loss function, simultaneous quantile regression to estimate all the conditional quantiles of a given target variable in case of aleatoric uncertainty. Wellcalibrated prediction intervals could be measured by using these quantiles. They proposed a collection of nontrivial diverse functions that map all training samples to zero and dubbed as training certificates for the estimation of epistemic uncertainty. The certificates signalled high epistemic uncertainty by mapping OoD examples to nonzero values. By using Bayesian deep networks, it is possible to know what the DNNs do not know in the domains where safety is a major concern. Flawed decision may lead to severe penalty in many domains such as autonomous driving, security and medical diagnosis. Traditional approaches are incapable of scaling complex large neural networks. Mobiny et al. [330] proposed an approach by imposing a Bernoulli distribution on the model weights to approximate Bayesian inference for DNNs. Their framework dubbed as MCDropConnect demonstrated model uncertainty by small alternation in the model structure or computed cost. They validated their technique on various datasets and architectures for semantic segmentation and classification tasks. They also introduced a novel uncertainty quantification metrics. Their experimental results showed considerable enhancements in uncertainty estimation and prediction accuracy compared to the prior approaches.
Uncertainty measures are crucial estimating tools in machine learning domain, that can lead to evaluate the similarity and dependence between two feature subsets and can be utilized to verify the importance of features in clustering and classification algorithms. There are few uncertainty measure tools to estimate a feature subset including rough entropy, information entropy, roughness, and accuracy etc. in the classical rough sets. These measures are not proper for realvalued datasets and relevant to discretevalued information systems. Chen et al. [65] proposed the neighborhood rough set model. In their approach, each object is related to a neighborhood subset, dubbed as a neighborhood granule. Different uncertainty measures of neighborhood granules were introduced, that were information granularity, neighborhood entropy, information quantity, and neighborhood accuracy. Further, they confirmed that these measures of uncertainty assured monotonicity, invariance and nonnegativity. In the neighborhood systems, their experimental results and theoretical analysis demonstrated that information granularity, neighborhood entropy and information quantity performed superior to the neighborhood accuracy measure. On the other hand, reliable and accurate machine learning systems depends on techniques for reasoning under uncertainty. The UQ is provided by a framework using Bayesian methods. But Bayesian uncertainty estimations are often imprecise because of the use of approximate inference and model misspecification. Kuleshov et al. [249] devised a simple method for calibrating any regression algorithm; it was guaranteed to provide calibrated uncertainty estimates having enough data when used to probabilistic and Bayesian models. They assessed their technique on recurrent, feedforward neural networks, and Bayesian linear regression and located outputs wellcalibrated credible intervals while enhancing performance on modelbased RL and time series forecasting tasks.
Gradientbased optimization techniques have showed its efficacy in learning overparameterized and complex neural networks from nonconvex objectives. Nevertheless, generalization in DNNs, the induced training dynamics, and specific theoretical relationship between gradientbased optimization methods are still unclear. Rudner et al. [416] examined training dynamics of overparameterized neural networks under natural gradient descent. They demonstrated that the discrepancy between the functions obtained from nonlinearized and linearized natural gradient descent is smaller in comparison to standard gradient descent. They showed empirically that there was no need to formulate a limit argument about the width of the neural network layers as the discrepancy is small for overparameterized neural networks. Finally, they demonstrated that the discrepancy was small on a set of regression benchmark problems and their theoretical results were steady with empirical discrepancy between the functions obtained from nonlinearized and linearized natural gradient descent. Patro et al. [368] devised gradientbased certainty estimates with visual attention maps. They resolved visual question answering job. The gradients for the estimates were enhanced by incorporating probabilistic deep learning techniques. There are two key advantages: 1. enhancement in getting the certainty estimates correlated better with misclassified samples and 2. stateoftheart results obtained by improving attention maps correlated with human attention regions. The enhanced attention maps consistently improved different techniques for visual question answering. Improved certainty estimates and explanation of deep learning techniques could be achieved through the presented method. They provided empirical results on all benchmarks for the visual question answering job and compared it with standard techniques.
BNNs have been used as a solution for neural network predictions, but it is still an open challenge to specify their prior. Independent normal prior in weight space leads to weak constraints on the function posterior, permit it to generalize in unanticipated ways on inputs outside of the training distribution. Hafner et al. [156] presented noise contrastive priors (NCPs) to estimate consistent uncertainty. The prime initiative was to train the model for data points outside of the training distribution to output elevated uncertainty. The NCPs relied on input prior, that included noise to the inputs of the current mini batch, and an output prior, that was an extensive distribution set by these inputs. The NCPs restricted overfitting outside of the training distribution and produced handy uncertainty estimates for AL. BNNs with latent variables are flexible and scalable probabilistic models. They can record complex noise patterns in the data by using latent variables and uncertainty is accounted by network weights. Depeweg et al. [90] exhibited the decomposition and derived uncertainty into aleatoric and epistemic for decision support systems. That empowered them to detect informative points for AL of functions with bimodal and heteroscedastic noises. They further described a new risksensitive condition to recognize policies for RL that balanced noise aversion, modelbias and expected cost by applying decomposition.
Uncertainty modelling in DNNs is an open problem despite advancements in the area. BNNs, where the prior over network weights is a design choice, is a powerful solution. Frequently normal or other distribution supports sparsity. The prior is agnostic to the generative process of the input data. This may direct to unwarranted generalization for outof distribution tested data. Rohekar et al. [413] suggested a confounder for the relation between the discriminative function and the input data given the target label. They proposed for modelling the confounder by sharing neural connectivity patterns between the discriminative and generative networks. Hence, a novel deep architecture was framed where networks were coupled into a compact hierarchy and sampled from the posterior of local causal structures (see Fig. 21).
They showed that sampling networks from the hierarchy, an efficient technique, was proportional to their posterior and different types of uncertainties could be estimated. It is a challenging job to learn unbiased models on imbalanced datasets. The generalization of learned boundaries to novel test examples are hindered by concentrated representation in the classification space in rare classes. Khan et al. [235] yielded that the difficulty level of individual samples and rarity of classes had direct correlation with Bayesian uncertainty estimates. They presented a new approach for uncertainty based class imbalance learning that exploited twofolded insights: 1. In rare (uncertain) classes, the classification boundaries should be broadened to evade overfitting and improved its generalization; 2. sampleâs uncertainty was defined by multivariate Gaussian distribution with a covariance matrix and a mean vector that modelled each sample. Individual samples and its distribution in the feature space should be taken care by the learned boundaries. Class and sample uncertainty information was used to obtain generalizable classification techniques and robust features. They formulated a loss function for maxmargin learning based on Bayesian uncertainty measure. Their technique exhibited key performance enhancements on six benchmark databases for skin lesion detection, digit/object classification, attribute prediction and face verification.
Neural networks do not measure uncertainty meaningfully as it leans to be overconfident on incorrectly labelled, noisy or unseen data. Variational approximations such as Multiplicative Normalising Flows or BBB are utilized by BDL to overcome this limitation. However, current methods have shortcomings regarding scalability and flexibility. Pawlowski et al. [370] proposed a novel technique of variational approximation, termed as Bayes by Hypernet (BbH) that deduced hypernetworks as implicit distributions. It naturally scaled to deep learning architectures and utilized neural networks to model arbitrarily complex distributions. Their method was robust against adversarial attacks and yielded competitive accuracies. On the other hand, significant increase in prediction accuracy records in deep learning models, but it comes along with the enhancement in the cost of rendering predictions. Wang et al. [513] speculated that for many of the real world inputs, deep learning models created recently, it tended to âoverthinkâ on simple inputs. They proposed I Donât Knowâ (IDK) prediction cascades approach to create a set of pretrained models systematically without a loss in prediction accuracy to speed up inference. They introduced two search based techniques for producing a new costaware objective as well as cascades. Their IDK cascade approach can be adopted in a model without further model retraining. They tested its efficacy on a variety of benchmarks.
Yang et al. [539] proposed a deep learning approach for propagating and quantifying uncertainty in models inspired by nonlinear differential equations utilized by physicsinformed neural networks. Probabilistic representations for the system states were produced by latent variable models while physical laws described by partial differential equations were satisfied by constraining their predictions. It also forwards an adversarial inference method for training them on data. A regularization approach for efficiently training deep generative models was provided by such physicsinformed constraints. Surrogates of physical models in which the training of datasets was usually small, and the cost of data acquisition was high. The outputs of physical systems were characterized by the framework due to noise in their observations or randomness in their inputs that bypassed the need of sampling costly experiments or numerical simulators. They proved efficacy of their method via a series of examples that demonstrated uncertainty propagation in nonlinear conservation laws and detection of constitutive laws. For autonomous driving, 3D scene flow estimation techniques generate 3D motion of a scene and 3D geometry. Brickwedde et al. [50] devised a new monocular 3D scene flow estimation technique dubbed as MonoSF that assessed both motion of the scene and 3D structure by integrating singleview depth information and multiview geometry. A CNN algorithm termed as ProbDepthNet was devised for combining singleview depth in a statistical manner. The new recalibration technique, ProbDepthNet, was presented for regression problems to guarantee wellcalibrated distributions. ProbDepthNet design and MonoSF method proved its efficacy in comparison to the stateoftheart approaches.
Mixup is a DNN training technique where extra samples are produced during training by convexly integrating random pairs of images and their labels. The method had demonstrated its effectiveness in improving the image classification performance. Thulasidasan et al. [474] investigated the predictive uncertainty and calibration of models trained with mixup. They revealed that DNNs trained with mixup were notably better calibrated than trained in regular technique. They tested their technique in large datasets and observed that this technique was less likely to overconfident predictions using randomnoise and OoD data. Label smoothing in mixup trained DNNs played a crucial role in enhancing calibration. They concluded that training with hard labels caused overconfidence observed in neural networks. The transparency, fairness and reliability of the methods can be improved by explaining blackbox machine learning models. Modelâs robustness and usersâ trust raised concern as the explanation of these models exhibited considerable uncertainty. Zhang et al. [568] illustrated the incidence of three sources of uncertainty, viz. variation in explained model credibility, variation with sampling proximity and randomness in the sampling procedure across different data points by concentrating on a specific local explanation technique called Local Interpretable ModelAgnostic Explanations (LIME). Even the blackbox models with high accuracy yielded uncertainty. They tested the uncertainty in the LIME technique on two publicly available datasets and synthetic data.
In the incidence of even small adversarial perturbations, employment of DNNs in safetycritical environments is rigorously restricted. Sheikholeslami et al. [435] devised a randomized approach to identify these perturbations that dealt with minimum uncertainty metrics by sampling at the hidden layers during the DNN inference period. Adversarial corrupted inputs were identified by the sampling probabilities. Any pretrained DNN at no additional training could be exploited by new detector of adversaries. The output uncertainty of DNN from the BNNs perspectives could be quantified by choosing units to sample per hidden layer where layerwise components denoted the overall uncertainty. Lowcomplexity approximate solvers were obtained by simplifying the objective function. These approximations associated stateoftheart randomized adversarial detectors with the new approach in addition to delivering meaningful insights. Moreover, consistency loss between various predictions under random perturbations is the basis of one of the effective strategies in semisupervised learning. In a successful student model, teachersâ pseudo labels must possess good quality, otherwise learning process will suffer. But the prevailing models do not evaluate the quality of teachersâ pseudo labels. Li et al. [274] presented a new certaintydriven consistency loss (CCL) that employed predictive uncertainty information in the consistency loss to learn students from reliable targets dynamically. They devised two strategies i.e. Temperature CCL and Filtering CCL to either pay less attention on the uncertain ones or filter out uncertain predictions in the consistency regularization. They termed it FTCCL by integrating the two strategies to enhance consistency learning approach. The FTCCL demonstrated robustness to noisy labels and enhancement on a semisupervised learning job. They presented a new mutual learning technique where one student was detached with its teacher and gained additional knowledge with another studentâs teacher.
Englesson et al. [110] introduced a modified knowledge distillation method to achieve computationally competent uncertainty estimates with deep networks. They tried to yield competitive uncertainty estimates both for out and inofdistribution samples. Their major contributions were as follows: 1. adapting and demonstrating to distillationâs regularization effect, 2. presenting a new target teacher distribution, 3. OoD uncertainty estimates were enhanced by a simple augmentation method, and 4. widespread set of experiments were executed to shed light on the distillation method. On the other hand, well calibrated uncertainty and accurate full predictive distributions are provided by Bayesian inference. High dimensionality of the parameter space limits the scaling of Bayesian inference methods to DNNs. Izmailov et al. [202] designed lowdimensional subspaces of parameter space that comprised of diverse sets of high performing approaches. They applied variational inference and elliptical slice sampling in the subspaces. Their method yielded wellcalibrated predictive uncertainty and accurate predictions for both image classification and regression by exploiting Bayesian model averaging over the induced posterior in the subspaces.
CsÃ¡ji et al. [78] introduced a datadriven strategy for uncertainty quantification of models based on kernel techniques. The method needed few mild regularities in the computation of noise instead of distributional assumptions such as dealing with exponential families or GPs. The uncertainty about the model could be estimated by perturbing the residuals in the gradient of the objective function. They devised an algorithm to make it distributionfree, nonasymptotically guaranteed and exact confidence regions for noisefree and ideal depiction of function that they estimated. For the symmetric noises and usual convex quadratic problems, the regions were star convex centred on a specified small estimate, and ellipsoidal outer approximations were also efficiently executed. On the other hand, the uncertainty estimates can be measured while pretraining process. Hendrycks et al. [173] demonstrated that pretraining enhanced the uncertainty estimates and model robustness although it might not improve the classification metrics. They showed the key gains from pretraining by performing empirical experiments on confidence calibration, OoD detection, class imbalance, label corruption and adversarial examples. Their adversarial pretraining method demonstrated approximately10% enhancement over existing methods in adversarial robustness. Pretraining without taskspecific techniques highlighted the need for pretraining, surpassed the stateoftheart when examining the future techniques on uncertainty and robustness.
Trustworthy confidence estimates are required by highrisk domains from predictive models. Rigid variational distributions utilized for tractable inference that erred on the side of overconfidence suffered from deep latent variable models. Veeling et al. [499] devised Stochastic Quantized Activation Distributions (SQUAD) that executed a tractable yet flexible distribution over discretized latent variables. The presented technique is sample efficient, selfnormalizing and scalable. Their method yielded predictive uncertainty of high quality, learnt interesting nonlinearities, fully used the flexible distribution. Multitask learning (MTL) is another domain that the impact of the importance of uncertainty methods on it can be considered. For example, MTL demonstrated its efficacy for MRonly radiotherapy planning as it can jointly automate contour of organsatrisk  a segmentation task â and simulate a synthetic CT (synCT) scan  a regression task from MRI scans. Bragman et al. [49] suggested utilizing a probabilistic deeplearning technique to estimate the parameter and intrinsic uncertainty. Parameter uncertainty was estimated through a approximate Bayesian inference whilst intrinsic uncertainty was modelled using a heteroscedastic noise technique. This developed an approach for uncertainty measuring over prediction of the tasks and datadriven adaptation of task losses on a voxelwise basis. They demonstrated competitive performance in the segmentation and regression of prostate cancer scans. More information can be found in Tables V and VI.
UQ category  Studies 

Bayesian  Balan et al. [25] (BPE: Bayesian parameter estimation), Houthooft et al. [188] (VIME: VI Maximizing Exploration), Springenberg et al. [448], Ilg et al. [201], Heo et al. [177], Henderson et al. [172], Ahn et al. [6], Zhang et al. [563], Sensoy et al. [433], Khan et al. [234], Acerbi [3] (VBMC: Variational Bayesian Monte Carlo), HauÃmann et al. [165], Gong et al. [143], De Ath et al. [86], Foong et al. [123], Hasanzadeh et al. [164], Chang et al. [60], Stoean et al. [451], Xiao et al. [529], Repetti et al. [407], TÃ³thovÃ¡ et al. [479], Moss et al. [333], Dutordoir et al. [104], Luo et al. [298], Gafni et al. [125], Jin et al. [211],Han et al. [158], Stoean et al. [452], Oh et al. [356], Dusenberry et al. [101], Havasi et al. [168], Krishnan et al. [246] (MOPED: MOdel Priors with Empirical Bayes using DNN), Jesson et al. [208], Filos et al. [120], Huang et al. [195], Amit and Meir [12], Bhattacharyya et al. [34], Yao et al. [540], Laves et al. [258] (UCE: uncertainty calibration error), Yang et al. [536] (OCBNN: OutputConstrained BNN), Thakur et al. [472] (LUNA: Learned UA), Yacoby et al. [534] (NCAI: Noise Constrained Approximate Inference), Masood and DoshiVelez [315] (PVI Particlebased VI), Abdolshah et al. [2] (MOBO: Multiobjective Bayesian optimisation), White et al. [521] (BO), Balandat et al. [26] (BOTORCH), GalyFajou et al. [131] (CMGGPC: Conjugate multiclass GP classification), Lee et al. [262] (BTAML: Bayesian Task Adaptive Meta Learning), Vadera and Marlin [490] (BDK: Bayesian Dark Knowledge), Siahkoohi et al. [441] (SGLD: stochastic gradient Langevin dynamics), Sun et al. [457], Patacchiola et al. [367], Cheng et al. [66], Caldeira and Nord [57], Wandzik et al. [504] (MCSD: Monte Carlo Stochastic Depth), Deng et al. [89] (DBSN: DNN Structures), GonzÃ¡lezLÃ³pez et al. [144], Foong et al. [121] (ConvNP: Convolutional Neural Process), Yao et al. [542] (SI: Stacked inference), Prijatelj et al. [392], Herzog et al. [180], Prokudin et al. [393] (CVAE: conditional VAE), Tuo and Wang [487], Acerbi [4] (VBMC+EIG (expected information gain)/VIQR (variational interquantile range)), Zhao et al. [571] (GEP: generalized expectation propagation), Li et al. [273] (DBGP: deep Bayesian GP), He et al. [169] (NTK: Neural Tangent Kernel), Wang and RoÄkovÃ¡ [516] (Gaussian approximability) 
Ensemble  Zhang et al. [563], Chang et al. [60], He et al. [169] (BDE: Bayesian Deep Ensembles), Schwab et al. [426], Smith et al. [446], Malinin and Gales [306], Jain et al. [205], ValdenegroToro [491], Juraska et al. [219], Oh et al. [357], Brown et al. [51], Salem et al. [421], Wen et al. [519] 
Others  Jiang et al. [209] (Trust score), Qin et al. [395] (infoCAM: informative class activation map), Wu et al. [525] (Deep Dirichlet mixture networks), Qian et al. [394] (Margin preserving metric learning), Gomez et al. [142] (Targeted dropout), Malinin and Gales [305] (Prior networks), Dunlop et al.et al. [100] (DGP: deep GP), Hendrycks et al. [175] (Selfsupervision), Kumar et al. [251] (Scalingbinning calibrator), [176] (AugMix as a data processing approach), MoÅ¼ejko et al. [334] (Softmax output), Boiarov et al. [44] (SPSA: Simultaneous Perturbation Stochastic Approximation), Ye et al. [544] (Lasso bootstrap), Monteiro et al. [332] (SSNs: Stochastic Segmentation Networks), Maggi et al. [303] (Superposition semantics), Amiri et al [11] (LCORPP: learningcommonsense reasoning and probabilistic planning), Sensoy et al. [432] (GEN: Generative Evidential Neural Network), Belakaria, et al. [29] (USeMO: UA Search framework for optimizing Multiple Objectives), Liu et al. [287] (UaGGP: UA Graph GP), Northcutt et al. [351] (CL: Confident learning), Manders et al. [310] (Class Prediction Uncertainty Alignment), Chun et al. [73] (Regularization Method), Mehta et al. [322] (Uncertainty metric), Liu et al. [282] (SNGP: Spectralnormalized Neural GP), Scillitoe et al. [427] (MFâs: Mondrian forests), Ovadia et al. [361] (Dataset shift), BiloÅ¡ et al. [37] (FDDir (Function DecompositionDirichlet) and WGPLN (Weighted GPLogisticNormal)), Zheng and Yang [576] (MR: memory regularization), Zelikman et al. [558] (CRUDE: Calibrating Regression Uncertainty Distributions Empirically), Da Silva et al. [80] (RCMP: Requesting ConfidenceModerated Policy advice), Thiagarajan et al. [473] (Uncertainty matching), Zhou et al. [578] (POMBU: Policy Optimization method with ModelBased Uncertainty), Standvoss et al. [450] (RGNN: recurrent generative NN), Wang et al. [512] (TransCal: Transferable Calibration), Grover and Ermon [148] (UAE: uncertainty autoencoders), Cakir et al. [56, 55] (MI), Yildiz et al. [546] (: Ordinary Differential Equation VAE), Titsias, Michalis et al. [476] and Lee et al. [264] (GP), Ravi and Beatson [403] (AVI: Amortized VI), Lu et al. [293] (DGPM: DGP with Moments), Wang et al. [505] (NLE loss: negative loglikelihood error), Tai et al. [464] (UIA: UA imitation learning), Selvan et al. [431] (cFlow: conditional Normalizing Flow), Poggi et al. [381] (SelfTeaching), Cui et al. [79] (MMD: Maximum Mean Discrepancy) 
As discussed earlier, GP is a powerful technique used for quantifying uncertainty. However, it is complex to form a Gaussian approximation to the posterior distribution even in the context of uncertainty estimation in huge deeplearning models. In such scenario, prevailing techniques generally route to a diagonal approximation of the covariance matrix in spite of executing low uncertainty estimates by these matrices. Mishkin et al. [327] designed a novel stochastic, lowrank, approximate naturalgradient (SLANG) technique for VI in huge deep models to tackle this issue. Their technique computed a âdiagonal plus lowrankâ structure based on backpropagated gradients of the network loglikelihood. Their findings indicate that the proposed technique in forming Gaussian approximation to the posterior distribution. As a fact, the safety of the AI systems can be enhanced by estimating uncertainty in predictions. Such uncertainties arise due to distributional mismatch between the training and test data distributions, irreducible data uncertainty and uncertainty in model parameters. Malinin et al. [305] devised a novel framework for predictive uncertainty dubbed as Prior Networks (PNs) that modelled distributional uncertainty explicitly. They achieved it by parameterizing a prior distribution over predictive distributions. Their work aimed at uncertainty for classification and scrutinized PNs on the tasks of recognizing OoD samples and identifying misclassification on the CIFAR10 and MNIST datasets. Empirical results indicate that PNs, unlike nonBayesian methods, could successfully discriminate between distributional and data uncertainty.