Aleatoric and Epistemic Uncertainty inMachine Learning: A Tutorial Introduction

# Aleatoric and Epistemic Uncertainty in Machine Learning: A Tutorial Introduction

Eyke Hüllermeier  and Willem Waegeman
Heinz Nixdorf Institute and Department of Computer Science
eyke@upb.de
Ghent University
Department of Mathematical Modelling, Statistics and Bioinformatics
willem.waegeman@ugent.be
###### Abstract

The notion of uncertainty is of major importance in machine learning and constitutes a key element of machine learning methodology. In line with the statistical tradition, uncertainty has long been perceived as almost synonymous with standard probability and probabilistic predictions. Yet, due to the steadily increasing relevance of machine learning for practical applications and related issues such as safety requirements, new problems and challenges have recently been identified by machine learning scholars, and these problems may call for new methodological developments. In particular, this includes the importance of distinguishing between (at least) two different types of uncertainty, often refereed to as aleatoric and epistemic. In this paper, we provide an introduction to the topic of uncertainty in machine learning as well as an overview of hitherto attempts at handling uncertainty in general and formalizing this distinction in particular.

## 1 Introduction

Machine learning is essentially concerned with extracting models from data and using these models to make predictions. As such, it is inseparably connected with uncertainty. Indeed, learning in the sense of generalizing beyond the data seen so far is necessarily based on a process of induction, i.e., replacing specific observations by general models of the data-generating process. Such models are never provably correct but only hypothetical and therefore uncertain, and the same holds true for the predictions produced by a model. In addition to the uncertainty inherent in inductive inference, other sources of uncertainty exist, including incorrect model assumptions and noisy or imprecise data.

Needless to say, a trustworthy representation of uncertainty is desirable and should be considered as a key feature of any machine learning method, all the more in safety-critical application domains such as medicine (Yang et al., 2009; Lambrou et al., 2011) or socio-technical systems (Varshney, 2016; Varshney and Alemzadeh, 2016). Besides, uncertainty is also a major concept within machine learning methodology itself; for example, the principle of uncertainty reduction plays a key role in settings such as active learning (Aggarwal et al., 2014; Nguyen et al., 2019), or in concrete learning algorithms such as decision tree induction (Mitchell, 1980).

Traditionally, uncertainty is modeled in a probabilistic way, and indeed, in fields like statistics and machine learning, probability theory has always been perceived as the ultimate tool for uncertainty handling. Without questioning the probabilistic approach in general, one may argue that conventional methods fail to distinguish two inherently different sources of uncertainty, which are often referred to as aleatoric and epistemic uncertainty (Hora, 1996). Roughly speaking, aleatoric (aka statistical) uncertainty refers to the notion of randomness, that is, the variability in the outcome of an experiment which is due to inherently random effects. The prototypical example of aleatoric uncertainty is coin flipping: The data-generating process in this type of experiment has a stochastic component that cannot be reduced by whatsoever additional source of information (except Laplace’s demon). Consequently, even the best model of this process will only be able to provide probabilities for the two possible outcomes, heads and tails, but no definite answer. As opposed to this, epistemic (aka systematic) uncertainty refers to uncertainty caused by a lack of knowledge, i.e., it refers to the epistemic state of the agent or decision maker. This uncertainty can in principle be reduced on the basis of additional information. For example, what does the word “kichwa” mean in the Swahili language, head or tail? The possible answers are the same as in coin flipping, and one might be equally uncertain about which one is correct. Yet, the nature of uncertainty is different, as one could easily get rid of it. In other words, epistemic uncertainty refers to the reducible part of the (total) uncertainty, whereas aleatoric uncertainty refers to the non-reducible part.

In machine learning, where the agent is a learning algorithm, the two sources of uncertainty are usually not distinguished. In some cases, such a distinction may indeed appear unnecessary. For example, if an agent is forced to make a decision or prediction, the source of its uncertainty—aleatoric or epistemic—might actually be irrelevant. This argument is often put forward by Bayesians in favor of a purely probabilistic approach (and classical Bayesian decision theory). One should note, however, that this scenario does not always apply. Instead, the ultimate decision can often be refused or delayed, like in classification with a reject option (Hellman, 1970), or actions can be taken that are specifically directed at reducing uncertainty, like in active learning (Aggarwal et al., 2014).

Motivated by such scenarios, and advocating a trustworthy representation of uncertainty in machine learning, Senge et al. (2014) explicitly refer to the distinction between aleatoric and epistemic uncertainty. They propose a quantification of these uncertainties and show the usefulness of their approach in the context of medical decision making. A very similar motivation is given by Kull and Flach (2014) in the context of their work on reliability maps. They distinguish between a predicted probability score and the uncertainty in that prediction, and illustrate this distinction with an example from weather forecasting: “… a weather forecaster can be very certain that the chance of rain is 50 %; or her best estimate at 20 % might be very uncertain due to lack of data.” Roughly, the 50 % chance corresponds to what one may understand as aleatoric uncertainty, whereas the uncertainty in the 20 % estimate is akin to the notion of epistemic uncertainty. On the more practical side, Varshney and Alemzadeh (2016) give an example of a recent accident of a self-driving car, which lead to the death of the driver (for the first time after 130 million miles of testing). They explain the car’s failure by the extremely rare circumstances, and emphasize “the importance of epistemic uncertainty or “uncertainty on uncertainty” in these AI-assisted systems”.

More recently, a distinction between aleatoric and epistemic uncertainty has also been advocated in the literature on deep learning (Kendall and Gal, 2017), where the limited awareness of neural networks of their own competence has been demonstrated quite nicely. For example, experiments on image classification have shown that a trained model does often fail on specific instances, despite being very confident in its prediction (cf. Fig. 1). Moreover, such models are often lacking robustness and can easily be fooled by “adversarial examples” (Papernot and McDaniel, 2018): Drastic changes of a prediction may already be provoked by minor, actually unimportant changes of an object. This problem has not only been observed for images but also for other types of data, such as natural language text (cf. Fig. 2 for an example).

This paper provides an overview of machine learning methods for handling uncertainty, with a specific focus on the distinction between aleatoric and epistemic uncertainty in the common setting of supervised learning. This topic will be explained and analyzed in some detail in Section 2. Concrete approaches for modeling and handling uncertainty in machine learning are then discussed in Section 3, prior to concluding the paper in Section 4. In addition, Appendix A provides some general background on uncertainty modeling (independent of applications in machine learning), specifically focusing on the distinction between set-based and distributional (probabilistic) representations and emphasizing the potential usefulness of combining them.

Table 1 summarizes some important notation. Let us note that, for the sake of readability, a somewhat simplified notation will be used for probability measures and associated distribution functions. Most of the time, these will be denoted by and , respectively, even if they refer to different measures and distributions on different spaces (which will be clear from the arguments). For example, we will write and for the probability (density) of hypotheses (as elements of the hypothesis space ) and outcomes (as elements of the output space ), instead of using different symbols, such as and .

## 2 Uncertainty in machine learning

Uncertainty occurs in various facets in machine learning, and different settings and learning problems will usually require a different handling from an uncertainty modeling point of view. Here, we consider the most standard setting of supervised learning (cf. Fig. 3), in which a learner is given access to a set of training data

 (1)

where is an instance space and the set of outcomes that can be associated with an instance. Typically, the training examples are assumed to be independent and identically distributed according to some unknown probability measure on . Given a hypothesis space and a loss function , the goal of the learner is to induce a hypothesis with low risk (expected loss)

 (2)

Thus, given the training data , the learner needs to “guess” a good hypothesis . This choice is commonly guided by the empirical risk

 Remp(h)\vbox\footnotesize.\footnotesize.=1NN∑i=1ℓ(h(x),y), (3)

i.e., the performance of a hypothesis on the training data. However, since is only an estimation of the true risk , the hypothesis (empirical risk minimizer)

 ˆh\vbox\footnotesize.\footnotesize.=argminh∈HRemp(h) (4)

favored by the learner will normally not coincide with the true risk minimizer

 h∗\vbox\footnotesize.\footnotesize.=∗argminh∈HR(h). (5)

Correspondingly, there remains uncertainty regarding as well as the approximation quality of (in the sense of its proximity to ) and its true risk .

Eventually, one is often interested in predictive uncertainty, i.e., the uncertainty related to the prediction for a concrete query instance . In other words, given a partial observation , we are wondering what can be said about the missing outcome, especially about the uncertainty related to a prediction of that outcome. Indeed, estimating and quantifying uncertainty in a transductive way, in the sense of tailoring it for individual instances, is arguably important and practically more relevant than a kind of average accuracy or confidence, which is often reported in machine learning. In medical diagnosis, for example, a patient will be interested in the reliability of a test result in her particular case, not in the reliability of the test on average. This view is also expressed, for example, by Kull and Flach (2014): “Being able to assess the reliability of a probability score for each instance is much more powerful than assigning an aggregate reliability score […] independent of the instance to be classified.”

### 2.1 Types of uncertainties

As the prediction constitutes the end of a process that consists of different learning and approximation steps, all errors and uncertainties related to these steps may also contribute to the uncertainty about (cf. Fig. 4):

• Since the dependency between and is typically non-deterministic, the description of a new prediction problem in the form of an instance gives rise to a conditional probability distribution

 p(y|xq)=p(xq,y)p(x) (6)

on , but it does normally not identify a single outcome in a unique way. Thus, even given full information in the form of the measure (and its density ), uncertainty about the actual outcome remains. This uncertainty is of an aleatoric nature. In some cases, the distribution (6) itself (called the predictive posterior distribution in Bayesian inference) might be delivered as a prediction. Yet, when being forced to commit to point estimates, the best predictions (in the sense of minimizing the expected loss) are prescribed by the pointwise Bayes predictor , which is defined by

 f∗(x)\vbox\footnotesize.\footnotesize.=argminˆy∈Y∫Yℓ(y,ˆy)dP(y|x) (7)

for each .

• The Bayes predictor (5) does not necessarily coincide with the pointwise Bayes predictor (7). This discrepancy between and is connected to the uncertainty regarding the right type of model to be fit, and hence the choice of the hypothesis space (which is part of what is called “background knowledge” in Fig. 3). We shall refer to this uncertainty as model uncertainty. Thus, due to this uncertainty, one can not guarantee that , or, in case the hypothesis (e.g., a probabilistic classifier) delivers probabilistic predictions instead of point predictions, that .

• The hypothesis produced by the learning algorithm, for example the empirical risk minimizer (4), is only an estimate of , and the quality of this estimate strongly depends on the quality and the amount of training data. We shall refer to the discrepancy between and , i.e., the uncertainty about how well the former approximates the latter, as approximation uncertainty.

As already said, aleatoric uncertainty refers to the irreducible part of the uncertainty, which is due to the non-deterministic nature of the sought input/output dependency. Model uncertainty and approximation uncertainty, on the other hand, are subsumed under the notion of epistemic uncertainty, that is, uncertainty due to a lack of knowledge about the perfect predictor (7). Obviously, this lack of knowledge will strongly depend on the amount of data seen so far: The larger the number of observations, the less ignorant the learner will be when having to make a new prediction. In the limit, when , a consistent learner will be able to identify (see Fig. 5 for an illustration).

For obvious reasons, model uncertainty is very difficult to capture, let alone quantify. In fact, the learning itself as well as all sorts of inference from the data are normally done under the assumption that the model is valid. Otherwise, since some assumptions are indeed always needed, it will be difficult to derive any useful conclusions (we will come back to this issue in Section 2.5). Still, even when taking the model assumptions for granted, and hence , the approximation uncertainty remains a source of epistemic uncertainty.

In the following, we illustrate and further elaborate on the points made above by looking at two specific though arguably natural and important cases, namely version space learning and Bayesian inference. In both cases, an explicit distinction is made between the uncertainty about , and how this uncertainty translates into uncertainty about the outcome for a query . Bayesian inference can be seen as the main representative of probabilistic methods and provides a coherent framework for statistical reasoning that is well-established in machine learning (and beyond). Version space learning can be seen as a “logical” (and in a sense simplified) counterpart of Bayesian inference, in which hypotheses and predictions are not assessed numerically in terms of probabilities, but only qualified (deterministically) as being possible or impossible. In spite of its limited practical usefulness, version space learning is interesting for various reasons. In particular, in light of our discussion about uncertainty, it constitutes an interesting case, as it is free of aleatoric uncertainty, i.e., all uncertainty in version space learning is epistemic.

### 2.2 Version space learning

In the idealized setting of version space learning, we assume a deterministic dependency , i.e., the distribution (6) degenerates to

 p(y|xq)={1 if y=f∗(xq)0 if y≠f∗(xq) (8)

Moreover, the training data (1) is free of noise. Correspondingly, we also assume that classifiers produce deterministic predictions in the form of probabilities 0 or 1. Finally, we assume that , and therefore .

Under these assumptions, a hypothesis can be eliminated as a candidate as soon as it makes at least one mistake on the training data: in that case, the risk of is necessarily higher than the risk of (which is 0). The idea of the candidate elimination algorithm (Mitchell, 1977) is to maintain the version space that consists of the set of all hypotheses consistent with the data seen so far:

 V=V(H,D)\vbox% \footnotesize.\footnotesize.={h∈H|h(xi)=yi for i=1,…,N} (9)

Obviously, the version space is shrinking with an increasing amount of training data, i.e., for .

If a prediction for a query instance is sought, this query is submitted to all members of the version space. Obviously, a unique prediction can only be made if all members agree on the outomce of . Otherwise, several outcomes may still appear possible. Formally, we can express the degree of possibility or plausibility of an outcome as follows:

 π(y)\vbox\footnotesize.\footnotesize.=maxh∈Hmin(⟦h∈V⟧,⟦h(x)=y⟧) (10)

Thus, if there exists a candidate hypothesis such that , and otherwise. In other words, the prediction produced in version space learning is a subset

 Y=Y(xq)\vbox\footnotesize.\footnotesize% .={h(xq)|h∈V}={y|π(y)=1}⊆Y (11)

See Fig. 6 for an illustration.

Note that the inference (10) can be seen as a kind of constraint propagation, in which the constraint on the hypothesis space is propagated to a constraint on , expressed in the form of the subset (11) of possible outcomes; or symbolically:

 H,D,xq⊨Y (12)

This view highlights the interaction between prior knowledge and data: It shows that what can be said about the possible outcomes not only depends on the data but also on the hypothesis space , i.e., the model assumptions the learner starts with. The specification of always comes with an inductive bias, which is indeed essential for learning from data (Mitchell, 1980). In general, both aleatoric and epistemic uncertainty (ignorance) depend on the way in which prior knowledge and data interact with each other. Roughly speaking, the stronger the knowledge the learning process starts with, the less data is needed to resolve uncertainty. In the extreme case, the true model is already known, and data is completely superfluous. Normally, however, prior knowledge is specified by assuming a certain type of model, for example a linear relationship between inputs and outputs . Then, all else (namely the data) being equal, the degree of predictive uncertainty depends on how flexible the corresponding model class is. Informally speaking, the more restrictive the model assumptions are, the smaller the uncertainty will be. This is illustrated in Fig. 7 for the case of binary classification.

Coming back to our discussion about uncertainty, it is clear that version space learning as outlined above does not involve any kind of aleatoric uncertainty. Instead, the only source of uncertainty is a lack of knowledge about , and hence of epistemic nature. On the model level, the amount of uncertainty is in direct correspondence with the size of the version space and reduces with an increasing sample size. Likewise, the predictive uncertainty could be measured in terms of the size of the set (11) of candidate outcomes. Obviously, this uncertainty may differ from instance to instance, or, stated differently, approximation uncertainty may translate into prediction uncertainty in different ways.

In version space learning, uncertainty is represented in a purely set-based manner: the version space and prediction set are subsets of and , respectively. In other words, hypotheses and outcomes are only qualified in terms of being possible or not. In the following, we discuss the Bayesian approach, in which hypotheses and predictions are qualified more gradually in terms of probabilities.

### 2.3 Bayesian inference

Consider a hypothesis space consisting of probabilistic predictors, that is, hypotheses that deliver probabilistic predictions of outcomes given an instance . In the Bayesian approach, is supposed to be equipped with a prior distribution , and learning essentially consists of replacing the prior by the posterior distribution:

 p(h|D)=p(h)⋅p(D|h)p(D)∝p(h)⋅p(D|h), (13)

where is the probability of the data given (the likelihood of ). Intuitively, captures the learner’s state of knowledge, and hence its epistemic uncertainty: The more “peaked” this distribution, i.e., the more concentrated the probability mass in a small region in , the less uncertain the learner is. Just like the version space in version space learning, the posterior distribution on provides global instead of per-instance information. For a given query instance , this information may translate into quite different representations of the uncertainty regarding the prediction (cf. Fig. 8).

More specifically, the representation of uncertainty about a prediction is given by the image of the posterior under the mapping from hypotheses to probabilities of outcomes. This yields the predictive posterior distribution

 p(y|xq)=∫Hp(y|xq,h)dP(h|D). (14)

In this type of (proper) Bayesian inference, a final prediction is thus produced by model averaging: The predicted probability of an outcome is the expected probability , where the expectation over the hypotheses is taken with respect to the posterior distribution on ; or, stated differently, the probability of an outcome is a weighted average over its probabilities under all hypotheses in , with the weight of each hypothesis given by its posterior probability . Since model averaging is often difficult and computationally costly, sometimes only the single hypothesis

 hmap\vbox\footnotesize.\footnotesize.=argmaxh∈Hp(h|D) (15)

with the highest posterior probability is adopted, and predictions on outcomes are derived from this hypothesis.

In (14), aleatoric and epistemic uncertainty are not distinguished any more, because epistemic uncertainty is “averaged out.” Consider the example of coin flipping, and let the hypothesis space be given by , where is modeling a biased coin landing heads up with a probability of and tails up with a probability of . According to (14), we derive a probability of for heads and tails, regardless of whether the (posterior) distribution on is given by the uniform distribution (all coins are equally probable, i.e., the case of complete ignorance) or the one-point measure assigning probability 1 to (the coin is known to be fair with complete certainty):

 p(y)=∫HαdP=12=∫HαdP′

for the uniform measure (with probability density function ) as well as the measure with probability mass function if and for . Obviously, MAP inference (15) does not capture epistemic uncertainty either.

More generally, consider the case of binary classification with and the probability predicted by the hypothesis for the positive class. Instead of deriving a distribution on according to (14), one could also derive a predictive distribution for the (unknown) probability of the positive class:

 p(q|xq)=∫H⟦p(+1|xq,h)=q⟧dP(h|D). (16)

This is a second-order probability, which still contains both aleatoric and epistemic uncertainty. The question of how to quantify the epistemic part of the uncertainty is not at all obvious, however. Intuitively, epistemic uncertainty should be reflected by the variability of the distribution (16): the more spread the probability mass over the unit interval, the higher the uncertainty seems to be. But how to put this into quantitative terms? Entropy is arguably not a good choice, for example, because this measure is invariant against redistribution of probability mass. For instance, the distributions and with and both have the same entropy, although they correspond to quite different states of information. From this point of view, the variance of the distribution would be better suited, but this measure has other deficiencies (for example, it is not maximized by the uniform distribution, which could be considered as a case of minimal informedness).

The difficulty of specifying epistemic uncertainty in the Bayesian approach is rooted in the more general difficulty of representing a lack of knowledge in probability theory. This issue will be discussed next.

### 2.4 Sets versus distributions for representing uncertainty

As already said, uncertainty is captured in a purely set-based way in version space learning: is a set of candidate hypotheses, which translates into a set of candidate outcomes for a query . In the case of sets, there is a rather obvious correspondence between the degree of uncertainty in the sense of a lack of knowledge and the size of the set of candidates: Proceeding from a reference set of alternatives, assuming some ground-truth , and expressing knowledge about the latter in terms of a subset of possible candidates, we can clearly say that the bigger , the larger the lack of knowledge. More specifically, for finite , a common uncertainty measure in information theory is . Consequently, knowledge gets weaker by adding additional elements to and stronger by removing candidates.

In probability, it is much less obvious how to “weaken” the knowledge conveyed by a distribution on . This is mainly because the total amount of belief is fixed in terms of a unit mass that can be distributed among the elements . Unlike for sets, where an additional candidate can be added or removed without changing the plausibility of all other candidates, increasing the weight of one alternative requires decreasing the weight of another alternative by exactly the same amount.

Of course, there are also measures of uncertainty for probability distributions, most notably the (Shannon) entropy, which, for finite , is given as follows:

 H(p)\vbox\footnotesize.\footnotesize.=−∑ω∈Ωp(ω)logp(ω)

However, these measures are primarily capturing the shape of the distribution, namely its “peakedness” or non-uniformity (Dubois and Hüllermeier, 2007), and hence inform about the predictability of the outcome of a random experiment. Seen from this point of view, they are more akin to aleatoric uncertainty, whereas the set-based approach is arguably better suited for capturing epistemic uncertainty.

For these reasons, it has been argued that probability distributions are less suitable for representing ignorance in the sense of a lack of knowledge (Dubois et al., 1996). For example, the case of complete ignorance is typically modeled in terms of the uniform distribution in probability theory; this is justified by the “principle of indifference” invoked by Laplace, or by referring to the principle of maximum entropy111Obviously, there is a technical problem in defining the uniform distribution in the case where is not finite.. Then, however, it is not possible to distinguish between (i) precise (probabilistic) knowledge about a random event, such as tossing a fair coin, and (ii) a complete lack of knowledge due to an incomplete description of the experiment. This was already pointed out by the famous Ronald Fisher, who noted that “not knowing the chance of mutually exclusive events and knowing the chance to be equal are two quite different states of knowledge.”

Another problem in this regard is caused by the measure-theoretic grounding of probability and its additive nature. For example, the uniform distribution is not invariant under reparametrization (a uniform distribution on a parameter does not translate into a uniform distribution on , although ignorance about implies ignorance about ). Likewise, expressing ignorance about the length of a cube in terms of a uniform distribution on an interval does not yield a uniform distribution of on , thereby suggesting some degree of informedness about its volume. Problems of this kind render the use of a uniform prior distribution, often interpreted as representing epistemic uncertainty in Bayesian inference, at least debatable222This problem is inherited by hierarchical Bayesian modeling. See work on “non-informative” priors, however (Jeffreys, 1946; Bernardo, 1979)..

The argument that a single (probability) distribution is not enough for representing uncertain knowledge is quite prominent in the literature. Correspondingly, various generalizations of standard probability theory have been proposed, including imprecise probability (Walley, 1991), evidence theory (Shafer, 1976; Smets and Kennes, 1994), and possibility theory (Dubois and Prade, 1988). These formalisms essentially seek to take advantage of the complementary nature of sets and distributions, and to combine both representations in a meaningful way — we refer to Appendix A for a brief overview. As pointed out by Senge et al. (2014), such approaches could also be useful for modeling and quantifying aleatoric and epistemic uncertainty in machine learning.

### 2.5 Model uncertainty

According to our discussion so far, aleatoric uncertainty relates to the stochastic dependency between instances and outcomes , as expressed by the conditional probability (14), whereas epistemic uncertainty relates to the lack of knowledge about the latter, expressed by the distribution (13). Indeed, assuming , and hence , we have , so uncertainty about the predictive posterior is essentially equivalent to uncertainty about the Bayes predictor.

Assuming a correctly specified hypothesis space , such that , actually means neglecting model uncertainty, which might be caused, for example, by model misspecification. An assumption of that kind appears implausible but unavoidable at the same time. Indeed, model induction, like statistical inference in general, is not possible without underlying assumptions, and conclusions drawn from data are always conditional to these assumptions. This is made very explicit in Bayesian inference, but also, for example, in version space learning, where a conflict in the data will immediately cause an empty version space .

But how could model uncertainty be captured, and perhaps even be measured? An answer to this question is far from obvious. In a sense, it would require a kind of meta-analysis: Instead of expressing uncertainty about the ground-truth hypothesis within a hypothesis space , one has to express uncertainty about which among a set of candidate hypothesis spaces might be the right one.

For some learning methods, such as nearest neighbor classification or (deep) neural networks, the hypothesis space is very large. Thus, the learner has a high capacity (or “universal approximation” capability) and can express hypotheses in a very flexible way. In such cases, or at least can safely be assumed, and model uncertainty essentially disappears. Moreover, since there is no assumption about any global structure of the dependency between inputs and outcomess , inductive inference will essentially be of a local nature: A class is approximated by the region in the instance space in which examples from that class have been seen, aleatoric uncertainty occurs where such regions are overlapping, and epistemic uncertainty where no examples have been encountered so far (cf. Fig. 9).

### 2.6 Reducible versus irreducible uncertainty

Our basic distinction between epistemic and aleatoric uncertainty refers to the distinction between reducible and irreducible uncertainty. But what does “reducible” actually mean? So far, the only source of additional information we considered was the training data : The learner’s uncertainty can be reduced by observing more data, while the setting of the learning problem — the instance space , output space , hypothesis space , joint probability on  — remain fixed.

In practice, this is of course not always the case. Imagine, for example, that a learner can decide to extend the description of instances by additional features, which essentially means replacing the current instance space by another space . This change of the setting may have an influence on uncertainty. An example is shown in Fig. 10: In a low-dimensional space (here defined by a single feature ), two class distributions are overlapping, which causes (aleatoric) uncertainty in a certain region of the instance space. By embedding the data in a higher-dimensional space (here accomplished by adding a second feature ), the two classes become separable, and the uncertainty can be resolved. More generally, embedding data in a higher-dimensional space will reduce aleatoric and increase epistemic uncertainty, because fitting a model will become more difficult and require more data.

What this example shows is that aleatoric and epistemic uncertainty should not be seen as absolute notions. Instead, they are context-dependent in the sense of depending on the setting . Changing the context will also change the sources of uncertainty: aleatoric may turn into epistemic uncertainty and vice versa. Consequently, by allowing the learner to change the setting, the distinction between these two types of uncertainty will be somewhat blurred (and their quantification will become even more difficult).

Note that a change of the setting may also be caused externally, for example through a non-stationary data-generating process as commonly assumed in learning on data streams (Gama, 2012). Besides, changes may also concern other components of the setting. For example, there are many classification problems in which certain classes may “disappear” while new classes emerge, i.e., in which may change in the course of time. Again, the quantification of uncertainty will become more difficult in flexible scenarios of that kind.

## 3 Machine learning methods

This section presents several important machine learning methods that allow for representing the learner’s uncertainty in a prediction. They differ with regard to the type of prediction produced and the way in which uncertainty is represented. Another interesting question is whether they allow for distinguishing between aleatoric and epistemic uncertainty, and perhaps even for quantifying the amount of uncertainty in terms of degrees of aleatoric and epistemic (and total) uncertainty.

### 3.1 Probability estimation via scoring, calibration, and ensembling

There are various methods in machine learning for inducing probabilistic predictors. These are hypotheses that do not merely output point predictions , i.e., elements of the output space , but probability estimates , i.e., complete probability distributions on . In the case of classification, this means predicting a single (conditional) probability for each class , whereas in regression, is a density function on . Such predictors can be learned in a discriminative way, i.e., in the form of a mapping , or in a generative way, which essentially means learning a joint distribution on . Moreover, the approaches can be parametric (assuming specific parametric families of probability distributions) or non-parametric. Well-known examples include classical statistical methods such as logistic and linear regression, Bayesian approaches such as Bayesian networks and Gaussian processes, as well as various techniques in the realm of (deep) neural networks.

Training probabilistic predictors is typically accomplished by minimizing suitable loss functions, i.e., loss functions that enforce “correct” (conditional) probabilities as predictions. In this regard, proper scoring rules (Gneiting and Raftery, 2005) play an important role, including the log-loss as a well-known special case. Sometimes, however, estimates are also obtained in a very simple way, following basic frequentist techniques for probability estimation, like in Naïve Bayes or nearest neighbor classification.

The predictions delivered by corresponding methods are at best “pseudo-probabilities” that are often not very accurate. Besides, there are many methods that deliver natural scores, intuitively expressing a degree of confidence (like the distance from the separating hyperplane in support vector machines), but which do not immediately qualify as probabilities either. The idea of scaling or calibration methods is to turn such scores into proper, well-calibrated probabilities, that is, to learn a mapping from scores to the unit interval that can be applied to the output of a predictor as a kind of post-processing step (Flach, 2017). Examples of such methods include binning (Zadrozny and Elkan, 2001), isotonic regression (Zadrozny and Elkan, 2002), logistic scaling (Platt, 1999) and improvements thereof (Kull et al., 2017), as well as the use of Venn predictors (Johansson et al., 2018). Calibration is still a topic of ongoing research.

Another import class of methods is ensemble learning, such as bagging or boosting, which has originally been invented to improve accuracy of (point) predictions. Since such methods produce a (large) set of predictors instead of a single hypothesis, it is tempting to produce probability estimates following basic frequentist principles. In the simplest case (of classification), each prediction can be interpreted as a “vote” in favor of a class , and probabilities can be estimated by relative frequencies. Especially important in this field are tree-based methods such as random forests (Breiman, 2001; Kruppa et al., 2014).

Obviously, while standard probability estimation is a viable approach to representing uncertainty in a prediction, there is no explicit distinction between different types of uncertainty.

### 3.2 Gaussian processes

In standard Bayesian inference (cf. Section 2.3), the hypothesis space is parametrized in the sense that each hypothesis is (uniquely) identified by a parameter (vector) of fixed dimensionality . Thus, computation of the posterior (13) essentially comes down to updating beliefs about the true (or best) parameter, which is treated as a multivariate random variable:

 p(θ|D)∝p(θ)⋅p(D|θ). (17)

Gaussian processes (Seeger, 2004) generalize the Bayesian approach from inference about multivariate (but finite-dimensional) random variables to inference about (infinite-dimensional) functions. Thus, they can be thought of as distributions not just over random vectors but over random functions.

More specifically, a stochastic process in the form of a collection of random variables with index set is said to be drawn from a Gaussian process with mean function and covariance function , denoted , if for any finite set of elements , the associated finite set of random variables has the following multivariate normal distribution:

Note that the above properties imply

 m(x) =E(f(x)), k(x,x′) =E((x−m(x))(x′−m(x′))),

and that needs to obey the properties of a kernel function (so as to guarantee proper covariance matrices). Intuitively, a function drawn from a Gaussian process prior can be thought of as a (very) high-dimensional vector drawn from a (very) high-dimensional multivariate Gaussian. Here, each dimension of the Gaussian corresponds to an element from the index set , and the corresponding component of the random vector represents the value of .

Gaussian processes allow for doing proper Bayesian inference (13) in a non-parametric way: Starting with a prior distribution on functions , specified by a mean function (often the zero function) and kernel , this distribution can be replaced by a posterior in light of observed data . Likewise, a posterior predictive distribution can be obtained on outcomes for a new query . In the general case, the computations involved are intractable, though turn out to be rather simple in the case of regression () with Gaussian noise. In this case, the posterior predictive distribution is again a Gaussian with mean and variance as follows:

 μ =K(xq,X)(K(X,X)+σ2ϵI)−1y, σ2 =K(xq,xq)+σ2ϵ−K(xq,X)(K(X,X)+σ2ϵI)−1K(X,xq),

where is the variance of the error term, is the vector of observed training outcomes, is an matrix summarizing the training inputs (the row of is given by ), and is the kernel matrix with entries .

Problems with discrete outcomes , such as binary classification with , are made amenable to Gaussian processes by suitable link functions, linking these outcomes with the real values as underlying (latent) variables. For example, using the logistic link function

 p(y|h)=s(h)=11+exp(−yh),

the following posterior predictive distribution is obtained:

 p(y=+1|X,y,xq)=∫σ(h′)p(h′|X,y,xq)dh′,

where

 p(h′|X,y,xq)=∫p(h′|X,xq,h)p(h|X,y)dh.

However, since the likelihood of the data will no longer be Gaussian, approximate inference techniques (e.g., Laplace, expectation propagation, MCMC) will be needed.

As for uncertainty representation, our general remarks on Bayesian inference (cf. Section 2.3) obviously apply to Gaussian processes as well. In the case of regression, the variance of the posterior predictive distribution for a query is a meaningful indicator of epistemic uncertainty, whereas the variance of the error term, , corresponds to the aleatoric uncertainty. Both have an influence on the width of confidence intervals that are typically obtained for per-instance predictions, and which reflect the total amount of uncertainty (cf. Fig. 11). Of course, unless is known, the two sources of uncertainty cannot easily be separated. In the case of classification, the situations gets even more difficult.

### 3.3 Deep neural networks

Work on uncertainty in deep learning fits the general framework we introduced so far, especially the Bayesian approach, quite well, although the methods proposed are specifically tailored for neural networks as a model class. A standard neural network can be seen as a probabilistic classifier : in the case of classification, given a query , the final layer of the network typically outputs a probability distribution (using transformations such as softmax) on the set of classes , and in the case of regression, a distribution (e.g., a Gaussian) is placed over a point prediction (typically conceived as the expectation of the distribution). Training a neural network can essentially be seen as doing maximum likelihood inference. As such, it yields probabilistic predictions, but no information about the confidence in these probabilities. In other words, it captures aleatoric but no epistemic uncertainty.

Epistemic uncertainty is commonly understood as uncertainty about the model parameters, that is, the weights of the neural network (which correspond to the parameter in (17)). To capture this kind of epistemic uncertainty, Bayesian neural networks (BNNs) have been proposed as a Bayesian extension of deep neural networks (Denker and LeCun, 1991; Kay, 1992; Neal, 2012). In BNNs, each weight is represented by a probability distribution (again, typically a Gaussian) instead of a real number, and learning comes down to Bayesian inference, i.e., computing the posterior . The predictive distribution of an outcome given a query instance is then given by

 p(y|xq,D)=∫p(y|xq,w)p(w|D)dw.

Since the posteriors on the weights cannot be obtained analytically, approximate variational techniques are used (Jordan et al., 1999; Graves, 2011), seeking a variational distribution on the weights that minimizes the Kullback-Leibler divergence between and the true posterior. An important example is Dropout variational inference (Gal and Ghahramani, 2016), which establishes an interesting connection between (variational) Bayesian inference and the idea of using Dropout as a learning technique for neural networks. Likewise, given a query , the predictive distribution for the outcome is approximated using techniques like Monte Carlo sampling, i.e., drawing model weights from the approximate posterior distributions (again, a connection to Dropout can be established). The total uncertainty can then be quantified in terms of common measures such as entropy (in the case of classification) or variance (in the case of regression) of this distribution.

The total uncertainty, say the variance in regression, still contains both, the residual error, i.e., the variance of the observation error (aleatoric uncertainty), , and the variance due to parameter uncertainty (epistemic uncertainty). The former can be heteroscedastic, which means that is not constant but a function of . Kendall and Gal (2017) propose a way to learn heteroscedastic aleatoric uncertainty as loss attenuation. Roughly speaking, the idea is to let the neural net not only predict the conditional mean of given , but also the residual error. The corresponding loss function to be minimized is constructed in such a way (or actually derived from the probabilistic interpretation of the model) that prediction errors for points with a high residual variance are penalized less, but at the same time, a penalty is also incurred for predicting a high variance.

An explicit attempt at measuring and separating aleatoric and epistemic uncertainty (in the context of regression) is made by Depeweg et al. (2018). Their idea is as follows: They quantify the total uncertainty as well as the aleatoric uncertainty, and then obtain the epistemic uncertainty as the difference. More specifically, they propose to measure the total uncertainty in terms of the (differential) entropy of the predictive posterior distribution, . This uncertainty also includes the (epistemic) uncertainty about the network weights . Fixing a set of weights, i.e., considering a distribution , thus removes the epistemic uncertainty. Therefore, the expectation over the entropies of these distributions, is a measure of the aleatoric uncertainty. Finally, the epistemic uncertainty is obtained as the difference

 H[p(y|x)]−Ep(w|D)H[p(y|w,x)]=I(y,w), (18)

which equals the mutual information between and . Intuitively, epistemic uncertainty thus captures the amount of information about the model parameters that would be gained through knowledge of the true outcome . A similar approach was recently adopted by Mobiny et al. (2017), who also propose a technique to compute (18) approximately.

Bayesian model averaging establishes a natural connection between Bayesian inference and ensemble learning, and indeed, as already mentioned in Section 3.3, the variance of the predictions produced by an ensemble is a good indicator of the (epistemic) uncertainty in a prediction. In fact, the variance is inversely related to the “peakedness” of a posterior distribution and can be seen as a measure of the discrepancy between the (most probable) hypotheses. Based on this idea, Lakshminarayanan et al. (2017) propose a simple ensemble approach as an alternative to Bayesian NNs, which is easy to implement, readily parallelizable, and requires little hyperparameter tuning.

### 3.4 Reliable prediction

The approach to reliable prediction by Senge et al. (2014) combines probabilistic and constraint-based (set-based) inference, and thus can be positioned in-between Bayesian inference and version space learning (cf. Sections 2.2 and 2.3).

#### 3.4.1 Modeling the plausibility of predictions

Consider the simplest case of binary classification with classes , which suffices to explain the basic idea (and which has been generalized to the multi-class case by Nguyen et al. (2018)). Recall that, in version space learning, the plausibility of both hypotheses and outcomes are expressed in a purely bivalent way: according to (10), a hypotheses is either considered possible/plausible or not (), and an outcome is either supported or not (). Senge et al. (2014) generalize both parts of the prediction from bivalent to graded plausibility and support.

More specifically, the first step consists of defining a degree of plausibility for each outcome . To this end, a (posterior) distribution on is exploited, just like in Bayesian inference, but adopting the idea of likelihood inference to avoid the need for specifying a prior. Referring to the notion of normalized likelihood (cf. Section A.4), a plausibility distribution on is defined as follows:

 πH(h)\vbox\footnotesize.\footnotesize.=L(h)suph′∈HL(h′)=L(h)L(hml), (19)

where is the maximum likelihood (ML) estimation on the data . Thus, the plausibility of a hypothesis is in proportion to its likelihood333In principle, the same idea can of course also be applied in Bayesian inference, namely by defining plausibility in terms of a normalized posterior distribution., with the ML estimation having the highest plausibility of 1.

The second step, both in version space and Bayesian learning, consists of translating uncertainty on into uncertainty about the prediction for a query . To this end, all predictions need to be aggregated, taking into account the plausibility of the hypotheses . Due to the problems of the averaging approach (14) in Bayesian inference, a generalization of the “existential” aggregation (10) used in version space learning is adopted:

 π(+1|xq)\vbox\footnotesize.% \footnotesize.=suph∈Hmin(πH(h),π(+1|h,xq)), (20)

where is the degree of support of the positive class provided by 444Indeed, should not be interpreted as a measure of uncertainty.. This measure of support, which generalizes the all-or-nothing support in (10), is defined as follows:

 π(+1|h,xq)\vbox\footnotesize.% \footnotesize.=max(2h(xq)−1,0) (21)

Thus, the support is 0 as long as the probability predicted by is , and linearly increases afterward, reaching 1 for . Recalling that is a probabilistic classifier, this clearly makes sense, since values are actually more in favor of the negative class, and therefore no evidence for the positive class. Also, as will be seen further below, this definition assures a maximal degree of aleatoric uncertainty in the case of full certainty about the uniform distribution , wich is a desirable property. Moreover, note that the supremum operator in (20) can be seen as a generalized existential quantifier. Therefore, the expression (20) can be read as follows: The class is plausible insofar there exists a hypothesis that is plausible and that strongly supports . Analogously, the plausibility for is defined as follows:

 π(−1|xq)\vbox\footnotesize.% \footnotesize.=suph∈Hmin(πH(h),π(−1|h,xq)), (22)

with .

The computation of according to (20) is illustrated in Fig. 12, where the hypothesis space is shown schematically as one of the axes. In comparison to Bayesian inference (14), two important differences are notable:

• First, evidence of hypotheses is represented in terms of normalized likelihood instead of posterior probabilities , and support for a class in terms of instead of probabilities .

• Second, the “sum-product aggregation” in Bayesian inference is replaced by a “max-min aggregation”.

More formally, the meaning of sum-product aggregation is that (14) corresponds to the computation of the standard (Lebesque) integral of the class probability with respect to the (posterior) probability distribution . Here, instead, the definition of corresponds to the Sugeno integral (Sugeno, 1974) of the support with respect to the possibility measure induced by the distribution (19) on :

 π(y|xq)=S∫Hπ(y|h,xq)∘ΠH (23)

In general, given a measurable space and an -measurable function , the Sugeno integral of with respect to a monotone measure (i.e., a measure on such that , , and for ) is defined as

 (24)

where .

In comparison to sum-product aggregation, max-min aggregation avoids the loss of information due to averaging and is more in line with the “existential” aggregation in version space learning. In fact, it can be seen as a graded generalization of (12). Note that max-min inference requires the two measures and to be commensurable. This is why the normalization of the likelihood according to (19) is important.

Compared to MAP inference (15), max-min inference takes more information into account. Indeed, MAP inference only looks at the probability of hypotheses but ignores the probabilities assigned to the classes. In contrast, a class can be considered plausible according to (20) even if not being strongly supported by the most likely hypothesis —this merely requires sufficient support by another hypothesis , which is not much less likely than .

#### 3.4.2 From plausibility to aleatoric and epistemic uncertainty

Given the plausibilities and of the positive and negative class, respectively, and having to make a prediction, one would naturally decide in favor of the more plausible class. Perhaps more interestingly, meaningful definitions of epistemic and aleatoric uncertainty can be defined on the basis of the two degrees of plausibility: the degree of epistemic uncertainty as

 ue\vbox\footnotesize.\footnotesize.=min(π(+1),π(−1)), (25)

that is, the degree to which both and are plausible555The minimum plays the role of a generalized logical conjunction (Klement et al., 2002). and the degree of aleatoric uncertainty as

 ua\vbox\footnotesize.\footnotesize.=min(1−π(+1),1−π(−1))=1−max(π(+1),π(−1)),

that is, the degree to which neither nor are plausible. Since these two degrees of uncertainty satisfy , the total uncertainty (aleatoric epistemic) is upper-bounded by 1. The following special cases are of interest:

• Full epistemic uncertainty: requires the existence of at least two fully plausible hypotheses (i.e., both with the highest likelihood), the one fully supporting the positive and the other fully supporting the negative class. This situation is likely to occur (at least approximately) in the case of a small sample size, for which the likelihood is not very peaked.

• No epistemic uncertainty: requires either or , which in turn means that for all hypotheses with non-zero plausibility, or for all these hypotheses. In other words, there is no disagreement about which of the two classes should be favored. Specifically, suppose that all plausible hypotheses agree on the same conditional probability distribution and , and let . In this case, , and the degree of aleatoric uncertainty depends on how close is to 1.

• Full aleatoric uncertainty: This is a special case of the previous one, in which . Indeed, means that all plausible hypotheses assign a probability of to both classes. In other words, there is an agreement that the query instance is a boundary case.

• No uncertainty: Again, this is a special case of the second one, with . A clear preference (close to 1) in favor of one of the two classes means that all plausible hypotheses, i.e., all hypotheses with a high likelihood, provide full support to that class.

Although algorithmic aspects are not in the focus of this paper, it is worth to mention that the computation of (20), and likewise of (22), may become rather complex. In fact, the computation of the supremum comes down to solving an optimization problem, the complexity of which strongly depends on the hypothesis space .

### 3.5 Conformal prediction

In contrast to the previous approaches, conformal prediction (Vovk et al., 2003; Shafer and Vovk, 2008; Balasubramanian et al., 2014) is a framework for reliable prediction that is rooted in classical frequentist statistics, more specifically in hypothesis testing. Given a sequence of training observations and a new query (which we denoted by before) with unknown outcome ,

 (x1,y1),(x2,y2),…,(xN,yN),(xN+1,∙), (26)

the basic idea is to hypothetically replace by each candidate, i.e., to test the hypothesis for all . Only those outcomes for which this hypothesis can be rejected at a predefined level of confidence are excluded, while those for which the hypothesis cannot be rejected are collected to form the prediction set or prediction region . The construction of a set-valued prediction that is guaranteed to cover the true outcome with a given probability (for example 95 %), instead of producing a point prediction , is the basic idea of conformal prediction. In the case of classification, is a subset of the set of classes , whereas in regression, a prediction region is commonly represented in terms of an interval666Obviously, since is infinite in regression, a hypothesis test cannot be conducted explicitly for each candidate outcome ..

Hypothesis testing is done in a nonparametric way: Consider any “nonconformity” function that assigns scores to input/output tuples; the latter can be interpreted as a measure of “strangeness” of the pattern , i.e., the higher the score, the less the data point conforms to what one would expect to observe. Applying this function to the sequence (26), with a specific (though hypothetical) choice of , yields a sequence of scores

 α1,α2,…,αN,αN+1.

Denote by the permutation of that sorts the scores in increasing order, i.e., such that . Under the assumption that the hypothetical choice of is in agreement with the true data-generating process, and that this process has the property of exchangeability (which is weaker than the assumption of independence and essentially means that the order of observations is irrelevant), every permutation has the same probability of occurrence. Consequently, the probability that is among the  % highest nonconformity scores should be low. This notion can be captured by the -values associated with the candidate , defined as

 p(y)\vbox\footnotesize.\footnotesize.=#{i|αi≥αN+1}N+1

According to what we said, the probability that (i.e., is among the  % highest -values) is upper-bounded by . Thus, the hypothesis can be rejected for those candidates for which .

Conformal prediction as outlined above realizes transductive inference, although inductive variants also exist (Papadopoulos, 2008). The error bounds are provided on a per-instance basis (unlike, for example, in PAC theory), and are valid and well calibrated by construction, regardless of the nonconformity function . However, the choice of this function has an important influence on the efficiency of conformal prediction, that is, the size of prediction regions: The more suitable the nonconformity function is chosen, the smaller these sets will be.

Although conformal prediction is mainly concerned with constructing prediction regions, the scores produced in the course of this construction can also be used for quantifying uncertainty. In this regard, the notions of confidence and credibility have been introduced (Gammerman and Vovk, 2002): Let denote the -values that correspond, respectively, to the candidate outcomes in a classification problem. If a definite choice (point prediction) has to be made, it is natural to pick the with the lowest -value, or, stated differently, the highest “plausibility” . The credibility in this prediction is then simply given by , and the confidence by , where is the second-highest plausibility. Besides, other methods for quantifying the uncertainty of a point prediction in the context of conformal prediction have been proposed (Linusson et al., 2016).

With its main concern of constructing valid prediction regions, conformal prediction differs from most other machine learning methods, which produce point predictions , whether equipped with a degree of uncertainty or not. In a sense, conformal prediction can even be seen as being orthogonal: It predefines the degree of uncertainty (level of confidence) and adjusts its prediction correspondingly, rather than the other way around.

### 3.6 Set-valued prediction based on utility maximization

In line with classical statistics, but unlike decision theory and machine learning, the setting of conformal prediction does not involve any notion of loss function. In this regard, it differs from methods for set-valued prediction based on utility maximization (or loss minimization), which are primarily used for multi-class classification problems. Similar to conformal prediction, such methods also return a set of classes when the classifier is too uncertain with respect to the class label to predict, but the interpretation of this set is different. Instead of returning a set that contains the true class label with high probability, sets that maximize a set-based utility score are sought.

Let be a set-based utility score, where denotes the ground truth outcome and the predicted set. Then, adopting a decision-theoretic perspective, the Bayes-optimal solution is found by maximizing the following objective:

 ˆY∗u(xq)=argmaxˆY∈2Y∖{∅}Ep(y|xq)(u(y,ˆY))=argmaxˆY∈2Y∖{∅}∑y∈Yu(y,ˆY)p(y|xq). (27)

Solving (27) as a brute-force search requires checking all subsets of , resulting in an exponential time complexity. However, for many utility scores, the Bayes-optimal prediction can be found more efficiently. Various methods in this direction have been proposed under different names and qualifications of predictions, such as “non-deterministic” (Coz et al., 2009), “credal” (Corani and Zaffalon, 2008a), and “cautious” (Yang et al., 2017b). Although the methods typically differ in the exact shape of the utility function , most functions are specific members of the following family:

 u(y,ˆY)={0if y∉ˆYg(|ˆY|)if y∈ˆY, (28)

where denotes the cardinality of the predicted set . This family is characterized by a sequence with the number of classes. Ideally, should obey the following properties:

1. , i.e., the utility should be maximal when the classifier returns the true class label as a singleton set.

2. should be non-increasing, i.e., the utility should be higher if the true class is contained in a smaller set of predicted classes.

3. , i.e., the utility of predicting a set containing the true and additional classes should not be lower than the expected utility of randomly guessing one of these classes. This requirement formalizes the idea of risk-aversion: in the face of uncertainty, abstaining should be preferred to random guessing (Zaffalon et al., 2012).

Many existing set-based utility scores are recovered as special cases of (28), including the three classical measures from information retrieval discussed by Del Coz et al. (2009): precision with , recall with , and the F-measure with . Other utility functions with specific choices for are studied in the literature on credal classification (Corani and Zaffalon, 2008b, 2009; Zaffalon et al., 2012; Yang et al., 2017b; Nguyen et al., 2018), such as

 gδ,γ(s)\vbox\footnotesize.\footnotesize.=δs−γs2,gexp(s)\vbox\footnotesize.\footnotesize.=1−exp(−δs),glog(s)\vbox\footnotesize.% \footnotesize.=log(1+1s).

Especially is commonly used in this community, where and can only take certain values to guarantee that the utility is in the interval . Precision (here called discounted accuracy) corresponds to the case . However, typical choices for ) are and (Nguyen et al., 2018), implementing the idea of risk aversion. The measure is an exponentiated version of precision, where the parameter also defines the degree of risk aversion.

Another example appears in the literature on binary or multi-class classification with reject option (Herbei and Wegkamp, 2006; Linusson et al., 2018; Ramaswamy et al., 2015). Here, the prediction can only be a singleton or the full set containing classes. The first case typically gets a reward of one (if the predicted class is correct), while the second case should receive a lower reward, e.g. . The latter corresponds to abstaining, i.e., not predicting any class label, and the (user-defined) parameter specifies a penalty for doing so, with the requirement to be risk-averse.

Set-valued predictions are also considered in hierarchical multi-class classification, mostly in the form of internal nodes of the hierarchy (Freitas, 2007; Rangwala and Naik, 2017; Yang et al., 2017a). Compared to the “flat” multi-class case, the prediction space is thus restricted, because only sets of classes that correspond to nodes of the hierarchy can be returned as a prediction. Some of the above utility scores also appear here. For example, Yang et al. (2017a) evaluate various members of the family in a framework where hierarchies are considered for computational reasons, while Oh (2017) optimizes recall by fixing as a user-defined parameter. Popular in hierarchical classification is the tree-distance loss, which could also be interpreted as a way of evaluating set-valued predictions (Bi and Kwok, 2015). This loss is not a member of the family (28), however. Besides, from the perspective of abstention in case of uncertainty, it is a less interesting loss function.

Quite obviously, methods that maximize set-based utitlity scores are closely connected to the quantification of uncertainty, since the decision about a suitable set of predictions is necessarily derived from information of that kind. The overwhelming majority of the above-mentioned methods depart from conditional class probabilities that are estimated in a classical frequentist way, so that uncertainties in decisions are of aleatoric nature. Exceptions include (Yang et al., 2017a) and (Nguyen et al., 2018), who further explore ideas from imprecise probability theory and reliable classification to generate label sets that capture both aleatoric and epistemic uncertainty.

### 3.7 Generative models

In Section 2.5, we explained that, in local learning methods with sufficient flexibility, a class in multi-class classification is approximated by the region in the instance space in which examples from that class have been seen, aleatoric uncertainty occurs where such regions are overlapping, and epistemic uncertainty where no examples have been encountered so far (cf. Fig. 9). An intuitive idea is hence to consider generative models to quantify epistemic uncertainty. Such approaches typically look at the densities to decide whether input points are located in regions with high or low density, in which the latter acts as a proxy for a high epistemic uncertainty. The density can be estimated with traditional methods such as kernel density estimation or Gaussian mixtures, yet novel density estimation methods still appear in the machine learning literature. Some more recent methods in this area are isolation forests (Liu et al., 2009), auto-encoders (Goodfellow et al., 2016), and radial basis function networks (Bazargami and Mac-Namee, 2019).

Density estimation is also a central building block in many anomaly and outlier detection methods. Often, a threshold is applied on top of the density to decide whether a data point is an outlier or not. For example, in auto-encoders, a low-dimensional representation of the original input in constructed, and the reconstruction error can be used as a measure of support of a data point within the underlying distribution. Methods of that kind can be classified as semi-supervised outlier detection methods, in contrast to supervised methods, which use annotated outliers during training. Many semi-supervised outlier detection methods are inspired by one-class support vector machines (SVMs), which fit a hyperplane that separates outliers from “normal” data points in a high-dimensional space (Khan and Madden, 2014). Some variations exist, where for example a hypersphere instead of a hyperplane is fitted to the data (Tax and Duin, 2004). Most one-class SVM methods have the disadvantage that outliers are assumed to lie close to the origin, and normal data points farther away.

Outlier detection is also somewhat related to the setting of classification with a reject option, e.g., when the classifier refuses to predict a label in low-density regions. For example, Ziyin et al. (2019) adopt this viewpoint in an optimization framework where an outlier detector and a standard classifier are jointly optimized. Since a data point is rejected if it is an outlier, the focus is here on epistemic uncertainty. In contrast, most papers on classification with reject option employ a reasoning on conditional class probabilities , using specific utility scores. These papers model aleatoric uncertainty, as discussed in Section 3.6. Thus, seemingly related papers adopting the same terminology can have very different notions of uncertainty in mind.

In recent papers, an outlier detector is often combined with a set-based prediction framework. Here, three scenarios can occur for a multi-class classifier:

1. A single class is predicted. In this case, the epistemic and aleatoric uncertainty are both assumed to be sufficiently low.

2. The “null set” (empty set) is predicted, i.e., the classifier abstains when there is not enough support for making a prediction. Here, the epistemic uncertainty is too high to make a prediction.

3. A set of cardinality bigger than one is predicted. Here, the aleatoric uncertainty is too high, while the epistemic uncertainty is assumed to be sufficiently low.

Hechtlinger et al. (2019) implement this idea by fitting per class a generative model , and predicting null sets for data points that have a too low density for any of those generative models. A set of cardinality one is predicted when the data point has a sufficiently high density for exactly one of the models , and a set of higher cardinality is returned when this happens for more than one of the models. Sondhi et al. (2019) propose a different framework with the same reasoning in mind. Here, a pair of models is fitted in a joint optimization problem. The first model acts as an outlier detector that intends to reject as few instances as possible, whereas the second model optimizes a set-based utility score. The joint optimization problem is therefore a linear combination of two objectives that capture, respectively, epistemic and aleatoric uncertainty components.

Generative models often deliver intuitive solutions to the quantification of epistemic and aleatoric uncertainty. Yet, like the other methods that we discussed, they also have disadvantages. An inherent problem with semi-supervised outlier detection methods is how to define the threshold that decides whether a data point is an outlier or not. Another issue is how to choose the model class for the generative model. Density estimation is a difficult problem, which, to be successful, typically requires a lot of data. When the sample size is small, specifying the right model class is not an easy task, so the model uncertainty will typically be high.

## 4 Discussion and conclusion

The importance to distinguish between different types of uncertainty has recently been recognized in machine learning. The goal of this paper was to sketch existing approaches in this direction, with an emphasis on the quantification of aleatoric and epistemic uncertainty about predictions in supervised learning. Looking at the problem from the perspective of the standard setting of supervised learning, it is natural to associate epistemic uncertainty with the lack of knowledge about the true (or Bayes-optimal) hypothesis within the hypothesis space , i.e., the uncertainty that is principally reducible by the learner and could get rid of by additional data. Aleatoric uncertainty, on the other hand, is the irreducible part of the uncertainty in a prediction, which is due to the inherently stochastic nature of the dependency between instances and outcomes .

In a Bayesian setting, epistemic uncertainty is hence reflected by the (posterior) probability on : The less peaked this distribution, the less informed the learner is, and the higher its (epistemic) uncertainty. As we argued, however, important information about this uncertainty might get lost through Bayesian model averaging. In this regard, we also argued that a (graded) set-based representation of epistemic uncertainty could be a viable alternative to a probabilistic representation, especially due to the difficulty of representing ignorance with probability distributions. This idea is perhaps most explicit in version space learning, where epistemic uncertainty is in direct correspondence with the size of the version space. The approach by Senge et al. (2014) combines concepts from (generalized) version space learning and Bayesian inference.

There are also other methods for modeling uncertainty, such as conformal prediction, for which the role of aleatoric and epistemic uncertainty is not immediately clear. Currently, the field develops very dynamically and is far from being settled. New proposals for modeling and quantifying uncertainty appear on a regular basis, some of them rather ad hoc and others better justified. Eventually, it would be desirable to “derive” a measure of total uncertainty as well as its decomposition into aleatoric and epistemic parts from basic principles in a mathematically rigorous way, i.e., to develop a kind of axiomatic basis for such a decomposition, comparable to axiomatic justifications of the entropy measure (Csiszár, 2008).

In addition to theoretical problems of this kind, there are also many open practical questions. This includes, for example, the question of how to perform an empirical evaluation of methods for quantifying uncertainty, whether aleatoric, epistemic, or total. In fact, unlike for the prediction of a target variable, the data does normally not contain information about any sort of “ground truth” uncertainty. What is often done, therefore, is to evaluate predicted uncertainties indirectly, that is, by assessing their usefulness for improved prediction and decision making. An example is accuracy-rejection curves, which depict the accuracy of a predictor as a function of the percentage of rejections (Hühn and Hüllermeier, 2009): A classifier, which is allowed to abstain on a certain percentage of predictions, will predict on those  % on which it feels most certain. Being able to quantify its own uncertainty well, it should improve its accuracy with increasing , hence the accuracy-rejection curve should be monotone increasing (unlike a flat curve obtained for random abstention).

Most approaches so far neglect model uncertainty, in the sense of proceeding from the (perhaps implicit) assumption of a correctly specified hypothesis space , that is, the assumption that contains a (probabilistic) predictor which is coherent with the data. In (Senge et al., 2014), for example, this assumption is reflected by the definition of plausibility in terms of normalized likelihood, which always ensures the existence of at least one fully plausible hypothesis—indeed, (19) is a measure of relative, not absolute plausibility. On the one side, it is true that model induction, like statistical inference in general, is not possible without underlying assumptions, and that conclusions drawn from data are always conditional to these assumptions. On the other side, misspecification of the model class is a common problem in practice, and should therefore not be ignored. In principle, conflict and inconsistency can be seen as another source of uncertainty, in addition to randomness and a lack of knowledge. Therefore, it would be useful to reflect this source of uncertainty as well (for example in the form of non-normalized plausibility functions , in which the gap serves as a measure of inconsistency between and the data , and hence as a measure of model uncertainty).

Related to this is the “closed world” assumption (Deng, 2014), which is often violated in contemporary machine learning applications. Modeling uncertainty and allowing the learner to express ignorance is obviously important in scenarios where new classes may emerge in the course of time, which implies a change of the underlying data-generating process, or the learner might be confronted with out-of-distribution queries. Some first proposals for dealing with this case can already be found in the literature (DeVries and Taylor, 2018; Lee et al., 2018; Malinin and Gales, 2018; Sensoy et al., 2018).

Finally, there are many ways in which other machine learning methodology may benefit from a proper quantification of uncertainty, and in which corresponding measures could be used for “uncertainty-informed” decisions. For example, Nguyen et al. (2019) take advantage of the distinction between epistemic and aleatoric uncertainty in active learning, arguing that the former is more relevant as a selection criterion for uncertainty sampling than the latter. Likewise, as already said, a suitable quantification of uncertainty can be very useful in set-valued prediction, where the learner is allowed to predict a subset of outcomes (and hence to partially abstain) in cases of uncertainty.

### Acknowledgment

The authors like to thank Sebastian Destercke, Karlson Pfannschmidt, and Ammar Shaker for helpful remarks and comments on the content of this paper.

## Appendix A Background on uncertainty modeling

The notion of uncertainty has been studied in various branches of science and scientific disciplines. For a long time, it plays a major role in fields like economics, psychology, and the social sciences, typically in the appearance of applied statistics. Likewise, its importance for artificial intelligence has been recognized very early on777The “Annual Conference on Uncertainty in Artificial Intelligence” (UAI) was launched in the mid 1980s., at the latest with the emergence of expert systems, which came along with the need for handling inconsistency, incompleteness, imprecision, and vagueness in knowledge representation (Kruse et al., 1991). More recently, the phenomenon of uncertainty has also attracted a lot of attention in engineering, where it is studied under the notion of “uncertainty quantification” (Owhadi et al., 2012); interestingly, a distinction between aleatoric and epistemic uncertainty, very much in line with our machine learning perspective, is also made there.

The contemporary literature on uncertainty is rather broad (cf. Fig. 13). In the following, we give a brief overview, specifically focusing on the distinction between set-based and distributional (probabilistic) representations. Against the background of our discussion about aleatoric and epistemic uncertainty, this distinction is arguably important. Roughly speaking, while aleatoric uncertainty is appropriately modeled in terms of probability distributions, one may argue that a set-based approach is more suitable for modeling ignorance and a lack of knowledge, and hence more apt at capturing epistemic uncertainty.

### a.1 Sets versus distributions

A generic way for describing situations of uncertainty is to proceed from an underlying reference set , sometimes called the frame of discernment (Shafer, 1976). This set consists of all hypotheses, or pieces of precise information, that ought to be distinguished in the current context. Thus, the elements are exhaustive and mutually exclusive, and one of them, , corresponds to the truth. For example, in the case of coin tossing, in predicting the outcome of a football match, or in the estimation of the parameters (expected value and standard deviation) of a normal distribution from data. For ease of exposition and to avoid measure-theoretic complications, we will subsequently assume that is a discrete (finite or countable) set.

As an aside, we note that the assumption of exhaustiveness of could be relaxed. In a classification problem in machine learning, for example, not all possible classes might be known beforehand, or new classes may emerge in the course of time (Hendrycks and Gimpel, 2017; Liang et al., 2018; DeVries and Taylor, 2018). In the literature, this is often called the “open world assumption”, whereas an exhaustive is considered as a “closed world” (Deng, 2014). Although this distinction may look technical at first sight, it has important consequences with regard to the representation and processing of uncertain information, which specifically concern the role of the empty set. While the empty set is logically excluded as a valid piece of information under the closed world assumption, it may suggest that the true state is outside under the open world assumption.

There are two basic ways for expressing uncertain information about , namely, in terms of subsets and in terms of distributions. A subset corresponds to a constraint suggesting that . Thus, information or knowledge888We do not distinguish between the notions of information and knowledge in this paper. expressed in this way distinguishes between values that are (at least provisionally) considered possible and those that are definitely excluded. As suggested by common examples such as specifying incomplete information about a numerical quantity in terms of an interval , a set-based representation is appropriate for capturing uncertainty in the sense of imprecision.

Going beyond this rough dichotomy, a distribution assigns a weight to each element , which can generally be understood as a degree of belief. At first sight, this appears to be a proper generalization of the set-based approach. Indeed, without any constraints on the weights, each subset can be characterized in terms of its indicator function on (which is a specific distribution assigning a weight of 1 to each and 0 to all ). However, for the specifically important case of probability distributions, this view is actually not valid.

First, probability distributions need to obey a normalization constraints. In particular, a probability distribution requires the weights to be nonnegative and integrate to 1. A corresponding probability measure on is a set-function such that , , and

 P(A∪B)=P(A)+P(B) (29)

for all disjoint sets (events) . With for all it follows that , and hence . Since the set-based approach does not (need to) satify this constraint, it is no longer a special case.

Second, in addition to the question of how information is represented, it is of course important to ask how the information is processed. In this regard, the probabilistic calculus differs fundamentally from constraint-based (set-based) information processing. The characteristic property of probability is its additivity (29), suggesting that the belief in the disjunction (union) of two (disjoint) events and is the sum of the belief in either of them. In contrast to this, the set-based approach is more in line with a logical interpretation and calculus. Interpreting a constraint as a logical proposition , an event is possible as soon as and impossible otherwise. Thus, the information can be associated with a set-function such that . Obviously, this set-function satisfies , , and

 Π(A∪B)=max(Π(A),Π(B)) (30)

Thus, is “maxitive” instead of additive (Shilkret, 1971; Dubois, 2006). Roughly speaking, an event is evaluated according to its (logical) consistency with a constraint , whereas in probability theory, an event is evaluated in terms of its probability of occurrence. The latter is reflected by the probability mass assigned to , and requires a comparison of this mass with the mass of other events (since only one outcome is possible, the elementary events compete with each other). Consequently, the calculus of probability, including rules for combination of information, conditioning, etc., is quite different from the corresponding calculus of constraint-based information processing (Dubois, 2006).

### a.2 Representation of ignorance

From the discussion so far, it is clear that a probability distribution is essentially modeling the phenomenon of chance rather than imprecision. One may wonder, therefore, to what extent it is suitable for representing epistemic uncertainty in the sense of a lack of knowledge.

In the set-based approach, there is an obvious correspondence between the degree of uncertainty or imprecision and the cardinality of the set : the larger , the larger the lack of knowledge999In information theory, a common uncertainty measure is .. Consequently, knowledge gets weaker by adding additional elements to . In probability, the total amount of belief is fixed in terms of a unit mass that is distributed among the elements , and increasing the weight of one element requires decreasing the weight of another element by the same amount. Therefore, the knowledge expressed by a probability measure cannot be “weakened” in a straightforward way.

Of course, there are also measures of uncertainty for probability distributions, most notably the (Shannon) entropy

 H(p)\vbox\footnotesize.\footnotesize.=−∑ω∈Ωp(ω)logp(ω).

However, these are primarily capturing the shape of the distribution, namely its “peakedness” or non-uniformity (Dubois and Hüllermeier, 2007), and hence inform about the predictability of the outcome of a random experiment. Seen from this point of view, they are more akin to aleatoric uncertainty, whereas the set-based approach is arguably better suited for capturing epistemic uncertainty.

For these reasons, it has been argued that probability distributions are less suitable for representing ignorance in the sense of a lack of knowledge (Dubois et al., 1996). For example, the case of complete ignorance is typically modeled in terms of the uniform distribution in probability theory; this is justified by the “principle of indifference” invoked by Laplace, or by referring to the principle of maximum entropy101010Obviously, there is a technical problem in defining the uniform distribution in the case where is not finite.. Then, however, it is not possible to distinguish between (i) precise (probabilistic) knowledge about a random event, such as tossing a fair coin, and (ii) a complete lack of knowledge, for example due to an incomplete description of the experiment. This was already pointed out by the famous Ronald Fisher, who noted that “not knowing the chance of mutually exclusive events and knowing the chance to be equal are two quite different states of knowledge”.

Another problem in this regard is caused by the measure-theoretic grounding of probability and its additive nature. For example, the uniform distribution is not invariant under reparametrization (a uniform distribution on a parameter does not translate into a uniform distribution on , although ignorance about implies ignorance about ). For example, expressing ignorance about the length of a cube in terms of a uniform distribution on an interval does not yield a uniform distribution of on , thereby suggesting some degree of informedness about its volume. Problems of this kind render the use of a uniform prior distribution, often interpreted as representing epistemic uncertainty in Bayesian inference, at least debatable111111This problem is inherited by hierarchical Bayesian modeling. See work on “non-informative” priors, however (Jeffreys, 1946; Bernardo, 1979)..

### a.3 Sets of distributions

Given the complementary nature of sets and distributions, and the observation that both have advantages and disadvantages, one may wonder whether the two could not be combined in a meaningful way. Indeed, the argument that a single (probability) distribution is not enough for representing uncertain knowledge is quite prominent in the literature, and many generalized theories of uncertainty can be considered as a combination of that kind (Dubois and Prade, 1988; Walley, 1991; Shafer, 1976; Smets and Kennes, 1994).

Since we are specifically interested in aleatoric and epistemic uncertainty, and since these two types of uncertainty are reasonably captured in terms of sets and probability distributions, respectively, a natural idea is to consider sets of probability distributions. In the literature on imprecise probability, these are also called credal sets (Cozman, 2000; Zaffalon, 2002). An illustration is given in Fig. 14, where probability distributions on are represented as points in a Barycentric coordinate systems. A credal set then corresponds to a subset of such points, suggesting a lack of knowledge about the true distribution but restricting it in terms of a set of possible candidates.

Credal sets are typically assumed to be convex subsets of the class of all probability distributions on . Such sets can be specified in different ways, for example in terms of upper and lower bounds on the probabilities of events . A specifically simple approach (albeit of limited expressivity) is the use of so-called possibility distributions and related possibility measures (Dubois and Prade, 1988). A possibility distribution is a mapping , and the associated measure is given by

 Π:2Ω⟶[0,1],A↦supω∈Aπ(ω). (31)

A measure of that kind can be interpreted as an upper bound, and thus defines a set of dominated probability distributions:

 (32)

Formally, a possibility measure on satisfies , , and for all . Thus, it generalizes the maxitivity (30) of sets in the sense that is not (necessarily) an indicator function, i.e., is in and not restricted to . A related necessity measure is defined as