Evaluating Scalable Uncertainty Estimation Methods for DNNBased Molecular Property Prediction
Department of Electronics, Information and Bioengineering, Politecnico di Milano, 20133 Milano, Italy MIT] Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA POLIMI] Department of Electronics, Information and Bioengineering, Politecnico di Milano, 20133 Milano, Italy NTU] Department of Chemical Engineering, National Taiwan University, Taipei 10617, Taiwan MIT] Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA \abbreviationsQSPR
1 Abstract
Advances in deep neural network (DNN) based molecular property prediction have recently led to the development of models of remarkable accuracy and generalization ability, with graph convolution neural networks (GCNNs) reporting stateoftheart performance for this task. However, some challenges remain and one of the most important that needs to be fully addressed concerns uncertainty quantification. DNN performance is affected by the volume and the quality of the training samples. Therefore, establishing when and to what extent a prediction can be considered reliable is just as important as outputting accurate predictions, especially when outofdomain molecules are targeted. Recently, several methods to account for uncertainty in DNNs have been proposed, most of which are based on approximate Bayesian inference. Among these, only a few scale to the large datasets required in applications. Evaluating and comparing these methods has recently attracted great interest, but results are generally fragmented and absent for molecular property prediction. In this paper, we aim to quantitatively compare scalable techniques for uncertainty estimation in GCNNs. We introduce a set of quantitative criteria to capture different uncertainty aspects, and then use these criteria to compare MCDropout, deep ensembles, and bootstrapping, both theoretically in a unified framework that separates aleatoric/epistemic uncertainty and experimentally on the QM9 dataset. Our experiments quantify the performance of the different uncertainty estimation methods and their impact on uncertaintyrelated error reduction. Our findings indicate that ensembling and bootstrapping consistently outperform MCDropout, with different contextspecific pros and cons. Our analysis also leads to a better understanding of the role of aleatoric/epistemic uncertainty and highlights the challenge posed by outofdomain uncertainty.
2 Introduction
Deep Neural Network (DNN) based molecular property prediction has received new attention recently with the development of models capable of promising performance on large and heterogeneous datasets^{Yang2019, wu2018moleculenet, mayr2018large}. In particular, recent progresses in graph convolution neural network^{duvenaud2015convolutional} (GCNN) — also known as message passing neural network (MPNN) — have led to stateoftheart performance for property prediction across a range of public and proprietary datasets^{Yang2019}, demonstrating both accuracy and generalization gains. However, some limitations still hold, and uncertainty quantification has recently been highlighted as an important direction to be investigated^{Yang2019}.
The need for an effective uncertainty quantification is driven by both intrinsic characteristics of DNN models and by peculiar features of chemical space. In general, standard DNN models do not output confidence estimates, since regression models only output a mean, while classification outputs cannot be reliably interpreted as confidence scores^{gal2016uncertainty}.
DNN performance strongly depend on the volume and the quality of training data, hence the need to assess when and to what extent a prediction can be considered reliable. While this has emerged in the context of DNN in several heterogeneous applications, most of which are based on computer vision^{Kendall2017}, DNN for chemistry is characterized by additional challenges. First of all, chemical training data are intrinsically biased^{Zhang2019}, since the chemical space has an extremely large variability and therefore a training dataset cannot represent the whole space. Moreover, chemical training data are often limited in volume and quality, directly reflected in DNN outputs. Additionally, doing predictions on molecules rather different to those seen during training is often the actual goal in the field, for example in drug discovery applications. This demands good generalization performance on one side, but also being able to identify the model’s knowledge boundary, i.e. assessing to what extent the model knows what it knows.
While uncertainty estimation in this domain has been investigated in the context of shallow models in the last few years^{proppe2017reliable}, uncertainty in DNN and GCNN models for molecular property prediction has been addressed only recently and is still limited and fragmented.
Bayesian Neural Networks (BNNs) have long been studied as an effective and principled way to take into account model uncertainty in the predictions of a DNN^{neal1995bayesian}, but the intractability of exact Bayesian inference together with the limited practicality of the approaches proposed until the last few years has prevented the widespread diffusion of these solutions in applications until recently^{gal2016uncertainty}. The recent work from gal2016dropout gave a decisive contribution to the spread of approximate BNNs in applications, proposing Monte Carlo Dropout (MCDropout), a practical method based on the widely used dropout regularization technique, to account for model uncertainty. Moreover, Kendall2017 proposed a framework to separate epistemic uncertainty, which refers to uncertainty in the model predictions, from the aleatoric uncertainty, which captures noise inherent in the data. MCDropout has been used in various applications, including, very recently, molecular property prediction^{Zhang2019, Ryu2018}.
Other techniques to efficiently approximate BNNs have been proposed since then, highlighting how finding a good tradeoff between effective approximation and scalability remains an important open challenge. Notably, the ensemblebased approach proposed by lakshminarayanan2017simple constitutes a simple and scalable technique to obtain wellcalibrated uncertainty estimates and has been already used in several applications across different fields (e.g., Ron18, Tomasev2019116). Moreover, even if originally proposed as a nonBayesian alternative to estimate uncertainty in DNNs^{lakshminarayanan2017simple}, recent work highlighted how ensembling in DNN can be traced back to Bayesian inference^{duvenaud2016early, pearce2018uncertainty, gustafsson2019evaluating}.
In parallel to the development of methods to efficiently approximate BNNs, their evaluation, and in particular their comparative assessment, has recently attracted great interest given the challenges it poses^{guo2017calibration, ilg2018uncertainty, mukhoti2018evaluating, gustafsson2019evaluating}. Indeed, we usually do not have “ground truth uncertainties”, which prevents using traditional benchmarks. Furthermore, evaluating uncertainty involves measuring the model’s unknowns and taking into account domainspecific features. First comparative assessments have been conducted for computer vision tasks^{ilg2018uncertainty, mukhoti2018evaluating, gustafsson2019evaluating}. However, results are still fragmented and no comparisons have been carried out for GCNN in the chemistry domain, which poses specific challenges such as uncertainty generalization in the chemical space. Moreover, many metrics traditionally used to evaluate uncertain forecasts, like calibration, have been defined in a classification setting, while their extension for regression — needed for scalar molecular properties — has been discussed only recently^{kuleshov2018accurate, levi2019evaluating}.
Comparative analysis of different methods calls for multiple metrics and quantitative indices. By contrast, recent works targeting uncertainty estimation for DNNbased molecular property prediction only employ a single technique, such as confidenceerror diagrams, and qualitative evaluations^{Ryu2018, Zhang2019}.
The goals of this work are as follows. First of all, we review existing methods for uncertainty estimation in DNN/GCNN, focusing on scalable techniques that can be used in applications. We contextualize them in a unique framework to estimate aleatoric and epistemic uncertainty, also in light of their recent interpretations, and we draw a theoretical comparison. Secondly, we introduce a set of uncertainty evaluation criteria, based both on existing benchmarks used in other fields and on chemistryspecific features. Finally, we implement the presented uncertainty estimation methods using as base model a recently published stateoftheart GCNN for molecular property prediction (chemprop^{Yang2019}) and we experimentally compare them through the introduced evaluation criteria on the QM9 dataset for the regression task. In doing so, we highlight the behaviors characterizing all the methods in the context of GCNNbased molecular property prediction and their differences due to different approximation schemes. Furthermore, we discuss and quantify the positive impact of modelling uncertainty in the network on the prediction error.
3 Methods
This section is organized as follows. We first summarize GCNNs, which constitute the stateoftheart for DNNbased molecular property prediction. We then review Bayesian Uncertainty Estimation in DNNs, detailing the methods that will be tested. Finally, we discuss uncertainty evaluation and related metrics. An overview of a GCNN extended as a DNN is shown in Figure 1.
3.1 Graph Convolutional Neural Networks
In general, a GCNN used for property prediction takes as input a molecular graph , where the nodes are atoms and the edges are bonds, with each atom initialized with the feature vector and each bond with the feature vector and then operates in two phases (see Figure 1). During the first phase — message passing — each atom’s feature vector is updated based on the neighbors’ features and related bond representations. This phase is executed times, iteratively, so that in the steps following the first one each atom’s feature is updated based on already updated neighbors features. This allows the interaction of distant atoms in the resulting representations. At the end, the molecule representation is given by the sum of its atoms representations. The second phase — readout — is based on a feedforward neural network that uses the final representation of the molecule to predict some properties of interest. Intuitively, the message passing phase allows the model to learn its own feature representations directly from data, while the readout phase allows learning the relationship between such representations and output properties. The model is trained as a whole to maximize the likelihood.
Starting from this general description, several specific networks improvements have been recently proposed^{coley2017convolutional, wu2018moleculenet, mayr2018large, Ryu2018, Yang2019}. Given the goal of this paper of evaluating largescale uncertainty estimation, we start from a welltested network, chemprop^{Yang2019}, that recently reported stateoftheart performance on multiple datasets. One of the peculiar features of this network is the usage of messages associated with directed edges (bonds) instead of vertices (atoms), improving the effectiveness of the messages exchanged. Interested readers can refer to the original work by Yang2019 for the details.
The techniques explored in this paper do not depend on a specific network, and the resulting comparative performance should hold for any GCNN model. We extended the chemprop model for this work to include the uncertainty estimation and evaluation methods presented next. The software developed has been made available^{i}^{i}ihttps://github.com/gscalia/chemprop.
3.2 Bayesian Uncertainty Estimation
Uncertainty can be the result of inherent data noise or could be related to what the model does not yet know. These two kind of uncertainties — aleatoric and epistemic — are reviewed in the next two sections, together with scalable techniques which have been proposed for their approximate computation. At the end, we discuss how these two kinds of uncertainty can be combined to obtain the total uncertainty of a prediction.
3.2.1 Aleatoric Uncertainty
When not explicitly modeled, the inherent observation noise is assumed constant for every observed molecule. This defines a homoscedastic aleatoric uncertainty, i.e. an uncertainty which does not vary over the data distribution and is essentially only taskdependent^{kendall2018multi}. However, this assumption does not hold in many realistic settings, where inputdependent noise needs to be modeled. For chemistry applications, it is usually difficult to derive a large number of highquality data; therefore, one often needs to use multiple data sources to compose a large enough dataset to train a model. Data derived from different sources are often measured or calculated with different methods, and thus are associated with different levels of intrinsic noise. Datadependent aleatoric uncertainty is referred to as heteroscedastic^{le2005heteroscedastic} and its importance for DNNs has been recently highlighted^{Kendall2017}, also for molecular property prediction^{Ryu2018}.
Since aleatoric uncertainty is a property of data, it can be learned directly from data adapting the model and the loss function. Assuming an underlying Gaussian error, the model (parameters ) can estimate both the mean and the variance of the output distribution given an input :
(1) 
This does not require “noise labels” but only changing the loss function. Indeed, by performing maximum a posteriori estimation (MAP) inference we obtain^{nix1994estimating}:
(2) 
with an additional weight decay term. Notice that, assuming a homoscedastic uncertainty, minimizing Eq. (2) coincides with the usual MSE. In practice, the last layer of the DNN is split to predict both and , and the network is trained using Eq. (2), with implicitly learned. The output corresponds to the heteroscedastic aleatoric uncertainty: . This is shown in Figure 2.
Interestingly, in Eq. (2) can be interpreted as a learned loss attenuation^{Kendall2017}. Intuitively, the network can learn to increase to reduce the impact of uncertain predictions on the overall loss. The second term prevents outputting an infinite uncertainty for every point.
This approach is very practical, requiring minimal modifications to the original network, and can be used independently of the technique chosen to model weight uncertainty (epistemic uncertainty). Indeed, it has been used in conjunction with both MCDropout^{Kendall2017} and ensembling^{lakshminarayanan2017simple}.
The output distribution does not need to be necessarily Gaussian (see Figure 1 for a general case). In some cases, a Gaussian distribution might not be enough to model the output properties, and more complex models could be used, such as Mixture Density Networks (MDN)^{bishop1994mixture}, which have been recently employed to model aleatoric uncertainty in DNN^{choi2018uncertainty}, or Compound Density Networks^{kristiadi2019predictive}, which represent a continuous extension of MDN. These solutions allow more flexible output distributions at the cost of more complex loss functions that may translate into less optimized and stable training. These extensions are beyond the scope of this paper.
Being predicted as a data variance, aleatoric uncertainty cannot account for uncertainty in the model’s parameters or for other dataindependent factors. Moreover, the MAP estimate does not take into account multiple plausible values for but only the most probable one. This can be overcome by performing Bayesian inference, as discussed next.
3.2.2 Epistemic uncertainty
In a BNN the weights are modeled as distributions learned from training data , instead of point estimates, and therefore it is possible to predict the output distribution of some new input through the predictive posterior distribution, Eq. (3).
(3) 
Equation (3) allows taking into account the epistemic uncertainty because a prediction is the “weighted sum” of each outcome for each possible configuration of the model, with more probable configurations having a higher weight. The probability of a configuration depends on training data .
Monte Carlo integration over samples of the posterior distribution can approximate the intractable integral, however obtaining samples directly from the posterior distribution is virtually impossible for neural networks. Therefore, an approximate distribution is introduced.
Several methods to sample from have been introduced. The pioneering work by Neal^{neal1995bayesian}, employing the MCMC variant Hamiltonian Monte Carlo (HMC), is currently considered the gold standard, but its applicability is limited to small networks and datasets. Stochastic and optimized variations have since been explored to enhance scalability at the expense of approximation performance ^{NIPS2015_5891, zhang2019cyclical}.
Variational Inference (VI) is an alternative paradigm to derive . In this case, a class of approximating distributions parameterized by is explicitly chosen, so that posterior approximation becomes an optimization problem of finding miniziming the KullbackLeibler (KL) divergence with respect to . The set of approximating distributions is predefined and performance will depend on the search space and the employed optimization procedure.
VI methods constitute a standard technique in Bayesian modelling. However, scalability requirements and NNspecific features have led to the design of new methods for this class of models in the last few years^{graves2011practical, hernandez2015probabilistic, gal2016dropout, duvenaud2016early, liu2016stein}. Nonetheless, some of these approaches — such as Stein Variational Gradient Descent^{liu2016stein} — do not actually scale up to trainingintensive applications such as active learning based molecular property prediction^{Zhang2019}.
MCDropout and ensemblingbased methods are currently the most popular approaches for largescale uncertainty estimation in NNs^{gustafsson2019evaluating} and, within chemistry, both have been very recently introduced^{Ryu2018, Zhang2019, Li2019, smith2018less, peterson2017addressing}. In addition to their scalability, these methods owe their popularity to the relative ease of implementation, since both leverage wellknown techniques for regularization and accuracy improvement. For this reason, in the following we will focus on MCDropout and ensembling, describing both the original methods, main variations (in particular, bootstrapping), recent improvements and interpretations.
Monte Carlo Dropout
MCDropout^{gal2016dropout, Kendall2017} is a simple and scalable VI approach. The algorithm consists in training a network with dropout before every layer and then, at testing time, keeping dropout to sample outputs with different random masks. Each different random dropout mask corresponds to a sample from the approximate posterior . The model prediction is the mean of the different outputs, while the epistemic uncertainty can be captured by the variance of the different outputs. If the aleatoric uncertainty is also computed (as in Figure 2), the output aleatoric uncertainty is the mean of the different aleatoric uncertainty estimates (and, in this case, the are substituted by the ):
(4) 
Formally, the MCDropout algorithm approximates the posterior with a product of Bernoulli distributions. Indeed, given a dropout probability , each unit of the network with parameters has probability of being dropped and set to zero. Equivalently, the approximation distribution can be seen as a mixture of two Gaussians with small variances and the mean of one of the Gaussians is fixed at zero^{gal2016dropout, Kendall2017}.
A drawback of the MCDropout approach is the introduction of the dropout rate as hyperparameter. Such a choice has an important impact both on the model’s accuracy and the uncertainty estimation. Indeed, contributes to determine the magnitude of the epistemic uncertainty. Moreover, this hinders model hyperparametrization, especially if is chosen to be layerdependent.
Among the methods proposed in the literature to automatically tune the dropout probability, Concrete Dropout^{gal2017concrete} represents a practical gradientbased solution which follows dropout’s variational interpretation. This approach has demonstrated comparable performance with respect to gridsearched ^{gal2017concrete} and an improvement in model calibration with respect to standard MCDropout^{mukhoti2018evaluating}. Therefore, we will compare this nonparametric version of MCDropout to the intrinsically nonparametric ensembling approach.
Ensembling
Ensembling has been introduced as a practical nonBayesian alternative to estimate uncertainty in lakshminarayanan2017simple. The algorithm consists in training the same network multiple times with a random initialization, minimizing the MLE objective each time. The output of the ensemble is given by the mean of the predictions, while the variance corresponds to the ensemble uncertainty, as in Equation (4) for MCDropout.
It is possible to draw a parallel between ensembling and MCDropout, since the latter can also be interpreted as a form of ensembling ^{lakshminarayanan2017simple, srivastava2014dropout} with weight sharing between the models. Even if ensembling has been originally proposed as a nonBayesian solution ^{lakshminarayanan2017simple}, recent literature has proved how, with minor modifications to the original ensembling methodology, it is possible to interpret it as a Bayesian inference technique ^{duvenaud2016early, pearce2018uncertainty}. Nonetheless, even without the modifications, ensembling can be interpreted as Bayesian approximation with an implicit distribution ^{gustafsson2019evaluating}.
Ensemble methods have long been recognized as very effective to improve predictive performance of machine learning ^{dietterich2000ensemble} and deep learning models ^{Goodfellow2016}, and their effectiveness for this purpose has been assessed even recently in chemistry for QSPR ^{Yang2019}. The reason why ensembling allows reducing the overall error with respect to each of components resides in the diversity of their errors. Indeed, perfectly correlated errors do not bring any advantage to the ensemble error, while perfectly uncorrelated errors reduce the expected ensemble error proportionally to the number of employed instances ^{Goodfellow2016}. Different solutions can be easily reached by deep models given their nonconvexity and the suboptimal optimization strategies employed.
The intuition behind the interpretation of the ensemble variance as model uncertainty is simple. Different instances of the ensemble of models will tend to output similar values when the inputs are similar to the observed training data, because each instance’s weights, even if different, are optimized for those data. In contrast, as inputs become less similar to the training data, the outputs of each instance tend to be more affected by the specificities of the suboptimal solution reached, thus the higher variance. Given this, it seems clear that diversity in the ensembled models should be promoted both for error reduction and uncertainty improvement.
Traditional regularization techniques, such as weight decay and early stopping, affect the solutions reached by NNs. Recently, the usage of these techniques has been proposed not only as a practical strategy to increase ensemble diversity, but also as a formal evidence for a Bayesian interpretation of ensembling^{pearce2018uncertainty, duvenaud2016early}. This is discussed in the next paragraph.
Anchored Ensembles and early stopping
Anchored ensembling ^{pearce2018uncertainty} modifies traditional ensembling leveraging the randomised MAP sampling technique. This technique exploits the fact that injecting some noise in the loss function of a MAP estimate allows sampling from the true posterior. Therefore, an ensemble of such models is a simple and scalable approach for approximate Bayesian inference.
It is known that the commonly used regularization for NN (weight decay) corresponds to the MAP estimate with Gaussian priors^{Goodfellow2016}, which can be interpreted as pulling the weights for which the network does not express a strong preference close to zero. The anchored ensembling algorithm proposes to add noise to this loss function by changing the priors’ means. For regression, this leads to the following loss for the th model in the ensemble:
(5) 
where are the target outputs and , which equals to zero for standard regularization, is the prior’s mean of the th model.
Following this approach, each model in the ensemble has its parameters anchored to a different , and this promotes the diversity of the solutions reached by the different models.
An important limitation of this approach is the need for additional hyperparameters that must be tuned. They include at least the regularization coefficient — that expresses the ratio between data variance and weights’ prior variance — and the noise distribution . As originally described^{pearce2018uncertainty}, the algorithm also employs a regularization matrix instead of the scalar , to allow specifying perlayer regularization.
The work presented in duvenaud2016early gives an interesting interpretation to a commonly exploited regularization method — early stopping — as approximate nonparametric Bayesian VI. In particular, they show how training a model to minimize the negative loglikelihood with stochastic gradient descent (SGD)^{ii}^{ii}iiThe approach is compatible also with minibatches. can be interpreted as obtaining the approximate posterior parametrized by the number of SGD steps, and demonstrate how early stopping leads to an optimal . Within this context, the initial distribution of the model is interpreted as the prior.
In practice, allows sampling from the variational posterior, and therefore ensembling different random restarts allows obtaining independent samples from the posterior, that can then be used as in traditional ensembling (Eq. (4)). Even if the approach, as originally described, does not take into consideration SGD with momentum, recent work also shows how SGD with momentum can be interpreted as Bayesian inference ^{mandt2017stochastic}.
Not only is this approach practical, but ensembling with early stopping is usually already exploited for property prediction in stateoftheart systems ^{Yang2019}. In this work we use it as a Bayesian alternative for uncertainty estimation.
We can draw a parallelism between the two approaches described above. It has been shown that early stopping for NNs is conceptually similar to regularization, while an exact equivalence holds in the simpler case of a linear model with a quadratic loss function^{Goodfellow2016}. Intuitively, both approaches restrict the optimization procedure to the vicinity of a predefined value — for regularization, the initial configuration for early stopping. In our case, we notice that these two values have the same role of prior in the two approaches^{pearce2018uncertainty, duvenaud2016early}, highlighting an interesting similarity. Even though they are based on different theoretical foundations, in practice both the approaches increase the diversity in the ensembled instaces by injecting some randomness into their regularization. An intrinsic advantage of early stopping over weight decay is that early stopping automatically determines the correct amount of regularization, instead of requiring external hyperparameter optimization^{Goodfellow2016}. Therefore, given the objective of this paper of evaluating scalable and practical uncertainty quantification techniques, in the following we will focus on early stopping for our extensive tests. Anchored ensembling and the impact of different priors for uncertainty estimation will be the subject of future work.
Bootstrapping
Also referred to as bagging, bootstrapping is a popular technique where ensemble members, instead of being trained on the whole dataset, are trained on different bootstrap samples of the original training set. Each bootstrap sample is obtained by sampling samples with replacement from the dataset and therefore will include a fraction of the elements in and duplicates. If the original dataset is a good approximator of the underlying distribution, each will also be.
Bootstrapping allows increasing the diversity in the trained instances, which, as previously discussed, is a key factor for ensembling performance. However, instead of relying on diversity in the models, bootstrapping relies on diversity in the datasets.
This approach has been successfully employed to increase the diversity in shallow ensembles, but its use within NNs might be less beneficial, since, given the dependence on a large amount of training data, each individual instance will be less powerful, thus affecting the whole ensemble performance^{lakshminarayanan2017simple}. Moreover, recent progresses in NN understanding suggest these models are characterized by an extremely large amount of equivalent local minima^{Goodfellow2016}, and the inherent stochasticity of SGD should already provide some degree of diversity even when trained on the same dataset.
Nonetheless, since bootstrapping has been recently described in the literature as an effective approach for NNs^{peterson2017addressing, Li2019}, we aim to compare it to fulldataset ensemble in different operating conditions to assess the differences with respect to the various evaluation metrics introduced.
A comparative overview of MCDropout, ensembling and bootstrapping is presented in Figure 3. As shown, each method relies on a set of predictions (explicit or implicit models), which diversity is driven by different factors. The different predictions are used to estimate epistemic uncertainty as shown in Figure 4.
3.2.3 Total uncertainty
Aleatoric and epistemic uncertainty can be added to approximate the total uncertainty of a prediction^{gal2016uncertainty, Kendall2017}. The total uncertainty captures all the variability of the output , which includes both the variability coming from our ignorance about the model (epistemic uncertainty) and variability coming from inherent randomness of the output (aleatoric uncertainty). We will evaluate both the separate contributions and the total uncertainty.
3.3 Uncertainty Evaluation
In the following, several methods to evaluate the accuracy of uncertainty estimates are discussed. We start from existing techniques described in the literature, merging the contributions of different fields, and we extend them to account for specific features of chemical space. We aim at identifying a set of quantitative and complementary evaluation criteria. First, we introduce ranking based methods, i.e. evaluation criteria based on the uncertainty’s capability of ordering predictions based on their confidence. Secondly, we discuss calibration, i.e. “the property of predicting probability estimates representative of the true correctness likelihood”^{guo2017calibration}. Then, dispersion is introduced to complement calibration evaluation. Finally, we discuss uncertainty domain shift, i.e. the property of predicting reliable uncertainty estimates for molecules different with respect to those seen during training.
3.3.1 Ranking based methods
A first class of evaluation indexes is based on the ranking defined by uncertainty estimates. This allows defining a confidence curve, which, in turn, allows defining several quantitative indices.
Confidence curve
One way to evaluate the uncertainty is by considering how the error varies as we remove molecules with the highest uncertainty in the test dataset. Indeed, a meaningful uncertainty should lead to a lower error on a subset of highconfident predictions. This concept is captured by the confidence curve, that highlights how the error varies (with respect to a given metric, e.g. MAE or RMSE) as a function of confidence percentile (or, in general, confidence quantile), i.e. the error on the subset of n% molecules (nth quantile) with the lower uncertainty.
Ideally, we would expect a decreasing confidence curve for a meaningful uncertainty. The error corresponding to the leftmost point is simply the error on the complete test dataset; the following points correspond to the error on the subset of testing molecules belonging to the nth quantile. Other than being decreasing, another important feature of the confidence curve is its shape: a better uncertainty corresponds to a higher slope, because it allows decreasing the error faster for the same amount of removed molecules. For comparison, randomly sampling the molecules to be removed should lead to a more or less constant function.
What this kind of evaluation really assesses is the ordering of the predictions by their confidence. From this perspective, the best possible ordering is the one imposed by the true error, which has been named oracle ordering^{ilg2018uncertainty} in the literature. We can interpret the oracle ordering as an uncertainty lower bound, and the oracle confidence curve is the best confidence curve obtainable for a given model and test data.
ConfidenceOracle error and AUCO
Since the oracle ordering corresponds to the lower bound, we can define the ConfidenceOracle error as the difference between the confidence curve for a given uncertainty estimation, and oracle confidence curve, . In general, we want this error to be as small as possible, therefore we introduce the Area Under the ConfidenceOracle error, AUCO, to quantify it in a single number ^{iii}^{iii}iii The ConfidenceOracle error has been called Sparsification Error in the context of optical flow estimation in computer vision^{ilg2018uncertainty}. The AUCO has been called Area Under the Sparsification Error curve in the same context^{ilg2018uncertainty}.:
(6) 
This value allows an easy comparison between two uncertainty estimations and with respect to the oracle, where the smaller is better.
For this kind of comparison, it is important to highlight that every confidence curve depends not only on the uncertainty estimation, but also on the predictive model. Indeed, while the first defines the quantiles, the second provides the data for which each quantile error is calculated. It follows that it is not possible to directly compare two confidence curves obtained through different models to establish which uncertainty estimation is better. This is particularly relevant because often the uncertainty estimation and the predictive model are strongly tied: for example, ensembling is an uncertainty technique that also affects the predictive model.
With this regard, an added benefit of the confidenceoracle error is that, since it marginalizes out the oracle, it enables a fair comparison of uncertainty estimates based on different methods ^{ilg2018uncertainty}. Therefore, the confidenceoracle error and the AUCO will be used in the following for this purpose.
Notice that, using quantiles, each uncertaintyimposed ranking that does not change the specific quantile each prediction belongs to, even if it does change the relative position of the predictions inside each quantile, is equivalent from the point of view of the confidence curve, the confidenceoracle error and the AUCO. Hence it follows that these are all affected by the choice of . In the following, we will use percentiles as commonly reported in the literature.
Error Drop
As an additional quantitative measure of confidence curve quality that does not depend on the oracle, we introduce the Error Drop. This is defined as the error ratio between the first and last quantiles, which should correspond to the curve’s maximum and minimum, respectively, if the confidence curve behaves correctly:
(7) 
This index measures the relative performance improvement of the model obtainable by considering only the most confident predictions instead of the entire dataset. Being a ratio, we can use it to directly compare different methods.
Decreasing coefficient
A limitation of the AUCO and Error Drop indices is that they do not take into account the monotonicity of the confidence curve. We observe that in existing evaluations this property is usually qualitatively considered but not quantitatively measured, and therefore we introduce a Decrease Ratio to capture it. Given a confidence curve :
(8) 
where corresponds to a perfectly nonincreasing curve.
Rather than being a measure of uncertainty quality by its own, this coefficient captures the noise in the confidence curve and should be used in combination with the other metrics for a more comprehensive analysis.
3.3.2 Uncertainty Calibration
One limitation of the evaluation methods introduced up to now is that they are all orderbased, and therefore they only take into account the ranking imposed by uncertainty estimates and true errors. While this is crucial to distinguish among various degrees of model confidence, it does not take into consideration the actual values expressed by uncertainty.
Indeed, another important aspect of uncertainty is more strictly related to the actual values it expresses, and referred to as calibration. In general, calibration of a model refers to the property of outputting probability distributions which are consistent with observed empirical frequencies.
Calibration evaluation of neural networks gained interest in the last two years, since it has been shown that modern neural networks, while being more accurate on one side, are less calibrated on the other^{guo2017calibration}, thus encouraging more research on the topic^{Kendall2017, lakshminarayanan2017simple}. Indeed, model calibration is orthogonal with respect to model accuracy^{lakshminarayanan2017simple}. Calibrated confidence is important for model interpretability and to establish trustworthiness with the user^{guo2017calibration}, since it allows providing uncertainty estimates which are informative not only relatively to other estimates, but also on their own with respect to model’s predictions.
Model calibration can be easily defined in the classification setting, since, given an input , an output and a vector confidence over the set of classes , the model is considered perfectly calibrated when the following holds:
(9) 
where is the confidence associated to the class . This means that the confidence assigned to each class is consistent with the probability of a prediction of belonging to that specific class.
In practice, over a finite number of samples, calibration can be captured by a Calibration Plot^{Kendall2017}, also called Reliability Diagram^{guo2017calibration}. To obtain such a plot the model predictions for all samples and classes in the test set are split into bins^{iv}^{iv}ivEach bin is a subset of predictions. in the range and the frequency of correctly predicted labels for each bin is plotted^{niculescu2005predicting}. Perfect calibration corresponds to a diagonal line.
Calibration can vary within the same uncertainty estimator when considering different uncertainty intervals. This could happen, for example, if a model has wellcalibrated low uncertainty but illcalibrated high uncertainty, or viceversa. Such cases are highlighted by a Calibration Plot which diverges from the diagonal line in some specific confidence intervals but not in others.
Calibration in regression
Uncertainty calibration is a wellstudied topic in the context of classification, both in its traditional domain of weather forecasting^{degroot1983comparison} and, more recently, in deep learning^{guo2017calibration}. However, calibration for regression appears to be less investigated, and different solutions to evaluate it have been employed and discussed only recently^{Kendall2017, gustafsson2019evaluating, kuleshov2018accurate, levi2019evaluating}. Focusing on molecular property prediction, calibration for regression becomes crucial to account for scalar properties like formation enthalpies or energies. In the following, we will consider two different definitions which extend calibration in a regression setting: confidenceintervals based and error based calibration.

Confidencebased calibration (also called intervalbased calibration)^{kuleshov2018accurate, gustafsson2019evaluating} interprets each prediction and its uncertainty as the mean and the variance of a Gaussian distribution , respectively, and we are interested in evaluating the confidence intervals thus defined. To do so, we consider symmetric intervals of varying confidence around the mean and compare them to the empirical probabilities of belonging to each interval. In a wellcalibrated model, the % of the predictions should fall in the % confidence interval. In practice, we discretize the confidence intervals and calculate the fraction of predictions falling in each interval. This allows obtaining a Calibration Plot in the range, as in the classification case, where perfect calibration corresponds to a diagonal line.

Errorbased calibration, originally described by levi2019evaluating, proposes to directly compare the uncertainty to the empirical error, as in Eq. (10).
(10) This defines a perfectly calibrated model as one outputting an uncertainty matching the expected error. As in the classification case, in practice, to assess calibration it is necessary to split the test data ordered by estimated uncertainty in bins and average uncertainties and errors for each bin. It is then possible to define the Calibration Curve by plotting the MSE of the th bin as a function of its average uncertainty ^{v}^{v}vIn the original definition proposed in levi2019evaluating, the RMSE and the predicted standard deviations are used instead of MSE and variances. We use the latter for consistency with the other measures introduced.. Notice that, unlike classification and confidenceinterval calibration cases, here the Calibration Plot is not bound in the interval but ranges between 0 and the maximum uncertainty. As in the other cases, perfect calibration corresponds to a diagonal line.
Each of these two approaches has its pros and cons. Confidencebased calibration has the advantage of considering all the predictions to compute each point of the plot, thus resulting in more robust empirical calculations. However, as recently highlighted^{levi2019evaluating}, one can recalibrate practically any output distribution using this evaluation method — even an entirely uncorrelated uncertainty. While this is not a limitation for the present work, since we do not address uncertainty recalibration, it is something to be taken into consideration in general. The main advantage of errorbased calibration is that it directly ties computed uncertainty to expected error, thus reflecting what the user would expect. The main limitation is represented by the fact that, since only a fraction of uncertainty estimates contributes to each computed point, and the uncertainty estimates are not uniformly distributed, the subsets used to compute the different points are not homogeneous.
Independently from which method is used to form a Calibration Plot, it is then possible to define some metrics over it to quantify calibration performance, as discussed in the next paragraphs.
Calibration Error Curve and AUCE
We can evaluate uncertainty calibration by computing the absolute difference of the Calibration Plot with respect to perfect calibration, thus obtaining the Calibration Error Curve. This difference can be quantified by considering the area under this curve, which has been referred to as the Area Under the Calibration Error Curve, AUCE metric^{gustafsson2019evaluating}. This is a cumulative metric accounting for the total calibration error.
ECE, MCE and ENCE
Rather than considering the total error, it is possible to define the Expected Calibration Error (ECE) and the Maximum Calibration Error (MCE) as follows (for the simpler binary classification case)^{naeini2015obtaining, guo2017calibration}:
(11) 
where is the th bin, is the fraction of predictions that fall into the bin, acc and conf are the accuracy (i.e., the fraction of times a class is correctly predicted) and the average confidence for the bin. ECE and MCE correspond to the average and the maximum over the Calibration Error Curve, respectively, weighted by the fraction of predictions which contribute to each bin. MCE is especially important in highrisk applications, since it models the worstcase scenario^{guo2017calibration}.
This definition can be extended for regression. For confidenceintervals based calibration we can compare the prediction accuracy (i.e. the fraction of times a prediction falls into the confidence interval) to the confidence. In this case since all the predictions contribute to all the bins. For errorbased calibration acc and conf are substituted by the RMSE and the root mean uncertainty, respectively, and this discrepancy is further normalized by the uncertainty over the bin, since the error is expected to be naturally higher as the uncertainty increases^{levi2019evaluating}, thus defining the Expected Normalized Calibration Error (ENCE).
3.3.3 Sharpness and dispersion
Calibration by itself could be insufficient to fully evaluate an uncertainty estimator. Indeed, if the model always outputs the same constant uncertainty which matches the empirical accuracy over the entire distribution, we obtain a perfectly calibrated uncertainty but not a very useful one, since it does not depend on the input data at all. This concept is captured by sharpness, an uncertainty’s property orthogonal and complementary to calibration ^{gneiting2007probabilistic}. Originally defined in the classification settings, it intuitively refers to outputting probabilities which are as much as possible concentrated around specific classes (for example, in a binary setting, probabilities close to zero or to one). From another perspective, it rewards inputdependent uncertainty estimates.
This notion has been recently extended for regression^{kuleshov2018accurate, levi2019evaluating}. Following the definition introduced in levi2019evaluating, in the following the dispersion of an uncertainty estimator is defined as the coefficient of variation of its uncertainty estimates (interpreted as standard deviations). A higher corresponds to more heterogeneous estimates for different inputs.
It should be noted that, for different reasons, dispersion cannot be used as an absolute measure to quantify the performance of a given uncertainty estimator on a given dataset. First of all, a higher by itself does not necessary reflect into more accurate confidence estimates. Secondly, the “true” dispersion depends on the dataset and could also be naturally low for homogeneous datasets. Moreover, being a normalized measure, does not take into consideration the absolute uncertainty values but only their dispersion around the mean. Nonetheless, dispersion represents a useful metric to be taken into account along with calibration when comparing different methods. In particular, we are interested in verifying that an improvement in calibration of an uncertainty estimator with respect to another one does not originate from a reduction in dispersion.
To the best of our knowledge dispersion has not been taken into account before in comparative evaluations of deep learning uncertainty estimation frameworks ^{guo2017calibration, ilg2018uncertainty, gustafsson2019evaluating, beluch2018power} or in the context of deep molecular property prediction^{Ryu2018, Zhang2019}, thus further motivating its experimental evaluation in the following.
3.3.4 Domain shift
An important feature that should characterize a wellbehaving uncertainty estimate is its ability to correctly manage domain shifts, i.e., its performance in an outofdomain context, which corresponds to a test set that is markedly different to the one seen during training. While this behavior — which implies a low variance of the model — is of first importance for every model’s output, it becomes even more crucial for uncertainty estimates. Indeed, it is well known that every learned model will degrade at some point on unseen samples as they become more and more different with respect to those seen during training, but a wellcalibrated uncertainty should be able to correctly identify this “knowledge boundary” and to assess if and to what extent the model predictions can be considered reliable. This property is orthogonal to the other uncertainty evaluation metrics and therefore needs to be separately evaluated.
The importance of calibration with respect to domain shifts has been highlighted in other contexts^{lakshminarayanan2017simple}, but its role in the chemical domain is even more prominent. Indeed, generalization power is a requirement in key applications such as drug discovery, and the intrinsic high variability of chemical space makes it challenging to fulfill this requirement. Despite this prominent role, the evaluation of outofdomain uncertainty performance in the chemistry field appears to be absent^{Zhang2019} or very limited^{Ryu2018}, thus demanding a more extensive analysis.
To achieve this goal, we employ the recently introduced scaffold splitting technique^{wu2018moleculenet, Yang2019}. Molecules are split into bins based on their Murcko scaffold, with each bin belonging to only one among training, validation and test set^{Yang2019}. Scaffold splitting has been successfully used to evaluate models under the more realistic assumption of significantly diverse training and testing distributions, thus overcoming the traditional random splitting. It has been demonstrated to be more challenging for a model and capable of simulating the chronological split which characterizes real scenarios of molecular property prediction^{Yang2019}. To the best of our knowledge, scaffold splitting has never been used to evaluate outofdomain uncertainty estimation procedures before.
More specifically, we are interested in reevaluating all the already introduced metrics — AUCO, AUCE, etc. — also in the outofdomain context obtained through scaffoldsplitting. We will pay particular attention to outofdomain calibration, since it can measure to what extent a model knows what it does not know. We are interested in quantifying domain shift uncertainty performance, i.e., the ratio between indomain and outofdomain metrics, also in relation to domain shift error (the ratio between indomain and outofdomain error) to assess if and to what extent error generalization and uncertainty generalization are characterized by the same behavior.
4 Experiments
We first describe the target dataset, followed by a description of the experimental procedure.
4.1 Data
The formation enthalpies of 131,722 stable organic molecules composed of C, H, O, and N atoms were used to train and test the model. These reference data were derived from the QM9 dataset, which was calculated at the B3LYP/631G(2df,p) level of theory with the rigid rotorharmonic oscillator approximation (RRHO).^{Ramakrishnan2014} As discussed in previous work, these calculated enthalpies are themselves associated with significant errors, primarily due to weaknesses of B3LYP such as the absence of longrange dispersion interaction but also the lack of rotor or conformer corrections in the calculations.^{cohen_challenges_2012, simm_systematic_2016, proppe_uncertainty_2017, li_thermodynamics_2016} We note that it is possible to use a small amount of highaccuracy coupled cluster training data via a transfer learning approach to minimize the influence of DFT errors. Interested readers are referred to the recent work of Grambow et al.^{grambow_accurate_2019}. In this work, we use the QM9 data as is without any attempt to correct its errors in order to investigate the effects of aleatoric uncertainties. The enthalpy values used for training and testing can be found in the Supporting Information.
We used a 80:10:10 split for training, validation, and test sets, both in the indomain and outofdomain settings. Random splitting has been used for indomain analysis, while, as previously discussed, scaffold splitting has been used for outofdomain analysis. In both cases, the same split has been employed to test all the methods.
4.2 Experimental Procedure
We evaluated the uncertainty estimation techniques previously reviewed using the methods previously introduced. Other than including diagrams, we evaluated the considered methods quantitatively, as follows:

For rankingbased evaluation we use the Area Under the ConfidenceOracle error (AUCO) as a measure of total discrepancy with respect to the best possible ranking, the Error Drop as a measure of total error reduction for highconfident predictions and the Decrease Ratio to assess the monotonicity of confidence curves.

For confidencebased calibration we use the Area Under the Calibration Error Curve (AUCE) as a measure of total discrepancy with respect to perfect calibration and the Maximum Calibration Error (MCE) to account for the worstcase scenario^{vi}^{vi}viWe did not use Expected Calibration Error (ECE) in our tests because it does not add significant information to AUCE for confidencebased calibration..

For errorbased calibration we use the Expected Normalized Calibration Error (ENCE) as a measure of the (normalized) total discrepancy with respect to perfect calibration.

For dispersion evaluation we use the coefficient of variation .

For domainshift performance we evaluated and compared all the above metrics also in an outofdomain setting obtained using scaffoldsplitting, as previously detailed.
We focused on the evaluation of complete and scalable uncertainty frameworks, therefore we compared MCDropout (with Concrete Dropout, as previously discussed), ensembling and bootstrapping. As previously mentioned, these approaches have been designed to model NNweight uncertainties, therefore they are directly related to epistemic uncertainty estimation. However, they have been used and described in the literature in conjunction with aleatoric uncertainty estimation to form complete frameworks ^{gal2016dropout, lakshminarayanan2017simple}, and this is the way we tested them in this work. In addition to evaluating total uncertainty, we have also separately evaluated aleatoric and epistemic uncertainty for each methodology. All the different methods use the same aleatoric approximation scheme but the way epistemic uncertainty is modeled affects also aleatoric uncertainty results, thus resulting in different outputs (ref. Eq. (4)). This also allows drawing conclusions about aleatoric uncertainty which do not depend on the uncertainty model used for the NNweights.
4.2.1 Implementation and experimental setting
We implemented the tested uncertainty estimation methods starting from the base model made available in Yang2019, based on the PyTorch framework.
We performed hyperparameter optimization using the hyperopt package^{vii}^{vii}viihttps://github.com/hyperopt/hyperopt on the base model and we used the same hyperparameters for all the uncertainty methods tested. The hyperparameters are: depth size for the convolutional layer , depth size for the fully connected layer , hidden size . The number of instances is 15 for ensembling and bootstrapping and 150 for MCDropout^{viii}^{viii}viiiMCDropout employs weight sharing between different instances and it does not require a separate training for each one, allowing the usage of more instances in practice. Therefore, this difference in the number of instances reflects realistic condition of use..
All the results obtained are inevitably a function of the number of instances used, since the approximation performance of all the tested methods depends on it. The number of instances chosen for the experiments is in line with what has been described in the literature; additionally, preliminary experiments varying the number of instances did not report significant variations in the outcomes, except for an asymptotically smaller general improvement in all the metrics for all the tested methods.
5 Results
We first detail error performance for the considered models. Next, we present results for uncertainty estimation evaluation.
5.1 Error
Table 1 lists the mean absolute error (MAE) for the considered models both in the indomain and outofdomain settings.
In domain  Out domain  

Base model  1.04  1.77 
MCDropout  0.97  1.49 
Ensembling  0.74  1.21 
Bootstrapping  0.89  1.43 
The baseline is the chemprop model^{Yang2019} without any uncertainty estimation. We notice how extending it to include uncertainty always leads to reductions in MAE, regardless of the approximation method used (MCDropout, ensembling and bootstrapping). These improvements, often underestimated, are due to both aleatoric and epistemic estimation in the model. Indeed, modelling aleatoric uncertainty implicitly reduces the impact of noisy training samples, thus improving predictive performance. Modelling epistemic uncertainty allows averaging multiple weight configurations, avoiding overfitting, and overconfident estimations, with a positive impact on predictions. These two contributions can independently reduce the overall MAE but act synergistically when both are modeled. We can notice that, independently from the model, the reduction in outofdomain error is higher than indomain error.
The analysis of improvements in MAE is not the main goal of the present paper, but its assessment is useful for the following discussion and should be kept in consideration as an important byproduct of Bayesian uncertainty modelling.
5.2 Uncertainty estimation
5.2.1 Rankingbased evaluation
The confidence curves for the different methods and the related ConfidenceOracle errors are shown in Fig. 5 and Fig. 6, respectively. The derived AUCO and Decrease Ratio metrics for each case are reported in the first two lines of Table 2.
We can observe that all the curves are mostly decreasing, therefore each method can establish a qualitatively meaningful ranking of the predictions by their uncertainty. However, as also highlighted by the Decrease Ratio, MCDropout does not lead to perfectly nonincreasing curves, especially for epistemic uncertainty and at high percentiles.
In absolute terms, ensembling allows reaching the lowest MAE in the highest percentiles in both the components and the total uncertainty. Interestingly, the epistemic uncertainty estimated by bootstrapping allows reaching a MAE comparable to ensembling in the highest percentiles (0.21 versus 0.19 kcal/mol in the top 5%), even if the initial MAE on the whole dataset is significantly worse (0.89 versus 0.74 kcal/mol). This is quantitatively measured by a higher or similar error drop of bootstrapping, despite the overall higher MAE.
To compare the relative performance of the different approaches we need to consider the ConfidenceOracle errors and the AUCO. Globally, ensembling results in the lowest errors, even if the epistemic uncertainty estimated by bootstrapping leads to comparable performance. In contrast, the aleatoric component of bootstrapping leads to a significantly worse performance than ensembling. MCDropout results in larger errors with respect to the other considered approaches, in particular for epistemic uncertainty.
The total uncertainty does not always result in a lower (i.e., better) AUCO than the two separate contributions. While this is true for ensembling, it is not true in the other cases. In general, in rankingbased evaluation, if or viceversa, the total uncertainty curve will approximate the dominant contribution. Anyway, as we can observe, in these cases the total uncertainty appears to approximate the best performing one in terms of AUCO.
5.2.2 Calibration and dispersion
Confidencebased calibration
The confidencebased calibration plots are shown in Fig. 7. The derived AUCE and MCE metrics are reported in lines three and four of Table 2, respectively.
Results for epistemic uncertainty vary. Ensembling is characterized by calibrated empirical coverages in the low probability range (), but increasingly underestimated coverages in the high probability range. Bootstrapping has a similar pattern but is better calibrated overall, with a broader interval of calibrated empirical coverages () and less underestimated coverages for higher values. This is quantified by the AUCE, which captures the overall behavior and is halved for bootstrapping with respect to ensembling. MCDropout epistemic uncertainty is largely underestimated.
In general, aleatoric uncertainty appears to be underestimated, independently from the underlying uncertainty model of the NN weights. The possible reasons for a miscalibrated aleatoric uncertainty are discussed in the last section.
Total uncertainty does not result in significant improvements to AUCE compared to considering epistemic uncertainty only in any of the cases, leading instead to slightly worse performance for ensembling and bootstrapping. By contrast, MCE is improved in those cases due to the combination of an underestimated aleatoric uncertainty and an overestimated epistemic uncertainty, which results in more stable curves. This also highlights the need of multiple metrics to quantify calibration.
Errorbased calibration
The errorbased calibration plots are shown in Fig. 8. The derived ENCE is also reported in line five of Table 2.
These plots offer a complementary view of uncertainty performance with respect to the confidencebased plots already shown. Indeed, rather than considering all the predictions at the same time, each dot only represents a subset of predictions in direct relation with the average error.
Aleatoric uncertainty on its own significantly underestimates the error in all the cases. Epistemic uncertainty appears to be a better error approximator for ensembling and bootstrapping, with a lead of the latter ( vs AUCE), but not for MCDropout. Total uncertainty always reports a better AUCE than the two individual contributions. Uncertainty tends to be underestimated in all of the considered cases.
Compared to confidencebased calibration, this kind of plot is less stable, especially for high values of . This is due to i) the fact that the error is expected to be naturally higher as uncertainty increases (a property already taken into account in the ENCE computation) and ii) the fact that high uncertainty values are more sparse. Overall, errorbased calibration confirms the main results of confidencebased calibration: bootstrapping estimates appear to be better calibrated and the total uncertainty is a better error approximator.
Interestingly, we notice that all the plots, independently from their distance to the diagonal line, are characterized by strongly correlated patterns (correlation for ensembling and bootstrapping, for MCDropout).
Dispersion
The dispersion coefficient is reported in the last line of Table 2. Results show no significant variations between the different methods, except for a slightly higher for MCDropout epistemic estimates. In general, epistemic uncertainty appears to be more disperse than aleatoric uncertainty for all the considered methods.
MCDropout  Ensembling  Bootstrapping  

Epi.  Ale.  Tot.  Epi.  Ale.  Tot.  Epi.  Ale.  Tot.  
AUCO  46.72  31.55  31.72  18.79  20.83  17.03  19.18  25.08  19.05 
Error drop  1.67  2.55  2.62  6.72  4.93  7.40  6.85  5.23  6.85 
Decr. Ratio  0.95  0.98  0.96  1.0  0.99  1.0  1.0  1.0  1.0 
AUCE  44.79  29.44  28.74  2.62  19.90  3.62  1.36  31.31  1.69 
MCE  0.85  0.50  0.48  0.087  0.33  0.061  0.051  0.53  0.044 
ENCE  4001.6  416.5  394.7  64.7  291.4  34.0  30.5  554.6  24.8 
0.97  0.50  0.49  0.74  0.51  0.67  0.74  0.45  0.71 
5.2.3 Outofdomain uncertainty
The same plots already discussed for random splitting are shown for the outofdomain case. The derived metrics are summarized in Table 3. In the following, the main differences with respect to random splitting are highlighted.
Confidence curves and ConfidenceOracle errors for the outofdomain case are reported in Fig. 9 and Fig. 10, respectively. In absolute terms, as expected all the related outofdomain indices (AUCO, error drop and decrease ratio) have deteriorated with respect to indomain indices for all the considered methods. The relative performance of MCDropout with respect to ensembling and bootstrapping are comparable, with these last two outperforming the first. The relative comparison between ensembling and bootstrapping results in qualitatively similar trends but quantitative differences which turn out to be strongly reduced. Ensembling has the lowest AUCO for both epistemic and aleatoric uncertainty, bootstrapping has comparably low scores and it also has comparably or higher error drops. The results for these two methods turn out to be more similar than in the indomain setting. In general, the rankingbased evaluation in the outofdomain setting does not highlight drastic changes other than an expected worsening of all the indices for all the methods.
The calibrationconfidence analysis (Fig. 11 and Fig. 12) highlights a drastic change with respect to indomain results for epistemic estimates using ensembling and bootstrapping. In particular, while indomain empirical coverages tend to be calibrated or slightly overestimated, except for high , outofdomain empirical coverages tend to be always underestimated. This means that, on average, uncertainty estimates in an outofdomain setting are lower than they should, while indomain uncertainty estimates appear to be more calibrated or slightly higher than they should. Aleatoric estimates are less affected than epistemic ones in terms of AUCE and MCE for all the considered methods. Calibrationerror analysis confirms the underestimation trend of outofdomain epistemic estimates, particulary affecting higherror predictions. The impact of outofdomain uncertainty underestimation is further discussed in the next section.
Overall, bootstrapping has a slight advantage over ensembling in terms of AUCE, MCE and ENCE driven both by better epistemic uncertainty estimates (even if the magnitude of the difference is less than indomain) and also better aleatoric uncertainty estimates (in contrast to indomain results). This highlights another difference with respect to indomain analysis, that is further discussed in the next section.
An additional difference pointed out by calibration analysis concerns the total uncertainty. While indomain total uncertainty turns out to be similar or slightly worse than the two individual components, outofdomain total calibration appears to be better than the two individual components for all the considered metrics.
In terms of dispersion, we observe a global increase for all the methods and uncertainty types.
MCDropout  Ensembling  Bootstrapping  

Epi.  Ale.  Tot.  Epi.  Ale.  Tot.  Epi.  Ale.  Tot.  
AUCO  73.35  52.29  52.93  34.12  38.68  33.38  36.27  40.05  35.81 
Error drop  1.64  1.67  1.75  3.02  2.02  2.88  2.91  3.18  2.86 
Decr. Ratio  0.86  0.90  0.91  0.99  0.95  0.98  0.98  0.99  1.0 
AUCE  47.36  37.13  36.70  10.18  32.81  8.10  9.62  32.57  7.50 
MCE  0.92  0.65  0.64  0.16  0.56  0.13  0.14  0.55  0.11 
ENCE  13936.2  707.5  687.2  78.2  480.4  61.2  65.1  429.9  50.9 
1.63  0.76  0.75  0.65  0.84  0.65  0.81  0.71  0.79 
6 Discussion
The goal of this section is to analyze and discuss the results presented in previous section, focusing on conclusions that can be drawn by comparing and integrating outcomes related to different uncertainty models and evaluation metrics.
Results show that ensembling and bootstrapping consistently outperform MCDropout both in the indomain and outofdomain scenarios for all the considered metrics. This is in line with results already presented for image classification/regression^{lakshminarayanan2017simple, beluch2018power} and optical flow estimation^{ilg2018uncertainty, gustafsson2019evaluating}, confirming this trend also for GCNNbased molecular property prediction. In contrast to previous comparisons, that used the “base” version of MCDropout^{lakshminarayanan2017simple, ilg2018uncertainty, gustafsson2019evaluating}, we employed Concrete MCDropout that was independently proven superior to standard MCDropout^{gal2017concrete, mukhoti2018evaluating} but has not been directly compared to ensembling and bootstrapping before.
The comparison between ensembling and bootstrapping requires a deeper analysis and raises multiple interesting observations. On the one side, ensembling has an advantage for total MAE, AUCO and aleatoric calibration, especially in the indomain setting. On the other, bootstrapping often leads to higher error drops (i.e. it allows reducing the MAE more in proportion when we consider small percentages of highconfidence predictions), has an advantage for better epistemic calibration in the indomain setting and is characterized by an overall better calibration in the outofdomain setting. This behavior can be explained by considering the effects of substituting each training dataset with a bootstrap sample. Each network only sees a fraction of the starting training dataset, thus increasing individual and ensembled MAE. Since aleatoric uncertainty is estimated from data, it follows a trend similar to MAE and it degrades. However, bootstrapping promotes diversity in ensembled models, which is key for epistemic uncertainty estimation, thus improving its calibration. We can argue that as training size increases — as long as the target molecular space is kept unchanged — bootstrapping becomes more advantageous, because each bootstrap sample becomes a better approximator of the underlying distribution, thus avoiding losses in MAE and aleatoric calibration in each single instance and in the ensembled model, but keeping an advantage as for epistemic calibration. Moreover, as we have observed, bootstrapping becomes globally more calibrated than ensembling in the outofdomain setting. This can be explained by a gain of generalization power given by the additional diversity of bootstrapping. Interestingly, this generalization power especially translates in calibration performance, and only to a lesser extent in rankingbased indices and total MAE, which turn out to be relatively improved in the outofdomain setting with respect to ensembling, but not better than the latter in absolute terms. Dispersion analysis allows checking that improvements in calibration are not the result of losses in uncertainty heterogeneity.
In previous studies for CNNbased image regression/classification, bootstrapping did not report significant improvements over ensembling ^{lakshminarayanan2017simple}. We can speculate that this difference is due to i) the peculiarities of the chemical space, characterized by a larger intrinsic variability that can be exploited by bootstrapping, and ii) by variations in the training size, as previously discussed. Results obtained for bootstrapping justify its recent use in active learning methodologies for molecular property prediction^{Li2019}, where model uncertainty (epistemic uncertainty) and generalization power are required.
Even if the methods investigated in this work jointly model aleatoric and epistemic uncertainties, their separate evaluation carried out in the previous section allows directly comparing the two. Both appear to be effective for rankingbased evaluation, with a potential complementary improvement of total uncertainty. From a calibration point of view, good performance has been reached using epistemic uncertainty alone, while aleatoric uncertainty individually turns out to always be largely underestimated, even if it is characterized by a high correlation with error. In any case, total uncertainty is as calibrated as the individual components, and even more calibrated in the outofdomain setting. We can explain this behavior of calibration as follows.
Aleatoric uncertainty should correlate with the noise in the observed variable, while epistemic uncertainty with the error in the trained function. However, the only observable error (MSE) includes both these contributions. Therefore, we can speculate that in this specific case epistemic uncertainty appears to be more calibrated than aleatoric uncertainty individually because the total error is primarily due to the model’s approximating function rather than the noise in the data. In other contexts, the individual contributions to total error could vary, and the situation could be reversed, but MSE should always be better approximated by total uncertainty. Evaluating the individual contributions can be helpful in pinpointing their relative importance in different settings. Moreover, even if MSE is better approximated by total uncertainty, applications could require taking into account only one of the two components for its specific meaning or to maximize some specific metric. This kind of analysis is not the main goal of this work and deserves further investigation.
Domain shift analysis is characterized by mixed results. On the one side, rankingbased performance does not appear to be particularly affected by outofdomain molecules: the AUCO decreases proportionally to the (inevitable) decrease in total MAE, while the error drop is even larger than in the indomain setting. On the other side, calibration performance drastically changes and outofdomain calibration appears to be consistently underestimated. The latter result is in line with what has been recently observed in Li2019, but the analysis carried out in this work has allowed the quantification of this behavior and its confirmation in a more general setting with multiple uncertainty methods being employed. As the model is tested on molecules different with respect to those seen during training, the error increases without the uncertainty being able to totally capture this rise, thus leading to lower than expected estimates in this case. Outofdomain uncertainty calibration should be a major focus of future development in uncertainty estimation methodologies for molecular property prediction.
Up to now, we mainly compared uncertainty models. However, the obtained results also allow for the comparison of different evaluation methods in terms of what they capture about uncertainty to discuss if and to what degree they are all necessary and complementary. Taking into consideration calibration allows identifying several patterns that do not emerge from confidence curves only, such as the discrepancy in ensembling epistemic and aleatoric uncertainties or some differences between ensembling and bootstrapping, thus highlighting its important role in comparisons. By contrast, even recent work that seeks to obtain “uncertaintycalibrated prediction of molecular properties”^{Zhang2019} do not take into consideration calibration evaluation in the results. The discrepancy between results obtained based on the two different definitions of calibration is more subtle. Qualitatively, the main conclusions derived by confidencebased calibration, such as the largely underestimated aleatoric uncertainty in all the experiments, are also reflected in errorbased calibration. Quantitatively, the ratios of the indices obtained through these two methods do not always overlap, but they always rank models in the same order. Based on the obtained results, it is not possible to state if and when quantitative indices based on one of the two definitions outperform the other. The results obtained for these two different definitions of calibration also confirm their previous comparative discussion. In particular, even if errorbased calibration directly relates error and uncertainty according to the definition, the inherent nonuniformity of uncertainty estimates makes it difficult to obtain reliable statistics in some uncertainty ranges (high uncertainty ranges in our experiments), with less stable results. This also prevents assessing if the error in these ranges is due to uncertainty estimates themselves or to insufficient data for computing reliable statistics. Therefore, we can conclude that the choice between these two evaluation techniques depends on the context. If the dataset is large enough to enable meaningful estimates for all the bins, errorbased calibration should be preferred because it allows for a more direct comparison and it avoids issues when recalibration techniques are employed ^{levi2019evaluating}. Instead, if the uncertainty distribution is highly skewed and few samples are available in some ranges, as it turns out in our experiments, confidencebased calibration can overcome this and results in less noisy plots.
7 Conclusion and Future Work
In this paper we compared three stateoftheart approaches for uncertainty estimation in neural networks in the context of GCNNs for molecular property prediction: MCDropout with Concrete Dropout, ensembling, and bootstrapping. We selected those approximate Bayesian inference techniques satisfying some specific applicationoriented criteria: scalability, lack of hyperparameters, and independence from the underlying network architecture. These techniques have been first reviewed in a unified framework that separates aleatoric and epistemic uncertainty, also in the light of recent interpretations given to ensembling, and then experimentally compared on the QM9 dataset based on a set of introduced criteria. Those criteria have been selected to evaluate uncertainty from different perspectives: based on its ability to define a ranking of most confident predictions, based on uncertainty calibration (two different recent definitions for regression have been employed), based on dispersion that measures estimated heterogeneity, and based on robustness to domain shift in the test set with respect to the training set, with scaffold splitting being employed.
The obtained results lead to multiple interesting conclusions. First of all, ensembling and bootstrapping appear to consistently outperform MCDropout, confirming the results recently presented for other domains and different network types also for GCNNbased molecular property prediction. The comparison between ensembling and bootstrapping leads to more mixed results. Even though ensembling is better with respect to most of the considered metrics, including overall MAE, bootstrapping appears to outperform ensembling for others, notably epistemic uncertainty calibration and overall outofdomain calibration. This is not in line with what has been previously described in the context of image regression/classification, highlighting an interesting property of the chemical space and/or the chemical dataset analyzed. Furthermore, the results presented have led to a better understanding about the role of aleatoric/epistemic uncertainty with an interesting method based on calibration plots to pinpoint the relative contribution of the two kinds of uncertainty to the total error.
The latter is one of the directions that should be further investigated in the future, with a deeper analysis of the uncertainty components, also in relation to the specific features of the datasets. In addition, taking into consideration how approximate methods interfere with their independent calculation would be of crucial importance in applications. Another important direction concerns the improvement of uncertainty estimation methods. To accomplish this, a promising direction — especially for epistemic and outofdomain uncertainty — is represented by the increase of diversity in the ensembled networks. This might not be the result of diversity in the data, as in bootstrapping, but instead come from the model itself^{lee2015m, pearce2018bayesian}. Balancing diversity, training data size and number of hyperparameters appears to be a challenging tradeoff. One of the main limitations of all the uncertainty estimation methods is outofdomain uncertainty calibration, and overcoming this weakness should be a major goal of future developments in uncertaintyaware molecular property prediction.