Confidence Calibration and Predictive Uncertainty Estimation for Deep Medical Image Segmentation
Abstract
Fully convolutional neural networks (FCNs), and in particular UNets, have achieved stateoftheart results in semantic segmentation for numerous medical imaging applications. Moreover, batch normalization and Dice loss have been used successfully to stabilize and accelerate training. However, these networks are poorly calibrated i.e. they tend to produce overconfident predictions both in correct and erroneous classifications, making them unreliable and hard to interpret. In this paper, we study predictive uncertainty estimation in FCNs for medical image segmentation. We make the following contributions: 1) We systematically compare cross entropy loss with Dice loss in terms of segmentation quality and uncertainty estimation of FCNs; 2) We propose model ensembling for confidence calibration of the FCNs trained with batch normalization and Dice loss; 3) We assess the ability of calibrated FCNs to predict segmentation quality of structures and detect outofdistribution test examples. We conduct extensive experiments across three medical image segmentation applications of the brain, the heart, and the prostate to evaluate our contributions. The results of this study offer considerable insight into the predictive uncertainty estimation and outofdistribution detection in medical image segmentation and provide practical recipes for confidence calibration. Moreover, we consistently demonstrate that model ensembling improves confidence calibration.
I Introduction
Fully convolutional neural networks (FCNs), and in particular the UNet [ronneberger2015u], have become a de facto standard for semantic segmentation in general and in medical image segmentation tasks in particular. The UNet architecture has been used for segmentation of both normal organs and lesions and achieved top ranking results in several international segmentation challenges [kuijf2019standardized, kaggle_salt, mrbrains18]. Despite numerous applications of UNets, very few works have studied the capability of these networks in capturing predictive uncertainty. Predictive uncertainty or prediction confidence is described as the ability of a decision making system to provide an expectation of success (i.e. correct classification) or failure for the test examples at inference time. Using a frequentist interpretation of uncertainty, predictions (i.e. class probabilities) of a wellcalibrated model should match the probability of success of those inferences in the long run [guo2017calibration]. For instance, if a wellcalibrated brain tumor segmentation model classifies 100 pixels each with the probability of 0.7 as cancer, we expect 70 of those pixels to be correctly classified as cancer. However, a poorly calibrated model with similar classification probabilities is expected to result in many more or less correctly classified pixels. Miscalibration frequently occurs in many modern neural networks (NNs) that are trained with advanced optimization methods[guo2017calibration]. Poorlycalibrated NNs are often highly confident in misclassification [amodei2016concrete]. In some applications, for example medical image analysis, or automated driving, overconfidence can be dangerous.
Batch normalizaton (BN) [ioffe2015batch] and Dice loss [sudre2017generalised] have made FCN optimization seamless. BN effectively stabilizes convergence and also improves performance of networks for natural image classification tasks [ioffe2015batch]. The addition of BN to the UNet has also improved optimization and segmentation quality [cciccek20163d]. Dice loss is robust to class imbalance and has been successfully applied in many segmentation problems [sudre2017generalised]. However, it has been reported that both BN and Dice loss have adverse effects on calibration quality [guo2017calibration, sander2019towards, bertels2019optimization]. Consequently, FCNs trained with BN and Dice loss do not produce wellcalibrated probabilities leading to poor uncertainty estimation. In contrast to Dice loss, cross entropy loss provides better calibrated predictions and uncertainty estimates. as it is a strictly proper scoring rule [gneiting2007strictly]. Yet, the use of cross entropy as loss function for training FCNs can be challenging in situations where there is a high class imbalance, e.g., where most of an image is considered background [sudre2017generalised]. Hence, it is of great significance and interest to study methods for confidence calibration of FCNs trained with BN and Dice loss.
uncalibrated  calibrated  
Another important aspect of uncertainty estimation is the ability of a predictive model to distinguish indistribution test examples (i.e. those similar to the training data) from outofdistribution test examples (i.e. those that do not fit the distribution of the training data) [hendrycks2016baseline]. The ability of the models to detect outofdistribution inputs is specifically important for medical imaging applications as deep networks are sensitive to domain shift, which is a recurring situation in medical imaging domain [ghafoorian2017transfer]. For instance, networks trained on one MRI protocol often do not perform satisfactorily on images obtained with slightly different parameters or outofdistribution test images. Hence, in the face of an outofdistribution sample, an ideal model knows and announces ”I do not know” and seeks human intervention – if possible – instead of a silent failure. Figure 1 shows inferences from a UNet model that was trained with BN and Dice loss for prostate segmentation before and after confidence calibration.
Ii Related Works
There has been a recent growing interest in uncertainty estimation and confidence measurement with deep NNs. Although most studies on uncertainty estimation have been done through Bayesian modeling of the NN, there has been some recent interest in using nonBayesian approaches such as ensembling methods. Here, we first briefly review Bayesian and nonBayesian methods and then review the recent literature for uncertainty estimation for semantic segmentation applications.
In the Bayesian approach, the deterministic parameters of the NN are replaced by prior probability distributions. Using Bayesian inference, given the data samples, a posterior probability distribution over the parameters is calculated. At inference time, instead of single scalar probability, the Bayesian NN gives probability distributions over the output label probabilities [mackay1992practical], which models NN predictive uncertainty. Gal and Ghahramani proposed to use dropout [srivastava2014dropout] as a Bayesian approximation [gal2015dropout]. They proposed Monte Carlo dropout (MC dropout) in which dropout layers are applied before every weight together with nonlinearities, which provide an approximation to a probabilistic Gaussian process. Implementing MC dropout is straightforward and has been applied in several application domains including medical imaging [leibig2017leveraging]. In a similar Bayesian approach, Teye et al. [teye2018MCBN] showed that training NNs with BN [ioffe2015batch] can be used to approximate inference of Bayesian NNs. For networks with BN and without dropout, Monte Carlo Batch Normalization (MCBN) can be considered an alternative to MC dropout. In another Bayesian work, Heo et al. [heo2018uncertainty] proposed a method that allows the attention model to leverage uncertainty. By learning the Uncertaintyaware Attention (UA) with variational inference, they improved both model calibration and performance in attention models. Seo et al. [seo2019learning] proposed a varianceweighted loss function that enables learning singleshot calibration scores. In combination with stochastic depth and dropout, their method can improve confidence calibration and classification accuracy. Recently, Liao et al. [liao2019modelling] proposed a method for modeling such uncertainty in intraobserver variability of 2D echocardiography using the proposed cumulative density Function Probability method.
NonBayesian approaches have been proposed for probability calibration and uncertainty estimation. Gue et al. [guo2017calibration] studied the problem of confidence calibration in deep NNs. Through experiments they analyzed different parameters such as depth, width, weight decay, and BN and their effect on calibration. They also used temperature scaling to easily calibrate trained models. Following the success of ensembling methods [dietterich2000ensemble] in improving baseline performance, Lakshminarayanan proposed Deep Ensembles in which model averaging was used to estimate predictive uncertainty [lakshminarayanan2017simple]. By training collections of models with random initialization of parameters and adversarial training, they provided a simple approach to assess uncertainty. Unlike MC dropout, using Deep Ensembles does not require network modification and results in superiority to MC dropout on two image classification problems. On the downside, it requires retraining a model from scratch, which is computationally expensive for large datasets and complex models.
Predictive uncertainty estimation has been studied specifically for the problem of semantic segmentation with deep NNs. Bayesian SegNet [kendall2015bayesian] was among the first that addressed uncertainty estimation in FCNs by using MC dropout. They applied MC dropout by adding dropout layers after the pooling and upsampling blocks of the three innermost layers of the encoder and decoder sections of the SegNet architecture. Using similar approaches for uncertainty estimation, Kwon et al. [kwon2018uncertainty] and Sedai et al. [sedai2018joint] used Bayesian NNs for uncertainty quantification in segmentation of ischemic stroke lesions and visualization of retinal layers, respectively. Sander et al. [sander2019towards] applied MC dropout to capture instance segmentation uncertainty in ambiguous regions and compared different loss functions in terms of the resultant miscalibration. Kohl et al. [kohl2018probabilistic] proposed a Probabilistic UNet that combined an FCN with a conditional variance autoencoder to provide multiple segmentation hypotheses for ambiguous images. In similar work, Hu et al. [hu2019supervised] studied uncertainty quantification in the presence of multiple annotations as a result of interobserver disagreement. They used a probabilistic UNet to quantify uncertainty in the segmentation of lung abnormalities. Rottmann and Schubert [rottmann2019uncertainty] proposed a prediction quality rating method for segmentation of nested multiresolution street scene images by measuring both pixelwise and segmentwise measures of uncertainty as predictive metrics for segmentation quality. Recently, Karimi et al. [karimi2019accurate] used ensembling for uncertainty estimation of difficult to segment regions and used this information to improve clinical target volume estimation in prostate ultrasound images. In another recent work, Jungo and Reyes [jungo2019assessing] studied uncertainty estimation for brain tumor and skin lesion segmentation tasks.
In conjunction with uncertainty estimation and confidence calibration, several works have studied outofdistribution detection [hendrycks2016baseline, liang2017enhancing, lee2017training, devries2018learning, shalev2018out]. In a nonBayesian approach, Hendrycks and Gimpel [hendrycks2016baseline] used softmax prediction probability baseline to effectively predict misclassificaiton and outofdistribution in test examples. Liang et al. [liang2017enhancing] used temperature scaling and input perturbations to enhance the baseline method of Hendrycks and Gimpel [hendrycks2016baseline]. In the context of a generative NN scheme, Lee et al. used a loss function that encourages confidence calibration [lee2017training] and this resulted in improvements in outofdistribution detection. Similarly, DeVries and Taylor [devries2018learning] proposed a hybrid with a confidence term to improve outofdistribution detection. Shaleve et al. [shalev2018out] used multiple semantic dense representations of the target labels to detect misclassified and adversarial examples.
Iii Contributions
In this work, we study predictive uncertainty estimation for semantic segmentation with FCNs and propose ensembling for confidence calibration and reliable predictive uncertainty estimation of segmented structures. In summary, we make the following contributions:

We analyze the choice of loss function for semantic segmentation in FCNs. We compare the two most commonly used loss functions in training FCNs for semantic segmentation: cross entropy loss and Dice loss. We train models with these loss functions and compare the resulting segmentation quality and predictive uncertainty estimation. We observe that FCNs trained with Dice loss perform significantly better segmentation compared to those trained with cross entropy but at the cost of poor calibration.

We propose model ensembling [lakshminarayanan2017simple] for confidence calibration of FCNs trained with Dice loss and batch normalization. By training collections of FCNs with random initialization of parameters and random shuffling of training data, we create an ensemble that improves both segmentation quality and uncertainty estimation. We empirically quantify the effect of number of models on calibration and segmentation quality.

We propose to use average entropy over the predicted segmented object as a metric to predict segmentation quality of foreground structures, which can be further used to detect outofdistribution test inputs. Our results demonstrate that object segmentation quality correlates inversely with the average entropy over the segmented object and can be used effectively for detecting outofdistribution inputs.

We demonstrate our method for uncertainty estimation and confidence calibration on three different segmentation tasks from MRI images of the brain, the heart, and the prostate. Where appropriate, we report the statistical significance of our findings.
Iv Applications & Data
Table I shows number of patient images in each dataset and how we split these into training, validation, and test sets. In the following subsections, we briefly describe each segmentation task, data characteristics and preprocessing.
Application  Brain  Heart  Prostate  

Dataset  CBICA  TCIA  ACDC  PROSTATEx  PROMISE12 
# Training  66  40  16  
# Validation  22  10  4  
# Test  102  50  20  35 

Used only for outofdistribution detection experiments.
Iva Brain Tumor Segmentation Task
For brain tumor segmentation, data from the MICCAI 2017 BraTS challenge [bakas2017advancing, menze2015multimodal] was used. This is a fourclass segmentation task; multiparametric MRI of brain tumor patients are to be segmented into into enhancing tumor, nonenhancing tumor, edema, and background. The training dataset consists of 190 multiparametric MRI (T1weighted, contrastenahnced T1weighted, T2weighted, and FLAIR sequences) from brain tumor patients. The dataset is further subdivided into two sets: CBICA and TCIA. The images in CBICA set were acquired at the Center for Biomedical Image Computing and Analytics (CBICA) at the University of Pennsylvania [bakas2017advancing]. The images in the TCIA set were acquired across multiple institutions and hosted by the National Cancer Institute, The Cancer Imaging Archive (TCIA). The CBICA subset was used for training and validation and the TCIA subset was reserved as the test set.
IvB Ventricular Segmentation Task
For heart ventricle segmentation, data from the MICCAI 2017 ACDC challenge for automated cardiac diagnosis was used [wolterink2017automatic]. This is a fourclass segmentation task; cine MR images (CMRI) of patients are to be segmented into the left ventricle, the endocardium, the right ventricle, and background. This dataset consists of enddiastole (ED) and endsystole (ES) images of 100 patients. We used only the ED images in our study.
IvC Prostate Segmentation Task
For prostate segmentation, the public datasets, PROSTATEx [Litjens2014prostatex] and PROMISE12 [litjens2014evaluation] were used. This is a twoclass segmentation task; Axial T2weighted images of men suspected of having prostate cancer are to be segmented into the prostate gland and background. For PROSTATEx dataset, 40 images with annotations from Meyer et al. [Meyer2018lo] were used. All these images were acquired at the same institution. PROSTATEx dataset was used for both training and testing purposes, and PROMISE12 dataset was set aside for test only. PROMISE12 dataset is a heterogeneous multiinstitutional dataset acquired using different MR scanners and acquisition parameters. We used the 50 training images for which ground truth is available.
IvD Data Preprocessing
Prostate and cardiac images were resampled to the common inplane resolution of mm and mm, respectively. Brain images were resampled to the resolution of mm. All axial slices were then cropped at the center to create images of size pixels as the input size of the FCN. Image intensities were normalized to be within the range of [0,1].
V Methods
Va Model
Semantic segmentation can be formulated as a pixellevel classification problem, which can be solved by convolutional neural networks [litjens2017survey]. The pixels in the training image and label pairs can be considered as N i.i.d data pints , where is the input Mdimensional feature map and can be one and only one of the possible classes . The use of FCNs for image segmentation allows for endtoend learning, with each pixel of the input image being mapped by the FCN to the output segmentation map. Compared to FCNs, patchbased NNs are much slower at inference time as they require sliding window mechanisms for predicting each pixel [long2015fully]. Moreover, it is more straightforward to implement segmentlevel loss functions such as Dice loss in FCN architectures. FCNs for segmentation usually consist of an encoder (contracting) path and a decoder (expanding) path [long2015fully, ronneberger2015u]. FCNs with skipconnections are able to combine high level abstract features with low level high resolution features, which has been shown to be successful in segmentation tasks [ronneberger2015u, cciccek20163d]. NNs can be formulated as parametric conditional probability models, , and the parameter set is chosen to minimize a loss function. Both cross entropy (CE) and negative of Dice Similarity Coefficient (DSC), known as Dice loss, have been used as loss functions for training FCNs. Class weights are used for optimization convergence and dealing with the class imbalance issue. With CE loss, parameter set is chosen to maximize the average log likelihood over training data:
(1) 
where is the probability of pixel belonging to class , is the binary indicator which denotes if the class label k is the correct class of th pixel, is the weight for class , and is the number of pixels that are used in each minibatch. With the Dice loss, the parameter set is chosen to minimize the negative of weighted Dice of different structures:
(2) 
where is the probability of pixel belonging to class , is the binary indicator which denotes if the class label k is the correct class of th pixel, is the weight for class , is the number of pixels that are used in each minibatch, and is the smoothing factor to make the loss function differentiable. Subsequently, is used for inference, where is the optimized parameter set.
VB Calibration Metrics
The output of an FCN for each input pixel is a class prediction and its associated class probability . The class probability can be considered the model confidence or probability of correctness and can be used as a measure for predictive uncertainty at the pixel level. Strictly proper scoring rules are used to assess the calibration quality of predictive models [gneiting2007strictly]. In general, scoring rules assess the quality of uncertainty estimation in models by awarding wellcalibrated probabilistic forecasts. Negative log likelihood (NLL), and Brier score [brier1950verification], are both strictly proper scoring rules that have been previously used in several studies for evaluating predictive uncertainty [guo2017calibration, lakshminarayanan2017simple, gal2015dropout]. In a segmentation problem, for a collection of pixels, NLL is calculated as:
(3) 
Brier score (Br) measures the accuracy of probabilistic predictions:
(4) 
In addition to NLL and Brier score, we directly assess the predictive power of a model by analyzing test examples confidence values versus their measured expected accuracy values. To do so, we use reliability diagrams as visual representations of model calibration and Expected Calibration Error (ECE) as summary statistics for calibration [guo2017calibration, naeini2015obtaining]. Reliability diagrams plot expected accuracy as a function of class probability (confidence). The reliability diagram of a perfectly calibrated model is the identity function. For expected accuracy measurement, the samples are binned into N groups and the accuracy and confidence for each group are computed. Assuming to be indices of samples whose confidence predictions are in the range of , the expected accuracy of the is . The average confidence on bin is calculated as . ECE is calculated by summing up the weighted average of the differences between accuracy and the average confidence over the bins:
(5) 
where is the total number of samples. In other words, ECE is the average of gaps on the reliability diagram.
VC Confidence Calibration with Ensembling
We propose to use ensembling [dietterich2000ensemble] for confidence calibration of FCNs trained with Dice loss. We hypothesize that an ensemble of poorly calibrated FCNs trained with Dice loss function produces high quality predictive uncertainty estimates, i.e. ensembling calibrates FCNs trained with Dice loss. To this end, similar to the Deep Ensembles method [lakshminarayanan2017simple], we train FCNs with random initialization of the network parameters and random shuffling of the training dataset in minibatch stochastic gradient descent. However, unlike the Deep Ensemble methods we do not use any form of adversarial training. We train each of the models in the ensemble from scratch and then compute the probability of the ensemble as the average of the baseline probabilities as follows:
(6) 
where are the individual probabilities.
Calibration Quality (Whole Volume)  Calibration Quality (Bounding Boxes)  

Application (Model)  NLL (95% CI)  Brier (95% CI)  ECE% (95% CI)  NLL (95% CI)  Brier (95% CI)  ECE% (95% CI)  Failure Rate 
Brain ()  0.06 (0.010.23)  0.03 (0.010.10)  0.78 (0.133.46)  0.43 (0.151.19)  0.19 (0.070.49)  6.75 (1.4020.07)  20.9% 
Brain ()  0.14 (0.030.42)  0.02 (0.000.05)  0.83 (0.172.50)  1.47 (0.443.58)  0.16 (0.060.35)  8.34 (2.8221.45)  16.8% 
Brain (Ensemble)  0.03 (0.010.07)  0.01 (0.000.02)  0.44 (0.021.33)  0.25 (0.100.77)  0.09 (0.040.18)  2.33 (0.298.07)  18.5% 
Heart ()  0.03 (0.010.08)  0.01 (0.010.03)  0.37 (0.140.95)  0.32(0.160.73)  0.16 (0.090.30)  5.26 (1.4212.03)  2.2% 
Heart ()  0.04 (0.01 0.15)  0.02 (0.000.04)  0.94(0.102.73)  0.52 (0.17 1.49)  0.22 (0.060.46)  12.80 (2.6031.58)  2.5% 
Heart (Ensemble)  0.02 (0.010.06)  0.01 (0.010.02)  0.20 (0.070.79)  0.25 (0.160.50)  0.13 (0.080.23)  3.08 (0.938.05)  2.1% 
Prostate ()  0.08 (0.040.16)  0.04 (0.020.09)  2.17 (0.517.20)  0.40 (0.220.79)  0.25 (0.130.47)  8.10 (1.6025.69)  0.0% 
Prostate ()  0.26 (0.100.58)  0.04 (0.020.08)  1.97 (0.974.13)  0.75 (0.331.67)  0.11 (0.070.27)  5.75 (3.3213.12)  0.0% 
Prostate (Ensemble)  0.05 (0.020.09)  0.02 (0.010.04)  0. 65(0.131.26)  0.15 (0.070.24)  0.07 (0.040.14)  2.01 (0.483.65)  0.0% 
VD Segmentlevel Predictive Uncertainty Estimation
For segmentation applications, besides the pixellevel confidence metric, it is desirable to have a confidence metric that captures model uncertainty at the segmentlevel. Such a metric would be very useful in clinical applications for decision making. For a wellcalibrated system, we anticipate that a segmentlevel confidence metric can predict the segmentation quality in the absence of ground truth. The metric can be used to detect outofdistribution samples and hard or ambiguous cases. Such metrics have been previously proposed for street scene segmentation [rottmann2019uncertainty]. Given the pixellevel class predictions and their associated ground truth class for a predicted segment , we propose to use the average of pixelwise entropy values over the predicted foreground segment as a scalar metric for volumelevel confidence of that segment as:
=  1—^Sk— ∑_i∈^S_k [p(^y_i— x_i,θ)⋅ln(p(^y_i—x_i,θ)) + (1 p(^y_i— x_i,θ) ) ⋅ln(1p(^y_i—x_i,θ))].
In calculating the average entropy of , we assumed binary classification: the probability of belonging to class , and the probability of belonging to other classes .
Vi Experiments
Via Training Baselines
For all of the experiments, we used a baseline FCN model similar to the twodimensional UNet architecture [ronneberger2015u] but with fewer kernel filters at each layer. The input and output of the FCN has a size of pixels. Except for the brain tumor segmentation that used a threechannel input (T1CE, T2, FLAIR), for the rest of the problems the input was a single channel. The network has the same number of layers as the original UNet but with fewer kernels. The number of kernels for the encoder section of UNet were 8, 8, 16, 16, 32, 32, 64, 64, 128, and 128. The parameters of the convolutional layers were initialized randomly from a Gaussian distribution [he2015delving]. For each of the three segmentation problems, the model was trained 100 times with cross entropy and 100 times with Dice loss, each with random weight initialization and random shuffling of the training data. For the models that were trained with Dice loss, the softmax activation function of the last layer was substituted with sigmoid function as it improved the convergence substantially. For optimization, stochastic gradient descent with the Adam update rule [kingma2014adam] was used. During the training, we used a minibatch of 16 examples for prostate segmentation and 32 examples for brain tumor and cardiac segmentation tasks. The initial learning rate was set to and it was reduced by a factor of if the average of validation Dice score did not improve by in 10 epochs. We used 1000 epochs for training the models with an early stopping policy. For each run, model checkpoint was saved at the epoch where the validation DSC was the highest.
ViB Cross Entropy vs. Dice
CE loss aims to minimize the average negative log likelihood over the pixels, while Dice loss improves segmentation quality in terms of Dice coefficient directly. As a result, we expect to observe models trained with CE to achieve a lower NLL and models trained with Dice loss to achieve better Dice coefficients. Here, our main focuses are to observe the segmentation quality of a model that is trained with cross entropy in terms of Dice loss and the calibration quality of a model that was trained with Dice loss. We compare models trained with cross entropy with those trained with Dice on three segmentation tasks. For statistical tests and calculating 95% confidence intervals (CI), we used bootstrapping (n=100).
ViC Confidence Calibration
We use ensembling (Equation 6) to calibrate batch normalized FCNs trained with Dice loss. For the three segmentation problems, we make ensemble predictions and compare them with baselines in terms of calibration and segmentation quality. For calibration quality, we compare NLL, Brier score, and ECE%. For segmentation quality, we compare dice and percentile Hausdorff distance. Moreover, for calibration quality assessment we calculate the metrics on two sets of samples from the heldout test datasets: 1) the whole test dataset (all pixels of the test volumes) 2) pixels belonging to dilated bounding boxes around the foreground segments. The foreground segments and the adjacent background around them usually have the highest uncertainty and difficulty. At the same time background pixels far from foreground segments show less uncertainty, but outnumber the foreground pixels. Using bounding boxes removes most of the correct (certain) background predictions from the statistics and will lead to better highlighting of the differences among models. For all three problems, we construct bounding boxes of the foreground structures. The boxes are then dilated by 8 mm in each direction of the inplane axes and 2 slices (which translates to 4mm to 20mm) in each direction of the outofplane axis.
We also measured the effect of ensembles by calculating (Equation 6) for ensembles with number of models () of 1, 2, 5, 10, 25, and 50. To provide better statistics and reduce the effect of chance in reporting the performance, for each , we constructed subsets of ensembles from the baseline models and then reported the mean of NLL and Dice for that specific . For prostate and heart segmentation tasks was set to 50 and for brain tumor segmentation was set to 10. Finally, For calculating 0.95 CI and statistical significance test, we created 100 bootstraps by sampling 100 instances of random models and test examples with replacement. For each bootstrap, the calibration metrics and Dice scores were calculated for baselines, and those calibrated with ensembling.
ViD Segmentlevel Predictive Uncertainty
For each of the segmentation problems, we calculated volumelevel confidence for each of the foreground labels and (Equation VD) vs. Dice. For prostate segmentation, we are also interested in observing the difference between the two datasets of PROSTATEx test set (which is the same as the source domain) and PROMISE12 set (which can be considered as a target set).
Vii Results
Average Dice Similarity Score (95% CI)  Average Hausdorff Distance (95th percentile) (95% CI)  
Organ (Model)  Seg. #1*  Seg. #2*  Seg. #3*  Seg. #1*  Seg. #2*  Seg. #3* 
Brain ()  0.45 (0.110.85)  0.51 (0.140.82)  0.65 (0.190.87)  52.28 (5.0099.34)  48.87 (6.7180.50)  50.44 (3.0096.34) 
Brain ()  0.53 (0.120.89)  0.64 (0.200.90)  0.72 (0.290.91)  38.78 (4.0092.03)  36.12 (3.0077.31)  36.33 (2.0093.94) 
Brain (Ensemble)  0.61 (0.150.94)  0.72 (0.250.92)  0.79 (0.480.92)  16.20 (2.4579.13)  19.45 (2.0064.50)  26.49 (2.0092.98) 
Heart ()  0.79 (0.460.91)  0.74 (0.550.86)  0.92 (0.780.97)  23.49 (7.21117.66)  18.34 (4.00126.00)  21.74 (2.00151.91) 
Heart ()  0.86 (0.590.96)  0.82 (0.64 0.90)  0.93 (0.810.97)  13.80 (2.00 51.49)  9.31 (2.00 69.91)  12.59 (2 120.40) 
Heart (Ensemble)  0.90 (0.780.96)  0.85 (0.710.91)  0.95 (0.890.98)  9.44 (2.0026.40)  4.43 (2.0011.83)  5.02 (2.0019.01) 
Prostate ()  0.83 (0.630.91)  11.77 (5.0025.67)  
Prostate ()  0.88 (0.730.93)  8.26 (3.6420.30)  
Prostate (Ensemble)  0.90 (0.760.95)  5.73 (3.1618.72) 

For brain application structures, #1, #2, and #3 correspond to nonenhancing tumor, enhancing tumor, and edema, respectively. For heart application structures, #1, #2, and #3 correspond to the left ventricle, the endocardium, and the right ventricle, respectively. For prostate application structure, #1 corresponds to the prostate gland.
Table II compares the averages and 95% CI values for NLL, Brier score, and ECE% for the whole volume and the bounding boxes around the segments. Prediction where the Dice score was less than were considered failures in segmentation and not included in the statistics. The failure rates are provided in Table II. For all three segmentation tasks and all the seven foreground labels, calibration quality was significantly better in terms of NLL and ECE% for models trained with cross entropy comparing to those that were trained with Dice loss. However, the direction of change for Brier score was not consistent among models trained with CE vs models trained with Dice loss. For bounding boxes of brain tumor and prostate segmentation, the Brier scores were significantly better for models trained with Dice loss compared to those trained with CE, while in the case of the heart segmentation was the opposite. The ensemble models show significantly better calibration qualities for all metrics across all tasks.
Table III compares the averages and 95% CI values of Dice coefficients of foreground segments for baselines trained with cross entropy loss, Dice loss, and baselines calibrated with ensembling (M=50). For all tasks across all segments, baselines trained with Dice loss outperform those trained with CE loss, and ensemble models outperform both baselines.
Figure 2 shows the improvement in quality of calibration and segmentation as a function of the number of models in the ensemble, . As we see, for the prostate, the heart, and the brain tumor segmentation, using even five ensembles (M=5) can reduce the NLL by about , , and , respectively.
Figure 3 visually compares the baselines trained with cross entropy, Dice loss with those calibrated with ensembling and through some representative examples over the three segmentation tasks. For each prediction map, a reliability diagram over the whole volume is provided.
Ground Truth  Baseline ()  Baseline ()  Ensemble (M=50) 
Figure 4 provides scatter plots of Dice coefficient vs. the proposed segmentlevel predictive uncertainty metric, (Equation VD). For better visualization, Dice values were logit transformed as in [niethammer2017active]. In all three segmentation tasks, we observed a strong correlation () between logit of Dice coefficient and average of entropy over the predicted segment. For the prostate segmentation task, a clustering is obvious among the test set from the source domain (PROSTATEx dataset) and those from the target domain (PROMISE12). Investigation of individual cases reveals that most of the poorly segmented cases, which were predicted correctly using , can be considered outofdistribution examples as they were imaged with endorectal coils.
Prostate Segmentation  Brain Tumor Segmentation  Cardiac Segmentation 

Viii Discussion
Through extensive experiments, we have rigorously assessed uncertainty estimation for medical image segmentation with FCNs. Furthermore, we proposed ensembling for confidence calibration of FCNs trained with Dice loss. We have performed these assessments using three common medical image segmentation tasks to ensure generalizability of the findings. The results consistently show that cross entropy loss is better than Dice loss in terms of uncertainty estimation in terms of NLL and ECE%, but falls short in segmentation quality. We then showed that ensembling with notably calibrates the confidence of models trained with Dice loss. Importantly, we also observed that in addition to NLL reduction, the segmentation accuracy in terms of Dice coefficient was also improved through ensembling. Consistent with the results of previous studies [kuijf2019standardized], we observed that segmentation quality improved with ensembling. The results of our experiments for comparing cross entropy with Dice loss are in line with the achieved results of Sanders et al. [sander2019towards].
Importantly, we demonstrated the feasibility of constructing metrics that can capture predictive uncertainty of individual segments. We showed that the average entropy of segments can predict the quality of the segmentation in terms of Dice coefficient. Preliminary results suggest that calibrated FCNs have the potential to detect outofdistribution samples. Specifically, for prostate segmentation the ensemble correctly predicted the cases where it failed due to differences in imaging parameters (such as different imaging coils). However, it should be noted that this is an early attempt to capture segmentlevel quality of segmentation and the results thus need to be interpreted with caution. The proposed metric can be improved by adding prior knowledge about the labels. Furthermore, it should be noted that the proposed metric does not encompass any information on number of samples used in that estimation.
As introduced in the methods section, some loss functions are ”proper scoring rules”, a desirable quality that promotes well calibrated probabilistic predictions. The Deep Ensembles method has a proper scoring rule requirement for the baseline loss function [lakshminarayanan2017simple]. The question arises: ”Is the Dice loss a proper scoring rule”? Here, we argue that there is a fundamental mismatch in the potential usage of the Dice loss for scoring rules. Scoring rules are functions that compare a probabilistic prediction with an outcome. In the context of binary segmentations, an outcome corresponds to a binary vector of length , where is the number of pixels. The difficulty with using scoring rules here is that the corresponding probabilistic prediction is a distribution on binary vectors. However, the predictions made by deep segmenters are collections of label probabilities. This is in distinction to distributions on binary vectors, which are more complex; in general they are probability mass function with parameters, one for each of the possible outcomes (the number of possible binary segmentations). The essential problem is that deep segmenters do not predict distributions on outcomes (binary vectors). One potential workaround is to say that the network does predict the required distributions, by constructing them as the product of the marginal distributions. This, though, has the problem that the predicted distributions will not be similar to the more general data distributions, so in that sense, they are bound to be poor predictions.
We used segmentation tasks in the brain, the heart and the prostate to assess uncertainty estimation. Although each of these tasks was performed on MRI images, there were subtle differences between them. The brain segmentation task was performed on three channel input (T1 contrast enhanced, FLAIR, and T2) while the other two were performed on single channel input (T2 for prostate and Cine images for heart). Moreover, the number of training samples, the size of the target segments, and the homogeneity of samples were different in each task. Only publicly available datasets were used in this study to allow others to easily reproduce these experiments and results. The ground truth was created by experts and independent test sets were used for all experiments. For prostate gland segmentation and brain tumor segmentation tasks, we used multiscanner, multiinstitution test sets. For all three tasks, boundaries of the target segments were commonly identified as areas of high uncertainty estimate.
Our focus was not on achieving stateoftheart results on the three mentioned segmentation tasks, but on using these to understand and improve the uncertainty prediction capabilities of FCNs. Since we performed several rounds of training with different loss functions, we limited the number of parameters in the models in order to speed up each training round; we carried out experiments with 2D CNNs (not 3D), used fewer convolutional filters in our baseline compared to the original UNet, and performed limited (not exhaustive) hyperparameter tuning to allow reasonable convergence.
Although MC dropout has been applied in many uncertainty estimation studies, we chose to not include it in this study as MC dropout requires modification of the network architecture by adding dropout layers to specific locations [kendall2015bayesian]. Moreover, batch normalization removes the need for dropout in many applications [ioffe2015batch].
Further work needs to be carried out to establish the effect of loss function on confidence calibration for deep FCNs. In this study, we only focused on Dice loss and cross entropy loss functions. It would be interesting to investigate the calibration and segmentation quality of other loss functions such as combinations of Dice loss and cross entropy loss, as well as the recently proposed LovászSoftmax loss [berman2018lovasz].
There remains a need to study calibration methods that, unlike ensembling, do not require training from scratch which is time consuming. In this work we only investigated uncertainty estimation for MR images. Although parameter changes occur more often in MRI comparing to computed tomography (CT), it would still be very interesting to study uncertainty estimation in CT images. Parameter changes in CT can also be a source of failure in CNNs. For instance, changes in slice thickness or use of contrast can result in failures in predictions and it is highly desirable to predict such failures through model confidence. We believe that our research will serve as a base for future studies on uncertainty estimation and confidence calibration for medical image segmentation. Further study of the sources of uncertainty in medical image segmentation is needed. Uncertainty has been classified as aleatoric or epistemic in medical applications [indrayan2012medical] and Bayesian modeling [kendall2017uncertainties]. Aleatoric refers to types of uncertainties that exist due to noise or the stochastic behavior of a system. In contrast, epistemic uncertainties are rooted in limitation in knowledge about the model or the data. In this study, we consistently observed higher levels of uncertainty at specific locations such as boundaries. For example in the prostate segmentation task, single and multiple raters often have higher inter and intra disagreements in delineation of the base and apex of the prostate rather than at the midgland boundaries [litjens2014evaluation]. Such disagreements can leave their traces on models that are trained using ground truth labels with such discrepancies. It has been shown that with enough training data from multiple raters, deep models are able to surpass human agreements on segmentation tasks [litjens2017survey]. However, not much work has been done on correlation of ground truth quality and model uncertainty that result from rater disagreements.
We conclude that model ensembling can be used successfully for confidence calibration of FCNs trained with Dice Loss. Also, the proposed average entropy metric can be used as an effective predictive metric for estimating the performance of the model at testtime when the groundtruth is unknown.