Confidence Calibration and Predictive Uncertainty Estimation for Deep Medical Image Segmentation

Confidence Calibration and Predictive Uncertainty Estimation for Deep Medical Image Segmentation

Alireza Mehrtash, William M. Wells III, Clare M. Tempany, Purang Abolmaesumi, and Tina Kapur* This work was supported by the US National Institutes of Health grants P41EB015898, Natural Sciences and Engineering Research Council (NSERC) of Canada, and the Canadian Institutes of Health Research (CIHR). Asterisk indicates corresponding author. A. Mehrtash is with the Department of Electrical and Computer Engineering, The University of British Columbia, Vancouver, BC, V6T 1Z4, Canada, and also with the Department of Radiology, Brigham and Women’s Hospital, Harvard Medical School, Boston, 02115, USA. P. Abolmaesumi is with the Department of Electrical and Computer Engineering, The University of British Columbia Vancouver, BC, V5T 1Z4, Canada. C. M. Tempany, W. M. Wells, and T. Kapur are with the Department of Radiology, Brigham and Women’s Hospital, Harvard Medical School, Boston, 02115, USA (e-mail:

Fully convolutional neural networks (FCNs), and in particular U-Nets, have achieved state-of-the-art results in semantic segmentation for numerous medical imaging applications. Moreover, batch normalization and Dice loss have been used successfully to stabilize and accelerate training. However, these networks are poorly calibrated i.e. they tend to produce overconfident predictions both in correct and erroneous classifications, making them unreliable and hard to interpret. In this paper, we study predictive uncertainty estimation in FCNs for medical image segmentation. We make the following contributions: 1) We systematically compare cross entropy loss with Dice loss in terms of segmentation quality and uncertainty estimation of FCNs; 2) We propose model ensembling for confidence calibration of the FCNs trained with batch normalization and Dice loss; 3) We assess the ability of calibrated FCNs to predict segmentation quality of structures and detect out-of-distribution test examples. We conduct extensive experiments across three medical image segmentation applications of the brain, the heart, and the prostate to evaluate our contributions. The results of this study offer considerable insight into the predictive uncertainty estimation and out-of-distribution detection in medical image segmentation and provide practical recipes for confidence calibration. Moreover, we consistently demonstrate that model ensembling improves confidence calibration.

Uncertainty Estimation, Confidence Calibration, Out-of-distribution Detection, Semantic Segmentation, Fully Convolutional Neural Networks

I Introduction

Fully convolutional neural networks (FCNs), and in particular the U-Net [ronneberger2015u], have become a de facto standard for semantic segmentation in general and in medical image segmentation tasks in particular. The U-Net architecture has been used for segmentation of both normal organs and lesions and achieved top ranking results in several international segmentation challenges [kuijf2019standardized, kaggle_salt, mrbrains18]. Despite numerous applications of U-Nets, very few works have studied the capability of these networks in capturing predictive uncertainty. Predictive uncertainty or prediction confidence is described as the ability of a decision making system to provide an expectation of success (i.e. correct classification) or failure for the test examples at inference time. Using a frequentist interpretation of uncertainty, predictions (i.e. class probabilities) of a well-calibrated model should match the probability of success of those inferences in the long run [guo2017calibration]. For instance, if a well-calibrated brain tumor segmentation model classifies 100 pixels each with the probability of 0.7 as cancer, we expect 70 of those pixels to be correctly classified as cancer. However, a poorly calibrated model with similar classification probabilities is expected to result in many more or less correctly classified pixels. Miscalibration frequently occurs in many modern neural networks (NNs) that are trained with advanced optimization methods[guo2017calibration]. Poorly-calibrated NNs are often highly confident in misclassification [amodei2016concrete]. In some applications, for example medical image analysis, or automated driving, overconfidence can be dangerous.

Batch normalizaton (BN) [ioffe2015batch] and Dice loss [sudre2017generalised] have made FCN optimization seamless. BN effectively stabilizes convergence and also improves performance of networks for natural image classification tasks [ioffe2015batch]. The addition of BN to the U-Net has also improved optimization and segmentation quality [cciccek20163d]. Dice loss is robust to class imbalance and has been successfully applied in many segmentation problems [sudre2017generalised]. However, it has been reported that both BN and Dice loss have adverse effects on calibration quality [guo2017calibration, sander2019towards, bertels2019optimization]. Consequently, FCNs trained with BN and Dice loss do not produce well-calibrated probabilities leading to poor uncertainty estimation. In contrast to Dice loss, cross entropy loss provides better calibrated predictions and uncertainty estimates. as it is a strictly proper scoring rule [gneiting2007strictly]. Yet, the use of cross entropy as loss function for training FCNs can be challenging in situations where there is a high class imbalance, e.g., where most of an image is considered background [sudre2017generalised]. Hence, it is of great significance and interest to study methods for confidence calibration of FCNs trained with BN and Dice loss.

uncalibrated calibrated
Figure 1: Examples of confidence calibration and out-of-distribution detection for models that were trained with MR images acquired using phased array coils. The results of inference are shown for two test examples: (top row) a test example imaged with phased array (in-distribution example), and (bottom row) MR image acquired using endorectal coil (out-of-distribution example). The first column on the left shows T2-weighted MR images of prostate with boundary of prostate drawn by an expert (white line). Second column shows MRIs overlaid with the uncalibrated segmentation predictions from an FCN that was trained with Dice loss. Third column shows MRIs overlaid with calibrated segmentation predictions of ensemble of FCNs trained with Dice loss. The last column on the right shows the histogram of class probabilities over the predicted prostate segment after calibration. Ensembling significantly improves both calibration and segmentation quality and captures the uncertainty of the model. In the second row, calibration reduces the average of confidence probabilities over the predicted prostate gland segment, hinting the user that ”I am not confident” about my prediction.

Another important aspect of uncertainty estimation is the ability of a predictive model to distinguish in-distribution test examples (i.e. those similar to the training data) from out-of-distribution test examples (i.e. those that do not fit the distribution of the training data) [hendrycks2016baseline]. The ability of the models to detect out-of-distribution inputs is specifically important for medical imaging applications as deep networks are sensitive to domain shift, which is a recurring situation in medical imaging domain [ghafoorian2017transfer]. For instance, networks trained on one MRI protocol often do not perform satisfactorily on images obtained with slightly different parameters or out-of-distribution test images. Hence, in the face of an out-of-distribution sample, an ideal model knows and announces ”I do not know” and seeks human intervention – if possible – instead of a silent failure. Figure 1 shows inferences from a U-Net model that was trained with BN and Dice loss for prostate segmentation before and after confidence calibration.

Ii Related Works

There has been a recent growing interest in uncertainty estimation and confidence measurement with deep NNs. Although most studies on uncertainty estimation have been done through Bayesian modeling of the NN, there has been some recent interest in using non-Bayesian approaches such as ensembling methods. Here, we first briefly review Bayesian and non-Bayesian methods and then review the recent literature for uncertainty estimation for semantic segmentation applications.

In the Bayesian approach, the deterministic parameters of the NN are replaced by prior probability distributions. Using Bayesian inference, given the data samples, a posterior probability distribution over the parameters is calculated. At inference time, instead of single scalar probability, the Bayesian NN gives probability distributions over the output label probabilities [mackay1992practical], which models NN predictive uncertainty. Gal and Ghahramani proposed to use dropout [srivastava2014dropout] as a Bayesian approximation [gal2015dropout]. They proposed Monte Carlo dropout (MC dropout) in which dropout layers are applied before every weight together with non-linearities, which provide an approximation to a probabilistic Gaussian process. Implementing MC dropout is straightforward and has been applied in several application domains including medical imaging [leibig2017leveraging]. In a similar Bayesian approach, Teye et al. [teye2018MCBN] showed that training NNs with BN [ioffe2015batch] can be used to approximate inference of Bayesian NNs. For networks with BN and without dropout, Monte Carlo Batch Normalization (MCBN) can be considered an alternative to MC dropout. In another Bayesian work, Heo et al. [heo2018uncertainty] proposed a method that allows the attention model to leverage uncertainty. By learning the Uncertainty-aware Attention (UA) with variational inference, they improved both model calibration and performance in attention models. Seo et al. [seo2019learning] proposed a variance-weighted loss function that enables learning single-shot calibration scores. In combination with stochastic depth and dropout, their method can improve confidence calibration and classification accuracy. Recently, Liao et al. [liao2019modelling] proposed a method for modeling such uncertainty in intra-observer variability of 2D echocardiography using the proposed cumulative density Function Probability method.

Non-Bayesian approaches have been proposed for probability calibration and uncertainty estimation. Gue et al. [guo2017calibration] studied the problem of confidence calibration in deep NNs. Through experiments they analyzed different parameters such as depth, width, weight decay, and BN and their effect on calibration. They also used temperature scaling to easily calibrate trained models. Following the success of ensembling methods [dietterich2000ensemble] in improving baseline performance, Lakshminarayanan proposed Deep Ensembles in which model averaging was used to estimate predictive uncertainty [lakshminarayanan2017simple]. By training collections of models with random initialization of parameters and adversarial training, they provided a simple approach to assess uncertainty. Unlike MC dropout, using Deep Ensembles does not require network modification and results in superiority to MC dropout on two image classification problems. On the downside, it requires retraining a model from scratch, which is computationally expensive for large datasets and complex models.

Predictive uncertainty estimation has been studied specifically for the problem of semantic segmentation with deep NNs. Bayesian SegNet [kendall2015bayesian] was among the first that addressed uncertainty estimation in FCNs by using MC dropout. They applied MC dropout by adding dropout layers after the pooling and upsampling blocks of the three innermost layers of the encoder and decoder sections of the SegNet architecture. Using similar approaches for uncertainty estimation, Kwon et al. [kwon2018uncertainty] and Sedai et al. [sedai2018joint] used Bayesian NNs for uncertainty quantification in segmentation of ischemic stroke lesions and visualization of retinal layers, respectively. Sander et al. [sander2019towards] applied MC dropout to capture instance segmentation uncertainty in ambiguous regions and compared different loss functions in terms of the resultant miscalibration. Kohl et al. [kohl2018probabilistic] proposed a Probabilistic U-Net that combined an FCN with a conditional variance autoencoder to provide multiple segmentation hypotheses for ambiguous images. In similar work, Hu et al. [hu2019supervised] studied uncertainty quantification in the presence of multiple annotations as a result of inter-observer disagreement. They used a probabilistic U-Net to quantify uncertainty in the segmentation of lung abnormalities. Rottmann and Schubert [rottmann2019uncertainty] proposed a prediction quality rating method for segmentation of nested multi-resolution street scene images by measuring both pixel-wise and segment-wise measures of uncertainty as predictive metrics for segmentation quality. Recently, Karimi et al. [karimi2019accurate] used ensembling for uncertainty estimation of difficult to segment regions and used this information to improve clinical target volume estimation in prostate ultrasound images. In another recent work, Jungo and Reyes [jungo2019assessing] studied uncertainty estimation for brain tumor and skin lesion segmentation tasks.

In conjunction with uncertainty estimation and confidence calibration, several works have studied out-of-distribution detection [hendrycks2016baseline, liang2017enhancing, lee2017training, devries2018learning, shalev2018out]. In a non-Bayesian approach, Hendrycks and Gimpel [hendrycks2016baseline] used softmax prediction probability baseline to effectively predict misclassificaiton and out-of-distribution in test examples. Liang et al. [liang2017enhancing] used temperature scaling and input perturbations to enhance the baseline method of Hendrycks and Gimpel [hendrycks2016baseline]. In the context of a generative NN scheme, Lee et al. used a loss function that encourages confidence calibration [lee2017training] and this resulted in improvements in out-of-distribution detection. Similarly, DeVries and Taylor [devries2018learning] proposed a hybrid with a confidence term to improve out-of-distribution detection. Shaleve et al. [shalev2018out] used multiple semantic dense representations of the target labels to detect misclassified and adversarial examples.

Iii Contributions

In this work, we study predictive uncertainty estimation for semantic segmentation with FCNs and propose ensembling for confidence calibration and reliable predictive uncertainty estimation of segmented structures. In summary, we make the following contributions:

  • We analyze the choice of loss function for semantic segmentation in FCNs. We compare the two most commonly used loss functions in training FCNs for semantic segmentation: cross entropy loss and Dice loss. We train models with these loss functions and compare the resulting segmentation quality and predictive uncertainty estimation. We observe that FCNs trained with Dice loss perform significantly better segmentation compared to those trained with cross entropy but at the cost of poor calibration.

  • We propose model ensembling [lakshminarayanan2017simple] for confidence calibration of FCNs trained with Dice loss and batch normalization. By training collections of FCNs with random initialization of parameters and random shuffling of training data, we create an ensemble that improves both segmentation quality and uncertainty estimation. We empirically quantify the effect of number of models on calibration and segmentation quality.

  • We propose to use average entropy over the predicted segmented object as a metric to predict segmentation quality of foreground structures, which can be further used to detect out-of-distribution test inputs. Our results demonstrate that object segmentation quality correlates inversely with the average entropy over the segmented object and can be used effectively for detecting out-of-distribution inputs.

  • We demonstrate our method for uncertainty estimation and confidence calibration on three different segmentation tasks from MRI images of the brain, the heart, and the prostate. Where appropriate, we report the statistical significance of our findings.

Iv Applications & Data

Table I shows number of patient images in each dataset and how we split these into training, validation, and test sets. In the following subsections, we briefly describe each segmentation task, data characteristics and pre-processing.

Application Brain Heart Prostate
# Training 66 40 16
# Validation 22 10 4
# Test 102 50 20 35
  • Used only for out-of-distribution detection experiments.

Table I: Number of patients for training, validation, and test sets used in this study.

Iv-a Brain Tumor Segmentation Task

For brain tumor segmentation, data from the MICCAI 2017 BraTS challenge [bakas2017advancing, menze2015multimodal] was used. This is a four-class segmentation task; multiparametric MRI of brain tumor patients are to be segmented into into enhancing tumor, non-enhancing tumor, edema, and background. The training dataset consists of 190 multiparametric MRI (T1-weighted, contrast-enahnced T1-weighted, T2-weighted, and FLAIR sequences) from brain tumor patients. The dataset is further subdivided into two sets: CBICA and TCIA. The images in CBICA set were acquired at the Center for Biomedical Image Computing and Analytics (CBICA) at the University of Pennsylvania [bakas2017advancing]. The images in the TCIA set were acquired across multiple institutions and hosted by the National Cancer Institute, The Cancer Imaging Archive (TCIA). The CBICA subset was used for training and validation and the TCIA subset was reserved as the test set.

Iv-B Ventricular Segmentation Task

For heart ventricle segmentation, data from the MICCAI 2017 ACDC challenge for automated cardiac diagnosis was used [wolterink2017automatic]. This is a four-class segmentation task; cine MR images (CMRI) of patients are to be segmented into the left ventricle, the endocardium, the right ventricle, and background. This dataset consists of end-diastole (ED) and end-systole (ES) images of 100 patients. We used only the ED images in our study.

Iv-C Prostate Segmentation Task

For prostate segmentation, the public datasets, PROSTATEx [Litjens2014-prostatex] and PROMISE12 [litjens2014evaluation] were used. This is a two-class segmentation task; Axial T2-weighted images of men suspected of having prostate cancer are to be segmented into the prostate gland and background. For PROSTATEx dataset, 40 images with annotations from Meyer et al. [Meyer2018-lo] were used. All these images were acquired at the same institution. PROSTATEx dataset was used for both training and testing purposes, and PROMISE12 dataset was set aside for test only. PROMISE12 dataset is a heterogeneous multi-institutional dataset acquired using different MR scanners and acquisition parameters. We used the 50 training images for which ground truth is available.

Iv-D Data Pre-processing

Prostate and cardiac images were resampled to the common in-plane resolution of mm and  mm, respectively. Brain images were resampled to the resolution of  mm. All axial slices were then cropped at the center to create images of size pixels as the input size of the FCN. Image intensities were normalized to be within the range of [0,1].

V Methods

V-a Model

Semantic segmentation can be formulated as a pixel-level classification problem, which can be solved by convolutional neural networks [litjens2017survey]. The pixels in the training image and label pairs can be considered as N i.i.d data pints , where is the input M-dimensional feature map and can be one and only one of the possible classes . The use of FCNs for image segmentation allows for end-to-end learning, with each pixel of the input image being mapped by the FCN to the output segmentation map. Compared to FCNs, patch-based NNs are much slower at inference time as they require sliding window mechanisms for predicting each pixel [long2015fully]. Moreover, it is more straightforward to implement segment-level loss functions such as Dice loss in FCN architectures. FCNs for segmentation usually consist of an encoder (contracting) path and a decoder (expanding) path [long2015fully, ronneberger2015u]. FCNs with skip-connections are able to combine high level abstract features with low level high resolution features, which has been shown to be successful in segmentation tasks [ronneberger2015u, cciccek20163d]. NNs can be formulated as parametric conditional probability models, , and the parameter set is chosen to minimize a loss function. Both cross entropy (CE) and negative of Dice Similarity Coefficient (DSC), known as Dice loss, have been used as loss functions for training FCNs. Class weights are used for optimization convergence and dealing with the class imbalance issue. With CE loss, parameter set is chosen to maximize the average log likelihood over training data:


where is the probability of pixel belonging to class , is the binary indicator which denotes if the class label k is the correct class of th pixel, is the weight for class , and is the number of pixels that are used in each mini-batch. With the Dice loss, the parameter set is chosen to minimize the negative of weighted Dice of different structures:


where is the probability of pixel belonging to class , is the binary indicator which denotes if the class label k is the correct class of th pixel, is the weight for class , is the number of pixels that are used in each mini-batch, and is the smoothing factor to make the loss function differentiable. Subsequently, is used for inference, where is the optimized parameter set.

V-B Calibration Metrics

The output of an FCN for each input pixel is a class prediction and its associated class probability . The class probability can be considered the model confidence or probability of correctness and can be used as a measure for predictive uncertainty at the pixel level. Strictly proper scoring rules are used to assess the calibration quality of predictive models [gneiting2007strictly]. In general, scoring rules assess the quality of uncertainty estimation in models by awarding well-calibrated probabilistic forecasts. Negative log likelihood (NLL), and Brier score [brier1950verification], are both strictly proper scoring rules that have been previously used in several studies for evaluating predictive uncertainty [guo2017calibration, lakshminarayanan2017simple, gal2015dropout]. In a segmentation problem, for a collection of pixels, NLL is calculated as:


Brier score (Br) measures the accuracy of probabilistic predictions:


In addition to NLL and Brier score, we directly assess the predictive power of a model by analyzing test examples confidence values versus their measured expected accuracy values. To do so, we use reliability diagrams as visual representations of model calibration and Expected Calibration Error (ECE) as summary statistics for calibration [guo2017calibration, naeini2015obtaining]. Reliability diagrams plot expected accuracy as a function of class probability (confidence). The reliability diagram of a perfectly calibrated model is the identity function. For expected accuracy measurement, the samples are binned into N groups and the accuracy and confidence for each group are computed. Assuming to be indices of samples whose confidence predictions are in the range of , the expected accuracy of the is . The average confidence on bin is calculated as . ECE is calculated by summing up the weighted average of the differences between accuracy and the average confidence over the bins:


where is the total number of samples. In other words, ECE is the average of gaps on the reliability diagram.

V-C Confidence Calibration with Ensembling

We propose to use ensembling [dietterich2000ensemble] for confidence calibration of FCNs trained with Dice loss. We hypothesize that an ensemble of poorly calibrated FCNs trained with Dice loss function produces high quality predictive uncertainty estimates, i.e. ensembling calibrates FCNs trained with Dice loss. To this end, similar to the Deep Ensembles method [lakshminarayanan2017simple], we train FCNs with random initialization of the network parameters and random shuffling of the training dataset in mini-batch stochastic gradient descent. However, unlike the Deep Ensemble methods we do not use any form of adversarial training. We train each of the models in the ensemble from scratch and then compute the probability of the ensemble as the average of the baseline probabilities as follows:


where are the individual probabilities.

Calibration Quality (Whole Volume) Calibration Quality (Bounding Boxes)
Application (Model) NLL (95% CI) Brier (95% CI) ECE% (95% CI) NLL (95% CI) Brier (95% CI) ECE% (95% CI) Failure Rate
Brain () 0.06 (0.010.23) 0.03 (0.010.10) 0.78 (0.133.46) 0.43 (0.151.19) 0.19 (0.070.49) 6.75 (1.4020.07) 20.9%
Brain () 0.14 (0.030.42) 0.02 (0.000.05) 0.83 (0.172.50) 1.47 (0.443.58) 0.16 (0.060.35) 8.34 (2.8221.45) 16.8%
Brain (Ensemble) 0.03 (0.010.07) 0.01 (0.000.02) 0.44 (0.021.33) 0.25 (0.100.77) 0.09 (0.040.18) 2.33 (0.298.07) 18.5%
Heart () 0.03 (0.010.08) 0.01 (0.010.03) 0.37 (0.140.95) 0.32(0.160.73) 0.16 (0.090.30) 5.26 (1.4212.03) 2.2%
Heart () 0.04 (0.01 0.15) 0.02 (0.000.04) 0.94(0.102.73) 0.52 (0.17 1.49) 0.22 (0.060.46) 12.80 (2.6031.58) 2.5%
Heart (Ensemble) 0.02 (0.010.06) 0.01 (0.010.02) 0.20 (0.070.79) 0.25 (0.160.50) 0.13 (0.080.23) 3.08 (0.938.05) 2.1%
Prostate () 0.08 (0.040.16) 0.04 (0.020.09) 2.17 (0.517.20) 0.40 (0.220.79) 0.25 (0.130.47) 8.10 (1.6025.69) 0.0%
Prostate () 0.26 (0.100.58) 0.04 (0.020.08) 1.97 (0.974.13) 0.75 (0.331.67) 0.11 (0.070.27) 5.75 (3.3213.12) 0.0%
Prostate (Ensemble) 0.05 (0.020.09) 0.02 (0.010.04) 0. 65(0.131.26) 0.15 (0.070.24) 0.07 (0.040.14) 2.01 (0.483.65) 0.0%
Table II: Confidence calibration for baselines trained with Dice loss () are compared with those that trained with cross entropy () and those that were calibrated with ensembling.

V-D Segment-level Predictive Uncertainty Estimation

For segmentation applications, besides the pixel-level confidence metric, it is desirable to have a confidence metric that captures model uncertainty at the segment-level. Such a metric would be very useful in clinical applications for decision making. For a well-calibrated system, we anticipate that a segment-level confidence metric can predict the segmentation quality in the absence of ground truth. The metric can be used to detect out-of-distribution samples and hard or ambiguous cases. Such metrics have been previously proposed for street scene segmentation [rottmann2019uncertainty]. Given the pixel-level class predictions and their associated ground truth class for a predicted segment , we propose to use the average of pixel-wise entropy values over the predicted foreground segment as a scalar metric for volume-level confidence of that segment as:


= - 1—^Sk— ∑_i∈^S_k [p(^y_i— x_i,θ)⋅ln(p(^y_i—x_i,θ)) + (1- p(^y_i— x_i,θ) ) ⋅ln(1-p(^y_i—x_i,θ))].

In calculating the average entropy of , we assumed binary classification: the probability of belonging to class , and the probability of belonging to other classes .

Vi Experiments

Vi-a Training Baselines

For all of the experiments, we used a baseline FCN model similar to the two-dimensional U-Net architecture [ronneberger2015u] but with fewer kernel filters at each layer. The input and output of the FCN has a size of pixels. Except for the brain tumor segmentation that used a three-channel input (T1CE, T2, FLAIR), for the rest of the problems the input was a single channel. The network has the same number of layers as the original U-Net but with fewer kernels. The number of kernels for the encoder section of U-Net were 8, 8, 16, 16, 32, 32, 64, 64, 128, and 128. The parameters of the convolutional layers were initialized randomly from a Gaussian distribution [he2015delving]. For each of the three segmentation problems, the model was trained 100 times with cross entropy and 100 times with Dice loss, each with random weight initialization and random shuffling of the training data. For the models that were trained with Dice loss, the softmax activation function of the last layer was substituted with sigmoid function as it improved the convergence substantially. For optimization, stochastic gradient descent with the Adam update rule [kingma2014adam] was used. During the training, we used a mini-batch of 16 examples for prostate segmentation and 32 examples for brain tumor and cardiac segmentation tasks. The initial learning rate was set to and it was reduced by a factor of if the average of validation Dice score did not improve by in 10 epochs. We used 1000 epochs for training the models with an early stopping policy. For each run, model checkpoint was saved at the epoch where the validation DSC was the highest.

Vi-B Cross Entropy vs. Dice

CE loss aims to minimize the average negative log likelihood over the pixels, while Dice loss improves segmentation quality in terms of Dice coefficient directly. As a result, we expect to observe models trained with CE to achieve a lower NLL and models trained with Dice loss to achieve better Dice coefficients. Here, our main focuses are to observe the segmentation quality of a model that is trained with cross entropy in terms of Dice loss and the calibration quality of a model that was trained with Dice loss. We compare models trained with cross entropy with those trained with Dice on three segmentation tasks. For statistical tests and calculating 95% confidence intervals (CI), we used bootstrapping (n=100).

Vi-C Confidence Calibration

We use ensembling (Equation 6) to calibrate batch normalized FCNs trained with Dice loss. For the three segmentation problems, we make ensemble predictions and compare them with baselines in terms of calibration and segmentation quality. For calibration quality, we compare NLL, Brier score, and ECE%. For segmentation quality, we compare dice and percentile Hausdorff distance. Moreover, for calibration quality assessment we calculate the metrics on two sets of samples from the held-out test datasets: 1) the whole test dataset (all pixels of the test volumes) 2) pixels belonging to dilated bounding boxes around the foreground segments. The foreground segments and the adjacent background around them usually have the highest uncertainty and difficulty. At the same time background pixels far from foreground segments show less uncertainty, but outnumber the foreground pixels. Using bounding boxes removes most of the correct (certain) background predictions from the statistics and will lead to better highlighting of the differences among models. For all three problems, we construct bounding boxes of the foreground structures. The boxes are then dilated by 8 mm in each direction of the in-plane axes and 2 slices (which translates to 4mm to 20mm) in each direction of the out-of-plane axis.

We also measured the effect of ensembles by calculating (Equation 6) for ensembles with number of models () of 1, 2, 5, 10, 25, and 50. To provide better statistics and reduce the effect of chance in reporting the performance, for each , we constructed subsets of ensembles from the baseline models and then reported the mean of NLL and Dice for that specific . For prostate and heart segmentation tasks was set to 50 and for brain tumor segmentation was set to 10. Finally, For calculating 0.95 CI and statistical significance test, we created 100 boot-straps by sampling 100 instances of random models and test examples with replacement. For each bootstrap, the calibration metrics and Dice scores were calculated for baselines, and those calibrated with ensembling.

Vi-D Segment-level Predictive Uncertainty

For each of the segmentation problems, we calculated volume-level confidence for each of the foreground labels and (Equation V-D) vs. Dice. For prostate segmentation, we are also interested in observing the difference between the two datasets of PROSTATEx test set (which is the same as the source domain) and PROMISE-12 set (which can be considered as a target set).

Vii Results

Average Dice Similarity Score (95% CI) Average Hausdorff Distance (95th percentile) (95% CI)
Organ (Model) Seg. #1* Seg. #2* Seg. #3* Seg. #1* Seg. #2* Seg. #3*
Brain () 0.45 (0.110.85) 0.51 (0.140.82) 0.65 (0.190.87) 52.28 (5.0099.34) 48.87 (6.7180.50) 50.44 (3.0096.34)
Brain () 0.53 (0.120.89) 0.64 (0.200.90) 0.72 (0.290.91) 38.78 (4.0092.03) 36.12 (3.0077.31) 36.33 (2.0093.94)
Brain (Ensemble) 0.61 (0.150.94) 0.72 (0.250.92) 0.79 (0.480.92) 16.20 (2.4579.13) 19.45 (2.0064.50) 26.49 (2.0092.98)
Heart () 0.79 (0.460.91) 0.74 (0.550.86) 0.92 (0.780.97) 23.49 (7.21117.66) 18.34 (4.00126.00) 21.74 (2.00151.91)
Heart () 0.86 (0.590.96) 0.82 (0.64 0.90) 0.93 (0.810.97) 13.80 (2.00 51.49) 9.31 (2.00 69.91) 12.59 (2 120.40)
Heart (Ensemble) 0.90 (0.780.96) 0.85 (0.710.91) 0.95 (0.890.98) 9.44 (2.0026.40) 4.43 (2.0011.83) 5.02 (2.0019.01)
Prostate () 0.83 (0.630.91) 11.77 (5.0025.67)
Prostate () 0.88 (0.730.93) 8.26 (3.6420.30)
Prostate (Ensemble) 0.90 (0.760.95) 5.73 (3.1618.72)
  • Similar to the results of Table II, prediction where the Dice score was less than were not included in this table. The failure rates are provided in Table II.

  • For brain application structures, #1, #2, and #3 correspond to non-enhancing tumor, enhancing tumor, and edema, respectively. For heart application structures, #1, #2, and #3 correspond to the left ventricle, the endocardium, and the right ventricle, respectively. For prostate application structure, #1 corresponds to the prostate gland.

Table III: Segmentation quality for baselines trained with Dice loss () are compared with those that trained with cross entropy () and those that were calibrated with ensembling (M=).

Table II compares the averages and 95% CI values for NLL, Brier score, and ECE% for the whole volume and the bounding boxes around the segments. Prediction where the Dice score was less than were considered failures in segmentation and not included in the statistics. The failure rates are provided in Table II. For all three segmentation tasks and all the seven foreground labels, calibration quality was significantly better in terms of NLL and ECE% for models trained with cross entropy comparing to those that were trained with Dice loss. However, the direction of change for Brier score was not consistent among models trained with CE vs models trained with Dice loss. For bounding boxes of brain tumor and prostate segmentation, the Brier scores were significantly better for models trained with Dice loss compared to those trained with CE, while in the case of the heart segmentation was the opposite. The ensemble models show significantly better calibration qualities for all metrics across all tasks.

Figure 2: Top row: calibration quality in terms of NLL as number of models increases for the prostate, the heart, and the brain tumor segmentation. Red line shows the mean and the shaded area 95% CI for NLL. The NLL for model trained with cross entropy is also given as a gray dot. For all three tasks ensemble of size M=10, outperforms the model trained with cross entropy in terms of NLL. Middle and bottom rows: a qualitative example for the prostate segmentation shows T2-weighted MRI images of the prostate gland at different positions of axial plane (apex, mid-gland, and base) overlaid with color-coded prostate probability map for two different ensemble models. The overlay probability maps in middle and bottom rows show the results of inference for ensembles of sizes 2 and 50, respectively. White line shows the ground truth boundary of the prostate. Qualitative improvement in segmentation and uncertainty estimation is obvious.

Table III compares the averages and 95% CI values of Dice coefficients of foreground segments for baselines trained with cross entropy loss, Dice loss, and baselines calibrated with ensembling (M=50). For all tasks across all segments, baselines trained with Dice loss outperform those trained with CE loss, and ensemble models outperform both baselines.

Figure 2 shows the improvement in quality of calibration and segmentation as a function of the number of models in the ensemble, . As we see, for the prostate, the heart, and the brain tumor segmentation, using even five ensembles (M=5) can reduce the NLL by about , , and , respectively.

Figure 3 visually compares the baselines trained with cross entropy, Dice loss with those calibrated with ensembling and through some representative examples over the three segmentation tasks. For each prediction map, a reliability diagram over the whole volume is provided.

Ground Truth Baseline () Baseline () Ensemble (M=50)
Figure 3: Examples of uncertainty estimation quality of baselines trained with cross entropy loss, Dice loss, and models calibrated with ensembling. Calibration has been applied only to models trained with Dice loss. MRI images are overlaid with class probabilities, and reliability diagrams (together with ECE%, NLL, and Brier score) are given for that specific volume. In the reliability diagrams only the bins with greater than 1000 samples are shown. Top, middle, and bottom rows show the results for non-enhancing tumor, the right ventricle, and the prostate gland segmentation tasks, respectively. The color bar for the class probability values is given in Figure 1.

Figure 4 provides scatter plots of Dice coefficient vs. the proposed segment-level predictive uncertainty metric, (Equation V-D). For better visualization, Dice values were logit transformed as in [niethammer2017active]. In all three segmentation tasks, we observed a strong correlation () between logit of Dice coefficient and average of entropy over the predicted segment. For the prostate segmentation task, a clustering is obvious among the test set from the source domain (PROSTATEx dataset) and those from the target domain (PROMISE12). Investigation of individual cases reveals that most of the poorly segmented cases, which were predicted correctly using , can be considered out-of-distribution examples as they were imaged with endorectal coils.

   Prostate Segmentation    Brain Tumor Segmentation    Cardiac Segmentation
Figure 4: Segment-level predictive uncertainty estimation: Top row: Scatter plots and linear regression between Dice coefficient and average of entropy over the predicted segment . For each of the regression plots, Pearson’s correlation coefficient () and 2-tailed p-value for testing non-correlation are provided. Dice coefficients are logit transformed before plotting and regression analysis. For the majority of the cases in all three segmentation tasks, the average entropy correlates well with Dice coefficient, meaning that it can be used as a reliable metric for predicting the segmentation quality of the predictions at test-time. Higher entropy means less confidence in predictions and more inaccurate classifications leading to poorer Dice coefficients. However, in all three tasks there are few cases that can be considered outliers. (A) For prostate segmentation, samples are marked by their domain: PROSTATEx (source domain), and the multi-device multi-institutional PROMISE12 dataset (target domain). As expected, on average, the source domain performs much better than the target domain, meaning that average entropy can be used to flag out-of-distribution samples. The two bottom rows correspond to two of the cases from the PROMISE12 dataset are marked in (A): Case I and Case II; These show the prostate T2-weighted MRI at different locations of the same patient with overlaid calibrated class probabilities (confidences) and histograms depicting distribution of probabilities over the segmented regions. The white boundary overlay on prostate denotes the ground truth. The wider probability distribution in Case II associates with a higher average entropy which correlates with a lower Dice score. Case-I was imaged with phased-array coil (same as the images that was used for training the models), while Case II was imaged with endorectal coil (out-of-distribution case in terms of imaging parameters). The samples in scatter plots in (B) and (C) are marked by their associated foreground segments. The color bar for the class probability values is given in Figure 1.

Viii Discussion

Through extensive experiments, we have rigorously assessed uncertainty estimation for medical image segmentation with FCNs. Furthermore, we proposed ensembling for confidence calibration of FCNs trained with Dice loss. We have performed these assessments using three common medical image segmentation tasks to ensure generalizability of the findings. The results consistently show that cross entropy loss is better than Dice loss in terms of uncertainty estimation in terms of NLL and ECE%, but falls short in segmentation quality. We then showed that ensembling with notably calibrates the confidence of models trained with Dice loss. Importantly, we also observed that in addition to NLL reduction, the segmentation accuracy in terms of Dice coefficient was also improved through ensembling. Consistent with the results of previous studies [kuijf2019standardized], we observed that segmentation quality improved with ensembling. The results of our experiments for comparing cross entropy with Dice loss are in line with the achieved results of Sanders et al. [sander2019towards].

Importantly, we demonstrated the feasibility of constructing metrics that can capture predictive uncertainty of individual segments. We showed that the average entropy of segments can predict the quality of the segmentation in terms of Dice coefficient. Preliminary results suggest that calibrated FCNs have the potential to detect out-of-distribution samples. Specifically, for prostate segmentation the ensemble correctly predicted the cases where it failed due to differences in imaging parameters (such as different imaging coils). However, it should be noted that this is an early attempt to capture segment-level quality of segmentation and the results thus need to be interpreted with caution. The proposed metric can be improved by adding prior knowledge about the labels. Furthermore, it should be noted that the proposed metric does not encompass any information on number of samples used in that estimation.

As introduced in the methods section, some loss functions are ”proper scoring rules”, a desirable quality that promotes well calibrated probabilistic predictions. The Deep Ensembles method has a proper scoring rule requirement for the baseline loss function [lakshminarayanan2017simple]. The question arises: ”Is the Dice loss a proper scoring rule”? Here, we argue that there is a fundamental mismatch in the potential usage of the Dice loss for scoring rules. Scoring rules are functions that compare a probabilistic prediction with an outcome. In the context of binary segmentations, an outcome corresponds to a binary vector of length , where is the number of pixels. The difficulty with using scoring rules here is that the corresponding probabilistic prediction is a distribution on binary vectors. However, the predictions made by deep segmenters are collections of label probabilities. This is in distinction to distributions on binary vectors, which are more complex; in general they are probability mass function with parameters, one for each of the possible outcomes (the number of possible binary segmentations). The essential problem is that deep segmenters do not predict distributions on outcomes (binary vectors). One potential workaround is to say that the network does predict the required distributions, by constructing them as the product of the marginal distributions. This, though, has the problem that the predicted distributions will not be similar to the more general data distributions, so in that sense, they are bound to be poor predictions.

We used segmentation tasks in the brain, the heart and the prostate to assess uncertainty estimation. Although each of these tasks was performed on MRI images, there were subtle differences between them. The brain segmentation task was performed on three channel input (T1 contrast enhanced, FLAIR, and T2) while the other two were performed on single channel input (T2 for prostate and Cine images for heart). Moreover, the number of training samples, the size of the target segments, and the homogeneity of samples were different in each task. Only publicly available datasets were used in this study to allow others to easily reproduce these experiments and results. The ground truth was created by experts and independent test sets were used for all experiments. For prostate gland segmentation and brain tumor segmentation tasks, we used multi-scanner, multi-institution test sets. For all three tasks, boundaries of the target segments were commonly identified as areas of high uncertainty estimate.

Our focus was not on achieving state-of-the-art results on the three mentioned segmentation tasks, but on using these to understand and improve the uncertainty prediction capabilities of FCNs. Since we performed several rounds of training with different loss functions, we limited the number of parameters in the models in order to speed up each training round; we carried out experiments with 2D CNNs (not 3D), used fewer convolutional filters in our baseline compared to the original U-Net, and performed limited (not exhaustive) hyperparameter tuning to allow reasonable convergence.

Although MC dropout has been applied in many uncertainty estimation studies, we chose to not include it in this study as MC dropout requires modification of the network architecture by adding dropout layers to specific locations [kendall2015bayesian]. Moreover, batch normalization removes the need for dropout in many applications [ioffe2015batch].

Further work needs to be carried out to establish the effect of loss function on confidence calibration for deep FCNs. In this study, we only focused on Dice loss and cross entropy loss functions. It would be interesting to investigate the calibration and segmentation quality of other loss functions such as combinations of Dice loss and cross entropy loss, as well as the recently proposed Lovász-Softmax loss [berman2018lovasz].

There remains a need to study calibration methods that, unlike ensembling, do not require training from scratch which is time consuming. In this work we only investigated uncertainty estimation for MR images. Although parameter changes occur more often in MRI comparing to computed tomography (CT), it would still be very interesting to study uncertainty estimation in CT images. Parameter changes in CT can also be a source of failure in CNNs. For instance, changes in slice thickness or use of contrast can result in failures in predictions and it is highly desirable to predict such failures through model confidence. We believe that our research will serve as a base for future studies on uncertainty estimation and confidence calibration for medical image segmentation. Further study of the sources of uncertainty in medical image segmentation is needed. Uncertainty has been classified as aleatoric or epistemic in medical applications [indrayan2012medical] and Bayesian modeling [kendall2017uncertainties]. Aleatoric refers to types of uncertainties that exist due to noise or the stochastic behavior of a system. In contrast, epistemic uncertainties are rooted in limitation in knowledge about the model or the data. In this study, we consistently observed higher levels of uncertainty at specific locations such as boundaries. For example in the prostate segmentation task, single and multiple raters often have higher inter and intra disagreements in delineation of the base and apex of the prostate rather than at the mid-gland boundaries [litjens2014evaluation]. Such disagreements can leave their traces on models that are trained using ground truth labels with such discrepancies. It has been shown that with enough training data from multiple raters, deep models are able to surpass human agreements on segmentation tasks [litjens2017survey]. However, not much work has been done on correlation of ground truth quality and model uncertainty that result from rater disagreements.

We conclude that model ensembling can be used successfully for confidence calibration of FCNs trained with Dice Loss. Also, the proposed average entropy metric can be used as an effective predictive metric for estimating the performance of the model at test-time when the ground-truth is unknown.


Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description