Direct Uncertainty Prediction with Applications to Healthcare
Abstract
Large labeled datasets for supervised learning are frequently constructed by assigning each instance to multiple human evaluators, and this leads to disagreement in the labels associated with a single instance. Here we consider the question of predicting the level of disagreement for a given instance, and we find an interesting phenomenon: direct prediction of uncertainty performs better than the twostep process of training a classifier and then using the classifier outputs to derive an uncertainty. We show stronger performance for predicting disagreement via this direct method both in a synthetic setting whose parameters we can fully control, and in a paradigmatic healthcare application involving multiple labels assigned by medical domain experts. We further show implications for allocating additional labeling effort toward instances with the greatest levels of predicted disagreement.
1 Introduction
In the practice of machine learning, a crucial role is played by carefully curated sets of labels in the datasets used to develop classification algorithms. Such labels are often collected by assigning instances of the classification problem to multiple human evaluators, resulting in several labels , which are later aggregated [14, 19]. Naturally, this process often involves disagreement in the labels assigned to a single instance.
This problem is greatly exacerbated in domains where labellers are specialists — highly trained human experts — as in the case of healthcare [5, 11]. Labels can be extremely timeconsuming and expensive to collect, but the issue of label disagreement remains fundamental [2]. A key point in this setting is that because the disagreement is arising from human judgment and bias, the instance itself contains features that may give rise to an intrinsic level of disagreement in human evaluation, separate from uniform noise [13], or noise dependent on solely ground truth labels [15].
This motivates the idea of trying to predict which instances are likely to be the source of greatest label disagreement. Accurate prediction of such instances could be a significant source of help in reasoning about curated labels for goldstandard evaluation sets, typically used to benchmark models trained with noisy data in sensitive domains [8]. It may also provide a means of understanding how to allocate additional effort in labeling instances of varying predicted disagreement. In particular, we are interested in the setting of learning to predict the label uncertainty for an instance , from a set of noisy labels that we observe, and can use as prediction targets. Additionally, in practice, each instance is likely to have very few noisy labels available, typically only two or three. So not only are we interested in predicting uncertainty, but doing so in a scarce label setting.
Our central finding is that in these circumstances, we observe an interesting phenomenon: direct prediction of uncertainty performs better than the twostep process of (i) training a classifier (ii) using the classifier outputs to derive an uncertainty. The basic contrast is as shown in Figure 1: rather than estimating uncertainty as a postprocessing step applied to a classifier for the labeling problem, we can do better by learning the uncertainty directly as a predicate of the instance .
We demonstrate this phenomenon in two ways — both through a formal framework, and in the context of an application — and in both cases we show that direct prediction of variation can lead to better performance. Since the canonical setting for this question is one in which there is disagreement over the assignment of a label to an instance , we first consider an abstract setting in which points are drawn from a mixture of distributions. The standard goal in this setting would be to label with the distribution in the mixture that produced it; the disagreement problem that we consider here corresponds to predicting the uncertainty in the label assigned to . We show that for general settings of this problem, we can achieve higher performance on this disagreement task by predicting disagreement as a quantity directly, rather than first estimating a distribution over labels. The framework is clean enough that we are able to suggest general insights into the contrast between these approaches.
We then turn to a paradigmatic and complex application of these ideas, the labeling of medical images. Recent applications of machine learning have followed a structure in which multiple medical experts provide labels to individual instances [5], and thus it becomes important to understand whether the direct prediction of disagreement among these medical experts might reveal label variation more effectively than methods that operate through the classification problem. We find that the phenomenon holds here as well, and we obtain performance improvements from methods to predict disagreement directly from image data. We evaluate our methods both in terms of the underlying disagreement, and in terms of adjudicated labels that arise from explicit interaction among labelers. Finally, we consider how these estimates of disagreement might help guide the allocation of labeling effort, by highlighting instances where additional labels are needed.
2 Direct Uncertainty Prediction
The core problem in our setting, directly motivated both at a conceptual level and by applications of machine learning to healthcare [5, 11, 8, 16], can be described as follows. We have a dataset , with datapoints , each of which belongs to some ground truth class, . Unfortunately, we have no way to determine this ground truth label, and only have access to a few (often only two, maybe three) noisy labels for each . We can train a classifier on this dataset, either by first aggregating the labels for some and getting a typical onehot class label [5], or creating an empirical probability distribution over classes (a histogram) as the label. But accurate model evaluation becomes critical, and especially so in sensitive domains.
Typically, model evaluation is done on a very small test set, on which multiple expert labels are painstakingly collected [11]. The best possible labels are those arising from an adjudication procedure [8], where experts first discuss the instance and come to a consensus on the appropriate label.
However, these test sets are extremely time consuming and expensive to assemble, and often constitute a bottleneck in the overall process. We address this issue by using model uncertainty predictions to guide the choice of datapoints for additional labels. In particular, we develop and train models to perform direct uncertainty prediction on the noisily labelled dataset . Each datapoint in is assigned a binary label , agree/disagree, with representing when the noisy labels all agree, and when there is at least one disagreement. We can then use this disagreement model to predict whether unseen instances in our evaluation set will cause disagreement.
A natural baseline to compare our direct uncertainty prediction model to is instead (i) train a classifier on and then (ii) use the probability outputs of the classifer to predict probability of disagreement. In particular, if we have classes, and our classifier outputs empirical probabilities for instance , then the predicted probability of disagreement can be computed as
(in words, the probability that two draws from this distribution are not equal) where is the event that experts will disagree on .
We find that direct uncertainty prediction performs comprehensively better in numerous settings (Tables 2, 3, 4) than first training a classifier and then predicting uncertainty. In particular, our results suggest the following underlying phenomena:
When there is true signal in the data on difficulty, and labels are scarce, directly predicting uncertainty leads to better performance than the two step process of (i) first learning a classifier (ii) using classifier outputs to determine uncertainty
Model Type  Two Labels  Three Labels 

Uncertainty via Classification  
Direct Uncertainty Prediction 
We first demonstrate this in a synthetic setting, where we can closely examine the underlying parameters of the generative model, before moving on to our main application in the next section. We assume our datagenerating process consists of a uniform mixture of multidimensional Gaussian distributions . For any datapoint sampled from , we get a natural noise distribution over possible classes, which we use to generate multiple labels for each with:
Each of the Gaussians has a covariance matrix of form or corresponding to high/low variance respectively. The means of the Gaussians are also drawn from a distribution so that the means corresponding to the low variance Gaussians form one cluster, and those corresponding to the high variance Gaussians correspond to another cluster. (See Appendix for full details.) The result is that the data has meaningful signal about whether it is likely to be high or low variance.
Next, we train two types of models on this data: a disagreement model that directly predicts whether will have labels that agree or disagree, and a classification model that predicts the empirical histogram of grades for each . We then evaluate both models on a test set: for unseen examples, the disagreement model outputs the probability of disagreement, and the predicted distribution of the classification model is used as in (1) to predict a disagreement probability. We then compute an AUC on these outputs. Our results, shown in Table 1, demonstrate that direct uncertainty prediction performs significantly better than uncertainty prediction via classification.
3 Related Work
The challenges posed by noisy labels is a longstanding one, and prior work has put forward several approaches to address some of these issues. Under the assumption that the noise distribution is conditionally independent of the data instance given the true label, [10, 15, 12] provide theoretical analysis along with algorithms to denoise the labels as training progresses. However, the conditional independence assumption does not hold in our setting (Section 4, Figure 2(c).) Other work relaxes this assumption by defining a domain specific generative model for how noise arises [9, 21, 17] with some methods using additional clean data to pretrain models to form a good prior for learning. Modeling uncertainty in the context of noisy data has also been looked at through Bayesian techniques [7], and (for different models) in the context of crowdsourcing by [20, 18]. A related line of work [3, 19] has looked at studying the per labeler error rates, which also requires the additional information of labeler ids, an assumption we relax. Most related is [4], where a multiheaded neural network is used to model different labelers. Surprisingly however, the best model is independent of image features, which is the source of signal in our experiments.
4 Medical Problem Setting
The key motivation and application for our method is the medical domain, where labelling is expensive, and labels are often noisy and subjective [1]. It is also critical to have at least a small good evaluation set [8]. Our dataset consists of retinal fundal images [5], large (587 by 587) images of the interior of the eye. These can be used to diagnose a variety of eye diseases, such as diabetic retinopathy, diabetic macular edema, agerelated macular degeneration, and more. Here we focus on diabetic retinopathy (DR), which is graded according to a class scale: grade corresponds to no DR, to mild DR, to moderate DR, to severe DR and to proliferative DR [2]. A key threshold is at grade : grades are referable DR (requiring specialist attention), and labels are nonreferable DR. Clinically, the most costly mistake is falsely diagnosing a referable DR patient with nonreferable DR.



Similar to Section 2 we have access to a large dataset , with each instance in typically having only one label, and a small fraction having two or more labels. (Figure 2(a)). The labels in are also highly noisy: as shown in Figure 2(b), if we sample labels, less than of the images have data with no disagreement, and have labels with a significant disagreement ( or more grades apart.)
We also have a small evaluation set with many more labels per image (on average ), made by specialists, who have a much higher agreement rate, around . Crucially, images in also have a single adjudicated label [8], determined via discussion by a group of experts. Treating the adjudicated grade as our ground truth, we see (Figure 2(c)) that the distribution of the noisy labels depends on both the true label (adjudicated grade) and the image instance.
The DR grades can be interpreted either as a continuous progression — from mild (grade 1) to proliferative (grade 5) — or as categorical classes as each grade has specific features associated with it, e.g. grade always indicates microaneurysms, while a grade of can refer to lesions or laser scars (from earlier treatments) [2]. Therefore, for direct uncertainty prediction, we look at models trained to predict a thresholded variance (in line with the continuous interpretation) as well as models trained to predict disagreement (in line with the categorical interpretation). See Appendix for further details. In the subsequent sections, we test the effectiveness of direct uncertainty prediction as follows:

We split into and train a direct uncertainty predictor (on thresholded variance or label disagreement) as well as a classifier (on histograms) on . We test both on how well they can predict uncertainty (variance/disagreement) on , finding that direct uncertainty prediction does much better.

We then test the effectiveness of these models on the adjudicated set . Here, we’re interested in seeing how well the model’s predicted uncertainty highlights disagreements between the individual doctor labels and the adjudicated (ground truth proxy) grade. Note that all of these tests are transfer tasks, because the individual doctor labels have a different distribution on than on (Figure 2(a), 2(b)).
The simplest test is determining whether our uncertainty estimates are predictive of whether some aggregation of the individual doctor grades will agree with the adjudicated grade. We find that again, direct uncertainty prediction models significantly outperform uncertainty prediction via classification.

We then see if the uncertainty predictions correlate well (using Spearman’s rank correlation) with the distance between the distribution of individual doctor grades, and the adjudicated grade. Again, direct uncertainty prediction is much more helpful.

Finally, we use the uncertainty estimates of our models to determine how to budget labels, and see large improvements over a baseline of uniform budgeting.
5 First Experimental Results
We start with the first task, training models to predict uncertainty (a) directly via variance/disagreement (in line with the continuous/categorical interpretation of DR labels, Section 4, Appendix) and (b) first training a classifier on label histograms. We do this on a train/test split on , . Our results (percentage AUC, averaged over three runs) are shown in Figure 2. Models are prefixed with their training targets, so Variance, Disagree models correspond to direct uncertainty predictions, while Histogram corresponds to classifierbased models. We evaluate the models on both predicting variance and disagreement of examples in . We see that direct uncertainty models significantly outperform their classification counterparts. Some specific model details below:
Task  Model Type  Performance 

Variance Prediction  HistogramE2E  
Variance Prediction  HistogramPC  
Variance Prediction  VarianceE2E  
Variance Prediction  VarianceE2E2H  
Variance Prediction  VarianceP  
Variance Prediction  VariancePR  
Variance Prediction  VariancePRC  74.8% 
Variance Prediction  VarianceLR  
Variance Prediction  DisagreePC  
Disagreement Prediction  HistogramE2E  
Disagreement Prediction  HistogramPC  
Disagreement Prediction  DisagreeP  78.1% 
Disagreement Prediction  DisagreePC  78.1% 
Disagreement Prediction  DisagreeLR  
Disagreement Prediction  VariancePRC 
Baseline Model: HistogramE2E
Our baseline model is the convolutional neural network model used in [5] which is at its core an Inceptionv3 model.
VarianceE2E and VarianceE2E2H
An analogous version of HistogramE2E is VarianceE2E, which is the same architecture trained end to end on variance. However, as most datapoints in only have one label (Figure 2(a)), this significantly reduces the amount of data available to train the model and inspires the (P) models below. (See Appendix for details of a variant, VarianceE2E2H.)
Training from the Penultimate Layer (P)
The lack of data with multiple labels makes it challenging to train large models end to end on uncertainty, so we instead try training a smaller network from the penultimate layer: we take a pretrained model on DR classification, remove the final fully connected layer and then add a small neural network on top of this (typically with two hidden layers). This method also improves the baseline results. However, the strengthened baseline trained in this way (HistogramPC) is still beaten in performance by all corresponding direct uncertainty prediction models (VarianceP, VariancePR, VariancePRC, DisagreeP, DisagreePC).
Adding Calibration (C)
Informed by the work of [6], we try out a calibration technique based on a temperature parameter , temperature scaling – see Appendix for details. This mostly further strengthens the baseline.
Regression head (R)
We also try adding a regression head to regularize the variance model, which gives a small improvement, more details in Appendix.
Do we need the Penultimate layer? Training on Logits (L)
We also evaluated the importance of training on the latent model representation (the prelogit layer) compared to directly trying to train the logit layer of the model, which did not work nearly as well, more details and insights in Appendix.
In summary, despite also comparing to strengthened variants of the baseline with prelogit finetuning and calibration (HistogramPC), we find that directly predicting uncertainty achieves noticeably better performance reliably (across all model comparisons and tasks.)
6 Adjudicated Evaluation
We now evaluate the models trained in Section 5 on our adjudicated dataset . Each image in this dataset has many more labels (around ), and also a single adjudicated label, proposed by a group of experts after consensus via discussion. Most tasks in this section focus on determining agreement between the adjudicated grade and the distribution of individual grades. As the individual grades on have a very different distribution than (Figure 2(b), Section 4), this also demonstrates the robustness of the model predictions to data distribution shifts.
Model Type  Majority  Median  Majority  Median  Referable 

HistogramE2EVar  
HistogramE2EDisagree  
HistogramPCVar  
HistogramPCDisagree  
VariancePR  
VariancePRC  
DisagreeP  81.0%  84.6%  81.9%  86.2%  
DisagreePC  80.9%  86.2% 
We first investigate whether we can use our model uncertainty prediction outputs to identify datapoints in where the average individual doctor grades have minimal/maximal disagreement with the adjudicated grade. To do so, we consider two aggregation methods for the individual labels: the mode (termed the ‘majority vote’, also used in [5]), and the median. Our results are shown in Table 3. We explicitly define HistogramPCVar and HistogramPCDisagree to be strengthened baselines with outputs used to compute predicted variance and predicted disagreement respectively (similarly for HistogramE2EVar).
We also consider specific subcases of interest. First, the cases when Majority and Median are interesting: since is the grade indicating a judgment of no DR, these are instances where most of the individual specialists have missed the onset of the condition. Second, there is a natural distinction between grades 3, 4, and 5 (which are referable for further action) and grades 1 and 2 (nonrefereable). We thus study this binary problem of determining whether an instance is referable. Again, we see that all direct uncertainty predictors beat the baselines (vanilla and strengthened) on all tasks.
Applying the Wasserstein distance.
The majority and median methods of aggregation are relatively insensitive to the individual grades. We thus also consider an aggregation measure that is more sensitively dependent on each individual grade, and more directly uses the continuous interpretation of the grades’ onedimensional embedding.
For this purpose, we compute the Wasserstein distance between the distribution of all individual doctor grades and the adjudicated grade (treated as a pointmass distribution). We give background on the Wasserstein distance in the supplementary material; roughly speaking, for an underlying metric (the distance on labels in our case), it defines the distance between distributions and to be the minimum cost (under ) to move the mass in distribution so that it is matches the mass in distribution .
For our purposes, we let be the empirical label distribution defined by the individual doctor grades for , and the point mass distribution corresponding to the adjudicated grade. We show
Theorem 1
If are (discrete) probability distributions and is a point mass distribution at , then is:
and apply this as is a pointmass, to obtain . We consider three distances on grades: the absolute value of the difference; an norm; and a 01 binary disagreement metric. (Further details are in the supplementary information.)
Prediction Type  Absolute Val  2Wasserstein  Binary Disagree 

HistogramE2EVar  
HistogramE2EDisagree  
HistogramPCVar  
HistogramPCDisagree  
VariancePR  
VariancePRC  
DisagreeP  0.682  0.670  0.676 
DisagreePC  
Doctors  
Doctors  
Doctors  
Doctors  
Doctors  0.728  0.712  0.718 
Using Spearman’s rank correlation coefficient, we compare how similar the ranking of instances by the Wasserstein distance is to the ranking using our uncertainty estimates. In Table 4 we see that over all metrics used, all direct uncertainty prediction models beat classifier based uncertainty models.
There is a natural scale against which to evaluate the performance of the rankings based on our uncertainty estimates: as a baseline we can compute the variance in each instance induced by the labels of only of the doctors, for , and ask how well the rankings according to these small sets of doctor labels compare to the Wasserstein ranking under Spearman’s rank correlation. In other words, how many doctor opinions would be required to match our ranking by estimated uncertainty? This comparison gives us a clean way to interpret our models – the best direct uncertainty models are as good in this setting as having doctor labels.
Budgeting Labels.
We now consider one final task — budgeting labels. Suppose we start with a set of images with one label each, but have a budget of additional labels we can use on the images, we’d like to know which images to assign the extra grades to, to ensure the aggregated individual grades are as close as possible to the adjudicated grades. Intuitively, we would like to allocate extra grades to the images for which there is the most uncertainty, which we cannot get from the one grade we have so far, but we can use our model predictions. For our evaluation of this task, we will focus on the subset of images that are refereable (label ) according to the adjudicated grade.
Model Type  Extra Labels  Extra Labels  Extra Labels  Extra Labels 

No Models (Uniformly Random)  
HistogramE2EVar  
HistogramE2EDisagree  
HistogramPCVar  
HistogramPCDisagree  
VariancePR  
VariancePRC  
DisagreeP  
DisagreePC 
We study this question by first drawing one grade for each image in to be the initial grade, and then determining (either randomly or using our model rankings) a subset of images to get another grade, which is drawn randomly from the remaining grades. (We allow at most one additional grade per image, and consider the case in which the budget is less than the total number of images; it is also interesting to consider cases with a higher budget and the freedom to determine how many additional labels each image gets.) Our results (percentage accuracy), are shown in Table 5. For full details of the method (very similar to our evaluation in Table 3), see the Appendix. One challenge we face in this setting is that the number of instances in with referable adjudicated grade is small. This, along with the process of only drawing one or two grades per image introduces a large amount of variance (an order of magnitude higher than for our other results), and so we explicitly show the variance in the performance in the table. While this makes it difficult to draw clear conclusions on the advantage of any one uncertainty model over another, it is clear that any allocation of labels based on an uncertainty ranking arising from model predictions vastly outperforms a naive uniform allocation of labels.
7 Discussion
In this paper, we propose direct uncertainty estimation, where models are trained to predict measures of uncertainty, such as disagreement/variance, directly from noisy labels on input instances, instead of deriving an uncertainty from a classifier. This method is inspired by a scenario (commonly arising in domains such as healthcare) where we have access to very few (typically just two) noisy labels for each input instance , with the instances themselves having features indicative of disagreement when labelling. We find that direct uncertainty estimation significantly outperforms classifierbased uncertainty in this setting, in both a synthetic setting with known distributions, as well as in a large scale medical imaging setting. Furthermore, in the latter, direct uncertainty models transfer this significant performance improvement to a variety of different tasks in a special evaluation set. Future work might look at mathematically characterizing the conditions where direct uncertainty estimation is most helpful, or transferring these techniques to different data modalities.
Acknowledgments
We thank Varun Gulshan and Arunachalam Narayanaswamy for detailed advice and discussion on the model, and Quoc Le, Martin Wattenberg, Jonathan Krause, Lily Peng and Dale Webster for useful comments and feedback. We also thank Naama Hammel and Zahra Rastegar for helpful medical insights.
References
 [1] L. S. Abrams, I. U. Scott, G. L. Spaeth, H. A. Quigley, and R. Varma. Agreement among optometrists, ophthalmologists, and residents in evaluating the optic disc for glaucoma. Ophthalmology, 101(10):1662–1667, 1994.
 [2] American Academy of Ophthalmology. International Clinical Diabetic Retinopathy Disease Severity Scale Detailed Table.
 [3] P. Dawid, A. M. Skene, A. P. Dawidt, and A. M. Skene. Maximum likelihood estimation of observer errorrates using the em algorithm. Applied Statistics, pages 20–28, 1979.
 [4] M. Y. Guan, V. Gulshan, A. M. Dai, and G. E. Hinton. Who said what: Modeling individual labelers improves classification. abs/1703.08774, 2017.
 [5] V. Gulshan, L. Peng, M. Coram, M. C. Stumpe, D. Wu, A. Narayanaswamy, S. Venugopalan, K. Widner, T. Madams, J. Cuadros, R. Kim, R. Raman, P. Q. Nelson, J. Mega, and D. Webster. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA, 316(22):2402–2410, 2016.
 [6] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On calibration of modern neural networks. abs/1706.04599, 2017.
 [7] A. Kendall and Y. Gal. What uncertainties do we need in bayesian deep learning for computer vision? In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 49 December 2017, Long Beach, CA, USA, pages 5580–5590, 2017.
 [8] J. Krause, V. Gulshan, E. Rahimy, P. Karth, K. Widner, G. S. Corrado, L. Peng, and D. R. Webster. Grader variability and the importance of reference standards for evaluating machine learning models for diabetic retinopathy. abs/1710.01711, 2017.
 [9] V. Mnih and G. Hinton. Learning to label aerial images from noisy data. International Conference on Machine Learning, 2012.
 [10] N. Natarajan, I. S. Dhillon, P. K. Ravikumar, and A. Tewari. Learning with noisy labels. In Advances in Neural Information Processing Systems 26, pages 1196–1204. 2013.
 [11] P. Rajpurkar, J. Irvin, K. Zhu, B. Yang, H. Mehta, T. Duan, D. Ding, A. Bagul, C. Langlotz, K. Shpanskaya, M. P. Lungren, and A. Y. Ng. Chexnet: Radiologistlevel pneumonia detection on chest xrays with deep learning. abs/1711.05225, 2017.
 [12] S. E. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, and A. Rabinovich. Training deep neural networks on noisy labels with bootstrapping. abs/1412.6596, 2014.
 [13] D. Rolnick, A. Veit, S. J. Belongie, and N. Shavit. Deep learning is robust to massive label noise. CoRR, abs/1705.10694, 2017.
 [14] O. Russakovsky and L. FeiFei. Attribute learning in largescale datasets. In European Conference of Computer Vision (ECCV), International Workshop on Parts and Attributes, 2010.
 [15] S. Sukhbaatar, J. Bruna, M. Paluri, L. Bourdev, and R. Fergus. Training convolutional networks with noisy labels. CoRR, abs/1406.2080, 2014.
 [16] A. V. Varadarajan, R. Poplin, K. Blumer, C. Angermüller, J. Ledsam, R. Chopra, P. A. Keane, G. Corrado, L. Peng, and D. R. Webster. Deep learning for predicting refractive error from retinal fundus images. abs/1712.07798, 2017.
 [17] A. Veit, N. Alldrin, G. Chechik, I. Krasin, A. Gupta, and S. J. Belongie. Learning from noisy largescale datasets with minimal supervision. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6575–6583, 2017.
 [18] F. L. Wauthier and M. I. Jordan. Bayesian bias mitigation for crowdsourcing. In J. ShaweTaylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 1800–1808. 2011.
 [19] P. Welinder and P. Perona. Online crowdsourcing: Rating annotators and obtaining costeffective labels. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR Workshops 2010, San Francisco, CA, USA, 1318 June, 2010, pages 25–32, 2010.
 [20] K. Werling, A. T. Chaganty, P. S. Liang, and C. D. Manning. Onthejob learning with bayesian decision theory. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 3465–3473. 2015.
 [21] T. Xiao, T. Xia, Y. Yang, C. Huang, and X. Wang. Learning from massive noisy labeled data for image classification. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2691–2699, 2015.
Supplementary material for “Direct Uncertainty Prediction with Applications to Healthcare”
Appendix A Details on Synthetic Data
For the synthetic dataset, we generated 10000 train data points and 2000 test data points form a uniform mixture of fifteen 20dimensional Gaussian distributions. The means of the distributions were drawn so that they formed two clusters: one cluster with variance (the high variance cluster), and the other cluster with variance (the low variance cluster). This ensured that instances drawn from the clusters encoded information on whether they were high/low disagreement datapoints. There were centers of high variance and centers of low variance. The centers of high variance were drawn from a normal distribution with mean , variance (with being the number of dimensions) around the vector and the centers of low variance were drawn with the same distribution around .
Appendix B Details on setting up Uncertainty Prediction
As described in Section 4, we first form a train/test split on , and train both direct uncertainty prediction models and a classifier on . As DR grades can be interpreted both as a continuous progression and as categorical classes (Section 4), we train direct uncertainty models to predict both (thresholded) variance as well as binary disagreement.
To get a binary disagreement label, we just look at the multiple noisy labels assigned to an image , and output if they all agree, or if they disagree.
We also form a thresholded variance label as follows: we first compute the empirical variance of the image, resulting from the observed noisy labels. If the empirical label histogram has probabilities for different grades , we get an empirical variance of . We can then threshold this on a value to get low variance/high variance labels. We choose the threshold value to be approximately . This is the variance value resulting from a single grade disagreement for an image with three grades. Values above this would have a difference of at least grades, which suggests a high variance.
Both of these uncertainty lables also have clear baselines using the outputs of the classifier – if the classifier probability output for class is , we can estimate the predicted variance with , and probability of disagreement with .
Appendix C Additional Model Details
VarianceE2E We try a variant of VarianceE2E, VarianceE2E2H, which has one head for predicting variance and the other for classifying, to enable usage of all the data. We then evaluate the variance head on , but in fact notice a small drop in performance.
Details on Calibration Via Temperature Scaling We set the predictions of the model to be where is the softmax function, applied pointwise, and are the logits. We initialize to , and then split into a and a . We train as normal on , with fixed at , and then train on , by only varying the temperature , and holding all other parameters fixed.
Regression for Variance Prediction Model We also consider a two head variance prediction model, where one head predicts a thresholded variance value like before, while the other head does regression on the numerical variance value. We evaluate on the thresholded variance head as usual. Note that only doing regression on the variance value fails at being a good uncertainty predictor, likely due to the poor scaling (some values with very high variance distorting mean squared error.)
Do we need the penultimate layer? We tried seeing if we could match performance by training on the model logits instead of the representation at the hidden (penultimate) layer. WE found that while this improved on the baseline results, it did not match the performance from training on the latent representation. (We also controlled for parameter difference by adding additional hidden layers but this did not lead to improvements.) This suggests that some information is lost between the prelogit and logit layers.
Appendix D Background on the Wasserstein Distance
Given two probability distributions , and letting be all product probability distributions with marginals , , the Wasserstein distance between is
where is some metric. This distance has connections to optimal transport, and corresponds to the cost (with respect to ) of moving the mass in distribution so that it is matches the mass in distribution as efficiently as possible. We can represent the amount of mass to move from to with ; and to be consistent with the mass at the start, , and the mass at the end we must have that and .
In our setting we let be the empirical probability distribution over labels defined by the individual doctor grades for a datapoint , and the point mass distribution corresponding to the adjudicated grade. We wish to compute for each , which is simplified by being a point mass
In the main text, we asserted the following
Theorem. If are (discrete) probability distributions and is a point mass distribution at , then is:
The proof is direct: for , we must have , and so .
Thus, in our setting, letting , we have
We consider three different cost metrics , each emphasizing different properties:

Absolute Value . This follows an interpretation in which the grades are equally spaced, so that all successive grade differences have the same weight.

2Wasserstein Distance , and, to make into a metric
This adds a higher penalty for larger grade differences.

Binary Disagreement We set if and otherwise.
Having computed the Wasserstein distance, obtaining a realvalued level of difference between the individual values and the adjudicated grade for each instance, we look at how well the ranking of instances by this distance correspond to a ranking using our uncertainty estimates. We do this using Spearman’s rank correlation coefficient, a standard nonparametric measure producing a number in the interval that represents how well the relationship between two ranked variables can be represented by a monotonic function. Specifically, we first rank the images by Wasserstein distance, and then rank again by the uncertainty prediction of different models. We then apply Spearman’s rank correlation to determine how monotonic this relationship is.
Appendix E Details On Budgeting Evaluation
We evaluate the budgeting results by first binarizing the individual grades into referable/nonreferable. If there is only a single grade, we directly check for agreement with the binarized adjudicated grade. If there are two grades, we label the image referable if at least one of the individual grades is referable. We then again check for agreement with the binarized adjudicated grade. This is consistent with the style of evaluation used in Table 3 in the referable column evaluation.