# Characterizing Sources of Uncertainty to Proxy Calibration and

Disambiguate Annotator and Data Bias

###### Abstract

Supporting model interpretability for complex phenomena where annotators can legitimately disagree, such as emotion recognition, is a challenging machine learning task. In this work, we show that explicitly quantifying the *uncertainty* in such settings has interpretability benefits. We use a simple modification of a classical network inference using Monte Carlo dropout to give measures of epistemic and aleatoric uncertainty. We identify a significant correlation between aleatoric uncertainty and human annotator disagreement (). Additionally, we demonstrate how difficult and subjective training samples can be identified using aleatoric uncertainty and how epistemic uncertainty can reveal data bias that could result in unfair predictions. We identify the total uncertainty as a suitable surrogate for model calibration, i.e. the degree we can trust model’s predicted confidence. In addition to explainability benefits, we observe modest performance boosts from incorporating model uncertainty.

## 1 Introduction

Supporting interpretability of an automated prediction system in complex tasks where human experts disagree is a challenging machine learning problem. In such settings, answering the following questions can help understand model’s predictions: is the model uncertain due to capturing annotator biases and their subjective perspective? Or is it error-prone for a specific set of samples due to a distribution shift from the training data? Can the predicted confidence scores of the model be trusted? Do they represent the true likelihood so that we can intuit and reason about their results?

Emotion understanding is an icon for a learning setting where label ambiguity abounds. Most researchers agree that emotion in itself is nuanced and the same input could be assigned different labels due to change in contextual information or the perspective of the reviewer [7]. Thus, disambiguating annotator and data bias and quantifying how well predictive confidence can be trusted is crucial to supporting explainability in emotion classification.

In this work, we extend beyond deterministic modeling of affect using Monte Carlo (MC) dropout [8], a technique that requires no changes to the neural network architecture and only minimal changes at inference time. This approach augments classification’s per-class confidence scores with measures of uncertainty. We tease apart elements in the uncertainty estimates and investigate how each helps interpreting model predictions and its failure modes. We show this decomposition results in a proxy for inter-rater disagreement capturing annotators’ bias, and a proxy highlighting bias in data that could potentially result in unfair predictions.

Given that humans have intuitive understanding of probability [3], to support this intuition and provide interpretable confidence scores, calibration is required. That is, we expect predicted probability estimates to represent the true likelihood of correctness [10]. Calibration can be seen as the degree of trust in predicted confidence scores of a classifier. We further investigate the relationship between uncertainty estimates and the degree of calibration, pinpointing samples where confidence predictions are (not) to be trusted.

While this technique helps interpretability by disambiguating sources of bias and relation to calibration, it can also boost performance. We report significant improvement in Jensen-Shannon divergence (JSD) between predicted and true class probabilities. We show a strong correlation between total uncertainty and JSD (), identifying it as a proxy for performance. We study the influence on accuracy, especially if given the option to reject classifying samples where the model lacks confidence.

To summarize, we use MC dropout with traditional neural network architectures and explore the benefits of resulting measures of uncertainty while disambiguating their source. Our contributions include: 1) introducing a proxy for inter-annotator disagreement, 2) demonstrating the power of such metrics in identifying difficult samples and bias in training data along with ways to alleviate them, 3) finding a surrogate for calibration, 4) showing improvements in performance in addition to interpretability benefits.

## 2 Background & Related Work

Understanding what a model does not know is especially important to explain and understand its predictions. State-of-the-art classification results are mostly achieved by deep neural networks (DNN)—such as AlexNet, VGGNet, ResNet, etc.—that are deterministic in nature and not designed to model uncertainty. Bayesian Neural Networks (BNN) have been an alternative to DNNs, providing a distribution over model parameters at an extra computational cost while increasing difficulty of conducting inference [5, 14]. These computational challenges hinder scalability of BNNs.

MC dropout [8] has been introduced as an approximation of BNNs that can be achieved by keeping the same architecture of a deterministic DNN and only making minimal changes at inference time. Dropout, i.e. randomly dropping weights at training time, is commonly used in DNNs as a regularization method. Drawing random dropout masks at test time can approximate a BNN. Recently, [12] demonstrated ways to additionally learn the observation noise parameter , thus modeling epistemic and aleatoric uncertainty in parallel. Epistemic uncertainty represents lack of confidence in one’s knowledge attributed to missing information about the learning task. Aleatoric uncertainty is attributed to the stochastic behavior of observations. They evaluated the approach for use in regression tasks in depth estimation. A partial version of the model, only modeling aleatoric uncertainty, was evaluated for classification for semantic segmentation.

These efforts show great potential at empowering deterministic DNNs with Bayesian properties with negligible computational overhead. However, we will show complex, difficult tasks where reviewers disagree and data may not fully represent everyone, such as affect detection, can benefit from inferring different sources of uncertainty. Despite its importance, latent uncertainty quantification in emotion detection tasks is under-explored. However, there have been a few efforts regarding more realistic emotion recognition by incorporating explicit inter-annotator disagreement. For example, modeling perception of uncertainty as measured by the standard deviation of labels captured from crowd-sourced annotations has been studied in [11]. While such efforts are valuable in affective computing applications, these approaches are supervised, are prone to error when annotations are sparse and varied in number, are not capable of capturing uncertainty introduced by model parameters or sources of noise other than human judgment.

## 3 Technical Approach

The underlying architecture of our model is an Inception-ResNet-v1 [23] for extracting facial features, followed by a multi-layer perceptron for emotion classification. We built upon an open-source implementation [20] of FaceNet [21]. We pre-trained the model up to the Mixed-8b layer using cross-entropy loss on face identity classes using the CASIA-WebFace dataset [24]. The pre-processing step included using a Multitask CNN [25] to detect facial landmarks and extract facial bounding boxes in the form of 182182 pixel images. Since the utility of this training mechanism is to identify faces, it learns to ignore features that are invariant to one’s identity, e.g. facial expressions, in the later layers of the network while the earlier layers represent lower-level features. Mixed-7a best encoded and retained emotionally-relevant information based on our experiments (See §A.1).

### 3.1 Baseline

After extracting features from layer Mixed-7a, a fully-connected network with two hidden layers was used to infer basic emotions. We refer to this model as Baseline. Facial Expression Recognition (FER) is an established emotion detection dataset [9]. FER+ is the same set of images, expanded to include at least 10 annotations from crowd-sourced taggers [1]. We used FER+ train, private test, and public test subsets for training, hyper parameter tuning, and evaluation of our model performance, respectively. See §A.1 for details.

### 3.2 Epistemic & Aleatoric Uncertainties

For each input image, Baseline predicts a length- logits vector which is then passed through a softmax operation to form a probability distribution over a set of class labels. For our new model, we move away from pointwise predictions, and put a Gaussian prior distribution over the network weights, . To overcome the intractability of computing the posterior distribution , we use MC dropout [8], performing dropout both during training and test time before each weight layer, and approximate the posterior with the simple distribution . Here, is a mixture of two Gaussians, where the mean of one of the Gaussians is fixed at zero. We minimize the Kullback-Leibler (KL) divergence between and the : , where is the number of data points, is dropout probability, is the dropout distribution, and . Using MC integration with sampled dropout masks, we have the approximation: .

Inspired by [15], we use entropy in the probability space as a proxy for classification uncertainty. To get an aggregate uncertainty measure, we marginalize over all parameters and use the entropy of the probability vector : . We then quantify the total () and aleatoric uncertainty () using:

The epistemic uncertainty is then defined . Note that will represent mutual information between true values and model parameters and thus has a different scale compared to and that each represent entropy of a probability distribution. For simplicity, we refer to this model as UncNet in the rest of the paper. The code is available at https://github.com/asmadotgh/unc-net.

## 4 Results & Discussion

We show that modeling and disambiguating different sources of uncertainty provides a means to identify data that are more difficult to classify, and seek to provide interpretable reasons for why. Similar to [19], to represent task subjectivity, we compute the probability that two draws from the empirical histogram of human annotations disagree: , where is the number of classes and is the probability of image being rated as class .

### 4.1 A Proxy for Inter-Rater Disagreement

Classification of perceived emotions is inherently a subjective task, with disagreement across human annotators. We hypothesize that aleatoric uncertainty is associated with inter-annotator disagreement. We used the Pearson correlation coefficient to assess the relationship between aleatoric uncertainty () and disagreement probability (), resulting in a significant correlation: . This finding suggests aleatoric uncertainty as a tool for quantifying degree of label subjectivity associated with an image.

Note that we observed no significant correlation between epistemic uncertainty and the annotators’ disagreement probability: , . This is aligned with our hypothesis that epistemic uncertainty captures the uncertainty introduced by model parameters and is not able to capture the nuance in subjective annotations.

### 4.2 Task Subjectivity, Difficulty & Bias in Training

Figure 1 shows samples with the highest and lowest uncertainties. On the left, extreme cases in terms of aleatoric uncertainty () are listed. We observe that samples with low are stereotypical expressions of emotion where annotators (almost) unanimously agree on the assigned label. The fact that “happiness” class is the second most common class in the dataset (after “neutral”), and has a stereotypical morphology in terms of the position of the eye corners, mouth, and teeth exposure may have contributed to the dominance of “happiness” class in low samples. On the other hand, we observe that samples with highest either represent subjectivity involved in label assignment and lack of annotators’ consensus; or low quality of an image. For example, the face occlusion or being a drawing as opposed to a photograph.

Figure 1, on the right, shows extreme cases in terms of epistemic uncertainty (). Low samples show similar patterns: samples with stereotypical expression of emotion that are common in the training set. On the other hand, we see different patterns in samples with high . We observe that the model has low confidence in the predictions for dark-skinned subgroups. Our interpretation is that the CASIA-WebFace dataset that was used for pre-training the model is highly skewed. It contains faces of celebrities that IMDB lists as active between 1940 and 2014. Most of these celebrities are white. That may explain why the model has high in making a prediction for non-white input images. We also see a sample that exemplifies a non-frontalized photo, which the human annotators were able to unanimously assign a “neutral” label despite its atypical viewpoint in the dataset. Since the pre-training process included a frontalization pre-processing step, we believe the current model is not capable of finding meaningful representations for non-frontalized photos and that is why this sample has high . Factors such as different illumination may also result in higher .

### 4.3 A Proxy for Degree of Calibration

The Brier Calibration Error [2, 6, 13] is a commonly-used metric for quantifying calibration: . is the number of samples, is the number of classes, is a one-hot representation of true labels and is the predicted confidence scores. Additionally, variations of reliability diagram have been used to show the discrepancy between confidence and accuracy [10, 17, 4]. Since we have multiple annotations per data point, each pair of <annotation, sample> is treated separately.

Figure 2 (left) shows the reliability diagram for both Baseline and UncNet. As plotted, both models are close to the 45 line. This is aligned with previous research findings showing evidence of well-calibrated predictions when trained with soft-labels [16, 22]. While the near-perfect calibration in Baseline does not leave space for further improvement, additional uncertainty estimates provide useful information about subgroups of images that may be more or less calibrated. On the right, the relationship between uncertainty and calibration is visualized. We sort samples based on predictive uncertainty estimates and plot BCE as a function of , , and percentile. There was a significant Pearson correlation between each of these pairs: ; ; . This suggests that lower uncertainty is associated with a better calibration. Particularly, aleatoric uncertainty plays a more significant role in identifying when a model’s predicted confidence score matches the true correctness likelihood. See §A.3 for details regarding other suggested calibration scores [10, 18].

### 4.4 Performance

We also hypothesized performance gains using UncNet. Due to task subjectivity and annotation spread (§A.2), we believe measures that rely on a binary true/false assumption for evaluation do not fully represent the nuance of our problem setting. Therefore, we use Jensen-Shannon divergence to quantify the distance between predicted and true class probabilities: . Here, is the point-wise mean of and and is the Kullback-Leibler divergence. Lower represents better performance. A paired-samples t-test was conducted to compare the s in Baseline and UncNet. There was a significant difference in for Baseline () and UncNet (); , confirming our hypothesis.

We take a more granular look and hypothesize that samples with higher uncertainty have higher . To test this, a Pearson correlation coefficient was computed to assess the relationship between , , and in UncNet. Each pair showed a significant correlation (): ; ; . This finding further confirms our hypothesis: lower uncertainty is associated with a better match between prediction and groundtruth. Similar to findings of [12], we see aleatoric uncertainty plays a more significant role in such identification.

Though accuracy may not fully represent this nuanced problem setting, we also checked how UncNet compared to the Baseline as measured by accuracy. We observed that UncNet has the potential to improve performance modestly, but that if the model had the option to reject classifying samples it is not confident in up to 25%, it improves significantly in performance, by as much as 8%. See §A.4 for details.

## 5 Conclusion

We focused on the often subjective task of perceived emotion classification and demonstrated how a classical network architecture can be altered to predict measures of epistemic and aleatoric uncertainties and how these measures can help interpretation of model’s confidence scores. We presented evidence for aleatoric uncertainty being a proxy for inter-annotator disagreement and showcased how the measured aleatoric uncertainty can identify low quality inputs or more subjective samples. Additionally, we presented explorations of how epistemic uncertainty can represent bias in training data and suggest directions to alleviate that. Our results suggest that the predicted total uncertainty can act as a surrogate for degree of calibration, even on tasks without human-expert consensus. Finally, we showed there are other benefits such as potential performance improvements.

## Acknowledgments

We would like to thank Ardavan Saeedi and Suvrit Sra for insightful discussions, Jeremy Nixon for sharing calibration metrics code, MIT Stephen A. Schwarzman College of Computing, Machine Learning Across Disciplines Challenge and MIT Media Lab Consortium for supporting this research.

## References

- [1] E. Barsoum, C. Zhang, C. C. Ferrer, and Z. Zhang. Training deep networks for facial expression recognition with crowd-sourced label distribution. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, pages 279–283. ACM, 2016.
- [2] G. W. Brier. Verification of forecasts expressed in terms of probability. Monthly weather review, 78(1):1–3, 1950.
- [3] L. Cosmides and J. Tooby. Are humans good intuitive statisticians after all? rethinking some conclusions from the literature on judgment under uncertainty. cognition, 58(1):1–73, 1996.
- [4] M. H. DeGroot and S. E. Fienberg. The comparison and evaluation of forecasters. Journal of the Royal Statistical Society: Series D (The Statistician), 32(1-2):12–22, 1983.
- [5] J. S. Denker and Y. Lecun. Transforming neural-net output levels to probability distributions. In Advances in neural information processing systems, pages 853–859, 1991.
- [6] E. Epstein. Verification of forecasts expressed in terms of probability. J. Appl. Meteorol, 8(6):985–987, 1969.
- [7] C. Frings and D. Wentura. Trial-by-trial effects in the affective priming paradigm. Acta Psychologica, 128(2):318–323, 2008.
- [8] Y. Gal and Z. Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059, 2016.
- [9] I. J. Goodfellow, D. Erhan, P. L. Carrier, A. Courville, M. Mirza, B. Hamner, W. Cukierski, Y. Tang, D. Thaler, D.-H. Lee, et al. Challenges in representation learning: A report on three machine learning contests. In International Conference on Neural Information Processing, pages 117–124. Springer, 2013.
- [10] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1321–1330. JMLR. org, 2017.
- [11] J. Han, Z. Zhang, M. Schmitt, M. Pantic, and B. Schuller. From hard to soft: Towards more human-like emotion recognition by modelling the perception uncertainty. In Proceedings of the 25th ACM international conference on Multimedia, pages 890–897. ACM, 2017.
- [12] A. Kendall and Y. Gal. What uncertainties do we need in bayesian deep learning for computer vision? In Advances in neural information processing systems, pages 5574–5584, 2017.
- [13] S. Lichtenstein and B. Fischhoff. Training for calibration. Organizational Behavior and Human Performance, 26(2):149–171, 1980.
- [14] D. J. MacKay. A practical bayesian framework for backpropagation networks. Neural computation, 4(3):448–472, 1992.
- [15] A. Malinin and M. Gales. Predictive uncertainty estimation via prior networks. In Advances in Neural Information Processing Systems, pages 7047–7058, 2018.
- [16] R. Müller, S. Kornblith, and G. Hinton. When does label smoothing help? arXiv preprint arXiv:1906.02629, 2019.
- [17] A. Niculescu-Mizil and R. Caruana. Predicting good probabilities with supervised learning. In Proceedings of the 22nd international conference on Machine learning, pages 625–632. ACM, 2005.
- [18] J. Nixon, M. W. Dusenberry, L. Zhang, G. Jerfel, and D. Tran. Measuring calibration in deep learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 38–41, 2019.
- [19] M. Raghu, K. Blumer, R. Sayres, Z. Obermeyer, S. Mullainathan, and J. Kleinberg. Direct uncertainty prediction with applications to healthcare. arXiv preprint arXiv:1807.01771, 2018.
- [20] D. Sandberg. Open-source implementation of a face recognition model. https://github.com/davidsandberg/facenet/wiki, 2018. [Online; accessed 10-April-2019].
- [21] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015.
- [22] S. Seo, P. H. Seo, and B. Han. Learning for single-shot confidence calibration in deep neural networks through stochastic inferences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9030–9038, 2019.
- [23] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
- [24] D. Yi, Z. Lei, S. Liao, and S. Z. Li. Learning face representation from scratch. arXiv preprint arXiv:1411.7923, 2014.
- [25] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10):1499–1503, 2016.

## Appendix A Supplementary Material

### a.1 Model Architecture and Pre-Training Details

Figure 3 shows the detailed model architecture. Note that the modules on the top represent an Inception-ResNet-v1 architecture. We have used up to layer Mixed-7a for feature extraction from raw images and added a fully connected (FC) network with two hidden layers of size 128128. This represents the Baseline architecture. The main difference in UncNet is adding a dropout mask before each layer of FC, not only during training, but also at inference time.

For face similarity pre-training, we treated the intermediate layer that was used to export features from a raw image as a hyper-parameter that was tuned according to validation loss. Our experiments included Mixed-5a, Mixed-6a, Mixed-6b, Mixed-7a, Mixed-8a, and Mixed-8b layers. We observed that the Mixed-7a layer best encoded and retained emotional information from the input face crop. Table 1 summarizes our exploration results.

### a.2 Annotation Disagreement Details

Figure 4 shows the distribution of disagreement probability () for all images in the training set. Histogram heights show a density rather than the absolute count, so that the area under the fitted curve is one.

### a.3 Detailed Calibration Results

In this section, we report a range of calibration scores for Baseline and UncNet. Further, we show how these scores are related to the predictive uncertainty estimates of UncNet.

Scholars have introduced a range of calibration scores. Maximum Calibration Error (MCE) and Expected Calibration Error (ECE) approximate calibration error by quantization of uncertainty bins and have been adopted in many recent publications [10]:

Here is the number of predictions in bin , is the number of samples, is the accuracy of prediction in bin , and is the average prediction confidence score in bin . Recently, new metrics have been proposed to overcome the limited assumption of mutually exclusiveness of classes and improve robustness to label noise [18]. Static Calibration Error (SCE) is a metric where prediction for all classes is taken into account as opposed to only the argmax of softmax outputs. Adaptive Calibration Error (ACE) is an extension of SCE where instead of equidistant bins, confidence scores are sorted and their percentiles represent “ranges”, parallel to “bins” in SCE. Thresholded Adaptive Calibration Error (TACE) is an extension to ACE where values with at least confidence are taken into account. SCE, ACE, and TACE can be formally defined as the following:

Table 2 summarizes these metrics using B/R=10. We did not observe any conclusive results comparing Baseline and UncNet conditions or using uncertainty quantiles. Our interpretation is that the close-to-perfect calibration with soft-labels, as well as identified problems with the dependence of these metrics on quantization may have resulted in a null result. Further study in this area is required to better understand what these metrics can and cannot capture.

### a.4 Detailed Performance Metrics

In this section, we report accuracy for a random run on the test set. Accuracy is defined as the percentage of samples where predicted maximum probability class maps to the annotated maximum probability class. Table 3 summarizes our findings. For future, we will add further performance metrics such as average precision or per-class accuracy and provide confidence bounds using bootstrapping.

Model | Evaluation Dataset | Accuracy (%) |
---|---|---|

Baseline | FER+ Test | 54.848 |

UncNet | FER+ Test | 56.943 |

UncNet - low | 75% of FER+ Test | 57.452 |

UncNet - low | 75% of FER+ Test | 62.481 |

UncNet - low | 75% of FER+ Test | 62.332 |