Operational Calibration: Debugging Confidence Errors for DNNs in the Field

Operational Calibration: Debugging Confidence Errors for DNNs in the Field

Zenan Li State Key Lab of Novel Software Technology, Nanjing UniversityNanjingChina lizenan@smail.nju.edu.cn Xiaoxing Ma 0000-0001-7970-1384State Key Lab of Novel Software Technology, Nanjing UniversityNanjingChina xxm@nju.edu.cn Chang Xu State Key Lab of Novel Software Technology, Nanjing UniversityNanjingChina changxu@nju.edu.cn Jingwei Xu State Key Lab of Novel Software Technology, Nanjing UniversityNanjingChina jingweix@nju.edu.cn Chun Cao State Key Lab of Novel Software Technology, Nanjing UniversityNanjingChina caochun@nju.edu.cn  and  Jian Lü State Key Lab of Novel Software Technology, Nanjing UniversityNanjingChina lj@nju.edu.cn

Trained DNN models are increasingly adopted as integral parts of software systems. However, they are often over-confident, especially in practical operation domains where slight divergence from their training data almost always exists. To minimize the loss due to inaccurate confidence, operational calibration, i.e., calibrating the confidence function of a DNN classifier against its operation domain, becomes a necessary debugging step in the engineering of the whole system.

Operational calibration is difficult considering the limited budget of labeling operation data and the weak interpretability of DNN models. We propose a Bayesian approach to operational calibration that gradually corrects the confidence given by the model under calibration with a small number of labeled operational data deliberately selected from a larger set of unlabeled operational data. Exploiting the locality of the learned representation of the DNN model and modeling the calibration as Gaussian Process Regression, the approach achieves impressive efficacy and efficiency. Comprehensive experiments with various practical data sets and DNN models show that it significantly outperformed alternative methods, and in some difficult tasks it eliminated about 71% to 97% high-confidence errors with only about 10% of the minimal amount of labeled operation data needed for practical learning techniques to barely work.

Operational Calibration, Deep Neural Networks, Gaussian Process
conference: Manuscript; 2019; arXivccs: Software and its engineering Software testing and debuggingccs: Computing methodologies Neural networks

1. Introduction

Deep learning (DL) has demonstrated near to or even better than human performance in some difficult tasks, such as image classification and speech recognition (LeCun et al., 2015; Goodfellow et al., 2016). Deep Neural Network (DNN) models are increasingly adopted in high-stakes application scenarios such as medical diagnostics (Obermeyer and Emanuel, 2016) and self-driven cars (Bojarski et al., 2016). However, it is not uncommon that DNN models perform poorly in the field (Riley, 2019). The interest on the quality assurance for DNN models as integral parts of software systems is surging in the community of software engineering (Pei et al., 2017; Ma et al., 2019; Sun et al., 2018; Zhang et al., 2018; Kim et al., 2019; Zhang et al., 2019).

A particular problem of using a previously trained DNN model in an operation domain is that the model may not only make more-than-expected mistakes in its predictions, but also give erroneous confidence values for these predictions. Arguably the latter issue is more harmful, because with accurate confident information the model could be at least partially usable by accepting only high-confident predictions.

The problem comes from the often occurred divergences between the original data on which the model is trained and the data in the operation domain, which is often called domain shift or dataset shift (Ng, 2016) in the machine learning literature. It can be difficult and go beyond the stretch of usual machine learning tricks such as fine-tuning and transfer learning (Pan and Yang, 2009; Wang et al., 2014), because of two practical restrictions often encountered. First, when the DNN model is provided by a third party, its training data are sometimes unavailable due to privacy and proprietary limitations (Zhou, 2016; Shokri and Shmatikov, 2015; Konečnỳ et al., 2016). Second, one can only use a small number of labeled operation data, because it is expensive to label the data collected in field. For example, in an AI-assisted clinical medicine scenario, surgical biopsies could be involved in the labeling of radiology or pathology images.

We consider operational calibration that corrects the error in the confidence for its prediction on each input in a given operation domain. It does not change the predictions made by a DNN model, but tells when the model works well and when not. In this sense, operational calibration is a necessary debugging step that should be incorporated in the engineering of the whole system. Operational calibration is challenging because what it fixes is a function, not just a value. It also needs to be efficient, i.e., reducing the effort in labeling operation data.

It is natural to model operational calibration as a kind of non-parametric Bayesian Inference and solve it with Gaussian Process Regression (Rasmussen and Williams, 2005). We take the original confidence of the DNN model as the priori, and gradually calibrate the confidence with the evidence collected by selecting and labeling operation data. The key insight into effective and efficient regression comes from following observations: First, the DNN model, although suffering from the domain shift, can be used as a feature extractor with which unlabeled operational data can be nicely clustered (Zhu, 2005; Shu et al., 2018). In each cluster, the prediction correctness of an example is correlated with another one. The correlation can be effectively estimated with the distance of the two examples in the feature space. Second, Gaussian Process is able to quantify the uncertainty after each step, which can be used to guide the selection of operational data to label efficiently.

Systematic empirical evaluations showed that the approach was promising. It significantly outperformed existing calibration methods in both efficacy and efficiency in all settings we tested. In some difficult tasks it eliminated about 71% to 97% high-confidence errors with only about 10% of the minimal amount of labeled operation data needed for practical learning techniques to barely work.

In summary, the contributions of this paper are:

  • Posing the problem of operational calibration for DNN models in the field, and casting it into a Bayesian inference framework.

  • Proposing a Gaussian Process-based approach to operational calibration, which leverages the representation learned by the DNN model under calibration and the locality of confidence errors in this representation.

  • Evaluating the approach systematically. Experiments with various datasets and models confirmed the general efficacy and efficiency of our approach.

The rest of this paper is organized as follows. We first discuss the general need for operational quality assurance for DNNs in Section 2, and then focus on the problem of, and our approach to, operational calibration in Section 3. The approach is evaluated empirically in Section 4. We briefly overview related work and highlight their differences from ours in Section 5 before concluding the paper with Section 6.

2. DNN and operational quality assurance

Deep learning is intrinsically inductive (Goodfellow et al., 2016). However, conventional software engineering is mostly deductive, as evidenced by its fundamental principle of specification-implementation consistency. Adopting DNN models as integral parts of software systems poses new challenges for quality assurance. To provide the background for the work on operational calibration, we first briefly introduce DNN and its prediction confidence, and then discuss its quality assurance for given operation domains.

2.1. DNN classifier and prediction confidence

A deep neural network classifier contains multiple hidden layers between its input and output layers. A popular understanding (Goodfellow et al., 2016) of the role of these hidden layers is that they progressively extract abstract features (e.g., a wheel, human skin, etc.) from a high-dimensional low-level input (e.g., the pixels of an image). These features provide a relatively low-dimensional high-level representation for the input , which makes the classification much easier, e.g., the image is more likely to be a car if wheels are present.

What a DNN classifier tries to learn from the training data is a posterior probability distribution, denoted as  (Bishop, 2006). For a K-classification problem, the distribution can be written as , where . For each input whose representation is , the output layer first computes the non-normalized prediction , whose element is often called the logit for the -th class. The classifier then normalizes with a softmax function to approximate the posterior probabilities


Finally, to classify , one just chooses the the category corresponding to the maximum posterior probability, i.e.,


Obviously, this prediction is intrinsically uncertain. The confidence for this prediction, which quantifies the likelihood of correctness, can be naturally measured as the estimated posterior class probability


Confidence takes an important role in decision-making. For example, if the loss due to an incorrect prediction is four times of the gain of a correct prediction, one should not invest on predictions with confidence less than 0.8. Inaccurate confidence could cause significant loss. For example, an over-confident benign prediction for a pathology image could mislead a doctor into overlooking a malignant tumor, while an under-confident benign prediction could result in unnecessary confirmatory testings.

Modern DNN classifiers are often inaccurate in confidence (Szegedy et al., 2016), because they overfit to the surrogate loss used in training (Guo et al., 2017; Tewari and Bartlett, 2007). Simply put, they are over optimized toward the accuracy of classification, but not the accuracy of estimation for posterior probabilities. To avoid the potential loss caused by inaccurate confidence, confidence calibration can be employed in the learning process  (Flach, 2016; Guo et al., 2017; Tewari and Bartlett, 2007). Usually the task is to find a function to correct the logit such that


matches the real posterior probability . Notice that, in this setting the inaccuracy of confidence is viewed as a kind of systematic error or bias, not associated with particular inputs or domains. That is, does not take or as input.

There exists different kinds of calibration methods, such as isotonic regression (Zadrozny and Elkan, 2002), histogram binning (Zadrozny and Elkan, 2002), and Platt scaling (Platt, 1999). However, according to a recent study (Guo et al., 2017), the most effective choice is often a simple method called Temperature Scaling (Hinton et al., 2015). The idea is to define the calibration function as


where is a scalar parameter computed by minimizing the negative log likelihood (Hastie et al., 2009) on the validation dataset.

2.2. Operational quality assurance

Well trained DNN models can provide marvellous capabilities, but unfortunately their failures in applications are also very common (Riley, 2019). When using a trained model as an integral part of a high-stakes software system, it is crucial to know quantitatively how well the model will work. The quality assurance combining the viewpoints from software engineering and machine learning is needed, but largely missing.

The principle of software quality assurance is founded on the specifications for software artifacts and the deductive reasonings based on them. A specification defines the assumptions and guarantees for a software artifact. The artifact is expected to meet its grantees whenever its assumptions are satisfied. Thus explicit specifications make software artifacts more or less domain independent. However, statistical machine learning does not provide such kind of specifications. Essentially it tries to induce a model from its training data, which is intended to be general so that the model can give predictions on previously unseen inputs. Unfortunately the scope of generalization cannot be explicitly specified. As a result, a major problem comes from the divergence between the domain where the model was original trained and the domain where it actually operates.

So the first requirement for the quality assurance of a DNN model is to focus on the concrete domain where the model actually operates. In theory the quality of a DNN model will be pointless without considering its operation domain, and in practice the performance of a model may drop significantly with domain shift (Li et al., 2019b). On the other hand, focusing on the operation domain also relieves the DNN model of the dependence on its original training data. Apart from practical concerns such as protecting the privacy and property of the training data, decoupling a model from its training data and process will also be helpful for (re)using it as a commercial off-the-shelf (COTS) software product (Zhou, 2016). This is in contrasting to machine learning techniques dealing with domain shift such as transfer learning or domain adaptation that heavily rely on the original training data or hyperparameters (Pan and Yang, 2009; Shu et al., 2018; Wang et al., 2019b). They need original training data because they try to generalize the scope of the model to include the new operation domain.

The second requirement is to embrace the uncertainty that is intrinsic in DNN models. A defect, or a “bug”, of a software artifact is a case that it does not deliver its promise. Different from conventional software artifacts, a DNN model never promises to be certainly correct on any given input, and thus individual incorrect predictions should not be regarded as errors, but to some extent features (Ilyas et al., 2175). Nevertheless, the model statistically quantifies the uncertainty of their predictions. Collectively, it is measured with metrics such as accuracy or precision. Individually, it is stated by the confidence value about a prediction on each given input. These qualifications of uncertainty, as well as the predictions a model made, should be subject to quality assurance. For example, given a DNN model and its operation domain, operational testing (Li et al., 2019b) examines to what degree the model’s overall accuracy is degraded by the domain shift.

Finally, operational quality assurance should prioritize the saving of human efforts, which include the cost of collecting, and especially labeling, the data in the operation domain. The labeling of operational data often involves physical interactions, such as surgical biopsies and destructive testings, and thus can be expensive and time-consuming. Without the access to the original training data, fine-tuning a DNN model to an operation domain may require a tremendous amount of labeled examples to work. Quality assurance activities often have to work under a much tighter budget for labeling data.

Figure 1. Operational quality assurance

Figure 1 depicts the overall idea for operational quality assurance, which generalizes the process of operational testing proposed in (Li et al., 2019b). A DNN model, which is trained by a third party with the data from the origin domain, is to be deployed in an operation domain. It needs to be evaluated, and possibly adapted, with the data from the current operation domain. To reduce the effort of labeling, data selection can be incorporated in the procedure with the guidance of the information generated by the DNN model and the quality assurance activity. Only the DNN models that pass the assessments and are possibly equipped with the adaptations will be put into operation.

3. Operational Calibration of DNN Confidence

Now we focus on operational calibration as a specific quality assurance task for DNNs in the field.

3.1. Defining the problem

Given a domain where a previously trained DNN model is deployed, operational calibration identifies and fixes the model’s errors in the confidence of predictions on individual inputs in the domain. Operational calibration is conservative in that it does not change the predictions made by the model, but tries to give accurate estimations on the likelihood of the predictions being correct. With this information, a DNN model will be useful even though its prediction accuracy is severely affected by the domain shift. One may take only its predictions on inputs with high confidence, but switch to other models or other backup measures if unconfident.

To quantify the accuracy of the confidence of a DNN model on a dataset , one can use the Brier score (BS) (Brier, 1950), which is actually the mean squared error of the estimation:


where is the indicator function for whether the labeled input is misclassified or not, i.e., if , and otherwise.

Now we formally define the problem of operation calibration: {myproblem} Given a previously trained DNN classifier, a set of unlabeled examples collected from an operational domain, and a budget for labeling the examples in , the task of operation calibration is to find a confidence estimation function for with minimal Brier score .

Notice that operational calibration is different from the confidence calibration discussed in Section 2.1. The latter is domain-independent and usually included as a step in the training process of a DNN model, but the former is needed only when the model is deployed by a third party in a specific operation domain. Operational calibration cannot take the confidence error as a systematic error of the learning process, because the error is caused by the domain shift from the training data to the operational data, and it may depend on specific inputs from the operation domain.

3.2. Modeling with Gaussian Process

At first glance operational calibration seems a simple regression problem with BS as the loss function. However, a direct regression would not work because of the limited budget of labeled operation data. It is helpful to view the problem in a Bayesian way. At the beginning, we have a prior belief about the correctness of a DNN model’s predictions, which is the confidence outputs of the model. Once we observe some evidences that the model makes correct or incorrect predictions on some inputs, the belief should be adjusted accordingly. The challenge here is to strike a balance between the priori that was learned from a huge training dataset but suffering from domain shift, and the evidence that is collected from the operation domain but limited in volume.

It is natural to model the problem as a Gaussian Process (Rasmussen and Williams, 2005), because what we need is actually a function . Gaussian Process is a non-parametric kind of Bayesian methods, which convert a prior over functions into a posterior over functions according to observed data.

For convenience, instead of estimating directly, we consider


where is the original confidence output of for input . At the beginning, without any evidence against , we assume that the prior distribution of is a zero-mean normal distribution


where is the covariance (kernel) function, which intuitively describes the “smoothness” of from point to point. In other words, the covariance function ensures that produces close outputs when inputs are close in the input space.

Assume that we observe a set of independent and identically distributed (i.i.d.) labeled operational data , in which . For notational convenience, let

be the observed data and their corresponding -values, and let

be those for a set of i.i.d. predictive points. We have


where is the kernel matrix. Therefore, the conditional probability distribution is



With this Gaussian Process, we can estimate the probability distribution of the operational confidence for any input as follows



Then, with Equation 7, we have the distribution of


Finally, due to the value of confidence ranges from 0 to 1, we need to truncate the original normal distribution (Burkardt, 2014), i.e.,




Here the and are the probability density function and the cumulative distribution function of standard normal distribution, respectively.

With this Bayesian approach, we compute a distribution, rather than an exact value, for the confidence of each prediction. To compute the Brier score, we simply choose the maximum a posteriori (MAP), i.e., the mode of the distribution, as the calibrated confidence value. Here it is the mean of the truncated normal distribution


3.3. Clustering in representation space

Directly applying the above Gaussian Process to estimate would be ineffective and inefficient. It is difficult to specify a proper covariance function in Equation 8, because the correlation between the correctness of predictions on different examples in the very high-dimensional input space is difficult, if possible, to model.

Fortunately, we have the DNN model on hand, which can be used as a feature extractor, although it may suffer from the problem of domain shift and perform badly as a classifier (Bengio et al., 2012). In this way we transform each input from the input space to a corresponding point in the representation space, which is defined by the output of the neurons in the last hidden layer. It turns out that the correctness of ’s predictions has an obvious locality, i.e., a prediction is more likely to be correct/incorrect if it is near to a correct/incorrect prediction in the representation space. See Figure 2 for an intuitive example.

Figure 2. Locality of prediction correctness in the representation space. A DNN model trained on the MNIST dataset is applied to an operation domain of the USPS dataset. Despite the significant drop of accuracy from 97% to 68%, the model is still effective in grouping together examples with correct predictions (blue ) and those with incorrect predictions (red ). The representation space is reduced to a two-dimensional plane to visualize the effect.

Another insight for improving the efficacy and efficiency of the Gaussian Process is that the distribution of operational data in the sparse representation space is far from even. They can be nicely grouped into a small number (usually tens) of clusters, and the correlation of prediction correctness within a group is much stronger than that between groups. Consequently, instead of regression with a universal Gaussian Process, we carry out a Gaussian Process regression in each cluster.

This clustering does not only reduce the computational cost of the Gaussian Processes, but also make it possible to use different covariance functions for different clusters. The flexibility makes our estimation more accurate. Elaborately, we use the RBF kernel


where the parameter (length scale) can be decided according to the distribution of original confidence produced by .

3.4. Considering costs in decision

The cost of misclassification must be taken into account in real-world decision making. One can also measure how well a model is calibrated with the loss due to confidence error (LCE) against a given cost model.

For example, let us assume a simple cost model in which the gain for a correct prediction is 1 and the loss for a false prediction is . The net gain if we take action on a prediction for input will be . We further assume that there will be no cost to take no action when the expected net gain is negative. Then the actual gain for an input with estimated confidence will be


where is the break-even threshold of confidence for taking action. On the other hand, if the confidence was perfect, i.e., if the prediction is correct, and 0 otherwise, the total gain for dataset would be a constant . So the average LCE over a dataset with examples is :


With the Bayesian approach we do not have an exact but a truncated normal distribution of it. If we take as , the above equations still hold. 111This is because here is a constant. Things will be different if, for example, one puts higher stakes on higher confidence predictions. Considering the page limit, we will not elaborate this issue, but the Bayesian approach allows for more flexibility in dealing with these cases.

Cost-sensitive calibration targets at minimizing the LCE instead of the Brier score. Notice that calibrating confidence with Brier score generally reduces LCE. However, with a cost model, the optimization toward minimizing LCE can be more effective and efficient.

3.5. Selecting operational data to label

In case that the set of labeled operational data is given, we simply apply a Gaussian Process in each cluster in the representation space and get the posteriori distribution for confidence . However, if we can decide which operational data to label, we shall spend the budget for labeling more wisely.

Initially, we select the operational input at the center of each cluster to label, and apply a Gaussian Process in each cluster with this central input to compute the posterior probability distribution of the confidence. Then we shall select the most “helpful” input to label and repeat the procedure.

The insight for input selection is twofold. First, to reduce the uncertainty as much as possible, one should choose the input with maximal variance . Second, to reduce the LCE as much as possible, one should pay more attention to those input with confidence near to the break-even threshold . So we chose as the next input to label:


Putting all the ideas together, we have Algorithm 1 shown below. The algorithm is robust in that it does not rely on any hyperparameters except for the number of clusters. It is also conservative in that it does not change the predictions made by the model. As a result, it needs no extra validation data.

0:  Previously trained DNN model , unlabeled dataset collected from operation domain , and the budget for labeling inputs.
0:  Calibrated confidence function for belongs to .
Build Gaussian Process models:
1:  Divide dataset into clusters using the K-modroid method, and label the inputs that correspond to the centers of the clusters.
2:  Initialize the labeled set .
3:  For each of the clusters, build a Gaussian Process model , .
4:  while  do
5:     Select a new input for labeling, where is searched by Equation 19.
6:     Update the Gaussian Process corresponding to the cluster containing .
7:     Update the labeled set .
8:  end while
Compute confidence value for input :
9:  Find the Gaussian Process model corresponding to the cluster containing input .
10:  Compute according to Equation 14.
11:  Output the estimated calibrated confidence .
Algorithm 1 Operational confidence calibration

3.6. Discussions

To understand why our approach is more effective than conventional confidence calibration techniques, one can consider the three-part decomposition of the Brier score (Murphy, 1973)


where is the set of inputs whose confidence falls into the interval , and the and are the expected accuracy and confidence in , respectively. The acc is the accuracy of dataset .

In this decomposition, the first term is called reliability, which measures the distance between the confidence and the true posterior probabilities. The second term is resolution, which measures the distinctions of the predictive probabilities. The final term is uncertainty, which is only determined by the accuracy.

In conventional confidence calibration, the model is assumed to be well trained and work well with the accuracies. In addition, the grouping of is acceptable because the confidence error is regarded as systematic error. So one only cares about minimizing the reliability. This is exactly what conventional calibration techniques such as Temperature Scaling are designed for.

However, in operational testing, the model itself suffers from the domain shift, and thus may be less accurate than expected. Even worse, the grouping of is problematic because the confidence error is unsystematic and the inputs in are not homogeneous anymore. Consequently, we need to maximize the resolution and minimize the reliability at the same time. Our approach achieves these two goals with more discriminative calibration that is based on the features of individual inputs rather than their logits or confidence values.

This observation also indicates that the benefit of our approach over temperature scaling will diminish if the confidence error happens to be systematic. For example, in case that the only divergence of the data in the operation domain is that some part of an image is missing, our approach will perform similarly to or even slightly worse than temperature scaling. However, as can be seen from later experiments, most operational situations have more or less domain shifts that temperature scaling cannot handle well.

In addition, when the loss for false prediction is very small (, as observed from experiments in the next section), our approach will be ineffective in reducing LCE. It is expected because in this situation one should accept almost all predictions, even when their confidence values are low.

4. Empirical evaluation

We conducted a series of experiments to answer the following questions:

  1. Is our approach to operational calibration generally effective in different tasks?

  2. How effective it is, compared with alternative approaches?

  3. How efficient it is, in the sense of saving labeling effort?

We implemented our approach on top of the PyTorch 1.1.0 DL framework. The code, together with the experiment data, are available at https://figshare.com/s/5f6096ca8f413ef31eb4. The experiments were conducted on a GPU server with two Intel Xeon Gold 5118 CPU @ 2.30GHz, 400GB RAM, and 10 GeForce RTX 2080 Ti GPUs. The server ran Ubuntu 16.04 with GNU/Linux kernel 4.4.0.

The execution time of our operational calibration depends on the size of the dataset used, and the architecture of the DNN model. For the tasks listed below, the execution time varied from about 3.5s to 50s, which we regard as totally acceptable.

4.1. Experiment tasks

To evaluate the general efficacy of our approach, we designed six tasks that were different in the application domains (image recognition and natural language processing), operation dataset size (from hundreds to thousands), classification difficulty (from 2 to 1000 classes), and model complexity (from to parameters). To make our simulation of domain shifts realistic, in four tasks we adopted third-party operational datasets often used in transfer learning research, and the other two tasks we used mutations that are also frequented made in the machine learning community. Figure 3 demonstrates some example images from the origin and operation domains. Table 1 lists the settings of the six tasks.

No. Model Origin Operation
Dataset Acc. (%) Size
1 LeNet-5 Digit recognition 96.9 68.0 900
2 RNN Polarity 99.0 83.4 1,000
(v1.0 v2.0)
3 ResNet-18 Image classification 93.2 47.1 5,000
4 VGG-19 CIFAR-100 72.0 63.6 5,000
(orig. crop)
5 ResNet-50 ImageCLEF 99.2 73.2 480
(c p)
6 Inception-v3 ImageNet 77.5 45.3 5,000
(orig. down-sample)
  • It refers to the maximum number of operation data available for labeling.

Table 1. Dataset and model settings of tasks

In Task 1 we applied a LeNet-5 model originally trained with the images from the MNIST dataset (LeCun et al., 1998) to classify images from the USPS dataset (Friedman et al., 2001). Both of them are popular handwritten digit recognition datasets consisting of single-channel images of size 16161. The size of the training dataset was 2,000, and the size of the operation dataset was 1,800. We reserved 900 of the 1800 operational data for testing, and used the other 900 for operational calibration.

Task 2 was focused on natural language processing. Polarity is a dataset for sentiment-analysis (Pang et al., 2002). It consists of sentences labeled with corresponding sentiment polarity (i.e., positive or negative). We chose Polarity-v1.0, which contained 1,400 movie reviews collected in 2002, as the training set. The Polarity-v2.0, which contained 2,000 movie reviews collected in 2004, was used as the data from the operation domain. We also reserved half of the operation data for testing.

In Task 3 we used two classic image classification datasets CIFAR-10 (Krizhevsky et al., 2009) and STL-10 (Coates et al., 2011). The former consists of 60,000 32323 images in 10 classes, and each class contains 6,000 images. The latter has only 1,3000 images, but the size of each image is 96963. We uses the whole CIFAR-10 dataset to train the model. The operation domain was represented by 8,000 images collected from STL-10, in which 5,000 were used for calibration, and the other 3,000 were reserved for testing.

Tasks 4 used the dataset CIFAR-100, which was more difficult than CIFAR-10 and contained 100 classes with 600 images in each. We trained the model with the whole training dataset of 50,000 images. To construct the operation domain, we randomly cropped the remaining 10,000 images. Half of these cropped images were used for calibration and the other half for testing.

Task 5 used the image classification dataset from the ImageCLEF 2014 challenge (Mller et al., 2010). It is organized with 12 common classes derived from three different domains: ImageNet ILSVRC 2012 (i), Caltech-256 (c), and Pascal VOC 2012 (p). We chose the dataset (c) as the origin domain and dataset (p) as the operation domain. Due to the extremely small size of the dataset, we divided the dataset (p) for calibration and testing by the ratio 4:1.

Finally, Task 6 dealt with an extremely difficult situation. ImageNet is a large-scale image classification dataset containing more than 1.2 million 2242243 images across 1,000 categories (Deng et al., 2009). The pre-trained model Inception-v3 was adopted for evaluation. The operation domain was constructed by down-sampling 10,000 images from the original test dataset. Again, half of the images were reserved for testing.

(a) CIFAR-10-origin
(b) CIFAR-10-operation
(c) ImageCLEF-origin
(d) ImageCLEF-operation
Figure 3. Examples of origin and operation domains. The left column is the images collected from CIFAR-10 and ImageCLEF-c, which are used as the original data. The right column is the images collected from STL-10 and ImageCLEF-p, which are used as the operational data.

4.2. Efficacy of operational calibration

Table 2 gives the Brier scores of the confidence before and after operational calibration. In these experiments all operational data listed in Table 1 (not including the reserved test data) were labeled and used in the calibration. The result unambiguously confirmed the general efficacy of our approach and its superiority over alternative approaches. In the following we elaborate on its performance in different situations and how it was compared with other approaches.

No. Model Orig. GPR RFR SVR TS SAR
1 LeNet-5 0.207 0.114 0.126 0.163 0.183 0.320
2 RNN 0.203 0.102 0.107 0.202 0.185 0.175
3 ResNet-18 0.474 0.101 0.121 0.115 0.387 0.308
4 VGG-19 0.216 0.158 0.162 0.170 0.217 0.529
5 ResNet-50 0.226 0.179 0.204 0.245 0.556 0.364
6 Inception-v3 0.192 0.161 0.167 0.217 0.191 -
  • Orig.–Before calibration; GPR–Our Gaussian Process-based approach; RFR–Random Forest Regression in the representation space; SVR–Support Vector Regression in the representation space; TS–Temperature Scaling (Guo et al., 2017); SAR–Regression with Surprise values (Kim et al., 2019). We failed to evaluate SV on task 6 because it took too long to run on the huge dataset.

Table 2. Brier score of different calibration methods

4.2.1. Calibration when fine-tuning is ineffective

A machine learning engineer might first consider to apply fine-tuning tricks to deal with the problem of domain shift. However, for non-trivial tasks, such as our tasks 4, 5, and 6, it can be very difficult, if possible, to fine-tune the DNN model with small operational datasets. Figure 4 shows the vain effort in fine-tuning the models with all the operational data (excluding test data). We tried all tricks including data augmentation, weight decay, and regularization to avoid over-fitting but failed to improve the test accuracy.

(a) Task-4
(b) Task-5
(c) Task-6
Figure 4. The fine-tuning of difficult tasks

Fortunately, our operational calibration worked quite well in these difficult situations. In addition to the Brier scores reported in Table 1, we can also see the saving of LCE for task 4 in Figure 5. Our approach reduced about a half of the LCE when , which indicates its capability in reducing high confidence errors.

Figure 5. loss due to confidence error

4.2.2. Calibration when fine-tuning is effective

In case of easier situations that fine-tuning works, we can still calibrate the model to give more accurate confidence. Note that effective fine-tuning does not necessarily provide accurate confidence. One can first apply fine-tuning until test accuracy does not increase, and then calibrate the fine-tuned model with the rest operation data.

For example, we successfully fine-tuned the models in our tasks 1, 2, and 3. 222Here we used some information of the training process, such as the learning rates, weight decays and training epochs. Fine-tuning could be more difficult because these information could be unavailable in real-world operation settings. Task 1 was the easiest to fine-tune and its accuracy kept increasing and exhausted all the 900 operational examples. Task 2 was binary classification, in this case our calibration was actual an effective fine-tuning technique. Figure 6 shows that our approach was more effective and efficient than conventional fine-tuning as it converged more quickly. For task 3 with fine-tuning the accuracy stopped increasing at about 79%, with about 3,000 operational examples. Figure 7 show that, the Brier score would decrease more if we spent rest operational data on calibration than continuing on the fine-tuning.

Figure 6. Operational calibration vs. Fine-tuning: Task 2
Figure 7. Operational calibration after fine-tuning: Task 3

4.3. Comparing with other calibration methods

First, we found our approach significantly outperformed Temperature Scaling (Hinton et al., 2015), which is reported to be the most effective conventional confidence calibration method (Guo et al., 2017). As shown in Table 2, Temperature Scaling was hardly effective, and it even worsened the confidence in tasks 4 and 5. We observed that its bad performance in these cases came from the significantly lowered resolution part of the Brier score, which confirmed the analysis in Section 3.6. For example, in task 3, with Temperature Scaling the reliability decreased from 0.196 to 0.138, but the resolution dropped from 0.014 to 0.0. In fact, in this case the confidence values were all very closed to 0.5 after scaling. However, with our approach the reliability decreased to 0.107, and the resolution also increased to 0.154.

Second, we also tried to calibrate confidence based on the surprise value that measured the difference in DL system’s behavior between the input and the training data (Kim et al., 2019). We thought it could be effective because it also leveraged the distribution of examples in the representation space. We made polynomial regression between the confidence adjustments and the likelihood-based surprise values. Unfortunately, it did not work for most of the cases. We believe the reason is that surprise values are scalars and cannot provide enough information for operational calibration.

Finally, to examine whether Gaussian Process Regression is the right choice for our purpose, we also experimented with two standard regression methods, viz. Random Forest Regression (RFR) and Support Vector Regression (SVR), in our framework. We used linear kernel for SVR and ten decision trees for RFR. In most cases, the non-liner RFR performed better than the linear SVR, and both of them performed better than Temperature Scaling but worse than our approach. The result indicates that (1) calibration based on the features extracted by the model rather than the logits computed by the model is crucial, (2) the confidence error is non-linear and unsystematic, and (3) the Gaussian Process as a Bayesian method can provide better estimation of the confidence.

4.4. Efficiency of operational calibration

In the above we have already shown that our approach worked with small operation datasets that were insufficient for fine-tuning (Task 4, 5, and 6). In fact, the Gaussian Process-based approach has a nice property that it starts to work with very few labeled examples. We experimented the approach with the input selection method presented in Section 3.5. We focused on the number of high-confidence false predictions, which was decreasing as more and more operational examples were labeled and used.

We experimented with all the tasks but labeled only 10% of the operational data. Table 3 shows the numbers of high-confidence false predictions before and after operational calibration. As a reference, we also include the numbers of high-confidence correct predictions. We can see that most of the high-confidence false predictions were eliminated. It is expected that there were less high-confidence correct predictions after calibration, because the actual accuracy of the models dropped. The much lowered LCE scores, which considered both the loss in lowering the confidence of correct predictions and the gain in lowering the confidence of false predictions, indicate that the overall improvements were significant.

No. Model Correct pred. False pred. LCE
1 LeNet-5 0.8 473 309.1 12624.3 0.1430.089
0.9 417 141.9 74 2.5 0.0960.055
2 RNN 0.8 512 552.9 11839.9 0.1620.091
0.9 482 261.3 106 12.0 0.1320.070
3 ResNet 0.8 1350 839.2 137259.7 0.3700.054
-18 0.9 1314 424.0 1263 9.4 0.3580.041
4 VGG-19 0.8 1105 392.5 58346.9 0.1270.070
0.9 772142.8 2809.3 0.0740.038
5 ResNet 0.8 53 26.9 165.2 0.1620.136
-50 0.9 46 26.9 102.0 0.1080.064
6 Inception 0.8 1160692.0 26563.6 0.0870.073
-v3 0.9 801554.1 137 40.2 0.0540.041
  • We ran each experiment 10 times and computed the average numbers.

Table 3. Reducing high-confidence false predictions with 10% operational data labeled

Note that for tasks 4, 5 and 6, usual fine-tuning tricks did not work even with all the operational data labeled. With our operational calibration, using only about 10% of the data, one can avoid about 97%, 80%, and 71% high-confidence () errors, respectively.

For a visual illustration of the efficiency of our approach, Figure 8 plots the proportions of high-confidence false predictions in all predictions for Task 3. Other tasks are similar and omitted here to save space. It is interesting to see that: (1) most of the high-confidence false predictions were identified very quickly, and (2) the approach was conservative, but the conservativeness is gradually remedied with more labeled operational data used.

(a) Task-1
(b) Task-2
(c) Task-3
(d) Task-4
(e) Task-5
(f) Task-6
Figure 8. The proportion curve of high confidence inputs

5. Related work

Operational calibration is generally related to the testing of deep learning systems in the software engineering community, and the confidence calibration, transfer learning, and active learning in the machine learning community. We briefly overview related work in these directions and highlight the connections and differences between our work and them.

5.1. Software testing for deep learning systems

The researches in this area can be roughly classified into four categories according to the kind of defects targeted.

  • [leftmargin=*]

  • Defects in DL programs. This line of work focuses on the bugs in the code of DL frameworks. For example, Pham et al. proposed to test the implementation of deep learning libraries (TensorFlow, CNTK and Theano) through differential testing (Pham et al., 2019). Odena et al. used fuzzing techniques to expose numerical errors in matrix multiplication operations (Odena et al., 2019).

  • Defects in DL models. Regarding trained DNN models as pieces of software artifact, and borrowing the idea of structural coverage in conventional software testing, a series of coverage criteria have been proposed for the testing of DNNs, for example, DeepXplore (Pei et al., 2017), DeepGauge (Ma et al., 2018), DeepConcolic (Sun et al., 2018), and Surprise Adequacy (Kim et al., 2019), to name but a few.

  • Defects in training datasets. Another critical element in machine learning is the dataset. There exist researches aimed at debugging and fixing errors in the polluted training dataset. For example, PSI identifies root causes (e.g., incorrect labels) of data errors by efficiently computing the Probability of Sufficiency scores through probabilistic programming (Chakarov et al., 2016).

  • Defects due to improper inputs. A DNN model cannot well handle inputs out of the distribution for which it is trained. Thus a defensive approach is to detect such inputs. For example, Wang et al.’s approach checked whether an input is normal or adversarial by integrating statistical hypothesis testing and model mutation testing (Wang et al., 2019a). More work in this line can be found in the machine learning literature under the name of out-of-distribution detection (Shalev et al., 2018).

For a more comprehensive survey on the testing of machine learning systems, one can consult Zhang et al. (Zhang et al., 2019).

The major difference of our work, compared with these researches, is that it is operational, i.e., focusing on how well a DNN model will work in a given operation domain. As discussed in Section 2, without considering the operation domain, it is often difficult to tell whether a phenomena of a DNN model is a bug or a feature (Ilyas et al., 2175; Li et al., 2019a).

An exception is the recent proposal of operational testing for the efficient estimation of the accuracy of a DNN model in the field (Li et al., 2019b). Arguably operational calibration is more challenging and more rewarding than operational testing, because the latter only tells the overall performance of a model in an operation domain, but the former tells when it works well and when not.

5.2. Confidence calibration in DNN training

Confidence calibration is important for training high quality classifiers. There is a plethora of proposals on this topic in the machine learning community (Niculescu-Mizil and Caruana, 2005; Naeini et al., 2015; Zadrozny and Elkan, 2002; Flach, 2016; Guo et al., 2017). Apart from the Temperature Scaling discussed in Section 2.1, Isotonic regression (Zadrozny and Elkan, 2002), Histogram binning (Zadrozny and Elkan, 2001) and Platt scaling (Platt, 1999) are also often used. Isotonic regression is a non-parametric approach that employs the least square method with a non-decreasing and piecewise constant fitted function. Histogram binning divides confidences into mutually exclusive bins and assigns the calibrated confidences by minimizing the bin-wise squared loss. Platt scaling is a generalized version of Temperature Scaling. It adds a linear transformation between the logit layer and the softmax layer, and optimizes the parameters with the NLL loss. However, according to Guo et al., Temperature Scaling is often the most effective approach.

As discussed earlier in Section 3.6, the problem of these calibration method is that they regard confidence errors as systematic errors, which is usually not the case in operation domain. Technically, these calibration methods are effective in minimize the reliability part of the Brier score, but ineffective in dealing with the problem in the resolution part.

In addition, Flach discussed the problem of confidence calibration from a decision-theoretic perspective (Flach, 2016). However, the confidence error caused by domain shift is not explicitly addressed.

5.3. Transfer learning and active learning

Our approach to operational calibration borrowed ideas from transfer learning (Pan and Yang, 2009) and active learning (Settles, 2009). Transfer learning (or domain adaptation) aims at training a model from a source domain (origin domain in our terms) that can be generalized to a target domain (operation domain), despite the dataset shift (Ng, 2016) between the domains. The key is to learn features that are transferable between the domains.

However, transfer learning techniques usually require data from both of the source and target domains. Contrastingly, operational calibration often has to work with limited data from the operation domain and no data from the origin domain. It does not aim at improving prediction accuracy in the operation domain, but it may leverage the existing transferability of features learned by the DNN model. In addition, transfer learning, if applicable, does not necessarily produce well calibrated models, and operational calibration can further improve the accuracy of confidence (cf. Figure 7).

Active learning aims at reducing the cost of labeling training data by deliberately selecting and labeling inputs from a large set of unlabeled data. For the Gaussian Process Regression, there exist different input selection strategies (Seo et al., 2000; Kapoor et al., 2007; Pasolli and Melgani, 2011). We tried many of them, such as those based on uncertainty (Seo et al., 2000), on density (Zhu et al., 2009), and on disagreement (Pasolli and Melgani, 2011), but failed to find a universally effective strategy that can improve the data efficiency of our approach. They were sensitive to the choices of the initial inputs, the models, and the distribution of examples (Settles, 2009). However, we found that the combination of cost-sensitive sampling bias and uncertainty can help in reducing high-confidence error predictions, especially in a cost-sensitive setting.

6. conclusion

Software quality assurance for systems incorporating DNN models is urgently needed. This paper focuses on the problem of operational calibration that detects and fixes the errors in the confidence given by a DNN model for its predictions in a given operation domain. A Bayesian approach to operational calibration is given. It solves the problem with Gaussian Process Regression, which leverages the locality of the operational data, and also their prediction correctness, in the representation space. The approach achieved impressive efficacy and efficiency in experiments with popular dataset and DNN models.

Theoretical analysis on aspects such as the data efficiency and the convergence of our algorithm is left for future work. In addition, we plan to investigate operational calibration methods for real-world decisions with more complicated cost models.


Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description