Operational Calibration: Debugging Confidence Errors for DNNs in the Field
Abstract.
Trained DNN models are increasingly adopted as integral parts of software systems. However, they are often overconfident, especially in practical operation domains where slight divergence from their training data almost always exists. To minimize the loss due to inaccurate confidence, operational calibration, i.e., calibrating the confidence function of a DNN classifier against its operation domain, becomes a necessary debugging step in the engineering of the whole system.
Operational calibration is difficult considering the limited budget of labeling operation data and the weak interpretability of DNN models. We propose a Bayesian approach to operational calibration that gradually corrects the confidence given by the model under calibration with a small number of labeled operational data deliberately selected from a larger set of unlabeled operational data. Exploiting the locality of the learned representation of the DNN model and modeling the calibration as Gaussian Process Regression, the approach achieves impressive efficacy and efficiency. Comprehensive experiments with various practical data sets and DNN models show that it significantly outperformed alternative methods, and in some difficult tasks it eliminated about 71% to 97% highconfidence errors with only about 10% of the minimal amount of labeled operation data needed for practical learning techniques to barely work.
1. Introduction
Deep learning (DL) has demonstrated near to or even better than human performance in some difficult tasks, such as image classification and speech recognition (LeCun et al., 2015; Goodfellow et al., 2016). Deep Neural Network (DNN) models are increasingly adopted in highstakes application scenarios such as medical diagnostics (Obermeyer and Emanuel, 2016) and selfdriven cars (Bojarski et al., 2016). However, it is not uncommon that DNN models perform poorly in the field (Riley, 2019). The interest on the quality assurance for DNN models as integral parts of software systems is surging in the community of software engineering (Pei et al., 2017; Ma et al., 2019; Sun et al., 2018; Zhang et al., 2018; Kim et al., 2019; Zhang et al., 2019).
A particular problem of using a previously trained DNN model in an operation domain is that the model may not only make morethanexpected mistakes in its predictions, but also give erroneous confidence values for these predictions. Arguably the latter issue is more harmful, because with accurate confident information the model could be at least partially usable by accepting only highconfident predictions.
The problem comes from the often occurred divergences between the original data on which the model is trained and the data in the operation domain, which is often called domain shift or dataset shift (Ng, 2016) in the machine learning literature. It can be difficult and go beyond the stretch of usual machine learning tricks such as finetuning and transfer learning (Pan and Yang, 2009; Wang et al., 2014), because of two practical restrictions often encountered. First, when the DNN model is provided by a third party, its training data are sometimes unavailable due to privacy and proprietary limitations (Zhou, 2016; Shokri and Shmatikov, 2015; Konečnỳ et al., 2016). Second, one can only use a small number of labeled operation data, because it is expensive to label the data collected in field. For example, in an AIassisted clinical medicine scenario, surgical biopsies could be involved in the labeling of radiology or pathology images.
We consider operational calibration that corrects the error in the confidence for its prediction on each input in a given operation domain. It does not change the predictions made by a DNN model, but tells when the model works well and when not. In this sense, operational calibration is a necessary debugging step that should be incorporated in the engineering of the whole system. Operational calibration is challenging because what it fixes is a function, not just a value. It also needs to be efficient, i.e., reducing the effort in labeling operation data.
It is natural to model operational calibration as a kind of nonparametric Bayesian Inference and solve it with Gaussian Process Regression (Rasmussen and Williams, 2005). We take the original confidence of the DNN model as the priori, and gradually calibrate the confidence with the evidence collected by selecting and labeling operation data. The key insight into effective and efficient regression comes from following observations: First, the DNN model, although suffering from the domain shift, can be used as a feature extractor with which unlabeled operational data can be nicely clustered (Zhu, 2005; Shu et al., 2018). In each cluster, the prediction correctness of an example is correlated with another one. The correlation can be effectively estimated with the distance of the two examples in the feature space. Second, Gaussian Process is able to quantify the uncertainty after each step, which can be used to guide the selection of operational data to label efficiently.
Systematic empirical evaluations showed that the approach was promising. It significantly outperformed existing calibration methods in both efficacy and efficiency in all settings we tested. In some difficult tasks it eliminated about 71% to 97% highconfidence errors with only about 10% of the minimal amount of labeled operation data needed for practical learning techniques to barely work.
In summary, the contributions of this paper are:

Posing the problem of operational calibration for DNN models in the field, and casting it into a Bayesian inference framework.

Proposing a Gaussian Processbased approach to operational calibration, which leverages the representation learned by the DNN model under calibration and the locality of confidence errors in this representation.

Evaluating the approach systematically. Experiments with various datasets and models confirmed the general efficacy and efficiency of our approach.
The rest of this paper is organized as follows. We first discuss the general need for operational quality assurance for DNNs in Section 2, and then focus on the problem of, and our approach to, operational calibration in Section 3. The approach is evaluated empirically in Section 4. We briefly overview related work and highlight their differences from ours in Section 5 before concluding the paper with Section 6.
2. DNN and operational quality assurance
Deep learning is intrinsically inductive (Goodfellow et al., 2016). However, conventional software engineering is mostly deductive, as evidenced by its fundamental principle of specificationimplementation consistency. Adopting DNN models as integral parts of software systems poses new challenges for quality assurance. To provide the background for the work on operational calibration, we first briefly introduce DNN and its prediction confidence, and then discuss its quality assurance for given operation domains.
2.1. DNN classifier and prediction confidence
A deep neural network classifier contains multiple hidden layers between its input and output layers. A popular understanding (Goodfellow et al., 2016) of the role of these hidden layers is that they progressively extract abstract features (e.g., a wheel, human skin, etc.) from a highdimensional lowlevel input (e.g., the pixels of an image). These features provide a relatively lowdimensional highlevel representation for the input , which makes the classification much easier, e.g., the image is more likely to be a car if wheels are present.
What a DNN classifier tries to learn from the training data is a posterior probability distribution, denoted as (Bishop, 2006). For a Kclassification problem, the distribution can be written as , where . For each input whose representation is , the output layer first computes the nonnormalized prediction , whose element is often called the logit for the th class. The classifier then normalizes with a softmax function to approximate the posterior probabilities
(1) 
Finally, to classify , one just chooses the the category corresponding to the maximum posterior probability, i.e.,
(2) 
Obviously, this prediction is intrinsically uncertain. The confidence for this prediction, which quantifies the likelihood of correctness, can be naturally measured as the estimated posterior class probability
(3) 
Confidence takes an important role in decisionmaking. For example, if the loss due to an incorrect prediction is four times of the gain of a correct prediction, one should not invest on predictions with confidence less than 0.8. Inaccurate confidence could cause significant loss. For example, an overconfident benign prediction for a pathology image could mislead a doctor into overlooking a malignant tumor, while an underconfident benign prediction could result in unnecessary confirmatory testings.
Modern DNN classifiers are often inaccurate in confidence (Szegedy et al., 2016), because they overfit to the surrogate loss used in training (Guo et al., 2017; Tewari and Bartlett, 2007). Simply put, they are over optimized toward the accuracy of classification, but not the accuracy of estimation for posterior probabilities. To avoid the potential loss caused by inaccurate confidence, confidence calibration can be employed in the learning process (Flach, 2016; Guo et al., 2017; Tewari and Bartlett, 2007). Usually the task is to find a function to correct the logit such that
(4) 
matches the real posterior probability . Notice that, in this setting the inaccuracy of confidence is viewed as a kind of systematic error or bias, not associated with particular inputs or domains. That is, does not take or as input.
There exists different kinds of calibration methods, such as isotonic regression (Zadrozny and Elkan, 2002), histogram binning (Zadrozny and Elkan, 2002), and Platt scaling (Platt, 1999). However, according to a recent study (Guo et al., 2017), the most effective choice is often a simple method called Temperature Scaling (Hinton et al., 2015). The idea is to define the calibration function as
(5) 
where is a scalar parameter computed by minimizing the negative log likelihood (Hastie et al., 2009) on the validation dataset.
2.2. Operational quality assurance
Well trained DNN models can provide marvellous capabilities, but unfortunately their failures in applications are also very common (Riley, 2019). When using a trained model as an integral part of a highstakes software system, it is crucial to know quantitatively how well the model will work. The quality assurance combining the viewpoints from software engineering and machine learning is needed, but largely missing.
The principle of software quality assurance is founded on the specifications for software artifacts and the deductive reasonings based on them. A specification defines the assumptions and guarantees for a software artifact. The artifact is expected to meet its grantees whenever its assumptions are satisfied. Thus explicit specifications make software artifacts more or less domain independent. However, statistical machine learning does not provide such kind of specifications. Essentially it tries to induce a model from its training data, which is intended to be general so that the model can give predictions on previously unseen inputs. Unfortunately the scope of generalization cannot be explicitly specified. As a result, a major problem comes from the divergence between the domain where the model was original trained and the domain where it actually operates.
So the first requirement for the quality assurance of a DNN model is to focus on the concrete domain where the model actually operates. In theory the quality of a DNN model will be pointless without considering its operation domain, and in practice the performance of a model may drop significantly with domain shift (Li et al., 2019b). On the other hand, focusing on the operation domain also relieves the DNN model of the dependence on its original training data. Apart from practical concerns such as protecting the privacy and property of the training data, decoupling a model from its training data and process will also be helpful for (re)using it as a commercial offtheshelf (COTS) software product (Zhou, 2016). This is in contrasting to machine learning techniques dealing with domain shift such as transfer learning or domain adaptation that heavily rely on the original training data or hyperparameters (Pan and Yang, 2009; Shu et al., 2018; Wang et al., 2019b). They need original training data because they try to generalize the scope of the model to include the new operation domain.
The second requirement is to embrace the uncertainty that is intrinsic in DNN models. A defect, or a “bug”, of a software artifact is a case that it does not deliver its promise. Different from conventional software artifacts, a DNN model never promises to be certainly correct on any given input, and thus individual incorrect predictions should not be regarded as errors, but to some extent features (Ilyas et al., 2175). Nevertheless, the model statistically quantifies the uncertainty of their predictions. Collectively, it is measured with metrics such as accuracy or precision. Individually, it is stated by the confidence value about a prediction on each given input. These qualifications of uncertainty, as well as the predictions a model made, should be subject to quality assurance. For example, given a DNN model and its operation domain, operational testing (Li et al., 2019b) examines to what degree the model’s overall accuracy is degraded by the domain shift.
Finally, operational quality assurance should prioritize the saving of human efforts, which include the cost of collecting, and especially labeling, the data in the operation domain. The labeling of operational data often involves physical interactions, such as surgical biopsies and destructive testings, and thus can be expensive and timeconsuming. Without the access to the original training data, finetuning a DNN model to an operation domain may require a tremendous amount of labeled examples to work. Quality assurance activities often have to work under a much tighter budget for labeling data.
Figure 1 depicts the overall idea for operational quality assurance, which generalizes the process of operational testing proposed in (Li et al., 2019b). A DNN model, which is trained by a third party with the data from the origin domain, is to be deployed in an operation domain. It needs to be evaluated, and possibly adapted, with the data from the current operation domain. To reduce the effort of labeling, data selection can be incorporated in the procedure with the guidance of the information generated by the DNN model and the quality assurance activity. Only the DNN models that pass the assessments and are possibly equipped with the adaptations will be put into operation.
3. Operational Calibration of DNN Confidence
Now we focus on operational calibration as a specific quality assurance task for DNNs in the field.
3.1. Defining the problem
Given a domain where a previously trained DNN model is deployed, operational calibration identifies and fixes the model’s errors in the confidence of predictions on individual inputs in the domain. Operational calibration is conservative in that it does not change the predictions made by the model, but tries to give accurate estimations on the likelihood of the predictions being correct. With this information, a DNN model will be useful even though its prediction accuracy is severely affected by the domain shift. One may take only its predictions on inputs with high confidence, but switch to other models or other backup measures if unconfident.
To quantify the accuracy of the confidence of a DNN model on a dataset , one can use the Brier score (BS) (Brier, 1950), which is actually the mean squared error of the estimation:
(6) 
where is the indicator function for whether the labeled input is misclassified or not, i.e., if , and otherwise.
Now we formally define the problem of operation calibration: {myproblem} Given a previously trained DNN classifier, a set of unlabeled examples collected from an operational domain, and a budget for labeling the examples in , the task of operation calibration is to find a confidence estimation function for with minimal Brier score .
Notice that operational calibration is different from the confidence calibration discussed in Section 2.1. The latter is domainindependent and usually included as a step in the training process of a DNN model, but the former is needed only when the model is deployed by a third party in a specific operation domain. Operational calibration cannot take the confidence error as a systematic error of the learning process, because the error is caused by the domain shift from the training data to the operational data, and it may depend on specific inputs from the operation domain.
3.2. Modeling with Gaussian Process
At first glance operational calibration seems a simple regression problem with BS as the loss function. However, a direct regression would not work because of the limited budget of labeled operation data. It is helpful to view the problem in a Bayesian way. At the beginning, we have a prior belief about the correctness of a DNN model’s predictions, which is the confidence outputs of the model. Once we observe some evidences that the model makes correct or incorrect predictions on some inputs, the belief should be adjusted accordingly. The challenge here is to strike a balance between the priori that was learned from a huge training dataset but suffering from domain shift, and the evidence that is collected from the operation domain but limited in volume.
It is natural to model the problem as a Gaussian Process (Rasmussen and Williams, 2005), because what we need is actually a function . Gaussian Process is a nonparametric kind of Bayesian methods, which convert a prior over functions into a posterior over functions according to observed data.
For convenience, instead of estimating directly, we consider
(7) 
where is the original confidence output of for input . At the beginning, without any evidence against , we assume that the prior distribution of is a zeromean normal distribution
(8) 
where is the covariance (kernel) function, which intuitively describes the “smoothness” of from point to point. In other words, the covariance function ensures that produces close outputs when inputs are close in the input space.
Assume that we observe a set of independent and identically distributed (i.i.d.) labeled operational data , in which . For notational convenience, let
be the observed data and their corresponding values, and let
be those for a set of i.i.d. predictive points. We have
(9) 
where is the kernel matrix. Therefore, the conditional probability distribution is
(10) 
where
With this Gaussian Process, we can estimate the probability distribution of the operational confidence for any input as follows
(11) 
where
Then, with Equation 7, we have the distribution of
(12) 
Finally, due to the value of confidence ranges from 0 to 1, we need to truncate the original normal distribution (Burkardt, 2014), i.e.,
(13) 
where
(14)  
Here the and are the probability density function and the cumulative distribution function of standard normal distribution, respectively.
With this Bayesian approach, we compute a distribution, rather than an exact value, for the confidence of each prediction. To compute the Brier score, we simply choose the maximum a posteriori (MAP), i.e., the mode of the distribution, as the calibrated confidence value. Here it is the mean of the truncated normal distribution
(15) 
3.3. Clustering in representation space
Directly applying the above Gaussian Process to estimate would be ineffective and inefficient. It is difficult to specify a proper covariance function in Equation 8, because the correlation between the correctness of predictions on different examples in the very highdimensional input space is difficult, if possible, to model.
Fortunately, we have the DNN model on hand, which can be used as a feature extractor, although it may suffer from the problem of domain shift and perform badly as a classifier (Bengio et al., 2012). In this way we transform each input from the input space to a corresponding point in the representation space, which is defined by the output of the neurons in the last hidden layer. It turns out that the correctness of ’s predictions has an obvious locality, i.e., a prediction is more likely to be correct/incorrect if it is near to a correct/incorrect prediction in the representation space. See Figure 2 for an intuitive example.
Another insight for improving the efficacy and efficiency of the Gaussian Process is that the distribution of operational data in the sparse representation space is far from even. They can be nicely grouped into a small number (usually tens) of clusters, and the correlation of prediction correctness within a group is much stronger than that between groups. Consequently, instead of regression with a universal Gaussian Process, we carry out a Gaussian Process regression in each cluster.
This clustering does not only reduce the computational cost of the Gaussian Processes, but also make it possible to use different covariance functions for different clusters. The flexibility makes our estimation more accurate. Elaborately, we use the RBF kernel
(16) 
where the parameter (length scale) can be decided according to the distribution of original confidence produced by .
3.4. Considering costs in decision
The cost of misclassification must be taken into account in realworld decision making. One can also measure how well a model is calibrated with the loss due to confidence error (LCE) against a given cost model.
For example, let us assume a simple cost model in which the gain for a correct prediction is 1 and the loss for a false prediction is . The net gain if we take action on a prediction for input will be . We further assume that there will be no cost to take no action when the expected net gain is negative. Then the actual gain for an input with estimated confidence will be
(17) 
where is the breakeven threshold of confidence for taking action. On the other hand, if the confidence was perfect, i.e., if the prediction is correct, and 0 otherwise, the total gain for dataset would be a constant . So the average LCE over a dataset with examples is :
(18) 
With the Bayesian approach we do not have an exact but a truncated normal distribution of it. If we take as , the above equations still hold. ^{1}^{1}1This is because here is a constant. Things will be different if, for example, one puts higher stakes on higher confidence predictions. Considering the page limit, we will not elaborate this issue, but the Bayesian approach allows for more flexibility in dealing with these cases.
Costsensitive calibration targets at minimizing the LCE instead of the Brier score. Notice that calibrating confidence with Brier score generally reduces LCE. However, with a cost model, the optimization toward minimizing LCE can be more effective and efficient.
3.5. Selecting operational data to label
In case that the set of labeled operational data is given, we simply apply a Gaussian Process in each cluster in the representation space and get the posteriori distribution for confidence . However, if we can decide which operational data to label, we shall spend the budget for labeling more wisely.
Initially, we select the operational input at the center of each cluster to label, and apply a Gaussian Process in each cluster with this central input to compute the posterior probability distribution of the confidence. Then we shall select the most “helpful” input to label and repeat the procedure.
The insight for input selection is twofold. First, to reduce the uncertainty as much as possible, one should choose the input with maximal variance . Second, to reduce the LCE as much as possible, one should pay more attention to those input with confidence near to the breakeven threshold . So we chose as the next input to label:
(19) 
Putting all the ideas together, we have Algorithm 1 shown below. The algorithm is robust in that it does not rely on any hyperparameters except for the number of clusters. It is also conservative in that it does not change the predictions made by the model. As a result, it needs no extra validation data.
3.6. Discussions
To understand why our approach is more effective than conventional confidence calibration techniques, one can consider the threepart decomposition of the Brier score (Murphy, 1973)
(20)  
where is the set of inputs whose confidence falls into the interval , and the and are the expected accuracy and confidence in , respectively. The acc is the accuracy of dataset .
In this decomposition, the first term is called reliability, which measures the distance between the confidence and the true posterior probabilities. The second term is resolution, which measures the distinctions of the predictive probabilities. The final term is uncertainty, which is only determined by the accuracy.
In conventional confidence calibration, the model is assumed to be well trained and work well with the accuracies. In addition, the grouping of is acceptable because the confidence error is regarded as systematic error. So one only cares about minimizing the reliability. This is exactly what conventional calibration techniques such as Temperature Scaling are designed for.
However, in operational testing, the model itself suffers from the domain shift, and thus may be less accurate than expected. Even worse, the grouping of is problematic because the confidence error is unsystematic and the inputs in are not homogeneous anymore. Consequently, we need to maximize the resolution and minimize the reliability at the same time. Our approach achieves these two goals with more discriminative calibration that is based on the features of individual inputs rather than their logits or confidence values.
This observation also indicates that the benefit of our approach over temperature scaling will diminish if the confidence error happens to be systematic. For example, in case that the only divergence of the data in the operation domain is that some part of an image is missing, our approach will perform similarly to or even slightly worse than temperature scaling. However, as can be seen from later experiments, most operational situations have more or less domain shifts that temperature scaling cannot handle well.
In addition, when the loss for false prediction is very small (, as observed from experiments in the next section), our approach will be ineffective in reducing LCE. It is expected because in this situation one should accept almost all predictions, even when their confidence values are low.
4. Empirical evaluation
We conducted a series of experiments to answer the following questions:

Is our approach to operational calibration generally effective in different tasks?

How effective it is, compared with alternative approaches?

How efficient it is, in the sense of saving labeling effort?
We implemented our approach on top of the PyTorch 1.1.0 DL framework. The code, together with the experiment data, are available at https://figshare.com/s/5f6096ca8f413ef31eb4. The experiments were conducted on a GPU server with two Intel Xeon Gold 5118 CPU @ 2.30GHz, 400GB RAM, and 10 GeForce RTX 2080 Ti GPUs. The server ran Ubuntu 16.04 with GNU/Linux kernel 4.4.0.
The execution time of our operational calibration depends on the size of the dataset used, and the architecture of the DNN model. For the tasks listed below, the execution time varied from about 3.5s to 50s, which we regard as totally acceptable.
4.1. Experiment tasks
To evaluate the general efficacy of our approach, we designed six tasks that were different in the application domains (image recognition and natural language processing), operation dataset size (from hundreds to thousands), classification difficulty (from 2 to 1000 classes), and model complexity (from to parameters). To make our simulation of domain shifts realistic, in four tasks we adopted thirdparty operational datasets often used in transfer learning research, and the other two tasks we used mutations that are also frequented made in the machine learning community. Figure 3 demonstrates some example images from the origin and operation domains. Table 1 lists the settings of the six tasks.
No.  Model  Origin Operation  

Dataset  Acc. (%)  Size  
1  LeNet5  Digit recognition  96.9 68.0  900 
(MNIST USPS)  
2  RNN  Polarity  99.0 83.4  1,000 
(v1.0 v2.0)  
3  ResNet18  Image classification  93.2 47.1  5,000 
CIFAR10 STL10  
4  VGG19  CIFAR100  72.0 63.6  5,000 
(orig. crop)  
5  ResNet50  ImageCLEF  99.2 73.2  480 
(c p)  
6  Inceptionv3  ImageNet  77.5 45.3  5,000 
(orig. downsample) 

It refers to the maximum number of operation data available for labeling.
In Task 1 we applied a LeNet5 model originally trained with the images from the MNIST dataset (LeCun et al., 1998) to classify images from the USPS dataset (Friedman et al., 2001). Both of them are popular handwritten digit recognition datasets consisting of singlechannel images of size 16161. The size of the training dataset was 2,000, and the size of the operation dataset was 1,800. We reserved 900 of the 1800 operational data for testing, and used the other 900 for operational calibration.
Task 2 was focused on natural language processing. Polarity is a dataset for sentimentanalysis (Pang et al., 2002). It consists of sentences labeled with corresponding sentiment polarity (i.e., positive or negative). We chose Polarityv1.0, which contained 1,400 movie reviews collected in 2002, as the training set. The Polarityv2.0, which contained 2,000 movie reviews collected in 2004, was used as the data from the operation domain. We also reserved half of the operation data for testing.
In Task 3 we used two classic image classification datasets CIFAR10 (Krizhevsky et al., 2009) and STL10 (Coates et al., 2011). The former consists of 60,000 32323 images in 10 classes, and each class contains 6,000 images. The latter has only 1,3000 images, but the size of each image is 96963. We uses the whole CIFAR10 dataset to train the model. The operation domain was represented by 8,000 images collected from STL10, in which 5,000 were used for calibration, and the other 3,000 were reserved for testing.
Tasks 4 used the dataset CIFAR100, which was more difficult than CIFAR10 and contained 100 classes with 600 images in each. We trained the model with the whole training dataset of 50,000 images. To construct the operation domain, we randomly cropped the remaining 10,000 images. Half of these cropped images were used for calibration and the other half for testing.
Task 5 used the image classification dataset from the ImageCLEF 2014 challenge (Mller et al., 2010). It is organized with 12 common classes derived from three different domains: ImageNet ILSVRC 2012 (i), Caltech256 (c), and Pascal VOC 2012 (p). We chose the dataset (c) as the origin domain and dataset (p) as the operation domain. Due to the extremely small size of the dataset, we divided the dataset (p) for calibration and testing by the ratio 4:1.
Finally, Task 6 dealt with an extremely difficult situation. ImageNet is a largescale image classification dataset containing more than 1.2 million 2242243 images across 1,000 categories (Deng et al., 2009). The pretrained model Inceptionv3 was adopted for evaluation. The operation domain was constructed by downsampling 10,000 images from the original test dataset. Again, half of the images were reserved for testing.


4.2. Efficacy of operational calibration
Table 2 gives the Brier scores of the confidence before and after operational calibration. In these experiments all operational data listed in Table 1 (not including the reserved test data) were labeled and used in the calibration. The result unambiguously confirmed the general efficacy of our approach and its superiority over alternative approaches. In the following we elaborate on its performance in different situations and how it was compared with other approaches.
No.  Model  Orig.  GPR  RFR  SVR  TS  SAR 

1  LeNet5  0.207  0.114  0.126  0.163  0.183  0.320 
2  RNN  0.203  0.102  0.107  0.202  0.185  0.175 
3  ResNet18  0.474  0.101  0.121  0.115  0.387  0.308 
4  VGG19  0.216  0.158  0.162  0.170  0.217  0.529 
5  ResNet50  0.226  0.179  0.204  0.245  0.556  0.364 
6  Inceptionv3  0.192  0.161  0.167  0.217  0.191   

Orig.–Before calibration; GPR–Our Gaussian Processbased approach; RFR–Random Forest Regression in the representation space; SVR–Support Vector Regression in the representation space; TS–Temperature Scaling (Guo et al., 2017); SAR–Regression with Surprise values (Kim et al., 2019). We failed to evaluate SV on task 6 because it took too long to run on the huge dataset.
4.2.1. Calibration when finetuning is ineffective
A machine learning engineer might first consider to apply finetuning tricks to deal with the problem of domain shift. However, for nontrivial tasks, such as our tasks 4, 5, and 6, it can be very difficult, if possible, to finetune the DNN model with small operational datasets. Figure 4 shows the vain effort in finetuning the models with all the operational data (excluding test data). We tried all tricks including data augmentation, weight decay, and regularization to avoid overfitting but failed to improve the test accuracy.
Fortunately, our operational calibration worked quite well in these difficult situations. In addition to the Brier scores reported in Table 1, we can also see the saving of LCE for task 4 in Figure 5. Our approach reduced about a half of the LCE when , which indicates its capability in reducing high confidence errors.
4.2.2. Calibration when finetuning is effective
In case of easier situations that finetuning works, we can still calibrate the model to give more accurate confidence. Note that effective finetuning does not necessarily provide accurate confidence. One can first apply finetuning until test accuracy does not increase, and then calibrate the finetuned model with the rest operation data.
For example, we successfully finetuned the models in our tasks 1, 2, and 3. ^{2}^{2}2Here we used some information of the training process, such as the learning rates, weight decays and training epochs. Finetuning could be more difficult because these information could be unavailable in realworld operation settings. Task 1 was the easiest to finetune and its accuracy kept increasing and exhausted all the 900 operational examples. Task 2 was binary classification, in this case our calibration was actual an effective finetuning technique. Figure 6 shows that our approach was more effective and efficient than conventional finetuning as it converged more quickly. For task 3 with finetuning the accuracy stopped increasing at about 79%, with about 3,000 operational examples. Figure 7 show that, the Brier score would decrease more if we spent rest operational data on calibration than continuing on the finetuning.
4.3. Comparing with other calibration methods
First, we found our approach significantly outperformed Temperature Scaling (Hinton et al., 2015), which is reported to be the most effective conventional confidence calibration method (Guo et al., 2017). As shown in Table 2, Temperature Scaling was hardly effective, and it even worsened the confidence in tasks 4 and 5. We observed that its bad performance in these cases came from the significantly lowered resolution part of the Brier score, which confirmed the analysis in Section 3.6. For example, in task 3, with Temperature Scaling the reliability decreased from 0.196 to 0.138, but the resolution dropped from 0.014 to 0.0. In fact, in this case the confidence values were all very closed to 0.5 after scaling. However, with our approach the reliability decreased to 0.107, and the resolution also increased to 0.154.
Second, we also tried to calibrate confidence based on the surprise value that measured the difference in DL system’s behavior between the input and the training data (Kim et al., 2019). We thought it could be effective because it also leveraged the distribution of examples in the representation space. We made polynomial regression between the confidence adjustments and the likelihoodbased surprise values. Unfortunately, it did not work for most of the cases. We believe the reason is that surprise values are scalars and cannot provide enough information for operational calibration.
Finally, to examine whether Gaussian Process Regression is the right choice for our purpose, we also experimented with two standard regression methods, viz. Random Forest Regression (RFR) and Support Vector Regression (SVR), in our framework. We used linear kernel for SVR and ten decision trees for RFR. In most cases, the nonliner RFR performed better than the linear SVR, and both of them performed better than Temperature Scaling but worse than our approach. The result indicates that (1) calibration based on the features extracted by the model rather than the logits computed by the model is crucial, (2) the confidence error is nonlinear and unsystematic, and (3) the Gaussian Process as a Bayesian method can provide better estimation of the confidence.
4.4. Efficiency of operational calibration
In the above we have already shown that our approach worked with small operation datasets that were insufficient for finetuning (Task 4, 5, and 6). In fact, the Gaussian Processbased approach has a nice property that it starts to work with very few labeled examples. We experimented the approach with the input selection method presented in Section 3.5. We focused on the number of highconfidence false predictions, which was decreasing as more and more operational examples were labeled and used.
We experimented with all the tasks but labeled only 10% of the operational data. Table 3 shows the numbers of highconfidence false predictions before and after operational calibration. As a reference, we also include the numbers of highconfidence correct predictions. We can see that most of the highconfidence false predictions were eliminated. It is expected that there were less highconfidence correct predictions after calibration, because the actual accuracy of the models dropped. The much lowered LCE scores, which considered both the loss in lowering the confidence of correct predictions and the gain in lowering the confidence of false predictions, indicate that the overall improvements were significant.
No.  Model  Correct pred.  False pred.  LCE  

1  LeNet5  0.8  473 309.1  12624.3  0.1430.089 
0.9  417 141.9  74 2.5  0.0960.055  
2  RNN  0.8  512 552.9  11839.9  0.1620.091 
0.9  482 261.3  106 12.0  0.1320.070  
3  ResNet  0.8  1350 839.2  137259.7  0.3700.054 
18  0.9  1314 424.0  1263 9.4  0.3580.041  
4  VGG19  0.8  1105 392.5  58346.9  0.1270.070 
0.9  772142.8  2809.3  0.0740.038  
5  ResNet  0.8  53 26.9  165.2  0.1620.136 
50  0.9  46 26.9  102.0  0.1080.064  
6  Inception  0.8  1160692.0  26563.6  0.0870.073 
v3  0.9  801554.1  137 40.2  0.0540.041 

We ran each experiment 10 times and computed the average numbers.
Note that for tasks 4, 5 and 6, usual finetuning tricks did not work even with all the operational data labeled. With our operational calibration, using only about 10% of the data, one can avoid about 97%, 80%, and 71% highconfidence () errors, respectively.
For a visual illustration of the efficiency of our approach, Figure 8 plots the proportions of highconfidence false predictions in all predictions for Task 3. Other tasks are similar and omitted here to save space. It is interesting to see that: (1) most of the highconfidence false predictions were identified very quickly, and (2) the approach was conservative, but the conservativeness is gradually remedied with more labeled operational data used.
5. Related work
Operational calibration is generally related to the testing of deep learning systems in the software engineering community, and the confidence calibration, transfer learning, and active learning in the machine learning community. We briefly overview related work in these directions and highlight the connections and differences between our work and them.
5.1. Software testing for deep learning systems
The researches in this area can be roughly classified into four categories according to the kind of defects targeted.

[leftmargin=*]

Defects in DL programs. This line of work focuses on the bugs in the code of DL frameworks. For example, Pham et al. proposed to test the implementation of deep learning libraries (TensorFlow, CNTK and Theano) through differential testing (Pham et al., 2019). Odena et al. used fuzzing techniques to expose numerical errors in matrix multiplication operations (Odena et al., 2019).

Defects in DL models. Regarding trained DNN models as pieces of software artifact, and borrowing the idea of structural coverage in conventional software testing, a series of coverage criteria have been proposed for the testing of DNNs, for example, DeepXplore (Pei et al., 2017), DeepGauge (Ma et al., 2018), DeepConcolic (Sun et al., 2018), and Surprise Adequacy (Kim et al., 2019), to name but a few.

Defects in training datasets. Another critical element in machine learning is the dataset. There exist researches aimed at debugging and fixing errors in the polluted training dataset. For example, PSI identifies root causes (e.g., incorrect labels) of data errors by efficiently computing the Probability of Sufficiency scores through probabilistic programming (Chakarov et al., 2016).

Defects due to improper inputs. A DNN model cannot well handle inputs out of the distribution for which it is trained. Thus a defensive approach is to detect such inputs. For example, Wang et al.’s approach checked whether an input is normal or adversarial by integrating statistical hypothesis testing and model mutation testing (Wang et al., 2019a). More work in this line can be found in the machine learning literature under the name of outofdistribution detection (Shalev et al., 2018).
For a more comprehensive survey on the testing of machine learning systems, one can consult Zhang et al. (Zhang et al., 2019).
The major difference of our work, compared with these researches, is that it is operational, i.e., focusing on how well a DNN model will work in a given operation domain. As discussed in Section 2, without considering the operation domain, it is often difficult to tell whether a phenomena of a DNN model is a bug or a feature (Ilyas et al., 2175; Li et al., 2019a).
An exception is the recent proposal of operational testing for the efficient estimation of the accuracy of a DNN model in the field (Li et al., 2019b). Arguably operational calibration is more challenging and more rewarding than operational testing, because the latter only tells the overall performance of a model in an operation domain, but the former tells when it works well and when not.
5.2. Confidence calibration in DNN training
Confidence calibration is important for training high quality classifiers. There is a plethora of proposals on this topic in the machine learning community (NiculescuMizil and Caruana, 2005; Naeini et al., 2015; Zadrozny and Elkan, 2002; Flach, 2016; Guo et al., 2017). Apart from the Temperature Scaling discussed in Section 2.1, Isotonic regression (Zadrozny and Elkan, 2002), Histogram binning (Zadrozny and Elkan, 2001) and Platt scaling (Platt, 1999) are also often used. Isotonic regression is a nonparametric approach that employs the least square method with a nondecreasing and piecewise constant fitted function. Histogram binning divides confidences into mutually exclusive bins and assigns the calibrated confidences by minimizing the binwise squared loss. Platt scaling is a generalized version of Temperature Scaling. It adds a linear transformation between the logit layer and the softmax layer, and optimizes the parameters with the NLL loss. However, according to Guo et al., Temperature Scaling is often the most effective approach.
As discussed earlier in Section 3.6, the problem of these calibration method is that they regard confidence errors as systematic errors, which is usually not the case in operation domain. Technically, these calibration methods are effective in minimize the reliability part of the Brier score, but ineffective in dealing with the problem in the resolution part.
5.3. Transfer learning and active learning
Our approach to operational calibration borrowed ideas from transfer learning (Pan and Yang, 2009) and active learning (Settles, 2009). Transfer learning (or domain adaptation) aims at training a model from a source domain (origin domain in our terms) that can be generalized to a target domain (operation domain), despite the dataset shift (Ng, 2016) between the domains. The key is to learn features that are transferable between the domains.
However, transfer learning techniques usually require data from both of the source and target domains. Contrastingly, operational calibration often has to work with limited data from the operation domain and no data from the origin domain. It does not aim at improving prediction accuracy in the operation domain, but it may leverage the existing transferability of features learned by the DNN model. In addition, transfer learning, if applicable, does not necessarily produce well calibrated models, and operational calibration can further improve the accuracy of confidence (cf. Figure 7).
Active learning aims at reducing the cost of labeling training data by deliberately selecting and labeling inputs from a large set of unlabeled data. For the Gaussian Process Regression, there exist different input selection strategies (Seo et al., 2000; Kapoor et al., 2007; Pasolli and Melgani, 2011). We tried many of them, such as those based on uncertainty (Seo et al., 2000), on density (Zhu et al., 2009), and on disagreement (Pasolli and Melgani, 2011), but failed to find a universally effective strategy that can improve the data efficiency of our approach. They were sensitive to the choices of the initial inputs, the models, and the distribution of examples (Settles, 2009). However, we found that the combination of costsensitive sampling bias and uncertainty can help in reducing highconfidence error predictions, especially in a costsensitive setting.
6. conclusion
Software quality assurance for systems incorporating DNN models is urgently needed. This paper focuses on the problem of operational calibration that detects and fixes the errors in the confidence given by a DNN model for its predictions in a given operation domain. A Bayesian approach to operational calibration is given. It solves the problem with Gaussian Process Regression, which leverages the locality of the operational data, and also their prediction correctness, in the representation space. The approach achieved impressive efficacy and efficiency in experiments with popular dataset and DNN models.
Theoretical analysis on aspects such as the data efficiency and the convergence of our algorithm is left for future work. In addition, we plan to investigate operational calibration methods for realworld decisions with more complicated cost models.
References
 (1)
 Bengio et al. (2012) Yoshua Bengio, Aaron C Courville, and Pascal Vincent. 2012. Unsupervised feature learning and deep learning: A review and new perspectives. CoRR, abs/1206.5538 1 (2012), 2012.
 Bishop (2006) Christopher M Bishop. 2006. Pattern recognition and machine learning. Springer, New York, NY. http://cds.cern.ch/record/998831 Softcover published in 2016.
 Bojarski et al. (2016) Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D. Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, Xin Zhang, Jake Zhao, and Karol Zieba. 2016. End to End Learning for SelfDriving Cars. CoRR abs/1604.07316, Article None (2016), 9 pages. arXiv:1604.07316 http://arxiv.org/abs/1604.07316
 Brier (1950) Glenn W Brier. 1950. Verification of forecasts expressed in terms of probability. Monthly weather review 78, 1 (1950), 1–3.
 Burkardt (2014) John Burkardt. 2014. The truncated normal distribution. None None, Article None (2014), 32 pages.
 Chakarov et al. (2016) Aleksandar Chakarov, Aditya Nori, Sriram Rajamani, Shayak Sen, and Deepak Vijaykeerthy. 2016. Debugging machine learning tasks. arXiv preprint arXiv:1603.07292 None, Article None (2016), 23 pages.
 Coates et al. (2011) Adam Coates, Andrew Ng, and Honglak Lee. 2011. An analysis of singlelayer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics. aistats, None, 215–223.
 Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li jia Li, Kai Li, and Li Feifei. 2009. Imagenet: A largescale hierarchical image database. In In CVPR. CVPR, None, Article None, 8 pages.
 Flach (2016) Peter A. Flach. 2016. Classifier Calibration. Springer US, Boston, MA, 1–8. https://doi.org/10.1007/9781489975027_9001
 Friedman et al. (2001) Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2001. The elements of statistical learning. Vol. 1. Springer series in statistics New York, None.
 Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press, New York, NY, USA. http://www.deeplearningbook.org.
 Guo et al. (2017) Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. 2017. On Calibration of Modern Neural Networks. In Proceedings of the 34th International Conference on Machine Learning  Volume 70 (ICML’17). JMLR.org, None, Article None, 10 pages. http://dl.acm.org/citation.cfm?id=3305381.3305518
 Hastie et al. (2009) T. Hastie, R. Tibshirani, and J.H. Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, None. https://books.google.com/books?id=eBSgoAEACAAJ
 Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 None, Article None (2015), 9 pages.
 Ilyas et al. (2175) Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and Aleksander Madry. abs/1905.02175. Adversarial examples are not bugs, they are features. arXiv preprint arXiv:1905.02175 0, 0 (abs/1905.02175), 0.
 Kapoor et al. (2007) Ashish Kapoor, Kristen Grauman, Raquel Urtasun, and Trevor Darrell. 2007. Active learning with gaussian processes for object categorization. In 2007 IEEE 11th International Conference on Computer Vision. IEEE, IEEE, None, 1–8.
 Kim et al. (2019) Jinhan Kim, Robert Feldt, and Shin Yoo. 2019. Guiding Deep Learning System Testing Using Surprise Adequacy. In Proceedings of the 41st International Conference on Software Engineering (ICSE ’19). IEEE Press, Piscataway, NJ, USA, Article None, 11 pages. https://doi.org/10.1109/ICSE.2019.00108
 Konečnỳ et al. (2016) Jakub Konečnỳ, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. 2016. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492 None, 10 (2016), None.
 Krizhevsky et al. (2009) Alex Krizhevsky, Geoffrey Hinton, et al. 2009. Learning multiple layers of features from tiny images. Technical Report. Citeseer.
 LeCun et al. (2015) Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521 (27 05 2015), 436 EP –. https://doi.org/10.1038/nature14539
 LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. 1998. Gradientbased learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278–2324.
 Li et al. (2019a) Zenan Li, Xiaoxing Ma, Chang Xu, and Chun Cao. 2019a. Structural Coverage Criteria for Neural Networks Could Be Misleading. In Proceedings of the 41st International Conference on Software Engineering: New Ideas and Emerging Results (ICSENIER ’19). IEEE Press, Piscataway, NJ, USA, Article None, 4 pages. https://doi.org/10.1109/ICSENIER.2019.00031
 Li et al. (2019b) Zenan Li, Xiaoxing Ma, Chang Xu, Chun Cao, Jingwei Xu, and Jian Lu. 2019b. Boosting Operational DNN Testing Efficiency through Conditioning. In Proceedings of the 27th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE ’19). ACM, Tallinn, Estonia, Article None, 12 pages. http://arxiv.org/abs/1906.02533
 Ma et al. (2019) L. Ma, F. JuefeiXu, M. Xue, B. Li, L. Li, Y. Liu, and J. Zhao. 2019. DeepCT: Tomographic Combinatorial Testing for Deep Learning Systems. In 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER). ACM, New York, NY, USA, Article None, 8 pages. https://doi.org/10.1109/SANER.2019.8668044
 Ma et al. (2018) Lei Ma, Felix JuefeiXu, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Chunyang Chen, Ting Su, Li Li, Yang Liu, Jianjun Zhao, and Yadong Wang. 2018. DeepGauge: Multigranularity Testing Criteria for Deep Learning Systems. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE 2018). ACM, New York, NY, USA, Article None, 12 pages. https://doi.org/10.1145/3238147.3238202
 Mller et al. (2010) Henning Mller, Paul Clough, Thomas Deselaers, and Barbara Caputo. 2010. ImageCLEF: Experimental Evaluation in Visual Information Retrieval (1st ed.). Springer Publishing Company, Incorporated, None.
 Murphy (1973) Allan H. Murphy. 1973. A New Vector Partition of the Probability Score. Journal of Applied Meteorology 12, 4 (1973), 595–600. https://doi.org/10.1175/15200450(1973)012<0595:ANVPOT>2.0.CO;2
 Naeini et al. (2015) Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. 2015. Obtaining well calibrated probabilities using bayesian binning. In TwentyNinth AAAI Conference on Artificial Intelligence. AAAI, None, Article None, 7 pages.
 Ng (2016) Andrew Ng. 2016. Nuts and bolts of building AI applications using Deep Learning. NeuripsKeynote None, Article None (2016), 5 pages.
 NiculescuMizil and Caruana (2005) Alexandru NiculescuMizil and Rich Caruana. 2005. Predicting good probabilities with supervised learning. In Proceedings of the 22nd international conference on Machine learning. ACM, JMLR.org, None, 625–632.
 Obermeyer and Emanuel (2016) Ziad Obermeyer and Ezekiel J Emanuel. 2016. Predicting the future—big data, machine learning, and clinical medicine. The New England journal of medicine 375, 13 (2016), 1216.
 Odena et al. (2019) Augustus Odena, Catherine Olsson, David Andersen, and Ian Goodfellow. 2019. TensorFuzz: Debugging Neural Networks with CoverageGuided Fuzzing. In Proceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.), Vol. 97. PMLR, Long Beach, California, USA, 4901–4911. http://proceedings.mlr.press/v97/odena19a.html
 Pan and Yang (2009) Sinno Jialin Pan and Qiang Yang. 2009. A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22, 10 (2009), 1345–1359.
 Pang et al. (2002) Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs Up? Sentiment Classification Using Machine Learning Techniques. In Proceedings of EMNLP. EMNLP, None, 79–86.
 Pasolli and Melgani (2011) Edoardo Pasolli and Farid Melgani. 2011. Gaussian process regression within an active learning scheme. In 2011 IEEE International Geoscience and Remote Sensing Symposium. IEEE, IEEE, None, 3574–3577.
 Pei et al. (2017) Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. 2017. DeepXplore: Automated Whitebox Testing of Deep Learning Systems. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP ’17). ACM, New York, NY, USA, Article None, 18 pages. https://doi.org/10.1145/3132747.3132785
 Pham et al. (2019) Hung Viet Pham, Thibaud Lutellier, Weizhen Qi, and Lin Tan. 2019. CRADLE: crossbackend validation to detect and localize bugs in deep learning libraries. In Proceedings of the 41st International Conference on Software Engineering. IEEE Press, ICSE’19, None, 1027–1038.
 Platt (1999) John C. Platt. 1999. Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods. In ADVANCES IN LARGE MARGIN CLASSIFIERS. MIT Press, None, 61–74.
 Rasmussen and Williams (2005) Carl Edward Rasmussen and Christopher K. I. Williams. 2005. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, None.
 Riley (2019) Patrick Riley. 2019. Three pitfalls to avoid in machine learning. Nature 572, 7767 (Jul 2019), 27–29. https://doi.org/10.1038/d4158601902307y
 Seo et al. (2000) Sambu Seo, Marko Wallat, Thore Graepel, and Klaus Obermayer. 2000. Gaussian process regression: Active data selection and test point rejection. In Mustererkennung 2000. Springer, None, 27–34.
 Settles (2009) Burr Settles. 2009. Active learning literature survey. Technical Report. University of WisconsinMadison Department of Computer Sciences.
 Shalev et al. (2018) Gabi Shalev, Yossi Adi, and Joseph Keshet. 2018. Outofdistribution detection using multiple semantic label representations. In Advances in Neural Information Processing Systems. None, None, 7375–7385.
 Shokri and Shmatikov (2015) Reza Shokri and Vitaly Shmatikov. 2015. PrivacyPreserving Deep Learning. In Proceedings of the 22Nd ACM SIGSAC Conference on Computer and Communications Security (CCS ’15). ACM, New York, NY, USA, Article None, 12 pages. https://doi.org/10.1145/2810103.2813687
 Shu et al. (2018) Rui Shu, Hung Bui, Hirokazu Narui, and Stefano Ermon. 2018. A DIRTT Approach to Unsupervised Domain Adaptation. In International Conference on Learning Representations. ICLR, None, Article None, 19 pages. https://openreview.net/forum?id=H1qTMAW
 Sun et al. (2018) Youcheng Sun, Min Wu, Wenjie Ruan, Xiaowei Huang, Marta Kwiatkowska, and Daniel Kroening. 2018. Concolic Testing for Deep Neural Networks. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE 2018). ACM, New York, NY, USA, Article None, 11 pages. https://doi.org/10.1145/3238147.3238172
 Szegedy et al. (2016) Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition. CVPR, None, 2818–2826.
 Tewari and Bartlett (2007) Ambuj Tewari and Peter L Bartlett. 2007. On the consistency of multiclass classification methods. Journal of Machine Learning Research 8, May (2007), 1007–1025.
 Wang et al. (2019b) Jindong Wang et al. 2019b. Everything about Transfer Learning and Domain Adapation. http://transferlearning.xyz.
 Wang et al. (2019a) Jingyi Wang, Guoliang Dong, Jun Sun, Xinyu Wang, and Peixin Zhang. 2019a. Adversarial sample detection for deep neural network through model mutation testing. In Proceedings of the 41st International Conference on Software Engineering. IEEE Press, ICSE’19, None, 1245–1256.
 Wang et al. (2014) Xuezhi Wang, TzuKuo Huang, and Jeff Schneider. 2014. Active Transfer Learning under Model Shift. In Proceedings of the 31st International Conference on Machine Learning (Proceedings of Machine Learning Research), Eric P. Xing and Tony Jebara (Eds.). PMLR, Bejing, China, 1305–1313. http://proceedings.mlr.press/v32/wangi14.html
 Zadrozny and Elkan (2001) Bianca Zadrozny and Charles Elkan. 2001. Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. In None. Citeseer, Citeseer, None, Article None, 7 pages.
 Zadrozny and Elkan (2002) Bianca Zadrozny and Charles Elkan. 2002. Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, KDD, None, 694–699.
 Zhang et al. (2019) Jie M. Zhang, Mark Harman, Lei Ma, and Yang Liu. 2019. Machine Learning Testing: Survey, Landscapes and Horizons. CoRR abs/1906.10742, Article None (2019), 35 pages. arXiv:1906.10742 http://arxiv.org/abs/1906.10742
 Zhang et al. (2018) Mengshi Zhang, Yuqun Zhang, Lingming Zhang, Cong Liu, and Sarfraz Khurshid. 2018. DeepRoad: GANbased Metamorphic Testing and Input Validation Framework for Autonomous Driving Systems. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE 2018). ACM, New York, NY, USA, Article None, 11 pages. https://doi.org/10.1145/3238147.3238187
 Zhou (2016) ZhiHua Zhou. 2016. Learnware: On the Future of Machine Learning. Front. Comput. Sci. 10, 4, Article None (Aug. 2016), 2 pages. https://doi.org/10.1007/s1170401669063
 Zhu et al. (2009) Jingbo Zhu, Huizhen Wang, Benjamin K Tsou, and Matthew Ma. 2009. Active learning with sampling by uncertainty and density for data annotations. IEEE Transactions on audio, speech, and language processing 18, 6 (2009), 1323–1331.
 Zhu (2005) Xiaojin Jerry Zhu. 2005. Semisupervised learning literature survey. Technical Report. University of WisconsinMadison Department of Computer Sciences.