Quantile Regularization: Towards Implicit Calibration of Regression Models
Recent works have shown that most deep learning models are often poorly calibrated, i.e., they may produce overconfident predictions that are wrong. It is therefore desirable to have models that produce predictive uncertainty estimates that are reliable. Several approaches have been proposed recently to calibrate classification models. However, there is relatively little work on calibrating regression models. We present a method for calibrating regression models based on a novel quantile regularizer defined as the cumulative KL divergence between two CDFs. Unlike most of the existing approaches for calibrating regression models, which are based on post-hoc processing of the model’s output and require an additional dataset, our method is trainable in an end-to-end fashion without requiring an additional dataset. The proposed regularizer can be used with any training objective for regression. We also show that post-hoc calibration methods like Isotonic Calibration sometimes compound miscalibration whereas our method provides consistently better calibrations. We provide empirical results demonstrating that the proposed quantile regularizer significantly improves calibration for regression models trained using approaches, such as Dropout VI and Deep Ensembles.
Calibration is a measure of evaluating how well a model’s confidence in its prediction matches with the correctness of these predictions. For example, a binary classifier will be considered perfectly calibrated if among all predictions with probability score 0.9, 90% of the predictions should be correct [guo2017calibration]. Likewise, consider a Bayesian regression model that produces credible intervals. In this setting, the model will be considered perfectly calibrated if the 90% credible interval contains 90% of the test points [kuleshov2018accurate]. Unfortunately, modern deep neural networks are known to be poorly calibrated [guo2017calibration].
While there has been a significant amount of recent work on calibrating classification models [guo2017calibration, kumar2018trainable], relatively little work exists on calibrating regression models. Recently, [kuleshov2018accurate] proposed a post-hoc method for calibrating regression models. Their approach is inspired by Platt scaling [platt1999probabilistic], commonly used for calibrating classification models. However, post-hoc methods like [kuleshov2018accurate] rely on the availability of large quantities of labeled i.i.d. data that is needed to achieve well-calibrated models.
In this work, we introduce quantile regularization, a method that can be trained in an end-to-end manner unlike the post-hoc calibration methods that require large quantities of labeled data. The regularizer we proposed is defined as the cumulative KL divergence between two CDFs. Moreover, our method has a very general applicability as it can be used in any regression model that produces a predictive mean and predictive variance, by augmenting its training objective with the proposed regularizer.
Before describing our approach, we first provide a brief overview of calibration approaches proposed for classification and regression models.
1.1 Classification Calibration
The notion of calibration was originally first considered in meteorology literature [brier1950verification, murphy1972scalar, gneiting2007strictly] and saw one of its first prominent usage used in the machine learning literature by [platt1999probabilistic] in context of support vector machines (SVM) in order to obtain probabilistic predictions from SVMs which are non-probabilistic models. There has been renewed interested in calibration, especially for classification models, after [guo2017calibration] showed that modern classification networks are not well-calibrated.
Currently there are three main notions of calibration in case of classification [kumar2019verified, vaicenavicius2019evaluating, kull2019beyond] and these are listed below. For the rest of this section. assume to be random variables on spaces and , to be their true joint distribution, and to be the model that outputs a probability distribution on . Therefore, we can represent the model as . The three notions are as follows:
Top-Label calibration : . It says that among all instances that the model predicts the most probable class with confidence say , the proportion of instances that are actually of predicted class should be .
Marginal calibration : . It says that, among all instances that the model predicts class with confidence , the proportion of actual instances from class should be .
Joint calibration : . It says that, among all instances for which a distribution is predicted , probability that it belongs to class is actually .
Most calibration methods [platt1999probabilistic, zadrozny2001obtaining, zadrozny2002transforming, guo2017calibration, kull2017beta, kull2019beyond] for classification models are post-hoc, where they learn calibration mapping using an additional dataset to recalibrate an already trained model. There has been recent work showing some of these popular post-hoc methods are either themselves miscalibrated or sample inefficient [kumar2019verified] and they do not actually help the model output well-calibrated probabilities.
An alternative to post-hoc processing is to ensure that model outputs well-calibrated probabilities during training itself. These are implicit calibration methods. Such an approach does not require an additional dataset to learn the calibration mapping. While almost all post-hoc calibration mechanisms can be seen as density estimation methods, existing implicit calibration methods are of various types. Several heuristics like Mixup [zhang2017mixup, thulasidasan2019mixup] and Label Smoothing [szegedy2016rethinking, muller2019does] that were part of high performance deep networks for classification were later shown empirically to achieve calibration. [maddox2019simple] show that their optimization method improves calibration. [pereyra2017regularizing] found that penalizing high-confidence predictions acts as a regularizer. A more principled way of achieving calibration is by minimizing a loss function that is tailored for calibration [kumar2018trainable]. This is somewhat similar in spirit to our proposed approach that does it for regression models.
1.2 Regression Calibration
There has been relatively less work on regression calibration. Among the early approaches, [gneiting2007probabilistic] were the first to address this issue by proposing a framework for calibration. However, they do not provide any procedure to correct a mis-calibrated model. Recently, [kuleshov2018accurate] proposed Quantile Calibration which intuitively says that the credible interval predicted by model should have target variable with probability . They also propose a post-hoc method based on isotonic regression [fawcett2007pav] for recalibration which is a well-known recalibration technique for classification models. Recently, [DBLP:conf/icml/SongDKF19] proposed a much stronger notion of calibration called distributional calibration which guarantees that among all instances whose predicted PDF has mean and standard deviation , the actual distribution of the target variable should have mean and standard deviation . This can be seen as the regression analog of joint calibration for classification (Sec. 1.1) . They too propose post-hoc recalibration method based on Gaussian processes. Among other work, [keren2018calibrated], consider a different setting where neural networks for classification are used for regression problems and showed that temperature scaling [hinton2015distilling, guo2017calibration] and their proposed method based on empirical prediction intervals improves calibration. Again, these are post-hoc methods.
1.3 Quantitle Calibration and Isotonic Regression
The notion of calibration that we consider in this work is quantile calibration. Isotonic Regression is currently used for quantile calibration [kuleshov2018accurate]. However, isotonic regression has the following disadvantages
It is a powerful nonparametric method that has tendency of overfitting, so much so that it perfectly passes through the datapoints if the datapoints already satisfy monotonicity constraint.
Using an isotonic calibration mapping will result in a non-smooth and piecewise linear calibrated CDF. Consequently, the calibrated PDF is discontinuous.
It is a post-hoc method and ideally requires an additional dataset to learn the calibration mapping.
Considering these shortcomings, we propose an end-to-end trainable loss function for quantile calibration. Our approach leverages a novel regularizer that is defined as a cumulative KL divergence (KL divergence of two CDFs). With our approach, the smoothness of the PDF/CDF is maintained for well-calibrated probabilities. Moreover, our approach eliminates the need for a separate calibration dataset. To the best of our knowledge, this is the first trainable loss function for any notion of calibration in regression setting.
The Rest of the paper is organized as follows: Section (2) sets up the notation and background and presents the problem setting formally. In Section (3), we present our proposed method. Section (4) discusses the experimental analysis. In Section (5), we conclude and briefly discuss avenues for future work.
2 Background and Definitions
Throughout the paper, and will denote random variables on spaces and with true distribution and will denote i.i.d samples from this distribution.we assume that CDF’s of random variables are invertible.
Any probabilistic regression model can be seen as conditional CDF, which gives a distribution function on corresponding to each instance from the input space . We represent the model as
Assume is distribution function predicted corresponding to the true distribution function . Ideally we want to predict true distribution, i.e., . This is equivalent of saying that Based on this, [gneiting2007probabilistic] propose the following definition
Definition 1 (Complete Probabilistic Calibration)
Given a model and true underlying model , the model is said to be probabilistically calibrated completely iff for every sequence
Since is unknown, [kuleshov2018accurate] proposes the sufficient condition for above definition which is useful in practice.
Definition 2 (Quantile Calibration)
Given a model and jointly distributed as , the function is said to be Quantile Calibrated iff
The key to understanding above definition is the random variable under consideration . Note that is cumulative density that the model predicts for random whose underlying distribution is
The importance of such definition is that we get calibrated confidence/credible intervals, which is extremely critical for reliable uncertainty estimates. Its usefulness was demonstrated empirically in [kuleshov2018accurate] who developed a post-hoc calibration method using the above notion of quantile calibration.
Existing calibration approaches can be divided into two types.
Post-hoc Calibration: This approach recalibrates a pre-trained model using a separate calibration dataset by learning the canonical calibration mapping [vaicenavicius2019evaluating].
Implicit Calibration: This approach ensures that that model is calibrated while training itself without explicitly using a separate dataset.
2.1 Post-hoc calibration
The objective of post-hoc calibration is to calibrate a miscalibrated model by learning a mapping s.t is calibrated model. One such mapping can be obtained from definition of calibration itself. Setting makes a quantile calibrated model. Recently, [vaicenavicius2019evaluating] refer to an analogous mapping in context of classification as canonical calibration mapping. We will use same name to refer to it for our regression setting.
For any Model and given the canonical calibration mapping , is quantile calibrated
The proof of this proposition can be found in the Appendix (A1)
With this insight, and using the fact that mapping is monotonically increasing, [kuleshov2018accurate] use isotonic regression to learn this mapping on the training dataset itself without using any separate dataset claiming that they do not overfit much. Given , and assume that , isotonic regression finds by minimizing the following objective
In isotonic calibration [kuleshov2018accurate], given training data , the recalibration dataset is generated as where . Then the isotonic calibration mapping is fit on this recalibration dataset. However, this approach can be prone to overfitting. One way to see why isotonic calibration can potentially overfit is that nature of recalibration dataset already satisfies the monotonicity constraint because if . So, to minimize the loss, the calibration mapping passes through exactly. Also it is non-parametric methods that can overfit given less data. Therefore, [kuleshov2018accurate] used training data itself in order to have plenty of data to learn the calibration mapping. Therefore, to recalibrate a pre-trained model you would need training data with which you would have trained the model. Another Disadvantage is that the isotonic mapping is a piecewise linear monotonic function, with which we have to compose our predicted CDF during test time. This results in non-smooth CDFs, which may not be desirable.
2.2 Implicit Calibration
In contrast to post-hoc calibration, implicit calibration ensures that the model is well-calibrated by having a strong inductive bias towards model parameters that yield well-calibrated predictions. Our approach can seen as regression analog of [kumar2018trainable] where they designed a trainable loss function for classification by kernalizing the calibration error and [pereyra2017regularizing] where they minimize the entropy of softmax outputs.
3 Quantile Regularization
Recall that, in quantile calibration, we want . Note that, both the right and the left hand sides can be seen as CDF of some random variables. Let and . Here can be seen as the the CDF of while can be seen as CDF of Uniform[0,1]. So quantile calibration essentially wants the two CDFs to be equal. This is equivalent to saying that, for perfectly calibrated quantile model, we have that is the Uniform[0,1] distribution. Our approach is based on this equivalence. Essentially, we penalize model if the r.v. deviates from Uniform[0,1]. This property can be used to design a calibration metric that can be trained with our loss function, yielding a well-calibrated model while training itself.
One possible divergence metric that one could use is the KL divergence. The KL divergence between a distribution and the uniform distribution is equal to differential entropy. This method will result in very interpretable way of getting calibration that is minimizing differentiable entropy of . However, in practice, this would require using the Beta kernel [chen1999beta] for density estimation and computing the entropy. Therefore, we use other divergences that can result in loss functions that are simpler to train.
3.1 Cumulative KL divergence
Cumulative KL divergence (CKL) [baratpour2012testing] is based on cumulative residual entropy (CKL) [rao2004cumulative]. We derive analytically closed-form expression for CKL between a distribution with support on and Uniform[0,1], and use this divergence for our calibration method.
Definition 3 (Cumulative Residual Entropy)
Let be non negative r.v with CDF and be survival function. Then the cumulative residual entropy is defined as
Definition 4 (Cumulative KL divergence)
Let be non-negative r.v with CDF and be corresponding survival functions . Then the cumulative KL divergence between and is defined as
The cumulative KL divergence has similar properties as the standard KL divergence. In particular, for any CDF’s , and iff
Consider random variable with CDF with support and let with CDF then CKL in terms of residual entropy is as follows
Proof of the above proposition can be found in the Appendix .A1
Given , let denote ordered samples, then the following is a consistent estimator of above expression
Proof of the above proposition can be found in the Appendix .A1
3.2 Calibration loss function
In our case, the random variable is where is the model. Given i.i.d. samples in the training data, we need to generate samples to compute the expression given in Eq. 4.
Note that, we want to make this part of the training procedure to achieve implicit calibration. However, we are faced with a challenge here. In particular, we need ordered samples to compute the first summation in Eq. 4 whereas sorting is not a differentiable operation. There are many differentiable approximations to sorting operation.We use NeuralSort [grover2019stochastic] for its simplicity in our experiments. The algorithm for computing the loss function is summarized below.
The overall loss function with quantile regularization is as follows: Given training data , let , w be parameters of the model, be the negative log likelihood and be the calibrated loss computed by Algorithm 1.
3.3 Sharpness with Calibrated Predictions
Note that calibration is alone not sufficient for predictions to be accurate; sharpness is needed too. Our method can seen as naturally achieving both desiderata. While the usual negative log-likelihood (NLL) makes sure that the prediction are sharp, the quantile regularizer makes sure that those predictions are calibrated too, with controlling strength of the regularization. As our experiments show, the RMSE and NLL scores do not worse much for even values as large as .
We evaluate our approach on various regression datasets in terms of the calibration error as well as other standard metrics, sich as root-mean-squared-error (RMSE) and negative log-likelihood (NLL). We experiment with two base models - MC Dropout [gal2016dropout] and [lakshminarayanan2017simple] - by augmenting their objective functions with our proposed quantile regularizer.
Quantile Calibration Error
Given any model , we define the calibration error as follows