Non-Parametric Calibration of Probabilistic Regression

Non-Parametric Calibration of Probabilistic Regression


The task of calibration is to retrospectively adjust the outputs from a machine learning model to provide better probability estimates on the target variable. While calibration has been investigated thoroughly in classification, it has not yet been well-established for regression tasks. This paper considers the problem of calibrating a probabilistic regression model to improve the estimated probability densities over the real-valued targets. We propose to calibrate a regression model through the cumulative probability density, which can be derived from calibrating a multi-class classifier. We provide three non-parametric approaches to solve the problem, two of which provide empirical estimates and the third providing smooth density estimates. The proposed approaches are experimentally evaluated to show their ability to improve the performance of regression models on the predictive likelihood.

1 Introduction

In predictive machine learning, (probability) calibration refers to a set of techniques that applies post-hoc modelling to correct the outputs from trained classifiers, so that the final outputs are better probability estimates on the target variable. Given a probabilistic two-class classifier, an output for the positive class is calibrated if the following condition holds: for all the instances receiving this prediction value of , the probability of observing a positive label is . From a frequentist point of view, a estimated probability of rain tomorrow is calibrated if, among all the days receiving this probability estimate, half of those days it indeed rained. A Bayesian might say that this is calibrated if it covers a group of days about which we should be maximally uncertain whether it will rain or not.

Calibration helps to make optimal decisions (e.g. setting a threshold on the classifier’s score) in cost-sensitive classification [24] and allows adapting to changing cost parameters without re-training the model. However, being calibrated does not necessarily imply that the classifier has good performance. For instance, a constant classifier outputting the marginal target distribution is calibrated by definition, but it is not a good predictive classifier as it does not separate the classes. In general, given a trained uncalibrated classifier, applying calibration can help improve the estimated probabilities, but not to further separate the feature space. One of the well-known parametric approaches is Logistic calibration [17], which uses the logistic function to map the SVM margins into better calibrated probabilities. [9] proposed the Beta calibration approach that allows more flexible adjustment besides a simple Sigmoid shape.

Calibration can be beneficial for regression tasks as well. Consider the toy dataset with a univariate regression task in Figure 1. In this task the feature contains useful information only in about half of the data (instances on the ascending line) while being non-informative on the other half (instances on the flat line). The actual conditional density of the target variable given the feature is shown in the left figure using background colour. The middle figure shows the predicted conditional densities from Ordinary Least Squares regression algorithm, which can be interpreted as assuming Gaussian conditional distributions with shared standard deviation. Clearly, the OLS doesn’t capture the shape of the distribution, and considerably over/under-estimates the densities around some regions. While this can be fixed by applying a different model that suits the data distribution, it requires to know or assume the true distribution family, which is often not feasible for many high dimensional and complex datasets. Alternatively, one can adopt methods from the field of conditional density estimation [19, 1, 20] or quantile regression [7, 11], without knowing the exact parametric form of the target distribution. However, such approaches normally require additional kennels or basis functions to be selected, which can be problematic for certain feature spaces.

In this paper we instead propose to take the predicted densities and improve them using a calibration procedure. The proposed approaches enable us to directly take the outputs from the well-founded regression models, and apply post-hoc calibrations to obtained better conditional density estimations. The figure on the right shows the result from applying GPC calibration on the OLS model (one of the calibration methods proposed in this paper).

Figure 1: The left figure shows the true conditional distribution of the target given the feature (in greyscale), and some data points drawn from the distribution (in yellow). The middle figure shows the densities predicted from the Ordinary Least Squares regression method (unimodal Gaussians). The results of calibrating these predicted densities using our proposed GPC calibration method have been shown in the right figure.

There are several benefits from calibrating a regression model: (1) As in the case of classification, calibrating a regression model improves the probability estimates on the target variable, and hence reduces the uncertainties in decision making. (2) Calibration can help to correct a mismatched distribution assumption, e.g. when the regression model assumes that the residuals are Gaussian whereas actually they are not. On the other hand, calibration of regression models can be hard: (1) Existing calibration methods from classification cannot be directly applied, as the conditional distribution of the target variable given the features is continuous rather than categorical. (2) Simple parametric methods might be insufficient to model the richness of continuous distributions.

In this paper we generalise the concept of calibration to regression tasks. We contribute to the problem by first defining what calibrated regression is, and then demonstrate the relationship between calibrating density functions and cumulative distribution functions. Based on the theory, an empirical approach is proposed to adapt the existing framework of calibration for classification. Another approach is further proposed to provide smooth estimation on the densities, which is based on a Gaussian process classifier. The calibrated density functions are non-parametric and hence suitable for any potential target distributions. We experimentally show that our calibration method can indeed increase the performance of regression models on their estimated densities.

The structure of the paper is as follows. Section 2 introduces calibration of binary classifiers. Section 3 gives the definitions of calibration and empirical calibration of a regression, with further theorems on the link between calibrated density functions and calibrated cumulative distribution functions. Section 4 introduces a simple method to adopt existing binary approaches to calibrate a regression empirically. Section 5 shows the proposed non-parametric approach based on a Gaussian Process classifier. Section 6 shows the experiments and results and Section 7 concludes the paper.

2 Classifier Calibration

In this section we introduce the general concept of calibration in binary classification. We use (in space ) and (in space ) to represents the random variable for the feature vector and target value, respectively. In the case of -class () classification, we denote . The small case and are used to denote an instance of the feature vector and the target. Additionally, we use the notation and to distinguish probability mass and probability density.

A probabilistic classifier is defined as a function , so that:


Hence, the model can take a feature vector and outputs a estimated probability mass function (e.g. categorical likelihood) on the target variable .

2.1 Binary Classification

With the notations above, we can now give the definition of calibration in binary classification.

Definition 1 (Calibrated Binary Classification)

A classifier is defined as calibrated if and only if, given , the following holds:


As discussed in previous work [9], even if a classifier shows good performance on metrics such as accuracy or F-score, its output might not be calibrated. Therefore, depending on the properties of the classifier, we have two kinds of calibration. The first kind arises when the model doesn’t provide probability estimates on the target variable, and we therefore need to derive calibrated probabilities from its outputs. For instance, a SVM by default only predicts the margins optimised with hinge loss, and therefore requires calibration for probabilistic outputs. The second kind occurs when a model is already probabilistic, as the results might not be accurate due to its assumptions or approximations, which also requires to be further calibrated to generate better results. One example in this category is the Naive Bayes classifier. While it is probabilistic and optimised via maximum likelihood, the independence assumption among the features makes its outputs poorly calibrated in general.

To solve the issues above, different approaches have been introduced to post-calibrate such classifiers. In this paper we only focus on the second kind of calibration, and we define binary calibration as a function , so that, for each feature vector , a calibration is able to provide an estimation on the true calibrated score . In [24, 2], the authors provide a list of properties and benefits by having a calibrated output. Empirical binning has been used as a baseline method, which estimates an empirical distribution on the predicted score [23]. Recently, [15] proposed a Bayesian binning approach to improve the estimates by performing inference on a hidden binning scheme. Isotonic regression with its related PAV algorithm is one of the major non-parametric calibration methods [24, 3]. The method calibrates a model by recursively averaging neighbouring non-monotonic scores, so that a piece-wise constant non-decreasing calibration map is obtained at the end.

Logistic calibration can be seen as a special case of 1-D Logistic regression, where the input is the uncalibrated output, and the model is fitted with the target to predict the calibrated probabilities. In the case of a probabilistic model, denoting , logistic calibration is given as:


Here , are the estimated parameters.

One way to interpret logistic calibration, or in general multivariate logistic regression, is through the Linear Discriminant Analysis (LDA) and the corresponding Gaussian assumption [4, 14]. Mathematically, LDA and logistic regression share the same function while calculating the target distribution with a given feature. The difference is that, while LDA estimates the parameters as the class prior, Gaussian means, and shared covariance matrix, logistic regression directly fits the parameters and through numerical optimisation.

While the Gaussian assumption is reasonable for a input defined in , it becomes less appropriate while calibrating a probabilistic model, where the input is instead in the interval . Kull et al then propose to instead use the Beta distribution to model the conditional probability , with denoting the random variable for the score produced by a classifier:

With denoting the Beta normalisation constant. Beta calibration can then be stated in the following form:

Here , , and .

As in the case of LDA and logistic regression, Beta calibration can also be fitted via a generative approach or a discriminative approach. In the discriminative case, the authors also show that the parameters of a Beta calibration can be fitted through logistic regression with a bi-variate input . Beta calibration improves logistic regression as the calibration map is not necessarily sigmoidal, and hence more versatile for the general purpose of calibration. The experiments in [9, 10] shows improvements on log-loss and Brier score of Beta calibration over other calibration approaches on a set of model classes, including Naive Bayes, Logistic Regression, Support Vector Machine (SVM), Random Forest (RF) and Multi-Layer Perceptron (MLP), and two variants of AdaBoost.

2.2 Multi-class Classification

The concept of calibration can be generalised to multi-class classification as follows.

Definition 2 (Calibrated Multi-class Classification)

Let be a classifier, denoting as the random variable for the predicted probability vector, we define to be calibrated if and only if, for every possible vector in the -dimensional probability simplex , , the following holds:


Therefore, multi-class calibration asks that, given a predicted -dimensional probability vector, every dimension of the vector is calibrated with the corresponding target class. The simplest approach to calibrate a multi-class classifier is to apply binary calibration on each target value with the one-vs-rest strategy, and eventually normalised obtained probability vector. As an direct extension, multinomial logistic regression is also commonly used in this case.

Empirical binning has been used as a baseline method, which estimates an empirical distribution on the predicted score [23]. Recently, [15] proposed a Bayesian binning approach to improve the estimates by performing inference on a hidden binning scheme. Isotonic regression with its related PAV algorithm is one of the major non-parametric calibration methods [24, 3]. The method calibrates a model by recursively averaging neighbouring non-monotonic scores, so that a piece-wise constant non-decreasing calibration map is obtained at the end.

3 Calibration of Probabilistic Regression

Now we move on to probabilistic regression models, where . We again define a probabilistic regression as a function, denoted as , with

In words, a trained probabilistic regression model provides a probability density function on the target variable given a feature vector .

As shown above, calibration is a property of the predicted probability mass. To define calibration in probabilistic regression, we therefore need to consider the integral of density functions. One common approach to generalise concepts from discrete to continuous, as in deriving limiting density of discrete points from Shannon Entropy or certain conditional density estimation approaches [5], is to apply binning on the continuous variable. To begin with, consider a multi-class scenario, given a set of values, , according to Definition 2, a probabilistic regression model is calibrated on these values if the following equation is satisfied:


We hence denote , with . Therefore, given a pair of , can provide a estimated probability mass:


Equation 5 can then be turned into the following definition via denoting :

Definition 3 (Empirically Calibrated Probabilistic Regression)

Denoting and as above, a probabilistic regression is said to be empirically calibrated on , if for , the following equation holds:


Here we omit and , as by definition we always have and .

It then makes sense to define calibrated regression as the limiting case where and , that is, we have a set of infinitely smooth values of . Therefore, the condition can be replaced with , with being an instance of a cumulative distribution function in . This leads to our definition of a calibrated probabilistic regression:

Definition 4 (Calibrated Probabilistic Regression)

With and as defined above, a probabilistic regression is said to be calibrated if for , , the following equation holds:


While the definition above is formalised through the predicted cumulative distribution in analogy with classification, we now show that, being calibrated in this sense also leads to “calibrated” densities.

Lemma 1

If a regression model is calibrated, as defined in Definition 4, then for , the following holds:


Denoting and as above, , as a particular pair of PDF and CDF, so that , we have:


Now we can show:

A important consequence of this lemma is: calibrating a regression with its predicted cumulative distributions can also improve the estimated densities on the target, which means that for all the instances receiving a prediction of , the PDF of is . Therefore, we can use the log-likelihood of the predicted PDFs as a measure to examine whether a model is well calibrated.

We are finally in a position to define calibration of a regression model as a function , which takes a target value and a predicted CDF , and outputs a calibrated probability for given . However, there is one major difficulty to design such post-calibration approaches: the input space of calibration is a set of functions. In classification, as the inputs are probability vectors, it is simple to adopt certain existing models, such as logistic regression. The situation is even simpler for a binary case, where the input space is the interval , which supports univariate approaches such as beta calibration and isotonic regression. A simple solution here is to address the problem via calibrating the model empirically in a binary manner, which we discuss in the next section.

4 The Empirical Approach: Adopting Logistic Calibration and Beta Calibration

Our first proposed approach is to discretise the target variable and by this transform the regression task into a multi-class classification task. We can then apply two-class calibration methods in the one-vs-rest manner to obtain multi-class probability estimates, interpretable as a piecewise constant conditional density function for the original regression calibration task.

As in Equation 5 and Definition 3, we first discretise the target variable by introducing segments defined by thresholds . Fitting of the calibration map for the regression model is performed as follows:

  1. For each class corresponding to one of the discretised segments we build a training dataset for learning a one-vs-rest calibration model. Every instance in the calibration fold of the regression task is transformed into the estimated probability mass and the binary ground truth label .

  2. A binary calibration model is trained separately on each class using the training data from step 1.

The CDF output by the regression model on a test instance is calibrated as follows:

  1. For each we calculate the predicted probability mass that the regression model puts on segment

  2. We apply the one-vs-rest calibration maps on the respective predicted probabilities and renormalise the results to ensure they add up to one. The calibrated probability vector has thus probabilities .

Within this method we can use any 2-class calibration methods. In the experiments we will use logistic calibration and beta calibration. Beta calibration is more appropriate here because the input to the calibration method is already in the probability range , whereas the logistic calibration derived from Gaussian assumptions would be best on the full real-valued scale. However, for reference we have still decided to include logistic calibration into the experiments. We will refer to the corresponding regression calibration methods as e-logistic and e-beta, where e- stands for empirical.

An example with both predictions from e-logistic and e-beta is given in Figure 3. Notice that the calibration map in the middle of the figure is drawn by putting the uncalibrated CDF as the horizontal coordinate and using the calibrated CDF as the vertical coordinate, hence we refer to it as marginal calibration map as it marginalises the effect of . As shown in the figure, the calibrated PDFs from e-logistic and e-beta are close to each other, and both show a bi-modal shape around the original estimated mean of the Gaussian. In this particular case, the true value indeed falls into one of the modes. The interpretation here is natural, while the predicted Gaussian distribution is optimised for least errors, its uni-modal assumption pushes it to lie around the mean of the training values. Hence, by adopting calibration methods, we show that the estimated PDFs are capable of generating a non-parametric shape of the predictive distribution from the original Gaussian, which captures the distribution of under-estimated values and over-estimated values around the original Gaussian mean.

Here both empirical approaches can be seen as non-parametric as the number of parameters increases with the number of target values, but not with the size of the dataset. Therefore both approaches take roughly linear time in the size of the dataset, and in the number of target values.

5 GPC: Using Gaussian Processes for Calibration

While the empirical approaches are quick to apply, they can not provide a smooth estimation of the CDFs and PDFs on the target variable, hence giving limited information regarding the predicted distribution of the target. As introduced previously, both Logistic calibration are Beta calibration are derived by assuming certain distributions on the predicted probabilities, which can then be optimised with a probabilistic objective function to approximate the calibrated probabilities. Intuitively, it would be ideal if we can also make such distributional assumptions in the regression case. Here we propose an approach based on the Gaussian Process Classifier (GPC) [22, 18] to achieve a smooth calibration function, which can be seen as modelling a latent Gaussian Process over the CDFs.

Following Definition 4, to calibrate a regression model we need a calibration function in the following form, denoting :

As discussed above, in general we cannot design with a finite dimensional vector to represent , unless follows certain parametric assumptions. For instance, for the case of Gaussian, we can use the mean and standard deviation to represent the function. However, as parametric assumptions can be a potential reason for yielding uncalibrated CDFs, here we strategically avoid such approaches.

Therefore, we consider a non-parametric approach which does not require an explicit representation of the whole function of , but only takes in a single value of :

We view this as a two-class probability estimation task with two features. The features are and and we want to predict the calibrated probability that the original regression target variable is below the threshold . To solve this task we use the Gaussian Process Classifier algorithm.

First, we need to build the training set for GPC. For this we consider the set of predicted CDFs on the calibration fold instances, and a set of target values . The training instances are then , representing a particular combination of the cumulative distribution and the corresponding value of . GPC models the probability estimator as a composition of two functions: a function which transforms the features into a hidden real-valued Gaussian-distributed variable encoding the confidence information, followed by a link function which transforms this confidence information into a probability. That is, it models a function , assuming that the function values are jointly Gaussian distributed, with a constant mean of and some by covariance matrix . Hence, instead fitting a distribution over the CDFs, we now have a distribution over the functions on a finite sample from the CDFs.

If we construct the covariance matrix via some covariance function (a positive definite kernel with parameter ), so that , the Gaussian distribution can be generalised to any infinite set of dimensions, which can later be used for making predictions. The next step is to map the quantities of into the interval of , which can then be used to compute a objective function with the target variable . The approach used in GPC is to adopt a link function , which is commonly constructed using logistic function or probit function. Training of GPC involves optimising the kernel parameter given and , by marginalising out :


Here denotes the likelihood function of the multivariate Gaussian. Regarding how prediction works using the GPC model please refer to [18].

As in common GPs, GPC is not sparse and hence has some computational difficulties. The most widely adopted approximation is the Laplace approximation [22, 18] and Expectation Propagation [13]. Both approaches are commonly seen in GP implementations as in scikit-learn [16], GPy [6], and Edward [21].

To train a GPC calibration we first require a set of target values , with which we can construct the input variable , and the output variable with the data points in the calibration set. The next step is to train a GPC to predict from . While any positive kernel can be potentially applied, here we use the RBF kernel as a default option in many GP and SVM applications, given that our aim is to smoothly calibrate the CDFs with the provided training points. Two examples of the training points and estimated calibration map can be seen in Figure 2.

Figure 2: Two examples of the calibration map for the GPC approach using a RBF kernel, with base models outputting a Gaussian density. The blue and red points are corresponding to the training points of and respectively. 32 values of are selected uniformly.

Since the GPC model is continuous, the thresholds do not need to be the same on training and test data. Therefore, on test data one can use many more thresholds than were used on the training data. While computationally we cannot select a infinity smooth set of , this can be done empirically as a trade-off between precision and computational speed, as in general approximation approaches. The following steps are again simple to perform. For a test feature and the uncalibrated CDF , we again construct the input feature as , and use the previous learned GPC to predict the estimated . The estimated PDF can be then directly calculated as .

A result of GPC calibration can be again seen in Figure 3. As the figure indicates, GPC calibration captures a close bi-modal shape on the PDF as the ones of e-logistic and e-beta, but instead have a smooth estimation. In this particular case, the smooth estimation provides a higher likelihood for the ground truth, and hence a lower log-loss, a major benefit of having calibrated outputs.

The major drawback of the GPC approach comes from its computational cost. As in general GP approaches, the computation of a GPC require some numerical approximations involving the inverse of matrices. This makes the speed of GPC relatively slow compared to the empirical approaches, and intractable for larger datasets (calibration sets), where further sparse approximations are required.

Figure 3: An example of the PDFs, marginal calibration maps, and CDFs estimated on a test instance using e-logistic, e-beta, and GPC. The base model is estimated with Gaussian Process Regression. Here default model refers to the model fitted with the whole training set, base models refers to the model fitted with of the training set, and the rest of the training set is used to learn the calibration (with linearly mapped target values). The PDFs are obtained by consecutively applying the base model and calibration maps on the test feature. The ground-truth is given as the yellow vertical line.

6 Experimental Evaluation

(a) Bayesian Ridge Regression
(b) Gaussian Process Regression
Figure 4: Predicted densities on a toy dataset.The result on the left shows the densities predicted from training on of the dataset. The rest of the results are obtained by using the remaining of the data to train a calibration method, and then applying it upon the base model on the left. The white lines show the predicted mean from the corresponding regression models. For the calibration methods, 16 target thresholds with equal distance are applied on the y-axis, which hence provide 16 bins for the predictions from the empirical methods. For the GPC approach, while also training 16 target thresholds, at test time 256 target thresholds are further specified to generate a smooth output.
(a) Bayesian Ridge Regression
(b) Gaussian Process Regression
Figure 5: Reliability diagrams on a toy dataset. Each dashed line is drawn by a particular target value , with estimated probability for on x-axis, and the actual relative frequency of on the y-axis (with bins on the x-axis).

In this section we experimentally examine the performance of our proposed methods, and compare them against different uncalibrated regression models. We first revisit the toy dataset used at the beginning of the paper. We then use 5 UCI datasets to compare multiple regression models.

As base models we selected three methods with Gaussian outputs: Ordinary Least Squares regression (OLS), Bayesian Ridge Regression (BRR), and Gaussian Process Regression (GPR). This choice is motivated by the following reasons. (1) Gaussian-output models are the most common among probabilistic regression methods, and have been used as baseline approaches in most regression problems. (2) These three models covers different aspects of a Gaussian-output method. OLS is optimised by squared error, which is the equivalent of fitting a linear function to predict the mean of a Gaussian output with a shared standard deviation. While BRR is still a linear model, all its parameters are optimised through a posterior given certain priors (in this case uninformative priors are used, which acts as regularisers). GPs can give non-linear predictions with certain kernels (in the following experiments RBF is used), and is optimised through a likelihood function. However, as stated previously, our proposed approaches are not limited to Gaussian-output methods – our main goal here is to compare performance among different model assumptions.

In terms of implementation, for all experiments we apply the same experimental design as in [17, 9], which runs 5-fold cross validation. Given a base model class and a calibration method, a calibrated regression can be trained by separating the training set into a base set and a calibration set. For each execution, the training set is divided into another 3 folds to iteratively train the base models and the calibration methods, which provides three calibrated models. The base model is first fitted with the base set, and then used to provide predictions on the calibration set. The calibration is then learnt on the calibration set with the these predictions from the base model. Finally, during testing, the predictions are obtained by applying the learnt base model and calibration consecutively, the final predictions are given as the averaged prediction among all three calibrated models.

6.1 The Toy Dataset

In Figure 1 we showed an initial example with OLS to demonstrate the motivation of applying calibration on regression tasks. Here we use the dataset again to compare our proposed methods and uncalibrated models. The dataset is generated as a mixture of two lines with a given Gaussian noise, with uniformly generated features on the horizontal axis. The 5-fold cross validation provides the following results.

Figure 4 visually shows the predicted densities from both BRR and GPR from a single training, with calibrated densities from them using e-logistic, e-beta, and GPC respectively. We omit the results of OLS here as it is partly shown in Figure 1 and close to the results of BRR in this particular case. In general, all three proposed approaches are able to capture the bi-modal shape of data distribution towards larger input values, and can correct the base output to be closer to the true distribution as given in Figure 1. Both e-logistic and e-beta clearly show horizontal density bands across the figures, which is expected given their empirical nature. Notably, the calibrated results with GPR under-estimates the densities around the top right of the figure. The explanation can be obtained by checking the original output of the GPR, which shows a non-linear estimation by virtue of the RBF kernel, and also under-estimates the densities at the same x-location in the top-right area. As discussed previously, while calibration can help improve the probability estimates from a given model, it can not further correct the predictions that are already grouped together. In this case, the non-linear GPR provides the same density estimation for the top-left area as many other low-density areas, meaning this area cannot be simply fixed by applying calibration.

Figure 5 shows the reliability diagram obtained by evaluating the training target values for each experiment. Reliability diagram is a widely adopted tool in binary classification for visualising whether a classifier gives calibrated probability estimates. The idea is to apply a set of bins on the probability estimates. Then within each bin, we calculate the averaged value of the estimates, as well as the relative frequency of the binary target. Then if we draw the two values within a 2-D space, a calibrated classifier will stay close to the ascending diagonal. In probabilistic regression we can obtain a set of lines with each being drawn as a binary task with the binary indicator . Both base models can be seen to be uncalibrated for certain values of , as there are multiple lines away from the ascending diagonal. All the calibration methods illustrates improved performance with most lines close to the ascending diagonal. The exception is the e-logistic approach with Bayesian ridge regression, where the approach created a few points further away from the diagonal. This is explainable as by definition logistic calibration is not designed for calibrating probabilistic models, and can lead to uncalibrated estimates for certain datasets and models [9, 10].

6.2 Experiments on UCI Data

(a) Ordinary Least Squares regression
(b) Bayesian Ridge Regression
(c) Gaussian Process Regression
Figure 6: Experiments with 5 UCI datasets. The x-axis indicates the number of target values used for training of calibration method. The y-axis shows the log-likelihood for the final estimate , higher value indicate better results. Each column of figures is corresponding to one of the five UCI datasets.

While in the previous experiment we used artificial data to demonstrate a case where the true distribution is known, this experiment aims to investigate the performance of our methods with real datasets. We use the log-likelihood as the evaluation measure for our experiments as is common for predictive probabilistic approaches.

We selected five datasets from the UCI repository [12]: (1) Diabetes, (2) Boston, (3) Airfoil, (4) Forest Fire, (5) Compressive Strength. These five datasets are selected according to their size and formats. We selected the size to be no more than considering the speed of the GPC approach. Also, as later shown, we perform experiments to examine different numbers of target values, which is also time-consuming even on a single dataset. Regarding the formats, we selected datasets that have a single tabular file and contain ready-to-use feature and target instances, which makes the experiments simple to reproduce. The only pre-processing applied is to remove instances with missing feature values.

The experiments are organised as follows. At the top level, as in [17, 9], we run 10-times 5-fold cross-validation to provide the averaged results. For the experiments with GPR as the base model we only use a single feature with the largest variance to ensure the convergence of the optimised kernel parameters. At a detailed level, for all the calibration approaches, we select different numbers ( and ) of target values with equal distances, which then aims to test the effect the number of target values. The prediction of GPC is set to have target values, again with equal distance among neighbouring values. The range of the target values is selected as , where and are the minimum value and maximum value of the target variable in the training set, and . This setting ensures the estimated PDFs can approximately cover most of the probability mass (hence the CDFs can be approximately seen as in ). To maintain the speed of the GPC approach, we use up to CDF values from the base model, which are uniformly selected from all the outputs within the calibration set.

The results are depicted in Figure 6. Although the performance of our proposed methods can vary in different settings, it can be seen that there is always a calibrated method giving better estimation than the uncalibrated models. The exception is the setting with the smallest number of target values (, on the left), where the calibration methods mostly perform poorer than the uncalibrated ones. This is reasonable as we only provide limited information from the CDFs to the calibration methods in this case.

With the empirical approaches, both e-logistic and e-beta outperforms the uncalibrated models while the number of the target values is around and , and the performance tends to drop as the number becomes larger. This drop can be explained by their empirical nature, where more empirical measurement can increase the variation of the output, hence increasing the potential for over-fitting. Furthermore, e-beta shows a better result than e-logistic for most cases. This is expected as e-beta is able to give estimates beyond the Sigmoid function, which is shown to be more suitable for probabilistic calibration, as shown in [10].

For most datasets and settings the GPC approach achieves top performance, mostly benefitting from a larger number of target values. However, several drops in performance can still be seen while target values are used. This can be considered as a consequence of setting the CDF values during the training process, which is equivalent to applying a naive sparse GPC, ending up with faster training but worse performance.

7 Conclusion and Future Work

We investigated the problem of calibrating a probabilistic regression model to provide better probability estimates. Compared to switching or improving the regression model itself, calibration provides an alternative approach to improve the original model directly. While we first define the concept of calibration in regression, we further illustrate that calibrated cumulative distribution predictions can lead to calibrated density predictions. One benefit of calibrating a model with CDFs is that we no longer require a parametric assumption on the density functions, which is useful if the distribution of the target is unknown. Two empirical approaches are proposed based on Logistic calibration and Beta calibration. These approaches are useful if one wants to quickly calibrate the shape the predicted densities, without caring about a particular density value, or cumulative density value. We further propose an approach based on the Gaussian process classifier, which can learn a smooth calibration function on the predicted cumulative densities. While the non-sparse property makes the approach relatively slow to train and not scale with larger datasets, it is useful for the scenarios where calibrated cumulative densities are required for decision making, such as forecasting tasks in areas like medicine.

While we mainly investigate non-parametric methods given their versatility in the regression setting, parametric methods are still an alternative direction which is useful when the distribution of the target is indeed known, or can be approximated with reasonable uncertainty. Among our proposed approaches, the empirical approaches are currently implemented via one-vs-rest, where further strategies can be investigated to provide improved estimations, such as the Least Square Error-Correcting Output Codes (LS-ECOC) approach proposed in [8]. GPC can be developed further to incorporate large datasets, which can be linked to recent progress in the area of sparse Gaussian processes.

8 Acknowledgements

This work was supported by the SPHERE Interdisciplinary Research Collaboration, funded by the UK Engineering and Physical Sciences Research Council under grant EP/K031910/1. MK was supported by the Estonian Research Council under grant PUT1458.


  1. Bishop, C.M.: Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA (2006)
  2. Cohen, I., Goldszmidt, M.: Properties and benefits of calibrated classifiers. In: European Conference on Principles of Data Mining and Knowledge Discovery. pp. 125–136. Springer (2004)
  3. Fawcett, T., Niculescu-Mizil, A.: Pav and the roc convex hull. Machine Learning 68(1), 97–106 (2007)
  4. Flach, P.: Machine learning: the art and science of algorithms that make sense of data. Cambridge University Press (2012)
  5. Frank, E., Bouckaert, R.R.: Conditional density estimation with class probability estimators. In: Zhou, Z.H., Washio, T. (eds.) Advances in Machine Learning. pp. 65–81. Springer Berlin Heidelberg, Berlin, Heidelberg (2009)
  6. GPy: GPy: A gaussian process framework in python. (since 2012)
  7. Koenker, R., Hallock, K.F.: Quantile regression. Journal of economic perspectives 15(4), 143–156 (2001)
  8. Kong, E.B., Diettrich, T.: Probability estimation via error-correcting output coding. In: Int. Conf. of Artificial Inteligence and soft computing. Citeseer (1997)
  9. Kull, M., Filho, T.S., Flach, P.: Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers. In: Singh, A., Zhu, J. (eds.) Proceedings of the 20th International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research, vol. 54, pp. 623–631. PMLR, Fort Lauderdale, FL, USA (20–22 Apr 2017),
  10. Kull, M., Silva Filho, T.M., Flach, P., et al.: Beyond sigmoids: How to obtain well-calibrated probabilities from binary classifiers with beta calibration. Electronic Journal of Statistics 11(2), 5052–5080 (2017)
  11. Li, Y., Liu, Y., Zhu, J.: Quantile regression in reproducing kernel hilbert spaces. Journal of the American Statistical Association 102(477), 255–268 (2007)
  12. Lichman, M.: UCI machine learning repository (2013),
  13. Minka, T.P.: Expectation propagation for approximate Bayesian inference. In: Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence. pp. 362–369. Morgan Kaufmann Publishers Inc. (2001)
  14. Murphy, K.P.: Machine learning: A probabilistic perspective. MIT press (2012)
  15. Naeini, M.P., Cooper, G.F., Hauskrecht, M.: Obtaining well calibrated probabilities using bayesian binning. In: AAAI. pp. 2901–2907 (2015)
  16. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011)
  17. Platt, J., et al.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers 10(3), 61–74 (1999)
  18. Rasmussen, C.E., Williams, C.K.: Gaussian processes for machine learning, vol. 1. MIT press Cambridge (2006)
  19. Schapire, R.E., Stone, P., McAllester, D., Littman, M.L., Csirik, J.A.: Modeling auction price uncertainty using boosting-based conditional density estimation. In: ICML. pp. 546–553 (2002)
  20. Sugiyama, M., Takeuchi, I., Suzuki, T., Kanamori, T., Hachiya, H., Okanohara, D.: Conditional density estimation via least-squares density ratio estimation. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. pp. 781–788 (2010)
  21. Tran, D., Kucukelbir, A., Dieng, A.B., Rudolph, M., Liang, D., Blei, D.M.: Edward: A library for probabilistic modeling, inference, and criticism. arXiv preprint arXiv:1610.09787 (2016)
  22. Williams, C.K., Barber, D.: Bayesian classification with gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(12), 1342–1351 (1998)
  23. Zadrozny, B., Elkan, C.: Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In: ICML. vol. 1, pp. 609–616. Citeseer (2001)
  24. Zadrozny, B., Elkan, C.: Transforming classifier scores into accurate multiclass probability estimates. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 694–699. ACM (2002)
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description