Dirichletbased Gaussian Processes
for Largescale Calibrated Classification
Abstract
In this paper, we study the problem of deriving fast and accurate classification algorithms with uncertainty quantification. Gaussian process classification provides a principled approach, but the corresponding computational burden is hardly sustainable in largescale problems and devising efficient alternatives is a challenge. In this work, we investigate if and how Gaussian process regression directly applied to the classification labels can be used to tackle this question. While in this case training time is remarkably faster, predictions need be calibrated for classification and uncertainty estimation. To this aim, we propose a novel approach based on interpreting the labels as the output of a Dirichlet distribution. Extensive experimental results show that the proposed approach provides essentially the same accuracy and uncertainty quantification of Gaussian process classification while requiring only a fraction of computational resources.
Dirichletbased Gaussian Processes
for Largescale Calibrated Classification
Dimitrios Milios EURECOM Sophia Antipolis, France milios@eurecom.fr Raffaello Camoriano University of Genoa Genoa, Italy raffaello.camoriano@iit.it Pietro Michiardi EURECOM Sophia Antipolis, France Pietro.Michiardi@eurecom.fr Lorenzo Rosasco University of Genoa LCSL, IIT & MIT lrosasco@mit.edu Maurizio Filippone EURECOM Sophia Antipolis, France maurizio.filippone@eurecom.fr
noticebox[b]\end@float
1 Introduction
Classification is a classic machine learning task. While the most basic performance measure is classification accuracy, in practice assigning a calibrated confidence to the predictions is often crucial [5]. For example, in image classification providing classification predictions with a calibrated score is important to avoid making overconfident decisions [6, 11, 14]. Several classification algorithms that output a continuous score are not necessarily calibrated (e.g., support vector machines [22]). Popular ways to calibrate classifiers use a validation set to learn a transformation of their output score that recovers calibration; these include Platt scaling [22] and isotonic regression [36]. Calibration can also be achieved if a sensible loss function is employed [12], for example the logistic/crossentropy loss, and it is known to be positively impacted if the classifier is well regularized [6].
Bayesian approaches provide a natural framework to tackle these kinds of questions, since quantification of uncertainty is of primary interest. In particular, Gaussian Processes Classification (gpc) [23, 34, 8] combines the flexibility of Gaussian Processes (gps) [23] and the regularization stemming from their probabilistic nature, with the use of the correct likelihood for classification, that is Bernoulli or multinomial for binary or multiclass classication, respectively. While we are not aware of empirical studies on the calibration properties of gpc, our results confirm the intuition that gpc is actually calibrated. The most severe drawback of gpc, however, is that it requires carrying out several matrix factorizations, making it unattractive for largescale problems.
In this paper, we study the question of whether Gaussian process approaches can be made efficient to find accurate and wellcalibrated classification rules. A simple idea is to use Gaussian process regression directly on classification labels. This idea is quite common in nonprobabilistic approaches [32, 25] and can be grounded from a decision theoretic point of view. Indeed, the Bayes’ rule minimizing the expected least squares is the the expected conditional probability, which in classification is directly related to the conditional probabilities of each class, see e.g. [29, 3]. Directly regressing the labels leads to fast training and excellent classification accuracies [27, 16, 10]. However, the corresponding predictions are not calibrated for uncertainty quantification. The question is then if calibration can be achieved while retaining speed.
The main contribution of our work is the proposal of a transformation of the classification labels, which turns the original problem into a regression problem without compromising calibration. For gps, this has the enormous advantage of bypassing the need for expensive posterior approximations, leading to a method that is as fast as a simple regression of the original labels. The proposed method is based on the interpretation of the labels as the output of a Dirichlet distribution, so we name it Dirichletbased gp classification (gpd). Through an extensive experimental validation, including largescale classification tasks, we demonstrate that gpd is calibrated and competitive in performance with stateoftheart gpc.
2 Related work
Calibration of classifiers:
Platt scaling [22] is a popular method to calibrate the output score of classifiers, as well as isotonic regression [36]. More recently, Beta calibration [12] and temperature scaling [6] have been proposed to extend the class of possible transformations and reduce the parameterization of the transformation, respectively. It is established that binary classifiers are calibrated when they employ the logistic loss; this is a direct consequence of the fact that the appropriate model for Bernoulli distributed variables is the one associated with this loss [12]. The extension to multiclass problems yields the socalled crossentropy loss, which corresponds to the multinomial likelihood. Not necessarily, however, the right loss makes classifiers well calibrated; recent works on calibration of convolutional neural networks for image classification show that depth negatively impacts calibration due to the introduction of a large number of parameters to optimize, and that regularization is important to recover calibration [6].
Kernelbased classification:
Performing regression on classification labels is also known as leastsquares classification [32, 25]. We are not aware of works that study gpbased leastsquares classification in depth; we could only find a few comments on it in [23] (Sec. 6.5). gpc is usually approached assuming a latent process, which is given a gp prior, that is transformed into a probability of class labels through a suitable squashing function [23]. Due to the nonconjugacy between the gp prior and the nonGaussian likelihood, applying standard Bayesian inference techniques in gpc leads to analytical intractabilities, and it is necessary to resort to approximations. Standard ways to approximate computations include the Laplace Approximation [34] and Expectation Propagation [18]; see, e.g., [15, 20] for a detailed review of these methods. More recently, there have been advancements in works that extend “sparse” gp approximations [33] to classification [9] in order to deal with the issues of scalability with the number of observations. All these approaches require iterative refinements of the posterior distribution over the latent gp, and this implies expensive algebraic operations at each iteration.
3 Background
Consider a multiclass classification problem. Given a set of training inputs and their corresponding labels , with onehot encoded classes denoted by the vectors , a classifier produces a predicted label as function of any new input .
In the literature, calibration is assessed through the Expected Calibration Error (ece) [6], which is the average of the absolute difference between accuracy and confidence:
(1) 
where the test set is divided into disjoint subsets , each corresponding to a given level of confidence predicted by the classifier , while is the classification accuracy of measured on the th subset. Other metrics used in this work to characterize the quality of a classifier are the mean negative loglikelihood (mnll) of the test set under the model, and the error rate on the test set. All metrics are defined so that lower values are better.
3.1 Kernel methods for classification
gp classification (gpc)
gpbased classification is defined by the following abstract steps:

A Gaussian prior is placed over a latent function . The gp prior is characterized by the mean function and the covariance function . The observable (nonGaussian) prior is obtained by transforming through a sigmoid function so that the sampled functions produce proper probability values. In the multiclass case, we consider independent priors over the vector of functions ; transformation to proper probabilities is achieved by applying the softmax function ^{1}^{1}1Softmax function s.t. for .

The observed labels are associated with a categorical likelihood with probability components , for any .

The latent posterior is obtained by means of Bayes’ theorem.

The latent posterior is transformed via , to obtain a distribution over class probabilities.
Throughout this work, we consider and . This choice of covariance function is also known as the rbf kernel; it is characterized by the and hyperparameters, interpreted as the gp marginal variance and lengthscale, respectively. The hyperparameters are commonly selected my maximizing the marginal likelihood of the model.
The major computational challenge of gpc can be identified in Step 3 described above. The categorical likelihood implies that the posterior stochastic process is not Gaussian and it cannot be calculated analytically. Therefore, different approaches resort to Gaussian approximations of the likelihood so that the resulting approximate Gaussian posterior has the following form:
(2) 
For example, in Expectation Propagation (EP, [18]) and are determined by the site parameters, which are learned via an iterative process. In the variational approach of [9], and are the variational parameters of the approximate Gaussian likelihoods. Despite being successful, approaches like these contribute significantly to the computational cost of classification, as they introduce a large number of parameters that need to be optimized. In this work, we explore a more straightforward Gaussian approximation to the likelihood that requires no significant computational overhead.
gp regression (gpr) on classification labels
A simple way to bypass the problem induced by categorical likelihoods is to perform leastsquares regression on the labels by ignoring their discrete nature. This implies considering a Gaussian likelihood , where is the observation noise variance. It is wellknown that if the observed labels are and , then the function that minimizes the mean squared error converges to the true class probabilities in the limit of infinite data [24]. Nevertheless, by not squashing through a softmax function, we can no longer guarantee that the resulting distribution of functions will lie within 0 and 1. For this reason, additional calibration steps are required (i.e. Platt scaling).
Kernel Ridge Regression (krr) for classification
The idea of directly regressing labels is quite common when gp estimators are applied within a frequentist context [25]. Here they are typically derived from a nonprobabilistic perspective based on empirical risk minimization, and the corresponding approach is dubbed Kernel Ridge Regression [7]. Taking this perspective, two comments can be made. The first is that the noise and covariance parameters are viewed as regularization parameters that need to be tuned, typically by crossvalidation. In our experiments, we compare this method with a canonical gpr approach. The second comment is that regressing labels with least squares can be justified from a decision theoretic point of view. The Bayes’ rule minimizing the expected least squares is the regression function (the expected conditional probability), which in binary classification is proportional to the conditional probability of one of the two classes [3] (similar reasoning applies to multiclass classification [19, 2]). From this perspective, one could expect a least squares estimator to be selfcalibrated, however this is typically not the case in practice, a feature imputed to the limited number of points and the choice of function models. In the following we breifly present Platt scaling, a simple and effective posthoc calibration method which can be seamlessly applied to both gpr and krrbased learning pipelines to obtain a calibrated model.
Platt scaling
Platt scaling [22] is an effective approach to perform posthoc calibration for different types of classifiers, such as svms [21] and neural networks [6]. Given a decision function , which is the result of a trained binary classifier, the class probabilities are given by the sigmoid transformation , where and are optimised over a separate validation set, so that the resulting model best explains the data. Although this parametric form may seem restrictive, Platt scaling has been shown to be effective for a wide range of classifiers [21].
3.2 A note on calibration properties
We advocate that two components are critical for wellcalibrated classifiers: regularization and the crossentropy loss. Previous work indicates that regularization has a positive effect on calibration [6]. Also, classifiers that rely on the crossentropy loss are reported to be wellcalibrated [21]. This form of loss function is equivalent to the negative Bernoulli loglikelihood (or categorical in the multiclass case), which is the proper interpretation of classification outcomes.
In Figure 1, we demonstrate the effects of regularization and crossentropy empirically: we summarize classification results on four synthetic datasets of increasing size. We assume that each class label is sampled from a Bernoulli distribution with probability given by the unknown function with input space of dimensionality . For a classifier to be wellcalibrated, it should accurately approximate . We fit three kinds of classifiers: a maximum likelihood (ML) classifier that relies on cross entropy loss (CE), a Bayesian classifier with MSE loss (i.e. gpr classification), and finally a Bayesian classifier that relies on CE (i.e. gpc). We report the averages over 1000 iterations and the average standard deviation. The Bayesian classifiers that rely on the cross entropy loss converge to the true solution at a faster rate, and they are characterized by smaller variance.
Although performing gpr on the labels induces regularization through the prior, the likelihood model is not appropriate. One possible solution is to employ meticulous likelihood approximations such as EP or variational gp classification [9], alas at an often prohibitive computational cost, especially for considerably large datasets. In the section that follows, we introduce a methodology that combines the best of both worlds. We propose to perform gp regression on labels transformed in such a way that a less crude approximation of the categorical likelihood is achieved.
4 gp regression on transformed Dirichlet variables
There is an obvious defect in gpbased leastsquares classification: each point is associated with a Gaussian likelihood, which is not the appropriate noise model for Bernoullidistributed variables. Instead of approximating the true nonGaussian likelihood, we propose to transform the labels in a latent space where a Gaussian approximation to the likelihood is more sensible.
For a given input, the goal of a Bayesian classifier is to estimate the distribution over its class probability vector; such a distribution is naturally represented by a Dirichletdistributed random variable. More formally, in a class classification problem each observation is a sample from a categorical distribution . The objective is to infer the class probabilities , for which we use a Dirichlet model: . In order to fully describe the distribution of class probabilities, we have to estimate the concentration parameters . Given an observation such that , our best guess for the values of will be: and . Note that it is necessary to add a small quantity , so as to have valid parameters for the Dirichlet distribution. Intuitively, we implicitly induce a Dirichlet prior so that before observing a data point we have the probability mass shared equally across classes; we know that we should observe exactly one count for a particular class, but we do not know which one. Most of the mass is concentrated on the corresponding class when is observed. This practice can be thought of as the categorical/Bernoulli analogue of the noisy observations in gp regression. The likelihood model is:
(3) 
It is wellknown that a Dirichlet sample can be generated by sampling from independent Gammadistributed random variables with shape parameters and rate ; realizations of the class probabilities can be generated as follows:
(4) 
Therefore, the noisy Dirichlet likelihood assumed for each observation translates to independent Gamma likelihoods with shape parameters either , if , or otherwise.
In order to construct a Gaussian likelihood in the logspace, we approximate each Gammadistributed with , which has the same mean and variance (i.e. moment matching):
Thus, for the parameters of the normally distributed logarithm we have:
(5) 
Note that this is the first approximation to the likelihood that we have employed so far. One could argue that a lognormal approximation to a Gammadistributed variable is reasonable, although it is not perfect for small values of the shape parameter . However, the most important implication is that we can now consider a Gaussian likelihood in the logspace. Assuming a vector of latent processes , we have:
(6) 
where the class label is now denoted by in the transformed logarithmic space. It is important to note that the noise parameter is different for each observation; we have a heteroskedastic regression model. In fact, the values (as well as ) solely depend on the Dirichlet pseudocount assumed in the prior, which only has two possible values. Given this likelihood approximation, it is straightforward to place a gp prior over and evaluate the posterior over latent processes exactly.
Remark: In the binary classification case, we still have to perform regression on two latent processes. The use of heteroskedastic noise model implies that one latent process is not a mirrored version of the other (see Figure 2), contrary to gpc.
4.1 From gp posterior to Dirichlet variables
The obtained gp posterior emulates the logarithm of a stochastic process with Gamma marginals that gives rise to the Dirichlet posterior distributions. It is straightforward to sample from the posterior lognormal marginals, which should behave as Gammadistributed samples to generate posterior Dirichlet samples as in Equation (4). It is easy to see that this corresponds to a simple application of the softmax function on the posterior gp samples. The expectation of class probabilities will be:
(7) 
which can be approximated by sampling from the Gaussian posterior .
Figure 2 is an example of Dirichlet regression for a onedimensional binary classification problem. The leftside panels demonstrate how the gp posterior approximates the transformed data; the error bars represent the standard deviation for each datapoint. Notice, that the posterior for class “0” (top) is not a mirror image of class “1” (bottom), because of the different noise terms assumed for each latent process. The rightside panels show results in the original output space, after applying softmax transformation; as expected in the binary case, one posterior process is a mirror image of the other.
4.2 Optimizing the Dirichlet prior
The performance of Dirichletbased classification is affected by the choice of , in addition to the usual gp hyperparameters. As approaches zero, converges to either or . It is easy to see that for the transformed “1” labels we have and in the limit. The transformed “0” labels, however, converge to infinity, and so do their variances. The role of is to make the transformed labels finite, so that it is possible to perform regression. The smaller the is, the further the transformed labels will be apart, but at the same time, the variance for the “0” label will be larger.
By increasing , the transformed labels of different classes tend to be closer. The marginal loglikelihood tends to be larger, as it is easier for a zeromean gp prior to fit the data. However, this behavior is not desirable for classification purposes. For this reason, the Gaussian marginal loglikelihood in the transformed space is not appropriate to determine the optimal value for .
Figure 3 demonstrates the effect of on classification accuracy, as reflected by the mnll metric. Each subfigure corresponds to a different dataset; mnll is reported for different choices of between and . As a general remark, it appears that there is no globally optimal parameter across datasets. However, the reported training and testing mnll curves appear to be in agreement regarding the optimal choice for . We therefore propose to select the value that minimizes the mnll on the training data.
5 Experiments
We experimentally evaluate the methodologies discussed on the datasets outlined in Table 1. For the implementation of gpbased models, we use and extend the algorithms available in the GPFlow library [17]. More specifically, for gpc we make use of variational sparse gp [8], while for gpr we employ sparse variational gp regression [33]. The latter is also the basis for our gpd implementation: we apply adjustments so that heteroskedastic noise is admitted, as dictated by the Dirichlet mapping. Concerning krr, in order to scale it up to largescale problems we use a subsamplingbased variant named Nyström krr (nkrr) [31, 35]. Nyströmbased approaches have been shown to achieve stateoftheart accuracy on largescale learning problems [13, 30, 26, 4, 28]. The number of inducing (subsampled) points used for each dataset is reported in Table 1.
Dataset  Classes  Training instances  Test instances  Dimensionality  Inducing points 

eeg  2  10980  4000  14  200 
htru2  2  12898  5000  8  200 
magic  2  14020  5000  10  200 
miniboo  2  120064  10000  50  400 
coverbin  2  522910  58102  54  500 
susy  2  4000000  1000000  18  200 
letter  26  15000  5000  16  200 
drive  11  48509  10000  48  500 
mocap  5  68095  10000  37  500 
The experiments have been repeated for 10 random training/test splits. For each iteration, inducing points are chosen by applying kmeans clustering on the training inputs. Exceptions are coverbin and susy, for which we used 5 splits and inducing points chosen uniformly at random. For gpr we further split each training dataset: 80% of which is used to train the model and the remaining 20% is used for calibration with Platt scaling. nkrr uses an 8020% split for fold crossvalidation and Platt scaling calibration, respectively. For each of the datasets, the parameter of gpd was selected according to the training mnll: we have for coverbin, for letter, drive and mocap, and for the remaining datasets.
In all experiments, we consider an isotropic rbf kernel; the kernel hyperparameters are selected by maximizing the marginal likelihood for the gpbased approaches, and by fold cross validation for nkrr (with for all datasets except from susy, for which ). In the case of gpr, we also optimize the noise variance jointly with all kernel parameters.
The performance of gpd, gpc, gpr and nkrr is compared in terms of various error metrics, including error rate, mnll and ece for a collection of datasets. The error rate, mnll and ece values obtained are summarised in Figure 4. The gpc method tends to outperform gpr in most cases. Regarding the gpd approach, its performance tends to lie between gpc and gpr; in some instances classification performance is better than gpc and nkrr. Most importantly, this performance is obtained at a fraction of the computational time required by the gpc method. Figure 5 summarizes the speedup achieved by the used of gpd as opposed to the variational gp classification approach.
This dramatic difference in computational efficiency has some interesting implications regarding the applicability of gpbased classification methods on large datasets. gpbased machine learning approaches are known to be computationally expensive; their practical application on large datasets demands the use of scalable methods to perform approximate inference. The approximation quality of sparse approaches depends on the number (and the selection) of inducing points. In the case of classification, the speedup obtained by gpd implies that the computational budget saved can be spent on a more finegrained sparse gp approximation. In Figure 5, we explore the effect of increasing the number of inducing points for three datasets: letter, miniboo and mocap; we see that the error rate drops below the target gpc with a fixed number of inducing points, and still at a fraction of the computational effort.
6 Conclusions
Most gpbased approaches to classification in the literature are characterized by a meticulous approximation of the likelihood. In this work, we experimentally show that such gp classifiers tend to be wellcalibrated, meaning that they estimate correctly classification uncertainty, as this is expressed through class probabilities. Despite this desirable property, their applicability is limited to small/moderate size of datasets, due to the high computational complexity of approximating the true posterior distribution.
Leastsquares classification, which may be implemented either as gpr or krr, is an established practice for more scalable classification. However, the crude approximation of a nonGaussian likelihood with a Gaussian one has a negative impact on classification quality, especially as this is reflected by the calibration properties of the classifier.
Considering the strengths and practical limitations of gps, we proposed a classification approach that is essentially an heteroskedastic gp regression on a latent space induced by a transformation of the labels, which are viewed as Dirichletdistributed random variables. This allowed us to convert class classification to a problem of regression for latent processes with Gamma likelihoods. We then proposed to approximate the Gammadistributed variables with lognormal ones, and thus we achieved a sensible Gaussian approximation in the logarithmic space. Crucially, this can be seen as a preprocessing step, that does not have to be learned, unlike gpc, where an accurate transformation is sought iteratively. Our experimental analysis shows that Dirichletbased gp classification produces wellcalibrated classifiers without the need for posthoc calibration steps. The performance of our approach in terms of classification accuracy tends to lie between properlyapproximated gpc and leastsquares classification, but most importantly it is orders of magnitude faster than gpc.
As a final remark, we note that the predictive distribution of the gpd approach is different from that obtained by gpc, as can be seen in the extended results in the appendix. An extended characterization of the predictive distribution for gpd is subject of future work.
Acknowledgments
DM and PM are partially supported by KPMG. RC would like to thank Luigi Carratino for the useful exchanges concerning largescale experiments. LR is funded by the Air Force project FA95501710390 (European Office of Aerospace Research and Development) and the RISE project NoMADS  DLV777826. MF gratefully acknowledges support from the AXA Research Fund.
References
 [1] A. Asuncion and D. J. Newman. UCI machine learning repository, 2007.
 [2] L. Baldassarre, L. Rosasco, A. Barla, and A. Verri. Multioutput learning via spectral filtering. Machine learning, 87(3):259–301, 2012.
 [3] P. Bartlett, M. Jordan, and J. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 2006.
 [4] R. Camoriano, T. Angles, A. Rudi, and L. Rosasco. NYTRO: When Subsampling Meets Early Stopping. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, pages 1403–1411, 2016.
 [5] P. A. Flach. Classifier Calibration, pages 1–8. Springer US, Boston, MA, 2016.
 [6] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On Calibration of Modern Neural Networks. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1321–1330, International Convention Centre, Sydney, Australia, Aug. 2017. PMLR.
 [7] T. Hastie, R. Tibshirani, J. Friedman, and J. Franklin. The elements of statistical learning: data mining, inference and prediction. The Mathematical Intelligencer, 27(2):83–85, 2001.
 [8] J. Hensman, A. Matthews, and Z. Ghahramani. Scalable Variational Gaussian Process Classification. In G. Lebanon and S. V. N. Vishwanathan, editors, Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, volume 38 of Proceedings of Machine Learning Research, pages 351–360, San Diego, California, USA, May 2015. PMLR.
 [9] J. Hensman, A. G. Matthews, M. Filippone, and Z. Ghahramani. MCMC for Variationally Sparse Gaussian Processes. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 1648–1656. Curran Associates, Inc., 2015.
 [10] P. Huang, H. Avron, T. N. Sainath, V. Sindhwani, and B. Ramabhadran. Kernel methods match deep neural networks on TIMIT. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2014.
 [11] A. Kendall and Y. Gal. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?, Mar. 2017.
 [12] M. Kull, T. S. Filho, and P. Flach. Beta calibration: a wellfounded and easily implemented improvement on logistic calibration for binary classifiers. In A. Singh and J. Zhu, editors, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 623–631, Fort Lauderdale, FL, USA, Apr. 2017. PMLR.
 [13] S. Kumar, M. Mohri, and A. Talwalkar. Ensemble Nystrom Method. In NIPS, pages 1060–1068. Curran Associates, Inc., 2009.
 [14] A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial examples in the physical world, Feb. 2017. arXiv:1607.02533.
 [15] M. Kuss and C. E. Rasmussen. Assessing Approximate Inference for Binary Gaussian Process Classification. Journal of Machine Learning Research, 6:1679–1704, 2005.
 [16] Z. Lu, A. May, K. Liu, A. B. Garakani, D. Guo, A. Bellet, L. Fan, M. Collins, B. Kingsbury, M. Picheny, and F. Sha. How to Scale Up Kernel Methods to Be As Good As Deep Neural Nets. CoRR, abs/1411.4000, 2014.
 [17] A. G. Matthews, M. van der Wilk, T. Nickson, K. Fujii, A. Boukouvalas, P. LeónVillagrá, Z. Ghahramani, and J. Hensman. GPflow: A Gaussian process library using TensorFlow. Journal of Machine Learning Research, 18(40):1–6, Apr. 2017.
 [18] T. P. Minka. Expectation Propagation for approximate Bayesian inference. In Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence, UAI ’01, pages 362–369, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.
 [19] Y. Mroueh, T. Poggio, L. Rosasco, and J.J. Slotine. Multiclass learning with simplex coding. In Advances in Neural Information Processing Systems, pages 2789–2797, 2012.
 [20] H. Nickisch and C. E. Rasmussen. Approximations for Binary Gaussian Process Classification. Journal of Machine Learning Research, 9:2035–2078, Oct. 2008.
 [21] A. NiculescuMizil and R. Caruana. Predicting good probabilities with supervised learning. In Proceedings of the 22Nd International Conference on Machine Learning, ICML ’05, pages 625–632. ACM, 2005.
 [22] J. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers, 10(3), 1999.
 [23] C. E. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.
 [24] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.
 [25] R. Rifkin, G. Yeo, T. Poggio, and Others. Regularized leastsquares classification. Nato Science Series Sub Series III Computer and Systems Sciences, 190:131–154, 2003.
 [26] A. Rudi, R. Camoriano, and L. Rosasco. Less is More: Nyström Computational Regularization. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 1657–1665. Curran Associates, Inc., 2015.
 [27] A. Rudi, L. Carratino, and L. Rosasco. FALKON: An Optimal Large Scale Kernel Method. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 3888–3898. Curran Associates, Inc., 2017.
 [28] A. Rudi, L. Carratino, and L. Rosasco. Falkon: An optimal large scale kernel method. In Advances in Neural Information Processing Systems, pages 3891–3901, 2017.
 [29] J. ShaweTaylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, New York, NY, USA, 2004.
 [30] S. Si, C.J. Hsieh, and I. S. Dhillon. Memory Efficient Kernel Approximation. In ICML, volume 32 of JMLR Proceedings, pages 701–709. JMLR.org, 2014.
 [31] A. J. Smola and B. Schölkopf. Sparse Greedy Matrix Approximation for Machine Learning. In ICML, pages 911–918. Morgan Kaufmann, 2000.
 [32] J. A. K. Suykens and J. Vandewalle. Least Squares Support Vector Machine Classifiers. Neural Process. Lett., 9(3):293–300, June 1999.
 [33] M. K. Titsias. Variational Learning of Inducing Variables in Sparse Gaussian Processes. In D. A. Dyk and M. Welling, editors, Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics, AISTATS 2009, Clearwater Beach, Florida, USA, April 1618, 2009, volume 5 of JMLR Proceedings, pages 567–574. JMLR.org, 2009.
 [34] C. K. I. Williams and D. Barber. Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20:1342–1351, 1998.
 [35] C. K. I. Williams and M. Seeger. Using the Nyström method to speed up kernel machines. In T. K. Leen, T. G. Dietterich, V. Tresp, T. K. Leen, T. G. Dietterich, and V. Tresp, editors, NIPS, pages 682–688. MIT Press, 2000.
 [36] B. Zadrozny and C. Elkan. Transforming Classifier Scores into Accurate Multiclass Probability Estimates. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’02, pages 694–699, New York, NY, USA, 2002. ACM.
Appendix A Extended calibration results
Reliability diagrams offer a visual representation of calibration properties, where accuracy is plotted as a function of confidence for the subsets . For a perfectly calibrated classifier, the accuracy function should be equal to the identity line, implying that . Large deviations from the identity line mean that the class probabilities are either underestimated or overestimated.
In Figure 6 we summarize the reliability diagrams for a number of binary classification datasets. Each row of diagrams corresponds to a particular dataset; and each column to one of the gpbased classification approaches that we have discussed in this work. In particular we consider the variational gp classification algorithm of [9] (gpc), our Dirichletbased classification scheme (gpd), and gp regression on the labels without and with a Plattscaling posthot calibration step (gpr and gpr (platt)). Note that each one of these approaches produces a distribution of classifiers. Thus, in the diagrams of Figure 6 we show the reliability curve of the mean classifier (depicted as solid linespoints), along with the classifiers described by the upper and lower 95% quantiles of the predictive distribution (grey area). If a classifier is wellcalibrated, then its reliability curve should be close to the identity curve (dashed line); the latter should also lie within the limits of the grey area.
For the results of Figure 6, we have considered subsets for different levels of confidence. We note that deviations from the identity curve should not deemed important, if these are not backed by a sufficient number of samples. For some datasets there are certain levels of confidence (middle section of htru2 for example) that contain very few data. In order to reflect this behavior, we also plot the histograms showing the proportion of the test set that corresponds to each confidence level.
A careful inspection of Figure 6 suggests that both gpc and gpd produce wellcalibrated models. The gpr approach on the other hand tends to produce a sigmoidshaped reliability curve, which suggests that there is underestimation of the class probabilities. This behavior is cured however by performing calibration via Plattscaling, as we see for the gpr (platt) method.
These conclusions are further supported by Figure 7, which summarizes the reliability plots for a number of multiclass datasets. In the multiclass case, it is not obvious how to concisely summarise the effect of the predictive distribution, so we resort to simple reliability plots of the average classifier for each method.
As a final remark, we note that the predictive distribution of the gpd approach is different from that obtained by gpc. In fact, judging form the upper and lower quantile classifiers as presented in Figure 6, it appears that gpd results in a narrower predictive distribution, which is nevertheless wellcalibrated. An extended characterization of the predictive distribution for gpd is subject of future work.