Scalable MultiClass Gaussian Process Classification using Expectation Propagation
Abstract
This paper describes an expectation propagation (EP) method for multiclass classification with Gaussian processes that scales well to very large datasets. In such a method the estimate of the logmarginallikelihood involves a sum across the data instances. This enables efficient training using stochastic gradients and minibatches. When this type of training is used, the computational cost does not depend on the number of data instances . Furthermore, extra assumptions in the approximate inference process make the memory cost independent of . The consequence is that the proposed EP method can be used on datasets with millions of instances. We compare empirically this method with alternative approaches that approximate the required computations using variational inference. The results show that it performs similar or even better than these techniques, which sometimes give significantly worse predictive distributions in terms of the test loglikelihood. Besides this, the training process of the proposed approach also seems to converge in a smaller number of iterations.
1 Introduction
Gaussian processes (GPs) are nonparametric models that can be used to address multiclass classification problems (Rasmussen & Williams, 2006). These models become more expressive as the number of data instances grows. They are also very useful to introduce prior knowledge in the learning problem, as many properties of the model are specified by a covariance function. Moreover, GPs provide an estimate of the uncertainty in the predictions made which may be critical in some applications. Nevertheless, in spite of these advantages, GPs scale poorly to large datasets because their training cost is , where is the number of instances. An additional challenge is that exact inference in these models is generally intractable and one has to resort to approximate methods in practice.
Traditionally, GP classification has received more attention in the binary case than in the multiclass setting (Kuss & Rasmussen, 2005; Nickisch & Rasmussen, 2008). The reason is that approximate inference is more challenging in the multiclass case where there is one latent function per class. To this one has to add more complicated likelihood factors, which often have the form of softmax functions or intractable Gaussian integrals. In spite of these difficulties, there have been several works addressing multiclass GP classification (Williams & Barber, 1998; Kim & Ghahramani, 2006; Girolami & Rogers, 2006; Chai, 2012; Riihimäki et al., 2013). Nevertheless, most of the proposed methods do not scale well with the size of the training set.
In the literature there have been some efforts to scale up GPs. These techniques often introduce a set of inducing points whose location is learnt alongside with the other model hyperparameters. The use of inducing points in the model can be understood as an approximate GP prior with a lowrank covariance structure (QuiñoneroCandela & Rasmussen, 2005). When inducing points are considered, the training cost can be reduced to . This allows to address datasets with several thousands of instances, but not millions. The reason is the difficulty of estimating the model hyperparameters, which is often done by maximizing an estimate of the logmarginallikelihood. Because such an estimate does not involve a sum across the data instances, one cannot rely on efficient methods for optimization based on stochastic gradients and minibatches.
A notable exception is the work of (Hensman et al., 2015a) which uses variational inference to approximate the calculations. Such a method allows for stochastic optimization and can address datasets with millions of instances. In this work we propose an alternative based on expectation propagation (EP) (Minka, 2001) and recent advances on binary GP classification (HernándezLobato & HernándezLobato, 2016). The proposed approach also allows for efficient training using minibatches. This leads to a training cost that is , where is the number of classes. An experimental comparison with the variational approach and related methods from the literature shows that the proposed approach has benefits both in terms of the training speed and the accuracy of the predictive distribution.
2 Scalable Multiclass Classification
Here we describe multiclass Gaussian process classification and the proposed method. Such a method uses the expectation propagation algorithm whose original description is modified to be more efficient both in terms of memory and computational costs. For this, we consider stochastic gradients to update the hyperparameters and an approximate likelihood that avoids onedimensional quadratures.
2.1 Multiclass Gaussian Process Classification
We consider a dataset of instances in the form of a matrix of attributes with labels , where and is the total number of different classes. The task of interest is to predict the class label of a new data instance .
A typical approach in multiclass Gaussian process (GP) classification is to assume the following labeling rule for given : , for , where each is a nonlinear latent function (Kim & Ghahramani, 2006). Define and . The likelihood of , , is then a product of factors of the form:
(1) 
where is the Heaviside step function. This likelihood takes value one if can explain the observed data and zero otherwise. Potential classification errors can be easily introduced in (1) by considering that each has been contaminated with Gaussian noise with variance . That is, , where .
In multiclass GP classification a GP prior is assumed for each function (Rasmussen & Williams, 2006). Namely, , where is some covariance function with hyperparameters . Often these priors are assumed to be independent. That is, , where each is a multivariate Gaussian distribution. The task of interest is to make inference about and for that Bayes’ rule is used: , where is a normalization constant (the marginal likelihood) which can be maximized to find good hyperparameters , for . However, because the likelihood in (1) is nonGaussian, evaluating and is intractable. Thus, these computations must be approximated. Often, one computes a Gaussian approximation to (Kim & Ghahramani, 2006). This results in a nonparametric classifier with training cost , where is the number of data instances.
To reduce the computational cost of the method described a typical approach is to consider a sparse representation for each GP. With this goal, one can introduce datasets of inducting points , with associated values for (Snelson & Ghahramani, 2006; NaishGuzman & Holden, 2008). Given each the prior for is approximated as , in which the conditional Gaussian distribution has been approximated by the factorizing distribution . This approximation is known as the full independent training conditional (FITC) (QuiñoneroCandela & Rasmussen, 2005), and it leads to a Gaussian prior with a lowrank covariance matrix. This allows for approximate inference with cost . The inducing points can be regarded as hyperparameters and can be learnt by maximizing the estimate of the marginal likelihood .
2.2 Method Specification and Expectation Propagation
The formulation of the previous section is limited because the estimate of the logmarginallikelihood cannot be expressed as a sum across the data instances. This makes infeasible the use of efficient methods based on stochastic optimization for finding the model hyperparameters.
A recent work focusing on the binary case has shown that it is possible to obtain an estimate of that involves a sum across the data instances if the values associated to the inducing points are not marginalized (HernándezLobato & HernándezLobato, 2016). We follow that work and consider the posterior approximation , where , , we have defined , and is a Gaussian approximation to . This distribution is obtained in three steps. First, we use on the exact posterior the FITC approximation:
(2) 
where we have defined and
(3) 
with , where
(4)  
(5) 
In the previous expressions is the p.d.f. of a Gaussian with mean and variance . Furthermore, is a vector with the covariances between and ; is a matrix with the cross covariances between ; and, finally, is the prior variance of .
A practical difficulty is that the integral in (3) is intractable. Although it can be evaluated using onedimensional quadrature techniques (HernándezLobato et al., 2011), in this paper we follow a different approach. For that, we note that (3) is simply the probability that for , given . Let and . The second step consists in approximating (3) as follows:
(6) 
where , is the c.d.f. of a standard Gaussian and , with , , and defined in (5). We have omitted in (6) the dependence on to improve the readability. The quality of this approximation is supported by the good experimental results obtained in Section 4. When (6) is replaced in (2) we get an approximate posterior distribution in which we can evaluate all the likelihood factors:
(7) 
where we have defined .
The r.h.s. of (7) is intractable due to the nonGaussian form of the likelihood factors. The third and last step uses expectation propagation (EP) (Minka, 2001) to get a Gaussian approximation to (7). This approximation is obtained by replacing each with an approximate Gaussian factor :
(8) 
where , , , , and we have defined . In (18) , , , and are free parameters adjusted by EP. Because the precision matrices in (18) are onerank (see the supplementary material for details), we only have to store in memory parameters for each . The posterior approximation is obtained by replacing in (7) each exact factor by the corresponding approximate factor . That is, , where is a normalization constant that approximates the marginal likelihood . Because all the factors involved in the computation of are Gaussian, and we assume independence among the latent functions of different classes in (18), is a product of multivariate Gaussians (on per class) on dimensions.
In EP each is updated until convergence as follows: First, is removed from by computing . Because the Gaussian family is closed under the product and division operations, is also Gaussian with parameters given by the equations in (Roweis, 1999). Then, the KullbackLeibler divergence between and , i.e, , is minimized with respect to , where is the normalization constant of . This is done by matching the moments of . These moments can be obtained from the derivatives of with respect to the parameters of (Seeger, 2006). After updating , the new approximate factor is . We update all the approximate factors at the same time, and reconstruct afterwards by computing the product of all the and the prior, as in (HernándezLobato et al., 2011).
The EP approximation to the marginal likelihood is the normalization constant of , . The log of its value is:
(9) 
where ; , , and are the natural parameters of , and the prior, respectively; and is the lognormalizer of a multivariate Gaussian distribution with natural parameters .
It is possible to show that if EP converges, the gradient of w.r.t the parameters of each is zero. Thus, the gradient of w.r.t. a hyperparameter of the th covariance function (including the inducing points) is:
(10) 
where and are the expected sufficient statistics under and the prior, respectively. Importantly, only the direct dependency of on has to be taken into account. See (Seeger, 2006). The dependency through , i.e., the natural parameters of can be ignored.
After obtaining and finding the model hyperparameters by maximizing , one can get an approximate predictive distribution for the label of a new instance :
(11) 
where we have defined , and has the same form as the likelihood factor in (3). The resulting integral in (11) is again intractable. However, it can be approximated using a onedimensional quadrature. See the supplementary material.
Because some simplifications occur when computing the derivatives of w.r.t the inducing points, the total training time of EP is while the total memory cost is (Snelson, 2007).
2.3 Scalable Expectation Propagation
Traditionally, for finding the model hyperparameters with EP one reruns EP until convergence (using the previous solution as the starting point), after each gradient ascent update of the hyperparameters. The reason for this is that (10) is only true if EP has converged (i.e., the approximate factors do not change any more). This approach is particularly inefficient initially, when there are strong changes to the model hyperparameters, and EP may require several iterations to converge. Recently, a more efficient method has been proposed in (HernándezLobato & HernándezLobato, 2016). In that work the authors suggest to update both the approximate factors and the model hyperparameters at the same time. Because we do not wait for EP to converge, one should ideally add to (10) extra terms to get the gradient. These terms account for the mismatch between the moments of and . However, according to (HernándezLobato & HernándezLobato, 2016) these extra terms can be ignored and one can simply use (10) for an inner update of the hyperparameters.
Figure 1 shows, for the Vehicle dataset from UCI repository (Lichman, 2013), the estimate of the marginal likelihood with respect to the training time, for 250 updates of the hyperparameters, and . We compare three methods: (i) rerunning EP until convergence each time and using (10) to update the hyperparameters (EPouter); (ii) updating at the same time the approximate factors and the hyperparameters with (10) (EPinnerapprox); and (iii) the same approach as the previous one, but using the exact gradient for the update instead of (10) (EPinnerexact). All approaches successfully maximize . However, the inner updates are more efficient as they do not wait until EP converges. Moreover, using the approximate gradient is faster (it is cheaper to compute), and it gives almost the same results as the exact gradient.
2.3.1 Stochastic Expectation Propagation
The memory cost of EP can be significantly reduced by a technique called stochastic EP (SEP) (Li et al., 2015). In SEP all the approximate factors are tied. This means that instead of storing their individual parameters, what is stored is their product, i.e., . A consequence of this is that we no longer have direct access to their individual parameters. This only affects the computation of the cavity distribution which now is obtained in an approximate way , where is the total number of factors and approximates each individual factor. Thus, SEP reduces the memory costs of EP by a factor of . All the other steps are carried out as in the original EP algorithm, including the computation of and its gradients. Figure 2 shows the differences between EP and SEP on a toy example. When SEP is used in the proposed method, the memory cost is reduced to .
2.3.2 Training Using Minibatches
Both the estimate of the logmarginallikelihood in (9) and its gradient in (10) contain a sum across the data instances. This allows to write an EP algorithm that processes minibatches of data, as in (HernándezLobato & HernándezLobato, 2016). For this, the data are split in minibatches of size , where is the number of instances. Given a minibatch , we process all the approximate factors corresponding to that minibatch, i.e., . Then, we update the model hyperparameters using a stochastic approximation of (10):
(12) 
where . We reconstruct after each update of the approximate factors and each update of the hyperparameters. When using minibatches of data, we update more frequently and the hyperparameters. The consequence is that the training cost is , assuming a constant number of updates until convergence. This training scheme can handle datasets with millions of instances.
3 Related Work
The likelihood used in (1) was first considered for multiclass Gaussian process classification in (Kim & Ghahramani, 2006). That work considers full nonparametric GP priors, which lead to a training cost that is . The consequence is that it can only address small classification problems. It is, however, straight forward to replace the nonparametric GP priors with the FITC approximate priors (QuiñoneroCandela & Rasmussen, 2005). These priors are obtained by marginalizing the latent variables associated to the inducing points , as indicated in Section 2.1. This allows to address datasets with a few thousand instances. This is precisely the approach followed in (NaishGuzman & Holden, 2008) to address binary GP classification problems. We refer to such an approach as the generalized FITC approximation (GFITC). Nevertheless, such an approach cannot use stochastic optimization. The reason is that the estimate of the logmarginallikelihood (needed for hyperparameter estimation) does not contain a sum across the instances. Thus, GFITC cannot scale well to very large datasets. Nevertheless, unlike the proposed approach, it can run expectation propagation over the exact likelihood factors in (1). In GFITC we follow the traditional approach and run EP until convergence before updating the hyperparameters.
Multiclass GP classification for potentially huge datasets has also been considered in (Hensman et al., 2015b) using variational inference (VI). However, such an approach cannot use the likelihood in (1) since its logarithm is not well defined (note that it takes value zero for some values of ). As an alternative, Hensman et al. (2015b) have considered the robust likelihood of (HernándezLobato et al., 2011):
(13) 
where is the probability of a labeling error (in that case, is chosen at random from the potential class labels). In (Hensman et al., 2015b) it is suggested to set .
We now describe the VI approach in detail. Using (13) and the definitions of Section 2, we know that . If we take the log and use Jensen’s inequality we get the bound . Consider now a Gaussian approximation to . Then,
(14) 
where we have used Jensen’s inequality and KL is the Kullback Leibler divergence. If we use the first bound we get
(15) 
where and the corresponding marginal over is . Note that is Gaussian because involves a Gaussian convolution. As in the proposed approach, is assumed to be a Gaussian factorizing over the latent functions . However, its mean and covariance parameters, i.e., and are found by maximizing (15). The parameters of are:
(16)  
(17) 
Hensman et al. (2015b) consider Markov chain Monte Carlo (MCMC) to sample the hyperparameters. Here we simply maximize (15) to find the hyperparameters and the inducing points. The reason for this is that in very large datasets MCMC is not expected to give much better results. We refer to the described approach as VI. The objective in (15) contains a sum across the data instances. Thus, VI also allows for stochastic optimization and it results in the same cost as the proposed approach. However, the expectations in (15) must be approximated using onedimensional quadratures. This is a drawback with respect to the proposed method which is free of any quadrature. Finally, there are some methods related to the VI approach just described. Dezfouli & Bonilla (2015) assume that can be a mixture of Gaussians, and Chai (2012) uses a softmax likelihood (but does not consider stochastic optimization). Both works need to introduce extra approximations.
In the literature there are other research works addressing multiclass Gaussian process classification. Some examples include (Williams & Barber, 1998; Girolami & Rogers, 2006; HernándezLobato et al., 2011; Henao & Winther, 2012; Riihimäki et al., 2013). These works employ expectation propagation, variational inference or the Laplace approximation to approximate the computations. Nevertheless, the corresponding estimate of the logmarginallikelihood cannot be expressed as a sum across the data instances. This avoids using efficient techniques for optimization based on stochastic gradients. Thus, one cannot address very large datasets with these methods.
4 Experiments
We evaluate the performance of the method proposed in Section 2.2. We consider two versions of it. A first one, using expectation propagation (EP). A second, using the memory efficient stochastic EP (SEP). EP and SEP are compared with the methods described in Section 3. Namely, GFITC and VI. All methods are codified in the R language (the source code is in the supplementary material), and they consider the same initial values for the model hyperparameters (including the inducing points, that are chosen at random from the training instances). The hyperparameters are optimized by maximizing the estimate of the marginal likelihood. A Gaussian covariance function with automatic relevance determination, an amplitude parameter and an additive noise parameter is employed.
4.1 Performance on Datasets from the UCI Repository
We evaluate the performance of each method on 8 datasets from the UCI repository (Lichman, 2013). The characteristics of the datasets are displayed in Table 1. We use batch training in each method (i.e., we go through all the data to compute the gradients). Batch training does not scale to large datasets. However, it is preferred on small datasets like the ones considered here. We use 90% of the data for training and 10% for testing, expect for Satellite which is fairly big, where we use 20% for training and 80% for testing. In Waveform, which is synthetic, we generate 1000 instances and split them in 30% for training and 70% for testing. Finally, in Vowel we consider only the points that belong to the 6 first classes. All methods are trained for 250 iterations using gradient ascent (GFITC and VI use lBFGS, EP and SEP use an adaptive learning rate described in the supplementary material). We consider three values for , the number of inducing points. Namely , and of the number of training instances. We report averages over 20 repetitions of the experiments.
Dataset  #Instances  #Attributes  #Classes 

Glass  214  9  6 
Newthyroid  215  5  3 
Satellite  6435  36  6 
Svmguide2  391  20  3 
Vehicle  846  18  4 
Vowel  540  10  6 
Waveform  1000  21  3 
Wine  178  13  3 
Table 2 shows, for each value of , the average negative test loglikelihood of each method with the corresponding error bars (test errors are shown in the supplementary material). The average training time of each method is also displayed. The best method (the lower the better) for each dataset is highlighted in bold face. We observe that the proposed approach, EP, obtains very similar results to those of GFITC, and sometimes it obtains the best results. The memory efficient version of EP, SEP, seems to provide similar results without reducing the performance. Regarding the computational cost, SEP is the fastest method (between 2 and 3 times faster than GFITC). VI is slower as a consequence of the quadratures required by this method. VI also gives much worse results in some datasets, e.g., Glass, Svmguide2 and Waveform. This is related to the optimization of in (15), instead of , which is closer to the data loglikelihood. In the EP objective in (9), is probably more similar to . This explains the much better results obtained by EP and SEP.
Problem  GFITC  EP  SEP  VI  


Glass  0.61  0.05  0.78  0.06  0.77  0.07  2.45  0.14 
Newthyroid  0.06  0.01  0.11  0.03  0.06  0.01  0.09  0.02  
Satellite  0.33  0.00  0.31  0.00  0.33  0.00  0.61  0.01  
Svmguide2  0.63  0.06  0.63  0.06  0.67  0.06  1.03  0.08  
Vehicle  0.32  0.01  0.34  0.02  0.34  0.02  0.76  0.05  
Vowel  0.16  0.01  0.25  0.01  0.25  0.01  0.41  0.05  
Waveform  0.42  0.01  0.36  0.00  0.39  0.01  0.89  0.02  
Wine  0.08  0.02  0.07  0.01  0.08  0.01  0.08  0.02  
Avg. Time  131  3.11  53.8  0.19  48.5  0.97  157  0.59  

Glass  0.58  0.05  0.74  0.06  0.79  0.07  2.18  0.14 
Newthyroid  0.07  0.01  0.06  0.01  0.06  0.01  0.05  0.01  
Satellite  0.34  0.00  0.30  0.00  0.34  0.00  0.58  0.01  
Svmguide2  0.67  0.05  0.67  0.05  0.74  0.07  0.90  0.10  
Vehicle  0.33  0.01  0.33  0.02  0.34  0.02  0.72  0.04  
Vowel  0.14  0.01  0.19  0.01  0.19  0.01  0.30  0.04  
Waveform  0.42  0.01  0.36  0.01  0.41  0.01  0.85  0.01  
Wine  0.07  0.01  0.06  0.01  0.07  0.01  0.07  0.01  
Avg. Time  264  6.91  102  0.64  96.6  1.99  179  0.78  

Glass  0.6  0.07  0.75  0.06  0.81  0.07  2.30  0.15 
Newthyroid  0.07  0.01  0.06  0.01  0.05  0.01  0.05  0.01  
Satellite  0.34  0.01  0.30  0.00  0.36  0.00  0.53  0.01  
Svmguide2  0.67  0.05  0.65  0.06  0.74  0.07  0.94  0.08  
Vehicle  0.33  0.01  0.33  0.02  0.34  0.02  0.63  0.04  
Vowel  0.12  0.01  0.16  0.01  0.18  0.01  0.15  0.03  
Waveform  0.43  0.01  0.37  0.01  0.45  0.01  0.80  0.01  
Wine  0.07  0.01  0.05  0.01  0.06  0.01  0.06  0.02  
Avg. Time  683  17.3  228  0.78  216  2.88  248  0.66 
GFITC 


EP 

SEP 

VI 

4.2 Analysis of Inducing Point Learning
We generate a synthetic two dimensional problem with three classes by sampling the latent functions from the GP prior and applying the rule . The distribution of is uniform in the box . We consider 1000 training instances and a growing number of inducing points, i.e., to . The initial location of the inducing points is chosen at random and it is the same for all the methods. We are interested in the location of the inducing points after training. Thus, we set the other hyperparameters to their true values (specified before generating the data) and we keep them fixed. All methods but VI are trained using batch methods during 2000 iterations. VI is trained using stochastic gradients for 2000 epochs (the batch version often gets stuck in local optima). We use ADAM with the default settings (Kingma & Ba, 2015), and as the minibatch size.
Figure 3 shows the location learnt by each method for the inducing points. Blue, red and green points represent the training data, black lines are decision boundaries and black border points are the inducing points. As we increase the number of inducing points the methods become more accurate. However, GFITC, EP and SEP identify decision boundaries that are better with a smaller number of inducing points. VI fails in this task. This is probably because VI updates the inducingpoints with a bad estimate of during the initial iterations. VI uses gradient steps to update , which is less efficient than the EP updates (free of any learning rate). GFITC, EP and SEP overlap the inducing points, which can be seen as a pruning mechanism (if two inducing points are equal, it is like having only one). This has already been observed in regression problems (Bauer et al., 2016). By contrast, VI places the inducing points near the decision boundaries. This agrees with previous results on binary classification (Hensman et al., 2015a).
4.3 Performance as a Function of the Training Time
Figure 4 shows the negative test loglikelihood of each method as a function of the training time on the Satellite dataset (EP results are not shown since it performs equal to SEP). Training is done as in Section 4.1. We consider a growing number of inducing points and report averages over 100 repetitions of the experiments. In all methods we use batch training. We observe that SEP is the method with the best performance at the lowest cost. Again, it is faster than GFITC because it optimizes and the hyperparameters at the same time, while GFITC waits until EP has converged to update the hyperparameters. VI is not very efficient for small values of , due to the quadratures. It also takes more time to get a good estimate of , which is updated by gradient descent and is less efficient than the EP updates. Similar results are obtained in terms of the test error. See the supplementary material. However, in that case VI does not overfit the training data.
4.4 Performance When Using Stochastic Gradients
In very large datasets batch training is infeasible, and one must use minibatches to update and to approximate the required gradients. We evaluate the performance of each method on the MNIST dataset (LeCun et al., 1998) with inducing points and minibatches with instances. This dataset has instances for training and for testing. The learning rate of each method is set using ADAM with the default parameters (Kingma & Ba, 2015). GFITC does not allow for stochastic optimization. Thus, it is ignored in the comparison. The test error and the negative test loglikelihood of each method is displayed in Figure 5 (top) as a function of the training time. In this larger dataset all methods perform similarly. However, EP and SEP take less time to converge than VI. SEP obtains a test error that is and average negative test loglikelihood that is . The results of VI are and , respectively. These results are similar to the ones reported in (Hensman et al., 2015a) using .
A last experiment considers all flights within the USA between 01/2008 and 04/2008 (http://statcomputing.org/dataexpo/2009). The task is to classify the flights according to their delay using three classes: On time, more than minutes of delay, or more than minutes before time. We consider attributes: age of the aircraft, distance covered, airtime, departure time, arrival time, day of the week, day of the month and month. After removing all instances with missing data instances remain, from which are used for testing and the rest for training. We use the same settings as on the MNIST dataset and evaluate each method. The results obtained are shown in Figure 5 (bottom). We also report the performance of a logistic regression classifier. Again, all methods perform similarly in terms of test error. However, EP and SEP converge faster and quickly outperform the linear model. Importantly, the negative test loglikelihood of VI starts increasing at some point, which is again probably due to the optimization of in (15). The supplementary material has further evidence supporting this.
5 Conclusions
We have proposed the first method for multiclass classification with Gaussian processes, based on expectation propagation (EP), that scales well to very large datasets. Such a method uses the FITC approximation to reduce the number of latent variables in the model from to , where , and is the number of data instances. For this, inducing points are introduced for each latent function in the model. Importantly, the proposed method allows for stochastic optimization as the estimate of the logmarginallikelihood involves a sum across the data. Moreover, we have also considered a stochastic version of EP (SEP) to reduce the memory usage. When minibatches and stochastic gradients are used for training, the computational cost of the proposed approach is , with the number of classes. The memory cost is .
We have compared the proposed method with other approaches from the literature based on variational inference (VI) (Hensman et al., 2015b), and with the model considered by Kim & Ghahramani (2006), which has been combined with FITC approximate priors (GFITC) (QuiñoneroCandela & Rasmussen, 2005). The proposed approach outperforms GFITC in large datasets as this method does not allow for stochastic optimization, and in small datasets it produces similar results. The proposed method, SEP, is slightly faster than VI which also allows for stochastic optimization. In particular, VI requires onedimensional quadratures which in small datasets are expensive. We have also observed that SEP converges faster than VI. This is probably because the EP updates, free of any learning rate, are more efficient for finding a good posterior approximation than the gradient ascent updates employed by VI.
An important conclusion of this work is that VI sometimes gives bad predictive distributions in terms of the test loglikelihood. The EP and SEP methods do not seem to have this problem. Thus, if one cares about accurate predictive distributions, VI should be avoided in favor of the proposed methods. In our experiments we have also observed that the proposed approaches tend to place the inducing points one on top of each other, which can be seen as an inducing point pruning technique (Bauer et al., 2016). By contrast, VI tends to place them near the decision boundaries.
Acknowledgements
The authors gratefully acknowledge the use of the facilities of Centro de Computación Científica (CCC) at Universidad Autónoma de Madrid. The authors also acknowledge financial support from Spanish Plan Nacional I+D+i, Grants TIN201342351P, TIN201676406P, TIN201570308REDT and TEC201681900REDT (MINECO/FEDER EU), and from Comunidad de Madrid, Grant S2013/ICE2845.
References
 Bauer et al. (2016) Bauer, M., van der Wilk, M., and Rasmussen, C. E. Understanding probabilistic sparse Gaussian process approximations. In Advances in Neural Information Processing Systems 29, pp. 1533–1541. 2016.
 Chai (2012) Chai, K. M. A. Variational multinomial logit Gaussian process. Journal of Machine Learning Research, 13:1745–1808, 2012.
 Dezfouli & Bonilla (2015) Dezfouli, A. and Bonilla, E. V. Scalable inference for Gaussian process models with blackbox likelihoods. In Advances in Neural Information Processing Systems 28, pp. 1414–1422. 2015.
 Girolami & Rogers (2006) Girolami, M. and Rogers, S. Variational Bayesian multinomial probit regression with Gaussian process priors. Neural Computation, 18:1790–1817, 2006.
 Henao & Winther (2012) Henao, R. and Winther, O. Predictive active set selection methods for Gaussian processes. Neurocomputing, 80:10–18, 2012.
 Hensman et al. (2015a) Hensman, J., Matthews, A., and Ghahramani, Z. Scalable variational Gaussian process classification. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, pp. 351–360, 2015a.
 Hensman et al. (2015b) Hensman, J., Matthews, A. G., Filippone, M., and Ghahramani, Z. MCMC for variationally sparse Gaussian processes. In Advances in Neural Information Processing Systems 28, pp. 1648–1656. 2015b.
 HernándezLobato (2010) HernándezLobato, D. Prediction Based on Averages over Automatically Induced Learners: Ensemble Methods and Bayesian Techniques. PhD thesis, 2010.
 HernándezLobato & HernándezLobato (2016) HernándezLobato, D. and HernándezLobato, J. M. Scalable Gaussian process classification via expectation propagation. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, pp. 168–176, 2016.
 HernándezLobato et al. (2011) HernándezLobato, D., ández Lobato, J.M., and Dupont, P. Robust multiclass Gaussian process classification. In Advances in Neural Information Processing Systems 24, pp. 280–288, 2011.
 Kim & Ghahramani (2006) Kim, H.C. and Ghahramani, Z. Bayesian Gaussian process classification with the EMEP algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28:1948–1959, 2006.
 Kingma & Ba (2015) Kingma, D. P. and Ba, J. ADAM: a method for stochastic optimization. In Inrernational Conference on Learning Representations, pp. 1–15, 2015.
 Kuss & Rasmussen (2005) Kuss, M. and Rasmussen, C. E. Assessing approximate inference for binary Gaussian process classification. Journal of Machine Learning Research, 6:1679–1704, 2005.
 LeCun et al. (1998) LeCun, Yann, Bottou, Léon, Bengio, Yoshua, and Haffner, Patrick. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86:2278–2324, 1998.
 Li et al. (2015) Li, Y., HernándezLobato, J. M., and Turner, R. E. Stochastic expectation propagation. In Advances in Neural Information Processing Systems 28, pp. 2323–2331. 2015.
 Lichman (2013) Lichman, M. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/ml.
 Minka (2001) Minka, T. Expectation propagation for approximate Bayesian inference. In Proceedings of the 17th Annual Conference on Uncertainty in Artificial Intelligence, pp. 362–36, 2001.
 NaishGuzman & Holden (2008) NaishGuzman, A. and Holden, S. The generalized FITC approximation. In Advances in Neural Information Processing Systems 20, pp. 1057–1064. 2008.
 Nickisch & Rasmussen (2008) Nickisch, H. and Rasmussen, C. E. Approximations for binary Gaussian process classification. Journal of Machine Learning Research, 9:2035–2078, 2008.
 Petersen & Pedersen (2012) Petersen, K. B. and Pedersen, M. S. The Matrix Cookbook. Technical University of Denmark, 2012.
 QuiñoneroCandela & Rasmussen (2005) QuiñoneroCandela, J. and Rasmussen, C. E. A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6:1939–1959, 2005.
 Rasmussen & Williams (2006) Rasmussen, C. E. and Williams, C. K. I. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, 2006.
 Riihimäki et al. (2013) Riihimäki, J., Jylänki, P., and Vehtari, A. Nested expectation propagation for Gaussian process classification with a multinomial probit likelihood. Journal of Machine Learning Research, 14:75–109, 2013.
 Roweis (1999) Roweis, S. Gaussian identities. Technical report, New York University, 1999.
 Seeger (2006) Seeger, M. Expectation propagation for exponential families. Technical report, Department of EECS, University of California, Berkeley, 2006.
 Snelson (2007) Snelson, E. Flexible and efficient Gaussian process models for machine learning. PhD thesis, 2007.
 Snelson & Ghahramani (2006) Snelson, E. and Ghahramani, Z. Sparse Gaussian processes using pseudoinputs. In Advances in Neural Information Processing Systems 18, pp. 1257–1264, 2006.
 Williams & Barber (1998) Williams, C. K. I. and Barber, D. Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20:1342–1351, 1998.
Appendix A Introduction
In this appendix we give all the details to implement the EP algorithm for the proposed method described in the main manuscript. In particular, we describe how to reconstruct the posterior approximation from the approximate factors and how to refine these factors. We also detail the computation of the EP approximation to the marginal likelihood and its gradients. Finally, we include some additional experimental results.
Appendix B Reconstruction of the posterior approximation
In this section we show how to obtain the posterior distribution by multiplying the approximate factors and the prior . From the main manuscript we know that these elements have the following form:
(18)  
(19) 
where