Laplace Approximation for Logistic Gaussian
Process
Density Estimation and Regression
Abstract
Logistic Gaussian process (LGP) priors provide a flexible alternative for modelling unknown densities. The smoothness properties of the density estimates can be controlled through the prior covariance structure of the LGP, but the challenge is the analytically intractable inference. In this paper, we present approximate Bayesian inference for LGP density estimation in a grid using Laplace’s method to integrate over the nonGaussian posterior distribution of latent function values and to determine the covariance function parameters with typeII maximum a posteriori (MAP) estimation. We demonstrate that Laplace’s method with MAP is sufficiently fast for practical interactive visualisation of 1D and 2D densities. Our experiments with simulated and real 1D data sets show that the estimation accuracy is close to a Markov chain Monte Carlo approximation and stateoftheart hierarchical infinite Gaussian mixture models. We also construct a reducedrank approximation to speed up the computations for dense 2D grids, and demonstrate density regression with the proposed Laplace approach.
Aki Vehtari aki.vehtari@aalto.fi
Department of Biomedical Engineering and Computational Science
Aalto University
P.O. Box 12200
FI00076 Aalto
Finland
Keywords: Gaussian process, logistic transformation, density estimation, density regression, approximate inference, Laplace’s method
1 Introduction
Logistic Gaussian process (LGP) priors provide a flexible alternative for modelling unknown densities (Leonard, 1978). With the LGP models densities can be estimated without restricting to any specific parameterized form and the smoothness properties of estimates can be controlled through the prior covariance structure. The challenge with the LGP model is the analytically intractable inference that results from the normalization term required to construct the prior distribution over density functions. This paper focuses on finite dimensional approximations, where the Gaussian process and the integral required to ensure normalization are evaluated in a grid as described by Tokdar (2007). The theoretical properties of LGP are discussed by Tokdar and Ghosh (2007) who establish conditions for posterior consistency of LGP density estimation and by van der Vaart and van Zanten (2009) who consider posterior convergence rates. Tokdar (2007) shows that as the spacing between the grid points decreases, the Kullback–Leibler divergence from an infinitedimensional model to the finitedimensional approximation converges to zero. Related considerations about a finite element approach for LGP density estimation are also presented by Griebel and Hegland (2010). Densities are often estimated on bounded intervals, although the estimation can be extended to unbounded intervals by transforming them into bounded intervals as proposed by Tokdar (2007).
Tokdar (2007) integrated over the latent values with Metropolis–Hastings sampling and over the potential grid point sets with Metropolis–Hastings–Green sampling. The purpose of the latter part is to keep the number of active grid points small and automatically find the most important grid point locations. By replacing the sampling with an analytic approximation, inference can be made faster even if a fixed and finer grid is assumed. In this paper, we consider Laplace’s method (LA) to approximate the posterior inference for LGP density estimation in a grid. Our objective is to obtain an accurate and quick approximation that enables practical estimation of densities by focusing on efficient ways to approximate Bayesian inference for LGP density estimation.
The proposed LA approach is related to other approaches in the literature. Given fixed covariance function parameters, the LA approximation for the posterior distribution involves finding the posterior mode, which is equivalent to a penalized maximum likelihood estimator considered by Leonard (1978) and Thorburn (1986). The marginal likelihood can be approximated with LA, which enables a fast gradientbased typeII maximum a posteriori (MAP) estimation of the covariance function parameters, or alternatively, the posterior distribution can be integrated over. Lenk (1991) uses a truncated Karhunen–Loeve representation and derives the moments of the density process, from which it is also possible to obtain a marginal likelihood approximation. Lenk evaluates the marginal likelihood of hyperparameter values in a grid, while we use gradients and BFGS quasiNewton optimization or Markov chain Monte Carlo (MCMC) integration. In Lenk’s approach the mean of the density process is guaranteed to be positive, but there is no such guarantee for the whole posterior. Also the spectral representation restricts the selection of covariance functions. Lenk (2003) refines his approach with a Fourier series expansion and MCMC sampling of the hyperparameters.
The computational complexity of the proposed LA approach is dominated by the covariance matrix operations, which scale in a straightforward implementation as , where is the number of grid points. However, because of the discretization, the computational complexity is independent of the number of observations. Applying the proposed approach with a default grid size 400, 1D and 2D density estimation takes one to two seconds, which facilitates interactive visualization of data with density estimates and violin plots (see, e.g., Figures 2–5 and 8). Additionally, we consider fast Fourier transform (FFT) to speed up the computations with an even grid and stationary covariance functions. To avoid the cubic computational scaling in with dense 2D grids, we also exploit Kronecker product computations to obtain a reducedrank approximation of the exact prior covariance structure.
The number of grid points grows exponentially with the data dimension . Although the Kronecker product approach is suitable for reducing the computation time when , the exponentially increasing number of latent function values makes the proposed approach impractical for more than three or four dimensions. Adams et al. (2009) propose an alternative GP approach called Gaussian process density sampler (GPDS) in which the numerical approximation of the normalizing term in the likelihood is avoided by a conditioning set and an elaborate rejection sampling method. The conditioning set is generated by the algorithm, which automatically places points where they are needed, making the estimation in higherdimensional spaces easier. However, the computational complexity of GPDS scales as , where is the number of data points.
The main contribution of this paper is the construction of the quick approximation for the LGP density estimation and regression by combining various ideas from the literature. Using Laplace’s method to integrate over the latent values, we avoid the slow mixing of MCMC. We present FFT and reducedrank based speedups tailored for LGP with LA. We demonstrate that LA can be further improved by rejection and importance sampling. We also show that using the typeII MAP estimation for two to three hyperparameters gives good results compared to the integration over the hyperparameters.
In the next section, we review the basics of the logistic Gaussian processes. In Section 3, we present the LA approach for LGP density estimation. We also introduce the additional approximations for speeding up the inference and consider briefly a similar LA approach for LGP density regression. In Section 4, we demonstrate the LA approach with several experiments, and compare it against MCMC and hierarchical infinite Gaussian mixture models by Griffin (2010).
2 Density Estimation with Logistic Gaussian Process Priors
We consider the problem of computing a density estimate given independently drawn dimensional data points from an unknown distribution in a finite region of . In this paper, we focus only on . To find an estimate for the unknown density, we can maximize the following loglikelihood functional
(1) 
with the constraints , and for . The limiting solution leads to a mixture of delta functions located at the observations (Leonard, 1978), which is why we need to set prior beliefs about the unknown density to obtain more realistic estimates.
To introduce the constraints of the density being nonnegative and that its integral over is equal to one, we employ the logistic density transform (Leonard, 1978)
(2) 
where is an unconstrained latent function. To smooth the density estimates, we place a Gaussian process prior for , which enables us to set the prior assumptions about the smoothness properties of the unknown density via the covariance structure of the GP prior.
We assume a zeromean Gaussian process , where the covariance function is denoted with for a pair of inputs and . An example of a widely used covariance function is the stationary squared exponential covariance function
(3) 
where the hyperparameters govern the smoothness properties of (Rasmussen and Williams, 2006). The lengthscale hyperparameters control how fast the correlation decreases in different dimensions, and is a magnitude hyperparameter.
For the latent function in Equation (2), we assume the model , where the GP prior is combined with the explicit basis functions . Regression coefficients are denoted with , and by placing a Gaussian prior with the mean and the covariance , the parameters can be integrated out from the model, which results in the following GP prior for (O’Hagan, 1978; Rasmussen and Williams, 2006):
(4) 
We assume the secondorder polynomials for the explicit basis functions. We use in 1D, and in 2D, which leads to a GP prior that can favour density estimates where the tails of the distribution go eventually to zero. The effect of the basis functions is demonstrated in an illustrative 1D example of Figure 1.
We discretize into subregions (or intervals in 1D), and collect the coordinates of the subregions into an matrix , where the ’th row denotes the center point of the ’th subregion. Given , the GP prior (4) results in the Gaussian distribution over the latent function values
(5) 
where is a column vector of latent values associated with each subregion. The entries of the covariance matrix are determined by the input points and the covariance function. The matrix , of size in 1D and in 2D, contains the values of the fixed basis functions evaluated at . We assume a weakly informative prior distribution for the regression coefficients by fixing the mean (a zero vector of a length in 1D and in 2D), and the covariance , where is the identity matrix of a size (in 1D) and (in 2D). The regression coefficient for the quadratic term should be negative in order to make the tails of a distribution to go towards zero. However, the prior distribution does not force the negativity of the coefficient. In our experiments, the posterior of the coefficient was clearly below zero, but in our implementation it is also possible to force the negativity of the coefficient by rejection sampling (in 1D).
After the discretization, the loglikelihood contribution of an observation belonging to the ’th subregion can be written as
(6) 
where the latent value is associated with the ’th subregion. Throughout this paper, we assume a regular grid, and therefore the weights have all the same value and can be omitted from (6). The number of observations that fall within the ’th subregion is denoted with and all the count observations with an vector . The overall loglikelihood contribution of the observations is given by
(7) 
The prior (5) and the likelihood (7) are combined by Bayes’ rule, which results in the conditional posterior distribution of the latent values
(8) 
where is the marginal likelihood. Due to the nonGaussian likelihood (7), the posterior distribution is also nonGaussian, and approximate methods are needed to integrate over .
3 Approximate Inference
In this section, we discuss the implementation issues of Laplace’s method for logistic GP density estimation. In Section 3.1, we present an efficient approach for the mode finding and computation of the marginal likelihood and predictions. Further speedups are obtained by using the fast Fourier transform for matrixvector multiplications and by Kronecker product and reducedrank approximations suitable for cases with dense grids (large ). In Section 3.2, we consider the LA approach for logistic GP density regression, and in Section 3.3, we give a brief description of inference with MCMC.
3.1 Inference with the Laplace Approximation
Our approach resembles the Laplace approximation for GP classification (Williams and Barber, 1998; Rasmussen and Williams, 2006) and GP point process intensity estimation (Cunningham et al., 2008), but the implementation is different because in LGP each term in the likelihood (7) depends on all the latent values .
The Laplace approximation is based on a secondorder Taylor expansion for around the posterior mode , which results in the Gaussian approximation
(9) 
where . The covariance matrix is given by
(10) 
where and . The likelihood (7) leads to a full matrix with the following structure^{1}^{1}1We use the following notation: diag() with a vector argument means a diagonal matrix with the elements of the vector on its diagonal, and diag() with a matrix argument means a column vector consisting of the diagonal elements of the matrix .
(11) 
where the nonnegative entries of the vector are given by
(12) 
In the implementation, forming the full matrix can be avoided by using the vector and the structure (11). Similarly to multiclass classification with the softmax likelihood (Williams and Barber, 1998; Rasmussen and Williams, 2006), is positive semidefinite, and because is positive definite, has a unique maximum.
3.1.1 Newton’s Method for Finding the Mode
We use Newton’s method for finding the mode . At each iteration, we need to compute
(13) 
where . To increase the numerical stability of the computations, we use the factorization , where
(14) 
Instead of a direct implementation of (13), we can apply the matrix inversion lemma (see, e.g., Harville, 1997) to write the Newton step in a numerically more preferable way as
(15) 
The inversion of Equation (15) can be computed by solving from the linear system with the conjugate gradient method. As discussed, for example, by Cunningham et al. (2008), a stationary covariance function and evenly spaced grid points lead to a Toeplitz covariance matrix which can be embedded in a larger circulant matrix enabling efficient computations by using the fast Fourier transform. With a Toeplitz covariance matrix , we can achieve a small speedup with larger grid sizes in the evaluation of the Newton step (15) because all the multiplications of and any vector can be done efficiently in the frequency domain. By using FFT, these matrixvector multiplications become convolution operations for which we are required only to form a single row of the embedded circulant matrix instead of the full matrix . The rest of the matrixvector multiplications in (15) are fast because the matrix has only two (in 1D) or five (in 2D) columns, and instead of forming the full matrix , we can use the vector and exploit the structure of from Equation (14).
3.1.2 The Approximate Marginal Likelihood and Predictions
Approximating the integration over with the LA approximation enables fast gradientbased typeII MAP estimation for choosing the values for the covariance function hyperparameters. After finding , the approximate log marginal likelihood can be evaluated as
(16) 
(see, e.g., Rasmussen and Williams, 2006). The first and second terms of (16) are fast to compute by using the results from Newton’s algorithm, but the determinant term is more difficult to evaluate.
Cunningham et al. (2008) show that for the point process intensity estimation with the GP priors, the evaluation of a corresponding determinant term can be done efficiently by exploiting a lowrank structure of the observation model. In density estimation has rank due to the normalization over , and a similar lowrank representation as in intensity estimation cannot be used for . Therefore, we form the full matrix and compute the Cholesky decomposition that scales as . Although this is computationally burdensome for large , it is required only once per each evaluation of the marginal likelihood after the convergence of Newton’s algorithm. Typically, in 1D cases with the number of intervals being less than one thousand, the LA approach is sufficiently fast for practical purposes (in our implementation the default size for is 400). However, the computations can become restrictive in 2D with large . In such cases, we consider a reducedrank approximation for the prior covariance to find a faster way to compute the determinant term and the approximate posterior covariance, as will be discussed in Section 3.1.3. The gradients of the log marginal likelihood with respect to the hyperparameters can be computed by evaluating explicit and implicit derivatives similarly as shown by Rasmussen and Williams (2006).
We place a prior distribution for the hyperparameters to improve the identifiability of the ratio of the magnitude and the lengthscale parameters of the covariance function. We assume a weakly informative half Student distribution, as recommended for hierarchical models by Gelman (2006), with one degree of freedom and a variance that is equal to ten in 1D and thousand in 2D for (magnitude). For (lengthscales), we assume otherwise the same prior but with a variance that is equal to one. The MAP estimate for the hyperparameters is found by maximizing the approximate marginal posterior , in which we use the BFGS quasiNewton optimization. In addition to the MAP estimate, we also approximate the integration over the hyperparameters with the central composite design (LACCDLGP) scheme, similarly as proposed by Rue et al. (2009).
To compute the joint posterior predictive distribution, we marginalize over the latent values by Monte Carlo using 8000 latent samples drawn from the multivariate normal posterior predictive distribution. This sampling is needed only once after the MAP estimate for the hyperparameters has been found, and time consumed in this step is negligible compared to the other parts of the computations. In many cases, we have observed that the posterior weights of the quadratic terms of the basis function are automatically small and thus the effect of basis functions is not strong for densities not well presented by them. We consider an optional rejection sampling step based on finite differences to enforce the tails to go to zero (in 1D), if necessary (not used if density is defined to be bounded).
To improve the Gaussian approximation of the posterior distribution, we also consider an additional importance sampling step (LAISLGP). Following Geweke (1989) we use the multivariate split Gaussian density as an approximation for the exact posterior. The multivariate split Gaussian is based on the posterior mode and covariance, but the density is scaled along the principal component axes (in positive and negative direction separately) adaptively to match to the skewness of the true distribution (see also Villani and Larsson, 2006). To further improve the performance the discontinuous split Gaussian used by Geweke (1989) was replaced with a continuous version. We observed that scaling is not needed along the principal component axis corresponding to the smallest eigenvalues. To speed up the computation, we scale only along the first 50 principal component axis. The importance sampling step took approximately additional 0.3 seconds, whereas direct sampling from the Gaussian posterior distribution (to compute the predictions) took about 0.05 seconds. Due to this small additional computational demand, we added the importance sampling correction also in the experiments. In all the experiments the estimated effective number of samples (Kong et al., 1994) was reasonable, but we also included a sanity check to the code and give a warning and use a soft thresholding of the importance weights if the estimated effective number of samples is less than . As a comparison, using scaled Metropolis–Hastings sampling to obtain 500 effectively independent samples from the latent posterior given fixed hyperparameters took 24 minutes (similar inefficiency in mixing was observed with other MCMC methods usually used for GPs).
3.1.3 The ReducedRank Approximation for 2D Density Estimation
To speed up the inference with 2D grids when is large, we propose a reducedrank approximation that can be formed efficiently by exploiting Kronecker product computations. For separable covariance functions, the covariance structure of evenly spaced grid points can be presented in a Kronecker product form , where is an matrix representing the covariances between the latent values associated with grid points in the ’th dimension. For Kronecker products, many matrix operations scale efficiently: for example, the determinant can be computed by using the determinants of the smaller matrices and as (Harville, 1997). To compute the approximate marginal likelihood (16), we need to evaluate the determinant term of a form . Unfortunately, the Kronecker product covariance structure does not preserve due to the multiplication and summation operations, leading to the unfavourable scaling. However, we can exploit the Kronecker product to obtain the eigendecomposition of efficiently, and then form a reducedrank approximation for by using only the largest eigenvalues with the corresponding eigenvectors. The idea of using the eigendecomposition to construct a reducedrank approximation for the covariance matrix has been previously mentioned by Rasmussen and Williams (2006, Chapter 8). By denoting an eigenvalue of with and an eigenvector of with , and similarly, an eigenvalue of with and an eigenvector of with , then, is an eigenvalue of and an eigenvector of corresponding to the eigenvalue (Harville, 1997). Thus, instead of computing the eigendecomposition of , which is an operation, we can compute the eigendecompositions of and , which scales as , and form a desired number of eigenvalues and eigenvectors with the Kronecker product computations to obtain the reducedrank approximation for .
We approximate the exact prior covariance with
(17) 
where is a diagonal matrix of size having the largest eigenvalues on its diagonal and is an matrix consisting of the corresponding eigenvectors. Similarly to the fully independent conditional (FIC) approximation (Snelson and Ghahramani, 2006), we use the exact fullrank diagonal by setting in (17) to obtain a more accurate approximation.
With the approximate prior (17), the mode can be found using Newton’s method in a similar way as described in Section 3.1.1. To evaluate the marginal likelihood approximation, we need to compute efficiently the determinant in (16). The determinant term can be written as , where is an column vector of ones, is a diagonal matrix, and . We have defined , which is of size , and of size . To avoid forming any matrix, can be evaluated by applying the matrix determinant lemma (see, e.g., Harville, 1997) and by applying the matrix inversion lemma. With the prior (17), we can also compute the approximate gradients of the marginal likelihood with respect to hyperparameters without resorting to any matrix operations.
For large , we use the same fixed grid for the observations and predictions. After we have found the MAP estimate for the hyperparameters, we need to draw samples from to marginalize over the latent values. The posterior covariance is approximated by
(18) 
where is the approximate prior covariance with the explicit basis functions evaluated at the grid points. We can rewrite in (18) as
(19) 
where
(20) 
In (20) we have denoted which is diagonal and therefore fast to evaluate. The matrix in Equation (20) is an lower triangular matrix, and it can be computed using the Cholesky decomposition
(21) 
which scales as . When drawing samples from the Gaussian approximation given the covariance matrix (18), the structure of can be exploited to avoid forming any matrix.
With the reducedrank approximation based on the eigendecomposition, we avoid choosing (or optimizing) the locations of inducing inputs, which is required in many sparse approximations, such as, fully independent conditional (FIC) sparse approaches (e.g. Snelson and Ghahramani, 2006; QuiñoneroCandela and Rasmussen, 2005). With the possibility to choose the locations of inducing inputs, the reducedrank approximation would be more expressive although choosing automatically the locations of inputs can be challenging if the correlation structure of a GP prior is wanted to be preserved using a smaller number of inducing inputs. In addition, the optimization of the locations of inducing inputs is not trivial and can lead to overfitting (see, e.g., the discussion and visualizations of different correlation structures by Vanhatalo et al., 2010). The problem with the reducedrank approximation based on Kronecker products, however, is that the covariance function must be separable with respect to the inputs, which leads to a restricted class of possible covariance functions. In this paper, we have tested the reducedrank approximation with the squared exponential covariance function.
3.2 Density Regression with the Laplace Approximation
Logistic Gaussian processes are also suitable for estimating conditional densities , where is a target variable (see, e.g., Tokdar et al., 2010). We discretize both the covariate and target space in a finite region to model the conditional densities with the logistic GP. In this paper, we focus on modelling the conditional density of given a univariate predictor , which leads to a 2D grid similarly as in the case of 2D density estimation. We denote the number of intervals in the covariate space with and the number of intervals in target space with . To estimate the conditional densities, we write the loglikelihood contribution of all the observations as
(22) 
where is the ’th element of . The vector contains all the latent values associated with each subregion conditioned to the covariate , that is, the latent values associated with the ’th slice of the grid. Similarly, is a vector of length and consists of the number of observations in each subregion of the ’th slice of the grid associated with . The vector contains all the latent values of the grid and all the count observations. To approximate the resulting nonGaussian posterior distribution, we use Laplace’s method. The likelihood (22) results in that in equation (10) is a blockdiagonal matrix. Similarly as in Section 3.1.1, the ’th block of can be factorized into , where
(23) 
The total number of observations in the ’th slice of the grid is denoted with , and the vector is formed as in Equation (12), but by considering only the latent values . Because the LA approach for the LGP density regression follows closely to the derivation presented in Section 3.1, the implementation issues are omitted here. The density regression becomes computationally challenging with dense grids and when applied to a larger number of covariate dimensions , and therefore, computational speedups are required, but we do not consider these in this paper.
3.3 Markov Chain Monte Carlo
MCMC sampling enables approximating the integration over the posterior distribution without limiting to a Gaussian form approximation. MCMC can be extremely slow but in the limit of a long run it provides an exact result by which the accuracy of the LA approach can be measured.
We approximate the posterior distribution by sampling alternatively from the conditional posterior of the latent values by using scaled Metropolis–Hastings sampling (Neal, 1998) and from the conditional posterior of the hyperparameters by using the noUturn sampler (NUTS) (Hoffmann and Gelman, 2013). In the experiments, this combination was more efficient than other MCMC methods usually used for GPs.
A fixed length of MCMC chains was determined by estimating the convergence and effective sample size for different data sets. The convergences of MCMC chains were diagnosed with visual inspection and the potential scale reduction factor (Brooks and Gelman, 1998). The effective sample size was estimated with Geyer’s initial monotone sequence estimator (Geyer, 1992). As the sampling from the conditional posterior of the latent values was significantly faster, each metaiteration consisted of hundred sampling steps for and one step for . The sampling was initialised by using the hyperparameter values at the mode of the LA approximated marginal posterior and the initial latent values were sampled from the Gaussian approximation given the initial hyperparameter values. The sampling consisted of 5100 metaiterations of which 100 iterations were removed as a burnin. The effective sample sizes with different data sets and random number sequences were about 50–100, which gives sufficient accuracy for the posterior mean of density estimate. For a grid size of 400, the computation time was about 2200 seconds.
4 Experiments
In this section, we examine the performance of the Laplace approximation for the logistic Gaussian process (LALGP) with several simulated and real data sets. We compare LALGP to the MCMC approximation (MCMCLGP) and to Griffin’s (2010) Dirichlet process mixture of Gaussians with common component variance (CCVmixture) and different component variance (DCVmixture). CCV model assumes that all mixture components have equal variances. DCV model allows different variances for the components. Computation for Dirichlet process mixture models is done with Gibbs sampling. We compare LALGP only to advanced Bayesian kernel methods (Griffin, 2010), since Tokdar (2007) and Adams (2009) have already shown that logistic GP works better than simple kernel methods (such as the Parzen method). Griffin showed that CCV and DCVmixture models performed equally or better than other default priors for mixture models, which is why the other priors are excluded in our comparisons. We do not compare the performance to other MCMC based LGP approaches by Tokdar (2007) and Adams (2009) as the prior is the same and if using the same grid the difference would only be in the implemantation speed and convergence speed of the different MCMC methods. LALGP and MCMCLGP were implemented using the GPstuff toolbox^{2}^{2}2http://mloss.org/software/view/451/,http://becs.aalto.fi/en/research/bayes/gpstuff/ (Vanhatalo et al., 2013), and CCV/DCVmixtures were computed using Griffin’s code^{3}^{3}3http://www.kent.ac.uk/ims/personal/jeg28/BDEcode.zip.
The squared exponential covariance function was employed in all the experiments with LGP. We also tested Matérn, exponential, rational quadratic and additive combinations, but the results did not improve considerably with these different covariance functions. To ensure that the same prior is suitable for different scales, we normalized the grid to have a zero mean and unit variance in all the experiments. If not otherwise mentioned we used LGP with a grid size 400.
The rest of this section is divided into six parts. We compare the performances of LALGP, MCMCLGP and CCV/DCVmixtures with simulated 1D data in Section 4.1, and with real 1D data in Section 4.2. In Section 4.3, we test how different grid sizes and the number of data points affect density estimates. We illustrate density estimation with simulated 2D data in Section 4.4, and finally, we demonstrate density regression with one simulated data in Section 4.5.
4.1 Simulated 1D Data
Figure 2 shows the density estimates and the corresponding 95% credible intervals for single random realisations from simulated data sets with .
The simulated data sets are:

: Student

Mixture of two :

Gamma:

Truncated Gamma+Gaussian : .
Student’s distribution was chosen as a simple unimodal distribution with thicker tails than Gaussian. The mixture of two distributions was chosen as a more challenging case, where there are two separate modes with different widths. In the second plot of Figure 2, it can be seen that the short lengthscale required to model the narrow peak, makes the density estimate bumpy also elsewhere. Gamma was chosen as a simple distribution with a mode on the boundary, and a truncated Gamma+Gaussian, , as a more difficult case with one mode on the boundary and another mode in the middle of the distribution. Truncated Gamma+Gaussian was also used by Tokdar (2007) and Adams (2009).
Figure 3 shows the pairwise comparison of LALGP to LALGP without importance sampling (LALGP, no IS), LALGP with the CCD integration (LACCDLGP), MCMCLGP, CCVmixture and DCVmixture. We made 100 realisations of random samples from the true density and computed for each method the mean logpredictive densities (MLPD) over the true distribution. Distributions of the differences between MLPDs are plotted with violin plots generated using LALGP along with the median and 95% lines. There are no statistically significant differences between the methods for the first three data sets. For the last data set the Gaussian mixture models do not work well for the data with the mode on the boundary. The violin plots in Figures 3 and 5 also illustrate another practical application of density estimation with the LA approach which facilitates interactive visualisation.
4.2 Real 1D Data
Figure 4 shows the density estimates and the corresponding 95% credible intervals for the following real data sets:

Galaxy

Enzyme

Log acidity

Sodium lithium ,
which were studied by Griffin (2010). The density estimates are visually similar to the results by Griffin (2010, Figure 5).
Figure 5 shows the pairwise comparison of LALGP to LALGP (no IS), LACCDLGP, MCMCLGP, CCVmixture and DCVmixture with the real data sets. For each method, we computed leaveoneout crossvalidation mean logpredictive densities (CVMLPD). Samples from the distributions of the differences between CVMLPDs were obtained using the Bayesian bootstrap (Rubin, 1981; Vehtari and Lampinen, 2002) and violin plots were generated using LALGP.
There is no substantial difference in the performances of LALGP and MCMCLGP across the data sets. Importance sampling improves the performance of the Laplace approximation for the Galaxy data set, but there is no improvement for the other data sets. CCD and MCMC improved the performance only for the Enzyme data set. CCVmixture and DCVmixture perform similar to LALGP for all the other data sets except for the Sodium lithium data set for which they have slightly worse performance.
Figure 6 shows the effect of the rejection sampling and the importance sampling in the density estimation of the Galaxy data set. The rejection sampling helps to make the tails decreasing. The importance sampling makes the estimate to be lower on the areas with no observations and respectively higher at the modes. When looking at MCMC posterior samples, the Galaxy data had most skewed marginal posterior distributions of the latent values, explaining the benefit of the importance sampling.
4.3 The Effect of the Number of Data and Grid Points
Figure 7 illustrates the effect of the number of grid and data points to the accuracy of LALGP and MCMCLGP. The number of grid points was 400 when the number of data points was varied, and the number of data points was 100 when the number of grid points was varied. For each combination, we made 100 realisations of random samples from the mixture of two distributions and truncated Gamma+Gaussian distribution. For both data sets the KL divergence approaches to zero when the number of data points increase. It seems that about 50–100 grid points are sufficient for the mixture of two distributions, whereas the truncated Gamma+Gaussian distribution is less sensitive to the number of grid points. There is no practical difference in the performances of LALGP and MCMCLGP.
We measured computation times for density estimation with different grid sizes using 100 random samples from the Student distribution. LALGP with the grid sizes took about seconds using four cores of Intel(R) Xeon(R) 2.67GHz. If the matrixvector multiplications in Newton’s algorithm were made with FFT, the times were about seconds. MCMCLGP with the same grid sizes took approximately seconds. Using Griffin’s code with the default options, CCVmixture took about 200 seconds and DCVmixture about 800 seconds. Note that these time comparisons are only approximate, and the computation times depend considerably on the specific implementation and on the chosen convergence criteria.
4.4 Simulated and real 2D data
The columns 1–4 of Figure 8 show the following four simulated 2D distributions:

Student:

Mixture of two Gaussians:

Banana:

Ring: .
Given a random sample with from each distribution, we compute the 2D density estimates with LALGP. The mean estimates are illustrated in the lower row of Figure 8. The fifth column of Figure 8 shows two density estimates for the Old faithful data (). The upper plot shows the estimate with MCMCLGP and the lower plot shows the estimate with LALGP. The density estimation for LALGP with a grid took about two seconds and for MCMCLGP about 29 minutes.
We tested the reducedrank approximation with two 2D data sets. Given observations from the Student and mixture of two Gaussians distributions, we measured the computation times and the KL divergences from the true density to the estimated density with different grid sizes. Figure 9 shows the elapsed times and the KL divergences as a function of the grid sizes for LALGP with the exact prior covariance (Full) and with the reducedrank prior covariance (Kron) of Equation (17). We formed the reducedrank approximation by excluding all the eigenvalues smaller than , or taking at most 50% of all the eigenvalues. The differences between the exact and the reducedrank prior are small in the KL sense, but for the grid sizes larger than , LALGP with the full prior covariance matrix becomes computationally more expensive than with the reducedrank approximation.
4.5 Density Regression with Simulated Data
The estimation of a conditional density in a grid with the LA approach is essentially similar to 1D density modelling with LALGP. Therefore, we consider density regression with only one simulated data set in this paper.
We demonstrate density regression with a simulated case studied by Kundu and Dunson (2011). The first plot of Figure 10 shows the simulated density regression scheme, where the predictors () are generated from the following trimodal density: . The generating model for a univariate target is , where . We fixed and . The second and third plots of Figure 10 show estimates with the Laplace approximation (LADR) and the MCMC sampling (MCMCDR) given a single realisation of samples. Density regression in a grid with LADR took about three seconds and with MCMC 1600 seconds. Finally, to show the differences between the estimates obtained with LADR and LALGP, we illustrate a conditional density estimate computed directly from a 2D density estimate obtained with LALGP (the fourth plot of Figure 10).
5 Discussion
In this paper, we have proposed to use Laplace’s method for fast logistic Gaussian process density estimation. The empirical results with 1D data sets indicate that the accuracy of the proposed LA approach with typeII MAP estimation is close to the stateoftheart MCMC methods for density estimation. The logistic Gaussian process with Laplace’s method also avoids the sampling and convergence assessment problems related to the highly multimodal posterior distributions of the mixture models.
Density estimation with LALGP and 400 grid points takes one to two seconds, which in interactive use is a reasonable waiting time. For a finer grid or more dimensions, more grid points could be placed, but this requires additional approximations. In this paper, we have considered a reducedrank approximation for LALGP that avoids the infamous cubic scaling of the basic Gaussian process computation. In addition, we have demonstrated the suitability of the Laplace approach for estimating conditional densities with one predictor variable.
Instead of Laplace’s method, other deterministic approximations could be applied to speed up the posterior inference compared to MCMC. Variational approximations, including variational bounding and factorized variational approximations have been considered for GP models. Although these are close to LA in speed, and can be more accurate than LA with suitable hyperparameter settings, they can also have problems in estimating the hyperparameters (see, e.g., the comparisons between various Gaussian approximations in GP classification problems by Nickisch and Rasmussen, 2008). Expectation propagation (EP) has been shown to perform better than the Laplace or variational approximations for binary classification (Nickisch and Rasmussen, 2008) and for Student regression (Jylänki et al., 2011). However, for the Poisson model EP was only slightly better than the Laplace approximation (Vanhatalo et al., 2010). Because the Laplace approximation for the logistic Gaussian process performed almost as well as the MCMC approximation, we believe that EP could only slightly improve the performance of LA. The implementation of expectation propagation for nondiagonal is nontrivial. A quadraturefree moment matching for EP could also be considered for density estimation, in a similar way as was done for multiclass GP classification (Riihimäki et al., 2013), but in our preliminary testing the moment matching of the distributions turned out to be quite slow.
The Gaussian approximations can be improved by considering corrections for the marginal posterior distributions (Rue et al., 2009; Cseke and Heskes, 2011). These corrections can be challenging for the LGP model because the likelihood function cannot be factorized into terms depending only on a single latent value. In the future, it would be interesting to see whether similar corrections as considered by Rue et al. (2009), could be extended for LALGP where each likelihood term depends on multiple latent values.
The code for LALGP, MCMCLGP and violin plots are available as part of the free GPstuff toolbox for Matlab and Octave (http://becs.aalto.fi/en/research/bayes/gpstuff/ or http://mloss.org/software/view/451/).
Acknowledgments
Authors would like to thank Enrique Lelo de Larrea Andrade for his help in the implementation of the importance sampling, and Pasi Jylänki, Janne Ojanen, Tomi Peltola, Arno Solin and anonymous reviewers for helpful comments on the manuscript. We acknowledge the computational resources provided by Aalto ScienceIT project. The research has been funded by the Academy of Finland (grant 218248).
References
 Adams (2009) Ryan Prescott Adams. Kernel Methods for Nonparametric Bayesian Inference of Probabilities and Point Processes. PhD thesis, University of Cambridge, 2009.
 Adams et al. (2009) Ryan Prescott Adams, Iain Murray, and David J.C. MacKay. The Gaussian process density sampler. In Advances in Neural Information Processing Systems 21, pages 9–16. NIPS foundation, 2009.
 Brooks and Gelman (1998) Stephen P. Brooks and Andrew Gelman. General methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics, 7(4):434–455, 1998.
 Cseke and Heskes (2011) Botond Cseke and Tom Heskes. Approximate marginals in latent Gaussian models. Journal of Machine Learning Research, 12:417–454, 2011.
 Cunningham et al. (2008) John P. Cunningham, Krishna V. Shenoy, and Maneesh Sahani. Fast Gaussian process methods for point process intensity estimation. In Andrew McCallum and Sam Roweis, editors, Proceedings of the 25th Annual International Conference on Machine Learning (ICML 2008), pages 192–199. Omnipress, 2008.
 Gelman (2006) Andrew Gelman. Prior distributions for variance parameters in hierarchical models. Bayesian Analysis, 1(3):515–533, 2006.
 Geweke (1989) John Geweke. Bayesian inference in econometric models using Monte Carlo integration. Econometrica, 57(6):1317–1339, 1989.
 Geyer (1992) Charles J. Geyer. Practical Markov chain Monte Carlo. Statistical Science, 7(4):473–483, 1992.
 Griebel and Hegland (2010) Michael Griebel and Markus Hegland. A finite element method for density estimation with Gaussian process priors. SIAM Journal of Numerical Analysis, 47(6):4759–4792, 2010.
 Griffin (2010) J.E. Griffin. Default priors for density estimation with mixture models. Bayesian Analysis, 5(1):45–64, 2010.
 Harville (1997) David A. Harville. Matrix Algebra From a Statistician’s Perspective. SpringerVerlag, 1997.
 Hoffmann and Gelman (2013) Matt Hoffmann and Andrew Gelman. The noUturn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 2013. In press.
 Jylänki et al. (2011) Pasi Jylänki, Jarno Vanhatalo, and Aki Vehtari. Robust Gaussian process regression with a Studentt likelihood. Journal of Machine Learning Research, 12:3227–3257, 2011.
 Kong et al. (1994) Augustine Kong, Jun S. Liu, and Wing Hung Wong. Sequential imputations and Bayesian missing data problems. Journal of the American Statistical Association, 89(425):278–288, 1994.
 Kundu and Dunson (2011) Suprateek Kundu and David B. Dunson. Latent factor models for density estimation, 2011. arXiv:1108.2720v2.
 Lenk (1991) Peter J. Lenk. Towards a practicable Bayesian nonparametric density estimator. Biometrika, 78(3):531–543, 1991.
 Lenk (2003) Peter J. Lenk. Bayesian semiparametric density estimation and model verification using a logisticGaussian process. Journal of Computational and Graphical Statistics, 12(3):548–565, 2003.
 Leonard (1978) Tom Leonard. Density estimation, stochastic processes, and prior information. Journal of the Royal Statistical Society: Series B (Methodological), 40(2):113–146, 1978.
 Neal (1998) Radford Neal. Regression and classification using Gaussian process priors. In J. M. Bernardo, J. O. Berger, A. P. David, and A. P. M. Smith, editors, Bayesian Statistics 6, pages 475–501. Oxford University Press, 1998.
 Nickisch and Rasmussen (2008) Hannes Nickisch and Carl Edward Rasmussen. Approximations for binary Gaussian process classification. Journal of Machine Learning Research, 9:2035–2078, 2008.
 O’Hagan (1978) Anthony O’Hagan. Curve fitting and optimal design for prediction. Journal of the Royal Statistical Society. Series B (Methodological), 40(1):1–42, 1978.
 QuiñoneroCandela and Rasmussen (2005) Joaquin QuiñoneroCandela and Carl Edward Rasmussen. A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6(3):1939–1959, 2005.
 Rasmussen and Williams (2006) Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, Cambridge, 2006.
 Riihimäki et al. (2013) Jaakko Riihimäki, Pasi Jylänki, and Aki Vehtari. Nested expectation propagation for Gaussian process classification with a multinomial probit likelihood. Journal of Machine Learning Research, 14:75–109, 2013.
 Rubin (1981) Donald B. Rubin. The Bayesian bootstrap. Annals of Statistics, 9(1):130–134, 1981.
 Rue et al. (2009) Håvard Rue, Sara Martino, and Nicolas Chopin. Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 71(2):319–392, 2009.
 Snelson and Ghahramani (2006) Edward Snelson and Zoubin Ghahramani. Sparse Gaussian process using pseudoinputs. In Y. Weiss, B. Schölkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18. The MIT Press, 2006.
 Thorburn (1986) Daniel Thorburn. A Bayesian approach to density estimation. Biometrika, 73(1):65–75, 1986.
 Tokdar (2007) Surya T. Tokdar. Towards a faster implementation of density estimation with logistic Gaussian process priors. Journal of Computational and Graphical Statistics, 16(3):633–655, 2007.
 Tokdar and Ghosh (2007) Surya T. Tokdar and Jayanta K. Ghosh. Posterior consistency of logistic Gaussian process priors in density estimation. Journal of Statistical Planning and Inference, 137(1):34–42, 2007.
 Tokdar et al. (2010) Surya T. Tokdar, Yu M. Zhu, and Jayanta K. Ghosh. Bayesian density regression with logistic Gaussian process and subspace projection. Bayesian Analysis, 5(2):319–344, 2010.
 van der Vaart and van Zanten (2009) A. W. van der Vaart and J. H. van Zanten. Adaptive Bayesian estimation using a Gaussian random field with inverse Gamma bandwidth. The Annals of Statistics, 37(5B):2655–2675, 2009.
 Vanhatalo et al. (2010) Jarno Vanhatalo, Ville Pietiläinen, and Aki Vehtari. Approximate inference for disease mapping with sparse Gaussian processes. Statistics in Medicine, 29(15):1580–1607, 2010.
 Vanhatalo et al. (2013) Jarno Vanhatalo, Jaakko Riihimäki, Jouni Hartikainen, Pasi Jylänki, Ville Tolvanen, and Aki Vehtari. GPstuff: Bayesian modeling with Gaussian processes. Journal of Machine Learning Research, 14:1175–1179, 2013.
 Vehtari and Lampinen (2002) Aki Vehtari and Jouko Lampinen. Bayesian model assessment and comparison using crossvalidation predictive densities. Neural Computation, 14(10):2439–2468, 2002.
 Villani and Larsson (2006) Mattias Villani and Rolf Larsson. The multivariate split normal distribution and asymmetric principal components analysis. Communications in Statistics: Theory & Methods, 35(6):1123–1140, 2006.
 Williams and Barber (1998) Christopher K. I. Williams and David Barber. Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12):1342–1351, 1998.