A determinantfree method to simulate the parameters of large Gaussian fields
Abstract
We propose a determinantfree approach for simulationbased Bayesian inference in highdimensional Gaussian models. We introduce auxiliary variables with covariance equal to the inverse covariance of the model. The joint probability of the auxiliary model can be computed without evaluating determinants, which are often hard to compute in high dimensions. We develop a Markov chain Monte Carlo sampling scheme for the auxiliary model that requires no more than the application of inversematrixsquareroots and the solution of linear systems. These operations can be performed at large scales with rational approximations. We provide an empirical study on both synthetic and realworld data for sparse Gaussian processes and for largescale Gaussian Markov random fields.
1 Introduction
Linear Gaussian models are one of the most fundamental tools for statistical analysis, with diverse applications such as spatial modelling, uncertainty quantification, dimensionality reduction, and timeseries modelling, e.g. speech recognition or even analysing highdimensional recordings of neural activity (Roweis and Ghahramani, 1999; Rosti, 2004; Macke et al., 2011). A set of observations, , is assumed to be generated from a set of latent Gaussian variables, , which are subject to a linear transformation and corrupted by independent Gaussian noise . Popular examples include nonparametric regression with Gaussian processes (GPs, Rasmussen and Williams, 2006) or modelling spatial phenomena using Gaussian Markov random fields (GMRFs, Rue and Held, 2005; Lindgren et al., 2011). GPs are often based on full covariance matrices, which means none of the variables are marginally independent. In contrast GMRFs directly encode conditional independence relationships via zero elements in the precision matrix, which as a result is often sparse.
We focus on Bayesian posterior inference for linear Gaussian models. It is possible to obtain closed form expressions for key quantities such as the marginal likelihood and posterior predictions. However, computing these quantities in practice is difficult for large problems. For modelled variables, an evaluation of the (normalised or unnormalised) probability density function costs time and memory. These costs arise from the solution of a linear system and the evaluation of a matrix determinant. There is a large literature on computationally motived approximations to the model itself, for example via subsampling or lowrank approximations (QuiñoneroCandela and Rasmussen, 2005), however, exact Bayesian inference for large models remains a challenging problem.
We can compute some properties of models with sparse covariance or precision matrices using two families of numerical linear algebra methods. These methods exploit the fact that large sparse matrices can be stored, and their products with vectors can be computed. Krylov methods, such as the linear solver conjugate gradient and multishift variants (Simpson, 2008; Jegerlehner, 1996), can solve linear systems with just matrixvector products. Rational approximations reduce the computation of matrix functions, such as the determinant, inverse, or square root, to solving a family of linear equations – with error guarantees that are straightforward to control and can even be set to floating point precision (Higham and Lin, 2013; Kennedy, 2004). A combination of these methods has been successfully applied to sample from highdimensional Gaussian distributions (Aune et al., 2013), and to perform maximum likelihood inference by estimating the logdeterminant term (Aune et al., 2014). A number of numerical approximations are available for estimating the logdeterminant term, for example, those discussed by Bai et al. (1996), Han et al. (2015), and Saibaba et al. (2016). Recently, an inversionfree approach for inferring pointestimates of covariance parameters of largescale Gaussian distributions was proposed by Anitescu et al. (2017). The approach is based on an alternative objective function than maximum likelihood, while asymptotically converging to the same solution.
Bayesian inference of the covariance parameters, however, remains largely an open problem. Markov chain Monte Carlo (MCMC) methods are ‘asymptotically exact’, but most previous work has limited scalability as the methods need to compute determinants (e.g., Murray and Adams, 2010). Lyne et al. (2015) realised that the logdeterminant estimator by Aune et al. (2014) is unbiased, and can therefore be combined with the ‘exactapproximate’ framework within a pseudomarginal MCMC algorithm (Andrieu and Roberts, 2009) – as also used by Filippone and Girolami (2014). A complication is the need to transform an unbiased estimator of the logdeterminant into an unbiased estimator of the determinant itself, which Lyne et al. (2015) achieved via random truncation (Russian roulette) of the infinite series expression of the exponential function. Apart from theoretical issues around the existence of positive unbiased estimators for Russian roulette (Jacob et al., 2015), unfortunately, for a realworld model of global ozone distributions (Lindgren et al., 2011), the combination of Russian roulette and pseudomarginal MCMC turned out to be too fragile and the resulting Markov chain in practice did not converge reliably.
In this paper we introduce an alternative MCMC approach to perform asymptotically exact inference on largescale Gaussian models. We introduce auxiliary variables so that the joint distribution of the model and auxiliary variables contains no determinants. In return for removing determinants we need to update the auxiliary variables, which can be performed with the application of inversematrixsquareroots. In the case that the inversematrixsquareroot of the covariance is unknown, we use rational approximations in the spirit of Aune et al. (2013) to perform tuningfree updates. Our scheme is considerably simpler than the approach taken by Lyne et al. (2015), and by avoiding the pseudomarginal framework there is no need to tune the internal unbiased estimators. Our approach scales well to highdimensional models, provided that the application of an inversematrixsquareroot can be carried out reliably. In the case of a poorlyconditioned model, in which the inversematrixsquareroot is unknown, the underlying Krylov method may converge slowly, or not at all to within the desired tolerance.
In the remainder of the paper, we give details on the linear Gaussian model itself and the computational challenges, here for the case of using MCMC for posterior inference. We introduce our auxiliary model, which avoids the need to compute matrix determinants, and describe a sampling scheme for the resulting joint distribution – including some necessary background on rational approximations. Finally, we empirically study the method on some toy and realworld examples. For models with wellbehaved covariance or precision matrices, our determinantfree method can outperform MCMC using standard Cholesky factorisations.
2 Background
We consider linear models of the form
(1) 
Here, observations are a linear transformation of independent Gaussian latent variables,
(2) 
with independent Gaussian additive noise^{1}^{1}1For simplicity we include in the parameters , although doesn’t effect the covariance of the latents . and covariance matrix . Since the model is linear and Gaussian, the marginal likelihood, , is also Gaussian, i.e.
(3) 
with covariance .
Two common examples with this setup are finitedimensional realizations of a Gaussian process (GP) and Gaussian Markov random field (GMRF) models, often with dense covariance and sparse precision matrices, respectively.
For these models to be of any practical use, it is necessary to determine suitable parameter values . In a Bayesian setting, we define a prior density and combine it with the likelihood in (3). Bayes’ theorem induces the posterior
(4) 
which is intractable for most nontrivial applications. While there are many approximate inference schemes to explore the posterior (4), e.g. variational methods, we here focus on simulation via Markov chain Monte Carlo (MCMC), which has the advantage of being asymptotically exact. MCMC methods seek to generate samples to represent the posterior. These samples can be used to estimate the posterior expectation of arbitrary functions, such as the posterior mean. In this paper we will use the standard Metropolis–Hastings (MH) algorithm.
2.1 Determinants in high dimensions
To evaluate the likelihood in (3), the standard approach is to compute the Cholesky factorisation , where is upper triangular. This factorisation allows any linear system of the form to be solved via cheap backsubstitution, and provides the logdeterminant as
For sparse highdimensional matrices , however, this approach is unsuitable, as even storing the Cholesky factorisation is infeasible. Cholesky factors suffer from a so called fillin effect and are generally not sparse. Currently standard computers struggle to hold Cholesky factors in memory from around . Examples of large sparse covariance matrices are those of Gaussian processes with compactly supported covariance functions. The Cholesky factor of sparse precision matrices from GMRF models can also be used.
3 Methodology
We augment the state space of the linear Gaussian model by introducing auxiliary variables
(5) 
The resulting joint posterior is then given by the product of the posterior in (4) and the distribution over the new auxiliary variables,
(6) 
The normalising terms, i.e. determinants, of the original model and the augmented variables cancel,
and (6) is left with only the quadratic forms for the latent and auxiliary Gaussian variables. The original posterior in (4) is restored by marginalizing over ,
In other words: in order to obtain posterior samples, we can sample from the augmented distribution and subsequently discard the auxiliary variables. Our MCMC scheme alternates between two updates:

Update the auxiliary variables keeping fixed using an MCMC method for target density (5).

Update the parameters keeping fixed using an MCMC method with target proportional to (6).
We next specify how we implement these updates.
3.1 Model parameter updates
The model parameters can be updated with standard Metropolis–Hastings (MH) updates. We use a preliminary run to estimate the covariance of the posterior . In our experiments we use a random walk proposal that is tuned to achieve an acceptance rate in the range of – (Rosenthal et al., 2011).
3.2 Auxiliary variable updates
The auxiliary variables could potentially be updated with MH or Hamiltonian Monte Carlo methods. However, the tuning of these updates would be challenging. Instead we perform Gibbs updates, directly resampling the auxiliary from its conditional distribution . In the case that the inversematrixsquareroot of covariance is unknown, we can’t use a Cholesky factorization to perform this update for large scale problems. Instead we employ a technique from numerical linear algebra, a combination of rational approximations and Krylov subspace methods (Aune et al., 2014; Kennedy, 2004; Higham and Lin, 2013; Jegerlehner, 1996). These methods, outlined in the next section, are able to compute the product of the squarerootinverse of a matrix with arbitrary vectors – only requiring matrixvector products involving the matrix itself. In particular, we can sample from highdimensional Gaussian distributions as long as we can quickly apply the covariance or precision matrix to a vector, and that the matrix is reasonably wellconditioned. We now distinguish the cases where we have direct access either to the sparse latent covariance matrix from equation (2), or to the precision .
Sparse covariance
For the case where we have direct access to a sparse latent covariance matrix , we directly compute for a standard normal , i.e. we generate a sample with desired auxiliary covariance , via only using matrixvector products of the form , i.e.
(7) 
Sparse precision
For the case where we know a sparse precision , we create a set of ‘fantasy observations’, from the model. For that, we first sample the latent variables from (2) with covariance by computing , using matrixvector products of the form , where is standard Gaussian. We then generate . Finally, we premultiply , which has covariance , with , to get
which has the desired auxiliary variable covariance . In practice, we employ the matrix inversion lemma:
which only requires a single additional linear solve in terms of matrixvector products involving the known . In summary, given a sparse precision, we compute
(8) 
3.3 Rational approximations and Krylov methods
We now briefly review the methodology required to solve the large sparse linear systems in (3.2) and matrixinversesquareroots in (3.2). Crucially, this is done without ever storing dense matrices, and only requires computing matrixvector products and respectively (we assume either the ability to compute or that a sparse is given). We mainly follow the approach taken by Aune et al. (2013). Rational approximations are used to construct matrixvector products with matrixinversesquareroots, e.g. , by solving a family of sparse linear systems. Krylov space methods may be used to solve linear systems using only matrixvector products.
The key identity, the Cauchy integral formula (Kennedy, 2004; Clark, 2006), relates a square matrix and a function that is analytic on and inside the closed contour enclosing the Eigenvalues of as
Right multiplying with vector and applying a quadrature of the integral yields
(9) 
where we have to pick integration weights and shifts in order to ensure convergence. For positive definite and and standard normal , Equation (9) can be used to sample from a multivariate Gaussian with covariance .
Integration weights and shifts in (9) can be efficiently computed from the contour , i.e. the smallest and largest Eigenvalue of , and respectively, in a standard way, e.g. Aune et al. (2014). To our knowledge, this is the best known rational approximation (Hale et al., 2008). Crucially, the Frobenius norm or the error in estimating decays rapidly as
and we can choose the number of quadrature points to reach a desired accuracy. In practice, the contour should be chosen to enclose the smallest region that contains the spectral range of for optimal efficiency. We will see in the experiments, however, that we only need quadrature points to reach floating point precision, e.g. when using and . Due to space constraints, we refer to the literature for further details.
Sparse (shifted family) linear systems
The conjugate gradient algorithm can solve the linear systems required in (9), only requiring sparse matrixvector products. Conjugate gradient is guaranteed to converge after iterations for an dimensional system, where each matrix multiplication costs . Depending on the condition number of the underlying matrix, however, convergence up to a reasonable tolerance can happen at a fraction of , and a preconditioning method can improve convergence rates (Benzi, 2002; Chow and Saad, 2014). The linear equation systems in (9) exhibit a special structure: only the diagonal term differs, arising from the various shifts . Shifted family Krylov methods can solve these systems simultaneously at the cost of a single solve of an unshifted system, i.e. ( (Freund, 1993; Clark, 2006; Aune et al., 2013; Jegerlehner, 1996). Alternatively, depending on the conditioning of , it might be preferable to solve all system separately, each with a different preconditioning approach (Simpson, 2008).
4 Experiments
In this section we apply our methodology to selected linear Gaussian models^{2}^{2}2Code used for the results in this section is available at https://github.com/lellam/det_free_method.. We begin with a simple toy model with a random precision matrix. Our proposed methodology runs faster than a Cholesky based approach for large models, and scales to model sizes where Cholesky factors are not practical at all due to their memory costs.
4.1 Random pattern precision matrices
Consider the following model
where is a random, symmetric and positive definite matrix, generated by first generating a random Jacobi rotation of a positive definite diagonal matrix with elements in . By adding elements to the diagonal, the precision is diagonally dominant and therefore its condition number is bounded. Each random precision matrix has nonzero elements; an example is illustrated in Figure 1. We use to generate the data for this example.
To perform inference, we use a loguniform prior on . We run chains of length 10,000 for each example and tune the random walk proposal to obtain an acceptance rate of –. We initialize the chain near its true value for this example. We use terms in the rational approximation and run the shifted family conjugate gradient solver (Simpson, 2008; Jegerlehner, 1996) to solve all linear systems in (9) in a single run. The covariance matrices in this example are wellconditioned, so the conjugate gradient method converges rapidly without the help of a preconditioner.
We compare our proposed determinantfree method to a standard random walk MH sampler where we evaluate the model likelihood using a Cholesky factorisation. As the table in Figure 1 shows, the Cholesky based approach produces more independent samples for a fixed number of MCMC iterations, as measured by effective sample size (ESS). It is expected that dependencies between the model parameters and auxiliary variables will cause the Markov chain to mix slower. Computing Cholesky factorisations requires more time per iteration and scales worse with problem size, which means that our scheme produces more effective samples per unit time. For a standard laptop doesn’t have enough memory to run the Cholesky based scheme at all, while our method scales to much larger problems.
Matrix size  

Standard MCMC  
ESS  1768  1529  NA  NA 
ESS/time  6.9009  0.0362  NA  NA 
Our method  
ESS  883  543  613  1271 
ESS/time  7.6241  0.9460  0.3159  0.0558 
[table]A table beside a figure
4.2 Sparse covariances for spatial modelling of antisocial crime data
We begin by applying our method to model the spatial distribution of antisocial criminal activity in London, using count data obtained from the UK government website. Logcrimerates were calculated for each Lower Layer Super Output Area (LSOA), defined as number of crimes divided by LSOA population, and each crimerate was assigned to the central location of the LSOA. After preprocessing, our dataset consists of 4,826 logcrimerates, one for each region. LSOA data was obtained from the Office of National Statistics. Figure 2 (top) illustrates the dataset.
Parameter  

Std. MCMC  
Mean  0.13  0.06  2.11 
CD  1.60  0.57  4.91 
ESS  868  930  956 
ESS/time  5.48  5.87  6.03 
Time  1.59  1.59  1.59 
Detfree method  
Mean  0.16  0.06  2.10 
SD  2.33  0.54  4.61 
ESS  213  242  340 
ESS/time  1.72  1.96  2.75 
Time  1.24  1.24  1.24 
The classical geostatistical model (Gelfand et al., 2010), decomposes observations (here onedimensional) of a spatial stochastic process at locations as where is a deterministic mean function, is a continuous zeromean stochastic process and is white noise.
Our choice of mean function is , where the constant represents the background crimerate and the radial basis functions capture the general trend of an increase in crime in builtup areas. The five coefficients are held fixed at their maximumlikelihood values, after fitting a linear regression model. The radial basis functions are centred at found using the kmeans algorithm and the scaling is set equal to the smallest pairwise distance between the . Centering the GP around the linear combination of basis functions focusses the model’s attention on deviations from the general trend. The stochastic process is expected to capture localised crime ‘hotspots’. Those tend to only influence surrounding neighbourhoods – a phenomenon that can be appropriately modelled with a compactly supported Wendland kernel (Rasmussen and Williams, 2006):
We use independent weakly informative lognormal priors on , and . The posterior over all unknown parameters is then
We explore the posterior using a standard random walk MH both with determinant computations and our proposed augmented scheme. We run chains of length 10,000 with a burnin of 3,000. The results and predictions are summarized in Figure 2. Due to the relatively small size of the dataset, the standard MCMC approach here produces slightly more independent samples per unit time – while our method runs significantly faster, it mixes slower due to the augmented sampling space.
We expect that our determinantfree method will provide more effective samples per unit time on larger systems, with a greater spatial extent. To test that idea we generated data locations from the uniform distribution over and sampled from the predictive distribution with the covariance parameters set to their posterior means. We ran the above procedure for a chain of length and report the results for the model parameter . The ESS/time measured in , for standard MCMC and our method was and respectively. Our method produced an order of magnitude more effective samples per unit time.
4.3 Gaussian Markov random field models specified by a whitening matrix
We now present a case where the model’s precision matrix is known and is specified by a whitening matrix
(10)  
(11) 
This general setting is applicable for several models of interest, such as stochastic partial differential equation models discussed in Kaipio and Somersalo (2006); Lindgren et al. (2011). In this case, the GMRF has sparse precision , and the latent process in (2) is . We may simulate from the centered latent process by drawing white noise and solving the sparse linear system in (10); the Krylov methods that raise concerns over conditioning are not required in this case. The covariance of the marginal likelihood in (3) is , which is not necessarily sparse. To evaluate the marginal likelihood without having to work with nonsparse matrices we can use the matrix inversion lemma:
(12)  
For moderate scale models, the logdeterminants in (12) can be evaluated with 2 Cholesky decompositions.
In this example, specify the precision by a smooth latent process: , where is a discrete approximation of the Laplacian with Dirichlet boundary conditions generated from the mask
We define the latent process over a uniform quadrilateral mesh and consider the cases with nodes and nodes, so that and for each case, respectively. We generate synthetic datasets of size at locations drawn uniformly over . The matrix performs linear interpolation to estimate the latent process at the observation locations. We use the values and to generate the true latent process in both cases and the latent process for the case is presented in Figure 3. We perform inference on the marginal posterior for and using loguniform priors for each.
Our preliminary runs revealed that is poorly conditioned, e.g. the example had a condition number of , which is expected to increase with . Unlike for models specified by covariance, Krylov methods must be performed directly on the latent process and the observation noise does not help with conditioning as it does for standard GP models. As a result, Krylov methods are expected to require a large number of iterations. In this setting, we make use of the sparse linear system in (10), which can be solved using a banded linear system solver.
We again explore the posterior using a standard random walk MH: first with determinant computations, using Cholesky decompositions per iteration; and then with our proposed augmented scheme. We run chains of length and initialize the parameters close to their true values. The results in Figure 3 show that, for the both examples, our method outperforms the standard MCMC approach in cost per iteration. For large scale problems, our method can outperform standard MCMC in ESS/time, despite the slower mixing caused by adding a larger number of auxiliary variables to the model.
Grid  

Case  
Std MCMC  
Mean  1.99  1.05  1.98  0.99 
SD  0.008  0.050  0.011  0.037 
ESS  1078  1373  796  1567 
ESS/time  0.387  0.493  0.053  0.105 
Detfree MCMC  
Mean  1.99  1.05  1.98  0.99 
SD  0.012  0.068  0.015  0.053 
ESS  512  636  414  786 
ESS/time  0.238  0.295  0.067  0.128 
[table]A table beside a figure
5 Discussion
We have shown how to introduce auxiliary random variables to replace the determinant computation arising in Gaussian models whose covariance is specified by unknown model parameters. The Markov chains for our method mix more slowly per iteration than a Choleskybased approach, due to the additional auxiliary variables, but can be much cheaper per iteration. Our method can be fast because it exploits fast matrixvector operations, such as for models specified by sparse matrices, and it never creates dense matrices. These properties are particularly beneficial when the Cholesky decomposition is prohibitively expensive due to the fillin, or where the Cholesky decomposition is expensive to obtain.
In practice, latent functions require a degree of smoothness, which forces the smallest eigenvalue in a large system close to zero and can result in a poorly conditioned system. The irony is that a system being poorly conditioned implies that it is almost low rank. Low rank systems are more constrained than an arbitrary process, and so should be less expensive to work with. In our crime example we exploited the low rank structure and worked with a reasonably well conditioned covariance matrix. However, Krylov methods do not work as well with extremely poorlyconditioned precision matrices as found in some GMRFs. In future work we will explore ways to exploit low rank structure more generally, so that our auxiliary variable scheme can focus on exploring the degrees of freedom of the process that have significant posterior uncertainty. We are also exploring applications of the methodology to inverse problems arising in partial differential equation models.
Acknowledgements
L.E. was supported by the EPSRC. H.S. was supported by the Gatsby Charitable Foundation. M.G. was supported by EPSRC [EP/J016934/1, EP/K034154/1, EP/P020720/1] and an EPSRC Established Career Fellowship. This work was conducted when L.E. and I.M. were visiting the Alan Turing Institute, London.
References
 Andrieu and Roberts (2009) C. Andrieu and G.O. Roberts. The pseudomarginal approach for efficient Monte Carlo computations. The Annals of Statistics, 37(2):697–725, 2009.
 Anitescu et al. (2017) Mihai Anitescu, Jie Chen, and Michael L Stein. An inversionfree estimating equations approach for gaussian process models. Journal of Computational and Graphical Statistics, 26(1):98–107, 2017.
 Aune et al. (2013) Erlend Aune, Jo Eidsvik, and Yvo Pokern. Iterative numerical methods for sampling from high dimensional Gaussian distributions. Statistics and Computing, 23(4):501–521, 2013.
 Aune et al. (2014) Erlend Aune, Daniel P Simpson, and Jo Eidsvik. Parameter estimation in high dimensional Gaussian distributions. Statistics and Computing, 24(2):247–263, 2014.
 Bai et al. (1996) Zhaojun Bai, Gark Fahey, and Gene Golub. Some largescale matrix computation problems. Journal of Computational and Applied Mathematics, 74(1):71–89, 1996.
 Benzi (2002) Michele Benzi. Preconditioning techniques for large linear systems: a survey. Journal of Computational Physics, 182(2):418–477, 2002.
 Chow and Saad (2014) Edmond Chow and Yousef Saad. Preconditioned krylov subspace methods for sampling multivariate gaussian distributions. SIAM Journal on Scientific Computing, 36(2):A588–A608, 2014.
 Clark (2006) Michael Andrew Clark. The rational hybrid Monte Carlo algorithm. In Proceedings from Lattice, 2006. arXiv:heplat/0610048.
 Filippone and Girolami (2014) Maurizio Filippone and Mark Girolami. Pseudomarginal Bayesian inference for Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(11):2214–2226, 2014.
 Freund (1993) Roland W Freund. Solution of shifted linear systems by quasiminimal residual iterations. Numerical Linear Algebra, pages 101–121, 1993.
 Gelfand et al. (2010) Alan E Gelfand, Peter Diggle, Peter Guttorp, and Montserrat Fuentes. Handbook of Spatial Statistics. CRC press, 2010.
 Hale et al. (2008) Nicholas Hale, Nicholas J Higham, and Lloyd N Trefethen. Computing , and related matrix functions by contour integrals. SIAM Journal on Numerical Analysis, 46(5):2505–2523, 2008.
 Han et al. (2015) Insu Han, Dmitry Malioutov, and Jinwoo Shin. Largescale logdeterminant computation through stochastic Chebyshev expansions. arXiv preprint arXiv:1503.06394, 2015.
 Higham and Lin (2013) Nicholas J Higham and Lijing Lin. Matrix functions: A short course. Matrix Functions and Matrix Equations, 19:1–27, 2013.
 Jacob et al. (2015) Pierre E Jacob, Alexandre H Thiery, et al. On nonnegative unbiased estimators. The Annals of Statistics, 43(2):769–784, 2015.
 Jegerlehner (1996) Beat Jegerlehner. Krylov space solvers for shifted linear systems, 1996. arXiv:heplat/9612014.
 Kaipio and Somersalo (2006) Jari Kaipio and Erkki Somersalo. Statistical and computational inverse problems, volume 160. Springer Science & Business Media, 2006.
 Kennedy (2004) AD Kennedy. Approximation theory for matrices. Nuclear Physics BProceedings Supplements, 128:107–116, 2004.
 Lindgren et al. (2011) Finn Lindgren, Håvard Rue, and Johan Lindström. An explicit link between Gaussian fields and Gaussian Markov random fields: the stochastic partial differential equation approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(4):423–498, 2011.
 Lyne et al. (2015) A.M. Lyne, M. Girolami, Y. Atchade, H. Strathmann, and D. Simpson. On Russian roulette estimates for Bayesian inference with doublyintractable likelihoods. Statistical Science, 30(4):443–467, 2015.
 Macke et al. (2011) Jakob H Macke, Lars Buesing, John P Cunningham, M Yu Byron, Krishna V Shenoy, and Maneesh Sahani. Empirical models of spiking in neural populations. In Advances in Neural Information Processing Systems, volume 24, pages 1350–1358, 2011.
 Murray and Adams (2010) Iain Murray and Ryan Prescott Adams. Slice sampling covariance hyperparameters of latent Gaussian models. In Advances in Neural Information Processing Systems, volume 23, pages 1723–1731, 2010.
 QuiñoneroCandela and Rasmussen (2005) Joaquin QuiñoneroCandela and Carl Edward Rasmussen. A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6(Dec):1939–1959, 2005.
 Rasmussen and Williams (2006) Carl Edward Rasmussen and Chris Williams. Gaussian processes for machine learning. MIT Press, 2006.
 Rosenthal et al. (2011) Jeffrey S Rosenthal et al. Optimal proposal distributions and adaptive MCMC. Handbook of Markov Chain Monte Carlo, pages 93–112, 2011.
 Rosti (2004) AnttiVeikko Ilmari Rosti. Linear Gaussian models for speech recognition. PhD thesis, University of Cambridge, 2004.
 Roweis and Ghahramani (1999) Sam Roweis and Zoubin Ghahramani. A unifying review of linear Gaussian models. Neural Computation, 11(2):305–345, 1999.
 Rue and Held (2005) Havard Rue and Leonhard Held. Gaussian Markov random fields: theory and applications. CRC Press, 2005.
 Saibaba et al. (2016) Arvind K Saibaba, Alen Alexanderian, and Ilse CF Ipsen. Randomized Matrixfree Trace and LogDeterminant Estimators. arXiv preprint arXiv:1605.04893, 2016.
 Simpson (2008) Daniel Peter Simpson. Krylov subspace methods for approximating functions of symmetric positive definite matrices with applications to applied statistics and anomalous diffusion. PhD thesis, Queensland University of Technology, 2008.