Spectral Mixture Kernels for
MultiOutput Gaussian Processes
Abstract
Early approaches to multipleoutput Gaussian processes (MOGPs) relied on linear combinations of independent, latent, singleoutput Gaussian processes (GPs). This resulted in crosscovariance functions with limited parametric interpretation, thus conflicting with the ability of singleoutput GPs to understand lengthscales, frequencies and magnitudes to name a few. On the contrary, current approaches to MOGP are able to better interpret the relationship between different channels by directly modelling the crosscovariances as a spectral mixture kernel with a phase shift. We extend this rationale and propose a parametric family of complexvalued crossspectral densities and then build on Cramér’s Theorem (the multivariate version of Bochner’s Theorem) to provide a principled approach to design multivariate covariance functions. The soconstructed kernels are able to model delays among channels in addition to phase differences and are thus more expressive than previous methods, while also providing full parametric interpretation of the relationship across channels. The proposed method is first validated on synthetic data and then compared to existing MOGP methods on two realworld examples.
Spectral Mixture Kernels for
MultiOutput Gaussian Processes
Gabriel Parra Department of Mathematical Engineering Universidad de Chile gparra@dim.uchile.cl Felipe Tobar Center for Mathematical Modeling Universidad de Chile ftobar@dim.uchile.cl
noticebox[b]31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\end@float
1 Introduction
The extension of Gaussian processes (GPs [1]) to multiple outputs is referred to as multioutput Gaussian processes (MOGPs). MOGPs model temporal or spatial relationships among infinitelymany random variables, as scalar GPs, but also account for the statistical dependence across different sources of data (or channels). This is crucial in a number of realworld applications such as fault detection, data imputation and denoising. For any two input points , the covariance function of an channel MOGP is a symmetric positivedefinite matrix of scalar covariance functions. The design of this matrixvalued kernel is challenging since we have to deal with the trade off between (i) choosing a broad class of crosscovariances and autocovariances, while at the same time (ii) ensuring positive definiteness of the symmetric matrix containing these covariance functions for any pair of inputs . In particular, unlike the widely available families of autocovariance functions (e.g., [2]), crosscovariances are not bound to be positive definite and therefore can be designed freely; the construction of these functions with interpretable functional form is the main focus of this article.
A classical approach to define crosscovariances for a MOGP is to linearly combine independent latents GPs, this is the case of the Linear Model of Coregionalization (LMC [3]) and the Convolution Model (CONV, [4]). In these cases, the resulting kernel is a function of both the covariance functions of the latent GPs and the parameters of the linear operator considered; this results in symmetric and centred crosscovariances. While these approaches are simple, they lack interpretability of the dependencies learnt and force the autocovariances to have similar behaviour across different channels. The LMC method has also inspired the CrossSpectral Mixture (CSM) kernel [5], which uses the Spectral Mixture (SM) kernel in [6] within LMC and model phase differences across channels by manually introducing a shift between the cosine and exponential factors of the SM kernel. Despite exhibiting improved performance wrt previous approaches, the addition of the shift parameter in CSM poses the following question: Can the spectral design of multiouput covariance functions be even more flexible?
We take a different approach to extend the spectral mixture concept to multiple outputs: Recall that for stationary scalarvalued GPs, [6] designs the power spectral density (PSD) of the process by a mixture of square exponential functions to then, supported by Bochner’s theorem [7], present the Spectral Mixture kernel via the inverse Fourier transform of the soconstructed PSD. Along the same lines, our main contribution is to propose an expressive family of complexvalued squareexponential crossspectral densities, and then build on Cramér’s theorem [8, 9], the multivariate extension of Bochner’s, to construct the MultiOutput Spectral Mixture kernel (MOSM). The proposed multivariate covariance function accounts for all the properties of the CrossSpectral Mixture kernel in [5] plus a delay component across channels and variable parameters for autocovariances of different channels. Additionally, the proposed MOSM provides clear interpretation of all the parameters in spectral terms. Our experimental contribution includes an illustrative example using a trivariate synthetic signal and validation against all the aforementioned literature using two realworld datasets.
2 Background
Definition 1.
A Gaussian process (GP) over the input set is a realvalued stochastic process such that for any finite subset of inputs , the random variables are jointly Gaussian. Without loss of generality we will choose .
A GP [1] defines a distribution over functions that is uniquely determined by its mean function , typically assumed , and its covariance function (also known as kernel) . We now equip the reader with the necessary background to follow our proposal: we first review a spectralbased approach to the design of scalarvalued covariance kernels and then present the definition of a multioutput GP.
2.1 The Spectral Mixture kernel
To bypass the explicit construction of positivedefinite functions within the design of stationary covariance kernels, it is possible to design the power spectral density (PSD) instead [6] and then transform it into a covariance function using the inverse Fourier transform. This is motivated by the fact that the strict positivity requirement of the PSD is much easier to achieve than the positive definiteness requirement of the covariance kernel. The theoretical support of this construction is given by the following theorem:
Theorem 1.
(Bochner’s theorem) An integrable^{1}^{1}1A function is said to be integrable if function is the covariance function of a weaklystationary meansquarecontinuous stochastic process if and only if it admits the following representation
(1) 
where is a nonnegative bounded function on and denotes the imaginary unit.
For a proof see [9]. The above theorem gives an explicit relationship between the spectral density and the covariance function of the stochastic process . In this sense, [6] proposed to model the spectral density as a weighted mixture of squareexponential functions, with weights , centres and diagonal covariance matrices , that is,
(2) 
Relying on Theorem 1, the kernel associated to the spectral density in eq. (2) is given the spectral mixture kernel defined as follows.
Definition 2.
A Spectral Mixture (SM) kernel is a positivedefinite stationary kernel given by
(3) 
where , and .
Due to the universal function approximation property of the mixtures of Gaussians (considered here in the frequency domain) and the relationship given by Theorem 1, the SM kernel is able to approximate continuous stationary kernels to an arbitrary precision given enough spectral components as is [10, 11]. This concept points in the direction of sidestepping the kernel selection problem in GPs and it will be extended to cater for multivariate GPs in Section 3.
2.2 MultiOutput Gaussian Processes
A multivariate extension of GPs can be constructed by considering an ensemble of scalarvalued stochastic processes where any finite collection of values across all such processes are jointly Gaussian. We formalise this definition as follows.
Definition 3.
An channel multioutput Gaussian process , , is an tuple of stochastic processes , such that for any (finite) subset of inputs , the random variables are jointly Gaussian for any choice of indices .
Recall that the construction of scalarvalued GPs requires choosing a scalarvalued mean function and a scalarvalued covariance function. Conversely, an channel MOGP is defined by an valued mean function, whose element denotes the mean function of the channel, and an valued covariance function, whose element denotes the covariance between the and channels. The symmetry and positivedefiniteness conditions of the MOGP kernel are defined as follows.
Definition 4.
A twoinput matrixvalued function defined elementwise by is a multivariate kernel (covariance function) if it is:
(i) Symmetric, i.e., , and
(ii) Positive definite, i.e., such that, , we have
(4) 
Furthermore, we say that a multivariate kernel is stationary if or equivalently , in this case, we denote
The design of the MOGP covariance kernel involves jointly choosing functions that model the covariance of each channel (diagonal elements in ) and functions that model the crosscovariance between different channels at different input locations (offdiagonal elements in ). Choosing these covariance functions is challenging when we want to be as expressive as possible and include, for instance, delays, phase shifts, negative correlations or to enforce specific spectral content while at the same time maintaining positive definiteness of . The reader is referred to [12, 13] for a comprehensive review of MOGP models.
3 Designing MultiOutput Gaussian Processes in the Fourier Domain
We extend the spectralmixture approach [6] to multioutput Gaussian processes relying on the multivariate version of Theorem 1 first proved by Cramér and thus referred to as Cramér’s Theorem [8, 9] given by
Theorem 2.
(Cramér’s Theorem) A family of integrable functions are the covariance functions of a weaklystationary multivariate stochastic process if and only if they (i) admit the representation
(5) 
where each is an integrable complexvalued function known as the spectral density associated to the covariance function , and (ii) fulfil the positive definiteness condition
(6) 
where denotes the complex conjugate of .
Note that eq. (5) states that each covariance function is the inverse Fourier transform of a spectral density , therefore, we will say that these functions are Fourier pairs. Accordingly, we refer to the set of arguments of the covariance function as time or space Domain depending of the application considered, and to the set of arguments of the spectral densities as Fourier or spectral domain. Furthermore, a direct consequence of the above theorem is that for any element in the Fourier domain, the matrix defined by is Hermitian, i.e., .
Theorem 2 gives the guidelines to construct covariance functions for MOGP by designing their corresponding spectral densities instead, i.e., the design is performed in the Fourier rather than the space domain. The simplicity of design in the Fourier domain stems from the positivedefiniteness condition of the spectral densities in eq. (6), which is much easier to achieve than that of the covariance functions in eq. (4). This can be understood through an analogy with the univariate model: in the singleoutput case the positivedefiniteness condition of the kernel only requires positivity of the spectral density, whereas in the multioutput case the positivedefiniteness condition of the multivariate kernel only requires that the matrix , is positive definite but there are no constraints on each function .
3.1 The MultiOutput Spectral Mixture kernel
We now propose a family of Hermitian positivedefinite complexvalued functions , thus fulfilling the requirements of Theorem 2, eq. (6), to use them as crossspectral densities within MOGP. This family of functions is designed with the aim of providing physical parametric interpretation and closedform covariance functions after applying the inverse Fourier transform.
Recall that complexvalued positivedefinite matrices can be decomposed in the form , meaning that the entry of can be expressed as ; where , is the column of , and denotes the Hermitian (transpose and conjugate) operator. Note that this factor decomposition fulfils eq. (6) for any choice of :
(7) 
We refer to as the rank of the decomposition, since by choosing the rank of can be at most . For ease of notation we choose^{2}^{2}2The extension to arbitrary will be presented at the end of this section. , where the columns of are complexvalued functions , and is modeled as a rankone matrix according to . Since Fourier transforms and multiplications of square exponential (SE) functions are also SE, we model as a complexvalued SE function so as to ensure closedform expression of its corresponding covariance kernel, that is,
(8) 
where , and . With this choice of the functions , the spectral densities are given by
(9) 
meaning that the crossspectral density between channels and is modeled as a complexvalued SE function with the following parameters:

covariance:

mean:

magnitude:

delay:

phase:
where the soconstructed magnitudes ensure positive definiteness and, in particular, the autospectral densities are realvalued SE functions (since ) as in the standard (scalarvalued) spectral mixture approach [6].
The power spectral density in eq. (9) corresponds to a complexvalued kernel and therefore to a complexvalued GP [14, 15] . In order to restrict this generative model only to realvalued GPs, the proposed power spectral density has to be symmetric with respect to [16], we then make symmetric simply by reassigning , this is equivalent to choosing to be a vector of two mirrored complex SE functions.
The resulting (symmetric with respect to ) crossspectral density between the and channels and its corresponding realvalued kernel are the following Fourier pairs
(10) 
where the magnitude parameter absorbs the constant resulting from the inverse Fourier transform.
We can again confirm that the autocovariances () are realvalued and contain squareexponential and cosine factors as in the scalar SM approach since and . Conversely, the proposed model for the crosscovariance between different channels () allows for (i) both negatively and positivelycorrelated signals (), (ii) delayed channels through the delay parameter and (iii) outofphase channels where the covariance is not symmetric with respect to the delay for . Fig. 1 shows crossspectral densities and their corresponding kernel for a choice of different delay and phase parameters.
The kernel in eq. (10) resulted from a low rank choice for the PSD matrix , therefore, increasing the rank in the proposed model for is equivalent to consider several kernel components. Arbitrarily choosing of these components yields the expression for the proposed multivariate kernel:
Definition 5.
The MultiOutput Spectral Mixture kernel (MOSM) has the form:
(11) 
where and the superindex denotes the parameter of the component of the spectral mixture.
This multivariate covariance function has spectralmixture positivedefinite kernels as autocovariances, while the crosscovariances are spectral mixture functions with different parameters for different output pairs, which can be (i) nonpositivedefinite, (ii) nonsymmetric, and (iii) delayed with respect to one another. Therefore, the MOSM kernel is a multioutput generalisation of the spectral mixture approach [6] where the positive definiteness is guaranteed by the factor decomposition of as shown in eq. (7).
3.2 Training the model and computing the predictive posterior
Fitting the model to observed data follows the same rationale of standard GP, that is, maximising logprobability of the data. Recall that the observations in the multioutput case consist of (i) a location , (ii) a channel identifier , and (iii) an observed value ; therefore, we denote observations as the set of 3tuples . As all observations are jointly Gaussian, we concatenate the observations into the three vectors , , and , to express the negative loglikelihood (NLL) by
(12) 
where all hyperparameters are denoted by , and is the covariance matrix of all observed samples, that is, the element is the covariance between the process at (location: , channel: ) and the process at (location: , channel: ). Recall that, under the proposed MOSM model, this covariance is given by eq. (11), that is, , where is a diagonal term to cater for uncorrelated observation noise. The NLL is then minimised with respect to , that is, the original parameters chosen to construct in Section 3.1, plus the noise hyperparameters.
Once the hyperparameters are optimised, computing the predictive posterior in the proposed MOSM follows the standard GP procedure with the joint covariances given by eq. (11).
3.3 Related work
Generalising the scalar spectral mixture kernel to MOGPs can be achieved from the LMC framework as pointed out in [5] (denoted SMLMC). As this formulation only considers realvalued cross spectral densities, the authors propose a multivariate covariance function by including a complex component to the cross spectral densities to cater for phase differences across channels, which they call the Cross Spectral Mixture kernel (denoted CSM). This multivariate covariance function can be seen as the proposed MOSM model with , , and for . As a consequence, the SMLMC is a particular case of the proposed MOSM model, where the parameters are restricted to be same for all channels and therefore no phase shifts and no delays are allowed—unlike the MOSM example in Fig. 1. Additionally, Cramér’s theorem has also been used in a similar fashion in [17] but only with realvalued tStudent crossspectral densities yielding crosscovariances that are either positivedefinite or negativedefinite.
4 Experiments
We show two sets of experiments. First, we validated the ability of the proposed MOSM model in the identification of known auto and crosscovariances of synthetic data. Second, we compared MOSM against the spectralmixture linear model of coregionalization (SMLMC, [3, 6, 5]), the Gaussian convolution model (CONV, [4]), and the crossspectral mixture model (CSM, [5]) in the estimation of missing realworld data in two different distributed settings: climate signals and metal concentrations. All models were implemented in Tensorflow [18] using GPflow [19] in order to make use of automatic differentiation to compute the gradients of the NLL. The performance of all the models in the experiments was measured by the mean absolute error given by
(13) 
where denotes the true value and the MOGP estimate.
4.1 Synthetic example: Learning derivatives and delayed signals
All models were implemented to recover the auto and crosscovariances of a threeoutput GP with the following components: (i) a reference signal sampled from a GP with spectral mixture covariance kernel and zero mean, (ii) its derivative , and (iii) a delayed version . The motivation for this illustrative example is that the covariances and cross covariances of the aforementioned processes are known explicitly (see [1, Sec. 9.4]) and we can therefore compare our estimates to the true model. The derivative was computed numerically (first order through finite differences) and the training samples were generated as follows: We chose samples from the reference function in the interval [20, 20], samples from the derivative signal in the interval [20, 0], and samples from the delayed signal in the interval [20, 0]. All samples were randomly uniformly chosen in the intervals mentioned and Gaussian noised was added to yield realistic observations. The experiment then consisted in the reconstruction the reference signal in the interval [20, 20], and the imputation of the derivative and delayed signals over the interval [0, 20].
Fig. 2 shows the ground truth and MOSM estimates for all three synthetic signals and the covariances (normalised), and Table 1 reports the MAE for all models over ten realisations of the experiment. Notice that the proposed model successfully learnt all crosscovariances and , and autocovariances without prior information about the delayed or the derivative relationship between the two channels. Furthermore, MOSM was the only model that successfully extrapolated the derivate signal and the delayed signal simultaneously, this is due the fact that the crosscovariances needed for this setting are not linear combinations of univariate kernels, hence models based on latent processes fail in this synthetic example.
Model  Reference  Derivative  Delayed 

CONV  0.211 0.085  0.759 0.075  0.524 0.097 
SMLMC  0.166 0.009  0.747 0.101  0.398 0.042 
CSM  0.148 0.010  0.262 0.032  0.368 0.089 
MOSM  0.127 0.011  0.223 0.015  0.146 0.017 
4.2 Climate data
The first realworld dataset contained measurements^{3}^{3}3The data can be obtained from www.cambermet.co.uk. and the sites therein. from a sensor network of four climate stations in the south on England: Cambermet, Chimet, Sotonmet and Bramblemet. We considered the normalised air temperature signal from 12 March, 2017 to 16 March, 2017, in 5minute intervals (5692 samples), from where we randomly chose samples for training. Following [4], we simulated a sensor failure by removing the second half of the measurements for one sensor and leaving the remaining three sensors operating correctly; we reproduced the same setup across all four sensors thus producing four experiments. All models considered had five latent signals/spectral components.
For all four models considered, Fig. 3 shows the estimates of missing data for the Cambermetfailure case. Table 2 shows the mean absolute error for all models and failure cases over the missing data region. Observe how all models were able to capture the behaviour of the signal in the missing range, this is because the considered climate signals are very similar to one another. This shows that the MOSM can also collapse to models that share parameters across pairs of outputs when required.
Model  Cambermet  Chimet  Sotonmet  Bramblemet 

CONV  0.098 0.008  0.192 0.015  0.211 0.038  0.163 0.009 
SMLMC  0.084 0.004  0.176 0.003  0.273 0.001  0.134 0.002 
CSM  0.094 0.003  0.129 0.004  0.195 0.011  0.130 0.004 
MOSM  0.097 0.006  0.137 0.007  0.162 0.011  0.129 0.003 
These results do not show a significant difference between the proposed model and the latent processes based models. In order to test for statistical significance, the KolmogorovSmirnov test [20, Ch. 7] was used with a significance level , concluding that for the Sotonmet sensor we can assure that the MOSM model yields the best results. Conversely, for the Cambermet, Chimet and Bramblemet sensors, MOSM and CSM provided similar results, though we cannot confirm their difference is statistically significant. However, given the high correlation of these signals and the similarity between the MOSM model and the CSM model, the close performance of these two models on this dataset is to be expected.
4.3 Heavy metal concentration
The Jura dataset [3] contains, in addition to other geological data, the concentration of seven heavy metals in a region of of the Swiss Jura, and it is divided into a training set (259 locations) and a validation set (100 locations). We followed [3, 4], where the motivation was to aid the prediction of a variable that is expensive to measure by using abundant measurements of correlated variables which are less expensive to acquire. Specifically, we estimated Cadmium and Copper at the validation locations using measurements of related variables at the training and test locations: Nickel and Zinc for Cadmium; and Lead, Nickel and Zinc for Copper. The MAE—see eq. (13)—is shown in Table 3, where the results for the CONV model were obtained from [4] and all models considered five latent signals/spectral components, except for the independent Gaussian process (denoted IGP).
Observe how the proposed MOSM model outperforms all other models over the Cadmium data, which is statistical significant with a significance level . Conversely, we cannot guarantee a statisticallysignificant difference between the CSM model and the MOSM in the Copper case. In both cases, testing for statistical significance against the CONV model was not possible since those results were obtained from [4]. On the other hand, the higher variability and nonGaussianity of the Copper data may be the reason of why the simplest MOGP model (SMLMC) achieves the best results.
Model  Cadmium  Copper 

IGP  
CONV  
SMLMC  
CSM  
MOSM 
5 Discussion
We have proposed the multioutput spectral mixture (MOSM) kernel to model rich relationships across multiple outputs within Gaussian processes regression models. This has been achieved by constructing a positivedefinite matrix of complexvalued spectral densities, and then transforming them via the inverse Fourier transform according to Cramér’s Theorem. The resulting kernel provides a clear interpretation from a spectral viewpoint, where each of its parameters can be identified with frequency, magnitude, phase and delay for a pair of channels. Furthermore, a key feature that is unique to the proposed kernel is the ability joint model delays and phase differences, this is possible due to the complexvalued model for the crossspectral density considered and validated experimentally using a synthetic example—see Fig. 2. The MOSM kernel has also been compared against existing MOGP models on two realworld datasets, where the proposed model performed competitively in terms of the mean absolute error. Further research should point towards a sparse implementation of the proposed MOGP which can build on [4, 21] to design inducing variables that exploit the spectral content of the processes as in [22, 23].
Acknowledgements
We thank Cristóbal Silva (Universidad de Chile) for useful recommendations about GPU implementation, Rasmus Bonnevie from the GPflow team for his assistance on the experimental MOGP module within GPflow, and the anonymous reviewers. This work was financially supported by Conicyt BasalCMM.
References
 [1] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning. The MIT Press, 2006.
 [2] D. Duvenaud, “Automatic model construction with Gaussian processes,” Ph.D. dissertation, University of Cambridge, 2014.
 [3] P. Goovaerts, Geostatistics for natural resources evaluation. Oxford University Press on Demand, 1997.
 [4] M. A. Álvarez and N. D. Lawrence, “Sparse convolved Gaussian processes for multioutput regression,” in Advances in Neural Information Processing Systems 21, 2008, pp. 57–64.
 [5] K. R. Ulrich, D. E. Carlson, K. Dzirasa, and L. Carin, “GP kernels for crossspectrum analysis,” in Advances in Neural Information Processing Systems 28, 2015, pp. 1999–2007.
 [6] A. G. Wilson and R. P. Adams, “Gaussian process kernels for pattern discovery and extrapolation,” in Proceedings of the 30th International Conference on Machine Learning (ICML13), 2013, pp. 1067–1075.
 [7] S. Bochner, M. Tenenbaum, and H. Pollard, Lectures on Fourier Integrals, ser. Annals of mathematics studies. Princeton University Press, 1959.
 [8] H. Cramér, “On the theory of stationary random processes,” Annals of Mathematics, pp. 215–230, 1940.
 [9] A. Yaglom, Correlation Theory of Stationary and Related Random Functions, ser. Correlation Theory of Stationary and Related Random Functions. Springer, 1987, no. v. 1.
 [10] F. Tobar, T. D. Bui, and R. E. Turner, “Learning stationary time series using Gaussian processes with nonparametric kernels,” in Advances in Neural Information Processing Systems 28. Curran Associates, Inc., 2015, pp. 3501–3509.
 [11] F. Tobar and R. E. Turner, “Modelling time series via automatic learning of basis functions,” in Proc. of IEEE SAM, 2016, pp. 2209–2213.
 [12] M. A. Álvarez, L. Rosasco, and N. D. Lawrence, “Kernels for vectorvalued functions: A review,” Found. Trends Mach. Learn., vol. 4, no. 3, pp. 195–266, Mar. 2012.
 [13] M. G. Genton and W. Kleiber, “Crosscovariance functions for multivariate geostatistics,” Institute of Mathematical Statistics, vol. 30, no. 2, 2015.
 [14] F. Tobar and R. E. Turner, “Modelling of complex signals using Gaussian processes,” in Proc. of IEEE ICASSP, 2015, pp. 2209–2213.
 [15] R. BoloixTortosa, F. J. PayánSomet, and J. J. MurilloFuentes, “Gaussian processes regressors for complex proper signals in digital communications,” in Proc. of IEEE SAM, 2014, pp. 137–140.
 [16] S. M. Kay, Modern spectral estimation : Theory and application. Englewood Cliffs, N.J. : Prentice Hall, 1988.
 [17] T. Gneiting, W. Kleiber, and M. Schlather, “Matérn crosscovariance functions for multivariate random fields,” Journal of the American Statistical Association, vol. 105, no. 491, pp. 1167–1177, 2010.
 [18] M. Abadi et al., “TensorFlow: Largescale machine learning on heterogeneous systems,” 2015, software available from tensorflow.org. [Online]. Available: http://tensorflow.org/
 [19] A. G. d. G. Matthews, M. van der Wilk, T. Nickson, K. Fujii, A. Boukouvalas, P. LeónVillagrá, Z. Ghahramani, and J. Hensman, “GPflow: A Gaussian process library using TensorFlow,” 2016.
 [20] J. W. Pratt and J. D. Gibbons, Concepts of nonparametric theory. Springer Science & Business Media, 2012.
 [21] M. A. Álvarez, D. Luengo, M. K. Titsias, and N. D. Lawrence, “Efficient multioutput Gaussian processes through variational inducing kernels.” in AISTATS, vol. 9, 2010, pp. 25–32.
 [22] J. Hensman, N. Durrande, and A. Solin, “Variational Fourier features for Gaussian processes,” arXiv preprint arXiv:1611.06740, 2016.
 [23] F. Tobar, T. D. Bui, and R. E. Turner, “Design of covariance functions using interdomain inducing variables,” in NIPS 2015  Time Series Workshop, December 2015.