Spectral Mixture Kernels for
Multi-Output Gaussian Processes
Early approaches to multiple-output Gaussian processes (MOGPs) relied on linear combinations of independent, latent, single-output Gaussian processes (GPs). This resulted in cross-covariance functions with limited parametric interpretation, thus conflicting with the ability of single-output GPs to understand lengthscales, frequencies and magnitudes to name a few. On the contrary, current approaches to MOGP are able to better interpret the relationship between different channels by directly modelling the cross-covariances as a spectral mixture kernel with a phase shift. We extend this rationale and propose a parametric family of complex-valued cross-spectral densities and then build on Cramér’s Theorem (the multivariate version of Bochner’s Theorem) to provide a principled approach to design multivariate covariance functions. The so-constructed kernels are able to model delays among channels in addition to phase differences and are thus more expressive than previous methods, while also providing full parametric interpretation of the relationship across channels. The proposed method is first validated on synthetic data and then compared to existing MOGP methods on two real-world examples.
Spectral Mixture Kernels for
Multi-Output Gaussian Processes
Gabriel Parra Department of Mathematical Engineering Universidad de Chile email@example.com Felipe Tobar Center for Mathematical Modeling Universidad de Chile firstname.lastname@example.org
noticebox[b]31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\end@float
The extension of Gaussian processes (GPs ) to multiple outputs is referred to as multi-output Gaussian processes (MOGPs). MOGPs model temporal or spatial relationships among infinitely-many random variables, as scalar GPs, but also account for the statistical dependence across different sources of data (or channels). This is crucial in a number of real-world applications such as fault detection, data imputation and denoising. For any two input points , the covariance function of an -channel MOGP is a symmetric positive-definite matrix of scalar covariance functions. The design of this matrix-valued kernel is challenging since we have to deal with the trade off between (i) choosing a broad class of cross-covariances and auto-covariances, while at the same time (ii) ensuring positive definiteness of the symmetric matrix containing these covariance functions for any pair of inputs . In particular, unlike the widely available families of auto-covariance functions (e.g., ), cross-covariances are not bound to be positive definite and therefore can be designed freely; the construction of these functions with interpretable functional form is the main focus of this article.
A classical approach to define cross-covariances for a MOGP is to linearly combine independent latents GPs, this is the case of the Linear Model of Coregionalization (LMC ) and the Convolution Model (CONV, ). In these cases, the resulting kernel is a function of both the covariance functions of the latent GPs and the parameters of the linear operator considered; this results in symmetric and centred cross-covariances. While these approaches are simple, they lack interpretability of the dependencies learnt and force the auto-covariances to have similar behaviour across different channels. The LMC method has also inspired the Cross-Spectral Mixture (CSM) kernel , which uses the Spectral Mixture (SM) kernel in  within LMC and model phase differences across channels by manually introducing a shift between the cosine and exponential factors of the SM kernel. Despite exhibiting improved performance wrt previous approaches, the addition of the shift parameter in CSM poses the following question: Can the spectral design of multiouput covariance functions be even more flexible?
We take a different approach to extend the spectral mixture concept to multiple outputs: Recall that for stationary scalar-valued GPs,  designs the power spectral density (PSD) of the process by a mixture of square exponential functions to then, supported by Bochner’s theorem , present the Spectral Mixture kernel via the inverse Fourier transform of the so-constructed PSD. Along the same lines, our main contribution is to propose an expressive family of complex-valued square-exponential cross-spectral densities, and then build on Cramér’s theorem [8, 9], the multivariate extension of Bochner’s, to construct the Multi-Output Spectral Mixture kernel (MOSM). The proposed multivariate covariance function accounts for all the properties of the Cross-Spectral Mixture kernel in  plus a delay component across channels and variable parameters for auto-covariances of different channels. Additionally, the proposed MOSM provides clear interpretation of all the parameters in spectral terms. Our experimental contribution includes an illustrative example using a trivariate synthetic signal and validation against all the aforementioned literature using two real-world datasets.
A Gaussian process (GP) over the input set is a real-valued stochastic process such that for any finite subset of inputs , the random variables are jointly Gaussian. Without loss of generality we will choose .
A GP  defines a distribution over functions that is uniquely determined by its mean function , typically assumed , and its covariance function (also known as kernel) . We now equip the reader with the necessary background to follow our proposal: we first review a spectral-based approach to the design of scalar-valued covariance kernels and then present the definition of a multi-output GP.
2.1 The Spectral Mixture kernel
To bypass the explicit construction of positive-definite functions within the design of stationary covariance kernels, it is possible to design the power spectral density (PSD) instead  and then transform it into a covariance function using the inverse Fourier transform. This is motivated by the fact that the strict positivity requirement of the PSD is much easier to achieve than the positive definiteness requirement of the covariance kernel. The theoretical support of this construction is given by the following theorem:
(Bochner’s theorem) An integrable111A function is said to be integrable if function is the covariance function of a weakly-stationary mean-square-continuous stochastic process if and only if it admits the following representation
where is a non-negative bounded function on and denotes the imaginary unit.
For a proof see . The above theorem gives an explicit relationship between the spectral density and the covariance function of the stochastic process . In this sense,  proposed to model the spectral density as a weighted mixture of square-exponential functions, with weights , centres and diagonal covariance matrices , that is,
A Spectral Mixture (SM) kernel is a positive-definite stationary kernel given by
where , and .
Due to the universal function approximation property of the mixtures of Gaussians (considered here in the frequency domain) and the relationship given by Theorem 1, the SM kernel is able to approximate continuous stationary kernels to an arbitrary precision given enough spectral components as is [10, 11]. This concept points in the direction of sidestepping the kernel selection problem in GPs and it will be extended to cater for multivariate GPs in Section 3.
2.2 Multi-Output Gaussian Processes
A multivariate extension of GPs can be constructed by considering an ensemble of scalar-valued stochastic processes where any finite collection of values across all such processes are jointly Gaussian. We formalise this definition as follows.
An -channel multi-output Gaussian process , , is an -tuple of stochastic processes , such that for any (finite) subset of inputs , the random variables are jointly Gaussian for any choice of indices .
Recall that the construction of scalar-valued GPs requires choosing a scalar-valued mean function and a scalar-valued covariance function. Conversely, an -channel MOGP is defined by an -valued mean function, whose element denotes the mean function of the channel, and an -valued covariance function, whose element denotes the covariance between the and channels. The symmetry and positive-definiteness conditions of the MOGP kernel are defined as follows.
A two-input matrix-valued function defined element-wise by is a multivariate kernel (covariance function) if it is:
(i) Symmetric, i.e., , and
(ii) Positive definite, i.e., such that, , we have
Furthermore, we say that a multivariate kernel is stationary if or equivalently , in this case, we denote
The design of the MOGP covariance kernel involves jointly choosing functions that model the covariance of each channel (diagonal elements in ) and functions that model the cross-covariance between different channels at different input locations (off-diagonal elements in ). Choosing these covariance functions is challenging when we want to be as expressive as possible and include, for instance, delays, phase shifts, negative correlations or to enforce specific spectral content while at the same time maintaining positive definiteness of . The reader is referred to [12, 13] for a comprehensive review of MOGP models.
3 Designing Multi-Output Gaussian Processes in the Fourier Domain
We extend the spectral-mixture approach  to multi-output Gaussian processes relying on the multivariate version of Theorem 1 first proved by Cramér and thus referred to as Cramér’s Theorem [8, 9] given by
(Cramér’s Theorem) A family of integrable functions are the covariance functions of a weakly-stationary multivariate stochastic process if and only if they (i) admit the representation
where each is an integrable complex-valued function known as the spectral density associated to the covariance function , and (ii) fulfil the positive definiteness condition
where denotes the complex conjugate of .
Note that eq. (5) states that each covariance function is the inverse Fourier transform of a spectral density , therefore, we will say that these functions are Fourier pairs. Accordingly, we refer to the set of arguments of the covariance function as time or space Domain depending of the application considered, and to the set of arguments of the spectral densities as Fourier or spectral domain. Furthermore, a direct consequence of the above theorem is that for any element in the Fourier domain, the matrix defined by is Hermitian, i.e., .
Theorem 2 gives the guidelines to construct covariance functions for MOGP by designing their corresponding spectral densities instead, i.e., the design is performed in the Fourier rather than the space domain. The simplicity of design in the Fourier domain stems from the positive-definiteness condition of the spectral densities in eq. (6), which is much easier to achieve than that of the covariance functions in eq. (4). This can be understood through an analogy with the univariate model: in the single-output case the positive-definiteness condition of the kernel only requires positivity of the spectral density, whereas in the multioutput case the positive-definiteness condition of the multivariate kernel only requires that the matrix , is positive definite but there are no constraints on each function .
3.1 The Multi-Output Spectral Mixture kernel
We now propose a family of Hermitian positive-definite complex-valued functions , thus fulfilling the requirements of Theorem 2, eq. (6), to use them as cross-spectral densities within MOGP. This family of functions is designed with the aim of providing physical parametric interpretation and closed-form covariance functions after applying the inverse Fourier transform.
Recall that complex-valued positive-definite matrices can be decomposed in the form , meaning that the entry of can be expressed as ; where , is the column of , and denotes the Hermitian (transpose and conjugate) operator. Note that this factor decomposition fulfils eq. (6) for any choice of :
We refer to as the rank of the decomposition, since by choosing the rank of can be at most . For ease of notation we choose222The extension to arbitrary will be presented at the end of this section. , where the columns of are complex-valued functions , and is modeled as a rank-one matrix according to . Since Fourier transforms and multiplications of square exponential (SE) functions are also SE, we model as a complex-valued SE function so as to ensure closed-form expression of its corresponding covariance kernel, that is,
where , and . With this choice of the functions , the spectral densities are given by
meaning that the cross-spectral density between channels and is modeled as a complex-valued SE function with the following parameters:
where the so-constructed magnitudes ensure positive definiteness and, in particular, the auto-spectral densities are real-valued SE functions (since ) as in the standard (scalar-valued) spectral mixture approach .
The power spectral density in eq. (9) corresponds to a complex-valued kernel and therefore to a complex-valued GP [14, 15] . In order to restrict this generative model only to real-valued GPs, the proposed power spectral density has to be symmetric with respect to , we then make symmetric simply by reassigning , this is equivalent to choosing to be a vector of two mirrored complex SE functions.
The resulting (symmetric with respect to ) cross-spectral density between the and channels and its corresponding real-valued kernel are the following Fourier pairs
where the magnitude parameter absorbs the constant resulting from the inverse Fourier transform.
We can again confirm that the autocovariances () are real-valued and contain square-exponential and cosine factors as in the scalar SM approach since and . Conversely, the proposed model for the cross-covariance between different channels () allows for (i) both negatively- and positively-correlated signals (), (ii) delayed channels through the delay parameter and (iii) out-of-phase channels where the covariance is not symmetric with respect to the delay for . Fig. 1 shows cross-spectral densities and their corresponding kernel for a choice of different delay and phase parameters.
The kernel in eq. (10) resulted from a low rank choice for the PSD matrix , therefore, increasing the rank in the proposed model for is equivalent to consider several kernel components. Arbitrarily choosing of these components yields the expression for the proposed multivariate kernel:
The Multi-Output Spectral Mixture kernel (MOSM) has the form:
where and the superindex denotes the parameter of the component of the spectral mixture.
This multivariate covariance function has spectral-mixture positive-definite kernels as auto-covariances, while the cross-covariances are spectral mixture functions with different parameters for different output pairs, which can be (i) non-positive-definite, (ii) non-symmetric, and (iii) delayed with respect to one another. Therefore, the MOSM kernel is a multi-output generalisation of the spectral mixture approach  where the positive definiteness is guaranteed by the factor decomposition of as shown in eq. (7).
3.2 Training the model and computing the predictive posterior
Fitting the model to observed data follows the same rationale of standard GP, that is, maximising log-probability of the data. Recall that the observations in the multioutput case consist of (i) a location , (ii) a channel identifier , and (iii) an observed value ; therefore, we denote observations as the set of 3-tuples . As all observations are jointly Gaussian, we concatenate the observations into the three vectors , , and , to express the negative log-likelihood (NLL) by
where all hyperparameters are denoted by , and is the covariance matrix of all observed samples, that is, the element is the covariance between the process at (location: , channel: ) and the process at (location: , channel: ). Recall that, under the proposed MOSM model, this covariance is given by eq. (11), that is, , where is a diagonal term to cater for uncorrelated observation noise. The NLL is then minimised with respect to , that is, the original parameters chosen to construct in Section 3.1, plus the noise hyperparameters.
Once the hyperparameters are optimised, computing the predictive posterior in the proposed MOSM follows the standard GP procedure with the joint covariances given by eq. (11).
3.3 Related work
Generalising the scalar spectral mixture kernel to MOGPs can be achieved from the LMC framework as pointed out in  (denoted SM-LMC). As this formulation only considers real-valued cross spectral densities, the authors propose a multivariate covariance function by including a complex component to the cross spectral densities to cater for phase differences across channels, which they call the Cross Spectral Mixture kernel (denoted CSM). This multivariate covariance function can be seen as the proposed MOSM model with , , and for . As a consequence, the SM-LMC is a particular case of the proposed MOSM model, where the parameters are restricted to be same for all channels and therefore no phase shifts and no delays are allowed—unlike the MOSM example in Fig. 1. Additionally, Cramér’s theorem has also been used in a similar fashion in  but only with real-valued t-Student cross-spectral densities yielding cross-covariances that are either positive-definite or negative-definite.
We show two sets of experiments. First, we validated the ability of the proposed MOSM model in the identification of known auto- and cross-covariances of synthetic data. Second, we compared MOSM against the spectral-mixture linear model of coregionalization (SM-LMC, [3, 6, 5]), the Gaussian convolution model (CONV, ), and the cross-spectral mixture model (CSM, ) in the estimation of missing real-world data in two different distributed settings: climate signals and metal concentrations. All models were implemented in Tensorflow  using GPflow  in order to make use of automatic differentiation to compute the gradients of the NLL. The performance of all the models in the experiments was measured by the mean absolute error given by
where denotes the true value and the MOGP estimate.
4.1 Synthetic example: Learning derivatives and delayed signals
All models were implemented to recover the auto- and cross-covariances of a three-output GP with the following components: (i) a reference signal sampled from a GP with spectral mixture covariance kernel and zero mean, (ii) its derivative , and (iii) a delayed version . The motivation for this illustrative example is that the covariances and cross covariances of the aforementioned processes are known explicitly (see [1, Sec. 9.4]) and we can therefore compare our estimates to the true model. The derivative was computed numerically (first order through finite differences) and the training samples were generated as follows: We chose samples from the reference function in the interval [-20, 20], samples from the derivative signal in the interval [-20, 0], and samples from the delayed signal in the interval [-20, 0]. All samples were randomly uniformly chosen in the intervals mentioned and Gaussian noised was added to yield realistic observations. The experiment then consisted in the reconstruction the reference signal in the interval [-20, 20], and the imputation of the derivative and delayed signals over the interval [0, 20].
Fig. 2 shows the ground truth and MOSM estimates for all three synthetic signals and the covariances (normalised), and Table 1 reports the MAE for all models over ten realisations of the experiment. Notice that the proposed model successfully learnt all cross-covariances and , and autocovariances without prior information about the delayed or the derivative relationship between the two channels. Furthermore, MOSM was the only model that successfully extrapolated the derivate signal and the delayed signal simultaneously, this is due the fact that the cross-covariances needed for this setting are not linear combinations of univariate kernels, hence models based on latent processes fail in this synthetic example.
|CONV||0.211 0.085||0.759 0.075||0.524 0.097|
|SM-LMC||0.166 0.009||0.747 0.101||0.398 0.042|
|CSM||0.148 0.010||0.262 0.032||0.368 0.089|
|MOSM||0.127 0.011||0.223 0.015||0.146 0.017|
4.2 Climate data
The first real-world dataset contained measurements333The data can be obtained from www.cambermet.co.uk. and the sites therein. from a sensor network of four climate stations in the south on England: Cambermet, Chimet, Sotonmet and Bramblemet. We considered the normalised air temperature signal from 12 March, 2017 to 16 March, 2017, in 5-minute intervals (5692 samples), from where we randomly chose samples for training. Following , we simulated a sensor failure by removing the second half of the measurements for one sensor and leaving the remaining three sensors operating correctly; we reproduced the same setup across all four sensors thus producing four experiments. All models considered had five latent signals/spectral components.
For all four models considered, Fig. 3 shows the estimates of missing data for the Cambermet-failure case. Table 2 shows the mean absolute error for all models and failure cases over the missing data region. Observe how all models were able to capture the behaviour of the signal in the missing range, this is because the considered climate signals are very similar to one another. This shows that the MOSM can also collapse to models that share parameters across pairs of outputs when required.
|CONV||0.098 0.008||0.192 0.015||0.211 0.038||0.163 0.009|
|SM-LMC||0.084 0.004||0.176 0.003||0.273 0.001||0.134 0.002|
|CSM||0.094 0.003||0.129 0.004||0.195 0.011||0.130 0.004|
|MOSM||0.097 0.006||0.137 0.007||0.162 0.011||0.129 0.003|
These results do not show a significant difference between the proposed model and the latent processes based models. In order to test for statistical significance, the Kolmogorov-Smirnov test [20, Ch. 7] was used with a significance level , concluding that for the Sotonmet sensor we can assure that the MOSM model yields the best results. Conversely, for the Cambermet, Chimet and Bramblemet sensors, MOSM and CSM provided similar results, though we cannot confirm their difference is statistically significant. However, given the high correlation of these signals and the similarity between the MOSM model and the CSM model, the close performance of these two models on this dataset is to be expected.
4.3 Heavy metal concentration
The Jura dataset  contains, in addition to other geological data, the concentration of seven heavy metals in a region of of the Swiss Jura, and it is divided into a training set (259 locations) and a validation set (100 locations). We followed [3, 4], where the motivation was to aid the prediction of a variable that is expensive to measure by using abundant measurements of correlated variables which are less expensive to acquire. Specifically, we estimated Cadmium and Copper at the validation locations using measurements of related variables at the training and test locations: Nickel and Zinc for Cadmium; and Lead, Nickel and Zinc for Copper. The MAE—see eq. (13)—is shown in Table 3, where the results for the CONV model were obtained from  and all models considered five latent signals/spectral components, except for the independent Gaussian process (denoted IGP).
Observe how the proposed MOSM model outperforms all other models over the Cadmium data, which is statistical significant with a significance level . Conversely, we cannot guarantee a statistically-significant difference between the CSM model and the MOSM in the Copper case. In both cases, testing for statistical significance against the CONV model was not possible since those results were obtained from . On the other hand, the higher variability and non-Gaussianity of the Copper data may be the reason of why the simplest MOGP model (SM-LMC) achieves the best results.
We have proposed the multioutput spectral mixture (MOSM) kernel to model rich relationships across multiple outputs within Gaussian processes regression models. This has been achieved by constructing a positive-definite matrix of complex-valued spectral densities, and then transforming them via the inverse Fourier transform according to Cramér’s Theorem. The resulting kernel provides a clear interpretation from a spectral viewpoint, where each of its parameters can be identified with frequency, magnitude, phase and delay for a pair of channels. Furthermore, a key feature that is unique to the proposed kernel is the ability joint model delays and phase differences, this is possible due to the complex-valued model for the cross-spectral density considered and validated experimentally using a synthetic example—see Fig. 2. The MOSM kernel has also been compared against existing MOGP models on two real-world datasets, where the proposed model performed competitively in terms of the mean absolute error. Further research should point towards a sparse implementation of the proposed MOGP which can build on [4, 21] to design inducing variables that exploit the spectral content of the processes as in [22, 23].
We thank Cristóbal Silva (Universidad de Chile) for useful recommendations about GPU implementation, Rasmus Bonnevie from the GPflow team for his assistance on the experimental MOGP module within GPflow, and the anonymous reviewers. This work was financially supported by Conicyt Basal-CMM.
-  C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning. The MIT Press, 2006.
-  D. Duvenaud, “Automatic model construction with Gaussian processes,” Ph.D. dissertation, University of Cambridge, 2014.
-  P. Goovaerts, Geostatistics for natural resources evaluation. Oxford University Press on Demand, 1997.
-  M. A. Álvarez and N. D. Lawrence, “Sparse convolved Gaussian processes for multi-output regression,” in Advances in Neural Information Processing Systems 21, 2008, pp. 57–64.
-  K. R. Ulrich, D. E. Carlson, K. Dzirasa, and L. Carin, “GP kernels for cross-spectrum analysis,” in Advances in Neural Information Processing Systems 28, 2015, pp. 1999–2007.
-  A. G. Wilson and R. P. Adams, “Gaussian process kernels for pattern discovery and extrapolation,” in Proceedings of the 30th International Conference on Machine Learning (ICML-13), 2013, pp. 1067–1075.
-  S. Bochner, M. Tenenbaum, and H. Pollard, Lectures on Fourier Integrals, ser. Annals of mathematics studies. Princeton University Press, 1959.
-  H. Cramér, “On the theory of stationary random processes,” Annals of Mathematics, pp. 215–230, 1940.
-  A. Yaglom, Correlation Theory of Stationary and Related Random Functions, ser. Correlation Theory of Stationary and Related Random Functions. Springer, 1987, no. v. 1.
-  F. Tobar, T. D. Bui, and R. E. Turner, “Learning stationary time series using Gaussian processes with nonparametric kernels,” in Advances in Neural Information Processing Systems 28. Curran Associates, Inc., 2015, pp. 3501–3509.
-  F. Tobar and R. E. Turner, “Modelling time series via automatic learning of basis functions,” in Proc. of IEEE SAM, 2016, pp. 2209–2213.
-  M. A. Álvarez, L. Rosasco, and N. D. Lawrence, “Kernels for vector-valued functions: A review,” Found. Trends Mach. Learn., vol. 4, no. 3, pp. 195–266, Mar. 2012.
-  M. G. Genton and W. Kleiber, “Cross-covariance functions for multivariate geostatistics,” Institute of Mathematical Statistics, vol. 30, no. 2, 2015.
-  F. Tobar and R. E. Turner, “Modelling of complex signals using Gaussian processes,” in Proc. of IEEE ICASSP, 2015, pp. 2209–2213.
-  R. Boloix-Tortosa, F. J. Payán-Somet, and J. J. Murillo-Fuentes, “Gaussian processes regressors for complex proper signals in digital communications,” in Proc. of IEEE SAM, 2014, pp. 137–140.
-  S. M. Kay, Modern spectral estimation : Theory and application. Englewood Cliffs, N.J. : Prentice Hall, 1988.
-  T. Gneiting, W. Kleiber, and M. Schlather, “Matérn cross-covariance functions for multivariate random fields,” Journal of the American Statistical Association, vol. 105, no. 491, pp. 1167–1177, 2010.
-  M. Abadi et al., “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015, software available from tensorflow.org. [Online]. Available: http://tensorflow.org/
-  A. G. d. G. Matthews, M. van der Wilk, T. Nickson, K. Fujii, A. Boukouvalas, P. León-Villagrá, Z. Ghahramani, and J. Hensman, “GPflow: A Gaussian process library using TensorFlow,” 2016.
-  J. W. Pratt and J. D. Gibbons, Concepts of nonparametric theory. Springer Science & Business Media, 2012.
-  M. A. Álvarez, D. Luengo, M. K. Titsias, and N. D. Lawrence, “Efficient multioutput Gaussian processes through variational inducing kernels.” in AISTATS, vol. 9, 2010, pp. 25–32.
-  J. Hensman, N. Durrande, and A. Solin, “Variational Fourier features for Gaussian processes,” arXiv preprint arXiv:1611.06740, 2016.
-  F. Tobar, T. D. Bui, and R. E. Turner, “Design of covariance functions using inter-domain inducing variables,” in NIPS 2015 - Time Series Workshop, December 2015.