Bayesian Alignments of Warped Multi-Output Gaussian Processes
Bayesian Alignments of Warped Multi-Output Gaussian Processes
We present a Bayesian extension to convolution processes which defines a representation between multiple functions by an embedding in a shared latent space. The proposed model allows for both arbitrary alignments of the inputs and and also non-parametric output warpings to transform the observations. This gives rise to multiple deep Gaussian process models connected via latent generating processes. We derive an efficient variational approximation based on nested variational compression and show how the model can be used to extract shared information between dependent time series, recovering an interpretable functional decomposition of the learning problem.
zotero_export.bib \addbibresourceadditional.bib \pdfstringdefDisableCommands\pdfstringdefDisableCommands
Gaussian processes (GP) (Rasmussen:2005te) are flexible yet informative prior that allows to provide structure over the space of functions allowing us to learn from small amounts of data. Due to the infinte support of the Gaussian distribution a GP is a general function approximator. However for many types of problems it is challenging to use all our existing knowledge when considering only a single function as our knowledge is often comprised of information related to a single level of an hierarchical or a composite function. To still be able to use the benefits of a GP the priors can be composed hierarchically in what is known as a \citetitledamianou_deep_2012 (damianou_deep_2012). Importantly a hierarchical model does not provide additional representational power, it rather the opposite, with each addition of layers it allows us to specify an ever more restrictive priors providing additional structure on our solution space allowing us to learn from smaller amounts of data.
In a traditional GP setting the outputs are considered conditionally independent given the input which significantly reduces the computational cost. In many scenarios we want to be able to directly parametrise the interdependencies between the dimensions. Specific modelling of these allows us to infer dimensions when only partial observations exists. One approach is to consider a linear dependency (alvarez_kernels_2011) while a more general approach would be to consider full convolutions as in (boyle_dependent_2004).
The fundamental assumption underlying a GP is the assumption that the instansiations of the function is jointly Gaussian. Due to the marginalisation property of a Gaussian on one end this is very beneficial as it leads to the simple and elegant inference associated with the models but it can also be a challenge as it is a rather restrictive setting. One approach to proceed is to first map, or warp the data closer to the Gaussian assumption which is referred to as \citetitleNIPS2003_2481 (NIPS2003_2481; lazaro-gredilla_bayesian_2012).
In this paper we marry the benefits of all these approaches and propose a hierarchical, warped and multi-output Gaussian process. We derive an efficient learning scheme by an approximation to the marginal likelihood which allows us to fully exploit the regularisation provided by our structure. The model we propose is highly interpretable, able to learn from small amounts of data and generalises to a large range of different problem settings.
2 Model Definition
We are interested in formulating shared priors over multiple functions using Gaussian processes (GPs). There are multiple ways of formulating such priors besides assuming independence of the different and placing a separate GP prior on each of them. In a common approach known as the linear model of coregionalization (LMC) (alvarez_kernels_2011) the different functions are assumed to be linear combinations of one or more latent process on which independent GP priors are placed. Different functions in samples drawn from such models behave very similarly since for every , the are given by linear combinations of the (latent) point observations ,
where the are constant coefficients.
We seek to relax these assumptions in multiple ways. First, instead of considering point estimates we focus on convolution processes (CPs) as proposed by \textciteboyle_dependent_2004. In the CP framework, the output functions are the result of a convolution of the latent processes with smoothing kernel functions for each output and latent process ,
CPs are a generalization of the LMC which can be recovered when the are Dirac delta functions scaled by and, similar to the LMC, the different output functions are conditionally independent given the latent processes . If the are distributed according to a GP and the different smoothing kernels are finite energy kernels (i.e., ), the also follow a GP distribution. Since the smoothing kernels can be chosen independently, the CP model allows for much more loosely coupled functions .
In the standard CP model, for every observation the convolutions of the latent processes are all performed around the same point . This implies that the relative covariance structure of the different functions can only be influenced via specific choices of (possibly nonstationary) smoothing kernels. This can, for example, be used to model constant time lag in time-series observations (boyle_dependent_2004). Our second relaxation of the original model assumptions is that we allow more general alignments of the observed functions which depend on the position in the input space. Every function is augmented with an alignment function on which we place independent GP priors. These functions are used to move the center of convolution based on the input and allow for nonlinear and not necessarily monotonous alignments treated in a Bayesian manner.
Finally, we assume that the dependent functions are themselves latent and the data we observe is generated via independent noisy nonlinear transformations of their values. Similar to the construction of Bayesian warped GPs introduced by \textcitelazaro-gredilla_bayesian_2012 we also place independent GP priors on these warpings associated with their respective .
For notational simplicity, we assume that the observations for every output are evaluated at the same inputs . This can easily be generalized to different input sets for every output. We call the observations associated with the -th function and use the stacked vector to collect the data of all outputs. The final model is then given by
where and are the respective alignment and warping functions and is a noise term. Because we assume independence between the two functions across outputs, we use Gaussian process priors of the form
where denotes the identity function. By setting the prior mean to the identity function, the default assumption of this model is to fall back to the standard CP model. During learning, the model can choose the different and in a way to reveal the independent shared latent processes on which we also place Gaussian process priors
Under this prior, the are also Gaussian processes with zero mean and
see for example (alvarez_sparse_2009; alvarez_kernels_2011).
Similar to (boyle_dependent_2004), we assume the latent processes to be independent white Gaussian noise processes, that is, we assume that . Then, the covariance term above simplifies to
boyle_dependent_2004 showed that when using the -dimensional squared exponential kernel
as the smoothing kernel function for all , the integral has a closed form solution. With denoting the set of kernel hyperparameters associated with , this solution is given by
with componentwise vector-operations and
Because there is no covariance structure in the latent processes, their values can be marginalized analytically yielding the explicit formulation of the covariance function for the different outputs above. Their cross-covariance-terms closely resemble the original RBF kernel.
Figures 0(b) and 0(a) show different interpretations of the resulting full graphical model. Figure 0(a) emphasizes the generative nature of the model, where different output functions are generated using the shared latent processes . In order to allow for more flexibility, we added the alignment functions and the warpings . The alignment function (which we assume to be close to the identity function) models non-stationary local shifts between the different output functions and the warpings allow for the output functions to live on different scales and topologies, removing the constraint that the function must be linear combinations of the convolutions. This model can be interpreted as a shared and warped latent variable model with a very specific prior: The indices are part of the prior for the latent space and specify a sense of order for the different data points which is augmented with uncertainty by the alignment functions. Using this order, the convolution processes enforce the covariance structure for the different datapoints specified by the smoothing kernels.
In contrast, figure 0(b) shows that the presented model can also be interpreted as a group of deep GPs with a layer with shared information between the different functions, that is, it shows a transfer learning setting. The shared knowledge is represented in the latent processes which serve as communication points between the conditionally independent deep GPs. In this setting, the functions and can be interpreted as non-linear embeddings which allow the model to represent the shared information in a more convenient space, without a semantic interpretation of alignments or warpings. Note that neither the index set nor the observations need to live in the same space since all embeddings are independent and only the latent functions must have compatible (co-)domains.
3 Variational Approximation
Since exact inference in the model defined above is intractable, we discuss a variational approximation to the model’s true marginal likelihood in this section and discuss approximate predictions. Analogously to , we denote the random vectors of size which contain the function values of the respective functions and outputs as and . The joint probability distribution of the data can then be written as
Here, we use to refer to the Gram matrix corresponding to the kernel of the respective GP. Everything but the convolution processes, for which the kernel matrices have size , factorize over both the different levels of the model as well as the different outputs, leading to the kernel matrices of size . Since all but the likelihood terms are Gaussian processes, the model can be interpreted as a specific deep Gaussian process, for which exact inference is intractable, since the uncertainties in the inner GPs have to be propagated through the outer ones in the function composition.
damianou_deep_2012 applied variational compression (titsias_variational_2009) jointly to complete deep GP models to perform approximate inference. \Textcitehensman_nested_2014 proposed nested variational compression in which every GP in the hierarchy is handled independently. While this forces a variational approximation of all intermediate outputs of the stacked processes, it has the appealing property that it allows optimization via stochastic gradient descent (hensman_gaussian_2013) and the variational approximation can after training be used to approximate predictions for new data points without considering the original training data. We will use the second property in this paper to show that after training, predictions of the form can be calculated without having to consider any of the other outputs besides in the variational approximation of the shared GP .
3.1 Augmented Model
The sparse GP approaches mentioned above all focus on augmenting a full GP model by introducing sets of inducing variables with their inputs . Usually, the number of inducing variables is chosen such that . Those variables are assumed to be latent observations of the same functions and are thus jointly Gaussian with the observed data. For example, the augmentation of with and yields the joint density
where, after dropping the indices for clarity, denotes the function values without noise and we write the Gram matrices as . Also dropping the explicit conditioning on the inputs and (and possible kernel parameters ), this joint Gaussian distribution can be rewritten using its marginals (titsias_variational_2009) as
While the original model in equation 11 can be recovered exactly by marginalizing the inducing variables of the augmented model, considering a specific variational approximation of the joint will yield the desired lower bound in the next subsection. We introduce such inducing variables for every GP in the model, yielding the set for the complete model. Note that for the convolution process , we introduce one set of inducing variables per output . These inducing variables will always be considered together and play a crucial role in sharing information between the different outputs.
3.2 Variational Lower Bound
To derive the desired variational lower bound for the log marginal likelihood of the complete model, multiple steps are necessary. First, we will consider the innermost Gaussian processes describing the alignment functions. We derive the SVGP, a lower bound for this model part which can be calculated efficiently and can be used for stochastic optimization, first introduced by \textcitehensman_gaussian_2013. In order to apply this bound recursively, we will both show how to propagate the uncertainty through the subsequent layers and and how to avoid the inter-layer cross-dependencies using another variational approximation as presented by \textcitehensman_nested_2014. While hensman_nested_2014 considered standard deep GP models, we will show how to apply their results to convolution processes.
The First Layer
Since the inputs are fully known, we do not need to propagate uncertainty through the Gaussian processes . Instead, the uncertainty about the comes from the uncertainty about the correct functions and is introduced by the processes themselves, analogously to standard GP regression. To derive a variational lower bound on the marginal log likelihood of , we assume a variational distribution approximating and additionally assume that . After dropping the indices for clarity again, using Jensen’s inequality we get
where denotes the expected value with respect to the distribution and denotes the KL divergence, which, since both and are Gaussian, can be evaluated analytically.
To bound the required expectation, we use Jensen’s inequality again together with equation 13 which gives
We apply this bound to the expectation to get
Resubstituting this result into equation 14 yields the final bound
This bound, which depends on the hyperparameters of the kernel and likelihood and the variational parameters , can be calculated in time. Compared to the variational bound of \textcitetitsias_variational_2009, which is obtained by marginalizing the inducing points in equation 15, this bound factorizes along the data points which enables stochastic optimization.
In order to obtain a bound on the full model, we want to apply the same techniques to the other Gaussian processes. Since the different alignment processes are assumed to be independent we have
where every term can be approximated using the bound in equation 18. However, for all subsequent layers, the bound is not directly applicable, since the inputs are no longer known but instead are given by the outputs of the previous process. It is therefore neccesary to propagate their uncertainty and also handle the interdependencies between the layers introduced by the latent function values , and .
The Second and Third Layer
Our next goal is to derive a bound on the outputs of the second layer
that is, an expression in which the uncertainty about the different and the cross-layer dependencies on the are both marginalized. While on the first layer, the different are conditionally independent, the second layer explicitly models the cross-covariances between the different outputs via convolutions over the shared latent processes . We will therefore need to handle all of the different , together denoted as , at the same time.
where we write to incorporate the Gaussian noise in the latent space. In order to ensure cancellation we assume that
and use another variational approximation to marginalize . This yields
where we apply Fubini’s theorem to exchange the order of integration in the expected values. The expectations with respect to above involve expectations of kernel matrices, also called psi-statistics, in the same way as \textcitedamianou_deep_2012,hensman_nested_2014 and are given by
These psi-statistics can be computed analytically for multiple kernels, including the squared exponential kernel in equation 8 or the linear kernel. In our case, the implicit kernel of the multi output convolution process is governed by the choice of latent processes and the smoothing kernel functions . In section 3.3 we show closed-form solutions for the psi-statistics assuming white noise latent processes and squared exponential smoothing kernels. To obtain the final formulation of the desired bound for we substitute equation 24 into equation 23 and get the analytically tractable bound
The uncertainties in the first layer have been propagated variationally to the second layer. Besides the regularization terms, is a Gaussian distribution.
Because of their cross dependencies, the different outputs are considered in a common bound and do not factorize along dimensions. The third layer warpings however are conditionally independent given and can therefore be considered separately. In order to derive a bound for we apply the same steps as described above, resulting in the final bound:
This bound on the multi output marginal likelihood is parameterised by the model’s hyperparameters in the kernels and the variational parameters, the position of the pseudo inputs in the different layers and the variational distributions of their respective function values. Similar to SVGP (hensman_scalable_2014), every part of the bound factorizes along the data, allowing for stochastic optimization methods.
3.3 Convolution Kernel Expectations
In order to evaluate the bound in equation 26, we need to be able to compute certain expectations for the kernel matrices, the psi-statistics shown in equation 24. Our model uses two types of kernels, the explicitly defined squared exponential kernel and the kernel implicitly defined via the convolution process which depends on the choice of latent processes and smoothing kernel functions . In section 2 we assumed the latent processes to be Gaussian white noise and the smoothing kernel functions to be squared exponential kernels, leading to an explicit closed form formulation for the covariance between outputs shown in equation 9. This kernel can be understood as a generalization of the squared exponential kernel, for which a closed form for the psi-statistics can be found (titsias_bayesian_2010). In this section, we derive the psi-statistics for the generalized version.
The uncertainty about the first layer is captured by the variational distribution of the latent alignments given by
Every aligned point in corresponds to one output of and ultimately to one of the . Since the closed form of the multi output kernel depends on the choice of outputs, we will use the notation to denote such that is associated with output . As shown in section 3.2, nested variational compression propagates these uncertainties through the model via Gaussian messages — like — passing through the GPs, making it neccesary to consider expectations of kernel matrices and their products.
For notational simplicity, we only consider the case of one single latent process , that is, we assume . Since the latent processes are independent, the results can easily be generalized to multiple processes. We first consider given by
Similar to the notation , we use the notation to mean the variance term associated with the covariance function . The expectation connecting the alignments and the pseudo inputs is given by
where is the combined lengthscale corresponding to the same kernel as . Lastly, connects alignments and pairs of pseudo inputs with the closed form
Note that the psi-statistics factorize along the data and we only need to consider the diagonal entries of . If all the data belong to the same output, that is, for all and the parameters and are equal, the psi-statistics of the standard squared exponential kernel can be recovered as a special case. This special case is used to propagate the uncertainty through the third layer of conditional independent GPs, the output-specific warpings .
3.4 Approximative Predictions
Using the variational lower bound in equation 26, our model can be fitted to data, resulting in appropriate choices of the kernel hyperparameters and the variational parameters and corresponding distributions for the alignments , the shared layer and the output warpings . Now assume we want to predict approximate function values for previously unseen points associated with output , which are given by .
Because of the conditional independence assumptions in the model, other outputs only have to be considered in the shared layer . In this shared layer, the believe about the different outputs, the shared information and — implicitly — the latent processes is captured by the variational distribution . Given this variational distribution, the different outputs are conditionally independent of one another and thus, predictions for a single dimension in our model are equivalent to predictions in a single independent deep GP with nested variational compression (hensman_nested_2014).
4 Time Series Alignments
In this section we show how to apply the proposed model to the task of finding common structure in time series observations. In this setting, we observe multiple time series as pairs of time indicies and their corresponding noisy observations. To model the common structure, we assume that there exist latent time series which inform the observations. These types of dependencies between time series can for example be found in measurements of physical dynamic systems, where two sensors measure the same latent effect. Depending on the positioning of the sensors and the propagation properties of this latent effect through the system, the sensors’ measurements will usually not be aligned in time. Instead, the relative timings of the sensors may be influenced by the system’s dynamics and therefore be variable over time.
The goal of our model in this case is to recover both a representation of the shared latent information and the relative alignments of the time series, both of which give high-level insight into the physical system. The relevant part of the system’s structure is recovered purely from the data, without the need for a parametric physical model or numeric simulations. Starting with the different misaligned time indices , the model’s first layer’s task is to find a latent space of relative time indices such that the observed time series are aligned both with each other and the latent informing processes . The shared information is applied to the different time series via the convolution process . Note that for any , the covariance carries semantic value with respect to the original physical system, since it describes the amount of shared information between the sensors given they observe the same state of the system with respect to the latent effect. The variables can be thought of as the (aligned) latent effect observed by the different sensors. The last layer in the model, the conditionally independent processes , now models the interaction of this latent effect with the corresponding sensor and produces the actual measurements which are then observed noisily in the outputs .
In our toy example presented here, we define a system of dependent time series by first specifying a shared latent function generating the observations. We can then independently choose relative alignments and warpings for the different time series we want to generate. Note that we can always choose one alignment and one warping to be the identity function in order to constrain the shared latent spaces and and provide a reference the other alignments and warpings will be relative to. Our example consists of two time series and generated by a dampened sine function. We choose the alignment of and the warping of to be the identity in order to prevent us from directly observing the latent function and apply a sigmoid warping to . The alignment of is selected to be a quadratic function which is very hard to recover using standard methods from signal processing. Figure 2 shows a visualization of this decomposed system of dependent time series. To obtain training data we uniformly sampled 100 points from both the first and second time series with added Gaussian noise. We subsequently removed parts of the training sets to explore the generalization behaviour of our model. The complete training sets are shown in figure 3.
We use this setup to train our model using squared exponential kernels both in the conditionally independent GPs and and as smoothing kernels in . Since we assume our toy data simulates a physical system, we apply the prior knowledge that the alignment and warping processes have ?slower? dynamics compared to the shared latent function. In other words, we emphasize that most of the observed dynamics are introduced into the system via the shared latent effect. To model this we applied priors to the and which prefer longer lengthscales and smaller variances compared to the shared layer . Otherwise, the model could find local minima in which it, for example, ignores the upper two layers of the model and explains the two time series independently using only the third layer . Additionally, assuming identity functions as mean functions for the different Gaussian processes prevents pathological cases in which the complete model collapses to a constant function (salimbeni_doubly_2017).
The model’s joint predictions are shown in figure 3 and the recovered factorization of the system can be seen in figure 4. During learning, the model was able to recover a good representation of the original decomposition, that is, it recovered a shared latent dampened sine function, a sigmoid warping for the first time series and an approximate quadratic alignment function for the second time series. As a consequence, it is able to predict that the second time series must describe a full (quadratically sampled) sine wavelength in the interval , where no training data for is available. An independently trained GP, as seen in figure 5, reverts to the prior in this interval, yielding a much less informative model.
Inspecting the model’s recovered decomposition, it can be seen that it learns almost no uncertainty about the quadratic alignment of the second time series although there is no data available. Indeed, the model in general adds uncertainty to predictions as late in the hierarchy as possible. Intuitively, the model is highly incentivised to do this because in general, uncertainty introduced into a prediction is very hard to get rid of throughout the hierarchy. If the model is uncertain about the alignments, it will most likely be forced to be more uncertain about the shared function, which in the end penalizes the bound in equation 26 via the fit term for observed date in the first time series more than an overly confident model close to the prior. Also note that if the model were uncertain about the correct alignment, its pointwise predictions would be much closer to the independently trained GP since uncertainty about the alignment means uncertainty about the placement of the minima and maxima of the sine wave and therefore, the points in the interval could take a wide range of values. However, every sample from the model would still describe a more informative sine function compared to the marginalized predictions which revert to the prior.
We have presented how Gaussian processes with multiple outputs can be embedded in a hierarchy to find shared structure in latent spaces. We extended convolution processes (boyle_dependent_2004) with conditionally independent Gaussian processes on both the input and output sides, giving rise to a highly structured deep GP model. By applying nested variational compression (hensman_nested_2014) to inference in these models, we derived a variational lower bound which combines approximate Bayesian treatment of all parts of the model with scalability via factorization over data points.
One application of this work is to find alignments between time series generated by a shared source of information, like sensors in a physical system. We show empirical evidence that the model is able to recover the latent information, non-linear misalignments and sensor specific warpings. The model yields a human-interpretable decomposition of functions which can offer insight about the underlying system. Formulating priors and inference methods which capture and implement knowledge of domain experts for physical systems in this model is an interesting research direction for later work.