Structured Bayesian Gaussian process latent variable model

Structured Bayesian Gaussian process latent variable model

Steven Atkinson
Center for Informatics and
Computational Science
University of Notre Dame
satkinso@nd.edu &Nicholas Zabaras
Center for Informatics and
Computational Science
University of Notre Dame
nzabaras@nd.edu
zabaras.com
Abstract

We introduce a Bayesian Gaussian process latent variable model that explicitly captures spatial correlations in data using a parameterized spatial kernel and leveraging structure-exploiting algebra on the model covariance matrices for computational tractability. Inference is made tractable through a collapsed variational bound with similar computational complexity to that of the traditional Bayesian GP-LVM. Inference over partially-observed test cases is achieved by optimizing a “partially-collapsed” bound. Modeling high-dimensional time series systems is enabled through use of a dynamical GP latent variable prior. Examples imputing missing data on images and super-resolution imputation of missing video frames demonstrate the model.

 

Structured Bayesian Gaussian process latent variable model


  Steven Atkinson Center for Informatics and Computational Science University of Notre Dame satkinso@nd.edu Nicholas Zabarasthanks: zabaras.com Center for Informatics and Computational Science University of Notre Dame nzabaras@nd.edu

\@float

noticebox[b]Preprint. Work in progress.\end@float

1 Introduction

Probabilistic generative models are valuable for inferring structure in large sets of unlabeled data [1]. Bayesian generative models such as the Gaussian process latent variable model [2, 3, 1] and the related unsupervised deep Gaussian processes [4, 5] leverage the expressive, yet regularized flexibility of Gaussian processes within an unsupervised learning setting, producing low-dimensional nonlinear embeddings of data. Analytical tractability is achieved through the variational inference procedure first given by Titsias [6] and later extended by Titasis and Lawrence [1] for the case of unobserved inputs. The dynamical GP-LVM [7] constrains the learned latent variables through a Gaussian process prior parameterized by time points corresponding to the observed examples.

However, the generative models utilized by all of the above mentioned do not account for the possibility that the data being modeled exhibit correlations. From a modeling standpoint, the selection of a prior that factorizes across dimensions is rather uninformative and adversely affects model sample efficiency. Instead, one might wish to encode knowledge about correlations within each high-dimensional observation through the model prior. Beyond considerations of sample efficiency, because no spatial information is incorporated explicitly into the model, one is not able to use these approaches for inference at spatial locations not included in the training set.

At the same time, neural networks have made great progress as generative models in recent years via variational autoencoder architectures [8] and adversarial formulations [9]. A main focus of this work is sample efficiency within unsupervised learning; as of now, the above generally struggle to perform well with a limited supply of data. A reasonable means of addressing this challenge is by carefully-chosen Bayesian regularization; however, we are still challenged to pose tractable, interpretable priors and methods for inference. While imposing Gaussian or Laplace priors over the model weights [10] is attractive because of its mathematical tractability [11], one may still question how well one understands the distributions over functions they imply. to what extent one can understand the implications of such priors on the resulting distribution over functions that the model inherits [12]. Indeed, merely interpreting workable neural network priors is a challenging task and is of considerable interest in its own right. In this work, our approach begins by defining the generative model’s distribution over functions and proceeding to derive a tractable means of inference.

The main contribution of this work is to extend the Bayesian GP-LVM by utilizing structure-exploiting algebra [13] to enable efficient inference of latent variables and a generative model with explicit spatial correlations modeled by a parameterized kernel function. We call this the structured Gaussian process latent variable model (SGPLVM). The learned generative model may be used for data imputation for partially-observed test data. Furthermore, the dynamical prior of Damianou et al. [7] is also incorporated into the model to allow for interpolation for high-dimensional time series data. For tractable inference, we derive a collapsed variational bound that may be efficiently computed by exploiting Kronecker product structure, achieving the same computational complexity as a typical sparse Gaussian process model.

The model bears a resemblance to the linear model of coregionalization [14, 15]. However, our model extends this by allowing the inputs to be uncertain within the usual variational free energy approach first shown in [1]. Computational tractability is maintained by exploiting the structure of the covariance matrix by extending the structured GP methods first introduced by Saatçi [13]. Additionally, the modeled spatial correlations are expressed in terms of a familiar kernel function (see, e.g., [16]) with a simple parameterization that is simple to optimize over, given high-dimensional observation data, and interpretable through its hyperparameters.

The rest of this paper is organized as follows. In Section 2, we define the structured GP-LVM including its variational lower bound, predictive density, and algorithm for inference of latent variables. In Section 3, we apply our model to model high-dimensional images with and without a dynamical prior and demonstrate its ability to impute missing data in several novel ways. Section 4 summarizes our work and discusses its implications.

1.1 Related work

Our model is based on the Bayesian Gaussian process latent variable model of Titsias and Lawrence [1]. Structured Gaussian process regression was first introduced in [13], who considered inputs composed as a Cartesian product of one-dimensional inputs, though combinations of higher-dimensional inputs are also possible so long as the kernel is appropriately separable [17]. The insight to exploit Kronecker product structure in Gaussian process latent variable models was first shown in [18]. However, their variational formulation defines a posterior over the induced outputs and restricts its covariance matrix to possess Kronecker product structure; The variational parameters describing its mean and covariance must be learned through optimization (i.e. the bound is “uncollapsed” [7]). Here, we make no assumptions about the structure of the posterior covariance, instead deriving the analytical optimum and showing how the resulting “collapsed” variational bound can still be computed efficiently by exploiting eigendecomposition properties of Kronecker products. Inference over partially-observed data is still possible using a novel ‘mixed” bound combining collapsed and uncollapsed terms from the training and test data, respectively. We also make the connection with the dynamical GP-LVM of [7] for modeling temporally-correlated data.

2 Theory

In this section, we derive the structured GP-LVM (SGP-LVM) model. Before proceeding, we quickly define some useful notation. Let be a matrix with -th row , -th column , and entry . The matrix may be assembled by a set of vectors as rows of .

2.1 Problem statement

Suppose we are given a set of data, each of which is observed simultaneously with channels at points in a -dimensional spatial domain . Represent this data as a matrix . The observations may also be dynamical in nature and indexed with time points . The spatial inputs are assembled as a matrix . Alternatively, instead of modeling our observations as data that are -dimensional (“few data, many dimensions”), we may instead reshape to model them as data that are -dimensional (“many data, few dimensions”). Suppose that each of the data are described by a latent variable ; represent these as the matrix with a prior . The -dimensional inputs are defined as the Cartesian product of the spatial and latent variable inputs: , with .111One can further impose structure within the spatial inputs, i.e., . We aim to learn a generative model

(1)

The model definition and training procedure is described in Sec. 2.2.1.

Given the learned model, we seek to carry out two types of predictions. First, given a test example observed at spatial points , we seek to infer the posterior of its latent variable, . Second, given and some (possibly different) test points in space , we wish to compute the predictive density for the corresponding outputs . These are addressed in Sec. 2.2.2.

2.2 Structured Bayesian Gaussian process latent variable model

In the following subsections, we explain how to train and do predictions with the structured Gaussian process latent variable model.

2.2.1 SGPLVM: model architecture and evidence lower bound

As mentioned in [18], inference for the generative model of Eq. (1) is challenging for two reasons. First, computing the marginal likelihood is analytically intractable. Second, the likelihood requires the inversion of a covariance matrix. Using the approach of Titsias and Lawrence, the model is augmented with inducing input-output pairs, assembled in matrices and ; we assume that these inducing pairs are modeled by the same generative process as the training inputs and latent outputs , allowing us to write the following joint probability:

(2)

where , , , and are kernel matrices evaluated on the training and inducing inputs. The conditional GP prior is given by

(3)

where the conditional mean and covariance take the usual forms from the projected process model [19, 16]:

(4)
(5)

Finally, we define the Gaussian variational posterior over the induced outputs

(6)

and pick a joint variational posterior that factorizes as

(7)

The form of will be discussed later.

Using the familiar variational approach [20, 1], we use Jensen’s inequality to write a lower bound for the logarithm of the model evidence. Following the usual approach, we use Jensen’s inequality to define the following lower bound on the logarithm of the model evidence:

(8)

After some manipulations, one arrives at the following uncollapsed lower bound:

(9)

where we have defined the kernel expectations

(10)

Using standard methods [20], one may find an analytic optimum for that maximizes Eq. (9) with respect to and :

(11)

where we have defined

(12)

for convenience. Substituting this into Eq. (9) and doing the required manipulations results in the “collapsed” lower bound:

(13)

where

(14)
(15)
(16)

and is a Cholesky decomposition. For modeling independent, identically-distributed data, we use the prior and variational posterior . For dynamical data indexed by times , we follow [21, 7] and use the GP prior and variational posterior , where is the covariance matrix formed on by a kernel and is a diagonal matrix.

To train the model, we maximize Eq. (13) over the variational parameters of the inducing inputs, and the model hyperparameters . However, doing so naïvely quickly becomes computationally intractable. Next, we show how the structure in the model inputs can be exploited so that the time and memory requirements associated with training become linear in .

Similarly to [18], we impose structure on the inducing points: , where and . This implies that and .

The statistics of Eq. (10) can then be written as

(17)

where , , and may be evaluated in the usual way (see [22]). The computational cost associated with evaluating the kernel expectations in Eq. (17) remains unchanged relative to the traditional Bayesian GP-LVM and still allows for parallelization with respect to latent variables of as originally discussed in [23, 24].

Finally, we show how the remaining terms of the variational bound may be computed efficiently without having to ever explicitly evaluate any of the full matrices in the bound. First, we note that and have Kronecker decompositions:

(18)
(19)

The eigendecomposition of is

(20)

so is orthogonal and is diagonal. These both also admit Kronecker decompositions:

(21)
(22)

The matrices on the right hand sides of Eqs. (21) and (22) are found by computing the eigendecomposition of and . It follows that

(23)

where, for notational convenience, we have defined . Note that is the sum of two diagonals and thus contains nonzero entries. Thus, it is at least as easy to store in memory as the data . We also note that expressing as in Eq. (23) makes the computation of in Eq. (13) straightforward. Next, we see that Eq. (12) can be rewritten as

(24)

and note that the matrix factors of

(25)

may be efficiently computed and stored since and are Kronecker products and the diagonal matrix can, of course, be inverted in an element-wise manner.

Finally, the second trace term in the second line of Eq. (13) can be manipulated using Eq. (25) to obtain

(26)
(27)
(28)

The matrix is most efficiently computed by first evaluating the product , then multiplying against last. Computing this term takes time. Thus, we see that computing the bound is linear in the size of the training data in both time and memory.

2.2.2 SGPLVM: predictions

Forward predictive density

Given test inputs , the predictive density follows form our assumption that the inducing points are sufficient statistics of the training examples [7]:

(29)
(30)
(31)

where is the covariance matrix computed on the test inputs and is the cross-covariance matrix between test and training inputs. Noticing that , , and substituting Eq. (25) into Eqs. (30) and (31), we obtain

(32)
(33)

where denotes the Schur product. These may be efficiently computed by exploiting Kronecker products and that is diagonal. Importantly, we are able to use this model to predict at different spatiotemporal resolutions from that of our training data if desired. This cannot be done without the parameterized kernel that captures spatiotemporal correlations in our model. The full covariance of Eq. (31) can be computed in time and memory; details are given in the supplementary material.

If we have a Gaussian posterior , then we can readily compute

(34)

where . However, unlike in [1], the marginal covariance of the predictive density is intractable due to the modeled spatial correlations. One can approximate the density as a mixture of Gaussians by taking samples from and computing the Gaussian of Eq. (29) for each sample of .

For the dynamical model, the latent variable posterior at time can be computed as [7]

(35)

where is the cross-covariance vector computed on and .

Inference of latent variables

Here, we explain how the SGPLVM can be used to infer the variational posterior over the latent variables associated with test observations held out from the training set. Given some test observation observed at spatial points , we would like to infer the posterior . To do this, we optimize the augmented bound

(36)

where

(37)

and and are given by Eq. (11). Our approach may be seen as a combining the “collapsed” bound of [1] to compute and the “uncollapsed” bound used in [18] for . Note that, since the test data are not included in , is no longer optimal, but the benefit of computational tractability greatly outweighs this drawback in practice. Equation (37) may be efficiently computed by exploiting Kronecker structure; details are given in the supplementary material. To infer , one optimizes over its variational parameters. In the interest of computational efficiency, we keep the variational parameters and kernel hyperparameters learned at training time constant.

3 Examples

3.1 Reconstruction of missing data

In this example, we demonstrate the SGPLVM’s ability to reconstruct missing data from an image using the “Frey Faces” data set [25, 6], which contains grayscale images at a resolution of pixels and each pixel takes on a scalar value in .222When modeling the data, we rescale the data to have zero mean and unit variance. However, the results are reported in terms of the raw pixel values. We randomly split the data set into training and test images; we consider the cases with and and .

As described in Sec. 2.1, the training data are assembled into the training data matrix , where and since the images are grayscale. The centers of the pixels are assigned two-dimensional spatial inputs and assembled into the spatial input matrix (). The SGPLVM model has latent variable inducing inputs, spatial inducing inputs, and latent dimensions. Gaussian and Matérn kernels are used for the latent variable and spatial inputs, respectively.

At test time, we randomly remove 50% of the pixels on each held-out image. We impute them by inferring the test latent variable posterior following Sec. 2.2.2, then by approximating the forward predictive density at as a Gaussian with mean given by Eq. (34) and covariance from a mixture of Gaussians as described in Sec. 2.2.2. We condition this Gaussian on the observed pixels in the test observation to repair some of the construction error; this is only possible because of the spatial correlations modeled by the SGPLVM. The prediction accuracy is measured for each image by the root mean square error (RMSE) and median negative log probability (MNLP) over the imputed pixels. We compare our model to the Bayesian GP-LVM on the same data set by replacing the Matérn 3/2 kernel with a white kernel as well as a vanilla across-image GP regression with a Matérn 3/2 kernel trained on each image individually.

For , the SGPLVM attains a mean RMSE (th percentiles) of (, ) and MNLP of (, ); the Bayesian GP-LVM attains an RMSE of (, ) and MNLP of (, ). For , the SGPLVM attains a mean RMSE of (, ) and MNLP of (, ); the Bayesian GP-LVM attains an RMSE of (, ) and MNLP of (, ). The across-image GP obtains an RMSE of (, ) and MNLP of (, ). Our results imply that the modeled spatial correlations are helpful particularly when data examples are limited, even though the learned length scales are on the order of a pixel. However, the across-image GP’s performance shows that the latent variable modeling is essential for obtaining good predictions. This is expected, given the small size of the images. Figure 1 shows a few examples of the reconstructions given by the SGPLVM model.

Figure 1: (Reconstruction) Example reconstructions using a SGPLVM with . Top row: provided partially-observed realization. Middle: Reconstructed observation. Bottom: Ground truth.

3.2 Generation of high-resolution video

In this example, we explore the use of the SGPLVM with a dynamical prior to train on a low-resolution video sequence and impute missing frames at a higher resolution without ever showing the model any high-resolution examples. The video we use is available at https://pixabay.com/en/videos/eye-face-human-male-man-person-4376/ under the Creative Commons CC0 license and is also included in the supplementary material. We train our model on the video with resolution and test our predictions against the HD video. Pixels in the low-resolution video are assigned spatial locations at the center of the clusters of pixels in the high-resolution video. The video contains frames. We randomly select frames as training data and predict on the remaining frames. We train a dynamical SGPLVM model with latent variable inducing inputs, spatial inducing inputs, and latent dimensions. Thus, the SGPLVM models data. The spatial input structure is exploited by defining , where and . The outputs are RGB triplets; thus, . Predictions with the dynamical SGPLVM are compared against a structured GP regression model which predicts the RGB triplets as a function of time and spatial location (3 input dimensions). Prediction accuracy on each frame is measured via RMSE and MNLP. For the dynamical SGPLVM, we obtain an RMSE of (, ) and an MNLP of (, ); for the structured GP, we obtain an RMSE of (, ) and MNLP of (, ). Note that data are pre-processed to have zero mean and unit variance on each channel. Thus, we find that the latent variable modeling of the frames enables us to get substantially better predictive accuracy. Figure 2 shows the time-dependent latent variable posterior means, the latent dimensions’ inverse length scales, and a comparison between a frame from the HD video and the corresponding generated frame from the SGPLVM.

(a) Latent variables
(b) Inverse length scales
(c) (Left) Example SGPLVM-generated frame and (right) HD ground truth.
Figure 2: (HD Video) (\subreffig:Examples:SGPLVM:Video:LVs) SGPLVM latent variables as a function of time; each curve is a separate latent dimension and markers indicate observed frames. (\subreffig:Examples:SGPLVM:Video:LengthScales) Inverse length scales learned by the SGPLVM model. (\subreffig:Examples:SGPLVM:Video:Frames) An example generated image and the corresponding ground truth. See the supplementary material for full-resolution pictures.

4 Conclusions and Discussion

In this work, we derived a structured Gaussian process latent variable model extending that explicitly captures spatial correlations in high-dimensional observation data. Computational tractability is maintained by identifying Kronecker product structure in the model kernel matrices. The modeled spatiotemporal correlations are expressed in terms of interpretable parameterized kernel functions, allowing for one to use the generative model at a higher resolution than that of the training data.

We envision several extensions to our work. First, the SGPLVM might be incorporated as a structured layer within a deep Gaussian process [4] with additional hidden layers preceding or following it. Second, we would like to explore the use of the SGPLVM as a data-driven surrogate model for modeling stochastic partial differential equations with high-dimensional nonstationary outputs, and as a data-driven high-dimensional stochastic input model (i.e. for dimensionality reduction).

Acknowledgments

This work was supported from the University of Notre Dame, the Center for Research Computing (CRC) and the Center for Informatics and Computational Science (CICS). Early developments were supported by the Computer Science and Mathematics Division of ORNL under the DARPA EQUiPS program. N.Z. thanks the Technische Universität München, Institute for Advanced Study for support through a Hans Fisher Senior Fellowship, funded by the German Excellence Initiative and the European Union Seventh Framework Programme under Grant Agreement No. 291763.

References

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
198598
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description