Structured Bayesian Gaussian process latent variable model
Abstract
We introduce a Bayesian Gaussian process latent variable model that explicitly captures spatial correlations in data using a parameterized spatial kernel and leveraging structureexploiting algebra on the model covariance matrices for computational tractability. Inference is made tractable through a collapsed variational bound with similar computational complexity to that of the traditional Bayesian GPLVM. Inference over partiallyobserved test cases is achieved by optimizing a “partiallycollapsed” bound. Modeling highdimensional time series systems is enabled through use of a dynamical GP latent variable prior. Examples imputing missing data on images and superresolution imputation of missing video frames demonstrate the model.
Structured Bayesian Gaussian process latent variable model
Steven Atkinson Center for Informatics and Computational Science University of Notre Dame satkinso@nd.edu Nicholas Zabaras^{†}^{†}thanks: zabaras.com Center for Informatics and Computational Science University of Notre Dame nzabaras@nd.edu
noticebox[b]Preprint. Work in progress.\end@float
1 Introduction
Probabilistic generative models are valuable for inferring structure in large sets of unlabeled data [1]. Bayesian generative models such as the Gaussian process latent variable model [2, 3, 1] and the related unsupervised deep Gaussian processes [4, 5] leverage the expressive, yet regularized flexibility of Gaussian processes within an unsupervised learning setting, producing lowdimensional nonlinear embeddings of data. Analytical tractability is achieved through the variational inference procedure first given by Titsias [6] and later extended by Titasis and Lawrence [1] for the case of unobserved inputs. The dynamical GPLVM [7] constrains the learned latent variables through a Gaussian process prior parameterized by time points corresponding to the observed examples.
However, the generative models utilized by all of the above mentioned do not account for the possibility that the data being modeled exhibit correlations. From a modeling standpoint, the selection of a prior that factorizes across dimensions is rather uninformative and adversely affects model sample efficiency. Instead, one might wish to encode knowledge about correlations within each highdimensional observation through the model prior. Beyond considerations of sample efficiency, because no spatial information is incorporated explicitly into the model, one is not able to use these approaches for inference at spatial locations not included in the training set.
At the same time, neural networks have made great progress as generative models in recent years via variational autoencoder architectures [8] and adversarial formulations [9]. A main focus of this work is sample efficiency within unsupervised learning; as of now, the above generally struggle to perform well with a limited supply of data. A reasonable means of addressing this challenge is by carefullychosen Bayesian regularization; however, we are still challenged to pose tractable, interpretable priors and methods for inference. While imposing Gaussian or Laplace priors over the model weights [10] is attractive because of its mathematical tractability [11], one may still question how well one understands the distributions over functions they imply. to what extent one can understand the implications of such priors on the resulting distribution over functions that the model inherits [12]. Indeed, merely interpreting workable neural network priors is a challenging task and is of considerable interest in its own right. In this work, our approach begins by defining the generative model’s distribution over functions and proceeding to derive a tractable means of inference.
The main contribution of this work is to extend the Bayesian GPLVM by utilizing structureexploiting algebra [13] to enable efficient inference of latent variables and a generative model with explicit spatial correlations modeled by a parameterized kernel function. We call this the structured Gaussian process latent variable model (SGPLVM). The learned generative model may be used for data imputation for partiallyobserved test data. Furthermore, the dynamical prior of Damianou et al. [7] is also incorporated into the model to allow for interpolation for highdimensional time series data. For tractable inference, we derive a collapsed variational bound that may be efficiently computed by exploiting Kronecker product structure, achieving the same computational complexity as a typical sparse Gaussian process model.
The model bears a resemblance to the linear model of coregionalization [14, 15]. However, our model extends this by allowing the inputs to be uncertain within the usual variational free energy approach first shown in [1]. Computational tractability is maintained by exploiting the structure of the covariance matrix by extending the structured GP methods first introduced by Saatçi [13]. Additionally, the modeled spatial correlations are expressed in terms of a familiar kernel function (see, e.g., [16]) with a simple parameterization that is simple to optimize over, given highdimensional observation data, and interpretable through its hyperparameters.
The rest of this paper is organized as follows. In Section 2, we define the structured GPLVM including its variational lower bound, predictive density, and algorithm for inference of latent variables. In Section 3, we apply our model to model highdimensional images with and without a dynamical prior and demonstrate its ability to impute missing data in several novel ways. Section 4 summarizes our work and discusses its implications.
1.1 Related work
Our model is based on the Bayesian Gaussian process latent variable model of Titsias and Lawrence [1]. Structured Gaussian process regression was first introduced in [13], who considered inputs composed as a Cartesian product of onedimensional inputs, though combinations of higherdimensional inputs are also possible so long as the kernel is appropriately separable [17]. The insight to exploit Kronecker product structure in Gaussian process latent variable models was first shown in [18]. However, their variational formulation defines a posterior over the induced outputs and restricts its covariance matrix to possess Kronecker product structure; The variational parameters describing its mean and covariance must be learned through optimization (i.e. the bound is “uncollapsed” [7]). Here, we make no assumptions about the structure of the posterior covariance, instead deriving the analytical optimum and showing how the resulting “collapsed” variational bound can still be computed efficiently by exploiting eigendecomposition properties of Kronecker products. Inference over partiallyobserved data is still possible using a novel ‘mixed” bound combining collapsed and uncollapsed terms from the training and test data, respectively. We also make the connection with the dynamical GPLVM of [7] for modeling temporallycorrelated data.
2 Theory
In this section, we derive the structured GPLVM (SGPLVM) model. Before proceeding, we quickly define some useful notation. Let be a matrix with th row , th column , and entry . The matrix may be assembled by a set of vectors as rows of .
2.1 Problem statement
Suppose we are given a set of data, each of which is observed simultaneously with channels at points in a dimensional spatial domain . Represent this data as a matrix . The observations may also be dynamical in nature and indexed with time points . The spatial inputs are assembled as a matrix . Alternatively, instead of modeling our observations as data that are dimensional (“few data, many dimensions”), we may instead reshape to model them as data that are dimensional (“many data, few dimensions”). Suppose that each of the data are described by a latent variable ; represent these as the matrix with a prior . The dimensional inputs are defined as the Cartesian product of the spatial and latent variable inputs: , with .^{1}^{1}1One can further impose structure within the spatial inputs, i.e., . We aim to learn a generative model
(1) 
The model definition and training procedure is described in Sec. 2.2.1.
Given the learned model, we seek to carry out two types of predictions. First, given a test example observed at spatial points , we seek to infer the posterior of its latent variable, . Second, given and some (possibly different) test points in space , we wish to compute the predictive density for the corresponding outputs . These are addressed in Sec. 2.2.2.
2.2 Structured Bayesian Gaussian process latent variable model
In the following subsections, we explain how to train and do predictions with the structured Gaussian process latent variable model.
2.2.1 SGPLVM: model architecture and evidence lower bound
As mentioned in [18], inference for the generative model of Eq. (1) is challenging for two reasons. First, computing the marginal likelihood is analytically intractable. Second, the likelihood requires the inversion of a covariance matrix. Using the approach of Titsias and Lawrence, the model is augmented with inducing inputoutput pairs, assembled in matrices and ; we assume that these inducing pairs are modeled by the same generative process as the training inputs and latent outputs , allowing us to write the following joint probability:
(2) 
where , , , and are kernel matrices evaluated on the training and inducing inputs. The conditional GP prior is given by
(3) 
where the conditional mean and covariance take the usual forms from the projected process model [19, 16]:
(4)  
(5) 
Finally, we define the Gaussian variational posterior over the induced outputs
(6) 
and pick a joint variational posterior that factorizes as
(7) 
The form of will be discussed later.
Using the familiar variational approach [20, 1], we use Jensen’s inequality to write a lower bound for the logarithm of the model evidence. Following the usual approach, we use Jensen’s inequality to define the following lower bound on the logarithm of the model evidence:
(8) 
After some manipulations, one arrives at the following uncollapsed lower bound:
(9)  
where we have defined the kernel expectations
(10)  
Using standard methods [20], one may find an analytic optimum for that maximizes Eq. (9) with respect to and :
(11) 
where we have defined
(12) 
for convenience. Substituting this into Eq. (9) and doing the required manipulations results in the “collapsed” lower bound:
(13)  
where
(14)  
(15)  
(16) 
and is a Cholesky decomposition. For modeling independent, identicallydistributed data, we use the prior and variational posterior . For dynamical data indexed by times , we follow [21, 7] and use the GP prior and variational posterior , where is the covariance matrix formed on by a kernel and is a diagonal matrix.
To train the model, we maximize Eq. (13) over the variational parameters of the inducing inputs, and the model hyperparameters . However, doing so naïvely quickly becomes computationally intractable. Next, we show how the structure in the model inputs can be exploited so that the time and memory requirements associated with training become linear in .
Similarly to [18], we impose structure on the inducing points: , where and . This implies that and .
The statistics of Eq. (10) can then be written as
(17) 
where , , and may be evaluated in the usual way (see [22]). The computational cost associated with evaluating the kernel expectations in Eq. (17) remains unchanged relative to the traditional Bayesian GPLVM and still allows for parallelization with respect to latent variables of as originally discussed in [23, 24].
Finally, we show how the remaining terms of the variational bound may be computed efficiently without having to ever explicitly evaluate any of the full matrices in the bound. First, we note that and have Kronecker decompositions:
(18)  
(19) 
The eigendecomposition of is
(20) 
so is orthogonal and is diagonal. These both also admit Kronecker decompositions:
(21)  
(22) 
The matrices on the right hand sides of Eqs. (21) and (22) are found by computing the eigendecomposition of and . It follows that
(23) 
where, for notational convenience, we have defined . Note that is the sum of two diagonals and thus contains nonzero entries. Thus, it is at least as easy to store in memory as the data . We also note that expressing as in Eq. (23) makes the computation of in Eq. (13) straightforward. Next, we see that Eq. (12) can be rewritten as
(24) 
and note that the matrix factors of
(25) 
may be efficiently computed and stored since and are Kronecker products and the diagonal matrix can, of course, be inverted in an elementwise manner.
Finally, the second trace term in the second line of Eq. (13) can be manipulated using Eq. (25) to obtain
(26)  
(27)  
(28) 
The matrix is most efficiently computed by first evaluating the product , then multiplying against last. Computing this term takes time. Thus, we see that computing the bound is linear in the size of the training data in both time and memory.
2.2.2 SGPLVM: predictions
Forward predictive density
Given test inputs , the predictive density follows form our assumption that the inducing points are sufficient statistics of the training examples [7]:
(29)  
(30)  
(31) 
where is the covariance matrix computed on the test inputs and is the crosscovariance matrix between test and training inputs. Noticing that , , and substituting Eq. (25) into Eqs. (30) and (31), we obtain
(32)  
(33) 
where denotes the Schur product. These may be efficiently computed by exploiting Kronecker products and that is diagonal. Importantly, we are able to use this model to predict at different spatiotemporal resolutions from that of our training data if desired. This cannot be done without the parameterized kernel that captures spatiotemporal correlations in our model. The full covariance of Eq. (31) can be computed in time and memory; details are given in the supplementary material.
If we have a Gaussian posterior , then we can readily compute
(34) 
where . However, unlike in [1], the marginal covariance of the predictive density is intractable due to the modeled spatial correlations. One can approximate the density as a mixture of Gaussians by taking samples from and computing the Gaussian of Eq. (29) for each sample of .
For the dynamical model, the latent variable posterior at time can be computed as [7]
(35) 
where is the crosscovariance vector computed on and .
Inference of latent variables
Here, we explain how the SGPLVM can be used to infer the variational posterior over the latent variables associated with test observations held out from the training set. Given some test observation observed at spatial points , we would like to infer the posterior . To do this, we optimize the augmented bound
(36) 
where
(37)  
and and are given by Eq. (11). Our approach may be seen as a combining the “collapsed” bound of [1] to compute and the “uncollapsed” bound used in [18] for . Note that, since the test data are not included in , is no longer optimal, but the benefit of computational tractability greatly outweighs this drawback in practice. Equation (37) may be efficiently computed by exploiting Kronecker structure; details are given in the supplementary material. To infer , one optimizes over its variational parameters. In the interest of computational efficiency, we keep the variational parameters and kernel hyperparameters learned at training time constant.
3 Examples
3.1 Reconstruction of missing data
In this example, we demonstrate the SGPLVM’s ability to reconstruct missing data from an image using the “Frey Faces” data set [25, 6], which contains grayscale images at a resolution of pixels and each pixel takes on a scalar value in .^{2}^{2}2When modeling the data, we rescale the data to have zero mean and unit variance. However, the results are reported in terms of the raw pixel values. We randomly split the data set into training and test images; we consider the cases with and and .
As described in Sec. 2.1, the training data are assembled into the training data matrix , where and since the images are grayscale. The centers of the pixels are assigned twodimensional spatial inputs and assembled into the spatial input matrix (). The SGPLVM model has latent variable inducing inputs, spatial inducing inputs, and latent dimensions. Gaussian and Matérn kernels are used for the latent variable and spatial inputs, respectively.
At test time, we randomly remove 50% of the pixels on each heldout image. We impute them by inferring the test latent variable posterior following Sec. 2.2.2, then by approximating the forward predictive density at as a Gaussian with mean given by Eq. (34) and covariance from a mixture of Gaussians as described in Sec. 2.2.2. We condition this Gaussian on the observed pixels in the test observation to repair some of the construction error; this is only possible because of the spatial correlations modeled by the SGPLVM. The prediction accuracy is measured for each image by the root mean square error (RMSE) and median negative log probability (MNLP) over the imputed pixels. We compare our model to the Bayesian GPLVM on the same data set by replacing the Matérn 3/2 kernel with a white kernel as well as a vanilla acrossimage GP regression with a Matérn 3/2 kernel trained on each image individually.
For , the SGPLVM attains a mean RMSE (th percentiles) of (, ) and MNLP of (, ); the Bayesian GPLVM attains an RMSE of (, ) and MNLP of (, ). For , the SGPLVM attains a mean RMSE of (, ) and MNLP of (, ); the Bayesian GPLVM attains an RMSE of (, ) and MNLP of (, ). The acrossimage GP obtains an RMSE of (, ) and MNLP of (, ). Our results imply that the modeled spatial correlations are helpful particularly when data examples are limited, even though the learned length scales are on the order of a pixel. However, the acrossimage GP’s performance shows that the latent variable modeling is essential for obtaining good predictions. This is expected, given the small size of the images. Figure 1 shows a few examples of the reconstructions given by the SGPLVM model.
3.2 Generation of highresolution video
In this example, we explore the use of the SGPLVM with a dynamical prior to train on a lowresolution video sequence and impute missing frames at a higher resolution without ever showing the model any highresolution examples. The video we use is available at https://pixabay.com/en/videos/eyefacehumanmalemanperson4376/ under the Creative Commons CC0 license and is also included in the supplementary material. We train our model on the video with resolution and test our predictions against the HD video. Pixels in the lowresolution video are assigned spatial locations at the center of the clusters of pixels in the highresolution video. The video contains frames. We randomly select frames as training data and predict on the remaining frames. We train a dynamical SGPLVM model with latent variable inducing inputs, spatial inducing inputs, and latent dimensions. Thus, the SGPLVM models data. The spatial input structure is exploited by defining , where and . The outputs are RGB triplets; thus, . Predictions with the dynamical SGPLVM are compared against a structured GP regression model which predicts the RGB triplets as a function of time and spatial location (3 input dimensions). Prediction accuracy on each frame is measured via RMSE and MNLP. For the dynamical SGPLVM, we obtain an RMSE of (, ) and an MNLP of (, ); for the structured GP, we obtain an RMSE of (, ) and MNLP of (, ). Note that data are preprocessed to have zero mean and unit variance on each channel. Thus, we find that the latent variable modeling of the frames enables us to get substantially better predictive accuracy. Figure 2 shows the timedependent latent variable posterior means, the latent dimensions’ inverse length scales, and a comparison between a frame from the HD video and the corresponding generated frame from the SGPLVM.

4 Conclusions and Discussion
In this work, we derived a structured Gaussian process latent variable model extending that explicitly captures spatial correlations in highdimensional observation data. Computational tractability is maintained by identifying Kronecker product structure in the model kernel matrices. The modeled spatiotemporal correlations are expressed in terms of interpretable parameterized kernel functions, allowing for one to use the generative model at a higher resolution than that of the training data.
We envision several extensions to our work. First, the SGPLVM might be incorporated as a structured layer within a deep Gaussian process [4] with additional hidden layers preceding or following it. Second, we would like to explore the use of the SGPLVM as a datadriven surrogate model for modeling stochastic partial differential equations with highdimensional nonstationary outputs, and as a datadriven highdimensional stochastic input model (i.e. for dimensionality reduction).
Acknowledgments
This work was supported from the University of Notre Dame, the Center for Research Computing (CRC) and the Center for Informatics and Computational Science (CICS). Early developments were supported by the Computer Science and Mathematics Division of ORNL under the DARPA EQUiPS program. N.Z. thanks the Technische Universität München, Institute for Advanced Study for support through a Hans Fisher Senior Fellowship, funded by the German Excellence Initiative and the European Union Seventh Framework Programme under Grant Agreement No. 291763.
References

[1]
M. K. Titsias, N. D. Lawrence,
Bayesian
Gaussian process latent variable model, in: Y. W. Teh, D. M. Titterington
(Eds.), Proceedings of the Thirteenth International Conference on Artificial
Intelligence and Statistics (AISTATS10), Vol. 9, 2010, pp. 844–851.
URL http://www.jmlr.org/proceedings/papers/v9/titsias10a/titsias10a.pdf  [2] N. D. Lawrence, Gaussian process latent variable models for visualisation of high dimensional data, in: Advances in neural information processing systems, 2004, pp. 329–336.

[3]
N. Lawrence,
Probabilistic
nonlinear principal component analysis with Gaussian process latent
variable models, J. Mach. Learn. Res. 6 (2005) 1783–1816.
URL http://dl.acm.org/citation.cfm?id=1046920.1194904 
[4]
A. C. Damianou, N. D. Lawrence,
Deep
Gaussian processes, in: C. M. Carvalho, P. Ravikumar (Eds.), Proceedings of
the Sixteenth International Conference on Artificial Intelligence and
Statistics (AISTATS16), Vol. 31, 2013, pp. 207–215.
URL http://www.jmlr.org/proceedings/papers/v31/damianou13a.pdf  [5] Z. Dai, A. Damianou, J. González, N. Lawrence, Variational autoencoded deep gaussian processes, arXiv preprint arXiv:1511.06455.

[6]
M. K. Titsias,
Variational
learning of inducing variables in sparse Gaussian processes, in: In
Artificial Intelligence and Statistics 12, 2009, pp. 567–574.
URL http://jmlr.org/proceedings/papers/v5/titsias09a/titsias09a.pdf 
[7]
A. C. Damianou, M. K. Titsias, N. D. Lawrence,
Variational
inference for latent variables and uncertain inputs in Gaussian processes,
J. Mach. Learn. Res. 17 (1) (2016) 1425–1486.
URL http://dl.acm.org/citation.cfm?id=2946645.2946687  [8] D. P. Kingma, M. Welling, Autoencoding variational bayes, arXiv preprint arXiv:1312.6114.
 [9] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in: Advances in neural information processing systems, 2014, pp. 2672–2680.
 [10] C. Blundell, J. Cornebise, K. Kavukcuoglu, D. Wierstra, Weight uncertainty in neural networks, arXiv preprint arXiv:1505.05424.
 [11] I. Goodfellow, Y. Bengio, A. Courville, Y. Bengio, Deep learning, Vol. 1, MIT press Cambridge, 2016.
 [12] H. Salimbeni, M. Deisenroth, Doubly stochastic variational inference for deep Gaussian processes, in: Advances in Neural Information Processing Systems, 2017, pp. 4591–4602.
 [13] Y. Saatçi, Scalable inference for structured gaussian process models, Ph.D. thesis, Citeseer (2012).
 [14] A. G. Journel, C. J. Huijbregts, Mining geostatistics, Academic press, 1978.
 [15] P. Goovaerts, Geostatistics for natural resources characterization, Geostatistics for natural resources characterization.

[16]
C. E. Rasmussen, C. K. Williams,
Gaussian
processes for machine learning (Adaptive Computation and Machine Learning),
the MIT Press 2 (3) (2006) 4.
URL https://mitpress.mit.edu/books/gaussianprocessesmachinelearning 
[17]
I. Bilionis, N. Zabaras, B. A. Konomi, G. Lin,
Multioutput
separable Gaussian process: Towards an efficient, fully Bayesian paradigm for
uncertainty quantification, Journal of Computational Physics 241 (2013) 212
– 239.
doi:http://dx.doi.org/10.1016/j.jcp.2013.01.011.
URL http://www.sciencedirect.com/science/article/pii/S0021999113000417  [18] A. Lopez, Z. Dai, N. D. Lawrence, Efficient modeling of latent information in supervised learning using gaussian processes, in: Advances in Neural Information Processing Systems 30 (NIPS 2017) preproceedings, Vol. 30, Massachusetts Institute of Technology Press, 2017.
 [19] M. Seeger, C. Williams, N. Lawrence, Fast forward selection to speed up sparse gaussian process regression, in: Artificial Intelligence and Statistics 9, no. EPFLCONF161318, 2003.

[20]
C. M. Bishop,
Pattern
Recognition and Machine Learning (Information Science and Statistics),
SpringerVerlag New York, Inc., Secaucus, NJ, USA, 2006.
URL https://www.microsoft.com/enus/research/people/cmbishop/ 
[21]
M. Opper, C. Archambeau,
The variational
Gaussian approximation revisited, Neural Comput. 21 (3) (2009) 786–792.
doi:10.1162/neco.2008.0807592.
URL http://dx.doi.org/10.1162/neco.2008.0807592 
[22]
A. Damianou, Deep Gaussian
processes and variational propagation of uncertainty, Ph.D. thesis,
University of Sheffield (2015).
URL http://etheses.whiterose.ac.uk/9968/  [23] Z. Dai, A. Damianou, J. Hensman, N. Lawrence, Gaussian process models with parallelization and gpu acceleration, arXiv preprint arXiv:1410.4984.
 [24] Y. Gal, M. van der Wilk, C. Rasmussen, Distributed variational inference in sparse gaussian process regression and latent variable models, in: Advances in Neural Information Processing Systems, 2014, pp. 3257–3265.
 [25] S. T. Roweis, L. K. Saul, G. E. Hinton, Global coordination of local linear models, in: Advances in neural information processing systems, 2002, pp. 889–896.