Deep Gaussian Processes with Decoupled Inducing Inputs

Deep Gaussian Processes with
Decoupled Inducing Inputs

Marton Havasi
Department of Engineering
University of Cambridge
&José Miguel Hernández-Lobato
Department of Engineering
University of Cambridge
&Juan José Murillo-Fuentes
University of Seville
Abstract

Deep Gaussian Processes (DGP) are hierarchical generalizations of Gaussian Processes (GP) that have proven to work effectively on a multiple supervised regression tasks. They combine the well calibrated uncertainty estimates of GPs with the great flexibility of multilayer models. In DGPs, given the inputs, the outputs of the layers are Gaussian distributions parameterized by their means and covariances. These layers are realized as Sparse GPs where the training data is approximated using a small set of pseudo points. In this work, we show that the computational cost of DGPs can be reduced with no loss in performance by using a separate, smaller set of pseudo points when calculating the layerwise variance while using a larger set of pseudo points when calculating the layerwise mean. This enabled us to train larger models that have lower cost and better predictive performance.

 

Deep Gaussian Processes with
Decoupled Inducing Inputs


  Marton Havasi Department of Engineering University of Cambridge José Miguel Hernández-Lobato Department of Engineering University of Cambridge Juan José Murillo-Fuentes University of Seville

\@float

noticebox[b]\end@float

1 Introduction

In this work, we are considering doubly stochastic DGPs (Salimbeni and Deisenroth, 2017) (Damianou and Lawrence, 2012). In this setting, each layer is a Sparse GP (Titsias, 2009) where the inputs are sampled from the output distribution of the previous layer (the input data in the case of the initial layer). These layers are parameterized by their covariance function and a set of inducing input-output pairs (previously referred as pseudo points).

Generally, the inducing inputs are treated as hyperparameters and the inducing outputs (the latent variables) are variationally approximated by a multivariate Gaussian distribution. In a recent interpretation, (Cheng and Boots, 2017) showed the duality between GPs and Gaussian Measures. As a result, they showed that the mean and the variance computations are not required to share the inducing inputs. Since calculating the variance is significantly more costly than calculating the mean, one can speed up the training process by using a second, reduced set of inducing inputs when computing the variance. This is referred to as a Decoupled GP.

Our contribution is showing that the decoupled approach is applicable in the case of DGPs. While changes necessary to the original parameterization proposed by Cheng and Boots (2017), the resulting DGPs were both faster and had higher predictive performance than their non-decoupled counterparts.

2 Methods

This section gives a brief overview of Variational Inference for Deep Gaussian Processes and explains how they can be incorporated with decoupled inducing inputs.

2.1 Variational Inference for Deep Gaussian Processes

Consider a set of input output pairs , . Lets denote the layer outputs as and the inducing input-output pairs with and respectively. In a single layer sparse Gaussian Process (GP) (Titsias, 2009) (Snelson and Ghahramani, 2006), one can express the joint probability density as

(1)

The first term, refers to the likelihood. In our case, Gaussian likelihood is used exclusively. The second part, is the sparse GP prior. where is the covariance function. We will be using the notation onwards.

is given by the GP model (Williams and Rasmussen, 1996):

(2)

When calculating the posterior of given the data , we are forced to approximate by (Titsias, 2009) due to tractability problems. The advantage of this form is that can be marginalized and the variational posterior is simply

(3)

Finally, according to the variational inference principles, the target of the optimization process is the Evidence Lower Bound (ELBO):

(4)

In the multilayer setting, we employ similar approximations:

(5)

where is the output of the -th layer (and input of the -st) for . The input to the first layer is simply . The ELBO is analogously

(6)

where is the number of inducing points ().

2.2 Decoupled Inducing Inputs

Using the dual formulation of a GP as a Gaussian Measure, (Cheng and Boots, 2017) have shown that it does not necessarily have to be the case that and are parameterized by the same set of inducing inputs ().

Instead of eq. 3, a new parameterization is used that utilizes two different sets of inducing inputs and . is parameterized by , which is beneficial, since it does not require computing the inverse covariance matrix (). is defined using :

(7)

The corresponding to the new parameterization, the ELBO takes a slightly different form:

(8)

where the KL divergence is given up-to a constant.

The reformulation greatly reduces the computational complexity. The time complexity of calculating the mean in a single layer is , the variance is and the KL divergence is (where is and is the size of the minibatch). However, after the decoupling, the cost the mean becomes , the variance becomes and the KL divergence becomes . This is due to the new parameterization not requiring the costly inversion of the covariance matrix. The overall cost reduces from to (where is the number of layers and is the width of the hidden layers). Note that the cost of inverting the covariance matrix does not scale with the layer width, since every node in the same layer shares the convariance matrix. This leads to considerable improvement in training time if .

2.3 Alternative parameterizations

Unfortunately, the parameterization advocated by (Cheng and Boots, 2017) (eq. 7, referred as and onwards) has poor convergence properties. The dependencies of the values of in the ELBO result in a highly non-convex optimization domain, which then leads to high variance gradients. This impedes convergence even at small learning rates.

To combat the issue, we look at different parameterizations to lift the dependencies and achieve stable convergence. We consider the natural extension of the standard sparse GP parameterization to decoupled inducing inputs. For the mean, we have two alternatives. The first option is the standard sparse GP (defined in eq. 3, referred as ) while the second option is to precondition with (shown in Table 1, referred as ), the Cholesky-decomposition of . The goal of the preconditioner is that it transforms the prior distribution to be a standard Gaussian. This aids convergence, because the term for the mean in the ELBO () becomes . As for the variance, the sparse GP parameterization offers only a single option (eq. 3, referred as ) which is shown in Table 2.

As discussed earlier, shows unstable convergence. While the problem was fixed with and , this came at a cost. Both of the these require computing the inverse covariance matrix for the mean (), which leads to an increased overall cost of . Fortunately, this is still an improvement over the original cost of DGPs when . Note that the cost reduction is more significant when and are large, due to the term dominating the overall cost. In our experiments, we used although the difference in terms of performance between and was limited.

For the variance, both and show stable convergence and they exhibit similar performance. In our experiments, we used .

Complexity Convergence
unstable
stable
stable
Table 1: Different parameterizations of the mean. refers to the Cholesky-decomposition:
Complexity Convergence
stable
stable
Table 2: Different parameterizations of the covariance. is parameterized as and is parameterized as for computational stability.

3 Experiments

The goal of the experiments is to show that the decoupled approach is not only more cost efficient, but also has better performance than the original, non-decoupled approach.

The experiments are conducted on three datasets. Two benchmark UCI regression datasets: kin8nm and protein as well as a third regression dataset, molecules, that has binary features and it describes the energy conversion efficiency of molecules in solar panels.

In each run, we separate a random 20% of the datapoints to serve as test data. The experiments were repeated 5 times. The optimization is carried out using Adam (Kingma and Ba, 2014) with the default learning rate (0.01) over 5000 epoch with minibatch size being .

Two models are selected for our experiments. A full DGP with and a decoupled DGP with , . is a common choice for DGPs and we choose the parameters of the decoupled model to have a proportionate runtime. We use an RBF kernel with a separate lengthscale per dimension in order to have comparable results to (Bui et al., 2016) and (Salimbeni and Deisenroth, 2017). The layer width is fixed at 10 and the depth varies from 0 to 4 hidden layers.

We use the added static mean function per layer as is suggested in (Salimbeni and Deisenroth, 2017). Adding the layer inputs to the layer outputs as a fixed mean function helps to avoid degenerate covariance matrices and it provides a decent initialization method for the inducing inputs.

The results are shown on Figures 1, 2 with Table 3 containing the median runtimes. The exact numerical values can be found in Appendix A.

Figure 1: The test mean log-likelihood of the models. The confidence bands denote one standard deviation. Higher is better.
Figure 2: The test root-mean-square error of the models. The confidence bands denote one standard deviation. Lower is better.
Model kin8nm protein molecules
116 s 587 s 1699 s
, 310 s 1210 s 2443 s
481 s 2567 s 4159 s
, 433 s 2015 s 3683 s
853 s 4577 s 6775 s
, 611 s 3038 s 5121 s
1242 s 6598 s 9426 s
, 811 s 4111 s 6596 s
1601 s 8251 s 12122 s
, 1017 s 5209 s 8078 s
Table 3: Median runtimes. The decoupled model was more cost efficient in the presence of at least one hidden layer.

The decoupled DGP outperforms the baseline DGP in every dataset with the exception of the DGP with two hidden layers on kin8nm. Moreover, the runtime of the former is also lower for models with at least one hidden layer. It has higher runtime in the case of the shallow model because in the absence of hidden layers the cost of matrix inversion has a more significant impact on the overall runtime.

4 Conclusions

We showed that decoupling the inducing inputs for the mean and the variance is compatible with DGPs and it improves the performance. While we were not able to achieve time complexity due to poor convergence characteristics, we attained a higher, complexity bound, whose limit is equal to the former for large batch sizes and wide layers.

We exhibited that the decoupled DGPs are both faster and more accurate than the full DGPs. The performance improvement was demonstrated across three distinct datasets. The decoupled model ran faster and achieved a higher log-likelihood and lower root mean squared error.

In future works, we aim to devise a strategy for determining the inducing input locations. One issue that we uncovered is that the inducing inputs suffer from vanishing gradients. We want to experiment with approaches other than stochastic gradient descent to attain inducing inputs that are able to better describe the data.

Acknowledgment

The authors would like to acknowledge the generous support from EPSRC.

References

  • Bui et al. [2016] T. D. Bui, D. Hernández-Lobato, Y. Li, J. M. Hernández-Lobato, and R. E. Turner. Deep Gaussian Processes for Regression using Approximate Expectation Propagation. ArXiv e-prints, Feb. 2016.
  • Cheng and Boots [2017] C.-A. Cheng and B. Boots. Variational Inference for Gaussian Process Models with Linear Complexity. ArXiv e-prints, Nov. 2017.
  • Damianou and Lawrence [2012] A. C. Damianou and N. D. Lawrence. Deep Gaussian Processes. ArXiv e-prints, Nov. 2012.
  • Kingma and Ba [2014] D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. ArXiv e-prints, Dec. 2014.
  • Salimbeni and Deisenroth [2017] H. Salimbeni and M. Deisenroth. Doubly Stochastic Variational Inference for Deep Gaussian Processes. ArXiv e-prints, May 2017.
  • Snelson and Ghahramani [2006] E. Snelson and Z. Ghahramani. Sparse gaussian processes using pseudo-inputs. In Y. Weiss, B. Schölkopf, and J. C. Platt, editors, Advances in Neural Information Processing Systems 18, pages 1257–1264. MIT Press, 2006. URL http://papers.nips.cc/paper/2857-sparse-gaussian-processes-using-pseudo-inputs.pdf.
  • Titsias [2009] M. Titsias. Variational learning of inducing variables in sparse gaussian processes. In D. van Dyk and M. Welling, editors, Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, volume 5 of Proceedings of Machine Learning Research, pages 567–574, Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA, 16–18 Apr 2009. PMLR. URL http://proceedings.mlr.press/v5/titsias09a.html.
  • Williams and Rasmussen [1996] C. K. Williams and C. E. Rasmussen. Gaussian processes for regression. In Advances in neural information processing systems, pages 514–520, 1996.

Appendix A Numerical results

Tables 4, 5, 6 contain the numerical results of the experiments. The runtime is the median time of the 5 runs on a Tesla K80 GPU.

Model Mean LL RMSE Runtime
0.88 0.02 0.10 0.00 116 s
, 1.01 0.01 0.09 0.00 310 s
1.32 0.03 0.07 0.00 481 s
, 1.34 0.02 0.07 0.00 433 s
1.38 0.02 0.06 0.00 853 s
, 1.37 0.01 0.07 0.00 611 s
1.36 0.03 0.07 0.00 1242 s
, 1.37 0.01 0.07 0.00 811 s
1.36 0.01 0.07 0.00 1601 s
, 1.38 0.03 0.06 0.00 1017 s
Table 4: kin8nm, D: 9 N: 8,192
Model Mean LL RMSE Runtime
-2.93 0.01 4.56 0.06 587 s
, -2.90 0.01 4.39 0.05 1210 s
-2.81 0.01 4.13 0.06 2567 s
, -2.77 0.01 4.04 0.06 2015 s
-2.76 0.02 4.03 0.07 4577 s
, -2.71 0.03 3.90 0.10 3038 s
-2.75 0.01 4.00 0.06 6598 s
, -2.69 0.03 3.87 0.12 4111 s
-2.73 0.02 3.95 0.07 8251 s
, -2.68 0.03 3.86 0.11 5209 s
Table 5: protein, D: 9 N: 45,730
Model Mean LL RMSE Runtime
-1.79 0.02 1.47 0.04 1699 s
, -1.74 0.02 1.41 0.04 2443 s
-1.51 0.03 1.22 0.04 4159 s
, -1.39 0.01 1.19 0.03 3683 s
-1.48 0.04 1.31 0.04 6775 s
, -1.36 0.02 1.25 0.05 5121 s
-1.46 0.07 1.29 0.08 9426 s
, -1.36 0.05 1.27 0.05 6596 s
-1.49 0.07 1.32 0.06 12122 s
, -1.37 0.05 1.26 0.05 8078 s
Table 6: molecules, D: 512 N: 60,000
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
307527
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description