Variational Implicit Processes
Abstract
This paper introduces the variational implicit processes (VIPs), a Bayesian nonparametric method based on a class of highly flexible priors over functions. Similar to Gaussian processes (GPs), in implicit processes (IPs), an implicit multivariate prior (data simulators, Bayesian neural networks, etc.) is placed over any finite collections of random variables. A novel and efficient variational inference algorithm for IPs is derived using wakesleep updates, which gives analytic solutions and allows scalable hyperparameter learning with stochastic optimization. Experiments on realworld regression datasets demonstrate that VIPs return better uncertainty estimates and superior performance over existing inference methods for GPs and Bayesian neural networks. With a Bayesian LSTM as the implicit prior, the proposed approach achieves stateoftheart results on predicting power conversion efficiency of molecules based on raw chemical formulas.
Variational Implicit Processes
Chao Ma University of Cambridge cm905@cam.ac.uk Yingzhen Li University of Cambridge yl494@cam.ac.uk José Miguel HernándezLobato University of Cambridge jmh233@cam.ac.uk
noticebox[b]\end@float
1 Introduction
Powerful models with implicit distributions as core components have recently attracted enormous interest in both deep learning as well as the approximate Bayesian inference community. In contrast to prescribed probabilistic models[22] that assign explicit densities to possible outcomes of the model, implicit models implicitly assign probability measures by the specification of the data generating process. One of the most well known implicit distributions is the generator of generative adversarial nets (GANs) [30, 73, 2, 66, 8, 93] that transforms isotropic noise into high dimensional data, say images, using neural networks. In approximate inference context, implicit distributions have also been deployed to postulate flexible approximate posterior distributions[76, 60, 56, 87, 61, 90, 58]. However, such generation process does not necessarily allow evaluation of densities pointwise, which becomes the main barrier for inference and learning.
This paper explores applications of implicit models to Bayesian nonparametric regression, which brings together the best of both worlds. That is, combining the nonparametric modeling power, elegance of inference, and the wellcalibrated uncertainty of Bayesian nonparametric methods, with the strong representational power of implicit methods. Similar to the construction of Gaussian processes (GPs), implicit stochastic processes (IPs) assign implicit distributions over any finite collections of random variables. With an IP as the prior, one can directly perform posterior inference over functions in a nonparametric fashion. Therefore our approach does not suffer from typical issues in parametric Bayesian modeling, e.g. symmetric modes in the posterior distribution of Bayesian neural network weights. However, due to the intractability of the underlying distribution of IPs, standard approximate Bayesian inference techniques cannot be directly applied. Concretely, our contributions are threefold:

We propose a generic framework for Bayesian nonparametrics with implicit stochastic process prior over functions. As IPs naturally include data simulators and Bayesian neural networks, our approach provides a new inference and learning framework for these important but challenging deep models.

We derive a novel and efficient variational framework that gives a closedform approximation to the IP posterior. Our inference method is computationally cheap, and it allows scalable hyperparameter learning in IPs.

We conduct an extensive comparison between IPs with the proposed inference method, and GPs/ Bayesian neural networks/Bayesian RNNs with existing modern variational approaches. Our method consistently outperforms other methods, and achieves stateoftheart performance on the task of predicting power conversion efficiency of molecules based on raw chemical formulas, by training an IP with Bayesian LSTM prior.
2 Implicit stochastic processes
In this section, we generalize GPs to implicit stochastic processes. Recall that a GP defines the distribution of a random function by placing a multivariate Gaussian distribution over any finite collection of function values evaluated at any given finite collection of input locations .^{1}^{1}1 can also be unobserved as in Gaussian process latent variable models[49]. Here and , and following Kolmogorov consistency theorem [42], the mean and covariance functions , are shared across all such finite collections. An alternative parameterization of GPs defines the sampling process as , , with the Cholesky decomposition of the covariance matrix. Observing this, we propose a generalization of the generative process by replacing the linear transform of the latent variable with a nonlinear one. This gives the formal definition of implicit stochastic process as follows.
Definition 1 (noiseless implicit stochastic processes).
An implicit stochastic process (IP) is a collection of random variables , such that any finite collection has a joint distribution implicitly defined by the following generative process
(1) 
A function distributed according to the above IP is denoted as .
Proposition 1.
There exists a unique stochastic process, such that any finite collection of random variables has distribution implicitly defined by (1).
IPs form a rich class of priors over functions. Indeed, we visualize some examples of IPs in Figure 1 with discussions as follows:
Example 1 (Data simulators).
Simulators, e.g. physics engines and climate models, are omnipresent in science and engineering. These models encode laws of physics in , use to explain the remaining randomness, and evaluate the function at input locations : . We define the neural sampler as a specific instance of this class. In this case is a neural network with weights , i.e., , and .
Example 2 (Bayesian neural network).
In a Bayesian neural network the synaptic weights are random variables (i.e., ) with a prior on them. A function is sampled by and then setting for all . In this case could include, e.g., the network architecture and additional hyperparameters.
Example 3 (Bayesian RNN).
Similar to Example 2, a Bayesian recurrent neural network (RNN) can be defined by considering its weights as random variables, and taking as function evaluation an output value generated by the RNN after processing the last symbol of an input sequence.
3 Variational implicit processes
Consider the following regression model with an IP prior over the function:
(2) 
Given an observational dataset and a set of test inputs , Bayesian predictive inference over involves computing the predictive distribution , which itself implies computing the posterior . Besides prediction, we may also want to learn the prior parameters and noise variance by maximizing the log marginal likelihood: , with the evaluation of on . Unfortunately, both the prior and the posterior are intractable as the implicit process does not allow pointwise density evaluation, let alone the integration tasks. Therefore, to address these tasks, we must resort to approximate inference methods.
We propose a generalization of the wakesleep algorithm [38] to handle both intractabilities. This method returns (1) an approximate posterior distribution which is later used for predictive inference, and (2) an approximation to the marginal likelihood for hyperparameter optimization. In this paper we choose to be the posterior distribution of a Gaussian process, . A highlevel summary of our algorithm is the following:

Sleep phase: sample function values and noisy outputs as indicated in (2). This dreamed data is then used as the maximumlikelihood (ML) target to fit a GP. This is equivalent to minimizing .

Wake phase: use the GP posterior to approximate the IP posterior , and obtain a variational approximation to , which is then optimized with respect to .
Our approach has two key advantages. First, the algorithm has no explicit sleep phase computation, since the sleep phase optimization has an analytic solution that can be directly plugged into the wakephase objective. Second, the proposed wake phase update is highly scalable, as it is equivalent to a Bayesian linear regression task with random features sampled from the implicit process. With our wakesleep algorithm, the evaluation of the implicit prior density is no longer an obstacle for approximate inference. We call this inference framework the variational implicit processes (VIPs). In the following sections we give specific details on both the wake and sleep phases.
3.1 Sleep phase: GP posterior as variational distribution
This section addresses the first intractability issue by approximating the IP posterior with a GP posterior . A naive approach based on variational inference [44] would require computing the joint distribution that is again intractable. However, sampling from this joint distribution is straightforward. Therefore, we leverage the idea of the sleep phase in the wakesleep algorithm to approximate with a simpler distribution instead. We use a GP prior for with mean and covariance functions and , respectively, and write the prior as .
The sleepphase update minimizes the KL divergence between the two joint distributions, which reduces to the following constrained optimization problem:
(3) 
We further simplify the approximation by using , which reduces to , and the objective is minimized when and are equal to the mean and covariance functions of the IP, respectively:
(4) 
In the following we also write the optimal solution as to explicitly specify the dependency on prior parameters . In practice, the mean and covariance functions are estimated by by Monte Carlo, which leads to maximum likelihood training for the GP with dreamed data. Assume functions are drawn from the IP: . The optimum of is then approximated by the MLE solution:
(5)  
(6) 
To reduce computational costs, the number of dreamed samples is often small. Therefore, we perform maximum a posteriori/posterior mean estimation instead, by putting an inverse Wishart process prior[81] over the GP covariance function (see Appendix C.2).
The original sleep phase algorithm in [38] also find a posterior approximation by minimizing (3). However, the original approach would define the distribution as , which builds a recognition model that can be directly transfered for later inference. By contrast, we define , which corresponds to an approximation of the IP prior. In other words, we approximate an intractable generative model using another generative model with a GP prior and later, the resulting GP posterior is employed as the variational distribution. Importantly, we never explicitly perform the sleep phase updates as there is an analytic solution readily available, which can potentially save an enormous amount of computation in practice.
Another interesting observation is that the sleep phase’s objective also provides an upperbound to the KL divergence between the posterior distributions, . Notice that according to chain rule of KL divergence, can be rewritten as:
(7) 
According the the nonnegativity of KL divergence, we have that is an upper bound of , which justifies as a valid variational objective for posterior approximation. Therefore during the sleep phase, the gap is also decreased when the mean and covariance functions are optimized.
3.2 Wake phase: Bayesian linear regression over random functions
In the wake phase of traditional wakesleep, the prior parameters are optimized by maximizing the variational lowerbound [44] to the log marginal likelihood . Unfortunately, this requires evaluating the IP prior which is again intractable. But recall from (7) that during the sleep phase is also minimized. Therefore we directly approximate the log marginal likelihood using the optimal GP from the sleep phase, i.e.
(8) 
This again demonstrates the key advantage of the proposed sleep phase update via generative model matching. Also it becomes a sensible objective for predictive inference as the GP returned by wakesleep will be used at prediction time anyway.
Similar to GP regression, optimizing can be computationally expensive for large datasets. Therefore sparse GP approximation techniques [84, 85, 35, 13] are applicable, but we leave them to future work and consider an alternative approach that is related to random feature approximations of GPs [74, 29, 27, 3, 50]. Note that can be approximated by the log marginal likelihood of a Bayesian linear regression model with randomly sampled dreamed functions, and coefficient :
(9) 
(10) 
Then it is straightforward to apply variational inference again for scalable stochastic optimization, and we follow [57, 37, 54] to approximate (9) by the energy (see Appendix B):
(11) 
When the energy reduces to the variational lowerbound, and empirically the energy has better approximation accuracy when [57, 37, 54]. Also since the prior is conjugate to the Gaussian likelihood , the exact posterior of can be reached by a correlated Gaussian . Stochastic optimization is applied to the energy wrt. and jointly, making our approach highly scalable to big datasets.
3.3 Computational complexity and scalable predictive inference
Assume the evaluation of a sampled function value for a given input takes time. The VIP has time complexity in training, where is the size of a minibatch, and is the number of random functions sampled from . Note that approximate inference techniques in space, e.g. meanfield Gaussian approximation to the posterior of Bayesian neural network weights [9, 37, 54], also takes time. Therefore when is large (typically the case for neural networks) the additional cost is often negligible, as is usually significantly smaller than the typical number of inducing points in sparse GP (20 in the experiments).
Predictive inference follows the standard GP equations to compute on test data that has datapoints: ,
(12) 
Recall that the optimal variational GP approximation has mean and covariance functions defined as (5) and (6), respectively, which means has rank . Therefore predictive inference requires both function evaluations and matrix inversion, which costs time. This complexity can be further reduced: note that the computational cost is dominated by the inversion of matrix . Denote the Cholesky decomposition of the kernel matrix as before. It is straightforward to show that in the Bayesian linear regression problem (10) the exact posterior of is , with Therefore the parameters of the GP predictive distribution in (12) are reduced to:
(13) 
This reduces the prediction cost to , which is on par with e.g. conventional predictive inference techniques for Bayesian neural networks that also cost . In practice we use the mean and covariance matrix from to compute the predictive distribution. Alternatively one can directly sample and compute , which is also an inference approach^{2}^{2}2If is a meanfield Gaussian distribution then the cost is . but is liable for higher variance.
4 Related research
In the world of nonparametric models, Gaussian Processes (GPs) [75] provides a principled and flexible probabilistic framework for Bayesian inference over functions. The Bayesian nature enables GPs to provide accurate uncertainty estimates on unseen data, all wrapped up in a single, exact closed form solution of posterior inference. Despite the success and popularity of GPs in the past decades, their time and space complexities make them impractical for largescale datasets. Therefore, people often resort to complicated approximate methods [80, 72, 84, 85, 35, 13, 12, 77, 16, 88]. Another intrinsic issue is the limited representational power of GP kernels. It has been argued that stationary kernels commonly used for nonlinear regressions are not able to obtain hierarchical representations for high dimensional data, which limits the usefulness of GP methods [7].
On the contrary, in the world of parametric modeling, it is wellknown that deep neural networks [6, 39, 78, 48, 48, 83, 23] are extremely flexible function approximators that enable learning from very highdimensional and structured data. Nowadays Deep learning has been widely spread to an enormous amount of tasks, including computer vision [48, 83, 23] and speech recognition [51, 67, 45, 15]. As people starts to apply deep learning techniques to critical applications such as automatic driving and health care, uncertainty quantification of neural networks has become increasingly important. Although decent progress has been made [20, 41, 4, 69, 31, 9, 36, 54], uncertainty in deep learning still remains an open challenge.
Research in the Gaussian processneural net correspondance has been extensively explored in order to improve the understandings of both worlds. Bayesian neural nets (BNNs) with infinitely wide hidden layers and certain prior distributions have been studied from a GP perspective. e.g. [68, 69, 91] for singlehidden layer, and [33, 27, 53, 62] for deeper nets. Notably, in [68, 27] a onelayer Bayesian neural network with nonlinearity and meanfield Gaussian prior is approximately equivalent to a Gaussian process with kernel function
Later [53] and [62] generalized this result and proved that deep Bayesian neural networks is approximately equivalent to a Gaussian process with a compositional kernel [14, 34, 19, 70] that mimic the deep net. These approaches allow us to construct expressive kernels for GPs[47], or conversely, exploit the exact Bayesian inference on GPs to perform exact Bayesian prediction for deep neural nets[53]. We will compare the above kernel with equation (6) in Appendix D
A number of alternative schemes have also been investigated to exploit deep structures for GP model design. These include: (1) deep GPs [18, 17, 11], where compositions of GP priors are proposed to represent prior over compositional functions; (2) the search and design of kernels for accurate and efficient learning [89, 24, 86, 3, 5, 79], and (3) deep kernel learning that uses deep neural net features as the inputs to GPs [40, 92, 1, 10, 43]. Frustratingly the first two approaches still struggle to model highdimensional structured data such as texts and images; and the third approach is not fully Bayesian, i.e. the model is only Bayesian wrt. the last output layer.
Our work is in different spirit of the above two: the intension is not to understand BNNs as GPs nor to use the deep learning concept to help GP design. Instead we directly model a BNN as an instance of implicit processes (IPs), and the GP is used as a variational distribution to assist predictive inference.^{3}^{3}3In principle any process can be used here as the variational distribution, and we use GPs here for the convenience of analytic approximations. Therefore it also retains the Bayesian nonparametric nature of IPs. This variational approximation does not require previous assumptions in the GPBNN correspondence literature (infinite hidden units, i.i.d. weight priors, etc) [53, 62] nor the conditions in compositional kernel literature (CITATION). Instead the optimal kernel (6) in the sleep phase applies to any IP that includes BNNs and beyond. A very recent work [26] also minimizes the sleep phase KL divergence (3) but wrt. the BNN prior, and their goal is to regularize BNN priors and implant some smoothness properties of GPs to BNNs. By contrast, our approach takes the advantage from the BNN prior over functions to better encode rich structures. Also [26] still performs variational posterior inference in weight space, and our inference method in function space also allows better uncertainty quantification.
From the practical point of view, the proposed inference method is computationally cheap, and it allows scalable learning of hyperparameters. The additional cost is negligible when the computation is dominated by the evaluation of e.g. BNN function samples. In the case where GP approximations dominate the cost, our approach does not require expensive matrix inversions, nor complicated kernel compositions that only have analytic forms under restricted cases [53, 47].
5 Experiments
We evaluate VIPs for regression tasks using realworld data. For small datasets, we use the posterior GP equations for prediction, and for large datasets, we use the approximation. We use for VIP unless noted otherwise. When the VIP is equipped with a Bayesian NN/LSTM as prior over functions, the prior parameters over each weight are untied, thus can be individually tuned. Fully Bayesian estimates of the prior parameters are used in experiments 5.3 and 5.4. We focus on comparing VIPs to other fully Bayesian approaches.
5.1 Toy example
We first evaluate the predictive mean and uncertainty of our method on a toy regression example. The training set is generated by first sampling 300 inputs according to a standard Gaussian distribution. Then, for each obtained, the corresponding target is sampled according to . Test set consists of 1000 evenly spaced points on . We use an IP with a Bayesian neural network with two hidden layers 10101 as prior. We use for the wakesleep training. We also compare with the full exact GP with RBF kernel and variational Bernoulli dropout Bayesian neural network (VDO) with the same network structure. All methods (including GP hyperparameters) are trained using 500 full batch epochs with Adam optimizer (learning rate = 0.01). The variational dropout uses a dropout rate and a length scale .
Figure 2 shows the performance of each method. VIP predictive mean is closer to groud truth function than variational dropout and GP. More importantly, VIP provides the best predictive uncertainty since it increases smoothly while approaching the boundary of the interval, which is as expected since input , while variational dropout does not. Although uncertainty of GP also increases when data becomes sparser, it slightly overfits the data, and tends to extrapolate a zero mean function where data is sparse. Loglikelihood and RMSE reveals similar conclusions, see Table 1.
Method  VIP  VDO  GP 

Test log likelihood  0.62  
Test RMSE  0.140  0.187  0.144 


5.2 Time series interpolation: Solar irradiance
We compare the VIP () with variational sparse GP, exact GP as well as Bayesian neural networks on the solar irradiance dataset [52]. The dataset is split according to [29], where 5 segments of length 20 are removed for interpolation. All inputs are then centered to zero means, and targets are scaled, dividing by the data standard deviation. Settings for all methods are the same as in (5.1), except that we run Adam with learning rate = 0.001 for 5000 iterations. Predictive interpolations are shown in Figure (3). VIP and VDO give similar interpolation behaviors, while sparse variational GP (with 100 inducing points) and exact GP performs nearly the same. Table 2 gives the test log likelihood and rmse for these methods. It is shown that IP gives the best performance over other methods, indicating that its returns highquality uncertainties and accurate mean predictions.
It is interesting to note that, compared with VIP, VDO underestimates uncertainty, expecially for data points in . VDO also gives the wrong mean estimate for the nearlyconstant data points in the interval around . On the contrary, VIP is able to recover the correct mean estimation around this interval with high confidence. A the meantime, GP methods underestimates uncertainty of training data and recovers the exact mean of training data, but often gives poor estimate of predictive means for interpolation.
Method  VIP  VDO  SVGP  GP 

Test log likelihood  0.126  
Test RMSE  0.237  0.363  0.668  0.650 
5.3 Multivariate regression
We perform additional experiments on reallife multivariate regression datasets from the UCI data repository[59]. We train a VIP that is equipped with a Bayesian neural net as the prior over latent functions (VIPBNN), and compare it with other popular modern variational inference methods of Bayesian neural nets, including variational Gaussian inference with reparameterization tricks [9], variational dropout [27], variational alpha inference by dropout[54]. Variational sparse GP [85] as well as full GP[75] are also included for comparison. We also train VIP with neural sampler prior (VIPNS), as defined in section 2. All neural networks use a 10101 structure with two hidden layers of size 10. All methods are trained using 10 random splits (5 splits on Protein Structure dataset) of all datasets and a 1K epochs of full batch training using Adam optimizer (learning rate = 0.01). Observational noise variance for VIP and variational/alphadropout methods are tuned using fast grid search over validation set, as detailed in Appendix E. The value for both VIP and alphavariational inference are fixed to 0.5, as suggested in[54, 37]. Full details are available in the appendix.
Dataset  N  D  VIPBNN  VIPNS  VI  VDO  SVGP  GP  

boston  506  13  2.45  0.04  2.45  0.03  2.76  0.04  2.63  0.10  2.45  0.02  2.63  0.04  2.46  0.04 
concrete  1030  8  3.02  0.02  3.13  0.02  3.28  0.01  3.23  0.01  3.06  0.03  3.4  0.01  3.05  0.02 
energy  768  8  0.60  0.03  0.59  0.04  2.17  0.02  1.13  0.02  0.95  0.09  2.31  0.02  0.57  0.02 
kin8nm  8192  8  1.12  0.01  1.05  0.00  0.81  0.01  0.83  0.01  0.92  0.02  0.76  0.00  N/A  0.00 
power  9568  4  2.92  0.00  2.90  0.00  2.83  0.01  2.88  0.00  2.81  0.00  2.82  0.00  N/A  0.00 
protein  45730  9  2.87  0.00  2.96  0.02  3.00  0.00  2.99  0.00  2.90  0.00  3.01  0.00  N/A  0.00 
red wine  1588  11  0.97  0.02  1.20  0.04  1.01  0.02  0.97  0.02  1.01  0.02  0.98  0.02  0.26  0.03 
yacht  308  6  0.02  0.07  0.59  0.13  1.11  0.04  1.22  0.18  0.79  0.11  2.29  0.03  0.10  0.05 
naval  11934  16  5.62  0.04  4.11  0.00  2.80  0.00  2.80  0.00  2.97  0.14  7.81  0.00  N/A  0.00 
Avg.Rank  1.77  0.54  2.77  0.57  4.66  0.28  3.88  0.38  2.55  0.37  4.44  0.66  N/A  0.00 
Dataset  N  D  VIPBNN  VIPNS  VI  VDO  SVGP  GP  

boston  506  13  2.88  0.14  2.78  0.12  3.85  0.22  3.15  0.11  3.06  0.09  3.30  0.21  2.95  0.12 
concrete  1030  8  4.81  0.13  5.54  0.09  6.51  0.10  6.11  0.10  5.18  0.16  7.25  0.15  5.31  0.15 
energy  768  8  0.45  0.01  0.45  0.05  2.07  0.05  0.74  0.04  0.51  0.03  2.39  0.06  0.45  0.01 
kin8nm  8192  8  0.07  0.00  0.08  0.00  0.10  0.00  0.10  0.00  0.09  0.00  0.11  0.01  N/A  0.00 
power  9568  4  4.11  0.05  4.11  0.04  4.11  0.04  4.38  0.03  4.08  0.00  4.06  0.04  N/A  0.00 
protein  45730  9  4.25  0.07  4.54  0.03  4.88  0.04  4.79  0.01  4.46  0.00  4.90  0.01  N/A  0.00 
red wine  1588  11  0.64  0.01  0.66  0.01  0.66  0.01  0.64  0.01  0.69  0.01  0.65  0.01  0.62  0.01 
yacht  308  6  0.32  0.06  0.54  0.09  0.79  0.05  1.03  0.06  0.49  0.04  2.25  0.13  0.35  0.04 
naval  11934  16  0.00  0.00  0.00  0.00  0.38  0.00  0.01  0.00  0.01  0.00  0.00  0.00  N/A  0.00 
Avg.Rank  1.33  0.23  2.22  0.36  4.66  0.33  4.00  0.44  3.11  0.42  4.44  0.72  N/A  0.00 
Results are shown in Table 3 and 4, where best results are boldfaced. Note that full GP models are only trained for small datasets due to its expensive costs, therefore is not included for the overall ranking. The results show that the VIP based methods, especially VIPBNN, consistently outperforms other methods in almost all datasets: VIP methods obtains best results in 7 out of the 9 datasets in terms of average test loglikelihood, and 8 out of the 9 datasets in terms of average RMSE. Generally, VIPBNN obtains the best ranking among 6 methods. It is also encouraging to note that, despite its general form, the VIPNS achieves the second best average ranking, outperforming many specifically designed BNN algorithms. When comparing with full GP, the VIP outperforms full GP in 3 out of 5 evaluated datasets in terms of average test loglikelihood, and 4 out of 5 in terms of average rates RMSE, despite the fact that only samples are used.
5.4 Bayesian LSTM for predicting power conversion efficiency of organic photovoltaics molecules
We perform additional experiments with data from the Harvard Clean Energy Project, which is the world’s largest materials highthroughput virtual screening effort[32]. It has scanned a large number of molecules of organic photovoltaics to find those with high power conversion efficiency (PCE) using quantumchemical techniques. The target value within this dataset is the PCE of each molecule, and the input is the variablelength sequential character data that represents the molecule structures. Previous studies expertdesigned[71, 11, 37] or learned fingerprint features[25] that transforms the raw string data into fixedsize feature for prediction.
We use a VIP equipped with a prior defined by Bayesian LSTM, of which the latent state size is 200, and the alpha value . We replicate the experimental settings in[11, 37], except that our method directly takes raw sequential molecule structure data as input. We compare our approach with variational dropout for recurrent nets (VDOLSTM) [28], alphavariational inference LSTM[54], BB Bayesian neural net[37], Variational inference Bayesian neural nets, FITC GPs and Deep GPs trained with expectation propagation[11]. Results for the latter 4 methods are directly obtained from[37, 11]. Detailed configurations can be found in the appendix. Results in Table 5 shows that VIP significantly outperforms other baseline methods and hits a stateoftheart result in both test likelihood and RMSE.
Metric  VIP  VDOLSTM  LSTM  BB  VIBNN  FITCGP  EPDGP  

Test LLH  0.65  0.01  1.24  0.01  2.06  0.02  0.74  0.01  1.37  0.02  1.25  0.00  0.98  0.00 
Test RMSE  0.88  0.02  0.93  0.01  1.38  0.02  1.08  0.01  1.07  0.01  1.35  0.00  1.17  0.00 
6 Conclusions
We presented a variational approach for learning and inference of Bayesian nonparametric regression models based on implicit process priors. It provides a powerful framework that combines the rich representational power of implicit models with the wellcalibrated uncertainty estimates from Bayesian nonparametric models. With Bayesian neural networks as the implicit process prior, our approach outperformed many existing Gaussian process/Bayesian neural network methods and achieved significantly improved results on the molecule regression data. Many directions remain to be explored. Classification models with implicit process priors will be developed. Implicit process latent variable models will also be derived in a similar fashion as Gaussian process latent variable models [49]. Future work will investigate novel inference methods to models equipped with other implicit process priors, e.g. data simulators in astrophysics, ecology and climate science.
Acknowledgements
We thank Jiri Hron, Rich Turner and Cheng Zhang for discussions. Chao Ma thanks Microsoft Research donation for supporting his research.
References
 [1] Maruan AlShedivat, Andrew Gordon Wilson, Yunus Saatchi, Zhiting Hu, and Eric P Xing. Learning scalable deep kernels with recurrent structure. The Journal of Machine Learning Research, 18(1):2850–2886, 2017.
 [2] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein GAN. arXiv:1701.07875, 2017.
 [3] Matej Balog, Balaji Lakshminarayanan, Zoubin Ghahramani, Daniel M Roy, and Yee Whye Teh. The mondrian kernel. arXiv preprint arXiv:1606.05241, 2016.
 [4] David Barber and Christopher M Bishop. Ensemble learning in Bayesian neural networks. NATO ASI SERIES F COMPUTER AND SYSTEMS SCIENCES, 168:215–238, 1998.
 [5] Daniel Beck and Trevor Cohn. Learning kernels over strings using gaussian processes. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), volume 2, pages 67–73, 2017.
 [6] Yoshua Bengio. Learning deep architectures for ai. Foundations and trends® in Machine Learning, 2(1):1–127, 2009.
 [7] Yoshua Bengio, Olivier Delalleau, and Nicolas Le Roux. The curse of dimensionality for local kernel machines. Techn. Rep, 1258, 2005.
 [8] David Berthelot, Tom Schumm, and Luke Metz. Began: Boundary equilibrium generative adversarial networks. arXiv:1703.10717, 2017.
 [9] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. arXiv:1505.05424, 2015.
 [10] John Bradshaw, Alexander G de G Matthews, and Zoubin Ghahramani. Adversarial examples, uncertainty, and transfer testing robustness in Gaussian process hybrid deep networks. arXiv:1707.02476, 2017.
 [11] Thang Bui, Daniel HernándezLobato, Jose HernandezLobato, Yingzhen Li, and Richard Turner. Deep Gaussian processes for regression using approximate expectation propagation. In International Conference on Machine Learning, pages 1472–1481, 2016.
 [12] Thang D Bui and Richard E Turner. Treestructured Gaussian process approximations. In Advances in Neural Information Processing Systems, pages 2213–2221, 2014.
 [13] Thang D Bui, Josiah Yan, and Richard E Turner. A unifying framework for sparse Gaussian process approximation using power expectation propagation. arXiv:1605.07066, 2016.
 [14] Youngmin Cho and Lawrence K Saul. Kernel methods for deep learning. In Advances in Neural Information Processing Systems, pages 342–350, 2009.
 [15] Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning, pages 160–167. ACM, 2008.
 [16] John P Cunningham, Krishna V Shenoy, and Maneesh Sahani. Fast gaussian process methods for point process intensity estimation. In Proceedings of the 25th International Conference on Machine Learning, pages 192–199. ACM, 2008.
 [17] Kurt Cutajar, Edwin V Bonilla, Pietro Michiardi, and Maurizio Filippone. Random feature expansions for deep Gaussian processes. arXiv:1610.04386, 2016.
 [18] Andreas Damianou and Neil Lawrence. Deep gaussian processes. In Artificial Intelligence and Statistics, pages 207–215, 2013.
 [19] Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In Advances in Neural Information Processing Systems, pages 2253–2261, 2016.
 [20] John Denker and Yann Lecun. Transforming neuralnet output levels to probability distributions. In Advances in Neural Information Processing Systems 3. Citeseer, 1991.
 [21] Stefan Depeweg, José Miguel HernándezLobato, Finale DoshiVelez, and Steffen Udluft. Learning and policy search in stochastic dynamical systems with Bayesian neural networks. arXiv:1605.07127, 2016.
 [22] Peter J Diggle and Richard J Gratton. Monte carlo methods of inference for implicit statistical models. Journal of the Royal Statistical Society. Series B (Methodological), pages 193–227, 1984.
 [23] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In ICML, pages 647–655, 2014.
 [24] David Duvenaud, James Robert Lloyd, Roger Grosse, Joshua B Tenenbaum, and Zoubin Ghahramani. Structure discovery in nonparametric regression through compositional kernel search. arXiv:1302.4922, 2013.
 [25] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán AspuruGuzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in Neural Information Processing Systems, pages 2224–2232, 2015.
 [26] Daniel FlamShepherd, James Requeima, and David Duvenaud. Mapping gaussian process priors to bayesian neural networks. NIPS Bayesian deep learning workshop, 2017.
 [27] Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning, pages 1050–1059, 2016.
 [28] Yarin Gal and Zoubin Ghahramani. A theoretically grounded application of dropout in recurrent neural networks. In Advances in Neural Information Processing Systems, pages 1019–1027, 2016.
 [29] Yarin Gal and Richard Turner. Improving the Gaussian process sparse spectrum approximation by representing uncertainty in frequency inputs. In International Conference on Machine Learning, pages 655–664, 2015.
 [30] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
 [31] Alex Graves. Practical variational inference for neural networks. In Advances in Neural Information Processing Systems, pages 2348–2356, 2011.
 [32] Johannes Hachmann, Roberto OlivaresAmaya, Adrian Jinich, Anthony L Appleton, Martin A BloodForsythe, Laszlo R Seress, Carolina RomanSalgado, Kai Trepte, Sule AtahanEvrenk, Süleyman Er, et al. Lead candidates for highperformance organic photovoltaics from highthroughput quantum chemistry–the harvard clean energy project. Energy & Environmental Science, 7(2):698–704, 2014.
 [33] Tamir Hazan and Tommi Jaakkola. Steps toward deep kernel methods from infinite neural networks. arXiv:1508.05133, 2015.
 [34] Uri Heinemann, Roi Livni, Elad Eban, Gal Elidan, and Amir Globerson. Improper deep kernels. In Artificial Intelligence and Statistics, pages 1159–1167, 2016.
 [35] James Hensman, Nicolo Fusi, and Neil D Lawrence. Gaussian processes for big data. arXiv:1309.6835, 2013.
 [36] José Miguel HernándezLobato and Ryan P Adams. Probabilistic backpropagation for scalable learning of Bayesian neural networks. arXiv:1502.05336, 2015.
 [37] José Miguel HernándezLobato, Yingzhen Li, Mark Rowland, Daniel HernándezLobato, Thang Bui, and Richard Eric Turner. Blackbox divergence minimization. 2016.
 [38] Geoffrey E Hinton, Peter Dayan, Brendan J Frey, and Radford M Neal. The" wakesleep" algorithm for unsupervised neural networks. Science, 268(5214):1158, 1995.
 [39] Geoffrey E Hinton, Simon Osindero, and YeeWhye Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.
 [40] Geoffrey E Hinton and Ruslan R Salakhutdinov. Using deep belief nets to learn covariance kernels for Gaussian processes. In Advances in Neural Information Processing Systems, pages 1249–1256, 2008.
 [41] Geoffrey E Hinton and Drew Van Camp. Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the Sixth Annual Conference on Computational Learning Theory, pages 5–13. ACM, 1993.
 [42] Kiyosi Itô. An Introduction to Probability Theory. Cambridge University Press, 1984.
 [43] Tomoharu Iwata and Zoubin Ghahramani. Improving output uncertainty estimation and generalization in deep learning via neural network Gaussian processes. arXiv:1707.05922, 2017.
 [44] Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An introduction to variational methods for graphical models. Machine learning, 37(2):183–233, 1999.
 [45] Nal Kalchbrenner and Phil Blunsom. Recurrent continuous translation models. In EMNLP, volume 3, page 413, 2013.
 [46] Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv:1312.6114, 2013.
 [47] Karl Krauth, Edwin V Bonilla, Kurt Cutajar, and Maurizio Filippone. Autogp: Exploring the capabilities and limitations of Gaussian process models. arXiv:1610.05392, 2016.
 [48] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012.
 [49] Neil D Lawrence. Gaussian process latent variable models for visualisation of high dimensional data. In Advances in Neural Information Processing Systems, pages 329–336, 2004.
 [50] Miguel LázaroGredilla, Joaquin QuiñoneroCandela, Carl Edward Rasmussen, and Aníbal R FigueirasVidal. Sparse spectrum gaussian process regression. Journal of Machine Learning Research, 11(Jun):1865–1881, 2010.
 [51] Quoc V Le. Building highlevel features using large scale unsupervised learning. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 8595–8598. IEEE, 2013.
 [52] Judith Lean, Juerg Beer, and Raymond Bradley. Reconstruction of solar irradiance since 1610: Implications for climate change. Geophysical Research Letters, 22(23):3195–3198, 1995.
 [53] Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pennington, and Jascha SohlDickstein. Deep neural networks as Gaussian processes. arXiv:1711.00165, 2017.
 [54] Yingzhen Li and Yarin Gal. Dropout inference in Bayesian neural networks with alphadivergences. arXiv:1703.02914, 2017.
 [55] Yingzhen Li, José Miguel HernándezLobato, and Richard E Turner. Stochastic expectation propagation. In Advances in Neural Information Processing Systems, pages 2323–2331, 2015.
 [56] Yingzhen Li and Qiang Liu. Wild variational approximations.
 [57] Yingzhen Li and Richard E Turner. Rényi divergence variational inference. In Advances in Neural Information Processing Systems, pages 1073–1081, 2016.
 [58] Yingzhen Li, Richard E Turner, and Qiang Liu. Approximate inference with amortised MCMC. arXiv:1702.08343, 2017.
 [59] Moshe Lichman et al. Uci machine learning repository, 2013.
 [60] Qiang Liu and Yihao Feng. Two methods for wild variational inference. arXiv:1612.00081, 2016.
 [61] Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose Bayesian inference algorithm. In Advances in Neural Information Processing Systems, pages 2370–2378, 2016.
 [62] Alexander G de G Matthews, Mark Rowland, Jiri Hron, Richard E Turner, and Zoubin Ghahramani. Gaussian process behaviour in wide deep neural networks. arXiv:1804.11271, 2018.
 [63] Alexander G de G Matthews, Mark Van Der Wilk, Tom Nickson, Keisuke Fujii, Alexis Boukouvalas, Pablo LeónVillagrá, Zoubin Ghahramani, and James Hensman. GPflow: A Gaussian process library using tensorflow. The Journal of Machine Learning Research, 18(1):1299–1304, 2017.
 [64] Thomas Minka. Power EP. Technical report, Technical report, Microsoft Research, Cambridge, 2004.
 [65] Thomas P Minka. Expectation propagation for approximate Bayesian inference. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, pages 362–369. Morgan Kaufmann Publishers Inc., 2001.
 [66] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv:1411.1784, 2014.
 [67] Abdelrahman Mohamed, George E Dahl, and Geoffrey Hinton. Acoustic modeling using deep belief networks. IEEE Transactions on Audio, Speech, and Language Processing, 20(1):14–22, 2012.
 [68] Radford M Neal. Priors for infinite networks. In Bayesian Learning for Neural Networks, pages 29–53. Springer, 1996.
 [69] Radford M Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012.
 [70] Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha SohlDickstein, and Surya Ganguli. Exponential expressivity in deep neural networks through transient chaos. In Advances in Neural Information Processing Systems, pages 3360–3368, 2016.
 [71] Edward O PyzerKnapp, Kewei Li, and Alan AspuruGuzik. Learning from the harvard clean energy project: The use of neural networks to accelerate materials discovery. Advanced Functional Materials, 25(41):6495–6502, 2015.
 [72] Joaquin QuiñoneroCandela and Carl Edward Rasmussen. A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6(Dec):1939–1959, 2005.
 [73] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv:1511.06434, 2015.
 [74] Ali Rahimi and Benjamin Recht. Random features for largescale kernel machines. In Advances in neural information processing systems, pages 1177–1184, 2008.
 [75] Carl Edward Rasmussen and Christopher KI Williams. Gaussian processes for machine learning, volume 1. MIT press Cambridge, 2006.
 [76] Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. arXiv:1505.05770, 2015.
 [77] Yunus Saatçi. Scalable inference for structured Gaussian process models. PhD thesis, University of Cambridge, 2012.
 [78] Ruslan Salakhutdinov and Geoffrey E Hinton. Deep boltzmann machines. In AISTATS, volume 1, page 3, 2009.
 [79] YvesLaurent Kom Samo and Stephen Roberts. String gaussian process kernels. arXiv preprint arXiv:1506.02239, 2015.
 [80] Matthias Seeger, Christopher Williams, and Neil Lawrence. Fast forward selection to speed up sparse Gaussian process regression. In Artificial Intelligence and Statistics 9, number EPFLCONF161318, 2003.
 [81] Amar Shah, Andrew Wilson, and Zoubin Ghahramani. Studentt processes as alternatives to Gaussian processes. In Artificial Intelligence and Statistics, pages 877–885, 2014.
 [82] Jiaxin Shi, Jianfei Chen, Jun Zhu, Shengyang Sun, Yucen Luo, Yihong Gu, and Yuhao Zhou. ZhuSuan: A library for Bayesian deep learning. arXiv:1709.05870, 2017.
 [83] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv:1409.1556, 2014.
 [84] Edward Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudoinputs. In Advances in Neural Information Processing Systems, pages 1257–1264, 2006.
 [85] Michalis K Titsias. Variational learning of inducing variables in sparse Gaussian processes. In International Conference on Artificial Intelligence and Statistics, pages 567–574, 2009.
 [86] Felipe Tobar, Thang D Bui, and Richard E Turner. Learning stationary time series using Gaussian processes with nonparametric kernels. In Advances in Neural Information Processing Systems, pages 3501–3509, 2015.
 [87] Dustin Tran, Rajesh Ranganath, and David M Blei. Deep and hierarchical implicit models. arXiv:1702.08896, 2017.
 [88] Richard E Turner and Maneesh Sahani. Statistical inference for singleand multiband probabilistic amplitude demodulation. In Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, pages 5466–5469. IEEE, 2010.
 [89] Mark van der Wilk, Carl Edward Rasmussen, and James Hensman. Convolutional Gaussian processes. In Advances in Neural Information Processing Systems, pages 2845–2854, 2017.
 [90] Dilin Wang and Qiang Liu. Learning to draw samples: With application to amortized mle for generative adversarial learning. arXiv:1611.01722, 2016.
 [91] Christopher KI Williams. Computing with infinite networks. In Advances in Neural Information Processing Systems, pages 295–301, 1997.
 [92] Andrew G Wilson, Zhiting Hu, Ruslan R Salakhutdinov, and Eric P Xing. Stochastic variational deep kernel learning. In Advances in Neural Information Processing Systems, pages 2586–2594, 2016.
 [93] Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI, pages 2852–2858, 2017.
 [94] Huaiyu Zhu and Richard Rohwer. Information geometric measurements of generalisation. 1995.
Appendix
Appendix A Brief review of Gaussian Processes
Gaussian Processes [75], as a popular example of Bayesian nonparametrics, provides a principled probabilistic framework for nonparametric Bayesian inference over functions, by imposing rich and flexible nonparametric priors over functions of interest. As flexible and interpretable function approximators, their Bayesian nature also enables Gaussian Processes to provide valuable information of uncertainties regarding predictions for intelligence systems, all wrapped up in a single, exact closed form solution of posterior inference.
Despite the success and popularity of Gaussian Processes (as well as other Bayesian nonparametrics) in the past decades, their computation and storage complexities make it impractical to apply Gaussian Processes to large and complicated datasets. Therefore, people often resort to complicated approximate methods [80, 72, 84, 85, 35, 13, 12, 77, 16, 88].
An inevitable issue that must also be addressed is the representational power of GP kernels. It has been argued that [7] local kernels commonly used for nonlinear regressions are not able to obtain hierarchical representations for high dimensional data, which limits the usefulness of Bayesian nonparametric methods for complicated tasks. A number of novel schemes were proposed, including deep GPs [18, 17, 11], the design of expressive kernels [89, 24, 86], and the hybrid models using deep neural nets as input features of GPs [40, 92]. However, the first two approaches still struggle to model complex high dimensional data such as texts and images easily; and in the third approach, the merits of fully Bayesian approach has been discarded.
We breifly introduce Gaussian Processes for regression. Assume that we have a set of observational data , where is the dimensional input of th data point, and is the corresponding scalar target of the regression problem. A Gaussian Process model assumes that is generated according the following procedure: firstly a function is drawn from a Gaussian Process (to be defined later). Then for each input data , the corresponding is then drawn according to:
A Gaussian Process is a nonparametric distribution defined over the space of functions, such that:
Definition 2 (Gaussian Processes).
A Gaussian process (GP) is a collection of random variables, any finite number of which have a joint Gaussian distributions. A Gaussian Process is fully specified by its mean function and covariance function , such that any finite collection of function values are distributed as Gaussian distribution , where , .
Now, given a set of observational data , we are able to perform probabilistic inference and assign posterior probabilities over all plausible functions that might have generated the data. Under the setting of regression, given a new test point input data , we are interested in posterior distributions over . Fortunately, this posterior distribution of interest admits a closed form solution :
(A.1) 
(A.2) 
In our notation, , , and . Although the Gaussian Process regression framework is theoretically very elegant, in practice its computational burden is prohibitive for large datasets since the matrix inversion takes time due to Cholesky decomposition. Once matrix inversion is done, predictions in test time can be made in for posterior mean and for posterior uncertainty , respectively.
Appendix B Brief Review of Variational inference, and blackbox energy
We give a brief review of modern variational techniques, including standard variational inference and blackbox Divergence minimization (BB), on which our methodology is heavily based. Considers the problem of finding the posterior distribution, , under the model likelihood and a prior distribution :
Variational inference [44] transfers the above inference problem to an optimization problem, by first proposing a class of approximate posterior , and then minimize the KLdivergence from the approximate posterior to the true posterior . Equivalently, VI optimizes the following variational free energy,
Built upon the idea of VI, BB is a modern blackbox variational inference framework that unifies and interpolates between Variational Bayes [44] and Expectation Propagationlike algorithms [65, 55]. BB performs approximate inference by minimizing the following divergence[94] :
divergence is a generic class of divergences that includes the inclusive KLdivergence (=1, corresponds to EP), Hellinger distance (=0.5), and the exclusive KLdivergence ( = 0, corresponds to VI) as special cases. Traditionally power EP[64] optimizes an divergence locally with exponential family approximation , via message passing. It converges to a fixed point of the so called power EP energy:
where is the natural parameter of . On the contrary, BB directly optimizes with tied factors to avoid prohibitive local factor updates and storage on the whole dataset. This means for all and . Therefore instead of parameterizing each factors, one can directly parameterize and replace all the local factors in the powerEP energy function by . After rearranging terms, this gives the BB energy:
which can be further approximated by the following if the dataset is large [54]:
The optimization of could be performed in a blackbox manner with reparameterization trick [46] and MC approximation. Empirically, it has been shown that BB with can return significantly better uncertainty estimation than VB, and has been applied successfully in different scenarios [54, 21]. From the hyperparameter learning (i.e., in ) point of veiw, it is shown in [57] that the BB energy constitutes a better estimation of log marginal likelihood, when compared with the variational free energy. Therefore, for both inference and learning, BB energy is extensively used in this paper.
Appendix C Derivations
c.1 Proof of Proposition 1
Proposition 1. There exists a unique stochastic process, with finite marginals that are distributed exactly according to Definition 1.
Proof: Generally, consider the following noisy IP model:
For any finite collection of random variables , we denote the induced distribution as . Note that can be represented as . Therefore for any , we have
Note that the switch of the order of integration relies on the assumption that the integral is finite. Therefore, the marginal consistency condition of Kolmogorov extension theorem is satisfied. Similarly, the permutation consistency condition of Kolmogorov extension theorem can be proved as follows: assume is a permutation of the indices , then
Therefore, by Kolmogorov extension theorem, there exists a unique stochastic process, with finite marginals that are distributed exactly according to Definition 1.
∎
c.2 Inverse Wishart process as prior for kernel function
Definition 3 (Inverse Wishart processes[81]).
Let be random function . A stochastic process defined on such functions is called the inverse Wishart process on with parameter and base function , if for any finite collection of input data , the corresponding matrixvalued evaluation is distributed according to an inverse Wishart distribution . We denote .
Consider the problem in Section 3.1 of minimizing the objective
Since we use , this reduces to . In order to obtain optimal solution wrt. , it sufficies to draw fantasy functions (each sample is a random function ) from the prior distribution , and perform moment matching, which gives exactly the MLE solution, i.e., empirical mean and covariance functions
(C.1)  
(C.2) 
In practice, in order to gain computational advantage, the number of fantasy functions is often small, therefore we further put an inverse wishart process prior over the GP covariance function, i.e. . By doing so, we are able to give MAP estimation instead of MLE estimation. Since inverse Wishart distribution is conjugate to multivariate Gaussian distribution, the Maximum A Posteriori(MAP) solution is given by
(C.3) 
Where is the number of data points in the training set where and are evaluated. Alternatively, one could also use Posterior Mean Estimator (which is an example of Bayes estimator that minimizes posterior expected squared loss)
(C.4) 
In the implementation of this paper, we choose estimator with and . The hyper parameter is trained using fast grid search using the same procedure for the noise variance parameter, as detailed in Appendix E.
c.3 Derivation of the upper bound for sleep phase
Applying the chaine rule of KLdivregence, we have
Therefore, by the nonnegative property of KL divergence, we have
Finally, since and are the optimal solution of , it is also optimal for