Variational Implicit Processes
This paper introduces the variational implicit processes (VIPs), a Bayesian nonparametric method based on a class of highly flexible priors over functions. Similar to Gaussian processes (GPs), in implicit processes (IPs), an implicit multivariate prior (data simulators, Bayesian neural networks, etc.) is placed over any finite collections of random variables. A novel and efficient variational inference algorithm for IPs is derived using wake-sleep updates, which gives analytic solutions and allows scalable hyper-parameter learning with stochastic optimization. Experiments on real-world regression datasets demonstrate that VIPs return better uncertainty estimates and superior performance over existing inference methods for GPs and Bayesian neural networks. With a Bayesian LSTM as the implicit prior, the proposed approach achieves state-of-the-art results on predicting power conversion efficiency of molecules based on raw chemical formulas.
Variational Implicit Processes
Chao Ma University of Cambridge email@example.com Yingzhen Li University of Cambridge firstname.lastname@example.org José Miguel Hernández-Lobato University of Cambridge email@example.com
Powerful models with implicit distributions as core components have recently attracted enormous interest in both deep learning as well as the approximate Bayesian inference community. In contrast to prescribed probabilistic models that assign explicit densities to possible outcomes of the model, implicit models implicitly assign probability measures by the specification of the data generating process. One of the most well known implicit distributions is the generator of generative adversarial nets (GANs) [30, 73, 2, 66, 8, 93] that transforms isotropic noise into high dimensional data, say images, using neural networks. In approximate inference context, implicit distributions have also been deployed to postulate flexible approximate posterior distributions[76, 60, 56, 87, 61, 90, 58]. However, such generation process does not necessarily allow evaluation of densities point-wise, which becomes the main barrier for inference and learning.
This paper explores applications of implicit models to Bayesian nonparametric regression, which brings together the best of both worlds. That is, combining the nonparametric modeling power, elegance of inference, and the well-calibrated uncertainty of Bayesian nonparametric methods, with the strong representational power of implicit methods. Similar to the construction of Gaussian processes (GPs), implicit stochastic processes (IPs) assign implicit distributions over any finite collections of random variables. With an IP as the prior, one can directly perform posterior inference over functions in a non-parametric fashion. Therefore our approach does not suffer from typical issues in parametric Bayesian modeling, e.g. symmetric modes in the posterior distribution of Bayesian neural network weights. However, due to the intractability of the underlying distribution of IPs, standard approximate Bayesian inference techniques cannot be directly applied. Concretely, our contributions are threefold:
We propose a generic framework for Bayesian nonparametrics with implicit stochastic process prior over functions. As IPs naturally include data simulators and Bayesian neural networks, our approach provides a new inference and learning framework for these important but challenging deep models.
We derive a novel and efficient variational framework that gives a closed-form approximation to the IP posterior. Our inference method is computationally cheap, and it allows scalable hyper-parameter learning in IPs.
We conduct an extensive comparison between IPs with the proposed inference method, and GPs/ Bayesian neural networks/Bayesian RNNs with existing modern variational approaches. Our method consistently outperforms other methods, and achieves state-of-the-art performance on the task of predicting power conversion efficiency of molecules based on raw chemical formulas, by training an IP with Bayesian LSTM prior.
2 Implicit stochastic processes
In this section, we generalize GPs to implicit stochastic processes. Recall that a GP defines the distribution of a random function by placing a multivariate Gaussian distribution over any finite collection of function values evaluated at any given finite collection of input locations .111 can also be unobserved as in Gaussian process latent variable models. Here and , and following Kolmogorov consistency theorem , the mean and covariance functions , are shared across all such finite collections. An alternative parameterization of GPs defines the sampling process as , , with the Cholesky decomposition of the covariance matrix. Observing this, we propose a generalization of the generative process by replacing the linear transform of the latent variable with a nonlinear one. This gives the formal definition of implicit stochastic process as follows.
Definition 1 (noiseless implicit stochastic processes).
An implicit stochastic process (IP) is a collection of random variables , such that any finite collection has a joint distribution implicitly defined by the following generative process
A function distributed according to the above IP is denoted as .
There exists a unique stochastic process, such that any finite collection of random variables has distribution implicitly defined by (1).
IPs form a rich class of priors over functions. Indeed, we visualize some examples of IPs in Figure 1 with discussions as follows:
Example 1 (Data simulators).
Simulators, e.g. physics engines and climate models, are omnipresent in science and engineering. These models encode laws of physics in , use to explain the remaining randomness, and evaluate the function at input locations : . We define the neural sampler as a specific instance of this class. In this case is a neural network with weights , i.e., , and .
Example 2 (Bayesian neural network).
In a Bayesian neural network the synaptic weights are random variables (i.e., ) with a prior on them. A function is sampled by and then setting for all . In this case could include, e.g., the network architecture and additional hyper-parameters.
Example 3 (Bayesian RNN).
Similar to Example 2, a Bayesian recurrent neural network (RNN) can be defined by considering its weights as random variables, and taking as function evaluation an output value generated by the RNN after processing the last symbol of an input sequence.
3 Variational implicit processes
Consider the following regression model with an IP prior over the function:
Given an observational dataset and a set of test inputs , Bayesian predictive inference over involves computing the predictive distribution , which itself implies computing the posterior . Besides prediction, we may also want to learn the prior parameters and noise variance by maximizing the log marginal likelihood: , with the evaluation of on . Unfortunately, both the prior and the posterior are intractable as the implicit process does not allow point-wise density evaluation, let alone the integration tasks. Therefore, to address these tasks, we must resort to approximate inference methods.
We propose a generalization of the wake-sleep algorithm  to handle both intractabilities. This method returns (1) an approximate posterior distribution which is later used for predictive inference, and (2) an approximation to the marginal likelihood for hyper-parameter optimization. In this paper we choose to be the posterior distribution of a Gaussian process, . A high-level summary of our algorithm is the following:
Sleep phase: sample function values and noisy outputs as indicated in (2). This dreamed data is then used as the maximum-likelihood (ML) target to fit a GP. This is equivalent to minimizing .
Wake phase: use the GP posterior to approximate the IP posterior , and obtain a variational approximation to , which is then optimized with respect to .
Our approach has two key advantages. First, the algorithm has no explicit sleep phase computation, since the sleep phase optimization has an analytic solution that can be directly plugged into the wake-phase objective. Second, the proposed wake phase update is highly scalable, as it is equivalent to a Bayesian linear regression task with random features sampled from the implicit process. With our wake-sleep algorithm, the evaluation of the implicit prior density is no longer an obstacle for approximate inference. We call this inference framework the variational implicit processes (VIPs). In the following sections we give specific details on both the wake and sleep phases.
3.1 Sleep phase: GP posterior as variational distribution
This section addresses the first intractability issue by approximating the IP posterior with a GP posterior . A naive approach based on variational inference  would require computing the joint distribution that is again intractable. However, sampling from this joint distribution is straightforward. Therefore, we leverage the idea of the sleep phase in the wake-sleep algorithm to approximate with a simpler distribution instead. We use a GP prior for with mean and covariance functions and , respectively, and write the prior as .
The sleep-phase update minimizes the KL divergence between the two joint distributions, which reduces to the following constrained optimization problem:
We further simplify the approximation by using , which reduces to , and the objective is minimized when and are equal to the mean and covariance functions of the IP, respectively:
In the following we also write the optimal solution as to explicitly specify the dependency on prior parameters . In practice, the mean and covariance functions are estimated by by Monte Carlo, which leads to maximum likelihood training for the GP with dreamed data. Assume functions are drawn from the IP: . The optimum of is then approximated by the MLE solution:
To reduce computational costs, the number of dreamed samples is often small. Therefore, we perform maximum a posteriori/posterior mean estimation instead, by putting an inverse Wishart process prior over the GP covariance function (see Appendix C.2).
The original sleep phase algorithm in  also find a posterior approximation by minimizing (3). However, the original approach would define the distribution as , which builds a recognition model that can be directly transfered for later inference. By contrast, we define , which corresponds to an approximation of the IP prior. In other words, we approximate an intractable generative model using another generative model with a GP prior and later, the resulting GP posterior is employed as the variational distribution. Importantly, we never explicitly perform the sleep phase updates as there is an analytic solution readily available, which can potentially save an enormous amount of computation in practice.
Another interesting observation is that the sleep phase’s objective also provides an upper-bound to the KL divergence between the posterior distributions, . Notice that according to chain rule of KL divergence, can be rewritten as:
According the the non-negativity of KL divergence, we have that is an upper bound of , which justifies as a valid variational objective for posterior approximation. Therefore during the sleep phase, the gap is also decreased when the mean and covariance functions are optimized.
3.2 Wake phase: Bayesian linear regression over random functions
In the wake phase of traditional wake-sleep, the prior parameters are optimized by maximizing the variational lower-bound  to the log marginal likelihood . Unfortunately, this requires evaluating the IP prior which is again intractable. But recall from (7) that during the sleep phase is also minimized. Therefore we directly approximate the log marginal likelihood using the optimal GP from the sleep phase, i.e.
This again demonstrates the key advantage of the proposed sleep phase update via generative model matching. Also it becomes a sensible objective for predictive inference as the GP returned by wake-sleep will be used at prediction time anyway.
Similar to GP regression, optimizing can be computationally expensive for large datasets. Therefore sparse GP approximation techniques [84, 85, 35, 13] are applicable, but we leave them to future work and consider an alternative approach that is related to random feature approximations of GPs [74, 29, 27, 3, 50]. Note that can be approximated by the log marginal likelihood of a Bayesian linear regression model with randomly sampled dreamed functions, and coefficient :
When the -energy reduces to the variational lower-bound, and empirically the -energy has better approximation accuracy when [57, 37, 54]. Also since the prior is conjugate to the Gaussian likelihood , the exact posterior of can be reached by a correlated Gaussian . Stochastic optimization is applied to the -energy wrt. and jointly, making our approach highly scalable to big datasets.
3.3 Computational complexity and scalable predictive inference
Assume the evaluation of a sampled function value for a given input takes time. The VIP has time complexity in training, where is the size of a mini-batch, and is the number of random functions sampled from . Note that approximate inference techniques in space, e.g. mean-field Gaussian approximation to the posterior of Bayesian neural network weights [9, 37, 54], also takes time. Therefore when is large (typically the case for neural networks) the additional cost is often negligible, as is usually significantly smaller than the typical number of inducing points in sparse GP (20 in the experiments).
Predictive inference follows the standard GP equations to compute on test data that has datapoints: ,
Recall that the optimal variational GP approximation has mean and covariance functions defined as (5) and (6), respectively, which means has rank . Therefore predictive inference requires both function evaluations and matrix inversion, which costs time. This complexity can be further reduced: note that the computational cost is dominated by the inversion of matrix . Denote the Cholesky decomposition of the kernel matrix as before. It is straightforward to show that in the Bayesian linear regression problem (10) the exact posterior of is , with Therefore the parameters of the GP predictive distribution in (12) are reduced to:
This reduces the prediction cost to , which is on par with e.g. conventional predictive inference techniques for Bayesian neural networks that also cost . In practice we use the mean and covariance matrix from to compute the predictive distribution. Alternatively one can directly sample and compute , which is also an inference approach222If is a mean-field Gaussian distribution then the cost is . but is liable for higher variance.
4 Related research
In the world of nonparametric models, Gaussian Processes (GPs)  provides a principled and flexible probabilistic framework for Bayesian inference over functions. The Bayesian nature enables GPs to provide accurate uncertainty estimates on unseen data, all wrapped up in a single, exact closed form solution of posterior inference. Despite the success and popularity of GPs in the past decades, their time and space complexities make them impractical for large-scale datasets. Therefore, people often resort to complicated approximate methods [80, 72, 84, 85, 35, 13, 12, 77, 16, 88]. Another intrinsic issue is the limited representational power of GP kernels. It has been argued that stationary kernels commonly used for nonlinear regressions are not able to obtain hierarchical representations for high dimensional data, which limits the usefulness of GP methods .
On the contrary, in the world of parametric modeling, it is well-known that deep neural networks [6, 39, 78, 48, 48, 83, 23] are extremely flexible function approximators that enable learning from very high-dimensional and structured data. Nowadays Deep learning has been widely spread to an enormous amount of tasks, including computer vision [48, 83, 23] and speech recognition [51, 67, 45, 15]. As people starts to apply deep learning techniques to critical applications such as automatic driving and health care, uncertainty quantification of neural networks has become increasingly important. Although decent progress has been made [20, 41, 4, 69, 31, 9, 36, 54], uncertainty in deep learning still remains an open challenge.
Research in the Gaussian process-neural net correspondance has been extensively explored in order to improve the understandings of both worlds. Bayesian neural nets (BNNs) with infinitely wide hidden layers and certain prior distributions have been studied from a GP perspective. e.g. [68, 69, 91] for single-hidden layer, and [33, 27, 53, 62] for deeper nets. Notably, in [68, 27] a one-layer Bayesian neural network with non-linearity and mean-field Gaussian prior is approximately equivalent to a Gaussian process with kernel function
Later  and  generalized this result and proved that deep Bayesian neural networks is approximately equivalent to a Gaussian process with a compositional kernel [14, 34, 19, 70] that mimic the deep net. These approaches allow us to construct expressive kernels for GPs, or conversely, exploit the exact Bayesian inference on GPs to perform exact Bayesian prediction for deep neural nets. We will compare the above kernel with equation (6) in Appendix D
A number of alternative schemes have also been investigated to exploit deep structures for GP model design. These include: (1) deep GPs [18, 17, 11], where compositions of GP priors are proposed to represent prior over compositional functions; (2) the search and design of kernels for accurate and efficient learning [89, 24, 86, 3, 5, 79], and (3) deep kernel learning that uses deep neural net features as the inputs to GPs [40, 92, 1, 10, 43]. Frustratingly the first two approaches still struggle to model high-dimensional structured data such as texts and images; and the third approach is not fully Bayesian, i.e. the model is only Bayesian wrt. the last output layer.
Our work is in different spirit of the above two: the intension is not to understand BNNs as GPs nor to use the deep learning concept to help GP design. Instead we directly model a BNN as an instance of implicit processes (IPs), and the GP is used as a variational distribution to assist predictive inference.333In principle any process can be used here as the variational distribution, and we use GPs here for the convenience of analytic approximations. Therefore it also retains the Bayesian nonparametric nature of IPs. This variational approximation does not require previous assumptions in the GP-BNN correspondence literature (infinite hidden units, i.i.d. weight priors, etc) [53, 62] nor the conditions in compositional kernel literature (CITATION). Instead the optimal kernel (6) in the sleep phase applies to any IP that includes BNNs and beyond. A very recent work  also minimizes the sleep phase KL divergence (3) but wrt. the BNN prior, and their goal is to regularize BNN priors and implant some smoothness properties of GPs to BNNs. By contrast, our approach takes the advantage from the BNN prior over functions to better encode rich structures. Also  still performs variational posterior inference in weight space, and our inference method in function space also allows better uncertainty quantification.
From the practical point of view, the proposed inference method is computationally cheap, and it allows scalable learning of hyper-parameters. The additional cost is negligible when the computation is dominated by the evaluation of e.g. BNN function samples. In the case where GP approximations dominate the cost, our approach does not require expensive matrix inversions, nor complicated kernel compositions that only have analytic forms under restricted cases [53, 47].
We evaluate VIPs for regression tasks using real-world data. For small datasets, we use the posterior GP equations for prediction, and for large datasets, we use the approximation. We use for VIP unless noted otherwise. When the VIP is equipped with a Bayesian NN/LSTM as prior over functions, the prior parameters over each weight are untied, thus can be individually tuned. Fully Bayesian estimates of the prior parameters are used in experiments 5.3 and 5.4. We focus on comparing VIPs to other fully Bayesian approaches.
5.1 Toy example
We first evaluate the predictive mean and uncertainty of our method on a toy regression example. The training set is generated by first sampling 300 inputs according to a standard Gaussian distribution. Then, for each obtained, the corresponding target is sampled according to . Test set consists of 1000 evenly spaced points on . We use an IP with a Bayesian neural network with two hidden layers 10-10-1 as prior. We use for the wake-sleep training. We also compare with the full exact GP with RBF kernel and variational Bernoulli dropout Bayesian neural network (VDO) with the same network structure. All methods (including GP hyperparameters) are trained using 500 full batch epochs with Adam optimizer (learning rate = 0.01). The variational dropout uses a dropout rate and a length scale .
Figure 2 shows the performance of each method. VIP predictive mean is closer to groud truth function than variational dropout and GP. More importantly, VIP provides the best predictive uncertainty since it increases smoothly while approaching the boundary of the interval, which is as expected since input , while variational dropout does not. Although uncertainty of GP also increases when data becomes sparser, it slightly overfits the data, and tends to extrapolate a zero mean function where data is sparse. Log-likelihood and RMSE reveals similar conclusions, see Table 1.
|Test log likelihood||0.62|
5.2 Time series interpolation: Solar irradiance
We compare the VIP () with variational sparse GP, exact GP as well as Bayesian neural networks on the solar irradiance dataset . The dataset is split according to , where 5 segments of length 20 are removed for interpolation. All inputs are then centered to zero means, and targets are scaled, dividing by the data standard deviation. Settings for all methods are the same as in (5.1), except that we run Adam with learning rate = 0.001 for 5000 iterations. Predictive interpolations are shown in Figure (3). VIP and VDO give similar interpolation behaviors, while sparse variational GP (with 100 inducing points) and exact GP performs nearly the same. Table 2 gives the test log likelihood and rmse for these methods. It is shown that IP gives the best performance over other methods, indicating that its returns high-quality uncertainties and accurate mean predictions.
It is interesting to note that, compared with VIP, VDO under-estimates uncertainty, expecially for data points in . VDO also gives the wrong mean estimate for the nearly-constant data points in the interval around . On the contrary, VIP is able to recover the correct mean estimation around this interval with high confidence. A the meantime, GP methods under-estimates uncertainty of training data and recovers the exact mean of training data, but often gives poor estimate of predictive means for interpolation.
|Test log likelihood||0.126|
5.3 Multivariate regression
We perform additional experiments on real-life multivariate regression datasets from the UCI data repository. We train a VIP that is equipped with a Bayesian neural net as the prior over latent functions (VIP-BNN), and compare it with other popular modern variational inference methods of Bayesian neural nets, including variational Gaussian inference with reparameterization tricks , variational dropout , variational alpha inference by dropout. Variational sparse GP  as well as full GP are also included for comparison. We also train VIP with neural sampler prior (VIP-NS), as defined in section 2. All neural networks use a 10-10-1 structure with two hidden layers of size 10. All methods are trained using 10 random splits (5 splits on Protein Structure dataset) of all datasets and a 1K epochs of full batch training using Adam optimizer (learning rate = 0.01). Observational noise variance for VIP and variational/alpha-dropout methods are tuned using fast grid search over validation set, as detailed in Appendix E. The value for both VIP and alpha-variational inference are fixed to 0.5, as suggested in[54, 37]. Full details are available in the appendix.
Results are shown in Table 3 and 4, where best results are boldfaced. Note that full GP models are only trained for small datasets due to its expensive costs, therefore is not included for the overall ranking. The results show that the VIP based methods, especially VIP-BNN, consistently outperforms other methods in almost all datasets: VIP methods obtains best results in 7 out of the 9 datasets in terms of average test log-likelihood, and 8 out of the 9 datasets in terms of average RMSE. Generally, VIP-BNN obtains the best ranking among 6 methods. It is also encouraging to note that, despite its general form, the VIP-NS achieves the second best average ranking, outperforming many specifically designed BNN algorithms. When comparing with full GP, the VIP outperforms full GP in 3 out of 5 evaluated datasets in terms of average test log-likelihood, and 4 out of 5 in terms of average rates RMSE, despite the fact that only samples are used.
5.4 Bayesian LSTM for predicting power conversion efficiency of organic photovoltaics molecules
We perform additional experiments with data from the Harvard Clean Energy Project, which is the world’s largest materials high-throughput virtual screening effort. It has scanned a large number of molecules of organic photovoltaics to find those with high power conversion efficiency (PCE) using quantum-chemical techniques. The target value within this dataset is the PCE of each molecule, and the input is the variable-length sequential character data that represents the molecule structures. Previous studies expert-designed[71, 11, 37] or learned finger-print features that transforms the raw string data into fixed-size feature for prediction.
We use a VIP equipped with a prior defined by Bayesian LSTM, of which the latent state size is 200, and the alpha value . We replicate the experimental settings in[11, 37], except that our method directly takes raw sequential molecule structure data as input. We compare our approach with variational dropout for recurrent nets (VDO-LSTM) , alpha-variational inference LSTM, BB- Bayesian neural net, Variational inference Bayesian neural nets, FITC GPs and Deep GPs trained with expectation propagation. Results for the latter 4 methods are directly obtained from[37, 11]. Detailed configurations can be found in the appendix. Results in Table 5 shows that VIP significantly outperforms other baseline methods and hits a state-of-the-art result in both test likelihood and RMSE.
We presented a variational approach for learning and inference of Bayesian nonparametric regression models based on implicit process priors. It provides a powerful framework that combines the rich representational power of implicit models with the well-calibrated uncertainty estimates from Bayesian nonparametric models. With Bayesian neural networks as the implicit process prior, our approach outperformed many existing Gaussian process/Bayesian neural network methods and achieved significantly improved results on the molecule regression data. Many directions remain to be explored. Classification models with implicit process priors will be developed. Implicit process latent variable models will also be derived in a similar fashion as Gaussian process latent variable models . Future work will investigate novel inference methods to models equipped with other implicit process priors, e.g. data simulators in astrophysics, ecology and climate science.
We thank Jiri Hron, Rich Turner and Cheng Zhang for discussions. Chao Ma thanks Microsoft Research donation for supporting his research.
-  Maruan Al-Shedivat, Andrew Gordon Wilson, Yunus Saatchi, Zhiting Hu, and Eric P Xing. Learning scalable deep kernels with recurrent structure. The Journal of Machine Learning Research, 18(1):2850–2886, 2017.
-  Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein GAN. arXiv:1701.07875, 2017.
-  Matej Balog, Balaji Lakshminarayanan, Zoubin Ghahramani, Daniel M Roy, and Yee Whye Teh. The mondrian kernel. arXiv preprint arXiv:1606.05241, 2016.
-  David Barber and Christopher M Bishop. Ensemble learning in Bayesian neural networks. NATO ASI SERIES F COMPUTER AND SYSTEMS SCIENCES, 168:215–238, 1998.
-  Daniel Beck and Trevor Cohn. Learning kernels over strings using gaussian processes. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), volume 2, pages 67–73, 2017.
-  Yoshua Bengio. Learning deep architectures for ai. Foundations and trends® in Machine Learning, 2(1):1–127, 2009.
-  Yoshua Bengio, Olivier Delalleau, and Nicolas Le Roux. The curse of dimensionality for local kernel machines. Techn. Rep, 1258, 2005.
-  David Berthelot, Tom Schumm, and Luke Metz. Began: Boundary equilibrium generative adversarial networks. arXiv:1703.10717, 2017.
-  Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. arXiv:1505.05424, 2015.
-  John Bradshaw, Alexander G de G Matthews, and Zoubin Ghahramani. Adversarial examples, uncertainty, and transfer testing robustness in Gaussian process hybrid deep networks. arXiv:1707.02476, 2017.
-  Thang Bui, Daniel Hernández-Lobato, Jose Hernandez-Lobato, Yingzhen Li, and Richard Turner. Deep Gaussian processes for regression using approximate expectation propagation. In International Conference on Machine Learning, pages 1472–1481, 2016.
-  Thang D Bui and Richard E Turner. Tree-structured Gaussian process approximations. In Advances in Neural Information Processing Systems, pages 2213–2221, 2014.
-  Thang D Bui, Josiah Yan, and Richard E Turner. A unifying framework for sparse Gaussian process approximation using power expectation propagation. arXiv:1605.07066, 2016.
-  Youngmin Cho and Lawrence K Saul. Kernel methods for deep learning. In Advances in Neural Information Processing Systems, pages 342–350, 2009.
-  Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning, pages 160–167. ACM, 2008.
-  John P Cunningham, Krishna V Shenoy, and Maneesh Sahani. Fast gaussian process methods for point process intensity estimation. In Proceedings of the 25th International Conference on Machine Learning, pages 192–199. ACM, 2008.
-  Kurt Cutajar, Edwin V Bonilla, Pietro Michiardi, and Maurizio Filippone. Random feature expansions for deep Gaussian processes. arXiv:1610.04386, 2016.
-  Andreas Damianou and Neil Lawrence. Deep gaussian processes. In Artificial Intelligence and Statistics, pages 207–215, 2013.
-  Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In Advances in Neural Information Processing Systems, pages 2253–2261, 2016.
-  John Denker and Yann Lecun. Transforming neural-net output levels to probability distributions. In Advances in Neural Information Processing Systems 3. Citeseer, 1991.
-  Stefan Depeweg, José Miguel Hernández-Lobato, Finale Doshi-Velez, and Steffen Udluft. Learning and policy search in stochastic dynamical systems with Bayesian neural networks. arXiv:1605.07127, 2016.
-  Peter J Diggle and Richard J Gratton. Monte carlo methods of inference for implicit statistical models. Journal of the Royal Statistical Society. Series B (Methodological), pages 193–227, 1984.
-  Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In ICML, pages 647–655, 2014.
-  David Duvenaud, James Robert Lloyd, Roger Grosse, Joshua B Tenenbaum, and Zoubin Ghahramani. Structure discovery in nonparametric regression through compositional kernel search. arXiv:1302.4922, 2013.
-  David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in Neural Information Processing Systems, pages 2224–2232, 2015.
-  Daniel Flam-Shepherd, James Requeima, and David Duvenaud. Mapping gaussian process priors to bayesian neural networks. NIPS Bayesian deep learning workshop, 2017.
-  Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning, pages 1050–1059, 2016.
-  Yarin Gal and Zoubin Ghahramani. A theoretically grounded application of dropout in recurrent neural networks. In Advances in Neural Information Processing Systems, pages 1019–1027, 2016.
-  Yarin Gal and Richard Turner. Improving the Gaussian process sparse spectrum approximation by representing uncertainty in frequency inputs. In International Conference on Machine Learning, pages 655–664, 2015.
-  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
-  Alex Graves. Practical variational inference for neural networks. In Advances in Neural Information Processing Systems, pages 2348–2356, 2011.
-  Johannes Hachmann, Roberto Olivares-Amaya, Adrian Jinich, Anthony L Appleton, Martin A Blood-Forsythe, Laszlo R Seress, Carolina Roman-Salgado, Kai Trepte, Sule Atahan-Evrenk, Süleyman Er, et al. Lead candidates for high-performance organic photovoltaics from high-throughput quantum chemistry–the harvard clean energy project. Energy & Environmental Science, 7(2):698–704, 2014.
-  Tamir Hazan and Tommi Jaakkola. Steps toward deep kernel methods from infinite neural networks. arXiv:1508.05133, 2015.
-  Uri Heinemann, Roi Livni, Elad Eban, Gal Elidan, and Amir Globerson. Improper deep kernels. In Artificial Intelligence and Statistics, pages 1159–1167, 2016.
-  James Hensman, Nicolo Fusi, and Neil D Lawrence. Gaussian processes for big data. arXiv:1309.6835, 2013.
-  José Miguel Hernández-Lobato and Ryan P Adams. Probabilistic backpropagation for scalable learning of Bayesian neural networks. arXiv:1502.05336, 2015.
-  José Miguel Hernández-Lobato, Yingzhen Li, Mark Rowland, Daniel Hernández-Lobato, Thang Bui, and Richard Eric Turner. Black-box -divergence minimization. 2016.
-  Geoffrey E Hinton, Peter Dayan, Brendan J Frey, and Radford M Neal. The" wake-sleep" algorithm for unsupervised neural networks. Science, 268(5214):1158, 1995.
-  Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.
-  Geoffrey E Hinton and Ruslan R Salakhutdinov. Using deep belief nets to learn covariance kernels for Gaussian processes. In Advances in Neural Information Processing Systems, pages 1249–1256, 2008.
-  Geoffrey E Hinton and Drew Van Camp. Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the Sixth Annual Conference on Computational Learning Theory, pages 5–13. ACM, 1993.
-  Kiyosi Itô. An Introduction to Probability Theory. Cambridge University Press, 1984.
-  Tomoharu Iwata and Zoubin Ghahramani. Improving output uncertainty estimation and generalization in deep learning via neural network Gaussian processes. arXiv:1707.05922, 2017.
-  Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An introduction to variational methods for graphical models. Machine learning, 37(2):183–233, 1999.
-  Nal Kalchbrenner and Phil Blunsom. Recurrent continuous translation models. In EMNLP, volume 3, page 413, 2013.
-  Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv:1312.6114, 2013.
-  Karl Krauth, Edwin V Bonilla, Kurt Cutajar, and Maurizio Filippone. Autogp: Exploring the capabilities and limitations of Gaussian process models. arXiv:1610.05392, 2016.
-  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012.
-  Neil D Lawrence. Gaussian process latent variable models for visualisation of high dimensional data. In Advances in Neural Information Processing Systems, pages 329–336, 2004.
-  Miguel Lázaro-Gredilla, Joaquin Quiñonero-Candela, Carl Edward Rasmussen, and Aníbal R Figueiras-Vidal. Sparse spectrum gaussian process regression. Journal of Machine Learning Research, 11(Jun):1865–1881, 2010.
-  Quoc V Le. Building high-level features using large scale unsupervised learning. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 8595–8598. IEEE, 2013.
-  Judith Lean, Juerg Beer, and Raymond Bradley. Reconstruction of solar irradiance since 1610: Implications for climate change. Geophysical Research Letters, 22(23):3195–3198, 1995.
-  Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. Deep neural networks as Gaussian processes. arXiv:1711.00165, 2017.
-  Yingzhen Li and Yarin Gal. Dropout inference in Bayesian neural networks with alpha-divergences. arXiv:1703.02914, 2017.
-  Yingzhen Li, José Miguel Hernández-Lobato, and Richard E Turner. Stochastic expectation propagation. In Advances in Neural Information Processing Systems, pages 2323–2331, 2015.
-  Yingzhen Li and Qiang Liu. Wild variational approximations.
-  Yingzhen Li and Richard E Turner. Rényi divergence variational inference. In Advances in Neural Information Processing Systems, pages 1073–1081, 2016.
-  Yingzhen Li, Richard E Turner, and Qiang Liu. Approximate inference with amortised MCMC. arXiv:1702.08343, 2017.
-  Moshe Lichman et al. Uci machine learning repository, 2013.
-  Qiang Liu and Yihao Feng. Two methods for wild variational inference. arXiv:1612.00081, 2016.
-  Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose Bayesian inference algorithm. In Advances in Neural Information Processing Systems, pages 2370–2378, 2016.
-  Alexander G de G Matthews, Mark Rowland, Jiri Hron, Richard E Turner, and Zoubin Ghahramani. Gaussian process behaviour in wide deep neural networks. arXiv:1804.11271, 2018.
-  Alexander G de G Matthews, Mark Van Der Wilk, Tom Nickson, Keisuke Fujii, Alexis Boukouvalas, Pablo León-Villagrá, Zoubin Ghahramani, and James Hensman. GPflow: A Gaussian process library using tensorflow. The Journal of Machine Learning Research, 18(1):1299–1304, 2017.
-  Thomas Minka. Power EP. Technical report, Technical report, Microsoft Research, Cambridge, 2004.
-  Thomas P Minka. Expectation propagation for approximate Bayesian inference. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, pages 362–369. Morgan Kaufmann Publishers Inc., 2001.
-  Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv:1411.1784, 2014.
-  Abdel-rahman Mohamed, George E Dahl, and Geoffrey Hinton. Acoustic modeling using deep belief networks. IEEE Transactions on Audio, Speech, and Language Processing, 20(1):14–22, 2012.
-  Radford M Neal. Priors for infinite networks. In Bayesian Learning for Neural Networks, pages 29–53. Springer, 1996.
-  Radford M Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012.
-  Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Exponential expressivity in deep neural networks through transient chaos. In Advances in Neural Information Processing Systems, pages 3360–3368, 2016.
-  Edward O Pyzer-Knapp, Kewei Li, and Alan Aspuru-Guzik. Learning from the harvard clean energy project: The use of neural networks to accelerate materials discovery. Advanced Functional Materials, 25(41):6495–6502, 2015.
-  Joaquin Quiñonero-Candela and Carl Edward Rasmussen. A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6(Dec):1939–1959, 2005.
-  Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv:1511.06434, 2015.
-  Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems, pages 1177–1184, 2008.
-  Carl Edward Rasmussen and Christopher KI Williams. Gaussian processes for machine learning, volume 1. MIT press Cambridge, 2006.
-  Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. arXiv:1505.05770, 2015.
-  Yunus Saatçi. Scalable inference for structured Gaussian process models. PhD thesis, University of Cambridge, 2012.
-  Ruslan Salakhutdinov and Geoffrey E Hinton. Deep boltzmann machines. In AISTATS, volume 1, page 3, 2009.
-  Yves-Laurent Kom Samo and Stephen Roberts. String gaussian process kernels. arXiv preprint arXiv:1506.02239, 2015.
-  Matthias Seeger, Christopher Williams, and Neil Lawrence. Fast forward selection to speed up sparse Gaussian process regression. In Artificial Intelligence and Statistics 9, number EPFL-CONF-161318, 2003.
-  Amar Shah, Andrew Wilson, and Zoubin Ghahramani. Student-t processes as alternatives to Gaussian processes. In Artificial Intelligence and Statistics, pages 877–885, 2014.
-  Jiaxin Shi, Jianfei Chen, Jun Zhu, Shengyang Sun, Yucen Luo, Yihong Gu, and Yuhao Zhou. ZhuSuan: A library for Bayesian deep learning. arXiv:1709.05870, 2017.
-  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2014.
-  Edward Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Advances in Neural Information Processing Systems, pages 1257–1264, 2006.
-  Michalis K Titsias. Variational learning of inducing variables in sparse Gaussian processes. In International Conference on Artificial Intelligence and Statistics, pages 567–574, 2009.
-  Felipe Tobar, Thang D Bui, and Richard E Turner. Learning stationary time series using Gaussian processes with nonparametric kernels. In Advances in Neural Information Processing Systems, pages 3501–3509, 2015.
-  Dustin Tran, Rajesh Ranganath, and David M Blei. Deep and hierarchical implicit models. arXiv:1702.08896, 2017.
-  Richard E Turner and Maneesh Sahani. Statistical inference for single-and multi-band probabilistic amplitude demodulation. In Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, pages 5466–5469. IEEE, 2010.
-  Mark van der Wilk, Carl Edward Rasmussen, and James Hensman. Convolutional Gaussian processes. In Advances in Neural Information Processing Systems, pages 2845–2854, 2017.
-  Dilin Wang and Qiang Liu. Learning to draw samples: With application to amortized mle for generative adversarial learning. arXiv:1611.01722, 2016.
-  Christopher KI Williams. Computing with infinite networks. In Advances in Neural Information Processing Systems, pages 295–301, 1997.
-  Andrew G Wilson, Zhiting Hu, Ruslan R Salakhutdinov, and Eric P Xing. Stochastic variational deep kernel learning. In Advances in Neural Information Processing Systems, pages 2586–2594, 2016.
-  Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI, pages 2852–2858, 2017.
-  Huaiyu Zhu and Richard Rohwer. Information geometric measurements of generalisation. 1995.
Appendix A Brief review of Gaussian Processes
Gaussian Processes , as a popular example of Bayesian nonparametrics, provides a principled probabilistic framework for non-parametric Bayesian inference over functions, by imposing rich and flexible nonparametric priors over functions of interest. As flexible and interpretable function approximators, their Bayesian nature also enables Gaussian Processes to provide valuable information of uncertainties regarding predictions for intelligence systems, all wrapped up in a single, exact closed form solution of posterior inference.
Despite the success and popularity of Gaussian Processes (as well as other Bayesian nonparametrics) in the past decades, their computation and storage complexities make it impractical to apply Gaussian Processes to large and complicated datasets. Therefore, people often resort to complicated approximate methods [80, 72, 84, 85, 35, 13, 12, 77, 16, 88].
An inevitable issue that must also be addressed is the representational power of GP kernels. It has been argued that  local kernels commonly used for nonlinear regressions are not able to obtain hierarchical representations for high dimensional data, which limits the usefulness of Bayesian nonparametric methods for complicated tasks. A number of novel schemes were proposed, including deep GPs [18, 17, 11], the design of expressive kernels [89, 24, 86], and the hybrid models using deep neural nets as input features of GPs [40, 92]. However, the first two approaches still struggle to model complex high dimensional data such as texts and images easily; and in the third approach, the merits of fully Bayesian approach has been discarded.
We breifly introduce Gaussian Processes for regression. Assume that we have a set of observational data , where is the dimensional input of th data point, and is the corresponding scalar target of the regression problem. A Gaussian Process model assumes that is generated according the following procedure: firstly a function is drawn from a Gaussian Process (to be defined later). Then for each input data , the corresponding is then drawn according to:
A Gaussian Process is a nonparametric distribution defined over the space of functions, such that:
Definition 2 (Gaussian Processes).
A Gaussian process (GP) is a collection of random variables, any finite number of which have a joint Gaussian distributions. A Gaussian Process is fully specified by its mean function and covariance function , such that any finite collection of function values are distributed as Gaussian distribution , where , .
Now, given a set of observational data , we are able to perform probabilistic inference and assign posterior probabilities over all plausible functions that might have generated the data. Under the setting of regression, given a new test point input data , we are interested in posterior distributions over . Fortunately, this posterior distribution of interest admits a closed form solution :
In our notation, , , and . Although the Gaussian Process regression framework is theoretically very elegant, in practice its computational burden is prohibitive for large datasets since the matrix inversion takes time due to Cholesky decomposition. Once matrix inversion is done, predictions in test time can be made in for posterior mean and for posterior uncertainty , respectively.
Appendix B Brief Review of Variational inference, and black-box -energy
We give a brief review of modern variational techniques, including standard variational inference and black-box -Divergence minimization (BB-), on which our methodology is heavily based. Considers the problem of finding the posterior distribution, , under the model likelihood and a prior distribution :
Variational inference  transfers the above inference problem to an optimization problem, by first proposing a class of approximate posterior , and then minimize the KL-divergence from the approximate posterior to the true posterior . Equivalently, VI optimizes the following variational free energy,
Built upon the idea of VI, BB- is a modern black-box variational inference framework that unifies and interpolates between Variational Bayes  and Expectation Propagation-like algorithms [65, 55]. BB- performs approximate inference by minimizing the following -divergence :
-divergence is a generic class of divergences that includes the inclusive KL-divergence (=1, corresponds to EP), Hellinger distance (=0.5), and the exclusive KL-divergence ( = 0, corresponds to VI) as special cases. Traditionally power EP optimizes an -divergence locally with exponential family approximation , via message passing. It converges to a fixed point of the so called power EP energy:
where is the natural parameter of . On the contrary, BB- directly optimizes with tied factors to avoid prohibitive local factor updates and storage on the whole dataset. This means for all and . Therefore instead of parameterizing each factors, one can directly parameterize and replace all the local factors in the power-EP energy function by . After re-arranging terms, this gives the BB- energy:
which can be further approximated by the following if the dataset is large :
The optimization of could be performed in a black-box manner with reparameterization trick  and MC approximation. Empirically, it has been shown that BB- with can return significantly better uncertainty estimation than VB, and has been applied successfully in different scenarios [54, 21]. From the hyperparameter learning (i.e., in ) point of veiw, it is shown in  that the BB- energy constitutes a better estimation of log marginal likelihood, when compared with the variational free energy. Therefore, for both inference and learning, BB- energy is extensively used in this paper.
Appendix C Derivations
c.1 Proof of Proposition 1
Proposition 1. There exists a unique stochastic process, with finite marginals that are distributed exactly according to Definition 1.
Proof: Generally, consider the following noisy IP model:
For any finite collection of random variables , we denote the induced distribution as . Note that can be represented as . Therefore for any , we have
Note that the switch of the order of integration relies on the assumption that the integral is finite. Therefore, the marginal consistency condition of Kolmogorov extension theorem is satisfied. Similarly, the permutation consistency condition of Kolmogorov extension theorem can be proved as follows: assume is a permutation of the indices , then
Therefore, by Kolmogorov extension theorem, there exists a unique stochastic process, with finite marginals that are distributed exactly according to Definition 1.
c.2 Inverse Wishart process as prior for kernel function
Definition 3 (Inverse Wishart processes).
Let be random function . A stochastic process defined on such functions is called the inverse Wishart process on with parameter and base function , if for any finite collection of input data , the corresponding matrix-valued evaluation is distributed according to an inverse Wishart distribution . We denote .
Consider the problem in Section 3.1 of minimizing the objective
Since we use , this reduces to . In order to obtain optimal solution wrt. , it sufficies to draw fantasy functions (each sample is a random function ) from the prior distribution , and perform moment matching, which gives exactly the MLE solution, i.e., empirical mean and covariance functions
In practice, in order to gain computational advantage, the number of fantasy functions is often small, therefore we further put an inverse wishart process prior over the GP covariance function, i.e. . By doing so, we are able to give MAP estimation instead of MLE estimation. Since inverse Wishart distribution is conjugate to multivariate Gaussian distribution, the Maximum A Posteriori(MAP) solution is given by
Where is the number of data points in the training set where and are evaluated. Alternatively, one could also use Posterior Mean Estimator (which is an example of Bayes estimator that minimizes posterior expected squared loss)
In the implementation of this paper, we choose estimator with and . The hyper parameter is trained using fast grid search using the same procedure for the noise variance parameter, as detailed in Appendix E.
c.3 Derivation of the upper bound for sleep phase
Applying the chaine rule of KL-divregence, we have
Therefore, by the non-negative property of KL divergence, we have
Finally, since and are the optimal solution of , it is also optimal for