The reparameterization trick for acquisition functions
Abstract
Bayesian optimization is a sampleefficient approach to solving global optimization problems. Along with a surrogate model, this approach relies on theoretically motivated value heuristics (acquisition functions) to guide the search process. Maximizing acquisition functions yields the best performance; unfortunately, this ideal is difficult to achieve since optimizing acquisition functions per se is frequently nontrivial. This statement is especially true in the parallel setting, where acquisition functions are routinely nonconvex, highdimensional, and intractable. Here, we demonstrate how many popular acquisition functions can be formulated as Gaussian integrals amenable to the reparameterization trick [17, 14] and, ensuingly, gradientbased optimization. Further, we use this reparameterized representation to derive an efficient Monte Carlo estimator for the upper confidence bound acquisition function [19] in the context of parallel selection.
1 Introduction
In Bayesian optimization (BO), acquisition functions , with few exceptions, amount to integrals defined in terms of a belief over the unknown outcomes revealed when evaluating a blackbox function at corresponding input locations . This formulation naturally occurs as part of a Bayesian approach whereby we would like to assess how valuable different queries are to the optimization process by accounting for all conceivable realizations of . Denoting by the function used to convey the valueadded for observing a given realization, this paradigm gives rise to acquisition functions defined as
(1) 
where integration region represents the set of all possible outcomes , any additional parameters associated with integrand , and the available prior information.^{1}^{1}1Henceforth, we omit explicit reference to prior information . Without loss of generality, we express acquisition functions as dimensional integrals, where denotes the total number of queries with unknown outcomes after each decision. For poolsize , we recover strictly sequential decisionmaking rules; whereas, for , we obtain strategies for parallel selection.^{2}^{2}2To avoid confusion when discussing SGD, we reserve the term batchsize for description of minibatches. As an exception to this rule, nonmyopic acquisition functions, which assign value by further considering how different realizations of impact our broader understanding of blackbox , generally correspond to higherdimensional integrals. Specifically, nonmyopic instances of the above formulation typically recurse, with the integrand amounting to an additional integral of the form (1). While in a minority of cases closedform solutions exist, these integrals are generally intractable and therefore difficult to optimize.
For this reason, a variety of methods have been proposed for evaluating intractable acquisition functions. These approaches have ranged from expectation propagationbased approximations of Gaussian probabilities [5, 9, 10] to bespoke approximation strategies [4, 6] to samplebased Monte Carlo techniques [16, 18, 9].
The special case of parallel Expected Improvement () has received considerable attention [7, 18, 3, 20]; however, excepting [20], proposed methods do not scale gracefully in poolsize . Still within the context of and independent of our work, [20] derive results analogous to our own, but refer to the reparameterization trick (discussed below) as infinitesimal perturbation analysis [8].
In this work, we focus on the most common estimation technique: Monte Carlo integration. Despite their generality and myriad other desirable properties, Monte Carlo approaches have consistently been regarded as nondifferentiable and, therefore, inefficient in practice given the need to optimize (1). However, it seems to have been overlooked that samplebased approaches can indeed be used to estimate gradients, wellknown examples of which include stochastic backpropagation and the reparameterization trick [17, 14]. In the following, we exploit this insight to demonstrate gradientbased optimization of acquisition functions estimated via Monte Carlo integration.
The reparameterization trick is a way of rewriting functions of random variables that makes their differentiability w.r.t. the parameters of an underlying distribution transparent. The trick applies a deterministic mapping from random variables with a parameterfree base distribution to random variables with the target distribution. This change of variables helps clarify that if is a differentiable function of then, by the chain rule of derivatives , i.e., we can use gradient information to optimize the target distribution’s parameters . We now explore the importance of this fact for BO and, in particular, for parallel selection.
2 Reparameterizing acquisition functions
Examples of reparameterizable acquisition functions  

Acquisition function  Parameters  Integrand  Reparameterization 
Expected Improvement (EI)  
Probability of Improvement (PI)  
Upper Confidence Bound (UCB)  
Simple Regret (SR)  
Entropy Search (ES) 
As is arguably the natural way of expressing uncertainty over interrelated values, beliefs over the outcomes for pool are typically defined in terms of a multivariate normal distribution . In the context of the reparameterization trick, the corresponding deterministic mapping for Gaussian random variables is , where denotes the Cholesky factor of , s.t. and . Rewriting (1) as a Gaussian integral and reparameterizing, we have
(2) 
where each of the terms in both and is transformed as and where values in have similarly been mapped to . By taking the gradient of w.r.t. modelbased posterior and further differentiating through the model to inputs , we can perform gradient ascent on acquisition values.^{3}^{3}3Parameters associated with model are not differentiated through and are therefore omitted for clarity.
When Monte Carlo integrating (2), an unbiased estimate to the acquisition gradients is then
(3) 
where, by minor abuse of notation, we have substituted in . The availability of gradient information is especially important for , both because parallel acquisition functions are generally intractable and because the dimensionality of the acquisition space scales linearly in .
Examples of wellknown acquisition functions amenable to this treatment are presented in Table 1. Figure 1 provides a visual example of the corresponding (stochastic) gradient ascent process, for each of the five acquisition functions shown in the table. Before going further, several points of interest in Table 1 warrant attention:

Parallelizing UCB: To the best of our knowledge, the integral representation of is novel and leads to the first truly parallel formulation of (). Relevantly, using the reparameterization trick greatly simplifies the associated derivation. As with other acquisition functions discussed here, can be efficiently estimated via Monte Carlo and optimized using gradients. For the complete derivation and related formulae, please refer to Appendix A.

Relaxing Heaviside step functions: Both Probability of Improvement () and Entropy Search () contain Heaviside step functions, whose derivatives are Dirac delta functions. Since these gradients are zero a.e., we instead propose the use of a function with temperature parameter . This combination has the appealing property that the resulting approximation becomes exact as , a property recently exploited in [11, 15]. To the extent that this soft approximation introduces an additional source of error, we argue that this downside is largely outweighed by the availability of informative gradients, which enable us to greatly reduce optimization error [2].

Differentiating though the max(): Many acquisition functions, such as , use the max operator. While not technically differentiable, this operator is known to be subdifferentiable and affords wellbehaved (sub)gradients.
3 Experiments
As baselines, we compared gradientbased approaches to optimizing acquisition functions with Random Search [1] and Dividing Rectangles [12] based ones. For stochastic gradient descent (SGD), we experimented with several offtheshelf optimizers; of these, Adam [13] produced the best results and is reported here. Similarly, we tested various batchsizes and report results for . For gradient descent (GD), we used a standard implementation of LBFGSB [21]. In both cases, gradientbased optimizers were run using 32 starting points sampled from the acquisition function. Finally, for both and , we set the temperature ; and, for , we set the confidence parameter .
Prior to running our experiments, we configured each acquisition function optimizer such that its runtime approximately matched that of the others. Further details regarding our experiments, including individual runtimes, are provided in Appendix B.
To help reduce the number of potentially confounding variables, we experimented on 8dimensional tasks drawn from a Gaussian process prior with known hyperparameters. For each combination of acquisition function and optimizer, trials began with randomly chosen observations and iterated by choosing queries at a time.^{4}^{4}4Methods discussed here extend to the parallel asynchronous setting; but, we did not explore this option. Each pair was run on a total of 16 sampled tasks, with results shown in Figure 2. Across acquisition functions, gradientbased strategies markedly outperformed gradientfree alternatives. Further, stochastic and deterministic gradient methods delivered comparable performance.
4 Conclusion
We show how many popular acquisition functions can be written as Gaussian integrals amenable to the reparameterization trick. By reparameterizing these integrals, we clarify the differentiability of their Monte Carlo estimates and, in turn, provide a generalized method for using gradients to optimize acquisition values. Our results clearly demonstrate the superiority of gradientbased approaches for optimizing acquisition functions, even in modest dimensional cases. Further, we show how, by looking at the associated integrals through the lens of the reparameterization trick, the difficult process of deriving theoretically sound acquisition functions may be greatly simplified.
Acknowledgments
The support of the EPSRC Centre for Doctoral Training in High Performance Embedded and Distributed Systems (reference EP/L016796/1) is gratefully acknowledged. This work has partly been supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme under grant no. 716721.
References
 Bergstra and Bengio [2012] J. Bergstra and Y. Bengio. Random search for hyperparameter optimization. Journal of Machine Learning Research, 2012.
 Bousquet and Bottou [2008] O. Bousquet and L. Bottou. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems 22, 2008.
 Chevalier and Ginsbourger [2013] C. Chevalier and D. Ginsbourger. Fast computation of the multipoints expected improvement with applications in batch selection. In International Conference on Learning and Intelligent Optimization, 2013.
 Contal et al. [2013] E. Contal, D. Buffoni, A. Robicquet, and N. Vayatis. Parallel Gaussian process optimization with upper confidence bound and pure exploration. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 2013.
 Cunningham et al. [2011] J. P. Cunningham, P. Hennig, and S. LacosteJulien. Gaussian probabilities and expectation propagation. arXiv preprint arXiv:1111.6832, 2011.
 Desautels et al. [2014] T. Desautels, A. Krause, and J. W. Burdick. Parallelizing explorationexploitation tradeoffs in Gaussian process bandit optimization. Journal of Machine Learning Research, 2014.
 Ginsbourger et al. [2010] D. Ginsbourger, R. Le Riche, and L. Carraro. Kriging is wellsuited to parallelize optimization, chapter 6. Springer, 2010.
 Glasserman [2013] P. Glasserman. Monte Carlo methods in financial engineering. Springer, 2013.
 Hennig and Schuler [2012] P. Hennig and C. Schuler. Entropy search for informationefficient global optimization. Journal of Machine Learning Research, 2012.
 HernándezLobato et al. [2014] J. HernándezLobato, M. Hoffman, and Z. Ghahramani. Predictive entropy search for efficient global optimization of blackbox functions. In Advances in Neural Information Processing Systems 27, 2014.
 Jang et al. [2016] E. Jang, S. Gu, and B. Poole. Categorical reparameterization with GumbelSoftmax. arXiv preprint arXiv:1611.01144, 2016.
 Jones et al. [1993] D. R. Jones, C. D. Perttunen, and B. E. Stuckman. Lipschitzian optimization without the Lipschitz constant. Journal of Optimization Theory and Applications, 1993.
 Kingma and Ba [2014] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kingma and Welling [2014] D. P. Kingma and M. Welling. Autoencoding variational Bayes. In Proceedings of the 2nd International Conference on Learning Representations, 2014.
 Maddison et al. [2016] C. J. Maddison, A. Mnih, and Y. W. Teh. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
 Osborne et al. [2009] M. A. Osborne, R. Garnett, and S. J. Roberts. Gaussian processes for global optimization. In Proceedings of the 3rd International Conference on Learning and Intelligent Optimization, 2009.
 Rezende et al. [2014] D. J. Rezende, M. Shakir, and D. Wierstra. Stochastic backpropagation and variational inference in deep latent Gaussian models. In Proceedings of the 31st International Conference on Machine Learning, 2014.
 Snoek et al. [2012] J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems 25, 2012.
 Srinivas et al. [2010] N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proceedings of the 27th International Conference on Machine Learning, 2010.
 Wang et al. [2016] J. Wang, S. C. Clark, E. Liu, and P. I. Frazier. Parallel Bayesian global optimization of expensive functions. arXiv preprint arXiv:1602.05149, 2016.
 Zhu et al. [1997] C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal. Algorithm 778: LBFGSB: Fortran subroutines for largescale boundconstrained optimization. ACM Transactions on Mathematical Software, 1997.
Appendix A Parallel Upper Confidence Bound ()
Working backward through (2), we derive an exact expression for parallel . In doing so, we begin with the definition
(4) 
where denotes the (elementwise) absolute value operator.^{5}^{5}5This definition comes directly from the standard integral identity . Using this fact and given , let such that . Under this notation, marginal can be expressed as
(5)  
(6)  
(7) 
where parameterize a Gaussian posterior over . This integral form of is advantageous precisely because it naturally lends itself to the generalized expression
(8)  
(9)  
(10) 
where . This representation has the requisite property that, for any size subset of , the value obtained when marginalizing out the remaining terms is its UCB value.
Previous methods for parallelizing have approached the problem by imitating a purely sequential strategy [4, 6]. Because a fully Bayesian approach to sequential selection generally involves an exponential number of posteriors, these works incorporate various wellchosen heuristics for the purpose of efficiently approximate parallel .^{6}^{6}6Due to the stochastic nature of the mean updates, the number of posteriors grows exponentially in .. By directly addressing the associated dimensional integral however, Equation (10) avoids the need for such approximations and, instead, unbiasedly estimates the true value.
Finally, the special case of marginal (6) can be further simplified as
(11) 
revealing an intuitive form, namely, the expectation of a Gaussian random variable (with rescaled covariance) above its mean.
Appendix B Experiment details
Runtimes of acquisition function optimizers  

Optimizer  
Random Search (RS)  
Dividing Rectangles (DIRECT)  
GD (LBFGSB)  
SGD (Adam) 
To provide fair comparison between acquisition function optimizers, efforts were made to approximately match their respective runtimes. First, Random Search was run using a set of uniform random pools , at each step during BO. Subsequently, RS’s average runtime, measured over a handful of preliminary trials, was used as a target value when configuring the remaining optimizers. Table 2 provides individual runtimes for each combination of acquisition function and optimizer.
For stochastic gradient descent, we tested the following optimizers: SGD with momentum, RMSProp, and Adam. Trials were run using batchsizes , each time tuning the number of SGD steps for equivalent runtimes. Of the tested configurations, steps using delivered the best performance.