The reparameterization trick for acquisition functions

The reparameterization trick for acquisition functions

James T. Wilson1,2
j.wilson17@imperial.ac.uk
&
Riccardo Moriconi1
r.moriconi16@imperial.ac.uk
&
Frank Hutter2
fh@cs.uni-freiburg.de
&
Marc Peter Deisenroth1
m.deisenroth@imperial.ac.uk
   1Imperial College of London                         2University of Freiburg        
Abstract

Bayesian optimization is a sample-efficient approach to solving global optimization problems. Along with a surrogate model, this approach relies on theoretically motivated value heuristics (acquisition functions) to guide the search process. Maximizing acquisition functions yields the best performance; unfortunately, this ideal is difficult to achieve since optimizing acquisition functions per se is frequently non-trivial. This statement is especially true in the parallel setting, where acquisition functions are routinely non-convex, high-dimensional, and intractable. Here, we demonstrate how many popular acquisition functions can be formulated as Gaussian integrals amenable to the reparameterization trick [17, 14] and, ensuingly, gradient-based optimization. Further, we use this reparameterized representation to derive an efficient Monte Carlo estimator for the upper confidence bound acquisition function [19] in the context of parallel selection.

1 Introduction

In Bayesian optimization (BO), acquisition functions , with few exceptions, amount to integrals defined in terms of a belief over the unknown outcomes revealed when evaluating a black-box function at corresponding input locations . This formulation naturally occurs as part of a Bayesian approach whereby we would like to assess how valuable different queries are to the optimization process by accounting for all conceivable realizations of . Denoting by the function used to convey the value-added for observing a given realization, this paradigm gives rise to acquisition functions defined as

(1)

where integration region represents the set of all possible outcomes , any additional parameters associated with integrand , and the available prior information.111Henceforth, we omit explicit reference to prior information . Without loss of generality, we express acquisition functions as -dimensional integrals, where denotes the total number of queries with unknown outcomes after each decision. For pool-size , we recover strictly sequential decision-making rules; whereas, for , we obtain strategies for parallel selection.222To avoid confusion when discussing SGD, we reserve the term batch-size for description of minibatches. As an exception to this rule, non-myopic acquisition functions, which assign value by further considering how different realizations of impact our broader understanding of black-box , generally correspond to higher-dimensional integrals. Specifically, non-myopic instances of the above formulation typically recurse, with the integrand amounting to an additional integral of the form (1). While in a minority of cases closed-form solutions exist, these integrals are generally intractable and therefore difficult to optimize.

For this reason, a variety of methods have been proposed for evaluating intractable acquisition functions. These approaches have ranged from expectation propagation-based approximations of Gaussian probabilities [5, 9, 10] to bespoke approximation strategies [4, 6] to sample-based Monte Carlo techniques [16, 18, 9].

The special case of parallel Expected Improvement () has received considerable attention [7, 18, 3, 20]; however, excepting [20], proposed methods do not scale gracefully in pool-size . Still within the context of and independent of our work, [20] derive results analogous to our own, but refer to the reparameterization trick (discussed below) as infinitesimal perturbation analysis [8].

In this work, we focus on the most common estimation technique: Monte Carlo integration. Despite their generality and myriad other desirable properties, Monte Carlo approaches have consistently been regarded as non-differentiable and, therefore, inefficient in practice given the need to optimize (1). However, it seems to have been overlooked that sample-based approaches can indeed be used to estimate gradients, well-known examples of which include stochastic backpropagation and the reparameterization trick [17, 14]. In the following, we exploit this insight to demonstrate gradient-based optimization of acquisition functions estimated via Monte Carlo integration.

The reparameterization trick is a way of rewriting functions of random variables that makes their differentiability w.r.t. the parameters of an underlying distribution transparent. The trick applies a deterministic mapping from random variables with a parameter-free base distribution to random variables with the target distribution. This change of variables helps clarify that if is a differentiable function of then, by the chain rule of derivatives , i.e., we can use gradient information to optimize the target distribution’s parameters . We now explore the importance of this fact for BO and, in particular, for parallel selection.

2 Reparameterizing acquisition functions

Examples of reparameterizable acquisition functions
Acquisition function Parameters Integrand Reparameterization
Expected Improvement (EI)
Probability of Improvement (PI)
Upper Confidence Bound (UCB)
Simple Regret (SR)
Entropy Search (ES)
Table 1: Above, we use the following notation: Cholesky factor ; denotes the right-/left-continuous Heaviside step function; the sigmoid nonlinearity; the improvement threshold; the temperature parameter described in Section 2; and, random variables . For Entropy Search, a non-myopic acquisition function, only the innermost integrand (used to approximate ) and its corresponding reparameterization are shown.

As is arguably the natural way of expressing uncertainty over interrelated values, beliefs over the outcomes for pool are typically defined in terms of a multivariate normal distribution . In the context of the reparameterization trick, the corresponding deterministic mapping for Gaussian random variables is , where denotes the Cholesky factor of , s.t. and . Rewriting (1) as a Gaussian integral and reparameterizing, we have

(2)

where each of the terms in both and is transformed as and where values in have similarly been mapped to . By taking the gradient of w.r.t. model-based posterior and further differentiating through the model to inputs , we can perform gradient ascent on acquisition values.333Parameters associated with model are not differentiated through and are therefore omitted for clarity.

Figure 1: Top left: GP-based posterior over 1-dimensional black-box given three initial observations (orange dots). Remaining: Response surfaces of various acquisition functions for pool-size . From ‘’ to ‘’, paths explored by gradient descent (green) and stochastic gradient descent (pink) when optimizing the various acquisition functions. Dashed horizontal lines denote axes of symmetry and large ‘’ (yellow) indicate the global maximum of each acquisition function.

When Monte Carlo integrating (2), an unbiased estimate to the acquisition gradients is then

(3)

where, by minor abuse of notation, we have substituted in . The availability of gradient information is especially important for , both because parallel acquisition functions are generally intractable and because the dimensionality of the acquisition space scales linearly in .

Examples of well-known acquisition functions amenable to this treatment are presented in Table 1. Figure 1 provides a visual example of the corresponding (stochastic) gradient ascent process, for each of the five acquisition functions shown in the table. Before going further, several points of interest in Table 1 warrant attention:

  1. Parallelizing UCB: To the best of our knowledge, the integral representation of is novel and leads to the first truly parallel formulation of (). Relevantly, using the reparameterization trick greatly simplifies the associated derivation. As with other acquisition functions discussed here, can be efficiently estimated via Monte Carlo and optimized using gradients. For the complete derivation and related formulae, please refer to Appendix A.

  2. Relaxing Heaviside step functions: Both Probability of Improvement () and Entropy Search () contain Heaviside step functions, whose derivatives are Dirac delta functions. Since these gradients are zero a.e., we instead propose the use of a function with temperature parameter . This combination has the appealing property that the resulting approximation becomes exact as , a property recently exploited in [11, 15]. To the extent that this soft approximation introduces an additional source of error, we argue that this downside is largely outweighed by the availability of informative gradients, which enable us to greatly reduce optimization error [2].

  3. Differentiating though the max(): Many acquisition functions, such as , use the max operator. While not technically differentiable, this operator is known to be subdifferentiable and affords well-behaved (sub)gradients.

3 Experiments

Figure 2: Left: For equivalent runtimes, best average case performance of each acquisition function given 256 evaluations of 8-dimensional samples from a GP prior with known hyperparameters when choosing pool-size queries in parallel. Remaining: Performance of individual acquisition functions for different optimizers thereof.

As baselines, we compared gradient-based approaches to optimizing acquisition functions with Random Search [1] and Dividing Rectangles [12] based ones. For stochastic gradient descent (SGD), we experimented with several off-the-shelf optimizers; of these, Adam [13] produced the best results and is reported here. Similarly, we tested various batch-sizes and report results for . For gradient descent (GD), we used a standard implementation of L-BFGS-B [21]. In both cases, gradient-based optimizers were run using 32 starting points sampled from the acquisition function. Finally, for both and , we set the temperature ; and, for , we set the confidence parameter .

Prior to running our experiments, we configured each acquisition function optimizer such that its runtime approximately matched that of the others. Further details regarding our experiments, including individual runtimes, are provided in Appendix B.

To help reduce the number of potentially confounding variables, we experimented on 8-dimensional tasks drawn from a Gaussian process prior with known hyperparameters. For each combination of acquisition function and optimizer, trials began with randomly chosen observations and iterated by choosing queries at a time.444Methods discussed here extend to the parallel asynchronous setting; but, we did not explore this option. Each pair was run on a total of 16 sampled tasks, with results shown in Figure 2. Across acquisition functions, gradient-based strategies markedly outperformed gradient-free alternatives. Further, stochastic and deterministic gradient methods delivered comparable performance.

4 Conclusion

We show how many popular acquisition functions can be written as Gaussian integrals amenable to the reparameterization trick. By reparameterizing these integrals, we clarify the differentiability of their Monte Carlo estimates and, in turn, provide a generalized method for using gradients to optimize acquisition values. Our results clearly demonstrate the superiority of gradient-based approaches for optimizing acquisition functions, even in modest dimensional cases. Further, we show how, by looking at the associated integrals through the lens of the reparameterization trick, the difficult process of deriving theoretically sound acquisition functions may be greatly simplified.

Acknowledgments

The support of the EPSRC Centre for Doctoral Training in High Performance Embedded and Distributed Systems (reference EP/L016796/1) is gratefully acknowledged. This work has partly been supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme under grant no. 716721.

References

  • Bergstra and Bengio [2012] J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 2012.
  • Bousquet and Bottou [2008] O. Bousquet and L. Bottou. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems 22, 2008.
  • Chevalier and Ginsbourger [2013] C. Chevalier and D. Ginsbourger. Fast computation of the multi-points expected improvement with applications in batch selection. In International Conference on Learning and Intelligent Optimization, 2013.
  • Contal et al. [2013] E. Contal, D. Buffoni, A. Robicquet, and N. Vayatis. Parallel Gaussian process optimization with upper confidence bound and pure exploration. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 2013.
  • Cunningham et al. [2011] J. P. Cunningham, P. Hennig, and S. Lacoste-Julien. Gaussian probabilities and expectation propagation. arXiv preprint arXiv:1111.6832, 2011.
  • Desautels et al. [2014] T. Desautels, A. Krause, and J. W. Burdick. Parallelizing exploration-exploitation tradeoffs in Gaussian process bandit optimization. Journal of Machine Learning Research, 2014.
  • Ginsbourger et al. [2010] D. Ginsbourger, R. Le Riche, and L. Carraro. Kriging is well-suited to parallelize optimization, chapter 6. Springer, 2010.
  • Glasserman [2013] P. Glasserman. Monte Carlo methods in financial engineering. Springer, 2013.
  • Hennig and Schuler [2012] P. Hennig and C. Schuler. Entropy search for information-efficient global optimization. Journal of Machine Learning Research, 2012.
  • Hernández-Lobato et al. [2014] J. Hernández-Lobato, M. Hoffman, and Z. Ghahramani. Predictive entropy search for efficient global optimization of black-box functions. In Advances in Neural Information Processing Systems 27, 2014.
  • Jang et al. [2016] E. Jang, S. Gu, and B. Poole. Categorical reparameterization with Gumbel-Softmax. arXiv preprint arXiv:1611.01144, 2016.
  • Jones et al. [1993] D. R. Jones, C. D. Perttunen, and B. E. Stuckman. Lipschitzian optimization without the Lipschitz constant. Journal of Optimization Theory and Applications, 1993.
  • Kingma and Ba [2014] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Kingma and Welling [2014] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In Proceedings of the 2nd International Conference on Learning Representations, 2014.
  • Maddison et al. [2016] C. J. Maddison, A. Mnih, and Y. W. Teh. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
  • Osborne et al. [2009] M. A. Osborne, R. Garnett, and S. J. Roberts. Gaussian processes for global optimization. In Proceedings of the 3rd International Conference on Learning and Intelligent Optimization, 2009.
  • Rezende et al. [2014] D. J. Rezende, M. Shakir, and D. Wierstra. Stochastic backpropagation and variational inference in deep latent Gaussian models. In Proceedings of the 31st International Conference on Machine Learning, 2014.
  • Snoek et al. [2012] J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems 25, 2012.
  • Srinivas et al. [2010] N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proceedings of the 27th International Conference on Machine Learning, 2010.
  • Wang et al. [2016] J. Wang, S. C. Clark, E. Liu, and P. I. Frazier. Parallel Bayesian global optimization of expensive functions. arXiv preprint arXiv:1602.05149, 2016.
  • Zhu et al. [1997] C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal. Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization. ACM Transactions on Mathematical Software, 1997.

Appendix A Parallel Upper Confidence Bound ()

Working backward through (2), we derive an exact expression for parallel . In doing so, we begin with the definition

(4)

where denotes the (element-wise) absolute value operator.555This definition comes directly from the standard integral identity . Using this fact and given , let such that . Under this notation, marginal can be expressed as

(5)
(6)
(7)

where parameterize a Gaussian posterior over . This integral form of is advantageous precisely because it naturally lends itself to the generalized expression

(8)
(9)
(10)

where . This representation has the requisite property that, for any size subset of , the value obtained when marginalizing out the remaining terms is its -UCB value.

Previous methods for parallelizing have approached the problem by imitating a purely sequential strategy [4, 6]. Because a fully Bayesian approach to sequential selection generally involves an exponential number of posteriors, these works incorporate various well-chosen heuristics for the purpose of efficiently approximate parallel .666Due to the stochastic nature of the mean updates, the number of posteriors grows exponentially in .. By directly addressing the associated -dimensional integral however, Equation (10) avoids the need for such approximations and, instead, unbiasedly estimates the true value.

Finally, the special case of marginal  (6) can be further simplified as

(11)

revealing an intuitive form, namely, the expectation of a Gaussian random variable (with rescaled covariance) above its mean.

Appendix B Experiment details

Runtimes of acquisition function optimizers
Optimizer
Random Search (RS)
Dividing Rectangles (DIRECT)
GD (L-BFGS-B)
SGD (Adam)
Table 2: Average runtime in seconds for each combination of acquisition function and optimizer when choosing the next pool of inputs. Reported numbers denote the mean and standard deviation of recorded wall-clock times.

To provide fair comparison between acquisition function optimizers, efforts were made to approximately match their respective runtimes. First, Random Search was run using a set of uniform random pools , at each step during BO. Subsequently, RS’s average runtime, measured over a handful of preliminary trials, was used as a target value when configuring the remaining optimizers. Table 2 provides individual runtimes for each combination of acquisition function and optimizer.

For stochastic gradient descent, we tested the following optimizers: SGD with momentum, RMSProp, and Adam. Trials were run using batch-sizes , each time tuning the number of SGD steps for equivalent runtimes. Of the tested configurations, steps using delivered the best performance.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
16815
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description