Dynamic Control of Explore/Exploit Trade-Off In Bayesian Optimization

Dynamic Control of Explore/Exploit Trade-Off In Bayesian Optimization

Dipti Jasrasaria and Edward O. Pyzer-Knapp IBM Research
Hartree Centre
Sci-Tech Daresbury
Email: epyzerk3@uk.ibm.com
Abstract

Bayesian optimization offers the possibility of optimizing black-box operations not accessible through traditional techniques. The success of Bayesian optimization methods such as Expected Improvement (EI) are significantly affected by the degree of trade-off between exploration and exploitation. Too much exploration can lead to inefficient optimization protocols, whilst too much exploitation leaves the protocol open to strong initial biases, and a high chance of getting stuck in a local minimum. Typically, a constant margin is used to control this trade-off, which results in yet another hyper-parameter to be optimized. We propose contextual improvement as a simple, yet effective heuristic to counter this - achieving a one-shot optimization strategy. Our proposed heuristic can be swiftly calculated and improves both the speed and robustness of discovery of optimal solutions. We demonstrate its effectiveness on both synthetic and real world problems and explore the unaccounted for uncertainty in the pre-determination of search hyperparameters controlling explore-exploit trade-off.

Bayesian Optimization; Artificial Intelligence; Hyperparameter Tuning

I Introduction

Many important real-world global optimization problems are so-called ‘black-box’ functions - that is to say that it is impossible either mathematically, or practically, to access the object of the optimization analytically - instead we are limited to querying the function at some point x and getting a (potentially noisy) answer in return. Some typical examples of black-box situations are the optimization of machine-learning model hyper-parameters [1, 2], or in experimental design of new products or processes. [3]

One popular framework for optimization of black-box functions is Bayesian optimization. [1, 4, 5, 6, 7]

In this framework, a Bayesian model (typically a Gaussian process[8, 1], although other models have been successfully used[9]) based on known responses of the black-box function is used as an ersatz, providing closed form access to the marginal means and variances. The optimization is then performed upon this ’response surface’ in place of the true surface. The model’s prior distribution is refined sequentially as new data is gathered by conditioning it upon the acquired data, with the resulting posterior distribution then being sampled to determine the next point(s) to acquire. In this way, all else being equal, the accuracy of the response surface should start to increasingly resemble the true surface. This is in fact dependent upon some of the choices made in the construction of the Bayesian model; and it is worth noting that a poor initial construction of the prior, through for instance an inappropriate kernel choice, will lead to a poor optimization protocol.

Since Bayesian optimization does not have analytical access properties traditionally used in optimization, such as the gradients, it relies upon an acquisition function being defined for determining which points to select. This acquisition function takes the model means and variances derived from the posterior distribution and translates them into a measure of the predicted utility of acquiring a point. At each iteration of Bayesian optimization, the acquisition function is maximized, with those data points corresponding to maximal acquisition being selected for sampling.

Bayesian optimization has particular utility when the function to be optimized is expensive, and thus the number of iterations the optimizer can perform is low. It also has utility as a ’fixed-resource optimizer’ since - unlike traditional optimization methods - it is possible to set a strict bound on resources consumed without destroying convergence criteria. Indeed, in abstract, the Bayesian optimization protocol of observe, hypothesize, validate is much closer in spirit to the scientific method than other optimization procedures.

I-a Acquisition Functions

A good choice of acquisition function is critical for the success of Bayesian optimization, although it is often not clear a priori which strategy is best suited for the task. Typical acquisition strategies fall into one of two types - improvement based strategies, and information based strategies. An improvement based strategy is analogous to the traditional optimization task in that it seeks to locate the global minimum/maximum as quickly as possible. An information based strategy is aimed at making the response surface as close to the real function as quickly as possible through the efficient selection of representative data. Information based strategies are strictly exploratory and thus we focus our attention on improvement based strategies for the duration of this paper.

In general, we can define the improvement, , provided by a given data-point, , as

(1)

for maximization, where is the best target value observed so far, is the predicted means supplied through the Bayesian model, and are their corresponding variances.

Two typically used acquisition functions are the Probability of Improvement (PI) [10] and the Expected Improvement (EI).[5] In PI, the probability that sampling a given data-point, , improves over the current best observation is maximized:

(2)

where is the CDF of the standard normal distribution.

One problem with the approach taken in PI is that it will, by its nature, prefer a point with a small but certain improvement over one which offers a far greater improvement, but at a slightly higher risk. In order to combat this effect, Mockus proposed the EI acquisition function.[5] A perfect acquisition function would minimize the expected deviation from the true optimum, , however since that is not known (why else would we be performing optimization?) EI proposes maximizing the expected improvement over the current best known point:

(3)

where denotes the PDF of the standard normal distribution.

By maximizing the expectation in this way, EI is able to more efficiently weigh the risk-reward balance of acquiring a data point, as it considers not just the probability that a data point offers an improvement over the current best, but also how large that improvement will be. Thus a larger, but more uncertain, reward can be preferred to a small but high-probability reward (which would have been selected using PI).

EI has been shown to have strong theoretical guarantees [11]and empirical effectiveness [1] and so we use it throughout this study as the baseline.

Ii Contextual Improvement

Ii-a Exploration vs. Exploitation Trade-Off

As with any global optimization procedure, in Bayesian optimization there exists a tension between exploration (i.e. the acquisition of new knowledge) and exploitation (i.e. the use of existing knowledge to drive improvement). Too much exploration will lead to an inefficient search, whilst too much exploitation will likely lead to local optimization - potentially missing completely a much higher value part of the information space.

EI, in its naive setting, is known to be overly greedy as it focuses too much effort on the area in which it believes the optimum to be, without efficiently exploring additional areas of the parameter space which may turn out to be more optimal in the long-term. The addition of margins to the improvement function in Equation 1 allow for some tuning in this regard. [12, 13] A margin specifies a minimum amount of improvement over the current best point, and is integrated into Equation 1 as follows:

(4)

for maximization, where represents the degree of exploration. The higher , the more exploratory. This is due to the fact that high values of require greater inclusion of predicted variance into the acquisition function.

Ii-B Definition of Contextual Improvement

The use of modified acquisition functions such as Equation 4 have one significant drawback. Through their use of a constant whose value is determined at the start of sampling, they now include an additional hyperparameter which itself needs tuning for optimized performance. Indeed the choice of can be the defining feature for the performance of the search. As Jones notes in his 2001 paper [12]:

…the difficulty is that [the optimization method] is extremely sensitive to the choice of the target. If the desired improvement is too small, the search will be highly local and will only move on to search globally after searching nearly exhaustively around the current best point. On the other hand, if is set too high, the search will be excessively global, and the algorithm will be slow to fine-tune any promising solutions

Given that the scope of Bayesian optimization is for optimizing functions whose evaluations are expensive; this is clearly not desirable at all.

In order to combat this, we propose a modification of the improvement which is implicitly tied to the underlying model, and thus changes dynamically as the optimization progresses - since the exploration / exploitation trade-off is now dependent upon the model’s state at any point in time, we call this contextual improvement, or :

(5)

for maximization, where is the contextual variance for which can be written as:

(6)

where is the mean of the variances contained within the sampled posterior distribution and should be distinguished from which is the individual variance of a prediction for a particular point in the posterior .

This is an intuitive setting for improvement, as exploration is preferred when, on average, the model has high uncertainty, and exploitation is preferred when the predicted uncertainty is low. This can provide a regularization for the search, due to the effects an overly local search will have on the posterior variance. The rationale for this is as follows: since the posterior variance can be written as

(7)

where represents a set of as yet unsampled data-points (i.e. part of the posterior rather than the prior), represents the kernel function and therefore denotes the covariance matrix evaluated at all pairs of training and test points, and similarly for , and , [8] - we can see that the variance depends only upon the feature space. If a search is overly local (i.e. stuck in a non-global minimum), it will produce a highly anisotropic variance distribution with small variances close to the local minima sampled, and larger variances elsewhere in the information space. This results in a larger value for the standard deviation for the posterior variance, which in turn, through Equation 5, forces greater sampling of the variance (equivalent to an increase in ). Since the variance is low in the locally sampled area, the acquisition function is depressed here. It is important to note here, the difference between this approach and an information centered approach. Due to the fact that Equation 5 works directly on the acquisition function, if there are no other areas with a high expectation of improvement (i.e. the local optimum is also predicted to be a strong global optimum beyond the range of variance) then that area will continue to be sampled - this is not the case in an information centered approach.

When the acquisition function is optimized directly (using a global optimization technique such as DIRECT - DIviding RECTangles),[14] the authors suggest providing a value for the distribution of the posterior variance required for Equation 5, , using a sampling method over the function bounds such as a low-discrepency sequence generation such as a Sobol or Halton sequence. Alternatively, if the manifold is not suited to this type of exploration, an MCMC-type sampling method such as slice sampling [15, 16] will also produce satisfactory results, albeit at greater computational expense.

Iii Experiments

Iii-a Definition of Success Metrics

In order to separate the contribution of contextual improvement from other algorithmic contributions, we directly compare EI with traditional improvement, -EI, with a value of (a common value for ) and EI using contextual improvement, which we will denote as adaptive EI (AEI). Our metrics for success are twofold: firstly, we measure the performance of the search (i.e. which method finds, on average, the best value) - this is referred to in the results tables as Mean - and secondly we measure the robustness of the search (how much variance is there between repeat searches). The robustness is measured as the difference between the 10th and 90th confidence intervals of the final sampling point (i.e. 50th) as calculated using a bootstrap. Thus, throughout this study robustness is referred in results tables as .

Iii-B Experimental Details

For all experiments, we utilize a Gaussian process with a squared-exponential kernel function with ARD using the implementation provided in the GPFlow package.[17] We optimized the hyperparameters of the Gaussian process at each sampling point on the log-marginal likelihood with respect to the currently observed data-points. The validity of the kernels was determined by testing for vanishing length-scales as this is typically observed when the kernel is miss-specified. Each experiment was repeated 10 times, with confidence intervals being estimated using bootstrapping of the mean function.

Iii-C Optimization of Synthetic Functions

One of the traditional ways of evaluating the effectiveness of Bayesian optimization strategies is to compare their performance on synthetic functions. This has the advantage of the fact that these functions are very fast to evaluate, and the optima and bounds are well known. Unfortunately these functions are not necessarily representative of real world problems, hence the inclusion of the other two categories. We have chosen to evaluate three well-known benchmarking functions, the Branin-Hoo function (2D, minimization), the 6-humped camelback function (2D, minimization), and the 6-dimensional Hartmann function (6D, maximization).

Iii-D Tuning of Machine Learning Algorithms

A popular use for Bayesian optimization functions is for tuning the hyperparameters of other machine-learning algorithms. [1, 2] Due to this fact, the lack of dependence of contextual improvement on pre-set scheduling hyperparameters is particularly important. In order to test the effectiveness of contextual improvement for this task, we use it to determine optimal hyperparameters for a support vector machine for the abalone regression task.[18] In this context we have three hyperparameters to optimize - (regularization parameter), (insensitive loss) for regression and (RBF kernel function). For the actual prediction process, we utilize the support vector regression function in scikit-learn.[19] We also tune five hyperparameters of a 2 layer multi-layered perceptron to tackle the MNIST 10-class image classification problem[20] for handwritten digits. Here we tune the number of neurons in each layer, the level of dropout [21]in each layer, and the learning rate for the stochastic gradient descent using the MLP implementation provided in the keras package,[22] which was used in conjunction with TensorFlow.[23]

Iii-E Experimental Design

An obvious use for Bayesian optimization is in experimental design, where each evaluation can be expensive both in time and money, and the targets can be noisy. For this experiment, we aim to design 2D aerofoils which optimize a lift to drag ratio as calculated using the JavaFoil program.[24] In order to specify the aerofoil design, we use the NACA 4-digit classification scheme, which denotes thickness relative to chord length, camber relative to chord length, the position of the camber along the chord length, as well as the angle of attack, thus resulting in a 4-dimensional optimization problem. It is important to note that, due to the empirical treatment of the drag coefficient, unrealistically high values of the lift to drag ratio can be observed when using JavaFoil as the ground truth. We chose to simply optimize the ground-truth function as calculated, but note the potential to apply a constraint in the optimization to account for this. [25]

(a) (b) (c) (d) (e) (f)
Fig. 1: Summary of the searches performed on synthetic functions. (a), (c) and (e) show the evolution of the best sampled value for Branin, camelback and 6-D Hartmann respectively, whilst (b), (d) and (f) show the evolving fragility of the model based upon the initial seed data. Values are constructed by bootstrapping the mean over each equivalent sampling position as the search progresses. If the performance of the model is strongly varying amongst the 10 trial runs performed then the value is large. Since we are aiming at developing a robust - ideally one-shot - framework, a small value is most desirable here.

Iv Results and Discussion

Iv-a Synthetic Functions

A graphical representation of the search, including the optimization progress and model fragility (the variance between runs) is shown in Figure 1. A numerical comparison is shown below in Table I.

Branin (min) Camelback (min) Hartmann (max)
Mean Mean Mean
AEI 0.406 0.002 -1.000 0.000 3.074 0.122
EI-0.0 0.997 1.481 -0.9259 1.000 3.081 0.439
EI-0.3 0.702 1.185 -0.9499 0.816 3.0754 0.652
TABLE I: Summary of the results of experiments on synthetic functions. For Confidence Intervals (), smaller values demonstrate reliability over multiple runs.

It can be seen that our setting of EI produces superior search capability for the three synthetic functions studied. For all but the 6-dimensional Hartmann function, AEI on average produces the most optimal results, and in all cases it achieves that result with the greatest reliability (smallest value for CI). This is due, in part to its ability to extract itself from local minima, since in the case of the Branin-Hoo function, the higher means for both settings of EI are due to the algorithm getting stuck in a local minima with a far worse value. Even in the one case in which AEI did not perform the best - 6-dimensional Hartmann - it can be seen that the average result discovered is extremely close to the best discovered by EI, and for this case AEI demonstrates superior reliability. It is also interesting to observe that in general the AEI search tracks with, or outperforms whichever method is performing best in the early sampling. Given that this method does not require the tuning parameter of traditional improvement, this can be seen as a validation of the dynamic approach taken here.

Iv-B Tuning of Machine Learning Algorithms

As previously described, we test our contextual improvement on two tasks - the tuning of three hyperparameters of a support vector machine for the abalone regression task, and the tuning of five parameters of a 2-hidden-layer multi-layer perceptron for the MNIST classification task. The results can be seen in Table II.

SVM MLP
Mean Mean
AEI 1.940 0.006 0.253 0.086
EI-0.0 1.940 0.004 0.298 0.223
EI-0.3 1.940 0.004 0.1938 0.008
TABLE II: Summary of the results of experiments on the tuning of machine learning algorithms - a support vector machine, and a 2-layer multi-layer perceptron. For Confidence Intervals (), smaller values demonstrate reliability over multiple runs.

For the SVM regression task, it can be seen that methods result in the same results, on average, after 50 epochs, with very little difference in the robustness, although AEI does perform slightly worse. This could be indicative of a funnelling shape of the information landscape, in which one basin is both dominant, and wide. This can be seen in Figure 2 This is an ideal case for hyperparameter setting, as the method used does not seem to particularly impact the results although, as can be seen from the other experiments in this study, it is not a typical one. As the study in the next section clearly shows, however, this could also be due to fortunate choices of which values of to study, and the authors argue that in tasks such as hyperparameter searches, which can be critical to the success of tasks further down the pipeline, disconnecting the confidence in the quality of the hyperparameters from the setting of a search hyperparameter such as should be considered a significant advantage of this method.

Fig. 2: Visualization of the search progress for EI with epsilon set to 0.0, and 0.3 and our Adaptive EI, which is based upon contextual improvement for setting hyperparmeters of support vector machines performing the abalone regression experiment. Each experiment is performed 10 times with 3 different randomly selected data points, with confidence intervals are produced by bootstrapping the mean.

The five-dimensional MLP-classification hyperparameter-setting task was more challenging for AEI, and the best performance was obtained using EI with . It is worth noting, however, for this task that the performance of - significantly worse both in search results and in CI - may suggest that the slightly worse performance of AEI is a price worth paying given the potential ramifications of getting the wrong value for . Of course, this is said under the assumption that there is no a priori knowledge about this value; and if this is not the case then this should be built taken into account when making risk-reward judgements. This is studied and discussed in more detail in the next section. The authors also recognise the possibility of building this knowledge into the contextual improvement framework, and this is an area under ongoing investigation.

Iv-C Experimental Design

Aerofoil
Mean
AEI 255.0327 183.7029
EI-0.0 234.0355 195.8036
EI-0.3 187.7445 165.3584
TABLE III: Summary of the results of experiments on the experimental design of 2D aerofoils, a maximization problem. For Confidence Intervals () smaller values demonstrate reliability over multiple runs.

This problem was selected to represent a real-world design problem. Experimental design is an area in which Bayesian optimization has the potential to provide powerful new capabilities, as traditional design of experiment (DoE) approaches are static and information centric (exploratory), and thus have the potential to be highly inefficient for design tasks. The performance of our AEI protocol here demonstrates the value of dynamic control of explore/exploit tradeoff. The results are shown in Table III. Unlike other problems investigated thus far, the setting of EI is highly inefficient, producing the worst lift/drag ratios out of the three protocols, although as a result of its exploratory nature it has better reproducibility (lower CI). As can be seen in Figure 3, AEI discovers the highest performing aerofoils with more reliability than the next best, the setting of EI - demonstrating how the method balances the twin goals of performance and reproducibility.

Fig. 3: Visualization of the search progress for EI with epsilon set to 0.0, and 0.3 and our Adaptive EI, which is based upon contextual improvement. Each experiment is performed 10 times with 3 different randomly selected data points, with confidence intervals produced by bootstrapping the mean.

Iv-D Overall Performance - Sensitivity to hyperparameters

One way to measure the robustness of AEI is to compare the rankings of the search and CI metrics over the whole range of tasks performed in this study. Since raw rankings can be misleading (a close second ranks the same as a search in which the gap between methods was much wider) we utilize a normalized ranking using the following method.

(8)

where represents the result of a particular strategy, the result of the best strategy, and represent the range of results encountered in the study.

Calculating the average value for across each of the experiments performed in this study is enlightening into the benefit provided by the dynamic control of explore-exploit trade-off (essentially, ). Our contextual-improvement based strategy (AEI) provides superior results for both search results (i.e. the discoverability of desirable solutions) and the CI (i.e. the robustness of the search). Additionally, we can start to estimate the dependency of these metrics upon a good choice of epsilon by comparing the scores obtained using and . Comparing the overall Z score (i.e. the combination of search and CI), we see that the difference between the two settings of epsilon is around 78% of the total value of our dynamic setting (Table IV, offering a significant degradation in performance.

Z
Search Overall
AEI 0.3910 0.3278 0.3594
EI-0.0 0.7187 0.7665 0.7426
EI-0.3 0.4854 0.4369 0.4611
TABLE IV: Summary of the results of experiments performed during this study using the criterion in Equation 8. Bold indicates the best performing method

Iv-E The importance of a one-shot technique

It is important to note here that the true apples to apples comparison is not really between any one value of , be it , or , (or even the difference between these two values) but instead to compare to the CI over a wide range of since the correct value cannot be determined a priori. In order to better illustrate this point, we perform two of the tasks described in the paper - the Camelback minimization (a synthetic function) and a ’real world’ example of tuning the hyperparameters of an SVM for the abalone regression problem - over a range of values for from 0.0 to 1.0, with a resolution of 0.01 (i.e. 100 values of epsilon).

(a)
(b)
Fig. 4: Visualization of the search progress for EI with a set of ranging between 0.0, and 1.0 (grey) and our Adaptive EI (black), which is based upon contextual improvement. (a) shows the effect of varying for the camelback minimization, whilst (b) shows the effect of varying for the SVM hyperparameter search experiment. Each experiment is performed 5 times with 3 different randomly selected data points, with confidence intervals produced by bootstrapping the mean.

The additional uncertainty associated with selecting a particular value of can be clearly be seen from Figure 4. Whilst we can see from the previous experiments that it is possible to find a value of , which performs as well as AEI, it is hard to know what the best value should be. Figure 4 shows the potential danger of using a poor value of , with Figure 4 (b) showing clearly the potential danger of choosing a bad value for epsilon when samples are low. In the typical Bayesian optimization setting, this is particularly important as there may be very little sampling as performing a ground truth evaluation can result in a significant cost, either financial or computational and thus a method which minimizes this risk has significant benefits. Additionally, since many decision making exercises are coming to increasingly rely on deterministic (i.e. not Bayesian), but highly scalable machine learning models, the potential consequences of not locating a good set of hyperparameters can be significant. ’One shot’ methods such as AEI afford the user a larger degree of confidence that the search has located a good set of parameters without the need to evaluate multiple search settings (such as would be required with ).

An approximation to the risk reward trade-off can be performed visually using Figure 4. Experiments in which the gamble failed to pay dividends (i.e. the performance of using a constant is worse than AEI) are represented as the shaded area above the black trace. This can be thought of as the situations in which AEI outperforms a static model. It can be seen that or both tasks evaluated there is a large density of experiments which fall into this ‘loss’ zone, especially when small number of samples have been drawn. For an idea of the magnitude of the risk, you can compare the areas shaded grey above and below the black trace. Again, the expectation, given a random selection of is significantly in the ‘loss’ with this result being more pronounced at low number of samples.

V Conclusion

We present a simple, yet effective adaptation to the traditional formulation of improvement, which we call contextual improvement. This allows a Bayesian optimization protocol to dynamically schedule the trade-off between explore and exploit, resulting in a more efficient data-collection strategy. This is of critical importance in Bayesian optimization, which is typically used to optimize functions where each evaluation is expensive to acquire. We have demonstrated that EI based upon contextual improvement outperforms EI using traditional improvement, and improvement with a margin in a range of tasks from synthetic functions to real-world tasks, such as experimental design of 2-D NACA aerofoils and the tuning of machine learning algorithms. We also note that our proposed contextual improvement results in settings of expected improvement which are significantly more robust to the random seed data, which is a highly desirable property since this allows the use of minimal seed data sets. In traditional Bayesian optimization settings, where each data point is expensive to acquire, this can result in significant savings in costs, both in time and financial outlay.

Vi Acknowledgements

The authors thank Dr Kirk Jordan for helpful discussions.

References

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
211826
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description