Knowing The What But Not The Where in Bayesian Optimization

Knowing The What But Not The Where in Bayesian Optimization

Abstract

Bayesian optimization has demonstrated impressive success in finding the optimum input and output of a black-box function . In some applications, however, the optimum output is known in advance and the goal is to find the corresponding optimum input . In this paper, we consider a new setting in BO in which the knowledge of the optimum output is available. Our goal is to exploit the knowledge about to search for the input efficiently. To achieve this goal, we first transform the Gaussian process surrogate using the information about the optimum output. Then, we propose two acquisition functions, called confidence bound minimization and expected regret minimization. We show that our approaches work intuitively and give quantitatively better performance against standard BO methods. We demonstrate real applications in tuning a deep reinforcement learning algorithm on the CartPole problem and XGBoost on Skin Segmentation dataset in which the optimum values are publicly available.

\printAffiliationsAndNotice

1 Introduction

Bayesian optimization (BO) (Brochu_2010Tutorial; Shahriari_2016Taking; oh2018bock; frazier2018tutorial) is an efficient method for the global optimization of a black-box function. BO has been successfully employed in selecting chemical compounds (Hernandez_2017Parallel), material design (Frazier_2016Bayesian; li2018accelerating_ICDM), and in search for hyperparameters of machine learning algorithms (Snoek_2012Practical; klein2017fast; chen2018bayesian). These recent results suggest BO is more efficient than manual, random, or grid search.

Bayesian optimization finds the global maximizer of the black-box function by incorporating prior beliefs about and updating the prior with evaluations where is the search domain. The model used for approximating the black-box function is called the surrogate model. A popular choice for a surrogate model is the Gaussian process (GP) (Rasmussen_2006gaussian) although there are existing alternative options, such as random forests (Hutter_2011Sequential), deep neural networks (Snoek_2015Scalable), Bayesian neural networks (Springenberg_2016Bayesian) and Mondrian trees (wang2018batched). This surrogate model is then used to define an acquisition function which determines the next query of the black-box function.

In some settings, the optimum output is known in advance. For example, the optimal reward is available for common reinforcement learning benchmarks. As another example, we know the optimum accuracy is in tuning classification algorithm for some public datasets. The goal is to efficiently find the best hyperparameters to produce the known best performance using the fewest number of queries.

In this paper, we give the first BO approach to this setting in which we know what we are looking for, but we do not know where it is. Specifically, we know the optimum output and aim to search for the unknown optimum input by utilizing value.

We incorporate the information about into Bayesian optimization in the following ways. First, we use the knowledge of to build a transformed GP surrogate model. Our intuition in transforming a GP is based on the fact that the black-box function value should not be above the threshold (since , by definition). As a result, the GP surrogate should also follow this property. Second, we propose two acquisition functions which make decisions informed by the value, namely confidence bound minimization and expected regret minimization.

We validate our model using benchmark functions and tuning a deep reinforcement learning algorithm where we observe the optimum value in advance. These experiments demonstrate that our proposed framework works both intuitively better and experimentally outperforms the baselines. Our main contributions are summarized as follows:

  • a first study of Bayesian optimization for exploiting the known optimum output ;

  • a transformed Gaussian process surrogate using the knowledge of ; and

  • two novel acquisition functions to efficiently select the optimum location given .

2 Preliminaries

In this section, we review some of the existing acquisition functions from the Bayesian optimization literature which can readily incorporate the known value. Then, we summarize the possible transformation techniques used to control the Gaussian process using .

2.1 Available acquisition functions for the known

Bayesian optimization uses an acquisition function to make a query. Among many existing acquisition functions (Hennig_2012Entropy; Hernandez_2014Predictive; Wang_2016Optimization; letham2019constrained; astudillo2019bayesian), we review two acquisition functions which can incorporate the known optimum output directly in their forms. We then use the two acquisition functions as the baselines for comparison.

Expected improvement with known incumbent . EI (Mockus_1978Application) considers the expectation over the improvement function which is defined over the incumbent as . One needs to define the incumbent to improve upon. Existing research has considered modifying this incumbent with various choices (Wang_2014Theoretical; Berk_2018Exploration_ECML). The typical choice of the incumbent is the best observed value so far in the observation set where is the dataset upto iteration . Given the known optimum output , one can readily use it as the incumbent, i.e., setting to have the following forms:

(1)

where is the GP predictive mean, is the GP predictive variance, , is the standard normal p.d.f. and is the c.d.f.

Output entropy search with known . The second group of acquisition functions, which are readily to incorporate the known optimum, include several approaches gaining information about the output, such as output-space PES (Hoffman_2015Output), MES (Wang_2017Max) and FITBO (ru2018fast). These approaches consider different ways to gain information about the optimum output . When is not known in advance, Hoffman_2015Output; Wang_2017Max utilize Thompson sampling to sample , or a collection of , while ru2018fast consider as a hyperparameter. After generating optimum value samples, the above approaches consider different approximation strategies.

Since the optimum output is available in our setting, we can use it directly within the above approaches. We select to review the MES due to its simplicity and closed-form computation. Given the known value, MES approximates using a truncated Gaussian distribution such that the distribution of needs to satisfy , to obtain,

Let , we have the MES as

2.2 Gaussian process transformation for

Figure 1: Comparison of the transformed GP with the GP using two different functions in left and right. The known output and unknown input are highlighted by horizontal and vertical red lines respectively. Top: the GP allows to go above and below . Bottom: the transformed GP will lift up the surrogate model closer to the known optimum output (left) and not go above (right).

We summarize several transformation approaches which can be potentially used to enforce that the function is everywhere below , given the upper bound .

The first category is to use functions such as sigmoid and tanh. However, there are two problems with such functions. The first problem is that they both require the knowledge of the lower bound, , and the upper bound, , for the normalization to the predefined ranges, i.e. for sigmoid and for tanh. However, we do not know the lower bound in our setting. The second problem is that exact inference for a GP is analytically intractable under these transformations. Particularly, this will become the Gaussian process classification problem (nickisch2008approximations) where approximation must be made, such as using expectation propagation (kuss2005assessing; riihimaki2013nested; hernandez2016scalable).

The second category is to transform the output of a GP using wraping (mackay1998introduction; snelson2004warped). However, the wraped GP is less efficient in the context of Bayesian optimization. This is because a wrapped GP requires more data points1 to learn the mapping from original to transformed space while we only have a small number of observations in BO setting.

The third category makes use of a linearization trick (osborne2012active; gunter2014sampling) as GPs are closed under linear transformations. This linearization ensures that we arrive at another GP after transformation given our existing GP. In this paper, we shall follow this linearization trick to transform the surrogate model given .

3 Bayesian Optimization When True Optimum Value Is Known

Figure 2: Illustration of the proposed acquisition functions CBM and ERM. A yellow star indicates the maximum of the acquisition function and thus is the selected point. Using the knowledge of , CBM and ERM can better identify while EI and UCB cannot.

We present a new approach for Bayesian optimization given situations where the knowledge of optimum output (value) is available. Our goal is to utilize this knowledge to improve BO performance in finding the unknown optimum input (location) . We first encode to build an informed GP surrogate model through transformation and then we propose two acquisition functions which effectively exploit knowledge of .

3.1 Transformed Gaussian process

We make use of the knowledge about the optimum output to control the GP surrogate model through transformation. Our transformation starts with two key observations that firstly the function value should reach the optimum output; but secondly never be greater than the optimal value , by definition of being a maximum value. Therefore, the desired GP surrogate should not go above this threshold. Based on this intuition, we propose the GP transformation given as follows

Our above transformation avoids the potential issues described in Sec. 2.2. That is we don’t need a lot of samples to learn the transformation mapping for the desired property that the function is always held as . The prior mean for can be used either or . These choices will bring two different effects. A zero mean prior will tend to lift up the surrogate model closer to as when . On the other hand, non-zero mean will encourage the mean prior of closer to zero – as a common practice in GP modeling where the output is standardized around zero .

Given the observations , we can compute the observations for , i.e., where . Then, we can write the posterior of as and where is the prior mean of .

Figure 3: We show that our model performs much better using transformed Gaussian process (TGP) than the vanilla GP. The knowledge of is useful to inform the surrogate model for better optimization, especially in high dimensional functions.

We don’t introduce any extra parameter for the above transformation. However, the transformation causes the distribution for any to become a non-central process, making the analysis intractable. To tackle this problem and obtain a posterior distribution that is also Gaussian, we employ an approximation technique presented in gunter2014sampling; ru2018fast. That is, we perform a local linearization of the transformation around and obtain where the gradient . Following gunter2014sampling; ru2018fast, we set to the mode of the posterior distribution and obtain an expression for as

Since the linear transformation of a Gaussian process remains Gaussian, the predictive posterior distribution for now has a closed form for where the predictive mean and variance are given by

(2)
(3)

These Eqs. (2) and (3) are the key to compute our acquisition functions in the next sections. As the effect of transformation, the predictive uncertainty of the transformed GP becomes larger than in the case of vanilla GP at the location where is low. This is because is high when is low and thus is high in Eq. (3). This property may let other acquisition functions (e.g., UCB, EI) explore more aggressively than they should. We further examine these effects in the supplement.

We visualize the property of our transformed GP and compare with the vanilla GP in Fig. 1. By transforming the GP using , we encode the knowledge about into the surrogate model, and thus are able to enforce that the surrogate model gets close to but never above , as desired, unlike the vanilla GP. In the supplement, we provide further illustration that transforming the surrogate model can help to find the optimum faster. We present quantitative comparison of our transformed GP and vanilla GP in Fig. 3 and in the supplement.

3.2 Confidence bound minimization

In this section, we introduce confidence bound minimization (CBM) to efficiently select the (unknown) optimum location given . Our idea is based on the underlying concept of GP-UCB (Srinivas_2010Gaussian). We consider the GP surrogate at any location w.h.p.

(4)

where is a hyperparameter. Given the knowledge of , we can express this property at optimum location where to have w.h.p.

This is equivalent to write . Therefore, we can find the next point by minimizing the confidence bound around the location with the estimated value closing to the optimum value . That is

where and are the GP mean and variance from Eq. (2) and Eq. (3) respectively. We select the next point by taking

(5)

In the above objective function, we aim to quickly locate the area potentially containing an optimum. Since the acquisition function is non-negative, , it takes the minimum value at the ideal location where and . When these two conditions are met, we can conclude that and thus is what we are looking for, as the property of Eq. (4).

Because the CBM involves a hyperparameter to which performance can be sensitive, we below propose another acquisition function incorporating the knowledge of using no hyperparameter.

3.3 Expected regret minimization

We next develop our second acquisition function using , called expected regret minimization (ERM). We start with the regret function . The probability of regret on a normal posterior distribution is as follows

(6)

As the end goal in optimization is to minimize the regret, we consider our acquisition function to minimize this expected regret as . Using the likelihood function in Eq. (9), we write the expected regret minimization acquisition function as

Let , we obtain the closed-form computation as

(7)

where and are the standard normal pdf and cdf, respectively. To select the next point, we minimize this acquisition function which is equivalent to minimizing the expected regret,

(8)

Our choice in Eq. (8) is where to minimize the expected regret. We can see that this acquisition function is always positive . It is minimized at the ideal location , i.e., , when and . This case happens at the desired location where the GP predictive value is equal to the true with zero GP uncertainty.

{algor}

Input: #iter , optimum value {algor}[1]

and

Construct a transformed Gaussian process surrogate model from and .

Estimating and from Eqs. (2) and (3).

Select , or , using the above transformed GP model.

Evaluate , set and augment .

Algorithm 1 BO with known optimum output.

Although our ERM is inspired by the EI in the way that we define the regret function and take the expectation, the resulting approach is different in the following. The original EI strategy is to balance exploration and exploitation, i.e., prefers high GP mean and high GP variance. On the other hand, ERM will not encourage such trade-off directly. Instead, ERM selects the point to minimize the expected regret with being closer to the known while having low variance to make sure that the GP estimation at our chosen location is correct. Then, if the chosen location turns out to be not expected (e.g., poor function value), the GP is updated and ERM will move to another place which minimizes the new expected regret. Therefore, these behaviors of EI and our ERM are radically different.

Algorithm. We summarize all steps in Algorithm 1. Given the original observation and , we compute , then build a transformed GP using . Using a transformed GP, we can predict the mean and uncertainty at any location from Eqs. (2) and (3) which are used to compute the CBM and ERM acquisition functions in Eq. (5) and Eq. (8). Our formulas are in closed-forms and the algorithm is easy to implement. In addition, our computational complexity is as cheap as the GP-UCB and EI.

Figure 4: Optimization comparison using benchmark functions from to dimensions. We demonstrate that the known optimum output will significantly boost the performances in high dimensions, such as in Alpine1 , gSobol and .

Illustration of CBM and ERM

We illustrate in Fig. 2 our proposed CBM and ERM comparing to the standard UCB and EI with both vanilla GP and transformed GP settings. Our acquisition functions make use of the knowledge about to make an informed decision about where we should query. That is, CBM and ERM will select the location where the GP mean is close to the optimal value and we are highly certain about it – or low . On the other hand, GP-UCB and EI will always keep exploring as the principle of explore-exploit without using the knowledge of . As the results, GP-UCB and EI can not identify the unknown location efficiently as opposed to our acquisition functions.

4 Experiments

The main goal of our experiments is to show that we can effectively exploit the known optimum output to improve Bayesian optimization performance. We first demonstrate the efficiency of our model on benchmark functions. Then, we perform hyperparameter optimization for a XGBoost classification on Skin Segmentation dataset and a deep reinforcement learning task on CartPole problem where the optimum values are publicly available. We provide additional experiments in the supplement.

Settings. All implementations are in Python. The experiments are independently performed times. We run the deep reinforcement learning experiment on a NVIDIA GTX 2080 GPU machine. We use the squared exponential kernel where is optimized from the GP marginal likelihood, the input is scaled and the output is standardized for robustness. We shall release the source code in the final version.

We follow Theorem 3 in Srinivas_2010Gaussian to specify . We use the prior mean for TGP as at earlier iterations and at later iterations once the value has been reached, i.e. . This condition can be checked in each BO iteration using a global optimization toolbox. Our CBM and ERM use a transformed Gaussian process (Sec. 3.1) in all experiments. We learn empirically that using a transformed GP as a surrogate will boost the performance for our CBM and ERM significantly against the case of using vanilla GP. For other baselines, we use both surrogates and report the best performance. We present further details of experiments in the supplement.

Baselines. To the best of our knowledge, there is no baseline in directly using the known optimum output for BO. We select to compare our model with the vanilla BO without knowing the optimum value including the GP-UCB (Srinivas_2010Gaussian) and EI (Mockus_1978Application). In addition, we use two other baselines using described in Sec. 2.1.

4.1 Comparison on benchmark function given

We perform optimization tasks on common benchmark functions2. For these functions, we assume that the optimum value is available in advance which will be given to the algorithm. We use the simple regret for comparison, defined as for maximization problem.

The experimental results are presented in Fig. 4 which shows that our proposed CBM and ERM are among the best approaches over all problems considered. This is because our framework has utilized the additional knowledge of to build an informed surrogate model and decision functions. Especially, ERM outperforms all methods by a wide margin. While CBM can be sensitive to the hyperparameter , ERM has no parameter and is thus more robust.

Particularly, our approaches with perform significantly better than the baselines in gSobol and Alpine1 functions. The results indicate that the knowledge of is particularly useful for high dimensional functions.

4.2 Tuning machine learning algorithms with

A popular application of BO is for hyperparameter tuning of machine learning models. Some machine learning tasks come with the known optimal value in advance. We consider tuning (1) a classification task using XGBoost on a Skin dataset and (2) a deep reinforcement learning task on a CartPole problem (barto1983neuronlike). Further detail of the experiment is described in the supplement.

Known (Accuracy)
Variables Min Max Found
min child weight
colsample bytree
max depth
subsample
alpha
gamma
Table 1: Hyperparameters for XGBoost.

XGBoost classification. We demonstrate a classification task using XGBoost (chen2016xgboost) on a Skin Segmentation dataset 3 where we know the best accuracy is , as shown in Table 1 of Le_2016Nonparametric.

Figure 5: Tuning performance on Skin dataset.

The Skin Segmentation dataset is split into for training and for testing for a classification problem. There are hyperparameters for XGBoost (chen2016xgboost) which is summarized in Table 1. To optimize the integer (ordinal) variables, we round the scalars to the nearest values in the continuous space. We present the result in Fig. 5. Our proposed ERM is the best approach, outperforming all the baselines by a wide margin. This demonstrates the benefit of exploiting the optimum value in BO.

Deep reinforcement learning.
Figure 6: Hyperparameter tuning for a deep reinforcement learning algorithm. The optimum value is available . Left: Selected points by our algorithm on tuning DRL. Color indicates the reward value. Right: Performance comparison with the baselines.

CartPole is a pendulum with a center of gravity above its pivot point. The goal is to keep the cartpole balanced by controlling a pivot point. The reward performance in CartPole is often averaged over consecutive trials. The maximum reward is known from the literature4 as .

We then use a deep reinforcement learning (DRL) algorithm to solve the CartPole problem and use Bayesian optimization to optimize the hyperparameters. In particular, we use the advantage actor critic (A2C) (Sutton_1998Reinforcement) which possesses three sensitive hyperparameters, including the discount factor , the learning rate for actor model, , and the learning rate for critic model, . We choose not to optimize the deep learning architecture for simplicity. We use Bayesian optimization given the known optimum output of to find the best hyperparameters for the A2C algorithm. We present the results in Fig. 6 where our ERM reaches the optimal performance after iterations outperforming all other baselines. In Fig. 6 Left, we visualize the selected point by our ERM acquisition function. Our ERM initially explores at several places and then exploits in the high value region (yellow dots).

4.3 What happens if we misspecify the optimum value

We now consider setting the to a value which is not the true optimum of the black-box function. We show that our model will fail with misspecified value of with different effects. Specifically, we both set larger (over-specify) and smaller (under-specify) than the true value in a maximization problem.

We experiment with our ERM using this misspecified setting of in Fig. 7.

Figure 7: Experiments with ERM in the maximization problem. Over-specifying is when the value of is larger than the true optimal value and under-specifying is when the value of is smaller than the true. Top: the true for Hartmann. Bottom: the true for gSobol. Both cases of misspecifying will degrade the performance.

The results suggest that our algorithm using the true value ( for Hartmann and for gSobol) will have the best performance. Both over-specifying and under-specifying will return worse performance. These misspecified settings potentially perform worse than the standard EI which does not use the knowledge about . In particular, the under-specifying case will result in worse performance than over-specifying. This is because our acquisition function will get stuck at the area once being found wrongly as the optimal. On the other hand, if we over-specify , our algorithm continues exploring to find the optimum because it can not find the point where both conditions are met and .

Discussion. We make the following observations. If we know the true value , ERM will return the best result. If we do not know the exact value, the performance of our approach is degraded. Thus, we should use the existing BO approaches, such as EI, for the best performance.

5 Conclusion and Future Work

In this paper, we have considered a new setting in Bayesian optimization with known optimum output. We present a transformed Gaussian process surrogate to model the objective function better by exploiting the knowledge of . Then, we propose two decision strategies which can exploit the function optimum value to make informed decisions. Our approaches are intuitively simple and easy to implement. By using extra knowledge of , we demonstrate that our ERM can converge quickly to the optimum in benchmark functions and real-world applications.

In future work, we can expand our algorithm to handle batch setting for parallel evaluations. We can also extend this work to other classes of surrogate functions such as Bayesian neural networks (neal2012bayesian) and deep Gaussian process (damianou2013deep). Moreover, we can extend the model to handle within a range of from the true output.

References

In the supplementary material, we provide the derivation for the expected regret minimization and additional details about the experiments.

Appendix A Derivation for Expected Regret Minimization

We are given an optimization problem where is a black-box function that we can evaluate pointwise. Let be the observation set including an input , an outcome and be the bounded search space. We define the regret function where is the known global optimum value. The likelihood of the regret on a normal posterior distribution is as follows

(9)

The expected regret can be written using the likelihood function in Eq. (9), we obtain

As the ultimate goal in optimization is to minimize the regret, we consider our acquisition function to minimize this expected regret as . Let , then and . We write

(10)

We compute the first term in Eq. (10) as

Next, we compute the second term in Eq. (10) as

Let , we obtain the acquisition function as follows

(11)

where is the standard normal pdf and is the cdf. To select the next point, we minimize this acquisition function which is equivalent to minimize the expected regret

We can see that this acquisition function is minimized when and . Our chosen point is the one which offers the smallest expected regret. We aim to find the point with the desired property of .

Appendix B Additional Experiments

We first illustrate the Bayesian optimisation with and without the knowledge of . Then, we provide additional information about the deep reinforcement learning experiment in the main paper. Next, we illustrate the experiments of our proposed acquisition function in vanilla GP and transformed GP. We show that our acquisition functions perform better with the transformed GP than the vanilla GP. Although the transformed GP is ideal for our acquisition functions, we show that it may not be useful for the EI and GP-UCB.

b.1 Illustration per iteration

We provide the illustration of Bayesian optimization with and without the knowledge of for comparison in Figs. 8 and 9. We show the GP and EI in the left (without ) and the transformed GP and ERM in the right (with ). As the effect of transformation using , the transformed GP (right) can lift up the surrogate model closer to the true value (red horizontal line) encouraging the acquisition function to select at these potential locations. On the other hand, without , the GP surrogate (left) is less informative. As a result, the EI operating on GP (left) is less efficient as opposed to the transformed GP. We demonstrate visually that using TGP our model can finally find the optimum input within the evaluation budget while the standard GP does not.

Figure 8: Illustration of the optimization process per iteration starting given the same initialization. Left: BO using GP as surrogate and EI as acquisition function. Right: BO using TGP as surrogate and ERM as acquisition function. Given the known optimum value, the transformed GP can lift up the surrogate model closer to the known value. Then, the ERM will make informed decision given . We also show that the EI may not make the best decision as ERM. To be continue in the next figure.
Figure 9: Continuing from the previous figure. Illustration of the optimization process per iteration starting given the same initialization. Left: BO using GP as surrogate and EI as acquisition function. Right: BO using TGP as surrogate and ERM as acquisition function. Given the known optimum value, the transformed GP can lift up the surrogate model closer to the known value. Then, the ERM will make informed decision given . We also show that the EI may not make the best decision as ERM.

b.2 Details of advantage actor critic on CartPole problem

Figure 10: Left: visualization of a CartPole. Middle and Right: visualization of the reward curve using the best found parameter value . We have used the Advantage Actor Critic (A2C) algorithm to solve the CartPole problem. The known optimum value is .

We use the advantage actor critic (A2C) (Sutton_1998Reinforcement) as the deep reinforcement learning algorithm to solve the CartPole problem (barto1983neuronlike). This A2C is implemented in Tensorflow abadi2016tensorflow and run on a NVIDIA GTX 2080 GPU machine. In A2C, we use two neural network models to learn and separately. In particular, we use a simple neural network architecture with layers and nodes in each layer. The range of the used hyperparameters in A2C and the found optimal parameter are summarized in Table 2.

We illustrate the reward performance over training episodes using the found optimal parameter value in Fig. 10. In particular, we plot the raw reward and the average reward over 100 consecutive episodes - this average score is used as the evaluation output. Our A2C with the found hyperparameter will take around episodes to reach the optimum value .

Variables Min Max Best Parameter
discount factor
learning rate model
learning rate model
Table 2: Hyperparameters of Advantage Actor Critic (A2C) algorithm .

b.3 Comparison using vanilla GP and transformed GP

In this section, we empirically compare the proposed transformed Gaussian process (using the knowledge of presented in the Sec. 3.1 of the main paper and the vanilla Gaussian process (Rasmussen_2006gaussian) as the surrogate model for Bayesian optimization. We then test our ERM and EI on the two surrogate models. After the experiment, we learn that the transformed GP is more suitable for our ERM while it may not be ideal for the EI.

Erm. We perform experiments on ERM acquisition function using two surrogate models as vanilla Gaussian process (GP) and transformed Gaussian process (TGP). Our acquisition function performs better with the transformed GP. The TGP exploits the knowledge about the optimum value to construct the surrogate model. Thus, it is more informative and can be helpful in high dimension functions, such as Alpine1 and gSobol , , in which the ERM on TGP achieves much better performance than ERM on GP. On the simpler functions, such as branin and hartmann, the transformed GP surrogate achieves comparable performances with the vanilla GP. We visualize all results in Fig. 11.

Figure 11: Experiments with ERM acquisition function on vanilla Gaussian process (GP) and transformed Gaussian process (TGP). Our acquisition function using the transformed GP consistently performs better than using the vanilla GP. Particularly, the TGP will be more useful in high-dimensional functions of Alpine1 and gSobol , functions. In these functions, ERM on TGP will outperform ERM on GP by a wide margin.

Expected Improvement (EI). We then test the EI acquisition function on two surrogate models of vanilla Gaussian process and our transformed Gaussian process (using ) in Fig. 12. In contrast to the case of ERM above, we show that the EI will perform well on the vanilla GP, but not on the TGP. This can be explained by the side effect of the GP transformation as follows. From Eq. (1) in the main paper, when the location has poor (or low) prediction value , we will have large value . As a result, this large value of will make the uncertainty larger from Eq. (2) in the main paper. Therefore, TGP will make an additional uncertainty at the location where is low.

Under the additional uncertainty effect of TGP, the expected improvement may spend more iterations to explore these uncertainty area and take more time to converge than the case of using the vanilla GP. We note that this effect will also happen to the GP-UCB and other acquisition functions, which rely on exploration-exploitation trade-off.

In high dimensional function of gSobol , TGP will make the EI explore aggressively due to the high uncertainty effect (described above) and thus result in worse performance. That is, it keeps exploring at poor region in the first iterations (see bottom row of Fig. 12).

Discussion. The transformed Gaussian process (TGP) surrogate takes into account the knowledge of optimum value to inform the surrogate. However, this transformation may create additional uncertainty at the area where function value is low. While our proposed acquisition function ERM and CBM will not suffer this effect, the existing acquisition functions of EI and UCB will. Therefore, we only recommend to use this TGP with our acquisition functions for the best optimization performance.

Figure 12: Experiments with EI acquisition function using the surrogate models as GP and TGP. Although the TGP exploits the knowledge about the optimum value to construct the informed surrogate model, it brings the side effect of transformation in making additional GP predictive uncertainty. As a result, the EI will explore more aggressively using TGP and thus obtain worse performance comparing to the case of using vanilla GP.

Appendix C Other known optimum value settings

To highlight the applicability of the proposed model, we list several other settings where the optimum values are known in Table 3.

Environment Source
Pong Gym.OpenAI
Frozen Lake Gym.OpenAI
Inverted Pendulum v1 Gym.OpenAI
CartPole Gym.OpenAI
Table 3: Examples of known optimum value settings.

Footnotes

  1. using the datasets with 800 to 1000 samples for learning.
  2. https://www.sfu.ca/~ssurjano/optimization.html
  3. https://archive.ics.uci.edu/ml/datasets/skin+segmentation
  4. https://gym.openai.com/envs/CartPole-v0/
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
407070
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description