A Simple Heuristic for
Bayesian Optimization with A Low Budget
Abstract
The aim of blackbox optimization is to optimize an objective function within the constraints of a given evaluation budget. In this problem, it is generally assumed that the computational cost for evaluating a point is large; thus, it is important to search efficiently with as low budget as possible. Bayesian optimization is an efficient method for blackbox optimization and provides explorationexploitation tradeoff by constructing a surrogate model that considers uncertainty of the objective function. However, because Bayesian optimization should construct the surrogate model for the entire search space, it does not exhibit good performance when points are not sampled sufficiently. In this study, we develop a heuristic method refining the search space for Bayesian optimization when the available evaluation budget is low. The proposed method refines a promising region by dividing the original region so that Bayesian optimization can be executed with the promising region as the initial search space. We confirm that Bayesian optimization with the proposed method outperforms Bayesian optimization alone and shows equal or better performance to two searchspace division algorithms through experiments on the benchmark functions and the hyperparameter optimization of machine learning algorithms.
Bayesian Optimization Hyperparameter Optimization
1 Introduction
Blackbox optimization is the problem of optimizing an objective function within the constraints of a given evaluation budget . In other words, the objective is to obtain a point with the lowest possible evaluation value within the function evaluations. In blackbox optimization, no algebraic representation of is given, and no gradient information is available. Blackbox optimization involves problems such as hyperparameter optimization of machine learning algorithms [1, 8, 13, 11], parameter tuning of agentbased simulations [32], and aircraft design [4].
In blackbox optimization, it is generally assumed that the computational cost for evaluating the point is large; thus it is important to search efficiently with as low budgets as possible. For example, it is reported that the experiment of hyperparameter optimization of Online LDA takes about 12 days for 50 evaluations [25]. The performance of deep neural networks (DNN) is known to be very sensitive to hyperparameters, and it has been actively studied in recent years [2, 7, 13, 20, 19, 9]. Because the experiment of DNN also takes a long time to learn corresponding to one hyperparameter, a lower evaluation budget can be used for hyperparameter optimization.
Bayesian optimization is an efficient method for blackbox optimization. Bayesian optimization is executed by repeating the following steps: (1) Based on the data observed thus far, it constructs a surrogate model that considers the uncertainty of the objective function. (2) It calculates the acquisition function to determine the point to be evaluated next by using the surrogate model constructed in step (1). (3) By maximizing the acquisition function, it determines the point to be evaluated next. (4) It then updates the surrogate model based on the newly obtained data, then returns to step (2).
However, Bayesian optimization, which constructs a surrogate model for the entire search space, can show bad performance in the low budget setting, because an optimization method cannot sample points sufficiently. In the low budget setting, we believe that a search should be performed locally; however, in Bayesian optimization, to estimate uncertainty in the search space by surrogate model, the points are sampled for the search space globally. Therefore, the lack of local search degrades the performance of Bayesian optimization. If there is no prior knowledge of the problem, the search space tends to be widely defined. When the search space is widely defined, the search will be performed more globally, degrading performance.
In this study, we develop a heuristic method that refines the search space for Bayesian optimization when evaluation budget is low. The proposed method performs division to reduce the volume of the search space. The proposed method makes it possible to perform Bayesian optimization within the local search space determined to be promising. We confirm that Bayesian optimization with the proposed method outperforms Bayesian optimization alone (that is, Bayesian optimization without the proposed method) by the experiments on the six benchmark functions and the hyperparameter optimization of the three machine learning algorithms (multilayer perceptron (MLP), convolutional neural network (CNN), LightGBM). We also experiment with Simultaneous Optimistic Optimization (SOO) [21] and BaMSOO [30], which are searchspace division algorithms, in order to confirm the validity of the refinement of the search space by the proposed method.
2 Background
2.1 Bayesian Optimization
Algorithm 1 shows the algorithm of Bayesian optimization, which samples and evaluates the initial points (line 1), constructs the surrogate model (line 3), finds the next point to evaluate by optimizing the acquisition function (line 4), evaluates the point selected and receives the evaluation value (line 5), and updates the data (line 6).
The main components of Bayesian optimization are the surrogate model and the acquisition function. In this section, we describe Bayesian optimization using a Gaussian process as the surrogate model and the expected improvement (EI) as the acquisition function.
2.1.1 Gaussian Process
A Gaussian process [24] is the probability distribution over the function space characterized by the mean function and the covariance function . We assume that data set and observations are obtained. The mean and variance of the predicted distribution of a Gaussian process with respect to can be calculated using the kernel function as follows:
(1)  
(2) 
Here,
(3)  
(4) 
The squared exponential kernel (Equation (5)) is one of the common kernel functions.
(5)  
(6) 
Here, is a parameter that adjusts the scale of the whole kernel function, and is a parameter of sensitivity to the difference between the two inputs .
2.1.2 Expected Improvement
The EI [16] is a typical acquisition function in Bayesian optimization, and it represents the expectation value of the improvement amount for the best evaluation value of the candidate point. Let the best evaluation value to be , EI for the point is calculated as follow:
(7) 
When we assume that the objective function follows a Gaussian process, Equation (7) can be calculated analytically as follows:
(8) 
Here, and are the cumulative distribution function and probability density function of the standard normal distribution, respectively.
2.2 Related Work
2.2.1 Bayesian optimization
In Bayesian optimization, the design of surrogate models and acquisition functions are actively studied. The treestructured Parzen Estimator (TPE) algorithm [1, 3], Sequential Modelbased Algorithm Configuration (SMAC) [12] and Spearmint [25] are known as powerful Bayesian optimization methods. The TPE algorithm, SMAC, and Spearmint use a treestructured parzen estimator, a random forest, and a Gaussian process as the surrogate model, respectively. The popular acquisition functions in Bayesian optimization include the EI [16], probability of improvement [14], upper confidence bound (UCB) [26], mutual information (MI) [5], and knowledge gradient (KG) [10].
However, there are few studies focusing on search spaces in Bayesian optimization. A prominent problem in Bayesian optimization is the boundary problem [28] that points sampled concentrate near the boundary of the search space. Oh et al. addressed this boundary problem by transforming the ball geometry of the search space using cylindrical transformation [23]. Wistuba et al. proposed using the previous experimental results to prune the search space of hyperparameters where there seems to be no good point [31]. In contrast to Wistuba’s study, we propose a method to refine the search space without prior knowledge. Nguyen et al. dynamically expanded the search space to cope with cases where the search space specified in advance does not contain a good point [22]. In contrast to Nguyen’s study, we focus on refining the search space rather than expanding.
2.2.2 SearchSpace Division Algorithm
The proposed method is similar to methods such as Simultaneous Optimistic Optimization (SOO) [21] and BaMSOO [30] in that it focuses on the division of the search space. SOO is an algorithm that generalizes the DIRECT algorithm [15], which is a Lipschitz optimization method, and the search space is expressed as a tree structure and the search is performed using hierarchical division. BaMSOO is a method that makes auxiliary optimization of acquisition functions unnecessary by combining SOO with Gaussian process. Wang et al. reported that BaMSOO shows better performance than SOO in experiments on some benchmark functions [30]. In the proposed method and searchdivision algorithms, SOO and BaMSOO, the motivation for optimization is different; the proposed method divides the search space to identify a promising initial region for Bayesian optimization, while the searchdivision algorithms divide the search space to identify a good solution.
3 Proposed Method
In Bayesian optimization, there are many tasks with a low available evaluation budget. For example, in hyperparameter optimization of machine learning algorithms, budget would be limited in terms of computing resources and time. In this study, we focus on Bayesian optimization when there is not enough evaluation budget available.
Nguyen et al. state that Bayesian opitmization using a Gaussian process as the surrogate model and UCB as the acquisition function has the following relationships between the volume of a search space and the cumulative regret (the sum of differences between the optimum value and the evaluation value at each time) [22]. (i) A larger space will have larger (worse) regret bound. (ii) A low evaluation budget will make the difference in the regrets more significant. Nguyen et al. give above description for cumulative regret [22], but converting it to simple regret is straightforward, such as [17]. We therefore believe that in the low budget setting, making the search space smaller is also important in terms of the regret for Bayesian optimization in general.
In this study, we try to improve the performance of Bayesian optimization with the low budget setting by introducing a heuristic method that refines a given search space. We assume that we have an arbitrary hypercube (: the number of dimensions) as a search space. Our method refines the search space by division, and outputs a region considered to be promising. As a result, Bayesian optimization can be executed with the refined search space as the initial search space instead of the original search space .
3.1 Integrating with Bayesian Optimization
Algorithm 2 shows Bayesian optimization with the proposed method. This method calculates the budget for refining the search space from the whole budget (line 1), refines the promising search space (line 2), performs optimization with the search space refined in the line 2 as the initial search space (line 3). We will describe (line 2) in Section 3.2.2.
3.2 Refining the Search Space
3.2.1 Calculation of the Budget
Corresponding to the whole budget , we set the budget used for the proposed method to (in Algorithm 2, line 1). We calculate by with respect to the number of dimension . If the evaluation budget increases to infinity (that is, ), there is no need for refining the search space (that is, ). We note that is maximum budget for the proposed method, not used necessarily in fact; is used for determining the division number. We show the details about how is used in Section 3.2.3.
3.2.2 Algorithm
The proposed method refines the promising region by dividing the region at equal intervals for each dimension. Figure 2 shows that refining the search space by the proposed method when the number of dimensions is and the division number is . The proposed method randomly selects a dimension without replacement, divides the region corresponding to the dimension into pieces, leaves only the region where the evaluation value of the center point of the divided region is the best. The proposed method repeats this operation until the division of the regions corresponding to all dimensions is completed.
3.2.3 Division Number
We need to set the division number to adjust how much the search space is refined. If we set the division number to an even number, the evaluation budget for refining is calculated by . However, when is an odd number, the evaluation budget for refining is calculated by because the center point of the search space refined before can be reused for the next evaluation. Therefore, we set the division number to an odd number so that the evaluation budget for refining approaches most closely according to Equation (9).
(9) 
4 Experiments
In this section, we assess the performance of the proposed method through the benchmark functions and the hyperparameter optimization of machine learning algorithms to confirm the effectiveness of the proposed method in the low budget setting.
4.1 Baseline Methods
We use GPEI (Bayesian optimization using Gaussian process as the surrogate model and the EI as the acquisition function), TPE [1] and SMAC [12] as the baseline methods of Bayesian optimization. In this experiment, we refer to GPEI with the proposed method as Ref+GPEI. Likewise, we refer to TPE and SMAC with the proposed method as Ref+TPE and Ref+SMAC, respectively. We use the GPyOpt^{1}^{1}1https://github.com/SheffieldML/GPyOpt, Hyperopt^{2}^{2}2https://github.com/hyperopt/hyperopt and SMAC3^{3}^{3}3https://github.com/automl/SMAC3 library to obtain the results for GPEI, TPE and SMAC, respectively. We set the parameters of GPEI, TPE, and SMAC to the default values of each library and use the center point of the search space as the initial starting point for SMAC. We also experiment with SOO [21] and BaMSOO [30], which are searchspace division algorithms, in order to confirm the validity of the refinement of the search space by the proposed method.
4.2 Benchmark Functions
In the first experiment, we assess the performance of the proposed method on the benchmark functions that are often used in blackbox optimization. Table 1 shows the six benchmark functions used in this experiment.
Name  Definition  Dim  Search Space  

Sphere  
tablet  
RosenbrockChain  
Branin 


Shekel  
Hartmann 
4.2.1 Experimental Setting
We run 50 trials for each experiment, and we set the evaluation budget to in each trial. We assess the performance of each method using the mean and standard error of the best evaluation values in 50 trials.
In SOO and BaMSOO, we set the division numner , which is the same setting in [29]. For BaMSOO, we use the Matrn kernel, which is one of the common kernel functions. This equation is given by where . We set the initial hyperparameters to and and update them by maximizing the data likelihood after each iteration.
4.2.2 Results
Figure 3 and Table 2 show the mean and standard error of the best evaluation values in 50 trials on the six benchmark functions. Ref+GPEI and BaMSOO show competitive performance in RosenbrockChain and Branin function, but Ref+GPEI shows better performance than all the other methods in other benchmark functions. Furthermore, Ref+GPEI, Ref+TPE and Ref+SMAC outperform GPEI, TPE and SMAC in all the benchmark functions, respectively.
Figure 4 shows the typical behavior of each method on the Hartmann function. Ref+GPEI, Ref+TPE and Ref+SMAC sample many points with good evaluation values after refining the search space whereas other methods have not been able to sample points sufficiently with good evaluation values even at the end of the search.
Problem  GPEI  Ref+GPEI  TPE  Ref+TPE  SMAC  Ref+SMAC  SOO  BaMSOO  

Sphere 









tablet 









Rosen 









Branin 









Shekel 









Hartmann 








4.3 Hyperparameter Optimization
In the second experiment, we assess the performance of the proposed method on the hyperparameter optimization of machine learning algorithms. We experiment with the following three machine learning algorithms, that are with the low budget setting in many cases.

MLP

CNN

LightGBM [18]
Table 4 shows the four hyperparameters of MLP and their respective search spaces. The MLP consists of two fullyconnected layers and SoftMax at the end. We set the maximum number of epochs during training to 20, and the minibatch size to 128. We use the MNIST dataset that has pixel greyscale images of digits, each belonging to one of ten classes. The MNIST dataset consists of training images and testing images. In this experiment, we split the training images into the training dataset of images and the validation dataset of images.
The CNN consists of two convolutional layers with batch normalization and SoftMax at the end. Each convolutional layer is followed by a maxpooling layer. The two convolutional layers are followed by two fullyconnected layers with ReLU activation. We use the same hyperparameters and search spaces used by the MLP problem above (Table 4). We set the maximum number of epochs during training to 10, and the minibatch size to 128. We use the MNIST dataset and split like the MLP problem.
Hyperparameter Search Space learning rate of SGD momentum of SGD num of hidden nodes dropout rate Hyperparameter Search Space learning rate colsample bytree reg lambda max depth 
Table 4 shows the four hyperparameters of LightGBM and their respective search spaces. We use the Breast Cancer Wisconsin dataset [6] that consists of data instances. In the experiment using this dataset, we use of the data instances as the training dataset, and the evaluation value is calculated using fold cross validation.
4.3.1 Experimental Setting
We run 50 trials for each experiment, and we set the evaluation budget to in each trial. For all experiments, we use the misclassification rate on the validation dataset as the evaluation value. For all the problems, we regard the integervalued hyperparameters as continuous variables by using rounded integer values when evaluating.
4.3.2 Results
Figure 5 and Table 5 show the mean and standard error of the best evaluation values for 50 trials on the hyperparameter optimization of the three machine learning algorithms. Similar to the experiment of the benchmark functions, Ref+GPEI, Ref+TPE and Ref+SMAC outperform GPEI, TPE and SMAC in all the hyperparameter optimization of machine learning algorithms, respectively. Likewise, Ref+GPEI, Ref+TPE and Ref+SMAC show equal or better performance to SOO and BaMSOO in all problems.
Problem  GPEI  Ref+GPEI  TPE  Ref+TPE  SMAC  Ref+SMAC  SOO  BaMSOO  

MLP 









CNN 









LightGBM 








5 Conclusion
In this study, we developed a simple heuristic method for Bayesian optimization with the low budget setting. The proposed method refines the promising region by dividing the region at equal intervals for each dimension. By refining the search space, Bayesian optimization can be executed with a promising region as the initial search space.
We experimented with the six benchmark functions and the hyperparameter optimization of the three machine learning algorithms (MLP, CNN, LightGBM). We confirmed that Bayesian optimization with the proposed method outperforms Bayesian optimization alone in all the problems including the benchmark functions and the hyperparameter optimization. Likewise, Bayesian optimization with the proposed method shows equal or better performance to two searchspace division algorithms.
In future work, we plan to adapt the proposed method for noisy environments. Realworld problems such as hyperparameter optimization are often noisy; thus, making the optimization method robust is important. Furthermore, because we do not consider the variable dependency at present, we are planning to refine the search space taking the variable dependency into consideration.
References
 [1] (2011) Algorithms for hyperparameter optimization. In NIPS, Cited by: §1, §2.2.1, §4.1.
 [2] (2012) Random search for hyperparameter optimization. Journal of Machine Learning Research 13, pp. 281–305. Cited by: §1.
 [3] (2013) Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures. In ICML, Cited by: §2.2.1.
 [4] (2005) Highfidelity multidisciplinary design optimization of wing shape for regional jet aircraft. In Evolutionary MultiCriterion Optimization, pp. 621–635. Cited by: §1.
 [5] (2014) Gaussian process optimization with mutual information. In ICML, Cited by: §2.2.1.
 [6] (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §4.3.
 [7] (2015) Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In IJCAI, Cited by: §1.
 [8] (2013) Towards an empirical foundation for assessing bayesian optimization of hyperparameters. In NeurIPS workshop on Bayesian Optimization in Theory and Practice, Cited by: §1.
 [9] (2018) BOHB: robust and efficient hyperparameter optimization at scale. In ICML, Cited by: §1.
 [10] (2009) The knowledgegradient policy for correlated normal beliefs. INFORMS Journal on Computing 21, pp. 599–613. Cited by: §2.2.1.
 [11] (2017) Google vizier: a service for blackbox optimization. In KDD, Cited by: §1.
 [12] (2011) Sequential modelbased optimization for general algorithm configuration. In LION, Cited by: §2.2.1, §4.1.
 [13] (2017) Efficient hyperparameter optimization for deep learning algorithms using deterministic rbf surrogates.. In AAAI, Cited by: §1, §1.
 [14] (1964) A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. Journal of Basic Engineering 86, pp. . Cited by: §2.2.1.
 [15] (1993) Lipschitzian optimization without the lipschitz constant. Journal of Optimization Theory and Applications 79 (1), pp. 157–181. Cited by: §2.2.2.
 [16] (1998) Efficient Global Optimization of Expensive BlackBox Functions. Journal of Global optimization 13, pp. 455–492. Cited by: §2.1.2, §2.2.1.
 [17] (2016) Gaussian process bandit optimisation with multifidelity evaluations. In ICML, Cited by: §3.
 [18] (2017) Lightgbm: a highly efficient gradient boosting decision tree. In NIPS, Cited by: 3rd item.
 [19] (2017) Fast bayesian optimization of machine learning hyperparameters on large datasets. In AISTATS, Cited by: §1.
 [20] (2017) Hyperband: A novel banditbased approach to hyperparameter optimization. Journal of Machine Learning Research 18, pp. 185:1–185:52. Cited by: §1.
 [21] (2011) Optimistic optimization of a deterministic function without the knowledge of its smoothness. In NIPS, Cited by: §1, §2.2.2, §4.1.
 [22] (2017) Bayesian optimization in weakly specified search space. In ICDM, Cited by: §2.2.1, §3.
 [23] (2018) BOCK : Bayesian optimization with cylindrical kernels. In ICML, Cited by: §2.2.1.
 [24] (2005) Gaussian processes for machine learning. The MIT Press. Cited by: §2.1.1.
 [25] (2012) Practical bayesian optimization of machine learning algorithms. In NIPS, Cited by: §1, §2.2.1.
 [26] (2010) Gaussian process optimization in the bandit setting: no regret and experimental design. In ICML, Cited by: §2.2.1.
 [27] (2019) Virtual library of simulation experiments: test functions and datasets. Note: Retrieved April 15, 2019, from http://www.sfu.ca/~ssurjano Cited by: Table 1.
 [28] (2017) Improving bayesian optimization for machine learning using expert priors. In PhD thesis, Cited by: §2.2.1.
 [29] (2013) Stochastic simultaneous optimistic optimization. In ICML, Cited by: §4.2.1.
 [30] (2014) Bayesian MultiScale Optimistic Optimization. In AISTAT, Cited by: §1, §2.2.2, §4.1.
 [31] (2015) Hyperparameter search space pruning – a new component for sequential modelbased hyperparameter optimization. In ECML, Cited by: §2.2.1.
 [32] (2009) Agentbased simulation on women’s role in a family line on civil service examination in chinese history. J. Artificial Societies and Social Simulation 12. Cited by: §1.