Information-theoretic Bayesian optimisation techniques have demonstrated state-of-the-art performance in tackling important global optimisation problems. However, current information-theoretic approaches: require many approximations in implementation; limit the choice of kernels available to model the objective; and introduce often-prohibitive computational overhead. We develop a fast information-theoretic Bayesian Optimisation method, FITBO, that circumvents the need for sampling the global minimiser, thus significantly reducing computational overhead. Moreover, in comparison with existing approaches, our method faces fewer constraints on kernel choice and enjoys the merits of dealing with the output space. We demonstrate empirically that FITBO inherits the performance associated with information-theoretic Bayesian optimisation, while being even faster than simpler Bayesian optimisation approaches, such as Expected Improvement.
Fast Information-theoretic Bayesian Optimisation
\aistatsauthorBinxin Ru&Michael Osborne &Mark McLeod
\aistatsaddressDepartment of Engineering Science,
University of Oxford &Department of Engineering Science,
University of Oxford &Department of Engineering Science,
University of Oxford
Optimisation problems arise in numerous fields ranging from science and engineering to economics and management (Brochu et al., 2010). In classical optimisation tasks, the objective function is usually known and cheap to evaluate (Hennig and Schuler, 2012). However, in many situations, we face another type of tasks for which the above assumptions do not apply. For example, in the cases of clinical trials, financial investments or constructing a sensor network, it is very costly to draw a sample from the latent function underlying the real-world processes (Brochu et al., 2010). The objective functions in such type of problems are generally non-convex and their closed-form expressions and derivatives are unknown (Shahriari et al., 2016). Bayesian optimisation is a powerful tool to tackle such optimisation challenges (Brochu et al., 2010).
A core step in Bayesian optimisation is to definite an acquisition function which uses the available observations effectively to recommend the next query location (Shahriari et al., 2016). There are many types of acquisition functions such as Probability of Improvement (PI) (Kushner, 1964), Expected Improvement (EI) (Jones et al., 1998) and Gaussian Process Upper Confidence Bound (UCB) (Srinivas et al., 2009). The most recent type is based on information theory and offers a new perspective to efficiently select the sequence of sampling locations based on entropy of the distribution over the unknown minimiser (Shahriari et al., 2016). The information-theoretic approaches guide our evaluations to locations where we can maximise our learning about the unknown minimum rather than to locations where we expect to obtain lower function values (Hennig and Schuler, 2012). Such methods have demonstrated impressive empirical performance and tend to outperform traditional methods in tasks with highly multimodal and noisy latent functions.
One popular information-based acquisition function is Predictive Entropy Search (PES) (Villemonteix et al., 2009, Hennig and Schuler, 2012, Hernández-Lobato et al., 2014) . However, it is very slow to evaluate in comparison with traditional methods like EI, PI and GP-UCB and faces serious constraints in its application. For example, the implementation of PES requires the first and second partial derivatives as well as the spectral density of the Gaussian process kernel function. This limits our kernel choices. Moreover, PES deals with the input space, thus less efficient in higher dimensional problems (Wang and Jegelka, 2017). The more recent methods such as Output-space Predictive Entropy Search (OPES) (Hoffman and Ghahramani, 2015) and Max-value Entropy Search (MES) (Wang and Jegelka, 2017) improve on PES by focusing on the information content in output space instead of input space. However, current entropy search methods, being it dealing with minimiser or minimum value, all involve two separate sampling processes : 1) sampling hyperparameters for marginalisation and 2) sampling the global minimum for entropy computation. The second sampling process not only contributes significantly to the computational burden of these information-based acquisition functions but also requires the construction of a good approximation for the objective function (Hernández-Lobato et al., 2014), which introduces some kernel constraints.
In view of the limitations of the existing methods, we propose a fast information-based Bayesian optimisation technique (FITBO). Inspired by the Bayesian integration work in (Gunter et al., 2014), the creative contribution of our technique is to approximate any black-box function in a parabolic form: . The global minimum is explicitly represented by a hyperparameter , which can be sampled together with other hyperparameters. As a result, our approach has the following three major advantages. First, our approach circumvents the need to sample the global minimiser/minimum with the use of parabolic approximation, thus saving much sampling effort and speeding up the evaluation of acquisition function tremendously. Second, our approach faces fewer constraints on the choice of appropriate kernel functions for the Gaussian process prior. Third, similar to MES (Wang and Jegelka, 2017), our approach works on information in the output space, thus more efficient in high dimensional problems.
2 Fast Information-Theoretic Bayesian Optimisation
Information theoretic techniques aim to reduce the uncertainty about the unknown global minimiser by selecting a query point that leads to the largest reduction in entropy of the distribution (Hennig and Schuler, 2012). The acquisition function for such techniques has the form (Hennig and Schuler, 2012) (Hernández-Lobato et al., 2014):
PES makes use of the symmetry of mutual information and arrives at the following equivalent acquisition function:
where is the predictive posterior distribution for conditioned on the observed data , the test location and the global minimiser of the objective function.
FITBO harnesses the same information-theoretic thinking but measures the entropy about the latent global minimum instead of that of the global minimiser . Thus, the acquisition function of FITBO method is the mutual information between the function minimum and the next query point (Wang and Jegelka, 2017). In other words, FITBO aims to select the next query point which minimises the entropy of the global minimum:
This idea of changing entropy computation from the input space to the output space is also shared by Hoffman and Ghahramani (2015) and Wang and Jegelka (2017). Hence, the acquisition function of the FITBO method is very similar to those of OPES (Hoffman and Ghahramani, 2015) and MES (Wang and Jegelka, 2017).
However, our novel contribution is to express the unknown objective function in a parabolic form: , thus representing the global minimum by a hyperparameter . FITBO acquisition function can then be reformulated as:
The intractable integral terms can be approximated by drawing samples of from the posterior distribution and using a Monte Carlo method (Hernández-Lobato et al., 2014). The predictive posterior distribution can be turned into a neat Gaussian form by applying a local linearisation technique on our parabolic approximation as described in Section 2.1. Thus, the first term in the above FITBO acquisition function is an entropy of a Gaussian mixture, which is intractable and demands approximation as described in Section 2.3. The second term is the expected entropy of a one-dimensional Gaussian distribution and can be computed analytically because the entropy of a Gaussian has the closed form: where the variance and is the variance of observation noise.
2.1 Parabolic Approximation and Predictive Posterior Distribution
Gunter et al. (2014) use a square-root transformation on the integrand in their warped sequential active Bayesian integration method to ensure non-negativity. Inspired by this work, we creatively express any unknown objective function in the parabolic form:
where is the global minimum of the objective function. Given the noise-free observation data , the observation data on is where .
We impose a zero-mean Gaussian process prior on , , so that the posterior distribution for conditioned on the observation data and the test point also follows a Gaussian process:
Due to the parabolic transformation, the distribution for any is now a non-central distribution, making the analysis intractable. In order to tackle this problem and obtain a posterior distribution that is also Gaussian, we resort to the linearisation technique proposed in (Gunter et al., 2014).
We perform a local linearisation of the parabolic transformation around and obtain where the gradient . By setting to the mode of the posterior distribution (i.e. ), we obtain an expression for which is linear in :
Since the affine transformation of a Gaussian process remains Gaussian, the predictive posterior distribution for now has a closed form:
However, in real world situation, we do not have access to the true function values but only noisy observations of the function, , where is assumed to be an independently and identically distributed Gaussian noise with variance (Rasmussen and Williams, 2006). Given noisy observation data , the predictive posterior distribution (8) becomes:
2.2 Hyperparameter Treatment
Hyperparameters are the free parameters, such as output scale and characteristic length scales in the kernel function for the Gaussian processes as well as noise variance. Recall that we introduce a new hyperparameter in our model to represent the global minimum. To ensure that is not greater than the minimum observation , we assume that follows a broad normal distribution. Thus the prior for has the form
The most popular approach to hyperparameter treatment is to learn hyperparameter values via maximum likelihood estimation (MLE) or maximum a posterior estimation (MAP). However, both MLE and MAP are not desirable as they give point estimates and ignore our uncertainty about the hyperparameters (Hernández-Lobato et al., 2014). In a fully Bayesian treatment of the hyperparameters, we should consider all possible hyperparameter values. This can be done approximately111Marginalising the acquisition function, while common for its convenience, is an approximation to the more principled marginalisation of the posteriors within the acquisition function. by marginalising the acquisition function with respect to the posterior :
Since complete marginalisation over hyperparameters is analytically intractable, the integral must be approximated using the Monte Carlo method (Hoffman and Ghahramani, 2015) (Snoek et al., 2012), leading to the final expression:
2.3 Approximation for the Entropy of A Gaussian Mixture
The entropy of a Gaussian mixture is intractable and can be estimated via a number of methods: the Taylor expansion proposed in (Huber et al., 2008), numerical integration and Monte Carlo integration. Of these three, our experimentation revealed that numerical integration (in particular, an adaptive Simpson’s method) was clearly the most performant for our application (see Supplementary Material). Note that our Gaussian mixture is univariate.
Alternatively, we can approximate the first entropy term by matching the first two moments of a Gaussian mixture. The mean and variance of a univariate Gaussian mixture model have the analytical form:
By fitting a Gaussian to the Gaussian mixture, the first entropy term can be approximated with an analytical expression: . We will compare numerical integration and moment-matching approaches in our experiments in Section 3.1.
2.4 The Algorithm
The procedures of FITBO approach can be summarised by Algorithm 1 and Figure 1 illustrates the sampling behaviour of FITBO method for a simple 1-D Bayesian optimisation problem. The optimisation process is started with 3 initial observation data. As more samples are taken, the mean of the posterior distribution for the objective function gradually resembles the objective function and the distribution of converges to the global minimum.
We conduct a series of experiments to test the empirical performance of FITBO and compare it with other popular acquisition functions. In this section, FITBO denotes the version using numerical integration to estimate the entropy of the Gaussian mixture while FITBO-MM denotes the version using moment matching. We adopt zero-mean Gaussian process prior with the squared exponential kernel function in all experiments and use the Metropolis-Hastings algorithm (Bishop, 2006) for sampling hyperparameters together with for FITBO. For the implementation of EI, PI, UCB, MES and PES, we use the open source Matlab code by Wang and Jegelka (2017) and Hernández-Lobato et al. (2014). We use the type of MES method that samples the minimum via functions generated by random features (Wang and Jegelka, 2017), which is also the minimiser sampling strategy used by PES (Hernández-Lobato et al., 2014). For both PES and MES, we draw only 1 minimum or minimiser sample for computing the acquisition function.
In evaluating the optimisation performance of various methods, we use the two common metrics adopted by Hennig and Schuler (2012). The first metric is Immediate regret (IR) which is defined as:
where is the location of true global minimiser and is the location recommended by a Bayesian optimiser after iterations, which corresponds to the global minimiser of the posterior mean (Hernández-Lobato et al., 2014). The second metric is the Euclidean distance of an optimiser’s recommendation from true global minimiser , which is defined as:
3.1 Runtime Tests
The first set of experiments measure and compare the runtime of evaluating the acquisition functions for methods including UCB, PI, EI, PES, MES, FITBO and FITBO-MM. The runtime measured excludes the time taken for sampling hyperparameters as well as optimising the acquisition functions. The methodology of the tests can be summarised as follows:
Generate 10 initial observation data from a N-D test function and sample a set of M hyperparameters from the log posterior distribution using a MC sampler.
Use this set of hyperparameters to evaluate all acquisition functions at 100 test points.
Compute and compare the mean and variance of the runtime taken for evaluating various acquisition functions at each test point.
We did not include the time for sampling alone into the runtime of evaluating FITBO and FITBO-MM because is sampled jointly with other hyperparameters and does not significantly add to the overall sampling burden. In fact, we have tested that sampling adds 0.2 seconds on average when drawing 2000 samples by M-H sampler and 2 seconds when drawing 20000 samples. Note further that we will limit all methods to a fixed number of hyperparameter samples in both runtime tests and performance experiments: this will impart a slight performance penalty to our method, which must sample from a hyperparameter space of one additional dimension.
The above tests are repeated for different hyperparameter sample sizes (M=10, 20, 50, 100, 200, 500) and test location data of different dimensions (N=2, 4, 6). The results are presented graphically in Figure 1 and the exact numerical results for methods that are very close in runtime are presented in Tables 1 and 2.
Figure 2 shows that FITBO is significantly faster to evaluate than PES and MES for various hyperparameter sample sizes used and for problems of different input dimensions. Moreover, the moment matching technique manages to further enhance the speed of FITBO by a large margin, making FITBO-MM faster than EI and comparable with PI and GP-UCB. In addition, the runtime of evaluating FITBO-MM, EI, PI and UCB tend to remain constant regardless of the input dimensions while the runtime for PES, MES and FITBO tends to increase with input dimensions at an exponential rate.
|(5.0E-02)||(2.4E-03)||( 2.8 E-03)|
3.2 Tests with Benchmark Functions
We perform optimisation tasks on three challenging benchmark functions: Branin-Hoo (defined in ), Eggholder (defined in ) and Hartmann 6D (defined in ). In all tests, we set the observation noise to and resample all the hyperparameters after each function evaluation. The performance of FITBO and FITBO-MM is compared with EI, PES and MES. Similar to the experiments in (Hernández-Lobato et al., 2014) and (Hennig and Schuler, 2012), we compute the median IR and the median between the predicted global minimiser and the true global minimiser over a number of random initialisations. For 2D benchmark problems, all Bayesian optimisation algorithms start from the same 3 random observation data at each initialisation and uses 100 hyperparameter samples for each function evaluation. For Hartmann 6D problem, all methods start from 9 initial observation data and uses 200 hyperparameter samples for each function evaluation.
The results are presented in Fig 3. The plots on the left show the median of the immediate regret (IR) obtained by each Bayesian optimisation approach as more evaluation steps are taken. The plots on the right show the median of the Euclidean distance between each optimiser’s recommended global minimiser and the true global minimiser. The shaded confidence bands equal to one standard deviation.
When optimising Branin 2D, FITBO and FITBO-MM lose out to other methods initially but surpass other methods, especially EI and PES, by a large margin after 50 evaluations. In the case of Eggholder 2D which is more complicated and multimodal, FITBO and FITBO-MM perform not as well as other methods in finding lower function values. However, they outperform all other methods in locating the global minimiser. As for a higher dimensional problem, Hartmann 6D, the more greedy improvement-based approach, EI, gains the advantage in searching for the lower functional value. FITBO and FITBO-MM outperform the other two information-based approaches in finding the global minimiser and manage to catch up EI in the input space metrics.
One interesting point we would like to illustrate through the Branin problem is the fundamentally different mechanisms behind information-based approaches like FITBO and improvement-based approaches like EI. As shown in Figure 4, FITBO is much more explorative compared to EI in taking new evaluations. FITBO successfully finds all three global minimisers by maximising the information gain but EI quickly concentrates its searches into regions of low function values, thus missing out one of the global minimisers.
3.3 Test with A Real-world Problem
Finally, we experiment with a real-world optimisation task of tuning 3 parameters (the number of neurons, the damping-decrease factor and the damping-increase factor) for a 1-hidden layer neural network (Wang and Jegelka, 2017). The neural network is trained and tested using the Boston Housing dataset (Bache and Lichman, 2013) and all algorithms start from 9 initial observation data. We set the observation noise to a negligible level of and repeat the experiment 20 times.
In this case, the ground truth is unknown but our aim is to minimise the validation error after tunning the parameters. Thus, the resultant validation errors after each evaluation are compared across all the Bayesian optimisation algorithms. Figure 5 shows that FITBO and FITBO-MM can perform competitively compared to MES and PES. However, all the information-based approaches seem to lose out to EI in this case although PES and MES outperform EI in the similar experiments performed by Wang and Jegelka (2017) and Hernández-Lobato et al. (2014). This discrepancy in the results may be due to the use of different values for the fixed parameters as well as different initial observation data. The limited number of repeated experiments and function evaluations we took may also account for this discrepancy.
We have proposed a novel information-theoretic approach for Bayesian optimisation, FITBO. With the creative use of the parabolic approximation and the hyperparameter , FITBO enjoys the merits of less sampling effort and much simpler implementation in comparison with other information-based methods like PES and MES. As a result, its computational speed outperforms current information-based methods by a large margin and even exceeds EI to be on par with PI and UCB. While requiring much lower runtime, it still manages to achieve satisfactory optimisation performance which is as good as or even better than PES and MES in a variety of tasks. Therefore, FITBO approach offers a very competitive alternative to existing Bayesian optimisation approaches.
- Bache and Lichman (2013) K. Bache and M. Lichman. Uci machine learning repository. 2013.
- Bishop (2006) C. M. Bishop. Pattern recognition. Machine Learning, 128, 2006.
- Brochu et al. (2010) E. Brochu, V. M. Cora, and N. De Freitas. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599, 2010.
- Gunter et al. (2014) T. Gunter, M. A. Osborne, R. Garnett, P. Hennig, and S. J. Roberts. Sampling for inference in probabilisktic models with fast bayesian quadrature. In Advances in neural information processing systems, pages 2789–2797, 2014.
- Hennig and Schuler (2012) P. Hennig and C. J. Schuler. Entropy search for information-efficient global optimization. Journal of Machine Learning Research, 13(Jun):1809–1837, 2012.
- Hernández-Lobato et al. (2014) J. M. Hernández-Lobato, M. W. Hoffman, and Z. Ghahramani. Predictive entropy search for efficient global optimization of black-box functions. In Advances in neural information processing systems, pages 918–926, 2014.
- Hoffman and Ghahramani (2015) M. W. Hoffman and Z. Ghahramani. Output-space predictive entropy search for flexible global optimization. In the NIPS workshop on Bayesian optimization, 2015.
- Huber et al. (2008) M. F. Huber, T. Bailey, H. Durrant-Whyte, and U. D. Hanebeck. On entropy approximation for Gaussian mixture random vectors. In Multisensor Fusion and Integration for Intelligent Systems, 2008. MFI 2008. IEEE International Conference on, pages 181–188. IEEE, 2008.
- Jones et al. (1998) D. R. Jones, M. Schonlau, and W. J. Welch. Efficient global optimization of expensive black-box functions. Journal of Global optimization, 13(4):455–492, 1998.
- Kushner (1964) H. J. Kushner. A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. Journal of Basic Engineering, 86(1):97–106, 1964.
- Rasmussen and Williams (2006) C. E. Rasmussen and C. K. Williams. Gaussian processes for machine learning, volume 1. MIT press Cambridge, 2006.
- Shahriari et al. (2016) B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1):148–175, 2016.
- Snoek et al. (2012) J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pages 2951–2959, 2012.
- Srinivas et al. (2009) N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995, 2009.
- Villemonteix et al. (2009) J. Villemonteix, E. Vazquez, and E. Walter. An informational approach to the global optimization of expensive-to-evaluate functions. Journal of Global Optimization, 44(4):509–534, 2009. URL http://www.springerlink.com/index/T670U067V47922VK.pdf.
- Wang and Jegelka (2017) Z. Wang and S. Jegelka. Max-value entropy search for efficient Bayesian optimization. arXiv preprint arXiv:1703.01968, 2017.