Scalable Meta-Learning for Bayesian Optimization

Scalable Meta-Learning for Bayesian Optimization

Matthias Feurer, Benjamin Letham, Eytan Bakshy,
University of Freiburg
Facebook
feurerm@cs.uni-freiburg.de, bletham@fb.com, ebakshy@fb.com
Abstract

Bayesian optimization has become a standard technique for hyperparameter optimization, including data-intensive models such as deep neural networks that may take days or weeks to train. We consider the setting where previous optimization runs are available, and we wish to use their results to warm-start a new optimization run. We develop an ensemble model that can incorporate the results of past optimization runs, while avoiding the poor scaling that comes with putting all results into a single Gaussian process model. The ensemble combines models from past runs according to estimates of their generalization performance on the current optimization. Results from a large collection of hyperparameter optimization benchmark problems and from optimization of a production computer vision platform at Facebook show that the ensemble can substantially reduce the time it takes to obtain near-optimal configurations, and is useful for warm-starting expensive searches or running quick re-optimizations.

Scalable Meta-Learning for Bayesian Optimization


Matthias Feurer, Benjamin Letham, Eytan Bakshy, University of Freiburg Facebook feurerm@cs.uni-freiburg.de, bletham@fb.com, ebakshy@fb.com

1 Introduction

Bayesian optimization is a technique for solving black-box optimization problems with expensive function evaluations. It has been successfully applied in a number of domains, including: optimizing the hyperparameters of machine learning algorithms, where a function evaluation involves training the model (Snoek et al., 2012); A/B testing, where a function evaluation is a field experiment (Letham et al., 2017); and parameter estimation in black-box models, where a function evaluation involves running a lengthy simulation to evaluate model likelihood (Kandasamy et al., 2017). Given a small initial set of function evaluations, Bayesian optimization proceeds by fitting a surrogate model to those observations, typically a Gaussian process (GP), and then optimizing an acquisition function that balances exploration and exploitation in determining what point to evaluate next. When the surrogate function can be quickly evaluated, Bayesian optimization uses the cheap problem of optimizing the acquisition function to guide the expensive optimization problem we wish to solve.

The “black-box” nature of the optimization assumes that nothing is known about the problem besides its function evaluations, but there are settings in which ancillary information is available in the form of prior optimizations. Here we are particularly interested in two such settings.

  1. Re-optimizations: Production machine learning models are constantly re-trained as new data become available and underlying code bases are updated. The optimal hyperparameters may also change as the data and code change, and so should be frequently re-optimized. Although they may change significantly, most of the time the results of a re-optimization will be similar to those of a previous run.

  2. Very short runs: A typical run of Bayesian optimization uses a few dozen function evaluations. Models with very long training times, such as deep neural networks on large datasets which may take days or weeks to train, may not permit that many iterations. Suppose we have access to the outcomes of previous optimization runs of similar models. The hyperparameter surface for our current model may be similar to those of a past run, and identifying similar runs can guide the short run.

Several Bayesian optimization methods have been developed to borrow strength across runs – these are described in Section 2. Many of these methods are for settings where additional metafeatures are available for identifying which past runs are likely to be similar, such as statistics of the dataset. Such metafeatures are often not available for broad classes of settings in which Bayesian optimization can be applied, including simulations and A/B tests.

Furthermore, most methods for using past optimization tasks learn an adjustment for the results of past runs that allow them to be combined into the same model as observations from the current run. This approach does not scale well with the number of past runs. GP regression has complexity, and combining past runs with iterations into a single model incurs computational cost. Practically, GPs become unusable after a few thousand data points (Cunningham et al., 2008; Hensman et al., 2013) and their usefulness as a tractable surrogate function diminishes even earlier. This limit is easily reached for the problems described above. A machine learning platform that does hyperparameter optimization with 50 iterations as part of model fitting will reach 1,000 data points after only 20 models. Our experiments in Section 5.2 use 49 past runs of 50 iterations each, for a total of 2,450 data points – well beyond what can used for Bayesian optimization with a single GP.

The contribution of this paper is an ensemble method for warm-starting Bayesian optimization using past runs, called the ranking-weighted Gaussian process ensemble (RGPE). The method has complexity, allowing it to scale to a much larger number of past runs than methods which combine all past observations in a single model. If fitted base models were stored during past runs, they can be used directly without any refitting of prior results. The method does not require the existence of metafeatures, allowing it to be used for a broader set of Bayesian optimization applications. Finally, unlike other ensemble approaches, our method is naturally parallelizable. We evaluate its performance using a large collection of SVM hyperparameter optimization benchmark problems and then show its use on a real problem by optimizing the computer vision platform at Facebook.

2 Related Work

Borrowing strength from past runs is a form of meta-learning, and in the context of Bayesian optimization is often called transfer learning. A key requirement is to determine which past runs are similar to the current task. Several past methods have used manually defined metafeatures of the datasets to measure task similarity (Brazdil et al., 1994). Bardenet et al. (2013) simultaneously optimize several problems by using metafeatures of the data as features of the GP along with the hyperparameters to optimize. Observations from all runs are put on the same scale using an SVM model and then used in a single GP. Yogatama and Mann (2014) select similar past optimization runs based on the nearest neighbors in metafeature space. Observations from all similar runs are then combined in a single GP. Schilling et al. (2016) construct a GP for each past optimization run including both the past observations and those of the current task, with task similarity described by metafeatures. These models are combined using the product of GP experts model (Cao and Fleet, 2014). Feurer et al. (2015) use metafeature similarity to select initial points for the optimization as the best points from similar runs, and then proceed with usual single-task Bayesian optimization. These methods all require metafeatures, whereas here we seek to develop a method that does not require metafeatures.

Another set of methods attempts to learn the task similarity without the use of metafeatures – our work falls in this category. Swersky et al. (2013) use a multitask GP to jointly model all past runs and the current task. Poloczek et al. (2016) also use a kernel over all tasks with a structure that allows for modeling negative correlations. Multitask GPs suffer from the same poor scaling as putting all of the observations into a single GP, and cannot be used for the problem of Section 5.2. Furthermore, Swersky et al. (2013) sample a lower triangular matrix describing task correlations, which prohibits a large number of past runs.

Wistuba et al. (2016) develop the two-stage transfer surrogate model with rankings (TST-R) which avoids poor scaling with the number of past observations by combining GPs individually fit to the observations from each run. They use a Nadaraya-Watson kernel weighting to linearly combine the predictions from each GP by defining a distance metric across tasks. In particular, they consider the orderings of all pairs of observations points in the current task. The distance between the past task and the current task is taken as the proportion of discordant pairs when the past model is evaluated on configurations used on the current task (Wistuba, 2016). Weights for each model are then computed using a quadratic kernel with bandwidth parameter , which serves as a threshold for the similarity required to borrow strength from any prior task. The bandwidth must be chosen by the user, and their experiments used for some benchmarks and for another. The kernel is used to combine mean predictions of past models with the mean prediction of the current model, but variances are not combined – the combined model is given the variance of the current model and variances of past models are ignored. This means that the TST-R model is no longer a GP, and in particular does not have a valid posterior from which joint samples can be drawn. In our experiments here we use the method of Snoek et al. (2012) for batch optimization, which does a Monte Carlo integration over joint samples drawn from the model posterior. This approach cannot be used with TST-R. Like our model, TST-R does not require metafeatures and scales well with the number of past runs, and so we use it for comparison in our experiments of Section 5.

Theoretical support for transfer learning in Bayesian optimization is provided by Shilton et al. (2017), who study two particular strategies for transfer learning and show that they result in tighter regret bounds than without transfer learning. One of these strategies is to model the difference between the past run and the target task with a GP, which is then used to adjust the past run observations for inclusion in the current model. Golovin et al. (2017) take a similar approach for an ordered set of past runs. Rather than fitting a GP to each run separately, a GP is fit to the residuals of each run relative to the predictions of the previous model in the stack. This essentially uses the outcomes of previous runs as a mean prior for the next run. This method assumes an ordering to the runs, which is reasonable for re-optimizations but not for meta-learning from a collection of unrelated problems.

A common issue for many of these methods is that the response surfaces of different problems can have very different scales. A frequently used strategy for making the past runs more amenable to combining is to standardize the observations for each task separately to have zero mean and unit standard deviation (Yogatama and Mann, 2014; Wistuba et al., 2015, 2016). We use this strategy for our method as well.

Bayesian optimization has been done with models that scale better with the number of observations, such as random forest (Hutter et al., 2011), parzen estimators (Bergstra et al., 2011) and neural networks (Snoek et al., 2015; Springenberg et al., 2016). These models come with challenges of their own, such as poor uncertainty extrapolation with limited observations and hyperparameter sensitivity. There are also extensions of GPs for large datasets, most notably sparse GPs (Csató and Opper, 2002). Sparse GPs can also have poor uncertainty extrapolation (Bauer et al., 2016) and do not easily produce joint samples. GPs remain the standard model for practical Bayesian optimization.

In the related field of algorithm configuration (Hutter et al., 2009), Lindauer and Hutter (2018) propose a method to learn a weighted combination of random forests for runtime prediction, with each random forest being fitted on a previous algorithm configuration run. They consider a setting in which only a few auxiliary tasks are available, but the number of observations per task is on the order of several hundreds, far beyond what a GP can handle.

3 Background and Problem Setup

The goal of Bayesian optimization is to find a minimizer of a black-box function in a bounded space by iteratively querying the function at input configurations and observing the corresponding outputs . In each iteration we first fit a probabilistic model on observations made so far. We then use an acquisition function to select a promising configuration to evaluate next, balancing exploration and exploitation.

In general may be a noisy estimate of the function value, and the noise standard deviation may also be known. We estimate the underlying function with GP regression, yielding a posterior that has mean and variance , which are known analytically (Rasmussen and Williams, 2006). These quantities depend on the GP kernel, which has several hyperparameters that are inferred when the model is fit. The GP posterior at a collection of points has a multivariate normal distribution with mean and covariance matrix denoted as and .

In our experiments in Sections 5 and 6 we use the expected improvement (EI) acquisition function (Jones et al., 1998), a common choice because it can be computed in closed form and optimized with gradient-based methods. Let be the value of the best point observed so far: . The EI is

with . A thorough introduction to Bayesian optimization is given by Shahriari et al. (2016).

We suppose that runs of Bayesian optimization have been completed. Let be the function evaluations made at past optimization runs . We fit a GP model to the observations of each past run and refer to these models as base models. They have posterior , with mean and variance and respectively. They remain fixed throughout the optimization, inasmuch as we do not obtain new observations for them. The current optimization problem we are trying to solve is run . We fit a GP to observations from run and call it the target model. The target model is refit after each new function evaluation. We overload notation and define . Our goal is to minimize the target function using the base models and the target model.

4 Ranking-Weighted Gaussian Process Ensemble

Our strategy here is to estimate the target function as a weighted combination of the predictions of each base model and the target model itself:

A model of this form is preferred for several practical reasons. First, this ensemble model remains a GP, and in particular

This means that all of the usual computational machinery for Bayesian optimization with GPs remains valid, such as a closed-form expression for EI and the ability to draw joint samples for parallel optimization. Additionally, each base model remains unchanged throughout the optimization and can be loaded directly from the previous runs. The fitting cost is only the cost of fitting the target model and inferring the weights . Finally, predictions are made independently for each GP and we obtain complexity and a linear slowdown relative to no warm-starting. Following Yogatama and Mann (2014), we standardize each model prior to inclusion in the ensemble.

Our approach for computing ensemble weights follows the agnostic Bayesian ensemble of Lacoste et al. (2014), which weights predictors according to estimates of their generalization performance. In particular, each predictor in the ensemble is weighted according to the probability that it is the best predictor in the ensemble, under a desired loss function. We use a ranking loss to compute weights, and so call this method the ranking-weighted Gaussian process ensemble (RGPE).

Our model is similar to a mixture-of-GPs (Tresp, 2001), although there are important differences in the problem setup. A mixture-of-GPs learns which expert model is responsible for which data point, whereas in this setting we know which expert is responsible for each of the datasets . There is thus no need to perform an assignment of datapoints to base models or for an additional gating network to assign data points at prediction time. Our generative model here is that the outputs of run come from a GP using only , so the base model GPs are conditionally independent given their data.

We now discuss our loss function and approach for estimating generalization of each model.

4.1 Computing Ensemble Weights

Our goal in Bayesian optimization is to find the minimum function value. A model will be useful for optimization if it is able to correctly order observations according to their function value. For meta-learning, we wish to assess the ability of model to generalize to the target function, and so construct a loss function that measures the degree to which each model is able to correctly rank the target observations . Given target function evaluations, we define the loss as the number of misranked pairs:

(1)

where is the exclusive-or operator. The loss is a random variable inasmuch as is a random variable. We can sample from the posterior distribution of by evaluating it on samples from the GP posterior of .

For base models, this measures their ability to generalize to the target function. For the target model, this is an estimate of in-sample error and does not accurately reflect generalization. We estimate generalization in the target model using cross-validation, in practice with leave-one-out models. Let indicate the target model with observation left out. The loss for the target model is computed as

(2)

The leave-one-out model is constructed by removing the data point from the GP; kernel hyperparameters are not re-estimated. Fig. 1 provides an illustration of how misrankings are computed for base models and using the leave-one-out target model, showing the inner sum of (1) and (2). If is large, we can use -fold cross-validation here which maintains complexity for weight fitting.

Figure 1: An illustration of the ranking loss used to weight models in the ensemble. (Top) The target model with four observations highlighted by vertical lines. We illustrate measuring misrankings relative to the first, at . (Middle) A posterior draw from a base model has one misranking, at . (Bottom) A posterior draw from the leave-one-out target model misranks .

Ranking loss is more appropriate for estimating optimization performance than other natural choices such as squared error or model log-likelihood because the actual values of the predictions do not matter for optimization—we only need to identify the location of the optimum. It is easy to see that if all of the models are able to correctly order a set of points then the ensemble will also correctly order those points. That is, if and at least one weight is non-zero, implies .

We weight each model with the probability that it is the model in the ensemble with the lowest ranking loss. The posterior for at the target observations is a multivariate normal with mean and covariance . We draw samples from this posterior at the configurations evaluated so far and then obtain posterior samples of the ranking loss by evaluating (1) and (2) on the GP samples. We draw such samples: for and . Weight for model is then computed as

(3)

If the argmin is not unique due to a tie in the number of misrankings, the weight is given to the target model if it is part of the tie, otherwise the tie is broken randomly.

4.2 Preventing Weight Dilution

One challenge within this type of ensemble is preventing weight dilution by a large number of noisy models. Suppose a GP has high variance at the points in and so is able to produce arbitrary rankings of the points. This model is clearly not useful for making predictions, and if better models are present in the ensemble it will have a low probability of being the argmin in (3). However, if we have a very large number of such models in the ensemble, the chance of any one of them producing the correct ranking in a sample goes to as the number of noisy models increases. As a concrete example, with five target points, a model that produces a random ranking will have probability of producing the correct order. With such models, the probability that at least one produces a perfect ranking is greater than .

We prevent weight dilution by discarding models that are substantially worse than the target model. Model is discarded from the ensemble if the median of its loss samples is greater than the percentile of the target loss samples . In addition to preventing weight dilution, this strategy has computational benefits in that it results in fewer GP predictions for each ensemble model prediction. The choice of the percentile for the exclusion threshold could be considered a hyperparameter of the method, however we perform a sensitivity analysis in Section 5 and show that the precise value is not important.

4.3 Optimization with the Gaussian Process Ensemble

The RGPE retains the distributional properties of a GP, and so can be used with standard acquisition functions for Bayesian optimization. More specifically, instead of we use , and instead of we use , to compute EI.

In many applications of Bayesian optimization we have the ability to run multiple function evaluations in parallel, and parallelization is critical for the scalability of Bayesian optimization. We use the technique of Snoek et al. (2012) to parallelize EI by integrating over the posterior for the outcomes at pending function evaluations. Suppose we have pending evaluations at points , and a worker is available to evaluate an additional point. This point is chosen as the one that maximizes EI when integrated over the pending outcomes , …, :

In practice, this is done using a Monte Carlo approximation. We jointly sample “fantasies” from the posterior at , and then add these simulated observations to the GP and compute EI with the conditioned model. EI is averaged over several such fantasies. For RGPE, we sample from each model in the ensemble independently and condition each model on its sample to obtain the conditioned ensemble.

If observations are noisy or if there is uncertainty in base models at the current best, may be a random variable. This can occur when the locations of the observations in the base models do not overlap with those of the target model. Typical approaches for computing expected improvement with noisy observations can be used with the RGPE; we follow the strategy of Letham et al. (2017) and integrate over uncertainty in in the same way that we integrate over pending outcomes for parallelization.

5 Experiments

We present several sets of experiments to explore how RGPE performs in practice. We begin with a simple synthetic function in Section 5.1. Section 5.2 then provides a comprehensive study of RGPE performance using a large collection of hyperparameter optimization benchmark problems. These are followed up by the results of using warm-starting inside the computer vision platform at Facebook, which provide useful insight into its real-world application.

All GPs in these experiments used GPy and the ARD Matérn 5/2 kernel (GPy, since 2012). Kernel hyperparameters were set to their posterior means, inferred via MCMC with the NUTS sampler (Hoffman and Gelman, 2014).

5.1 Synthetic Function

Figure 2: (Left) The target function and base functions for the synthetic test problem. (Center) Optimization performance on the test problem, averaged over 100 runs with quasirandom initializations. Error bars show standard error of the mean. Warm-starting provided a clear benefit early on and RGPE quickly converged to the global optimum. (Right) Average weights on each model in the ensemble, averaged over runs. Weight is of the target model, and and correspond to the base functions with the smallest shifts from the target. RGPE relied heavily on the most similar base models.

We use a modification of the Alpine 1 function (Jamil and Yang, 2013) as a synthetic test case for warm-starting:

where is a shift parameter that is used for generating similar datasets. We used as the target function, and then created five base functions with varying degrees of similarity for meta-learning: . The target and base functions are shown in Fig. 2.

Base models were fit to 20 randomly selected points from each base function. Minimization of the target function began with three quasirandom points from a scrambled Sobol sequence, and then proceeded sequentially for a total of 20 function evaluations. The optimization was repeated 100 times, each with a different random selection of the points for the base functions. Five methods were evaluated on this problem: quasirandom points with no model (Sobol), standard Bayesian optimization with a GP fit only to observations made on the target task (GP), TST-R with bandwidth , TST-R with bandwidth , and RGPE. Fig. 2 shows the simple regret averaged over the 100 runs of the simulation.

RGPE used the warm-start provided by the base models to immediately begin sampling near the global optimum and quickly converge. The base model corresponding to the smallest shift () received the most weight in the ensemble of all base models ( in Fig. 2), and the two models with the largest shifts received no weight after iteration 7. Weight on the target model, , increased later in the optimization as it gained more predictive power.

TST-R also provided benefit over GP, but its performance depended on the value of the kernel bandwidth and RGPE generally performed better.

5.2 SVM Benchmark Problems

Our main experimental validation of RGPE uses a large set of hyperparameter optimization benchmark problems from Wistuba et al. (2015), and also used in Wistuba et al. (2016). They did hyperparameter searches for SVM on a diverse set of 50 datasets, with sizes ranging from 35 to 250000 training examples, and from 2 to 7000 features. For each dataset, test-set accuracy was measured on a grid of six parameters: three binary parameters indicating a linear, polynomial, or RBF kernel; the penalty parameter ; the degree of the polynomial kernel ( if unused); and the RBF kernel bandwidth ( if unused). The grid search resulted in 288 datapoints for each dataset. Note that this is a harder problem than the common 2-dimensional RBF SVM problem.

The goal is to optimize the hyperparameters over this grid of 288 points for each problem, while treating the remaining 49 problems as past runs for meta-learning. This is the same experimental setup used by Bardenet et al. (2013), except they optimized a 2-dimensional space of AdaBoost parameters on a smaller number of datasets.

Each optimization run was initialized with three randomly selected points, after which it proceeded sequentially for a total of 20 function evaluations. For meta-learning methods, we fit base models for each of the 49 other problems on a random sample of 50 points. Optimization was repeated 20 times for each of the 50 problems, for a total of 1000 optimization runs. Each optimization run used a different random initialization and a different random sample of 50 function evaluations for the base models. We benchmarked the same set of methods from Section 5.1, except random search over the grid replaced Sobol search.

Figure 3: Optimization performance on the SVM hyperparameter optimization benchmarks, evaluated over 20 runs for each of 50 problems. (Left) Simple regret averaged over runs, with bars showing standard error. Warm-starting provided an initial boost over GP, and RGPE achieved the lowest regret. (Middle) The average rank of each method, ranked by simple regret, shows that RGPE consistently performed the best (lower is better). (Right) The number of non-zero weights in the RGPE ensemble. With 49 base models, more than half were immediately given zero weight, and by the end of the optimization only 10 models had positive weight.

Fig. 3 shows the results of these experiments. The left panel shows that warm-starting provided RGPE with a significant, early drop in regret compared to GP, which it was able to sustain throughout the optimization. TST-R with had the quickest initial drop in regret, however by iteration 7 it was passed by RGPE, and by iteration 15 was passed even by GP. Following related work (Bardenet et al., 2013; Yogatama and Mann, 2014; Feurer et al., 2015), we compared the performance of the methods by computing at each iteration the average rank of each method, ranked by simple regret and averaged over optimization runs. Ranks were averaged in the case of ties. Fig. 3 shows that from iteration 5 and on, RGPE outperformed the other methods and had the lowest rank. TST-R initially outperformed GP, but by iteration 8 GP achieved a lower average rank than TST-R with , and was close to TST-R with by iteration 20. In addition to generally performing better, RGPE has the advantage of being parameter-free.

The right panel of Fig. 3 shows the number of non-zero weights in the RGPE. There were 50 models in the ensemble (49 base models and the target model), but on average less than half of them were ever used, since many of the base functions were dissimilar to the target function. Computationally, this means that instead of a RGPE function evaluation requiring 50 GP evaluations, it actually required many fewer – only about ten by the end of the optimization.

We tested sensitivity to the choice of 95 as the percentile threshold used in Section 4.2 to avoid weight dilution by running the SVM benchmarks with 80 and 99 as the threshold. At every iteration the mean regret for both 80 and 99 was within two standard errors of the mean regret with 95. We thus did not see evidence of sensitivity to this parameter across the large set of SVM benchmarks.

6 Optimizing a Computer Vision Platform

Lumos is a computer vision platform at Facebook that is used to train image classification models for a large variety of tasks and datasets. The final stage of the model is a logistic regression on top of convolutional neural network (CNN) features, for which hyperparameter optimization is done with each training. We used RGPE to accelerate the hyperparameter optimization by borrowing strength from previous runs on different datasets.

We optimized eight image classifiers trained on Lumos, each for a different task and to a different dataset. We optimized three parameters: learning rate for stochastic gradient descent, and two regularization parameters. Datasets ranged from ten thousand to two million images, for the largest of which GP Bayesian optimization required around 2,500 core hours. As base runs, we used the results of nine earlier GP Bayesian optimization sweeps on different datasets. These sweeps had 30 iterations each, for a total of 270 base iterations. Each optimization was begun with an initialization of three quasirandom points, after which it proceeded with two iterations asynchronously in parallel for a total of 20 function evaluations. Fig. 4 shows the results of these optimization runs. The true minimum is not known for these problems, so regret was measured as a percentage of the best point found by either method.

Figure 4: (Top) Optimization performance on the computer vision models, averaged over eight models (standard error in bars). With warm-starting, eight iterations were sufficient to achieve lower regret than the GP. (Bottom) Early on the RGPE borrowed heavily from base models, and later in the optimization began to concentrate weight on the target model.

As in the SVM benchmark problems, warm-starting provided a substantial boost in performance starting with the first optimized configuration in iteration 4. By iteration 9 RGPE achieved lower regret than the GP achieved with 20 iterations.

Fig. 4 also shows the RGPE target weight throughout the iterations. In early iterations the target model was unable to generalize and so most of the ensemble weight went to base models. With more iterations, the target model improved and was given more weight, capturing 50% by iteration 20.

7 Discussion

The Introduction described two settings in which warm-starting may be useful: re-optimization runs and very short runs. Both of these settings occur frequently in production machine learning systems at Facebook. Many models are periodically re-fit to track changes in the data, some as often as daily. There are also many models whose fitting times are too lengthy for full searches and so use only 5-10 iterations.

Our results here show that RGPE is a useful method for solving these problems, and also provide insight into how the model behaves. As the number of target function evaluations grows, base models accumulate misrankings while the target model begins to generalize better. At the same time, the target model will generalize better and will have relatively fewer misrankings. As a result the ensemble will eventually put all of its weight on the target model and will revert to standard Bayesian optimization. The ensemble thus provides a warm-start that improves performance while the target GP is weak, and is then faded out as the target GP becomes more useful.

Our results on the computer vision models showed that in practice there are often similar base runs which can greatly accelerate the optimization. On these models RGPE required less than half the computational resources to achieve the same results as regular Bayesian optimization.

Our goal in developing RGPE was to have a meta-learning method that avoids the scaling of putting all observations into a single GP, while maintaining the nice distributional properties of a GP. This allows RGPE to be directly substituted for a GP in a Bayesian optimization system. Closed-form acquisition functions and parallelization methods that have been developed for GPs can be used directly with RGPE. Our experiments showed that RGPE performed better than the alternative ensemble approach, and that it is a scalable and effective method for warm-starting GP Bayesian optimization.

Acknowledgements

Thanks to Till Varoquaux for support in developing the method, and to Alex Chen for supporting the Lumos experiments.

References

  • Bardenet et al. [2013] Rémi Bardenet, Mátyás Brendel, Balázs Kégl, and Michèle Sebag. Collaborative hyperparameter tuning. In Proceedings of the 30th International Conference on Machine Learning, ICML, 2013.
  • Bauer et al. [2016] Matthias Bauer, Mark van der Wilk, and Carl Edward Rasmussen. Understanding probabilistic sparse Gaussian process approximations. In Advances in Neural Information Processing Systems 29, NIPS, 2016.
  • Bergstra et al. [2011] James S. Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyper-parameter optimization. In Advances in Neural Information Processing Systems 24, pages 2546–2554, 2011.
  • Brazdil et al. [1994] Pavel Brazdil, João Gama, and Bob Henery. Characterizing the applicability of classification algorithms using meta-level learning. In Proceedings of the European Conference on Machine Learning, ECML, pages 83–102, 1994.
  • Cao and Fleet [2014] Yanshuai Cao and David J. Fleet. Generalized product of experts for automatic and principled fusion of Gaussian process predictions. In Modern Nonparametrics 3: Automating the Learning Pipeline workshop at NIPS, 2014.
  • Csató and Opper [2002] Lehel Csató and Manfred Opper. Sparse on-line Gaussian processes. Neural Computation, 14(3):641–668, 2002.
  • Cunningham et al. [2008] John P. Cunningham, Krishna V. Shenoy, and Maneesh Sahani. Fast Gaussian process methods for point process intensity estimation. In Proceedings of the 25th International Conference on Machine Learning, ICML, 2008.
  • Feurer et al. [2015] Matthias Feurer, Jost Tobias Springenberg, and Frank Hutter. Initializing Bayesian hyperparameter optimization via meta-learning. In Proceedings of the 29th AAAI Conference on Artificial Intelligence, AAAI, 2015.
  • Golovin et al. [2017] Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Karro, and D. Sculley. Google vizier: A service for black-box optimization. In Proceedings of the 23rd ACM SIGKDD international conference on Knowledge discovery and data mining, KDD, pages 1487–1496, 2017.
  • GPy [since 2012] GPy. GPy: A gaussian process framework in python. http://github.com/SheffieldML/GPy, since 2012.
  • Hensman et al. [2013] James Hensman, Nicolò Fusi, and Neil D. Lawrence. Gaussian processes for big data. In Proceedings of the 29th Conference on Uncertainty in Artificial Intelligence, UAI, 2013.
  • Hoffman and Gelman [2014] Matthew D. Hoffman and Andrew Gelman. The No-U-Turn Sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15:1351–1381, 2014.
  • Hutter et al. [2009] Frank Hutter, Holger H. Hoos, Kevin Leyton-Brown, and Thomas Stützle. ParamILS: an automatic algorithm configuration framework. Journal of Artificial Intelligence Research, 36:267–306, 2009.
  • Hutter et al. [2011] Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. Sequential model-based optimization for general algorithm configuration. In Proceedings of the 5th Conference on Learning and Intelligent Optimization, LION, pages 507 – 523, 2011.
  • Jamil and Yang [2013] Momin Jamil and Xin-She Yang. A literature survey of benchmark functions for global optimization problems. International Journal of Mathematical Modeling and Numerical Optimisation, 4(2):150–194, 2013.
  • Jones et al. [1998] Donald R. Jones, Matthias Schonlau, and William J. Welch. Efficient global optimization of expensive black-box functions. Journal of Global Optimization, 13:455–492, 1998.
  • Kandasamy et al. [2017] Kirthevasan Kandasamy, Gautam Dasarathy, Jeff Schneider, and Barnabás Póczos. Multi-fidelity Bayesian optimisation with continuous approximations. In Proceedings of the 34th International Conference on Machine Learning, ICML, 2017.
  • Lacoste et al. [2014] Alexandre Lacoste, Hugo Larochelle, Mario Marchand, and François Laviolette. Agnostic Bayesian learning of ensembles. In Proceedings of the 31st International Conference on Machine Learning, ICML, 2014.
  • Letham et al. [2017] Benjamin Letham, Brian Karrer, Guilherme Ottoni, and Eytan Bakshy. Constrained bayesian optimization with noisy experiments. arXiv:0706.1234 [stats.ML], 2017.
  • Lindauer and Hutter [2018] M. Lindauer and F. Hutter. Warmstarting of model-based algorithm configuration. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, AAAI, 2018.
  • Poloczek et al. [2016] Matthias Poloczek, Jialei Wang, and Peter I. Frazier. Warm starting Bayesian optimization. In Winter Simulation Conference, WSC, 2016. arXiv:1608.03585.
  • Rasmussen and Williams [2006] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, Cambridge, Massachusetts, 2006.
  • Schilling et al. [2016] Nicolas Schilling, Martin Wistuba, and Lars Schmidt-Thieme. Scalable hyperparameter optimization with products of Gaussian process experts. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, ECML PKDD, 2016.
  • Shahriari et al. [2016] Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P. Adams, and Nando de Freitas. Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1):148–175, Jan 2016.
  • Shilton et al. [2017] Alistair Shilton, Sunil Gupta, Santu Rana, and Svetha Venkatesh. Regret bounds for transfer learning in Bayesian optimisation. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS, 2017.
  • Snoek et al. [2012] Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems 25, NIPS, 2012.
  • Snoek et al. [2015] Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Md. Mostofa Ali Patwary, Prabhat, and Ryan P. Adams. Scalable Bayesian optimization using deep neural networks. In Proceedings of the 32nd International Conference on Machine Learning, ICML, 2015.
  • Springenberg et al. [2016] Jost Tobias Springenberg, Aaron Klein, Stefan Falkner, and Frank Hutter. Bayesian optimization with robust bayesian neural networks. In Advances in Neural Information Processing Systems 29, NIPS, 2016.
  • Swersky et al. [2013] Kevin Swersky, Jasper Snoek, and Ryan P. Adams. Multi-task Bayesian optimization. In Advances in Neural Information Processing Systems 26, NIPS, 2013.
  • Tresp [2001] Volker Tresp. Mixtures of gaussian processes. In Advances in Neural Information Processing Systems 13, pages 654–660, 2001.
  • Wistuba et al. [2015] Martin Wistuba, Nicolas Schilling, and Lars Schmidt-Thieme. Learning hyperparameter optimization initializations. In Proceedings of the IEEE International Conference on Data Science and Advanced Analytics, DSAA, 2015.
  • Wistuba et al. [2016] Martin Wistuba, Nicolas Schilling, and Lars Schmidt-Thieme. Two-stage transfer surrogate model for automatic hyperparameter optimization. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, ECML PKDD, 2016.
  • Wistuba [2016] Martin Wistuba. TST-R implementation, 2016. https://github.com/wistuba/TST.
  • Yogatama and Mann [2014] Dani Yogatama and Gideon Mann. Efficient transfer learning method for automatic hyperparameter tuning. In Proceedings of the 17th International Conference on Artificial Intelligence and Statistics, AISTATS, 2014.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
38236
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description