Regret Bounds for Noise-Free Bayesian Optimization
Bayesian optimisation is a powerful method for non-convex black-box optimization in low data regimes. However, the question of establishing tight upper bounds for common algorithms in the noiseless setting remains a largely open question. In this paper, we establish new and tightest bounds for two algorithms, namely GP-UCB and Thompson sampling, under the assumption that the objective function is smooth in terms of having a bounded norm in a Matérn RKHS. Importantly, unlike several related works, we do not consider perfect knowledge of the kernel of the Gaussian process emulator used within the Bayesian optimization loop. This allows us to provide results for practical algorithms that sequentially estimate the Gaussian process kernel parameters from the available data.
Keywords: Gaussian processes, Thompson sampling, Optimistic optimization, Cumulative regret
Consider a sequential decision-making problem over a set of infinitely many actions. The goal is to design a learning policy that selects an action at each discrete time instance , and receives the corresponding real-valued reward . For a fixed reward function , let be the optimal action (assuming it is achievable and unique). The objective is to minimize regret defined as the cumulative loss compared to a setting where and are known a priori. Specifically, regret is defined as
where is the sequence of actions, is a time horizon, and the expectation is taken with respect to the possible randomness in .
Based on the previous observations , the decision maker can increase their knowledge of the reward process and make more informed decisions aimed at increasing their overall reward. This problem captures the fundamental exploration-exploitation trade-off in learning where the decision maker must balance between learning about the underlying distribution of the reward process (exploration) and capitalizing on this information in order to gain higher rewards (exploitation).
It is infeasible to learn the global maxima of a reward function over a very large action space without any assumption on the structure of . For instance, a convexity assumption on can be interpreted as a global structure of the reward function (Agarwal et al., 2011), while a Lipschitz assumption imposes a local structure on it (Bubeck et al., 2011). A popular alternative is to rely on the Gaussian Process (GP) framework, which introduces a spectrum of flexible and mild (local and global) assumptions on the reward function, by means of a kernel function. In this framework, using a statistical point of view, is assumed to be a sample of a GP with a given kernel. However, one may alternatively treat as a fixed function, and only assume that it has a bounded norm in the Reproducing Kernel Hilbert Space (RKHS) corresponding to a kernel. In this work, we focus on the latter setting, but we also further discuss the interpretation of our results under the former.
Bayesian optimization (BO) is based on such probabilistic models. It leverages the fact that the Gaussian process conditioning provides versatile and numerically efficient statistical emulators for the objective function. Starting from a GP prior over , usually encoded with parametric mean and kernel functions, the available observations of are used to build the posterior distribution, which is in turn used to determine the learning policy, balancing exploration (high GP posterior variance) and exploitation (high GP posterior mean). We consider here two popular instances of BO: Gaussian Process-Upper Confidence Bound (GP-UCB) and Gaussian Process-Thomson Sampling (GP-TS). GP-UCB uses the GP posterior to assign an upper confidence bound to the reward of each action and selects the action with the highest bound value (Srinivas et al., 2010). Thompson Sampling (TS, also referred to as posterior sampling or probability matching (Russo et al., 2014)), simply draws from the posterior distribution of . Chowdhury and Gopalan (2017) showed that the application of TS to GP models offers the same order of regret as the one for GP-UCB (up to multiplicative logarithmic terms in time and polynomial terms in the dimension of the search space). In both cases, BO requires to estimate the parameters of the mean and kernel functions. This is commonly done repeatedly, each time a new reward observation can be added to the dataset, typically by maximizing the marginal likelihood with respect to those parameters.
In this work, we provide regret analysis for both GP-UCB and GP-TS, when belongs to an RKHS with a Matérn kernel with a given smoothness parameter (see details in Sec. 2). We further assume that can be evaluated for any exactly (i.e. without noise). This is classical in computer experiments (see for instance Sacks et al., 1989; Jones et al., 1998). Regarding the GP emulator, we also assume a Matérn kernel, with a possibly different smoothness parameter, and with unknown lengthscale and variance parameters that must be estimated. We show that our results hold for various kernel parameter estimators for the models used within the BO loop. On a dimensional search space, for GP-UCB, we show an , an and a constant upper bound on regret, when , and , respectively. We also show similar upper bounds on the regret of GP-TS which are slightly larger than the ones for GP-UCB (by a factor of that is due to the random sampling in GP-TS). Both policies require the knowledge of the smoothness parameter of the RKHS and an upper bound on the RKHS norm of . Our results (given in Table 1) are the tightest regret bounds on BO found so far under the noise-free setting to the best of our knowledge.
A review of basic concepts from GP literature is given in Sec. 2 followed by the exact procedures of the BO methods. The theorems on regret bounds and their analysis are provided in Sections 3 and 4, respectively, where the main contributions are presented. The proof of auxiliary lemmas are given in the appendix.
1.2 Related work
Srinivas et al. (2010) proved an upper bound (up to factors) on the regret of GP-UCB under noisy observations where is an upper bound on the information gain that depends on the kernel function and the time horizon . Chowdhury and Gopalan (2017) showed that a similar order of regret (up to multiplicative and poly factors) holds for GP-TS. However, the upper bounds on regret provided by both works do not apply to the noise-free setting (specifically, the information gain becomes degenerate as the level of noise goes to ; see Eq. (7) and the following paragraph in Srinivas et al. (2010)). Our approach takes a different path and leverages the tight concentration inequalities on GP regression provided in Teckentrup (2018) (formally given in Lemma 3) to establish novel regret bounds. For a review of such concentration results, see Kanagawa et al. (2018) and references therein.
Under the noise-free setting, de Freitas et al. (2012) established exponential bounds on the simple regret (expected loss in the reward at the end of a time horizon ) of a branch and bound algorithm that is related to GP-UCB, under the assumptions that the kernel and its hyper-parameters are fully known and that the reward function has a quadratic behaviour in the neighborhood of its optimum. In another related work, Grünewälder et al. (2010) proved an upper bound on the simple regret of a pure exploration strategy where the kernel is Hölder continuous and is the exponent of the Hölder condition. Pure exploration clearly results in linear cumulative regret in time. In comparison, our bounds on cumulative regret of GP-UCB and GP-TS with unknown hyper-parameters are given specifically in terms of the smoothness of the space of functions and dimension of the search space.
In other BO methods, Vazquez and Bect (2010) proved the convergence of Expected Improvement (EI) method under the noise-free setting. Bull (2011) built upon this result to show an upper bound on the simple regret of EI on GP models with a Matérn kernel. They further showed that the combination of EI with a well calibrated greedy strategy results in near optimal simple regret (which results in the same order of cumulative regret as in our results) for some particular estimates of the unknown hyper-parameters. In comparison, in GP-UCB and GP-TS studied here, exploration and exploitation interleave naturally with no need to impose additional exploration. Furthermore, our results hold for various common hyper-parameter estimates, such as the plug in of maximum likelihood estimate. There are several other BO methods used in practice which have much less established results on theoretical analysis. For surveys of BO methods, their implementation and applications, see Shahriari et al. (2016); Frazier (2018), and references therein.
Both UCB and TS approaches to optimization were introduced much earlier for the class of multi-armed bandit problems, on which there is a large body of work (see Auer et al., 2002; Russo and Van Roy, 2016; Kaufmann et al., 2012; Vakili and Zhao, 2013, and references therein). In this context, the action space is typically discrete and finite (Auer et al., 2002), or the reward function is assumed to be convex (Agarwal et al., 2011) or Lipschitz (Bubeck et al., 2011). Such settings require different techniques compared to the approach we develop here for a modelling paradigm based on GPs.
2 Preliminaries and Bayesian Optimization
In order to specify our working hypothesis on the regularity of the reward function and our main theorems, we first recall here some of the main concepts from the GP literature.
2.1 Gaussian Processes, RKHS and Kernel Functions
A GP is a collection of (possibly infinitely many) random variables for whose each finite subset follows a multivariate Gaussian distribution. The distribution of a GP can be specified by its mean function (without loss of generality, it is typically assumed that for prior GP distributions) and a positive definite kernel (or covariance function) . Classic kernel functions usually depend on a set of hyper-parameters, such as variance and lengthscales (Rasmussen and Williams, 2006), that we will denote by .
A concept closely related to GPs is reproducing kernel Hilbert spaces (RKHSs). Given a kernel , the associated RKHS is defined as the Hilbert space of functions on equipped with an inner product such that , for all , and (reproducing property), for all and . An RKHS is completely specified by its kernel function and vice-versa. The RKHS norm can be interpreted as a measure of the complexity of . We focus on the popular class of Matérn kernels (Stein, 2012), defined as
with hyper-parameters (referred to as marginal variance and lengthscale, respectively), where denotes the Gamma function, denotes the modified Bessel function of the second kind and is a parameter governing the regularity of the samples. Important special cases of include that corresponds to the exponential kernel and that corresponds to the radial basis kernel. When , , the Matérn kernel can be expressed as a product of an exponential and a polynomial of order . A GP with Matérn kernel is times differentiable in the mean-square sense. For any given , the RKHS with Matérn kernel is equal to the Sobolev space on of order denoted by (). Furthermore, their norms are equivalent: there exists such that
It was shown in Kanagawa et al. (2018) that the samples from a GP and the elements of an RKHS are very closely related. In particular, for the case of Matérn kernels, the samples of GP belong to a slightly larger space of functions that is a Sobolev Hilbert space of order . We formalize this observation in the following lemma.
Lemma 1 (Kanagawa et al. (2018)).
Let be a bounded open set such that the boundary is Lipschitz and an interior cone condition is satisfied
We leverage this lemma in Section 3 to generalise some results where is assumed to be an element of an RKHS to alternative settings where is a sample of a GP.
2.2 Bayesian optimization
Conditioning GPs on available observations provide us with powerful non-parametric Bayesian models over the space of functions. Consider a set of observations , where , with and for all . At time , conditioned on the set of past observations , the posterior of is a GP distribution with mean function and kernel function specified as follows:
where and is the by positive definite covariance matrix . The posterior variance of given the data is obtained by setting in the conditional covariance expression: .
In a practical scenario, the parameters of the kernel corresponding to the true are unknown. A classical solution is to adopt a plug-in approach (that is also known as an empirical Bayes approach) where estimates of the hyper-parameters are computed, then plugged into the predictive equations (2.2.1). For example, one might use Maximum A Posteriori (MAP) estimate, when an informative prior distribution is available on , or Maximum Likelihood Estimate (MLE), ( is chosen uniformly from its support for the case ). The distribution of is a GP with mean and kernel functions given in (2.2.1) with parameter .
It has however been shown that for some kernels (including Matérn) or may not converge to the true as grows to an infinite number of observations. In general, it is thus not always possible to estimate the true hyper-parameters (Teckentrup, 2018).
The Bayesian Optimization Loop
In BO, the sequence of actions is obtained according to an acquisition rule, based on the GP posterior distribution, that varies from one algorithm to another. Every time a new action is selected and evaluated, the GP posterior is updated, which impacts the acquisition criteria over the actions.
Gp-Ucb relies on an optimistic estimate of the objective function. Under the Gaussian assumption, this estimate is simply
where is a constant (in our case, an upper bound on ). The GP-UCB acquisition rule is to select the UCB maximizer, i.e.
Gp-Ts proceeds as follows. At each time , a sample is drawn from . Then is selected as the maximizer of the sample
The variance of the sample is increased by a factor of which guarantees the required level of exploration for convergence to the global optima as it will become clear later in the analysis. In practice, for non-degenerate kernels such as those of the Matérn class, GP samples can only be drawn over a finite set of locations, which makes the above optimisation over a continuous infeasible. A typical solution is to introduce at each time step a fine discretization of , then sample jointly for all and finally choose within by exhaustive search (see for instance Kandasamy et al., 2018; Chowdhury and Gopalan, 2017).
In this paper, we follow the setup of Chowdhury and Gopalan (2017); Srinivas et al. (2010) and assume that is chosen such that for all , where is the closest point in to . It has been shown (see previous references) that such a exists and its size satisfies where c is a constant independent of and .
Both GP-TS and GP-UCB require an estimate of the kernel hyper-parameters at each time . MLE and MAP estimates are the most common. Our results on regret bound hold for all hyper-parameter estimates with bounded support, . More detail is provided in Sec. 3. We emphasize that both GP-UCB and GP-TS require the knowledge of an upper bound on the RKHS norm of the reward function. The policies are presented in pseudo-code in Algorithms 1 and 2. The acquisition steps are illustrated in Figure 1.
3 Regret Bounds on Bayesian Optimization with Unknown Hyper-Parameters
In this section, we provide theoretical results on the regret of GP-UCB and GP-TS. While the estimates of hyper-parameters do not converge in general, we show that with a well calibrated exploration the predictive power of the Bayesian model is sufficiently good to provide us with sublinear regret orders. Hence, our regret bounds hold for all GP-UCB and GP-TS methods regardless of their hyper-parameter estimates as long as some mild conditions are satisfied.
In the BO problem of a given function with a time horizon , a fixed BO policy (that is either GP-UCB or GP-TS here) generates a possibly random sequence of actions. The sequence is determined by , which correspond to a sequence of (possibly random) mappings from the history of observations to a new action at each time : . In the statement of the theorems, we add the specification of the policy to the notation of regret given in (1).
Consider the Bayesian optimization problem of a function using a GP model emulator equipped with a Matérn kernel with smoothness parameter . Assume the search domain is the -dimensional ball centered at the origin with diameter . The regret of GP-UCB with parameter satisfies
Under the same assumptions as in Theorem 1, the regret of GP-TS with parameter satisfies
In addition to an upper bound on the RKHS norm of , both GP-UCB and GP-TS require some knowledge of , the smoothness parameter of reproducing kernel of . In particular, when is known, both policies choose to obtain the lowest regret growth rate. Otherwise, the best known lower bound on is chosen as the parameter of the GP emulator.
Consider now the Bayesian setting where is assumed to be a sample from a GP equipped with a Matérn kernel . By Lemma 1, belongs to with probability . This observation together with Theorems 1 and 2 result in the following corollary.
Consider the Bayesian optimization problem using a GP model emulator equipped with a Matérn kernel with smoothness parameter where is a sample from such that and the search domain is the -dimensional ball centered at the origin with diameter . With probability , the regrets of GP-UCB and GP-TS satisfy
More details on exact bounds are given in Sec. 4 and the appendix. In this compact form, the effect of hyper-parameter estimates on regret is not visible. However, the constant in Lemma 4 which directly appears as a multiplicative constant in Lemma 5 (in appendix) and on the regret bounds depends on the hyper parameter estimation . As it is detailed in the appendix, depends on the minimum and the maximum ratios of the norm of and of the corresponding Sobolev space defined in (3). Thus, our analysis works for all hyper-parameter estimates which have an upper bound on such term which includes all hyper-parameter estimates with bounded support. However, exact characterization of the more intricate dependency of the parameter estimation and regret is beyond the results presented here.
4.1 Auxiliary Lemmas and Definitions
We prove a lemma on the distance between and the predictive mean in terms of the predictive standard deviation. We then present an important result on the convergence of the predictive standard deviation. Alternatively to Eq. 2.2.1, the predictive mean and the predictive standard deviation of a GP model, given a history of observations , can be written as
See Kanagawa et al. (2018) for the proof. Hence, the predictive mean is the smoothest function in (for any kernel ) interpolating the observations . The predictive standard deviation is the largest possible difference between a function whose RKHS norm is not larger than and that is the predictive mean conditioned on observations
For any function , let and be the predictive mean and standard deviation of a GP emulator equipped with the kernel . Assume . The following inequality holds for all :
See appendix. ∎
Next, we introduce two lemmas which provide upper bounds on based on the distance between the actions. Given a set of points , we define the fill distance as the largest distance from any point in to the points in , that is:
For a function , let be the predictive standard deviation of a GP model emulator equipped with a Matérn kernel and , conditioned on observations at points . We have
where is a constant independent of , and , and and are the minimum and the maximum ratio between the norm of and that of , , as given in (3).
Now, we can use Lemma 3 to prove that is sufficiently small when the value of is observed at a sufficiently close point to :
Under the setting of Lemma 3, we have
where is a constant independent of .
See appendix. ∎
4.2 Proof of Theorem 1
Lemma 2 shows that is an upper bound on . Let us similarly define a lower bound on . We thus have
At step of GP-UCB, is selected as the maximizer of . Thus,
from which it is straightforward to conclude the following upper bound on regret:
Leveraging now Lemma 4, and noting that (the marginal standard deviation of the Matérn kernel), we obtain the following upper bound on regret:
For any sequence of points ,
See appendix. ∎
In summary, to prove Theorem 1, we first showed that the regret of GP-UCB is upper bounded by the cumulative value of the predictive standard deviation at the sequence of actions (up to constant factors) as a measure of uncertainty. We then used Lemma 4 to establish a geometric upper bound on based on the distance between the location of and the previous actions. The exponent of the diminishing distance and the dimension of the space determine the upper bound on given in Lemma 5 which is proven by, first, finding an upper bound on the minimum distance between two actions (that depends on the number of actions and the dimension of the space); then, showing a recursive relation in the values of the sum (based on number of the actions).
4.3 Proof of Theorem 2
The analysis consists of two parts. In the first part, mainly following the regret analysis of GP-TS under the noisy case given in Chowdhury and Gopalan (2017), we show that the regret grows by the sum of standard deviations at the selected actions. In the second part, we use our results given in Lemmas 4 and 5 to establish a new upper bound on the regret of GP-TS under the noise-free setting.
In GP-TS, at each time a sample is drawn from and is selected where is a discretization of satisfying . The following lemma establishes similar results as in Lemma 2 for , but in a high probability setting.
Define . For the sample at time , define the event . This event holds with high probability:
See appendix. ∎
In order to establish the relation between regret at time and uncertainty in the selected action , we partition the actions into two sets based on weather the difference in and is smaller or larger than a term. Define , .
In GP-TS, the following inequality holds for :
See appendix. ∎
Leveraging this lemma, we upper bound the expected loss at each time with a term depending on . Conditioned on being true, we have
where (18) comes from discretization, the definition of , and Lemma 2; (18) is a result of condition ; and (18) holds by the way of selecting in GP-TS. By taking expectation with regards to the randomness in , when , we have
Summing up the both sides of (20) over for , we have
Leveraging recent results on concentration of GP regression, we provided new regret bounds on the performance of GP-UCB and GP-TS. The approach we took differs from the existing ones based on information gain in order to cope with the noise-free setting. The regret bounds we obtained are the tightest general regret bounds on the regret of BO under the noise-free setting. Furthermore, they do not require the knowledge of the hyper-parameters of the kernel which significantly broadens their applicability.
Appendix A Proof of Lemmas
Proof of Lemma 2.
Let . We have ; thus, by characterization of in (9), we have
Multiplying both sides by , we get
Proof of Lemma 3.
As a direct result of Theorem 3.4. in Teckentrup (2018), we have
where is the Sobolev Hilbert space of order on . As it is mentioned in Sec. 3.3 of Teckentrup (2018), we can further obtain the point-wise convergence of to , using the Sobolev embedding theorem (Brezis, 2011),
Finally, the characterization of in (9) implies
Proof of Lemma 4.
The proof follows from applying Lemma 3 to a different space. For , let be the closest point to : . Define , the dimensional hyper-ball centered at with radius . Let . The fill distance of the points in satisfies:
Define and . Let be the predictive standard deviation conditioned on observations . Applying Lemma 3 to , we have
The constants are distinguished from those in Lemma 3 as the space is different. Choose constant as an upper bound on over all and over all hyper-balls with radius not bigger than . The existence of such constant is a result of the equivalence of RKHS of Matérn and the corresponding Sobolev space (see Teckentrup (2018)).
The lemma is proven by noticing that since extra observations available in evaluating can only decrease the uncertainty (provided that the same hyperparameters are used). Specifically, this can be seen from the update rule given in (2.2.1) and the fact that the covariance matrix is positive definite. A more formal proof that can be found in Chevalier et al. (2014) for instance. ∎
Proof of Lemma 5.
For the simplicity of notation, for , define
and let be the maximum possible value of the sum.
We first show that in every set of points, there are at least two points whose distance is not greater than . Let be the smallest distance between two points. The dimensional balls centered at for , with radius , denoted by , are all mutually disjoint. The ball contains all ; thus, its dimensional volume is larger than the sum of the dimensional volumes of