1 Introduction

## Abstract

We consider the problem of optimising functions in the reproducing kernel Hilbert space (RKHS) of a Matérn kernel with smoothness parameter over the domain under noisy bandit feedback. Our contribution, the -GP-UCB algorithm, is the first practical approach with guaranteed sublinear regret for all and . Empirical validation suggests better performance and drastically improved computational scalablity compared with its predecessor, Improved GP-UCB.

## 1 Introduction

We consider the black-box function optimisation problem using the stochastic bandit formalism of sequential decision making (Robbins, 1952). Under this model, we consider an agent that sequentially selects an action from an action set at each time step for and observes

 y(xt)=f(xt)+ϵt,

where is the reward function, is the expected reward function and we assume is conditionally subGaussian given . The agent’s goal is to minimise regret, given by

 RT=T∑t=1f(x⋆)−f(xt),

where is the reward associated with an optimal arm. Regret is closely linked to bounds on the convergence of black-box optimisation in the presence of noise: bounding by a quantity sublinear in implies as , and so an algorithm achieving such a bound will converge to a subset of the global optima of .

A standard approach to the bandit problem is to construct a confidence upper bound for , of the form

 μt−1(x)+βt(δ)σt−1(x),

where and are the mean and standard deviation predictors for given by a suitable estimator based on observations prior to time , and is a confidence width multiplier that ensures the expression is an upper bound on with probability at least . Algorithms then select that maximises this bound (Auer, 2002; Auer et al., 2002; Auer and Ortner, 2010). This upper confidence bound (UCB) strategy naturally balances exploration, sampling in regions where the uncertainty is large, with exploitation, focusing on regions where the mean is large, and leads to algorithms with minimax optimal regret bounds for the case where is finite (Audibert and Bubeck, 2009).

From the perspective of black-box function optimisation, a particularly interesting bandit problem is the kernelised continuum-armed bandit (Srinivas et al., 2010). Here,  is assumed to be in the closure of functions on expressible as a linear combination of a feature embedding parameterised by a kernel . The properties of the functions in the resulting space, referred to as the RKHS of , are determined by the choice of the kernel. For example, the RKHS corresponding to the linear kernel contains linear functions, and in this case existing kernelised bandit algorithms recover bounds that match those of the relevant stochastic linear bandit algorithms (Chowdhury and Gopalan, 2017; Abbasi-Yadkori et al., 2011). For a squared exponential kernel, the corresponding RKHS contains only infinitely differentiable functions, and here the existing methods match known lower bounds up to polylogarithmic factors (Srinivas et al., 2010; Scarlett et al., 2017).

In this work, we focus on the RKHS associated with a Matérn kernel, parameterised by a smoothness parameter (Stein, 2012). For a given , the Matérn RKHS contains all -times continuously differentiable functions, and therefore for any , contains as a strict subset the RKHS of both the linear and the squared exponential kernels. The Matérn RKHS is of particular practical significance, since it offers a more suitable set of assumptions for the modelling and optimisation of physical quantities (Stein, 2012; Rasmussen and Williams, 2006). However, the theoretical guarantees offered for this class of functions by existing kernelised algorithms, such as KernelUCB (Valko et al., 2013), GP-UCB (Srinivas et al., 2010) and Improved GP-UCB (Chowdhury and Gopalan, 2017), are limited. Specifically, these guarantee that the regret after steps, , is bounded with high probability as1

leaving a large gap to the algorithm-agnostic lower bound for this problem (Scarlett et al., 2017). Since this bound is linear for , existing practical kernel-based algorithms are not guaranteed to converge for functions with fewer than derivatives.

Our main contribution is an algorithm that successfully tackles the Matérn RKHS bandit problem for all and all . The algorithm, Partitioned Improved GP-UCB (-GP-UCB), offers a high probability regret bound of order

 RT=˜O(Td(2d+3)+2νd(2d+4)+4ν),

and therefore guarantees convergence for once differentiable functions of any finite dimensionality. Our contribution is not limited to theory: -GP-UCB shows strong empirical performance on 3-dimensional functions in the Matérn RKHS (in contrast, Improved GP-UCB barely outperforms uniformly sampling of arms). Moreover, our experiments show that despite using Gaussian process regression to construct confidence intervals, -GP-UCB achieves an empirically near-linear runtime.

Our analysis also results in tighter bounds on the effective dimension associated with the Matérn kernel RKHS, an important quantity in the context of kernelised bandit problems which immediately improves bounds for a range of existing algorithms.

## 2 Background

Our work builds on Improved GP-UCB (IGP-UCB), and like IGP-UCB, it uses Gaussian process regression to construct confidence intervals. We briefly outline these topics and introduce the required notation.

#### Gaussian process regression

We use to denote a set of observations with , and

 DAt={(xAi,yAi)∈DXt:xAi∈A}

for to denote the subset of these located in . We denote the sequence of input locations within by and by the associated sequence of observations.

For of cardinality , , a kernel , assumed to be normalised such that , we define

 kAt(x)=[k(xA1,x),…,k(xAN,x)]T,

as well as

 KAt=[k(x,x′)]x,x′∈XAt  and  yA1:t=[yA1,…,yAN]T.

For a regularisation parameter , we define the Gaussian process regressor on by a mean,

 μAt(x)=kAt(x)T(KAt+αI)−1yA1:t,

and an associated predictive standard deviation,

 σAt(x)=√k(x,x)−kAt(x)T(KAt+αI)−1kAt(x),

for each . Note that is monotone decreasing in both and , meaning if and (Vivarelli, 1998).

#### Effective dimension

While Gaussian process regressors are often defined through an infinite dimensional feature embedding, due to finite data and regularisation, the number of features that have a noticeable impact on the regression model can be small. The effective dimension of the Gaussian process regressor on the set , defined as

 ˜dA=Tr(KAT(KAT+αI)−1)=α−1T∑t=1(σAT(xt))2,

provides an estimate of the number of relevant features used in the regression problem (Zhang, 2005; Valko et al., 2013), and frequently appears in the bounds on kernelised bandit algorithms. It is closely related to information gain, defined

 γAt=12log|I+α−1KAt|

for , with the two of the same order up to polylogarithmic factors (Calandriello et al., 2019, proposition 5). We shall use this relationship throughout.

#### Matérn kernel family

The Matérn family consists of kernels of the form where and is the Fourier transform of a Student’s t-density,

 λ(ω)=Cℓ,d(1+(ℓ∥ω∥2)2)−ν−d/2, (1)

for the -dimensional frequency vector and

 Cℓ,d=ℓdΓ(ν+d/2)πd/2Γ(ν)

for parameters of the kernel.

#### The Improved GP-UCB algorithm

The IGP-UCB algorithm is an approach to the kernelised bandit problem based on the classic GP-UCB Bayesian optimisation algorithm. For in the RKHS of a known kernel, it improves on the regret bounds of GP-UCB by a polylogarithmic factor and significantly improves empirical performance. For readers familiar with stochastic linear bandits, the step from GP-UCB to IGP-UCB mirrors that from ConfidenceBall (Dani et al., 2008) to OFUL (Abbasi-Yadkori et al., 2011).

IGP-UCB assumes a known bound on the RKHS norm of the function and a known bound on the subGaussianity constant of . Then, for a chosen and , IGP-UCB uses a Gaussian process regressor mean and standard deviation , along with a confidence width multiplier

 βZt(δ)=B+L√2(γZt+1+log(1/δ)), (2)

to construct probability confidence intervals for the restriction of to . Chowdhury and Gopalan (2017) choose to use , that is, consider the whole domain, and show that selecting observations that maximise this upper confidence bound leads to regret bounded as with probability .

Intuitively, the IGP-UCB algorithm needs strong regularity assumptions to guarantee sublinear regret because the information gain term, , contained within the confidence width parameter , grows too quickly with otherwise. Specifically, in the case of the Matérn kernel, can be lower and upper bounded as

 ˜O(Td(d+1)d(d+1)+2ν)andΩ(Tdd+2ν) (3)

respectively, where the former is given in Srinivas et al. (2010) and the latter can be deduced from combining the bounds in Valko et al. (2013) and Scarlett et al. (2017). Our main contribution can be understood as addressing this issue. We provide a construction which leads to confidence intervals that grow only polylogarithmically with .

## 3 Main contribution

We introduce Partitioned Improved GP-UCB (-GP-UCB), an algorithm that at each time step constructs a closed cover of the domain and selects points by taking a maximiser of the IGP-UCB upper confidence bound constructed independently on each cover element. Throughout the paper, we will make use of the following two constants,

 b=d+1d+2νandq=d(d+1)d(d+2)+2ν,

which depend on the RKHS only. Also, for a hypercube we will use denote its -diameter, i.e. the length of any of its sides.

#### The π-GP-UCB algorithm

Choose a confidence level. Let be any set of closed hypercubes overlapping at edges only, of cardinality at most , that covers the domain .2 At each time step select a query location and then construct a new cover as follows:

Point selection. Fit an independent Gaussian process with on each cover element , conditioned only on data within , and select the next point to evaluate by maximising

 UCBt(x)=maxA∈At:x∈AμAt−1(x)+ˆβAtσAt−1(x),

where , with .

Splitting rule. Split any element for which along the middle of each side, resulting in new hypercubes. Let be the set of the newly created hypercubes and the elements of which were not split.

#### Properties of algorithm

The construction of the cover ensures the following two properties hold:

###### Lemma 1.

Let be a subset of and suppose there exists a such that . Let . Then, for some , .

###### Lemma 2.

Let be the covering set at time . Suppose . Then, for sufficiently large,

 |∪t≤TAt|≤Cd,νTq,

where depends on and only.

That is, on all cover elements, information gain can be bounded polylogarithmically for all , and the cardinality of all the covering sets generated up to time is sublinear in . We will show that the regret of -GP-UCB after steps is bounded by

 RT=O(γAT√T|∪t≤TAt|)=˜O(√T|∪t≤TAt|)=o(T).

The formal statement of this result is:

###### Theorem 3.

Let , let be the RKHS of a Matérn kernel with parameter such that . Suppose satisfies for a known and we observe , where is sub-Gaussian. Then for any fixed , with probability at least the regret incurred by -GP-UCB is bounded by .

The properties of the cover also allows us to improve upon existing upper bounds for the information gain associated with a Matérn kernel. To do this, we bound the information gain as the sum of the information gain on each cover element , and therefore

 γXT=O(γAT|∪t≤TAt|)=˜O(|∪t≤TAt|),

which translates into the following bound:

###### Theorem 4.

Information gain associated with the Matérn kernel with parameter after steps can be bounded as .

This is a strict improvement over the bound presented in Srinivas et al. (2010), given here in equation 3.

#### Practical considerations

While the main purpose of the cover construction within -GP-UCB is to provide strong theoretical guarantees on regret, it also results in a significantly more scalable algorithm.

To see this, first note that fitting a Gaussian process requires updating the inverse of a kernel matrix . Our cover construction means that inverses are only performed on a kernel matrix of a subset of the data, reducing both compute and memory costs. Empirically, this allows -GP-UCB to scale much more favourably than GP-UCB. However, the construction offers no asymptotic improvement on computational complexity: once the algorithm comes close to convergence (enters the asymptotic regime), points are expected to land within small neighbourhoods of the optima, and therefore likely within the same cover element.

Moreover, running Gaussian process UCB algorithms requires maximising the UCB index across the domain. Whereas the classic GP-UCB algorithms require the maximiser to be recomputed across all of the domain, under -GP-UCB, the maximisation procedure need only be carried out over the cover elements containing the most recent observation and any newly created cover elements.

## 4 Proof of results

In order to prove theorem 3, we will use the following concentration inequality:

###### Lemma 5.

Given for all , for all , for all , we have

 |μAt−1(x)−f(x)|≤ˆβAtσAt−1(x),

with probability .

To prove creftype 5, we will use the following result, proven in section 4.3, which bounds the number of all cover elements that could ever be created within steps of running the algorithm by .

###### Lemma 6.

There exists a set such that and with probability one.

We present the full proof of creftype 5 in appendix A. Here, we prove a weaker result that admits a much shorter proof.

By theorem 2 in Chowdhury and Gopalan (2017), under the conditions of our theorem 3,

 |μAt−1(x)−f(x)|≤βAt(δ)σAt−1(x)

with probability for any compact. The weaker result then follows by taking a union bound over all , resulting in a confidence width multiplier of . The full proof in the appendix allows us to use instead.

To prove theorem 3 we will also need the following bound relating the sum of predictive variances on a subset of the domain to information gain:

###### Lemma 7.

For any , sequence , set and a Gaussian process estimator with , .

###### Proof.

Since for all , we have that , because for any , . Summing over , we have

 τ∑t=1 1{xt∈A}(σAt−1(xt))2 ≤τ∑t=1,xt∈A2αlog(1+α−1(σAt−1(xt))2)=4αγAτ,

where the final equality is by Lemma 5.3 in Srinivas et al. (2010). ∎

###### Proof of theorem 3.

By Cauchy-Schwarz, . It therefore suffices to bound .

For each let be an element of such that

 At(x)∈\operatornamewithlimitsargmaxA∈At:x∈AμAt−1(x)+ˆβAtσAt−1(x).

That is, is an element of on which the upper confidence bound associated with is the highest.

For any , we have that due to the manner in which points are selected. Expanding this expression and applying creftype 5, we bound as

 rt=f(x⋆)−f(xt)≤2ˆβAt(xt)tσAt(xt)t−1(xt)

with probability for all .

Denote , the set of all cover elements created until time , and define the initial time for an element , as and the terminal time as . We have

 T∑t=1r2t ≤4T∑t=1(ˆβAt(xt)t)2(σAt(xt)t−1(xt))2 ≤4T∑t=1∑A∈At1{xt∈A}(ˆβAt)2(σAt−1(xt))2 =4∑A∈˜ATτ′(A)∑t=τ(A)1{xt∈A}(ˆβAt)2(σAt−1(xt))2 ≤48∑A∈˜AT(ˆβAτ′(A))2γAτ′(A) (∗)

The final inequality uses monotonicity of , creftype 7 and . By creftype 2, the number of summands in is . As for all , creftype 1 completes the proof. ∎

###### Proof of theorem 4.

From proposition 5 in Calandriello et al. (2019),

 ˜dA≤log|I+α−1KAT| ≤˜dA(1+log(α−1∥KAT∥2+1)).

Noting that , and taking we have,

 ˜dX≤2γXT≤˜dX(1+log(α−1T+1)).

We now proceed to bound . At each time step choose to be as in the closed cover of from -GP-UCB. Let , as before.

For any , let . Then, because is in at least one of the for all and by monotonicity of predictive variance, we have

 ˜dX≤1αT∑t=1∑A∈At1{xt∈A}(σAτ′(A)−1(xt))2.

Interchanging the order of summation and applying creftype 7, we have that this is upper bounded by . Then, by creftype 1, we have that

 ˜dX≤4C|˜AT|logTloglogT.

By creftype 2, . Therefore

 γXT≤C′TqlogT(loglogT)(1+log(α−1T+1)),

for some independent of . ∎

### 4.1 Proof of creftype 1, information gain on a cover element

To prove creftype 1, we rely on the following theorem from Srinivas et al. (2010). It provides a bound on the maximum information gain after samples, , in terms of the operator spectrum of the kernel with respect to a uniform covariate distribution.3

###### Theorem 8.

(Srinivas et al., 2010, Theorem 8) Suppose that is compact, and is kernel continuously differentiable in a neighbourhood of . Let where is the operator spectrum of with respect to the uniform distribution over . Pick and let , where is the volume of . Then is bounded by

 C maxr=1,…,N[s0logrnNα+(4ζ+2)VAlogNα−1 ×(1−rN)(Nζ+1S(s0)+1)]+O(N1−ζ/d), (4)

where for any .

The operator spectrum of the Matérn kernel, required to use theorem 8, can be bounded using the following.

###### Theorem 9.

(Seeger et al., 2008, Theorem 2) Let be an isotropic covariance function on satisfying the conditions of Widom’s theorem (Widom, 1963), with a spectral density . Suppose that the covariate distribution has bounded support and a bounded density, in that for all and for all . Then,

 λs≤D(2π)dλ(CdR−1s1/d)(1+o(1))

asymptotically as , where .

The required spectral density of a Matérn kernel is given in equation 1. By Seeger et al. (2008), it satisfies the conditions of Widom’s theorem.

###### Proof of creftype 1.

As the information gain is a sum of non-negative elements, we have for any ,

 ^γ(A,M)≤^γ(A,N).

It therefore suffices to bound the maximum number of points that can fall in a partition before it would split. Let denote this quantity. The proof proceeds in two parts: first we bound as a function of , then we bound in terms of the horizon.

From equation 1, we have that the spectral density for the Matérn kernel satisfies

 λ(ω)=Cℓ,d(1+(ℓω)2)−ν−d/2≤C′ℓ,dω−(2v+d).

where . Utilising this within theorem 9 with a uniform covariate distribution,

 λs≤Cρ−dA(s1/dρ−1A)−(2v+d)=Cρ2νAs−(2ν+d)/d,

for some constant , where we used for all , as is a -dimensional cube.

As the bound on for large is monotonically decreasing, we can bound the tail of the Matérn kernel operator spectrum as

 ∑s>s0λs ≤Cρ2νA∫∞s=s0s−(2ν+d)/d=O(ρ2νA).

We now apply theorem 8 in order to bound . Choose , then

 Nd(2ν−1)2ν+dA

for sufficiently large. Choose .

For these parameter choices and for any ,

 s0logrnNAα=O(logNAloglogNA).

As , the second term in the maximum in equation 4 is . As the diameter satisfies by the definition of this term is . The final term in equation 4 is .

All that remains is to bound . We consider two cases: first, if then . Otherwise, was created by some set splitting, for which . Then,

 NA ≤ρ−1/bA=2−1/bρ−1/bA′ ≤2−1/b(NA′+1)<2−1/b(T+1).

In both cases, . ∎

### 4.2 Proof of creftype 2, bound on size of cover

###### Proof.

First, since any element was either in or was created by the splitting of a cover element into elements, we have that

 |˜AT|=|A1|+2d|˜AT∖AT|.

By assumption . Denote . We now upper bound . Take the diameter of any element to be (e.g.  if ). For any given , we have that

 |ΘT| ≤maxx1,…,xT|ΘT| =maxx1,…,xT∞∑i=0∑A∈ΘT1{ρA=2−iρ0}.

We upper bound the solution to this maximisation problem by considering just a subset of the constraints imposed by the splitting procedure.

First constraint: we have a budget constraint derived from placing points. Let and , and suppose there exists an for all . Then

 ∞∑i=0 M(2−iρ0)∑A∈ΘT1{ρA=2−iρ0} ≤∞∑i=0∑A∈ΘT(|XAτ′(A)|−|XAτ(A)|)1{ρA=2−iρ0} =∑A∈ΘT|XAτ′(A)|−|XAτ(A)| ≤∑A∈˜AT|XAτ′(A)|−|XAτ(A)| =∑A∈˜ATτ′(A)∑t=τ(A)1{xt∈A}=T∑t=1∑A∈AT1{xt∈A} =T∑t=1|{A∈At:xt∈A}| ≤2dT.

Now we find a suitable , which we shall refer to as the cost of splitting an element . Because split,

 |XAτ′(A)|+1>ρ−1/bA≥|XAτ′(A)|

Suppose that is the element that split to create . Then and . Therefore

 |XAτ′(A)|−|XAτ(A)| ≥|XAτ′(A)|−|XA′τ′(A′)|−1 ≥ρ−1/bA(1−2−1/b)−2=M(ρA).

Second constraint: a supply constraint. There are at most elements of diameter , and therefore at most elements of diameter can be split, leading to

Since increases with , the solution to the relaxed optimisation problem will be to buy all the available with smallest diameter, subject to the supply and budget constraints. Suppose the smallest split with this strategy has a diameter for some . Then, since the supply constraint is binding and budget constraint is satisfied, we have that

 z−1∑i=0M(2−iρ0)2di⌈ρ−d0⌉≤2dT.

Writing for some and and using the geometric series formula to solve for , we obtain , a quantity independent of . Counting all the cover elements of diameters , we have that , which is . Since this was a construction that maximises , we have that . ∎

### 4.3 Proof of creftypecap 6

###### Proof.

Let and define recursively to be the set of the hypercubes created by splitting each element in into hypercubes. Let be the diameter of elements in . We have that

 At⊂⋃i≥0Bi⟹At⊂⋃i≥0(Bi∩At).

Suppose there exists for some . By the splitting condition,

 Z∈At∖B0⟹ρZ>(|XZt|+1)−b≥(t+1)−b.

Also, implies . Therefore, so Let . We have for all and hence . We now bound the cardinality of . We have , so

 |Bt| =Jt∑i=0|Bi|=⌈ρ−d0⌉⌊Jt⌋∑i=02di≤⌈ρ−d0⌉2dJt+1 ≤4(t+1)db.\qed

## 5 Empirical validation

We present an empirical comparison of -GP-UCB and IGP-UCB on two types of functions: first, synthetic functions in the Matérn kernel RKHS, where the conditions of the theory for both algorithms are met; second, standard global optimisation baselines, corresponding to a more realistic setting where the RKHS norm of the target function is not known.

We run both IGP-UCB and -GP-UCB using a Matérn kernel with parameters and , and use a regularisation parameter . For simplicity, the problems are discretised onto a regular grid, such that . We use a confidence parameter , and compute the quantities and exactly at each time step. For our method, -GP-UCB, we use an initial partition of cardinality approximately .

#### Synthetic functions

We benchmark on a set of synthetic functions which satisfy the assumptions behind the regret bounds of both algorithms. We construct each function by sampling points, , uniformly on , and each independent uniform on and defining for all , where is a Matérn kernel with lengthscale . Both algorithms are given access to the exact RKHS norm of this function, computed as