Streaming kernel regression with provably adaptivemean, variance, and regularization

Streaming kernel regression with provably adaptive mean, variance, and regularization

\nameAudrey Durand \emailaudrey.durand.2@ulaval.ca
\AND\nameOdalric-Ambrym Maillard \emailodalric.maillard@inria.fr
\AND\nameJoelle Pineau \emailjpineau@cs.mcgill.ca
Abstract

We consider the problem of streaming kernel regression, when the observations arrive sequentially and the goal is to recover the underlying mean function, assumed to belong to an RKHS. The variance of the noise is not assumed to be known. In this context, we tackle the problem of tuning the regularization parameter adaptively at each time step, while maintaining tight confidence bounds estimates on the value of the mean function at each point. To this end, we first generalize existing results for finite-dimensional linear regression with fixed regularization and known variance to the kernel setup with a regularization parameter allowed to be a measurable function of past observations. Then, using appropriate self-normalized inequalities we build upper and lower bound estimates for the variance, leading to Bersntein-like concentration bounds. The later is used in order to define the adaptive regularization. The bounds resulting from our technique are valid uniformly over all observation points and all time steps, and are compared against the literature with numerical experiments. Finally, the potential of these tools is illustrated by an application to kernelized bandits, where we revisit the Kernel UCB and Kernel Thompson Sampling procedures, and show the benefits of the novel adaptive kernel tuning strategy.

120001-484/0010/00meila00aAudrey Durand and Odalric-Ambrym Maillard and Joelle Pineau \ShortHeadingsStreaming kernel regression with unknown varianceDurand, Maillard, and Pineau \firstpageno1

\editor

Kevin Murphy and Bernhard Schölkopf

{keywords}

kernel, regression, online learning, adaptive tuning, bandits

1 Introduction

Many applications require solving an online optimization problem for an unknown, noisy, function defined over a possibly large domain space. Kernel regression methods can learn such possibly non-linear functions by sharing information gathered across observations. These techniques are being used in many fields where they serve a variety of applications like hyperparameters optimization (Snoek et al., 2012), active preference learning (Brochu et al., 2008), and reinforcement learning (Marchant and Ramos, 2014; Wilson et al., 2014). The idea is generally to rely on kernel regression to estimate a function that can be used for decision making and selecting the next observation point. Algorithmically speaking, standard kernel regression involves a regularization parameter that accounts for both the complexity of the unknown target function, and the variance of the noise. While most theoretical approaches rely on a fixed regularization parameter, in practice, people have often used heuristics in order to tune this parameter adaptively with time.

This however comes at the price of loosing theoretical guarantees. Indeed, in order for theoretical guarantees (based on concentration inequalities) to hold, existing approaches (Srinivas et al., 2010; Valko et al., 2013) require the regularization parameter in the kernel regression to be a fixed quantity. Further, they assume a prior and tight knowledge of the variance of the noise, which is unrealistic in practice. The reason for this cumbersome assumption is to adjust the regularization parameter in the kernel regression based on this deterministic quantity, as such a choice of regularization conveys a natural Bayesian interpretation (Rasmussen and Williams, 2006). Following this intuition, given an empirical estimate of the function noise based on gathered observations, one should be able to tune the regularization automatically. This is however non-trivial, first due to the streaming nature of the data, that allows the noise to be a measurable function of the past observations, second because concentration bounds on the empirical variance are currently unknown in such a general kernel setup, and finally because all existing theoretical bounds require the regularization parameter to be a deterministic constant, while we require here a parameterization that explicitly depends on past observations. The goal of this work is to provide the rigorous tools for performing an online tuning of the kernel regularization while preserving theoretical guarantees and confidence intervals in the context of streaming kernel regression with unknown noise. We thus hope to provide a sound method for adaptive tuning that is both interesting from a practical perspective and retains theoretical guarantees.

We gently start our contributions by Theorem 2.1 that generalizes existing concentration results (such as in Abbasi-Yadkori et al. (2011); Wang and de Freitas (2014)), and is explicitly stated for a regularization parameter that may differ from the noise. This result paves the way to an even more general result (Theorem 2.2) that holds when the regularization is tuned online at each step. Afterwards, we introduce a streaming variance estimator (Theorem 3.1) that yields empirical upper- and lower-bounds on the function noise. Plugging-in the resulting estimates leads to empirical Bernstein-like concentration results (Corollary 3.1) for the kernel regression, where we use the variance estimates in order to tune the regularization parameter. Section 4 presents an application to kernelized bandits, where regret bounds for Kernel UCB and Kernel Thompson Sampling procedures are derived. Section 5 discusses our results and compares them against other approaches. Finally, Section 6 shows the potential of all the previously introduced results while comparing them to existing alternatives through different numerical experiments. We postpone most of the proofs to the appendix.

2 Kernel streaming regression with a predictable noise process

Let us consider a sequential regression problem. At each time step , a learner picks a point and gets the observation

 yt=f⋆(xt)+ξt,

where is an unknown function assumed to belong to some function space , and is a random noise. We assume the process generating the observations is predictable in the sense that there is a filtration such that is -measurable and is -measurable. Such an example is given by . In the sub-Gaussian streaming predictable model, we assume that for some non-negative constant the following holds

 ∀t∈\Nat,∀γ∈\Real,ln\Esp[exp(γξt)∣∣\cHt−1]≤γ2σ22.

Let be a kernel function (that is continuous, symmetric positive definite) on a compact set equipped with a positive finite Borel measure, and denote the corresponding RKHS. We first provide a result bounding the prediction error of a standard regularized kernel estimate, where the regularization is given by a fixed parameter .

Theorem 2.1 (Streaming Kernel Least-Squares)

Assume we are in the sub-Gaussian streaming predictable model. For a parameter , let us define the posterior mean and variances after observing as

 {fλ,t(x)=kt(x)⊤(Kt+λIt)−1Yts2λ,t(x)=σ2λkλ,t(x,x)% with kλ,t(x,x)=k(x,x)−kt(x)⊤(Kt+λIt)−1kt(x).

where is a (column) vector and . Then , with probability higher than , it holds simultaneously over all and ,

 |f⋆(x)−fλ,t(x)|≤√kλ,t(x,x)λ[√λ∥f⋆∥\cK+σ√2ln(1/δ)+2γt(λ)],

where the quantity is the information gain.

Remark 2.1

This result should be considered as an extension of (Abbasi-Yadkori et al., 2011, Theorem 2) from finite-dimensional to possibly infinite dimensional function space. It is a non-trivial result as the Laplace method must be amended in order to be applied.

Remark 2.2

This result holds uniformly over all and most importantly over all , thanks to a random stopping time construction (related to the occurrence of bad events) and a self-normalized inequality handling this stopping time. This is in contrast with results such as Wang and de Freitas (2014), that are only stated separately for each .

Remark 2.3

The quantity directly generalizes the classical notion of information gain (Cover and Thomas, 1991), that is recovered for the choice of regularization .

The case when is of special interest, since we get on the one hand

 f⋆t(x) = kt(x)⊤(Kt+λ⋆It)−1Yt s2⋆t(x) = ∥f∥2\cKkt(x,x)with kt(x,x)=k(x,x)−kt(x)⊤(Kt+λ⋆It)−1kt(x)

and on the other hand . In practice however, neither nor may be known exactly. In this paper, we assume that an upper bound is given on . Then, we want to build an estimate of at each time in order to tune . Using a sequence of regularization parameters that is tuned adaptively based on the past observations requires to modify the previous theorem (it is only valid for a deterministic ) into the following more general statement:

Theorem 2.2 (Streaming Kernel Least-Squares with online tuning)

Under the same assumption as Theorem 2.1, let be a predictable positive sequence of parameters, that is is -measurable for each . Assume that for each , holds for a positive constant . Let us define the modified posterior mean and variances after observing as

 {fλ,t(x)=kt(x)⊤(Kt+λt+1It)−1Yts2λ,t(x)=σ2λt+1kλt+1,t(x,x)with kλ,t(x,x)=k(x,x)−kt(x)⊤(Kt+λIt)−1kt(x),

where , and . Then for all , with probability higher than , it holds simultaneously over all and

 |f⋆(x)−fλ,t(x)|≤√kλt+1,t(x,x)λt+1[√λt+1∥f⋆∥\cK+σ√2ln(1/δ)+2γt(λ⋆)].

The proof is presented in Appendix A.

The regularization parameter is therefore used in conjunction with previous data up to time to provide the posterior regression model (mean and variance) that is used in return to acquire the next observation on point .

Remark 2.4

Since is allowed to be -measurable, this gives theoretical guarantees for virtually any adaptive tuning procedure of the regularization parameter.

Remark 2.5

The assumption that will be naturally satisfied for the choice of regularization we consider.

3 Variance estimation

We now focus on the estimation of the variance parameter of the noise in the case when it is unknown, or loosely known. Theorem 2.2 suggests to define the sequence by

 λt=σ2+,t−1/C2withσ+,t=min{~σ+,t,σ+,t−1}andσ+,0=σ+, (1)

where is an initial loose upper bound on and is an upper-bound estimate on built from all observations gathered up to time (inclusively). This ensures that is measurable for all and satisfies with high probability, where . The crux is now to define the upper-bound estimate on . In order to get a variance estimate, one obviously requires more than the sub-Gaussian assumption, since the term has no reason to be tight (the inequality remains valid when is replaced with any larger value). In order to convey the minimality of , we assume that the noise sequence is both -sub-Gaussian and second-order111The term on the right-hand side corresponds to the cumulant generating function of the chi-squared distribution with 1 degree of freedom. This assumption naturally holds for Gaussian variables. -sub-Gaussian, in the sense that

 ∀t,∀γ<12σ2ln\Esp[exp(γξ2t)∣∣∣\cHt−1]≤−12ln(1−2γσ2).
Remark 3.1

To avoid any technicality, one may assume that is exactly , in which case it is trivially second-order -sub-Gaussian.

Now let denote the (slightly biased) variance estimate for a regularization parameter .

Theorem 3.1 (Streaming Kernel variance estimate)

Assume we are in the predictable second-order -sub-Gaussian streaming regression model, with a predictable positive sequence such that holds for all . Let us introduce the following quantities

 Ct(δ)=ln(e/δ)[1+ln(π2ln(t)/6)/ln(1/δ)],Dλ,t(δ)=2ln(1/δ)+2γt(λ) and finallyα=max(1−√Ct(δ′)t−√Ct(δ′)+2Dλ⋆,t(δ′)t,0).

Then, let us introduce the following variance bounds, defined differently depending on whether a deterministic upper bound is known (case 1) or not (case 2).

Then with probability higher than , it holds simultaneously for all

 σ−,t(λt)≤σ≤σ+,t(λt,λ⋆).

The proof is presented in Appendix B.

Remark 3.2

The case when absolutely no bound is known on the noise is challenging in practice. In this case, it is intuitive that one should not be able to recover the noise with too few samples. The bound stated in Theorem 3.1 (see Appendix B) supports this intuition, as when the number of observations is too small, then and the corresponding bound becomes trivial ().

Remark 3.3

In the variance bounds of Theorem B.1 the term appears systematically with the factor . This suggests we need to choose proportional to , which gives further justification to the target , where is a known upper bound on .

Remark 3.4

In practice, we advice to choose the best of case 1 and case 2 bounds when is known.

In order to estimate the upper bound , one needs at least a lower-bound on . Let us define

 σ−,t=max{~σ−,t,σ−,t−1}withσ−,0=σ−, (2)

where is a initial lower-bound on and is a lower-bound estimate on built from all observations gathered up to time (inclusively). Then, one way to proceed is, at each time step , to build an estimate , which in return can be used to compute the lower quantity , and obtain the estimate . Then, we compute the predictable sequence as described by equation 1. Further replacing the variance with its estimate using a union bound in the result of Theorem 2.2, we derive confidence bounds that are fully computable in the context where the regularization parameter is adaptively tuned and the function noise is unknown. This is summarized in the following empirical Bernstein-style inequality:

Corollary 3.1 (Kernel empirical-Bernstein inequality)

Assume that . Let us define the following noise lower-bound for each

 σ−,t=max{σ−,t(λt−1),σ−,t−1}

and define as the corresponding lower bound on . Then, let us define the following noise upper bound for each

 σ+,t=min{σ+,t(λt−1,λ−),σ+,t−1}.

Define the regularization parameterizing the regression model used for acquiring observation at time to be , according to Equation 1. Then with probability higher than , the following is valid simultaneously for all and ,

 ∣∣f⋆(x)−fλt,t(x)∣∣≤√kλt,t(x,x)λtBλt,t(δ)where Bλt,t(δ)=√λtC+σ+,t√2ln(1/δ)+2γt(λ−). (3)
Remark 3.5

This result is especially interesting since it provides a fully empirical confidence envelope function around . When an initial bound on the noise is known and considered to be tight, one may simply choose the constant deterministic sequence , in which case the same result holds for and .

We observe from Theorem 3.1 that the tightness of the noise estimates depends on the parameter that is used for computing and . Since holds with high probability by construction, using such an adaptive should yield tighter bounds than using a fixed . This is supported by the numerical experiments of Section 6.2.

4 Application to kernelized bandits

Here is a direct application of our results in the framework of stochastic multi-armed bandits with structured arms embedded in an RKHS (Srinivas et al., 2010; Valko et al., 2013). At each time step , a bandit algorithm recommends a point to sample and observes a noisy outcome , where . Let be the optimal arm. The goal of an algorithm is to pick a sequence of points that minimizes the cumulative regret

 \kRT=T∑t=1f⋆(⋆)−f⋆(xt). (4)

In this context, one needs to build tight confidence sets on the mean of each arm, and this will be given by Corollary 3.1. We illustrate our technique on two main bandit strategies: Upper Confidence Bound (UCB) (Auer et al., 2002) and Thompson Sampling (TS) (Thompson, 1933); both are adapted here to the kernel setting with unknown variance.

Definition 4.1 (Information gain with unknown variance)

We define the information gain at time for a regularization parameter to be

This definition directly extends the usual definition of information gain, that can be recovered by choosing . The following extension of Lemma 7 in Wang and de Freitas (2014) (see also Srinivas et al. (2012)) to the case when the variance is estimated plays an important role in the regret analysis of both algorithms.

Lemma 4.1 (From sum of variances to information gain)

Let us assume that the kernel is bounded by in the sense that . Let be any sequence such that . For instance, this is satisfied with high probability when using Equation 1. Then, it holds

 T∑t=1s2λ,t−1(xt)=σ2T∑t=11λtkλt,t−1(xt,xt)≤2C2ln(1+C2/σ2)γT(σ2/C2).

In the sequel, it is useful to bound the confidence bound term from Equation 3.

Lemma 4.2 (Deterministic bound on the confidence bound)

Assume that we are given a constant , so that holds for all . Then for all , the confidence bound term is upper-bounded by the following deterministic quantity

 Bλt,t(δ) ≤σ+(1+√2ln(1/δ)+2γT(σ2−/C2)).

Further, we have .

Remark 4.1

The term can be replaced with a more refined term thanks to the confidence bounds on the variance estimates.

Kernel UCB with unknown variance

The upper bound on the error can be used directly in order to build a UCB-style algorithm. Formally, the vanilla UCB algorithm (Auer et al., 2002) corresponding to our setting picks at time the arm

 xt∈argmaxx∈\cXf+λt,t−1(x) where f+λ,t(x)=fλ,t(x)+√kλ,t(x,x)λBλ,t(δ). (5)

Following the regret proof strategy of Abbasi-Yadkori et al. (2011), with some minor modifications, yields the following guarantee on the regret of this strategy:

Theorem 4.1 (Kernel UCB with unknown noise and adaptive regularization)

With probability higher than , the regret of Kernel UCB with adaptive regularization and variance estimation satisfies for all (recall that is defined in Equation 3):

 \kRT≤2T∑t=1√kλt,t−1(xt,xt)λtBλt,t−1(δ/4).

In particular, we have

 \kRT ≤ 2σ+σ(1+√2ln(4/δ)+2γT(σ2−/C2))C√T2γT(σ2/C2)ln(1+C2/σ2).
Remark 4.2

This result that holds simultaneously over all time horizon extends that of Abbasi-Yadkori et al. (2011) first to kernel regression and then to the case when the variance of the noise is unknown. This should also be compared to Valko et al. (2013) that assumes bounded observations, which implies a bounded noise (with known bound) and a bounded , and Srinivas et al. (2010) that provides looser bounds.

Kernel TS with unknown variance

Another application of our confidence bounds is in the analysis of Thompson sampling in the kernel scenario. Before presenting the result, let us say a few words about the design of TS algorithm in a kernel setting. Such an algorithm requires sampling from a posterior distribution over the arms. It is natural to consider a Gaussian posterior with posterior means and variances given by the kernel estimates. However, it has been noted in a series of papers (Agrawal and Goyal, 2014; Abeille and Lazaric, 2016) that, in order to obtain provable regret minimization guarantees, the posterior variance should be inflated (although in practice, the vanilla version without inflation may work better). Following these lines of research, and owing to our novel confidence bounds, we derive the following TS algorithm using a posterior variance inflation factor .

Remark 4.3

The algorithm does not know the variance of the noise, but uses an upper estimate .

Remark 4.4

We assume that the set of arms is discrete. This is merely for practical reasons since otherwise updating the estimate of in a RKHS requires memory and computational times that are unbounded with . This also simplifies the analysis.

The following regret bound can then be obtained after some careful but easy adaptation of Agrawal and Goyal (2014). We provide the proof of this result in Appendix C, which can be of independent interest, being a more rigorous and somewhat simpler rewriting of the original proof technique from Agrawal and Goyal (2014).

Theorem 4.2 (Regularized Kernel TS with variance estimate)

Assume that the maximal instantaneous pseudo-regret is finite. Then, the regret of Kernel TS (Algorithm 1) with after episodes is with probability . More precisely, with probability , the regret is bounded for all :

 \kRT ≤ C1,T(T∑t=1√kλt,t−1(xt,xt)λtBλt,t−1(δ/4))+C2R√Tln(1/δ)+4πeRδ,

where and .

Further, we have

 \kRT ≤ C1,Tσ+σ(1+√2ln(4/δ)+2γT(σ2−/C2))C√T2γT(σ2/C2)ln(1+C2/σ2) +C2R√Tln(1/δ)+4πeRδ.
Remark 4.5

As our confidence intervals do not require a bounded noise, likewise we can control the regret with high probability without requiring bounded observations, contrary to earlier works such as Valko et al. (2013).

5 Discussion and related works

Concentration results

Theorem 2.1 extends the self-normalized bounds of Abbasi-Yadkori et al. (2011) from the setting of linear function spaces to that of an RKHS with sub-Gaussian noise. Based on a nontrivial adaptation of the Laplace method, it yields self-normalized inequalities in a setting of possibly infinite dimension. It generalizes the following result of Wang and de Freitas (2014) to kernel regression with , which was already a generalization of a previous result by Srinivas et al. (2010) for bounded noise. It is also more general than the concentration result from Valko et al. (2013), for kernel regression with , which holds under the assumption of bounded observations.

Lemma 5.1 (Proposition 1 from Wang and de Freitas (2014))

Let denote a function in the RKHS induced by kernel and let us define the posterior mean and variances with , for (arbitrary) data . Assuming -sub-Gaussian noise variables, then for all we have that

 Pr[∃x∈\cX:|fλ,t(x)−f⋆(x)|≥ℓλ,t+1(δ′)k1/2λ,t(x,x)]≤δ′,where ℓ2λ,t(δ′)=∥f∥2\cK+√8γt−1(λ)ln2δ′+√2ln4δ′∥f∥\cK+2γt−1(λ)+2σln2δ′

and is the information gain.

Remark 5.1

This results provides a bound that is valid for each , with probability higher . In contrast, results from Abbasi-Yadkori et al. (2011), as well as Theorem 2.1 hold with probability higher , uniformly for all , and are thus much stronger in this sense.

Theorem 2.2 extends Theorem 2.1 to the case when the regularization is tuned online based on gathered observations. To the best of our knowledge, no such result exists in the literature at the time of writing this paper. Moreover, Theorem 3.1 provides variance estimates with confidence bounds scaling with , in the spirit of the results from Maurer and Pontil (2009), that were provided in the i.i.d. case. Thus, Theorem 3.1 also appears to be new. Finally, Corollary 3.1 further specifies Theorem 2.2 to the situation where the regularization is tuned according to Theorem 3.1, yielding a fully adaptive regularization procedure with explicit confidence bounds.

Bandits optimization

When applied to the setting of multi-armed bandits, Theorems 4.2 and 4.1 respectively extend linear TS (Agrawal and Goyal, 2014; Abeille and Lazaric, 2016) and UCB (Li et al., 2010; Chu et al., 2011) to the RKHS setting. Similar extensions have been provided in the literature: GP-UCB (Srinivas et al., 2010) generalizes UCB from the linear to the RKHS setting through the use of Gaussian processes; this corresponds to the case when . The bounds they provide in the case when the target function belongs to an RKHS is however quite loose. KernelUCB (Valko et al., 2013) also generalizes UCB from the linear to the RKHS setting through the use of kernel regression. However the analysis of this algorithm was out of reach of their proof technique (that requires independence between arms) and they analyze instead the arguably less appealing variant called SupKernelUCB. Also, the analysis of both GP-UCB and SupKernelUCB in the agnostic setting are respectively limited to bounded noise and bounded observations.

6 Illustrative numerical experiments

In this section, we illustrate the results introduced in the previous Sections 2 and 3 on a few examples. The first one is the concentration result on the mean from Theorem 2.1, the second one is the variance estimate from Theorem 3.1, and the last one combines the formers by using the noise estimate to tune in Theorem 2.2, which corresponds to Corollary 3.1. We finally show the performance of kernelized bandits techniques using the provided variance estimates and adaptative regularization schemes.

We conduct the experiments using the function shown by Figure 1, which has norm in the RKHS induced by a Gaussian kernel with length scale . We consider the space and that the standard deviation of the noise is . All further experiments use the upper-bound on and the lower-bound on .

6.1 Kernel concentration bound

The following experiments compare the concentration result given by Theorem 2.1 with the kernel concentration bounds from Wang and de Freitas (2014) reported by Lemma 5.1. The true noise is assumed to be known and all observations are uniformly sampled from . In both cases, we use a fixed confidence level . Figure 2 shows that for , the result given by Theorem 2.1 recovers the confidence envelope of Wang and de Freitas (2014). Note however that the confidence bound that we plot for Theorem 2.1 are valid uniformly over all time steps, while the one derived from Wang and de Freitas (2014) is only valid separately for each time. Further, Theorem 2.1 generalizes the latter result to the case where . For illustration, Figure 3 illustrates the confidence envelopes in the special case where , which also shows the potential benefit of such a tuning.

6.2 Empirical variance estimate

We now illustrate the convergence rate of the noise estimates and computed using Theorem 3.1, where and . All observations are uniformly sampled from . Section 3 suggests that should provide tighter bounds than a fixed . Figure 4 shows that this is indeed the case especially for large values of . We also see that the adaptive update of converges to the same value, whatever the initial bound . This is especially interesting when is a loose initial upper bound on .

In practice, the bound of Theorem 3.1 not using the knowledge of may be useful even when is known. This is illustrated by Figure 4(a) that plots the upper-bound variance estimate for in both cases. In practice, we suggest to use the minimum of the bound using the knowledge of (case 1) and of the agnostic one (case 2) to set and the maximum for . Figure 4(b) shows the resulting noise estimate envelopes for different values (recall that ).

We now combine the previous experiments and use the estimated noise in order to tune the regularization. Recall that we consider , , and . On each time , we estimate the noise lower-bound using Theorem 3.1 and set . We then compute the upper-bound noise estimate using Theorem 3.1 and set . We are now ready to compute the confidence interval given by Corollary 3.1. Note that is used everywhere and all observations are uniformely sampled from . Figure 6 illustrates the resulting confidence envelope of this fully empirical model for noise upper-bound (recall that the noise satisfies ) plotted against the confidence envelope obtained with Theorem 2.1 with fixed . We observe the improvement of the confidence intervals with the number of observations. Recall that this setting is especially challenging since the variance is unknown, the regularization parameter is tuned online, and the confidence bounds are valid uniformly over all time steps.

6.4 Kernelized bandits optimization

In this section, we now evaluate the potential of kernelized bandits algorithms with variance estimate. We consider as the linearly discretized space into 100 arms. Recall that the goal is to minimize the cumulative regret (Equation 4) and that we are optimizing the function shown by Figure 1 with . We evaluate Kernel UCB (Equation 5) and Kernel TS (Algorithm 1 with ) with three different configurations:

1. the oracle, that is with fixed , assuming knowledge of ;

2. the fixed , that is the best one can do without prior knowledge of ;

3. the adaptative regularization tuned with Corollary 3.1.

All configurations use . Kernel UCB uses and Kernel TS uses such that their regret bounds respectively hold with probability . Recall that observations are now sampled from using the bandits algorithms (they are not i.i.d.). Configurations b) and c) use , while the oracle a) uses . Figure 7 shows the cumulative regret averaged over 100 repetitions. Note that the oracle corresponds to the best performance that could be expected by Kernel UCB and Kernel TS given knowledge of the noise. The plots confirm that adaptively tuning the regularization using the variance estimates can lead to a major improvement compared to using a fixed, non-accurate guess: after an initial burn-in phase, the regret of the adaptively tuned algorithm increases at the same rate as that of the oracle algorithm knowing the noise exactly. The fact that Kernel UCB outperforms Kernel TS much implies that inflating the variance in Kernel TS, as suggested per the theory presented previously, may not be optimal in practice. Further attention should be given to this question.

In order to evaluate the benefit of the concentration bound provided by Theorem 2.1, we compare the Kernel TS (Algorithm 1) oracle using and , where is given by Theorem 2.1, against where is given by Lemma 5.1 (Wang and de Freitas, 2014) with . Figure 8 shows that the concentration bound given by Theorem 2.1 improves the performance of Kernel TS compared with existing concentration results (Wang and de Freitas, 2014). It highlights the relevance of expliciting the regularization parameter, which allows us to take advantage of regularization rates that may be better adapted.

7 Conclusion

This work addresses two problems: the online tuning of the regularization parameter in streaming kernel regression and the online estimation of the noise variance. To this extent, we introduce novel concentration bounds on the posterior mean estimate in streaming kernel regression with fixed and explicit regularization (Theorem 2.1), which we then extend to the setting where the regularization parameter is tuned (Theorem 2.2). We further introduce upper- and lower-bound estimates of the noise variance (Theorem 3.1). Putting these tools together, we show how the estimate of the noise variance can be used to tune the kernel regularization in an online fashion (Corollary 3.1) while retaining theoretical guarantees. We also show how to use the proposed results in order to derive kernelized variations of the most common bandits algorithms UCB and Thompson sampling, for which regret bounds are also provided (Theorems 4.1 and 4.2).

All the proposed results and tools are illustrated through numerical experiments. The obtained results show the relevance of the introduced kernel regression concentration intervals for explicit regularization, which hold when the regularization does not correspond to the noise variance. The potential of the proposed regularization tuning procedure is illustrated through the application to kernelized bandits, where the benefits of adaptive regularization is undeniable when the noise variance is unknown (this is usually the case in practice). Finally, one must note that a major strength of the tools proposed in this work is to allow for an adaptively tuned regularization parameter while preserving theoretical guarantees, which is not the case when regularization is tuned for example by cross-validation.

Future work includes a natural extension of these techniques to obtain an empirical estimate of the kernel length scales. This information is often assumed to be known, while in practice it is often not available. Although some preliminary work has been done in that direction (Wang and de Freitas, 2014), designing theoretically motivated algorithms addressing these concerns would help to fill an important gap between theory and practice. On a different matter, the current work gives the basis for performing Thompson sampling in RKHS, and could be extended to the contextual setting in a near future, as was done with CGP-UCB (Krause and Ong, 2011; Valko et al., 2013).

\acks

This work was supported through funding from the Natural Sciences and Engineering Research Council of Canada (NSERC, Canada), the REPARTI strategic network (FRQ-NT, Québec), MITACS, and E Machine Learning Inc. O.-A. M. acknowledges the support of the French Agence Nationale de la Recherche (ANR), under grant ANR-16- CE40-0002 (project BADASS).

References

• Abbasi-Yadkori et al. (2011) Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems 24 (NIPS), pages 2312–2320, 2011.
• Abeille and Lazaric (2016) M. Abeille and A. Lazaric. Linear Thompson sampling revisited. arXiv preprint arXiv:1611.06534, 2016.
• Abramowitz and Stegun (1964) M. Abramowitz and I. A. Stegun. Handbook of mathematical functions: with formulas, graphs, and mathematical tables, volume 55. Courier Corporation, 1964.
• Agrawal and Goyal (2014) S. Agrawal and N. Goyal. Thompson sampling for contextual bandits with linear payoffs. arXiv preprint arXiv:1209.3352, 2014.
• Auer et al. (2002) P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.
• Brochu et al. (2008) E. Brochu, N. De Freitas, and A. Ghosh. Active preference learning with discrete choice data. In Advances in Neural Information Processing Systems 21 (NIPS), pages 409–416, 2008.
• Chu et al. (2011) W. Chu, L. Li, L. Reyzin, and R. E. Schapire. Contextual bandits with linear payoff functions. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS), volume 15, pages 208–214, 2011.
• Cover and Thomas (1991) T. M Cover and J. A. Thomas. Elements of information theory. 1991.
• Krause and Ong (2011) A. Krause and C. S. Ong. Contextual Gaussian process bandit optimization. In Advances in Neural Information Processing Systems 24 (NIPS), pages 2447–2455, 2011.
• Li et al. (2010) L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web (WWW), pages 661–0670, 2010.
• Maillard (2016) O.-A. Maillard. Self-normalization techniques for streaming confident regression. working paper or preprint, May 2016.
• Marchant and Ramos (2014) R. Marchant and F. Ramos. Bayesian optimisation for informative continuous path planning. In International Conference on Robotics and Automation (ICRA), pages 6136–6143. IEEE, 2014.
• Maurer and Pontil (2009) A. Maurer and M Pontil. Empirical Bernstein bounds and sample variance penalization. In Proceedings of the 22nd Annual Conference on Learning Theory (COLT), 2009.
• Rasmussen and Williams (2006) C. E. Rasmussen and C. K. I. Williams. Gaussian processes for machine learning. MIT Press, 2006.
• Snoek et al. (2012) J. Snoek, H. Larochelle, and R. P. Adams. Practical bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems 25 (NIPS), pages 2951–2959, 2012.
• Srinivas et al. (2010) N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proceedings of the 27th International Conference on Machine Learning (ICML), 2010.
• Srinivas et al. (2012) N. Srinivas, A. Krause, S. M. Kakade, and M. W. Seeger. Information-theoretic regret bounds for Gaussian process optimization in the bandit setting. IEEE Transactions on Information Theory, 58(5):3250–3265, 2012.
• Thompson (1933) W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
• Valko et al. (2013) M. Valko, N. Korda, R. Munos, I. Flaounas, and N. Cristianini. Finite-time analysis of kernelised contextual bandits. In Proceedings of the 29th conference on Uncertainty In Artificial Intelligence (UAI), pages 654–665, 2013.
• Wang and de Freitas (2014) Z. Wang and N. de Freitas. Theoretical analysis of bayesian optimisation with unknown gaussian process hyper-parameters. arXiv preprint arXiv:1406.7758, 2014.
• Wilson et al. (2014) A. Wilson, A. Fern, and P. Tadepalli. Using trajectory data to improve bayesian optimization for reinforcement learning. Journal of Machine Learning Research, 15:253–282, 2014.

Appendix A Laplace method for tuned kernel regression

In this section, we want to control the term simultaneously over all . To this end, we resort to a version of the Laplace method carefully extended to the RKHS setting.

Before proceeding, we note that since is a kernel function (that is continuous, symmetric positive definite) on a compact set equipped with a positive finite Borel measure , then there is an at most countable sequence where , and form an orthonormal basis of , such that

 k(x,y)=∞∑j=1σjψj(x)ψj(y′) and ∥f∥2\cK=∞∑j=1⟨f,ψj⟩2L2,μσj

Let . Note that , . Further, if , then and . In particular belongs to the RKHS if and only if . For and , we now denote for , by analogy with the finite dimensional case. Note that .

In the sequel, the following Martingale control will be a key component of the analysis.

Lemma A.1 (Hilbert Martingale Control)

Assume that the noise sequence is conditionally -sub-Gaussian

 ∀t∈\Nat,∀γ∈\Real,ln\Esp[exp(γξt)|\cHt−1]≤γ2σ22.

Let be a stopping time with respect to the filtration generated by the variables . For any such that , and deterministic positive , let us denote

 Mqm,λ=exp(m∑t=1q⊤ϕ(xt)√λξt−σ22m∑t=1(q⊤ϕ(xt))2λ)

Then, for all such the quantity is well defined and satisfies

 ln\Esp[Mqτ,λ]≤0.
{proof}

The only difficulty in the proof is to handle the stopping time. Indeed, for all , thanks to the conditional -sub-Gaussian property, it is immediate to show that is a non-negative super-martingale and actually satisfies .

By the convergence theorem for nonnegative super-martingales, is almost surely well-defined, and thus is well-defined (whether or not) as well. In order to show that , we introduce a stopped version of . Now by Fatou’s lemma, which concludes the proof. We refer to (Abbasi-Yadkori et al., 2011) for further details.

We are now ready to prove the following result.

Proof of Theorem 2.2 (Streaming Kernel Least-Squares)  We make use of the features in an explicit way. Let . For , we denote its corresponding parameter sequence. We let be a matrix built from the features and introduce the bi-infinite matrix as well as the noise vector . In order to control the term , we first decompose the estimation term. Indeed, using the feature map, it holds that

 fλ,t(x) = kt(x)⊤(Kt+λIt)−1Yt = ϕ(x)⊤Φ⊤t(ΦtΦ⊤t+λIt)−1Yt = ϕ(x)⊤Φ⊤t(Itλ−1λΦt(λI+Φ⊤tΦt)−1Φ⊤t)Yt = ϕ(x)⊤(Φ⊤tΦt+λI)−1Φ⊤t(Φtθ⋆+Et)

where in the third line, we used the Shermann-Morrison formula. From this, simple algebra yields

 fλ,t(x)−f⋆(x) = 1λϕ(x)⊤V−1λ,t(Φ⊤tEt−λθ⋆).

We then obtain, from a simple Hölder inequality using the appropriate matrix norm, the following decomposition, that is valid provided that all terms involved are finite.

 |fλ,t(x)−f