Gaussian processes (GP) are a popular Bayesian approach for the optimization of black-box functions. Despite their effectiveness in simple problems, GP-based algorithms hardly scale to complex high-dimensional functions, as their per-iteration time and space cost is at least quadratic in the number of dimensions d and iterations t. Given a set of A alternative to choose from, the overall runtime \mathcal{O}(t^{3}A) quickly becomes prohibitive. In this paper, we introduce BKB (budgeted kernelized bandit), a novel approximate GP algorithm for optimization under bandit feedback that achieves near-optimal regret (and hence near-optimal convergence rate) with near-constant per-iteration complexity and no assumption on the input space or covariance of the GP.

Combining a kernelized linear bandit algorithm (GP-UCB) with randomized matrix sketching technique (i.e., leverage score sampling), we prove that selecting inducing points based on their posterior variance gives an accurate low-rank approximation of the GP, preserving variance estimates and confidence intervals. As a consequence, BKB does not suffer from variance starvation, an important problem faced by many previous sparse GP approximations. Moreover, we show that our procedure selects at most \widetilde{\mathcal{O}}(d_{\text{eff}}) points, where d_{\text{eff}} is the effective dimension of the explored space, which is typically much smaller than both d and t. This greatly reduces the dimensionality of the problem, thus leading to a \mathcal{O}(TAd_{\text{eff}}^{2}) runtime and \mathcal{O}(Ad_{\text{eff}}) space complexity.


BKB] Gussian Process Optimization with Adaptive Sketching:
Scalable and No Regret \coltauthor \NameDaniele Calandriello \Emaildaniele.calandriello@iit.it
\addrLCSL - Istituto Italiano di Tecnologia, Genova, Italy & MIT, Cambridge, USA \AND\NameLuigi Carratino \Emailluigi.carratino@dibris.unige.it
\addrUNIGE - Università degli Studi di Genova, Genova, Italy \AND\NameAlessandro Lazaric \Emaillazaric@fb.com
\addrFAIR - Facebook AI Research, Paris, France \AND\NameMichal Valko \Emailmichal.valko@inria.fr
\addrINRIA Lille - Nord Europe, SequeL team, Lille, France \AND\NameLorenzo Rosasco \Emaillrosasco@mit.edu
\addrLCSL - Istituto Italiano di Tecnologia, Genova, Italy & MIT, Cambridge, USA
UNIGE - Università degli Studi di Genova, Genova, Italy

parse Gaussian Process Optimization; Kernelized Linear Bandit; Regret; Sketching; Bayesian Optimization; Black Box Optimization; Variance Starvation

1 Introduction

Efficiently selecting the best alternative out of a set of alternatives is an important problem in sequential decision making, with practical applications ranging from recommender systems (Li et al., 2010), to experimental design (Robbins, 1952). It is also the main focus of the research in bandits (Lattimore and Szepesvári, ) and Bayesian optimization (Mockus, 1989; Pelikan, 2005; Snoek et al., 2012), that study optimization under bandit feedback. In this setting, a learning algorithm sequentially interacts with a reward/utility function f. Over T interactions, the algorithm chooses a point \mathbf{x}_{t} and it has only access to a noisy black-box evaluation of f at \mathbf{x}_{t}. The goal of the algorithm is to minimize the cumulative regret, which compares the reward accumulated at the points selected over time (i.e., \sum_{t}f(\mathbf{x}_{t})) to the reward obtained by repeatedly selecting the optimum of the function (i.e., T\max_{x}f(x)). In this paper we focus on the Gaussian process optimization approach to this problem. In particular, we study the GP-UCB algorithm first introduced by Srinivas et al. (2010).

Starting from a Gaussian process prior over f, GP-UCB alternates between evaluating the function, and using the evaluations to build a posterior of f. This posterior is composed by a mean function \mu that estimates the value of f, and a variance function \sigma that captures the uncertainty of \mu. These two quantities are combined in a single upper confidence bound (UCB) that drives the selection of the evaluation points, and trades off between evaluating high-reward points (i.e., exploitation) and testing possibly sub-optimal points to reduce the uncertainty on the function (i.e., exploration). While other approaches to select promising points exist, such as expected improvement (EI) and maximum probability of improvement, it is not known if they can achieve low regret. The performance of GP-UCB has been extensively studied by  (Srinivas et al., 2010; Chowdhury and Gopalan, 2017) and it provably achieves low regret both in a Bayesian and non-Bayesian framework. However, the main limiting factor to its applicability is its computational cost. When choosing between A alternatives, GP-UCB requires \Omega(At^{2}) time/space to select each new point, and therefore does not scale to complex function that require many iterations to be optimized. Several approximations of GP-UCB have been suggested before (Quinonero-Candela et al., 2007; Liu et al., 2018):
Inducing points: The GP can be restricted to lie in the range of a small subset of inducing points. The subset should cover well the whole space for accuracy, but also be as small as possible for efficiency. Several methods, referred to as sparse GPs, have been proposed to select the inducing points and an approximation based on the subset. Popular instances of this approach are the subset of regressors (SoR, Wahba, 1990) and the deterministic training conditional (DTC, Seeger et al., 2003). While these methods are simple to interpret and efficient, they do not come with regret guarantees. Moreover, when the subset does not cover the space well, they suffer from variance starvation Wang et al. (2018), as they underestimate the variance of points far away from the inducing points.
Random Fourier features: Another approach is to use explicit feature expansions to approximate the GP covariance function, and embed the points in a low-dimensional space. These methods usually exploit Fourier expansions (Rahimi and Recht, 2007), which is possible only for specific covariances, i.e. stationary kernels, is not data adaptive, and usually scales poorly with the input dimension. Methods based on Quadratures are the most recent example of this approach, for which Mutny and Krause (2018) proved that exploiting the structure of stationary kernels it is possible to achieve low regret with an approximate GP-UCB. However their method does not extend to other non-stationary kernels, and although their dependency on t is small, the dependency on the input dimension scales exponentially.
Variational inference: another approach is to replace the true GP likelihood with a variational approximation that can be optimized efficiently. Although recent methods provide guarantees on the approximate posterior mean and variance (Huggins et al., 2019), these guarantees only apply to GP regression and not to the harder optimization setting.
Linear case: in the linear bandit literature, several methods to reduce the complexity of algorithms such as LinUCB have been proposed. Kuzborskij et al. (2019) uses the Frequent Directions (FD, Ghashami et al., 2016) to project the design matrix data to a smaller subspace. Unfortunately, the size of the subspace has to be specified in advance and when the size is not sufficiently large, the method can suffer linear regret. Prior to the FD approach, CBRAP (Yu et al., 2017) used random projections instead, but faced similar issues. This turns out to be a fundamental weakness of all approach that do not adapt the actual size of the space defined by the sequence of points selected by the learning algorithm. Indeed, Ghosh et al. (2017) showed a lower bound for the linear case that shows that as soon as one single arm is misspecified we can suffer linear regret.

1.1 Contributions

In this paper, we show a way to adapt the size the projected space online and devise an algorithm, BKB (budgeted kernel bandit), that achieves near-optimal regret guarantees with a computational complexity that can be drastically smaller than GP-UCB. This is achieved without assumptions on the dimensionality of the input or on the kernel function. BKB leverages several well-known tools from the literature: a DTC approximation of the posterior variance, based on inducing points, and a confidence interval construction based on state-of-the-art self-normalized concentration inequalities (Abbasi-Yadkori et al., 2011). It also introduces two novel tools: a selection strategy to select inducing points based on ridge leverage score (RLS) sampling Alaoui and Mahoney (2015) that is provably accurate, and a novel approximate confidence interval for the function that is both nearly as accurate as the one of GP-UCB, but also efficient to compute.

Ridge leverage score sampling was originally introduced for randomized kernel matrix approximation  (Alaoui and Mahoney, 2015). In the context of GPs, RLS correspond to the posterior variance of a point, which allows adapting algorithms and their guarantees from the RLS sampling literature to the GP setting. This solves two critical problems in sparse GP and linear bandit approximations. First, BKB constructs estimates of the variance that are provably accurate, i.e. it does not suffer from variance starvation, which results in provably accurate confidence bounds as well. The only method with comparable guarantees, Thompson sampling with quadrature FF (Mutny and Krause, 2018) only applies to stationary kernels, and only in extremely low dimension. Moreover our approximation guarantees are qualitatively different, as they do not require a corresponding uniform approximation bound on the GP.
Second, BKB adaptively chooses the size of the inducing point set based on the effective dimension d_{\text{eff}} of the problem, also known as degrees of freedom of the model. This is crucial to achieve low regret, since fixed approximation schemes may suffer linear regret. Moreover, in a problem with A arms, using a set of \mathcal{O}(d_{\text{eff}}) inducing points results in an algorithm with \mathcal{O}(Ad_{\text{eff}}^{2}) per-step runtime and \mathcal{O}(Ad_{\text{eff}}) space, a significant improvement over the \mathcal{O}(At^{2}) time and \mathcal{O}(At) space cost of GP-UCB.

Finally, while in our work we only address kernelized (GP) bandits, our work could be extended to more complex online learning problems, such as to recent advances in kernelized reinforcement learning (Chowdhury and Gopalan, 2019). Moreover inducing point methods have clear model interpretability, and our analysis provides insight both from a bandit and Bayesian optimization perspective, making it applicable to a large amount of downstream tasks.

2 Background

Notation. We use lower-case letters a for scalars, lower-case bold letters \mathbf{a} for vectors, and upper-case bold letters \mathbf{A} for matrices and operators, where [\mathbf{A}]_{ij} denotes its element (i,j). We denote by \|\mathbf{x}\|_{\mathbf{A}}^{2}=\mathbf{x}^{\mathsf{\scriptscriptstyle T}}% \mathbf{A}\mathbf{x} the norm with metric \mathbf{A}, and \|\mathbf{x}\|=\|\mathbf{x}\|_{\mathbf{I}} with \mathbf{I} the identity. Finally, we denote the first T integers as [T]:=\{1,\ldots,T\}.

Online optimization under bandit feedback.

Let f:\mathcal{A}\rightarrow\mathbb{R} be a reward function that we wish to optimize over a set of decisions \mathcal{A}, also called actions or arms. For simplicity, we assume that \mathcal{A}=\{\mathbf{x}_{i}\}_{i=1}^{A} is a fixed finite set of A vectors in \mathbb{R}^{d}. We discuss how to relax these assumptions in Section 4. In optimization under bandit feedback, a learner aims to optimize f through a sequence of interactions. At each step t\in[T] the learner 1) chooses an arm \mathbf{x}_{t}\in\mathcal{A}, 2) receives reward y_{t}=f(\mathbf{x}_{t})+\eta_{t}, where \eta_{t} is a zero-mean noise, 3) updates its model of the problem.

The goal of the learner is to minimize its cumulative regret R_{T}=\sum_{t=1}^{T}f(\mathbf{x}_{*})-f(\mathbf{x}_{t}) w.r.t. the best111We assume a consistent and arbitrary tie-breaking strategy. arm \mathbf{x}_{*}, where \mathbf{x}_{*}=\operatorname*{arg\,max}_{\mathbf{x}_{i}\in\mathcal{A}}f(% \mathbf{x}_{i}). In particular, the objective of a no-regret algorithm is to have R_{T}/T go to zero as T grows as fast as possible. We recall that the regret is strictly related to the convergence rate and the optimization performance. In fact, let \overline{x}_{T} be an arm chosen at random from the sequence of arms (x_{1},\ldots,x_{T}) selected by the learner, then f(\mathbf{x}_{*})-\mathbb{E}[f(x_{T})]\leq R_{T}/T.

2.1 Gaussian process optimization and GP-UCB

A popular no-regret algorithm for optimization under bandit feedback is GP-UCB, introduced by Srinivas et al. (2010) in the context of Gaussian process optimization. We first introduce the formal definition of a Gaussian process, and then briefly present the GP-UCB algorithm.
A Gaussian process GP(\mu,k) (Rasmussen and Williams, 2006) is a generalization of the Gaussian distribution to a space of functions and it is defined by a mean function \mu(\cdot) and covariance function k(\cdot,\cdot). W.l.o.g. we consider zero-mean GP(0,k) priors and bounded covariance k(\mathbf{x}_{i},\mathbf{x}_{i})\leq\kappa^{2} for all \mathbf{x}_{i}\in\mathcal{A}. An important property of Gaussian processes is that if we combine a prior f\sim GP(0,k), and assume the observation noise is zero-mean Gaussian (i.e., \eta_{t}\sim\mathcal{N}(0,\xi^{2})), then the posterior distribution of f conditioned on a set of observations \{(\mathbf{x}_{s},y_{s})\}_{s=1}^{t} is also a GP. More precisely, let \mathbf{X}_{t}=[\mathbf{x}_{1},\dots,\mathbf{x}_{t}]^{\mathsf{% \scriptscriptstyle T}}\in\mathbb{R}^{t\times d} be the matrix with all arms selected so far and \mathbf{y}_{t}=[y_{1},\dots,y_{t}]^{\mathsf{\scriptscriptstyle T}} be the corresponding observations, then the posterior is still a GP and the mean and variance of the function at a test point \mathbf{x} are defined as

\displaystyle\mu_{t}\mathopen{}\mathclose{{}\left(\mathbf{x}\;\middle|\;% \mathbf{X}_{t},\mathbf{y}_{t}}\right)\!=\!\mathbf{k}_{t}(\mathbf{x})^{\mathsf{% \scriptscriptstyle T}}(\mathbf{K}_{t}+\lambda\mathbf{I})^{-1}\mathbf{y}_{t}, (1)
\displaystyle\sigma_{t}^{2}\mathopen{}\mathclose{{}\left(\mathbf{x}\;\middle|% \;\mathbf{X}_{t}}\right)\!=\!k(\mathbf{x},\mathbf{x})-\mathbf{k}_{t}(\mathbf{x% })^{\mathsf{\scriptscriptstyle T}}(\mathbf{K}_{t}+\lambda\mathbf{I})^{-1}% \mathbf{k}_{t}(\mathbf{x}), (2)

where \lambda=\xi^{2}, \mathbf{K}_{t}\in\mathbb{R}^{t\times t} is the matrix [\mathbf{K}_{t}]_{i,j}=k(\mathbf{x}_{i},\mathbf{x}_{j}) constructed from all pairs \mathbf{x}_{i},\mathbf{x}_{j} in \mathbf{X}_{t}, and \mathbf{k}_{t}(\mathbf{x})=[k(\mathbf{x}_{1},\mathbf{x}),\dots,k(\mathbf{x}_{t% },\mathbf{x})]^{\mathsf{\scriptscriptstyle T}}. Notice that \mathbf{k}_{t}(\mathbf{x}) can be interpreted as an embedding of an arm \mathbf{x} “supported” over the arms \mathbf{x}_{1},\ldots,\mathbf{x}_{T} observed so far.
The GP-UCB algorithm is a Bayesian optimization algorithm that uses a Gaussian process GP(0,k) as a prior for f. Inspired by the optimism-in-face-of-uncertainty principle, at each time step t, GP-UCB uses the posterior GP to compute the mean and variance of an arm \mathbf{x}_{i} and obtain the score

\displaystyle u_{t}(\mathbf{x}_{i})=\mu_{t}(\mathbf{x}_{i})+\beta_{t}\sigma_{t% }(\mathbf{x}_{i}), (3)

where we use the short-hand notation \mu_{t}(\cdot)=\mu\mathopen{}\mathclose{{}\left(\cdot\;\middle|\;\mathbf{X}_{t% },\mathbf{y}_{t}}\right) and \sigma_{t}(\cdot)=\sigma\mathopen{}\mathclose{{}\left(\cdot\;\middle|\;\mathbf% {X}_{t}}\right). Finally, GP-UCB chooses the maximizer \mathbf{x}_{t+1}=\operatorname*{arg\,max}_{\mathbf{x}_{i}\in\mathcal{A}}u_{t}(% \mathbf{x}_{i}) as the next arm to evaluate. According to the score u_{t}, an arm \mathbf{x} is likely to be selected if it has high mean reward \mu_{t} and high variance \sigma_{t}, i.e., its estimated reward \mu_{t}(\mathbf{x}) is very uncertain. As a result, selecting the arm \mathbf{x}_{t+1} with the largest score trades off between collecting (estimated) large reward (i.e., exploitation) and improving the accuracy of the posterior (i.e., exploration). The parameter \beta_{t} balances between these two objectives and it must be properly tuned to guarantee low regret. Srinivas et al. (2010) proposes different approaches to tune \beta_{t} for different assumptions on f and \mathcal{A}.

While GP-UCB is interpretable, simple to implement and provably achieves low regret, it is computationally expensive. In particular, computing \sigma_{t}(\mathbf{x}) has a complexity at least \Omega(t^{2}) for the matrix-vector product (\mathbf{K}_{t-1}+\xi^{2}\mathbf{I})^{-1}\mathbf{k}_{t-1}(\mathbf{x}). Multiplying this for T iterations and A arms results in an overall \mathcal{O}(AT^{3}) computational cost, which does not scale to large number of iterations T.

3 Budgeted Kernel Bandits

In this section, we introduce the BKB (budgeted kernel bandit) algorithm, a novel efficient approximation of GP-UCB, and we provide guarantees for its computational complexity. The analysis in Section 4 shows that BKB can be tuned to significantly reduce the complexity of GP-UCB with a negligible impact on the regret. We begin by introducing the two major contributions of this section: an approximation of the GP-UCB scores “supported” only by a small subset \mathcal{S}_{t} of inducing points, and a method to incrementally and adaptively construct an accurate subset \mathcal{S}_{t}.

3.1 The algorithm

The main complexity bottleneck to compute the scores in Equation 3 is due to the fact that after t steps, the posterior GP is “supported” on all t previously seen arms. As a consequence, evaluating Equations 2 and 1 requires computing a t dimensional vector \mathbf{k}_{t}(\mathbf{x}) and t\times t matrix \mathbf{K}_{t} respectively. To avoid this dependency we restrict both \mathbf{k}_{t} and \mathbf{K}_{t} to be supported on a subset \mathcal{S}_{t} of m arms. This approach is part of the sparse Gaussian process approximation framework (Quinonero-Candela et al., 2007), or equivalently a linear bandit constrained on a subspace (Kuzborskij et al., 2019).

Approximated GP-UCB scores. Consider a subset of arm \mathcal{S}_{t}=\{\mathbf{x}_{i}\}_{i=1}^{m} and denote by \mathbf{X}_{\mathcal{S}_{t}}\in\mathbb{R}^{m\times d} the matrix with all arms in \mathcal{S}_{t} as rows. Let \mathbf{K}_{\mathcal{S}_{t}}\in\mathbb{R}^{m\times m} be the matrix constructed by evaluating the covariance k between any two pair of arms in \mathcal{S}_{t} and \mathbf{k}_{\mathcal{S}_{t}}(\mathbf{x})=[k(\mathbf{x}_{1},\mathbf{x}),\dots,k% (\mathbf{x}_{m},\mathbf{x})]^{\mathsf{\scriptscriptstyle T}}. The Nyström embedding \mathbf{z}_{t}(\cdot) associated with subset \mathcal{S}_{t} is defined as the mapping222Recall that in the exact version, \mathbf{k}_{t}(\mathbf{x}) can be seen as an embedding of any arm \mathbf{x} into the space induced by all the t arms selected so far, i.e. using all selected points as inducing points.

\displaystyle\mathbf{z}_{t}(\cdot)=\mathopen{}\mathclose{{}\left(\mathbf{K}_{% \mathcal{S}_{t}}^{1/2}}\right)^{+}\mathbf{k}_{\mathcal{S}_{t}}(\cdot):\mathbb{% R}^{d}\rightarrow\mathbb{R}^{m},

where (\cdot)^{+} indicates the pseudo-inverse. We denote with \mathbf{Z}_{t}(\mathbf{X}_{t})=[\mathbf{z}_{t}(\mathbf{x}_{1}),\dots,\mathbf{z% }_{t}(\mathbf{x}_{t})]^{\mathsf{\scriptscriptstyle T}}\in\mathbb{R}^{t\times m} the associated matrix of points and we define \mathbf{V}_{t}=\mathbf{Z}_{t}(\mathbf{X}_{t})^{\mathsf{\scriptscriptstyle T}}% \mathbf{Z}_{t}(\mathbf{X}_{t})+\lambda\mathbf{I}. Then, we approximate the posterior mean and variance of the function on an arm \mathbf{x}_{i} as

\displaystyle\widetilde{\mu}_{t}(\mathbf{x}_{i})= \displaystyle\mathbf{z}_{t}(\mathbf{x}_{i})^{\mathsf{\scriptscriptstyle T}}% \mathbf{V}_{t}^{-1}\mathbf{Z}_{t}(\mathbf{x}_{i})^{\mathsf{\scriptscriptstyle T% }}\mathbf{y}_{t}
\displaystyle\widetilde{\sigma}_{t}^{2}(\mathbf{x}_{i})= \displaystyle\frac{1}{\lambda}\Big{(}k(\mathbf{x}_{i},\mathbf{x}_{i})-\mathbf{% z}_{t}(\mathbf{x}_{i})^{\mathsf{\scriptscriptstyle T}}\mathbf{Z}_{t}(\mathbf{X% }_{t})^{\mathsf{\scriptscriptstyle T}}\mathbf{Z}_{t}(\mathbf{X}_{t})\mathbf{V}% _{t}^{-1}\mathbf{z}_{t}(\mathbf{x}_{i})\Big{)}.

As a result, the approximate scores are obtained as

\widetilde{u}_{t}(\mathbf{x}_{i})=\widetilde{\mu}_{t}(\mathbf{x}_{i})+% \widetilde{\beta}_{t}\widetilde{\sigma}_{t}(\mathbf{x}_{i}), (4)

where \widetilde{\beta}_{t} is appropriately tuned to achieve small regret in the theoretical analysis of Section 4. Finally, at each time step t, BKB selects arm \widetilde{\mathbf{x}}_{t+1}=\operatorname*{arg\,max}_{\mathbf{x}_{i}\in% \mathcal{A}}\widetilde{u}_{t}(\mathbf{x}_{i}).

Notice that in general, \widetilde{\mu}_{t} and \widetilde{\sigma}_{t} do not correspond to any GP posterior. In fact, if we were simply replacing the k(\mathbf{x}_{i},\mathbf{x}_{i}) in the expression of \widetilde{\sigma}_{t}^{2}(\mathbf{x}_{i}) by its value in the Nyström embedding, i.e., \mathbf{z}_{t}(\mathbf{x}_{i})^{\mathsf{\scriptscriptstyle T}}\mathbf{z}_{t}(% \mathbf{x}_{i}), then we would recover a classical sparse GP approximations, the subset of regressors. Using \mathbf{z}_{t}(\mathbf{x}_{i})^{\mathsf{\scriptscriptstyle T}}\mathbf{z}_{t}(% \mathbf{x}_{i}) is known to cause variance starvation, i.e., it can severely underestimate the variance of a test point \mathbf{x}_{i} when it is far from the points in \mathcal{S}_{t}. Our formulation of \widetilde{\sigma}_{t} is known in Bayesian literature as the deterministic training conditional (DTC), where it is used as a heuristic to prevent variance starvation. However, DTC does not correspond to a GP since it violates consistency (Quinonero-Candela et al., 2007). In this work, we justify this approach rigorously, showing that it is crucial to prove approximation guarantees necessary both for the optimization process and for the construction of the set of inducing points.


[H]\SetAlgoLined\KwDataArm set \mathcal{A}, q, \{\beta_{t}\}_{t=1}^{T} \KwResultArm choices \mathcal{D}_{T}=\{(\widetilde{\mathbf{x}}_{t},y_{t})\} Select uniformly at random \mathbf{x}_{1} and observe y_{1}  Initialize \mathcal{S}_{1}=\{\mathbf{x}_{1}\}\Fort=\{1,\dots,T-1\} Compute \widetilde{\mu}_{t}(\mathbf{x}_{i}) and \widetilde{\sigma}_{t}^{2}(\mathbf{x}_{i}) for all \mathbf{x}_{i}\in\mathcal{A}  Select \widetilde{\mathbf{x}}_{t+1}=\operatorname*{arg\,max}_{\mathbf{x}_{i}\in% \mathcal{A}}\widetilde{u}_{t}(\mathbf{x}_{i})\Fori=\{1,\dots,t+1\} Set \widetilde{p}_{t+1,i}=\overline{q}\cdot\widetilde{\sigma}_{t}^{2}(\widetilde{% \mathbf{x}}_{i})  Draw q_{t+1,i}\sim Bernoulli\mathopen{}\mathclose{{}\left(\widetilde{p}_{t+1,i}}\right)  If q_{t+1}=1 include \widetilde{\mathbf{x}}_{i} in \mathcal{S}_{t+1}

Fig. 1: BKB

Choosing the inducing points.

A critical aspect to effectively keep the complexity of BKB low while preserving regret guarantees is to carefully choose the inducing points to include in the subset \mathcal{S}_{t}. As the complexity of computing \widetilde{u}_{t} scales with the size m of \mathcal{S}_{t}, a smaller set gives a faster algorithm. Conversely, the difference between \widetilde{\mu}_{t} and \widetilde{\sigma}_{t} and their exact counterparts depends on the accuracy of the embedding \mathbf{z}_{t}, which increases with the size of the set \mathcal{S}_{t}. Moreover, even for a fixed m, the quality of the embedding greatly depends on which inducing points are included. For instance, selecting the same arm as inducing point twice, or two co-linear arms, does not improve accuracy as the embedding space does not change. Finally, we need to take into account two important aspects of sequential optimization when choosing \mathcal{S}_{t}. First, we need to focus our approximation more on regions of \mathcal{A} that are relevant to the optimization problem (i.e., high-reward arms). Second, as these regions change over time, we need to keep adapting the composition and size of \mathcal{S}_{t} accordingly.

Taking into consideration these objectives, we choose to construct \mathcal{S}_{t} by randomly subsampling only out of the set of arms \widetilde{\mathbf{X}}_{t} evaluated so far. In particular, arms are selected for inclusion in \mathcal{S}_{t} with a probability proportional to their posterior variance \sigma_{t} at step t. We report the selection procedure in Figure 1, with the complete BKB algorithm.

We initialize \mathcal{S}_{1}=\{\widetilde{\mathbf{x}}_{1}\} by selecting an arm uniformly at random. At each step t, after selecting \widetilde{\mathbf{x}}_{t+1}, we must re-generate \mathcal{S}_{t} to reflect the changes in \widetilde{\mathbf{X}}_{t+1}. Ideally, we would sample each arm in \widetilde{\mathbf{X}}_{t+1} proportionally to \sigma_{t+1}^{2}, but this would be too computationally expensive. For efficiency we first approximate \sigma_{t+1}^{2} with \sigma_{t}^{2}. This is equivalent to ignoring the last arm and does not significantly impact the accuracy. We can then replace \sigma_{t}^{2} with \widetilde{\sigma}_{t}^{2} that was already efficiently computed when constructing Equation 4. Finally, given a parameter \overline{q}\geq 1, we set our approximate inclusion probability as \widetilde{p}_{t+1,i}=\overline{q}\widetilde{\sigma}_{t}^{2}(\widetilde{% \mathbf{x}}_{s}). The \overline{q} parameter is used to increase the inclusion probability in order to boost the overall success probability of the approximation procedure at the expense of a small increase in the size of \mathcal{S}_{t+1}. Given \widetilde{p}_{t+1,i}, we start from an empty \mathcal{S}_{t+1} and iterate over all \widetilde{\mathbf{x}}_{i} for i\in[t+1] drawing q_{t+1,i} from a Bernoulli distribution with probability \widetilde{p}_{t+1,i}. If q_{t+1,i}=1, \widetilde{\mathbf{x}}_{i} is included in \mathcal{S}_{t+1}.

Notice that while constructing \mathcal{S}_{t} based on \sigma_{t}^{2} is a common heuristic in the sparse GP literature, it has not been yet rigorously justified. In the next section, we show that this approach is equivalent to \lambda-ridge leverage score (RLS) sampling (Alaoui and Mahoney, 2015), a well studied tool in randomized linear algebra. We leverage the known results from this field to prove both accuracy and efficiency guarantees for our selection procedure.

3.2 Complexity analysis

Fig. 2: We simulate a GP on [0,1]\in\mathbb{R} using Gaussian kernel with bandwidth \sigma^{2}=100. We draw a f from the GP, and give to BKB t=\{6,63,215\} evaluations sampled uniformly in [0,0.5]. We plot f, and \widetilde{\mu}_{t}\pm 3\widetilde{\sigma}_{t}.

Denote with m_{t}=|\mathcal{S}_{t}| the size of the set \mathcal{S}_{t} at step t. At each step, we first compute the embedding \mathbf{z}_{t}(\mathbf{x}_{i}) of all arms in \mathcal{O}(Am_{t}^{2}+m_{t}^{3}) time, which corresponds to one inversion of \mathbf{K}_{\mathcal{S}_{t}}^{1/2} and the matrix-vector product specific to each arm. We then rebuild the matrix \mathbf{V}_{t} from scratch using all the arms observed so far. In general, it is sufficient to record the counters of the arms pulled so far, rather than the full list of arms, so that \mathbf{V}_{t} can be constructed in \mathcal{O}(\min\{t,A\}m_{t}^{2}) time. Then the inverse \mathbf{V}_{t}^{-1} is computed in \mathcal{O}(m_{t}^{3}) time. We can now efficiently compute \widetilde{\mu}_{t}, \widetilde{\sigma}_{t}, and \widetilde{u}_{t} for all arms in \mathcal{O}(Am_{t}^{2}) reusing the embeddings and \mathbf{V}_{t}^{-1}. Finally, computing all q_{t+1,i} and \mathcal{S}_{t+1} takes \mathcal{O}(\min\{t+1,A\}) using the estimated variances \widetilde{\sigma}_{t}^{2}. As a result, the per-step complexity is of order \mathcal{O}\big{(}(A+\min\{t,A\})m_{T}^{2}\big{)}.333Notice that m_{t}\leq\min\{t,a\} and thus the complexity term \mathcal{O}(m_{t}^{3}) is absorbed in the other terms. Space-wise, we only need to store the embedded arms and \mathbf{V}_{t} matrix, which takes at most \mathcal{O}(Am_{T}) space.

Bounding the size of \mathcal{S}_{T}.

The size m_{t} of \mathcal{S}_{t} can be expressed using the q_{t,i} r.v. as the sum m_{t}=\sum_{i=1}^{t}q_{t,i}. In order to provide a bound on the total number of inducing points, which directly determines the computational complexity of BKB, we go through three major steps.
The first is to show that w.h.p., m_{t} is close to the sum \sum_{i=1}^{t}\widetilde{p}_{t,i}=\sum_{i=1}^{t}\overline{q}\widetilde{\sigma}% _{t}^{2}(\widetilde{\mathbf{x}}_{i}), i.e., close to the sum of the probabilities we used to sample each q_{t,i}. However the various q_{t,i} are not independent, and each \widetilde{p}_{t,i} is itself a r.v. Nonetheless all q_{t,i} are conditionally independent given the previous t-1 steps, and this is sufficient to obtain the result.
The second and the most complex step is to guarantee that the random sum \sum_{i=1}^{t}\widetilde{\sigma}_{t}^{2}(\widetilde{\mathbf{x}}_{i}) is close to \sum_{i=1}^{t}\sigma_{t}^{2}(\widetilde{\mathbf{x}}_{i}), and at a lower level that each individual estimate \widetilde{\sigma}_{t}^{2}(\cdot) is close to \sigma_{t}^{2}(\cdot). To achieve this we exploit the connection between ridge leverage scores and posterior variance \sigma_{t}^{2}. In particular, we show that the variance estimator \widetilde{\sigma}_{t}^{2}(\cdot) used by BKB is a variation of the RLS estimator recently introduced by Calandriello et al. (2017b) for RLS sampling. As a consequence, we can transfer the strong accuracy and size guarantees of RLS sampling to our optimization setting (see Appendix C).
The first two steps lead to m_{t}\approx\sum_{i=1}^{t}\sigma_{i}^{2}(\widetilde{\mathbf{x}}_{i}), for which we need to derive a more explicit bound. In the GP literature, this quantity is bounded using the maximal information gain \gamma_{T} after T rounds. Denote with \mathbf{X}_{\mathcal{A}}\in\mathbb{R}^{A\times d} the matrix with all arms as rows, with \mathcal{D} a subset of these rows, potentially with duplicates, and with {\mathbf{K}}_{\mathcal{D}} the associated kernel matrix. Then, Srinivas et al. (2010) define

\displaystyle\gamma_{T}=\max_{\mathcal{D}\subset\mathcal{A}:|\mathcal{D}|=T}% \tfrac{1}{2}\operatorname*{log\,det}({\mathbf{K}}_{\mathcal{D}}/\lambda+% \mathbf{I}),

and show that \sum_{i=1}^{t}\sigma_{i}^{2}(\widetilde{\mathbf{x}}_{i})\leq\gamma_{t}, and that \gamma_{T} itself can be bounded for specific \mathcal{A} and kernel functions, e.g., \gamma_{T}\leq\mathcal{O}(\log(T)^{d+1}) for Gaussian kernels. Using the equivalence between RLS and posterior variance \sigma_{t}^{2}, we can also relate the posterior variance \sigma_{t}^{2}(\widetilde{\mathbf{x}}_{i}) of the evaluated arms to the so-called GP’s effective dimension d_{\text{eff}} or degrees of freedom

\displaystyle d_{\text{eff}}(\lambda,\widetilde{\mathbf{X}}_{T})=\sum\nolimits% _{i=1}^{t}\sigma_{t}^{2}(\widetilde{\mathbf{x}}_{i})=\operatorname*{Tr}({% \mathbf{K}}_{T}({\mathbf{K}}_{T}+\lambda\mathbf{I})^{-1}), (5)

using the following inequality by Calandriello et al. (2017a)

\displaystyle\operatorname*{log\,det}\mathopen{}\mathclose{{}\left({\mathbf{K}% }_{T}/\lambda+\mathbf{I}}\right)\leq\operatorname*{Tr}({\mathbf{K}}_{T}({% \mathbf{K}}_{T}+\lambda\mathbf{I})^{-1})\mathopen{}\mathclose{{}\left(1+\log% \mathopen{}\mathclose{{}\left(\tfrac{\|{\mathbf{K}}_{T}\|}{\lambda}+1}\right)}% \right). (6)

We will use both RLS and d_{\text{eff}} to describe the behaviour of BKB’s selection procedure.
Finally, we obtain the main result of this section. {theorem} For a desired 0<\varepsilon<1, 0<\delta<1, let \alpha=(1+\varepsilon)/(1-\varepsilon). If we run BKB with \overline{q}\geq 6\alpha\log(4T/\delta)/\varepsilon^{2} then with probability 1-\delta, for all t\in[T] and for all \mathbf{x}\in\mathcal{A} we have

\displaystyle\sigma_{t}^{2}(\mathbf{x})/\alpha\leq\widetilde{\sigma}_{t}^{2}(% \mathbf{x})\leq\alpha\sigma_{t}^{2}(\mathbf{x}) and \displaystyle|\mathcal{S}_{t}|\leq 3(1+\kappa^{2}/\lambda)\alpha\overline{q}d_% {\text{eff}}(\lambda,\widetilde{\mathbf{X}}_{t}).

Computational complexity. We already showed that BKB’s implementation with Nyström embedding required \mathcal{O}(T(A+\min\{t,A\})m_{T}^{3}) time and \mathcal{O}(Am_{T}) space. Combining this with Section 3.2 and the bound m_{T}\leq\widetilde{\mathcal{O}}(d_{\text{eff}}), we obtain a \widetilde{\mathcal{O}}(T(A+\min\{t,A\})d_{\text{eff}}^{3}) time complexity. Whenever d_{\text{eff}}\ll T and T\ll A this is essentially a quadratic \mathcal{O}(T^{2}) runtime, a large improvement over the quartic \mathcal{O}(T^{4})\leq\mathcal{O}(T^{3}A) runtime of GP-UCB.

Tuning \overline{q}. Note that although \overline{q} must satisfy the condition reported in Section 3.2 for the result to hold, it is quite robust to uncertainty on the desired horizon T. In particular, the bound holds for any \varepsilon>0, and even if we continue updating \mathcal{S}_{T} after the T-th step, the bound still holds by implicitly increasing the parameter \varepsilon. Alternatively, after the T-th iteration the user can suspend the algorithm, increase \overline{q} to suit the new desired horizon, and re-run only the subset selection subroutine on the arms selected so far.

Avoiding variance starvation. Another important consequence of Section 3.2 is that BKB’s variance estimate are always close to the exact one up to a small constant factor. To the best of our knowledge, it makes BKB the first efficient and general GP algorithm that provably avoids variance starvation, which can be caused by two sources of error. The first source is the degeneracy, i.e. low-rankness, of the GP approximation, which causes the estimate to grow over-confident when the number of observed points grows and exceeds the degrees of freedom of the GP. BKB can adaptively choose its degrees of freedom as the size of \mathcal{S}_{t} scales with the effective dimension. The second source of error arises when a point is far away from \mathcal{S}_{t}. Our use of a DTC variance estimator avoids under-estimation before we update the subset \mathcal{S}_{t}. Afterward, we can use guarantees on the quality of \mathcal{S}_{t} to guarantee that we do not over-estimate the variance too much, exploiting a similar approach used to guarantee accuracy in RLS estimation. Both problems, and BKB’s accuracy, are highlighted in Figure 2 using a benchmark experiment proposed by Wang et al. (2018).

Incremental dictionary update. At each step t, BKB recomputes the dictionary \mathcal{S}_{t+1} from scratch by sampling each of the arms pulled so far with a suitable probability \widetilde{p}_{t+1,i}. A more efficient variant would be to build \mathcal{S}_{t+1} by adding the new point {\mathbf{x}}_{t+1} with probability \widetilde{p}_{t+1,t+1} and including the points in \mathcal{S}_{t} with probability \widetilde{p}_{t+1,i}/\widetilde{p}_{t,i}. This strategy is often used in the streaming setting to avoid storing all points observed so far and incrementally update the dictionary (see e.g., Calandriello et al., 2017b). Nonetheless, the stream of points, although arbitrary, is assumed to be generated independently from the dictionary itself. On the other hand, in our bandit setting, the points \widetilde{{\mathbf{x}}}_{1},\widetilde{{\mathbf{x}}}_{2},\ldots are actually chosen by the learning algorithm depending on the dictionaries built over time, thus building a strong dependency between the stream of points and the dictionary itself. How to analyze such dependency and whether the accuracy of the inducing points is preserved in this case remains as an open question. Finally, notice that despite being more elegant and efficient, such incremental dictionary update would not significantly reduce the computational complexity, since computing the posterior variance for each arm would still dominate the overall runtime.

4 Regret Analysis

We are now ready to present the second main contribution of this paper, a bound on the regret achieved by BKB. To prove our result we additionally assume that the reward function f has bounded norm, i.e., \|f\|_{\mathcal{H}}^{2}=\langle f,f\rangle<\infty. We use an upper-bound \|f\|_{\mathcal{H}}\leq F to properly tune \widetilde{\beta}_{t} to the “range” of the reward. If F is not known in advance, standard guess-and-double techniques can be applied.


Assume \|f\|_{\mathcal{H}}\leq F<\infty. For any desired 0<\varepsilon<1, 0<\delta<1, 0<\lambda, let \alpha=(1+\varepsilon)/(1-\varepsilon) and \overline{q}\geq 6\alpha\log(4T/\delta)/\varepsilon^{2}. If we run BKB with

\displaystyle\widetilde{\beta}_{t}=2\xi\sqrt{\alpha\log(\kappa^{2}t)\mathopen{% }\mathclose{{}\left(\sum\nolimits_{s=1}^{t}\widetilde{\sigma}_{t}^{2}(\mathbf{% x}_{s})}\right)+\log(1/\delta)}+\mathopen{}\mathclose{{}\left(1+\tfrac{1}{% \sqrt{1-\varepsilon}}}\right)\sqrt{\lambda}F,

then, with prob. 1-\delta, BKB’s regret R_{T} is bounded by

\displaystyle R_{T}\leq \displaystyle 2(2\alpha)^{3/2}\sqrt{T}\mathopen{}\mathclose{{}\left(\xi d_{% \text{eff}}(\lambda,\widetilde{\mathbf{X}}_{T})\log(\kappa^{2}T)+\sqrt{\lambda F% ^{2}d_{\text{eff}}(\lambda,\widetilde{\mathbf{X}}_{T})\log(\kappa^{2}T)}+\xi% \log(1/\delta)}\right).

Section 4 shows that BKB achieves exactly the same regret as (exact) GP-UCB up to small \alpha constant and \log(\kappa^{2}T) multiplicative factor.444Here we derive a frequentist regret bound and thus we compare with the result of Chowdhury and Gopalan (2017) rather than the original Bayesian analysis of Srinivas et al. (2010). For instance, setting \varepsilon=1/2 results in a bound only 3\log(T) times larger than GP-UCB. At the same time, the choice \varepsilon=1/2 only accounts for a constant factor 12 in the per-step computational complexity, which is still dramatically reduced from t^{2}A to d_{\text{eff}}^{2}A. Moreover, it is easy to show that d_{\text{eff}}(\lambda,\widetilde{\mathbf{X}}_{T})\leq\operatorname*{log\,det}% ({\mathbf{K}}_{T}/\lambda+\mathbf{I}) (see Appendix B in the appendix), so any bound on \operatorname*{log\,det}({\mathbf{K}}_{T}/\lambda+\mathbf{I}) available for GP-UCB can be directly applied to BKB. This means that up to an extra \log(T) factor we also match GP-UCB’s \widetilde{\mathcal{O}}(\log(T)^{2d}) rate for Gaussian kernel, \widetilde{\mathcal{O}}(T^{\frac{1}{2}\frac{2\nu+3d^{2}}{2\nu+d^{2}}}) rate for Matérn kernel, and \widetilde{\mathcal{O}}(d\sqrt{T}) for linear kernel. While these bounds are not minimax optimal, they closely follow the lower bounds derived in Scarlett et al. (2017). On the other hand, in the case of linear kernel (i.e., the linear bandit problem) we match the lower bound of Dani et al. (2008) up to logarithmic factors.

Another interesting aspect of BKB is that computing the trade-off parameter \widetilde{\beta}_{t} can be done efficiently. Previous methods bounded this quantity with a loose (deterministic) upper bound (e.g., \mathcal{O}(\log(T)^{d}) for Gaussian kernels) to avoid the large cost of computing \operatorname*{log\,det}({\mathbf{K}}_{T}/\lambda+\mathbf{I}). In our \widetilde{\beta}_{t}, we bound the \operatorname*{log\,det} by d_{\text{eff}}, which is then bounded by \sum\nolimits_{s=1}^{t}\widetilde{\sigma}_{t}^{2}(\mathbf{x}_{s}) (see Thm. 3.2), where all \widetilde{\sigma}_{t}^{2} are already efficiently computed at each step. While this is up to \log(t) larger than the exact \operatorname*{log\,det}, it is still data adaptive and much smaller than the worst case upper bounds.

This regret guarantee is crucially achieved without requiring an increasing accuracy in our approximation. One would expect that to obtain a sublinear regret the error induced by the approximation should decrease as 1/T. Instead, in BKB the constants \varepsilon and \lambda that govern the accuracy level are fixed and thus it is not possible to guarantee that \widetilde{\mu}_{t} will ever get close to \mu_{t} everywhere. Adaptivity is key here: we can afford the same approximation level at every step because accuracy is actually increased only on a specific part of the arm set. For example, if a suboptimal arm is selected too often due to bad approximation, it will be eventually included in \mathcal{S}_{t}. After the inclusion, the approximation accuracy in the region of the suboptimal arm increases, and it would not be selected anymore. As the set of inducing points is updated fast enough, the impact of inaccurate approximations is limited in time, thus preventing large regret to accumulate. Note that this is a significant divergence from existing results. In particular approximation bounds that are uniformly accurate for all \mathbf{x}_{i}\in\mathcal{A}, such as those obtained with Quadrature FF (Mutny and Krause, 2018), rely on packing arguments. Due to the nature of packing, this usually causes the runtime or regret to scale exponentially in the input dimension d, and requires the kernel k to posses specific structure, e.g. to be statonary. Our new analysis avoids both of these problem.

Finally, we point out that the adaptivity of BKB allows drawing an interesting connection between learning and computational complexity. In fact, both the regret and the computation of BKB scale with the log-determinant and effective dimension of {\mathbf{K}}_{T}, which is related to the effective dimension of the sequence of arms selected over time. As a result, if the problem is difficult from a learning point of view (i.e., the regret is large because of large log-determinant), then BKB automatically adapts the set \mathcal{S}_{t} by including many more inducing points to guarantee the level of accuracy needed to solve the problem. Conversely, if the problem is simple (i.e., small regret), then BKB can greatly reduce the size of \mathcal{S}_{t} and achieve the derived level of accuracy.

4.1 Proof sketch

We build on the analysis of GP-UCB by Chowdhury and Gopalan (2017). Their analysis relies on a confidence interval formulation of GP-UCB that is more conveniently expressed using an explicit feature-based representation of the GP. For any GP with covariance k, there is a corresponding RKHS \mathcal{H} with k as its kernel function. Furthermore, any kernel function k is associated to a non-linear feature map \boldsymbol{\phi}(\cdot):\mathbb{R}^{d}\rightarrow\mathcal{H} such that k(\mathbf{x},\mathbf{x}^{\prime})=\boldsymbol{\phi}(\mathbf{x}^{\prime})^{% \mathsf{\scriptscriptstyle T}}\boldsymbol{\phi}(\mathbf{x}^{\prime}). As a result, any reward function f\in\mathcal{H} can be written as f(\mathbf{x})=\boldsymbol{\phi}(\mathbf{x})^{\mathsf{\scriptscriptstyle T}}% \mathbf{w}_{*}, where \mathbf{w}_{*}\in\mathcal{H}.

Confidence interval view of GP-UCB.

Let \boldsymbol{\Phi}(\mathbf{X}_{t})=[\boldsymbol{\phi}(\mathbf{x}_{1}),\dots,% \boldsymbol{\phi}(\mathbf{x}_{t})]^{\mathsf{\scriptscriptstyle T}} be the matrix \mathbf{X}_{t} after the application of \boldsymbol{\phi}(\cdot) to each row. The regularized design matrix is then defined as \mathbf{A}_{t}=\boldsymbol{\Phi}(\mathbf{X}_{t})^{\mathsf{\scriptscriptstyle T% }}\boldsymbol{\Phi}(\mathbf{X}_{t})+\lambda\mathbf{I}, while the regularized least-squares estimate is computed as

\displaystyle\widehat{\mathbf{w}}_{t} \displaystyle=\operatorname*{arg\,min}_{\mathbf{w}\in\mathcal{H}}\sum\nolimits% _{i=1}^{t}(y_{i}-\boldsymbol{\phi}(\mathbf{x}_{i})^{\mathsf{\scriptscriptstyle T% }}\mathbf{w})^{2}+\lambda\|\mathbf{w}\|_{2}^{2}=\mathbf{A}_{t}^{-1}\boldsymbol% {\Phi}(\mathbf{X}_{t})^{\mathsf{\scriptscriptstyle T}}\mathbf{y}_{t}.

We define the confidence interval C_{t} as the ellipsoid induced by \mathbf{A}_{t} with center \widehat{\mathbf{w}}_{t} and radius \beta_{t}

\displaystyle C_{t} \displaystyle=\mathopen{}\mathclose{{}\left\{\mathbf{w}:\|\mathbf{w}-\widehat{% \mathbf{w}}_{t}\|_{\mathbf{A}_{t}}\leq\beta_{t}}\right\},\quad\quad\beta_{t}=% \lambda^{1/2}F+R\sqrt{2(\operatorname*{log\,det}(\mathbf{A}_{t}/\lambda)+\log(% 1/\delta)}, (7)

where the radius \beta_{t} is such that \mathbf{w}_{*}\in C_{t} w.h.p. (Chowdhury and Gopalan, 2017). Finally, using Lagrange multipliers we can reformulate the GP-UCB scores as

\displaystyle u_{t}(\mathbf{x}_{i}) \displaystyle=\max_{\mathbf{w}\in C_{t}}\boldsymbol{\phi}(\mathbf{x}_{i})^{% \mathsf{\scriptscriptstyle T}}\mathbf{w}=\overbracket{\boldsymbol{\phi}(% \mathbf{x}_{i})^{\mathsf{\scriptscriptstyle T}}\widehat{\mathbf{w}}_{t}}^{\mu_% {t}(\mathbf{x}_{i})}+\beta_{t}\overbracket{\sqrt{\boldsymbol{\phi}(\mathbf{x}_% {i})^{\mathsf{\scriptscriptstyle T}}\mathbf{A}_{t}^{-1}\boldsymbol{\phi}(% \mathbf{x}_{i})}.}^{\sigma_{t}(\mathbf{x}_{i})} (8)

Approximating the confidence ellipsoid. Let us focus now on the subset of arm \mathcal{S}_{t}=\{\mathbf{x}_{i}\}_{i=1}^{m} chosen by BKB at each step, and denote by \mathbf{X}_{\mathcal{S}_{t}}\in\mathbb{R}^{m\times d} the matrix with all arms in \mathcal{S}_{t} as rows. We denote by \widetilde{\mathcal{H}}_{t}=\operatorname*{Im}(\boldsymbol{\Phi}(\mathbf{X}_{% \mathcal{S}_{t}})) the smaller m-rank RKHS spanned by \boldsymbol{\Phi}(\mathbf{X}_{\mathcal{S}_{t}}), and by \mathbf{P}_{t} the symmetric orthogonal projection operator on \widetilde{\mathcal{H}}_{t}. We can now define an approximate feature map \widetilde{\boldsymbol{\phi}}_{t}(\cdot)=\mathbf{P}_{t}\boldsymbol{\phi}(\cdot% ):\mathbb{R}^{d}\rightarrow\widetilde{\mathcal{H}}_{t} and associated approximations of \mathbf{A}_{t} and \widehat{\mathbf{w}}_{t}

\displaystyle\widetilde{\mathbf{A}}_{t} \displaystyle=\widetilde{\boldsymbol{\Phi}}_{t}(\mathbf{X}_{t})^{\mathsf{% \scriptscriptstyle T}}\widetilde{\boldsymbol{\Phi}}_{t}(\mathbf{X}_{t})+% \lambda\mathbf{I} (9)
\displaystyle\widetilde{\mathbf{w}}_{t} \displaystyle=\operatorname*{arg\,min}_{\mathbf{w}\in\mathcal{H}}\sum_{i=1}^{t% }(y_{i}-\widetilde{\boldsymbol{\phi}}(\mathbf{x}_{i})^{\mathsf{% \scriptscriptstyle T}}\mathbf{w})^{2}+\lambda\|\mathbf{w}\|_{2}^{2}=\widetilde% {\mathbf{A}}_{t}^{-1}\widetilde{\boldsymbol{\Phi}}_{t}(\mathbf{X}_{t})^{% \mathsf{\scriptscriptstyle T}}\mathbf{y}_{t}. (10)

This leads to an approximate confidence ellipsoid \widetilde{C}_{t}=\mathopen{}\mathclose{{}\left\{\mathbf{w}:\|\mathbf{w}-% \widetilde{\mathbf{w}}_{t}\|_{\widetilde{\mathbf{A}}_{t}}\leq\widetilde{\beta}% _{t}}\right\}.
A subtle element in these definitions is that while \widetilde{\boldsymbol{\Phi}}_{t}(\mathbf{X}_{t})^{\mathsf{\scriptscriptstyle T% }}\widetilde{\boldsymbol{\Phi}}_{t}(\mathbf{X}_{t}) and \widetilde{\mathbf{w}}_{t} are now restricted to \widetilde{\mathcal{H}}_{t}, the identity operator \lambda\mathbf{I} in the regularization of \widetilde{\mathbf{A}}_{t} still acts over the whole \mathcal{H}, and therefore \widetilde{\mathbf{A}}_{t} does not belong to \widetilde{\mathcal{H}}_{t} and remains full-rank and invertible. This immediately leads to using k(\mathbf{x}_{i},\mathbf{x}_{i}) in the definition of \widetilde{\sigma} in Eq. 4, instead of the its approximate version using the Nyström embeddings.

Bounding the regret. To find an appropriate \widetilde{\beta}_{t} we follow an approach similar to Abbasi-Yadkori et al. (2011). Exploiting the relationship y_{t}=\widetilde{\boldsymbol{\phi}}(\widetilde{\mathbf{x}}_{t})^{\mathsf{% \scriptscriptstyle T}}\mathbf{w}_{*}+\eta_{t}, we bound

\displaystyle\|\mathbf{w}_{*}-\widetilde{\mathbf{w}}_{t}\|^{2}_{\widetilde{% \mathbf{A}}_{t}}\leq\overbracket{\lambda^{1/2}\|\mathbf{w}_{*}\|}^{(a)}+% \overbracket{\|\widetilde{\boldsymbol{\Phi}}_{t}(\mathbf{X}_{t})\boldsymbol{% \eta}_{t}\|_{\widetilde{\mathbf{A}}_{t}^{-1}}}^{(b)}+\overbracket{\|% \boldsymbol{\Phi}(\mathbf{X}_{t})^{\mathsf{\scriptscriptstyle T}}\|_{\mathbf{I% }-\mathbf{P}_{t}}\cdot\|\mathbf{w}_{*}\|}^{(c)}.

Both (a) and (b) are present in GP-UCB and OFUL’s analysis. Term (a) is due to the bias introduced in the least-square estimator \widetilde{\mathbf{w}}_{t} by the regularization \lambda. Term (b) is due to the noisy in the reward observations. Note that the same term (b) appears in GP-UCB’s analysis as \|\boldsymbol{\Phi}(\mathbf{X}_{t})\boldsymbol{\eta}_{t}\|_{\mathbf{A}_{t}^{-1}} and it is bounded by \operatorname*{log\,det}(\mathbf{A}_{t}/\lambda) using self-normalizing concentration inequalities Chowdhury and Gopalan (2017). However our \|\widetilde{\boldsymbol{\Phi}}_{t}(\mathbf{X}_{t})\boldsymbol{\eta}_{t}\|_{% \widetilde{\mathbf{A}}_{t}^{-1}} is a more complex object, since the projection \mathbf{P}_{t} contained in \widetilde{\boldsymbol{\Phi}}_{t}(\mathbf{X}_{t})=\mathbf{P}_{t}\boldsymbol{% \Phi}(\mathbf{X}_{t}) depends on the whole process up to time time t, and therefore \widetilde{\boldsymbol{\Phi}}_{t}(\mathbf{X}_{t}) also depends on the whole process, losing its martingale structure. To avoid this, we use Sylvester’s identity and the projection operator \mathbf{P}_{t} to bound

\displaystyle\operatorname*{log\,det}(\widetilde{\mathbf{A}}_{t}/\lambda)=% \operatorname*{log\,det}\mathopen{}\mathclose{{}\left(\tfrac{\boldsymbol{\Phi}% (\mathbf{X}_{t})\mathbf{P}_{t}\boldsymbol{\Phi}(\mathbf{X}_{t})^{\mathsf{% \scriptscriptstyle T}}}{\lambda}+\mathbf{I}}\right)\leq\operatorname*{log\,det% }\mathopen{}\mathclose{{}\left(\tfrac{\boldsymbol{\Phi}(\mathbf{X}_{t})% \boldsymbol{\Phi}(\mathbf{X}_{t})^{\mathsf{\scriptscriptstyle T}}}{\lambda}+% \mathbf{I}}\right)=\operatorname*{log\,det}(\mathbf{A}_{t}/\lambda).

In other words, restricting the problem to \widetilde{\mathcal{H}}_{t} acts as a regularization and reduces the variance of the martingale. Unfortunately, \operatorname*{log\,det}(\mathbf{A}_{t}/\lambda) is too expensive to compute, so we first bound it with d_{\text{eff}}(\lambda,\widetilde{\mathbf{X}}_{t})\log(\kappa^{2}t), and then we bound d_{\text{eff}}(\lambda,\widetilde{\mathbf{X}}_{t})\leq\alpha\sum\nolimits_{s=1% }^{t}\widetilde{\sigma}_{t}^{2}(\mathbf{x}_{s}) (Section 3.2), which can be computed efficiently. Finally, a new bias term (c) appears. Combining Section 3.2 with results from Calandriello and Rosasco (2018) on a projection \mathbf{P}_{t} obtained using RLSs sampling we show that

\displaystyle\mathbf{I}-\mathbf{P}\preceq\frac{\lambda}{1-\varepsilon}\mathbf{% A}_{t}^{-1}.

The combination of (a), (b), and (c) finally leads to the definition of \widetilde{\beta}_{t} and the final regret bound R_{T}\leq\sqrt{\widetilde{\beta}_{T}}\sqrt{\sum\nolimits_{t=1}^{T}\boldsymbol{% \phi}(\mathbf{x}_{t})^{\mathsf{\scriptscriptstyle T}}\widetilde{\mathbf{A}}_{t% }^{-1}\boldsymbol{\phi}(\mathbf{x}_{t})}. To conclude the proof we bound \sum\nolimits_{t=1}^{T}\boldsymbol{\phi}(\mathbf{x}_{t})^{\mathsf{% \scriptscriptstyle T}}\widetilde{\mathbf{A}}_{t}^{-1}\boldsymbol{\phi}(\mathbf% {x}_{t}) with the following corollary to Section 3.2. {corollary} Under the same conditions as Section 4, for all t\in T we have \mathbf{A}_{t}/\alpha\preceq\widetilde{\mathbf{A}}_{t}\preceq\alpha\mathbf{A}_% {t}.


The novel bound \|\boldsymbol{\Phi}(\mathbf{X}_{t})^{\mathsf{\scriptscriptstyle T}}\|_{\mathbf% {I}-\mathbf{P}_{t}}\leq\tfrac{\lambda}{1-\varepsilon}\|\boldsymbol{\Phi}(% \mathbf{X}_{t})^{\mathsf{\scriptscriptstyle T}}\|_{\mathbf{A}_{t}^{-1}} has a crucial role in controlling the bias due to the projection \mathbf{P}_{t}. Note that the second term measures the error with the same metric \mathbf{A}_{t}^{-1} used by the variance martingale. In other words, the bias introduced by BKB’s approximation can be seen as a self-normalizing bias. It is larger along directions that have been sampled less frequently, and smaller along directions correlated with arms selected often (e.g., the optimal arm).
Our analysis bears some similarity with the one recently and independently developed by Kuzborskij et al. (2019). Nonetheless, our proof improves their result along two dimensions. First, we consider the more general (and challenging) GP optimization setting. Second, we do not fix the rank of our approximation in advance. While their analysis also exploits a self-normalized bias argument, this applies only to the k largest components. If the problem has an effective dimension larger than k, their radius and regret become essentially linear. In BKB we use our adaptive sampling scheme to include all necessary directions and to achieve the same regret rate as exact GP-UCB.

5 Discussion

As the literature in Bayesian optimization is vast and a complete review is out of the scope of this paper, we do not compare to alternative GP acquisition function, such as GP-EI or GP-PI, and we focus on approximation techniques with theoretical guarantees. Similarly, we exclude scalable variational inference based methods, even when their approximate posterior is provably accurate such as pF-DTC (Huggins et al., 2019), since they only provide guarantees for GP regression and not the harder optimization setting. We also do not discuss KernelUCB (Valko et al., 2013), which has a tighter analysis than GP-UCB, since the algorithm construction is not efficient in practice.

Infinite arm sets. Looking at the proof of Section 3.2, the guarantees on \widetilde{u}_{t} hold for all \mathcal{H}, and in Section 4 we only require that the maximum \widetilde{\mathbf{x}}_{t+1}=\operatorname*{arg\,max}_{\mathbf{x}\in\mathcal{A% }}\max_{\mathbf{w}\in\widetilde{C}_{t}}\boldsymbol{\phi}(\mathbf{x})^{\mathsf{% \scriptscriptstyle T}}\mathbf{w} is returned. Therefore, the accuracy and regret guarantees hold also for an infinite set of arms \mathcal{A}. However, the search over \mathcal{A} can be difficult. In the general case maximization of a GP posterior is an NP-hard problem and we focused on the easier case of finite sets, where enumeration is sufficient. Note that this automatically introduces an \Omega(A) runtime dependency, which could be removed if the user can provide an efficient method to solve the maximization problem on a specific infinite set \mathcal{A}. However, known general maximization algorithm usually scale exponentially with the input dimension d and are not practical (Mutny and Krause, 2018). Finally, note that recomputing a new set \mathcal{S}_{t} still requires \min\{A,t\}d_{\text{eff}}^{2} at each step. This represent a bottleneck in BKB independent from the arm selection problem, and must be addressed separately in future work.

Linear bandit with matrix sketching. Our analysis is closely related to CBRAP  (Yu et al., 2017) and SOFUL (Kuzborskij et al., 2019). CBRAP uses Gaussian projections to embed all arms in a lower dimensional space for efficiency. Unfortunately their approach must either use an embedded space at least \Omega(T) large, which in most cases would be even slower than exact OFUL, or it incurs linear regret w.h.p. Another approach for Euclidean spaces based on matrix approximation is introduced by Kuzborskij et al. (2019). It uses Frequent Direction (Ghashami et al., 2016), a method similar to incremental PCA, to embed the arms into \mathbb{R}^{m}, where m is fixed in advance. They achieve a \widetilde{\mathcal{O}}(TAm^{2}) runtime, and \widetilde{\mathcal{O}}((1+\varepsilon_{m})^{3/2}(d+m)\sqrt{T}) regret, where \varepsilon_{m} is the sum of the d-m smallest eigenvalues of \mathbf{A}_{T}. However, notice that if the tail do not decrease quickly enough this algorithm may suffer linear regret and no adaptive way to tune m is provided. On the same task BKB does achieve a \widetilde{\mathcal{O}}(d\sqrt{T}) regret, since it adaptively chooses the size of the embedding. Computationally, directly instantiating BKB to use a linear kernel would achieve a \widetilde{\mathcal{O}}(TAm_{t}^{2}) runtime, matching Kuzborskij et al. (2019)’s .

Approximate GP with RFF. Traditionally, RFF approaches have been popular to transform GP optimization in a finite-dimensional problem and allow for scalability. Unfortunately GP-UCB with traditional RFF is not low-regret, as RFF are well known to suffer from variance starvation (Wang et al., 2018) and unfeasibly large RFF embeddings would be necessary to prevent it. Recently Mutny and Krause (2018) proposed an alternative approach based on Quadrature Fourier Features (QFF), a specialized approach to random features for stationary kernels. They achieve the same regret rate as GP-UCB and BKB, with a \mathcal{O}(TA\log(T)^{d+1}) runtime. However Quadrature based approaches apply to stationary kernel only, and require to \varepsilon-cover \mathcal{A}, hence they cannot escape an exponential dependency on the dimensionality d. Conversely BKB can be applied to any kernel function, and while not specifically designed for this task it also achieve a close \widetilde{\mathcal{O}}(TA\log(T)^{3(d+1)}) runtime. Moreover in practice the size of \mathcal{S}_{T} can be much less than exponential in d.

Acknowledgements   This material is based upon work supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216. L. R. acknowledges the financial support of the AFOSR projects FA9550-17-1-0390 and BAA-AFRL-AFOSR-2016-0007 (European Office of Aerospace Research and Development), and the EU H2020-MSCA-RISE project NoMADS - DLV-777826. The research presented was also supported by European CHIST-ERA project DELTA, French Ministry of Higher Education and Research, Nord-Pas-de-Calais Regional Council, Inria and Otto-von-Guericke-Universität Magdeburg associated-team north-European project Allocate, and French National Research Agency projects ExTra-Learn (n.ANR-14-CE24-0010-01) and BoB (n.ANR-16-CE23-0003). This research has benefited from the support of the FMJH Program PGMO and from the support to this program from Criteo.


  • Abbasi-Yadkori et al. (2011) Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.
  • Alaoui and Mahoney (2015) Ahmed El Alaoui and Michael W. Mahoney. Fast randomized kernel methods with statistical guarantees. In Neural Information Processing Systems, 2015.
  • Calandriello and Rosasco (2018) Daniele Calandriello and Lorenzo Rosasco. Statistical and computational trade-offs in kernel k-means. In Neural Information Processing Systems, 2018.
  • Calandriello et al. (2017a) Daniele Calandriello, Alessandro Lazaric, and Michal Valko. Second-order kernel online convex optimization with adaptive sketching. In International Conference on Machine Learning, 2017a.
  • Calandriello et al. (2017b) Daniele Calandriello, Alessandro Lazaric, and Michal Valko. Distributed adaptive sampling for kernel matrix approximation. In AISTATS, 2017b.
  • Chowdhury and Gopalan (2017) Sayak Ray Chowdhury and Aditya Gopalan. On kernelized multi-armed bandits. In International Conference on Machine Learning, pages 844–853, 2017.
  • Chowdhury and Gopalan (2019) Sayak Ray Chowdhury and Aditya Gopalan. Online learning in kernelized Markov decision processes. In International Conference on Artificial Intelligence and Statistics, may 2019. URL http://arxiv.org/abs/1805.08052.
  • Dani et al. (2008) Varsha Dani, Thomas P. Hayes, and Sham M. Kakade. Stochastic linear optimization under bandit feedback. In COLT, 2008.
  • Ghashami et al. (2016) Mina Ghashami, Edo Liberty, Jeff M Phillips, and David P Woodruff. Frequent directions: Simple and deterministic matrix sketching. SIAM Journal on Computing, 45(5):1762–1792, 2016.
  • Ghosh et al. (2017) Avishek Ghosh, Sayak Ray Chowdhury, and Aditya Gopalan. Misspecified linear bandits. In AAAI, apr 2017. URL https://arxiv.org/abs/1704.06880.
  • Hazan et al. (2006) Elad Hazan, Adam Kalai, Satyen Kale, and Amit Agarwal. Logarithmic regret algorithms for online convex optimization. In Conference on Learning Theory. Springer, 2006.
  • Huggins et al. (2019) Jonathan H Huggins, Trevor Campbell, Mikołaj Kasprzak, and Tamara Broderick. Scalable gaussian process inference with finite-data mean and variance guarantees. International Conference on Artificial Intelligence and Statistics, 2019.
  • Kuzborskij et al. (2019) Ilja Kuzborskij, Leonardo Cella, and Nicolò Cesa-Bianchi. Efficient linear bandits through matrix sketching. In International Conference on Artificial Intelligence and Statistics, 2019.
  • (14) Tor Lattimore and Csaba Szepesvári. Bandit algorithms.
  • Li et al. (2010) Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. International World Wide Web Conference, 2010. URL http://rob.schapire.net/papers/www10.pdf.
  • Liu et al. (2018) Haitao Liu, Yew-Soon Ong, Xiaobo Shen, and Jianfei Cai. When gaussian process meets big data: A review of scalable gps. arXiv preprint arXiv:1807.01065, 2018.
  • Mockus (1989) Jonas Mockus. Global optimization and the Bayesian approach. 1989. doi: 10.1007/978-94-009-0909-0_1. URL http://www.springerlink.com/index/10.1007/978-94-009-0909-0{_}1.
  • Mutny and Krause (2018) Mojmir Mutny and Andreas Krause. Efficient High Dimensional Bayesian Optimization with Additivity and Quadrature Fourier Features. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 9019–9030. Curran Associates, Inc., 2018.
  • Pelikan (2005) Martin Pelikan. Hierarchical bayesian optimization algorithm. In Hierarchical Bayesian Optimization Algorithm, pages 105–129. Springer, 2005.
  • Quinonero-Candela et al. (2007) Joaquin Quinonero-Candela, Carl Edward Rasmussen, and Christopher KI Williams. Approximation methods for gaussian process regression. Large-scale kernel machines, pages 203–224, 2007.
  • Rahimi and Recht (2007) Ali Rahimi and Ben Recht. Random features for large-scale kernel machines. In Neural Information Processing Systems, 2007.
  • Rasmussen and Williams (2006) Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian processes for machine learning. Adaptive computation and machine learning. MIT Press, Cambridge, Mass, 2006. ISBN 978-0-262-18253-9. OCLC: ocm61285753.
  • Robbins (1952) Herbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematics Society, 58:527–535, 1952.
  • Scarlett et al. (2017) Jonathan Scarlett, Ilija Bogunovic, and Volkan Cevher. Lower bounds on regret for noisy gaussian process bandit optimization. In Conference on Learning Theory, pages 1723–1742, 2017.
  • Seeger et al. (2003) Matthias Seeger, Christopher Williams, and Neil Lawrence. Fast forward selection to speed up sparse gaussian process regression. In Artificial Intelligence and Statistics 9, number EPFL-CONF-161318, 2003.
  • Snoek et al. (2012) Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pages 2951–2959, 2012.
  • Srinivas et al. (2010) Niranjan Srinivas, Andreas Krause, Matthias Seeger, and Sham M Kakade. Gaussian process optimization in the bandit setting: No regret and experimental design. In International Conference on Machine Learning, pages 1015–1022, 2010.
  • Tropp (2015) Joel A. Tropp. An introduction to matrix concentration inequalities. Foundations and Trends® in Machine Learning, 8(1-2):1–230, 2015. ISSN 1935-8237. doi: 10.1561/2200000048. URL http://dx.doi.org/10.1561/2200000048.
  • Valko et al. (2013) Michal Valko, Nathan Korda, Rémi Munos, Ilias Flaounas, and Nello Cristianini. Finite-time analysis of kernelised contextual bandits. In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, pages 654–663. AUAI Press, 2013.
  • Wahba (1990) Grace Wahba. Spline models for observational data, volume 59. Siam, 1990.
  • Wang et al. (2018) Zi Wang, Clement Gehring, Pushmeet Kohli, and Stefanie Jegelka. Batched large-scale bayesian optimization in high-dimensional spaces. In International Conference on Artificial Intelligence and Statistics, pages 745–754, 2018.
  • Yu et al. (2017) Xiaotian Yu, Michael R. Lyu, and Irwin King. CBRAP: Contextual bandits with random projection. In AAAI Conference on Artificial Intelligence, 2017.

Appendix A Relaxing assumptions

In the derivation so far we applied several assumptions. While some are necessary, others can be relaxed.

Assumptions on the noise. Throughout the paper, we assumed that the noise \eta_{t} is i.i.d. Gaussian. Since Chowdhury and Gopalan’s results hold for any \xi-sub-Gussian noise that is measurable based on the previous iterations, this assumption can be easily relaxed.

Assumptions on the arms. So far we considered a set of arms that is (a) in \mathbb{R}^{d}, (b) fixed for all t, and (c) finite. Relaxing (a) is easy, since we do not make any assumption beyond boundedness on the kernel function k, and there are many bounded kernel function for non-Euclidean spaces, e.g. strings or graphs. Relaxing (b) is also trivial, we just need to embed the changing arm sets as they are provided, and store and re-embed previously selected arms as necessary. The per-step time complexity will now depend on the size of the set of arms available at each step. Relaxing (c) is straightforward from a theoretical perspective, but has varying computational consequences. In particular, looking at the proof of Section 3.2, the guarantees on \widetilde{u}_{t} hold for all \mathcal{H}, and in Section 4 we only require that the maximum \widetilde{\mathbf{x}}_{t+1}=\operatorname*{arg\,max}_{\mathbf{x}\in\mathcal{A% }}\max_{\mathbf{w}\in\widetilde{C}_{t}}\boldsymbol{\phi}(\mathbf{x})^{\mathsf{% \scriptscriptstyle T}}\mathbf{w} is returned. Therefore, at least from the regret point of view, everything hold also for infinite \mathcal{A}. However, while the inner maximization over \widetilde{C}_{t} can be solved in closed form for a fixed \mathbf{x}, the same cannot be said of the search over \mathcal{A}. If the designer can provide an efficient method to solve the maximization problem on an infinite \mathcal{A}, e.g., linear bandit optimization over compact subsets or \mathbb{R}^{d}, than all BKB guarantees hold. But in the general case maximization of a GP posterior, or of a general function over \mathcal{H}, is an NP-hard problem and we focused on the easier case of finite sets, where enumeration is sufficient. Note however that recomputing a new set \mathcal{S}_{t} still requires \min\{A,t\}d_{\text{eff}}^{2} at each step. This represent a bottleneck independent from the arm selection problem, and must be addressed separately.

Appendix B Preliminary results on posterior variance

For simplicity we will collect here several results we will use on the posterior variance \sigma_{t}^{2}(\cdot). While most of these hold for generic RLS, we will adapt them to our notation.


[Calandriello et al., 2017b]

\displaystyle\frac{1}{\kappa^{2}/\lambda+1}\sigma_{t-1}^{2}(\widetilde{\mathbf% {x}}_{t})\leq\frac{1}{\sigma_{t-1}^{2}(\widetilde{\mathbf{x}}_{t})+1}\sigma_{t% -1}^{2}(\widetilde{\mathbf{x}}_{t})\leq\sigma_{t}^{2}(\widetilde{\mathbf{x}}_{% t})\leq\sigma_{t-1}^{2}(\widetilde{\mathbf{x}}_{t})

The leftmost inequality follows from \kappa^{2}/\lambda\geq\sigma_{0}^{2}(x) and \sigma^{2}_{a}(x)\geq\sigma^{2}_{b}(x),\forall a\leq b, the others from Calandriello et al., 2017b


[Hazan et al., 2006; Calandriello et al., 2017a]

\displaystyle d_{\text{eff}}(\lambda,\widetilde{\mathbf{X}}_{T}) \displaystyle=\operatorname*{Tr}({\mathbf{K}}_{T}({\mathbf{K}}_{T}+\lambda% \mathbf{I})^{-1})=\sum\nolimits_{t=1}^{T}\sigma_{T}^{2}(\widetilde{\mathbf{x}}% _{t})
\displaystyle\lx@stackrel{{\scriptstyle(1)}}{{\leq}}\sum\nolimits_{t=1}^{T}% \sigma_{t}^{2}(\widetilde{\mathbf{x}}_{t})
\displaystyle\lx@stackrel{{\scriptstyle(2)}}{{\leq}}\operatorname*{log\,det}% \mathopen{}\mathclose{{}\left({\mathbf{K}}_{T}/\lambda+\mathbf{I}}\right)
\displaystyle\lx@stackrel{{\scriptstyle(3)}}{{\leq}}\operatorname*{Tr}({% \mathbf{K}}_{T}({\mathbf{K}}_{T}+\lambda\mathbf{I})^{-1})\mathopen{}\mathclose% {{}\left(1+\log\mathopen{}\mathclose{{}\left(\tfrac{\|{\mathbf{K}}_{T}\|}{% \lambda}+1}\right)}\right).

Inequality (1) is due to Appendix B. Inequality (2) is due to Hazan et al. (2006). Inequality (3) is due to Calandriello et al. (2017a).

Appendix C Proof of Section 3.2

Let B_{t} be the bad event where the guarantees of Section 3.2 do not hold. Our goal in this proof is to prove that B_{t} happens at most with probability \delta for all t\in[T].

C.1 Notation

For notational simplicity, in the following we refer to \boldsymbol{\Phi}(\widetilde{\mathbf{X}}_{t}) as \boldsymbol{\Phi}_{t}, \widetilde{\boldsymbol{\Phi}}(\widetilde{\mathbf{X}}_{t}) as \widetilde{\boldsymbol{\Phi}}_{t} and \boldsymbol{\phi}(\widetilde{\mathbf{x}}_{t}) as \boldsymbol{\phi}_{t}. When the subscript is clear from the context, we will omit it. Since we leverage several results of Calandriello et al. (2017a), we report here some necessary additional notation.

First we extend our notation for the subset \mathcal{S}_{t} to include a possible reweighing of the inducing points. We denote with \mathcal{S}_{t}=\{(\boldsymbol{\phi}_{j},s_{j})\}_{j=1}^{m_{t}} a weighted subset, i.e. a weighted dictionary, of columns from \boldsymbol{\Phi}_{t}, with positive weights s_{j}>0 that must be appropriately chosen.

Denote with i_{j}\in[t] the index of the sample \boldsymbol{\phi}_{j} as a column in \boldsymbol{\Phi}_{t}. Using a standard approach (Alaoui and Mahoney, 2015), we choose s_{j}=1/\sqrt{\widetilde{p}_{t,i_{j}}}, where \widetilde{p}_{t,i}=\overline{q}\widetilde{\sigma}_{t-1}^{2}(\widetilde{% \mathbf{x}}_{i}) is the probability555Note that \widetilde{p}_{t,i} might be larger than 1, but with a small abuse of notation and w.l.o.g. we will still refer to it as a probability. used by Figure 1 when sampling \boldsymbol{\phi}_{i_{j}} from \boldsymbol{\Phi}_{t}.

Let us now denote with {\mathbf{S}}_{t}\in\mathbb{R}^{t\times t} the diagonal matrix with q_{t,i}/\sqrt{\widetilde{p}_{t,i}} on the diagonal, where q_{t,i} are the \{0,1\} r.v. selected by Figure 1. Then we can see that

\displaystyle\sum_{j=1}^{m_{t}}\frac{1}{\widetilde{p}_{t,i_{j}}}\boldsymbol{% \phi}_{i_{j}}\boldsymbol{\phi}_{i_{j}}^{\mathsf{\scriptscriptstyle T}}=\sum_{i% =1}^{t}\frac{q_{t,i}}{\widetilde{p}_{t,i}}\boldsymbol{\phi}_{i}\boldsymbol{% \phi}_{i}^{\mathsf{\scriptscriptstyle T}}=\boldsymbol{\Phi}_{t}{\mathbf{S}}_{t% }{\mathbf{S}}_{t}^{\mathsf{\scriptscriptstyle T}}\boldsymbol{\Phi}_{t}^{% \mathsf{\scriptscriptstyle T}}. (11)

Calandriello et al. (2017b) define \mathcal{S}_{t} to be an \varepsilon-accurate dictionary of \boldsymbol{\Phi}_{t} if it satisfies the following condition

\displaystyle(1-\varepsilon)\boldsymbol{\Phi}_{t}\boldsymbol{\Phi}_{t}^{% \mathsf{\scriptscriptstyle T}}-\varepsilon\lambda\mathbf{I}\preceq\boldsymbol{% \Phi}_{t}{\mathbf{S}}_{t}{\mathbf{S}}_{t}^{\mathsf{\scriptscriptstyle T}}% \boldsymbol{\Phi}_{t}^{\mathsf{\scriptscriptstyle T}}\preceq(1+\varepsilon)% \boldsymbol{\Phi}_{t}\boldsymbol{\Phi}_{t}^{\mathsf{\scriptscriptstyle T}}+% \varepsilon\lambda\mathbf{I}. (12)

We can also now fully define the projection operator at time t (see Section 4.1 for more details) as

\displaystyle\mathbf{P}_{t}=\boldsymbol{\Phi}_{t}{\mathbf{S}}_{t}({\mathbf{S}}% _{t}^{\mathsf{\scriptscriptstyle T}}\boldsymbol{\Phi}_{t}^{\mathsf{% \scriptscriptstyle T}}\boldsymbol{\Phi}_{t}{\mathbf{S}}_{t})^{+}{\mathbf{S}}_{% t}^{\mathsf{\scriptscriptstyle T}}\boldsymbol{\Phi}_{t}^{\mathsf{% \scriptscriptstyle T}}

the projection matrix spanned by the dictionary.

C.2 Event decomposition

We can decompose Section 3.2 into an accuracy part, i.e. \mathcal{S}_{t} must induce accurate \widetilde{\sigma}_{t}, and an efficiency part, i.e. m_{t}\leq d_{\text{eff}}(t). We also connect the accuracy of \widetilde{\sigma}_{t} to the definition of \varepsilon-accuracy. {lemma} Let \alpha=\frac{1+\varepsilon}{1-\varepsilon}. If \mathcal{S}_{t} is \varepsilon-accurate w.r.t. \boldsymbol{\Phi}_{t}, then

\displaystyle\mathbf{A}_{t}/\alpha\preceq\widetilde{\mathbf{A}}_{t}\preceq% \alpha\mathbf{A}_{t}

and \sigma_{t}^{2}(\mathbf{x})/\alpha\leq\min\mathopen{}\mathclose{{}\left\{% \widetilde{\sigma}_{t}^{2}(\mathbf{x}),1}\right\}\leq\alpha\sigma_{t}^{2}(% \mathbf{x})\; for all \mathbf{x}\in\mathcal{A}. {proof} Inverting the bound in Equation 12, and using the fact that \mathbf{P}_{t}\boldsymbol{\Phi}_{t}{\mathbf{S}}_{t}=\boldsymbol{\Phi}_{t}{% \mathbf{S}}_{t} we have

\displaystyle\mathbf{P}_{t}\boldsymbol{\Phi}_{t}\boldsymbol{\Phi}_{t}^{\mathsf% {\scriptscriptstyle T}}\mathbf{P}_{t} \displaystyle\preceq\frac{1}{1-\varepsilon}(\mathbf{P}_{t}\boldsymbol{\Phi}_{t% }{\mathbf{S}}_{t}{\mathbf{S}}_{t}^{\mathsf{\scriptscriptstyle T}}\boldsymbol{% \Phi}_{t}^{\mathsf{\scriptscriptstyle T}}\mathbf{P}_{t}+\varepsilon\lambda% \mathbf{P}_{t})\preceq\frac{1}{1-\varepsilon}(\boldsymbol{\Phi}_{t}{\mathbf{S}% }_{t}{\mathbf{S}}_{t}^{\mathsf{\scriptscriptstyle T}}\boldsymbol{\Phi}_{t}^{% \mathsf{\scriptscriptstyle T}}+\varepsilon\lambda\mathbf{P}_{t})
\displaystyle\preceq\frac{1}{1-\varepsilon}((1+\varepsilon)\boldsymbol{\Phi}_{% t}\boldsymbol{\Phi}_{t}^{\mathsf{\scriptscriptstyle T}}+\varepsilon\lambda% \mathbf{I}+\varepsilon\lambda\mathbf{P}_{t})\preceq\frac{1+\varepsilon}{1-% \varepsilon}\mathopen{}\mathclose{{}\left(\boldsymbol{\Phi}_{t}\boldsymbol{% \Phi}_{t}^{\mathsf{\scriptscriptstyle T}}+\frac{2\varepsilon}{1+\varepsilon}% \lambda\mathbf{I}}\right).

Repeating for the other side we obtain

\displaystyle\frac{1-\varepsilon}{1+\varepsilon}\mathopen{}\mathclose{{}\left(% \boldsymbol{\Phi}_{t}\boldsymbol{\Phi}_{t}^{\mathsf{\scriptscriptstyle T}}-% \frac{2\varepsilon}{1-\varepsilon}\lambda\mathbf{I}}\right)\preceq\mathbf{P}_{% t}\boldsymbol{\Phi}_{t}\boldsymbol{\Phi}_{t}^{\mathsf{\scriptscriptstyle T}}% \mathbf{P}_{t}\preceq\frac{1+\varepsilon}{1-\varepsilon}\mathopen{}\mathclose{% {}\left(\boldsymbol{\Phi}_{t}\boldsymbol{\Phi}_{t}^{\mathsf{\scriptscriptstyle T% }}+\frac{2\varepsilon}{1+\varepsilon}\lambda\mathbf{I}}\right).

Applying this to \widetilde{\mathbf{A}}_{t} we have

\displaystyle\widetilde{\mathbf{A}}_{t}=\mathbf{P}_{t}\boldsymbol{\Phi}_{t}% \boldsymbol{\Phi}_{t}^{\mathsf{\scriptscriptstyle T}}\mathbf{P}_{t}+\lambda% \mathbf{I}\succeq\frac{1-\varepsilon}{1+\varepsilon}\mathopen{}\mathclose{{}% \left(\boldsymbol{\Phi}_{t}\boldsymbol{\Phi}_{t}^{\mathsf{\scriptscriptstyle T% }}-\frac{2\varepsilon}{1-\varepsilon}\lambda\mathbf{I}}\right)+\lambda\mathbf{% I}=\frac{1-\varepsilon}{1+\varepsilon}\mathopen{}\mathclose{{}\left(% \boldsymbol{\Phi}_{t}\boldsymbol{\Phi}_{t}^{\mathsf{\scriptscriptstyle T}}+% \lambda\mathbf{I}}\right)=\frac{1-\varepsilon}{1+\varepsilon}\mathbf{A}_{t},

which can again be applied on the other side to obtain our result. To prove the accuracy of the approximate posterior variance we \widetilde{\sigma}_{t}^{2}(\mathbf{x}_{i}) we simply apply the definition

\displaystyle\frac{1-\varepsilon}{1+\varepsilon}\overbracket{\boldsymbol{\phi}% _{i}^{\mathsf{\scriptscriptstyle T}}\mathbf{A}_{t}\boldsymbol{\phi}_{i}}^{{% \sigma}_{t}^{2}(\mathbf{x}_{i})}\preceq\overbracket{\boldsymbol{\phi}_{i}^{% \mathsf{\scriptscriptstyle T}}\widetilde{\mathbf{A}}_{t}\boldsymbol{\phi}_{i}}% ^{\widetilde{\sigma}_{t}^{2}(\mathbf{x}_{i})}\preceq\frac{1+\varepsilon}{1-% \varepsilon}\overbracket{\boldsymbol{\phi}_{i}^{\mathsf{\scriptscriptstyle T}}% \mathbf{A}_{t}\boldsymbol{\phi}_{i}}^{{\sigma}_{t}^{2}(\mathbf{x}_{i})},

Using Section C.2 we can now decompose our bad event B_{t}=A_{t}\cup E_{t} where A_{t} is the event where \mathcal{S}_{t} is not \varepsilon-accurate w.r.t. \boldsymbol{\Phi}_{t}. and E_{t} is the event where m_{t} is much larger than d_{\text{eff}}(\lambda,\widetilde{\mathbf{X}}_{t}). We can further decompose the event A_{t}

\displaystyle A_{t} \displaystyle=(A_{t}\cap A_{t-1})\cup(A_{t}\cap A_{t-1}^{\complement})
\displaystyle\subseteq A_{t-1}\cup(A_{t}\cap A_{t-1}^{\complement})=A_{0}\cup% \mathopen{}\mathclose{{}\left(\bigcup_{s=1}^{t}(A_{s}\cap A_{s-1}^{\complement% })}\right)=\bigcup_{s=1}^{t}(A_{s}\cap A_{s-1}^{\complement}),

where A_{0} is the empty event since \boldsymbol{\Phi}_{0} is empty and it is well approximated by the empty \mathcal{S}_{0}. Moreover, we can simplify a part of the problem by noting

\displaystyle B_{t}=A_{t}\cup E_{t}=A_{t}\cup(E_{t}\cap A_{t-1}^{\complement})% \cup(E_{t}\cap A_{t-1})\subseteq A_{t}\cup A_{t-1}\cup(E_{t}\cap A_{t-1}^{% \complement}),

which will help us when bounding the event E_{t}, where we will directly act as if A_{t} does not hold. Putting it all together

\displaystyle\bigcup_{t=1}^{T}B_{t} \displaystyle=\bigcup_{t=1}^{T}(A_{t}\cup E_{t})\subseteq\bigcup_{t=1}^{T}% \mathopen{}\mathclose{{}\left(A_{t}\cup A_{t-1}\cup(E_{t}\cap A_{t-1}^{% \complement})}\right)
\displaystyle=\mathopen{}\mathclose{{}\left(\bigcup_{t=1}^{T}A_{t}}\right)\cup% \mathopen{}\mathclose{{}\left(\bigcup_{t=1}^{T}(E_{t}\cap A_{t-1}^{\complement% })}\right)=\mathopen{}\mathclose{{}\left(\bigcup_{t=1}^{T}A_{t}}\right)\cup% \mathopen{}\mathclose{{}\left(\bigcup_{t=1}^{T}(E_{t}\cap A_{t-1}^{\complement% })}\right)
\displaystyle\subseteq\mathopen{}\mathclose{{}\left(\bigcup_{t=1}^{T}\mathopen% {}\mathclose{{}\left(\bigcup_{s=1}^{t}(A_{s}\cap A_{s-1}^{\complement})}\right% )}\right)\cup\mathopen{}\mathclose{{}\left(\bigcup_{t=1}^{T}(E_{t}\cap A_{t-1}% ^{\complement})}\right)
\displaystyle=\mathopen{}\mathclose{{}\left(\bigcup_{t=1}^{T}(A_{t}\cap A_{t-1% }^{\complement})}\right)\cup\mathopen{}\mathclose{{}\left(\bigcup_{t=1}^{T}(E_% {t}\cap A_{t-1}^{\complement})}\right)

C.3 Bounding \Pr(A_{t}\cap A_{t-1}^{\complement})

We can now bound the probability of event A_{t}\cap A_{t-1}^{\complement} happening. Our first step is to formally define A_{t} using Equation 12. In particular we can rewrite the \varepsilon-accuracy condition as

\displaystyle(1-\varepsilon)\boldsymbol{\Phi}_{t}\boldsymbol{\Phi}_{t}^{% \mathsf{\scriptscriptstyle T}}-\varepsilon\lambda\mathbf{I}\preceq\boldsymbol{% \Phi}_{t}{\mathbf{S}}_{t}{\mathbf{S}}_{t}^{\mathsf{\scriptscriptstyle T}}% \boldsymbol{\Phi}_{t}^{\mathsf{\scriptscriptstyle T}}\preceq(1+\varepsilon)% \boldsymbol{\Phi}_{t}\boldsymbol{\Phi}_{t}^{\mathsf{\scriptscriptstyle T}}+% \varepsilon\lambda\mathbf{I}
\displaystyle\iff-\varepsilon(\boldsymbol{\Phi}_{t}\boldsymbol{\Phi}_{t}^{% \mathsf{\scriptscriptstyle T}}+\lambda\mathbf{I})\preceq\boldsymbol{\Phi}_{t}{% \mathbf{S}}_{t}{\mathbf{S}}_{t}^{\mathsf{\scriptscriptstyle T}}\boldsymbol{% \Phi}_{t}^{\mathsf{\scriptscriptstyle T}}-\boldsymbol{\Phi}_{t}\boldsymbol{% \Phi}_{t}^{\mathsf{\scriptscriptstyle T}}\preceq\varepsilon(\boldsymbol{\Phi}_% {t}\boldsymbol{\Phi}_{t}^{\mathsf{\scriptscriptstyle T}}+\lambda\mathbf{I})
\displaystyle\iff-\varepsilon\mathbf{I}\preceq(\boldsymbol{\Phi}_{t}% \boldsymbol{\Phi}_{t}^{\mathsf{\scriptscriptstyle T}}+\lambda\mathbf{I})^{-1/2% }(\boldsymbol{\Phi}_{t}{\mathbf{S}}_{t}{\mathbf{S}}_{t}^{\mathsf{% \scriptscriptstyle T}}\boldsymbol{\Phi}_{t}^{\mathsf{\scriptscriptstyle T}}-% \boldsymbol{\Phi}_{t}\boldsymbol{\Phi}_{t}^{\mathsf{\scriptscriptstyle T}})(% \boldsymbol{\Phi}_{t}\boldsymbol{\Phi}_{t}^{\mathsf{\scriptscriptstyle T}}+% \lambda\mathbf{I})^{-1/2}\preceq\varepsilon\mathbf{I}
\displaystyle\iff\|(\boldsymbol{\Phi}_{t}\boldsymbol{\Phi}_{t}^{\mathsf{% \scriptscriptstyle T}}+\lambda\mathbf{I})^{-1/2}(\boldsymbol{\Phi}_{t}{\mathbf% {S}}_{t}{\mathbf{S}}_{t}^{\mathsf{\scriptscriptstyle T}}\boldsymbol{\Phi}_{t}^% {\mathsf{\scriptscriptstyle T}}-\boldsymbol{\Phi}_{t}\boldsymbol{\Phi}_{t}^{% \mathsf{\scriptscriptstyle T}})(\boldsymbol{\Phi}_{t}\boldsymbol{\Phi}_{t}^{% \mathsf{\scriptscriptstyle T}}+\lambda\mathbf{I})^{-1/2}\|\leq\varepsilon

We can now focus on this last reformulation, and frame it as a random matrix concentration problem in a RKHS \mathcal{H}. Let \mathbf{\psi}_{t,i}=(\boldsymbol{\Phi}_{t}\boldsymbol{\Phi}_{t}^{\mathsf{% \scriptscriptstyle T}}+\lambda\mathbf{I})^{-\tfrac{1}{2}}\boldsymbol{\phi}_{i} and \mathbf{\Psi}_{t}=\boldsymbol{\Phi}_{t}(\boldsymbol{\Phi}_{t}^{\mathsf{% \scriptscriptstyle T}}\boldsymbol{\Phi}_{t}+\lambda\mathbf{I})^{-\tfrac{1}{2}}% =[\mathbf{\psi}_{t,1},\dots,\mathbf{\psi}_{t,t}]^{\mathsf{\scriptscriptstyle T}}, and define the operator \mathbf{G}_{t,i}=\mathopen{}\mathclose{{}\left(\frac{q_{t,i}}{\widetilde{p}_{t% ,i}}-1}\right)\psi_{t,i}\psi_{t,i}^{\mathsf{\scriptscriptstyle T}}. Then we can rewrite \varepsilon-accuracy as

\displaystyle\|(\boldsymbol{\Phi}_{t}\boldsymbol{\Phi}_{t}^{\mathsf{% \scriptscriptstyle T}}+\lambda\mathbf{I})^{-\tfrac{1}{2}}\boldsymbol{\Phi}_{t}% ({\mathbf{S}}_{t}{\mathbf{S}}_{t}^{\mathsf{\scriptscriptstyle T}}-\mathbf{I})% \boldsymbol{\Phi}_{t}^{\mathsf{\scriptscriptstyle T}}(\boldsymbol{\Phi}_{t}% \boldsymbol{\Phi}_{t}^{\mathsf{\scriptscriptstyle T}}+\lambda\mathbf{I})^{-% \tfrac{1}{2}}\|=\mathopen{}\mathclose{{}\left\|\sum_{i=1}^{t}\mathopen{}% \mathclose{{}\left(\frac{q_{t,i}}{\widetilde{p}_{t,i}}-1}\right)\mathbf{\psi}_% {t,i}\mathbf{\psi}_{t,i}^{\mathsf{\scriptscriptstyle T}}}\right\|=\mathopen{}% \mathclose{{}\left\|\sum_{i=1}^{t}\mathbf{G}_{t,i}}\right\|\leq\varepsilon,

and the event A_{t} as the event where \mathopen{}\mathclose{{}\left\|\sum_{i=1}^{t}\mathbf{G}_{t,i}}\right\|\geq\varepsilon, Note that this reformulation exploits the fact that q_{t,i}=0 encodes the column that are not selected in \mathcal{S}_{t} (see Equation 11). To study this random object, we begin by defining the filtration \mathcal{F}_{t}=\{q_{s,i},\eta_{s}\}_{s=1}^{t} at time t containing all the randomness coming from the construction of the various \mathcal{S}_{s} and the noise on the function \eta_{t}. In particular, note that the \{0,1\} r.v. q_{t,i} used by Figure 1 are not Bernoulli r.v., since the probability \widetilde{p}_{t,i} used to select 0 or 1 is itself random. However they become well defined Bernoulli when conditioned on \mathcal{F}_{t-1}. Let \mathbb{I}\{\cdot\} indicate the indicator function of an event. We can rewrite

\displaystyle\Pr(A_{t}\cap A_{t-1}^{\complement}) \displaystyle=\Pr\mathopen{}\mathclose{{}\left(\mathopen{}\mathclose{{}\left\|% \sum_{i=1}^{t}\mathbf{G}_{t,i}}\right\|\geq\varepsilon\cap\mathopen{}% \mathclose{{}\left\|\sum_{i=1}^{t}\mathbf{G}_{t-1,i}}\right\|\leq\varepsilon}\right)
\displaystyle=\operatorname*{\mathbb{E}}_{\mathcal{F}_{t}}\mathopen{}% \mathclose{{}\left[\mathbb{I}\mathopen{}\mathclose{{}\left\{\mathopen{}% \mathclose{{}\left\|\sum_{i=1}^{t}\mathbf{G}_{t,i}}\right\|\geq\varepsilon\cap% \mathopen{}\mathclose{{}\left\|\sum_{i=1}^{t}\mathbf{G}_{t-1,i}}\right\|\leq% \varepsilon}\right\}}\right]
\displaystyle=\operatorname*{\mathbb{E}}_{\mathcal{F}_{t-1}}\mathopen{}% \mathclose{{}\left[\operatorname*{\mathbb{E}}_{\eta_{t},\{q_{t,i}\}}\mathopen{% }\mathclose{{}\left[\mathbb{I}\mathopen{}\mathclose{{}\left\{\mathopen{}% \mathclose{{}\left\|\sum_{i=1}^{t}\mathbf{G}_{t,i}}\right\|\geq\varepsilon\cap% \mathopen{}\mathclose{{}\left\|\sum_{i=1}^{t}\mathbf{G}_{t-1,i}}\right\|\leq% \varepsilon}\right\}\;\middle|\;\mathcal{F}_{t-1}}\right]}\right]
\displaystyle=\operatorname*{\mathbb{E}}_{\mathcal{F}_{t-1}}\mathopen{}% \mathclose{{}\left[\operatorname*{\mathbb{E}}_{\{q_{t,i}\}}\mathopen{}% \mathclose{{}\left[\mathbb{I}\mathopen{}\mathclose{{}\left\{\mathopen{}% \mathclose{{}\left\|\sum_{i=1}^{t}\mathbf{G}_{t,i}}\right\|\geq\varepsilon\cap% \mathopen{}\mathclose{{}\left\|\sum_{i=1}^{t}\mathbf{G}_{t-1,i}}\right\|\leq% \varepsilon}\right\}\;\middle|\;\mathcal{F}_{t-1}}\right]}\right],

where the last passage is due to the fact that \mathbf{G}_{t,i} is independent from \eta_{t}. Notice now that conditioned on \mathcal{F}_{t-1} the event A_{t-1}^{\complement} is not random anymore, and we can restrict our expectations to the outcomes where \mathopen{}\mathclose{{}\left\|\sum_{i=1}^{t}\mathbf{G}_{t-1,i}}\right\|\leq\varepsilon,

\displaystyle\Pr(A_{t}\cap A_{t-1}^{\complement}) \displaystyle=\operatorname*{\mathbb{E}}_{\mathcal{F}_{t-1}:\mathopen{}% \mathclose{{}\left\|\sum_{i=1}^{t}\mathbf{G}_{t-1,i}}\right\|\leq\varepsilon}% \mathopen{}\mathclose{{}\left[\operatorname*{\mathbb{E}}_{\{q_{t,i}\}}% \mathopen{}\mathclose{{}\left[\mathbb{I}\mathopen{}\mathclose{{}\left\{% \mathopen{}\mathclose{{}\left\|\sum_{i=1}^{t}\mathbf{G}_{t,i}}\right\|\geq% \varepsilon}\right\}\;\middle|\;\mathcal{F}_{t-1}}\right]}\right].

Moreover, conditioned on \mathcal{F}_{t-1} all the q_{t,i} become independent r.v., and we can use the following result of Tropp (2015). {proposition} Let \mathbf{G}_{1},\dots,\mathbf{G}_{n} be a sequence of independent self-adjoint random operators such that \operatorname*{\mathbb{E}}{\mathbf{G}_{i}}=0 and \mathopen{}\mathclose{{}\left\|\mathbf{G}_{i}}\right\|\leq R a.s.
Denote \sigma^{2}=\mathopen{}\mathclose{{}\left\|\sum_{i=1}^{t}\operatorname*{\mathbb% {E}}\mathbf{G}_{i}^{2}}\right\|. Then for any \varepsilon\geq 0

\displaystyle\Pr\mathopen{}\mathclose{{}\left(\mathopen{}\mathclose{{}\left\|% \sum_{i=1}^{t}\mathbf{G}_{i}}\right\|\geq\varepsilon}\right)\leq 4t\exp(\frac{% \varepsilon^{2}/2}{\sigma^{2}+R\varepsilon/3})

We begin by computing the mean of \mathbf{G}_{t,i}

\displaystyle\operatorname*{\mathbb{E}}_{q_{t,i}}\mathopen{}\mathclose{{}\left% [\mathbf{G}_{t,i}\;\middle|\;\mathcal{F}_{t-1}}\right] \displaystyle=\operatorname*{\mathbb{E}}_{q_{t,i}}\mathopen{}\mathclose{{}% \left[\mathopen{}\mathclose{{}\left(\frac{{q}_{t,i}}{\widetilde{p}_{t,i}}-1}% \right)\mathbf{\psi}_{t,i}\mathbf{\psi}_{t,i}^{\mathsf{\scriptscriptstyle T}}% \;\middle|\;\mathcal{F}_{t-1}}\right]
\displaystyle=\mathopen{}\mathclose{{}\left(\frac{\operatorname*{\mathbb{E}}_{% q_{t,i}}\mathopen{}\mathclose{{}\left[{q}_{t,i}\;\middle|\;\mathcal{F}_{t-1}}% \right]}{\widetilde{p}_{t,i}}-1}\right)\mathbf{\psi}_{t,i}\mathbf{\psi}_{t,i}^% {\mathsf{\scriptscriptstyle T}}=\mathopen{}\mathclose{{}\left(\frac{\widetilde% {p}_{t,i}}{\widetilde{p}_{t,i}}-1}\right)\mathbf{\psi}_{t,i}\mathbf{\psi}_{t,i% }^{\mathsf{\scriptscriptstyle T}}=\mathbf{0},

where we use the fact that \widetilde{p}_{t,i} is fixed conditioned on \mathcal{F}_{t-1} and it is the (conditional) expectation of {q}_{t,i}. Since \mathbf{G} is zero-mean, we can use Section C.3. First we find R. We can upper bound

\displaystyle\mathopen{}\mathclose{{}\left\|\mathbf{G}_{t,i}}\right\|=% \mathopen{}\mathclose{{}\left\|\mathopen{}\mathclose{{}\left(\frac{{q}_{t,i}}{% \widetilde{p}_{t,i}}-1}\right)\mathbf{\psi}_{t,i}\mathbf{\psi}_{t,i}^{\mathsf{% \scriptscriptstyle T}}}\right\|\leq\mathopen{}\mathclose{{}\left|\mathopen{}% \mathclose{{}\left(\frac{{q}_{t,i}}{\widetilde{p}_{t,i}}-1}\right)}\right|\|% \mathbf{\psi}_{t,i}\mathbf{\psi}_{t,i}^{\mathsf{\scriptscriptstyle T}}\|\leq% \frac{1}{\widetilde{p}_{t,i}}\|\mathbf{\psi}_{t,i}\mathbf{\psi}_{t,i}^{\mathsf% {\scriptscriptstyle T}}\|.

Note that due to the definition of \mathbf{\psi}_{t,i}

\displaystyle\|\mathbf{\psi}_{t,i}\mathbf{\psi}_{t,i}^{\mathsf{% \scriptscriptstyle T}}\|=\mathbf{\psi}_{t,i}^{\mathsf{\scriptscriptstyle T}}% \mathbf{\psi}_{t,i}=\boldsymbol{\phi}_{i}^{\mathsf{\scriptscriptstyle T}}(% \boldsymbol{\Phi}_{t}\boldsymbol{\Phi}_{t}^{\mathsf{\scriptscriptstyle T}}+% \lambda\mathbf{I})^{-1}\boldsymbol{\phi}_{i}=\sigma_{t}^{2}(\widetilde{\mathbf% {x}}_{i}).

Moreover, we are only considering outcomes of \mathcal{F}_{t-1} where \mathopen{}\mathclose{{}\left\|\sum_{i=1}^{t}\mathbf{G}_{t-1,i}}\right\|\leq\varepsilon, which implies that \mathcal{S}_{t-1} is \varepsilon accurate, and by Section C.2 we have \widetilde{\sigma}_{t-1}(\widetilde{\mathbf{x}}_{i})\geq\sigma_{t-1}(% \widetilde{\mathbf{x}}_{i})/\alpha. Finally, due to Appendix B we have \sigma_{t-1}(\widetilde{\mathbf{x}}_{i})\geq\sigma_{t}(\widetilde{\mathbf{x}}_% {i}). Putting this all together we can bound

\displaystyle\frac{1}{\widetilde{p}_{t,i}}\|\mathbf{\psi}_{t,i}\mathbf{\psi}_{% t,i}^{\mathsf{\scriptscriptstyle T}}\|.=\frac{1}{\overline{q}\widetilde{\sigma% }_{t-1}(\widetilde{\mathbf{x}}_{i})}{\sigma}_{t}(\widetilde{\mathbf{x}}_{i})% \leq\frac{\alpha}{\overline{q}}:=R.

For the variance term, we expand

\displaystyle\sum_{i=1}^{t}\operatorname*{\mathbb{E}}_{q_{t,i}}\mathopen{}% \mathclose{{}\left[\mathbf{G}_{t,i}^{2}\;\middle|\;\mathcal{F}_{t-1}}\right] \displaystyle=\sum_{i=1}^{t}\operatorname*{\mathbb{E}}_{q_{t,i}}\mathopen{}% \mathclose{{}\left[\mathopen{}\mathclose{{}\left(\frac{{q}_{t,i}}{\widetilde{p% }_{t,i}}-1}\right)^{2}\;\middle|\;\mathcal{F}_{t-1}}\right]\mathbf{\psi}_{t,i}% \mathbf{\psi}_{t,i}^{\mathsf{\scriptscriptstyle T}}\mathbf{\psi}_{t,i}\mathbf{% \psi}_{t,i}^{\mathsf{\scriptscriptstyle T}}
\displaystyle=\sum_{i=1}^{t}\mathopen{}\mathclose{{}\left(\operatorname*{% \mathbb{E}}_{q_{t,i}}\mathopen{}\mathclose{{}\left[\frac{{q}_{t,i}^{2}}{% \widetilde{p}_{t,i}^{2}}\;\middle|\;\mathcal{F}_{t-1}}\right]-\operatorname*{% \mathbb{E}}_{q_{t,i}}\mathopen{}\mathclose{{}\left[2\frac{{q}_{t,i}}{% \widetilde{p}_{t,i}}\;\middle|\;\mathcal{F}_{t-1}}\right]+1}\right)\mathbf{% \psi}_{t,i}\mathbf{\psi}_{t,i}^{\mathsf{\scriptscriptstyle T}}\mathbf{\psi}_{t% ,i}\mathbf{\psi}_{t,i}^{\mathsf{\scriptscriptstyle T}}
\displaystyle=\sum_{i=1}^{t}\mathopen{}\mathclose{{}\left(\operatorname*{% \mathbb{E}}_{q_{t,i}}\mathopen{}\mathclose{{}\left[\frac{{q}_{t,i}}{\widetilde% {p}_{t,i}^{2}}\;\middle|\;\mathcal{F}_{t-1}}\right]-1}\right)\mathbf{\psi}_{t,% i}\mathbf{\psi}_{t,i}^{\mathsf{\scriptscriptstyle T}}\mathbf{\psi}_{t,i}% \mathbf{\psi}_{t,i}^{\mathsf{\scriptscriptstyle T}}=\sum_{i=1}^{t}\mathopen{}% \mathclose{{}\left(\operatorname*{\mathbb{E}}_{q_{t,i}}\mathopen{}\mathclose{{% }\left[\frac{{q}_{t,i}}{\widetilde{p}_{t,i}^{2}}\;\middle|\;\mathcal{F}_{t-1}}% \right]-1}\right)\mathbf{\psi}_{t,i}\mathbf{\psi}_{t,i}^{\mathsf{% \scriptscriptstyle T}}\mathbf{\psi}_{t,i}\mathbf{\psi}_{t,i}^{\mathsf{% \scriptscriptstyle T}}
\displaystyle=\sum_{i=1}^{t}\mathopen{}\mathclose{{}\left(\frac{1}{\widetilde{% p}_{t,i}}-1}\right)\mathbf{\psi}_{t,i}\mathbf{\psi}_{t,i}^{\mathsf{% \scriptscriptstyle T}}\mathbf{\psi}_{t,i}\mathbf{\psi}_{t,i}^{\mathsf{% \scriptscriptstyle T}}\preceq\sum_{i=1}^{t}\frac{1}{\widetilde{p}_{t,i}}\|% \mathbf{\psi}_{t,i}\mathbf{\psi}_{t,i}^{\mathsf{\scriptscriptstyle T}}\|% \mathbf{\psi}_{t,i}\mathbf{\psi}_{t,i}^{\mathsf{\scriptscriptstyle T}}\preceq% \sum_{i=1}^{t}R\mathbf{\psi}_{t,i}\mathbf{\psi}_{t,i}^{\mathsf{% \scriptscriptstyle T}}

where we used the fact that {q}_{t,i}^{2}={q}_{t,i} and \operatorname*{\mathbb{E}}_{q_{t,i}}[{q}_{t,i}|\mathcal{F}_{t-1}]=\widetilde{p% }_{t,i}. We can now bound this quantity as

\displaystyle\mathopen{}\mathclose{{}\left\|\sum_{i=1}^{t}\operatorname*{% \mathbb{E}}_{q_{t,i}}\mathopen{}\mathclose{{}\left[\mathbf{G}_{t,i}^{2}\;% \middle|\;\mathcal{F}_{t-1}}\right]}\right\|\leq\mathopen{}\mathclose{{}\left% \|\sum_{i=1}^{t}R\mathbf{\psi}_{t,i}\mathbf{\psi}_{t,i}^{\mathsf{% \scriptscriptstyle T}}}\right\|=R\mathopen{}\mathclose{{}\left\|\sum_{i=1}^{t}% \mathbf{\psi}_{t,i}\mathbf{\psi}_{t,i}^{\mathsf{\scriptscriptstyle T}}}\right% \|=R\|\mathbf{\Psi}_{t}^{\mathsf{\scriptscriptstyle T}}\mathbf{\Psi}_{t}\|\leq R% :=\sigma^{2}.

Therefore, we have \sigma^{2}=R and R=1/\overline{q}, and applying Section C.3 and a union bound we conclude the proof.

C.4 Bounding \Pr(E_{t}\cap A_{t-1}^{\complement})

We will use the following concentration for independent Bernoulli r.v. {proposition}[Calandriello et al., 2017b, App. D.4] Let \{q_{s}\}_{s=1}^{t} be independent Bernoulli random variables, each with success probability p_{s}, and denote their sum as d=\sum_{s=1}^{t}p_{s}\geq 1. Then,666This is a simple variant of Chernoff bound where the Bernoulli random variables are not identically distributed.

\displaystyle\mathbb{P}\mathopen{}\mathclose{{}\left(\sum_{s=1}^{t}q_{s}\geq 3% d}\right)\leq\exp\{-3d(3d-(\log(3d)+1))\}\leq\exp\{-2d\}

We can also rigorously define the event E_{t} as the event where

\displaystyle\sum_{i=1}^{t}q_{t,i}\geq 3\alpha(1+\kappa^{2}/\lambda)\log(t/% \delta)\sum_{i=1}^{t}\sigma_{t}^{2}(\widetilde{\mathbf{x}}_{i})=3\alpha(1+% \kappa^{2}/\lambda)d_{\text{eff}}(\lambda,\widetilde{\mathbf{X}}_{t})\log(t/% \delta).

Once again we will use conditioning, in particular

\displaystyle\Pr(E_{t}\cap A_{t}^{\complement}) \displaystyle=\operatorname*{\mathbb{E}}_{\mathcal{F}_{t-1}:\mathopen{}% \mathclose{{}\left\|\sum_{i=1}^{t}\mathbf{G}_{t-1,i}}\right\|\leq\varepsilon}% \mathopen{}\mathclose{{}\left[\operatorname*{\mathbb{E}}_{\{q_{t,i}\}}% \mathopen{}\mathclose{{}\left[\mathbb{I}\mathopen{}\mathclose{{}\left\{\sum_{i% =1}^{t}q_{t,i}\geq 3\alpha(1+\kappa^{2}/\lambda)\log(t/\delta)\sum_{i=1}^{t}% \sigma_{t}^{2}(\widetilde{\mathbf{x}}_{i})}\right\}\;\middle|\;\mathcal{F}_{t-% 1}}\right]}\right].

Conditioned on \mathcal{F}_{t-1} the r.v. q_{t,i} become independent Bernoulli with probability \widetilde{p}_{t,i}=\overline{q}\widetilde{\sigma}_{t-1}(\widetilde{\mathbf{x}% }_{i}). Since we restrict the outcomes to A_{t-1}^{\complement}, we can exploit Section C.2 and the guarantees of \varepsilon accuracy to bound \widetilde{p}_{t,i}\leq\alpha\sigma_{t-1}^{2}(\widetilde{\mathbf{x}}_{i}). Then we can use Appendix B to bound \sigma_{t-1}^{2}(\widetilde{\mathbf{x}}_{i})\leq(1+\kappa^{2}/\lambda)\sigma_{% t}^{2}(\widetilde{\mathbf{x}}_{i}). Therefore, q_{t,i} are conditionally independent Bernoulli with probability at most \overline{q}(1+\kappa^{2}/\lambda)\sigma_{t}^{2}(\widetilde{\mathbf{x}}_{i}). Applying a simple stochastic dominance argument, and Section C.4, we obtain our result.

Appendix D Proof of Section 4

Following Abbasi-Yadkori et al. (2011), we divide the proof in two parts, bounding the approximate confidence ellipsoid, and bounding the regret.

D.1 Bounding the confidence ellipsoid

We begin by proving an intermediate result on the confidence ellipsoid {theorem} Under the same assumptions as Section 4 with probability at least 1-\delta and for all t\geq 0 \mathbf{w}_{*} lies in the set

\displaystyle\widetilde{C}_{t} \displaystyle:=\mathopen{}\mathclose{{}\left\{\mathbf{w}:\|\mathbf{w}-% \widetilde{\mathbf{w}}_{t}\|_{\widetilde{\mathbf{A}}_{t}}\leq\widetilde{\beta}% _{t}}\right\}


\displaystyle\widetilde{\beta}_{t} \displaystyle:=2\xi\sqrt{\alpha\log(\kappa^{2}t)\mathopen{}\mathclose{{}\left(% \sum_{s=1}^{t}\widetilde{\sigma}_{t}^{2}(\mathbf{x}_{s})}\right)+\log\mathopen% {}\mathclose{{}\left(\frac{1}{\delta}}\right)}+\mathopen{}\mathclose{{}\left(1% +\frac{1}{\sqrt{1-\varepsilon}}}\right)\sqrt{\lambda}F

For simplicity, we now omit the subscript t. We begin noticing that

\displaystyle(\widetilde{\mathbf{w}}-\mathbf{w}_{*})^{\mathsf{% \scriptscriptstyle T}}\widetilde{\mathbf{A}}(\widetilde{\mathbf{w}}-\mathbf{w}% _{*}) \displaystyle=(\widetilde{\mathbf{w}}-\mathbf{w}_{*})^{\mathsf{% \scriptscriptstyle T}}\widetilde{\mathbf{A}}(\widetilde{\mathbf{A}}^{-1}% \widetilde{\boldsymbol{\Phi}}^{\mathsf{\scriptscriptstyle T}}\mathbf{y}-% \mathbf{w}_{*})
\displaystyle=(\widetilde{\mathbf{w}}-\mathbf{w}_{*})^{\mathsf{% \scriptscriptstyle T}}\widetilde{\mathbf{A}}(\widetilde{\mathbf{A}}^{-1}% \widetilde{\boldsymbol{\Phi}}^{\mathsf{\scriptscriptstyle T}}(\boldsymbol{\Phi% }\mathbf{w}_{*}+\eta-\mathbf{w}_{*})
\displaystyle=(\widetilde{\mathbf{w}}-\mathbf{w}_{*})^{\mathsf{% \scriptscriptstyle T}}\widetilde{\mathbf{A}}(\underbrace{\widetilde{\mathbf{A}% }^{-1}\widetilde{\boldsymbol{\Phi}}^{\mathsf{\scriptscriptstyle T}}\boldsymbol% {\Phi}\mathbf{w}_{*}-\mathbf{w}_{*}}_{\text{bias}})+(\widetilde{\mathbf{w}}-% \mathbf{w}_{*})^{\mathsf{\scriptscriptstyle T}}\widetilde{\mathbf{A}}^{1/2}% \underbrace{\widetilde{\mathbf{A}}^{-1/2}\widetilde{\boldsymbol{\Phi}}^{% \mathsf{\scriptscriptstyle T}}\eta}_{\text{variance}}.

Bounding the bias. We first focus on the first term, which is made complicated by the mismatch \widetilde{\boldsymbol{\Phi}}^{\mathsf{\scriptscriptstyle T}}\boldsymbol{\Phi}. We have

\displaystyle\widetilde{\mathbf{A}}(\widetilde{\mathbf{A}}^{-1}\widetilde{% \boldsymbol{\Phi}}^{\mathsf{\scriptscriptstyle T}}\boldsymbol{\Phi}\mathbf{w}_% {*}-\mathbf{w}_{*}) \displaystyle=\widetilde{\boldsymbol{\Phi}}^{\mathsf{\scriptscriptstyle T}}% \boldsymbol{\Phi}\mathbf{w}_{*}-\widetilde{\boldsymbol{\Phi}}^{\mathsf{% \scriptscriptstyle T}}\widetilde{\boldsymbol{\Phi}}\mathbf{w}_{*}-\lambda% \mathbf{w}_{*}
\displaystyle=\widetilde{\boldsymbol{\Phi}}^{\mathsf{\scriptscriptstyle T}}% \boldsymbol{\Phi}(\mathbf{I}-\mathbf{P})\mathbf{w}_{*}+\widetilde{\boldsymbol{% \Phi}}^{\mathsf{\scriptscriptstyle T}}\boldsymbol{\Phi}\mathbf{P}\mathbf{w}_{*% }-\widetilde{\boldsymbol{\Phi}}^{\mathsf{\scriptscriptstyle T}}\widetilde{% \boldsymbol{\Phi}}^{\mathbf{w}}_{*}-\lambda\mathbf{w}_{*}
\displaystyle=\widetilde{\boldsymbol{\Phi}}^{\mathsf{\scriptscriptstyle T}}% \boldsymbol{\Phi}(\mathbf{I}-\mathbf{P})\mathbf{w}_{*}-\lambda\mathbf{w}_{*}


\displaystyle(\widetilde{\mathbf{w}}-\mathbf{w}_{*})^{\mathsf{% \scriptscriptstyle T}}\widetilde{\mathbf{A}}(\widetilde{\mathbf{A}}^{-1}% \widetilde{\boldsymbol{\Phi}}^{\mathsf{\scriptscriptstyle T}}\boldsymbol{\Phi}% \mathbf{w}_{*}-\mathbf{w}_{*}) \displaystyle=(\widetilde{\mathbf{w}}-\mathbf{w}_{*})^{\mathsf{% \scriptscriptstyle T}}\widetilde{\boldsymbol{\Phi}}^{\mathsf{% \scriptscriptstyle T}}\boldsymbol{\Phi}(\mathbf{I}-\mathbf{P})\mathbf{w}_{*}-% \lambda(\widetilde{\mathbf{w}}-\mathbf{w}_{*})^{\mathsf{\scriptscriptstyle T}}% \mathbf{w}_{*}
\displaystyle\leq\|\widetilde{\mathbf{w}}-\mathbf{w}_{*}\|_{\widetilde{\mathbf% {A}}}\mathopen{}\mathclose{{}\left(\|\widetilde{\mathbf{A}}^{-1/2}\widetilde{% \boldsymbol{\Phi}}^{\mathsf{\scriptscriptstyle T}}\boldsymbol{\Phi}(\mathbf{I}% -\mathbf{P})\mathbf{w}_{*}\|+\lambda\|\mathbf{w}^{*}\|_{\widetilde{\mathbf{A}}% ^{-1}}}\right)
\displaystyle\leq\|\widetilde{\mathbf{w}}-\mathbf{w}_{*}\|_{\widetilde{\mathbf% {A}}}\mathopen{}\mathclose{{}\left(\|\widetilde{\mathbf{A}}^{-1/2}\widetilde{% \boldsymbol{\Phi}}^{\mathsf{\scriptscriptstyle T}}\boldsymbol{\Phi}(\mathbf{I}% -\mathbf{P})\mathbf{w}_{*}\|+\tfrac{\lambda}{\sqrt{\lambda}}\|\mathbf{w}^{*}\|% }\right).

We have

\displaystyle\|\widetilde{\mathbf{A}}^{-1/2}\widetilde{\boldsymbol{\Phi}}^{% \mathsf{\scriptscriptstyle T}}\boldsymbol{\Phi}(\mathbf{I}-\mathbf{P})\mathbf{% w}_{*}\| \displaystyle\leq\|\widetilde{\mathbf{A}}^{-1/2}\widetilde{\boldsymbol{\Phi}}^% {\mathsf{\scriptscriptstyle T}}\|\|\boldsymbol{\Phi}(\mathbf{I}-\mathbf{P})\|% \|\mathbf{w}_{*}\|
\displaystyle\leq\sqrt{\lambda_{\max}(\widetilde{\boldsymbol{\Phi}}\widetilde{% \mathbf{A}}^{-1}\widetilde{\boldsymbol{\Phi}}^{\mathsf{\scriptscriptstyle T}})% }\sqrt{\lambda_{\max}(\boldsymbol{\Phi}(\mathbf{I}-\mathbf{P})^{2}\boldsymbol{% \Phi}^{\mathsf{\scriptscriptstyle T}})}\|\mathbf{w}_{*}\|.

It is easy to see that

\displaystyle\lambda_{\max}(\widetilde{\boldsymbol{\Phi}}\widetilde{\mathbf{A}% }^{-1}\widetilde{\boldsymbol{\Phi}}^{\mathsf{\scriptscriptstyle T}})=\lambda_{% \max}(\widetilde{\boldsymbol{\Phi}}(\widetilde{\boldsymbol{\Phi}}^{\mathsf{% \scriptscriptstyle T}}\widetilde{\boldsymbol{\Phi}}+\lambda\mathbf{I})^{-1}% \widetilde{\boldsymbol{\Phi}}^{\mathsf{\scriptscriptstyle T}})\leq 1.

To bound the other term we use the following result from Calandriello and Rosasco (2018). {proposition} If \mathcal{S}_{t} is \varepsilon-accurate w.r.t. \boldsymbol{\Phi}_{t}, then

\displaystyle\mathbf{I}-\mathbf{P}_{t}\preceq\mathbf{I}-\boldsymbol{\Phi}_{t}{% \mathbf{S}}_{t}({\mathbf{S}}_{t}^{\mathsf{\scriptscriptstyle T}}\boldsymbol{% \Phi}_{t}^{\mathsf{\scriptscriptstyle T}}\boldsymbol{\Phi}_{t}{\mathbf{S}}_{t}% +\lambda\mathbf{I})^{-1}{\mathbf{S}}_{t}^{\mathsf{\scriptscriptstyle T}}% \boldsymbol{\Phi}_{t}^{\mathsf{\scriptscriptstyle T}}\preceq\frac{\lambda}{1-% \varepsilon}(\boldsymbol{\Phi}_{t}\boldsymbol{\Phi}_{t}^{\mathsf{% \scriptscriptstyle T}}+\lambda\mathbf{I})^{-1}.

Since from Section 3.2 we have that \mathcal{S}_{t} is \varepsilon-accurate, we have from Section D.1

\displaystyle\boldsymbol{\Phi}(\mathbf{I}-\mathbf{P})^{2}\boldsymbol{\Phi}^{% \mathsf{\scriptscriptstyle T}}=\boldsymbol{\Phi}(\mathbf{I}-\mathbf{P})% \boldsymbol{\Phi}^{\mathsf{\scriptscriptstyle T}}\preceq\frac{\lambda}{1-% \varepsilon}\boldsymbol{\Phi}(\boldsymbol{\Phi}^{\mathsf{\scriptscriptstyle T}% }\boldsymbol{\Phi}+\lambda\mathbf{I})^{-1}\boldsymbol{\Phi}^{\mathsf{% \scriptscriptstyle T}}\preceq\frac{\lambda}{1-\varepsilon}\mathbf{I}

Putting it all together

\displaystyle(\widetilde{\mathbf{w}}-\mathbf{w}_{*})^{\mathsf{% \scriptscriptstyle T}}\widetilde{\mathbf{A}}(\widetilde{\mathbf{A}}^{-1}% \widetilde{\boldsymbol{\Phi}}^{\mathsf{\scriptscriptstyle T}}\boldsymbol{\Phi}% \mathbf{w}_{*}-\mathbf{w}_{*}) \displaystyle\leq\mathopen{}\mathclose{{}\left(1+\frac{1}{\sqrt{1-\varepsilon}% }}\right)\|\widetilde{\mathbf{w}}-\mathbf{w}_{*}\|_{\widetilde{\mathbf{A}}}% \sqrt{\lambda}\|\mathbf{w}_{*}\|

Bounding the variance. We will need this self-normalized martingale concentration inequality from Abbasi-Yadkori et al. (2011). It can be trivially extended to RKHSs in the case of finite sets such as our \mathcal{A}. Note that if the reader is interested in infinite sets, Chowdhury and Gopalan (2017) provide a generalization with slightly worst constants. {proposition}[Abbasi-Yadkori et al. (2011)] Let \{\mathcal{F}_{t}\}^{\infty}_{t=0} be a filtration. Let \{\eta_{t}\}_{t=1}^{\infty} be a real-valued stochastic process such that \eta_{t} is \mathcal{F}_{t}-measurable and zero-mean \xi-subgaussian. Let \{\boldsymbol{\Phi}_{t}\}_{t=1}^{\infty} be an \mathcal{H}-valued stochastic process such that \boldsymbol{\Phi}_{t} is \mathcal{F}_{t-1}-measurable. Let \mathbf{I} be the identity operator on \mathcal{H}. For any t\geq 1, define

\displaystyle\mathbf{A}_{t}=\boldsymbol{\Phi}_{t}^{\mathsf{\scriptscriptstyle T% }}\boldsymbol{\Phi}_{t}+\lambda\mathbf{I}, \displaystyle\mathbf{V}_{t}=\boldsymbol{\Phi}_{t}^{\mathsf{\scriptscriptstyle T% }}\eta_{t}.

Then, for any \delta>0, with probability at least 1-\delta, for all t\geq 0,

\displaystyle\|\mathbf{V}_{t}\|_{\mathbf{A}_{t}^{-1}}^{2}\leq 2\xi^{2}\log(% \operatorname*{Det}(\mathbf{A}_{t}/\lambda)/\delta).

Remembering the definition of \alpha\geq 1 from Section 3.2, we can reformulate

\displaystyle(\widetilde{\mathbf{w}}-\mathbf{w}_{*})^{\mathsf{% \scriptscriptstyle T}}\widetilde{\mathbf{A}}^{1/2}\widetilde{\mathbf{A}}^{-1/2% }\widetilde{\boldsymbol{\Phi}}\eta \displaystyle\leq\|\widetilde{\mathbf{w}}-\mathbf{w}_{*}\|_{\widetilde{\mathbf% {A}}}\|\widetilde{\boldsymbol{\Phi}}\eta\|_{\widetilde{\mathbf{A}}^{-1}}
\displaystyle=\|\widetilde{\mathbf{w}}-\mathbf{w}_{*}\|_{\widetilde{\mathbf{A}% }}\|\widetilde{\boldsymbol{\Phi}}^{\mathsf{\scriptscriptstyle T}}\eta\|_{(% \widetilde{\boldsymbol{\Phi}}^{\mathsf{\scriptscriptstyle T}}\widetilde{% \boldsymbol{\Phi}}+\lambda\mathbf{I})^{-1}}
\displaystyle=\|\widetilde{\mathbf{w}}-\mathbf{w}_{*}\|_{\widetilde{\mathbf{A}% }}\|\widetilde{\boldsymbol{\Phi}}^{\mathsf{\scriptscriptstyle T}}\eta/\lambda% \|_{(\widetilde{\boldsymbol{\Phi}}^{\mathsf{\scriptscriptstyle T}}\widetilde{% \boldsymbol{\Phi}}/\lambda+\mathbf{I})^{-1}}

We need now to make a remark that requires temporal notation. Note that we cannot directly apply Section D.1 to \widetilde{\boldsymbol{\Phi}}_{t}\eta_{t}=\mathbf{P}_{t}\boldsymbol{\Phi}_{t}% \eta_{t}. In particular, for s<t we have that \widetilde{\boldsymbol{\Phi}}_{s}\eta_{s}=\mathbf{P}_{t}\boldsymbol{\Phi}_{s}% \eta_{s} is not \mathcal{F}_{s-1} measurable, since \mathbf{P}_{t} depends on all randomness up to time t. However, since \mathbf{P}_{t} is always a projection matrix we know that the variance of the projected process is bounded by the variance of the original process

\displaystyle\|\widetilde{\boldsymbol{\Phi}}^{\mathsf{\scriptscriptstyle T}}% \eta/\lambda\|_{(\widetilde{\boldsymbol{\Phi}}^{\mathsf{\scriptscriptstyle T}}% \widetilde{\boldsymbol{\Phi}}/\lambda+\mathbf{I})^{-1}} \displaystyle=\sqrt{\eta^{\mathsf{\scriptscriptstyle T}}\widetilde{\boldsymbol% {\Phi}}(\widetilde{\boldsymbol{\Phi}}^{\mathsf{\scriptscriptstyle T}}% \widetilde{\boldsymbol{\Phi}}/\lambda+\mathbf{I})^{-1}\widetilde{\boldsymbol{% \Phi}}^{\mathsf{\scriptscriptstyle T}}\eta/\lambda}=\sqrt{\eta^{\mathsf{% \scriptscriptstyle T}}\widetilde{\boldsymbol{\Phi}}\widetilde{\boldsymbol{\Phi% }}^{\mathsf{\scriptscriptstyle T}}(\widetilde{\boldsymbol{\Phi}}\widetilde{% \boldsymbol{\Phi}}^{\mathsf{\scriptscriptstyle T}}/\lambda+\mathbf{I})^{-1}% \eta/\lambda}
\displaystyle\lx@stackrel{{\scriptstyle(a)}}{{=}}\sqrt{\eta^{\mathsf{% \scriptscriptstyle T}}(\mathbf{I}-\lambda(\widetilde{\boldsymbol{\Phi}}% \widetilde{\boldsymbol{\Phi}}^{\mathsf{\scriptscriptstyle T}}/\lambda+\mathbf{% I})^{-1})\eta/\lambda}=\sqrt{\eta^{\mathsf{\scriptscriptstyle T}}(\mathbf{I}-% \lambda(\boldsymbol{\Phi}\mathbf{P}\boldsymbol{\Phi}^{\mathsf{% \scriptscriptstyle T}}/\lambda+\mathbf{I})^{-1})\eta/\lambda}
\displaystyle\lx@stackrel{{\scriptstyle(b)}}{{\leq}}\sqrt{\eta^{\mathsf{% \scriptscriptstyle T}}(\mathbf{I}-\lambda(\boldsymbol{\Phi}\boldsymbol{\Phi}^{% \mathsf{\scriptscriptstyle T}}/\lambda+\mathbf{I})^{-1})\eta/\lambda}% \lx@stackrel{{\scriptstyle(c)}}{{=}}\|\boldsymbol{\Phi}^{\mathsf{% \scriptscriptstyle T}}\eta/\lambda\|_{(\boldsymbol{\Phi}^{\mathsf{% \scriptscriptstyle T}}\boldsymbol{\Phi}/\lambda+\mathbf{I})^{-1}},

where in (a) we added and subtracted \lambda\mathbf{I} from \widetilde{\boldsymbol{\Phi}}\widetilde{\boldsymbol{\Phi}}^{\mathsf{% \scriptscriptstyle T}}, in (b) we used the fact that \|\mathbf{P}\|\leq 1 for all projection matrices, and in (c) we reversed the reformulation from (a). We can finally use Section D.1

\displaystyle\|\boldsymbol{\Phi}^{\mathsf{\scriptscriptstyle T}}\eta/\lambda\|% _{(\boldsymbol{\Phi}^{\mathsf{\scriptscriptstyle T}}\boldsymbol{\Phi}/\lambda+% \mathbf{I})^{-1}} \displaystyle\leq\sqrt{2\xi^{2}\log(\operatorname*{Det}(\boldsymbol{\Phi}^{% \mathsf{\scriptscriptstyle T}}\boldsymbol{\Phi}/\lambda+\mathbf{I})/\delta)}
\displaystyle=\sqrt{2\xi^{2}\log(\operatorname*{Det}(\mathbf{A}/\lambda)/% \delta)}.

While this is now a valid bound on the radius of the confidence interval, it is still not satisfactory. In particular, we can use Sylvester’s identity to reformulate

\displaystyle\operatorname*{log\,det}(\mathbf{A}/\lambda)=\operatorname*{log\,% det}(\boldsymbol{\Phi}^{\mathsf{\scriptscriptstyle T}}\boldsymbol{\Phi}/% \lambda+\mathbf{I})=\operatorname*{log\,det}(\boldsymbol{\Phi}\boldsymbol{\Phi% }^{\mathsf{\scriptscriptstyle T}}/\lambda+\mathbf{I})=\operatorname*{log\,det}% (\mathbf{K}/\lambda+\mathbf{I}).

Computing the radius would require constructing the matrix \mathbf{K}\in\mathbb{R}^{t\times t} and is too expensive. Instead, we obtain a cheap but still small upper bound as follow

\displaystyle\operatorname*{log\,det}(\mathbf{K}_{t}/\lambda+\mathbf{I}) \displaystyle\leq\operatorname*{Tr}(\mathbf{K}_{t}(\mathbf{K}_{t}+\lambda% \mathbf{I})^{-1})(1+\log(\|\mathbf{K}_{t}\|+1))
\displaystyle\leq\operatorname*{Tr}(\mathbf{K}_{t}(\mathbf{K}_{t}+\lambda% \mathbf{I})^{-1})(1+\log(\operatorname*{Tr}{\mathbf{K}_{t}}+1))
\displaystyle\leq\operatorname*{Tr}(\mathbf{K}_{t}(\mathbf{K}_{t}+\lambda% \mathbf{I})^{-1})(1+\log(\kappa^{2}t+1))
\displaystyle=(1+\log(\kappa^{2}t+1))\sum_{s=1}^{t}\sigma_{t}^{2}(\mathbf{x}_{% s})
\displaystyle\leq\alpha(1+\log(\kappa^{2}t+1))\sum_{s=1}^{t}\widetilde{\sigma}% _{t}^{2}(\mathbf{x}_{s})
\displaystyle\leq 2\alpha\log(\kappa^{2}t)\sum_{s=1}^{t}\widetilde{\sigma}_{t}% ^{2}(\mathbf{x}_{s}),

where \widetilde{\sigma}_{t}^{2}(\mathbf{x}_{s}) can be computed efficiently, and is already done so by the algorithm at every step. Putting it all together

\displaystyle\|\widetilde{\mathbf{w}}-\mathbf{w}_{*}\|_{\widetilde{\mathbf{A}}} \displaystyle\leq 2\xi\sqrt{\alpha\log(\kappa^{2}t)\mathopen{}\mathclose{{}% \left(\sum_{s=1}^{t}\widetilde{\sigma}_{t}^{2}(\mathbf{x}_{s})}\right)+\log(1/% \delta)}+\mathopen{}\mathclose{{}\left(1+\frac{1}{\sqrt{1-\varepsilon}}}\right% )\sqrt{\lambda}\|\mathbf{w}_{*}\|
\displaystyle\leq 2\xi\sqrt{\alpha\log(\kappa^{2}t)\mathopen{}\mathclose{{}% \left(\sum_{s=1}^{t}\widetilde{\sigma}_{t}^{2}(\mathbf{x}_{s})}\right)+\log(1/% \delta)}+\mathopen{}\mathclose{{}\left(1+\frac{1}{\sqrt{1-\varepsilon}}}\right% )\sqrt{\lambda}F:=\widetilde{\beta}_{t}

D.2 Bounding the regret

The regret analysis is straightforward. Assume that \mathbf{w}_{*}\in\widetilde{C}_{t} is satisfied (i.e. the event from Section D.1 holds) and remember that by the definition \boldsymbol{\phi}_{t}=\operatorname*{arg\,max}_{\mathbf{x}_{i}in\mathcal{A}}% \max_{\mathbf{w}\in\widetilde{\mathbf{C}}_{t}}\boldsymbol{\phi}_{i}^{\mathsf{% \scriptscriptstyle T}}\mathbf{w}. We can also define \overline{\mathbf{w}}_{t,i}=\operatorname*{arg\,max}_{\mathbf{w}\in\widetilde{% \mathbf{C}}_{t}}\boldsymbol{\phi}_{i}^{\mathsf{\scriptscriptstyle T}}\mathbf{w} as the auxiliary vector which encodes the optimistic behaviour of the algorithm. With a slight abuse of notation we will also use * as a subscript to indicate the (unknown) index of the optimal arm, so that \overline{\mathbf{w}}_{t,*}=\operatorname*{arg\,max}_{\mathbf{w}\in\widetilde{% \mathbf{C}}_{t}}\boldsymbol{\phi}_{*}^{\mathsf{\scriptscriptstyle T}}\mathbf{w}. Since \mathbf{w}_{*}\in\widetilde{C}_{t}

\displaystyle\boldsymbol{\phi}_{t}^{\mathsf{\scriptscriptstyle T}}\overline{% \mathbf{w}}_{t,t}\geq{\boldsymbol{\phi}_{*}}\overline{\mathbf{w}}_{t,*}\geq{% \boldsymbol{\phi}_{*}}\mathbf{w}_{*}.

We can now bound the instantaneous regret r_{t} as

\displaystyle r_{t} \displaystyle=\boldsymbol{\phi}_{*}^{\mathsf{\scriptscriptstyle T}}\mathbf{w}_% {*}-\boldsymbol{\phi}_{t}^{\mathsf{\scriptscriptstyle T}}\mathbf{w}_{*}\leq% \boldsymbol{\phi}_{t}^{\mathsf{\scriptscriptstyle T}}\overline{\mathbf{w}}_{t,% t}-\boldsymbol{\phi}_{t}^{\mathsf{\scriptscriptstyle T}}\mathbf{w}_{*}
\displaystyle=\boldsymbol{\phi}_{t}^{\mathsf{\scriptscriptstyle T}}(\overline{% \mathbf{w}}_{t,t}-\widehat{\mathbf{w}}_{t})+\boldsymbol{\phi}_{t}^{\mathsf{% \scriptscriptstyle T}}(\widehat{\mathbf{w}}_{t}-\mathbf{w}_{*})
\displaystyle=\boldsymbol{\phi}_{t}^{\mathsf{\scriptscriptstyle T}}\widetilde{% \mathbf{A}}_{t}^{-1/2}\widetilde{\mathbf{A}}_{t}^{1/2}(\overline{\mathbf{w}}_{% t,t}-\widehat{\mathbf{w}}_{t})+\boldsymbol{\phi}_{t}^{\mathsf{% \scriptscriptstyle T}}\widetilde{\mathbf{A}}_{t-1}^{-1/2}\widetilde{\mathbf{A}% }_{t}^{1/2}(\widehat{\mathbf{w}}_{t}-\mathbf{w}_{*})
\displaystyle\leq\sqrt{\boldsymbol{\phi}_{t}^{\mathsf{\scriptscriptstyle T}}% \widetilde{\mathbf{A}}_{t}^{-1}\boldsymbol{\phi}_{t}}(\|\overline{\mathbf{w}}_% {t,t}-\widehat{\mathbf{w}}_{t}\|_{\widetilde{\mathbf{A}}_{t}}+\|\widehat{% \mathbf{w}}_{t}-\mathbf{w}_{*}\|_{\widetilde{\mathbf{A}}_{t}})
\displaystyle\leq 2\widetilde{\beta}_{t}\sqrt{\boldsymbol{\phi}_{t}^{\mathsf{% \scriptscriptstyle T}}\widetilde{\mathbf{A}}_{t}^{-1}\boldsymbol{\phi}_{t}}

Summing over t and taking the max over \widetilde{\beta}_{t}

\displaystyle R_{t} \displaystyle\leq 2\widetilde{\beta}_{T}\sum_{t=1}^{T}\sqrt{\boldsymbol{\phi}_% {t}^{\mathsf{\scriptscriptstyle T}}\widetilde{\mathbf{A}}_{t}^{-1}\boldsymbol{% \phi}_{t}}\leq 2\widetilde{\beta}_{T}\sqrt{T}\sqrt{\sum_{t=1}^{T}\boldsymbol{% \phi}_{t}^{\mathsf{\scriptscriptstyle T}}\widetilde{\mathbf{A}}_{t}^{-1}% \boldsymbol{\phi}_{t}}\leq 2\widetilde{\beta}_{T}\sqrt{T}\sqrt{\alpha\sum_{t=1% }^{T}\boldsymbol{\phi}_{t}^{\mathsf{\scriptscriptstyle T}}\mathbf{A}_{t}^{-1}% \boldsymbol{\phi}_{t}}

We can now use once again Appendix B to obtain

\displaystyle R_{T} \displaystyle\leq 2\widetilde{\beta}_{T}\sqrt{\alpha T\sum_{t=1}^{T}% \boldsymbol{\phi}_{t}^{\mathsf{\scriptscriptstyle T}}\mathbf{A}_{t}^{-1}% \boldsymbol{\phi}_{t}}=2\widetilde{\beta}_{T}\sqrt{\alpha T\sum_{t=1}^{T}% \sigma_{t}^{2}(\widetilde{\mathbf{x}}_{t})}\leq 2\widetilde{\beta}_{T}\sqrt{2% \alpha Td_{\text{eff}}(\lambda,\widetilde{\mathbf{X}}_{T})\log(\kappa^{2}T)}.

We can also further upper bound \widetilde{\beta}_{T} as

\displaystyle\widetilde{\beta}_{T} \displaystyle=2\xi\sqrt{\alpha\log(\kappa^{2}T)\mathopen{}\mathclose{{}\left(% \sum_{s=1}^{T}\widetilde{\sigma}_{t}^{2}(\mathbf{x}_{s})}\right)+\log(1/\delta% )}+\mathopen{}\mathclose{{}\left(1+\frac{1}{\sqrt{1-\varepsilon}}}\right)\sqrt% {\lambda}F
\displaystyle\leq 2\xi\sqrt{\alpha^{2}\log(\kappa^{2}T)\mathopen{}\mathclose{{% }\left(\sum_{s=1}^{T}\sigma_{t}^{2}(\mathbf{x}_{s})}\right)+\log(1/\delta)}+% \mathopen{}\mathclose{{}\left(1+\frac{1}{\sqrt{1-\varepsilon}}}\right)\sqrt{% \lambda}F
\displaystyle\leq 2\xi\alpha\sqrt{d_{\text{eff}}(\lambda,\widetilde{\mathbf{X}% }_{T})\log(\kappa^{2}T)+\log(1/\delta)}+\mathopen{}\mathclose{{}\left(1+\frac{% 1}{\sqrt{1-\varepsilon}}}\right)\sqrt{\lambda}F

Putting it together we obtain

\displaystyle R_{T}\leq \displaystyle 2\mathopen{}\mathclose{{}\left(2\xi\alpha\sqrt{d_{\text{eff}}(% \lambda,\widetilde{\mathbf{X}}_{T})\log(\kappa^{2}T)+\log(1/\delta)}}\right)% \sqrt{2\alpha Td_{\text{eff}}(\lambda,\widetilde{\mathbf{X}}_{T})\log(\kappa^{% 2}T)}
\displaystyle+2\mathopen{}\mathclose{{}\left(\mathopen{}\mathclose{{}\left(1+% \frac{1}{\sqrt{1-\varepsilon}}}\right)\sqrt{\lambda}F}\right)\sqrt{2\alpha Td_% {\text{eff}}(\lambda,\widetilde{\mathbf{X}}_{T})\log(\kappa^{2}T)}
\displaystyle\leq \displaystyle 2\xi(2\alpha)^{3/2}\mathopen{}\mathclose{{}\left(d_{\text{eff}}(% \lambda,\widetilde{\mathbf{X}}_{T})\log(\kappa^{2}T)+\log(1/\delta)}\right)
\displaystyle+2\mathopen{}\mathclose{{}\left(2\sqrt{\alpha}\sqrt{\lambda}F}% \right)\sqrt{2\alpha Td_{\text{eff}}(\lambda,\widetilde{\mathbf{X}}_{T})\log(% \kappa^{2}T)}
\displaystyle\leq \displaystyle 2(2\alpha)^{3/2}\mathopen{}\mathclose{{}\left(\sqrt{T}\xi d_{% \text{eff}}(\lambda,\widetilde{\mathbf{X}}_{T})\log(\kappa^{2}T)+\sqrt{T}\log(% 1/\delta)+\sqrt{T\lambda F^{2}d_{\text{eff}}(\lambda,\widetilde{\mathbf{X}}_{T% })\log(\kappa^{2}T)}}\right)
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description