1 Introduction

## Abstract

Robustness to distributional shift is one of the key challenges of contemporary machine learning. Attaining such robustness is the goal of distributionally robust optimization, which seeks a solution to an optimization problem that is worst-case robust under a specified distributional shift of an uncontrolled covariate. In this paper, we study such a problem when the distributional shift is measured via the maximum mean discrepancy (MMD). For the setting of zeroth-order, noisy optimization, we present a novel distributionally robust Bayesian optimization algorithm (DRBO). Our algorithm provably obtains sub-linear robust regret in various settings that differ in how the uncertain covariate is observed. We demonstrate the robust performance of our method on both synthetic and real-world benchmarks.

## 1 Introduction

Bayesian optimization (BO) is a framework for model-based sequential optimization of black-box functions that are expensive to evaluate and for which noisy point evaluations are available. Bayesian optimization algorithms have been successfully applied in a wide range of applications where the goal is to discover best-performing designs from a small number of trials, e.g., in vaccine and molecular design, gene optimization, automatic machine learning, robotics and control tasks, and many more.

In many practical tasks, the objective also depends on contextual covariates of the environment. If this context follows a known distribution, the setting is essentially that of stochastic optimization with the objective to maximize the expected pay-off. Often, however, there exists a distributional mismatch between the covariate distribution that the learner assumes, and the true distribution of the environment. Examples include automated machine learning, where hyperparameters are tuned on training data while the test distribution can differ; recommender systems, where the distribution of the users shifts with time; and robotics, where the simulated environmental variables are only an approximation of the real physical world. In particular, whenever there is a distributional mismatch between the true and the data distribution used at training time, the optimization solutions can result in inferior performance or even lead to unsafe/unreliable execution. The problem of distributional data shift has been recently identified as one of the most prevalent concrete challenges of modern AI safety (Amodei et al., 2016). While the connection of robust optimization (RO) and Bayesian optimization has recently been established by Bogunovic et al. (2018), robustness to distributional data shift remains unexplored in this field.

In this paper, we introduce the setting of distributionally robust Bayesian optimization (DRBO): The goal is to track the optimal input that maximizes the expected function value under the worst-case distribution of an external, contextual parameter. In distributionally robust optimization (DRO), such a worst-case distribution belongs to a known uncertainty set of distributions that is typically chosen as a ball centered around a given reference distribution. To measure the distance between distributions, in this work, we focus on the kernel-based maximum mean discrepancy (MMD) distance. This metric fits well with the kernel-based regularity assumptions on the unknown function that are typically made in Bayesian optimization.

### 1.1 Related Work

A large number of Bayesian optimization algorithms have been developed over the years, (e.g. Srinivas et al., 2010; Wang and Jegelka, 2017; Hennig and Schuler, 2012; Chowdhury and Gopalan, 2017; Bogunovic et al., 2016b). Several practical variants of the standard setting were addressed recently, including contextual (Krause and Ong, 2011; Valko et al., 2013; Lamprier et al., 2018; Kirschner and Krause, 2019) and time-varying (Bogunovic et al., 2016a) BO, high-dimensional BO (Djolonga et al., 2013; Kandasamy et al., 2015; Kirschner et al., 2019), BO with constraints (Gardner et al., 2014; Gelbart et al., 2014), heteroscedastic noise (Kirschner and Krause, 2018) and uncertain inputs (Oliveira et al., 2019).

Two classical objectives for optimization under uncertainty are stochastic optimization (SO) (Srinivas et al., 2010; Krause and Ong, 2011; Lamprier et al., 2018; Oliveira et al., 2019; Kirschner and Krause, 2019) and robust optimization (RO) (Bogunovic et al., 2018), see Table 1. SO asks for a solution that performs well in expectation over an uncontrolled, stochastic covariate. Here, the assumption is that the distribution of the contextual parameter is known, or (i.i.d.) samples are provided. Some variants of SO have been considered in the related contextual Bayesian optimization works (Krause and Ong, 2011; Valko et al., 2013; Kirschner and Krause, 2019). RO aims at a solution that is robust with respect to the worst possible realization of the context parameter. The RO objective has recently been studied in Bayesian optimization in (Bogunovic et al., 2018); the authors provide a robust BO algorithm, and obtain strong regret guarantees. In many practical scenarios, however, the solution to the SO problem might be highly non-robust, while on the other hand, the worst-case RO solution might be overly pessimistic. This motivates us to consider the distributionally robust optimization (DRO), which is a “middle ground” between SO and RO.

Distributionally robust optimization (DRO) dates back to the seminal work of Scarf (1957) and since then it has become an important topic in robust optimization (e.g. Bertsimas et al., 2018; Goh and Sim, 2010). It has recently received significant attention in machine learning, in particular due to its relation to regularization, adversarial learning, and generalization (Staib et al., 2018). The full literature on DRO is too vast to be adequately covered here, so we refer the interested reader to the recent review by Rahimian and Mehrotra (2019) and references within. For defining the uncertainty sets of distributions, different DRO works have studied -divergences (Ben-Tal et al., 2013; Namkoong and Duchi, 2017), Wasserstein (Gao et al., 2017; Esfahani and Kuhn, 2018; Sinha et al., 2017) and the MMD (Staib and Jegelka, 2019) distances. In this work, we focus on the kernel-based MMD distance, but unlike previous DRO works, we assume that the objective function is unknown, and only noisy point evaluations are available.

We conclude this section by mentioning other robust aspects and settings that have been previously considered in Bayesian optimization. BO with outliers has been considered by Martinez-Cantin et al. (2017), while the setting in which sampled points are subject to uncertainty has been studied by Nogueira et al. (2016); Beland and Nair (2017); Oliveira et al. (2019). These settings differ significantly from the one considered in this paper and they do not consider robustness under distributional shift. Finally, we note that another robust BO algorithm has been recently developed for playing unknown repeated games against non-cooperative agents (Sessa et al., 2019).

While this work was under submission, a related approach for distributionally robust Bayesian quadrature appeared online (Nguyen et al., 2020). The authors propose an approach based on Thompson sampling to solve a related robust objective for Bayesian quadrature. Our work captures this scenario in the ‘simulator setting‘, detailed below. The main difference in the analysis is that we bound worst-case frequentist regret opposed to the expected Bayesian regret.

#### Contributions

We propose a novel, distributionally robust Bayesian optimization (DRBO) algorithm. Our analysis shows that the DRBO achieves sublinear robust regret on several variants of the setting. Finally, we demonstrate robust performance of the DRBO method on synthetic and real-world benchmarks.

## 2 Problem Statement

Let be an unknown reward function defined over a parameter space with finite1 action and context sets, and . The objective is to optimize from sequential and noisy point evaluations. In our main setup, at each time step , the learner chooses whereas the environment provides the context together with the noisy function observation , where with known and independence between time steps. More generally, our results hold if the noise is -sub-Gaussian, which allows for non-Gaussian likelihoods (e.g., bounded noise). Further, we assume that is sampled independently from an unknown, time-dependent distribution .

Optimization objective. We consider the distributionally robust optimization (DRO) (Scarf, 1957) objective, which asks to perform well simultaneously for a range of problems, each determined by a distribution in some uncertainty set. This is in contrast to SO, where we seek good performance against a single problem instance parametrized by a given distribution.

In DRO, the objective is to find that solves

 maxx∈XinfQ∈UtEc∼Q[f(x,c)]. (1)

Here, is a known uncertainty set of distributions over that can depend on the step and contains the true distribution . Typically, is chosen as a ball of radius (or margin) , and centered around a given reference distribution on , i.e.,

 Ut={Q:d(Q,Pt)≤ϵt},

where measures the discrepancy between two distributions. A possible choice for the reference distribution , is the empirical sample distribution , which is an instance of data-driven DRO (Bertsimas et al., 2018). Depending on the underlying function and the uncertainty set , the robust solution can significantly differ from the solution to the stochastic objective for a fixed (and typically known) distribution . We illustrate such a case in Fig. 1.

Hence, at time step , the learner receives a reference distribution and margin . Our objective is to choose a sequence of actions that minimizes robust cumulative regret:

 RT=T∑t=1infQ∈UtEQ[f(x∗t,c)]−infQ∈UtEQ[f(xt,c)], (2)

where . The robust regret measures the cumulative loss of the learner on the chosen sequence of actions w.r.t. the worst case distribution over .

#### RKHS Regression.

The main regularity assumption of Bayesian optimization is that belongs to a reproducing kernel Hilbert space (RKHS) with known kernel . We denote the Hilbert norm by and assume for some known . From the observed data , we can compute a kernel ridge regression estimate with

 ^ft=argming∈Ht−1∑i=1(g(xi,ci)−yi)2+∥g∥2H . (3)

The representer theorem provides the standard, closed-form solution for the least-squares estimate (Rasmussen and Williams, 2006). The next lemma is a standard result by Srinivas et al. (2010); Abbasi-Yadkori (2013). It provides a frequentist confidence interval of the form that contains the true function values with high probability. The exact definitions of and can be found in Appendix A; we just note here that and are the posterior mean and posterior variance functions of the corresponding Bayesian Gaussian process model (Rasmussen and Williams, 2006). We denote the data kernel matrix by , and assume that .

###### Lemma 1.

With probability at least , for any , at any time ,

 |^ft(x,c)−f(x,c)|≤βtσt(x,c)

with .

We explicitly define the upper and lower confidence bounds for every and as follows:

 ucbt(x,c) :=^ft(x,c)+βtσt(x,c), lcbt(x,c) :=^ft(x,c)−βtσt(x,c).

For a fixed , we use and to refer to the corresponding vectors in .

Finally, we introduce a sample complexity parameter, the maximum information gain:

 γT:=max{(xt,ct)}Tt=1logdet(1t+KT) . (4)

The information gain appears in the regret bounds for Bayesian optimization (Srinivas et al., 2010). Analytical upper bounds are known for a range of kernels, e.g., for the RBF kernel, if .

#### Maximum Mean Discrepancy (MMD).

MMD is a kernel-based discrepancy measure between distributions (e.g., Muandet et al. (2017)). It has been used in various applications, including generative modeling (Sutherland et al., 2016; Bińkowski et al., 2018), DRO (Staib and Jegelka, 2019) and kernel sample tests (Gretton et al., 2012; Chwialkowski et al., 2016). Let be an RKHS with corresponding kernel . For two distributions and over , the maximum mean discrepancy (MMD) is

 d(P,Q):=supg∈HM:∥g∥HM≤1Ec∼P[g(c)]−Ec∼Q[g(c)] . (5)

Note that the kernel over that defines the MMD is different from the kernel over that is used for regression. An equivalent way of writing is via kernel mean embeddings (Muandet et al., 2017, Section 3.5). Specifically, any distribution over can be embedded into via the mean embedding , which satisfies for all . An equivalent expression for the MMD (5) is

 d(P,Q)=∥mP−mQ∥H . (6)

More explicitly, for finite context set and probability vectors and , the kernel mean embeddings are and , respectively. With the kernel matrix , the MMD becomes

 d(P,Q)=√(w−w′)⊤M(w−w′)=:∥w−w′∥M% .

## 3 Distributionally Robust Bayesian Optimization

We now introduce a Bayesian optimization algorithm for our main objective (2). We will start with a general formulation that allows for time-dependent reference distributions and margins . We then continue with data-driven DRO (Bertsimas et al., 2018), where we specialize the general setup and choose the empirical distribution as reference distribution. Hence, our algorithm chooses actions that are robust w.r.t. the estimation error of the true context distribution. Finally, we motivate and discuss the simulator setting, where the learner is allowed to choose the context and obtains the corresponding evaluation .

### 3.1 General DRBO

In our general DRBO formulation, the interaction protocol at time is specified by the following steps:

1. The environment chooses a reference distribution and margin . This defines the uncertainty set

 Ut={Q:d(Q,Pt)≤ϵt} . (7)
2. The learner observes and , and chooses a robust action .

3. The environment chooses a sampling distribution and the context is realized as an independent sample .

4. The learner observes the reward and .

We make no further assumptions on how the environment chooses the sequences and . The DRBO algorithm for this setting is given in Algorithm 1. Recall that is a distribution over the finite context set with elements, and we use to denote a probability vector with entries for every . With this, the inner adversarial problem for a fixed action can be equivalently written as:

 infQ:d(Pt,Q)≤ϵtEc∼Q[f(x,c)]=minw′:∥w′∥1=1,0≤w′j≤1∀j∈[n],∥w′−wt∥M≤ϵt⟨w′,fx⟩, (8)

where , and with . In particular the solution to (8) is the worst-case distribution over for the objective if the learner chooses action . Since the constraints are convex, the program (8) can be solved efficiently by standard convex optimization solvers.

Since the true function values are unknown to the learner, we can only obtain an approximate solution to (8). In our algorithm, we hence use an optimistic upper bound instead. Specifically, we substitute for to compute the “optimistic” worst-case distribution for every action . Finally, at time , the learner chooses that maximizes the optimistic expected reward under the worst-case distribution.

The DRBO algorithm achieves the following regret bound.

###### Theorem 2.

The robust regret of Algorithm 1, with , is bounded with probability at least by

 RT≤4βT√T(γT+4log(12δ))+2B′T∑t=1ϵt .

Here, is the maximum information gain defined in Eq. (4), and .

The complete proof is given in Appendix B.1, and we only sketch the main steps here. Denote by the probability vector of the true distribution at time , and by the solution to (8) at . The idea is to bound the instantaneous regret at time by

 rt =infQ:d(Pt,Q)≤ϵtEQ[f(x∗,c)]−infQ:d(Pt,Q)≤ϵtEQ[f(xt,c)] \lx@stackrel(i)≤⟨w∗t,ucbtxt⟩−⟨wfxt,fxt⟩ =⟨w∗t,ucbtxt−fxt⟩+⟨w∗t−wfxt,fxt⟩ \lx@stackrel(ii)≤2βt⟨w∗t,σt(xt,⋅)⟩+∥w∗t−wfxt∥M∥fxt∥M−1 \lx@stackrel(iii)≤2βT⟨w∗t,σt(xt,⋅)⟩+2ϵtB′ .

For the first inequality (i), we used that , the definition of the UCB action and that . In step (ii), we use Cauchy-Schwarz and the confidence bounds, and step (iii) follows since . From here it remains to sum the instantaneous regret, where we rely on Lemma 3 in (Kirschner and Krause, 2018) to relate the expectation over the true sampling distribution to the observed values .

In the regret bound in Theorem 2, the first term is the same as the standard regret bound for GP-UCB (Srinivas et al., 2010; Abbasi-Yadkori, 2013) and reflects the statistical convergence rate for estimating the RKHS function. The additional term (for ) is specific to our setting. First, the complexity parameter quantifies how much the distributional shift can increase the regret on the given objective . A crude upper bound is , but in general can be much smaller. The linear scaling of the regret bound is arguably unsatisfying, but seems unavoidable without further assumptions. A problematic case is when the true distribution is supported on a single context, e.g., , and the learner is not able to learn the function values at different contexts for . In this case, the learner can never infer the robust solution exactly from the data and consequently incurs constant regret of order per round. In practice, we do not expect that this severely affects the performance of our algorithm if the true distribution sufficiently covers the context space. We leave a precise formulation of this intuition for future work.

Instead, in the following sections we explore two different ways of controlling the additional regret that the learner incurs in the general DRBO setting. First, for the data-driven setting, we will set the reference distribution to the empirical distribution of the observed context samples. In this case, the margin is the distance to the true sampling distribution, which for the MMD is of order and results in . In the second variant, the learner is allowed to also choose , which circumvents the estimation problem outlined above and avoids the linear regret term.

### 3.2 Data-Driven DRBO

In data-driven DRBO, we assume there is a fixed but unknown distribution on . In each round, the learner first chooses an action , and then observes a context sample together with the corresponding observation . At the beginning of round , the learner computse the empirical distribution using the observed contexts . The objective is to choose a sequence of actions , which is robust to the estimation error in . This corresponds to minimizing the robust regret (2), where we set for every .

As the learner observes more context samples, she becomes more confident about the true unknown . It is therefore reasonable to shrink the uncertainty set of distributions over time. We make use of the following lemma.

###### Lemma 3 (Muandet et al. (2017), Theorem 3.4).

Assume for all . Let be the true context distribution over , and let be the empirical sample distribution. Then, with probability at least ,

 d(P∗,^Pt)≤1√t(2+√2log(1/δ)).

Lemma 3 shows how to set the margin such that, at time , the true distribution is contained with high probability in the uncertainty set around the empirical distribution. The interaction protocol at time is then:

1. The learner computes the empirical distribution and corresponding margin according to Lemma 3, and defines the uncertainty set

 Ut={Q:d(Q,^Pt)≤ϵt} .
2. The learner chooses a robust action .

3. The learner observes reward and context sample .

We follow Algorithm 1, and set the reference distribution and margin as outlined above. As a consequence of Theorem 2 we obtain the following regret bound.

###### Corollary 4.

The robust regret of Algorithm 1, with and is bounded in the data-driven scenario with probability at least by

 RT ≤2βT√T(√γT(1+log(3/δ)) +4B′(2+√2log(6T2δ)) , (9)

where is the maximum information gain as defined in (4), and .

The proof can be found in Appendix B.2. We just note that we increased the value of such that Lemma 3 holds simultaneously over all time steps. In the data-driven contextual setting without the robustness requirement, several related approaches have been proposed (Lamprier et al., 2018; Kirschner and Krause, 2019). These are based on computing a UCB score directly at the kernel mean embedding of the empirical distribution . To account for the estimation error, an additional exploration bonus is added. We note that as and becomes an accurate estimation of , both robust and non-robust approaches converge to the stochastic solution. The advantage of the robust formulation is that we explicitly minimize the loss under the worst-case estimation error in the context distribution. As we demonstrante in our experiments (in Section 4), DRBO obtains significantly smaller regret when the robust and stochastic solutions are different.

### 3.3 Simulator DRBO

In our second variant of the general setup, the learner is allowed to choose in addition to and then obtains the observation .

One example of this setting, previously considered in the context of RO (Bogunovic et al., 2018), is when the learner tunes control parameters with a simulator of the environment (e.g. for a building heating system). The simulator gives the learner the ability to evaluate the objective at any specific context . The objective is to simultaneously (or only at the final time ) deploy a robust solution on the real system, where the covariate is uncontrolled. Again, the learner’s objective is to be robust with respect to an uncertainty set of distributions on on the real environment (e.g. for heating control, we want robustness on predicted weather conditions that effect the building’s state). With this motivation in mind, we refer to this setup as simulator DRBO. Formally, the interaction protocol is:

1. The environment provides a reference distribution , margin and uncertainty set as before.

2. The learner chooses an action and a context .

3. The learner observes reward from the simulator.

4. The learner deploys a robust action on the real system (or possibly only at the final step ).

We provide Algorithm 2 for this setting. As before, is an optimistic action under the worst-case distribution. In addition, the learner chooses as the context with the largest estimation uncertainty at . We bound the robust regret in the next theorem.

###### Theorem 5.

In the simulator setting, Algorithm 2, with , obtains bounded robust regret w.p. at least ,

 RT≤2βT√γTT .

We provide the proof of Theorem 5 in Appendix B.3.

Perhaps surprisingly, this rate is the same as for GP-UCB in the standard setting (a similar result was obtained for RO (Bogunovic et al., 2018)). This is because now the learner can estimate globally at any input , and the sample complexity to infer the robust solution only depends on the sample complexity of estimating .

In the simulator setting, the performance of the final solution can be of significant interest if we aim to deploy the obtained parameter on the real system. To this end, we allow the final solution to be different from the last evaluation . The metric of interest is then the robust simple regret,

 rT=maxx∈XinfQEc∼Qf(x,c)−infQEc∼Qf(^xT,c).

To obtain a bound on the simple regret, we assume that the margin and the reference distribution are fixed. This is a natural requirement, which allows the learner to optimize the simple regret for the final solution w.r.t.  and . We choose the final solution among the iterates from Algorithm 2 with

 ^t:=argmaxt=1,…,Tminw′:∥w′∥1=1,0≤w′j≤1∀j∈[n],∥w′−wt∥M≤ϵ⟨w′,lcbtxt⟩ . (10)

The program computes the best robust solution among the iterates using the conservative function values of the corresponding time steps . It is easy to maintain iteratively by computing the conservative, worst-case payoff of the action and comparing to the previous solution .

###### Corollary 6 (Simple Regret).

With probability at least , the solution obtains simple regret

 rT≤2βT√γT/T. (11)

This result is a consequence of the fact that the simple regret of is upper bounded by the simple regret of each iterate . The guarantee then follows from the proof of Theorem 5. We provide the complete argument in Appendix B.4.

## 4 Experiments

We evaluate the proposed DRBO in the general, data-driven and simulator setting on two synthetic test functions, and on a recommender task based on a real-world crop yield data set. In our experiments, we compare to StableOpt (Bogunovic et al., 2018) and a stochastic UCB variant (Srinivas et al., 2010; Kirschner and Krause, 2019).

#### Baselines

The first baseline is a stochastic variant of the UCB approach (Srinivas et al., 2010; Kirschner and Krause, 2019), which chooses actions according to optimistic expected payoff w.r.t. the reference distribution,

 xUCBt=argmaxx∈XEPt[ucbt(x,c)] .

Our second baseline is StableOpt (Bogunovic et al., 2018), an approach for worst-case robust optimization. It chooses actions according to

 xSTABLEt=argmaxx∈Xminc∈Δtucbt(x,c) ,

for a robustness set of possible context values . There is no canonical way of choosing in our setting, and we use . With the decreasing margin and the discretization of the context domain, it can happen that is an empty set. In this case we explicitly set .

UCB and StableOpt optimize for the stochastic and worst-case robust solutions respectively, and therefore can exhibit linear regret for the robust regret (unless as in the data-driven setting). For all approaches we use the same RKHS hyper-parameters. In particularly we set , which is a common practice to improve performance over the (conservative) theoretical values.

#### Benchmarks

Our first synthetic benchmark is the function illustrated in the introduction. The reference distribution is and the true sampling distribution is . For simplicity, we set the margin to the exact MMD distance . On this function, the stochastic, worst-case robust and distributionally robust solution all differ, which leads to linear robust regret for UCB and StableOpt. The second synthetic benchmark is chosen such that stochastic, worst-case and distributionally robust solutions coincide, with the same choice of and as before. See Appendix C, Fig. 3(a) for a contour plot. Fig. 2 illustrates the results.

Further, we evaluate the methods on real-world wind power data (Data Package Time Series, 2019). Wind power forecasting is an important task (Wang et al., 2011) as power sources that can be effectively scheduled are valuable on the global energy market. In our problem setup, we take hourly recorded wind power data from 2013/14 and use a 48h sliding window to compute an empirical reference distribution for each time step. The decision variable is the amount of energy that is guaranteed to be delivered in the next hour after the end of the window. The contextual variable is the actual power generation which we take from the data set. We choose the reward (revenue) function:

 f(x,c)=0.1max(c−x,0)+min(x,c)−5max(x−c,0).

There is a reward/energy that was not committed ahead of time, reward/energy for committed energy and penalty for committed energy that is not delivered (if the wind generation was too low). For each time step, we use the simulator scenario to compute the robust/stochastic/stable solution; and evaluate the performance on the data set. In Figure 3, we report the cumulative revenue of the different solutions deployed at each time step; this corresponds to the total revenue obtained during the year. The additional baseline is a “zero commitment strategy” (). The figure also shows cumulative robust regret. Clearly, the stochastic solution is different from the robust one, hence UCB obtains linear robust regret. In fact, in this case if the DRO objective is solved exactly for each step, the DRBO method would obtain zero robust regret (we compute the solution according to (10) after steps, therefore an optimization error may remain).

## 5 Conclusion

In this work, we introduced and studied distributionally robust Bayesian optimization, where the goal is to be robust against the worst-case contextual distribution among a specified uncertainty set of distributions. Specifically, we focused on uncertainty sets determined by the MMD distance. For a few settings of interest that differ in how the contextual parameter is realized, we provided the first DRBO algorithms with theoretical guarantees. In the experimental study, we demonstrated improvements in terms of robust expected regret over stochastic and worst-case BO baselines.

Our algorithms rely on solving the inner adversary problem, which, in our case, is a linear program with convex constraints. This program can be solved efficiently but is of size , which currently limits the method to relatively small context sets. The formulation and the theory continue to hold for large or continuous context sets, but finding a tractable algorithmic approximation is an interesting direction for future work. Finally, while the considered kernel-based MMD distance fits well with the kernel-based regularity assumptions used in BO, an interesting direction is to extend the ideas to other uncertainty sets used in machine learning, such as the ones defined by -divergences and Wasserstein distance. In fact, our approach is still applicable in the case of other divergences, as long as the uncertainty set of distributions is convex and the inner problem can be solved efficiently.

## Acknowledgement

This project has received funding from the European Research Council (ERC) under the European Unionâs Horizon 2020 research, innovation programme grant agreement No 815943, and NSF CAREER award 1553284. IB is supported by ETH Zürich Postdoctoral Fellowship 19-2 FEL-47.

## Appendix A RKHS Regression

Recall that at step , we have data . The kernel ridge regression estimate is defined by,

 ^ft=argming∈Ht∑i=1(g(xi,ci)−yi)2+∥g∥2H . (12)

Denote by the vector of observations, the data kernel matrix, and the data kernel features. We then have

 ^ft(x,c) =kt(x,c)⊤(Kt+1t)−1yt% . (13)

We further have the posterior variance that determines the width of the confidence intervals,

 σt(x,c)2 =k(x,c,x,c)−kt(x,c)T(Kt+1t)−1kt(x,c) . (14)

## Appendix B Proofs

### b.1 Proof of Theorem 2

The robust cumulative regret is

 RT=T∑t=1maxx∈XinfQ:d(Q,Pt)≤ϵtEQ[f(x,c)]−infQ:d(Q,Pt)≤ϵtEQ[f(xt,c)]. (15)

For the proof, we first bound the instantaneous robust regret,

 rt=infQ:d(Pt,Q)≤ϵtEQ[f(x∗t,c)]−infQ:d(Pt,Q)≤ϵtEQ[f(xt,c)] , (16)

where we denote the true robust solution at time . We recall the following notation, , and are vectors in , and . Further, is a probability vector in , where is used to denote the size of the contextual set, i.e., . With this, note that

 infQ:d(Pt,Q)≤ϵtE[f(x,c)]=infw′:∥w′∥1=1,0≤w′j≤1∀j∈[n],∥w′−wt∥M≤ϵt⟨w′,fx⟩ (17)

The solution to this linear program is the worst case distribution over if we choose action . Define worst-case distributions , and for exact, optimistic and pessimistic function values (the dependence on is implicit),

 wfx=argminw′:∥w′∥1=1,0≤w′j≤1∀j∈[n],∥w′−wt∥M≤ϵt⟨w′,fx⟩,wlcbtx=argminw′:∥w′∥1=1,0≤w′j≤1∀j∈[n],∥w′−wt∥M≤ϵt⟨w′,lcbtx⟩,wucbtx=argminw′:∥w′∥1=1,0≤w′j≤1∀j∈[n],∥w′−wt∥M≤ϵt⟨w′,ucbtx⟩ . (18)

By combining (8) and (18), we can upper and lower bound the objective as follows:

 ⟨wlcbtx,lcbtx⟩≤infw′:∥w′∥1=1,0≤w′j≤1∀j∈[n],∥w′−wt∥M≤ϵt⟨w′,fx⟩≤⟨wucbtx,ucbtx⟩. (19)

Recall that Algorithm 1 takes actions , and note that where is the probability vector from the true sampling distribution at time . For any , we proceed to bound the instantaneous regret,

<
 rt =infQ:d(P,Q)≤ϵEQ[f(x∗t,c)]−infQ:d(P,Q)≤ϵEQ[f(xt,c)] (20) \lx@stackrel(i)≤⟨wucbtx∗t,ucbtx∗t⟩−⟨wfxt,fxt⟩ (21) \lx@stackrel(ii)≤⟨wucbtxt,ucbtxt⟩−⟨wfxt,fxt⟩ (22)