1 Introduction

Abstract

We consider robust optimization problems, where the goal is to optimize an unknown objective function against the worst-case realization of an uncertain parameter. For this setting, we design a novel sample-efficient algorithm GP-MRO, which sequentially learns about the unknown objective from noisy point evaluations. GP-MRO seeks to discover a robust and randomized mixed strategy, that maximizes the worst-case expected objective value. To achieve this, it combines techniques from online learning with nonparametric confidence bounds from Gaussian processes. Our theoretical results characterize the number of samples required by GP-MRO to discover a robust near-optimal mixed strategy for different GP kernels of interest. We experimentally demonstrate the performance of our algorithm on synthetic datasets and on human-assisted trajectory planning tasks for autonomous vehicles. In our simulations, we show that robust deterministic strategies can be overly conservative, while the mixed strategies found by GP-MRO significantly improve the overall performance.

1 Introduction

Many real-world problems require taking decisions under uncertainty. Latter can manifest itself in the form of uncertain parameters, perturbations, or an adversary that can corrupt the decision (bertsimas2011theory). In such problems, one often seeks to optimize an objective function while being robust to the worst possible uncertainty realization. This can be achieved by phrasing such problems in the framework of Robust Optimization (RO) (ben2009robust). RO has found successful applications in various domains including supply chain management (bertsimas2004supplychain), portfolio optimization (bental2000portfolio), influence maximization (he2016robust), and robotics (bojorgensen2018), to name a few.

In various practical problems, however, the objective function to be optimized is a-priori unknown, and one can only learn about it from sequential and noisy point evaluations. Gaussian process (GP) optimization is an established framework for model-based sequential optimization of such unknown functions (srinivas2009gaussian). An array of algorithms that use Bayesian non-parametric GP models (rasmussen2006gaussian), and balance exploration (learning the function globally) and exploitation (maximizing the function) have been developed over the years, e.g., (srinivas2009gaussian; bogunovic2016truncated; chowdhury17kernelized; wang2017max; frazier2018).

In this paper, we study the robust optimization problem where (i) the objective function is unknown and (ii) the goal is to be robust against the worst possible realization of its uncertain parameter. This problem differs from the classical RO formulation where the objective function is assumed to be known, and is also different from the standard GP optimization where robustness requirement is typically not pursued.

Instead of finding a robust deterministic solution to this problem (as in (bogunovic2018adversarially)), we seek to discover a randomized, i.e., mixed strategy, from a relatively small number of noisy function evaluations. The primary motivation for seeking such strategies is that, in general, they can provide arbitrarily better worst-case expected performance than deterministic ones (krause2011robust; vorobeychik2014; sinha2018security), i.e., randomization prevents a potential adversary to know the actual decision until it is realized. Consequently, we design and use a novel GP-based sample efficient algorithm to discover near-optimal mixed strategies. We empirically demonstrate the effectiveness of the identified robust mixed strategies in a trajectory planning task for autonomous vehicles, where deterministic strategies are shown to be overly conservative.

Related work. Over the past couple of years, robust optimization has been extensively studied in the machine learning community. While most of the works focus on convex settings (e.g., (shalev2016; namkoong2016)), more recent works also consider general non-convex objectives, e.g., (chen2017robust; sinha2017certifying; staib2018distributionally). Among those, chen2017robust provide robust algorithmic strategies that are shown to be successful in several learning tasks. The proposed algorithm is based on the idea of simulating a zero-sum game between a learner and an adversary. Similar strategies have been also explored in other adversarial settings, e.g., in submodular optimization (krause2011robust; kawase2019). Our approach is based on the similar algorithmic idea of chen2017robust, but unlike this and other works mentioned above that assume the objective function is perfectly known (or a maximization oracle is available), it also requires performing a non-trivial function estimation.

In non-robust GP optimization, various optimization algorithms (srinivas2009gaussian; chowdhury17kernelized; bogunovic2016truncated; contal2013parallel; wang2017max) have been proposed to sequentially optimize the unknown function from noisy and zeroth-order observations. Similarly to these algorithms, our algorithm relies on a non-parametric GP model to obtain shrinking confidence bounds around the unknown objective function. Besides the standard problem, GP optimization has been considered in several other practical settings such as contextual (krause2011contextual), time-varying (bogunovic2016time), safe exploration (sui2015safe), etc.

Recently, a novel algorithm for robust GP optimization StableOpt has been proposed by bogunovic2018adversarially. StableOpt discovers a deterministic solution that is robust with respect to the worst-case realization of the uncertain parameter. This work is closest to ours, but instead of seeking deterministic solutions, our focus is on the mixed strategies which are preferable in certain scenarios (see Section 4.2), where deterministic solutions turn out to be overly conservative. We also note that other forms of robustness have been studied in GP optimization. For instance, nogueira2016unscented; oliveira2019 consider robustness against uncertain inputs (typical in robotics applications), sessa2019noregret study robust aspects in multi-agent unknown repeated games, williams2000; tesch2011 deal with uncontrolled environmental variables, while robustness with respect to outliers is addressed by martinez2018practical.

Contributions. We consider robust optimization of unknown and generally non-convex objectives.

• We propose an algorithm, GP-MRO, which returns a mixed strategy, i.e., a probability distribution over actions, that is robust against the worst-case realization of the uncertain parameter.

• Our theoretical analysis shows the number of samples required for GP-MRO to discover a near-optimal robust mixed strategy.

• We propose a variant of GP-MRO which can effectively trade-off worst-case and average-case performance.

• Finally, we consider the problem of trajectory planning in autonomous driving guided by user’s evaluations. In our experiments, we demonstrate the effectiveness of the robust mixed strategies discovered by GP-MRO in comparison to those identified by existing robust methods.

2 Problem Formulation

Let be a reward function over domain , where is a continuous and compact decision set and is a finite set of parameter values. The reward function is unknown, and we learn about it from sequential noisy point observations, i.e., so-called bandit feedback. At each time step , we choose and , and observe a noisy sample , where , and ’s are independent over time (our approach allows also for sub-Gaussian noise).

After rounds (i.e., samples), our goal is to report a strategy for selecting points in that is robust against the worst-possible parameter value from . We assume that during the optimization phase (i.e., training/simulation) one can choose , while later, during the implementation (i.e., test) phase, the parameter becomes uncontrollable. Hence, it is important to design a robust strategy for selecting the first parameter.

Optimization goal. Let denote the set of all probability distributions, or mixed strategies on . Our goal is to find a distribution in that achieves high reward in the worst-case over . The maximin optimal value is given by:

 τ∗=maxP∈Δ(X)minθ∈ΘEx∼P[f(x,θ)],\vspace−0.2em (1)

and we seek to report a robust solution that for some specified accuracy value achieves

 minθ∈ΘEx∼P(T)[f(x,θ)]≥τ∗−ϵ.\vspace−0.2em (2)

Besides achieving (2), our goal is also to minimize the total number of required samples .

We note that our optimization goal is different from the one of computing deterministic (pure strategy) solution and competing against as considered in (bogunovic2018adversarially). Our goal is to discover a randomized strategy and compete against , which can be arbitrarily larger than . Hence, mixed strategies considered in this work can provide arbitrarily better expected performance than such deterministic ones. Conceptually, randomization allows the decisions to be less predictable, and is a key feature necessary in many applications including security games (sinha2018security), adversarial learning (vorobeychik2014) and sensing (krause2011robust). This is also the case in the autonomous driving scenario considered in Section 4.2, where we show that deterministic strategies can be overly conservative. Finally, we also note that the same objective (1) is considered in (chen2017robust), in the case of known reward functions , and .

Our Model. We assume that the unknown objective is fixed and belongs to a Reproducing Kernel Hilbert Space (RKHS) corresponding to a positive semi-definite kernel function . Furthermore, we require to have a bounded RKHS norm, i.e., where stands for the RKHS norm and is a known positive constant. The RKHS norm represents a measure of smoothness of as measured by the corresponding kernel. We note that these are the standard assumptions used in GP optimization (see, e.g., (srinivas2009gaussian; chowdhury17kernelized; bogunovic2018adversarially)).

For the kernel function, we assume for all , which is without loss of generality if appropriate re-scaling is applied. Our setup also allows for composite kernels that can be constructed by using individual kernels and , to obtain, for example, additive kernel or product kernel . Popularly used kernels are linear, squared exponential (SE) and MatÃ©rn:

 kLin(x,x′) =xTx′, kSE(x,x′) =exp(−12l2∥x−x′∥2),and kMat(x,x′) =21−νΓ(ν)(√2ν∥x−x′∥l)Jν(√2ν∥x−x′∥l),

where is the length-scale parameter and is a parameter that determines the smoothness (rasmussen2006gaussian).

Under such assumptions, the uncertainty over is naturally modeled as a Gaussian process . Further on, a Gaussian likelihood model for the observations can be used assuming the noise is drawn, independently across , from . Here, denotes a free hyper-parameter that may differ from the true noise variance . With this model in place, conditioned on the history of inputs and their noisy observations , the posterior distribution under this prior is also Gaussian with the closed form posterior mean and variance:

 μt(x,θ) (3) σ2t(x,θ) =k((x,θ),(x,θ)) −kt(x,θ)T(Kt+λIt)−1kt(x,θ), (4)

s.t. , and is the kernel matrix. As described bellow, we make use of this model in our algorithm to sequentially learn about the unknown objective function.

3 Proposed Algorithm and Theory

Our algorithm, GP-MRO, is shown in Algorithm 1. It can be interpreted as a zero-sum game between a simulated adversary and a learner. The adversary plays actions from the set , while the learner plays actions from . Because the true reward function is unknown, the algorithm maintains and makes use of the optimistic upper confidence bound (defined below) of the unknown reward function. We define the confidence bounds as follows:

 ucbt(x,θ) :=μt(x,θ)+βt+1σt(x,θ) (5) lcbt(x,θ) :=μt(x,θ)−βt+1σt(x,θ), (6)

where is the confidence parameter that we set according to Lemma 1 bellow. We also define their truncated versions:

 ¯¯¯¯¯¯¯¯¯ucbt(x,θ) :=min{ucbt(x,θ),1} (7) ¯¯¯¯¯¯¯lcbt(x,θ) :=max{lcbt(x,θ),0}, (8)

which we use in our algorithm. At every round , GP-MRO simulates the adversary by selecting a distribution over the values of , i.e., , where denotes the probability of selecting . Subsequently, the learner best responds by selecting based on the knowledge of . After iterations, GP-MRO returns the uniform distribution over , denoted with . Next, we explain how and are chosen in Algorithm 1.

The multiplicative weight updates (MWU) rule (freund1997) is used to select at every round . We note that this algorithm is an online learning no-regret algorithm that requires full-information feedback at every round, i.e., observations that correspond to every pair . This is not possible in our setting where the learner only receives a single noisy observation that corresponds to the chosen pair . To cope with this, we make use of the upper confidence bound functions to effectively emulate the full information feedback.1 Hence, the MWU rule used in our algorithm is given by:

 wt[i]∝exp{−ηTt−1∑j=1¯¯¯¯¯¯¯¯¯ucbj−1(xj,θi)},\vspace−0.3em

where is the learning rate parameter that we set in Theorem 2 bellow. Another equivalent way of writing this rule is via the following recursive update:

 wt[i]=wt−1[i]⋅exp(−ηT⋅¯¯¯¯¯¯¯¯¯ucbt−1(xt−1,θi))∑mj=1wt−1[j]⋅exp(−ηT⋅¯¯¯¯¯¯¯¯¯ucbt−1(xt−1,θj)) .\vspace−0.3em

The learner then observes , and plays the best response that is obtained by using the upper confidence bound instead of the true unknown function:

 xt=argmaxx∈X(m∑i=1wt[i]⋅¯¯¯¯¯¯¯¯¯ucbt−1(x,θi)).\vspace−0.3em (9)

Finally, the unknown function is queried at , where is selected as the parameter value that has the highest uncertainty for the selected , i.e.,

 θt∈argmaxθ∈Θσt−1(xt,θ).\vspace−0.2em (10)

The observed data is then used to update the model via (3) and (4).

3.1 Main result

To characterize our regret bounds, we make use of a suitable measure of complexity of the function class, the so-called maximum information gain. It has been introduced by srinivas2009gaussian, and subsequently used in many different works on Bayesian (GP) optimization. At time , it is defined as

 γt=max{(x1,θ1),…,(xt,θt)}12logdet(It+λ−1Kt), (11)

and is used to measure the reduction in uncertainty about after receiving noisy observations that correspond to . In the case , this kernel-dependent quantity is sublinear in for various kernel functions, e.g., for squared exponential and for the Matérn kernel with (srinivas2009gaussian).

We use the following well-known result in GP optimization (srinivas2009gaussian; chowdhury17kernelized), that allows for construction of statistical confidence bounds around the unknown function.

Lemma 1.

Let with , and consider the sampling model

 yt=f(xt,θt)+ξt% , where ξt∼N(0,σ2).

If the confidence parameter is set to

 βt=B+σλ−1/2√2(γt−1+ln(1/δ)), (12)

the following holds for every and , with probability at least :

 |μt−1(x,θ)−f(x,θ)|≤βtσt−1(x,θ), (13)

where and are given in (3) and (4) with .

Given the definitions (7)-(8), and by conditioning on the event (13) in Lemma 1 holding true we have:

 1≥¯¯¯¯¯¯¯¯¯ucbt(x,θ)≥f(x,θ)≥¯¯¯¯¯¯¯lcbt(x,θ)≥0, (14)

for every pair and .

Next, we state our main theorem in which we bound the performance of GP-MRO. All the proofs from this section are provided in the supplementary material.

Theorem 2.

Fix , , , , , and suppose the following holds

 T≥1ϵ2(log(m)2+βT√32λγTlog(m)+16β2TλγT),

for some . For any , such that and , GP-MRO with set as in Lemma 1 and achieves

 minθ∈ΘEx∼U(T)[f(x,θ)]≥maxP∈Δ(X)minθ∈ΘEx∼P[f(x,θ)]−ϵ,

after rounds with probability at least , where is the distribution returned by GP-MRO.

Our analysis is based on the regret bounding techniques for zero-sum games similarly to (chen2017robust) (we bound the rate of convergence to an equilibrium of the game simulated by GP-MRO), but with additional non-trivial challenges to characterize the excess regret due to the fact that is unknown. The result in this theorem holds for general kernels and it can be made more specific by substituting the bounds on for different kernels. For example, for and the widely used squared exponential kernel, we obtain , for constant , where is used to hide dimension-independent factors. In the same setting, StableOpt (bogunovic2018adversarially) requires samples to discover a deterministic maximin strategy that is near-optimal with respect to a generally weaker benchmark. Finally, in comparison to the result of chen2017robust where and is assumed to be known, our bound characterizes an additional number of samples required for estimating the unknown RKHS function.

Trading Off Worst-Case and Average-Case Performance

In many scenarios, one might care about the performance of the reported distribution in the worst-case while also ensuring a good performance on “average”. A natural problem to consider is to trade off these two quantities by using the following objective:

 W(P):=(1−χ)⋅Eθ∼Qx∼P[f(x,θ)]+χ⋅minθ∈ΘEx∼P[f(x,θ)],

for some fixed distribution (e.g., the uniform distribution) and trade-off parameter . Note that by setting , we recover the worst-case objective. Hence, our goal is to output after rounds, such that for some accuracy

 W(P(T))≥W(P∗)−ϵ, (15)

where .

Extending our algorithm to this case amounts to modifying the best response rule (Line 3 of Algorithm 1) as:

 xt=argmaxx∈X[ +χ⋅m∑i=1wt[i]⋅¯¯¯¯¯¯¯¯¯ucbt−1(x,θi)]. (16)

The theoretical guarantees of GP-MRO in this setting with the best-response rule as given in (3.1.1) are provided in the following corollary.

Corollary 3.

Let be a fixed distribution in and let be a trade-off parameter. Under the setup of Theorem 2, and when the following holds

 T≥1ϵ2(χ2log(m)2+χβT√32λγTlog(m)+16β2TλγT),

for some , GP-MRO with best-response rule as in (3.1.1), achieves

 W(U(T))≥W(P∗)−ϵ,

after rounds with probability at least , where is the returned uniform distribution over the queried points .

The proof closely follows the one of Theorem 2. When , we recover Theorem 2, while the performance clearly improves for smaller values of , i.e., when . We also note that for , our algorithm solves the stochastic optimization problem, and achieves the standard regret bound (as in (srinivas2009gaussian)) which is known to be nearly optimal for various kernels (see (scarlett2017lower)).

4 Experiments

In this section, we evaluate the performance of GP-MRO on synthetic benchmarks and demonstrate the applicability of GP-MRO in planning safe trajectories for autonomous vehicles guided by user’s preferences.

4.1 Synthetic Experiments

For a function , we compute the performance of a mixed strategy as:

 minθ∈ΘEx∼P(T)[f(x,θ)]. (17)

In case the strategy is deterministic , the performance is computed by considering the Dirac distribution centered at . We compare the performance of GP-MRO with the following baselines:

• StableOpt (bogunovic2018adversarially) searches for the deterministic max-min point.

• GP-UCB (srinivas2009gaussian) seeks for a non-robust global optimum and selects at every . After iterations, we consider to be the returned point.

• RandMaxMin selects the point reported by StableOpt or GP-UCB with equal probability at every round, and returns a uniform distribution over these points.

We set for each of the above algorithms (we found the theoretical choice to be overly conservative, as also noted in previous works (srinivas2009gaussian; bogunovic2018adversarially)), while is set according to Theorem 2. As an idealized benchmark, we also test against (chen2017robust, Algorithm 1) (which we name via the authorsâ surnames as CLSS) which assumes oracle access to and thus upper bounds the achievable performance.

In the first experiment, we let with , and , and sample a random function from a with kernel . Moreover, we run the different baselines with the true prior and noise standard deviation .

In Figure 0(a), we show as well as the strategies returned by StableOpt and GP-MRO after iterations. StableOpt converges to the max-min point of , while the distribution returned by GP-MRO assigns most of the probability mass to points and . As shown in Figure 0(b), this leads to a higher performance compared to all the considered baselines.

Next, we consider the synthetic function from (bertsimas2010robust), and the robust optimization task from (bogunovic2018adversarially). The goal is to select points that maximize subject to the worst-case perturbation . We map such problem to our setting by defining . The decision space consists of a uniformly spaced grid of points, while the set of perturbations is obtained by drawing random points from the unit ball centered at the origin.

We set noise standard deviation and run all the algorithms using MatÃ©rn kernel for iterations (kernel hyperparameters are found via maximum-likelihood method). In Figure 1(a), we plot the function as well as the support of the strategies returned by StableOpt (in black) and GP-MRO (in cyan). For GP-MRO we plot only points selected with probability mass greater than . StableOpt is able to discover the max-min point of , while GP-MRO randomizes between points in the max-min region and points close to the global optimum. This leads to a higher performance compared to other baselines, as shown in Figure 1(b).

4.2 Human-assisted trajectory planning for autonomous vehicles

We study the problem of planning safe trajectories for an Autonomous Vehicle (AV) driving on roads shared with human-driven vehicles (HVs). We consider the situation depicted in Figure 2(c), where the AV (in yellow) is approaching, with a speed of , a HV (in red) driving at a constant speed of . The intentions of the HV are uncertain and this should be taken into account when planning the AV’s trajectory.

In the context of autonomous driving and AV-HV interactions, deterministic strategies would make AVs’ actions predictable, hence giving a significant advantage to HVs. We observe this fact in our simulations, where such strategies tend to be overly conservative and prevent the AV from completing the overtake manoeuvre. Similarly, we expect this to occur in many other challenging scenarios such as intersections (liu2018intersection), or when merging into dense lanes (bouton2019merging). Instead, we model such problem according to Section 2 and seek for robust mixed strategies for the AV. This is in contrast with previous works (e.g., (fisac2018hierarchical; sadigh2016planning)) where deterministic strategies are found, assuming a specific behavioral model for the HV.

Further on, our goal is to plan trajectories for the AV which best reflect typical human driving preferences (e.g., driving styles, security measures, and safe behaviors that the AV should follow). For instance, in the specific situation of Figure 2(c), a good trajectory for the AV should depend on the importance that humans give to overtaking rather than breaking behind the HV. We encode such driving preferences with an unknown scoring function. We assume we can learn such function by sequential evaluations obtained interacting with a user who assists our planning phase.

Computing such mixed strategies requires enough computation and relies on sequential interactions with the user. Hence, after illustrating our approach, we propose an offline scheme to pre-compute a control policy for the AV using GP-MRO.

Decision sets. A strategy for the AV consists of selecting a steering angle , and an acceleration . Once chosen, both are assumed to be constant for the horizon of . Hence, we let be the set of points . Similarly, we assume the HV travels at a constant speed and can choose a steering angle . We discretize both and using uniform grids of and points, respectively. Car trajectories (depicted in Figure 2(c)) are computed using the commonly used discrete-time bicycle model (polack2017bycicle) with time steps of .

Optimization goal. We let the scoring function reflect the humans’ driving preferences for the AV. As discussed later, measures how rewarding is for the AV to select a possible when the HV decides to steer with angle . Our goal is to compute a robust mixed strategy which solves the problem in (1). More generally, according to Section 3.1.1, we can incorporate priors on HV’s behaviors and find strategies that can trade-off worst-case and average-case performance, for a trade-off parameter .

Scoring function. We assume that is initially unknown but can be learned by iteratively querying the user. Querying at a given point consists of: 1) Forward simulating the AV’s and HV’s trajectories corresponding to and and 2) Presenting the outcome of such simulation to the user who assigns a score to the considered trajectories. In this experiment, we assume such score is determined by a feature vector that can be extracted from the simulated trajectories. Such vector consists of: longitudinal distance travelled by the AV (), AV’s maximum absolute lateral position (), and the minimum distance between the AV and the human-driven car (). We use a model of the unknown of the following form: , where rewards progress, penalizes exiting the road limits, while penalizes the AV if it gets too close to the human-driven car and therefore needs to activate emergency breaking. In future work, we plan to replace our model and test our approach with scores coming from real users.

{SCfigure*}

Closed-loop simulation of the AV (yellow) and human-driven car (red). At every iteration, the AV implements (a) the randomized policy found by GP-MRO or (b) the deterministic max-min strategy. The human-driven car follows the noisy rational Boltzmann policy (18). The robust deterministic strategies are overly conservative, while GP-MRO algorithm allows the AV to safely overtake.

4.3 Illustration of the mixed strategies computed by Gp-Mro

We consider the configuration in Figure 2(c) and compute a mixed strategy for the AV running GP-MRO for iterations. We set trade-off parameter , , and . To learn , we fit a GP with kernel function where the feature vector is computed as explained above. In Figure 2(c), we depict (in blue) the support of the mixed strategy where the color intensity of a trajectory is proportional to its probability. Additionally, we show (in dotted light-green) the trajectory corresponding to the robust deterministic strategy . The strategy randomizes between an overtake from the left or the right side. Instead, amounts to breaking and thus never overtaking.

Our next goal is to find a strategy for the AV which can trade off the worst-case with average-case performance. Let us assume that, with probability , the HV doesn’t realize the presence of the AV and thus has no intention to steer. In this case, we can seek for the optimal strategy for the AV by setting and letting be a Dirac distribution corresponding to the HV proceeding straight. In Figure 2(c) we depict the strategy returned by GP-MRO, together with the trajectory . In this case, favors an overtake from the right, while still leads to no overtaking.

4.4 Closed-loop simulations

We propose the following offline procedure to pre-compute a control policy for the AV. We consider a finite set of possible scenarios , each describing the initial and relative positions and velocities of the two cars. We compute a mixed strategy for each scenario using GP-MRO with . Moreover, to make our approach more tractable, we query at chosen points (Line 5 in Algorithm 1) only if is greater than . By doing so, we end up with a policy mapping scenarios to distributions after a total number of 136 queries of the unknown function.

We evaluate the policy online, in a receding-horizon fashion: Starting from given initial positions and velocities, every we map the cars’ positions and velocities to the closest (using a nearest-neighbour tree-based algorithm) and let the AV sample its trajectory from . For the behavior of the HV we implement a noisy rational Boltzmann policy (as in (fisac2018hierarchical)) where, in a given scenario , is sampled with probability

 P[θ=θi∣s]∝exp(Ex∼U(T)(s)fH(θi,x)). (18)

The function rewards progress for the HV and penalizes exiting the road or getting too close to the AV, the same way as does for the AV.

In Figure 4.2, we plot several snapshots of a closed-loop simulation of where the AV samples trajectories from the pre-computed policy (a), and where the AV chooses the max-min strategy at every iteration (b). As can be seen from Figure 4.2, the proposed approach allows the AV to safely overtake, while the robust deterministic strategy is too conservative and forces the AV to break behind the HV. We repeat the closed-loop simulation for times (for fixed initial positions and velocities of the two cars). As reported in Table 1, the deterministic strategy is non-overtaking and the AV reaches an average final longitudinal positions of . Instead, using the pre-computed randomized policy the AV successfully overtakes the human-driven car in cases (in the remaining cases it breaks behind the HV), reaching an average final position of .

5 Conclusion

We have studied a robust optimization problem in which the objective function is unknown and depends on an uncertain parameter. For this problem, we have proposed a novel sample-efficient algorithm GP-MRO, which can discover a near-optimal randomized and robust strategy. We have established rigorous theoretical guarantees and designed a variant of GP-MRO that effectively trades off worst-case and average-case performance. In synthetic experiments and trajectory planning tasks, we have showed that our proposed algorithm significantly outperforms existing baselines.

Acknowledgments

This work was gratefully supported by the Swiss National Science Foundation, under the grant SNSF _, by the European Union’s ERC grant , and ETH Zürich Postdoctoral Fellowship 19-2 FEL-47.

References

Supplementary Material

Mixed Strategies for Robust Optimization of Unknown Objectives
Pier Giuseppe Sessa, Ilija Bogunovic, Maryam Kamgarpour, Andreas Krause (AISTATS 2020)

Appendix A Proof of Theorem 2

Proof.

In this proof, we condition on the event in Lemma 1 holding true, meaning that and provide valid confidence bounds as per (13). As stated in the lemma, this holds with probability at least .

Our main goal in this proof is to upper bound the difference:

 maxP∈Δ(X)minθ∈ΘEx∼P[f(x,θ)]−minθ∈Θ1TT∑t=1f(xt,θ). (19)

To do so, we provide upper and lower bounds of the first and second terms, respectively, and then we upper bound their difference.

First, we show that the following holds:

 minθ∈Θ1TT∑t=1f(xt,θ)≥(minθ∈Θ1TT∑t=1¯¯¯¯¯¯¯¯¯ucbt−1(xt,θ))−4βT√λγTT, (20)

where is the point queried at time .

To prove Eq. (20) we use the lower confidence bound and (14):

 minθ∈Θ1TT∑t=1f(xt,θ) ≥minθ∈Θ1TT∑t=1lcbt−1(xt,θ) (21) (22) ≥(minθ∈Θ1TT∑t=1¯¯¯¯¯¯¯¯¯ucbt−1(xt,θ))−maxθ∈Θ1TT∑t=12βtσt−1(xt,θ) (23) ≥(minθ∈Θ1TT∑t=1¯¯¯¯¯¯¯¯¯ucbt−1(xt,θ))−2βTTT∑t=1maxθ∈Θσt−1(xt,θ) (24) (25) ≥(minθ∈Θ1TT∑t=1¯¯¯¯¯¯¯¯¯ucbt−1(xt,θ))−4βT√γTλT, (26)

where (22) follows from the definition of the confidence bounds in (5) and (6), (24) is due to monotonicty of , and (25) is by rule (10) used in Algorithm 1 to select . Finally, (26) is obtained via the standard result from (srinivas2009gaussian; chowdhury17kernelized)

 T∑t=1σt−1(xt,θt)≤√4TλγT, (27)

when .

Next, we show that the first term can be upper bounded as follows:

 maxP∈Δ(X)minθ∈ΘEx∼P[f(x,θ)]≤1TT∑t=1Eθ∼wt[¯¯¯¯¯¯¯¯¯ucbt−1(xt,θ)].

To prove this, we start by upper bounding the minimum value of the inner objective:

 maxP∈Δ(X)minθ∈ΘEx∼P[f(x,θ)] ≤maxP∈Δ(X)1TT∑t=1m∑i=1wt[i]⋅Ex∼P[f(x,θi)] (28) ≤maxP∈Δ(X)1TT∑t=1m∑i=1wt[i]⋅Ex∼P[¯¯¯¯¯¯¯¯¯ucbt−1(x,θi)] (29) =maxP∈Δ(X)1TT∑t=1Ex∼P[m∑i=1wt[i]⋅¯¯¯¯¯¯¯¯¯ucbt−1(x,θi)] (30) (31) =1TT∑t=1maxx∈Xm∑i=1wt[i]⋅¯¯¯¯¯¯¯¯¯ucbt−1(x,θi) (32) =1TT∑t=1m∑i=1wt[i]⋅¯¯¯¯¯¯¯¯¯ucbt−1(xt,θi). (33)

We obtain Eq. (28) as the following trivially holds

 minθ∈ΘEx∼P[f(x,θ)]≤m∑i=1wt[i]⋅Ex∼P[f(x,θi)]

for each and , and hence it also holds for the average value

 minθ∈ΘEx∼P[f(x,θ)]≤1TT∑t=1m∑i=1wt[i]⋅Ex∼P[f(x,θi)].

Eq. (29) follows from (14), (30) follows by the linearity of expectation, and (32) holds since Dirac delta , , is in . Finally, (33) follows by rule (9) used in Algorithm 1 to select .

Next, we bound the difference in (19) by combining the bounds obtained in (26) and (33):

 maxP∈Δ(X)minθ∈Θ Ex∼P[f(x,θ)]−minθ∈Θ1TT∑t=1f(xt,θ) ≤1TT∑t=1Eθ∼wt[¯¯¯¯¯¯¯¯¯ucbt−1(xt,θ)]−(minθ∈Θ1TT∑t=1¯¯¯¯¯¯¯¯¯ucbt−1(xt,θ))+4βT√γTλT ≤√log(m)2T+4βT√γTλT, (34)

where (34) follows by the guarantees of the no-regret online multiplicative weight updates algorithm played by the adversary, that is,

 1TT∑t=1Eθ∼wt[¯¯¯¯¯¯¯¯¯ucbt−1(xt,θ)]−(minθ∈Θ1TT∑t=1¯¯¯¯¯¯¯¯¯ucbt−1(xt,θ))≤√log(m)2T, (35)

with the learning rate set to . For more details on this result see (cesa-bianchi_prediction_2006, Section 4.2) where the same online algorithm is considered. Specifically, the result above follows from (cesa-bianchi_prediction_2006, Theorem 2.2) by noting that , and for every . In our case, the objective function changes with but remains bounded, which allows the result to hold despite the changes (see time-varying games result extension (cesa-bianchi_prediction_2006, Remark 7.3)).

By rearranging (34) and by letting be the uniform distribution over the queried points during the run of Algorithm 1, we obtain:

 minθ∈ΘEx∼U(T)[f(x,θ)] ≥maxP∈Δ(X)minθ∈ΘEx∼P[f(x,θ)]−√log(m)2T−4βT√γTλT.

Finally, we require , which we obtain when

 T≥1ϵ2(log(m)2+βT√32λγTlog(m)+16β2TλγT).

Appendix B Proof of Corollary 3

Proof.

The proof closely follows the one of Theorem 2. The main changes are due to the modified best-response rule from (3.1.1).

For a given distribution and trade-off parameter , we can define the new function

 g(x,θ):=χ⋅f(x,θ)+(1−χ)⋅Eθ∼Q[f(x,θ)] (36)

Same as before, our goal is to upper bound the difference:

 maxP∈Δ(X)minθ∈ΘEx∼P[g(x,θ)]−minθ∈Θ1TT∑t=1g(xt,θ), (37)

where is the point selected at time by GP-MRO using the modified best-response rule as in (3.1.1).

Next, we condition on the event in Lemma 1 holding true, and we provide upper and lower bounds of the first and second term, respectively.

First, we show that the second term of (37) can be lower bounded as: