Abstract
We consider robust optimization problems, where the goal is to optimize an unknown objective function against the worstcase realization of an uncertain parameter. For this setting, we design a novel sampleefficient algorithm GPMRO, which sequentially learns about the unknown objective from noisy point evaluations. GPMRO seeks to discover a robust and randomized mixed strategy, that maximizes the worstcase expected objective value. To achieve this, it combines techniques from online learning with nonparametric confidence bounds from Gaussian processes. Our theoretical results characterize the number of samples required by GPMRO to discover a robust nearoptimal mixed strategy for different GP kernels of interest. We experimentally demonstrate the performance of our algorithm on synthetic datasets and on humanassisted trajectory planning tasks for autonomous vehicles. In our simulations, we show that robust deterministic strategies can be overly conservative, while the mixed strategies found by GPMRO significantly improve the overall performance.
1 Introduction
Many realworld problems require taking decisions under uncertainty. Latter can manifest itself in the form of uncertain parameters, perturbations, or an adversary that can corrupt the decision (bertsimas2011theory). In such problems, one often seeks to optimize an objective function while being robust to the worst possible uncertainty realization. This can be achieved by phrasing such problems in the framework of Robust Optimization (RO) (ben2009robust). RO has found successful applications in various domains including supply chain management (bertsimas2004supplychain), portfolio optimization (bental2000portfolio), influence maximization (he2016robust), and robotics (bojorgensen2018), to name a few.
In various practical problems, however, the objective function to be optimized is apriori unknown, and one can only learn about it from sequential and noisy point evaluations. Gaussian process (GP) optimization is an established framework for modelbased sequential optimization of such unknown functions (srinivas2009gaussian). An array of algorithms that use Bayesian nonparametric GP models (rasmussen2006gaussian), and balance exploration (learning the function globally) and exploitation (maximizing the function) have been developed over the years, e.g., (srinivas2009gaussian; bogunovic2016truncated; chowdhury17kernelized; wang2017max; frazier2018).
In this paper, we study the robust optimization problem where (i) the objective function is unknown and (ii) the goal is to be robust against the worst possible realization of its uncertain parameter. This problem differs from the classical RO formulation where the objective function is assumed to be known, and is also different from the standard GP optimization where robustness requirement is typically not pursued.
Instead of finding a robust deterministic solution to this problem (as in (bogunovic2018adversarially)), we seek to discover a randomized, i.e., mixed strategy, from a relatively small number of noisy function evaluations. The primary motivation for seeking such strategies is that, in general, they can provide arbitrarily better worstcase expected performance than deterministic ones (krause2011robust; vorobeychik2014; sinha2018security), i.e., randomization prevents a potential adversary to know the actual decision until it is realized. Consequently, we design and use a novel GPbased sample efficient algorithm to discover nearoptimal mixed strategies. We empirically demonstrate the effectiveness of the identified robust mixed strategies in a trajectory planning task for autonomous vehicles, where deterministic strategies are shown to be overly conservative.
Related work. Over the past couple of years, robust optimization has been extensively studied in the machine learning community. While most of the works focus on convex settings (e.g., (shalev2016; namkoong2016)), more recent works also consider general nonconvex objectives, e.g., (chen2017robust; sinha2017certifying; staib2018distributionally). Among those, chen2017robust provide robust algorithmic strategies that are shown to be successful in several learning tasks. The proposed algorithm is based on the idea of simulating a zerosum game between a learner and an adversary. Similar strategies have been also explored in other adversarial settings, e.g., in submodular optimization (krause2011robust; kawase2019). Our approach is based on the similar algorithmic idea of chen2017robust, but unlike this and other works mentioned above that assume the objective function is perfectly known (or a maximization oracle is available), it also requires performing a nontrivial function estimation.
In nonrobust GP optimization, various optimization algorithms (srinivas2009gaussian; chowdhury17kernelized; bogunovic2016truncated; contal2013parallel; wang2017max) have been proposed to sequentially optimize the unknown function from noisy and zerothorder observations. Similarly to these algorithms, our algorithm relies on a nonparametric GP model to obtain shrinking confidence bounds around the unknown objective function. Besides the standard problem, GP optimization has been considered in several other practical settings such as contextual (krause2011contextual), timevarying (bogunovic2016time), safe exploration (sui2015safe), etc.
Recently, a novel algorithm for robust GP optimization StableOpt has been proposed by bogunovic2018adversarially. StableOpt discovers a deterministic solution that is robust with respect to the worstcase realization of the uncertain parameter. This work is closest to ours, but instead of seeking deterministic solutions, our focus is on the mixed strategies which are preferable in certain scenarios (see Section 4.2), where deterministic solutions turn out to be overly conservative. We also note that other forms of robustness have been studied in GP optimization. For instance, nogueira2016unscented; oliveira2019 consider robustness against uncertain inputs (typical in robotics applications), sessa2019noregret study robust aspects in multiagent unknown repeated games, williams2000; tesch2011 deal with uncontrolled environmental variables, while robustness with respect to outliers is addressed by martinez2018practical.
Contributions. We consider robust optimization of unknown and generally nonconvex objectives.

We propose an algorithm, GPMRO, which returns a mixed strategy, i.e., a probability distribution over actions, that is robust against the worstcase realization of the uncertain parameter.

Our theoretical analysis shows the number of samples required for GPMRO to discover a nearoptimal robust mixed strategy.

We propose a variant of GPMRO which can effectively tradeoff worstcase and averagecase performance.

Finally, we consider the problem of trajectory planning in autonomous driving guided by user’s evaluations. In our experiments, we demonstrate the effectiveness of the robust mixed strategies discovered by GPMRO in comparison to those identified by existing robust methods.
2 Problem Formulation
Let be a reward function over domain , where is a continuous and compact decision set and is a finite set of parameter values. The reward function is unknown, and we learn about it from sequential noisy point observations, i.e., socalled bandit feedback. At each time step , we choose and , and observe a noisy sample , where , and ’s are independent over time (our approach allows also for subGaussian noise).
After rounds (i.e., samples), our goal is to report a strategy for selecting points in that is robust against the worstpossible parameter value from . We assume that during the optimization phase (i.e., training/simulation) one can choose , while later, during the implementation (i.e., test) phase, the parameter becomes uncontrollable. Hence, it is important to design a robust strategy for selecting the first parameter.
Optimization goal. Let denote the set of all probability distributions, or mixed strategies on . Our goal is to find a distribution in that achieves high reward in the worstcase over . The maximin optimal value is given by:
(1) 
and we seek to report a robust solution that for some specified accuracy value achieves
(2) 
Besides achieving (2), our goal is also to minimize the total number of required samples .
We note that our optimization goal is different from the one of computing deterministic (pure strategy) solution and competing against as considered in (bogunovic2018adversarially). Our goal is to discover a randomized strategy and compete against , which can be arbitrarily larger than . Hence, mixed strategies considered in this work can provide arbitrarily better expected performance than such deterministic ones. Conceptually, randomization allows the decisions to be less predictable, and is a key feature necessary in many applications including security games (sinha2018security), adversarial learning (vorobeychik2014) and sensing (krause2011robust). This is also the case in the autonomous driving scenario considered in Section 4.2, where we show that deterministic strategies can be overly conservative. Finally, we also note that the same objective (1) is considered in (chen2017robust), in the case of known reward functions , and .
Our Model. We assume that the unknown objective is fixed and belongs to a Reproducing Kernel Hilbert Space (RKHS) corresponding to a positive semidefinite kernel function . Furthermore, we require to have a bounded RKHS norm, i.e., where stands for the RKHS norm and is a known positive constant. The RKHS norm represents a measure of smoothness of as measured by the corresponding kernel. We note that these are the standard assumptions used in GP optimization (see, e.g., (srinivas2009gaussian; chowdhury17kernelized; bogunovic2018adversarially)).
For the kernel function, we assume for all , which is without loss of generality if appropriate rescaling is applied. Our setup also allows for composite kernels that can be constructed by using individual kernels and , to obtain, for example, additive kernel or product kernel . Popularly used kernels are linear, squared exponential (SE) and MatÃ©rn:
where is the lengthscale parameter and is a parameter that determines the smoothness (rasmussen2006gaussian).
Under such assumptions, the uncertainty over is naturally modeled as a Gaussian process . Further on, a Gaussian likelihood model for the observations can be used assuming the noise is drawn, independently across , from . Here, denotes a free hyperparameter that may differ from the true noise variance . With this model in place, conditioned on the history of inputs and their noisy observations , the posterior distribution under this prior is also Gaussian with the closed form posterior mean and variance:
(3)  
(4) 
s.t. , and is the kernel matrix. As described bellow, we make use of this model in our algorithm to sequentially learn about the unknown objective function.
3 Proposed Algorithm and Theory
Our algorithm, GPMRO, is shown in Algorithm 1. It can be interpreted as a zerosum game between a simulated adversary and a learner. The adversary plays actions from the set , while the learner plays actions from . Because the true reward function is unknown, the algorithm maintains and makes use of the optimistic upper confidence bound (defined below) of the unknown reward function. We define the confidence bounds as follows:
(5)  
(6) 
where is the confidence parameter that we set according to Lemma 1 bellow. We also define their truncated versions:
(7)  
(8) 
which we use in our algorithm. At every round , GPMRO simulates the adversary by selecting a distribution over the values of , i.e., , where denotes the probability of selecting . Subsequently, the learner best responds by selecting based on the knowledge of . After iterations, GPMRO returns the uniform distribution over , denoted with . Next, we explain how and are chosen in Algorithm 1.
The multiplicative weight updates (MWU) rule (freund1997) is used to select at every round . We note that this algorithm is an online learning noregret algorithm that requires fullinformation feedback at every round, i.e., observations that correspond to every pair . This is not possible in our setting where the learner only receives a single noisy observation that corresponds to the chosen pair . To cope with this, we make use of the upper confidence bound functions to effectively emulate the full information feedback.
where is the learning rate parameter that we set in Theorem 2 bellow. Another equivalent way of writing this rule is via the following recursive update:
The learner then observes , and plays the best response that is obtained by using the upper confidence bound instead of the true unknown function:
(9) 
Finally, the unknown function is queried at , where is selected as the parameter value that has the highest uncertainty for the selected , i.e.,
(10) 
The observed data is then used to update the model via (3) and (4).
3.1 Main result
To characterize our regret bounds, we make use of a suitable measure of complexity of the function class, the socalled maximum information gain. It has been introduced by srinivas2009gaussian, and subsequently used in many different works on Bayesian (GP) optimization. At time , it is defined as
(11) 
and is used to measure the reduction in uncertainty about after receiving noisy observations that correspond to . In the case , this kerneldependent quantity is sublinear in for various kernel functions, e.g., for squared exponential and for the Matérn kernel with (srinivas2009gaussian).
We use the following wellknown result in GP optimization (srinivas2009gaussian; chowdhury17kernelized), that allows for construction of statistical confidence bounds around the unknown function.
Lemma 1.
Given the definitions (7)(8), and by conditioning on the event (13) in Lemma 1 holding true we have:
(14) 
for every pair and .
Next, we state our main theorem in which we bound the performance of GPMRO. All the proofs from this section are provided in the supplementary material.
Theorem 2.
Fix , , , , , and suppose the following holds
for some . For any , such that and , GPMRO with set as in Lemma 1 and achieves
after rounds with probability at least , where is the distribution returned by GPMRO.
Our analysis is based on the regret bounding techniques for zerosum games similarly to (chen2017robust) (we bound the rate of convergence to an equilibrium of the game simulated by GPMRO), but with additional nontrivial challenges to characterize the excess regret due to the fact that is unknown. The result in this theorem holds for general kernels and it can be made more specific by substituting the bounds on for different kernels. For example, for and the widely used squared exponential kernel, we obtain , for constant , where is used to hide dimensionindependent factors. In the same setting, StableOpt (bogunovic2018adversarially) requires samples to discover a deterministic maximin strategy that is nearoptimal with respect to a generally weaker benchmark. Finally, in comparison to the result of chen2017robust where and is assumed to be known, our bound characterizes an additional number of samples required for estimating the unknown RKHS function.
Trading Off WorstCase and AverageCase Performance
In many scenarios, one might care about the performance of the reported distribution in the worstcase while also ensuring a good performance on “average”. A natural problem to consider is to trade off these two quantities by using the following objective:
for some fixed distribution (e.g., the uniform distribution) and tradeoff parameter . Note that by setting , we recover the worstcase objective. Hence, our goal is to output after rounds, such that for some accuracy
(15) 
where .
Extending our algorithm to this case amounts to modifying the best response rule (Line 3 of Algorithm 1) as:
(16) 
The theoretical guarantees of GPMRO in this setting with the bestresponse rule as given in (3.1.1) are provided in the following corollary.
Corollary 3.
The proof closely follows the one of Theorem 2. When , we recover Theorem 2, while the performance clearly improves for smaller values of , i.e., when . We also note that for , our algorithm solves the stochastic optimization problem, and achieves the standard regret bound (as in (srinivas2009gaussian)) which is known to be nearly optimal for various kernels (see (scarlett2017lower)).
4 Experiments
In this section, we evaluate the performance of GPMRO on synthetic benchmarks and demonstrate the applicability of GPMRO in planning safe trajectories for autonomous vehicles guided by user’s preferences.
4.1 Synthetic Experiments
For a function , we compute the performance of a mixed strategy as:
(17) 
In case the strategy is deterministic , the performance is computed by considering the Dirac distribution centered at . We compare the performance of GPMRO with the following baselines:

StableOpt (bogunovic2018adversarially) searches for the deterministic maxmin point.

GPUCB (srinivas2009gaussian) seeks for a nonrobust global optimum and selects at every . After iterations, we consider to be the returned point.

RandMaxMin selects the point reported by StableOpt or GPUCB with equal probability at every round, and returns a uniform distribution over these points.
We set for each of the above algorithms (we found the theoretical choice to be overly conservative, as also noted in previous works (srinivas2009gaussian; bogunovic2018adversarially)), while is set according to Theorem 2. As an idealized benchmark, we also test against (chen2017robust, Algorithm 1) (which we name via the authorsâ surnames as CLSS) which assumes oracle access to and thus upper bounds the achievable performance.
In the first experiment, we let with , and , and sample a random function from a with kernel . Moreover, we run the different baselines with the true prior and noise standard deviation .
In Figure 0(a), we show as well as the strategies returned by StableOpt and GPMRO after iterations. StableOpt converges to the maxmin point of , while the distribution returned by GPMRO assigns most of the probability mass to points and . As shown in Figure 0(b), this leads to a higher performance compared to all the considered baselines.
Next, we consider the synthetic function from (bertsimas2010robust), and the robust optimization task from (bogunovic2018adversarially). The goal is to select points that maximize subject to the worstcase perturbation . We map such problem to our setting by defining . The decision space consists of a uniformly spaced grid of points, while the set of perturbations is obtained by drawing random points from the unit ball centered at the origin.
We set noise standard deviation and run all the algorithms using MatÃ©rn kernel for iterations (kernel hyperparameters are found via maximumlikelihood method). In Figure 1(a), we plot the function as well as the support of the strategies returned by StableOpt (in black) and GPMRO (in cyan). For GPMRO we plot only points selected with probability mass greater than . StableOpt is able to discover the maxmin point of , while GPMRO randomizes between points in the maxmin region and points close to the global optimum. This leads to a higher performance compared to other baselines, as shown in Figure 1(b).
4.2 Humanassisted trajectory planning for autonomous vehicles
We study the problem of planning safe trajectories for an Autonomous Vehicle (AV) driving on roads shared with humandriven vehicles (HVs). We consider the situation depicted in Figure 2(c), where the AV (in yellow) is approaching, with a speed of , a HV (in red) driving at a constant speed of . The intentions of the HV are uncertain and this should be taken into account when planning the AV’s trajectory.
In the context of autonomous driving and AVHV interactions, deterministic strategies would make AVs’ actions predictable, hence giving a significant advantage to HVs. We observe this fact in our simulations, where such strategies tend to be overly conservative and prevent the AV from completing the overtake manoeuvre. Similarly, we expect this to occur in many other challenging scenarios such as intersections (liu2018intersection), or when merging into dense lanes (bouton2019merging). Instead, we model such problem according to Section 2 and seek for robust mixed strategies for the AV. This is in contrast with previous works (e.g., (fisac2018hierarchical; sadigh2016planning)) where deterministic strategies are found, assuming a specific behavioral model for the HV.
Further on, our goal is to plan trajectories for the AV which best reflect typical human driving preferences (e.g., driving styles, security measures, and safe behaviors that the AV should follow). For instance, in the specific situation of Figure 2(c), a good trajectory for the AV should depend on the importance that humans give to overtaking rather than breaking behind the HV. We encode such driving preferences with an unknown scoring function. We assume we can learn such function by sequential evaluations obtained interacting with a user who assists our planning phase.
Computing such mixed strategies requires enough computation and relies on sequential interactions with the user. Hence, after illustrating our approach, we propose an offline scheme to precompute a control policy for the AV using GPMRO.
Decision sets. A strategy for the AV consists of selecting a steering angle , and an acceleration . Once chosen, both are assumed to be constant for the horizon of . Hence, we let be the set of points . Similarly, we assume the HV travels at a constant speed and can choose a steering angle . We discretize both and using uniform grids of and points, respectively. Car trajectories (depicted in Figure 2(c)) are computed using the commonly used discretetime bicycle model (polack2017bycicle) with time steps of .
Optimization goal. We let the scoring function reflect the humans’ driving preferences for the AV. As discussed later, measures how rewarding is for the AV to select a possible when the HV decides to steer with angle . Our goal is to compute a robust mixed strategy which solves the problem in (1). More generally, according to Section 3.1.1, we can incorporate priors on HV’s behaviors and find strategies that can tradeoff worstcase and averagecase performance, for a tradeoff parameter .
Scoring function. We assume that is initially unknown but can be learned by iteratively querying the user. Querying at a given point consists of: 1) Forward simulating the AV’s and HV’s trajectories corresponding to and and 2) Presenting the outcome of such simulation to the user who assigns a score to the considered trajectories. In this experiment, we assume such score is determined by a feature vector that can be extracted from the simulated trajectories. Such vector consists of: longitudinal distance travelled by the AV (), AV’s maximum absolute lateral position (), and the minimum distance between the AV and the humandriven car (). We use a model of the unknown of the following form: , where rewards progress, penalizes exiting the road limits, while penalizes the AV if it gets too close to the humandriven car and therefore needs to activate emergency breaking. In future work, we plan to replace our model and test our approach with scores coming from real users.
4.3 Illustration of the mixed strategies computed by GpMro
We consider the configuration in Figure 2(c) and compute a mixed strategy for the AV running GPMRO for iterations. We set tradeoff parameter , , and . To learn , we fit a GP with kernel function where the feature vector is computed as explained above. In Figure 2(c), we depict (in blue) the support of the mixed strategy where the color intensity of a trajectory is proportional to its probability. Additionally, we show (in dotted lightgreen) the trajectory corresponding to the robust deterministic strategy . The strategy randomizes between an overtake from the left or the right side. Instead, amounts to breaking and thus never overtaking.
Our next goal is to find a strategy for the AV which can trade off the worstcase with averagecase performance. Let us assume that, with probability , the HV doesn’t realize the presence of the AV and thus has no intention to steer. In this case, we can seek for the optimal strategy for the AV by setting and letting be a Dirac distribution corresponding to the HV proceeding straight. In Figure 2(c) we depict the strategy returned by GPMRO, together with the trajectory . In this case, favors an overtake from the right, while still leads to no overtaking.
4.4 Closedloop simulations
We propose the following offline procedure to precompute a control policy for the AV. We consider a finite set of possible scenarios , each describing the initial and relative positions and velocities of the two cars. We compute a mixed strategy for each scenario using GPMRO with . Moreover, to make our approach more tractable, we query at chosen points (Line 5 in Algorithm 1) only if is greater than . By doing so, we end up with a policy mapping scenarios to distributions after a total number of 136 queries of the unknown function.
We evaluate the policy online, in a recedinghorizon fashion: Starting from given initial positions and velocities, every we map the cars’ positions and velocities to the closest (using a nearestneighbour treebased algorithm) and let the AV sample its trajectory from . For the behavior of the HV we implement a noisy rational Boltzmann policy (as in (fisac2018hierarchical)) where, in a given scenario , is sampled with probability
(18) 
The function rewards progress for the HV and penalizes exiting the road or getting too close to the AV, the same way as does for the AV.
GPMRO  Deterministic maxmin  
# of overtakes  
avg. final pos. AV  
avg. final pos. human 
In Figure 4.2, we plot several snapshots of a closedloop simulation of where the AV samples trajectories from the precomputed policy (a), and where the AV chooses the maxmin strategy at every iteration (b). As can be seen from Figure 4.2, the proposed approach allows the AV to safely overtake, while the robust deterministic strategy is too conservative and forces the AV to break behind the HV. We repeat the closedloop simulation for times (for fixed initial positions and velocities of the two cars). As reported in Table 1, the deterministic strategy is nonovertaking and the AV reaches an average final longitudinal positions of . Instead, using the precomputed randomized policy the AV successfully overtakes the humandriven car in cases (in the remaining cases it breaks behind the HV), reaching an average final position of .
5 Conclusion
We have studied a robust optimization problem in which the objective function is unknown and depends on an uncertain parameter. For this problem, we have proposed a novel sampleefficient algorithm GPMRO, which can discover a nearoptimal randomized and robust strategy. We have established rigorous theoretical guarantees and designed a variant of GPMRO that effectively trades off worstcase and averagecase performance. In synthetic experiments and trajectory planning tasks, we have showed that our proposed algorithm significantly outperforms existing baselines.
Acknowledgments
This work was gratefully supported by the Swiss National Science Foundation, under the grant SNSF _, by the European Union’s ERC grant , and ETH Zürich Postdoctoral Fellowship 192 FEL47.
References
Supplementary Material
Mixed Strategies for Robust Optimization of Unknown Objectives
Pier Giuseppe Sessa, Ilija Bogunovic, Maryam Kamgarpour, Andreas Krause (AISTATS 2020)
Appendix A Proof of Theorem 2
Proof.
In this proof, we condition on the event in Lemma 1 holding true, meaning that and provide valid confidence bounds as per (13). As stated in the lemma, this holds with probability at least .
Our main goal in this proof is to upper bound the difference:
(19) 
To do so, we provide upper and lower bounds of the first and second terms, respectively, and then we upper bound their difference.
First, we show that the following holds:
(20) 
where is the point queried at time .
To prove Eq. (20) we use the lower confidence bound and (14):
(21)  
(22)  
(23)  
(24)  
(25)  
(26) 
where (22) follows from the definition of the confidence bounds in (5) and (6), (24) is due to monotonicty of , and (25) is by rule (10) used in Algorithm 1 to select . Finally, (26) is obtained via the standard result from (srinivas2009gaussian; chowdhury17kernelized)
(27) 
when .
Next, we show that the first term can be upper bounded as follows:
To prove this, we start by upper bounding the minimum value of the inner objective:
(28)  
(29)  
(30)  
(31)  
(32)  
(33) 
We obtain Eq. (28) as the following trivially holds
for each and , and hence it also holds for the average value
Eq. (29) follows from (14), (30) follows by the linearity of expectation, and (32) holds since Dirac delta , , is in . Finally, (33) follows by rule (9) used in Algorithm 1 to select .
Next, we bound the difference in (19) by combining the bounds obtained in (26) and (33):
(34) 
where (34) follows by the guarantees of the noregret online multiplicative weight updates algorithm played by the adversary, that is,
(35) 
with the learning rate set to . For more details on this result see (cesabianchi_prediction_2006, Section 4.2) where the same online algorithm is considered. Specifically, the result above follows from (cesabianchi_prediction_2006, Theorem 2.2) by noting that , and for every . In our case, the objective function changes with but remains bounded, which allows the result to hold despite the changes (see timevarying games result extension (cesabianchi_prediction_2006, Remark 7.3)).
Appendix B Proof of Corollary 3
Proof.
The proof closely follows the one of Theorem 2. The main changes are due to the modified bestresponse rule from (3.1.1).
For a given distribution and tradeoff parameter , we can define the new function
(36) 
Same as before, our goal is to upper bound the difference:
(37) 
where is the point selected at time by GPMRO using the modified bestresponse rule as in (3.1.1).
Next, we condition on the event in Lemma 1 holding true, and we provide upper and lower bounds of the first and second term, respectively.
First, we show that the second term of (37) can be lower bounded as: