Strategizing against No-regret Learners
How should a player who repeatedly plays a game against a no-regret learner strategize to maximize his utility? We study this question and show that under some mild assumptions, the player can always guarantee himself a utility of at least what he would get in a Stackelberg equilibrium of the game. When the no-regret learner has only two actions, we show that the player cannot get any higher utility than the Stackelberg equilibrium utility. But when the no-regret learner has more than two actions and plays a mean-based no-regret strategy, we show that the player can get strictly higher than the Stackelberg equilibrium utility. We provide a characterization of the optimal game-play for the player against a mean-based no-regret learner as a solution to a control problem. When the no-regret learner’s strategy also guarantees him a no-swap regret, we show that the player cannot get anything higher than a Stackelberg equilibrium utility.
Consider a two player bimatrix game with a finite number of actions for each player repeated over rounds. When playing a repeated game, a widely adopted strategy is to employ a no-regret learning algorithm: a strategy that guarantees the player that in hindsight no single action when played throughout the game would have performed significantly better. Knowing that one of the players (the learner) is playing a no-regret learning strategy, what is the optimal gameplay for the other player (the optimizer)? This question is the focus of our work.
If this were a single-shot strategic game where learning is not relevant, a (pure or mixed strategy) Nash equilibrium is a reasonable prediction of the game’s outcome. In the rounds game with learning, can the optimizer guarantee himself a per-round utility of at least what he could get in a single-shot game? Is it possible to get significantly more utility than this? Does this utility depend on the specific choice of learning algorithm of the learner? What gameplay the optimizer should adopt to achieve maximal utility? None of these questions are straightforward, and indeed none of these have unconditional answers.
Central to our results is the idea of the Stackelberg equilibrium of the underlying game. The Stackelberg variant of our game is a single-shot two-stage game where the optimizer is the first player and can publicly commit to a mixed strategy; the learner then best responds to this strategy. The Stackelberg equilibrium is the resulting equilibrium of this game when both players play optimally. Note that the optimizer’s utility in the Stackelberg equilibrium is always weakly larger than his utility in any (pure or mixed strategy) Nash equilibrium, and is often strictly larger.
Let be the utility of the optimizer in the Stackelberg equilibrium. With some mild assumptions on the game, we show that the optimizer can always guarantee himself a utility of at least in rounds, irrespective of the learning algorithm used by the learner as long as it has the no-regret guarantee (see Theorem 4). This means that if one of the players is a learner the other player can already profit over the Nash equilibrium regardless of the specifics of the learning algorithm employed or the structure of the game. Further, if any one of the following conditions is true:
[topsep=0pt, noitemsep, leftmargin=*]
the game is a constant-sum game,
the learner’s no-regret algorithm has the stronger guarantee of no-swap regret (see Section 2),
the learner has only two possible actions in the game,
If the learner employs a learning algorithm from a natural class of algorithms called mean-based learning algorithms (Braverman et al., 2018) (see Section 2) that includes popular no-regret algorithms like the Multiplicative Weights algorithm, the Follow-the-Perturbed-Leader algorithm, and the EXP3 algorithm, we show that there exist games where the optimizer can guarantee himself a utility for some (see Theorem 8). We note the contrast between the cases of and actions for the learner: in the -actions case even if the learner plays a mean-based strategy, the optimizer cannot get anything more than (Theorem 7), whereas with actions, there are games where he is able to guarantee a linearly higher utility.
Given this possibility of exceeding Stackelberg utility, our final result is on the nature and structure of the utility optimal gameplay for the optimizer against a learner that employs a mean-based strategy. First, we give a crisp characterization of the optimizer’s asymptotic optimal algorithm as the solution to a control problem (see Section 4.2) in dimensions where is the number of actions for the learner. This characterization is predicated on the fact that just knowing the cumulative historical utilities of each of the learner’s actions is essentially enough information to accurately predict the learner’s next action in the case of a mean-based learner. These cumulative utilites thus form an -dimensional “state” for the learner which the optimizer can manipulate via their choice of action. We then proceed to make multiple observations that simplify the solution space for this control problem. We leave as a very interesting open question of computing or characterizing the optimal solution to this control problem and we further provide one conjecture of a potential characterization.
Comparison to prior work.
The very recent work of Braverman et al. (2018) is the closest to ours. They study the specific -player game of an auction between a single seller and single buyer. The main difference from Braverman et al. (2018) is that they consider a Bayesian setting where the buyer’s type is drawn from a distribution, whereas there is no Bayesian element in our setting. But beyond that the seller’s choice of the auction represents his action, and the buyer’s bid represents her action. They show that regardless of the specific algorithm used by the buyer, as long as the buyer plays a no-regret learning algorithm the seller can always earn at least the optimal revenue in a single shot auction. Our Theorem 4 is a direct generalization of this result to arbitrary games without any structure. Further Braverman et al. (2018) show that there exist no-regret strategies for the buyer that guarantee that the seller cannot get anything better than the single-shot optimal revenue. Our Theorems 5, 6 and 7 are both generalizations and refinements of this result, as they pinpoint both the exact learner’s strategies and the kind of games that prevent the optimizer from going beyond the Stackelberg utility. Braverman et al. (2018) show that when the buyer plays a mean-based strategy, the seller can design an auction to guarantee him a revenue beyond the per round auction revenue. Our control problem can be seen as a rough parallel and generalization of this result.
Other related work.
The first notion of regret (without the swap qualification) we use in the paper is also referred to as external-regret (see Hannan (1957), Foster and Vohra (1993), Littlestone and Warmuth (1994), Freund and Schapire (1997), Freund and Schapire (1999), Cesa-Bianchi et al. (1997)). The other notion of regret we use is swap regret. There is a slightly weaker notion of regret called internal regret that was defined earlier in Foster and Vohra (1998), which allows all occurrences of a given action to be replaced by another action . Many no-internal-regret algorithms have been designed (see for example Hart and Mas-Colell (2000), Foster and Vohra (1997, 1998, 1999), Cesa-Bianchi and Lugosi (2003)). The stronger notion of swap regret was introduced in Blum and Mansour (2005), and it allows one to simultaneously swap several pairs of actions. Blum and Mansour show how to efficiently convert a no-regret algorithm to a no-swap-regret algorithm. One of the reasons behind the importance of internal and swap regret is their close connection to the central notion of correlated equilibrium introduced by Aumann (1974). In a general players game, a distribution over action profiles of all the players is a correlated equilibrium if every player has zero internal regret. When all players use algorithms with no-internal-regret guarantees, the time averaged strategies of the players converges to a correlated equilibrium (see Hart and Mas-Colell (2000)). When all players simply use algorithms with no-external-regret guarantees, the time averaged strategies of the players converges to the weaker notion of coarse correlated equilibrium. When the game is a zero-sum game, the time-averaged strategies of players employing no-external-regret dynamics converges to the Nash equilbrium of the game.
2 Model and Preliminaries
2.1 Games and equilibria
Throughout this paper, we restrict our attention to simultaneous two-player bimatrix games . We refer to the first player as the optimizer and the second player as the learner. We denote the set of actions available to the optimizer as and the set of actions available to the learner as . If the optimizer chooses action and the learner chooses action , then the optimizer receives utility and the learner receives utility . We normalize the utility such that and . We write and to denote the set of mixed strategies for the optimizer and learner respectively. When the optimizer plays and the learner plays , the optimizer’s utility is denoted by , similarly for the learner’s utility.
We say that a strategy is a best-response to a strategy if . We are now ready to define Stackelberg equilibrium (Von Stackelberg, 2010).
The Stackelberg equilibrium of a game is a pair of strategies and that maximizes under the constraint that is a best-response to . We call the value the Stackelberg value of the game.
A game is zero-sum if for all and ; likewise, a game is constant-sum if for some fixed constant for all and . Note that for zero-sum or constant-sum games, the Stackelberg equilibrium coincides with the standard notion of Nash equilibrium due to the celebrated minimax theorem (von Neumann, 1928). Moreover, throughout this paper, we assume that the learner does not have weakly dominated strategies: a strategy is weakly dominated if there exists such that for all , .
We are interested in the setting where the optimizer and the learner repeatedly play the game for rounds. We will denote the optimizer’s action at time as ; likewise we will denote the learner’s action at time as . Both the optimizer and learner’s utilities are additive over rounds with no discounting.
The optimizer’s strategy can be adaptive (i.e. can depend on the previous values of ) or non-adaptive (in which case it can be expressed as a sequence of mixed strategies ). Unless otherwise specified, all positive results (results guaranteeing the optimizer can guarantee some utility) apply for non-adaptive optimizers and all negative results apply even to adaptive optimizers. As the name suggests, the learner’s (adaptive) strategy will be specified by some variant of a low-regret learning algorithm, as described in the next section.
2.2 No-regret learning and mean-based learning
In the classic multi-armed bandit problem with rounds, the learner selects one of options (a.k.a. arms) on round and receives a reward if he selects option . The rewards can be chosen adversarially and the learner’s objective is to maximize her total reward.
Let be the arm pulled by the learner at round . The regret for a (possibly randomized) learning algorithm is defined as the difference between performance of the algorithm and the best arm: . An algorithm for the multi-armed bandit problem is no-regret if the expected regret is sub-linear in , i.e., . In addition to the bandits setting in which the learner only learns the reward of the arm he pulls, our results also apply to the experts setting in which the learner can learn the rewards of all arms for every round. Simple no-regret strategies exist in both the bandits and the experts settings.
Among no-regret algorithms, we are interested in two special classes of algorithms. The first is the class of mean-based strategies:
Definition 2 (Mean-based Algorithm).
Let be the cumulative reward for pulling arm for the first rounds. An algorithm is -mean-based if whenever , the probability for the algorithm to pull arm on round is at most . An algorithm is mean-based if it is -mean-based for some .
Intuitively, mean-based strategies are strategies that play the arm that historically performs the best. Braverman et al. (2018) shows that many no-regret algorithms are mean-based, including commonly used variants of EXP3 (for the bandits setting), the Multiplicative Weights algorithm (for the experts setting) and the Follow-the-Perturbed-Leader algorithm (for the experts setting).
The second class is the class of no-swap-regret algorithms:
Definition 3 (No-Swap-Regret Algorithm).
The swap regret of an algorithm is defined as
where the maximum is over all functions mapping options to options. An algorithm is no-swap-regret if the expected swap regret is sublinear in , i.e. .
Intuitively, no-swap-regret strategies strengthen the no-regret criterion in the following way: no-regret guarantees the learning algorithm performs as well as the best possible arm overall, but no-swap-regret guarantees the learning algorithm performs as well as the best possible arm over each subset of rounds where the same action is played. Given a no-regret algorithm, a no-swap-regret algorithm can be constructed via a clever reduction (see Blum and Mansour (2005)).
3 Playing against no-regret learners
3.1 Achieving Stackelberg equilibrium utility
To begin with, we show that the optimizer can achieve an average utility per round arbitrarily close to the Stackelberg value against a no-regret learner.
Let be the Stackelberg value of the game . If the learner is playing a no-regret learning algorithm, then for any , the optimizer can guarantee at least utility.
Let be the Stackelberg equilibrium of the game . Since forms a Stackelberg equilibrium, . Moreover, by the assumption that the learner does not have a weakly dominated strategy, there does not exist such that for all , . By Farkas’s lemma (Farkas, 1902), there must exist an such that for all , for .
Therefore, for any , the optimizer can play the strategy such that is the unique best response to and playing strategy will induce a utility loss at least for the learner. As a result, since the leaner is playing a no-regret learning algorithm, in expectation, there is at most rounds in which the learner plays . It follows that the optimizer’s utility is at least . Thus, we can conclude our proof by setting . ∎
Next, we show that in the special class of constant-sum games, the Stackelberg value is the best that the optimizer can hope for when playing against a no-regret learner.
Let be a constant-sum game, and let be the Stackelberg value of this game. If the learner is playing a no-regret algorithm, then the optimizer receives no more than utility.
Let be the sequence of the optimizer’s actions. Moreover, let be a mixed strategy such that plays with probability .
Since the learner is playing a no-regret learning algorithm, the learner’s cumulative utility is at least , where is the constant sum, which implies that the optimizer’s utility is at most
where the equality follows that the Stackelberg value is equal to the minimax value by the minimax theorem for a constant-sum game. ∎
3.2 No-swap-regret learning
In this section, we show that if the learner is playing a no-swap-regret algorithm, the optimizer can only achieve their Stackelberg utility per round.
Let be the Stackelberg value of the game . If the learner is playing a no-swap-regret algorithm, then the optimizer will receive no more than utility.
Let be the sequence of the optimizer’s actions and let be the realization of the sequence of the learner’s actions. Moreover, let be the probability that the learner (who is playing some no-swap-regret learning algorithm) plays given that the adversary plays . Then, the marginal probability for the learner to play at round is
Let be a mixed strategy such that plays with probability
Let and consider a mapping such that . Then, the swap-regret under is
where . Therefore, since the learner is playing a no-swap-regret algorithm, we have .
Moreover, for , the optimizer’s utility when the learner plays is at most
Thus, the optimizer’s utility is at most
Let be a game where the learner has actions, and let be the Stackelberg value of this game. If the learner is playing a no-regret algorithm, then the optimizer receives no more than utility.
By Theorem 6, it suffices to show that when there are two actions for the learner, a no-regret learning algorithm is in fact a no-swap-regret learning algorithm.
When there are only two actions, there are three possible mappings from other than the identity mapping. Let be a mapping such that and , be a mapping such that and , and be a mapping such that and .
Since the learner is playing a no-regret learning algorithm, we have and . Moreover, notice that
which concludes the proof. ∎
4 Playing against mean-based learners
From the results of the previous section, it is natural to conjecture that no optimizer can achieve more than the Stackelberg value per round if playing against a no-regret algorithm. After all, this is true for the subclass of no-swap-regret algorithms (Theorem 6) and is true for simple games: constant-sum games (Theorems 5) and games in which the learner only has two actions (Theorem 7).
In this section we show that this is not the case. Specifically, we show that there exist games where an optimizer can win strictly more than the Stackelberg value every round when playing against a mean-based learner. We emphasize that the same strategy for the optimizer will work against any mean-based learning algorithm the learner uses.
We then proceed to characterize the optimal strategy for a non-adaptive optimizer playing against a mean-based learner as the solution to an optimal control problem in dimensions (where is the number of actions of the learner), and make several preliminary observations about structure an optimal solution to this control problem must possess. Understanding how to efficiently solve this control problem (or whether the optimal solution is even computable) is an intriguing open question.
4.1 Beating the Stackelberg value
We begin by showing it is possible for the optimizer to get significantly (linear in ) more utility when playing against a mean-based learner.
There exists a game with Stackelberg value where the optimizer can receive utility at least against a mean-based learner for some .
Assume that the learner is using a -mean-based algorithm. Consider the bimatrix game shown in Table 1 in which the optimizer is the row player (These utilities are bounded in instead of for convenience; we can divide through by to get a similar example where utility is bounded in ). We first argue that the Stackelberg value of this game is . Notice that if the optimizer plays Bottom with probability more than , then the learner’s best response is to play Mid, resulting in a utility for the optimizer . However, if the optimizer plays Bottom with probability at most , the expected utility for the optimizer from each column is at most 0. Therefore, in the Stackelberg equilibrium, the optimizer will play Top and Bottom with probability each, and the learner will best respond with purely playing Right.
|Top||(0, )||(-2, -1)||(-2, 0)|
|Bottom||(0, -1)||(-2, 1)||(2, 0)|
However, the optimizer can obtain utility by playing Top for the first rounds and then playing Bottom for the remaining rounds. Given the optimizer’s strategy, for the first rounds, the learner will play Left with probability at least after first rounds. For the remaining rounds, the learner will switch to play Right with probability at least between -th round and -th round, since the cumulative utility for playing Left is at most and the cumulative utility for playing Mid is at most .
Therefore, the cumulative utility for the optimizer for the first rounds is at least
and the cumulative utility for the optimizer for the remaining rounds is at least
Thus, the optimizer can obtain a total utility , which is greater than for the Stackelberg value in this game. ∎
4.2 The geometry of mean-based learning
We have just seen that it is possible for the optimizer to get more than the Stackelberg value when playing against a mean-based learner. This raises an obvious next question: how much utility can an optimizer obtain when playing against a mean-based learner? What is the largest such that an optimizer can always obtain utility against a mean-based learner?
In this section, we will see how to reduce the problem of constructing the optimal gameplay of a non-adaptive optimizer to solving a control problem in dimensions. The primary insight is that a mean-based learner’s behavior depends only on their historical cumulative utilities for each of their actions, and therefore we can characterize the essential “state” of the learner by a tuple of real numbers that represent the cumulative utilities for different actions. The optimizer can control the state of the learner by playing different actions, and in different regions of the state space the learner plays specific responses.
More formally, our control problem will involve constructing a path in starting at the origin. For each , let equals the subset of where (this will represent the subset of state space where the learner will play action ). Note that these sets (up to some intersection of measure 0) partition the entire space .
We represent the optimizer’s strategy as a sequence of tuples with and satisfying . Here the tuple represents the optimizer playing mixed strategy for a fraction of the total rounds. This strategy evolves the learner’s state as follows. The learner originally starts at the state . After the th tuple , the learner’s state evolves according to (in fact, the state linearly interpolates between and as the optimizer plays this action). For simplicity, we will assume that positive combinations of vectors of the form can generate the entire state space .
To characterize the optimizer’s reward, we must know which set the learner’s state belongs to. For this reason, we will insist that for each , there exists a such that both and belong to the same region . It is possible to convert any strategy into a strategy of this form by subdividing a step that crosses a region boundary into two steps and with so that the first step stops exactly at the region boundary. If there is more than one possible choice for (i.e. and lie on the same region boundary), then without loss of generality we let the optimizer choose , since the optimizer can always modify the initial path slightly so that and both lie in a unique region.
Once we have done this, the optimizer’s average utility per round is given by the expression:
Let where the supremum is over all valid strategies in this control game. Then
For any , there exists a non-adaptive strategy for the optimizer which guarantees expected utility at least when playing against any mean-based learner.
For any , there exists no non-adaptive strategy for the optimizer which can guarantee expected utility at least when playing against any mean-based learner.
Understanding how to solve this control problem (even inefficiently, in finite time) is an interesting open problem. In the remainder of this section, we make some general observations which will let us cut down the strategy space of the optimizer even further and propose a conjecture to the form of the optimal strategy.
The first observation is that when the learner has actions, our state space is truly dimensional, not dimensional. This is because in addition to the learner’s actions only depending on the cumulative reward for each action, they in fact only depend on the differences between cumulative rewards for different actions (see Definition 2). This means we can represent the state of the learner as a vector , where . The sets for can be written in terms of the as
The next observation is that if the optimizer makes several consecutive steps in the same region , we can combine them into a single step. Specifically, assume , , and all belong to some region , where sends to and sends to . Then replacing these two steps with results in a strategy with the exact same reward . Applying this fact whenever possible, this means we can restrict our attention to strategies where all (with the possible exception of the final state ) lie on the boundary of two or more regions .
Finally, we observe that this control problem is scale-invariant; if
is a valid policy that obtains utility , then
is another valid policy (with the exception that , not ) which obtains utility (this is true since all the regions are cones with apex at the origin). This means we do not have to restrict to policies with ; we can choose a policy of any total time, as long as we normalize the utility by . This generalizes the strategy space, but is useful for the following reason. Consider a sequence of steps which starts at some point (not necessarily ) and ends at . Then if is the average utility of this cycle, then (in particular, we can consider any policy which goes from to and then repeats this cycle many times). Likewise, if we have a sequence of steps which starts at some point and ends at for some which achieves average utility , then again (by considering the policy which proceeds (note that it is essential that to prevent this from converging back to in finite time).
These observations motivate the following conjecture.
The value is achieved by either:
The average utility of a policy starting at the origin and consisting of at most steps (in distinct regions).
The average utility of a path of at most steps (in distinct regions) which starts at some point and returns to for some .
We leave it as an interesting open problem to compute the optimal solution to this control problem.
- Agrawal et al.  Shipra Agrawal, Constantinos Daskalakis, Vahab S. Mirrokni, and Balasubramanian Sivan. Robust repeated auctions under heterogeneous buyer behavior. In Proceedings of the 2018 ACM Conference on Economics and Computation, Ithaca, NY, USA, June 18-22, 2018, page 171, 2018.
- Aumann  Robert J. Aumann. Subjectivity and correlation in randomized strategies. Journal of Mathematical Economics, 1(1):67 – 96, 1974. ISSN 0304-4068.
- Blum and Mansour  Avrim Blum and Yishay Mansour. From external to internal regret. In Peter Auer and Ron Meir, editors, Learning Theory, 2005.
- Braverman et al.  Mark Braverman, Jieming Mao, Jon Schneider, and Matt Weinberg. Selling to a no-regret buyer. In Proceedings of the 2018 ACM Conference on Economics and Computation, pages 523–538. ACM, 2018.
- Cesa-Bianchi and Lugosi  Nicolò Cesa-Bianchi and Gábor Lugosi. Potential-based algorithms in on-line prediction and game theory. Machine Learning, 51(3):239–261, Jun 2003.
- Cesa-Bianchi et al.  Nicolò Cesa-Bianchi, Yoav Freund, David Haussler, David P. Helmbold, Robert E. Schapire, and Manfred K. Warmuth. How to use expert advice. J. ACM, 44(3):427–485, May 1997. ISSN 0004-5411.
- Farkas  Julius Farkas. Theorie der einfachen ungleichungen. Journal für die reine und angewandte Mathematik, 124:1–27, 1902.
- Foster and Vohra  Dean P. Foster and Rakesh V. Vohra. A randomization rule for selecting forecasts. Operations Research, 41(4):704–709, 1993.
- Foster and Vohra  Dean P. Foster and Rakesh V. Vohra. Calibrated learning and correlated equilibrium. Games and Economic Behavior, 21(1):40 – 55, 1997.
- Foster and Vohra  Dean P. Foster and Rakesh V. Vohra. Asymptotic calibration. Biometrika, 85(2):379–390, 06 1998.
- Foster and Vohra  Dean P. Foster and Rakesh V. Vohra. Regret in the on-line decision problem. Games and Economic Behavior, 29(1):7 – 35, 1999.
- Freund and Schapire  Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119 – 139, 1997.
- Freund and Schapire  Yoav Freund and Robert E. Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29(1):79 – 103, 1999.
- Hannan  James Hannan. Approximation to bayes risk in repeated plays. Contributions to the Theory of Games, 3:97–139, 1957.
- Hart and Mas-Colell  Sergiu Hart and Andreu Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5):1127–1150, 2000.
- Littlestone and Warmuth  N. Littlestone and M.K. Warmuth. The weighted majority algorithm. Information and Computation, 108(2):212 – 261, 1994.
- von Neumann  John von Neumann. Zur theorie der gesellschaftsspiele. Mathematische annalen, 100(1):295–320, 1928.
- Von Stackelberg  Heinrich Von Stackelberg. Market structure and equilibrium. Springer Science & Business Media, 2010.
Proof of Theorem 9
Proof of Theorem 9.
Part 1: Let be a strategy for the control problem which satisfies . As suggested by , we will consider the strategy of the optimizer where for each (in order), the optimizer plays mixed strategy for rounds. We will show that this strategy guarantees an expected utility of for the optimizer .
Since the learner is mean-based, they are playing a -mean-based algorithm for some . As in Definition 2, let be the learner’s cumulative utility from playing action for rounds through . For , let (with ). For , let be the state of the control problem at time (linearly interpolating between and if ); note that . We will first show that with high probability, ; in other words, provides a good approximation of the true cumulative utilities of the learner in the repeated game.
To see this, we first claim . Fix any round in ; this means that the optimizer plays strategy during round , and therefore that . If also belongs to (so and both belong to ), we also have that . Since there are only intervals, and belong to the same interval for all but rounds, and since utilities are bounded by it follows that . Now, we also claim that with high probability (at least ), for all , . This follows simply from Hoeffding’s inequality, since each component of is the sum of independent random variables bounded in . Together, this implies that .
We now claim that for sufficiently large , the learner will play action for rounds . To see this, recall that is the unique region containing both and . Since regions are convex with disjoint interiors, this means that the segment connecting and lies in the interior of . By the definition of , this implies that there exists some such that for at least fraction of in the interval , satisfies for all . Since for all , this means that for at least a fraction of rounds in , we have that . For sufficiently large , this is bigger than (which is also ).
Therefore, for each , for at least rounds, the optimizer plays the mixed strategy and the learner plays action . The optimizer’s total expected utility is therefore at least
Assume there exists such a family (one for each ) of non-adaptive strategies for the optimizer . Since this strategy must work against any mean-based learner, we will construct a bad mean-based learner for this strategy in the following way. Fix (any will work). At any time , let be the set of actions for the learner whose historical performance are within of the optimally performing action. The mean-based property requires the learner to play an action in with probability at least . Our mean-based learner will choose the worst action in for the optimizer ; that is, the action which minimizes .
Now, choose a sufficiently large such that this strategy achieves average utility at least for the optimizer against this mean-based learner. We now claim we can construct a solution to the control problem with which satisfies , contradicting the optimality of . Consider the protocol . This is not a proper protocol, since some of the steps of this protocol might start in one region and end in a different region , but for any such steps we can divide them into substeps per region as described earlier.
We now claim that the step only passes through regions in the set . To see this, note that and differ in each coordinate by at most (since all utilities are bounded by ). Therefore if the segment between and passes through a point on the boundary (where ), it must be the case that and are both within of . By construction , so this implies that , and therefore (similarly, ).
Now, if the step only passes through regions in the set , it obtains utility for the optimizer at least , and thus
But this sum is exactly the utility of the optimizer against our mean-based learner, which is at least . It follows that , contradicting that is optimal.