Adversarial Bandits with Knapsacks1footnote 11footnote 1Compared to a submission to ACM STOC 2019, this version adds extensions (Section 6) and an expanded discussion of related work and open questions.

# Adversarial Bandits with Knapsacks111Compared to a submission to Acm Stoc 2019, this version adds extensions (Section 6) and an expanded discussion of related work and open questions.

Nicole Immorlica Microsoft Research, New York, NY. Email: nicimm@microsoft.com. Karthik A. Sankararaman University of Maryland, College Park, MD. Email: kabinav@cs.umd.edu. Part of this work was done while an intern at Microsoft Research, New York, NY. Supported in part by NSF Awards CNS 1010789, CCF 1422569. Robert Schapire Microsoft Research, New York, NY. Email: schapire@microsoft.com. Aleksandrs Slivkins Microsoft Research, New York, NY. Email: slivkins@microsoft.com.
November 2018
###### Abstract

We consider Bandits with Knapsacks (henceforth, BwK), a general model for multi-armed bandits under supply/budget constraints. In particular, a bandit algorithm needs to solve a well-known knapsack problem: find an optimal packing of items into a limited-size knapsack. The BwK problem is a common generalization of numerous motivating examples, which range from dynamic pricing to repeated auctions to dynamic ad allocation to network routing and scheduling. While the prior work on BwK focused on the stochastic version, we pioneer the other extreme in which the outcomes can be chosen adversarially. This is a considerably harder problem, compared to both the stochastic version and the “classic” adversarial bandits, in that regret minimization is no longer feasible. Instead, the objective is to minimize the competitive ratio: the ratio of the benchmark reward to algorithm’s reward.

We design an algorithm with competitive ratio relative to the best fixed distribution over actions, where is the time horizon; we also prove a matching lower bound. The key conceptual contribution is a new perspective on the stochastic version of the problem. We suggest a new algorithm for the stochastic version, which builds on the framework of regret minimization in repeated games and admits a substantially simpler analysis compared to prior work. We then analyze this algorithm for the adversarial version, and use it as a subroutine to solve the latter.

## 1 Introduction

Multi-armed bandits is a simple abstraction for the tradeoff between exploration and exploitation, i.e., between making potentially suboptimal decisions for the sake of acquiring new information and using this information for making better decisions. Studied over many decades, multi-armed bandits is a very active research area spanning computer science, operations research, and economics [26, 17, 38, 21].

In this paper, we focus on bandit problems which feature supply or budget constraints, as is the case in many realistic applications. For example, a seller who experiments with prices may have a limited inventory, and a website optimizing ad placement may be constrained by the advertisers’ budgets. This general problem is called Bandits with Knapsacks (BwK) since, in this model, a bandit algorithm needs effectively to solve a knapsack problem (find an optimal packing of items into a limited-size knapsack) or generalization thereof. The BwK model was introduced in [14] as a common generalization of numerous motivating examples, ranging from dynamic pricing to ad allocation to repeated auctions to network routing/scheduling. Various special cases with budget/supply constraints were studied previously, e.g., [18, 12, 13, 64, 29].

In BwK, the algorithm is endowed with limited resources that are consumed by the algorithm. In each round, the algorithm chooses an action (arm) from a fixed set of actions, and the outcome consists of a reward and consumption of each resource; all are assumed to lie in . The algorithm observes bandit feedback, i.e., only the outcome of the chosen arm. The algorithm stops at time horizon , or when the total consumption of some resource exceeds its budget. The goal is to maximize the total reward, denoted REW.

For a concrete example, consider dynamic pricing.222See Section 8 in [14] for a detailed discussion of this and many other examples. The algorithm is a seller with limited supply of some product. In each round, a new customer arrives, the algorithm chooses a price, and the customer either buys one item at this price or leaves. A sale at price implies reward of and consumption of . This example easily extends to products/resources. Now in each round the algorithm chooses the per-unit price for each resource, and the customer decides how much of each resource to buy at this price.

Prior work on BwK focused on the stochastic version of the problem, called Stochastic BwK, where the outcome of each action is drawn from a fixed distribution. This problem has been solved optimally using three different techniques [14, 4], and extended in various directions in subsequent work [4, 15, 6, 5].

We go beyond the stochastic version, and instead study the most “pessimistic”, adversarial version where the rewards and resource consumptions can be arbitrary. We call it adversarial bandits with knapsacks (Adversarial BwK), as it extends the classic model of “adversarial bandits” [10]. Bandits aside, this problem subsumes online packing problems [50, 24], where algorithm observes full feedback (the outcomes of all possible actions) in each round, and observes it before choosing an action.

Hardness of the problem. Adversarial BwK is a much harder problem compared to Stochastic BwK. The new challenge is that the algorithm needs to decide how much budget to save for the future, without being able to predict it. (It is also the essential challenge in online packing problems, and it drives our lower bounds.) This challenge compounds the ones already present in Stochastic BwK: that exploitation may be severely limited by the resource consumption during exploration, that optimal per-round reward no longer guarantees optimal total reward, and that the best fixed distribution over arms may perform much better than the best fixed arm. Jointly, these challenges amount to the following. An algorithm for Adversarial BwK must compete, during any given time segment , with a distribution over arms that maximizes the total reward on this time segment. However, this distribution may behave very differently, in terms of expected per-round outcomes, compared to the optimal distribution for some other time segment .

In more concrete terms, let be the total expected reward of the best fixed distribution over arms. In Stochastic BwK (as well as in adversarial bandits) an algorithm can achieve sublinear regret: .333More specifically, one can achieve regret for adversarial bandits [10], as well as for Stochastic BwK if all budgets are [14]. One can achieve sublinear regret for Stochastic BwK if all budgets are , [14]. In contrast, in Adversarial BwK regret minimization is no longer possible, and we therefore are primarily interested in the competitive ratio .

It is instructive to consider a simple example in which the competitive ratio is at least for any algorithm. There are two arms and one resource with budget . Arm has zero rewards and zero consumption. Arm has consumption in each round, and offers reward in each round of the first half-time ( rounds). In the second half-time, it offers either reward in all rounds, or reward in all rounds. Thus, there are two problem instances that coincide for the first half-time and differ in the second half-time. The algorithm needs to choose how much budget to invest in the first half-time, without knowing what comes in the second. Any choice leads to competitive ratio at least on one of the problem instances.

Extending this idea, we prove an even stronger logarithmic lower bound on the competitive ratio:

 OPTFD/\operatornamewithlimitsE[REW]≥Ω(logT). (1.1)

Like the simple example above, the lower-bounding construction involves only two arms and only one resource, and forces the algorithm to make a huge commitment without knowing the future.

Algorithmic contributions. Our main result is an algorithm for Adversarial BwK with a regret bound that nearly matches (1.1), achieving

 \operatornamewithlimitsE[REW]≥OPTFD/O(logT)−o(T). (1.2)

We put forward a new algorithm for BwK, called LagrangeBwK, that unifies the stochastic and adversarial versions. It has a natural game-theoretic interpretation for Stochastic BwK, and admits a simpler analysis compared to the three algorithms in prior work. For Adversarial BwK, we use LagrangeBwK as a subroutine, though with a different parameter and a different analysis, to derive two algorithms: a simple one that achieves (1.2), and a more involved one that achieves the same competitive ratio with high probability. Absent resource consumption, we recover the optimal regret for adversarial bandits.

LagrangeBwK is based on a new perspective on Stochastic BwK. We reframe a standard linear relaxation for Stochastic BwK in a way that gives rise to a repeated zero-sum game, where the two players choose among arms and resources, respectively, and the payoffs are given by the Lagrange function of the linear relaxation. Our algorithm consists of two online learning algorithms playing this repeated game. We analyze this algorithm for Stochastic BwK, building on the standard tools from regret minimization in stochastic games, and achieve a near-optimal regret bound when the optimal value and the budgets are .444This regime is of primary importance in prior work, e.g., [18, 72].

We obtain several extensions, where we derive improved performance guarantees for some scenarios. These extensions showcase the modularity of LagrangeBwK, in the sense that the two players can be implemented as arbitrary algorithms for adversarial online learning that admit a given regret bound. Each extension follows from the main results, with a different choice of the players’ algorithms.

Discussion. LagrangeBwK has numerous favorable properties. As just discussed, it is simple, unifying, modular, and yields strong bounds on performance in multiple settings. It is the first “black-box reduction” from bandits to BwK: we take a bandit algorithm and use it as a subroutine for BwK. This is a very natural algorithm for the stochastic version once the single-shot game is set up; indeed, it is immediate from prior work that the repeated game converges (in some formal sense) to the optimal distribution over arms. Its regret analysis for Stochastic BwK is extremely clean. Compared to prior work [14, 4], we side-step the intricate analysis of sensitivity of the linear program to non-uniform stochastic deviations that arise from adaptive exploration.

LagrangeBwK has a primal-dual interpretation, as arms and resources correspond respectively to primal and dual variables in the linear relaxation. Two players in the repeated game can be seen as the respective primal algorithm and dual algorithm. Compared to the rich literature on primal-dual algorithms [74, 24, 50] (including the more recent literature on stochastic online packing problems [30, 7, 31, 34, 52]) LagrangeBwK has a very specific and modular structure dictated by the repeated game; in particular, dual variables are updated very explicitly.

Other benchmarks. Both our result and prior work on Stochastic BwK can be extended to a stronger benchmark, namely, the total expected reward of the best dynamic policy (the best algorithm, in hindsight, that is allowed to switch arms arbitrarily across time-steps), denoted . Likewise, prior work on online packing problems achieves a similar competitive ratio against this stronger benchmark. However, we argue that this benchmark is too strong for Adversarial BwK: we show a simple example with just one resource (with budget ), where competitive ratio is at least for any algorithm.

A traditional benchmark in multi-armed bandits is the expected reward of the best fixed arm, denoted . This is a much weaker benchmark for Stochastic BwK: there are simple examples with in multiple special cases of interest [14].555E.g., we have two arms and two resources. In each round, each arm collects reward , consumes unit of resource , and units of the other resource. Both budgets are , and . Then , whereas is close to . For the adversarial version, is both weaker and less interesting: we show that the competitive ratio is at least in the worst case, and this is matched, in expectation, by a trivial algorithm that samples one arm at random and sticks with it forever.

Related work. The literature on regret-minimizing online learning algorithms is vast; see [26, 21, 43] for background. Most relevant are two algorithms for adversarial rewards/costs: Hedge for full feedback [37], and EXP3 for bandit feedback [10]; both are based on the weighted majority algorithm from [48].

Stochastic BwK was introduced and optimally solved in [14]. Subsequent work extended these results to convex optimization [4], combinatorial semi-bandits [60], and contextual bandits [15, 6, 5]. Several special cases with budget/supply constraints were studied previously: dynamic pricing [18, 12, 19, 72], dynamic procurement [13, 64] (a version of dynamic pricing where the algorithm is a buyer rather than a seller), dynamic ad allocation [65, 29], and a version with a single resource and unlimited time [41, 69, 70, 32]. While all this work is on regret minimization, [39, 40] studied closely related Bayesian formulations.

Stochastic BwK was optimally solved using three different algorithms [14, 4]. One of them is a primal-dual algorithm superficially similar to ours [14]. Indeed, it decouples into a “primal algorithm” that receives bandit feedback and controls arms, and a “dual algorithm” that receives full feedback, outputs distributions over resources, and can be implemented via Hedge. However, the two algorithms are not playing a game in any meaningful sense. Another difference is that the primal algorithm in [14] is very specific and inherently ‘stochastic’ in nature, whereas ours is generic and adversarial.666The primal algorithm in [14] focuses on the expected reward to expected cost ratio of an arm, interpreting the distribution chosen by the dual player as a vector of costs over resources, and uses ‘optimism under uncertainty’ to choose among arms. The other two algorithms for Stochastic BwK are also based on inherently ‘stochastic’ techniques, namely, successive elimination and optimism under uncertainty. Neither of the three algorithms appears to extend to the adversarial problem.

Our approach to using regret minimization in games can be traced to [35, 37] (see Ch. 6 in [61]), who showed how a repeated zero-sum game played by two agents yields an approximate Nash equilibrium. This approach has been used as a unifying algorithmic framework for several problems: boosting [35], linear programs [8], maximum flow [28], and convex optimization [1, 71]. While we use a result with the convergence rate for the equilibrum property, recent literature obtains faster convergence for cumulative payoffs (but not for the equilibrium property) under various assumptions (e.g., [55, 66, 73]).

Repeated Lagrangian games, in conjunction with regret minimization in games, have been used in a series of recent papers [57, 44, 59, 46, 2, 58], as an algorithmic tool to solve convex optimization problems; application domains range from differential privacy to algorithmic fairness to learning from revealed preferences. All these papers deal with deterministic games (i.e., same game matrix in all rounds). Reframing the problem in terms of repeated Lagrangian games is a key technical insight in this work. Most related to our paper are [59, 58], where a repeated Lagrangian game is used as a subroutine (the “inner loop”) in an online algorithm; the other papers solve an offline problem. We depart from this prior work in several respects: we use a stochastic game, we deal with some subtleties specific to Stochastic BwK, and we provide a very different analysis for our main results on Adversarial BwK, where we cannot rely on the standard machinery.

Online packing problems (e.g., [25, 31], see [24] for a survey) can be seen as a special case of Adversarial BwK with a much more permissive feedback model: the algorithm observes full feedback (the outcomes for all actions) before choosing an action. Online packing subsumes various online matching problems, including the AdWords problem [51] motivated by ad allocation (see [50] for a survey). While we derive competitive ratio against , online packing admits a similar result against .777An extension to contextual BwK allows us to treat the full feedback as a context, and compete against the best fixed distribution over policies (see Section 6.3). The latter benchmark lies between and , and it coincides with for the AdWords problem. However, for AdWords one can obtain a constant competitive ratio, whereas our results only imply ratio.

## 2 Preliminaries

We use bold fonts to represent vectors and matrices. We use standard notation whereby, for a positive integer , stands for , and denotes the set of all probability distributions on . Some of the notation introduced further is summarized in Appendix LABEL:sec:notations.

Bandits with Knapsacks (BwK). There are rounds, possible actions and resources, indexed as , respectively. In each round , the algorithm chooses an action and receives an outcome vector , where is a reward and is consumption of each resource . Each resource is endowed with budget . The game stops early, at some round , when/if the total consumption of any resource exceeds its budget. The algorithm’s objective is to maximize its total reward. Without loss of generality all budgets are the same: .888To see that this is indeed w.l.o.g., for each resource , divide all per-round consumptions by , where is the smallest budget. In the modified problem instance, all consumptions still lie in , and all the budgets are equal to .

The outcome vectors are chosen as follows. In each round , the adversary chooses the outcome matrix , where rows correspond to actions. The outcome vector is defined as the -th row of this matrix, denoted . Only this row is revealed to the algorithm. The adversary is deterministic and oblivious, meaning that the entire sequence is chosen before round . A problem instance of BwK consists of (known) parameters , and the (unknown) sequence .

In the stochastic version of BwK, henceforth termed Stochastic BwK, each outcome matrix is chosen from some fixed but unknown distribution over the outcome matrices. An instance of this problem consists of (known) parameters , and the (unknown) distribution .

Following prior work [14, 4], we assume, w.l.o.g., that one of the resources is a dummy resource similar to time; formally, each action consumes units of this resource per round (we only need this for Stochastic BwK). Further, we posit that one of the actions is a null action, which lets the algorithm skips a round: it has reward and consumes amount of each resource other than the dummy resource.

Benchmarks. Let be the total reward of algorithm ALG in the BwK problem. Our benchmark is the best fixed distribution, a distribution over actions which maximizes for a particular problem instance. The expected total reward of this distribution is denoted .

For Stochastic BwK, one can compete with the best dynamic policy: an algorithm that maximizes for a particular problem instance. Essentially, this algorithm knows the latent distribution over outcome matrices. Its expected total reward is denoted .

Adversarial online learning. To state the framework of “regret minimization in games” below, we need to introduce the protocol of adversarial online learning, see Figure 1.

In this protocol, the adversary can use previously chosen arms to choose the payoff vector , but not the algorithm’s random seed. The distribution is chosen as a deterministic function of history. (The history at round consists, for each round , of the chosen action and the observed feedback in this round.) We focus on two feedback models: bandit feedback (no auxiliary feedback) and full feedback (the entire payoff vector ). The version for costs can be defined similarly, by setting the payoffs to be the negative of costs.

We are interested in adversarial online learning algorithms with known upper bounds on regret,

 OPT−∑t∈[τ]ft(at)whereOPT=maxa∈A∑t∈[T]ft(a).

Here OPT is the total payoff of the best arm, according to the payoff vectors actually chosen by the adversary. More precisely, we need regret bounds that hold with probability at least , for some known parameter . The high-probability statement needs to hold simultaneously for all rounds .999This is w.l.o.g. because one can start with an algorithm that satisfies a regret bound only for , and then employ the standard “doubling trick” (e.g., see Section 2.3 in [26]) to obtain bounds that hold for all . Thus, the regret bound is stated as follows, for some regret term :

 Pr[∀τ∈[T]OPT−∑t∈[τ]ft(at)≤(b−a)Rδ(τ)]≥1−δ. (2.1)

The payoff range can be given as a parameter to the algorithm. Prior work offers algorithms EXP3.P.1 for bandit feedback [10], and Hedge101010When we refer to Hedge, we mean Hedge combined with the doubling trick so as to obtain bounds that hold for all . for full feedback [36]. They satisfy regret bound (2.1) with

 Rδ(τ)=O(√|A|τlog(T/δ)) for EXP3.P.1 and Rδ(τ)=O(√τlog(|A|T/δ)) for Hedge. (2.2)

Regret minimization in games. We build on the framework of regret minimization in games. A zero-sum game is a game between two players with action sets and and payoff matrix . If each player chooses an action , the outcome is a number . Player receives this number as reward, and player receives it as cost. A repeated zero-sum game with action sets and , time horizon and game matrices is a game between two algorithms, and , which proceeds over rounds such that each round is a zero-sum game . The goal of is to maximize the total reward, and the goal of is to minimize the total cost.

The game is called stochastic if the game matrix in each round is drawn independently from some fixed distribution. For such games, we are interested in the expected game, defined by the expected game matrix . We can relate the algorithms’ performance to the minimax value of .

\thmt@toks\thmt@toks

Consider a stochastic repeated zero-sum game between algorithms and , with payoff range . Let be the minimax value for the expected game . Assume that each , is an adversarial online learning algorithm which satisfies a regret bound (2.1) with and payoff range . Here is some fixed parameter. Let be the action set of , let be the distribution chosen by in each round , and let be the corresponding average play distribution. Then with probability at least it holds that

###### Lemma 2.3.
(2.4)

Eq. (2.4) states that the average play of player is approximately optimal against any distribution chosen by player .111111If each player chooses distribution , and the game matrix is , then expected reward/cost is . This lemma is well-known for the deterministic case (i.e., when for each round ), and folklore for the stochastic case. We provide a proof in Appendix LABEL:app:pf-games for the sake of completeness.

## 3 A new algorithm for Stochastic BwK

We present a new algorithm for Stochastic BwK, based on the framework of regret minimization in games. This is a very natural algorithm once the single-shot game is set up, and it allows for a very clean regret analysis. We will also use this algorithm as a subroutine for the adversarial version.

On a high level, we define a stochastic zero-sum game for which a mixed Nash equilibrium corresponds to an optimal solution for a linear relaxation of the original problem. Our algorithm consists of two regret-minimizing algorithms playing this game. The framework of regret minimization in games guarantees that the average primal and dual play distributions ( and in Lemma 2) approximate the mixed Nash equilibrium in the expected game, which correspondingly approximates the optimal solution.

### 3.1 Linear relaxation and Lagrange functions

We start with a linear relaxation of the problem that all prior work relies on. This relaxation is stated in terms of expected rewards/consumptions, i.e., implicitly, in terms of the expected outcome matrix . We explicitly formulate the relaxation in terms of , and this is essential for the subsequent developments. For ease of notation, we write the -th row of , for each action , as

 M(a)=(rM(a);cM1(a), … ,cMd(a)),

so that is the expected reward and is the expected consumption of each resource .

Essentially, the relaxation assumes that each instantaneous outcome matrix is equal to the expected outcome matrix , and time is fractional. The relaxation seeks the best distribution over actions, focusing on a single round with budgets rescaled as . This leads to the following linear program (LP):

 maximize∑a∈[K]X(a)rM(a)such that∑a∈[K]X(a)=1∀i∈[d]∑a∈[K]X(a)cMi(a)≤B/T∀a∈[K]0≤X(a)≤1. (3.1)

We denote this LP by . The solution is the best fixed distribution over actions, according to the relaxation. The value of this LP, denoted , is the expected per-round reward of this distribution. It is also the total reward of in the relaxation, divided by . We know from [14] that

 T⋅OPTLP(M,B,T)≥% OPTDP≥OPTFD, (3.2)

where and are the total expected rewards of, respectively, the best dynamic policy and the best fixed distribution. In words, is sandwiched between the total expected reward of the best fixed distribution and that of its linear relaxation.

Associated with the linear program is the Lagrange function . It is a function defined as

 L(X,λ):=∑a∈[K]X(a)rM(a)+∑i∈[d]λi⎡⎣1−TB∑a∈[K]X(a)cMi(a)⎤⎦. (3.3)

The values in Eq. (3.3) are called the dual variables, as they correspond to the variables in the dual LP. Lagrange functions are meaningful due to their max-min property (e.g., Theorem D.2.2 in [16]):

 minλ≥0maxX∈ΔKL(X,λ)=maxX∈ΔKminλ≥0L(X,λ)=OPTLP(M,B,T). (3.4)

This property holds for our setting because has at least one feasible solution (namely, one that puts probability one on the null action), and the optimal value of the LP is bounded.

###### Remark 3.5.

We use the linear program and the associated Lagrange function throughout the paper. Both are parameterized by an outcome matrix , budget and time horizon . In particular, we can plug in an arbitrary , and we heavily use this ability throughout. For the adversarial version, it is essential to plug in parameter instead of the time horizon . For the analysis of the high-probability result in Adversarial BwK, we use a rescaled budget instead of budget .

### 3.2 Our algorithm: repeated Lagrangian game

The Lagrange function from (3.3) defines the following zero-sum game: the primal player chooses an arm , the dual player chooses a resource , and the payoff is a number

 L(a,i)=rM(a)+1−TBcMi(a). (3.6)

The primal player receives this number as a reward, and the dual player receives it as cost. This game is termed the Lagrangian game induced by . This game will be crucial throughout the paper.

The Lagrangian game is related to the original linear program as follows:

###### Lemma 3.7.

Assume one of the resources is the dummy resource. Consider the linear program , for some outcome matrix . Then the value of this LP equals the minimax value of the Lagrangian game induced by . Further, if is a mixed Nash equilibrium in the Lagrangian game, then is an optimal solution to the LP.

###### Proof.

Recall (3.4), the max-min property of the Lagrange function . We observe that the second equality in (3.4) also holds when the dual vector is restricted to distributions:

 maxX∈ΔKminλ∈ΔdL(X,λ)=OPTLP(M,B,T). (3.8)

This holds because of some special structure of . The proof can be found in Appendix A.2. And the minimax value equals the left-hand side of (3.8) by the min-max theorem for zero-sum games. The second part of the lemma follows from the proof of Theorem D.4.1 in [16]. For completeness we prove this in the Appendix A.2. ∎

Consider a repeated version of the Lagrangian game. Formally, the repeated Lagrangian game with parameters and is a repeated zero-sum game between the primal algorithm that chooses among arms and the dual algorithm that chooses among resources. Each round of this game is the Lagrangian game induced by the Lagrange function , where is the round- outcome matrix. Note that we use parameters instead of budget and time horizon .121212These parameters are needed only for the adversarial version. For Stochastic BwK we use and .

###### Remark 3.9.

Consider repeated Lagrangian game for Stochastic BwK (with and ). The payoffs in the expected game are defined by the expected Lagrange function . By linearity, is the Lagrange function for the expected outcome matrix :

 L:=\operatornamewithlimitsE[Lt]=LM,B,T. (3.10)

Our algorithm, called LagrangeBwK, is very simple: it is a repeated Lagrangian game in which the primal algorithm receives bandit feedback, and the dual algorithm receives full feedback.

To set up the notation, let and be, respectively, the chosen arm and resource in round . The payoff is therefore . It can be rewritten in terms of the observed outcome vector (which corresponds to the -th row of the instantaneous outcome matrix ):

 Lt(at,it)=rt+1−T0B0ct,it∈[−T0B0+1,2]. (3.11)

The range of the Lagrange function is critical, since it is the payoff range for algorithms (and their regret bounds (2.1) scale linearly in ).

With this notation, the pseudocode for LagrangeBwK is summarized in Algorithm LABEL:alg:LagrangianBwK. The pseudocode is simple and self-contained, without referring to the formalism of repeated games and Lagrangian functions. Note that the algorithm is implementable, in the sense that the outcome vector revealed in each round of the BwK problem suffices to generate full feedback for the dual algorithm.

\@float

algocf[!h]\end@float

### 3.3 Performance guarantees

We consider algorithm LagrangeBwK with parameter . We assume the existence of the dummy resource; this is to ensure that the crucial step, Eq. (3.20), works out even if the algorithm stops at time , without exhausting any actual resources. We obtain a regret bound that is non-trivial whenever , and is optimal, up to log factors, in the regime when .

###### Theorem 3.12.

Consider Stochastic BwK with arms, resources, time horizon , and budget . Assume that one resource is the dummy resource (with consumption for each arm). Fix the failure probability parameter . Consider algorithm LagrangeBwK with parameters , .

If EXP3.P.1 and Hedge are used as the primal and the dual algorithms, respectively, then the algorithm achieves the following regret bound, with probability at least :

 (3.13)

In general, suppose each algorithm satisfies a regret bound (2.1) with and payoff range . Then with probability at least it holds that

 OPTDP−REW(% LagrangeBwK)≤O(TB)(R1,δ(T)+R2,δ(T))+O(√Tlog(Kd/δ)). (3.14)
###### Remark 3.15.

To obtain (3.13) from the “black-box” result (3.14), we use regret bounds in Eq. (2.2).

###### Remark 3.16.

From [14], the optimal regret bound for Stochastic BwK is

 OPTDP−\operatornamewithlimitsE[REW]≤~O(√KOPTDP(1+√OPTDP/B)).

Thus, the regret bound (3.13) is near-optimal if , and non-trivial if .

We next prove the “black-box” regret bound (3.14). For the sake of analysis, consider a version of the repeated Lagrangian game that continues up to the time horizon . In what follows, we separate the “easy steps” from what we believe is the crux of the proof.

Notation. Let be the distribution chosen in round by the primal algorithm . Let be the distribution of average play up to round . Let be the expected outcome matrix. Let be the vector of expected rewards over the actions. Likewise, be the vector of expected consumption of each resource .

Using Azuma-Hoeffding inequality. Consider the first rounds, for some . The average reward and resource- consumption over these rounds are close to and , respectively, with high probability. Specifically, a simple usage of Azuma-Hoeffding inequality (Lemma A.1) implies that

 1τ∑t∈[τ]rt ≥¯¯¯¯¯Xτ⋅r−R0(τ)/τ, (3.17) 1τ∑t∈[τ]ci,t ≤¯¯¯¯¯Xτ⋅ci+R0(τ)/τ,∀i∈[d], (3.18)

hold with probability at least , where .

Regret minimization in games. Let us apply the machinery from regret minimization in games to the repeated Lagrangian game. Consider the game matrix of the expected game. Using Eq. (3.10) and Lemma 3.7, we conclude that the minimax value of is .

We apply Lemma 2: with probability at least ,

 ∀τ∈[T],λ∈Δd:¯¯¯¯¯XTτGλ≥v∗−(b−a)⋅R1,δ(τ)+R2,δ(τ)τ. (3.19)

Here is the payoff range in the Lagrangian game, so .

Crux of the proof. Let us condition on the event that (3.17), (3.18), and (3.19) hold for each . By the union bound, this event holds with probability at least .

Let denote the stopping time of the algorithm, the first round when the total consumption of some resource exceeds its budget. Let be the resource for which this happens; hence,

 ∑t∈[τ]ci,t>B. (3.20)

Let us use Eq. (3.19) with , the point distribution for this resource. Then

 ¯¯¯¯¯XTτGλ(i) =LM,B,T(¯¯¯¯¯Xτ,λ(i)) (by Eq. (3.10)) =¯¯¯¯¯Xτ⋅r+1−TB¯¯¯¯¯Xτ⋅ci (by definition of Lagrange function) ≤1τ((∑t∈[τ]rt)−(TB∑t∈[τ]ci,t)+τ−2R0(τ)) (plugging in (3.17) and (3.18)) ≤1τ((∑t∈[τ]rt)+τ−T−2R0(τ)). (plugging in Eq. (3.20))

Plugging this into Eq. (3.19) and rearranging, we obtain

 ∑t∈[τ]rt≥τv∗+T−τ−% reg(τ),

where the regret term is .

Since and , it follows that

 REW(LagrangeBwK)=∑t∈[τ]rt≥Tv∗−Reg(T).

The claimed regret bound (3.14) follows by Eq. (3.2), completing the proof of Theorem 3.12.

## 4 A simple algorithm for Adversarial BwK

We present and analyze an algorithm for Adversarial BwK which achieves competitive ratio, in expectation, up to a low-order additive term. Our algorithm is very simple: we randomly guess the value of and run LagrangeBwK with parameter driven by this guess. The analysis is very different, however, since we cannot rely on the machinery from regret minimization in stochastic games. The crux of the analysis (Lemma 4.12) is re-used to analyze the high-probability algorithm in the next section.

The intuition for our algorithm can be explained as follows. LagrangeBwK builds on adversarial online learning algorithms , and appears plausibly applicable to Adversarial BwK. We analyze it for Adversarial BwK, with an arbitrary parameter (see Lemma 4.12, the crux of our analysis), and find that it performs best when is tailored to up to a constant multiplicative factor. This is precisely what our algorithm achieves using the random guess.

\@float

algocf[h]\end@float

Our algorithm is presented as Algorithm LABEL:alg:LagrangianBwKAdv. We randomly guess the value of from within a specified range , up to the specified multiplicative factor of . We consider multiplicative scales , , and we guess uniformly at random among all possible . Our analysis works as long as and ; then we obtain competitive ratio up to a low-order additive term. As a corollary, we obtain competitive ratio with no assumptions.

###### Theorem 4.1.

Consider Adversarial BwK with arms, resources, time horizon , and budget . Assume that one of the arms is a null arm that has zero reward and zero resource consumption. Consider Algorithm LABEL:alg:LagrangianBwKAdv with scale parameter . Suppose algorithms that satisfy the regret bound (2.1) with and , for any known payoff range .

If then the expected reward of Algorithm LABEL:alg:LagrangianBwKAdv satisfies

 \operatornamewithlimitsE[REW]≥(OPT% FD−reg)/(κ2⌈logκgmaxgmin⌉),wherereg=2+OPTFDκB(R1,δ(T)+R2,δ(T)). (4.2)

In particular, if we take , then we obtain

 (4.3)
###### Remark 4.4.

In particular, one can use algorithms EXP3.P.1 for and Hedge for , with regret bounds given by Eq. (2.2), and achieve the regret term .

###### Remark 4.5.

We define the outcome matrices slightly differently compared to Section 3 in that we do not posit a dummy resource. Formally, we assume that the null arm has zero consumption in every resource. This is essential for the case in the analysis.

If a problem instance of Adversarial BwK is actually an instance of adversarial bandits, then we recover the optimal regret. (This easily follows by examining the proof of Lemma 4.12.)

###### Lemma 4.6.

Consider an instance of Adversarial BwK with zero resource consumption. Suppose algorithms that satisfy the regret bound (2.1) with and , for any payoff range that is unknown to the algorithm.131313Unknown payoff range is w.l.o.g., by a suitable application of the “doubling trick” (e.g., See Section 2 in [27]). Then LagrangeBwK with any parameters recover the optimal regret. Accordingly, so does Algorithm LABEL:alg:LagrangianBwKAdv with any parameter .

### 4.1 Analysis: proof of Theorem 4.1 and Lemma 4.6

Stopped linear program. Let us set up a linear relaxation that is suitable to the adversarial setting. The expected outcome matrix is no longer available. Instead, we use average outcome matrices:

 ¯¯¯¯¯¯¯Mτ=1τ∑t∈[τ]Mt, (4.7)

the average up to a given intermediate round . Similar to the stochastic case, the relaxation assumes that each instantaneous outcome matrix is equal to the average outcome matrix . What is different now is that the relaxation depends on : using is tantamount to stopping precisely at this round.

With this intuition in mind, for a particular end-time we consider the linear program (3.1), parameterized by the time horizon and the average outcome matrix . Its value, , represents the per-round expected reward, so it needs to be scaled by the factor of to obtain the total expected reward. Finally, we maximize over . Thus, our linear relaxation for Adversarial BwK is defined as follows:

 OPT[T]LP:=maxτ∈[T]τ⋅OPTLP(¯¯¯¯¯¯¯Mτ,B,τ)≥OPTFD. (4.8)

The inequality in (4.8) is proved in the appendix (Section LABEL:sec:appxAdversarial).

Regret bounds for . Recall that each algorithm satisfies the regret bound (2.1) with and . Recall from (3.11) that the payoff range is . For succinctness, let denote the respective regret terms in (2.1).

Let us apply these regret bounds to our setting. Let and be, resp., the chosen arm and resource in round . We represent the outcomes as vectors over arms: denote, resp., reward vector and resource- consumption vector for a given round . Recall that the round- payoffs in LagrangeBwK are given by the Lagrange function such that

 Lt(a,i)=rt(a)+1−T0Bct,i(a) (4.9)

for each arm and resource . Consider the total Lagrangian payoff at a given round :

 ∑t∈[τ]Lt(at,it)=% REWτ+τ−Wτ, (4.10)

where is the total reward up to round , and is the consumption term. The regret bounds sandwich (4.10) from above and below:

 ⎛⎝maxa∈[K]∑t∈[τ]Lt(a,it)⎞⎠−R1(τ|T0)≤REWτ+τ−Wτ≤⎛⎝mini∈[d]∑t∈[τ]Lt(at,i)⎞⎠+R2(τ|T0). (4.11)

This holds for all , with probability at least . The first inequality in (4.11) is due to the primal algorithm, and the second is due to the dual algorithm. Call them primal and dual inequality, respectively.

Crux of the proof. We condition on the event that (4.11) holds for all , which we call the clean event. The crux of the analysis is encapsulated in the following lemma, which analyzes an execution of LagrangeBwK with an arbitrary parameter under the clean event.

###### Lemma 4.12.

Consider an execution of LagrangeBwK with and an arbitrary parameter such that the clean event holds. Fix an arbitrary round , and consider the LP value relative to this round:

 f(σ):=OPTLP(¯¯¯¯¯¯¯Mσ,B,σ). (4.13)

The algorithm’s reward up to round satisfies

 REWσ≥min(T0,σ⋅f(σ)−dT0)−(R1(σ|T0)+R2(σ|T0)). (4.14)

Taking to be the maximizer in (4.8), algorithm’s reward satisfies

 REW≥min(T0,OPTFD−dT0)−(R1(σ|T0)+R2(σ|T0)). (4.15)
###### Proof.

Let be the stopping time of the algorithm. We consider two cases, depending on whether some resource is exhausted at time . In both cases, we focus on the round .

For succinctness, denote regret terms as , for each .

Case 1: and some resource is exhausted. Let us focus on round . If is the exhausted resource, then . Let us apply the dual inequality in (4.11) for this resource:

 REWτ+τ−Wτ−R2(τ) ≤∑t∈[τ]Lt(at,i) =REWτ+τ−T0B∑t∈[τ]ct,i(at) ≤REWτ+τ−T0.

It follows that .

Now, let us apply the primal inequality in (4.11) for the null arm. Recall that the reward and consumption for this arm is , so for each round . Therefore,

 REWτ+τ−Wτ+R1(τ) ≥∑t∈[τ]Lt(null,it)=τ.

We conclude that .

Case 2: . Let us focus on round . Consider the linear program , and let be an optimal solution to this LP. The primal inequality in (4.11) implies that

 REWσ+σ−Wσ+R1(σ) ≥maxa∈[K]∑t∈[σ]Lt(a,it) =σ+∑t∈[σ]X∗⋅rt−T0B∑t∈[σ]X∗⋅ct,it REWσ ≥σ⋅f(σ)−T0B∑t∈[σ]X∗⋅ct,it−R1(σ) (4.16)

In the last inequality we used the fact that by optimality of .

Moreover, since is a feasible solution for . Therefore,

 ∑t∈[σ]X∗⋅ct,it≤∑i∈[d]∑t∈[σ]X∗⋅ct,i≤dB (4.17)

Plugging (4.17) into (4.16), we conclude that .

Conclusions from the two cases imply (4.15), as claimed. ∎

Wrapping up. If lies in the guess range , then some guess is approximately correct:

 OPTFD/κ≤ˆg≤OPTFD.

With such a guess , and provided that , we have , and

 OPTFD−dT0≥OPTFD(1−dκ)≥OPTFD/κ.

So, by Lemma 4.12, the algorithm’s execution with this guess, assuming the clean event, satisfies (4.15) with and . To complete the proof of Theorem 4.1, we observe that we obtain a suitable guess with probability .

Proof Sketch of Lemma 4.6. Recall that in the adversarial bandit setting we have for every and every . We re-prove Lemma 4.12 with . Notice that case 1 never occurs. Thus we obtain obtain Eq. (4.16) in case 2. Note that since . Therefore, we obtain

 REWT≥T⋅f(T)−R1(T).

We now argue that . Let be the optimal distribution over the arms. Thus . Note that since the only constraint on is that it lies in . Therefore the maximizer is a point distribution on .

The above proof did not rely on any specific and thus works for every such value. Moreover the range of the payoff is and thus when is EXP3.P.1. we have .

## 5 High-probability algorithm for Adversarial BwK

In this section we recover the approximation ratio for Adversarial BwK, but with high probability rather than merely in expectation. Our algorithm uses LagrangeBwK as a subroutine, and re-uses the adversarial analysis thereof (Lemma 4.12). We do not attempt to optimize the regret term.

The algorithm is considerably more complicated compared to Algorithm LABEL:alg:LagrangianBwKAdv. Instead of making one random guess for the value of , we iteratively refine this guess over time. The algorithm proceeds in phases. In the beginning of each phase, we start a fresh instance of LagrangeBwK with parameter defined by the current value of .141414The idea of restarting the algorithm in each phase is similar to the standard “doubling trick” in the online machine learning literature, but much more delicate in our setting. We update the guess in each round (in a way specified later), and stop the phase once becomes too large compared to its initial value in this phase. We invoke LagrangeBwK with a rescaled budget