Approximation Algorithms for Correlated Knapsacks and Non-Martingale Bandits

Approximation Algorithms for Correlated Knapsacks
and Non-Martingale Bandits

Anupam Gupta Deparment of Computer Science, Carnegie Mellon University, Pittsburgh PA 15213.    Ravishankar Krishnaswamy    Marco Molinaro Tepper School of Business, Carnegie Mellon University, Pittsburgh PA 15213.    R. Ravi
Abstract

In the stochastic knapsack problem, we are given a knapsack of size , and a set of jobs whose sizes and rewards are drawn from a known probability distribution. However, the only way to know the actual size and reward is to schedule the job—when it completes, we get to know these values. How should we schedule jobs to maximize the expected total reward? We know constant-factor approximations for this problem when we assume that rewards and sizes are independent random variables, and that we cannot prematurely cancel jobs after we schedule them. What can we say when either or both of these assumptions are changed?

The stochastic knapsack problem is of interest in its own right, but techniques developed for it are applicable to other stochastic packing problems. Indeed, ideas for this problem have been useful for budgeted learning problems, where one is given several arms which evolve in a specified stochastic fashion with each pull, and the goal is to pull the arms a total of times to maximize the reward obtained. Much recent work on this problem focus on the case when the evolution of the arms follows a martingale, i.e., when the expected reward from the future is the same as the reward at the current state. What can we say when the rewards do not form a martingale?

In this paper, we give constant-factor approximation algorithms for the stochastic knapsack problem with correlations and/or cancellations, and also for budgeted learning problems where the martingale condition is not satisfied, using similar ideas. Indeed, we can show that previously proposed linear programming relaxations for these problems have large integrality gaps. We propose new time-indexed LP relaxations; using a decomposition and “gap-filling” approach, we convert these fractional solutions to distributions over strategies, and then use the LP values and the time ordering information from these strategies to devise a randomized adaptive scheduling algorithm. We hope our LP formulation and decomposition methods may provide a new way to address other correlated bandit problems with more general contexts.

-1mm -1mm

1 Introduction

Stochastic packing problems seem to be conceptually harder than their deterministic counterparts—imagine a situation where some rounding algorithm outputs a solution in which the budget constraint has been exceeded by a constant factor. For deterministic packing problems (with a single constraint), one can now simply pick the most profitable subset of the items which meets the packing constraint; this would give us a profit within a constant of the optimal value. The deterministic packing problems not well understood are those with multiple (potentially conflicting) packing constraints.

However, for the stochastic problems, even a single packing constraint is not simple to handle. Even though they arise in diverse situations, the first study from an approximations perspective was in an important paper of Dean et al. [DGV08] (see also [DGV05, Dea05]). They defined the stochastic knapsack problem, where each job has a random size and a random reward, and the goal is to give an adaptive strategy for irrevocably picking jobs in order to maximize the expected value of those fitting into a knapsack with size —they gave an LP relaxation and rounding algorithm, which produced non-adaptive solutions whose performance was surprisingly within a constant-factor of the best adaptive ones (resulting in a constant adaptivity gap, a notion they also introduced). However, the results required that (a) the random rewards and sizes for items were independent of each other, and (b) once a job was placed, it could not be prematurely canceled—it is easy to see that these assumptions change the nature of the problem significantly.

The study of the stochastic knapsack problem was very influential—in particular, the ideas here were used to obtain approximation algorithms for budgeted learning problems studied by Guha and Munagala [GM07b, GM07a, GM09] and Goel et al. [GKN09], among others. They considered problems in the multi-armed bandit setting with arms, each arm evolving according to an underlying state machine with probabilistic transitions when pulled. Given a budget , the goal is to pull arms up to times to maximize the reward—payoffs are associated with states, and the reward is some function of payoffs of the states seen during the evolution of the algorithm. (E.g., it could be the sum of the payoffs of all states seen, or the reward of the best final state, etc.) The above papers gave -approximations, index-based policies and adaptivity gaps for several budgeted learning problems. However, these results all required the assumption that the rewards satisfied a martingale property, namely, if an arm is some state , one pull of this arm would bring an expected payoff equal to the payoff of state itself — the motivation for such an assumption comes from the fact that the different arms are assumed to be associated with a fixed (but unknown) reward, but we only begin with a prior distribution of possible rewards. Then, the expected reward from the next pull of the arm, conditioned on the previous pulls, forms a Doob martingale.

However, there are natural instances where the martingale property need not hold. For instance, the evolution of the prior could not just depend on the observations made but on external factors (such as time) as well. Or, in a marketing application, the evolution of a customer’s state may require repeated “pulls” (or marketing actions) before the customer transitions to a high reward state and makes a purchase, while the intermediate states may not yield any reward. These lead us to consider the following problem: there are a collection of arms, each characterized by an arbitrary (known) Markov chain, and there are rewards associated with the different states. When we play an arm, it makes a state transition according to the associated Markov chain, and fetches the corresponding reward of the new state. What should our strategy be in order to maximize the expected total reward we can accrue by making at most pulls in total?

1.1 Results

Our main results are the following: We give the first constant-factor approximations for the general version of the stochastic knapsack problem where rewards could be correlated with the sizes. Our techniques are general and also apply to the setting when jobs could be canceled arbitrarily. We then extend those ideas to give the first constant-factor approximation algorithms for a class of budgeted learning problems with Markovian transitions where the martingale property is not satisfied. We summarize these in Table 1.

Problem Restrictions Paper
Stochastic Knapsack Fixed Rewards, No Cancellation [DGV05]
Correlated Rewards, No Cancellation Section 2
Correlated Rewards, Cancellation Section 3
Multi-Armed Bandits Martingale Assumption [GM07b]
No Martingale Assumption Section 4
Table 1: Summary of Results

1.2 Why Previous Ideas Don’t Extend, and Our Techniques

One reason why stochastic packing problems are more difficult than their deterministic counterparts is that, unlike in the deterministic setting, here we cannot simply take a solution with expected reward that packs into a knapsack of size and convert it (by picking a subset of the items) into a solution which obtains a constant fraction of the reward whilst packing into a knapsack of size . In fact, there are examples where a budget of can fetch much more reward than what a budget of size can (see Appendix A.2). Another distinction from deterministic problems is that allowing cancellations can drastically increase the value of the solution (see Appendix A.1). The model used in previous works on stochastic knapsack and on budgeted learning circumvented both issues—in contrast, our model forces us to address them.

Stochastic Knapsack: Dean et al. [DGV08, Dea05] assume that the reward/profit of an item is independent of its stochastic size. Moreover, their model does not consider the possibility of canceling jobs in the middle. These assumptions simplify the structure of the decision tree and make it possible to formulate a (deterministic) knapsack-style LP, and round it. However, as shown in Appendix A, their LP relaxation performs poorly when either correlation or cancellation is allowed. This is the first issue we need to address.

Budgeted Learning: Obtaining approximations for budgeted learning problems is a more complicated task, since cancellations maybe inherent in the problem formulation, i.e., any strategy would stop playing a particular arm and switch to another, and the rewards by playing any arm are naturally correlated with the (current) state and hence the number of previous pulls made on the item/arm. The first issue is often tacked by using more elaborate LPs with a flow-like structure that compute a probability distribution over the different times at which the LP stops playing an arm (e.g., [GM07a]), but the latter issue is less understood. Indeed, several papers on this topic present strategies that fetch an expected reward which is a constant-factor of an optimal solution’s reward, but which may violate the budget by a constant factor. In order to obtain an approximate solution without violating the budget, they critically make use of the martingale property—with this assumption at hand, they can truncate the last arm played to fit the budget without incurring any loss in expected reward. However, such an idea fails when the martingale property is not satisfied, and these LPs now have large integrality gaps (see Appendix A.2).

At a high level, a major drawback with previous LP relaxations for both problems is that the constraints are local for each arm/job, i.e., they track the probability distribution over how long each item/arm is processed (either till completion or cancellation), and there is an additional global constraint binding the total number of pulls/total size across items. This results in two different issues. For the (correlated) stochastic knapsack problem, these LPs do not capture the case when all the items have high contention, since they want to play early in order to collect profit. And for the general multi-armed bandit problem, we show that no local LP can be good since such LPs do not capture the notion of preempting an arm, namely switching from one arm to another, and possibly returning to the original arm later later. Indeed, we show cases when any near-optimal strategy must switch between different arms (see Appendix A.3)—this is a major difference from previous work with the martingale property where there exist near-optimal strategies that never return to any arm [GM09, Lemma 2.1]. At a high level, the lack of the martingale property means our algorithm needs to make adaptive decisions, where each move is a function of the previous outcomes; in particular this may involve revisiting a particular arm several times, with interruptions in the middle.

We resolve these issues in the following manner: incorporating cancellations into stochastic knapsack can be handled by just adapting the flow-like LPs from the multi-armed bandits case. To resolve the problems of contention and preemption, we formulate a global time-indexed relaxation that forces the LP solution to commit each job to begin at a time, and places constraints on the maximum expected reward that can be obtained if the algorithm begins an item a particular time. Furthermore, the time-indexing also enables our rounding scheme to extract information about when to preempt an arm and when to re-visit it based on the LP solution; in fact, these decisions will possibly be different for different (random) outcomes of any pull, but the LP encodes the information for each possibility. We believe that our rounding approach may be of interest in other applications in Stochastic optimization problems.

Another important version of budgeted learning is when we are allowed to make up to plays as usual but now we can “exploit” at most times: reward is only fetched when an arm is exploited and again depends on its current state. There is a further constraint that once an arm is exploited, it must then be discarded. The LP-based approach here can be easily extended to that case as well.

1.3 Roadmap

We begin in Section 2 by presenting a constant-factor approximation algorithm for the stochastic knapsack problem () when rewards could be correlated with the sizes, but decisions are irrevocable, i.e., job cancellations are not allowed. Then, we build on these ideas in Section 3, and present our results for the (correlated) stochastic knapsack problem, where job cancellation is allowed.

In Section 4, we move on to the more general class of multi-armed bandit () problems. For clarity in exposition, we present our algorithm for , assuming that the transition graph for each arm is an arborescence (i.e., a directed tree), and then generalize it to arbitrary transition graphs in Section 5.

We remark that while our LP-based approach for the budgeted learning problem implies approximation algorithms for the stochastic knapsack problem as well, the knapsack problem provides a gentler introduction to the issues—it motivates and gives insight into our techniques for . Similarly, it is easier to understand our techniques for the problem when the transition graph of each arm’s Markov chain is a tree. Several illustrative examples are presented in Appendix A, e.g., illustrating why we need adaptive strategies for the non-martingale problems, and why some natural ideas do not work. Finally, the extension of our algorithm for for the case when rewards are available only when the arms are explicitly exploited with budgets on both the exploration and exploitation pulls appear in Appendix F. Note that this algorithm strictly generalizes the previous work on budgeted learning for with the martingale property [GM07a].

1.4 Related Work

Stochastic scheduling problems have been long studied since the 1960s (e.g., [BL97, Pin95]); however, there are fewer papers on approximation algorithms for such problems. Kleinberg et al. [KRT00], and Goel and Indyk [GI99] consider stochastic knapsack problems with chance constraints: find the max-profit set which will overflow the knapsack with probability at most . However, their results hold for deterministic profits and specific size distributions. Approximation algorithms for minimizing average completion times with arbitrary job-size distributions was studied by [MSU99, SU01]. The work most relevant to us is that of Dean, Goemans and Vondrák [DGV08, DGV05, Dea05] on stochastic knapsack and packing; apart from algorithms (for independent rewards and sizes), they show the problem to be PSPACE-hard when correlations are allowed. [CR06] study stochastic flow problems. Recent work of Bhalgat et al. [BGK11] presents a PTAS but violate the capacity by a factor ; they also get better constant-factor approximations without violations.

The general area of learning with costs is a rich and diverse one (see, e.g., [Ber05, Git89]). Approximation algorithms start with the work of Guha and Munagala [GM07a], who gave LP-rounding algorithms for some problems. Further papers by these authors [GMS07, GM09] and by Goel et al. [GKN09] give improvements, relate LP-based techniques and index-based policies and also give new index policies. (See also [GGM06, GM07b].) [GM09] considers switching costs, [GMP11] allows pulling many arms simultaneously, or when there is delayed feedback. All these papers assume the martingale condition.

2 The Correlated Stochastic Knapsack without Cancellation

We begin by considering the stochastic knapsack problem (), when the job rewards may be correlated with its size. This generalizes the problem studied by Dean et al. [DGV05] who assume that the rewards are independent of the size of the job. We first explain why the LP of [DGV05] has a large integrality gap for our problem; this will naturally motivate our time-indexed formulation. We then present a simple randomized rounding algorithm which produces a non-adaptive strategy and show that it is an -approximation.

2.1 Problem Definitions and Notation

We are given a knapsack of total budget and a collection of stochastic items. For any item , we are given a probability distribution over pairs specified as follows: for each integer value of , the tuple denotes the probability that item has a size , and the corresponding reward is . Note that the reward for a job is now correlated to its size; however, these quantities for two different jobs are still independent of each other.

An algorithm to adaptively process these items can do the following actions at the end of each timestep; (i) an item may complete at a certain size, giving us the corresponding reward, and the algorithm may choose a new item to start processing, or (ii) the knapsack becomes full, at which point the algorithm cannot process any more items, and any currently running job does not accrue any reward. The objective function is to maximize the total expected reward obtained from all completed items. Notice that we do not allow the algorithm to cancel an item before it completes. We relax this requirement in Section 3.

2.2 LP Relaxation

The LP relaxation in [DGV05] was (essentially) a knapsack LP where the sizes of items are replaced by the expected sizes, and the rewards are replaced by the expected rewards. While this was sufficient when an item’s reward is fixed (or chosen randomly but independent of its size), we give an example in Appendix A.2 where such an LP (and in fact, the class of more general LPs used for approximating problems) would have a large integrality gap. As mentioned in Section 1.2, the reason why local LPs don’t work is that there could be high contention for being scheduled early (i.e., there could be a large number of items which all fetch reward if they instantiate to a large size, but these events occur with low probability). In order to capture this contention, we write a global time-indexed LP relaxation.

The variable indicates that item is scheduled at (global) time ; denotes the random variable for the size of item , and captures the expected reward that can be obtained from item if it begins at time ; (no reward is obtained for sizes that cannot fit the (remaining) budget.)

()
(2.1)
(2.2)
(2.3)

While the size of the above LP (and the running time of the rounding algorithm below) polynomially depend on , i.e., pseudo-polynomial, it is possible to write a compact (approximate) LP and then round it; details on the polynomial time implementation appear in Appendix B.2.

Notice the constraints involving the truncated random variables in equation (2.2): these are crucial for showing the correctness of the rounding algorithm StocK-NoCancel. Furthermore, the ideas used here will appear subsequently in the algorithm later; for , even though we can’t explicitly enforce such a constraint in the LP, we will end up inferring a similar family of inequalities from a near-optimal LP solution.

Lemma 2.1

The above relaxation is valid for the problem when cancellations are not permitted, and has objective value , where is the expected profit of an optimal adaptive policy.

Proof.

Consider an optimal policy and let denote the probability that item is scheduled at time . We first show that is a feasible solution for the LP relaxation . It is easy to see that constraints (2.1) and (2.3) are satisfied. To prove that (2.2) are also satisfied, consider some and some run (over random choices of item sizes) of the optimal policy. Let be indicator variable that item is scheduled at time and let be the indicator variable for whether the size of item is . Also, let be the random variable indicating the last item scheduled at or before time . Notice that is the only item scheduled before or at time whose execution may go over time . Therefore, we get that

Including in the summation and truncating the sizes by , we immediately obtain

Now, taking expectation (over all of ’s sample paths) on both sides and using linearity of expectation we have

However, because decides whether to schedule an item before observing the size it instantiates to, we have that and are independent random variables; hence, the LHS above can be re-written as

Hence constraints (2.2) are satisfied. Now we argue that the expected reward of is equal to the value of the solution . Let be the random variable denoting the reward obtained by from item . Again, due to the independence between scheduling an item and the size it instantiates to, we get that the expected reward that gets from executing item at time is

Thus the expected reward from item is obtained by considering all possible starting times for :

This shows that is a valid relaxation for our problem and completes the proof of the lemma. ∎

We are now ready to present our rounding algorithm StocK-NoCancel (Algorithm 2.1). It a simple randomized rounding procedure which (i) picks the start time of each item according to the corresponding distribution in the optimal LP solution, and (ii) plays the items in order of the (random) start times. To ensure that the budget is not violated, we also drop each item independently with some constant probability.

1:  for each item , assign a random start-time with probability ; with probability , completely ignore item ( in this case).
2:  for  from to  do
3:     Consider the item which has the th smallest deadline (and )
4:     if the items added so far to the knapsack occupy at most space then
5:        add to the knapsack.
Algorithm 2.1 Algorithm StocK-NoCancel

Notice that the strategy obtained by the rounding procedure obtains reward from all items which are not dropped and which do not fail (i.e. they can start being scheduled before the sampled start-time in Step 1); we now bound the failure probability.

Lemma 2.2

For every , .

Proof.

Consider an item and time and condition on the event that . Let us consider the execution of the algorithm when it tries to add item to the knapsack in steps 3-5. Now, let be a random variable denoting how much of the interval of the knapsack is occupied by previously scheduling items, at the time when is considered for addition; since does not fail when , it suffices to prove that .

For some item , let be the indicator variable that ; notice that by the order in which algorithm StocK-NoCancel adds items into the knapsack, it is also the indicator that was considered before . In addition, let be the indicator variable that . Now, if denotes the total amount of the interval that that occupies, we have

Now, using the independence of and , we have

(2.4)

Since , we can use linearity of expectation and the fact that satisfies LP constraint (2.2) to get

To conclude the proof of the lemma, we apply Markov’s inequality to obtain . ∎

To complete the analysis, we use the fact that any item chooses a random start time with probability , and conditioned on this event, it is added to the knapsack with probability at least from Lemma 2.2; in this case, we get an expected reward of at least . The theorem below (formally proved in Appendix B.1 then follows by linearity of expectations.

Theorem 2.3

The expected reward of our randomized algorithm is at least of .

3 Stochastic Knapsack with Correlated Rewards and Cancellations

In this section, we present our algorithm for stochastic knapsack () where we allow correlations between rewards and sizes, and also allow cancellation of jobs. The example in Appendix A.1 shows that there can be an arbitrarily large gap in the expected profit between strategies that can cancel jobs and those that can’t. Hence we need to write new LPs to capture the benefit of cancellation, which we do in the following manner.

Consider any job : we can create two jobs from it, the “early” version of the job, where we discard profits from any instantiation where the size of the job is more than , and the “late” version of the job where we discard profits from instantiations of size at most . Hence, we can get at least half the optimal value by flipping a fair coin and either collecting rewards from either the early or late versions of jobs, based on the outcome. In the next section, we show how to obtain a constant factor approximation for the first kind. For the second kind, we argue that cancellations don’t help; we can then reduce it to without cancellations (considered in Section 2).

3.1 Case I: Jobs with Early Rewards

We begin with the setting in which only small-size instantiations of items may fetch reward, i.e., the rewards of every item are assumed to be for . In the following LP relaxation , tries to capture the probability with which will process item for at least timesteps111In the following two sections, we use the word timestep to refer to processing one unit of some item., is the probability that stops processing item exactly at timesteps. The time-indexed formulation causes the algorithm to have running times of —however, it is easy to write compact (approximate) LPs and then round them; we describe the necessary changes to obtain an algorithm with running time in Appendix C.2.

()
(3.5)
(3.6)
(3.7)
(3.8)
(3.9)
Theorem 3.1

The linear program () is a valid relaxation for the problem, and hence the optimal value of the LP is at least the total expected reward of an optimal solution.

Proof.

Consider an optimal solution and let and denote the probability that processes item for at least timesteps, and the probability that stops processing item at exactly timesteps. We will now show that all the constraints of   are satisfied one by one.

To this end, let denote the random variable (over different executions of ) for the amount of processing done on job . Notice that . But now, by definition we have and . This shows that satisfies these constraints.

For the next constraint, observe that conditioned on running an item for at least time steps, the probability of item stopping due to its size having instantiated to exactly equal to is , i.e., . This shows that satisfies constraints (3.6).

Finally, to see why constraint (3.7) is satisfied, consider any particular run of the optimal algorithm and let denote the indicator random variable of the event . Then we have

Now, taking expectation over all runs of and using linearity of expectation and the fact that , we get constraint (3.7). As for the objective function, we again consider a particular run of the optimal algorithm and let now denote the indicator random variable for the event , and denote the indicator variable for whether the size of item is instantiated to exactly in this run. Then we have the total reward collected by in this run to be exactly

Now, we simply take the expectation of the above random variable over all runs of , and then use the following fact about :

We thus get that the expected reward collected by is exactly equal to the objective function value of the LP formulation for the solution . ∎

Our rounding algorithm is very natural, and simply tries to mimic the probability distribution (over when to stop each item) as suggested by the optimal LP solution. To this end, let denote an optimal fractional solution. The reason why we introduce some damping (in the selection probabilities) up-front is to make sure that we could appeal to Markov’s inequality and ensure that the knapsack does not get violated with good probability.

1:  for each item  do
2:     ignore with probability (i.e., do not schedule it at all).
3:     for  do
4:        cancel item at this step with probability and continue to next item.
5:        process item for its timestep.
6:        if item terminates after being processed for exactly timesteps then
7:           collect a reward of from this item; continue onto next item;
Algorithm 3.1 Algorithm StocK-Small

Notice that while we let the algorithm proceed even if its budget is violated, we will collect reward only from items that complete before time . This simplifies the analysis a fair bit, both here and for the algorithm. In Lemma 3.2 below (proof in Appendix C), we show that for any item that is not dropped in step 2, its probability distribution over stopping times is identical to the optimal LP solution . We then use this to argue that the expected reward of our algorithm is .

Lemma 3.2

Consider item that was not dropped in step 2, Then, for any timestep , the following hold:

  • The probability (including cancellation& completion) of stopping at timestep for item is .

  • The probability that item gets processed for its timestep is exactly

  • If item has been processed for timesteps, the probability of completing successfully at timestep is

Theorem 3.3

The expected reward of our randomized algorithm is at least of .

Proof.

Consider any item . In the worst case, we process it after all other items. Then the total expected size occupied thus far is at most , where is the indicator random variable denoting whether item is not dropped in step 2. Here we have used Lemma 3.2 to argue that if an item is selected, its stopping-time distribution follows . Taking expectation over the randomness in step 2, the expected space occupied by other jobs is at most . Markov’s inequality implies that this is at most with probability at least . In this case, if item is started (which happens w.p. ), it runs without violating the knapsack, with expected reward ; the total expected reward is then at least . ∎

3.2 Case II: Jobs with Late Rewards

Now we handle instances in which only large-size instantiations of items may fetch reward, i.e., the rewards of every item are assumed to be for . For such instances, we now argue that cancellation is not helpful. As a consequence, we can use the results of Section 2 and obtain a constant-factor approximation algorithm!

To see why, intuitively, as an algorithm processes a job for its timestep for , it gets no more information about the reward than when starting (since all rewards are at large sizes). Furthermore, there is no benefit of canceling a job once it has run for at least timesteps – we can’t get any reward by starting some other item.

More formally, consider a (deterministic) strategy which in some state makes the decision of scheduling item and halting its execution if it takes more than timesteps. First suppose that ; since this job does will not be able to reach size larger than , no reward will be accrued from it and hence we can change this strategy by skipping the scheduling of without altering its total reward. Now consider the case where . Consider the strategy which behaves as except that it does not preempt in this state but lets run to completion. We claim that obtains at least as much expected reward as . First, whenever item has size at most then and obtain the same reward. Now suppose that we are in a scenario where reached size . Then item is halted and cannot obtain any other reward in the future, since no item that can fetch any reward would complete before the budget runs out; in the same situation, strategy obtains non-negative rewards. Using this argument we can eliminate all the cancellations of a strategy without decreasing its expected reward.

Lemma 3.4

There is an optimal solution in this case which does not cancel.

As mentioned earlier, we can now appeal to the results of Section 2 and obtain a constant-factor approximation for the large-size instances. Now we can combine the algorithms that handle the two different scenarios (or choose one at random and run it), and get a constant fraction of the expected reward that an optimal policy fetches.

4 Multi-Armed Bandits

We now turn our attention to the more general Multi-Armed Bandits problem (). In this framework, there are arms: arm has a collection of states denoted by , a starting state ; Without loss of generality, we assume that for . Each arm also has a transition graph , which is given as a polynomial-size (weighted) directed tree rooted at ; we will relax the tree assumption later. If there is an edge in , then the edge weight denotes the probability of making a transition from to if we play arm when its current state is node ; hence . Each time we play an arm, we get a reward whose value depends on the state from which the arm is played. Let us denote the reward at a state by . Recall that the martingale property on rewards requires that for all states .

Problem Definition. For a concrete example, we consider the following budgeted learning problem on tree transition graphs. Each of the arms starts at the start state . We get a reward from each of the states we play, and the goal is to maximize the total expected reward, while not exceeding a pre-specified allowed number of plays across all arms. The framework described below can handle other problems (like the explore/exploit kind) as well, and we discuss this in Appendix F.

Note that the Stochastic Knapsack problem considered in the previous section is a special case of this problem where each item corresponds to an arm, where the evolution of the states corresponds to the explored size for the item. Rewards are associated with each stopping size, which can be modeled by end states that can be reached from the states of the corresponding size with the probability of this transition being the probability of the item taking this size. Thus the resulting trees are paths of length up to the maximum size with transitions to end states with reward for each item size. For example, the transition graph in Figure 4.1 corresponds to an item which instantiates to a size of with probability (and fetches a reward ), takes size with probability (with reward ), and size with the remaining probability (reward is ). Notice that the reward on stopping at all intermediate nodes is and such an instance therefore does not satisfy the martingale property. Even though the rewards are obtained in this example on reaching a state rather than playing it, it is not hard to modify our methods for this version as well.

Figure 4.1: Reducing Stochastic Knapsack to MAB
Notation.

The transition graph for arm is an out-arborescence defined on the states rooted at . Let of a node be the depth of node in tree , where the root has depth . The unique parent of node in is denoted by . Let denote the set of all states in the instance, and denote the arm to which state belongs, i.e., the index such that . Finally, for , we refer to the act of playing arm when it is in state as “playing state ”, or “playing state ” if the arm is clear in context.

4.1 Global Time-indexed LP

In the following, the variable indicates that the algorithm plays state at time . For state and time , indicates that arm first enters state at time : this happens if and only if the algorithm played at time and the arm made a transition into state .

()
(4.10)
(4.11)
(4.12)
(4.13)
Lemma 4.1

The value of an optimal LP solution is at least , the expected reward of an optimal adaptive strategy.

Proof.

We convention that starts playing at time . Let denote the probability that plays state at time , namely, the probability that arm is in state at time and is played at time . Also let denote the probability that “enters” state at time , and further let for all .

We first show that is a feasible solution for and later argue that its LP objective is at least . Consider constraint (4.10) for some and . The probability of entering state at time conditioned on playing state at time is . In addition, the probability of entering state at time conditioning on not playing state at time is zero. Since is the probability that plays state at time , we remove the conditioning to obtain .

Now consider constraint (4.11) for some and . For any outcome of the algorithm (denoted by a sample path ), let be the indicator variable that enters state at time and let be the indicator variable that plays state at time . Since is acyclic, state is played at most once in and is also entered at most once in . Moreover, whenever is played before or at time , it must be that was also entered before or at time , and hence . Taking expectation on both sides and using the fact that and , linearity of expectation gives .

To see that constraints (4.12) are satisfied, notice that we can play at most one arm (or alternatively one state) in each time step, hence holds for all ; the claim then follows by taking expectation on both sides as in the previous paragraph. Finally, constraints (4.13) is satisfied by definition of the start states.

To conclude the proof of the lemma, it suffices to show that . Since obtains reward whenever it plays state , it follows that ’s reward is given by ; by taking expectation we get , and hence . ∎

4.2 The Rounding Algorithm

In order to best understand the motivation behind our rounding algorithm, it would be useful to go over the example which illustrates the necessity of preemption (repeatedly switching back and forth between the different arms) in Appendix A.3.

At a high level, the rounding algorithm proceeds as follows. In Phase I, given an optimal LP solution, we decompose the fractional solution for each arm into a convex222Strictly speaking, we do not get convex combinations that sum to one; our combinations sum to , the value the LP assigned to pick to play the root of the arm over all possible start times, which is at most one. combination of integral “strategy forests” (which are depicted in Figure 4.2): each of these tells us at what times to play the arm, and in which states to abandon the arm. Now, if we sample a random strategy forest for each arm from this distribution, we may end up scheduling multiple arms to play at some of the timesteps, and hence we need to resolve these conflicts. A natural first approach might be to (i) sample a strategy forest for each arm, (ii) play these arms in a random order, and (iii) for any arm follow the decisions (about whether to abort or continue playing) as suggested by the sampled strategy forest. In essence, we are ignoring the times at which the sampled strategy forest has scheduled the plays of this arm and instead playing this arm continually until the sampled forest abandons it. While such a non-preemptive strategy works when the martingale property holds, the example in Appendix A.3 shows that preemption is unavoidable.

Another approach would be to try to play the sampled forests at their prescribed times; if multiple forests want to play at the same time slot, we round-robin over them. The expected number of plays in each timestep is 1, and the hope is that round-robin will not hurt us much. However, if some arm needs contiguous steps to get to a state with high reward, and a single play of some other arm gets scheduled by bad luck in some timestep, we would end up getting nothing!

Guided by these bad examples, we try to use the continuity information in the sampled strategy forests—once we start playing some contiguous component (where the strategy forest plays the arm in every consecutive time step), we play it to the end of the component. The naïve implementation does not work, so we first alter the LP solution to get convex combinations of “nice” forests—loosely, these are forests where the strategy forest plays contiguously in almost all timesteps, or in at least half the timesteps. This alteration is done in Phase II, and then the actual rounding in Phase III, and the analysis appears in Section 4.2.3.

4.2.1 Phase I: Convex Decomposition

In this step, we decompose the fractional solution into a convex combination of “forest-like strategies” , corresponding to the forest for arm . We first formally define what these forests look like: The strategy forest for arm is an assignment of values and to each state such that:

  • For and , it holds that , and

  • For and , if then ; else if then .

We call a triple a tree-node of . When and are understood from the context, we identify the tree-node with the state .

For any state , the values and denote the time at which the arm is played at state , and the probability with which the arm is played, according to the strategy forest .333When and are clear from the context, we will just refer to state instead of the triple . The probability values are particularly simple: if then this strategy does not play the arm at , and hence the probability is zero, else is equal to the probability of reaching over the random transitions according to if we play the root with probability . Hence, we can compute just given and whether or not . Note that the values are not necessarily consecutive, plotting these on the timeline and connecting a state to its parents only when they are in consecutive timesteps (as in Figure 4.2) gives us forests, hence the name.

(a) Strategy forest: numbers are s
(b) Strategy forest shown on a timeline
Figure 4.2: Strategy forests and how to visualize them: grey blobs are connected components.

The algorithm to construct such a decomposition proceeds in rounds for each arm ; in a particular round, it “peels” off such a strategy as described above, and ensures that the residual fractional solution continues to satisfy the LP constraints, guaranteeing that we can repeat this process, which is similar to (but slightly more involved than) performing flow-decompositions. The decomposition lemma is proved in Appendix D.1:

Lemma 4.2

Given a solution to (), there exists a collection of at most strategy forests such that .444To reiterate, even though we call this a convex decomposition, the sum of the probability values of the root state of any arm is at most one by constraint 4.12, and hence the sum of the probabilities of the root over the decomposition could be less than one in general. Hence, for all .

For any , these values satisfy a “preflow” condition: the in-flow at any node is always at least the out-flow, namely . This leads to the following simple but crucial observation.

Observation 4.3

For any arm , for any set of states such that no state in is an ancestor of another state in in the transition tree , and for any that is an ancestor of all states in , .

More generally, given similar conditions on , if is a set of states such that for any , there exists such that is an ancestor of , we have

4.2.2 Phase II: Eliminating Small Gaps

While Appendix A.3 shows that preemption is necessary to remain competitive with respect to , we also should not get “tricked” into switching arms during very short breaks taken by the LP. For example, say, an arm of length was played in two continuous segments with a gap in the middle. In this case, we should not lose out on profit from this arm by starting some other arms’ plays during the break. To handle this issue, whenever some path on the strategy tree is almost contiguous—i.e., gaps on it are relatively small—we make these portions completely contiguous. Note that we will not make the entire tree contiguous, but just combine some sections together.

Before we make this formal, here is some useful notation: Given , let be its ancestor node of least depth such that the plays from through occur in consecutive values. More formally, the path in is such that for all . We also define the connected component of a node , denoted by , as the set of all nodes such that . Figure 4.2 shows the connected components and heads.

The main idea of our gap-filling procedure is the following: if a head state is played at time s.t. , then we “advance” the and get rid of the gap between and its parent (and recursively apply this rule)555The intuition is that such vertices have only a small gap in their play and should rather be played contiguously.. The procedure can be described in more detail as follows.

1:  for  to  do
2:     while there exists a tree-node such that  do
3:        let .
4:        if  is not the root of  then
5:           let .
6:           advance the component rooted at such that , to make contiguous with the ancestor forming one larger component. Also alter the s of appropriately to maintain contiguity with (and now with ).
Algorithm 4.1 Gap Filling Algorithm GapFill

One crucial property is that these “advances” do not increase by much the number of plays that occur at any given time . Essentially this is because if for some time slot we “advance” a set of components that were originally scheduled after to now cross time slot , these components moved because their ancestor paths (fractionally) used up at least of the time slots before ; since there are time slots to be used up, each to unit extent, there can be at most units of components being moved up. Hence, in the following, we assume that our ’s satisfy the properties in the following lemma:

Lemma 4.4

Algorithm GapFill produces a modified collection of ’s such that

  • For each such that , .

  • The total extent of plays at any time , i.e., is at most .

The proof appears in Appendix D.2.

4.2.3 Phase III: Scheduling the Arms

Having done the preprocessing, the rounding algorithm is simple: it first randomly selects at most one strategy forest from the collection for each arm . It then picks an arm with the earliest connected component (i.e., that with smallest ) that contains the current state (the root states, to begin with), plays it to the end—which either results in terminating the arm, or making a transition to a state played much later in time, and repeats. The formal description appears in Algorithm 4.2. (If there are ties in Step 5, we choose the smallest index.) Note that the algorithm runs as long as there is some active node, regardless of whether or not we have run out of plays (i.e., the budget is exceeded)—however, we only count the profit from the first plays in the analysis.

1:  for arm , sample strategy with probability ; ignore arm w.p. .
2:  let set of “active” arms which chose a strategy in the random process.
3:  for each , let index of the chosen and let root .
4:  while active arms  do
5:     let arm with state played earliest in the LP (i.e., .
6:     let .
7:     while  and  do
8:        play arm at state
9:        update be the new state of arm ; let .
10:     if  then
11:        let
Algorithm 4.2 Scheduling the Connected Components: Algorithm AlgMAB

Observe that Steps 7-9 play a connected component of a strategy forest contiguously. In particular, this means that all ’s considered in Step 5 are head vertices of the corresponding strategy forests. These facts will be crucial in the analysis.

Lemma 4.5

For arm and strategy , conditioned on after Step 1 of AlgMAB, the probability of playing state is , where the probability is over the random transitions of arm .

The above lemma is relatively simple, and proved in Appendix D.3. The rest of the section proves that in expectation, we collect a constant factor of the LP reward of each strategy before running out of budget; the analysis is inspired by our rounding procedure. We mainly focus on the following lemma.

Lemma 4.6

Consider any arm and strategy . Then, conditioned on and on the algorithm playing state , the probability that this play happens before time is at least .

Proof.

Fix an arm and an index for the rest of the proof. Given a state , let denote the event . Also, let be the head of the connected component containing in . Let r.v.  (respectively ) be the actual time at which state (respectively state ) is played—these random variables take value if the arm is not played in these states. Then

(4.14)

because the time between playing and is exactly