Approximation Algorithms for Restless Bandit ProblemsThis paper combines and generalizes results presented in two papers [25] and [28] that appeared in the FOCS ’07 and SODA ’09 conferences respectively.

Approximation Algorithms for Restless Bandit Problemsthanks: This paper combines and generalizes results presented in two papers [25] and [28] that appeared in the FOCS ’07 and SODA ’09 conferences respectively.

Sudipto Guha Department of Computer and Information Sciences, University of Pennsylvania, Philadelphia PA 19104-6389. Email: sudipto@cis.upenn.edu. Research supported in part by an Alfred P. Sloan Research Fellowship, an NSF CAREER Award CCF-0644119.    Kamesh Munagala Department of Computer Science, Duke University, Durham NC 27708-0129. Email: kamesh@cs.duke.edu. Research supported by NSF via a CAREER award and grant CNS-0540347.    Peng Shi Duke University, Durham NC 27708. Email: peng.shi@duke.edu. This research was supported by the Duke University Work-Study Program and by NSF award CNS-0540347.
Abstract

The restless bandit problem is one of the most well-studied generalizations of the celebrated stochastic multi-armed bandit problem in decision theory. In its ultimate generality, the restless bandit problem is known to be PSPACE-Hard to approximate to any non-trivial factor, and little progress has been made on this problem despite its significance in modeling activity allocation under uncertainty.

We consider a special case that we call Feedback MAB, where the reward obtained by playing each of independent arms varies according to an underlying on/off Markov process whose exact state is only revealed when the arm is played. The goal is to design a policy for playing the arms in order to maximize the infinite horizon time average expected reward. This problem is also an instance of a Partially Observable Markov Decision Process (POMDP), and is widely studied in wireless scheduling and unmanned aerial vehicle (UAV) routing. Unlike the stochastic MAB problem, the Feedback MAB problem does not admit to greedy index-based optimal policies. The state of the system at any time step encodes the beliefs about the states of different arms, and the policy decisions change these beliefs – this aspect complicates the design and analysis of simple algorithms.

We develop a novel and fairly general duality-based algorithmic technique that yields a surprisingly simple and intuitive -approximate greedy policy to this problem. We then define a general sub-class of restless bandit problems that we term Monotone bandits, for which our policy is a -approximation. Our technique is robust enough to handle generalizations of these problems to incorporate various side-constraints such as blocking plays and switching costs. This technique is also of independent interest for other restless bandit problems, and we provide an example in non-preemptive machine replenishment. We finally show that our policies are closely related to the Whittle index that is widely used for its simplicity and efficiency of computation. In fact, not only is our policy just as efficient to compute as the Whittle index, but in addition, it provides surprisingly strong constant factor guarantees even in cases where the Whittle index is provably polynomially worse.

By presenting the first (and efficient) approximations for non-trivial instances of restless bandits as well as of POMDPs, our work initiates the study of approximation algorithms in both these contexts.

1 Introduction

The celebrated multi-armed bandit problem (MAB) models the central trade-off in decision theory between exploration and exploitation, or in other words between learning about the state of a system and utilizing the system. In this problem, there are competing options, referred to as “arms,” yielding unknown rewards . Playing an arm yields a reward drawn from an underlying distribution, and the information from the reward observed partially resolves its distribution. The goal is to sequentially play the arms in order to maximize reward obtained over some time horizon.

Typically, the multi-armed bandit problem is studied under one of two assumptions:

  1. The underlying reward distribution for each arm is fixed but unknown, and a prior of this distribution is specified as input (stochastic multi-armed bandits [2, 10, 42, 48]); or

  2. The underlying rewards can vary with time in an adversarial fashion, and the comparison is against an optimal strategy that always plays one arm, albeit with the benefit of hindsight (adversarial multi-armed bandits [7, 15, 17]).

Relaxing both the assumptions simultaneously leads to the notorious restless bandit problem in decision theory, which in its ultimate generality, is PSPACE hard to even approximate [41]. In the last two decades, in spite of the growth of approximation algorithms and the numerous applications of restless bandits [3, 19, 20, 21, 37, 40, 13, 49, 52], the approximability of these have remained unexplored. In this paper, we provide a general algorithmic technique that yields the first approximations to a large class of these problems that are commonly studied in practice.

An important subclass of restless bandit problems are situations where the system is agnostic of the exploration – or the exploration gives us information about the state of the system but does not interfere with the evolution of the system. One such problem is the Feedback MAB which models opportunistic multi-channel access at a wireless node [25, 30, 54]: The bandit corresponds to a wireless node with access to multiple noisy channels (arms). The state of the arm is the state (good/bad) of the channel, which varies according to a bursty -state Markov process. Playing the arm corresponds to transmitting on the channel, yielding reward if the transmission is successful (good channel state), and at the same time revealing to the transmitter the current state of the channel. This corresponds to the Gilbert-Elliot model [33] of channel evolution. The goal is to find a transmission policy of choosing one channel to transmit on every time step, that maximizes the long-term transmission rate. Feedback MAB also models Unmanned Aerial Vehicle (UAV) routing [40]: the arms are locations of possibly interesting events, and whether a location is interesting or uninteresting follows a -state Markov processes. Visiting a location by the UAV corresponds to playing the arm, and yields reward if an interesting event is detected. The goal is to find a routing policy that maximizes the long-term average reward from interesting events.

This problem is also a special case of Partially Observable Markov Decision Processes or POMDPs [31, 44, 45]. The state of each arm evolves according to a Markov chain whose state is only observed when the arm is played. The player’s partial information, encapsulated by the last observed state and the number of steps since last playing, yields a belief on the current state. (This belief is simply a probability distribution for the arm being good or bad.) The player uses this partial information in making the decision about which arm to play next, which in turn affects the information at future times. While such POMDPs are widely used in control theory, they are in general notoriously intractable [10, 31]. In this paper we provide the first approximation for the Feedback MAB and a number of its important extensions. This represents the first approximation guarantee for a POMDP, and the first guarantee for a MAB problem with time-varying rewards that compares to an optimal solution allowed to switch arms at will.

Before we present the problem statements formally, we survey literature on the stochastic multi-armed bandit problem. (We discuss adversarial MAB after we present our model and results.)

1.1 Background: Stochastic MAB and Restless Bandits

The stochastic MAB was first formulated by Arrow et al [2] and Robbins [42]. It resides under a Bayesian (or decision theoretic) setting: we successively choose between several options given some prior information (specified by distributions), and our beliefs are updated via Bayes’ rule conditioned on the results of our choices (observed rewards).

More formally, we are given a “bandit” with independent arms. Each arm can be in one of several states belonging to the set . At any time step, the player can play one arm. If arm in state is played, it transitions in a Markovian fashion to state w.p. , and yields reward . The states of arms that are not played stay the same. The initial state models the prior knowledge about the arm. The states in general capture the posterior conditioned on the observations from sequential plays. The problems is, given the initial states of the arms, find a policy for playing the arms in order to maximize one of the following infinite horizon quantities: (discounted reward), or (average reward), where is the expected reward of the policy at time step and is a discount factor. A policy is a (possibly implicit) specification of fixing up front which arm (or distribution over arms) to play for every possible joint state of the arms.

It is well-known that Bellman’s equations [10] yield the optimal policy by dynamic programming. The main issue in the stochastic setting is in efficiently computing and succinctly specifying the optimal policy: The input to an algorithm specifies the rewards and transition probabilities for each arm, and thus has size linear in , but the state space is exponential in . We seek polynomial-time algorithms (in terms of the input size) that compute (near-) optimal policies with poly-size specifications. Moreover, we require the policies to be executable each step in poly-time.

Note that since a policy is a fixed (possibly randomized) mapping from the exponential size joint state space to a set of actions, ensuring poly-time computation and execution often requires simplifying the description of the optimal policy using the problem structure. The stochastic MAB problem is the most well-known decision problem for which such a structure is known: The optimal policy is a greedy policy termed the Gittins index policy [18, 46, 10]. In general, an index policy specifies a single number called “index” for each state for each arm , and at every time step, plays the arm whose current state has the highest index. Index policies are desirable since they can be compactly represented, so they are the heuristic method of choice for several MDP problems. In addition, index policies are also optimal for several generalizations of the stochastic MAB, such as arm-acquiring bandits [51] and branching bandits [50]. In fact, a general characterization of problems for which index policies are optimal is now known [12].

Restless Bandits.

In the stochastic MAB problem, the underlying reward distributions for each arm are fixed but unknown. However, if the rewards can vary with time, the problem stops admitting optimal index policies or efficient solutions. The problem now needs to be modeled as a restless bandit problem, first proposed by Whittle [52]. The problem statement of the restless bandits is similar to stochastic MAB, except that when arm in state is not played, it’s state evolves to with probability . Therefore, the state of each arm varies according to an active transition matrix when the arm is played, and according to a passive transition matrix if the arm is not played. Unlike the stochastic MAB problem, which is interesting only in the discounted reward setting111Playing the arm with the highest long-term average reward exclusively is the trivial optimal policy for stochastic MAB in the infinite-horizon average reward setting., the restless bandit problem is interesting even in the infinite horizon average reward setting – this is the setting in which the problem has been typically studied, and so we limit ourselves to this setting in this paper. It is relatively straightforward to show that no index policy can be optimal for these problems; in fact, Papadimitriou and Tsitsiklis [41] show that for arms, even when all and values are either or (deterministic transitions), computing the optimal policy is a PSPACE-hard problem. Their proof in fact shows that deciding if the optimal reward is non-zero is also PSPACE-hard, hence ruling out any approximation algorithm as well.

On the positive side, Whittle [52] presents a poly-size LP relaxation of the problem. In this relaxation, the constraint that exactly one arm is played per time step is replaced by the constraint that one arm on average is played per time step. In the LP, this is the only constraint connecting the arms. (Such decision problems have been termed weakly coupled systems [29, 1].) Based on the Lagrangean of this relaxation, Whittle [52] defines a heuristic index that generalizes the Gittins index. This is termed the Whittle Index (see Section 3). Though this index is widely used in practice and has excellent empirical performance [3, 19, 20, 21, 37, 40, 49], the known theoretical guarantees ( [49, 19]) are very weak. In summary, despite being very well-motivated and extensively studied, there are almost no positive results on approximation guarantees for the restless bandit problems.

1.2 Results and Roadmap

We provide the first approximation algorithm for both a restless bandit problem and a partially observable Markov decision problem by providing a -approximate index policy for the Feedback MAB problem which belongs to both classes. We show several other results; however, before presenting the specifics, we place our contribution in the context of existing techniques in control theory.

Technical Contributions.

Our algorithmic technique for this problem (Section 2) involves solving (in polynomial time) the Lagrangean of Whittle’s LP relaxation for a suitable (and subtle) “balanced” choice of the Lagrange multiplier, converting this into a feasible index policy, and using an amortized accounting of the reward for the analysis. We show that this technique is closely related to the Whittle index [52, 40, 30], and in fact, provide the first approximation analysis of (a subtle variant of) the Whittle index which is widely used in control theory literature in the context of Feedback MAB problems (Section 3). We believe that analyzing the performance guarantees of the numerous indices used in the literature will increase and our analysis will provide an useful template.

However, the key difference between Whittle’s index and our index policy is the following: The former chooses one Lagrange multiplier (or index) per state of each arm, with the policy playing the arm with the largest index. This has the advantage of separate efficient computations for different arms; and in addition, such a policy (the Gittins index policy [18]) is known to be optimal for the stochastic MAB. However, it is well-known [4, 8, 14] that this intuition about playing the arm with the largest index being optimal becomes increasingly invalid when complicated side-constraints such as time-varying rewards (Feedback MAB), blocking plays, and switching costs are introduced. In fact, we show a concrete problem in Section 8 where the Whittle index has a performance gap.

In contrast to the Whittle index, our technique chooses a single global Lagrange multiplier via a careful accounting of the reward, and develops a feasible policy from it. Unlike the Whittle index, this technique is sufficiently robust to encompass a large number of often-used variants of Feedback MAB problems: Plays with varying duration (Section 5), switching costs (Section 6), and observation costs (Section 7). In fact, we identify a general Monotone condition in restless bandit problems under which our technique applies (Section 4). Furthermore, our technique provides approximations to other classic restless bandit problems even when Whittle’s index is polynomially sub-optimal: We show an example in the non-preemptive machine replenishment problem (Section 8). Finally, since our technique is based on solving the Lagrangean222This aspect is explicit in Sections 2 and 3. However, in Sections 48, we have presented our algorithm in terms of first solving a linear program. However, it is easy to see that this is equivalent to solving the Lagrangean, and hence to the computation required for Whittle’s index. The details are quite standard and can be reconstructed from those in Sections 2 and 3. (just like the Whittle index), the computation time is comparable to that for such indices.

In summary, our technique succeeds in finding the first provably approximate policies for widely-studied control problems, without sacrificing efficiency in the process. We believe that the generality of this technique will be useful for exploring other useful variations of these problems as well as providing an alternate algorithm for practitioners.

Specific Results.

In terms of specific results, the paper is organized as follows:

  • We begin by presenting a -approximation for Feedback bandits in Section 2. We also provide a integrality gap instance showing that out analysis is nearly tight.

  • In Section 3 we show that our analysis technique can be used to prove that a thresholded variant of the Whittle index is a approximation. We also show instances where the reward of any index policy is at least factor from the reward of the optimal policy. Therefore although the Whittle index is not optimal, our result sheds light on its observed superior performance in this specific context.

  • In Section 4 we generalize the result in Section 2 to define a general sub-class of restless bandit problems based on a critical set of properties: Separability and monotonicity. For this subclass, termed Monotone bandits (which generalizes Feedback MAB), we provide a approximation by generalizing the technique in Section 2. Our technique now introduces a balance constraint in the dual of the natural LP relaxation, and constructs the index policy from the optimal dual solution. We further show that in the absence of monotonicity or separability, the problem is either NP-Hard to approximate, or has unbounded integrality gap respectively.

  • In Section 5 we extend Feedback MAB (as well as Monotone bandits) to consider multiple simultaneous blocking plays of varying durations.

  • In Section 6 we extend Feedback MAB (and Monotone bandits) to consider switching costs.

  • In Section 7 we extend Feedback MAB to a variant where the information acquisition is varied, namely, an arm has to be explicitly probed at some cost to obtain its state.

  • In Section 8, we derive a -approximation for a classic, restless bandit problem called non-preemptive machine replenishment [10, 23, 39]. We also show that the Whittle Index for this problem has a factor worse performance compared to the optimal policy. Thus the technique introduced in this paper can be superior to Whittle index or similar policies.

1.3 Related Work

Contrast with the Adversarial MAB Problem. While our problem formulations are based on the stochastic MAB problem, one might be interested in a formulation based on the adversarial MAB [7, 34]. Such a formulation might be to assume that rewards can vary adversarially, and that the objective is to compete with a restricted optimal solution that always plays the same arm but with the benefit of hindsight.

These different formulations result in fundamentally different problems. Under our formulation, the difficulty is computational: we want to compute policies for playing the arms, assuming stochastic models of how the system varies with time. under the adversarial formulation, the difficulty is informational: we would be interested in the regret of not having the benefit of hindsight. A sequence of papers show near-tight regret bounds in fairly general settings [5, 6, 7, 15, 17, 34, 36]. However, applying this framework is not satisfying: It is straightforward to show that a policy for Feedback MAB that is allowed to switch arms can be times better than a policy that is not allowed to do so (even assuming hindsight). Another approach would be to define each policy as an “expert”, and use the low-regret experts algorithm [15]; however, the number of policies is super-exponentially large, which would lead to weak regret bounds, along with exponential-size policy descriptions and exponential per-step execution time.

We note that developing regret bounds in the presence of changing environments has received significant interest recently in computational learning [5, 7, 16, 32, 43]; however, this direction requires strong assumptions such as bounded switching between arms [5] and slowly varying environments [32, 43], both of which assumptions are inapplicable to Feedback MAB. In independent work, Slivkins and Upfal [43] consider the modification of Feedback MAB where the underlying state of the arms vary according to a reflected Brownian motion with bounded variance. As discussed in [43], this problem is technically very different from ours, even requiring different performance metrics.

Other Related Work.

The results in [24, 22, 26, 27] consider variants of the stochastic MAB where the underlying reward distribution does not change and only a limited time is allotted to learning about this environment. Although several of these results use LP rounding, they have little connection to the duality based framework considered here.

Our duality based framework shows a -approximate index policy for non-preemptive machine replenishment (Section 8). Elsewhere, Munagala and Shi [39] considered the special case of preemptive machine replenishment problem, for which the Whittle index is equivalent to a simple greedy scheme. They show that this greedy policy, though not optimal, is a approximation. However, the techniques there are based on queuing analysis, and do not extend to the non-preemptive case where the Whittle index can be an arbitrarily poor approximation (as shown in Section 8).

Our solution technique differs from primal-dual approximation algorithms [47] and online algorithms [53], which relax either the primal or the dual complementary slackness conditions using a careful dual-growing procedure. Our index policy and associated potential function analysis crucially exploit the structure of the optimal dual solution that is gleaned using both the exact primal as well as dual complementary slackness conditions. Furthermore, our notion of dual balancing is very different from that used by Levi et al [35] for designing online algorithms for stochastic inventory management.

2 The Feedback MAB Problem

In this problem, first formulated independently in [25, 54, 30, 40], there is a bandit with independent arms. Arm has two states: The good state yields reward , and the bad state yields no reward. The evolution of state of the arm follows a bursty -state Markov process which does not depend on whether the arm is played or not at a time slot. Let denote the state of arm at time . Denote the transition probabilities of the Markov chain as follows: and . The values are specified as input. The “burstiness” assumption simply means for some small specified as part of the input. The evolution of states for different arms are independent. Any policy chooses at most one arm to play every time slot. Each play is of unit duration, yields reward depending on the state of the arm, and reveals to the policy the current state of that arm. When an arm is not played, the true underlying state cannot be observed, which makes the problem a POMDP. The goal is to find a policy to play the arms in order to maximize the infinite horizon average reward.

First observe that we can change the reward structure of Feedback MAB so that when an arm is played, we obtain reward from the last-observed state instead of the currently observed state. This does not change the average reward of any policy. This allows us to encode all the state of each arm as follows.

Proposition 2.1.

From the perspective of any policy, the state of any arm can be encoded as , which denotes that it was last observed steps ago to be in state .

Note that any policy maps each possible joint state of arms into an action of which arm to play. Such a mapping has size exponential in . The standard heuristic is to consider index policies: Policies which define an “index” or number for each state and play the arm with the highest current index. The following theorem shows that playing the arm with the highest myopic reward does not work, and that index policies in general are non-optimal. Therefore, our problem is interesting and the best we can hope for with index policies is a approximation.

Theorem 2.1.

(Proved in Appendix A) For Feedback MAB, the reward of the optimal policy has an gap against that of the myopic index policy and an gap against that of the optimal index policy.

Roadmap.

In this section, we show that a simple index policy is a approximation. This is based on a natural LP relaxation suggested by Whittle which we discuss in Section 2.1; this formulation will have infinitely many constraints. We then consider the Lagrangean of this formulation in Section 2.2, and analyze its structure via duality, which enables computing its optimal solution in polynomial time. At this point, we deviate significantly from previous literature, and present our main contribution in Section 2.3: A subtle and powerful “balanced” choice of the Lagrange multiplier, which enables the design of an intuitive index policy, BalancedIndex, along with an equally intuitive analysis. We use duality and potential function arguments to show that the policy is approximation. We conclude by showing that the gap of Whittle’s relaxation is , indicating that our analysis is reasonably tight. This analysis technique generalizes easily (explored in Sections 48) and has rich connections to other index policies, most notably the Whittle index (explored in Section 3).

2.1 Whittle’s LP

Whittle’s LP is obtained by effectively replacing the hard constraint of playing one arm per time step, with allowing multiple plays per step but requiring one play per step on average. Hence, the LP is a relaxation of the optimal policy.

Definition 1.

Let be the probability of the arm being in state when it was last observed in state exactly steps ago. Let be the same probability when the last observed state was . We have:

Fact 2.2.

The functions and are monotonically increasing and concave functions of .

We now present Whittle’s LP, and interpret it in the lemma that immediately follows.

Lemma 2.3.

The optimal objective to Whittle’s LP, , is at least the value of the optimal policy.

Proof.

Consider the optimal policy. In the execution of this policy, for each arm and state for , let the variable denote the probability (or fraction of time steps) of the event: Arm is in state and gets played. Let correspond to the probability of the event that the state is and the arm is not played. Since the underlying Markov chains are ergodic, the optimal policy when executed is ergodic, and the above probabilities are well-defined.

Now, at any time step, some arm in state is played, which implies the values are probabilities of mutually exclusive events. This implies they satisfy the first constraint in the LP. Similarly, for each arm , at any step, this arm is in some state and is either played or not played, so that the correspond to mutually exclusive events. This implies that for each , they satisfy the second constraint. For any arm and state , the LHS of the third constraint is the probability of being in this state, while the RHS is the probability of entering this state; these are clearly identical in the steady state. For arm , the LHS of the fourth (resp. fifth) constraint is the probability of being in state (resp. ), and the RHS is the probability of entering this state; again, these are identical.

This shows that the probability values defined for the execution of the optimal policy are feasible for the constraints of the LP. The value of the optimal policy is precisely , which is at most – the maximum possible objective for the LP. ∎

The above LP encodes in one variable the probability the arm is in state and gets played; however, we note that in the optimal policy, this decision to play actually depends on the joint state of all arms. This separation of the joint probabilities into individual probabilities effectively relaxes the condition of having one play per step, to allowing multiple plays per step but requiring one play per step on average. While the policy generated by Whittle’s LP is infeasible, the relaxation allows us to compute an upper-bound on the value of the optimal feasible policy.

We note . It is convenient to eliminate the variables by substitution and the last two constraints collapse into the same constraint. Thus, we have the natural LP formulation shown in Figure 1. We note that the first constraint can either be an inequality () or an equality; w.l.o.g., we use equality, since we can add a dummy arm that does not yield any reward on playing.

Figure 1: The linear program (Whittle) for the feedback MAB problem.

From now on, let denote the value of the optimal solution to (Whittle). The LP in its current form has infinitely many constraints; we will now show that this LP can be solved in polynomial time to arbitrary precision by finding structure in the Lagrangean.

2.2 Decoupling Arms via the Lagrangean

In (Whittle), the only constraint connecting different arms is the constraint:

We absorb this constraint into the objective via Lagrange multiplier to obtain the following objective:

Through the Lagrangean, we have effectively removed the only constraint that connected multiple arms. LPLagrange now yields disjoint maximization problems, one for each arm : At any time step, arm can be played (and reward obtained from it), or not played. Whenever the arm is played, we incur a penalty in addition to the reward. The goal is to maximize the expected reward minus cost. Note that if the penalty is zero, the arm is played every step, and if the penalty is sufficiently large, the optimal solution would be to never play the arm.

Definition 2.

For each arm , let denote the optimal policy, and let denote the optimal reward minus penalty. Note that the global reward minus penalty is the sum for each arm: .

2.2.1 Characterizing the Optimal Single-arm Policy

We first show that the optimal policy for any arm belongs to the class of policies for , whose specification is presented in Figure 2. Intuitively, step (1) corresponds to exploitation, and step (2) to exploration. Set to be the policy that never plays the arm.

Policy : 1. If the arm was just observed to be in state , then play the arm. 2. If the arm was just observed to be in state , wait steps and play the arm.

Figure 2: The Policy .

To show this, we take with the dual of LPLagrange:

The fact that the optimal single arm policy belongs to the class comes from (5) of the following lemma.

Lemma 2.4.

For any , in the optimal solution to Whittle-Dual, for any arm with :

  1. .

  2. .

  3. For some , and .

  4. and .

  5. The optimal single-arm policy for arm is .

Proof.

The first part follows the definition of strong duality. The problem LPLagrange, ignoring the constant in the objective, separates into separate LPs, one for each arm. The dual objective for arm is precisely , which must be the same as the primal objective, .

If , the solution to the LP for arm is the policy . In order to have non-zero , such a policy must play the arm first in some state and state . Since is the probability this policy plays in state , this implies and .

Since , by complementary slackness, we have . Since the LHS is at least zero, this implies . This proves parts (2) and (3).

To see part (4), observe that for the set of constraints , since is a monotonically increasing function of , the RHS is monotonically decreasing in . Since the LHS is monotonically increasing, if the LHS and RHS are equal, they have to be so for . Now, since , by complementary slackness, . By the above argument, , which completes the proof of part (4).

Since and , the optimal policy for plays the arm in state and in state , which is precisely the description of . This proves part (5). ∎

It will be instructive to interpret the problem as follows: Amortize the reward so that for each play, the arm yields a steady reward of . The goal is to find the single-arm policy that optimizes the excess reward per step over and above the amortized reward per play. As we have shown above, the optimal value for this problem is precisely , and the policy that achieves this belongs to the class .

2.2.2 Solving LPLagrange

Having decomposed the program LPLagrange into independent maximization problems for each arm, and having characterized the optimal single-arm policies, we can now solve the program in polynomial time. It will turn out this can be solved by simple function maximization via closed form expressions.

Definition 3.

For policy , let denote the expected per-step reward, and let denote the expected rate of play. Let denote the value of . Also define:

(1)

Finally, let and

Note that the optimal reward minus cost for arm is simply . Since each corresponds to a Markov Chain, it is straightforward to obtain closed form expressions for and .

Figure 3: Markov Chain for policy .
Lemma 2.5.

In playing an arm with reward , transition probabilities and , the policy yields average reward , and expected rate of play . Recall that is the probability the arm is good given it was observed to be bad steps ago.

Proof.

The Markov chain describing the policy is shown in Figure 3, and has states which we denote . The state corresponds to the arm being observed to be in state , and the state corresponds to the arm being observed in state exactly steps ago. The transition probability from state to state is , from state to state is , from state to state is , and from state to state is . Let denote the steady state probabilities of being in states respectively. This Markov chain is easy to solve. We have , so that the first identity is: . Furthermore, by considering transitions into and out of , we obtain: . Combining these, we obtain: , and . Now we have:

Lemma 2.6.

(Proved in Appendix A) For each arm , the optimal reward minus penalty of the single arm policy for arm is

The maximum value satisfies the following:

  1. If , then , and .

  2. If for some , then (and hence ) can be computed in time polynomial in the input size and in by binary search.

2.3 The BalancedIndex Policy

Though we could now use LPLagrange to solve Whittle’s LP by finding the so that (refer Appendix A.3 for details), our -approximation policy will not be based this approach. For our analysis to work, we must make a subtle but crucial modification: We will instead set to be the sum of the excess reward for all single-arm policies . (Recall that we can interpret to be a penalty per play, so in the optimal single-arm policy for arm , is the average reward minus penalty.) Note that by Lemma 2.7, this implies and . Intuitively, we are forcing the Lagrangean to balance short-term reward (represented by ) with long-term average reward (represented by ). Our balance technique can be generalizes to many other restless bandit problems (see Sections 48).

We first show how to compute this value of in polynomial time. We begin by presenting the connection between and , the value of the optimal solution to (Whittle).

Lemma 2.7.

For any , we have: .

Proof.

By Lemma 2.4, part (1), we have: . The latter is the objective of the dual of (Whittle), which implies the lemma by weak duality. ∎

Lemma 2.8.

is a non-increasing function of .

Proof.

Recall from Lemma 2.4 that . For any , consider the value of the policy . Since this decreases as increases, is also a non-increasing function of . ∎

Lemma 2.9.

In polynomial time, we can find a so that , and .

Proof.

First note by Lemma 2.8 that is monotonically non-increasing in . Therefore, start with , and scale down by by a factor of until . Note that for any , the value of can be computed in poly-time by Lemma 2.6. At this point, let . Since , by Lemma 2.7, we have , which implies . Further, since , again by Lemma 2.7, we have . ∎

2.3.1 Index Policy

We start with the value of from Lemma 2.9. The policy only works with the subset of arms so that for , we have . For this , the solution to LPLagrange yields one policy of value for each arm (see Lemma 2.4). Let . Recall that if an arm was last observed in state some steps ago, then its state is denoted . We call an arm in state as good; in state for as ready, and in state for as bad. The policy is shown in Figure 4.

BalancedIndex Policy for Feedback MAB Consider only arms with ; denote these as set . Play arms in according to the following priority scheme: 1. Exploit: Play any arm in state (good state): 2. If condition (1) does not hold: (a) Explore: Play any arm in state such that (ready state). (b) Idle: If no arm is good or ready (all arms bad), do not play at this step.

Figure 4: The BalancedIndex Policy for Feedback MAB.

Note that the way the scheme works, at most one arm can be in state at any time step, and if such an arm exists, this arm is played at the current step (and in the future until it switches out of this state). The above can be thought of as executing the policies for arms independently and in case of simultaneous attempts to play, resolving conflicts according to the above priority scheme.

Though the above policy is not written as an index policy, it is equivalent to the following index: There is a dummy arm with index that does not yield reward on playing. If , the index for all states of this arm is . For arms with , the index for state is ; that for states with is , and that for states with is .

2.3.2 Analysis

We now prove that the BalancedIndex policy is in fact a 2-approximation. The proof is based on the fact that the Lagragean and the excess rewards give us a way of accounting for the average reward. And by Lemma 2.7, and , which gives us a way of linking the rewards from our policy to the LP optimum.

Theorem 2.10.

The BalancedIndex policy is a approximation to Feedback MAB. Furthermore, this policy can be computed in polynomial time.

Proof.

Recall that the reward of optimal single arm policy is , so that this reward can be accounted as per step plus per play. We use this amortization of rewards to show that the average reward of our index policy is at least .

Focus on any arm , we call a step blocked for the arm if the arm is ready for play–the state is where –but some other arm is played at the current step. Consider only the time steps which are not blocked for arm . For these time steps, the arm behaves as follows: It is continuously played in state . Then it transitions to state and moves in time steps to state . After this the arm might be blocked, and the next state that is not blocked is for some , at which point the arm is played. Using the formula for from Lemma 2.5, and since for , we have

which implies that the per-step reward of this single arm policy for arm restricted to the non-blocked time steps is at least the per-step reward of the optimal single-arm policy . Therefore, for these non-blocked steps, the reward we get is at least per step, and at least per play.

Now, on steps where no arm is played, none of the arms is blocked by definition, so our amortization yields a per-step reward of at least . On steps when some arm is played, the arm that is played by definition cannot not blocked, so we get a reward of at least for this step. This completes the proof. ∎

2.3.3 Alternate Analysis

The above analysis is very intuitive. We now present an alternative way to analyze the policy, that leads to a more generalizable technique. This uses a Lyapunov (potential) function argument. Recall from Lemma 2.4 that ; further that . Define the potential for each arm at any time as follows:

Definition 4.

If arm moved to state some steps ago (), the potential is . In the state the potential is . Recall that is the optimal dual variable in Whittle-Dual.

Let denote the total potential, , at any step and let denote the total reward accrued until that step. Define the function . Let and .

Lemma 2.11.

is a Lyapunov function. i.e., . Equivalently, at any step:

Proof.

At a given step, suppose the policy does nothing, then all arms are “not ready”. The total increase in potential is precisely

On the other hand, suppose that the policy plays arm , which has last been observed in state and has been in that state for steps. With probability the observed state is and the change in reward and the change in potential is . With probability the observed state is and the change in potential is (and there is no change in reward). Thus in this case since , and , we have:

The penultimate inequality follows from Lemma 2.4, part (3). Note that the potentials of arms not played cannot decrease, so that the first inequality is valid.

Finally supposing the policy plays an arm which was last observed in state and played in the last step, with probability the increase in reward is and the potential is unchanged. With probability the potential will decrease by . Therefore in this case, by Lemma 2.4, part (4).

By their definition, the potentials are bounded independent of the time horizon, by telescoping summation, the above lemma implies that . This proves Theorem 2.10.

Gap of Whittle’s LP.

The following theorem shows that our analysis is almost tight (considering that our -approximation is against Whittle’s LP).

Theorem 2.12.

(Proved in Appendix A) The gap of Whittle’s LP is arbitrarily close to .

3 Analyzing the Whittle Index for Feedback Mab

Before generalizing our -approximation algorithm to a larger subclass of restless bandit problems, we explore the connection between our analysis and the well-known Whittle Index used in practice. This section can be skipped without losing continuity of the paper.

A well-studied index policy for restless bandit problems is the Whittle Index [52]. In the context of Feedback MAB, this index has been independently studied by Le Ny et al [40] and subsequently by Liu and Zhao [37]. Both these works give a closed form expressions for this index and show near-optimal empirical performance. Our main result in this section is to justify the empirical performance by showing that a simple but very natural modification of this index in order to favor myopic exploitation yields a -approximation. The modification simply involves giving additional priority to arms in state if their myopic expected next step reward is at least a threshold value.

3.1 Description of the Whittle Index

Defined in general, the Whittle’s index for each state is the largest penalty-per-play such that the optimal policy is indifferent between playing in and not playing. In our specific problem, the current state for each arm is captured by the tuple – the arm was last seen to be (good or bad) steps ago. The Whittle index is a non-negative real numbers computed as follows: using the notation from Section 2.2., for any penalty per play , there is a single-arm policy that maximizes the average reward minus penalty (excess reward) over the infinite horizon. When , the optimal policy never plays; when , the optimal policy would play in any state. As is decreased from , at some value , the decision in state changes from “not play” to “play”. The Whittle index is precisely this value of . The Whittle index policy always plays the arm with the highest Whittle’s index (Fig. 5).

Whittle Index Policy: Play the arm whose current state has the highest index .

Figure 5: The Whittle Index Policy.
Remarks.

The Whittle index is strongly decomposable, i.e., can be computed separately for each arm. Further, we have defined as a penalty (or amortized reward) per play, while Whittle defines it as a reward for not playing (which he terms the subsidy for passivity); it is easy to see that both these formulations are equivalent. Finally, for Feedback MAB, it can be shown [40, 37] that for any state , there is a unique where the decision switches between “play” and “not play”, i.e., the decision is monotone in . Strictly speaking, the Whittle index is defined only for such systems (termed indexable by Whittle [52]); we will define this aspect away by insisting that the index is the largest value where a switch happens.

We present an explicit connection of Whittle’s index to LPLagrange.

Lemma 3.1.

(Proved in Appendix A) Recall the notation and from Section 2.2. The following hold for :

  1. for all states where and .

  2. , and for all .

  3. , and is a monotonically non-decreasing function of .

Though Whittle’s index is widely used, it is not clear how to analyze it since it leads to complicated priorities between arms. We now show that our balancing technique also implies an analysis for a slight but non-trivial modification to Whittle’s index.

3.2 The Threshold-Whittle Policy

We now show that modifying the index slightly to exploit the myopic next step reward in good states yields a approximation. Note that the myopic next step reward of an arm in state is precisely . The modification essentially favors exploiting such a “good” state if the myopic reward is at least a certain threshold value. In particular, we analyze the policy Threshold-Whittle shown in Figure 6, where we set , where is the value where (refer Section 2.3).

Threshold-Whittle At any time step: If arm in state whose Whittle index is then Play arm . else Play the arm with the highest Whittle index.

Figure 6: Policy Threshold-Whittle. It exploits arm if the myopic reward in state is .

Note that the above policy can be restated as playing the arm with the highest modified index, which is computed as follows: For arm , if , the modified index for state is , else the modified index is the same as the Whittle index.

Theorem 3.2.

Threshold-Whittle is a approximation for Feedback MAB. Here, satisfies (refer Section 2.3).

3.3 Proof of Theorem 3.2

We now prove the above result by modifying our analysis of the BalancedIndex policy (from Figure 4). Recall that is the set of arms with in the optimal solution to Whittle-Dual. For such arms, is the first time instant when is tight. For arm , state is good if and ; ready if and ; and bad otherwise. The index policy from Figure 4 favors good over ready states, and does not play any arm in bad states.

Claim 3.3.

For any arm , exactly one of the following is true for Whittle-Dual and LPLagrange.

  1. The constraint is first tight at . Then, and . Further, and .

  2. The constraint is not tight for any . Then, for all , and .

Proof.

The optimal solution to LPLagrange finds the policy for every arm with . Therefore, by Lemma 3.1, we must have , and for all . Furthermore, since the variable in the optimal solution to LPLagrange is first non-zero at , this implies the constraint is first tight at by complementary slackness (Lemma 2.4). Further, if this constraint is tight at , since is monotonically increasing, the constraint is feasible for all only if . Finally, follows from Lemma 3.1.

Suppose now that is not tight for any . Then, by complementary slackness, we have for all , which implies for all . Therefore, the policy never plays arm . This implies for all . Since the excess reward of is zero, we have . (This can also be shown by complementary slackness.) ∎

3.3.1 Types of Arms

We next classify the arms as follows. In Claim 3.3, let the arms satisfying the first condition () of the Claim be denoted Type (1), and the remaining arms satisfying be denoted Type (2). Note that type (1) arms are precisely the set of arms in Fig. 4, so the BalancedIndex policy only plays type (1) arms.

Type (1): Arms in .

We consider the behavior of Threshold-Whittle restricted to just these arms. Since is monotonically increasing in , by Claim 3.3, we have the following for the policy of Fig. 4: If the arm is ready, the Whittle index is at least ; if the arm is bad, the index is at most ; and finally, if the arm is good, then the modified Whittle index is infinity.

Therefore, Threshold-Whittle confined to these arms gives priority to good over ready over bad arms. The only difference with the policy in Fig. 4 is that instead of idling when all arms are bad, the policy Threshold-Whittle will play some bad arm. We now show that this is better than idling.

Claim 3.4.

Threshold-Whittle executed just over Type (1) arms yields reward at least .

Proof.

Consider the alternate analysis presented in Section 2.3.3. The Index policy from Fig. 4 does not play an arm in bad state, and achieves change in potential of exactly . All we need to show is that if the arm is played instead, the expected change in potential is still at least . The rest of the proof is the same as that of Lemma 2.11. Suppose the arm is played after steps. The expected change in potential is: . We further have by definition of that . We therefore have . Since is a concave function of with , the above implies that for every , we must have . Therefore, . ∎

Type (2): Arms not in .

The only catch now is that Threshold-Whittle can sometimes play a type (2) arm whose . For such arms, we count their reward and ignore the change in potential.

Lemma 3.5.

In Threshold-Whittle, if a type (2) arm preempts the play of a type (1) arm , either the reward from the former is at least or the increase in potential of the later is at least .

Proof.

Suppose that for type (2) arm , . Denote such a state as nice, and a nice type (2) arm has modified index of