An Asymptotically Optimal Index Policy
for Finite-Horizon Restless Bandits
We consider restless multi-armed bandit (RMAB) with a finite horizon and multiple pulls per period. Leveraging the Lagrangian relaxation, we approximate the problem with a collection of single arm problems. We then propose an index-based policy that uses optimal solutions of the single arm problems to index individual arms, and offer a proof that it is asymptotically optimal as the number of arms tends to infinity. We also use simulation to show that this index-based policy performs better than the state-of-art heuristics in various problem settings.
arXiv:0000.0000 \startlocaldefs \endlocaldefs
Restless Bandits \kwd[; ]Constrained Control Process
We consider the restless multiarmed bandit (RMAB) problem  with a finite horizon and multiple pulls per period. In the RMAB, we have a collection of “arms”, each of which is endowed with a state that evolves independently. If the arm is “pulled” or “engaged’ in a time period then it advances stochastically according to one transition kernel, and if not then it advances according to a different kernel. Rewards are generated with each transition, and our goal is to maximize the expected total reward over a finite horizon, subject to a constraint on the number of arms pulled in each time period.
The RMAB generalizes the multi-armed bandit (MAB)  by allowing arms that are not engaged to change state and multiple pulls per period. This extends the applicability of the MAB problem to a broader range of settings, including the submarine tracking problem, project assignment problem in , and contemporary applications including:
Facebook displays ads in the suggested posts section every time its users browse their personal pages. Among the ads that have been shown, some are known to attract more clicks than others. But there are also many ads which have yet to be shown and they may attract even more clicks. Given that the slots for display are limited, a policy is required to select ads to maximize total clicks.
In a multi-stage clinical trial, a medical group starts with a number of new treatments and an existing treatment with reliable performance. In each stage, a few treatments are selected from the pool to test, with the goal to identify the new treatments that perform better than the existing one with high confidence. A strategy is required to select which treatments to test at every stage to most effectively support their judgment at the end of the trial.
A data analyst wishes to label a large number of images using crowdsourced effort from low-cost but potentially inaccurate workers. Each label given by the crowdworkers comes with a cost and the analyst has limited budget. Hence she needs to carefully assign tasks so as to maximize the likelihood of correct labeling.
The infinite horizon MAB with one pull per time period is famously known to have a tractable-to-compute optimal policy, called the Gittins index policy . This policy is appealing because it can be computed by considering the state space for only a single arm, making it computationally tractable for problems with many arms. This policy loses its optimality properties, however, when modifying the problem in any problem dimension: when allowing arms that are not engaged to change state; when moving to a finite horizon ; or when allowing multiple pulls per period. Thus, the Gittins index does not apply to our problem setting. Moreover, while optimal policies for RMABs with multiple pulls per period or finite horizons are characterized by the dynamic programming equations , the so-called ‘curse of dimensionality’  prevents computing them because the dimension of the state space grows linearly with the number of arms.
Thus, while the RMAB is not known to have a computable optimal policy,  proposed a heuristic called the Whittle index for the infinite-horizon RMAB with multiple pulls per period, which is well-defined when arms satisfy an indexability condition. This policy is derived by considering a Lagrangian relaxtion of the RMAB in which the constraint on the number of arms pulled is replaced by a penalty paid for pulling an arm. An arm’s Whittle index is then the penalty that makes a rational player indifferent between pulling and not pulling that arm. The Whittle index policy then pulls those arms with the highest Whittle indices. Appealingly, the Whittle index and the Gittins index are identical when applied to the MAB problem with a single pull per period.
 further conjectured that if the number of arms and the number of pulls in each time period go to infinity at the same rate in an infinite-horizon RMAB, then the Whittle index policy is asymptotically optimal when arms are indexable. [14, 15] gave a proof to Whittle’s conjecture with a difficult-to-verify condition: that the fluid approximation has a globally asymptotically stable equilibrium point. This condition was shown to hold when each arm’s state space has at most states, but this condition does not hold in general and  provides a counterexample with states.
Our contribution in this paper is to (1) create an index policy for finite horizon RMABs with multiple pulls per period, and (2) show that it is asymptotically optimal in the same limit considered by Whittle. Like the Whittle index, our approach is computationally appealing because it requires considering the state space for only a single arm, and its computational complexity does not grow with the number of arms. Unlike the Whitle index, our index policy does not require an indexability condition hold to be well-defined, and in contrast with [14, 15] our proof of asymptotic optimality holds regardless of the number of states. We further demonstrate our index policy numerically on problems from the literature that can be formulated as finite-horizon RMABs, and show that it provides finite-sample performance that improves over the state-of-the-art.
In addition to building on [16, 14, 15], our work builds on the literature in weakly coupled dynamic programs (WCDP), that itself builds on RMABs. Indeed, at the end of his paper, Whittle pointed out that his relaxation technique can be applied to a more general class of problems in which sub-problems are linked by constraints on actions, but are otherwise independent. Hawkins in his thesis  formally termed these problems (but with a more general type of constraints) as WCDPs and proposed a general decoupling technique. Moreover, he also proposed index-based policies for solving both finite and infinite horizon WCDPs and offered a proof that his policy, when applied to the infinite time horizon Multi-arm bandit problem (MAB), is equivalent to the Gittins index policy. Our work is similar to Hawkins’ in that we consider Lagrange multipliers of the same functional form when computing indices. However, Hawkins does not specify what the coefficients of the function should be, or give a tie-breaking rule for the case when multiple arms have the same index. We obtain an asymptotically optimal policy by addressing both of these issues. The differences will be discussed with greater details after we formally introduce our index policy.
Another major work in WCDP is by  who shows that the ADP relaxation is tighter than the Lagrangian relaxation but is also computationally more expensive. It gives necessary and sufficient conditions for the Lagrangian relaxation to be tight and proves that the optimality gap is bounded by a constant when the Lagrange multipliers are allowed to be state dependent. The last result that the optimality gap is bounded by a constant implies that the per arm gap goes to zero as the number of arms grows. We achieve a similar result in our paper by showing the per arm reward of our index-based heuristic policy goes to the per arm reward of the Lagrangian bound, in spite of that our Lagrange multipliers are not state-dependent. We conjecture that this is due to the fact that our constraints, which is a function about on the action and not the state, is less general than the constraints considered in WCDP, which depends on both the action and the state. Moreover, the focuses of the two works differ: while our work focuses on offering an asymptotically optimal heuristic policy,  examines the ordering and tightness of different bounds. The heuristic proposed in  is based on ADP technique, is also different from our index-based policy.
Other work on WCDP also include  who proposes a even tighter bound by incorporating information relaxation on the non-anticipative constraints in addition to the existing relaxation methods.  considers two classes of large-scaled WCDPs in which the state and action space in each sub-problem also grows exponentially and uses an ADP technique to approximate the value functions of individual sub-MDPs in addition to employing Lagrangian relaxation for the overall problem.
The remainder of this paper is outlined as follows: Section 2 formulates the problem. Section 3 discusses the Lagrangian relaxation of the problem. Section 4 states our index-based policy, and provide computation methods. Section 5 gives a proof of asymptotic optimality. Section 6 numerically evaluates our index policy. Section 8 concludes the paper.
2 Problem Description and Notation
We consider an MDP
which is created by a collection of sub-processes . The sub-processes are independent of each other except that they are bound by the joint decisions on their actions at each time step. These sub-processes are also referred to as arms in the bandit literature and shall be indexed by . Following a standard construction for MDPs, both the larger joint MDP and the sub-processes will be constructed on the same measurable space . Random variables on this measurable space will correspond to states, actions, rewards, and each policy will induce a probability measure over this space.
We describe the MDP to consider formally as follows:
The time horizon .
The state space is the cross product of sub-process state space . is assumed to be finite. We use to denote an element in and when the state is random. We also use to emphasize that the state is at time . Likewise, we use to denote an element in and or when it is random.
The action space is the cross product of sub-processes action space , which is set equal to . We use to denote a generic element of , and when it is random. We use to denote a generic element in and to denote an action when it is random. In the context of bandit problems, is called “pulling” an arm (sub-process).
The reward function for each . , where is the reward obtained by a sub-process when action is taken in state at time . We assume rewards are non-negative and finite.
The transition kernel , where is the probability of a sub-process transitioning from to if action is taken, i.e., . The product implies that the sub-processes evolve independently. We also point out that RMAB differ from MAB problems in that MABs require while RMABs allows . Since we are considering both cases, we do not restrict the value of .
Next we describe the set of policies for our MDP problem. Since the state and action space defined above are finite, it is sufficient to consider the set of Markov policies . Define a policy as a function that determines the probability of choosing action in state at time , that is, , Subsequently we have , . A policy and the transition kernel together defines a probability distribution on the all possible paths of the process . Starting at a fixed state , i.e., , we have the conditional distributions of and defined recursively by and .
The MDP we are considering allows exactly sub-processes to be set active at each time step. Hence a feasible policy, , has to satisfy that , . Here we use as an operator that sums all the elements in a vector.
The objective of our MDP is as follows:
Since we will discuss other MDPs in the process of solving this one, (2.1) will be referred to as the original MDP in the rest of the paper to avoid confusion. For convenience, we summarizes our notations in Appendix A. We note the original MDP (2.1) suffers from the ’curse of dimensionality’, and hence solving it is computationally intractable. In the remaining of the paper we seek to building a computationally feasible index-based heuristics with performance guarantee.
3 Lagrangian Relaxation and Upper Bounds
In this section we discuss the Lagrangian relaxation of the original MDP and the corresponding single process problems. These single process problems together with the Lagrange multipliers form the building block of our index-based policy, which will be formally introduced in Section 4. Lagrangian relaxation removes the binding constraints and allows the problem to be decomposed into tractable MDPs . It works as follows: for any , we consider an unconstrained problem whose objective is obtained by augmenting the objective in (2.1):
This unconstrained problem forms the Lagrangian relaxation of (2.1), and has the following property:
For any , is an upper bound to the optimal value of the original MDP.
This Lagrangian relaxation then decomposes into smaller MDPs, which we can easily solve to optimality. To elaborate on this idea of decomposition, we construct a sub-MDP problem based on tuple . Again we consider only the set of Markov policies, , for this problem. Similarly a policy is a function that determines the probability of choosing action in state at time , i.e., . The sub-MDP starts at a fixed state . Subsequently we can define distributions of and under in a similar manner as we did for and in the previous section. The objective of the sub-MDP is defined as follows:
We are now ready to present the decomposition of the Lagrangian relaxation.
The optimal value of the relaxed problem satisfies
 also gave a proof to Lemma 2, and we again provide a different proof in Appendix C. Since the state space of the sub-MDP is much smaller, we can solve it directly by using backward induction on optimality equations. The existence of such an optimal Markov deterministic policy is implied by that the state and action spaces of the sub-MDP being finite . Let be the set of optimal Markov deterministic policies of the sub-MDP for a given . The relaxed problem can be solved by combining the solutions of individual sub-MDPs, that is, we can construct an optimal policy of the relaxed problem by setting , where is an element in . Moreover, is convex and piecewise linear in .
4 An Index Based Heuristic Policy
Our index based heuristic policy assigns an index to each sub-process, based upon its state and current time. At each time step, we set active the m sub-processes with the highest indices. Before carrying out the process of sequential decision-making, our index policy calls for pre-computation of 1) , as defined in Section 3; 2) a set of indices, , that will later be used for decision-making at every time step; 3) an optimal policy for the sub-MDP problem in (3.2). In the first part of this section we discuss how we carry out such computations.
4.1.1 Dual optimal
To compute this sub-gradient, we compute a policy in and then use exact computation or simulation with a large number of replications to compute . To compute a policy in , we first compute the value function of sub-MDP . We accomplish this using backward induction :
Then, any and all policies in are constructed by determining for each and the action whose one-step lookahead value is equal to , and then setting for this . For those and for which both actions have one-step lookahead values equal to , one may set for either such action. Thus, the cardinality of is 2 raised to the power of the number of for which the one-step lookahead values for playing and not playing are tied.
When we construct a policy in for the purpose of computing a sub-gradient of , we choose to play in those with tied one-step lookahead values. While our subgradient descent algorithm would converge for other choices, making this choice better supports computation of indices in section 4.1.2,
Define vector to be , that is, the vector with the element replaced by . We define the index of state at time as
Instead of computing the entire set , we only need to compute a policy in using the method discussed in section 4.1.1, i.e., always choose the active action when there are ties. Intuitively, this index is the maximum price we are willing to pay to set a sub-process active in state at . By leveraging the monotonicity of optimal actions with respect to rewards, as shown in Lemma 8 in Appendix G, we compute via bisection search in interval , where upper bounds the largest possible value of . For example, we can set as when (which we show in Appendix F that cannot be greater than this value in this case). We pre-compute the set before running the actual algorithm.
4.1.3 Occupation measure and its corresponding optimal policy
Our tie-breaking policy involves constructing an optimal Markov policy for the sub-MDP such that , . The existence of is shown in Appendix E. To compute , we borrow the idea of occupation measure . Define occupation measure, , induced by a policy to be the probability of being in state and taking action given time under . Subsequently can be solved by the following linear program (LP):
where , . The first constraint ensures that . The second constraint ensures flow balance. The third constraint shows that we start at state . The second and third constraint together imply , i.e., that is a probability distribution for each t. The fourth and fifth constraints ensure that is a valid probability measure.
Let be an optimal solution to , can then be constructed by
Here we also make an observation that can be computed by solving (4.3) with replaced by .
4.2 Index policy
Let be the indices associated with the K sub-processes at time . We define to be the largest value in such that at least sub-processes have indices of at least . Our index policy then sets the sub-processes with indices strictly greater than active, and those with indices strictly less than inactive. When more than sub-processes have indices greater than or equal to , a tie-breaking a rule is needed. Our tie-breaking rule allocates remaining resources (the remaining sub-processes to be set active) across tied states according to the probability distribution induced by over at time . This tie-breaking ensures asymptotic optimality of the index policy as it enforces that the fraction of sub-processes in each state is equal to the distribution induced by in the limit. This idea shall become clear in Section 5 where the proof of asymptotic optimality is presented.
To further illustrate how our tie-breaking rule works, let be the set of states occupied by the tied sub-processes, we allocate
fraction of the remaining resources to each tied states, when , where is a solution to (4.3). In cases when , we do tie-breaking according to the number of sub-processes that are currently in each of the tied states.
We then use the function Rounding(total, frac, avail) in Algorithm 2 to deal with the situations where the products between the desired fractions and remaining resources are not integers. Here represents the number of remaining resources, is a vector of fractions to be allocated to each tied state, and is a vector of the number of sub-processes in each tied state. The function also takes care of the corner cases in which the number of sub-processes in a tied state is less than the number of resources we would like to assign to according to the fraction in (4.5). We note the following property of this function Rounding, which we will rely on in our proof in Section 6.
When total, avail, frac satisfy , the output vector satisfies for all .
 proposed a minimum-lambda policy, which, when translated into our setting, finds the largest Lagrange multiplier of the form for which an optimal solution of the relaxed problem is feasible for the original MDP. The policy then sets active those sub-processes which would be set active in the relaxed problem. However, Hawkins did not specify what and should be, thus limiting the policy’s applicability to finite horizon settings. Our policy is similar to that of Hawkins in that 1) setting the sub-processes with the largest indices in our policy is equivalent to finding the largest that satisfies the constraints of the original MDP and setting the corresponding sub-processes active, and; 2) We also limit the values of Lagrange multiplier considered to a ray, as can be written in the form of , for . However, unlike Hawkins’ policy, our policy defines the starting point and the direction of the ray, along with a tie-breaking policy that ensures asymptotic optimality.
5 Proof of Asymptotic Optimality
Our index policy achieves asymptotic optimality when we let the number of sub-processes go to infinity, while holding constant. Let to denote the expected reward of the original MDP obtained by policy , to emphasize the dependency of this quantity on and . We use to denote the set of all feasible Markov policies for the original MDP with sub-processes and a budget of activations per period. Lastly, it should be understood that whenever we use to denote our index policy there is a dependency of on and that is not explicitly stated. We are now ready to state the main result of this paper, which shows that the per arm gap between the upper bound and the index policy goes to zero under the limit assumption.:
For any ,
To formalize the notations that will be used throughout the proofs, we augment to to indicate the values of and assumed in the Lagrangian relaxation. We use to denote one and any element in and let be the optimal policy constructed in (4.4) using , which satisfies . Note and depend on only (not on ).
As before, we let be the number of sub-processes in state at time under . We additionally define to be the number of sub-processes in state at time that are set active by . These quantities depend on and , but for simplicity we do not include this dependence in the notation: they always assume and we rely on context to make clear the value of assumed. We also define to be the set of states with the same index value as , including , and to be the set of states with index value greater than that of , for each time . These quantities depend on but not on or .
We prove Theorem 1 by first demonstrating below in Theorem 2 that for each time , the proportion of the sub-processes that are in state under our index policy , , approaches as . In other words, our index policy recreates the behavior of in the large limit.
For every and ,
At time , for all , we have
If , then .
If , then .
For any state and time ,
If , then
If , then
Now we are ready to prove Theorem 2.
When , all sub-processes starts in state , and we have
By the set-up of the original MDP, , and we have
so we have proved the base case of the induction.
Now assume (5.2) and (5.3) hold up until time . Fix a state and time , define to be the number of sub-processes set active by in at time which transition to state at time , and to be the number of sub-processes set inactive by in at time which transition to at time . Note that and also depend on . We can subsequently express as
Dividing both sides by , and taking to a limit, we get
Note is a binomial random variable with trials and success probability Similarly, is a binomial random variable with trials and success probability . We can rewrite the RHS of (5.4) by applying Lemma 9, which is stated at the end of the section:
The last equality follows as we have exhausted all the ways of getting to at time . Hence we have shown (5.2) holds for time .
To show (5.3) holds for time , define sets , and . We use notation for the set which consists of all elements in divided by . Define function to represent the number of sub-processes set active at time in state , that is,
where represent the number of sub-processes set active when tie-breaking is needed, that is,
where and are random variables due to the rounding rules in Algorithm 2, and are dependent on . We also define function
Combining the three lemmas above we have
Proof of Theorem 1 implies . Thus,
On the other hand,
Here, the third line follows by Theorem 2 and the fact that both and are bounded and hence uniformly integrable random variables (for uniformly integrable random variables, convergence almost surely implies convergence in expectation). The fourth line holds because takes the active action at each time with probability . The fifth line follows from Lemma 2, where we have augmented the notation for to include the values of and assumed. The sixth line follows from Lemma 1.
Finally, sandwiching the two inequalities gives the desired result. ∎
6 Numerical Experiments
In this section we present numerical experiments for two problems: the finite-horizon multi-arm bandit with multiple pulls per period,and subset selection [4, 9]. These experiments demonstrate numerically that our index policy is indeed asymptotically optimal. We also compare the finite-time performance of our policy to other policies from the literature. Although our previously provided theoretical results do not apply to finite , we see that our index policy performs strictly better than all benchmarks considered in both of the problems.
6.1 Multi-armed bandit
In our first experiment, we consider a Bernoulli multi-armed bandit problem with a finite time horizon , and multiple pulls per time period. A player is presented with arms and may select of them to pull at every time st. Each arm pulled returns a reward of or . The player’s goal is to maximize her total expected reward. We take a Bayesian-optimal approach and impose a Beta(1,1) prior on each of the arm. The values of the state then correspond to the posterior parameters of the K arms.
For comparison, we include results from an upper confidence bound (UCB) algorithm with pre-trained confidence width. At every time step, we compute for each arm , where and are the sample mean and standard deviation of arm . We pre-train by running the UCB algorithm on a different set of data (but simulated with the same distribution) with values of ranging from 0 to 5 and then set to the value that gives the best performance.
Figure 1 plots the reward per arm (expected total reward divided by ) against , for . The red dashed line represents the upper bound computed using