An Asymptotically Optimal Index Policy
for FiniteHorizon Restless Bandits
Abstract
We consider restless multiarmed bandit (RMAB) with a finite horizon and multiple pulls per period. Leveraging the Lagrangian relaxation, we approximate the problem with a collection of single arm problems. We then propose an indexbased policy that uses optimal solutions of the single arm problems to index individual arms, and offer a proof that it is asymptotically optimal as the number of arms tends to infinity. We also use simulation to show that this indexbased policy performs better than the stateofart heuristics in various problem settings.
arXiv:0000.0000 \startlocaldefs \endlocaldefs
,
Restless Bandits \kwd[; ]Constrained Control Process
1 Introduction
We consider the restless multiarmed bandit (RMAB) problem [16] with a finite horizon and multiple pulls per period. In the RMAB, we have a collection of “arms”, each of which is endowed with a state that evolves independently. If the arm is “pulled” or “engaged’ in a time period then it advances stochastically according to one transition kernel, and if not then it advances according to a different kernel. Rewards are generated with each transition, and our goal is to maximize the expected total reward over a finite horizon, subject to a constraint on the number of arms pulled in each time period.
The RMAB generalizes the multiarmed bandit (MAB) [12] by allowing arms that are not engaged to change state and multiple pulls per period. This extends the applicability of the MAB problem to a broader range of settings, including the submarine tracking problem, project assignment problem in [16], and contemporary applications including:

Facebook displays ads in the suggested posts section every time its users browse their personal pages. Among the ads that have been shown, some are known to attract more clicks than others. But there are also many ads which have yet to be shown and they may attract even more clicks. Given that the slots for display are limited, a policy is required to select ads to maximize total clicks.

In a multistage clinical trial, a medical group starts with a number of new treatments and an existing treatment with reliable performance. In each stage, a few treatments are selected from the pool to test, with the goal to identify the new treatments that perform better than the existing one with high confidence. A strategy is required to select which treatments to test at every stage to most effectively support their judgment at the end of the trial.

A data analyst wishes to label a large number of images using crowdsourced effort from lowcost but potentially inaccurate workers. Each label given by the crowdworkers comes with a cost and the analyst has limited budget. Hence she needs to carefully assign tasks so as to maximize the likelihood of correct labeling.
The infinite horizon MAB with one pull per time period is famously known to have a tractabletocompute optimal policy, called the Gittins index policy [6]. This policy is appealing because it can be computed by considering the state space for only a single arm, making it computationally tractable for problems with many arms. This policy loses its optimality properties, however, when modifying the problem in any problem dimension: when allowing arms that are not engaged to change state; when moving to a finite horizon [3]; or when allowing multiple pulls per period. Thus, the Gittins index does not apply to our problem setting. Moreover, while optimal policies for RMABs with multiple pulls per period or finite horizons are characterized by the dynamic programming equations [11], the socalled ‘curse of dimensionality’ [10] prevents computing them because the dimension of the state space grows linearly with the number of arms.
Thus, while the RMAB is not known to have a computable optimal policy, [16] proposed a heuristic called the Whittle index for the infinitehorizon RMAB with multiple pulls per period, which is welldefined when arms satisfy an indexability condition. This policy is derived by considering a Lagrangian relaxtion of the RMAB in which the constraint on the number of arms pulled is replaced by a penalty paid for pulling an arm. An arm’s Whittle index is then the penalty that makes a rational player indifferent between pulling and not pulling that arm. The Whittle index policy then pulls those arms with the highest Whittle indices. Appealingly, the Whittle index and the Gittins index are identical when applied to the MAB problem with a single pull per period.
[16] further conjectured that if the number of arms and the number of pulls in each time period go to infinity at the same rate in an infinitehorizon RMAB, then the Whittle index policy is asymptotically optimal when arms are indexable. [14, 15] gave a proof to Whittle’s conjecture with a difficulttoverify condition: that the fluid approximation has a globally asymptotically stable equilibrium point. This condition was shown to hold when each arm’s state space has at most states, but this condition does not hold in general and [14] provides a counterexample with states.
Our contribution in this paper is to (1) create an index policy for finite horizon RMABs with multiple pulls per period, and (2) show that it is asymptotically optimal in the same limit considered by Whittle. Like the Whittle index, our approach is computationally appealing because it requires considering the state space for only a single arm, and its computational complexity does not grow with the number of arms. Unlike the Whitle index, our index policy does not require an indexability condition hold to be welldefined, and in contrast with [14, 15] our proof of asymptotic optimality holds regardless of the number of states. We further demonstrate our index policy numerically on problems from the literature that can be formulated as finitehorizon RMABs, and show that it provides finitesample performance that improves over the stateoftheart.
In addition to building on [16, 14, 15], our work builds on the literature in weakly coupled dynamic programs (WCDP), that itself builds on RMABs. Indeed, at the end of his paper, Whittle pointed out that his relaxation technique can be applied to a more general class of problems in which subproblems are linked by constraints on actions, but are otherwise independent. Hawkins in his thesis [8] formally termed these problems (but with a more general type of constraints) as WCDPs and proposed a general decoupling technique. Moreover, he also proposed indexbased policies for solving both finite and infinite horizon WCDPs and offered a proof that his policy, when applied to the infinite time horizon Multiarm bandit problem (MAB), is equivalent to the Gittins index policy. Our work is similar to Hawkins’ in that we consider Lagrange multipliers of the same functional form when computing indices. However, Hawkins does not specify what the coefficients of the function should be, or give a tiebreaking rule for the case when multiple arms have the same index. We obtain an asymptotically optimal policy by addressing both of these issues. The differences will be discussed with greater details after we formally introduce our index policy.
Another major work in WCDP is by [1] who shows that the ADP relaxation is tighter than the Lagrangian relaxation but is also computationally more expensive. It gives necessary and sufficient conditions for the Lagrangian relaxation to be tight and proves that the optimality gap is bounded by a constant when the Lagrange multipliers are allowed to be state dependent. The last result that the optimality gap is bounded by a constant implies that the per arm gap goes to zero as the number of arms grows. We achieve a similar result in our paper by showing the per arm reward of our indexbased heuristic policy goes to the per arm reward of the Lagrangian bound, in spite of that our Lagrange multipliers are not statedependent. We conjecture that this is due to the fact that our constraints, which is a function about on the action and not the state, is less general than the constraints considered in WCDP, which depends on both the action and the state. Moreover, the focuses of the two works differ: while our work focuses on offering an asymptotically optimal heuristic policy, [1] examines the ordering and tightness of different bounds. The heuristic proposed in [1] is based on ADP technique, is also different from our indexbased policy.
Other work on WCDP also include [17] who proposes a even tighter bound by incorporating information relaxation on the nonanticipative constraints in addition to the existing relaxation methods. [7] considers two classes of largescaled WCDPs in which the state and action space in each subproblem also grows exponentially and uses an ADP technique to approximate the value functions of individual subMDPs in addition to employing Lagrangian relaxation for the overall problem.
The remainder of this paper is outlined as follows: Section 2 formulates the problem. Section 3 discusses the Lagrangian relaxation of the problem. Section 4 states our indexbased policy, and provide computation methods. Section 5 gives a proof of asymptotic optimality. Section 6 numerically evaluates our index policy. Section 8 concludes the paper.
2 Problem Description and Notation
We consider an MDP
which is created by a collection of subprocesses . The subprocesses are independent of each other except that they are bound by the joint decisions on their actions at each time step. These subprocesses are also referred to as arms in the bandit literature and shall be indexed by .
Following a standard construction for MDPs,
both the larger joint MDP and the subprocesses will be constructed on the same measurable space .
Random variables on this measurable space will correspond to states, actions, rewards,
and each policy will induce a probability measure over this space.
We describe the MDP to consider formally as follows:

The time horizon .

The state space is the cross product of subprocess state space . is assumed to be finite. We use to denote an element in and when the state is random. We also use to emphasize that the state is at time . Likewise, we use to denote an element in and or when it is random.

The action space is the cross product of subprocesses action space , which is set equal to . We use to denote a generic element of , and when it is random. We use to denote a generic element in and to denote an action when it is random. In the context of bandit problems, is called “pulling” an arm (subprocess).

The reward function for each . , where is the reward obtained by a subprocess when action is taken in state at time . We assume rewards are nonnegative and finite.

The transition kernel , where is the probability of a subprocess transitioning from to if action is taken, i.e., . The product implies that the subprocesses evolve independently. We also point out that RMAB differ from MAB problems in that MABs require while RMABs allows . Since we are considering both cases, we do not restrict the value of .
Next we describe the set of policies for our MDP problem. Since the state and action space defined above are finite, it is sufficient to consider the set of Markov policies [11]. Define a policy as a function that determines the probability of choosing action in state at time , that is, , Subsequently we have , . A policy and the transition kernel together defines a probability distribution on the all possible paths of the process . Starting at a fixed state , i.e., , we have the conditional distributions of and defined recursively by and .
The MDP we are considering allows exactly subprocesses to be set active at each time step. Hence a feasible policy, , has to satisfy that , . Here we use as an operator that sums all the elements in a vector.
The objective of our MDP is as follows:
(2.1)  
subject to 
Since we will discuss other MDPs in the process of solving this one, (2.1) will be referred to as the original MDP in the rest of the paper to avoid confusion. For convenience, we summarizes our notations in Appendix A. We note the original MDP (2.1) suffers from the ’curse of dimensionality’, and hence solving it is computationally intractable. In the remaining of the paper we seek to building a computationally feasible indexbased heuristics with performance guarantee.
3 Lagrangian Relaxation and Upper Bounds
In this section we discuss the Lagrangian relaxation of the original MDP and the corresponding single process problems. These single process problems together with the Lagrange multipliers form the building block of our indexbased policy, which will be formally introduced in Section 4. Lagrangian relaxation removes the binding constraints and allows the problem to be decomposed into tractable MDPs [1]. It works as follows: for any , we consider an unconstrained problem whose objective is obtained by augmenting the objective in (2.1):
(3.1) 
This unconstrained problem forms the Lagrangian relaxation of (2.1), and has the following property:
Lemma 1.
For any , is an upper bound to the optimal value of the original MDP.
[1] gave a proof to Lemma 1 using the Bellman equations. We provide a more straightforward proof by viewing as the Lagrange dual function of a relaxed problem of the original MDP; see Appendix B.
This Lagrangian relaxation then decomposes into smaller MDPs, which we can easily solve to optimality. To elaborate on this idea of decomposition, we construct a subMDP problem based on tuple . Again we consider only the set of Markov policies, , for this problem. Similarly a policy is a function that determines the probability of choosing action in state at time , i.e., . The subMDP starts at a fixed state . Subsequently we can define distributions of and under in a similar manner as we did for and in the previous section. The objective of the subMDP is defined as follows:
(3.2) 
We are now ready to present the decomposition of the Lagrangian relaxation.
Lemma 2.
The optimal value of the relaxed problem satisfies
(3.3) 
[1] also gave a proof to Lemma 2, and we again provide a different proof in Appendix C. Since the state space of the subMDP is much smaller, we can solve it directly by using backward induction on optimality equations. The existence of such an optimal Markov deterministic policy is implied by that the state and action spaces of the subMDP being finite [11]. Let be the set of optimal Markov deterministic policies of the subMDP for a given . The relaxed problem can be solved by combining the solutions of individual subMDPs, that is, we can construct an optimal policy of the relaxed problem by setting , where is an element in . Moreover, is convex and piecewise linear in [1].
4 An Index Based Heuristic Policy
Our index based heuristic policy assigns an index to each subprocess, based upon its state and current time. At each time step, we set active the m subprocesses with the highest indices. Before carrying out the process of sequential decisionmaking, our index policy calls for precomputation of 1) , as defined in Section 3; 2) a set of indices, , that will later be used for decisionmaking at every time step; 3) an optimal policy for the subMDP problem in (3.2). In the first part of this section we discuss how we carry out such computations.
4.1 Precomputations
4.1.1 Dual optimal
We use subgradient descent to solve , which converges to its solution by convexity of (Theorem 7.4 in [13]). By (3.2) and , a subgradient of with respect to is given by , where is any policy in .
To compute this subgradient, we compute a policy in and then use exact computation or simulation with a large number of replications to compute . To compute a policy in , we first compute the value function of subMDP . We accomplish this using backward induction [11]:
(4.1) 
Then, any and all policies in are constructed by determining for each and the action whose onestep lookahead value is equal to , and then setting for this . For those and for which both actions have onestep lookahead values equal to , one may set for either such action. Thus, the cardinality of is 2 raised to the power of the number of for which the onestep lookahead values for playing and not playing are tied.
When we construct a policy in for the purpose of computing a subgradient of , we choose to play in those with tied onestep lookahead values. While our subgradient descent algorithm would converge for other choices, making this choice better supports computation of indices in section 4.1.2,
4.1.2 Indices
Define vector to be , that is, the vector with the element replaced by . We define the index of state at time as
(4.2) 
Instead of computing the entire set , we only need to compute a policy in using the method discussed in section 4.1.1, i.e., always choose the active action when there are ties. Intuitively, this index is the maximum price we are willing to pay to set a subprocess active in state at . By leveraging the monotonicity of optimal actions with respect to rewards, as shown in Lemma 8 in Appendix G, we compute via bisection search in interval , where upper bounds the largest possible value of . For example, we can set as when (which we show in Appendix F that cannot be greater than this value in this case). We precompute the set before running the actual algorithm.
4.1.3 Occupation measure and its corresponding optimal policy
Our tiebreaking policy involves constructing an optimal Markov policy for the subMDP such that , . The existence of is shown in Appendix E. To compute , we borrow the idea of occupation measure [5]. Define occupation measure, , induced by a policy to be the probability of being in state and taking action given time under . Subsequently can be solved by the following linear program (LP):
(4.3)  
subject to  
where , . The first constraint ensures that . The second constraint ensures flow balance. The third constraint shows that we start at state . The second and third constraint together imply , i.e., that is a probability distribution for each t. The fourth and fifth constraints ensure that is a valid probability measure.
Let be an optimal solution to , can then be constructed by
(4.4) 
for all
Here we also make an observation that can be computed by solving (4.3) with replaced by .
4.2 Index policy
Let be the indices associated with the K subprocesses at time . We define to be the largest value in such that at least subprocesses have indices of at least . Our index policy then sets the subprocesses with indices strictly greater than active, and those with indices strictly less than inactive. When more than subprocesses have indices greater than or equal to , a tiebreaking a rule is needed. Our tiebreaking rule allocates remaining resources (the remaining subprocesses to be set active) across tied states according to the probability distribution induced by over at time . This tiebreaking ensures asymptotic optimality of the index policy as it enforces that the fraction of subprocesses in each state is equal to the distribution induced by in the limit. This idea shall become clear in Section 5 where the proof of asymptotic optimality is presented.
To further illustrate how our tiebreaking rule works, let be the set of states occupied by the tied subprocesses, we allocate
(4.5) 
fraction of the remaining resources to each tied states, when , where is a solution to (4.3). In cases when , we do tiebreaking according to the number of subprocesses that are currently in each of the tied states.
We then use the function Rounding(total, frac, avail) in Algorithm 2 to deal with the situations where the products between the desired fractions and remaining resources are not integers. Here represents the number of remaining resources, is a vector of fractions to be allocated to each tied state, and is a vector of the number of subprocesses in each tied state. The function also takes care of the corner cases in which the number of subprocesses in a tied state is less than the number of resources we would like to assign to according to the fraction in (4.5). We note the following property of this function Rounding, which we will rely on in our proof in Section 6.
Remark 1.
When total, avail, frac satisfy , the output vector satisfies for all .
Remark 2.
[8] proposed a minimumlambda policy, which, when translated into our setting, finds the largest Lagrange multiplier of the form for which an optimal solution of the relaxed problem is feasible for the original MDP. The policy then sets active those subprocesses which would be set active in the relaxed problem. However, Hawkins did not specify what and should be, thus limiting the policy’s applicability to finite horizon settings. Our policy is similar to that of Hawkins in that 1) setting the subprocesses with the largest indices in our policy is equivalent to finding the largest that satisfies the constraints of the original MDP and setting the corresponding subprocesses active, and; 2) We also limit the values of Lagrange multiplier considered to a ray, as can be written in the form of , for . However, unlike Hawkins’ policy, our policy defines the starting point and the direction of the ray, along with a tiebreaking policy that ensures asymptotic optimality.
5 Proof of Asymptotic Optimality
Our index policy achieves asymptotic optimality when we let the number of subprocesses go to infinity, while holding constant. Let to denote the expected reward of the original MDP obtained by policy , to emphasize the dependency of this quantity on and . We use to denote the set of all feasible Markov policies for the original MDP with subprocesses and a budget of activations per period. Lastly, it should be understood that whenever we use to denote our index policy there is a dependency of on and that is not explicitly stated. We are now ready to state the main result of this paper, which shows that the per arm gap between the upper bound and the index policy goes to zero under the limit assumption.:
Theorem 1.
For any ,
(5.1) 
To formalize the notations that will be used throughout the proofs, we augment to to indicate the values of and assumed in the Lagrangian relaxation. We use to denote one and any element in and let be the optimal policy constructed in (4.4) using , which satisfies . Note and depend on only (not on ).
As before, we let be the number of subprocesses in state at time under . We additionally define to be the number of subprocesses in state at time that are set active by . These quantities depend on and , but for simplicity we do not include this dependence in the notation: they always assume and we rely on context to make clear the value of assumed. We also define to be the set of states with the same index value as , including , and to be the set of states with index value greater than that of , for each time . These quantities depend on but not on or .
We prove Theorem 1 by first demonstrating below in Theorem 2 that for each time , the proportion of the subprocesses that are in state under our index policy , , approaches as . In other words, our index policy recreates the behavior of in the large limit.
Theorem 2.
For every and ,
(5.2) 
and
(5.3) 
Before proving Theorem 2, we first present two intermediate results, whose proofs are given in Appendix H and I.
Lemma 3.
At time , for all , we have

If , then .

If , then .
Lemma 4.
For any state and time ,

If , then

If , then
We will also require the following technical result in the proof of Theorem 2. Again the proof is offered in Appendix J
Now we are ready to prove Theorem 2.
Proof.
When , all subprocesses starts in state , and we have
By the setup of the original MDP, , and we have
so we have proved the base case of the induction.
Now assume (5.2) and (5.3) hold up until time . Fix a state and time , define to be the number of subprocesses set active by in at time which transition to state at time , and to be the number of subprocesses set inactive by in at time which transition to at time . Note that and also depend on . We can subsequently express as
Dividing both sides by , and taking to a limit, we get
(5.4) 
Note is a binomial random variable with trials and success probability Similarly, is a binomial random variable with trials and success probability . We can rewrite the RHS of (5.4) by applying Lemma 9, which is stated at the end of the section:
(5.5)  
(5.6) 
The last equality follows as we have exhausted all the ways of getting to at time . Hence we have shown (5.2) holds for time .
To show (5.3) holds for time , define sets , and . We use notation for the set which consists of all elements in divided by . Define function to represent the number of subprocesses set active at time in state , that is,
(5.7) 
where represent the number of subprocesses set active when tiebreaking is needed, that is,
(5.8) 
where and are random variables due to the rounding rules in Algorithm 2, and are dependent on . We also define function
(5.9) 
This proof will be accomplished by the following three lemmas, whose proof is given in Appendix K,L,M
Lemma 5.
.
Lemma 6.
Lemma 7.
Combining the three lemmas above we have
∎
Proof.
Proof of Theorem 1 implies . Thus,
On the other hand,
Here, the third line follows by Theorem 2 and the fact that both and are bounded and hence uniformly integrable random variables (for uniformly integrable random variables, convergence almost surely implies convergence in expectation). The fourth line holds because takes the active action at each time with probability . The fifth line follows from Lemma 2, where we have augmented the notation for to include the values of and assumed. The sixth line follows from Lemma 1.
Finally, sandwiching the two inequalities gives the desired result. ∎
6 Numerical Experiments
In this section we present numerical experiments for two problems: the finitehorizon multiarm bandit with multiple pulls per period,and subset selection [4, 9]. These experiments demonstrate numerically that our index policy is indeed asymptotically optimal. We also compare the finitetime performance of our policy to other policies from the literature. Although our previously provided theoretical results do not apply to finite , we see that our index policy performs strictly better than all benchmarks considered in both of the problems.
6.1 Multiarmed bandit
In our first experiment, we consider a Bernoulli multiarmed bandit problem with a finite time horizon , and multiple pulls per time period. A player is presented with arms and may select of them to pull at every time st. Each arm pulled returns a reward of or . The player’s goal is to maximize her total expected reward. We take a Bayesianoptimal approach and impose a Beta(1,1) prior on each of the arm. The values of the state then correspond to the posterior parameters of the K arms.
For comparison, we include results from an upper confidence bound (UCB) algorithm with pretrained confidence width. At every time step, we compute for each arm , where and are the sample mean and standard deviation of arm . We pretrain by running the UCB algorithm on a different set of data (but simulated with the same distribution) with values of ranging from 0 to 5 and then set to the value that gives the best performance.
Figure 1 plots the reward per arm (expected total reward divided by ) against , for . The red dashed line represents the upper bound computed using