Multiuser Scheduling in a Markovmodeled Downlink using Randomly Delayed ARQ Feedback
Abstract
In this paper, we focus on the downlink of a cellular system, which corresponds to the bulk of the
data transfer in such wireless systems. We address the problem of opportunistic multiuser scheduling
under imperfect channel state information, by exploiting the memory inherent in the channel. In our
setting, the channel between the base station and each user is modeled by a twostate
Markov chain and the scheduled user sends back an ARQ feedback signal that arrives at the scheduler
with a random delay that is i.i.d across users and time. The scheduler indirectly estimates the channel via accumulated delayedARQ
feedback and uses this information to make scheduling decisions.
We formulate a throughput maximization problem as a partially observable Markov decision process
(POMDP). For the case of two users in the system, we
show that a greedy policy is sum throughput optimal for any distribution on the ARQ feedback delay.
For the case of more than two users, we prove that the greedy policy is suboptimal and demonstrate, via
numerical studies, that it has near optimal performance. We show that the greedy policy can be implemented by a simple algorithm that does not
require the statistics of the underlying Markov channel or the ARQ feedback delay, thus making it
robust against errors in system parameter estimation. Establishing an equivalence between the
twouser system and a genieaided system, we obtain a simple closed form expression for the sum
capacity of the Markovmodeled downlink. We further derive inner and outer bounds on the capacity
region of the Markovmodeled downlink and tighten these bounds for special cases of the system parameters.
Index Terms – Opportunistic multiuser scheduling, cellular downlink, Markov channel, ARQ feedback, delay, greedy policy, sum capacity, capacity region.
ptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptpt
Multiuser Scheduling in a Markovmodeled Downlink using Randomly Delayed ARQ Feedback
Sugumar Murugesan, Member, IEEE, Philip Schniter, Senior Member, IEEE and Ness B. Shroff, Fellow, IEEE
^{0}^{0}footnotetext: This work was supported by the NSF CAREER grant 237037, the Office of Naval Research grant N000140710209, NSF grants CNS0721236, CNS0626703, ARO W911NF0810238, CNS1065136 and CNS1012700.^{0}^{0}footnotetext: Murugesan was with the Department of ECE, The Ohio State University and is currently with the Department of ECEE, Arizona State University, Schniter is with the Department of ECE, The Ohio State University and Shroff holds a joint appointment in both the Department of ECE, and the Department of CSE at The Ohio State University. (Email: sugumar.murugesan@asu.edu, schniter@ece.osu.edu, shroff@ece.osu.edu)
I Introduction
With the ever increasing demand for high data rates, opportunistic multiuser scheduling, introduced by Knopp and Humblet in [1], and defined as allocating the resources to the user experiencing the most favorable channel conditions, has gained immense popularity among wireless network designers. Opportunistic multiuser scheduling essentially exploits the multiuser diversity in the system and has motivated several researchers (e.g., [2][6]) to study the performance gains obtained by opportunistic scheduling under various scenarios. While the i.i.d flat fading model is used in these works to model time varying channels (for a general treatment on opportunistic scheduling with minimal assumptions on the channel, see [7]), it fails to capture the memory in the channel observed in realistic scenarios. Hence, more recently, opportunistic scheduling has also been investigated by modeling the channels by Markov chains (e.g., [8, 9, 10, 11, 12, 13]). However, in these works, the channel state information that is crucial for the success of any opportunistic scheduling scheme is assumed to be readily available at the scheduler. This is a simplifying assumption that does not hold in reality, where a nontrivial amount of resource must be spent in gathering the information on the channel state. Another line of work (e.g., [14, 15]) attempts to exploit the memory in the Markovmodeled channels to gather this information. Specifically, Automatic Repeat reQuest (ARQ) feedback, that is traditionally used for error control (e.g., [16, 17, 18, 19]) at the data link layer, is used to estimate the state of the Markovmodeled channels.
These two lines of work can be combined to create a new design paradigm: exploit multiuser diversity in Markovmodeled channels (e.g., [8, 9, 10, 11, 12, 13]) and use the already existing ARQ feedback mechanism to estimate the state of these Markovmodeled channels (e.g., [14, 15]). Assuming instantaneous ARQ feedback (i.e., it arrives at the end of the slot) and ONOFF Markov channel model (the GilbertElliott model [20]), this problem was addressed in independent works [21, 22]. In [21], the authors studied opportunistic spectrum access in a cognitive radio setting — a setup mathematically equivalent to the instantaneous ARQ based opportunistic scheduling in a Markovmodeled downlink — and showed that a simple greedy scheduling policy is optimal. In [22], we directly addressed the instantaneous ARQ based downlink scheduling problem. By identifying a special mathematical structure in the problem, we derived a closed form expression for the twouser sum capacity of the downlink and obtained bounds on the system stability region.
In this paper, we model the downlink channels by two state (ONOFF) Markov chains and study the ARQ based joint channel learningscheduling problem when the ARQ feedback arrives at the scheduler with a random delay that is i.i.d across users and time. The delay in the feedback channel is an important consideration that cannot be overlooked in realistic scenarios. The effect of feedback delay on channel resource allocation has been studied under various settings in the past (e.g., [23, 24, 25, 26]). While these works assume deterministic delay, we consider random, i.i.d feedback delay. An instance when the feedback delay can be i.i.d is when the delay is due to channel propagation time of the feedback signal and when the feedback channel environment changes drastically due to high mobility of users. In essence, by modeling the feedback delay to be random, we attempt to capture the effect of the nonidealities of the feedback channel on the joint channel learningscheduling problem, in a more general framework.
It turns out that, despite the random delay, the ARQ feedback can be used for opportunistic scheduling to achieve performance gains. A sample of this gain is illustrated in Fig. 1 for a specific set of system parameters to be defined in the next section. Fig. 1 plots the sum (over all the downlink users) rate of successful transmission of packets over a length of slots under optimal opportunistic scheduling when the scheduler has: (a) randomly delayed channel state information (CSI) from all the downlink users (b) randomly delayed CSI from the scheduled user — i.e., randomly delayed ARQ feedback, and (c) no CSI  i.e., random scheduling. We make two observations from the figure: (1) Using delayed ARQ feedback for opportunistic scheduling can achieve performance close to opportunistic scheduling using delayed CSI from all users, and (2) a gain (when ) in the sum rate is associated with opportunistic scheduling using delayed ARQ over random scheduling. These observations motivate our approach: exploit multiuser diversity in Markovmodeled downlink channels using the already existing (albeit delayed) ARQ feedback mechanisms.
When compared to the instantaneous ARQ case, the randomly delayed ARQ case adds additional layers of complexity to the scheduling problem, making it different and far more challenging than the former. However, we show that, when there are two users in the system, for any ARQ delay distribution, the greedy policy that was optimal in the instantaneous ARQ case [21] is also optimal in the delayed ARQ case. For more than two users, however, using a counterexample, we show that the greedy policy is not, in general, optimal. Despite the suboptimality, extensive numerical experiments suggest that the greedy policy has near optimal performance. Encouraged by this insight, we study the structure of the greedy policy and show that it can be implemented via a simple algorithm that is immune to errors in the estimates of the Markov channel parameters and the ARQ delay statistics. We also study the fundamental limits of the Markovmodeled downlink with randomly delayed ARQ feedback. By establishing an equivalence between the twouser downlink and a genieaided system, we derive a simple closed form expression for the sum capacity of the twouser downlink, while obtaining bounds on the sum capacity for larger number of users. We further derive inner and outer bounds on the capacity region of the downlink and tighten these bounds for special cases of the system parameters.
The rest of the paper is organized as follows. The problem setup is described in Section II, followed by a study of the optimality properties of the greedy policy in Section IIIA. Section IIIB contains a numerical performance analysis of the greedy policy. In Section IIIC, we discuss the implementation structure of the greedy policy. We then study the sum capacity and the capacity region of the Markovmodeled downlink in Section IV, followed by concluding remarks in Section V.
Ii Problem Setup
Iia Channel Model
We consider downlink transmissions with users. For each user, there is an associated queue at the base station that accumulates packets intended for that user. We assume that each queue is infinitely backlogged. The channel between the base station and each user is modeled by an i.i.d twostate Markov chain. Each state corresponds to the degree of decodability of the data sent through the channel. State (ON) corresponds to full decodability, while state (OFF) corresponds to zero decodability. Time is slotted and the channel of each user remains fixed for a slot and moves into another state in the next slot following the state transition probability of the Markov chain. The time slots of all users are synchronized. The twostate Markov channel is characterized by a probability transition matrix
(1) 
where
The states can be interpreted as a quantized representation of the underlying channel strength, which lies on a continuum. It is known from classic works [27, 28] that the fading channel, with reasonable accuracy, can be modeled by finite state Markov chains and that, in reality, the fading process is observed to be gradual enough that the state transitions/crossovers can be restricted to adjacent states of the Markov model. With the top ‘half’ of the states in these models cumulatively represented by the ON state and the rest by the OFF state in our twostate model, we see that, in realistic scenarios, the crossover from ON to OFF state (respectively, OFF to ON) is less likely to occur than staying in ON state (respectively, OFF state). This is positive correlation, i.e., . Motivated by this, we restrict our attention to throughout this work.
IiB Scheduling Problem
The base station (henceforth known as the scheduler) is the central controller that controls the transmission to the users in each slot. In any time slot, the scheduler does not know the exact channel state of the users and it must schedule the transmission of the headofline packet of exactly one user. Thus, a TDMA styled scheduling is performed here. The power spent in each transmission is fixed. At the beginning of a time slot, the headofline packet of the scheduled user is transmitted. The scheduled user attempts to decode the received packet and based on the decodability of the packet sends back ACK(bit )/NACK(bit ) feedback signals to the scheduler at the end of the time slot, over an errorfree feedback channel. The feedback channel is assumed to suffer from a random delay that is i.i.d across users and time. This delayed feedback information, along with the label of the time slot from which it is acquired, will be used by the scheduler in scheduling decisions. The scheduler aims to maximize the sum of the rate of successful transmission of packets to all the users in the system. We formally define the problem below.
IiC Formal Problem Definition
Since the scheduler must make scheduling decisions based only on a partial observation^{1}^{1}1In this case, the set of timestamped binary delayed feedback on the channels. of the underlying Markov chain, the scheduling problem can be represented by a partially observable Markov decision process (POMDP). See [29] for an overview of POMDPs. We now formulate our problem in the language of POMDPs. The key quantities used throughout this paper are summarized in Appendix E.
Horizon: The number of consecutive slots over which scheduling is performed is the horizon. We index the time slots in decreasing order with slot corresponding to the end of the horizon. Throughout this paper, the horizon is denoted by , i.e., the scheduling process begins at slot .
Feedback arriving at slot : For some slot , , let be the number of ARQ feedback bits () arriving at the end of slot from the users scheduled in the previous slots. Due to the random nature of the feedback delay, can take values in the set . Let represent all the ARQ feedback arriving at the end of slot . Thus , if and , if . The ARQ feedback is timestamped and thus, since the scheduler has a record on which users were scheduled in the past slots, it can map the feedback bits to the users and slots they originated from. Let be the feedback that originated during slot , where . Note that since in each slot one and only one user is scheduled, is neither empty nor has multiple values, i.e., with bit mapped to NACK and bit to ACK feedback.
Delay of feedback from user in slot : Let be the random variable corresponding to the delay, in number of slots, experienced by the feedback sent by user in slot . Let correspond to the case when the ARQ feedback originating from user in slot arrives at the scheduler at the end of the same slot . We assume the distribution of to be i.i.d across users and time throughout this work, and let , denote the probability mass function of .
Belief value of user in slot  : This represents the probability that the channel of user , in slot , is in the ON state, given all the past feedback about the channel. Define , for , as the step belief evolution operator given by with and for . Now if, at the end of slot , the arriving feedback contains the ARQ feedback from user from slot , i.e., , then, if is the latest slot from which an ARQ feedback from user has arrived, then is obtained by applying the 1step belief evolution operator repeatedly over all the time slots between ‘now’ (slot ) and slot , i.e.,
(2) 
where we have used . If is not the latest slot from which an ARQ feedback from user has arrived (possible since the random nature of the feedback delay can result in outofturn arrival of ARQ feedback), then due to the firstorder Markovian nature of the channels, this ARQ feedback does not have any new information to affect the belief value, and so . Similarly, if does not contain any feedback from user , then .
Reward structure: In any slot , a reward of is accrued at the scheduler when the channel of the scheduled user is found to be in the ON state, else is accrued.
Scheduling Policy : A scheduling policy in slot is a mapping from all the information available at the scheduler in slot along with the slot index to a scheduling decision . Formally,
(3)  
where are the past scheduling decisions and are the belief values of the channels of all users, corresponding to slots , held by the scheduler at the moment (slot ).
Net expected reward in slot , : With the scheduling policy, , fixed, the net expected reward in slot , i.e., , is the sum of the reward expected in the current slot and the net reward expected in all the future slots . Formally, with denoting the scheduling decision in slot ,
where is the expected immediate reward and the expectation in the future reward is over the feedback received in slot , i.e., , along with the originating slot indices. Note that the belief vector is uptodate based on all previous scheduling decisions and the ARQ feedback received before slot . With the reward structure defined earlier, the expected immediate reward can be written as
Performance Metric: For a given scheduling policy , the performance metric is given by the sum throughput (sum rate of successful transmission) over a finite horizon, :
(5) 
where is the initial belief values of the channels.
Iii Greedy Policy  Optimality, Performance Evaluation and the Implementation Structure
Iiia On the Optimality of the Greedy Policy
Consider the following policy:
Since the above given policy attempts to maximize the expected immediate reward, without any regard to the expected future reward, it follows an approach that is fundamentally greedy in nature. We henceforth call the greedy policy and let denote the scheduling decision in slot under the greedy policy. We now proceed to establish the optimality of the greedy policy when . We first introduce the following lemma.
Lemma 1
For any and any with ,
(7) 
The results of Lemma 1 can be explained intuitively. Note that is the belief value of the channel (probability that the channel is in the ONstate) in the current slot given the belief value, slots earlier, was . Also note that (similarly ) gives the belief value in the current slot given the channel was in the ON state (similarly OFF state) slots earlier. Now, since the Markov channel is positively correlated (), the probability that the channel is in the ON state in the current slot given it was in the ON state slots earlier () is at least as high as the probability that the channel is ON in the current slot given it was ON with probability , slots earlier (). This explains the first inequality in Lemma 1. The second and third inequalities can be explained along similar lines. Regarding the last inequality, consider slots such that . Due to the Markovian nature of the channel, the closer slot is to , the stronger is the memory, i.e., the dependency of the channel state in with that of . Now, since the channel is positively correlated, if the channel was in the ON state in slot , the closer is to , the higher is the probability that the channel is ON in slot . By definition, this probability is given by with . Thus monotonically decreases with . Using a similar explanation, monotonically increases with . The limiting value of both these functions, as , is the probability that the channel is ON when no information on the past channel states is available. This is given by the steady state probability^{2}^{2}2We will discuss the steady state probability in Section IV.. This explains for any . A formal proof of Lemma 1 can be found in Appendix A.
Proposition 2
For , the sum throughput, , of the system is maximized by the greedy policy for any ARQ delay distribution.
Proof: Consider a slot . Fix a sequence of scheduling decisions . Recall the definition of , the feedback arriving at the end of slot , from Section IIC. Let denote the originating slots corresponding to feedback , i.e., if the feedback from users and , for , both arrive at slot , then and . Also define as the latest slot from which the ARQ feedback of user is available at the scheduler by (the beginning of) slot . Formally, if at least one ARQ feedback from user 1 has arrived at the scheduler by slot , then
If no ARQ feedback from user has arrived by slot , i.e., if a such that ‘’, then . Let , when , be a measure of ‘freshness’ of the latest feedback from user . Let when . Similarly define for user . With these definitions, the proof proceeds in two steps: In step , we show that the greedy decision in slot , given the ARQ feedback and the scheduling decision from slot , is independent of the feedback and scheduling decision corresponding to slot . In step , we show that, if the greedy policy is implemented in slot , then the expected immediate reward in slot is independent of the scheduling decisions . We then provide induction based arguments to establish the proposition.
Step 1: Let and . The greedy decision in slot , conditioned on the past feedback and scheduling decisions is given by
(9) 
The preceding equation comes directly from the first order Markovian property of the underlying channels. Consider the case when () or (). The belief values in slot as a function of feedback and is given below:
Using Lemma 1, the greedy decision can be written as
(11)  
Thus the greedy decision is independent of feedback if . We now proceed to generalize equation (11). Let denote the latest slot for which an ARQ feedback is available from one of the users by slot , i.e.,
(12) 
Let for and for be a measure of freshness of the latest ARQ feedback. Thus, using the preceding discussion, we have
where is the user not scheduled in slot . This completes step of the proof.
Step 2: If the greedy policy is implemented in slot , the immediate reward expected in slot , conditioned on scheduling decisions and initial belief can be rewritten as
(14)  
where is defined after (12). Note that
(15) 
since, with , i.e., no past feedback at the scheduler, the belief values at slot is independent of the past scheduling decisions and is simply given by . Now rewriting the second part of (14),
(16)  
Consider . From the first step of the proof, the greedy decision in slot can be made solely based on the latest feedback, i.e., . This was recorded in (IIIA). Thus, if the feedback is an ACK (occurs with probability ) reschedule the user in slot . Conditioned on , the belief value and hence the expected immediate reward in slot is given by . If the feedback is a NACK, schedule the other user denoted by . Conditioned on , the belief value and hence the expected immediate reward in slot is given by . Averaging over , we have
where is the state of the channel of user in slot . From (16),
We have used the following argument in the last equality: the event is controlled by the underlying Markov dynamics and is independent of the scheduling decisions . Likewise, this event is independent of the value of since we have assumed that the feedback channel and the forward channel are independent.
Recall is the random variable indicating the delay incurred by the ARQ feedback sent by user in slot . Let be the random variable corresponding to the quantity , the degree of freshness of the latest ARQ feedback, and be the probability mass function of . Therefore, for ,
where we have used the independence between the forward and the feedback channel to remove the condition on in the second equality. The last equality comes from the assumption that the ARQ delay is i.i.d across users and time^{3}^{3}3Note: here we do not require the ARQ delay to be identically distributed across time.. Similarly
Applying the preceding equations in (14), we have
The expected reward in slot is thus independent of the sequence of actions if the greedy policy is implemented in slot . By extension, the total reward expected from slot until the horizon is independent of the scheduling vector if the greedy policy is implemented in slots , i.e.,
Thus, if the greedy policy is optimal in slots , then, it is also optimal in slot . Since is arbitrary and since the greedy policy is optimal at the horizon, by induction, the greedy policy is optimal in every slot . This establishes the proposition.
Remarks: When the Markov channels are negatively correlated, i.e.,  the case of limited practical significance, using arguments similar to those in the preceding proof, we can show that the greedy policy is optimal when , for any ARQ delay distribution. We record this below.
Corollary 3
When the Markov channels are negatively correlated, i.e., , and when , the sum throughput, , of the system is maximized by the greedy policy for any ARQ delay distribution.
A formal proof can be found in Appendix B.
Returning to the original positive correlation setup, the arguments in the proof of Proposition 2 hold true even when the ARQ delay is not identically distributed across time. Thus, the greedy policy is optimal for even when the ARQ delay distribution is timevariant. Also, since is arbitrary, the greedy policy maximizes the sum throughput over an infinite horizon. We record this below.
Corollary 4
For , the greedy policy is optimal when the performance metric is the sum throughput over an infinite horizon, i.e.,
(23) 
for any initial belief .
The optimality of the greedy policy does not extend to the case . We record this in the following proposition.
Proposition 5
The greedy policy is not, in general, optimal when there are more than two users in the downlink.
Proof outline: We establish the proposition using a counterexample with deterministic ARQ delay of , i.e., , and arbitrary values of and . We construct a variant of the greedy policy that schedules a nongreedy user in a specific time slot under a specific sample path of the past channel states observable by the scheduler. In the rest of the slots and under other realizations, the constructed policy performs greedy scheduling. We explicitly evaluate the difference in the rewards corresponding to the constructed policy and the greedy policy and show that, there exists system parameters such that the constructed policy has a reward strictly larger than the greedy policy. Thus the greedy policy is, in general, not optimal when . A formal proof can be found in Appendix C.
Remarks: Note that, in contrast, it has been shown in [21] that the greedy policy is optimal for any number of users when the ARQ feedback is instantaneous, i.e., . To summarize, the optimality of the greedy policy vanishes

when the ARQ delay is increased from zero to higher values, with the number of users unconstrained, or

when the number of users is increased from two to higher values, with the ARQ delay being random and unconstrained.
These observations point to the volatile nature of the underlying dynamics of the scheduling problem, with respect to the greedy policy optimality.
It would be interesting to see how the optimality properties of the greedy policy extend to more general channel models. Considering the multirate channels, i.e., when the number of states is greater than two, the special ‘toggle’ structure that led to the optimality of the greedy policy in the ONOFF channel vanishes. In fact, we have shown [30] that, even when the number of states is increased by 1, the general greedy policy optimality vanishes and the optimality can be shown to hold only under very restrictive conditions on the Markov channel statistics. Now, consider the case when the twostate Markov channels are nonidentical across users. In this setup, we can show that the greedy policy is not, in general, optimal, even when the ARQ delay is instantaneous. We record this below.
Proposition 6
The greedy policy is not, in general, optimal when the Markov channels are not identical across users, even when and the ARQ feedback is instantaneous.
The proposition is established using counterexamples. Proof is available in Appendix D.
In summary, continuing our discussion before Proposition 6, the optimality of the greedy policy vanishes even under minimal deviations from the original setup. These observations further indicate the volatile nature of the underlying scheduling problem dynamics.
Returning to the original setup at hand, numerical results suggest that the greedy policy, despite being not optimal in general, has near optimal performance. We discuss this next.
IiiB Performance Evaluation of the Greedy Policy
Delay  subopt  
3  
4  
3  
4 
N=10  Delay  

genie  genie  
5.3908  5.2912  1.8470  5.5279  5.2067  5.8109  
5.6547  5.4281  4.0072  5.9195  5.4119  8.5741  
5.7867  5.4987  4.9771  6.1152  5.5208  9.7203  
5.9187  5.5712  5.8703  6.3110  5.6353  10.7070  
N=20  Delay  
genie  genie  
8.8565  8.8254  0.3504  3.4487  3.4371  0.3368  
8.9715  8.9291  0.4723  3.5525  3.4661  2.4315  
9.0290  8.9820  0.5203  3.6043  3.4807  3.4300  
9.0865  9.0357  0.5593  3.6562  3.4955  4.3967 
N=10  Delay  

genie  genie  
2.0196  2.0162  0.1716  6.2768  6.2571  0.3131  
2.1261  2.0384  4.1241  6.4895  6.3813  1.6663  
2.2152  2.0577  7.1089  6.6375  6.4743  2.4587  
2.3018  2.0772  9.7568  6.7764  6.5677  3.0792  
N=20  Delay  
genie 