Stochastic Optimization for Markov Modulated Networks with Application to Delay Constrained Wireless Scheduling

Stochastic Optimization for Markov Modulated Networks with Application to Delay Constrained Wireless Scheduling

Michael J. Neely , Sucha Supittayapornpong This material was presented in part at the th IEEE Conf. on Decision and Control (CDC), Shanghai, China, Dec. 2009.The authors are with the Electrical Engineering department at the University of Southern California, Los Angles, CA.This material is supported in part by one or more of the following: the DARPA IT-MANET program grant W911NF-07-0028, the NSF Career grant CCF-0747525, NSF grant 0540420, the Network Science Collaborative Technology Alliance sponsored by the U.S. Army Research Laboratory W911NF-09-2-0053.

We consider a wireless system with a small number of delay constrained users and a larger number of users without delay constraints. We develop a scheduling algorithm that reacts to time varying channels and maximizes throughput utility (to within a desired proximity), stabilizes all queues, and satisfies the delay constraints. The problem is solved by reducing the constrained optimization to a set of weighted stochastic shortest path problems, which act as natural generalizations of max-weight policies to Markov decision networks. We also present approximation results for the corresponding shortest path problems, and discuss the additional complexity and delay incurred as compared to systems without delay constraints. The solution technique is general and applies to other constrained stochastic decision problems.


Constrained Markov Decision Processes, Queueing Systems, Dynamic Scheduling

I Introduction

This paper considers delay-aware scheduling in a multi-user wireless uplink or downlink with delay-constrained users and delay-unconstrained users, each with different transmission channels. The system operates in slotted time with normalized slots . Every slot, a random number of new packets arrive from each user. Packets are queued for eventual transmission, and every slot a scheduler looks at the queue backlog and the current channel states and chooses one channel to serve. The number of packets transmitted over that channel depends on its current channel state. The goal is to stabilize all queues, satisfy average delay constraints for the delay-constrained users, and drop as few packets as possible.

Without the delay constraints, this problem is a classical opportunistic scheduling problem, and can be solved with efficient max-weight algorithms based on Lyapunov drift and Lyapunov optimization (see [1] and references therein). The delay constraints make the problem a much more complex Markov Decision Problem (MDP). While general methods for solving MDPs exist (see, for example, [2][3][4]), they typically suffer from a curse of dimensionality. Specifically, the number of queue state vectors grows exponentially in the number of queues. Thus, a general problem with many queues has an intractably large state space. This creates non-polynomial implementation complexity for offline approaches such as linear programming [2][3], and non-polynomial complexity and/or learning time for online or quasi online/offline approaches such as -learning [5][6].

We do not solve this fundamental curse of dimensionality. Rather, we avoid this difficulty by focusing on the special structure that arises in a wireless network with a relatively small number of delay-constrained users (say, ), but with an arbitrarily large number of users without delay constraints (so that can be large). This is an important scenario, particularly in cases when the number of “best effort” users in a network is much larger than the number of delay-constrained users. We develop a solution that, on each slot, requires a computation that has a complexity that depends exponentially in , but only polynomially in . Further, the resulting convergence times and delays are fully polynomial in the total number of queues . Our solution uses a concept of forced renewals that introduces a deviation from optimality that can be made arbitrarily small with a corresponding polynomial tradeoff in convergence time. Finally, we show that a simple Robbins-Monro iteration can be used to approximate the required computations when channel and traffic statistics are unknown. Our methods are general and can be applied to other MDPs for networks with similar structure.

Related prior work on delay optimality for multi-user opportunistic scheduling under special symmetric assumptions is developed in [7][8][9], and single-queue delay optimization problems are treated in [10][11][12][13] using dynamic programming and Markov Decision theory. Approximate dynamic programming algorithms are applied to multi-queue switches in [14] and shown to perform well in simulation. Optimal asymptotic energy-delay tradeoffs are developed for single queue systems in [15], and optimal energy-delay and utility-delay tradeoffs for multi-queue systems are treated in [16][17]. The algorithms of [16][17] have very low complexity and provably converge quickly even for large networks, although the tradeoff-optimal delay guarantees they achieve do not necessarily optimize the coefficient multiplier in the delay expression.

Our approach in the present paper treats the MDP problem associated with delay constraints using Lyapunov drift and Lyapunov optimization theory [1]. This theory has been used to stabilize queueing networks [7] and provide utility optimization [18][19][20][21][1] via simple max-weight principles. We extend the max-weight principles to treat networks with Markov decisions, where the network costs depend on both the control actions taken and the current state (such as the queue state) the system is in. For each cost constraint we define a virtual queue, and show that the constrained MDP can be solved using Lyapunov drift theory implemented over a variable-length frame, where “max-weight” rules are replaced with weighted stochastic shortest path problems. This is similar to the Lagrange multiplier approaches used in the related works [12][13] that treat power minimization for single-queue wireless links with an average delay constraint. The work in [12] uses stochastic approximation with a 2-timescale argument and a limiting ordinary differential equation. The work in [13] treats a single-queue MIMO system using primal-dual updates [22]. Our virtual queues are similar to the Lagrange Multiplier updates in [12][13]. However, we treat multi-queue systems, and we use a different analytical approach that emphasizes stochastic shortest paths over variable length frames. Because of this, our approach can be used in conjunction with a variety of existing techniques for solving shortest path problems (see, for example, [5]). We use a Robbins-Monro technique that is adapted to this context, together with a delayed queue analysis to uncorrelate past samples from current queue states. Our resulting algorithm has an implementation complexity that grows exponentially in the number of delay-constrained queues , but polynomially in the number of delay-unconstrained queues . Further, we obtain polynomial bounds on convergence times and delays.

The next section describes the network model. Section III presents the weighted stochastic shortest-path algorithm. Section IV describes approximate implementations, and Section V presents a simple simulation example.

Ii Network Model

Consider a wireless queueing network that operates in discrete time with timeslots . The network has delay-constrained queues and stability-constrained queues, for a total of queues indexed by sets and . The queues store fixed-length packets for transmission over their wireless channels. Every timeslot, new packets randomly arrive to each queue, and we let represent the random packet arrival vector. The stability-constrained queues have an infinite buffer space. The delay-constrained queues have a finite buffer space that can store packets (for some positive integer ). The network channels can vary from slot to slot, and we let be the channel state vector on slot , representing conditions that affect transmission rates. We assume the stacked vector is independent and identically distributed (i.i.d.) over slots, with possibly correlated entries on the same slot.

Every slot , the network controller observes the channel states and chooses a transmission rate vector , being a vector of non-negative integers. The choice of is constrained to a set that depends on the current . A simple example is a system with ON/OFF channels where the controller can transmit a single packet over at most one ON channel per slot, as in [7]. In this example, is a binary vector of channel states, and restricts to be a binary vector with at most one non-zero entry and with whenever . We assume that for each possible channel state vector , the set has the property that for any , the vector is also in , where is formed from by setting one or more entries to . In addition to constraining to take values in every slot , we shall soon also restrict the values for the delay-constrained queues to be at most the current number of packets in queue . This is a natural restriction, although we do not place such a restriction on the stability-constrained queues . This is a technical detail that will be important later, when we show that the effective dimension of the resulting Markov decision problem is , independent of the number of stability-constrained queues .

Let represent the vector of current queue backlogs, and define . The queue dynamics for the stability-constrained queues are:111For simplicity of exposition later, we have allowed the stability-constrained queues to serve newly arriving data. This can be modified easily by introducing a delay by one slot, so that the “new arrivals” to the stability-constrained queues actually arrived one slot ago.


where the operation allows, in principle, a service variable to be independent of whether or not is empty.

The delay-constrained queues have a different queue dynamic. Because of the finite buffer, we must allow packet dropping. Let be the number of dropped packets on slot . The queue dynamics for the delay-constrained queues are given by:


Note that this does not have any operation, because we will force the and decisions to be such that we never serve or drop packets that we do not have. The precise constraints on these decision variables is given after the introduction of a forced renewal event, defined in the next subsection.

Ii-a Forced Renewals

To force the delay-constrained queues to repeatedly visit a renewal state of being simultaneously empty, at the end of every slot, with probability we independently drop all unserved packets in all delay constrained queues . The stability-constrained queues do not experience such forced drops. Specifically, let be an i.i.d. Bernoulli process that is with probability every slot , and otherwise. Assume is independent of . If , we say slot experiences a forced renewal event. The decision options for and for are then additionally constrained as follows: If , then:

so that during normal operation, we can serve at most packets from queue (so new arrivals cannot be served), and we can drop only new arrivals, necessarily dropping any new arrivals that would exceed the finite buffer capacity. However, if we have:

So that is constrained as before, but is then equal to the remaining packets (if any) at the end of the slot.

We shall optimize the system under the assumption that the forced renewal process is uncontrollable. This provides an analyzable system that lends itself to simple approximations, as shown in later parts of the paper. While these forced renewals create inefficiency in the system, the rate of dropped packets due to forced renewals is at most , which assumes the worst case of dropping a full buffer plus all new arrivals every renewal event. This value can be made arbitrarily small with a small choice of . For problems such as minimizing the average drop rate subject to delay constraints in the delay-constrained queues and stability in the stability-constrained queues, it can be shown that this term bounds the gap between system optimality without forced renewals and system optimality with forced renewals. Formally, this can be shown by a simple sample path argument: A system optimized without forced renewals has a performance that is no better than a system with forced renewals, but where all “drops” from forced renewals are counted as delivered throughput, and where all other decisions mimic those of the prior system. We omit a formal argument for brevity. In Theorem 1 we show the disadvantage of using a small value of is that our average queue bounds for the stability-constrained queues is .

Define a renewal frame as the sequence of slots starting just after a renewal event and ending at the next renewal event. Assume that , so that time starts the first renewal frame. Define , and let and for represent the sequence that marks the beginning of each renewal frame. For , define as the duration of the th renewal frame. Note that are i.i.d. geometric random variables with .

Ii-B Markov Decision Notation

Define as the observed arrivals and channels of the network on slot , and define the random network event . Then is i.i.d. over slots. We can summarize the control decision constraints of the previous section with the following simple notation: Let be the -dimensional state space for the delay-constrained queues, and let represent the current state of these queues. Every slot , the controller observes the random event and the queue state , and makes a control action , which determines all decision variables for and for , chosen in a set that depends on and . Note that, indeed, all of our decision variables as described in the previous subsection are constrained only in terms of and , and in particular the queue states for do not constrain our decisions.

Recall that . The , , together affect the vector through a deterministic function :


Further, , , together define the transition probabilities from to , defined for all states and in :


From the equation (2) we find that , so that next states are deterministic given , , . Finally, we define a general penalty vector , for some integer , where penalties are deterministic functions of , , :


For example, penalty can be defined as the total number of dropped packets on slot by defining , which is indeed a function of , , .

We assume throughout that all of the above deterministic functions are bounded, so that there is a finite constant such that for all , all , and all slots we have:


Ii-C The Optimization Problems

A control policy is a method for choosing actions over slots . We restrict to causal policies that make decisions with knowledge of the past but without knowledge of the future. Suppose a particular control policy is given. Define time averages and for and by:

Our goal is to design a control policy to solve the following stochastic optimization problem:

Minimize: (7)
Subject to: (8)

That is, we desire to minimize the time average of the penalty, subject to time average constraints on the other penalties, and subject to queue stability (called strong stability) for all stability-constrained queues. The general structure (7)-(10) fits a variety of network optimization problems. For example, if we define as the sum packet drops , define , and define for all (for some positive constant ), then the problem (7)-(10) seeks to minimize the total packet drop rate, subject to an average backlog of at most in all delay-constrained queues , and subject to stability of all stability-constrained queues .

Alternatively, to enforce an average delay constraint at all queues (for some positive number ), we can define penalties:

Note that the time average of is the number , the average arrival rate of (non-dropped) packets to queue . Hence, the constraint is equivalent to:

However, by Little’s theorem [23] we have , where is the average delay for queue , and so the constraint ensures (assuming ).

In the following, we develop a dynamic algorithm that can come arbitrarily close to solving the problem (7)-(10). Our solution is general and applies to any other discrete time Markov decision problem on a general finite state space , random events (for forced renewal process ), control actions in a general set , queue equations (1) with given in the form (3), transition probabilities in the form (4), and penalties in the form (5).

Ii-D Slackness Assumptions

Suppose the problem (7)-(10) is feasible, so that there exists a policy that satisfies the constraints. It can be shown that the constraint implies that [24], and so the following modified problem is feasible whenever the original one is:

Minimize: (11)
Subject to: (12)

Define as the infimum of for the problem (11)-(14), necessarily being less than or equal to the corresponding infimum of the original problem (7)-(10).222Recall that is defined assuming forced renewals of probability . Thus, is within a gap of of the minimum cost without such forced renewals. We show in Theorem 1 that, under a suitable slackness condition, the value of can be approached arbitrarily closely while maintaining for all queues . Thus, under that slackness condition, is also the infimum of for the original problem (7)-(10).

The problem (11)-(14) is a constrained Markov decision problem (MDP) with state . Under mild assumptions (such as this state space being finite, and the action space being finite for each ) the MDP has an optimal stationary policy that chooses actions every slot as a stationary and possibly randomized function of the state only. We call such policies -only policies. Because this system experiences regular renewals, the performance of any -only policy can be characterized by ratios of expectations over one renewal frame. Thus, we make the following assumption.

Assumption 1: There is an -only policy that satisfies the following over any renewal frame:


where is the size of the renewal frame, with , and , are values under the policy on slot of the renewal frame.

We emphasize that Assumption 1 is mild and holds whenever the problem (11)-(14) is feasible and has an optimal stationary policy (i.e., an optimal -only policy). We now make the following stronger assumption that there exists an -only policy that can meet the constraints (16)-(17) with “-slackness,” without caring what average value of this policy generates. This assumption is related to standard “Slater-type” assumptions in optimization theory [22].

Assumption 2: There is a value and an -only policy (typically different from policy in Assumption 1) that satisfies the following over any renewal frame:


We show in Theorem 1 that systems that satisfy Assumption 2 with larger values of can operate with smaller average queue sizes in the stability-constrained queues.

Iii The Dynamic Control Algorithm

To solve the problem (7)-(10), we extend the framework of [1] to a case of variable length frames. Specifically, for each of the penalty constraints , we define a virtual queue that is initialized to zero and that has dynamic update equation:


where is the th penalty incurred on slot by a particular action . The intuition is that if the virtual queue is stable, then the time average of must be non-positive. This turns the time average constraint into a simple queue stability problem.

Iii-a Lyapunov Drift

Define as a vector of all virtual queues for . Define as the combined vector of all virtual queues and all stability-constrained queues:

Assume all queues are initially empty, so that . Define the following quadratic function:

Let be the start of a renewal frame, with duration . Define the frame-based conditional Lyapunov drift as follows:


Note that is a function of the initial state and the policy implemented during the frame, where expectations are with respect to the random events that can take place and the possibly random control actions made. The explicit conditioning on in (21) will be suppressed in the remainder of this paper, as this conditioning is implied given that starts a renewal frame.

It is important to note the following subtlety: The implemented policy may not be stationary and/or may depend on the queue values (which can be different on each renewal interval), and so actual system events are not necessarily i.i.d. over different renewal frames. However, these frames are useful because we will analytically compare the Lyapunov drift of the actual implemented policy over a frame to the corresponding drifts of the -only policies of Assumptions 1 and 2.

Lemma 1

(Lyapunov Drift) Under any network control policy that chooses for all slots during a renewal frame , and for any initial queue values , we have:


where is defined:


and where is a finite constant defined:

where we recall is the bound in (6).


For any and any we have by squaring (20):

where the final inequality holds because the change in on any slot is at most , as is the magnitude of . Summing the above over and dividing by yields:

where (III-A) uses the identity:

Similarly, it can be shown for any :


Summing (LABEL:eq:driftlem1) and (26) over , , taking conditional expectations, and noting that the second moment of a geometric random variable with success probability is given by proves the result.

Iii-B The Frame-Based Drift-Plus-Penalty Algorithm

Let be a non-negative parameter that we use to affect proximity to the optimal solution. Our dynamic algorithm initializes all virtual and actual queue states to 0, and designates as the start of the first renewal frame. Then:

  • For each frame , observe the vector of virtual and actual queues and implement a policy over the course of the frame to minimize the following “drift-plus-penalty” expression:

  • During the course of the frame, update virtual and actual queues every slot by (1) and (20), and update state by (4). At the end of the frame, go back to the preceding step.

The decision rule (27) generalizes the drift-plus-penalty rule in [1][25] to a variable frame system. The problem of designing a policy to minimize (27) is a weighted stochastic shortest path problem, where weights are virtual and actual queue backlogs at the start of the frame. Finding such a policy is non-trivial, and often can only be done in an approximate context. In the next sub-section, we present the performance of the algorithm, under the assumption that we have an algorithm to approximate (27). In Section IV we consider various such approximation methods.

Iii-C Performance Theorem

For constants , , define a -approximation of (27) to be a policy for choosing over a frame (consisting of slots ) that yields a total drift-plus-penalty that is less than or equal to that of any other policy, plus an error term parameterized by and :


where and represent (23) and (5), respectively, under any alternative algorithm that can be implemented during the slots of the frame. Note that an exact minimization of the stochastic shortest path problem (27) is a -approximation for .

Theorem 1

Suppose Assumptions 1 and 2 hold for a given . Fix , , , and suppose we use a -approximation every frame. If , then all virtual and actual queues are strongly stable, and so all desired constraints (8)-(10) are satisfied. In particular, for all positive integers , the average queue sizes satisfy:


Further, the time average penalty satisfies:


Suppose our implementation of the stochastic shortest path problem every frame is accurate enough to ensure . Then from (30) and (29), the time average of can be made arbitrarily close to (or below) as is increased, with a tradeoff in average queue size that is linear in . The dependence on the parameter is also apparent: While we desire to be small to minimize the disruptions due to forced renewals, a small value of implies a larger value of in (30) and (29). Note also that the average size of each stability-constrained queue affects its average delay, and the average size of each virtual queue affects the convergence time required for its constraint to be closely met.

Iii-D Proof of Theorem 1

We first prove (29), and then (30).


(Theorem 1 part 1—Queue Bounds) Let be the start of a renewal time. From (28) and (22) we have:


where and are for any alternative policy . Using the fact that for all , and , we have:


Now consider the -only policy from Assumption 2, which makes decisions independent of to yield (using the definition of in (23)):

Substituting the above into the right-hand-side of (32) gives:


Taking expectations of the above and using the definition of gives:

Summing the above over (for some positive integer ), dividing by , and using the fact that gives:

Rearranging terms and using and proves (29). While (29) samples only at the start of renewal frames, it can easily be used to show all queues are strongly stable (recall that the maximum queue change over any slot is bounded, and frame sizes are geometrically distributed with average ). Hence, by stability theory in [24] we know all desired inequality constraints are met.


(Theorem 1 part 2 — Performance Bound) Define probability . This is a valid probability because by assumption. We consider a new policy implemented over the frame . The policy is a randomized mixture of the -only policies from Assumptions 1 and 2: At the start of the frame, independently flip a biased coin with probabilities and , and carry out one of the two following policies for the full duration of the renewal interval:

  • With probability : Use policy from Assumption 2 for the duration of the renewal frame, which yields (18)-(19).

  • With probability : Use policy from Assumption 1 for the duration of the renewal frame, which yields (15)-(17).

Note that this policy is independent of . With , from (15) we have:


We also have from (16)-(17) and (18)-(19):


Plugging (34)-(35) into (31) yields:

Taking expectations gives:

Summing over and dividing by gives the following for all :

Using shows the right-hand-side of the above inequality is the same as the right-hand-side of the desired inequality (30). Finally, because for all , and are i.i.d. geometric random variables with mean , it can be shown that (see Appendix A):

Iv Approximating the Stochastic Shortest Path Problem

Consider now the stochastic shortest path problem (27). Here we describe several approximation options. For simplicity, assume the state space is finite, and the action space is finite for all . Without loss of generality, assume we start at time and have (possibly non-zero) backlogs . Let be the renewal interval size. For every step , define as the incurred cost assuming that the queue state at the beginning of the renewal is :


Let denote the optimal control action on slot for solving the stochastic shortest path problem, given that the controller first observes and . Define , where we have added a new state “” to represent the renewal state, which is the termination state of the stochastic shortest path problem. Appropriately adjust the transition probabilities to account for this new state [26][5]. Define as a vector of optimal costs, where is the minimum expected sum cost to the renewal state given that we start in state , and . By basic dynamic programming theory [26][5], the optimal control action on each slot (given and ) is: