Delay-Optimal Probabilistic Scheduling with Arbitrary Arrival and Adaptive Transmission

# Delay-Optimal Probabilistic Scheduling with Arbitrary Arrival and Adaptive Transmission

Xiang Chen,  Wei Chen,  Joohyun Lee,  and Ness B. Shroff,  X. Chen and W. Chen are with the Department of Electronic Engineering and Tsinghua National Laboratory for Information Science and Technology (TNList), Tsinghua University. E-mail: chen-xiang12@mails.tsinghua.edu.cn, wchen@tsinghua.edu.cn. J. Lee is with the Department of ECE at The Ohio State University. E-mail: lee.7119@osu.edu. Ness B. Shroff holds a joint appointment in both the Department of ECE and the Department of CSE at The Ohio State University. E-mail: shroff@ece.osu.edu.
###### Abstract

In this paper, we aim to obtain the optimal delay-power tradeoff and the corresponding optimal scheduling policy for arbitrary i.i.d. arrival process and adaptive transmissions. The number of backlogged packets at the transmitter is known to a scheduler, who has to determine how many backlogged packets to transmit during each time slot. The power consumption is assumed to be convex in transmission rates. Hence, if the scheduler transmits faster, the delay will be reduced but with higher power consumption. To obtain the optimal delay-power tradeoff and the corresponding optimal policy, we model the problem as a Constrained Markov Decision Process (CMDP), where we minimize the average delay given an average power constraint. By steady-state analysis and Lagrangian relaxation, we can show that the optimal tradeoff curve is decreasing, convex, and piecewise linear, and the optimal policy is threshold-based. Based on the revealed properties of the optimal policy, we develop an algorithm to efficiently obtain the optimal tradeoff curve and the optimal policy. The complexity of our proposed algorithm is much lower than a general algorithm based on Linear Programming. We validate the derived results and the proposed algorithm through Linear Programming and simulations.

Cross-layer design, Queueing, Scheduling, Markov Decision Process, Energy efficiency, Average delay, Delay-power tradeoff, Linear programming.

## I Introduction

In this paper, we study an important problem of how to schedule the number of packets to transmit over a link taking into account both the delay and the power cost. This is an important problem because delay is a vital metric for many emerging applications (e.g., instant messenger, social network service, streaming media, and so on), and power consumption is critical to battery life of various mobile devices. In other words, we are studying the tradeoff between the timeliness and greenness of the communication service.

Such a delay-power scheduling problem can be formulated using a Markov Decision Process (MDP). The authors in [1] were among the earliest who studied this type of scheduling problem. Specifically, they considered a two-state channel and finite time horizon. The dual problem was solved based on results derived by Dynamic Programming and induction. Follow-up papers [2, 3, 4, 5] extended this study in various directions. The optimal delay-power tradeoff curve is proven to be nonincreasing and convex in [2]. The existence of stationary optimal policy and the structure of the optimal policy are further investigated in [3]. Different types of power/rate control policies are studied in [4]. In [5], the asymptotic small-delay regime is investigated. In [6], a piecewise linear delay-power tradeoff curve was obtained along with an approximate closed form expression.

If one can show monotonicity or a threshold type of structure to the optimal policy for MDPs, it helps to substantially reduce the computation complexity in finding the optimal policy. Indeed, the optimal scheduling policies are shown to be threshold-based or monotone in [1, 3, 5, 7, 8, 9, 10], proven by studying the convexity, superadditivity / subadditivity, or supermodularity / submodularity of expected cost functions by induction using dynamic programming. However, most of these results are limited to the unconstrained Lagrangian Relaxation problem. In [3, 10], some properties of the optimal policy for the constrained problem are described based on the results for the unconstrained problem. Detailed analysis on the optimal policy for the constrained problem is conducted in [8, 9]. In [8], properties such as unichain policies and multimodularity of costs are assumed to be true so that monotone optimal policies can be proven. In [9], the transmission action is either 1 or 0, i.e. to transmit or not. In order to obtain the detailed structure of the solution to the constrained problem, we believe that the analysis of the Lagrangian relaxation problem and the analysis of the structure of the delay-power tradeoff curve should be combined together.

In [11], we study the optimal delay-power tradeoff problem. In particular, we minimize the average delay given an average power constraint, considering Bernoulli arrivals and adaptive transmissions. Some technical details are given in [12], where we proved that the optimal tradeoff curve is convex and piecewise linear, and the optimal policies are threshold-based, by Constrained Markov Decision Process formulation and steady-state analysis. In this paper, we substantially generalize the Bernoulli arrival process to an arbitrary i.i.d. distribution. We show that the optimal policies for this generalized model are still threshold-based. Furthermore, we develop an efficient algorithm to find the optimal policy and the optimal delay-power tradeoff curve.

The remainder of this paper is organized as follows. The system model and the constrained problem are introduced in Section II. We show that the optimal policy is threshold-based in Section III by using steady-state analysis and Lagrangian relaxation. Based on theoretical results, we propose an efficient algorithm in Section IV to obtain the optimal tradeoff curve and the corresponding policies. In Section V, theoretical results and the proposed algorithm are verified by simulations. Section VI concludes the paper.

## Ii System Model

The system model is shown in Fig. 1. We assume there are data packet(s) arriving at the end of the th timeslot. The number is i.i.d. for different values of and its distribution is given by , where , , and . Therefore the expected number of packets arrived in each timeslot is given by .

Let denote the number of data packets transmitted in timeslot . Assume that at most packets can be transmitted in each timeslot because of the constraints of the transmitter, and . Let denote the transmission power consumed in timeslot . Assume transmitting packet(s) will cost power , where , therefore . Transmitting packet will cost no power, hence . In typical communications, the power efficiency decreases as the transmission rate increases, hence we assume that is convex in . Detailed explanations can be found in the Introduction section in [12]. The convexity of the power consumption function will be utilized in Theorem 2 to prove that the optimal policy for the unconstrained problem is threshold-based.

Backlog packets are stored in a buffer with size . Let denote the queue length at the beginning of timeslot . Since data arrive at the end of the timeslot, in order to avoid buffer overflow (i.e. ) and underflow (i.e. ), we should have . Therefore the dynamics of the buffer is given as

 q[n+1]=q[n]−s[n]+a[n]. (1)

In timeslot , we can decide how many packets to be transmitted based on the buffer state . It can be seen that this is a Markov Decision Process (MDP), where the queue length is the state of the MDP, and the number of packets transmitted in each timeslot is the action we take in each timeslot . The probability distribution of the next state is given by

 Pr{q[n+1]=j|q[n]=q,s[n]=s} = {αj−q+s0≤j−q+s≤A,0otherwise. (2)

We minimize the average queueing delay given an average power constraint, which makes it a Constrained Markov Decision Process (CMDP). For an infinite-horizon CMDP with stationary parameters, according to [13, Theorem 11.3], stationary policies are complete, which means stationary policies can achieve the optimal performance. Therefore we only need to consider stationary policies in this problem. Let denote the probability to transmit packet(s) when , i.e.,

 fq,s=Pr{s[n]=s|q[n]=q}. (3)

Then we have for . Since we guarantee that the transmission strategy will avoid overflow or underflow, we set

 fq,s=0 if q−s<0 or q−s>Q−A. (4)

Let denote a matrix whose element in the th row and the th column is . Therefore matrix can represent a stationary transmission policy. Let and denote the average power consumption and the average queueing delay under policy . Let denote the set of all feasible stationary policies that guarantee no queue overflow or underflow. Let denote the set of all stationary and deterministic policies which can guarantee no overflow or underflow. Thus to obtain the optimal tradeoff curve, we can minimize the average delay given an average power constraint shown as

 minF∈F DF (5a) s.t. PF≤Pth. (5b)

From another perspective, policy will determine a point in the delay-power plane. Define as the set of all feasible points in the delay-power plane. Intuitively, since the power consumption for each data packet increases if we want to transmit faster, there is a tradeoff between the average queueing delay and the average power consumption. Thus the optimal delay-power tradeoff curve can be presented as .

If we fix a stationary policy for a Markov Decision Process, the Markov Decision Process will degenerate to a Markov Reward Process (MRP). Let denote the transition probability from state to state . According to the system model, because of the constraints of transmission and arrival processes, the state transition probability can be derived as

 λi,j=min{S,i,i−j+A}∑s=max{0,i+A−Q,i−j}αj−i+sfi,s. (6)

An example of the transition diagram is shown in Fig. 2, where for are omitted to keep the diagram legible.

The Markov chain could have more than one closed communication classes under certain transmission policies. Under this circumstance, the limiting probability distribution and the average cost are dependent on the initial state and the sample paths. In Appendix A, it is proven that we only need to consider the cases where the Markov chain has only one closed communication class, which is called a unichain. Becausae of this key result, we focus only on the unichain cases in the following.

## Iii Optimal Threshold-Based Policy for the Constrained Markov Decision Process

In this section, we will demonstrate that the optimal policy for the Constrained MDP problem is threshold-based. In other words, for an optimal policy, more data will be transmitted if the queue is longer. We give the rigorous definition of a stationary threshold-based policy that, there exist thresholds , such that only when (set for simplicity of notation). According to this definition, under policy , when the queue state is larger than threshold and smaller than , it transmits packet(s). When the queue state is equal to threshold , it transmits or packet(s). Note that under this definition, probabilistic policies can also be threshold-based.

In the following, we will first conduct the steady-state analysis of the Markov process, based on which we can show the properties of the feasible delay-power region and the optimal delay-power tradeoff, and then by proving that the Lagrangian relaxation problem has a deterministic and threshold-based optimal policy, we can finally show that the optimal policy for the constrained problem is threshold-based.

Since we can focus on unichain cases, which contain a single recurrent class plus possibly some transient states, the steady-state probability distribution exists for the Markov process. Let denote the steady-state probability for state when applying policy . Set . Define as a matrix whose element in the th column and the th row is , which is determined by policy . Set as the identity matrix. Define , and . Set . Set and .

According to the definition of the steady-state distribution, we have and . For a unichain, the rank of is . Therefore, we have is invertible and

 HFπF=c. (7)

For state , transmitting packet(s) will cost with probability . Define , which is a function of . The average power consumption can be expressed as

 PF=Q∑q=0πF(q)S∑s=0Psfq,s=pTFπF. (8)

Similarly, define . According to Little’s Law, the average delay under policy is

The following theorem describes the structure of the feasible delay-power region and the optimal delay-power tradeoff curve.

###### Theorem 1.

The set of all feasible points in the delay-power plane, , and the optimal delay-power tradeoff curve , satisfy that

1. The set is a convex polygon.

2. The curve is piecewise linear, decreasing, and convex.

3. Vertices of and are all obtained by deterministic scheduling policies.

4. The policies corresponding to adjacent vertices of and take different actions in only one state.

###### Proof:

See Appendix B. ∎

### Iii-B Optimal Deterministic Threshold-Based Policy for the Lagrangian Relaxation Problem

In (5), we formulate the optimization problem as a Constrained MDP, which is difficult to solve in general. Let denote the Lagrange multiplier. Consider the Lagrangian relaxation of (5)

 minF∈FDF+μPF−μPth. (10)

In (10), the term is constant. Therefore, the Lagrangian relaxation problem is minimizing the weighted average cost , which becomes an unconstrained infinite-horizon Markov Decision Process with an average cost criterion. It is proven in [14, Theorem 9.1.8] that, there exists an optimal stationary deterministic policy. Moreover, the optimal policy for the relaxation problem has the following property.

###### Theorem 2.

An optimal policy for the unconstrained Markov Decision Process is threshold-based. That is to say, there exists thresholds , such that

 {fq,s=1qF(s−1)

where .

###### Proof:

See Appendix C. ∎

### Iii-C Optimal Threshold-Based Policy for the Constrained Problem

From another perspective, can be seen as the inner product of vector and . Since is piecewise linear, decreasing and convex, the corresponding minimizing the inner product will be obtained by the vertices of , as can be observed in Fig. 3. Since the conclusion in Theorem 2 holds for any , the vertices of the optimal tradeoff curve can all be obtained by optimal policies for the Lagrangian relaxation problem, which are deterministic and threshold-based. Moreover, from Theorem 1, the adjacent vertices of are obtained by policies which take different actions in only one state. Therefore, we can have the following theorem.

###### Theorem 3.

Given an average power constraint, the scheduling policy to minimize the average delay takes the following form: there exists thresholds , one of which we name , such that

 ⎧⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪⎩fq,s=1qF(s−1)

where .

###### Proof:

Since the optimal tradeoff curve is piecewise linear, assume is on the line segment between vertices and . According to Theorem 2, the form of optimal policies and , which are corresponding to vertices of the optimal tradeoff curve, satisfies (11). Moreover, according to Theorem 1, the policies corresponding to adjacent vertices of take different actions in only one state. Define the thresholds for as , then the thresholds for can be expressed as , where the two policies take different actions only in state . Since , the policy to obtain a point on the line segment between and is the convex combination of and , it should have the form shown in (12). ∎

We can see that the optimal policy for the Constrained Markov Decision Process may not be deterministic. At most two elements in the policy matrix , i.e. and , can be decimal, while the other elements are either 0 or 1. Policies in this form also satisfy our definition of stationary threshold-based policy at the beginning of Section III.

## Iv Algorithm to Efficiently Obtain the Optimal Tradeoff Curve

We design Algorithm 1 to efficiently obtain the optimal delay-power tradeoff curve and the corresponding optimal policies. Similar to [12], this algorithm takes advantage of the properties we have shown, i.e., the optimal delay-power tradeoff curve is piecewise linear, the vertices are obtained by deterministic threshold-based policies, and policies corresponding to two adjacent vertices take different actions in only one state. Therefore given the optimal policy for a certain vertex, we can narrow down the alternatives of optimal policies for its adjacent vertex. The policies corresponding to points between two adjacent vertices can also be easily generated.

Our proposed iterative algorithm starts from the bottom-right vertex of the optimal tradeoff curve, whose corresponding policy is known to transmit as much as possible. Then for each vertex we have determined, we enumerate the candidates for the next vertex. According to the properties we have obtained, we only need to search for deterministic threshold-based policies which take different actions in only one threshold. By comparing all the candidates, the next vertex will be determined by the policy candidate whose connecting line with the current vertex has the minimum absolute slope and the minimum length. Note that a vertex can be obtained by more than one policy, therefore we use lists and to restore all policies corresponding to the previous and the current vertices.

The complexity of this algorithm is much smaller than using general methods. Since during each iteration, one of the thresholds of the optimal policy will be decreased by 1, the maximum iteration times are . Within each iteration, we have thresholds to try. For each candidate, the most time consuming operation, i.e. the matrix inversion, costs . Therefore the complexity of the algorithm is .

In comparison, we also formulate a Linear Programming (LP) to obtain the optimal tradeoff curve. As demonstrated in [13, Chapter 11.5], all CMDP problems with infinite horizon and average cost can be formulated as Linear Programming. In our case, by taking as variables, we can formulate an LP with variables to minimize the average delay given a certain power constraint. Due to space limitation, we provide the LP without explanations.

 min 1EaQ∑q=0qS∑s=0xq,s (13a) s.t. Q∑q=0S∑s=0Psxq,s≤Pth (13b) q−1∑l=max{0,q−A}A∑a=0l+a−q∑s=0αaxl,s =min{q+S−1,Q}∑r=qA∑a=0S∑s=r+a−q+1αaxr,sq=1,⋯,Q (13c) Q∑q=0S∑s=0xq,s=1 (13d) xq,s=0∀q−s<0 or q−s>Q−A (13e) xq,s≥0∀0≤q−s≤Q−A. (13f)

By solving the LP, we can obtain a point on the optimal tradeoff curve. If we apply the ellipsoid algorithm to solve the LP problem, the computational complexity is . It means that, the computation to obtain one point on the optimal tradeoff curve by applying LP is larger than obtaining the entire curve with our proposed algorithm. This demonstrates the inherent advantage of using the revealed properties of the optimal tradeoff curve and the optimal policies.

## V Numerical Results

In this section, we validate our theoretical results and the proposed algorithm by conducting LP numerical computation and simulations. We consider a practical scenario with adaptive M-PSK transmissions. The optional modulations are BPSK, QPSK, and 8-PSK. Assume the bandwidth = 1 MHz, the length of a timeslot = 10 ms, and the target bit error rate ber=. Assume a data packet contains 10,000 bits, and in each timeslot the number of arriving packet could be 0, 1, 2 or 3. Then by adaptively applying BPSK, QPSK, or 8-PSK, we can respectively transmit 1, 2, or 3 packets in a timeslot, which means . Assume the one-sided noise power spectral density =-150 dBm/Hz. The transmission power for different transmission rates can be calculated as J, J, J, and J. Set the buffer size as .

The optimal delay-power tradeoff curves are shown in Fig. 4 and Fig. 5. In each figure, we vary the arrival process to get different tradeoff curves. As can be observed, the tradeoff curves generated by Algorithm 1 perfectly match the Linear Programming and simulation results. As proven in Theorem 1, the optimal tradeoff curves are piecewise linear, decreasing, and convex. The vertices of the curves obtained by Algorithm 1 are marked by squares. The corresponding optimal policies can be checked as threshold-based. The minimum average delay is 1 for all curves, because when we transmit as much as we can, all data packets will stay in the queue for exactly one timeslot. In Fig. 4, with the average arrival rate increasing, the curve gets higher because of the heavier workload. In Fig. 5, the three arrival processes have the same average arrival rate and different variance. When the variance gets larger, it is more likely that the queue size gets long in a short time duration, which leads to higher delay. It is interesting to characterize the effect of the variance in the arrival process, which we leave as a future work.

## Vi Conclusion

In this paper, we extend our previous work to obtain the optimal delay-power tradeoff and the corresponding optimal scheduling policy considering arbitrary i.i.d. arrival and adaptive transmissions. The scheduler optimize the transmission in each timeslot according to the buffer state. We formulate this problem as a CMDP, and minimize the average delay to obtain the optimal tradeoff curve. By studying the steady-state properties and the Lagrangian relaxation of the CMDP problem, we can prove that the optimal delay-power tradeoff curve is convex and piecewise linear, on which the adjacent vertices are obtained by policies taking different actions in only one state. Based on this, the optimal policies are proven to be threshold-based. We also design an efficient algorithm to obtain the optimal tradeoff curve and the optimal policies. Linear Programming and simulations are conducted to confirm the theoretical results and the proposed algorithm.

## Appendix A Proof of the Equivalency to Reduce to Unichain cases

We claim that we can focus only on the unichain cases, because for any Markov process with multiple recurrent classes determined by a certain policy, we can design a policy which leads to a unichain Markov process having the same performance as any of the recurrent class. We strictly express the reason as a proposition below, and give the detailed proof.

###### Proposition 1.

In the Markov Decision Process with arbitrary arrival and adaptive transmission, if there is more than one closed communication class in the Markov chain generated by policy , which we define as , , where , then for any , there exists a policy , under which the Markov chain has as its only closed communication class. Furthermore, the steady-state distribution and the average cost of the Markov chain under starting from state are the same as the steady-state distribution and the average cost of the Markov chain under .

###### Proof:

Define the set of those transient states that have access to as . Define the set of transient states which don’t have access to as . Therefore is a partition of the states of the MDP. There should exists at least one state which is next to a state . We can always change the action in state such that state can access the set . After the modification, state will be a transient state which has access to . The states which communicate with will also be transient states which have access to .

We update the partition of states since the policy is changed. According to the above description, the set won’t change, while the cardinality of will be strictly increasing. Hence, by repeating the above operation for finite times, every state of the MDP will be partitioned in either or . The Markov chain generated by the modified policy has as its only closed communication class, and the modified policy is the we request.

Since the actions of states in are the same for policy and , the steady-state distribution and the average cost corresponding to policy starting from state are the same as those under policy . ∎

## Appendix B Proof of Theorem 1

In order to prove Theorem 1, we will first prove a lemma showing that the mapping from to has a partially linear property in the first subsection. In the second subsection, we will prove that the set is a convex polygon, whose vertices are all obtained by deterministic scheduling policies, and the policies corresponding to adjacent vertices of take different actions in only one state. In the third subsection, we will prove that the set is piecewise linear, decreasing, and convex, whose vertices are obtained by deterministic scheduling policies, and the policies corresponding to adjacent vertices of take different actions in only one state.

In correspondence with Theorem 1, conclusion 1) in the theorem is proven in Subsection B, conclusion 2) is proven in Subsection C, and conclusion 3) and 4) are proven by combining results in Subsection B and C.

### B-a Partially Linear Property of Scheduling Policies

###### Lemma 1.

and are two policies different only when , i.e., these two matrices are different only in the th row. Denote where . Then
1) There exists a certain so that and . Furthermore, parameter is a continuous non-decreasing function of .
2) When changes from 0 to 1, point moves on the line segment from to .

###### Proof:

In the following, the two conclusions of the lemma will be proven one by one.

1) According to the definition of and , we have that if , then and . Set and . Since and are different only in the th row, it can be derived that the th column of is the only column that can contain non-zero elements, and the th element of is its only non-zero element. Therefore can be expressed as , where is its th column, and can be expressed as , where is its th element. Based on this, we set

 H−1F=[hT0,hT1,⋯,hTQ]T. (14)

Hence

 (15)

By mathematical induction, we can prove that for ,

 (H−1FΔH)iH−1F = ⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣(hT0δq)(hTqδq)i−1hTq(hT1δq)(hTqδq)i−1hTq⋮(hTQδq)(hTqδq)i−1hTq⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦ (16) = (hTqδq)i−1(H−1FΔH)H−1F (17)

and

 ΔpTH−1F(H−1FΔH)i−1 = ζq(hTqδq)i−1hTq. (18)

Therefore,

 (HF+ϵΔH)−1 = +∞∑i=0(−ϵ)i(H−1FΔH)iH−1F (19) = H−1F++∞∑i=1(−ϵ)i(hTqδq)i−1(H−1FΔH)H−1F. (20)

We have and . Hence

 PF′′−PFPF′−PF = (pF+ϵΔp)T(HF+ϵΔH)−1c−pTFH−1Fc(pF+Δp)T(HF+ΔH)−1c−pTFH−1Fc (21) = pTF[(HF+ϵΔH)−1−H−1F]c+ϵΔpT(HF+ϵΔH)−1cpTF[(HF+ΔH)−1−H−1F]c+ΔpT(HF+ΔH)−1c (22) = pTF[∑+∞i=1(−ϵ)i(hTqδq)i−1(H−1FΔH)H−1F]c−ΔpT[∑+∞i=1(−ϵ)i(H−1FΔH)i−1H−1F]cpTF[∑+∞i=1(−1)i(hTqδq)i−1(H−1FΔH)H−1F]c−ΔpT[∑+∞i=1(−1)i(H−1FΔH)i−1H−1F]c (23) = ∑+∞i=1(−ϵ)i(hTqδq)i−1pTF(H−1FΔH)H−1Fc−∑+∞i=1(−ϵ)iζq(hTqδq)i−1hTqc∑+∞i=1(−1)i(hTqδq)i−1pTF(H−1FΔH)H−1Fc−∑+∞i=1(−1)iζq(hTqδq)i−1hTqc (24) = ∑+∞i=1(−ϵ)i(hTqδq)i−1∑+∞i=1(−1)i(hTqδq)i−1 (25) = ϵ+ϵhTqδq1+ϵhTqδq (26)

and

 DF′′−DFDF′−DF = dT(HF+ϵΔH)−1c−dTH−1FcdT(HF+ΔH)−1c−dTH−1Fc (27) = dT(∑+∞i=1(−ϵ)i(hTqδq)i−1(H−1FΔH)H−1F)cdT(∑+∞i=1(−1)i(hTqδq)i−1(H−1FΔH)H−1F)c (28) = ∑+∞i=1(−ϵ)i(hTqδq)i−1∑+∞i=1(−1)i(hTqδq)i−1 (29) = ϵ+ϵhTqδq1+ϵhTqδq. (30)

Hence , so that and . Furthermore, it can be seen that is a continuous nondecreasing function.

2) From the first part, we proved and is a continuous non-decreasing function of . When , we have . When , we have . Therefore when changes from 0 to 1, the point moves on the line segment from to . The slope of the line can be expressed as

### B-B Properties of set R

In this subsection, we will prove that , the set of all feasible points in the delay-power plane, is a convex polygon whose vertices are all obtained by deterministic scheduling policies. Moreover, the policies corresponding to adjacent vertices of take different actions in only one state.

Define as the convex hull of points corresponding to deterministic scheduling policies in the delay-power plane. Hence we will show that is a convex polygon whose vertices are all obtained by deterministic scheduling policies by proving .

The proof is made up of three parts. In Part I, we will prove by the construction method. Part II is the most difficult part. We will first define the concepts of basic polygons and compound polygons, then prove their convexity, based on which can be proven. By combining the results from Part I and II, we will have . Finally, in Part III, it will be shown that policies corresponding to adjacent vertices of are different in only one state.

Part I. Prove

For any probabilistic policy where , we construct

 F′=⎧⎪⎨⎪⎩f′q,s=1q=q∗,s=s∗f′q,s=0q=q∗,s≠s∗f′q,s=fq,selse (33)

and

 F′′=⎧⎪ ⎪⎨⎪ ⎪⎩f′′q,s=0q=q∗,s=s∗f′′q,s=fq,s1−fq∗,s∗q=q∗,s≠s∗f′′q,s=fq,selse. (34)

Since , and the fact that whenever , it must holds that