Optimizing Quality of Experience of Dynamic Video Streaming over Fading Wireless Networks
Abstract
We address the problem of video streaming packets from an Access Point (AP) to multiple clients over a shared wireless channel with fading. In such systems, each client maintains a buffer of packets from which to play the video, and an outage occurs in the streaming whenever the buffer is empty. Clients can switch to a lowerquality of video packet, or request packet transmission at a higher energy level in order to minimize the number of outages plus the number of outage periods and the number of low quality video packets streamed, while there is an average power constraint on the AP. We pose the problem of choosing the video quality and transmission power as a Constrained Markov Decision Process (CMDP). We show that the problem involving clients decomposes into MDPs, each involving only a single client, and furthermore that the optimal policy has a threshold structure, in which the decision to choose the videoquality and powerlevel of transmission depends solely on the bufferlevel.
I Introduction
Scheduling packets for video streaming over a shared wireless downlink is of increasing attention [18]. Predominantly, this problem has been addressed with the goal of minimizing the average number of outages, i.e., timeslots during which a client has no packet to play [1, 2, 3], [4],[5, 6, 7, 8, 9]. However the models considered in these works do not incorporate the communication constraints imposed by the network over which the streaming occurs. Typically clients streaming video files will share a common wireless channel, which again typically has a constraint on the average power. The access point (AP) has to choose the power level at which to transmit individual packets to each client so as to maximize the total Quality of Experience (QoE) experienced by the clients. The system also has an additional degree of freedom in that the AP can transmit lower quality packets on occasion, leading to a softer loss of video quality than an abrupt outage. Another important aspect is that the quality of video streaming experienced by a client depends not only on the number of outages, but also on the number of “outage periods”, i.e., number of interruption periods as well. Thus an outage lasting timeslots is not the same as outages each lasting timeslot. The QoE experienced by a client thus has to take into account several metrics: the average number of outages, the average number of outageperiods, and the quality of videopackets streamed. In this paper we address this overall problem. While we focus here on the single lasthop case for ease of exposition and brevity, our results can be generalized to multihop networks as well. In order to provide noninterruptive video streaming experience to the clients, the AP has to guarantee some sort of service regularity to the clients, i.e., it has to ensure that the packet deliveries to the clients are not in a bursty fashion. References [19, 20, 21, 22, 23] develop a framework to design policies which provide services to clients in a regular fashion, though not in a video streaming context.
Ii System Description
Consider a system where a wireless channel is shared by clients for the purpose of streaming video packets.
It is assumed that the system evolves over discrete timeslots, and one timeslot is taken by the access point (AP) for attempting one packet transmission.
Client maintains a buffer of size packets and plays a packet for a duration of timeslots. Once it has finished playing a video packet, it looks for the next packet in the buffer. In case the buffer is empty, there is an “outage”, meaning that the video streaming is interrupted, and the client has to wait for a packet to be delivered to its buffer before it can resume the video streaming.
The wireless channels connecting the clients to the AP are assumed to be random. For ease of exposition, we will derive the results for the case when the channel conditions are fixed. These results carry over to the case of fading channels in a straightforward manner. Later, in Section VIII, we will outline the results for the case of fading channels.
There are different video qualities of packets that can be transmitted for client , with class video quality providing the best viewing experience. Similarly there are different power levels at which the packets for client can be transmitted. We let , i.e., a user may choose to not request a packet in a timeslot. The probability that the packet for client is successfully delivered upon a transmission attempt, , depends on the amount of power used in the packet transmission and the quality of video packet that was attempted. We also incorporate an average power constraint on the AP.
The basic problem considered is that of scheduling the AP’s packet transmissions to clients so as to maximize the combined Quality of Experience (QoE) of the clients. The QoE of a single client depends on multiple factors

The average number of outages.

How “often” the video gets interrupted, i.e., the number of outageperiods, or the number of timeslots in which the transition from “nonoutage” to outage occurs.

The number of packets of different quality types that are streamed.
Iii Problem Formulation
We denote by the random variable that assumes the value if the th client faces an outage at time , and otherwise, and by the transmission power utilized by the th client at timeslot . Also, let be the random variable that takes the value if a packet of quality is delivered to client in timeslot .
The Constrained Markov Decision Process (CMDP) of interest is then to choose the quality of video packets and transmission power for each client, in order to
(1)  
(Primal MDP) 
Note that the term assumes the value if timeslot is the beginning of an outageperiod for client , and is otherwise. It thereby measures the number of outage periods incurred. The parameters are employed for tuning the QoS to account for the relative importance placed on each of the objectives. We note that for , for all , since we assumed that the video quality of a packet is less if the packet belongs to a higher valued class.
Thus the above problem is a CMDP in which the system state at time is described by the dimensional vector , where is the amount of play time remaining in the buffer of client at time .
The central difficulty which arises is that the cardinality of the statespace of the system increases exponentially with the number of clients , and thus the problem is computationally infeasible as formulated above.
We show that the problem of serving clients can be decomposed into separate problems each involving only a single client. Thus the computational complexity of the problem grows linearly in the number of clients. Moreover, we show that the optimal policy is easily implementable since it has a simple threshold structure.
Iv The Dual MDP
The Lagrangian associated with a policy for the system (III) is given by,
(2) 
where is the Lagrangian multiplier associated with the average power constraint. The associated Lagrange dual is,
(3) 
Next we present a useful bound on the dual, the proof of which follows from the superadditivity of and subadditivity of operations.
Lemma 1
(4) 
V Single Client Problem
We consider minimizing the bound obtained in Lemma 1. Observing the bound, we find that we have decomposed the original problem (III) into singleclient problems, i.e., the expression in the r.h.s. of (1) is the sum of the costs of clients, in which the cost of a single client depends only on the action chosen for it in each timeslot.
The problem for the single client is described as follows. We omit the subscript in the following discussion. The channel connecting the client to the AP is random. The client maintains a buffer of capacity timeslots of playtime video (this assumption is equivalent to the assumption of maintaining a buffer of packets since a packet is played for timeslots), and in each timeslot, the AP has to choose two quantities, which together comprise the control action chosen for the client:

The video quality .

The power at which to carry out packet transmission.
The state of the client is thus described by , the playtime duration of the packets present in the buffer at time . If the client is scheduled a packet transmission of quality at an power at time , and the remaining playtime at time , , is less than or equal to , then the system state at time is with a probability , while it is with a probability . However if the value of remaining playtime is strictly greater than , then the system state at time is with a probability .
We let
(5)  
(6) 
be the transitions associated with the remaining playtimes associated for a successful and failed packet transmission respectively. The control action at time will be denoted , where are the video quality and transmission power level chosen at time .
The transmissions at power level incur a cost of . There is a penalty of unit upon an outage at time . A penalty of units is imposed if a packet of quality is delivered to it, while a penalty of units is imposed at time in case there was no outage at timeslot , and an outage occurs in timeslot , i.e. if a new outageperiod begins at time .
Since the probability distribution of the system state at time is completely determined by the system state at time , and the action chosen at time , i.e., requested video quality and power level at which transmission occurs, the single client problem is a Markov Decision Process (MDP) involving only a finite number of actions and states, and is thus solved by a stationary Markov policy [12].
Denote by a policy for the client . The single client problem is to solve,
(7) 
Denote by , the optimal policy which solves the single client problem. We also let
(8) 
be the optimal cost, and be the cost associated with a policy .
Vi Threshold Structure of the Optimal Policy for the Single Client Problem
We will suppress the subscript in the following discussion, and begin with a discussion of the discounted infinite horizon cost problem for the single client. Let
(9) 
be the minimum discounted infinite horizon cost for the system starting in state at time , where can assume values in the set . The function is similarly defined to be the minimum discounted cost incurred in timeslots for the system starting in state , i.e.,
where is a policy for the horizon discounted problem. The quantities should not be confused with the quantities defined in the previous section. We have,
(10) 
where
(11) 
is the onestep cost associated with the action , and for ,
(12) 
We assume that a lower video quality packet, or a higher power packet transmission, leads to an increase in the success of packet transmission , i.e., an increase in cost is associated with a higher transmission success probability.
Definition 1
We say a policy is of thresholdtype if it satisfies the following for each stage :

Fix any . If the policy chooses the action in state , then it does not choose the actions for any state .

Fix any . If the policy chooses the action in state , then it does not choose the actions for any state .
If are such that , let be the actions chosen by a threshold policy in states and . Then it is easily verified that .
Next we present a useful lemma that is easily proved. In the following, is the policy that follows the action in the first slot, and then follows policy , while is the cost achieved under the policy in timeslots for the system starting in state .
Lemma 2
Let be two actions where , or equivalently, . Then,
Lemma 3
For , the functions are decreasing in for .
Within this proof, let be the optimal policy for the discounted timeslots problem, and let be the policy for timeslots which takes the action at the first timeslot, and then follows the policy . In order to prove the claim, we will use induction on , the number of timeslots.
Let us assume that the statement is true for the functions , for all . In particular the function,
(13) 
is decreasing for .
First we will prove the decreasing property for . Now the assumption (13) made above, and (VI), together imply that is of thresholdtype.
Fix an and denote by , the optimal actions at stage for the states respectively. Note that the threshold nature of implies that,
This is true because as the value of state decreases in the interval , a threshold policy switches to an action that has a higher transmission success probability. So it follows from Lemma 2 that
where the first inequality follows since a suboptimal action in the state increases the costtogo for timeslots, the second inequality is a consequence of the assumption that the functions are decreasing in , while the last inequality follows from the fact that a suboptimal action in the state will increase the costtogo for timeslots. Thus we have proved the decreasing property of for , and it remains to show that .
Once again, let be the optimal actions at stage for the states respectively. Using the same argument as above (i.e., assuming that the actions taken in stage at states are the same, and the actions taken in the states are the same), it follows that
However, then (for stages, apply the same actions for the system starting in state , as that for a system starting in state , and note that the two systems couple at a stage , when the latter system hits the state at any stage ; the hitting stage is of course random). This gives us,
and thus we conclude that the function is decreasing for . In order to complete the proof, we notice that for , we have,
and thus the assertion of Lemma is true for .
Theorem 1
Consider the single client problem discussed in Section V. There is a threshold policy that is Blackwell optimal [17], i.e., it is optimal for all values of for some , and is also optimal for the Average cost problem. Thus is of thresholdtype and can be obtained in time via comparing the costs of all thresholdtype policies.
Fix a and let be two power levels. Without loss of generality, let . Clearly (11). In the Bellman equation (VI), consider the term depending on , i.e. the term . For , , we have,
where the last inequality follows from Lemma 3. Thus it follows that if action is preferred over action for any state , then will also be preferred over action for any state , . Finally note that it follows from the Bellman equation (VI) and (5), that the optimal action for states is to let (since any packet that is received will be lost due to buffer over flow). The proof for variations in power levels is similar. Thus it follows from the definition of a threshold policy that the optimal policy is of threshold type.
Finally note that the statement regarding Blackwell optimality follows from the result in the above paragraph, and because the statespace is finite.
Vii Solution of Primal MDP
We now present the solution of the Primal Problem.
Lemma 4
.
Let be the policy obtained by following the policy for each client . Then from the definition of dual function, Lagrangian (IV), cost associated with a policy (V) and Lemma 1, we have
(14) 
However since the policy is stationary, (all the and become in the definition of its Lagrangian, and associated rewards in the singleclient problem change to ), we have that
which, along with (14) gives us .
Theorem 2
We observe that there is a onetoone correspondence between any stationary randomized policy, and the measure it induces on the stateaction space, and thus the Primal MDP can be posed as a linear program [13, 14]. Thus it follows from Slater’s condition [15] that for the Primal MDP, strong duality holds if there exists a policy that satisfies the constraints . However the policy which never schedules any packets incurs a net power expenditure of , and thus Slater’s condition is true for the Primal MDP if . The claim of the Theorem then follows from Lemma 3. We note that the policy is a decentralized policy. That is, the decision to choose the videoquality and powerlevel at each time for client , i.e., can be taken by client itself, and doesn’t require the AP to coordinate the clients. Thus a client need not know the state values of other clients, for , nor does the AP need to know the values of . Thus the policy is easy to implement.
Viia Obtaining iteratively in a decentralized fashion
We note that in order to implement the optimal policy as in Theorem 2, we need to find the optimal value of the price . We iterate on the price using the subgradient method [16], and since the problem is concave, the prices converge to the optimal value . Moreover the iterations involving priceupdates are decentralized, i.e., the clients need only the knowledge of the current price for the iteration.
Now since , we have,
(15) 
where is the expected cost incurred on the power over all the users. This is the total “congestion” at the AP. The iteration for is,
where is the subgradient evaluated in (15).
Viii Fading Channels
The results in the previous sections can be extended in a straight forward manner to the case of fading channels. Let the channel conditions for client be described by a Markov process evolving on finitely many states having a transition matrix . The state of client is described by the vector , where is the playtime duration of the packets present in the buffer at time , and is the channel condition at time . If the client is scheduled a packet transmission of quality at an power at time , then the system state at time is with a probability , while it is with a probability .
However now the cost associated to an action also depends on the channel condition, i.e.,
(16) 
and a threshold policy will have a threshold structure for each value of channel condition (as defined in Section V).
Ix Concluding Remarks
We have formulated the problem of dynamically choosing the qualities and power levels for packet transmissions across unreliable wireless so as to maximize the Quality of Experience of video streaming channels as an MDP. Using Lagrangian techniques, we have shown that the problem exhibits a decentralized solution, wherein the clients can dynamically decide these quantities on their own using their local information, i.e., the channel state and the amount of playtime remaining in their buffers. Thus the optimal policy can be obtained in time linear in the number of users.
Furthermore we have shown that the optimal policy has a threshold structure, thus further reducing the complexity of searching for the optimal policy. Moreover due to the threshold nature of the policy, it is easy to implement.
References
 [1] A. ParandehGheibi, M. Medard, A. E. Ozdaglar, and S. Shakkottai, “Avoiding interruptions  a QoE reliability function for streaming media applications,” IEEE Journal on Selected Areas in Communication, vol. 29, no. 5, pp. 1064–1074, 2011.
 [2] G. Liang, “Effect of delay and buffering on jitterfree streaming over random vbr channels,” IEEE Transactions on Multimedia vol. 10, no. 6, pp. 1128–1141.
 [3] Ankit Singh Rawat and Emina Soljanin, Dynamic Control of Video Quality for AVS, 2014 IEEE International Symposium on Information Theory.
 [4] Y. XU, E. ALTMAN, R. ELAZOUZI, M. HADDAD, S. ELAYOUBI and T. JIMENEZ, “Analysis of buffer starvation with application to objective qoe optimization of streaming services,” IEEE Transactions on Multimedia, vol. 16, no. 3 (April 2014),pp 813–827.
 [5] G. Tian and Y. Liu, “Towards agile and smooth video adaptation in dynamic http streaming, in Proceedings of CoNEXT, 2012, pp. 109–120.
 [6] L. De Cicco, S. Mascolo, and V. Palmisano, “Feedback control for adaptive live video streaming, in Proceedings of MMSys, 2011, pp. 145–156.
 [7] T. Hossfeld, S. Egger, R. Schatz, M. Fiedler, K. Masuch, and C. Lorentzen, “Initial delay vs. interruptions: between the devil and the deep blue sea,? in Proc. of QoMEX, 2012.
 [8] J. De Vriendt, D. De Vleeschauwer, and D. Robinson, “Model for estimating qoe of video delivered using http adaptive streaming,? in Proeedings of IFIP/IEEE International Symposium on Integrated Network Management (IM 2013), 2013.
 [9] Y. Xu, S. Elayoubi, E. Altman and R. ElAzouzi. “Impact of flowlevel dynamics on qoe of video streaming in wireless networks”. In Proceedings of IEEE INFOCOM, 2013 (April 2013), pp. 2715–2723.
 [10] Eitan Altman, Constrained Markov Decision Processes. Taylor & Francis, 1999.
 [11] Frederick J. Beutler and Keith W. Ross, “Optimal policies for controlled Markov chains with a constraint,”Journal of Mathematical Analysis and Applications, Volume 112, Issue 1, 15 November 1985, Pages 236–252.
 [12] Martin L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., 1994.
 [13] Alan S. Manne, Linear programming and sequential decisions in Management Science, 259 267, 1960
 [14] Vivek S. Borkar, Control of Markov Chains with LongRun Average Cost Criterion in Stochastic Differential Systems, Stochastic Control Theory and Applications, 5777, 1988
 [15] D. P. Bertsekas, Nonlinear Programming, Athena Scientific, 1999
 [16] N.Z. Shor, Krzysztof C. Kiwiel, and Andrzej Ruszcaynsk, Minimization Methods for Nondifferentiable Functions, SpringerVerlag New York, Inc., 1985
 [17] David Blackwell, “Discrete dynamic programming”, Annals of Mathematical Statistics vol. 33, 719726, 1962.
 [18] Cisco Visual Networking Index (VNI): http://www.cisco.com/c/en/us/solutions/collateral/serviceprovider/visualnetworkingindexvni/white_paper_c11520862.pdf
 [19] Rahul Singh, IHong Hou and P.R. Kumar, Fluctuation analysis of debt based policies for wireless networks with hard delay constraints, IEEE INFOCOM, 2014 , pp 2400–2408.
 [20] Rahul Singh, IHong Hou and P.R. Kumar, Pathwise performance of debt based policies for wireless networks with hard delay constraints, IEEE 52nd Annual Conference on Decision and Control (CDC), 2013, pp 7838–7843.
 [21] Rahul Singh, Xueying Guo and P.R. Kumar, Index Policies for Optimal MeanVariance TradeOff of Interdelivery Times in RealTime Sensor Networks, IEEE INFOCOM 2015.
 [22] Xueying Guo, Rahul Singh, P.R. Kumar and Niu, Zhisheng, A High Reliability Asymptotic Approach for Packet InterDelivery Time Optimization in CyberPhysical Systems, ACM MobiHoc 2015, pp 197–206.
 [23] Rahul Singh and Alexander Stolyar, MaxWeight Scheduling: Asymptotic Behavior of Unscaled QueueDifferentials in Heavy Traffic, ACM SIGMETRICS, 2015.