Buffer-aware Wireless Scheduling based on Deep Reinforcement Learning

Buffer-aware Wireless Scheduling based on
Deep Reinforcement Learning

Chen Xu1, Jian Wang1, Tianhang Yu1, Chuili Kong1, Yourui Huangfu1, Rong Li1, Yiqun Ge2, Jun Wang1 Emails: {xuchen14, wangjian23, yutianhang, kongchuili, huangfuyourui, lirongone.li, yiqun.ge, justin.wangjun}@huawei.com 1Hangzhou Research Center, Huawei Technologies, Hangzhou, China
2Ottawa Research Center, Huawei Technologies, Ottawa, Canada

In this paper, the downlink packet scheduling problem for cellular networks is modeled, which jointly optimizes throughput, fairness and packet drop rate. Two genie-aided heuristic search methods are employed to explore the solution space. A deep reinforcement learning (DRL) framework with A2C algorithm is proposed for the optimization problem. Several methods have been utilized in the framework to improve the sampling and training efficiency and to adapt the algorithm to a specific scheduling problem. Numerical results show that DRL outperforms the baseline algorithm and achieves similar performance as genie-aided methods without using the future information.

radio resource scheduling, deep reinforcement learning, cellular networks, multi-objective optimization

I Introduction

The communication field has been developed for decades with the guidance of information theory, where various advanced architectures and algorithms are proposed. However, such advanced approaches are normally designed under traditional optimization frameworks that operate closer and closer to Shannon spectral efficiency limit.

In recent years, deep learning (DL) has been widely applied to almost every industries and research domains, like computer vision and natural language processing. Thanks to the increasing computation power, many researchers start to resort to DL for further gain from traditional methodology. For example, instead of optimizing multiple signal processing blocks independently in traditional communication systems, an autoencoder based system is introduced to obtain a joint design architecture and finally achieves better end-to-end performance [10]. On the other hand, conventional communication systems are characterized with rigid mathematical models that are generally linear and Gaussian-statistical to facilitate the analysis. However, neither real-world imperfections nor non-linearity can be fully represented by these linear models. To address this issue, a DL-based sequence detection algorithm is proposed for molecular communication [5].

Owing to the aforementioned advantages of DL, deep reinforcement learning (DRL) has been also widely employed to solve the decision making problems, turning out to be another promising technique in future communication systems [9]. The work in [3] proposes a interference-aware path planning scheme based on DRL for a network of cellular-connected unmanned aerial vehicles (UAVs). Such a scheme is able to predict the dynamics of the network, and improves the tradeoff among energy efficiency, wireless latency and caused interference. A DRL-based solution for multi-user computation offloading and resource allocation with mobile edge computation (MEC) is presented in [8], where time and energy cost is jointly minimized. Numerical results reveal that Q-learning and Deep Q network (DQN) achieve better sum cost reduction compared to the baselines. In [1], both safety and Quality-of-Service (QoS) are addressed with DQN in a green Vehicle-to-Infrastructure (V2I) communication scenario. While in [6], another DQN based central scheduler is proposed to obtain the optimal user selection policy in cache-enabled opportunistic interference alignment (IA) networks. A relay scheduler is introduced for the cognitive Internet of Things (CIoT) [14], and Q-learning algorithm is designed to jointly reduce the power consumption and packet loss. In [13], a coexistence of artificial intelligence (AI) and conventional modules are proposed, and the learning ability of DRL in cellular network scheduling problem is verified both with and without the help of expert knowledge.

Although intensive efforts have been made on DRL based scheduling, most works consider a rather complicated scenario, which lacks a standard baseline or under some unpractical assumptions. Motivated by this, in this paper, we focus on the packet scheduling problem in cellular network to provide a more practical paradigm of AI-enabled wireless networks. Finite buffer size and maximum delay time are taken into consideration. More precisely, a DRL based scheduler is proposed, in which a DRL agent interacts with the environment by jointly considering throughput, fairness and packet drop rate. Our main contributions are highlighted as follows.

  • Downlink packet scheduling in cellular network is formulated as the multi-objective optimization problem under the condition of limited transmission buffer size and queuing delay as well as link adaptation.

  • Two genie-aided heuristic search methods are proposed to probe the performance gain in a fixed time window.

  • Modifications of the DRL method for adapting the considering scheduling problem are presented.

  • Numerical results show that the DRL algorithm obtains , , gain over baseline in throughput, fairness and packet drop rate, respectively, achieving similar performance to the genie-aided methods without using the future information.

Ii Problem Formulation

In a cellular network, a scheduler is of critical importance because it allocates radio resources among user equipments (UEs) while simultaneously balances between throughput and fairness [2]. As shown in Fig. 1, active UEs in the system are waiting for being scheduled, where each traffic flow from upper layer arrives and is stored in the transmission buffer. The packet arrival is modeled as Poisson process with arrival rate . The scheduler allocates channel resources to UEs in each transmission time interval (TTI) according to the channel condition and buffer state. Then the head of line (HoL) packet in the corresponding buffer is sent to physical layer for transmission.

Fig. 1: Transmission system model.

For conventional broadband systems, throughput (THP) and fairness (indicated by Jain’s fairness index [7] (JFI)) are two key performance indicators (KPIs). Besides these, it is worthwhile to note that the packet loss due to buffer overflow and expiration is also a key factor for practical finite buffer systems. Thus, we consider the packet drop rate (PDR) as the third KPI. The three metrics can be expressed as


where and are set of UEs and resource block groups (RBGs), respectively. is denoted as the achievable rate of the th UE at th RBG at time step , as the scheduler decision whether th RBG is allocated to UE . and are numbers of arrived packets and transmitted packets for the th UE, respectively. All the three metrics are statistically computed over a sufficient long period so that the scheduler algorithm should pay more attention to the long-term reward than short term one.

Then the scheduling can be formulated into a multi-objective optimization problem, where the three objectives are THP maximization, JFI maximization and PDR minimization, respectively. Obviously it is NP-hard and difficult to find an optimal solution.

Iii Genie-aided Scheduling

The three objectives are intertwined to each other and hard to be optimized independently. Consequently, a trade-off among them is needed. Pareto optimization, as a classical multi-objective optimization approach, can theoretically constitute all the non-dominant tradeoffs into Pareto frontier. Then, an optimal trade-off can be selected under a typical circumstance. However, facing a long-term performance optimization problem, it is rather difficult to find out the complete Pareto frontier of the scheduling issue in practice due to the huge solution space.

In order to explore the gain space of the scheduling problem, we introduce two genie-aided methods based on Pareto optimization in this section. Genie-aided methods assume that the scheduler is given the information that it would not access to in practice. For example, the scheduler knows all the information in the scheduling window of TTIs. More precisely, we assume that the scheduler has obtained the channel state information (CSI) and the packets arrival of future -TTI duration, then the single-RBG scheduling problem over TTIs for UEs is equivalent to a search problem that is to find one sequence of actions among candidate action groups, as shown in Fig. 2(a) (Dark circles denote the scheduled UEs). Obviously, it is computationally forbidden to achieve the optimal solution over this immense search space. Thus, here we only present two heuristic algorithms instead, i.e., genetic algorithm (GA) and Pareto list algorithm (PLA).

Iii-a Genetic Algorithm

Fig. 2: Search problem.

Genetic algorithm (GA) are well-suited for search problems involving several, often conflicting objectives. The main goal is to obtain multiple good action groups that have high objective values and at the meantime to maintain the diversity. We adopt the nondominated sorting genetic algorithm II (NSGA-II) approach, which uses a fast nondominated sorting procedure, a diversity-preserving operator, and an elitist-preserving approach (details please refer [4]).

Fast nondominated sorting procedure: For each solution , two entities are calculated: 1) , the number of solutions which dominate the solution ; 2) , a set of solutions that the solution dominates. Based on the values of and , the population is sorted into different nondomination levels.

Diversity-preserving operator: For the members in the same nondomination level, we use the diversity-preserving operator which consists a fast crowded distance estimator and a simple crowded comparison procedure to guide the selection toward a uniformly spread-out Pareto-optimal front.

Elitist-preserving approach: Elitism helps in achieving better convergence and performance. At each generation, a combined population from parents and offsprings is sorted according to both the nondomination levels and crowded distance, and then the first population members are chosen sequentially.

For GA, the length of chromosome is , and the variable of each gene lies in the set of . The crossover and mutation operators are conducted to introduce the possibilities of generating new and better action groups.

Iii-B Pareto List Algorithm

The Pareto list algorithm executes path expanding, sorting and pruning TTI by TTI, constraining the complexity to a fixed level, i.e., the maximum number of the list .

Firstly, at each TTI, a path is expanded according to the number of active UEs. Note that different paths may lead to different active UEs and be expanded in different ways, because paths affect the states, such as UE buffers. After expansion, each path records throughput, fairness, packet drop rate of all the UEs till the th TTI.

When the number of the paths exceeds , path sorting and pruning is needed to limit the path number to . As shown in Fig. 2(b), the gray lines mean pruning and the dark lines represent the preserved paths. Unlike the conventional list-based algorithm in which the path metric is a scalar, the path sorting and pruning in the scheduling issue is a multi-objective problem and can be handled in a Pareto-based way. In order to guarantee the global convergence and avoid being trapped in a local optimum, the optimality and diversity of the preserved paths are needed. Inspired by the NSGA-II algorithm, we propose a modified sorting method which can further improve the path diversity. In the scheduling issue, large number of paths may result in the same states, gathering at one point. Therefore, despite of the non-dominated sorting and crowded distance sorting, the paths with the same states are removed in advance.

After the path expansion, sorting and pruning of the last TTI have been executed, a final scheduling path is selected from the path list according to the preference or the typical circumstance to satisfy the system requirements.

Iv Deep Reinforcement Learning based Scheduling

Iv-a Markov Decision Process

In wireless networks, most decision making problems can be modeled as Markov decision process (MDP). An MDP is typically defined by a tuple , is the set of states, where is the set of actions, is the transition probability from state to due to action , and is the immediate reward when transition happens.

To tackle the scheduling problem with DRL method, we first define state, action and reward as follows:

State. The input state contains all UEs’ observations. The estimated instantaneous rate, averaged rate, the spare space in the buffer and the waiting time of the HoL packet are concatenated to form the observation of each UE, since the agent should have some knowledge of the buffer state to be buffer-aware.

Action. The action set consists of one-hot code, indicating which UE is selected for each transmission. Note that we reuse the same policy network for different RBGs to avoid an exponential increase of action space which incurs significant training costs and probably unconvergence.

Reward. As throughput, fairness and packet drop rate are three KPIs that we concern about, a straight forward reward function is defined in the form of linear weighted sum


where and are total throughput and JFI per TTI. is the total number of dropped packets that are normalized with at the same interval. , and are the weighting factors. Although linear scalarization of a multi-objective problem may lead to non-convex Pareto frontier, we find it still efficient to obtain a satisfying result in our training framework.

Iv-B Deep Reinforcement Learning based Method

The MDP is usually solved by classical dynamic programming algorithms, e.g., value iteration or policy iteration if the state transition probability is perfectly known. When the problem becomes complicated and large-scale, model-free DRL methods are alternatives for handling such situation. At each time step , the DRL agent observes state from environment, makes decision then receives reward and next state . The goal of DRL is to find a policy through such interaction with environment that maximizes the accumulated (discounted) reward.

Advantage actor-critic (A2C) algorithm is employed to solve the scheduling problem in this paper. The actor-critic is essentially a policy-based DRL algorithm which directly parameterizes the policy as . The parameters can be updated by gradient ascent on the expected return :


where is the discount factor which determines the importance of the future reward. The gradient in (3) can be further represented as [11]:


The advantage function in (5) shows whether the action is better than the average performance of the current policy . That means if it is true, the gradient update should increase the probability of this action.

The estimated gradient will have much lower variance if , however, the advantage function needs to be estimated by another parameterized value function :

Fig. 3: Framework of the DRL method.

The system-level simulator is shown in Fig. 3. DRL agent learns from the training data which is generated by interacting with the simulator. Training and evaluation data are obtained by setting a certain sampling window in which both DRL and reference performance are collected at the same environment state for a fair comparison. To make the original A2C algorithm applicable to our scheduling problem, we make the following efforts:

-step return. The estimation of the advantage function in (6) is called one-step temporal difference (TD). In fact, we implement a more general version -step TD in our framework. Instead of sampling from to , more actions are taken to get rewards along the trajectory. This helps improve the advantage estimation by averaging out some variance during gradient updates and leads to a stabler training.


Entropy regularization. Although there are several environments providing uncorrelated experiences at the same time, the policy still easily converges to a deterministic local optimal. Here, an entropy regularization in (8) for policy network is employed to enhance the exploring ability of the DRL agent, since the exploring only comes from the sampling on the policy distribution.


Action masking. The policy and value network in our A2C network are both fully connected networks. Special structure is applied on the policy network to handle the inactive UE problem introduced by the non-full buffer scenario as in Fig. 4. A mask is generated by the states, and prohibits policy from choosing the inactive UE(s), e.g., by subtracting a large value from the corresponding logits, then a softmax activation function is adopted to output the probability distribution.

Fig. 4: Neural network structure.

Multiple-RBG scheduling. Multiple-RBG scheduling is also considered on the basis of single RBG in our scheme. Assume that there are RBGs, the maximum dimension of the action space is . An iterative reuse of the policy network among the RBGs is adopted in this paper to deal with the dimension curse problem. A policy network for -UE scheduling is designed, where the state in the same time step will be adjusted by the possible influence of the action that already carried out.

To sum up, the full algorithm is described in Alg. 1.

Initialize all environments
Initialize actor network and critic network
Initialize experience buffer
for iteration = 1,  do
     Update discounted reward for th experience
     Policy objective
     Entropy term
     MSE of value
     SGD with
end for
function Sample_Batch()
     for t = 1,  do
         Choose action
         Take action , observe and
         Store () into
     end for
end function
Algorithm 1 A2C algorithm

V Numerical Results and Discussions

Fig. 5: System-level simulator.

An LTE based system-level simulator is used to evaluate the proposed DRL method, as shown in Fig. 5. The scheduler allocates physical resources for packet delivery to UEs periodically. The feedback mechanism is established between transmitter and receiver so that link adaptation techniques can be applied at the transmitter. Herein, LTE standard adaptive modulation and coding (AMC) function is adopted where modulation and coding schemes (MCSs) are chosen in order to achieve a target block error ratio (BLER). And outer loop link adaptation (OLLA) is also realized to provide fixed step compensations to the feedback imperfection based on ACK/NACK signaling that UEs report. An DRL module is integrated in the simulator to make a better decision. Proportional fairness (PF) scheduling [12] is baseline algorithm in the system.


where is the estimated instantaneous rate and is the exponential moving average throughput of th user.

To improve the sampling efficiency of the on-policy A2C algorithm, 56 simulators for a same DRL agent are launched simultaneously for a quickly and adequately exploring in state and action space. The UE deployment and random seed for each simulator are differentiated in order to decrease the data correlation in a batch which is lethal for DRL. In addition, averaging among different random seeded simulators also increases the generalization and are more reasonable in performance evaluation.

Fig. 6: Training of the DRL.

The simulation is conducted with UEs and scheduling window TTIs, the maximum number of RBGs is . The neural networks (NNs) used in the DRL agent are fully connected ones with 2 hidden layers, each of which contains 640 neurons. ReLU function is used as the activation function for all the hidden layers. The policy network outputs decision with softmax and linear activation is used for value function.

Fig. 6(a) shows the reward that DRL agent obtains during the training iteration. Also, we plot the reward value of PF algorithm for reference. After 5000 updates, the DRL obtains the same reward value as the PF algorithm and converges. The learning rate is further decayed at iteration 5000 so that the DRL finally achieves a larger reward than PF algorithm. The variation of the KPI values during the same training is recorded in Fig. 6(b). The performance is evaluated every 50 updates and the normalized performance gaps between DRL and PF algorithm is elaborated. It is interesting to see that the agent quickly learns a scheduling policy similar to MAXC/I, then slowly converges to the policy that outperforms the PF algorithm in all three KPIs.

After training, we fix the parameters of the NN model and run a performance evaluation. The average performances over baseline in 20000 TTIs of 56 UE deployments are shown in Fig. 7 and Fig. 9. For GA, we use simulated binary crossover operator and polynomial mutation with the crossover probability of and a mutation probability of . Also, the distribution indexed for crossover and mutation operators are and , respectively. We set the population size as and generations to conduct the GA. For PLA, the maximum list number of PLA is . Both GA and PLA considers the single RBG case due to the complexity.

For single RBG as in Fig. 7, we can see that the performances of GA and PLA are similar in our configuration. DRL obtains nearly the same throughput, but slightly better JFI and PDR. It should be noted that DRL algorithm only has one policy for all UE deployments and has no future information when making decisions, which is very different from genie-aided method. The gain of DRL merely comes from the learning ability by exploring and getting rewards. Fig. 8 plots the performance of DRL in all 56 deployments, where we can see that the THP and PDR gain are obvious among all seeds while JFI keeps almost the same to the baseline. We argue that some THP vales, e.g., seed 11 and seed 45, are not failures because they have large JFI, which means they are still somewhere near the Pareto frontier. We believe that in real world, some deployment-specific KPI weightings will help them fast converges to the required performance.

As for multiple RBG scheduling, two methods have been tried: a) transfer learning, i.e., the NNs trained for single RBG of 1.4 MHz bandwidth is directly reused in the 10 RBGs system with 20 MHz bandwidth, without any further retraining. b) training the new model that fit for 10-RBG system. Both results are illustrated in Fig. 9, we find that the model in first method exhibits great generalization capability in transfering to a system with more resources. The performance is mildly degraded but still better than the baseline. The re-training method further exploits the learning ability of the agent and achieves a similar performance to single RBG.

Fig. 7: Performance metrics of DRL and genie-aided methods.
Fig. 8: Performance metrics for DRL of each seed.
Fig. 9: Performance of DRL for multiple RBGs.

Vi Conclusion and future work

In this paper, we propose DRL method to solve the scheduling problem in cellular networks. The practical scheduling issue is modeled as a multi-objective optimization problem consisting of long-term throughput maximization, fairness maximization and packet drop rate minimization. Two genie-aided methods are employed to probe the performance gain space. Then a modified A2C algorithm is proposed to solve the considered scheduling problem. The results show that the DRL can outperform baseline PF algorithm and achieve similar performance to the genie-aided methods without using the future information. In the future, multi-agent reinforcement learning (MARL) structure is considered to further improve the spectral efficiency, e.g., by providing the ability of intelligent inter-cell interference cancellation.


  • [1] R. Atallah, C. Assi, and M. Khabbaz (2017) Deep reinforcement learning-based scheduling for roadside communication networks. In 2017 15th International Symposium on Modeling and Optimization in Mobile, Ad Hoc, and Wireless Networks (WiOpt), pp. 1–8. Cited by: §I.
  • [2] F. Capozzi, G. Piro, L. A. Grieco, G. Boggia, and P. Camarda (2013-Second) Downlink packet scheduling in lte cellular networks: key design issues and a survey. IEEE Communications Surveys Tutorials 15 (2), pp. 678–700. External Links: Document, ISSN Cited by: §II.
  • [3] U. Challita, W. Saad, and C. Bettstetter (2018-05) Deep reinforcement learning for interference-aware path planning of cellular-connected uavs. In 2018 IEEE International Conference on Communications (ICC), Vol. , pp. 1–7. External Links: Document, ISSN 1938-1883 Cited by: §I.
  • [4] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan (2002-04) A fast and elitist multiobjective genetic algorithm: nsga-ii. IEEE Transactions on Evolutionary Computation 6 (2), pp. 182–197. Cited by: §III-A.
  • [5] N. Farsad and A. Goldsmith (2017) Detection algorithms for communication systems using deep learning. arXiv preprint arXiv:1705.08044. Cited by: §I.
  • [6] Y. He, Z. Zhang, F. R. Yu, N. Zhao, H. Yin, V. C. Leung, and Y. Zhang (2017) Deep-reinforcement-learning-based optimization for cache-enabled opportunistic interference alignment wireless networks. IEEE Transactions on Vehicular Technology 66 (11), pp. 10433–10445. Cited by: §I.
  • [7] R. K. Jain, D. W. Chiu, and W. R. Hawe (1984) A quantitative measure of fairness and discrimination. Eastern Research Laboratory, Digital Equipment Corporation, Hudson, MA. Cited by: §II.
  • [8] J. Li, H. Gao, T. Lv, and Y. Lu (2018-04) Deep reinforcement learning based computation offloading and resource allocation for mec. In 2018 IEEE Wireless Communications and Networking Conference (WCNC), Vol. , pp. 1–6. External Links: Document, ISSN 1558-2612 Cited by: §I.
  • [9] N. C. Luong, D. T. Hoang, S. Gong, D. Niyato, P. Wang, Y. Liang, and D. I. Kim (2019) Applications of deep reinforcement learning in communications and networking: a survey. IEEE Communications Surveys Tutorials (), pp. 1–1. External Links: Document, ISSN 1553-877X Cited by: §I.
  • [10] T. O’Shea and J. Hoydis (2017-12) An introduction to deep learning for the physical layer. IEEE Transactions on Cognitive Communications and Networking 3 (4), pp. 563–575. External Links: Document, ISSN 2332-7731 Cited by: §I.
  • [11] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel (2015) High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438. Cited by: §IV-B.
  • [12] D. Tse (2001) Multiuser diversity in wireless networks. In Wireless Communications Seminar, Standford University, Cited by: §V.
  • [13] J. Wang, C. Xu, Y. Huangfu, R. Li, Y. Ge, and J. Wang (2019) Deep reinforcement learning for scheduling in cellular networks. arXiv preprint arXiv:1905.05914. Cited by: §I.
  • [14] J. Zhu, Y. Song, D. Jiang, and H. Song (2017) A new deep-q-learning-based transmission scheduling mechanism for the cognitive internet of things. IEEE Internet of Things Journal 5 (4), pp. 2375–2385. Cited by: §I.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description