Deep Reinforcement Learning for Real-Time Optimization in NB-IoT Networks

Deep Reinforcement Learning for Real-Time Optimization in NB-IoT Networks

Nan Jiang, Student Member, IEEE, Yansha Deng, Member, IEEE, Arumugam Nallanathan, Fellow, IEEE, and Jonathon A. Chambers, Fellow, IEEE
N. Jiang, and A. Nallanathan are with the School of Electronic Engineering and Computer Science, Queen Mary University of London, London E1 4NS, UK (e-mail: {nan.jiang, a.nallanathan} Deng is with the Department of Informatics, King’s College London, London WC2R 2LS, UK (e-mail: (Corresponding author: Yansha Deng).J. A. Chambers is with the Department of Engineering, University of Leicester, Leicester LE1 7RH, UK (e-mail:

NarrowBand-Internet of Things (NB-IoT) is an emerging cellular-based technology that offers a range of flexible configurations for massive IoT radio access from groups of devices with heterogeneous requirements. A configuration specifies the amount of radio resource allocated to each group of devices for random access and for data transmission. Assuming no knowledge of the traffic statistics, there exists an important challenge in “how to determine the configuration that maximizes the long-term average number of served IoT devices at each Transmission Time Interval (TTI) in an online fashion”. Given the complexity of searching for optimal configuration, we first develop real-time configuration selection based on the tabular Q-learning (tabular-Q), the Linear Approximation based Q-learning (LA-Q), and the Deep Neural Network based Q-learning (DQN) in the single-parameter single-group scenario. Our results show that the proposed reinforcement learning based approaches considerably outperform the conventional heuristic approaches based on load estimation (LE-URC) in terms of the number of served IoT devices. This result also indicates that LA-Q and DQN can be good alternatives for tabular-Q to achieve almost the same performance with much less training time. We further advance LA-Q and DQN via Actions Aggregation (AA-LA-Q and AA-DQN) and via Cooperative Multi-Agent learning (CMA-DQN) for the multi-parameter multi-group scenario, thereby solve the problem that Q-learning agents do not converge in high-dimensional configurations. In this scenario, the superiority of the proposed Q-learning approaches over the conventional LE-URC approach significantly improves with the increase of configuration dimensions, and the CMA-DQN approach outperforms the other approaches in both throughput and training efficiency.

I Introduction

To effectively support the emerging massive Internet of Things (mIoT) ecosystem, the 3rd Generation Partnership Project (3GPP) partners have standardized a new radio access technology, namely NarrowBand-IoT (NB-IoT) [1]. NB-IoT is expected to provide reliable wireless access for IoT devices with various types of data traffic, and to meet the requirement of extended coverage. As most mIoT applications favor delay-tolerant data traffic with small size, such as data from alarms, and meters, monitors, the key target of NB-IoT design is to deal with the sporadic uplink transmissions of massive IoT devices [2].

NB-IoT is built from legacy Long-Term Evolution (LTE) design, but only deploys in a narrow bandwidth (180 KHz) for Coverage Enhancement (CE) [3]. Different from the legacy LTE, NB-IoT only defines two uplink physical channel resource to perform all the uplink transmission, including the Random Access CHannel (RACH) resource (i.e., using NarrowBand Physical Random Access CHannel, a.k.a. NPRACH) for RACH preamble transmission, and the data resource (i.e., using NarrowBand Physical Uplink Shared CHannel, a.k.a. NPUSCH) for control information and data transmission. To support various traffic with different coverage requirements, NB-IoT supports up to three CE groups of IoT devices sharing the uplink resource in the same band. Each group serves IoT devices with different coverage requirements distinguishing based on a same broadcast signal from the evolved Node B (eNB) [3]. At the beginning of each uplink Transmission Time Interval (TTI), eNB selects a system configuration that specifies the radio resource allocated to each group in order to accommodate the RACH procedure along with the remaining resource for data transmission. The key challenge is to optimally balance the allocations of channel resource between the RACH procedure and data transmission so as to provide maximum success accesses and transmissions in massive IoT networks. Allocating too many resource for RACH enhances the random access pernformace, while leaving insufficient resource for data transmission.

Unfortunately, dynamic RACH and data transmission resource configuration optimization is an untreated problem in NB-IoT. Generally speaking, the eNB observes the transmission receptions of both RACH (e.g., number of successfully received preambles and collisions) and data transmission (e.g., number of successful scheduling and unscheduling) for all groups at the end of each TTI. This historical information can be potentially used to predict traffic from all groups and to facilitate the optimization of future TTIs’ configurations. Even if one knew all the relevant statistics, tackling this problem in an exact manner would result in a Partially Observable Markov Decision Process (POMDP) with large state and action spaces, which would be generally intractable. The complexity of the problem is compounded by the lack of a prior knowledge at the eNB regarding the stochastic traffic and unobservable channel statistics (i.e., random collision, and effects of physical radio including path-loss and fading). The related works will be briefly introduced in the following two subsections.

I-1 Related works on real-time optimization in cellular-based networks

In light of this POMDP challenge, prior works [4, 5] studied real-time resource configuration of RACH procedure and/or data transmission by proposing dynamic Access Class Barring (ACB) schemes to optimize the number of served IoT devices. These optimization problems have been tackled under the simplified assumptions that at most two configurations are allowed and that the optimization is executed for a single group without considering errors due to wireless transmission. In order to consider more complex and practical formulations, Reinforcement Learning (RL) emerges as a natural solution given its capability in interacting with the practical environment and feedback in the form of the number of successful and unsuccessful transmissions per TTI. Distributed RL based on tabular Q-learning (tabular-Q) has been proposed in [6, 7, 8, 9]. In [6, 7, 8], the authors studied distributed tabular-Q in slotted-Aloha networks, where each device learns how to avoid collisions by finding a proper time slot to transmit packets. In [9], the authors implemented tabular-Q agents at the relay nodes for cooperatively selecting its transmit power and transmission probability to optimize the total number of useful received packets per consumed energy. Centralized RL has also been studied in [10, 11, 12], where the RL agent was implemented at the base station site. In [10], a learning-based scheme was proposed for radio resource management in multimedia wide-band code-division multiple access systems to improve spectrum utilization. In [11, 12], the authors studied the tabular-Q based ACB schemes in cellular networks, where a Q-agent was implemented at an eNB aiming at selecting the optimal ACB factor to maximize the access success probability of RACH procedure.

I-2 Related works on optimization in NB-IoT

In NB-IoT networks, most existing studies either focused on the resource allocation during RACH procedure [13, 14], or that during the data transmission [15, 16]. For RACH procedure, the access success probability was statistically optimized in [13] using exhaustive search, and the authors in [14] studied the fixed-size data resource scheduling for various resource requirements. For the data transmission, [15] presented an uplink data transmission time slot and power allocation scheme to optimize the overall channel gain, and [16] proposed a link adaptation scheme, which dynamically selects modulation and coding level, and the repetition value according to the acknowledgment/negative-acknowledgment feedback to reduce the uplink data transmission block error ratio. More importantly, these works ignore the time-varied heterogeneous traffic of massive IoT devices, and considered a snap shot [13, 15, 16] or steady-state behavior [14] of NB-IoT networks. Our most relevant work is [17], where the authors studied the steady-state behavior of NB-IoT networks from the perspective of a single device. Optimizing some of the parameters of the NB-IoT configuration, namely the repetition value (to be defined below) and time intervals between two consecutive scheduling of NPRACH and NPDCCH, was carried out in terms of latency and power consumption in [17] using a queuing framework.

Unfortunately, the tabular-Q framework in [11, 12] cannot be used to solve the multi-parameter multi-group optimization problem in uplink resource configuration of NB-IoT networks, due to their incapability to address high-dimensional state space and variable selection. More importantly, whether their proposed RL-based resource configuration approaches [11, 12] outperform the conventional resource configuration approaches [5, 4] is still unknown. In this paper, we develop RL-based uplink resource configuration approaches to dynamically optimize the number of served IoT devices in NB-IoT networks. To showcase the efficiency, we compare the proposed RL-based approaches with the conventional heuristic uplink resource allocation approaches. The contributions can be summarized as follows:

  • We develop an RL-based framework to optimize the number of served IoT devices by adaptively configuring uplink resource in NB-IoT networks. The uplink communication procedure in NB-IoT is simulated by taking into account the heterogeneous IoT traffics, the CE group selection, the RACH procedure, and the uplink data transmission resource scheduling. This generated simulation environment is used for training the RL-based agents before deployment, and these agents will be updated according to the real traffic in practical NB-IoT networks in an online manner.

  • We first study a simplified NB-IoT scenario considering the single parameter and the single CE group, where a basic tabular-Q was developed to compare with the revised conventional Load Estimation based Uplink Resource Configuration (LE-URC) scheme. The tabular-Q is further advanced by implementing function approximators with different computational complexities, namely, Linear Approximator (LA-Q) and Deep Neural Networks (Deep Q-Network, a.k.a. DQN) to elaborate their capability and efficiency in dealing with high-dimensional state space.

  • We then study a more practical NB-IoT scenario with multiple parameters and multiple CE groups, where direct implementation of the LA-Q or DQN is not feasible due to the increasing size of the parameter combinations. To solve it, we propose Action Aggregation approaches based on LA-Q and DQN, namely, AA-LA-Q and AA-DQN, which guarantee convergence capability by sacrificing certain accuracy in the parameters selection. Finally, a Cooperative Multi-Agent learning based on DQN (CMA-DQN) is developed to break down the selection in high-dimensional parameters into multiple parallel sub-tasks by using that a number of DQN agents are cooperatively trained to produce each parameter for each CE group.

  • In the simplified scenario, our results show that the number of served IoT devices with tabular-Q considerably outperforms that with LE-URC, while LA-Q and DQN achieve almost the same performance as that of tabular-Q using much less training time. In the practical scenario, the superiority of Q-learning based approaches over LE-URC significantly improves. Especially, CMA-DQN outperforms all other approaches in terms of both throughput and training efficiency, which is mainly due to the use of DQN enabling operation over a large state space and the use of multiple agents dealing with the large dimensionality of parameters selection.

The rest of the paper is organized as follows. Section II provides the problem formulation and system model. Section III illustrates the preliminary and the conventional LE-URC. Section IV proposes Q-leaning based uplink resource configuration approaches in the single-parameter single-group scenario. Section V presents the advanced Q-learning based approaches in the multi-parameter multi-group scenario. Section VI elaborates the numerical results, and finally, Section VII concludes the paper.

Ii Problem Formulation and System Model

As illustrated in Fig. 1(a), we consider a single-cell NB-IoT network composed of an eNB located at the center of the cell, and a set of static IoT devices randomly located in an area of the plane , and remain spatially static once deployed. The devices are divided into three CE groups as further discussed below, and the eNB is unaware of the status of these IoT devices, hence no uplink channel resource is scheduled to them in advance. In each IoT device, uplink data is generated according to random inter-arrival processes over the TTIs, which are Markovian and possibly time-varying.

Fig. 1: (a) Illustration of system model; (b) Uplink channel frame structure.

Ii-a Problem Formulation

With packets waiting for service, an IoT device executes the contention-based RACH procedure in order to establish a Radio Resource Control (RRC) connection with the eNB. The contention-based RACH procedure consists of four steps, where an IoT device transmits a randomly selected preamble, for a given number of times according to the repetition value [1], to initial RACH procedure in step 1, and exchanges control information with the eNB in the next three steps [18]. The RACH process can fail if: (i) a collision occurs when two or more IoT devices selecting the same preamble; or (ii) there is no collision, but the eNB cannot detect a preamble due to low SNR. Note that a collision can be still detected in step 3 of RACH when the collided preambles are not detected in step 1 of RACH following 3GPP report [19]. This assumption is different from our previous works [20, 21], which only focuses on the preamble detection analysis in step 1 of RACH.

As shown in Fig. 1(b), for each TTI and for each CE group , in order to reduce the chance of a collision, the eNB can increase the number of RACH periods in the TTI or the number of preambles available in each RACH period [22]. Furthermore, in order to mitigate the SNR outage, the eNB can increase the number of times that a preamble transmission is repeated by a device in group in one RACH period [22] of the TTI.

After the RRC connection is established, the IoT device requests uplink channel resource from the eNB for control information and data transmission. As shown in Fig. 1(b), given a total number of resource for uplink transmission in the TTI, the number of available resource for data transmission is written as , where is the overall number of Resource Elements (REs)111The uplink channel consists of 48 sub-carriers within 180 kHz bandwidth. With a 3.75 kHz tone spacing, one RE is composed of one time slot of 2 ms and one sub-carrier of 3.75 kHz [1]. Note that the NB-IoT also supports 12 sub-carriers with 15 kHz tone spacing for NPUSCH, but NPRACH only supports 3.75 kHz tone spacing [1]. allocated for the RACH procedure. This can be computed as , where measures the number of REs required for one preamble transmission.

In this work, we tackle the problem of optimizing the RACH configuration defined by parameters for each th group in an online manner for every TTI . In order to make this decision at the beginning of every TTI , the eNB accesses all prior history in TTIs consisting of the following variables: the number of the collided preambles , the number of the successfully received preambles , and the number of idle preambles of the th CE group in the th TTI for the RACH, as well as the number of IoT devices that have successfully sent data and the number of IoT devices that are waiting for being allocated data resource . We denote as the observed history of all such measurements and past actions.

The eNB aims at maximizing the long-term average number of devices that successfully transmit data with respect to the stochastic policy that maps the current observation history to the probabilities of selecting each possible configuration . This problem can be formulated as the optimization


where is the discount rate for the performance in future TTIs and index runs over the CE groups. Since the dynamics of the system is Markovian over the TTI and is defined by the NB-IoT protocol to be further discussed below, this is a POMDP problem that is generally intractable. Approximate solutions will be discussed in Sections III, IV, and V.

Ii-B NB-IoT Access Network

We now provide additional details on the model and on the NB-IoT protocol. To capture the effects of the physical radio, we consider the standard power-law path-loss model that the path-loss attenuation is , with the propagation distance and the path-loss exponent . The system is operated in a Rayleigh flat-fading environment, where the channel power gains are exponentially distributed (i.i.d.) random variables with unit mean. Fig. 2 presents the uplink data transmission procedure from the perspective of an IoT device in NB-IoT networks, which consists of the four stages that are explained in the following four subsections to introduce the system model.

Fig. 2: Uplink data transmission procedure from the perspective of an IoT device in NB-IoT networks.

Ii-B1 Traffic Inter-Arrival

We consider two types of IoT devices with different traffic models, including periodical traffic and bursty traffic, which is a heterogeneous traffic scenario for diverse IoT applications [23, 24]. The periodical traffic coming from periodic uplink reporting tasks, such as metering or environmental monitoring, is the most common traffic model in NB-IoT networks [25]. The bursty traffic due to emergency events, such as fire alarms and earthquake alarms, captures the complementary scenario in which a massive number of IoT devices tries to establish RRC connection with the eNB [19]. Due to the nature of slotted-Aloha, an IoT device can only transmit a preamble at the beginning of a RACH period, which means that IoT devices executing RACH in a RACH period comes from those who received an inter-arrival within the interval between with the last RACH period. For the periodical traffic, the first packet is generated using Uniform distribution over (ms), and then repeated every ms. The packet inter-arrival rate measured in each RACH period at each IoT device is hence expressed by


where is the number of RACH periods in the th TTI, is the duration between neighboring RACH periods. The bursty traffic is generated within a short period of time starting from a random time . The traffic instantaneous rate in packets in a period is described by a function so that the packets arrival rate in the th RACH period of the th TTI is given by


where is the starting time of the th RACH period in the th TTI, , and the distribution follows the time limited Beta profile given as [19, Section 6.1.1]


In (4), is the Beta function with the constant parameters and [26].

Ii-B2 CE Group Determination

Once an IoT device is backlogged, it first determines its associated CE group by comparing the received power of the broadcast signal to the Reference Signal Received Power (RSRP) thresholds according to the rule [27]


In (5), the received power of broadcast signal is expressed as


where is the device’s distance from the eNB, and is the broadcast power of eNB [27]. Note that is obtained by averaging the small-scale Rayleigh fading of the received power [27].

Ii-B3 RACH Procedure

After CE group determination, each backlogged IoT device in group repeats a randomly selected preamble times in the next RACH period by using a pseudo-random frequency hopping schedule. The pseudo-random hopping rule is based on the current repetition time as well as the Narrowband Physical Cell ID, and in one repetition, a preamble consists of four symbol groups, which are transmitted with fixed size frequency hopping [28, 20, 1]. Therefore, a preamble is successfully detected if at least one preamble repetition succeeds, which in turn happens if all of its four symbol groups are correctly decoded [20]. Assuming that correct detecting is determined by the SNR level for the th repetition and the symbol group, the correct detecting event can be expressed as


where is the index of symbol group in the th repetition, is the repetition value of the th CE group in the th TTI, means that the preamble symbol group is successfully decoded when its received SNR above a threshold , and is expressed as


In (8), is the Euclidean distance between the IoT device and the eNB, is the path-loss attenuation factor, is the Rayleigh fading channel power gain from the IoT device to the eNB, is the noise power, and is the preamble transmit power in the th CE group defined as


where is the index of CE groups, IoT devices in the CE group 0 () transmit preamble using the full path-loss inversion power control [27], which maintains the received signal power at the eNB from IoT devices with different distance equalling to the same threshold , and is the maximal transmit power of an IoT device. The IoT devices in the CE group 1 and group 2 always transmit preamble using the maximum transmit power [27].

As shown in the RACH procedure of Fig. 2, if a RACH fails, the IoT device reattempts the procedure until receiving a positive acknowledgement that RRC connection is established, or exceeding RACH attempts while being part of one CE group. If these attempts exceeds , the device switches to a higher CE group if possible [29]. Moreover, the IoT device is allowed to attempt the RACH procedure no more than times before dropping its packets. These two features are counted by and , respectively.

Ii-B4 Data Resource Scheduling

After the RACH procedure succeeds, the RRC connection is successfully established, and the eNB schedules resource from the data channel resource to the associated IoT device for control information and data transmission as shown in Fig 1(b). To allocate data resource among these devices, we adopt a basic random scheduling strategy, whereby an ordered list of all devices that have successfully completed the RACH procedure but have not received a data channel is compiled using a random order. In each TTI, devices in the list are considered in order for access to the data channel until the data resource are insufficient to serve the next device in the list. The remaining RRC connections between the unscheduled IoT devices and the eNB will be preserved within at most subsequent TTIs counting by , and attempts will be made to schedule the device’s data during these TTIs [30, 29]. The condition that the data resource are sufficient in TTI is expressed as


where is the number of scheduled devices limited by the upper bound denoted by IoT devices with successful RACH in the current TTI as well as unscheduled IoT devices in the last TTI , is the number of required REs for serving one IoT device within the th CE group, and is the number of REs per repetition for control signal and data transmission222The basic scheduling unit of NPUSCH is resource unit (RU), which has two formats. NPUSCH format 1 (NPUSCH-1) is with 16 REs for data transmission, and NPUSCH format 2 (NPUSCH-2) is with 4 REs for carrying control information [3, 22].. Note that is the repetition value for the th CE group in the th TTI, which is the same as for preamble transmission [1].

Iii Preliminary and Conventional Solutions

Iii-a Preliminary

The optimized number of served IoT devices over the long term given in Eq. (1) is really complicated, which cannot be easily solved via the conventional uplink resource approach. Therefore, most prior works simplified the objective to dynamically optimize the single parameter to achieve the maximum number of served IoT devices in the single group without consideration of future performance [5, 4], which is expressed as


where is the optimized single parameter.

To maximize number of served IoT devices in the th TTI, the configuration is expected to be dynamically adjusted according to the actual number of IoT devices that will execute RACH attempts , which refers to the current load of the network. Note that in practice, this load information is unable to be detected at the eNB. Thus, it is necessary to estimate the load based on the previous transmission reception from the th to th TTI before the uplink resource configuration in the th TTI.

In [5], the authors designed a dynamic ACB scheme to optimize the problem given in Eq. (1) via adjusting the ACB factor. The ACB factor is adapted based on the knowledge of traffic load, which is estimated via moment matching. The estimated number of RACH attempting IoT devices in the th TTI is expressed as:


where is the number of allocated preambles in the th TTI, and is the estimated number of devices performing RACH attempts in the th TTI given as


In Eq. (13), , , and are the ACB factor, the number of preambles and the observed number of collided preambles in the th TTI, and is an estimated factor given in [5, Eq. (32)].

In Eq. (12), is the difference between the estimated numbers of RACH requesting IoT devices in the th and the th TTIs, which is obtained by assuming that the number of successful RACH IoT devices does not change significantly in these two TTIs [5].

This dynamic control approach is designed for an ACB scheme, which is only triggered when the exact traffic load is bigger than the number of preambles (i.e., ). Accordingly, the related backlog estimation approach is only used when . However, it cannot estimate the load when , which is required in our problem.

Iii-B Resource Configuration in Single Parameter Single CE Group Scenario

In this subsection, we modify the load estimation approach given in [5] via estimating based on the last number of the collided preambles and the previous numbers of idle preambles . And then, we propose an uplink resource configuration approach based on this revised load estimation, namely, LE-URC.

Iii-B1 Load Estimation

By definition, is the set of valid number of preambles that the eNB can choose, where each IoT device selects a RACH preamble from available preambles with an equal probability given by . For a given preamble transmitted to the eNB, let denotes the number of IoT devices that selects the preamble . The probability that no IoT device selects preamble is


The expected number of preambles experiencing idles in the th TTI is given by


Due to that the actual number of preambles experiencing idles can be observed at the eNB, the number of RACH attempting IoT devices in the th TTI can be estimated as


To obtain the estimated number of RACH attempting IoT devices in the th TTI , we also need to know the difference between the estimated numbers of RACH attempting IoT devices in the th and the th TTIs, denoted by , where for , and . However, cannot be obtained before the th TTI. To solve this, we can assume according to [5]. This is due to that the time between two consecutive TTIs is small, and the available preambles are gradually updated leading to that the number of successful RACH IoT devices does not change significantly in these two TTIs [5]. Therefore, the number of RACH attempting IoT devices in the th time slot is estimated as


where represents that there are at least number of IoT devices colliding in the last TTI.

Iii-B2 Uplink Resource Configuration Based on Load Estimation

In the following, we propose LE-URC by taking into account the resource condition given in Eq. (10). The number of RACH periods and the repetition value is fixed, and only the number of preambles in each RACH period is dynamically configured in each TTI. Using the estimated number of RACH attempting IoT devices in the th TTI , the probability that only one IoT device selects preamble (i.e., no collision occurs) is expressed as


The expected number of RACH attempting IoT devices in the th TTI is derived as


Based on (19), the expected number of IoT devices requesting uplink resource in the th TTI is derived as


where is the number of unscheduled IoT devices in the last TTI. Note that can be observed.

However, if the data resource is not sufficient (i.e., occurs when Eq. (10) is invalid), some IoT devices may not be scheduled in the th TTI. The upper bound of the number of scheduled IoT devices is expressed as


where is the total number of REs reserved for uplink transmission in a TTI, is the uplink resource configured for RACH in the th TTI. is required REs for serving one IoT device given in Eq. (10).

According to (20) and (21), the expected number of the successfully served IoT devices is given by


The maximal expected number of the successfully served IoT devices is obtained by selects the number of preamble using


The LE-URC approach based on the estimated load is detailed in Algorithm 1. For comparison, we consider an ideal scenario that the actual number of RACH requesting IoT devices is available at the eNB, namely, Full State Information based URC (FSI-URC). FSI-URC configures still using the approach given in Eq. (23), while the load estimation approach given in Section III.B.1) is not required.

input : The set of the number of preambles in each RACH period , Number of IoT devices , Operation Iteration .
1 for Iteration to  do
2        Initialization of , , , , and bursty traffic arrival rate ;
3        for  to  do
4               Generate using Eq. (3);
5               The eNB observes and , and calculate using Eq. (16);
6               Estimate the number of RACH requesting IoT devices using Eq. (17);
7               Select the number of preambles using Eq. (23) based on the estimated load ;
8               The eNB broadcasts , and backlogged IoT devices attempt communication in the th TTI;
9               Update .
10        end for
12 end for
Algorithm 1 Load Estimation Based Uplink Resource Configuration (LE-URC)

Iii-B3 LE-URC for Multiple CE Groups

We slightly revise the introduced single-parameter single-group LE-URC approach (given in Section III.B) to dynamically configure resource for multiple CE groups. Note that the repetition value in the LE-URC approach is still defined as a constant to enable the availability of load estimation in Eq. (17). Remind that the principle of LE-URC approach is to optimize the expectation of the number of successful served IoT devices while balancing and with limited uplink resource . In the multiple CE groups scenarios, the resource are allocated to IoT devices in any CE groups without bias, but is specifically allocated to each CE group.

Under this condition, the expected number of successfully served IoT devices given in Eq. (22) needs to be modified by taking into account multiple variables, which becomes non-convex, and extremely complicates the optimization problem. To solve it, we use a sub-optimal solution by artificially setting uplink resource constrain for each CE group (). Each CE group can independently allocate the resource between and according to the approach given in Eq. (23).

Iv Q-Learning Based Resource Configuration in Single-Parameter Single-Group Scenario

The RL approaches are well-known in addressing dynamic control problem in complex POMDPs [31]. Nevertheless, they have been rarely studied in handling the resource configuration in slotted-Aloha based wireless communication systems. Therefore, it is worthwhile to evaluate the capability of RL in the single-parameter single-group scenario first, in order to be compared with conventional heuristic approaches. In this section, we consider one single CE group with the fixed RACH periods as well as the fixed repetition value , and only dynamically configuring the number of preambles at the beginning of each TTI. In the following, We first study tabular-Q based on the tabular representation of the value function, which is the simplest Q-learning form with guaranteed convergence [31], but requires extremely long training time. We then study Q-learning with function approximators to improve training efficiency, where LA-Q and DQN will be used to construct an approximation of the desired value function.

Iv-a Q-Learning and Tabular Value Function

Considering a Q-agent deployed at the eNB to optimize the number of successfully served IoT devices in real-time, the Q-agent need to explore the environment in order to choose appropriate actions progressively leading to the optimization goal. We define , , and as any state, action, and reward from their corresponding sets, respectively. At the beginning of the th TTI (), the Q-agent first observes the current state corresponding to a set of previous observations (=}) in order to select an specific action . The action corresponds to the number of preambles in each RACH period in single CE group scenario.

Fig. 3: The Tabular-Q agent and environment interaction in the POMDP.

As shown in Fig. 3, we consider a basic state function in the single CE group scenario, where is a set of indices mapping to the current observed information . With the knowledge of the state , the Q-agent chooses an action from the set , which is a set of indexes mapped to the set of the number of available preambles . Once an action is performed, the Q-agent will receive a scalar reward , and observe a new state . The reward indicates to what extent the executed action can achieve the optimization goal, which is determined by the new observed state . As the optimization goal is to maximize the number of the successfully served IoT devices, we define the reward as a function that positively proportional to the observed number of successfully served IoT devices , which is defined as


where is constant used to normalize the reward function.

Q-learning is a value-based RL approach [31, 32], where the policy of states to actions mapping is learned using a state-action value function to determine an action for the state . We first use a lookup table to represent the state-action value function (tabular-Q), which consists of value scalars for all the state and action spaces. To obtain an action , we select the highest value scalar from the numerical value vector , which maps all possible actions under to the Q-value table .

Accordingly, our objective is to find an optimal Q-value table with optimal policy that can select actions to dynamically optimize the number of served IoT devices. To do so, we train a initial Q-value table in the environment using Q-Learning algorithm, where is immediately updated using the current observed reward after each action as


where is a constant step-size learning rate that affects how fast the algorithm adapt to a new environment, is the discount rate that determines how current rewards affects the value function updating, approximates the value in optimal Q-value table via the up-to-date Q-value table and the obtained new state . Note that in Eq. (25) is a scalar, which means that we can only update one value scalar in the Q-value table with one received reward .

As shown in Fig. 3, we consider -greedy approach to balance exploitation and exploration in the Actor of the Q-Agent, where is a positive real number and . In each TTI , the Q-agent randomly generates a probability to compare with . Then, with the probability , the algorithm randomly chooses an action from the remaining feasible actions to improve the estimate of the non-greedy action’s value. With the probability , the algorithm exploits the current knowledge of the Q-value table to choose the action that maximizes the expected reward.

Particularly, the learning rate is suggested to be set to a small number (e.g., ) to guarantee the stable convergence of Q-value table in this NB-IoT communication system. This is due to that a single reward in a specific TTI can be severely biased, because state function is composed of multiple unobserved information with unpredictable distributions (e.g., an action allows for the setting with large number of preambles , but massive random collisions accidentally occur, which leads to an unusual low reward). In the following, the implementation of uplink resource configuration using tabular-Q based real-time optimization is shown in Algorithm 2.

input : Valid numbers of preambles , Number of IoT devices , Operation Iteration .
1 Algorithm hyperparameters: learning rate , discount rate , -greedy rate ;
2 Initialization of the Q-value table with value scalars;
3 for Iteration to  do
4        Initialization of by executing a random action and bursty traffic arrival rate ;
5        for  to  do
6               Update using Eq. (3);
7               if  then select a random action from ;
8               else select ;
9               The eNB broadcasts and backlogged IoT devices attempt communication in the th TTI;
10               The eNB observes , calculate the related using Eq. (24), and update using Eq. (25).
11        end for
13 end for
Algorithm 2 Tabular-Q Based Uplink Resource Configuration

Iv-B Value Function Approximation

Since tabular-Q needs its each element to be updated to converge, searching for an optimal policy can be difficult in limited time and computational resource. To solve this problem, we use a value function approximator instead of Q-value table to find a sub-optimal approximated policy. Generally, selecting a efficient approximation approach to represent the value function for different learning scenarios is a usual problem within the RL [31, 33, 34, 35]. A variety of function approximation approaches can be conducted, such as LA, DNNs, tree search, and which approach to be selected can critically influence the successful learning [31, 34, 35]. The function approximation should fit the complexity of the desired value function, and be efficient to obtain good solutions. Unfortunately, most function approximation approaches require specific design for different learning problems, and there is no basis function, which is both reliable and efficient to satisfy all learning problems.

In this subsection, we first focus on the linear function approximation for Q-learning, due to its simplicity, efficiency, and guaranteed convergence [31, 36, 37]. We then conduct the DNN for Q-learning as a more effective but complicated function approximator, which is also known as DQN [32]. The reasons we conduct DQN are that: 1) the DNN function approximation is able to deal with several kinds of partially observable problems [31, 32]; 2) DQN has the potential to accurately approximate the desired value function while addressing a problem with very large state spaces [32], which can be favored for the learning in the multiple CE group scenarios; 3) DQN is with high scalability, where the scale of its value function can be easily fit to a more complicated problem; 4) a variety of libraries have been established to facilitate building DNN architectures and accelerate experiments, such as TensorFlow, Pytorch, Theano, Keras, and etc..

Iv-B1 Linear Approximation

LA-Q uses a linear weight matrix to approximate the value function with feature vector corresponding to the state . The dimensions of weight matrix is , where is the total number of all available actions and is the size of feature vector . Here, we consider polynomial regression (as [31, Eq. 9.17]) to construct the real-valued feature vector due to its efficiency333The polynomial case is the most well understood feature constructor and always performs well in practice with appropriate setting [31, 33]. Furthermore, the results in [38] shows that there is a rough correspondence between a fitted neural network and a fitted ordinary parametric polynomial regression model. These reasons encourage us to compare the polynomial based LA-Q with DQN. In the training process, the exploration is the same as the tabular Q-learning by generating random actions, but the exploitation is calculated using the weight matrix of the value function. In detail, to predict an action using the LA value function with state in the th TTI, the approximated value function scalars for each action is obtained by inner-producting between the weight matrix and the features vector as:


By searching for the maximal value function scalar in given in Eq. (26), we can obtain the matched action to maximize future rewards.

To obtain the optimal policy, we update the weigh matrix in the value function using Stochastic Gradient Descent (SGD) [31, 39]. SGD minimizes the error on predictions of observation after each example, where the error is reduced by a small amount following the direction to the optimal target policy . As it is infeasible to obtain optimal target policy by summing over all states, we instead estimate the desired action-value function by simply considering one learning sample [31]. In each TTI, the weigh matrix is updated following


where is the learning rate. is the gradient of the loss function used to train the Q-function approximator. This is given as


where is the weight matrix, is the features matrix with the same shape of . is constructed by zeros and the feature vector located in the row corresponding to the index of the action selected in the th TTI . Note that is a scalar. The learning procedure follows Algorithm 2 by changing the Q-table to the LA value function with linear weigh matrix , and updating with SGD given in (28) in step 10 of Algorithm 2.

Iv-B2 Deep Q-Network

The DQN agent parameterizes the action-state value function by using a function , where represents the weights matrix of a DNN with multiple layers. We consider the conventional DNN, where neurons between two adjacent layers are fully pairwise connected, namely fully-connected layers. The input of the DNN is given by the variables in state ; the intermediate hidden layers are Rectifier Linear Units (ReLUs) by using the function ; while the output layer is composed of linear units444 Linear activation is used here according to [32]. Note that Q-learning is value-based, thus the desired value function given in Eq. (15) can be bigger than 1, rather than a probability, and thus the activation function with return value limited in (such as sigmoid function and tanh function) can lead to convergence difficulty., which are in one-to-one correspondence with all available actions in .

Fig. 4: The DQN agent and environment interaction in the POMDP.
input : The set of numbers of preambles in each RACH period , the number of IoT devices , and operation iteration .
1 Algorithm hyperparameters: learning rate , discount rate , -greedy rate , target network update frequency ;
2 Initialization of replay memory to capacity , the primary Q-network , and the target Q-network ;
3 for Iteration to  do
4        Initialization of by executing a random action and bursty traffic arrival rate ;
5        for  to  do
6               Update using Eq. (3);
7               if  then select a random action from ;
8               else select ;
9               The eNB broadcasts and backlogged IoT devices attempt communication in the th TTI;
10               The eNB observes , and calculate the related using Eq. (24);
11               Store transition in replay memory ;
12               Sample random minibatch of transitions from replay memory ;
13               Perform a gradient descent for using Eq. (30);
14               Every steps update target Q-network .
15        end for
17 end for
Algorithm 3 DQN Based Uplink Resource Configuration

The exploitation is obtained by performing forward propagation of Q-function with respect to the observed state . The weights matrix is updated online along each training episode by using double deep Q-learning (DDQN) [40], which to some extend reduce the substantial overestimations555Overestimation refers to that some suboptimal actions regularly were given higher Q-values than optimal actions, which can negatively influence the convergence capability and training efficiency of the algorithm [40, 34]. of value function. Accordingly, learning takes place over multiple training episodes, with each episode of duration TTI periods. In each TTI, the parameter of the Q-function approximator is updated using SGD as


where is RMSProp learning rate [41], is the gradient of the loss function used to train the Q-function approximator. This is given as


where the expectation is taken with respect to a so-called minibatch, which are randomly selected previous samples for some , with being the replay memory [32]. When is negative, this is interpreted as including samples from the previous episode. The use of minibatch, instead of a single sample, to update the value function improves the convergent reliability of value function [32]. Furthermore, following DDQN [40], in (30), is a so-called target Q-network that is used to estimate the future value of the Q-function in the update rule. This parameter is periodically copied from the current value and kept fixed for a number of episodes [40].

V Q-Learning Based Resource Configuration in Multi-Parameter Multi-Group Scenario

Practically, NB-IoT is always deployed with multiple CE groups to serve IoT devices with various coverage requirements. In this section, we study the problem (1) of optimizing the resource configuration for three CE groups each with parameters . This joint optimization by configuring each parameter in each CE group can improve the overall data access and transmission performance. Note that each CE group shares the uplink resource in the same bandwidth, and the eNB schedules data resource to all RRC connected IoT devices without the CE group bias as introduced in Sec. II.B.4). To optimize the number of served IoT devices in real-time, the eNB should not only balance the uplink resource between RACH and data, but also balance them among each CE group.

The Q-learning algorithms with the single CE group provided in Sec. IV are model-free, and thus their learning structure can be directly used in this multi-parameter multi-group scenario. However, considering multiple CE groups results in the increment of observations space, which exponentially increases the size of state space. To train Q-agent with this expansion, the requirements of time and computational resource greatly increase. In such case, the tabular-Q would be extremely inefficient, as not only the state-action value table requires a big memory, but it is impossible to repeatedly experience every state to achieve convergence with limited time. In view of this, we only study Q-learning with value function approximation (LA-Q and DQN) to design uplink resource configuration approaches for the multi-parameter multi-group scenario.

LA-Q and DQN are with high capability to handle massive state spaces, and thus we can considerably improve the state spaces with more observed information to support the optimization of Q-agent. Here, we define the current state includes information about the last TTIs (). This design improves Q-agent by enabling it to estimate the trend of traffic. As our goal is to optimize the number of served IoT devices, the reward function should be defined according to the number of successfully served IoT devices of each CE group, which is expressed as


Same as the state spaces, the available action spaces also exponentially increases with the increment of the adjustable configurations. The number of available actions corresponds to the possible combinations of configurations (i.e., denotes the number of elements in any vector , is the set of actions, , , and are the sets of the number of RACH periods, the repetition value, and the number of preambles in each RACH period). Unfortunately, it is extremely hard to optimize the system under such numerous action spaces (i.e., can be over fifty thousands.), due to that the system will fall into updating policy with only a small part of the action in , and finally leads to convergence difficulty. To solve this problem, we then provide two approaches that can reduce the dimension of action space to enable the LA and DQN in the multi-parameter multi-group scenario.

V-a Actions Aggregated Approach

We first provide AA based Q-learning approaches, which guarantee convergent capability by sacrificing the accuracy of action selection666The action aggregation has been rarely evaluated, but the same idea, namely, state aggregation has been well studied, which is a basic function approximation approach [31].. In detail, the specific action selection can be converted to the increasing or decreasing trend selection. Instead of selecting the exact values from the sets of , , and , we convert it to single step ascent/descent based on the last action, which is represented by , , and for the number of RACH periods , the repetition values , and the number of preambles in each RACH period in the th TTI. Consequently, the size of total action spaces for the three CE groups is reduced to ==. By doing so, the algorithms for training with LA function approximator and DQN in the multiple configurations multiple CE groups scenario can be deployed following Algorithm 2 and Algorithm 3, respectively.

V-B Cooperative Multi-agent Learning Approach

Despite that the uplink resource configuration is managed by a central authority, identifying the control of each parameter as one sub-task that is cooperatively handled by independent Q-agents is sufficient to deal with the problem with unsolvable action spaces [42]. As shown in Fig. 5, we consider multiple DQN agents are centralized at the eNB with the same structure of value function approximator777The structures of value function approximator can also be specifically designed for RL agents with sub-tasks of significantly different complexity. However, there is no such requirement in our problem, so it will not be considered. following Section IV.B.2). We break down the action space by considering nine separate action variables in , where each DQN agent controls their own action variable as shown in Fig. 5. Recall that we have three variables for each group , namely , , and .

Fig. 5: The CMA-DQN agents and environment interaction in the POMDP.

We introduce a separate DQN agent for each output variable in defined as action selected by the th agent, where each th agent is responsible to update the value of action in shared state . The DQN agents are trained in parallel and receive the same reward signal given in Eq. (31) at the end of each TTI as per problem (1). The use of this common reward signal ensures that all DQN agents aim at cooperatively increase the objective in (1). Note that the approach can be interpreted as applying a factorization of the overall value function akin to the approach proposed in [43] for multi-agent systems.

The challenge of this approach is how to evaluate each action according to the common reward function. For each DQN agent, the received reward is corrupted by massive noise, where its own effect on the reward is deeply hidden in the effects of all other DQN agents. For instance, a positive action can receive a mismatched low reward due to other DQN agents’ negative actions. Fortunately, in our scenario, all DQN agents are centralized at the eNB, which means that all DQN agents can have full information among each other. Accordingly, we adopt the action selection histories of each DQN agent as part of state function888The state function can be designed to collect more information according to the complexity requirements, such as sharing the value function between each DQN agent [42]., thus they are able to know how reward is influenced by different combinations of actions. To do so, we define state variable as


where is the number of stored observations, is the set of selected action of each DQN agent in the th TTI corresponding to , , and for the th CE group, and is the set of observed transmission receptions.

In each TTI, the parameters of the Q-function approximator are updated using SGD at all agents as Eq. (29). The learning algorithm can be implemented following Algorithm 3. Different from the single-parameter single-group scenario, we need to first initialize nine primary networks , target networks , and replay memories for each DQN agent. In step 11 of Algorithm 3, the current transactions of each DQN agent should be stored in their own memory separately. In step 12 and 13 of Algorithm 3, the minibatch of transaction should separately sampled from each memory to train the corresponding DQN agent.

Vi Simulation Results

In this section, we evaluate the performance of the proposed Q-learning approaches and compare it with the conventional LE-URC and FSI-URC described in Sec. III via numerical experiments. We adopt the standard network parameters listed in Table I following [1, 3, 25, 29, 22], and hyperparameters for Q-learning listed in Table II. Accordingly, one epoch consists of 937 TTIs (i.e., 10 minutes). The RL agents will first be trained in a so-called learning phase, and after convergence, their performance will be compared with LE-URC and FSI-URC in a so-called testing phase. All testing performance results are obtained by averaging over 1000 episodes. In the following, we present our simulation results of the single-parameter single-group scenario and the multi-parameter multi-group scenario in Section VI-A and Section VI-B, respectively.

Parameters Setting Parameters Setting
Path-loss exponent 4 noise power -138 dBm
eNB broadcast power 35 dBm Path-loss inverse power control threshold 120 dB
Maximal preamble transmit power 23 dBm The received SNR threshold 0 dB
Duration of periodic traffic 1 hour TTI ms
Duration of bursty traffic 10 minutes Set of number of preambles {}
Maximum allowed resource requests 5 Set of repetition value {}
Maximum RACH attempts 10 Set of number of RACH periods {}
Maximum allowed RACH in one CE 5 REs required for 4
Bursty traffic parameter Beta() (3,4) REs required for