Deep Reinforcement Learning for RealTime Optimization in NBIoT Networks
Abstract
NarrowBandInternet of Things (NBIoT) is an emerging cellularbased technology that offers a range of flexible configurations for massive IoT radio access from groups of devices with heterogeneous requirements. A configuration specifies the amount of radio resource allocated to each group of devices for random access and for data transmission. Assuming no knowledge of the traffic statistics, there exists an important challenge in “how to determine the configuration that maximizes the longterm average number of served IoT devices at each Transmission Time Interval (TTI) in an online fashion”. Given the complexity of searching for optimal configuration, we first develop realtime configuration selection based on the tabular Qlearning (tabularQ), the Linear Approximation based Qlearning (LAQ), and the Deep Neural Network based Qlearning (DQN) in the singleparameter singlegroup scenario. Our results show that the proposed reinforcement learning based approaches considerably outperform the conventional heuristic approaches based on load estimation (LEURC) in terms of the number of served IoT devices. This result also indicates that LAQ and DQN can be good alternatives for tabularQ to achieve almost the same performance with much less training time. We further advance LAQ and DQN via Actions Aggregation (AALAQ and AADQN) and via Cooperative MultiAgent learning (CMADQN) for the multiparameter multigroup scenario, thereby solve the problem that Qlearning agents do not converge in highdimensional configurations. In this scenario, the superiority of the proposed Qlearning approaches over the conventional LEURC approach significantly improves with the increase of configuration dimensions, and the CMADQN approach outperforms the other approaches in both throughput and training efficiency.
I Introduction
To effectively support the emerging massive Internet of Things (mIoT) ecosystem, the 3rd Generation Partnership Project (3GPP) partners have standardized a new radio access technology, namely NarrowBandIoT (NBIoT) [1]. NBIoT is expected to provide reliable wireless access for IoT devices with various types of data traffic, and to meet the requirement of extended coverage. As most mIoT applications favor delaytolerant data traffic with small size, such as data from alarms, and meters, monitors, the key target of NBIoT design is to deal with the sporadic uplink transmissions of massive IoT devices [2].
NBIoT is built from legacy LongTerm Evolution (LTE) design, but only deploys in a narrow bandwidth (180 KHz) for Coverage Enhancement (CE) [3]. Different from the legacy LTE, NBIoT only defines two uplink physical channel resource to perform all the uplink transmission, including the Random Access CHannel (RACH) resource (i.e., using NarrowBand Physical Random Access CHannel, a.k.a. NPRACH) for RACH preamble transmission, and the data resource (i.e., using NarrowBand Physical Uplink Shared CHannel, a.k.a. NPUSCH) for control information and data transmission. To support various traffic with different coverage requirements, NBIoT supports up to three CE groups of IoT devices sharing the uplink resource in the same band. Each group serves IoT devices with different coverage requirements distinguishing based on a same broadcast signal from the evolved Node B (eNB) [3]. At the beginning of each uplink Transmission Time Interval (TTI), eNB selects a system configuration that specifies the radio resource allocated to each group in order to accommodate the RACH procedure along with the remaining resource for data transmission. The key challenge is to optimally balance the allocations of channel resource between the RACH procedure and data transmission so as to provide maximum success accesses and transmissions in massive IoT networks. Allocating too many resource for RACH enhances the random access pernformace, while leaving insufficient resource for data transmission.
Unfortunately, dynamic RACH and data transmission resource configuration optimization is an untreated problem in NBIoT. Generally speaking, the eNB observes the transmission receptions of both RACH (e.g., number of successfully received preambles and collisions) and data transmission (e.g., number of successful scheduling and unscheduling) for all groups at the end of each TTI. This historical information can be potentially used to predict traffic from all groups and to facilitate the optimization of future TTIs’ configurations. Even if one knew all the relevant statistics, tackling this problem in an exact manner would result in a Partially Observable Markov Decision Process (POMDP) with large state and action spaces, which would be generally intractable. The complexity of the problem is compounded by the lack of a prior knowledge at the eNB regarding the stochastic traffic and unobservable channel statistics (i.e., random collision, and effects of physical radio including pathloss and fading). The related works will be briefly introduced in the following two subsections.
I1 Related works on realtime optimization in cellularbased networks
In light of this POMDP challenge, prior works [4, 5] studied realtime resource configuration of RACH procedure and/or data transmission by proposing dynamic Access Class Barring (ACB) schemes to optimize the number of served IoT devices. These optimization problems have been tackled under the simplified assumptions that at most two configurations are allowed and that the optimization is executed for a single group without considering errors due to wireless transmission. In order to consider more complex and practical formulations, Reinforcement Learning (RL) emerges as a natural solution given its capability in interacting with the practical environment and feedback in the form of the number of successful and unsuccessful transmissions per TTI. Distributed RL based on tabular Qlearning (tabularQ) has been proposed in [6, 7, 8, 9]. In [6, 7, 8], the authors studied distributed tabularQ in slottedAloha networks, where each device learns how to avoid collisions by finding a proper time slot to transmit packets. In [9], the authors implemented tabularQ agents at the relay nodes for cooperatively selecting its transmit power and transmission probability to optimize the total number of useful received packets per consumed energy. Centralized RL has also been studied in [10, 11, 12], where the RL agent was implemented at the base station site. In [10], a learningbased scheme was proposed for radio resource management in multimedia wideband codedivision multiple access systems to improve spectrum utilization. In [11, 12], the authors studied the tabularQ based ACB schemes in cellular networks, where a Qagent was implemented at an eNB aiming at selecting the optimal ACB factor to maximize the access success probability of RACH procedure.
I2 Related works on optimization in NBIoT
In NBIoT networks, most existing studies either focused on the resource allocation during RACH procedure [13, 14], or that during the data transmission [15, 16]. For RACH procedure, the access success probability was statistically optimized in [13] using exhaustive search, and the authors in [14] studied the fixedsize data resource scheduling for various resource requirements. For the data transmission, [15] presented an uplink data transmission time slot and power allocation scheme to optimize the overall channel gain, and [16] proposed a link adaptation scheme, which dynamically selects modulation and coding level, and the repetition value according to the acknowledgment/negativeacknowledgment feedback to reduce the uplink data transmission block error ratio. More importantly, these works ignore the timevaried heterogeneous traffic of massive IoT devices, and considered a snap shot [13, 15, 16] or steadystate behavior [14] of NBIoT networks. Our most relevant work is [17], where the authors studied the steadystate behavior of NBIoT networks from the perspective of a single device. Optimizing some of the parameters of the NBIoT configuration, namely the repetition value (to be defined below) and time intervals between two consecutive scheduling of NPRACH and NPDCCH, was carried out in terms of latency and power consumption in [17] using a queuing framework.
Unfortunately, the tabularQ framework in [11, 12] cannot be used to solve the multiparameter multigroup optimization problem in uplink resource configuration of NBIoT networks, due to their incapability to address highdimensional state space and variable selection. More importantly, whether their proposed RLbased resource configuration approaches [11, 12] outperform the conventional resource configuration approaches [5, 4] is still unknown. In this paper, we develop RLbased uplink resource configuration approaches to dynamically optimize the number of served IoT devices in NBIoT networks. To showcase the efficiency, we compare the proposed RLbased approaches with the conventional heuristic uplink resource allocation approaches. The contributions can be summarized as follows:

We develop an RLbased framework to optimize the number of served IoT devices by adaptively configuring uplink resource in NBIoT networks. The uplink communication procedure in NBIoT is simulated by taking into account the heterogeneous IoT traffics, the CE group selection, the RACH procedure, and the uplink data transmission resource scheduling. This generated simulation environment is used for training the RLbased agents before deployment, and these agents will be updated according to the real traffic in practical NBIoT networks in an online manner.

We first study a simplified NBIoT scenario considering the single parameter and the single CE group, where a basic tabularQ was developed to compare with the revised conventional Load Estimation based Uplink Resource Configuration (LEURC) scheme. The tabularQ is further advanced by implementing function approximators with different computational complexities, namely, Linear Approximator (LAQ) and Deep Neural Networks (Deep QNetwork, a.k.a. DQN) to elaborate their capability and efficiency in dealing with highdimensional state space.

We then study a more practical NBIoT scenario with multiple parameters and multiple CE groups, where direct implementation of the LAQ or DQN is not feasible due to the increasing size of the parameter combinations. To solve it, we propose Action Aggregation approaches based on LAQ and DQN, namely, AALAQ and AADQN, which guarantee convergence capability by sacrificing certain accuracy in the parameters selection. Finally, a Cooperative MultiAgent learning based on DQN (CMADQN) is developed to break down the selection in highdimensional parameters into multiple parallel subtasks by using that a number of DQN agents are cooperatively trained to produce each parameter for each CE group.

In the simplified scenario, our results show that the number of served IoT devices with tabularQ considerably outperforms that with LEURC, while LAQ and DQN achieve almost the same performance as that of tabularQ using much less training time. In the practical scenario, the superiority of Qlearning based approaches over LEURC significantly improves. Especially, CMADQN outperforms all other approaches in terms of both throughput and training efficiency, which is mainly due to the use of DQN enabling operation over a large state space and the use of multiple agents dealing with the large dimensionality of parameters selection.
The rest of the paper is organized as follows. Section II provides the problem formulation and system model. Section III illustrates the preliminary and the conventional LEURC. Section IV proposes Qleaning based uplink resource configuration approaches in the singleparameter singlegroup scenario. Section V presents the advanced Qlearning based approaches in the multiparameter multigroup scenario. Section VI elaborates the numerical results, and finally, Section VII concludes the paper.
Ii Problem Formulation and System Model
As illustrated in Fig. 1(a), we consider a singlecell NBIoT network composed of an eNB located at the center of the cell, and a set of static IoT devices randomly located in an area of the plane , and remain spatially static once deployed. The devices are divided into three CE groups as further discussed below, and the eNB is unaware of the status of these IoT devices, hence no uplink channel resource is scheduled to them in advance. In each IoT device, uplink data is generated according to random interarrival processes over the TTIs, which are Markovian and possibly timevarying.
Iia Problem Formulation
With packets waiting for service, an IoT device executes the contentionbased RACH procedure in order to establish a Radio Resource Control (RRC) connection with the eNB. The contentionbased RACH procedure consists of four steps, where an IoT device transmits a randomly selected preamble, for a given number of times according to the repetition value [1], to initial RACH procedure in step 1, and exchanges control information with the eNB in the next three steps [18]. The RACH process can fail if: (i) a collision occurs when two or more IoT devices selecting the same preamble; or (ii) there is no collision, but the eNB cannot detect a preamble due to low SNR. Note that a collision can be still detected in step 3 of RACH when the collided preambles are not detected in step 1 of RACH following 3GPP report [19]. This assumption is different from our previous works [20, 21], which only focuses on the preamble detection analysis in step 1 of RACH.
As shown in Fig. 1(b), for each TTI and for each CE group , in order to reduce the chance of a collision, the eNB can increase the number of RACH periods in the TTI or the number of preambles available in each RACH period [22]. Furthermore, in order to mitigate the SNR outage, the eNB can increase the number of times that a preamble transmission is repeated by a device in group in one RACH period [22] of the TTI.
After the RRC connection is established, the IoT device requests uplink channel resource from the eNB for control information and data transmission. As shown in Fig. 1(b), given a total number of resource for uplink transmission in the TTI, the number of available resource for data transmission is written as , where is the overall number of Resource Elements (REs)^{1}^{1}1The uplink channel consists of 48 subcarriers within 180 kHz bandwidth. With a 3.75 kHz tone spacing, one RE is composed of one time slot of 2 ms and one subcarrier of 3.75 kHz [1]. Note that the NBIoT also supports 12 subcarriers with 15 kHz tone spacing for NPUSCH, but NPRACH only supports 3.75 kHz tone spacing [1]. allocated for the RACH procedure. This can be computed as , where measures the number of REs required for one preamble transmission.
In this work, we tackle the problem of optimizing the RACH configuration defined by parameters for each th group in an online manner for every TTI . In order to make this decision at the beginning of every TTI , the eNB accesses all prior history in TTIs consisting of the following variables: the number of the collided preambles , the number of the successfully received preambles , and the number of idle preambles of the th CE group in the th TTI for the RACH, as well as the number of IoT devices that have successfully sent data and the number of IoT devices that are waiting for being allocated data resource . We denote as the observed history of all such measurements and past actions.
The eNB aims at maximizing the longterm average number of devices that successfully transmit data with respect to the stochastic policy that maps the current observation history to the probabilities of selecting each possible configuration . This problem can be formulated as the optimization
(1) 
where is the discount rate for the performance in future TTIs and index runs over the CE groups. Since the dynamics of the system is Markovian over the TTI and is defined by the NBIoT protocol to be further discussed below, this is a POMDP problem that is generally intractable. Approximate solutions will be discussed in Sections III, IV, and V.
IiB NBIoT Access Network
We now provide additional details on the model and on the NBIoT protocol. To capture the effects of the physical radio, we consider the standard powerlaw pathloss model that the pathloss attenuation is , with the propagation distance and the pathloss exponent . The system is operated in a Rayleigh flatfading environment, where the channel power gains are exponentially distributed (i.i.d.) random variables with unit mean. Fig. 2 presents the uplink data transmission procedure from the perspective of an IoT device in NBIoT networks, which consists of the four stages that are explained in the following four subsections to introduce the system model.
IiB1 Traffic InterArrival
We consider two types of IoT devices with different traffic models, including periodical traffic and bursty traffic, which is a heterogeneous traffic scenario for diverse IoT applications [23, 24]. The periodical traffic coming from periodic uplink reporting tasks, such as metering or environmental monitoring, is the most common traffic model in NBIoT networks [25]. The bursty traffic due to emergency events, such as fire alarms and earthquake alarms, captures the complementary scenario in which a massive number of IoT devices tries to establish RRC connection with the eNB [19]. Due to the nature of slottedAloha, an IoT device can only transmit a preamble at the beginning of a RACH period, which means that IoT devices executing RACH in a RACH period comes from those who received an interarrival within the interval between with the last RACH period. For the periodical traffic, the first packet is generated using Uniform distribution over (ms), and then repeated every ms. The packet interarrival rate measured in each RACH period at each IoT device is hence expressed by
(2) 
where is the number of RACH periods in the th TTI, is the duration between neighboring RACH periods. The bursty traffic is generated within a short period of time starting from a random time . The traffic instantaneous rate in packets in a period is described by a function so that the packets arrival rate in the th RACH period of the th TTI is given by
(3) 
where is the starting time of the th RACH period in the th TTI, , and the distribution follows the time limited Beta profile given as [19, Section 6.1.1]
(4) 
In (4), is the Beta function with the constant parameters and [26].
IiB2 CE Group Determination
Once an IoT device is backlogged, it first determines its associated CE group by comparing the received power of the broadcast signal to the Reference Signal Received Power (RSRP) thresholds according to the rule [27]
(5) 
IiB3 RACH Procedure
After CE group determination, each backlogged IoT device in group repeats a randomly selected preamble times in the next RACH period by using a pseudorandom frequency hopping schedule. The pseudorandom hopping rule is based on the current repetition time as well as the Narrowband Physical Cell ID, and in one repetition, a preamble consists of four symbol groups, which are transmitted with fixed size frequency hopping [28, 20, 1]. Therefore, a preamble is successfully detected if at least one preamble repetition succeeds, which in turn happens if all of its four symbol groups are correctly decoded [20]. Assuming that correct detecting is determined by the SNR level for the th repetition and the symbol group, the correct detecting event can be expressed as
(7) 
where is the index of symbol group in the th repetition, is the repetition value of the th CE group in the th TTI, means that the preamble symbol group is successfully decoded when its received SNR above a threshold , and is expressed as
(8) 
In (8), is the Euclidean distance between the IoT device and the eNB, is the pathloss attenuation factor, is the Rayleigh fading channel power gain from the IoT device to the eNB, is the noise power, and is the preamble transmit power in the th CE group defined as
(9) 
where is the index of CE groups, IoT devices in the CE group 0 () transmit preamble using the full pathloss inversion power control [27], which maintains the received signal power at the eNB from IoT devices with different distance equalling to the same threshold , and is the maximal transmit power of an IoT device. The IoT devices in the CE group 1 and group 2 always transmit preamble using the maximum transmit power [27].
As shown in the RACH procedure of Fig. 2, if a RACH fails, the IoT device reattempts the procedure until receiving a positive acknowledgement that RRC connection is established, or exceeding RACH attempts while being part of one CE group. If these attempts exceeds , the device switches to a higher CE group if possible [29]. Moreover, the IoT device is allowed to attempt the RACH procedure no more than times before dropping its packets. These two features are counted by and , respectively.
IiB4 Data Resource Scheduling
After the RACH procedure succeeds, the RRC connection is successfully established, and the eNB schedules resource from the data channel resource to the associated IoT device for control information and data transmission as shown in Fig 1(b). To allocate data resource among these devices, we adopt a basic random scheduling strategy, whereby an ordered list of all devices that have successfully completed the RACH procedure but have not received a data channel is compiled using a random order. In each TTI, devices in the list are considered in order for access to the data channel until the data resource are insufficient to serve the next device in the list. The remaining RRC connections between the unscheduled IoT devices and the eNB will be preserved within at most subsequent TTIs counting by , and attempts will be made to schedule the device’s data during these TTIs [30, 29]. The condition that the data resource are sufficient in TTI is expressed as
(10) 
where is the number of scheduled devices limited by the upper bound denoted by IoT devices with successful RACH in the current TTI as well as unscheduled IoT devices in the last TTI , is the number of required REs for serving one IoT device within the th CE group, and is the number of REs per repetition for control signal and data transmission^{2}^{2}2The basic scheduling unit of NPUSCH is resource unit (RU), which has two formats. NPUSCH format 1 (NPUSCH1) is with 16 REs for data transmission, and NPUSCH format 2 (NPUSCH2) is with 4 REs for carrying control information [3, 22].. Note that is the repetition value for the th CE group in the th TTI, which is the same as for preamble transmission [1].
Iii Preliminary and Conventional Solutions
Iiia Preliminary
The optimized number of served IoT devices over the long term given in Eq. (1) is really complicated, which cannot be easily solved via the conventional uplink resource approach. Therefore, most prior works simplified the objective to dynamically optimize the single parameter to achieve the maximum number of served IoT devices in the single group without consideration of future performance [5, 4], which is expressed as
(11) 
where is the optimized single parameter.
To maximize number of served IoT devices in the th TTI, the configuration is expected to be dynamically adjusted according to the actual number of IoT devices that will execute RACH attempts , which refers to the current load of the network. Note that in practice, this load information is unable to be detected at the eNB. Thus, it is necessary to estimate the load based on the previous transmission reception from the th to th TTI before the uplink resource configuration in the th TTI.
In [5], the authors designed a dynamic ACB scheme to optimize the problem given in Eq. (1) via adjusting the ACB factor. The ACB factor is adapted based on the knowledge of traffic load, which is estimated via moment matching. The estimated number of RACH attempting IoT devices in the th TTI is expressed as:
(12) 
where is the number of allocated preambles in the th TTI, and is the estimated number of devices performing RACH attempts in the th TTI given as
(13) 
In Eq. (13), , , and are the ACB factor, the number of preambles and the observed number of collided preambles in the th TTI, and is an estimated factor given in [5, Eq. (32)].
In Eq. (12), is the difference between the estimated numbers of RACH requesting IoT devices in the th and the th TTIs, which is obtained by assuming that the number of successful RACH IoT devices does not change significantly in these two TTIs [5].
This dynamic control approach is designed for an ACB scheme, which is only triggered when the exact traffic load is bigger than the number of preambles (i.e., ). Accordingly, the related backlog estimation approach is only used when . However, it cannot estimate the load when , which is required in our problem.
IiiB Resource Configuration in Single Parameter Single CE Group Scenario
In this subsection, we modify the load estimation approach given in [5] via estimating based on the last number of the collided preambles and the previous numbers of idle preambles . And then, we propose an uplink resource configuration approach based on this revised load estimation, namely, LEURC.
IiiB1 Load Estimation
By definition, is the set of valid number of preambles that the eNB can choose, where each IoT device selects a RACH preamble from available preambles with an equal probability given by . For a given preamble transmitted to the eNB, let denotes the number of IoT devices that selects the preamble . The probability that no IoT device selects preamble is
(14) 
The expected number of preambles experiencing idles in the th TTI is given by
(15) 
Due to that the actual number of preambles experiencing idles can be observed at the eNB, the number of RACH attempting IoT devices in the th TTI can be estimated as
(16) 
To obtain the estimated number of RACH attempting IoT devices in the th TTI , we also need to know the difference between the estimated numbers of RACH attempting IoT devices in the th and the th TTIs, denoted by , where for , and . However, cannot be obtained before the th TTI. To solve this, we can assume according to [5]. This is due to that the time between two consecutive TTIs is small, and the available preambles are gradually updated leading to that the number of successful RACH IoT devices does not change significantly in these two TTIs [5]. Therefore, the number of RACH attempting IoT devices in the th time slot is estimated as
(17) 
where represents that there are at least number of IoT devices colliding in the last TTI.
IiiB2 Uplink Resource Configuration Based on Load Estimation
In the following, we propose LEURC by taking into account the resource condition given in Eq. (10). The number of RACH periods and the repetition value is fixed, and only the number of preambles in each RACH period is dynamically configured in each TTI. Using the estimated number of RACH attempting IoT devices in the th TTI , the probability that only one IoT device selects preamble (i.e., no collision occurs) is expressed as
(18) 
The expected number of RACH attempting IoT devices in the th TTI is derived as
(19) 
Based on (19), the expected number of IoT devices requesting uplink resource in the th TTI is derived as
(20) 
where is the number of unscheduled IoT devices in the last TTI. Note that can be observed.
However, if the data resource is not sufficient (i.e., occurs when Eq. (10) is invalid), some IoT devices may not be scheduled in the th TTI. The upper bound of the number of scheduled IoT devices is expressed as
(21) 
where is the total number of REs reserved for uplink transmission in a TTI, is the uplink resource configured for RACH in the th TTI. is required REs for serving one IoT device given in Eq. (10).
According to (20) and (21), the expected number of the successfully served IoT devices is given by
(22) 
The maximal expected number of the successfully served IoT devices is obtained by selects the number of preamble using
(23) 
The LEURC approach based on the estimated load is detailed in Algorithm 1. For comparison, we consider an ideal scenario that the actual number of RACH requesting IoT devices is available at the eNB, namely, Full State Information based URC (FSIURC). FSIURC configures still using the approach given in Eq. (23), while the load estimation approach given in Section III.B.1) is not required.
IiiB3 LEURC for Multiple CE Groups
We slightly revise the introduced singleparameter singlegroup LEURC approach (given in Section III.B) to dynamically configure resource for multiple CE groups. Note that the repetition value in the LEURC approach is still defined as a constant to enable the availability of load estimation in Eq. (17). Remind that the principle of LEURC approach is to optimize the expectation of the number of successful served IoT devices while balancing and with limited uplink resource . In the multiple CE groups scenarios, the resource are allocated to IoT devices in any CE groups without bias, but is specifically allocated to each CE group.
Under this condition, the expected number of successfully served IoT devices given in Eq. (22) needs to be modified by taking into account multiple variables, which becomes nonconvex, and extremely complicates the optimization problem. To solve it, we use a suboptimal solution by artificially setting uplink resource constrain for each CE group (). Each CE group can independently allocate the resource between and according to the approach given in Eq. (23).
Iv QLearning Based Resource Configuration in SingleParameter SingleGroup Scenario
The RL approaches are wellknown in addressing dynamic control problem in complex POMDPs [31]. Nevertheless, they have been rarely studied in handling the resource configuration in slottedAloha based wireless communication systems. Therefore, it is worthwhile to evaluate the capability of RL in the singleparameter singlegroup scenario first, in order to be compared with conventional heuristic approaches. In this section, we consider one single CE group with the fixed RACH periods as well as the fixed repetition value , and only dynamically configuring the number of preambles at the beginning of each TTI. In the following, We first study tabularQ based on the tabular representation of the value function, which is the simplest Qlearning form with guaranteed convergence [31], but requires extremely long training time. We then study Qlearning with function approximators to improve training efficiency, where LAQ and DQN will be used to construct an approximation of the desired value function.
Iva QLearning and Tabular Value Function
Considering a Qagent deployed at the eNB to optimize the number of successfully served IoT devices in realtime, the Qagent need to explore the environment in order to choose appropriate actions progressively leading to the optimization goal. We define , , and as any state, action, and reward from their corresponding sets, respectively. At the beginning of the th TTI (), the Qagent first observes the current state corresponding to a set of previous observations (=}) in order to select an specific action . The action corresponds to the number of preambles in each RACH period in single CE group scenario.
As shown in Fig. 3, we consider a basic state function in the single CE group scenario, where is a set of indices mapping to the current observed information . With the knowledge of the state , the Qagent chooses an action from the set , which is a set of indexes mapped to the set of the number of available preambles . Once an action is performed, the Qagent will receive a scalar reward , and observe a new state . The reward indicates to what extent the executed action can achieve the optimization goal, which is determined by the new observed state . As the optimization goal is to maximize the number of the successfully served IoT devices, we define the reward as a function that positively proportional to the observed number of successfully served IoT devices , which is defined as
(24) 
where is constant used to normalize the reward function.
Qlearning is a valuebased RL approach [31, 32], where the policy of states to actions mapping is learned using a stateaction value function to determine an action for the state . We first use a lookup table to represent the stateaction value function (tabularQ), which consists of value scalars for all the state and action spaces. To obtain an action , we select the highest value scalar from the numerical value vector , which maps all possible actions under to the Qvalue table .
Accordingly, our objective is to find an optimal Qvalue table with optimal policy that can select actions to dynamically optimize the number of served IoT devices. To do so, we train a initial Qvalue table in the environment using QLearning algorithm, where is immediately updated using the current observed reward after each action as
(25) 
where is a constant stepsize learning rate that affects how fast the algorithm adapt to a new environment, is the discount rate that determines how current rewards affects the value function updating, approximates the value in optimal Qvalue table via the uptodate Qvalue table and the obtained new state . Note that in Eq. (25) is a scalar, which means that we can only update one value scalar in the Qvalue table with one received reward .
As shown in Fig. 3, we consider greedy approach to balance exploitation and exploration in the Actor of the QAgent, where is a positive real number and . In each TTI , the Qagent randomly generates a probability to compare with . Then, with the probability , the algorithm randomly chooses an action from the remaining feasible actions to improve the estimate of the nongreedy action’s value. With the probability , the algorithm exploits the current knowledge of the Qvalue table to choose the action that maximizes the expected reward.
Particularly, the learning rate is suggested to be set to a small number (e.g., ) to guarantee the stable convergence of Qvalue table in this NBIoT communication system. This is due to that a single reward in a specific TTI can be severely biased, because state function is composed of multiple unobserved information with unpredictable distributions (e.g., an action allows for the setting with large number of preambles , but massive random collisions accidentally occur, which leads to an unusual low reward). In the following, the implementation of uplink resource configuration using tabularQ based realtime optimization is shown in Algorithm 2.
IvB Value Function Approximation
Since tabularQ needs its each element to be updated to converge, searching for an optimal policy can be difficult in limited time and computational resource. To solve this problem, we use a value function approximator instead of Qvalue table to find a suboptimal approximated policy. Generally, selecting a efficient approximation approach to represent the value function for different learning scenarios is a usual problem within the RL [31, 33, 34, 35]. A variety of function approximation approaches can be conducted, such as LA, DNNs, tree search, and which approach to be selected can critically influence the successful learning [31, 34, 35]. The function approximation should fit the complexity of the desired value function, and be efficient to obtain good solutions. Unfortunately, most function approximation approaches require specific design for different learning problems, and there is no basis function, which is both reliable and efficient to satisfy all learning problems.
In this subsection, we first focus on the linear function approximation for Qlearning, due to its simplicity, efficiency, and guaranteed convergence [31, 36, 37]. We then conduct the DNN for Qlearning as a more effective but complicated function approximator, which is also known as DQN [32]. The reasons we conduct DQN are that: 1) the DNN function approximation is able to deal with several kinds of partially observable problems [31, 32]; 2) DQN has the potential to accurately approximate the desired value function while addressing a problem with very large state spaces [32], which can be favored for the learning in the multiple CE group scenarios; 3) DQN is with high scalability, where the scale of its value function can be easily fit to a more complicated problem; 4) a variety of libraries have been established to facilitate building DNN architectures and accelerate experiments, such as TensorFlow, Pytorch, Theano, Keras, and etc..
IvB1 Linear Approximation
LAQ uses a linear weight matrix to approximate the value function with feature vector corresponding to the state . The dimensions of weight matrix is , where is the total number of all available actions and is the size of feature vector . Here, we consider polynomial regression (as [31, Eq. 9.17]) to construct the realvalued feature vector due to its efficiency^{3}^{3}3The polynomial case is the most well understood feature constructor and always performs well in practice with appropriate setting [31, 33]. Furthermore, the results in [38] shows that there is a rough correspondence between a fitted neural network and a fitted ordinary parametric polynomial regression model. These reasons encourage us to compare the polynomial based LAQ with DQN. In the training process, the exploration is the same as the tabular Qlearning by generating random actions, but the exploitation is calculated using the weight matrix of the value function. In detail, to predict an action using the LA value function with state in the th TTI, the approximated value function scalars for each action is obtained by innerproducting between the weight matrix and the features vector as:
(26) 
By searching for the maximal value function scalar in given in Eq. (26), we can obtain the matched action to maximize future rewards.
To obtain the optimal policy, we update the weigh matrix in the value function using Stochastic Gradient Descent (SGD) [31, 39]. SGD minimizes the error on predictions of observation after each example, where the error is reduced by a small amount following the direction to the optimal target policy . As it is infeasible to obtain optimal target policy by summing over all states, we instead estimate the desired actionvalue function by simply considering one learning sample [31]. In each TTI, the weigh matrix is updated following
(27) 
where is the learning rate. is the gradient of the loss function used to train the Qfunction approximator. This is given as
(28) 
where is the weight matrix, is the features matrix with the same shape of . is constructed by zeros and the feature vector located in the row corresponding to the index of the action selected in the th TTI . Note that is a scalar. The learning procedure follows Algorithm 2 by changing the Qtable to the LA value function with linear weigh matrix , and updating with SGD given in (28) in step 10 of Algorithm 2.
IvB2 Deep QNetwork
The DQN agent parameterizes the actionstate value function by using a function , where represents the weights matrix of a DNN with multiple layers. We consider the conventional DNN, where neurons between two adjacent layers are fully pairwise connected, namely fullyconnected layers. The input of the DNN is given by the variables in state ; the intermediate hidden layers are Rectifier Linear Units (ReLUs) by using the function ; while the output layer is composed of linear units^{4}^{4}4 Linear activation is used here according to [32]. Note that Qlearning is valuebased, thus the desired value function given in Eq. (15) can be bigger than 1, rather than a probability, and thus the activation function with return value limited in (such as sigmoid function and tanh function) can lead to convergence difficulty., which are in onetoone correspondence with all available actions in .
The exploitation is obtained by performing forward propagation of Qfunction with respect to the observed state . The weights matrix is updated online along each training episode by using double deep Qlearning (DDQN) [40], which to some extend reduce the substantial overestimations^{5}^{5}5Overestimation refers to that some suboptimal actions regularly were given higher Qvalues than optimal actions, which can negatively influence the convergence capability and training efficiency of the algorithm [40, 34]. of value function. Accordingly, learning takes place over multiple training episodes, with each episode of duration TTI periods. In each TTI, the parameter of the Qfunction approximator is updated using SGD as
(29) 
where is RMSProp learning rate [41], is the gradient of the loss function used to train the Qfunction approximator. This is given as
(30) 
where the expectation is taken with respect to a socalled minibatch, which are randomly selected previous samples for some , with being the replay memory [32]. When is negative, this is interpreted as including samples from the previous episode. The use of minibatch, instead of a single sample, to update the value function improves the convergent reliability of value function [32]. Furthermore, following DDQN [40], in (30), is a socalled target Qnetwork that is used to estimate the future value of the Qfunction in the update rule. This parameter is periodically copied from the current value and kept fixed for a number of episodes [40].
V QLearning Based Resource Configuration in MultiParameter MultiGroup Scenario
Practically, NBIoT is always deployed with multiple CE groups to serve IoT devices with various coverage requirements. In this section, we study the problem (1) of optimizing the resource configuration for three CE groups each with parameters . This joint optimization by configuring each parameter in each CE group can improve the overall data access and transmission performance. Note that each CE group shares the uplink resource in the same bandwidth, and the eNB schedules data resource to all RRC connected IoT devices without the CE group bias as introduced in Sec. II.B.4). To optimize the number of served IoT devices in realtime, the eNB should not only balance the uplink resource between RACH and data, but also balance them among each CE group.
The Qlearning algorithms with the single CE group provided in Sec. IV are modelfree, and thus their learning structure can be directly used in this multiparameter multigroup scenario. However, considering multiple CE groups results in the increment of observations space, which exponentially increases the size of state space. To train Qagent with this expansion, the requirements of time and computational resource greatly increase. In such case, the tabularQ would be extremely inefficient, as not only the stateaction value table requires a big memory, but it is impossible to repeatedly experience every state to achieve convergence with limited time. In view of this, we only study Qlearning with value function approximation (LAQ and DQN) to design uplink resource configuration approaches for the multiparameter multigroup scenario.
LAQ and DQN are with high capability to handle massive state spaces, and thus we can considerably improve the state spaces with more observed information to support the optimization of Qagent. Here, we define the current state includes information about the last TTIs (). This design improves Qagent by enabling it to estimate the trend of traffic. As our goal is to optimize the number of served IoT devices, the reward function should be defined according to the number of successfully served IoT devices of each CE group, which is expressed as
(31) 
Same as the state spaces, the available action spaces also exponentially increases with the increment of the adjustable configurations. The number of available actions corresponds to the possible combinations of configurations (i.e., denotes the number of elements in any vector , is the set of actions, , , and are the sets of the number of RACH periods, the repetition value, and the number of preambles in each RACH period). Unfortunately, it is extremely hard to optimize the system under such numerous action spaces (i.e., can be over fifty thousands.), due to that the system will fall into updating policy with only a small part of the action in , and finally leads to convergence difficulty. To solve this problem, we then provide two approaches that can reduce the dimension of action space to enable the LA and DQN in the multiparameter multigroup scenario.
Va Actions Aggregated Approach
We first provide AA based Qlearning approaches, which guarantee convergent capability by sacrificing the accuracy of action selection^{6}^{6}6The action aggregation has been rarely evaluated, but the same idea, namely, state aggregation has been well studied, which is a basic function approximation approach [31].. In detail, the specific action selection can be converted to the increasing or decreasing trend selection. Instead of selecting the exact values from the sets of , , and , we convert it to single step ascent/descent based on the last action, which is represented by , , and for the number of RACH periods , the repetition values , and the number of preambles in each RACH period in the th TTI. Consequently, the size of total action spaces for the three CE groups is reduced to ==. By doing so, the algorithms for training with LA function approximator and DQN in the multiple configurations multiple CE groups scenario can be deployed following Algorithm 2 and Algorithm 3, respectively.
VB Cooperative Multiagent Learning Approach
Despite that the uplink resource configuration is managed by a central authority, identifying the control of each parameter as one subtask that is cooperatively handled by independent Qagents is sufficient to deal with the problem with unsolvable action spaces [42]. As shown in Fig. 5, we consider multiple DQN agents are centralized at the eNB with the same structure of value function approximator^{7}^{7}7The structures of value function approximator can also be specifically designed for RL agents with subtasks of significantly different complexity. However, there is no such requirement in our problem, so it will not be considered. following Section IV.B.2). We break down the action space by considering nine separate action variables in , where each DQN agent controls their own action variable as shown in Fig. 5. Recall that we have three variables for each group , namely , , and .
We introduce a separate DQN agent for each output variable in defined as action selected by the th agent, where each th agent is responsible to update the value of action in shared state . The DQN agents are trained in parallel and receive the same reward signal given in Eq. (31) at the end of each TTI as per problem (1). The use of this common reward signal ensures that all DQN agents aim at cooperatively increase the objective in (1). Note that the approach can be interpreted as applying a factorization of the overall value function akin to the approach proposed in [43] for multiagent systems.
The challenge of this approach is how to evaluate each action according to the common reward function. For each DQN agent, the received reward is corrupted by massive noise, where its own effect on the reward is deeply hidden in the effects of all other DQN agents. For instance, a positive action can receive a mismatched low reward due to other DQN agents’ negative actions. Fortunately, in our scenario, all DQN agents are centralized at the eNB, which means that all DQN agents can have full information among each other. Accordingly, we adopt the action selection histories of each DQN agent as part of state function^{8}^{8}8The state function can be designed to collect more information according to the complexity requirements, such as sharing the value function between each DQN agent [42]., thus they are able to know how reward is influenced by different combinations of actions. To do so, we define state variable as
(32) 
where is the number of stored observations, is the set of selected action of each DQN agent in the th TTI corresponding to , , and for the th CE group, and is the set of observed transmission receptions.
In each TTI, the parameters of the Qfunction approximator are updated using SGD at all agents as Eq. (29). The learning algorithm can be implemented following Algorithm 3. Different from the singleparameter singlegroup scenario, we need to first initialize nine primary networks , target networks , and replay memories for each DQN agent. In step 11 of Algorithm 3, the current transactions of each DQN agent should be stored in their own memory separately. In step 12 and 13 of Algorithm 3, the minibatch of transaction should separately sampled from each memory to train the corresponding DQN agent.
Vi Simulation Results
In this section, we evaluate the performance of the proposed Qlearning approaches and compare it with the conventional LEURC and FSIURC described in Sec. III via numerical experiments. We adopt the standard network parameters listed in Table I following [1, 3, 25, 29, 22], and hyperparameters for Qlearning listed in Table II. Accordingly, one epoch consists of 937 TTIs (i.e., 10 minutes). The RL agents will first be trained in a socalled learning phase, and after convergence, their performance will be compared with LEURC and FSIURC in a socalled testing phase. All testing performance results are obtained by averaging over 1000 episodes. In the following, we present our simulation results of the singleparameter singlegroup scenario and the multiparameter multigroup scenario in Section VIA and Section VIB, respectively.
Parameters  Setting  Parameters  Setting 

Pathloss exponent  4  noise power  138 dBm 
eNB broadcast power  35 dBm  Pathloss inverse power control threshold  120 dB 
Maximal preamble transmit power  23 dBm  The received SNR threshold  0 dB 
Duration of periodic traffic  1 hour  TTI  ms 
Duration of bursty traffic  10 minutes  Set of number of preambles  {} 
Maximum allowed resource requests  5  Set of repetition value  {} 
Maximum RACH attempts  10  Set of number of RACH periods  {} 
Maximum allowed RACH in one CE  5  REs required for  4 
Bursty traffic parameter Beta()  (3,4)  REs required for 