Power Allocation in MultiUser Cellular Networks: Deep Reinforcement Learning Approaches
Abstract
The modelbased power allocation algorithm has been investigated for decades, but it requires the mathematical models to be analytically tractable and it usually has high computational complexity. Recently, the datadriven modelfree machine learning enabled approaches are being rapidly developed to obtain nearoptimal performance with affordable computational complexity, and deep reinforcement learning (DRL) is regarded as of great potential for future intelligent networks. In this paper, the DRL approaches are considered for power control in multiuser wireless communication cellular networks. Considering the crosscell cooperation, the offline/online centralized training and the distributed execution, we present a mathematical analysis for the DRLbased toplevel design. The concrete DRL design is further developed based on this foundation, and policybased REINFORCE, valuebased deep Q learning (DQL), actorcritic deep deterministic policy gradient (DDPG) algorithms are proposed. Simulation results show that the proposed datadriven approaches outperform the stateofart modelbased methods on sumrate performance, with good generalization power and faster processing speed. Furthermore, the proposed DDPG outperforms the REINFORCE and DQL in terms of both sumrate performance and robustness, and can be incorporated into existing resource allocation schemes due to its generality.
I Introduction
Wireless data transmission has experienced tremendous growth in past years and will continue to grow in the future. When large numbers of terminals such as mobile phones and wearable devices are connected to the networks, the density of access point (AP) will have to be increased. Dense deployment of small cells such as picocells, femtocells, has become the most effective solution to accommodate the critical demand for spectrum [1]. With denser APs and smaller cells, the whole communication network is flooded with wireless signals, and thus the intracell and intercell interference problems are severe [2]. Therefore, power allocation and interference management are crucial and challenging [3], [4].
Massive modeloriented algorithms have been developed to cope with interference management [5, 6, 7, 8, 9], and the existing studies mainly focus on suboptimal or heuristic algorithms, whose performance gaps to the optimal solution are typically difficult to quantify. Besides, the mathematical models are usually assumed to be analytically tractable, but these models are not always accurate because both hardware and channel imperfections can exist in practical communication environments. When considering specific hardware components and realistic transmission scenarios, such as lowresolution A/D, nonlinear amplifier and user distribution, the signal processing techniques with modeldriven tools are challenging to be developed. Moreover, the computational complexity of these algorithms is high and thus concrete implementation becomes impractical. Meanwhile, machine learning (ML) [10] algorithms are potentially useful techniques for future intelligent wireless communications. These methods are usually modelfree and datadriven[11], [12], and the solutions are obtained through data learning instead of modeloriented analysis and design.
Two main branches of ML are supervised learning and reinforcement learning (RL). With available training input/output pairs, the supervised learning method is simple but efficient especially for classification tasks such as modulation recognition [13] and signal detection [14], [15]. However, the correct output data sets or optimal guidance solutions can be difficult to obtain. Meanwhile, the RL [16] has been developed as a goaloriented algorithm, aiming to learn a better policy through exploration of uncharted territory and exploitation of current knowledge. The RL concerns with how agents ought to take actions in an environment so as to maximize some notion of cumulative reward, and the environment is typically formulated as a Markov decision process (MDP) [17]. Therefore, many RL algorithms [16] have been developed using dynamic programming (DP) techniques. In classic RL, a value function or a policy is stored in a tabular form, which leads to the curse of dimensionality and the lack of generalization. To compromise generality and efficiency, function approximation is proposed to replace the table, and it can be realized by a neural network (NN) or deep NN (DNN) [18]. Combining RL with DNN, the deep RL (DRL) is created and widely investigated, and it has achieved stunning performance in a number of noted projects [19] such as the game of Go [20] and Atari video games [21].
The DRL algorithms can be categorized into three groups [19]: valuebased, policybased and actorcritic methods. The valuebased DRL algorithm derives optimal action by the actionstate value function, and the most widelyused algorithms include deep Q learning (DQL) and Sarsa. As for the policybased algorithm such as REINFORCE, a stochastic policy is directly generated. Both of these two methods have the following defects in general:

Valuebased: The action space must be discrete, which introduces quantization error for tasks with continuous action space. The output dimension increases exponentially for multiaction issues or joint optimizations.

Policybased: It is difficult to achieve a balance between exploration and exploitation, and the algorithm usually converges with a suboptimal solution. The variance of estimated gradient is high. In addition, the action space is still discrete.
The actorcritic algorithm is developed to overcome the aforementioned drawbacks as a hybrid of the valuebased and policybased methods. It consists of two components: an actor to generate policy and a critic to assess the policy. A better solution is learned through settling a multiobjective optimization problem, and updating the parameters of the actor and the critic alternatively.
In a communication system where multiple users share a common frequency band, the problem of choosing transmit power dynamically in response to physical channel conditions in order to maximize the downlink sumrate with maximal power constraints is NPhard [3]. Two advanced modelbased algorithms, namely fractional programming (FP) [5] and weighted minimum mean squared error (WMMSE) [6] are regarded as benchmarks in the simulation comparisons. The supervised learning is studied in [22], where a DNN is utilized to mimic the guidance algorithm, and accelerate the processing speed with acceptable performance loss. The ensemble of DNNs is also proposed to further improve the performance in [23]. As for the interference management/power allocation with DRL approaches, the current research work mainly concentrates on valuebased methods. The QL or DQL is widely applied in various communication scenarios by a number of articles, such as HetNets [24, 25, 26, 27], cellular networks [28], [29] and V2V broadcasting [30]. To the best of the authors’ knowledge, the classic policybased approach has seldom been considered on this issue [31]. An actorcritic algorithm has been applied for power allocation [32], where a Gaussian probability distribution is used to formulate a stochastic policy.
In this paper, we consider an interfering multipleaccess channel (IMAC) scenario which is similar to [22]. We focus on the systemlevel optimization, and target at maximizing the overall sumrate by intercell interference coordination. This is actually a static optimization problem, where the target is a multivariate ordinary function. While the standard DRL tools are designed for the DP which can be settled recursively, a direct utilization of these tools to tackle the static optimization problem will suffer some performance degradation. In our previous work [33], we verified through simulations that the present widely applied standard DQL algorithm suffers sumrate performance degradation on power allocation. In this work, we explain the reasons for this degradation and revise the DRL algorithms eliminate such degradation, by developing theoretical analysis on the general DRL approaches to address the static optimization problem. On this theoretical basis, three more simplified but efficient algorithms, namely policybased REINFORCE, valuebased DQL and actorcriticbased deep deterministic policy gradient (DDPG) [34] are proposed. Simulation results show that the proposed DQL achieves the highest sumrate performance when compared to the ones with standard DQL, and our DRL approaches also outperform the stateofart modelbased methods. The contributions of this manuscript are summarized as follows:

We develop mathematical analysis on proper application of general DRL algorithms to address the static optimization problems, and we consider dynamic power allocation in multiuser cellular networks.

The training procedure of the proposed DRL algorithm is centralized and the learned model is distributively executed. Both the offline and online training are introduced, and an environment tracking mechanism is proposed to dynamically control the online learning.

The logarithmic representation of channel gain and power is used to settle numerical problem in DNNs and improve training efficiency. Besides, a sorting preprocessing technique is proposed to accommodate varying user densities and reduce computation load.

On the basis of proposed general DRL on static optimization, the concrete DRL design is further introduced and we propose three novel algorithms, namely REINFORCE, DQL and DDPG, which are respectively policybased, valuebased and actorcriticbased. Contrast simulations on sumrate performance, generalization ability and computation complexity are also demonstrated.
The remainder of this paper is organized as follows. Section II outlines the power control problem in the wireless cellular network with IMAC. In Section III, the toplevel DRL design for static optimization problem is analyzed and introduced. In Section IV our proposed DRL approaches are presented in detail. Then, the DRL methods are compared along with benchmark algorithms in different scenarios, and the simulation results are demonstrated in Section V. Conclusions and discussion are given in Section VI.
Ii System Model
We investigate crosscell dynamic power allocation in a wireless cellular network with IMAC. The network system is composed of cells, and a base station (BS) with one transmitter is deployed at each cell center. Assuming shared frequency bands, users are simultaneously served by the center BS.
Iia Problem Formulation
At time slot , the independent channel gain between the BS and the user in cell is denoted by , and can be presented as
(1) 
where is the absolute value operation; is a complex Gaussian random variable with Rayleigh distributed magnitude; is the largescale fading component, taking both geometric attenuation and shadow fading into account, and it is assumed to be invariant over the time slot. According to the Jakesâ model [35], the smallscale flat fading is modeled as a firstorder complex GaussMarkov process
(2) 
where and . The correlation is determined by
(3) 
where is the first kind zeroorder Bessel function, is the maximum Doppler frequency, and is the time interval between adjacent instants.
The downlink from the th BS to the th serving AP is denoted by . Supposing that the signals from different transmitters are independent of each other, the channels remain constant in each time slot. Then the signaltointerferenceplusnoise ratio (SINR) of in time slot can be formulated by
(4) 
where is the set of interference cells around the th cell, is the emitting power of the transmitter to its receiver at slot , and denotes the additional noise power. The terms and represent the intracell and intercell interference power, respectively. With normalized bandwidth, the downlink rate of in time slot is expressed as
(5) 
Under maximum power constraint of each transmitter, our goal is to find the optimum power, to maximize the sumrate objective function. The optimization problem is given as
(6) 
where denotes the maximum emitting power; the power set , channel gain set , and sumrate are respectively defined as
(7)  
(8)  
(9) 
The problem (6) is nonconvex and NPhard. As for the modelbased methods, the performance gaps to the optimal solution are typically difficult to quantify, and also the practical implementation is restricted due to the high computational complexity. More importantly, the modeloriented approached cannot accommodate future heterogeneous service requirements and randomly evolving environments, and thus the datadriven DRL algorithms are discussed and studied in the following section.
Iii Deep Reinforcement Learning
Iiia Problem Formulation
A general MDP problem concerns about a single or multiple agents interacting with an environment. In each interaction, the agent takes action by policy with observed state , then receives a feedback reward and a new state from the environment. The agent aims to find an optimal policy to maximize the cumulative reward over the continuous interactions, and the DRL algorithms are developed for such problems.
To facilitate the analysis, the discretetime modelbased MDP is considered, and the action and state spaces are assumed to be finite. The tuple is known, where the elements are

, a finite set of states,

, a finite set of actions,

is the probability that action in state will lead to state ,

, a finite set of immediate rewards, where element denotes the reward obtained after transitioning from state to state , due to action .
Under stochastic policy , the step cumulative reward and discounted cumulative reward are considered as the state value function . With initial state , they are defined as
(10) 
and
(11) 
where denotes a discount factor that trades off the importance of immediate and future rewards, and is the expectation operation. For an initial stateaction pair , the stateaction value functions, namely the Q functions are defined as
(12) 
and
(13) 
Starting from the perspective of MDP, the following conclusions are given when the environment satisfies certain conditions.
Theorem 1.
When the environment transition is independent with action, and the current action is only related to the reward function of this instant, then the optimal policy for maximization of cumulative rewards is equivalent to a combination of singlestep rewards.
Proof.
First, we focus on (10) and it is expanded as
(14) 
The description of the assumed conditions can be mathematically formulated as
(15)  
(16) 
Without loss of generality, for probability mass functions of policy and state transitioning , clearly we have
(17)  
(18) 
From (15), (16), (17) and (18), the state value function (14) can be rewritten as
(19) 
The full unrolling of (19) is given as
(20) 
Since the state transfer is irrelevant to the action, and the state can be independently sampled as .
Lemma 1.
With state sequence , the maximization of (20) with respect to can be decomposed to the subproblem:
(21) 
Proof.
(22) 
∎
Obviously with Lemma 1, it can be proved that the maximization of (20) with respect to can be decomposed into subproblems:
(23) 
Besides, the equivalence proof of discounted cumulative reward is similar.
∎
Since the channel is modeled as a firstorder Markov process, the environment satisfies the two conditions in Theorem 1. Then we let and along with the constraints, the optimization problem of (6) with DRL approach is equivalent to (22).
Although the equivalence is mathematically proved, and it has no concern with the value of or . Several facts must be observed when the improper hyperparameter is adopted. We take the valuebased method as an example, and optimal Q function associated with Bellman equation is given as
(24) 
This function must be estimated precisely to achieve the optimal action. Here we list two issues caused by :

The Q value is overestimated, and the bias is . This effect actually has no or little influence on the final performance, since this deviation does not concern with action .

The variance of Q value becomes enlarged, and becomes larger as increases. During training, the additional noise on data can slow down the convergence speed, and also can deteriorate the performance of learned DNN.
In [33], we verified that an increasing has negative influence on the sumrate performance of DQN in simulations, as shown in Fig. 1. Therefore, we suggest using hyperparameter or in this specific scenario, and thus . In the remainder of this article, we make adjustment to the standard DRL algorithms, and particularly claim that the Q function is equal to the reward function. The aforementioned analysis and discussion provide the design guidance for the next DRL.
IiiB Centralized Training & Distributed Execution
In (6), only a single center agent is trained and then implemented. Under this framework, the current local channel state information (CSI) is first estimated and transmitted to the center agent for further processing. The decisions of allocated powers are then broadcast to the corresponding transmitters and executed. However, several defects of the centralized framework with a massive number of cells must be observed:

Space explosion: The cardinalities of DNN I/O is proportional to the cell number , and training such a DNN is difficult since the stateaction space increases exponentially with the I/O dimensions. Additionally, exploration in highdimensional space is inefficient, and thus the learning can be impractical.

Delivery pressure: The center agent requires full CSI of the communication network in current time. When the cell number is large and lowlatency service is required, both the transmitting CSI to the agent and the broadcasting allocation scheme to each transmitter are challenging.
In [36], a framework of centralized training and distributed execution was proposed to address these challenges. The power allocation scheme is decentralized, the transmitter of each link is regarded as an agent, and all agents in the communication network operate synchronously and distributively. Meanwhile, the agent only partially consumes channel information and outputs its own power , where is defined as
(25) 
The multiobjective programming is established as
(26) 
However, multiagent training is still difficult, since it requires much more learning data, training time and DNN parameters. Besides, links in distinct areas are approximately identical since their characteristics are locationinvariant and the network is large. To simplify this issue, all agents are treated as the same agent. Same policy is shared and it is learned with collected data from all links. Therefore, the training is centralized, and the execution is distributed. The detailed design of the DRL algorithms will be introduced in the following section.
IiiC Online Training
In our previously proposed modelfree twostep training framework [33], the DNN is first pretrained offline in simulated wireless communication scenarios. This procedure is to reduce the online training stress, due to the large data requirement for datadriven algorithm by nature. Second, with transfer learning, the offline learned DNN can be deployed in real networks. However, it will suffer from the imperfections in real implementations, dynamic wireless channel environment and some unknown issues. Therefore, the agent must be trained online in the initial deployment, in order to adapt to actual unknown issues that cannot be simulated. To prevent a prolonged degradation of the system performance, parameter update of the DNN to accommodate the environment changes is also necessary.
One simple but bruteforce approach is to use continuous regular training, which leads to a great waste of network performance and computation resources. Online training is costly for several reasons. First, interaction with the real environment is required, and this exploration ruins the sumrate performance of communication system to some extent. Second, the training requires high performance computing to reduce time cost, while the hardware is expensive and powerhungry. On the one hand, training is unnecessary when the environment fluctuation is negligible, but on the other hand this method cannot timely respond to the outburst.
Therefore, we propose an environment tracking mechanism as an efficient approach to dynamically control the agent training. For DRL algorithms, the shift of environment indicates that the reward function is changed, and thus the policy or Q function must be adjusted correspondingly to avoid performance degradation. Hence, the Q value needs to approximate the reward value as accurately as possible. We define the normalized critic loss as
(27) 
where denotes the DNN parameter; is the observation window; is an index to evaluate the accuracy of Q function approximation to the actual environment. Once exceeds some fixed threshold , the training of DNN is initiated to track the current environment; otherwise, the learning procedure is omitted. The introduction of tracking mechanism achieves a balance between performance and efficiency. With online training, the DRL is modelfree and datadriven in a true sense.
Iv DRL Algorithm Design
Iva Concrete DRL Design
In the previous section we discuss the DRL on a macrolevel, and concrete design of several DRL algorithms namely REINFORCE, DQL and DDPG, is introduced in this subsection. First, the descriptions of state, reward and action are given, as an expansion of Section IIIB.
IvA1 State
The selection of environment information is significant, and obviously current partial CSI is the most critical feature. It is inappropriate to directly use as DNN input due to numerical issues. In [33], a logarithmic normalized expression of is proposed, and it is given as
(28) 
where is the Kronecker product, and is a vector filled with ones. The channel amplitudes elements in are normalized by the downlink , and the logarithmic representation is preferred since that amplitudes often vary by orders of magnitude. The cardinality of is , and it changes with varying AP densities. First, we define the sorting function
(29) 
where set is sorted in decreasing order, and the first elements are selected as the new set . The indices of the chosen components are donated by . To further reduce the input dimension and accommodate different AP densities, the new set and its indices are obtained by (29) with and , where is a constant.
The channel is modeled as a Markov process and correlated in the time domain, and thus the last solutions can provide a better initialization for this moment’s solve and interference information. In correspondence to , the last power set is defined as
(30) 
The irrelevant or weakcorrelated input elements consume more computational resources and even lead to performance degradation, but some auxiliary information can improve the sumrate performance of DNN. Similar to (30), the assisted feature is given by
(31) 
Two types of feature are considered, and they are written as
(32)  
(33) 
The partially observed state for DRL algorithms can be or , and their performance will be compared in the simulation section. Moreover, the cardinalities of state , i.e., the input dimensions, are and .
IvA2 Reward
According to our investigation, there is few work on the strict design criteria of reward function due to the problem complexity. In general, the reward function is elaborately designed to improve the agent’s transmitting rate and also to mitigate its interference to neighbouring links [25, 26, 27, 28, 29, 30]. In our previous work, we use averaged sumrate (9) as the reward, and it follows that the sum of all rewards is equal to the network sumrate. However, rates from remote cells is introduced, and they have little relationship with decision of action . These irrelevant elements enlarge the variance of reward function, and thus the DNN becomes hard to train when the network becomes large. Therefore, the localized reward function is proposed as
(34) 
where is a weight coefficient of interference effect, and denotes the positive real scalar. The sum of local rewards is proportional to the sumrate
(35) 
when the cell number is sufficient large.
IvA3 Action
The downlink power is a nonnegative continuous scalar, and is limited by the maximum power . Since that the action space must be finite for certain algorithms such as DQL and REINFORCE, the possible emitting power is quantized in levels. The allowed power set is given as
(36) 
where is the nonzero minimum emitting power. Discretization of continuous variable results in quantization error. Meanwhile, the actor of DDPG directly outputs deterministic action , and this constrained continuous scalar is generated by a scaled sigmoid function:
(37) 
where is the preactivation output. Except for elimination of quantization error, DDPG has great potential on multiaction task. For example, We take a task with action number for example, the output dimension of DDPG . While for both DQL and REINFORCE, we have . Since the action space increases exponentially, the application of multiaction tasks with such algorithms is impractical.
IvA4 Experience Replay
The concept of “Experience Replay” is proposed to deal with the problem: the data is correlated and nonstationary distributed in MDPs, while the training samples for DNN are suggested to be independently and identically distributed (I.I.D.). In our investigated problem, the data correlation in time domain is not strong, and this technique is optional.
IvB Policybased: Reinforce
The REINFORCE is derived as a MonteCarlo policygradient learning algorithm [37],[38]. Policybased algorithms directly generate stochastic policy instead of indirect Q valuation, and is parameterized by a policy network with parameter , as shown in Fig. 2. The overall strategy of stochastic gradient ascent requires a way to obtain samples such that the expectation of sample gradient is proportional to the actual gradient of the performance measure as a function of the parameter. The goal of REINFORCE is to maximize expected rewards under policy :
(38) 
where denotes the policy network, and is its parameter. The gradient of (38) with MonteCarlo sampling is presented as
(39) 
where is the gradient operation. The complete deduction is presented in [16]. Since the policy network directly generates stochastic policy, the optimal action is selected with the maximum probability:
(40) 
and the optimal action value is obtained by a mapping table. Besides, the random action is selected following in exploration.
In practical training, the algorithm is susceptible to reward scaling. We can alleviate this dependency by whitening the rewards before computing the gradients, and the normalization of reward is given as
(41) 
where and are the mean value and standard deviation of reward , respectively. The proposed REINFORCE algorithm is stated in Algorithm. 1.
IvC Valuebased: DQL
DQL is one of the most popular valuebased offpolicy DRL algorithms. As shown in Fig. 2, the topology of DQL and REINFORCE are the same, and the values are estimated by a DQN , where denotes the parameter. The selection of a good action is based upon accurate estimation, and thus DQL is aimed to search for optimal parameter to minimize the loss:
(42) 
The gradient with respect to is given as
(43) 
The optimal action is selected to maximize the Q value, and it is given by
(44) 
During training, a dynamic greedy policy is adopted to control the exploration probability, and is defined as
(45) 
where denotes the episode times, and are initial and final exploration probabilities, respectively. Detailed description of our DQL algorithm is presented in Algorithm. 2.
IvD ActorCritic: DDPG
DDPG is presented as an actorcritic, modelfree algorithm based on the deterministic policy gradient that can operate over continuous action spaces. As shown in Fig. 3, an actor generates deterministic action with observation by a mapping network , where denotes the actor parameter. The critic predicts the Q value with an actionstate pair through a critic network , where denotes the critic parameter and is the critic state. The critic and actor work cooperatively, and the optimal deterministic policy is achieved by solving the following joint optimization problem:
(46)  
(47) 
The actor strives to maximize the evaluation from critic, and the critic aims to make assessment precisely. Both the actor and critic are differentiable, and using chain rule their gradients are given as
(48)  
(49) 
The deterministic action is directly obtained by the actor:
(50) 
Similar to the dynamic greedy policy, the exploration action in episode is defined as
(51) 
where is an additional noise and follows uniform distribution:
(52) 
and action is bounded by the interval .
The critic can be regarded as an auxiliary network to transfer gradient in learning, and it is needless in further testing. The must be differentiable, but not necessarily trainable. The critic is modelbased in this approach, since the evaluating rules are available with (4), (5) and (34) in offline training.
However, the modelbased actor is confirmed and infeasible to accommodate the unknown issues in online training. Meanwhile, complex reward function is difficult to be approximated accurately with pure NN parameters. Therefore, a semimodelfree critic is suggested, with utilization of both priori knowledge and flexibility of NN. Similar to the preprocessing of , the state for critic is obtained by (4), (5), (29) and (31) with and . The detailed DDPG algorithm is introduced in Algorithm. 3.
The policy gradient algorithm is developed with stochastic policy , but sampling in continuous or highdimensional action space is inefficient. The deterministic policy gradient is proposed to overcome this problem. On the other hand, in contrast with valuebased DQL, the critic and Q value estimator are similar in terms of function. The difference is that a critic takes both and as input and then predict Q value, but estimates all actions’ corresponding Q values with input .
V Simulation Results
Va Simulation Configuration
In the training procedure, a cellular network with cells is considered. In each cell, APs are located uniformly and randomly within the range , where km and km are the inner space and half celltocell distance, respectively. The Doppler frequency Hz and time period ms are adopted to simulate the fading effects. According to the LTE standard, the largescale fading is modeled as
(53) 
where lognormal random variable follows with dB, and is the transmittertoreceiver distance. The additional white Gaussian noise (AWGN) power is dBm, and the emitting power constraints and are dBm and dBm, respectively. Besides, the maximal SINR is restricted by dB.
The cardinality of adjacent cells is , the first interferers remain and power level number . Therefore, the input state dimensions with feature , are and , respectively. The weight coefficient . In episode , the largescale fading is invariant and thus the number of episode , being large to overcome the generalization problem. The time slots per episode , being small to reduce overfitting in th specific scenario. The Adam [39] is adopted as the optimizer for all DRLs. In Table. I, the architectures of all DNNs and the hyperparameter settings are listed in detail. The left and right parts of the layer are activation function and neuron number, respectively. These default settings will be clarified once changed in the following simulations. The training procedure is independently repeated times for each algorithm design, and the testing result is obtained from times generated scenarios. The simulation codes are available at https://github.com/mengxiaomao/DRL_PA.
VB DRL Algorithm Comparison
In this subsection, the sumrate performance of REINFORCE, DQL and DDPG is studied, in terms of experience replay, feature selection and quantization error. The notations , and are defined as variance of sumrate, average sumrate, the average sumrate of top over independent repetitive experiments, respectively. The is an indicator to measure performance of the welltrained algorithms.
VB1 Experience Replay
Since the parameter initialization and data generation are stochastic, the performance of DRL algorithms can be influenced to varying degrees. As shown in Table. II^{1}^{1}1The proposed DDPG is not applicable for experience replay and thus the corresponding simulation result is omitted., the REINFORCE and experience replay are abbreviated as RF and ER, respectively. Generally, the experience replay helps the DRLs reduce the variance of sumrate and improve the average sumrate , but its influence on best results is negligible.
The variance of REINFORCE is the highest, and we find it difficult to stabilize the training results even with experience replay and normalization in (41). In contrast, the DQL is much more stable. While the of DDPG is the lowest, up to one or more orders of magnitude lower than the REINFORCE. This indicates that DDPG has strong robustness to random issues. Moreover, DDPG achieves the highest . In general, the performance of REINFORCE and DQL are almost the same, and REINFORCE performs slightly better than DQL but has weaker stabilization. The DDPG overwhelms these two algorithms, in terms of both sumrate performance and robustness.
VB2 Feature Engineering
Next we compare the performance with features or . As shown in Table. II and Fig. 4, the assisted information in generally improves the average sumrate . Besides, the improvement on best results is notable, especially for DDPG algorithm. We speculate that the mapping function is hard to approximate for a simple NN, due to the multiplication, division and exponentiation operations in (4) and (5). Meanwhile, the variance is increased with the additional feature. To achieve the highest sumrate score by repetitive training, this feature is important. The improved performance is achieved at a cost of enlarged input dimension and more training times. On the other hand, a simplified feature state is meaningful for online training since the data and computational resource can be restricted and costly.
VB3 Quantization Error
In a common sense, the quantization error can be gradually reduced by increasing the digitalizing bit. Therefore, the number of power level in this designed experiment, and the best result is used as the measurement. As illustrated in Fig. 5, the of REINFORCE and DQL both slightly rise as the increases from to . However, further increase of output dimension cannot improve the sumrate performance. The of DQL drops slowly, while that of REINFORCE experiences a dramatic decline from bps to bps, as the increases from to . This indicates that the huge action space can lead to difficulties in practical training especially for REINFORCE, and also full elimination of quantization error is infeasible by simply enlarging action space. In addition, DDPG has no need for discretization of space by nature, and it outperforms both DQL and REINFORCE algorithms.
Variable  

RF  RFER  DQL  DQLER  DDPG  
VC Generalization Performance
For the following simulations, the learned models with the best result and feature are selected for further study. In the previous subsection, we mainly focus on comparisons between different DRL algorithms, and the training set and testing set are I.I.D. However, the statistical characteristics in real scenarios vary over time, and tracking the environment with frequent online training is impractical. Therefore, a good generalization ability is significant to be robust against changing issues. The FP, WMMSE, maximum power and random power schemes are considered as benchmarks to evaluate our proposed DRL algorithms.
VC1 Cell range
In this part, the half celltocell range is regarded as a variable. Nowadays, the cells are getting smaller, and thus the range set km is considered. As shown in Fig. 6, generally the intra/inter cell interference is stronger as the cell range becomes smaller, and thus the average sumrate decreases. The sumrate performance of random/maximum power is the lowest, while the FP and WMMSE achieve much higher spectral efficiency. The performances of these two algorithms are comparable, and WMMSE performs slightly better than FP. In contrast, all the datadriven algorithms outperform the modeldriven methods, and the proposed actorcriticbased DDPG achieves the highest sumrate value. Additionally, the learned models are obtained in the simulation environment with fixed range , but performance degradation in these unknown scenarios is not found. Therefore, our learned datadriven models with proposed algorithms show good generalization ability in terms of varying cell ranges .
VC2 User Density
In a practical scenario, the user density can change over time and location, so it is considered in this simulation. The user density is changed by the number of AP per cell , which ranges from to . As plotted in Fig. 7, the average sumrate drops as the users become denser, and all the algorithms have the similar trend. Apparently, the DRL approaches outperform the other schemes, and DDPG again achieves the best sumrate performance. Hence, the simulation result shows that the learned datadriven models also show good generalization ability on different user densities.
VC3 Doppler frequency
The Doppler frequency is a significant variable related to the smallscale fading. Since the information at last instant is utilized for the current power allocation, fast fading can lead to performance degradation for our proposed datadriven models. Meanwhile, the modeldriven algorithms are not influenced by by nature. The Doppler frequency is sampled in range from Hz to Hz, and the simulation results in Fig. 8 show that the average sumrates of datadriven algorithms drop slowly in this range. This indicates that the datadriven models also are robust against Doppler frequency .
VD Computation Complexity
Low computation complexity is crucial for algorithm deployment and it is considered here. The simulation platform is presented as: CPU Intel i76700 and GPU Nvidia GTX1070Ti. There are APs in the simulated cellular network, the time cost per execution of our proposed distributed algorithms and the centralized modelbased methods are listed in Table. III. It is interesting that the calculation time with GPU is higher than that of CPU, and we consider that the GPU cannot be fully utilized with small scale DNN and distributed execution^{2}^{2}2The common batch operation cannot be used under distributed execution in real scenario.. It can be seen that the time cost of three DRLs are almost the same due to similar DNN models, and in terms of only CPU time, they are about and times faster than FP and WMMSE, respectively. Fast execution speed with DNN tools can be explained by several points:

The execution of our proposed algorithms is distributed, and thus the time expense is a constant as the total amount of users increases, at a cost of more calculation devices (equal to ).

Most of the operations in DNNs involve matrix multiplication and addition, which can be accelerated by parallel computation. Besides, the simple but efficient activation function ReLU: is adopted.
In summary, the low computational time cost of the proposed DRLs can be attributed to distributed execution framework, parallel computing architecture and simple efficient function.
Algorithm  

RF  DQL  DDPG  FP  WMMSE  
CPU  
GPU     
Vi Conclusions & Discussions
The distributed power allocation with proposed DRL algorithms in wireless cellular networks with IMAC was investigated. We presented a mathematically analysis on the proper design and application of DRL algorithms at a systematic level by considering intercell cooperation, offline/online training and distributed execution. The concrete algorithm design was further introduced. In theory, the sumrate performances of DQL and REINFORCE algorithms are the same with proper training, and DDPG outperforms these two methods by eliminating quantization error. The simulation results agree with our expectation, and DDPG performs the best in terms of both sumrate performance and robustness. Besides, all the datadriven approaches outperform the stateofart modelbased methods, and also show good generalization ability and low computational time cost in a series of experiments.
The datadriven algorithm, especially DRL, is a promising technique for future intelligent networks, and the proposed DDPG algorithm can be applied to general tasks with discrete/continuous state/action space and joint optimization problems of multiple variables. Specifically speaking, the algorithm can be applied to many problems such as user scheduling, channel management and power allocation in various communication networks.
Vii Acknowledgments
This work was supported in part by the National Natural Science Foundation of China (Grant No. 61801112, 61601281), the Natural Science Foundation of Jiangsu Province (Grant No. BK20180357), the Open Program of State Key Laboratory of Millimeter Waves (Southeast University, Grant No. Z201804).
References
 [1] R. Q. Hu and Y. Qian, “An energy efficient and spectrum efficient wireless heterogeneous network framework for 5G systems,” IEEE Commun. Mag., vol. 52, no. 5, pp. 94–101, May 2014.
 [2] H. Zhang, X. Chu, W. Guo, and S. Wang, “Coexistence of wifi and heterogeneous small cell networks sharing unlicensed spectrum,” IEEE Commun. Mag., vol. 53, no. 3, pp. 158–164, Mar. 2015.
 [3] Z. Q. Luo and S. Zhang, “Dynamic spectrum management: Complexity and duality,” IEEE J. Sel. Topics Signal Process., vol. 2, no. 1, pp. 57–73, Feb. 2008.
 [4] F. Boccardi, R. W. Heath, A. Lozano, T. L. Marzetta, and P. Popovski, “Five disruptive technology directions for 5G,” IEEE Commun. Mag., vol. 52, no. 2, pp. 74–80, Feb. 2013.
 [5] K. Shen and W. Yu, “Fractional programming for communication systems  Part I: Power control and beamforming,” IEEE Trans. Signal Process., vol. 66, no. 10, pp. 2616–2630, May 2018.
 [6] Q. Shi, M. Razaviyayn, Z. Q. Luo, and C. He, “An iteratively weighted mmse approach to distributed sumutility maximization for a mimo interfering broadcast channel,” in IEEE Int. Conf. Acoust., Speech, Signal Process., 2011, pp. 4331–4340.
 [7] M. Chiang, P. Hande, T. Lan, and C. W. Tan, “Power control in wireless cellular networks,” Found. Trends Netw., vol. 2, no. 4, pp. 381–533, 2008.
 [8] H. Zhang, L. Venturino, N. Prasad, P. Li, S. Rangarajan, and X. Wang, “Weighted sumrate maximization in multicell networks via coordinated scheduling and discrete power control,” IEEE J. Sel. Areas Commun., vol. 29, no. 6, pp. 1214–1224, Jun. 2011.
 [9] W. Yu, T. Kwon, and C. Shin, “Multicell coordination via joint scheduling, beamforming, and power spectrum adaptation,” IEEE Trans. Wireless Commun., vol. 12, no. 7, pp. 3300–3313, Jul. 2013.
 [10] C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics). SpringerVerlag New York, Inc., 2006.
 [11] T. Oshea, J. Hoydis, T. Oshea, and J. Hoydis, “An introduction to deep learning for the physical layer,” IEEE Trans. Cogn. Commun. Netw., vol. 3, no. 4, pp. 563–575, Dec. 2017.
 [12] T. Wang, C. K. Wen, H. Wang, F. Gao, T. Jiang, and S. Jin, “Deep learning for wireless physical layer: Opportunities and challenges,” China Commun., vol. 14, no. 11, pp. 92–111, Nov. 2017.
 [13] F. Meng, P. Chen, L. Wu, and X. Wang, “Automatic modulation classification: A deep learning enabled approach,” IEEE Trans. Veh. Technol., vol. 67, no. 11, pp. 10 760–10 772, Nov. 2018.
 [14] H. Ye, G. Y. Li, and B. Juang, “Power of deep learning for channel estimation and signal detection in OFDM systems,” IEEE Wireless Commun. Lett., vol. 7, no. 1, pp. 114–117, Feb. 2018.
 [15] F. Meng, P. Chen, and L. Wu, “NNbased IDF demodulator in bandlimited communication system,” IET Commun., vol. 12, no. 2, pp. 198–204, Feb. 2018.
 [16] R. Sutton and A. Barto, Reinforcement Learning: An Introduction. MIT Press, 2018.
 [17] L. R. Busoniu, R. Babuska, B. D. Schutter, and D. Ernst, Reinforcement Learning and Dynamic Programming Using Function Approximators. CRC Press, Inc., 2010.
 [18] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. The MIT Press, 2016.
 [19] Y. Li, “Deep reinforcement learning: An overview,” CoRR, vol. abs/1701.07274, 2017. [Online]. Available: http://arxiv.org/abs/1701.07274
 [20] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, and A. Bolton, “Mastering the game of Go without human knowledge,” Nature, vol. 550, no. 7676, pp. 354–359, Oct. 2017.
 [21] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, and G. Ostrovski, “Humanlevel control through deep reinforcement learning.” Nature, vol. 518, no. 7540, p. 529, Feb. 2015.
 [22] H. Sun, X. Chen, Q. Shi, M. Hong, X. Fu, and N. D. Sidiropoulos, “Learning to optimize: Training deep neural networks for interference management,” IEEE Trans. Signal Process., vol. 66, no. 20, pp. 5438–5453, Oct. 2018.
 [23] F. Liang, C. Shen, W. Yu, and F. Wu, “Towards optimal power control via ensembling deep neural networks,” CoRR, vol. abs/1807.10025, 2018.
 [24] M. Bennis and D. Niyato, “A qlearning based approach to interference avoidance in selforganized femtocell networks,” in 2010 IEEE Globecom Workshops, 2010, pp. 706–710.
 [25] M. Simsek, A. Czylwik, A. GalindoSerrano, and L. Giupponi, “Improved decentralized qlearning algorithm for interference reduction in ltefemtocells,” in 2011 Wireless Adv., 2011, pp. 138–143.
 [26] M. Simsek, M. Bennis, and I. GÃ¼venÃ§, “Learning based frequency and timedomain intercell interference coordination in hetnets,” IEEE Trans. Veh. Technol., vol. 64, no. 10, pp. 4589–4602, Oct. 2015.
 [27] R. Amiri, H. Mehrpouyan, L. Fridman, R. K. Mallik, A. Nallanathan, and D. Matolak, “A machine learning approach for power allocation in hetnets considering QoS,” in 2018 IEEE Int. Conf. Commun (ICC), 2018, pp. 1–7.
 [28] E. Ghadimi, F. D. Calabrese, G. Peters, and P. Soldati, “A reinforcement learning approach to power control and rate adaptation in cellular networks,” in 2017 IEEE Int. Conf. Commun (ICC), May 2017, pp. 1–7.
 [29] Y. S. Nasir and D. Guo, “Deep reinforcement learning for distributed dynamic power allocation in wireless networks,” CoRR, vol. abs/1808.00490, 2018. [Online]. Available: http://arxiv.org/abs/1808.00490
 [30] H. Ye and G. Y. Li, “Deep reinforcement learning based distributed resource allocation for V2V broadcasting,” in 2018 14th Int. Wireless Commun. Mobile Comput. Conf. (IWCMC), 2018, pp. 440–445.
 [31] N. H. Viet, N. A. Vien, and T. Chung, “Policy gradient SMDP for resource allocation and routing in integrated services networks,” in 2008 IEEE Int. Conf. Netw., Sens., Control, 2008, pp. 1541–1546.
 [32] Y. Wei, F. R. Yu, M. Song, and Z. Han, “User scheduling and resource allocation in hetnets with hybrid energy supply: An actorcritic reinforcement learning approach,” IEEE Trans. Wireless Commun., vol. 17, no. 1, pp. 680–692, Jan. 2018.
 [33] F. Meng, P. Chen, and L. Wu, “Power allocation in multiuser cellular networks with deep Q learning approach,” CoRR, vol. abs/1812.02979, 2018. [Online]. Available: http://arxiv.org/abs/1812.02979
 [34] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” Comput. Sci., vol. 8, no. 6, p. A187, 2015.
 [35] Bottomley, G.E, and Croft, “Jakes fading model revisited,” Electron. Lett., vol. 29, no. 13, pp. 1162–1163, Jun. 1993.
 [36] F. D. Calabrese, L. Wang, E. Ghadimi, G. Peters, L. Hanzo, and P. Soldati, “Learning radio resource management in RANs: Framework, opportunities, and challenges,” IEEE Commun. Mag., vol. 56, no. 9, pp. 138–145, Sep. 2018.
 [37] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” Adv. Neural inf. Process. Syst., pp. 1057–1063, 2000.
 [38] P. S. Thomas and E. Brunskill, “Policy gradient methods for reinforcement learning with function approximation and actiondependent baselines,” CoRR, vol. abs/1706.06643, 2017. [Online]. Available: http://arxiv.org/abs/1706.06643
 [39] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014. [Online]. Available: http://arxiv.org/abs/1412.6980