Deep Reinforcement Learning for Unmanned Aerial VehicleAssisted Vehicular Networks
Abstract
Unmanned aerial vehicles (UAVs) are envisioned to complement the 5G communication infrastructure in future smart cities. Hot spots easily appear in road intersections, where effective communication among vehicles is challenging. UAVs may serve as relays with the advantages of low price, easy deployment, lineofsight links, and flexible mobility. In this paper, we study a UAVassisted vehicular network where the UAV jointly adjusts its transmission control (power and channel) and 3D flight to maximize the total throughput. First, we formulate a Markov decision process (MDP) problem by modeling the mobility of the UAV/vehicles and the state transitions. Secondly, we solve the target problem using a deep reinforcement learning method, namely, the deep deterministic policy gradient (DDPG), and propose three solutions with different control objectives. Considering the energy consumption of 3D flight, we extend the proposed solutions to maximize the total throughput per energy unit by encouraging or discouraging the UAV’s mobility. To achieve this goal, the DDPG framework is modified. Thirdly, in a simplified model with small state space and action space, we verify the optimality of proposed algorithms. Comparing with two baseline schemes, we demonstrate the effectiveness of proposed algorithms in a realistic model.
I Introduction
Intelligent transportation system [1] [2] [3] [4] is a key component of smart cities, which employs realtime data communication for traffic monitoring, path planning, entertainment and advertisement [5]. High speed vehicular networks [6] emerge as a key component of intelligent transportation systems that provide cooperative communications to improve data transmission performance.
With the increasing number of vehicles, the current communication infrastructure may not satisfy data transmission requirements, especially when hot spots (e.g., road intersections) appear during rush hours. Unmanned aerial vehicles (UAVs) or drones [7] can complement the 4G/5G communication infrastructure, including vehicletovehicle (V2V) communications, and vehicletoinfrastructure (V2I) communications. Qualcomm has received a certification of authorization allowing for UAV testing below 400 feet [8]; Huawei will cooperate with China Mobile to build the first cellular test network for regional logistics UAVs [9].
A UAVassisted vehicular network in Fig. 1 has several advantages. First, the path loss will be much lower since the UAV can move nearer to vehicles compared with stationary base stations. Secondly, the UAV is flexible in adjusting the transmission control [10] based on the mobility of vehicles. Thirdly, the quality of UAVtovehicle links is generally better than that of terrestrial links [11], since they are mostly lineofsight (LoS).
Maximizing the total throughput of UAVtovehicle links has several challenges. First, the communication channels vary with the UAV’s threedimensional (3D) positions. Secondly, the joint adjustment of the UAV’s 3D flight and transmission control (e.g., power control) cannot be solved directly using conventional optimization methods, especially when the environment is unknown. Thirdly, the channel conditions are hard to acquire, e.g., the path loss from the UAV to vehicles is closely related to the height/density of buildings and street width.
In this paper, we propose deep reinforcement learning [12] [13] based algorithms to maximize the total throughput of UAVtovehicle communications, which jointly adjusts the UAV’s 3D flight and transmission control by learning through interacting with the environment. The main contributions of this paper can be summarized as follows: 1) We formulate the problem as a Markov decision process (MDP) problem to maximize the total throughput with the constraints of total transmission power and total channel; 2) We apply a deep reinforcement learning method, the deep deterministic policy gradient (DDPG), to solve the problem. DDPG is suitable to solve MDP problems with continuous states and actions. We propose three solutions with different control objectives to jointly adjust the UAV’s 3D flight and transmission control. Then we extend the proposed solutions to maximize the total throughput per energy unit. To encourage or discourage the UAV’s mobility, we modify the reward function and the DDPG framework; 3) We verify the optimality of proposed solutions using a simplified model with small state space and action space. Finally, we provide extensive simulation results to demonstrate the effectiveness of the proposed solutions compared with two baseline schemes.
Ii Related Works
The dynamic control for the UAVassisted vehicular networks includes flight control and transmission control. Flight control mainly includes the planning of flight path, time, and direction. Yang et al. [14] proposed a joint genetic algorithm and ant colony optimization method to obtain the best UAV flight paths to collect sensory data in wireless sensor networks. To further minimize the UAVs’ travel duration under certain constraints (e.g., energy limitations, fairness, and collision), Garraffa et al. [15] proposed a twodimensional (2D) path planning method based on a column generation approach. Liu et al. [16] proposed a deep reinforcement learning approach to control a group of UAVs by optimizing the flying directions and distances to achieve the best communication coverage in the long run with limited energy consumption.
The transmission control of UAVs mainly concerns resource allocations, e.g., access selection, transmission power and bandwidth/channel allocation. Wang et al. [17] presented a power allocation strategy for UAVs considering communications, caching, and energy transfer. In a UAVassisted communication network, Yan et al. [18] studied a UAV access selection and base station bandwidth allocation problem, where the interaction among UAVs and base stations was modeled as a Stackelberg game, and the uniqueness of a Nash equilibrium was obtained.
Joint control of both UAVs’ flight and transmission has also be considered. Wu et al. [19] considered maximizing the minimum achievable rates from a UAV to ground users by jointly optimizing the UAV’s 2D trajectory and power allocation. Zeng et al. [20] proposed a convex optimization method to optimize the UAV’s 2D trajectory to minimize its mission completion time while ensuring each ground terminal recovers the file with high probability when the UAV disseminates a common file to them. Zhang et al. [21] considered the UAV mission completion time minimization by optimizing its 2D trajectory with a constraint on the connectivity quality from base stations to the UAV. However, most existing research works neglected adjusting UAVs’ height to obtain better quality of links by avoiding various obstructions or nonlineofsight (NLoS) links.
Fan et al. [22] optimized the UAV’s 3D flight and transmission control together; however, the 3D position optimization was converted to a 2D position optimization by the LoS link requirement. The existing deep reinforcement learning based methodd only handle UAVs’ 2D flight and simple transmission control decisions. For example, Challita et al. [23] proposed a deep reinforcement learning based method for a cellular UAV network by optimizing the 2D path and cell association to achieve a tradeoff between maximizing energy efficiency and minimizing both wireless latency and the interference on the path. A similar scheme is applied to provide intelligent traffic light control in [24].
In addition, most existing works assumed that the ground terminals are stationary; whereas in reality, some ground terminals move with certain patterns, e.g., vehicles move under the control of traffic lights. This work studies a UAVassisted vehicular network where the UAV’s 3D flight and transmission control can be jointly adjusted, considering the mobility of vehicles in a road intersection.
Iii System Models and Problem Formulation
In this section, we first describe the traffic model and communication model, and then formulate the target problem as a Markov decision process. The variables in the communication model are listed in Table I for easy reference.
Iiia Traffic Model
We start with a onewaytwoflow road intersection, as shown in Fig. 2, while a much more complicated scenario in Fig. 7 will be described in Section VB. Five blocks are numbered as 0, 1, 2, 3, and 4, where block 0 is the intersection. We assume that each block contains at most one vehicle, indicated by binary variables . There are two traffic flows in Fig. 2,

“Flow 1”: ;

“Flow 2”: .
channel power gain and channel state from the UAV  
to a vehicle in block in time slot .  
SINR from the UAV to a vehicle in block in time slot .  
horizontal distance and Euclidean distance between  
the UAV and a vehicle in block .  
total transmission power, total number of channels, and band  
width of each channel.  
transmission power and number of channels allocated for the  
vehicle in block in time slot . 
The traffic light has four configurations:

: red light for flow 1 and green light for flow 2;

: red light for flow 1 and yellow light for flow 2;

: green light for flow 1 and red light for flow 2;

: yellow light for flow 1 and red light for flow 2.
Time is partitioned into slots with equal duration. The duration of a green or red light occupies time slots, and the duration of a yellow light occupies a time slot, which are shown in Fig. 3. We assume that each vehicle moves one block in a time slot if the traffic light is green.
IiiB Communication Model
We focus on the downlink communications (UAVtovehicle), since they are directly controlled by the UAV. There are two channel states of each UAVtovehicle link, lineofsight (LoS) and nonlineofsight (NLoS). Let and denote the block (horizontal position) and height of the UAV respectively, where corresponds to these five blocks in Fig. 2, and is discretized to multiple values. Next, we describe the communication model, including the channel power gain, the signal to interference and noise ratio (SINR), and the total throughput.
First, the channel power gain between the UAV and a vehicle in block in time slot is with a channel state . is formulated as [10] [25]
(1) 
where is the Euclidean distance between the UAV and the vehicle in block in time slot , is the path loss exponent, and is an additional attenuation factor caused by NLoS connections.
The probabilities of LoS and NLoS links between the UAV and a vehicle in block in time slot are [26]
(2)  
(3) 
where and are system parameters depending on the environment (height/density of buildings, and street width, etc.), and is the horizontal distance in time slot . The angle is measured in “degrees” with the range . Both and are discrete variables, therefore, is also a discrete variable.
Secondly, the SINR in time slot from the UAV to a vehicle in block is characterized as [27]
(4) 
where is the equal bandwidth of each channel, and are the allocated transmission power and number of channels for the vehicle in block in time slot , respectively, and is the additive white Gaussian noise (AWGN) power spectrum density, and is formulated by (1). We assume that the UAV employs orthogonal frequency division multiple access (OFDMA) [28]; therefore, there is no interference among these channels.
Thirdly, the total throughput (reward) of UAVtovehicle links is formulated as [29]
(5) 
IiiC MDP Formulation
The UAV aims to maximize the total throughput with the constraints of total transmission power and total channels:
where is the total transmission power, is the total number of channels, is the maximum power allocated to a vehicle, is the maximum number of channels allocated to a vehicle, is a discrete variable, and is a nonnegative integer variable.
The UAVassisted communication is modeled as a Markov decision process (MDP). On one hand, from (2) and (3), we know that the channel state of UAVtovehicle links follows a stochastic process. On the other hand, the arrival of vehicles follows a stochastic process under the control of the traffic light, e.g., (12) and (13).
Under the MDP framework, the state space , action space , reward , policy , and state transition probability of our problem are defined as follows.

State , where is the traffic light state, is the UAV’s 3D position with being the block and being the height, and is the channel state from the UAV to each block with . Let , where and are the UAV’s minimum and maximum height, respectively. The block is the location projected from UAV’s 3D position to the road.

Action denotes the action set. is the UAV’s 3D flight, where denotes the horizontal flight and denotes the vertical flight. We see that in Fig. 4. We assume
(6) which means that the UAV can flight downward 5 meters, horizontally, and up 5 meters in a time slot. The UAV’s height changes as
(7) and are the transmission power and channel allocation actions for those five blocks, respectively. At the end of time slot , the UAV moves to a new 3D position according to action , and over time slot , the transmission power and number of channels are and , respectively.

Reward is the total throughput after a transition from state to taking action . Note that the total throughput over the th time slot is measured at the state .

Policy is the strategy for the UAV, which maps states to a probability distribution over the actions , where denotes probability distribution. In time slot , the UAV’s state is , and its policy outputs the probability distribution over the action . We see that the policy indicates the action preference of the UAV.

State transition probability formulated in (8) is the probability of the UAV entering the new state , after taking the action at the current state . At the current state , after taking the 3D flight and transmission control , the UAV moves to the new 3D position , and the channel state changes to , with the traffic light changes to and the number of vehicles in each block changes to .
The state transitions of the traffic light along time are shown in Fig. 3. The transition of the channel state for UAVtovehicle links is a stochastic process, which is reflected by (2) and (3).
Next, we discuss the MDP in three aspects: the state transition probability, the state transitions of the number of vehicles in each block, and the UAV’s 3D position. Note that the transmission power control and channel control do not affect the traffic light, the channel state, the number of vehicles, and the UAV’s 3D position.
First, we discuss the state transition probability , . The UAV’s 3D fight only affects the UAV’s 3D position state and the channel state, the traffic light state of the next time slot relies on the current the traffic light state, and the number of vehicles in each block of the next time slot relies on the current number of vehicles and the traffic light state. Therefore, the state transition probability is
(8) 
where is easily obtained by the 3D position state transition based on the UAV’s flight actions in Fig. 4, is easily obtained by (2) and (3), is obtained by the traffic light state transition in Fig. 3, and is easily obtained by (9) (13).
Secondly, we discuss the state transitions of the number of vehicles in each block. It is a stochastic process. The UAV’s states and actions do not affect the number of vehicles of all blocks. Let and be the probabilities of the arrivals of new vehicles in flow 1 and 2, respectively.
The state transitions for the number of vehicles in block 0, 3, and 4 are
(9) 
(10) 
(11) 
The transition probability is 1 in (9), (10) and (11) since the transitions are deterministic in block 0, 3, and 4. While the state transition probabilities for the number of vehicles in block 1 and 2 are nondeterministic, moreover, both of them are affected by their current number of vehicles and the traffic light. Taking block 1 when the traffic light state as an example, the probability for the number of vehicles is
(12)  
(13) 
When and , the probability for the number of vehicles will be obtained in a similar way.
Algorithm 1: Qlearningbased algorithm 
Input: the number of episodes , the learning rate , parameter . 
1: Initialize all states. Initialize for all stateaction pairs randomly. 
2: for episode to 
3: Observe the initial state . 
4: for each slot to 
5: Select the UAV’s action from state using (15). 
6: Execute the UAV’s action , receive reward , and observe a new state from the environment. 
7: Update Qvalue function: . 
Thirdly, we discuss the state transition of the UAV’s 3D position. It includes block transitions and height transitions. The UAV’s height transition is formulated in (7). If the UAV’s height is fixed, the corresponding position state transition diagram is shown in Fig. 4, where denotes the block of the UAV: denotes staying in the current block; denotes a flight from block 0 to the other blocks (1, 2, 3, and 4); denotes an anticlockwise flight; denotes a flight from block 1, 2, 3, or 4 to block 0; denotes a clockwise flight.
Iv Proposed Solutions
In this section, we first present an overview of Qlearning and the deep deterministic policy gradient algorithm, and then propose solutions with different control objectives, and finally present an extension of solutions that takes into account the energy consumption of 3D flight.
Iva Qlearning
The state transition probabilities of MDP are unknown in our problem, since some variables are unknown, e.g., , , , and . Our problem cannot be solved directly using conventional MDP solutions, e.g., dynamic programming algorithms, policy iteration and value iteration algorithms. Therefore, we apply the reinforcement learning (RL) approach. The return from a state is defined as the sum of discounted future reward , where is the total number of time slots, and is a discount factor that diminishes the future reward and ensures that the sum of an infinite number of rewards is still finite. Let represents the expected return after taking action in state under policy . The Bellman equation gives the optimality condition in conventional MDP solutions [30]:
Qlearning [31] is a classical modelfree RL algorithm [32]. Qlearning with the essence of exploration and exploitation aims to maximize the expected return by interacting with the environment. The update of is
(14) 
where is a learning rate.
Qlearning uses the greedy strategy [33] to select an action, so that the agent behaves greedily most of the time, but selects randomly among all the actions with a small probability . The greedy strategy is defined as follows
(15) 
The Qlearning algorithm [30] is shown in Alg. 1. Line 1 is initialization. In each episode, the inner loop is executed in lines 4 7. Line 5 selects an action using (15), and then the action is executed in line 6. Line 7 updates the Qvalue.
Qlearning cannot solve our problem because of several limitations. 1) Qlearning can only solve MDP problems with small state space and action space. However, the state space and action space of our problem are very large. 2) Qlearning cannot handle continuous state or action space. The UAV’s transmission power allocation actions are continuous. The transmission power control is a continuous action in reality. If we discretize the transmission power allocation actions, and use Qlearning to solve it, the result may be far from the optimum. 3) Qlearning will converge slowly using too many computational resources [30], and this is not practical in our problem. Therefore, we adopt the deep deterministic policy gradient algorithm to solve our problem.
IvB Deep Deterministic Policy Gradient
The deep deterministic policy gradient (DDPG) method [34] uses deep neural networks to approximate both action policy and value function . This method has two advantages: 1) it uses neural networks as approximators, essentially compressing the state and action space to much smaller latent parameter space, and 2) the gradient descent method can be used to update the network weights, which greatly speeds up the convergence and reduces the computational time. Therefore, the memory and computational resources are largely saved. In real systems, DDPG exploits the powerful skills introduced in AlphaGo zero [35] and Atari game playing [36], including experience replay buffer, actorcritic approach, soft update, and exploration noise.
1) Experience replay buffer stores transitions that will be used to update network parameters. At each time slot , a transition is stored in . After a certain number of time slots, each iteration samples a minibatch of transitions to train neural networks, where is a set of indices of sampled transitions from . “Experience replay buffer” has two advantages: 1) enabling the stochastic gradient decent method [37]; and 2) removing the correlations between consecutive transitions.
2) Actorcritic approach: the critic approximates the Qvalue, and the actor approximates the action policy. The critic has two neural networks: the online Qnetwork with parameter and the target Qnetwork with parameter . The actor has two neural networks: the online policy network with parameter and the target policy network with parameter . The training of these four neural networks are discussed in the next subsection.
3) Soft update with a low learning rate is introduced to improve the stability of learning. The soft updates of the target Qnetwork and the target policy network are as follows
(16)  
(17) 
4) Exploration noise is added to the actor’s target policy to output a new action
(18) 
There is a tradeoff between exploration and exploitation, and the exploration is independent from the learning process. Adding exploration noise in (18) ensures that the UAV has a certain probability of exploring new actions besides the one predicted by the current policy , and avoids that the UAV is trapped in a local optimum.
Algorithm 2: Channel allocation in time slot 
Input: the power allocation , the number of vehicles in all blocks , the maximum number of channels allocated to 
a vehicle , the total number of channels . 
Output: the channel allocation for all blocks. 
1: Initialize the remaining total number of channels . 
2: Calculate the average allocated power for each vehicle in all blocks by (19). 
3: Sort by the descending order, and obtain a sequence of block indices . 
4: for block 
5: . 
7: . 
8: Return . 
Algorithm 3: DDPGbased algorithms: PowerControl, FlightControl, and JointControl 
Input: the number of episodes , the number of time slots in an episode, the minibatch size , the learning rate . 
1: Initialize all states, including the traffic light state , the UAV’s 3D position , the number of vehicles and the 
channel state in all blocks. 
2: Randomly initialize critic’s online Qnetwork parameters and actor’s online policy network parameters , and 
initialize the critic’s target Qnetwork parameters and actor’s target policy network parameters . 
3: Allocate an experience replay buffer . 
4: for episode to 
5: Initialize a random process (a standard normal distribution) for the UAV’s action exploration. 
6: Observe the initial state . 
7: for to T 
8: Select the UAV’s action according to the policy of and the exploration noise . 
9: if PowerControl 
10: Combine the channel allocation in Alg. 2 and as the UAV’s action at a fixed 3D position. 
11: if FlightControl 
12: Combine the equal transmission power, equal channel allocation and (3D flight) as the UAV’s action . 
13: if JointControl 
14: Combine the 3D flight action, the channel allocation in Alg. 2 and as the UAV’s action . 
15: Execute the UAV’s action , and receive reward , and observe new state from the environment. 
16: Store transition in the UAV’s experience replay buffer . 
17: Sample to obtain a random minibatch of transitions , where is a set of 
indices of sampled transitions with . 
18: The critic’s target Qnetwork calculates and outputs to the critic’s 
online Qnetwork . 
19: Update the critic’s online Qnetwork to make its Qvalue fit by minimizing the loss function: 
. 
20: Update the actor’s online policy network based on the input from using the 
policy gradient by the chain rule: 
. 
21: Soft update the critic’s target Qnetwork and actor’s target policy network to make the evaluation of the 
UAV’s actions and the UAV’s policy more stable: , . 
IvC Deep Reinforcement Learningbased Solutions
The UAV has two transmission controls, power and channel. We use the power allocation as the main control objective for two reasons. 1) Once the power allocation is determined, the channel allocation will be easily obtained in OFDMA. According to Theorem 4 of [38], in OFDMA, if all links have the equal weights just as our reward function (5), the transmitter should send messages to the receiver with the strongest channel in each time slot. In our problem, the strongest channel is not determined since the channel state (LoS or NLoS) is a random process. DDPG trends to allocate more power to the strongest channels with large probabilities, therefore, channel allocation will be easily obtained based on power allocation actions. 2) Power allocation is continuous, and DDPG is suitable to handle these actions. However, if we use DDPG for the channel allocation, the number of action variables will be very large and the convergence will be very slow, since the channel allocation is discrete and the number of channels is generally large (e.g., 200). Considering of the 3D flight, we assume DDPG can either choose power control or 3D flight. Then we propose three algorithms:

PowerControl: the UAV adjusts the transmission power allocation using the actor network at a fixed 3D position, and the channels are allocated to vehicles by Alg.2 in each time slot.

FlightControl: the UAV adjusts its 3D flight using the actor network, and the transmission power and channel allocation are equally allocated to each vehicle in each time slot.

JointControl: the UAV adjusts its 3D flight and the transmission power allocation using the actor network, and the channels are allocated to vehicles by Alg.2 in each time slot.
To allocate channels among blocks, we introduce a variable denoting the average allocated power of a vehicle in block :
(19) 
The channel allocation algorithm is shown in Alg. 2, which is executed after obtaining the power allocation actions. Line 1 is the initialization. Lines 2 3 calculate and sort . Line 5 assigns the maximum number of channels to the current possibly strongest channel, and line 6 updates the remaining total number of channels.
The DDPGbased algorithms are given in Alg. 3. The algorithm has two parts: initializations, and the main process. First, we describe the initializations in lines 1 3. In line 1, all states are initialized: the traffic light is initialized as 0, the number of vehicles in all blocks is 0, the UAV’s block and height are randomized, and the channel state for each block is set as LoS or NLoS with the same probability. Note that the action space DDPG controls in PowerControl, FlightControl, and JointControl is different. Line 2 initializes the parameters of the critic and actor. Line 3 allocates an experience replay buffer .
Secondly, we present the main process. Line 5 initializes a random process for action exploration. Line 6 receives an initial state . Let be the action DDPG controls, and be the UAV’s all action. Line 8 selects an action according to and an exploration noise . Lines 9 10 combine the channel allocation actions in Alg. 2 and as at a fixed 3D position in PowerControl. Lines 11 12 combine the equal transmission power, equal channel allocation actions and (3D flight) as in FlightControl. Lines 13 14 combine the 3D flight action, the channel allocation actions in Alg. 2 and as in JointControl. Line 15 executes the UAV’s action , and then the UAV receives a reward and all states are updated. Line 16 stores a transition into . In line 17, a random minibatch of transitions are sampled from . Line 18 sets the value of for the critic’s online Qnetwork. Lines 19 21 update all network parameters.
The DDPGbased algorithms in Alg. 3 in essence are the approximated Qlearning method in Alg. 1. The exploration noise in line 8 approximates the second case of (15) in Qlearning. Lines 18 19 in Alg. 3 make in line 7 of Alg. 1 converge. Line 20 of Alg. 3 approximates the first case of (15) in Qlearning, since both of them aims to obtain the policy of the maximum Qvalue. The soft update of in line 21 of Alg. 3 is exactly (14) in Qlearning, where and are learning rates. Next, we discuss the training and test stages of proposed solutions.
1) In the training stage, we train the actor and the critic, and store the parameters of their neural networks. Fig. 5 illustrates the data flow and parameter update process. The training stage has two parts. First, and are trained through a random minibatch of transitions sampled from the experience replay buffer . Secondly, and are trained through soft update.
The training process is as follows. A minibatch of transitions are sampled from , where is a set of indices of sampled transitions from with . Then three data flows are outputted from : , , and . outputs to to calculate . outputs to . Then calculates and outputs to . updates its parameters by (22). Then two soft updates are executed for and in (16) and (17), respectively.
The data flow of the critic’s target Qnetwork and online Qnetwork are as follows. takes as the input and outputs to , where are the output of , and is calculated by
(20) 
takes as the input and outputs to for updating parameters in (22), where are sampled from , and are the output of .
The data flows of the actor’s online policy network and target policy network are as follows. first takes as input and outputs to . After outputs to , updates its parameters by (22). takes as the input and outputs to for calculating in (20), where are sampled from .
The updates of parameters of four neural networks (, , , and ) are as follows. The online Qnetwork updates its parameters by minimizing the norm loss function to make its Qvalue fit :
(21) 
The target Qnetwork updates its parameters by (16). The online policy network updates its parameters following the chain rule with respect to :
(22) 
The target policy network updates its parameters by (17).
In each time slot , the current state from the environment is delivered to , and calculates the UAV’s target policy . Finally, an exploration noise is added to to get the UAV’s action in (18).
2) In the test stage, we restore the neural network of the actor’s target policy network based on the stored parameters. This way, there is no need to store transitions to the experience replay buffer . Given the current state , we use to obtain the UAV’s optimal action . Note that there is no noise added to , since all neural networks have been trained and the UAV has got the optimal action through . Finally, the UAV executes the action .
IvD Extension on Energy Consumption of 3D Flight
The UAV’s energy is used in two parts, communication and 3D flight. The above proposed solutions in Alg. 3 do not consider the energy consumption of 3D flight. In this subsection, we discuss how to incorporate the energy consumption of 3D flight into Alg. 3. To encourage or discourage the UAV’s 3D flight actions in different directions with different amount of energy consumption, we modify the reward function and the DDPG framework.
The UAV aims to maximize the total throughput per energy unit since the UAV’s battery has limited capacity. For example, the UAV DJI Mavic Air [39] with full energy can only fly 21 minutes. Given that the UAV’s energy consumption of 3D flight is much larger than that of communication, we only use the former part as the total energy consumption. Thus, the reward function (5) is modified as follows
(23) 
where is the energy consumption of taking action in time slot . Our energy consumption setups follow the UAV DJI Mavic Air [39]. The UAV has three vertical flight actions per time slot just as in (6). If the UAV keeps moving downward, horizontally, or upward until the energy for 3D flight is used up, the flight time is assumed to be 27, 21, and 17 minutes, respectively. If the duration of a time slot is set to 6 seconds, so the UAV can fly 270, 210, and 170 time slots, respectively. Therefore, the formulation of is given by
(24) 
where is the total energy if the UAV’s battery is full.
Let be a prediction error as follows
(25) 
where evaluates the difference between the actual reward and the expected return . To make the UAV learn from the prediction error , not the difference between the new Qvalue and old Qvalue in (14), the Qvalue is updated by the following rule
(26) 
where is a learning rate similar to (14).
We introduce and to represent the learning rate when and , respectively. Therefore, the UAV can choose to be active or inactive by properly setting the values of and . The update of Qvalue in Qlearning is modified as follows, inspired by [40]
(27) 
We define the prediction error as the difference between the actual reward and the output of the critic’s online Qnetwork :
(28) 
We use and to denote the weights when and , respectively. The update of the critic’s target Qnetwork is
(29) 
The update of the actor’s target policy network is
(30) 
If , the UAV is active and prefers to move. If , the UAV is inactive and prefers to stay. If , the UAV is neither active nor inactive. To approximate the Qvalue, we introduce similar to (20) and then make the critic’s online Qnetwork to fit it. We optimize the loss function
(31) 
where .
We modify the MDP, DDPG framework, and DDPGbased algorithms by considering the energy consumption of 3D flight:

The MDP is modified as follows. The state space , where is the energy in the UAV’s battery. The energy changes as follows
(32) The other parts of MDP formulation and state transitions are the same as in Section IIIC.
V Performance Evaluation
For a onewaytwoflow road intersection in Fig. 2, we present the optimality verification of deep reinforcement learning algorithms. Then, we study a more realistic road intersection as shown in Fig. 7, and present our simulation results.
Our simulations are executed on a server with Linux OS, 200 GB memory, two Intel(R) Xeon(R) Gold 5118 CPUs@2.30 GHz, a Tesla V100PCIE GPU and four RTX 2080 Ti GPUs.
The implementation of Alg. 3 includes two parts: building the environment (including traffic and communication models) for our scenarios, and using the DDPG algorithm in TensorFlow [41].
Va Optimality Verification of Deep Reinforcement Learning
The parameter settings are summarized in Table II. In the simulations, there are three types of parameters: DDPG algorithm parameters, communication parameters, and UAV/vehicle parameters.
First, we describe the DDPG algorithm parameters. The number of episodes is 256, and the number of time slots in an episode is 256, so the number of total time slots is 65,536. The experience replay buffer capacity is 10,000, and the learning rate of target networks is 0.001. The minibatch size is .
Secondly, we describe communication parameters. and are set to 9.6 and 0.28, which are common values in urban areas [42]. is 3, and is 0.01, which are widely used in path loss modeling. The duration of a time slot is set to 6 seconds, and the number of occupied red or green traffic light is 10, i.e., 60 seconds constitute a red/green duration, which is commonly seen in cities and can ensure that the vehicles in blocks can get the next block in a time slot. The white power spectral density is set to 130 dBm/Hz. The total UAV transmission power is set to W in consideration of the limited communication ability. The total number of channels is 10, and the bandwidth of each channel is 100 KHz, therefore, the total bandwidth of all channels is 1 MHz. The maximum power allocated to a vehicle is 3 W, and the maximum number of channels allocated to a vehicle is 5. We assume the power control for each vehicle has 4 discrete values (0, 1, 2, 3).
Thirdly, we describe UAV/vehicle parameters. is set to 0.1 0.7. The total number of channels is 200. The length of a road block is set to 3 meters. The blocks’ distance is easily calculated as follows: , and , where is the Euclidean distance from block to block . We assume the arrival of vehicles in block 1 and 2 follows a binomial distribution with the same parameter in the range . The discount factor is 0.9.
The assumptions of the simplified scenario in Fig. 2 are as follows. To keep the state space small for verification purpose, we assume the channel states of all communication links are LoS, and the UAV’s height is fixed as 150 meters, so that the UAV can only adjusts its horizontal flight control and transmission control. The traffic light state is assumed to have two values (red or green).
The configure of neural networks in proposed solutions is based on the configure of the DDPG action space. A neural network consists of an input layer, fullyconnected layers, and an output layer. The number of fullyconnected layers in actor is set to 8.
9.6  0.28  3  0.01  130 dBm/Hz  3 
1 6  100 200  10  10  200  
512  0.1 0.7  0.4  0.3  0.3  5 KHz 
0.001  0.9 W  50 
Theoretically, it is wellknown that deep reinforcement learning algorithms (including DDPG algorithms) solve MDP problems and achieve the optimal results with much less memory and computational resources. We provide the optimality verification of DDPGbased algorithms in Alg. 3 in a onewaytwoflow road intersection in Fig. 2. The reasons are as follows: (i) the MDP problem in such a simplified scenario is explicitly defined and the theoretically optimal policy can be obtained using the Python MDP Toolbox [43]; and (ii) this optimality verification process also serves a good code debugging process before we apply the DDPG algorithm in TensorFlow [41] to the more realistic road intersection scenario in Fig. 7.
The result of DDPGbased algorithms matches that of the policy iteration algorithm using Python MDP Toolbox [43] (serving as the optimal policy). The total throughput obtained by the policy iteration algorithm and DDPGbased algorithms are shown as dashed lines and solid lines in Fig. 6. Therefore, DDPGbased algorithms achieve near optimal policies. We see that, the total throughput in JointControl is the largest, which is much higher than PowerControl and FlightControl. This is in consistent with our believes that the JointControl of power and flight allocation will be better than the control of either of both. The performance of PowerControl is better than FlightControl. The throughput increases with the increasing of vehicle arrival probability in all algorithms, and it saturates when due to traffic congestion.
VB More Realistic Traffic Model
We consider a more realistic road intersection model in Fig. 7. There are totally 33 blocks with four entrances (block 26, 28, 30, and 32), and four exits (block 25, 27, 29, and 31). Vehicles in block go straight, turn left, turn right with the probabilities , , and , such that . We assume vehicles can turn right when the traffic light is green.
Now, we describe the settings different from the last subsection. The discount factor is . The total UAV transmission power is set to W. The total number of channels is 100 200, the bandwidth of each channel is 5 KHz, therefore, the total bandwidth of all channels is MHz. The maximum power allocated to a vehicle is 0.9 W, and the maximum number of channels allocated to a vehicle is 50. The minimum and maximum height of the UAV is 10 meters and 200 meters. The probability of a vehicle going straight, turning left, and turning right (, , and ) is set to 0.4, 0.3, and 0.3, respectively, and each of them is assumed to be the same in block 2, 4, 6, and 8. We assume the arrival of vehicles in block 26, 28, 30, and 32 follows a binomial distribution with the same parameter in the range .
The UAV’s horizontal and vertical flight actions are as follows. We assume that the UAV’s block is 0 8 since the number of vehicles in the intersection block 0 is generally the largest and the UAV will not move to the block far from the intersection block. Moreover, within a time slot we assume that the UAV can stay or only move to its adjacent blocks. The UAV’s vertical flight action is set by (6). In PowerControl, the UAV stays at block 0 with the height of 150 meters.
VC Baseline Schemes
We compare with two baseline schemes. Generally, the equal transmission power and channels allocation is common in communication systems for fairness. Therefore, they are used in baseline schemes.
The first baseline scheme is Cycle, i.e., the UAV cycles anticlockwise at a fixed height (e.g., 150 meters), and the UAV allocates the transmission power and channels equally to each vehicle in each time slot. The UAV moves along the fixed trajectory periodically, without considering the vehicle flows.
The second baseline scheme is Greedy, i.e., at a fixed height (e.g., 150 meters), the UAV greedily moves to the block with the largest number of vehicles. If a nonadjacent block has the largest number of vehicles, the UAV has to move to block 0 and then move to that block. The UAV also allocates the transmission power and the channels equally to each vehicle in each time slot. The UAV tries to serve the block with the largest number of vehicles by moving nearer to them.
VD Simulation Results
Next, we first show the convergence of loss functions, and then show total throughput vs. discount factor, total transmission power, total number of channels and vehicle arrival probability, and finally present the total throughput and the UAV’s flight time vs. energy percent for 3D flight.
The convergence of loss functions in training stage for PowerControl, FlightControl, and JointControl indicates that the neural network is welltrained. It is shown in Fig. 8 when , , and during time slots 10,000 11,000. The first 10,000 time slots are not shown since during the 0 10,000, the experience replay buffer has not achieved its capacity. We see that, the loss functions in three algorithms converge after time slot 11,000. The other metrics in the paper are measured in test stage by default.
Total throughput vs. discount factor is drawn in Fig. 9 when , , and . We can see that, when changes, the throughput of three algorithms is steady; and JointControl achieves higher total throughput, comparing with PowerControl and FlightControl, respectively. PowerControl achieves higher throughput than FlightControl since PowerControl allocates power and channel to strongest channels while FlightControl only adjusts the UAV’s 3D position to enhance the strongest channel and the equal power and channel allocation is far from the best strategy in OFDMA.
Total throughput vs. total transmission power (