UAV Coverage Path Planning under Varying Power Constraints using Deep Reinforcement Learning
CPP is the task of designing a trajectory that enables a mobile agent to travel over every point of an area of interest. We propose a new method to control an \acUAV carrying a camera on a \acCPP mission with random start positions and multiple options for landing positions in an environment containing no-fly zones. While numerous approaches have been proposed to solve similar CPP problems, we leverage end-to-end \acRL to learn a control policy that generalizes over varying power constraints for the UAV. Despite recent improvements in battery technology, the maximum flying range of small UAVs is still a severe constraint, which is exacerbated by variations in the UAV’s power consumption that are hard to predict. By using map-like input channels to feed spatial information through convolutional network layers to the agent, we are able to train a \acDDQN to make control decisions for the \acUAV, balancing limited power budget and coverage goal. The proposed method can be applied to a wide variety of environments and harmonizes complex goal structures with system constraints.
CPP short = CPP , long = coverage path planning \DeclareAcronymUAV short = UAV , long = unmanned aerial vehicle \DeclareAcronymRL short = RL , long = reinforcement learning \DeclareAcronymDQN short = DQN , long = deep Q-network \DeclareAcronymDDQN short = DDQN , long = double deep Q-network \DeclareAcronymNN short = NN , short-plural = NNs , long = neural network , long-plural = neural networks \DeclareAcronymFoV short = FoV , long = field of view \DeclareAcronymTD short = TD , long = temporal difference
Whereas the \acCPP problem for ground-based robotics has already found its way into our everyday life in the form of vacuum cleaning robots , autonomous coverage with UAVs, while not yet having attained the same level of prominence, is being considered for a wide range of applications, such as photogrammetry, smart farming and especially disaster management . UAVs can be deployed rapidly to gather initial or continuous survey data of areas hit by natural disasters, or mitigate their consequences. In the aftermath of the 2019-20 Australian bushfire season, wildlife officers inventively used quadcopter drones with infrared sensors to conduct a search-and-rescue operation for koalas affected by the blaze .
As its name suggests, covering all points inside an area of interest with CPP is related to conventional path planning where the goal is to find a path between start and goal positions. In general, CPP aims to cover as much of the target area as possible within given energy or path-length constraints while avoiding obstacles or no-fly zones. Due to the limitations in battery energy density, available power limits mission duration for quadcopter UAVs severely. Finding a CPP control policy that generalizes over varying power constraints and setting a specific movement budget can be seen as a way to model the variations in actual power consumption during the mission, e.g. caused by environmental factors that are hard to predict. Similar to conventional path planning, CPP can usually be reduced to some form of the travelling salesman problem, which is NP-hard . Lawn-mowing and milling  are other examples of closely related problems.
The most recent survey of UAV coverage path planning is given by Cabreira et al. . Galceran and Carreras  provide a survey of general (ground robotics) approaches to CPP. Autonomous UAVs for applications in wireless communications have also sparked a lot of interest recently. Some scenarios, e.g. deep RL trajectory planning for UAVs providing wireless connectivity under battery power constraints, are related to CPP. An overview of UAV applications in wireless communications can be found in .
To guarantee complete coverage, most existing CPP approaches split the target area and surrounding free space into cells, by means of exact or approximate cellular decomposition. Choset and Pignon  proposed the classical boustrophedon (”the way of the ox”, back and forth motion) cellular decomposition, an exact decomposition method that guarantees full coverage but offers no bounds on path-length. This algorithm was extended by Mannadiar and Rekleitis  through encoding the cells of the boustrophedon decomposition as a Reeb graph and then constructing the Euler tour that covers every edge in the graph exactly once. Cases where the mobile agent does not have enough power to cover the whole area are not considered. The authors in  adapted this method for use in a non-holonomic, fixed-wing UAV and conducted extensive experimental validation. Two other approaches combining CPP and the travelling salesman problem to find near-optimal solutions for coverage of target regions enclosed by non-target areas are proposed by the authors in : grid-based and dynamic programming-based, respectively. Both approaches suffer from exponential increase in time complexity with the number of target regions and do not consider obstacles or UAV power limitations.
Non-standard approaches have made use of neural networks (NNs) before. The authors in  design a network of neurons with only lateral connections that each represent one grid cell in a cleaning robot’s non-stationary 2D environment. The path planning is directly based on the neural network’s activity landscape, which is computationally simple and can support changing environments, but does not take path-length or power constraints into account.
Reinforcement learning with deep neural networks has only recently started to be considered for UAV path planning. Maciel-Pearson et al.  proposed a method using an extended double deep Q-network (EDDQN) to explore and navigate from a start to a goal position in outdoor environments by combining map and camera information from the drone. Their approach is focused on obstacle avoidance under changing weather conditions. The authors in  investigate the CPP-related drone patrolling problem where a UAV patrols an area optimizing the relevance of its observations through the use of a single-channel relevance map fed into a convolutional layer of a DDQN agent. However, there is no consideration for power constraints and the relevance map is preprocessed showing only local information. To the best of our knowledge, deep RL has not been considered for UAV control in coverage path planning under power constraints before.
The main contributions of this paper are the following:
Introduction of a novel UAV control method for coverage path planning based on double deep Q-learning;
The usage of map-like channels to feed spatial information into convolutional network layers of the agent;
Learning a control policy that generalizes over random start positions and varying power constraints and decides between multiple landing positions.
Ii Problem Formulation
The sensors of the UAV forming the input of the reinforcement learning agent are depicted in Figure 1: camera and GPS receiver. The camera gives a periodic frame of the current coverage view and the GPS yields the drone’s position. Power constraints determined by external factors are modelled as a movement budget for the drone that is fixed at mission start. Two additional software components are running on the UAV. The first is the mission algorithm which is responsible for the analysis of the camera data. We assume that any mission algorithm can give feedback on the area that was already covered. The second component is a safety controller that evaluates the proposed action of the agent and accepts or rejects it based on the safety constraints (entering into no-fly zones or landing in unsuitable areas). Note that the safety controller does not assist the agent in finding the landing area. The last component is a map which is provided by the operator on the ground. While this map could be dynamic throughout the mission, we focus on static maps for the duration of one mission in this paper.
Ii-B 3-Channel Map
The coverage problem to be solved can be represented by a two dimensional grid map with three channels. Each cell in the grid represents a square area of the coverage region. The three channels describe starting and landing zones, target zones, and no-fly zones. The start and landing zones are areas the agent can start from and land on after finishing a coverage path. Target zones have to be covered at least once by the field of view (FoV) of the UAV’s camera. No-fly zones represent areas which the drone is prohibited from entering. Note that it is possible that a cell is declared as none, or more than one of these zones, with the exception that starting and landing zones can not be no-fly zones at the same time.
Ii-C Markov Decision Process
In order to solve the described coverage path planning problem with reinforcement learning, it is converted into a Markov decision process (MDP). An MDP is described by the tuple , with the set of possible states , the set of possible actions , the reward function , and the deterministic state transition function .
In a grid, the state space has the following dimensions:
where is the Boolean domain . The action space contains the following five actions:
The reward function , mapping the current state and current action to a real-valued reward, consists of multiple components:
(positive) coverage reward for each target cell that is covered by the UAV’s field of view for the first time;
(negative) safety penalty in case the safety controller (SC) rejects the agent’s proposed action;
(negative) constant movement penalty that is applied for every unit of the movement budget the UAV uses
(negative) penalty in case the UAV runs out of movement budget without having safely landed in a landing zone.
Reinforcement learning, in general, proceeds in a cycle of interactions between an agent and its environment. At time , the agent observes a state , performs an action and subsequently receives a reward . The time index is then incremented and the environment propagates the agent to a new state , from where the cycle restarts. The goal of the agent is to maximize the discounted cumulative return from the current state up to a terminal state at time . It is given as
with being the discount factor, balancing the importance of immediate and future rewards. The return is maximized by adapting the agent’s behavioral policy . The policy can be deterministic with such that , or probabilistic with such that , yielding a probability distribution over the action space for each .
To find a policy which maximizes the return, we utilize Q-learning, a model-free reinforcement learning approach. It is based on learning the state-action-value function, or Q-function , defined as
Q-learning relies on iteratively updating the current knowledge of the Q-function. When the optimal Q-function is known, it is easy to construct an optimal policy by taking actions that maximize the Q-function. For convenience, and are abbreviated to and and and to and in the following.
Iii-B Deep Q-Learning
The Q-function from (2) can be represented through a table of Q-values with the dimension . This is not feasible for large state or action spaces, but it is possible to approximate the Q-function by a neural network in those cases. A deep Q-network (DQN) parameterizing the Q-function with the parameter vector is trained to minimize the expected temporal difference (TD) error given by
with the target value
While a DQN is significantly more data efficient compared to a Q-table due to its ability to generalize, the deadly triad  of function approximation, bootstrapping and off-policy training can make its training unstable and cause divergence. In 2015, Mnih et al.  presented a methodology to stabilize the DQN learning process. Their training approach makes use of an experience replay memory which stores experience tuples collected by the agent during each interaction with the environment. Training the agent on uniformly sampled batches from the replay memory decorrelates the individual samples and rephrases the TD-error as
Additionally, Mnih et al. used a separate target network for the estimation of the next maximum Q-value changing the target value to
with representing the parameters of the target network. The parameters of the target network can either be updated as a periodic hard copy of or as a soft update with
after each update of . is the update factor determining the adaptation pace. The combination of replay memory and target network separation to stabilize the training process laid the groundwork for the rise in popularity of DQN methods.
An additional improvement was proposed by Van Hasselt et al. , who showed that under certain conditions, action values in (6) get overestimated. To solve this issue, the double deep Q-network (DDQN) was introduced. The target value is then given by
and the corresponding loss function
in which the overestimation of action values is reduced by selecting the best action using but estimating the value of that action using . When calculating the target value is taken as is, hence, the back-propagating gradient is stopped before .
Iii-C Neural Network Model and Training Procedure
The DQN solving the MDP from Section II consists of convolutional and fully-connected layers. It is visualized in Figure 2. The UAV’s own position is converted to a 2D one-hot representation, i.e. the encoding of the occupied cell inside the whole grid. With the position encoded in this way, it can be stacked with the three-channel map and the coverage grid to form the five-channel input of the network’s convolutional layers. The kernels of the convolutional layers are then able to form direct spatial connections between the current position and nearby cells. The remaining movement budget is fed into the network after the convolutional layers.
The convolutional layers are padded so that their output shape remains the same as their input shape. All layers are zero-padded for all channels, with the exception of the first layer’s no-fly zone channel, which is one-padded. This is an explicit representation of the no-fly zone surrounding the mission grid. The rectified linear unit (ReLU) is chosen as activation function for the convolutional layers. The last layer of the convolutional network is flattened and concatenated with the movement budget input. Fully-connected layers with ReLU activation are attached to this flatten layer.
The last fully-connected layer is of size and has no activation function. It directly represents the Q-values for each action given the input state. Choosing the of the Q-values is called the greedy policy and exploits already learned knowledge. The greedy policy given by
is deterministic and used when evaluating the agent’s learning progress. During training, the sampled soft-max policy for exploration of the state and action space is used instead. It is given by
with the temperature parameter scaling the balance of exploration versus exploitation. When is increased so does exploration. The limit of the soft-max policy (11) is the greedy policy (10). The soft-max policy was chosen over the -greedy policy because it offers variable exploration based on the relative difference of Q-values and does not depend on the number of training steps or episodes. This appeared to be beneficial for this particular problem.
Algorithm 1 describes the training procedure for the double deep Q-network in more detail. After replay memory and network parameters are initialized, a new training episode begins with resetting the state, choosing a random UAV starting position and random movement budget . The episode continues as long as the movement budget is greater than zero and the UAV has not landed. A new action is chosen according to (11) and the subsequent experience stored in the replay memory buffer .
Sampling a minibatch of size from the replay memory, the primary network parameters are updated by performing a gradient step using the Adam optimizer. Subsequently, the target network parameters are updated using the soft update (7) and the movement budget is decremented. The episode ends when either the drone lands or the movement budget decreases to zero. Then, a new episode starts unless the maximum number of episodes is reached. The hyperparameters that were used during training are listed in Table I.
Iv-a Simulation Setup
The agent can move in a two dimensional grid through action commands in if accepted by the safety controller. Each action, no matter if accepted or rejected, consumes one unit of movement budget since energy is spent during hovering as well. The agent’s initial state consists of a fixed map, a zero-initialized coverage grid and a position, which is uniformly sampled from the starting and landing zone of the map. Additionally, the initial movement budget is uniformly sampled from a movement budget range , which is set to 25-75 for the purpose of this evaluation. The value of the safety flag in is initialized to zero and the UAV’s camera field of view (FoV) is set to a fixed 3-by-3-cell area centered underneath the agent. After each step the mission algorithm marks the FoV as seen in the coverage grid map.
Three evaluation scenarios were chosen, each with a unique problem for the agent to solve. Map A depicted in Figure 5 (a)-(c) has a large starting and landing zone, which yields high variation during training. Additionally, the shape of the target area is challenging to cover. The difficulty of map B in Figure 5 (d)-(f) lies in the yellow area that is marked as target zone, but also marked as a no-fly zone, and therefore must be covered by flying adjacent to it. Map C, in Figure 5 (g)-(i) with a narrow passage between no-fly zones, while easy to cover is very difficult for training as discussed later.
After being trained on their respective scenario with varying movement budgets and varying starting positions under the exploration policy from (11), the agents are evaluated under their exploitation policy from (10). Their performance during coverage for the full movement budget range and all possible starting positions is evaluated. The performance is described through Figures 5 and 5 and Table II.
The agents’ ability to plan a trajectory that ends with a safe landing inside the landing zone over the full movement budget range and starting at each possible position is evaluated through Table II, showing the ratio of landing for all scenario variations. Despite the agent’s good landing performance, the safety controller’s capabilities on a real-world UAV would likely be extended to force navigation to the closest landing zone in the rare cases when the RL agent misses the right moment to return.
To evaluate the impact of movement budget on the achieved coverage ratio, the starting position was fixed. For the selected starting positions the agents successfully landed after completing a trajectory for each movement budget. Figure 5 shows the coverage ratio of each agent with respect to initial movement budget. Selected trajectories for each map under three different movement budgets are depicted in Figure 5. Whereas the movement budget is increasing from left to right, the agent does not necessarily utilize the whole allocated budget if it determines that there is a risk of not returning to the landing area in time or the coverage goal is already fulfilled. It can be seen that the agent finds a trajectory balancing the goals of safe landing and maximum coverage ratio.
Figure 5 shows the training process of an agent on map C. The curve describes the cumulative reward of the exploitation strategy when evaluated during the training process. Three major phases are highlighted in the graph. In phase one the agent learns to land safely, but does not venture far enough from the landing zone to find the target zone. When transitioning to phase two, the agent discovers the target zone, yielding high immediate reward. Due to the focus on mid-term reward through the choice of discount factor , this strategy represents a local optimum. In phase three the agent discovers the path back to the landing zone, avoiding the crashing penalty . After refining the trajectory, the agent finds the optimal path at the end of phase three. The phase transitions are highly dependent on the exploration strategy. Soft-max exploration appeared to be more effective than the -greedy policy to guide these transitions. The basic pattern of this incremental learning process is also visible when applied to other maps, albeit with less pronounced transitions due to bigger variations in coverage ratios.
We have introduced a new deep reinforcement learning approach for the longstanding problem of coverage path planning. While existing approaches might offer guarantees on the (near)-optimality of their solutions, the case where available power constrains the path planning is usually not considered. By feeding spatial information through map-like input channels to the agent, we train a double deep Q-network to learn a UAV control policy that generalizes over varying starting positions and varying power constraints modelled as movement budgets. Using this method, we observed an incremental learning process that successfully balances safe landing and coverage of the target area on three different environments, each with unique challenges.
In the future we will investigate the possibilities of transfer learning for this set of problems. At first we will train the agents on a curriculum of problems based on individual map channels to further accelerate the training process described in this work. From there we will examine approaches to transfer the two dimensional grid agent to higher dimensions and dynamics, e.g. adding altitude and heading. To this effect, it might be beneficial to investigate other RL techniques such as actor-critic methods and policy optimization. The proposed approach can also be seen as an initial step for handling variable power consumption in a real-world scenario. Through these steps an application on physical hardware will likely be within reach.
Marco Caccamo was supported by an Alexander von Humboldt Professorship endowed by the German Federal Ministry of Education and Research. Harald Bayerlein and David Gesbert are supported by the PERFUME project funded by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement no. 670896).
- (2000) Approximation algorithms for lawn mowing and milling. Computational Geometry 17 (1-2), pp. 25–50. Cited by: §I.
- (2019) Survey on coverage path planning with unmanned aerial vehicles. Drones 3 (1). Cited by: §I, §I.
- (1998) Coverage path planning: the boustrophedon cellular decomposition. In Field and service robotics, pp. 203–209. Cited by: §I.
- (2013) A survey on coverage path planning for robotics. Robotics and Autonomous Systems 61 (12), pp. 1258–1276. Cited by: §I, §I, §I.
- (2020-10 Feb) Drones and thermal imaging: saving koalas injured in the bushfires - [news]. The Guardian. External Links: Cited by: §I.
- (2019) Online deep reinforcement learning for autonomous UAV navigation and exploration of outdoor environments. arXiv preprint arXiv:1912.05684. Cited by: §I.
- (2010) Optimal coverage of a known arbitrary environment. In 2010 IEEE International conference on robotics and automation, pp. 5525–5530. Cited by: §I.
- (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §III-B.
- (2019) Drone patrolling with reinforcement learning. In Proceedings of the 13th International Conference on Distributed Smart Cameras, External Links: Cited by: §I.
- (2018) Reinforcement learning: an introduction. second edition, MIT Press. Cited by: §III-B.
- (2016) Deep reinforcement learning with double Q-learning. In Thirtieth AAAI conference on artificial intelligence, pp. 2094–2100. Cited by: §III-B.
- (2018) An integrated traveling salesman and coverage path planning problem for unmanned aircraft systems. IEEE control systems letters 3 (1), pp. 67–72. Cited by: §I.
- (2011) Optimal complete terrain coverage using an unmanned aerial vehicle. In 2011 IEEE International conference on robotics and automation, pp. 2513–2519. Cited by: §I.
- (2004) A neural network approach to complete coverage path planning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 34 (1), pp. 718–724. Cited by: §I.
- (2019) Accessing from the sky: a tutorial on UAV communications for 5G and beyond. Proceedings of the IEEE 107 (12), pp. 2327–2375. Cited by: §I.