# Deep Reinforcement Learning Attitude Control of Fixed-Wing UAVs Using Proximal Policy Optimization

###### Abstract

Contemporary autopilot systems for \glspluav are far more limited in their flight envelope as compared to experienced human pilots, thereby restricting the conditions \glspluav can operate in and the types of missions they can accomplish autonomously. This paper proposes a \glsdrl controller to handle the nonlinear attitude control problem, enabling extended flight envelopes for fixed-wing \glspluav. A proof-of-concept controller using the \glsppo algorithm is developed, and is shown to be capable of stabilizing a fixed-wing \glsuav from a large set of initial conditions to reference roll, pitch and airspeed values. The training process is outlined and key factors for its progression rate are considered, with the most important factor found to be limiting the number of variables in the observation vector, and including values for several previous time steps for these variables. The trained \glsrl controller is compared to a \glspid controller, and is found to converge in more cases than the \glspid controller, with comparable performance. Furthermore, the \glsrl controller is shown to generalize well to unseen disturbances in the form of wind and turbulence, even in severe disturbance conditions.

glossary

## I Introduction

\Glspluav are employed extensively to increase safety and efficiency in a plethora of tasks such as infrastructure inspection, forest monitoring, and search and rescue missions. Many tasks can however not be accomplished fully autonomously, due to several limitations of autopilot systems. Low-level stabilization of the \glsuav’s attitude provided by the inner control loops is increasingly difficult, due to various nonlinearities, as the attitude and airspeed deviates from stable, level conditions. The outer control layers providing path planning and guidance has to account for this, and settle for non-agile and safe plans. Equipping the autopilot with the stabilization skills of an experienced pilot would allow fully autonomous operation in turbulent or otherwise troublesome environments, such as search and rescue missions in extreme weather conditions, as well as increasing the usefulness of the \glsuav by for instance allowing the \glsuav to fly closer to its targets for inspection purposes.

Autopilots for fixed-wing \glspluav, as illustrated in Figure 1, are typically designed using cascaded single-variable loops under assumptions of decoupled longitudinal and lateral motion, using classical linear control theory [Beard]. The dynamics of fixed-wing aircraft including \glspluav are however strongly coupled and nonlinear. Nonlinear terms in the equations of motion include kinematic nonlinearities (rotations and coriolis effects), actuator saturation and aerodynamic nonlinearities, which are also uncertain and difficult to model. The decoupled and linear designs are reliable and well-tested for nominal flight, but also requires conservative safety limits in the allowable range of flight conditions and maneuvers (flight envelope protection), because linear controllers applied to nonlinear systems typically result in a limited region of attraction [Khalil2001]. This motivates the use of state-of-the-art nonlinear control algorithms.

Examples of nonlinear control methods applied to \glspluav include gain scheduling [Girish2014], linear parameter varying (LPV) control [Rotondo2017], dynamic inversion (feedback linearization) [Kawakami2017], adaptive backstepping [Ren2005], sliding mode control [Castaneda2017], nonlinear model predictive control [Mathisen2016], nonlinear H-infinity control [Garcia2017], dynamic inversion combined with mu-synthesis [Michailidis2017], model reference adaptive control [EugeneLavretsky2012] and L1 adaptive control [kaminer]. Automated agile and aerobatic maneuvering is treated in [Levin2019] and [Bulka2019]. Several of these methods require a more or less accurate aerodynamic model of the UAV. A model-free method based on fuzzy logic can be found in [Kurnaz2008]. Fuzzy control falls under the category of intelligent control systems, which also includes the use of neural networks. An adaptive backstepping controller using a neural network to compensate for aerodynamic uncertainties is given in [Lee2001], while a genetic neuro-fuzzy approach for attitude control is taken in [Oliveira2017]. The state of the art in intelligent flight control of small \glspluav is discussed in [Santoso2018].

Control of small \glspluav requires making very fast control decisions with limited computational power available. Sufficiently sophisticated models incorporating aerodynamic nonlinearities and uncertainties with the necessary accuracy to enable robust real-time control may not be viable under these constraints. Biology suggests that a bottom-up approach to control design might be a more feasible option. Birds perform elegant and marvelous maneuvers and are able to land abruptly with pinpoint accuracy utilizing stall effects. Insects can hover and zip around with astonishing efficiency, in part due to exploiting unsteady, turbulent aerodynamic flow effects [Beard]. These creatures have developed the skills not through careful consideration and modeling, but through an evolutionary trial-and-error process driven by randomness, with mother nature as a ruthless arbiter of control design proficiency. In similar bottom-up fashion, \glsml methods have shown great promise in uncovering intricate models from data and representing complex nonlinear relations from its inputs to its outputs. \glsml can offer an additional class of designs through learning that are not easily accessible through first principles modeling, exhibiting antifragile properties where unexpected events and stressors provide data to learn and improve from, instead of invalidating the design. It can harbor powerful predictive powers allowing proactive behaviour, while meeting the strict computation time budget in fast control systems.

rl [suttonbarto] is a subfield of \glsml concerned with how agents should act in order to maximize some measure of utility, and how they can learn this behaviour from interacting with their environment. Control has historically been viewed as a difficult application of \glsrl due to the continuous nature of these problems’ state and action spaces. Furthermore, the task has to be sufficiently nonlinear and complex for \glsrl to be an appropriate consideration over conventional control methods in the first place. To apply tabular methods one would have to discretize and thus suffer from the consequences of the curse of dimensionality from a discretization-resolution appropriate to achieve acceptable control. The alternative to tabular methods require function approximation, which has to be sophisticated enough to handle the dynamics of the task, while having a sufficiently stable and tractable training process to offer convergence. \Glsplnn are one of few models which satisfy these criteria: they can certainly be made expressively powerful enough for many tasks, but achieving a stable training phase can be a great challenge. Advances in computation capability and algorithmic progress in \glsrl, reducing the variance in parameter updates, have made \glspldnn applicable to \glsrl algorithms, spawning the field of \glsdrl. \glspldnn in \glsrl algorithms provide end-to-end learning of appropriate representations and features for the task at hand, allowing algorithms to solve classes of problems previously deemed unfit for \glsrl. \Glsdrl has been applied to complex control tasks such as motion control of robots [Zhang] as well as other tasks where formalizing a strategy with other means is difficult, e.g. game playing [mnih_human-level_2015].

A central challenge with \glsrl approaches to control is the low sample efficiency of these methods, meaning they need a large amount of data before they can become proficient. Allowing the algorithm full control to learn from its mistakes is often not a viable option due to operational constraints such as safety, and simulations are therefore usually the preferred option. The simulation is merely an approximation of the true environment. The model errors, i.e. the differences between the simulator and the real world, is called the reality gap. If the reality gap is small, then the low sample efficiency of these methods is not as paramount, and the agent might exhibit great skill the first time it is applied to the real world.

The current state-of-the-art \glsrl algorithms in continuous state and action spaces, notably \glsddpg [lillicrap_continuous_2015], \glstrpo [schulman_trust_2015], \glsppo [schulman_proximal_2017] and \glssac [haarnoja_soft_2018], are generally policy-gradient methods, where some parameterization of the policy is iteratively optimized through estimating the gradients. They are model-free, meaning they make no attempt at estimating the state-transition function. Thus they are very general and can be applied to many problems with little effort, at the cost of lower sample efficiency. These methods generally follow the actor-critic architecture, wherein the actor module, i.e. the policy, chooses actions for the agent and the critic module evaluates how good these actions are, i.e. it estimates the expected long term reward, which reduces variance of the gradient estimates.

The premise of this research was to explore the application of \glsrl methods to low-level control of fixed-wing \glspluav, in the hopes of producing a proof-of-concept \glsrl controller capable of stabilizing the attitude of the \acrshortuav to a given attitude reference. To this end, an OpenAI Gym environment [brockman_openai_2016] with a flight simulator tailored to the Skywalker X8 flying wing was implemented, in which the \glsrl controller is tasked with controlling the attitude (the roll and pitch angles) as well as the airspeed of the aircraft. Aerodynamic coefficients for the X8 are given in [Gryte]. The flight simulator was designed with the goal of being valid for a wide array of flight conditions, and therefore includes additional nonlinear effects in the aerodynamic model. The software has been made openly available [pfly, pfly_env]. Key factors impacting the final performance of the controller as well as the rate of progression during training were identified. To the best of the authors’ knowledge, this is the first reported work to use \glsdrl for attitude control of fixed-wing \glspluav.

The rest of the paper is organized as follows. First, previous applications of \glsrl algorithms to \glspluav are presented in Section II, and the aerodynamic model of the Skywalker X8 fixed-wing \glsuav is then introduced in Section III. Section IV outlines the approach taken to develop the \glsrl controller, presenting the configuration of the \glsrl algorithm and the key design decisions taken, and finally how the controller is evaluated. In Section V, the training process and its major aspects are presented and discussed, and the controller is evaluated in light of the approach described in the preceding section. Finally, Section VI offers some final remarks and suggestions for further work.

## Ii Related Work

In general, the application of \glsrl to \glsuav platforms has been limited compared to other robotics applications, due to data collection with \glsuav systems carrying significant risk of fatal damage to the aircraft. \Glsrl have been proposed as a solution to many high level tasks for \glspluav such as the higher level path planning and guidance tasks, alongside tried and tested traditional controllers providing low-level stabilization. In the work of gandhiLearning2017 a \glsuav is trained to navigate in an indoor environment by gathering a sizable dataset consisting of crashes, giving the \glsuav ample experience of how NOT to fly. In [Han], the authors tackle the data collection problem by constructing a pseudo flight environment in which a fixed-wing \glsuav and the surrounding area is fitted with magnets, allowing for adjustable magnetic forces and moments in each \glsdof. In this way, the \glsuav can be propped up as one would do when teaching a baby to walk, and thereby experiment without fear of crashing in a setting more realistic than simulations.

Imanberdiyev2016 developed a model-based \glsrl algorithm called TEXPLORE to efficiently plan trajectories in unknown environments subject to constraints such as battery life. In [ZhangMPC], the authors use \pglsmpc to generate training data for \pglsrl controller, thereby guiding the policy search and avoiding the potentially catastrophic early phase before an effective policy is found. Their controller generalizes to avoid multiple obstacles, compared to the singular obstacle avoided by the \glsmpc in training, does not require full state information like the \glsmpc does, and is computed at a fraction of the time. With the advent of \glsdrl, it has also been used for more advanced tasks such as enabling intelligent cooperation between multiple \glspluav [Hung2017], and for specific control problems such as landing [Polvara2018]. \Glsrl algorithms have also been proposed for attitude control of other autonomous vehicles, including satellites [Xu2018] and underwater vehicles. Carlucho2018 applies an actor-critic \glsdrl algorithm to low-level attitude control of an \glsauv — similar to the proposed method in this paper — and find that the derived control law transfers well from simulation to real world experiments.

Of work addressing problems more similar in nature to the one in this paper, i.e. low-level attitude control of \glspluav, one can trace the application of \glsrl methods back to the works of Bagnell and Ng, both focusing on helicopter \glspluav. Both employed methods based on offline learning from data gathered by an experienced pilot, as opposed to the online self-learning approach proposed in this paper. The former focuses on control of a subset of the controllable states while keeping the others fixed, whereas the latter work extends the control to all six degrees of freedom. In both cases, the controllers exhibit control performance exceeding that of the original pilot when tested on real \glspluav. In [Abbeel], the latter work was further extended to include difficult aerobatic maneuvers such as forward flips and sideways rolls, significantly improving upon the state-of-the-art. Cory2008 presents experimental data of a fixed-wing \glsuav perching maneuver using an approximate optimal control solution. The control is calculated using a value iteration algorithm on a model obtained using nonlinear function approximators and unsteady system identification based on motion capture data. Bou-Ammar2010 compared \pglsrl algorithm using \glsfvi for approximation of the value function, to a non-linear controller based on feedback linearization, on their proficiency in stabilizing a quadcopter \glsuav after an input disturbance. They find the feedback-linearized controller to have superior performance. Recently, koch_reinforcement_2018 applied three state-of-the-art \glsrl algorithms to control the angular rates of a quadcopter \glsuav. They found \glsppo to perform the best of the \glsrl algorithms, outperforming the \glspid controller on nearly every metric.

## Iii Uav Model

Following [Beard], the UAV is modeled as a rigid body of mass with inertia tensor and a body frame rigidly attached to its center of mass, moving relative to a north-east-down (NED) frame assumed to be inertial . To allow for arbitrary attitude maneuvers during simulation, the attitude is represented using unit quaternions where . The time evolution of the position and attitude of the UAV is governed by the kinematic equations

(1) | ||||

(2) |

where and are the linear and angular velocities, respectively, and is the skew-symmetric matrix

(3) |

The attitude can also be represented using Euler angles , where , , are the roll, pitch and yaw angles respectively. Euler angles will be used for plotting purposes in later sections, and also as inputs to the controllers. Algorithms to convert between unit quaternions and Euler angles can be found in [Beard].

The rotation matrix transforms vectors from to and can be calculated from using [Fossen2011]

(4) |

where is the 3 by 3 identity matrix and .

The rates of change of the velocities and are given by the Newton-Euler equations of motion:

(5) | ||||

(6) |

where and is the acceleration of gravity. Apart from gravity, the UAV is affected by forces and moments due to aerodynamics and propulsion, which are explained in the next sections. All velocities, forces and moments are represented in the body frame.

### Iii-a Aerodynamic Forces and Moments

The UAV is flying in a wind field decomposed into a steady part and a stochastic part representing gusts and turbulence. The steady part is represented in , while the stochastic part is represented in . Similarly, rotational disturbances are modeled through the wind angular velocity . The relative (to the surrounding air mass) velocities of the UAV is then defined as:

(7) | ||||

(8) |

From the relative velocity we can calculate the airspeed , angle of attack and sideslip angle :

(9) | ||||

(10) | ||||

(11) |

The stochastic components of the wind, given by and are generated by passing white noise through shaping filters given by the Dryden velocity spectra [dryden][dryden_matlab].

The aerodynamic forces and moments are formulated in terms of aerodynamic coefficients that are, in general, nonlinear functions of , and , as well as control surface deflections. Aerodynamic coefficients are taken from [Gryte], based on wind tunnel experiments of the Skywalker X8 flying wing as well as a Computational Fluid Dynamics (CFD) code. The X8 is equipped with right and left elevon control surfaces. Note that there is no tail or rudder. Even though the vehicle under consideration has elevons, in [Gryte] the model is parameterized in terms of ”virtual” aileron and elevator deflections and . These are related to elevon deflections through the transformation

(12) |

where and are right and left elevon deflections, respectively.

The aerodynamic forces are described by

(13) | ||||

(14) | ||||

(15) |

where is the density of air, is the wing planform area, is the aerodynamic chord, and the wingspan of the UAV. The rotation matrix transforming the drag force , side force and lift force from the wind frame to the body frame is given by:

(16) |

The model in [Gryte] has similar structure to the linear coefficients in [Beard], but has added quadratic terms in and to the drag coefficient . In addition, is quadratic in the elevator deflection . In this paper, as an attempt to extend the range of validity of the model, the lift, drag and pitch moment coefficients in [Gryte] are extended using nonlinear Newtonian flat plate theory from [Beard] and [Gryte2015]. This makes the lift, drag and pitch coefficients nonlinear in angle of attack by blending between the linear models which are valid for small angles, and the flat plate models which are only valid for large angles. While the linear models are based on physical wind-tunnel experiments and CFD, the nonlinear models have not been validated experimentally.

### Iii-B Propulsion Forces and Moments

Assuming the propeller thrust is aligned with the x-axis of , we can write

(17) |

The propeller thrust is given by [fitzpatrick] as presented in [beard2]:

(18) | ||||

(19) |

where is the discharge velocity of air from the propeller, is a motor constant, is the propeller disc area, is an efficiency factor, and is the throttle. The parameters in (18) and (19) for a typical X8 motor/propeller configuration are given in [Coates], which are based on wind tunnel experiments.

The propeller moments are given by

(20) |

where and , which are based on the same experimental data used in [Coates]. Gyroscopic moments are assumed negligible.

### Iii-C Actuator Dynamics and Constraints

Denoting commands with superscript c, the elevon control surface dynamics are modeled by rate limited and saturated second-order integrators similar to [prasad_pradeep]:

(21) |

for , where and . The angular deflections and rates are constrained to degrees and degrees per second, respectively.

The throttle dynamics are given by the first order transfer function [Gryte2015]

(22) |

where .

## Iv Method

\glsppo was the chosen \glsrl algorithm for the attitude controller for several reasons: first, \glsppo was found to be the best performing algorithm for attitude control of quadcopters in [koch_reinforcement_2018], and secondly, \glsppo’s hyperparameters are robust for a large variety of tasks, and it has high performance and low computational complexity. It is therefore the default choice of algorithm in OpenAIs projects.

The objective is to control the \glsuav’s attitude, so a natural choice of controlled variables are the roll, pitch and yaw angles. However, the yaw angle of the aircraft is typically not controlled directly, but through the yaw-rate that depends on the roll angle. In addition, it is desirable to stay close to some nominal airspeed to ensure energy efficient flight, to avoid stall, and to maintain control surface effectiveness which is proportional to airspeed squared. The \glsrl controller is therefore tasked with controlling the roll and pitch angles, and , and the airspeed to desired reference values. At each time step the controller receives an immediate reward, and it aims at developing a control law that maximizes the sum of future discounted rewards.

The action space of the controller is three dimensional, consisting of commanded virtual elevator and aileron angles as well as the throttle. Elevator and aileron commands are mapped to commanded elevon deflections using the inverse of the transformation given by (12).

The observation vector (i.e. the input to the \glsrl algorithm) contains information obtained directly from state feedback of states typically measured by standard sensor suites. No sensor noise is added. To promote smooth actions it also includes a moving average of previous actuator setpoints. Moreover, since the policy network is a feed-forward network with no memory, the observation vector at each time step consists of these values for several previous time steps to facilitate learning of the dynamics.

### Iv-a The Proximal Policy Optimization Algorithm

\glsppo is a model-free, on-policy, actor-critic, policy-gradient method. It aims to retain the reliable performance of \glstrpo algorithms, which guarantee monotonic improvements by considering the \glskl divergence of policy updates, while only using first-order optimization. In this section, is the policy network (that is, the control law) which is optimized wrt. its parameterization ,^{1}^{1}1 is used in this section as it is the established nomenclature in the machine learning field, but will in the rest of the article refer to the pitch angle. in this case the \glsnn weights. The policy network takes the state, , as its input, i.e. the observation vector, and outputs an action, , consisting of the elevator, aileron and throttle setpoints. For continuous action spaces, the policy network is tasked with outputting the moments of a probability distribution, in this case the means and variances of a multivariate Gaussian, from which actions are drawn. During training, actions are randomly sampled from this distribution to increase exploration, while the mean is taken as the action when training is completed.

Policy gradient algorithms work by estimating the policy gradient, and then applying a gradient ascent algorithm to the gradient estimate. The gradients are estimated in \pglsmc fashion by running the policy in the environment to obtain samples of the policy loss and its gradient [suttonbarto]:^{2}^{2}2 represents trajectories of the form

(23) | ||||

(24) |

In practice, these gradients are obtained with automatic differentiation software on a surrogate loss objective, whose gradients are the same as (24), and are then backpropagated through the \glsnn to update .

The central challenge in policy gradient methods lie in reducing the variance of the gradient estimates, such that consistent progress towards better policies can be made. The actor-critic architecture makes a significant impact in this regard, by reformulating the reward signals in terms of advantage:

(25) | ||||

(26) | ||||

(27) |

The advantage function (27) measures how good an action is compared to the other actions available in the state, such that good actions have positive rewards, and bad actions have negative rewards. One thus has to be able to estimate the average reward of the state, i.e. the value function .^{3}^{3}3The value function is the expected long term reward of being in state and then following policy , as opposed to the -function which focuses on the long term reward of taking a specific action in the state, and then following the policy. This is the job of the critic network, a separate \glsnn trained in a supervised manner to predict the value function with ground truth from the reward values in the gathered samples. Several improvements such as \glsgae are further employed to reduce variance of the advantage estimates. \glsppo also makes use of several actors simultaneously gathering samples with the policy, to increase the sample batch size.

ppo maximizes the surrogate objective function

(28) |

in which and denotes the empirically obtained estimates of the advantage function and expectation, respectively, and is the probability ratio

(29) |

Vanilla policy gradients require samples from the policy being optimized, which after a single optimization step are no longer usable for the improved policy. For increased sample efficiency, \glsppo uses importance sampling to obtain the expectation of samples gathered from an old policy under the new policy we want to refine . In this way, each sample can be used for several gradient ascent steps. As the new policy is refined, the two policies will diverge, increasing variance of the estimation, and the old policy is therefore periodically updated to match the new policy. For this approach to be valid, the state transition function must be similar between the two policies, which is ensured by clipping the probability ratio (29) to the region .^{4}^{4}4The clip operator saturates the variable in the first argument between the values supplied by the two following arguments. This also gives a first-order approach to trust region optimization: The algorithm is not too greedy in favoring actions with positive advantage, and not too quick to avoid actions with a negative advantage function from a small set of samples. The minimum operator ensures that the surrogate objective function is a lower bound on the unclipped objective, and eliminates increased preference for actions with negative advantage function. \Glsppo is outlined in Algorithm 1.

### Iv-B Action Space

A known issue in optimal control is that while continually switching between maximum and minimum input is often optimal in the sense of maximizing the objective function, in practice it wears unnecessarily on the actuators. Since \glsppo during training samples its outputs from a Gaussian distribution, a high variance will generate highly fluctuating actions. This is not much of a problem in a simulator environment but could be an issue if trained online on a real aircraft. \glsppo’s hyperparameters are tuned wrt. a symmetric action space with a small range (e.g. -1 to 1). Adhering to this design also has the benefit of increased generality, training the controller to output actions as a fraction of maximal and minimal setpoints. The actions produced by the controller are therefore clipped to this range, and subsequently scaled to fit the actuator ranges as described in Section III.

### Iv-C Training of Controller

Variable | Initial Condition | Target |
---|---|---|

- | ||

- | ||

- | ||

- | ||

The \glsppo \glsrl controller was initialized with the default hyperparameters in the OpenAI Baselines implementation [stable-baselines], and ran with 6 parallel actors. The controller policy is an extended version of the default two hidden layer, 64 nodes \glsmlp policy: The observation vector is first processed in a convolutional layer with three filters spanning the time dimension for each component, before being fed to the default policy. This allows the policy to construct functions on the time evolution of the observation vector, while scaling more favorably in parameter count with increasing observation vector size compared to a fully connected input layer.

The controller is trained in an episodic manner to assume control of an aircraft in motion and orient it towards some new reference attitude. Although the task at hand is not truly episodic in the sense of having natural terminal states, episodic training allows one to adjust episode conditions to suit the agents proficiency, and also admits greater control of the agents exploration of the state space. The initial state and reference setpoints for the aircraft are randomized in the ranges shown in Table I. Episode conditions are progressively made more difficult as the controller improves, beginning close to target setpoints and in stable conditions, until finally spanning the entirety of Table I. The chosen ranges allow the \glsrl controller to demonstrate that it is capable of attitude control, and facilitates comparison with the \glspid controller as it is expected to perform well in this region. According to [Beard], a typical sampling frequency for autopilots is 100 Hertz, and the simulator therefore advances 0.01 seconds at each time step. Each episode is terminated after a maximum of 2000 time steps, corresponding to 20 seconds of flight time. No wind or turbulence forces are enabled during training of the controller.

In accordance with traditional control theory, where one usually considers cost to be minimized rather than rewards to be maximized, the immediate reward returns to the \glsrl controller are all negative rewards in the normalized range of -1 to 0:

(30) |

In this reward function, was chosen as the distance metric between the current state and the desired state (denoted with superscript d).^{5}^{5}5The distance has the advantage of punishing small errors harsher than the distance, and therefore encourages eliminating small steady-state errors. Furthermore, a cost is attached to changing the actuator setpoints to address oscillatory control behaviour. Commanded control setpoint of actuator at time step is denoted , where . The importance of each component of the reward function is weighted through the factors. To balance the disparate scales of the different components, the values are divided by the variables approximate dynamic range, represented by the factors.

The components of the observation vector are expressed in different units and also have differing dynamic ranges. \Glsplnn are known to converge faster when the input features share a common scale, such that the network does not need to learn this scaling itself. The observation vector should therefore be normalized. This is accomplished with the VecNormalize class of [stable-baselines], which estimates a running mean and variance of each observation component and normalizes based on these estimates.

### Iv-D Evaluation

Representing the state-of-the-art in model free control, fixed-gain \glspid controllers for roll, pitch and airspeed were implemented to provide a baseline comparison for the \glsrl controller:

(31) | ||||

(32) | ||||

(33) |

The throttle is used to control airspeed, while virtual aileron and elevator commands are calculated to control roll and pitch, respectively. The \glspid controllers were manually tuned using a trial-and-error approach until achieving acceptable transient responses and low steady-state errors for a range of initial conditions and setpoints. Wind was turned off in the simulator during tuning. The integral terms in (31)-(33) are implemented numerically using forward Euler. Controller gains are given in Table II.

Parameter | Value | Parameter | Value |
---|---|---|---|

The same aerodynamic model that is used for training is also used for evaluation purposes, with the addition of disturbances in the form of wind to test generalization capabilities. The controllers are compared in four distinct wind and turbulence regions: light, moderate, severe and no turbulence. Each setting consists of a steady wind component, with randomized orientation and a magnitude of 7 m/s, 15 m/s, 23 m/s and 0 m/s respectively, and additive turbulence given by the Dryden turbulence model [dryden]. Note that a wind speed of 23 m/s is a substantial disturbance, especially when considering the Skywalker X8’s nominal airspeed of 18 m/s. For each wind setting, 100 sets of initial conditions and target setpoints are generated, spanning the ranges shown in Table I. The reference setpoints are set to 20-30 degrees and 3-4 m/s deviation from the initial state for the angle variables and airspeed, respectively. Each evaluation scenario is run for a maximum of 1500 time steps, corresponding to 15 seconds of flight time, which should be sufficient time to allow the controller to regulate to the setpoint.

The reward function is not merely measuring the proficiency of the \glsrl controller, but is also designed to facilitate learning. To compare, rank and evaluate different controllers, one needs to define additional evaluation criteria. To this end, the controllers are evaluated on the following criteria: Success/failure, whether the controller is successful in controlling the state to within some bound of the setpoint. The state must remain within the bounds for at least 100 consecutive time steps to be counted as a success. The bound was chosen to be for the roll and pitch angles, and for the airspeed. Rise time, the time it takes the controller to reduce the initial error from 90 % to 10 %. As these control scenarios are not just simple step responses and may cross these thresholds several times during the episode, the rise time is calculated from the first time it crosses the lower threshold until the first time it reaches the upper threshold. Settling time, the time it takes the controller to settle within the success setpoint bounds, and never leave this bound again. Overshoot, the peak value reached on the opposing side of the setpoint wrt. the initial error, expressed as a percentage of the initial error. Control variation, the average change in actuator commands per second, where the average is taken over time steps and actuators. Rise time, settling time, overshoot and control variation are only measured when the episode is counted as a success. When comparing controllers, the success criterion is the most important, as it is indicative of stability as well as achieving the control objective. Secondly, low control variation is important to avoid unnecessary wear and tear on the actuators. While success or failure is a binary variable, rise time, settling time and overshoot give additional quantitative information on the average performance of the successful scenarios.

## V Results and Discussion

The controller was trained on a desktop computer with an i7-9700k CPU and an RTX 2070 GPU. The model converges after around two million time steps of training, which on this hardware takes about an hour. This is relatively little compared to other applications of \glsdrl, and suggests that the \glsrl controller has additional capacity to master more difficult tasks. Inference with the trained model takes microseconds on this hardware, meaning that the \glsrl controller could reasonably be expected to be able to operate at the assumed autopilot sampling frequency of Hertz in flight.

### V-a Key Factors Impacting Training

The choice of observation vector supplied to the \glsrl controller proved to be significant for its rate of improvement during training and its final performance. It was found that reducing the observation vector to only the essential components, i.e. the current airspeed and roll and pitch angles, the current angular velocities of the \glsuav, and the state errors, helped the \glsrl controller improve significantly faster than other, larger observation vectors.^{6}^{6}6Essential here referring to the factors’ impact on performance for this specific control task. One would for instance expect and to be essential factors when operating in the more extreme and nonlinear regions of the state space. Including values for several previous time steps (five was found to a good choice) further accelerated training, as this makes learning the dynamics easier for the memoryless feed-forward policy.

The reward function is one of the major ways the designer can influence and direct the behaviour of the agent. One of the more popular alternatives to norm and clipping to achieve saturated rewards are the class of exponential reward functions, and notably the Gaussian reward function as in [Carlucho2018]. Analyzing different choices of the reward function was not given much focus as the original choice gave satisfying results.

Success (%) | Rise time (s) | Settling time (s) | Overshoot (%) | Control variation () | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Setting | Controller | All | |||||||||||||

No turbulence | RL | 100 | 100 | 100 | 100 | 0.265 | 0.661 | 0.825 | 1.584 | 1.663 | 2.798 | 21 | 24 | 31 | 0.517 |

PID | 100 | 100 | 98 | 98 | 1.344 | 0.228 | 0.962 | 2.050 | 1.364 | 2.198 | 4 | 17 | 35 | 0.199 | |

Light turbulence | RL | 100 | 100 | 100 | 100 | 0.210 | 0.773 | 0.744 | 1.676 | 1.806 | 2.738 | 28 | 33 | 36 | 0.748 |

PID | 100 | 100 | 99 | 99 | 1.081 | 0.294 | 0.863 | 2.057 | 1.638 | 2.369 | 6 | 20 | 43 | 0.457 | |

Moderate turbulence | RL | 100 | 100 | 98 | 98 | 0.192 | 1.474 | 0.934 | 2.167 | 2.438 | 4.085 | 54 | 54 | 74 | 0.913 |

PID | 100 | 93 | 90 | 87 | 0.793 | 0.525 | 0.864 | 2.764 | 2.563 | 3.460 | 34 | 35 | 70 | 0.781 | |

Severe turbulence | RL | 100 | 100 | 92 | 92 | 0.166 | 1.792 | 1.585 | 2.903 | 3.280 | 7.049 | 122 | 93 | 156 | 1.083 |

PID | 99 | 96 | 87 | 86 | 0.630 | 0.945 | 1.343 | 3.576 | 5.256 | 5.470 | 92 | 80 | 122 | 1.117 |

### V-B Evaluation of Controller

The \glsrl controller generalizes well to situations and tasks not encountered during training. Even though the controller is trained with a single setpoint for each episode, Figure 2 shows that the controller is perfectly capable of adapting to new setpoints during flight. This result was also found by koch_reinforcement_2018 for quadcopters. The generalization capability also holds true for unmodeled wind and turbulence forces. The controller is trained with no wind estimates present in the observation vector, and no wind forces enabled in the simulator, but as Table III shows it is still able to achieve tracking to the setpoint when steady wind and turbulence is enabled in the test environment. Table III should be read as a quantitative analysis of performance in conditions similar to normal operating conditions, while Figure 2 and 3 qualitatively shows the capabilities of the controllers on significantly more challenging tasks.

Table III shows that both controllers are generally capable of achieving convergence to the target for the evaluation tasks, with neither controller clearly outperforming the other. The \glsrl controller has an advantage over the \glspid controller on the success criterion, and seems to be more robust to the turbulence disturbance. It is able to achieve convergence in the attitude states in all situations, unlike the \glspid controller, and is also notably more successful in moderate and severe turbulence conditions. The \glspid controller has considerably lower control variation for the simple settings with little or no wind, but its control variation grows fast with increasing disturbance. At severe turbulence the \glsrl controller has the least control variation.

The two controllers perform similarly wrt. settling time and rise time, each having the edge in different states under various conditions, while the \glspid controller performs favorably when measured on overshoot. All in all, this is an encouraging result for the \glsrl controller, as it is able to perform similarly as the established \glspid controller in its preferred domain, while the \glsrl controller is expected to make its greatest contribution in the more nonlinear regions of the state space.

A comparison of the two controllers is shown in Figure 3 on a scenario involving fairly aggressive maneuvers, which both are able to execute. Figure 2 and 3 illustrate an interesting result, the \glsrl controller is able to eliminate steady state errors. While the \glspid controller has integral action to mitigate steady-state errors, the control law of the \glsrl controller is only a function of the last few states and references. This might suggest that the \glsrl controller has learned some feed-forward action, including nominal inputs in each equilibrium state, thus removing steady-state errors in most cases. Another possibility is that steady-state errors are greatly reduced through the use of high-gain feedback, but the low control variation shown for severe turbulence in Table III indicates that the gain is not excessive. Future work should include integral error states in the observations and evaluate the implications on training and flight performance.

## Vi Conclusions

The ease with which the proof of concept \glsrl controller learns to control the \glsuav for the tasks presented in this paper, and its ability to generalize to turbulent wind conditions, suggests that \glsdrl is a good candidate for nonlinear flight control design. A central unanswered question here is the severity of the reality gap, or in other words how transferable the strategies learned in simulations are to real world flight. Future work should evaluate the controller’s robustness to parametric and structural aerodynamic uncertainties; this is essential to do before undertaking any real life flight experiments. For more advanced maneuvers, e.g. aerobatic flight or recovering from extreme situations, the controller should be given more freedom in adjusting the airspeed, possibly through having it as an uncontrolled state.

There is still much potential left to harness for this class of controllers. The policy network used to represent the control law is small and simple; more complex architectures such as \glslstm could be used to make a dynamic \glsrl controller. Training, experiments and reward structures can be designed to facilitate learning of more advanced behavior, tighter control or better robustness. Should the reality gap prove to be a major obstacle for the success of the \glsrl controller in the real world, one should look to the class of off-policy algorithms such as \glssac. These algorithms are able to learn offline from gathered data, and thus might be more suited for \glsuav applications.

## Acknowledgments

The first author is financed by ”PhD Scholarships at SINTEF” from the Research Council of Norway (grant no. 272402). The second and fourth authors were partially supported by the Research Council of Norway at the Norwegian University of Science and Technology (grants no. 223254 NTNU AMOS and no. 261791 AutoFly).