Generalized Policy Iteration for Optimal Control in Continuous Time
This paper proposes the Deep Generalized Policy Iteration (DGPI) algorithm to find the infinite horizon optimal control policy for general nonlinear continuous-time systems with known dynamics. Unlike existing adaptive dynamic programming algorithms for continuous time systems, DGPI does not require the “admissibility” of initialized policy, and input-affine nature of controlled systems for convergence. Our algorithm employs the actor-critic architecture to approximate both policy and value functions with the purpose of iteratively solving the Hamilton-Jacobi-Bellman equation. Both the policy and value functions are approximated by deep neural networks. Given any arbitrary initial policy, the proposed DGPI algorithm can eventually converge to an admissible, and subsequently an optimal policy for an arbitrary nonlinear system. We also relax the update termination conditions of both the policy evaluation and improvement processes, which leads to a faster convergence speed than conventional Policy Iteration (PI) methods, for the same architecture of function approximators. We further prove the convergence and optimality of the algorithm with thorough Lyapunov analysis, and demonstrate its generality and efficacy using two detailed numerical examples.
Dynamic programming offers a theoretical and systematic way to solve Continuous Time (CT) infinite horizon optimal control problems with known dynamics for unconstrained linear systems, by employing the principle of Bellman optimality via the solution of the underlying Hamilton-Jacobi-Bellman (HJB) equation [bertsekas2005DP&OC]. This yields the celebrated Linear Quadratic Regulator, where the optimal control policy is an affine state feedback [pappas1980numerical]. However, if the system is subject to operating constraints, or is modeled by nonlinear dynamics, solving an infinite horizon optimal control problem analytically is a challenging task. This is due to the fact that it is difficult to get an analytical solution of the HJB equation (typically nonlinear partial differential equation [lewis2012OptimalControl]) by applying traditional DP, as the computation grows exponentially with increase in the dimensionality of the system, summarized by the phrase curse of dimensionality [wang2009adaptive].
To find a suboptimal approximation to the optimal control policy for nonlinear dynamics, Werbos defined a family of actor-critic algorithms, which he termed Adaptive Dynamic Programming (ADP) algorithms [werbos1974phD, werbos1992approximate]. Another well-known name for this kind of algorithms, especially in the field of machine learning, is Reinforcement Learning (RL) [sutton2018reinforcement, liu2017ADP]. A distinct feature of the ADP method is that it employs a critic parameterized function, such as a Neural Network (NN) for value function approximation and an actor parameterized function for policy approximation. For the sake of finding suitable approximation of both value function and policy, most ADP methods adopt an iterative technique, called Policy Iteration (PI) [howard1964PI]. PI refers to a class of algorithms built as a two-step iteration: 1) policy evaluation, in which the value function associated with an admissible control policy is evaluated, and 2) policy improvement, in which the policy is updated to optimize the corresponding value function, using Bellman’s principle of optimality.
Over the last few decades, numerous ADP (or RL) methods and the inherent analyses have appeared in literature for controlling autonomous systems [doya2000reinforcement, abbeel2004apprenticeship, peters2008natural, levine2013guided, silver2014deterministic, duan2016benchmarking, recht2019tour, schulman2015TRPO], including for CT systems. Some of these algorithms for CT systems are also called approximate DP or neuro-DP [powell2007approximate, al2008discrete, bertsekas1995neuroDP, kamalapurkar2016model]. Abu-Khalaf and Lewis (2005) proposed an ADP algorithm to find nearly optimal constrained control state feedback laws for general nonlinear systems by introducing a non-quadratic cost function [lewis2005definition]. The value function represented by a linear combination of artificially designed basis functions is trained by least-squares method at the policy evaluation step, while the policy is directly derived from the value function. Utilizing the same single approximator scheme, Dierks and Jagannathan (2010) derived a novel online parameter tuning law that not only ensures the optimal value function and policy are achieved, but also ensures the system states remain bounded during the online learning process [dierks2010admissible]. Vamvoudakis and Lewis (2012) proposed a synchronous PI algorithm implemented as actor-critic architecture for nonlinear CT systems without control constraints. Both the value function and policy are approximated by linear methods and tuned simultaneously online [Vamvoudakis2010OnlineAC]. Furthermore, Vamvoudakis (2014) presented an event-triggered ADP algorithm that reduces the computation cost by updating the policy only when an event-triggered condition was violated [vamvoudakis2014linearmethod]. Dong et al. (2017) extended this idea to nonlinear systems with saturated actuators [dong2017eventtrigger]. In addition, the ADP method has also been widely applied in the optimal control of incompletely known dynamic systems [modares2013adaptive, yang2014admissible, vrabie2009admissiblerequire, vrabie2009adaptive, jiang2015global] and multi-agent systems [Vamvoudakis2012MultiADP, li2017off].
It should be pointed out that most existing ADP techniques for CT systems are valid on the basis of one or both of the following two assumptions:
A1: Admissibility of Initial Policy: The infinite horizon value function can be evaluated only in the case of stabilizing control policies. Hence, the initial policy must be “admissible”, that is, it has to stabilize the system (detailed in Definition 1). However, in practical situations, especially for complex systems, it is often difficult to obtain an admissible policy.
A2: Input-Affine Nature of System: Most ADP methods are subject to input-affine systems. This is due to the fact that these methods require that the optimal policy needs to be directly represented by the value function. Which means that the minimum point of the Hamilton function could be solved analytically, when the value function is given. For non input-affine systems, directly solving the optimal policy in this way is often intractable.
In this paper, we propose a Deep Generalized Policy Iteration (DGPI) algorithm with proof of convergence and optimality, for solving optimal control problems of general nonlinear CT systems with known dynamics to overcome the limitation of the above two central assumptions. Both the actor and critic are approximated by deep NNs which build a map from the system states to action and value function respectively. Our main contributions can be summarized as follows:
Given any arbitrary initial policy, the proposed DGPI algorithm is proven to converge to an admissible policy by continuously minimizing the square of the Hamiltonian. This relaxes the requirement of A1.
We prove faster convergence speeds of DGPI than corresponding PI methods, due to novel update termination conditions of both the policy evaluation and improvement processes. The policy network is updated by directly minimizing the associated Hamiltonian, and the tuning rules are generalized to arbitrary nonlinear, non input-affine dynamics. This relaxes the requirement of A2.
The paper is organized as follows. In Section II, we provide the formulation of the optimal control problem, followed by the general description of PI and DPI algorithm. In Section III, we describe the DGPI algorithm and analyze its convergence and optimality. In Section IV, we present simulation examples that show the generality and effectiveness of the DGPI algorithm for CT system. Section V concludes this paper.
Ii Mathematical Preliminaries
Ii-a HJB Equation of the Continuous-time Optimal Control Problem
Consider the general time-invariant dynamical system given by
with state , control input , and . We assume that is Lipschitz continuous on a compact set that contains the origin, and that the system is stabilizable on , i.e., there exists a continuous policy , where , such that the system is asymptotically stable on . The system dynamics is assumed to be known, it can be nonlinear and input non-affine analytic functions, Neural Networks (NNs), or even a MATLAB/Simulink model (only if is available). Moreover, the system input can be either constrained or unconstrained. Given the policy , define its associated infinite horizon value function
Therefore, given a policy , the value function in (2) associated with the system (1) can be found by solving the Lyapunov equation. Then the optimal control problem for continuous-time (CT) system can now be formulated as finding a policy such that the value function associated with systems in (1) is minimized. The minimized or optimal value function defined by
satisfies the Hamilton-Jacobi-Bellman (HJB) equation [lewis2012OptimalControl]
Meanwhile, the optimal control for every state can be derived as
Inserting this optimal control policy and optimal value function in the Lyapunov equation, we obtain the formulation of the HJB equation in terms of and [lewis2012OptimalControl]
Existence and uniqueness of the value function has been shown in [lyashevskiy1996unique]. In order to find the optimal policy for CT systems one only needs to solve the HJB equation (5) for the value function and then substitute the solution into (6) to obtain the optimal control. However, due to the nonlinear nature of the HJB equation, finding its solution is generally difficult or impossible.
Ii-B Policy Iteration
The proposed algorithm for CT system used in this paper is motivated by Policy Iteration (PI) [sutton2018reinforcement] . Therefore in this section we describe PI. PI is an iterative method of reinforcement learning (RL) for solving optimal policy of CT or discrete-time systems, and involves computation cycles between policy evaluation based on (4) and policy improvement based on (6). The pseudo-code of PI is shown in Algorithm 1.
As Algorithm 1 shows, the first step of PI is to find an initial policy because the associated value function is finite only when the system is asymptotically stable. Algorithm 1 will iteratively converge to the optimal control policy and value function . Proofs of convergence and optimality have been given in [lewis2005definition].
Ii-C Value Function and Policy Approximation
In previous adaptive dynamic programming (ADP) researches for CT systems, the value function and policy are usually approximated by linear methods, which requires a large number of artificially designed basis functions [jiang2015global]. In recent years, deep NNs are favored in many fields such as RL and machine learning due to their better generality and higher fitting ability [mnih2015human, lecun2015deep]. In our work, both the value function and policy are approximated by deep NNs, called respectively the value network (or critic network) ( for short) and the policy network (or actor network) ( for short), where and are network parameters. These two networks directly build a map from the raw system states to the approximated value function and control inputs respectively; in this case, no hand-crafted basis function is needed.
Inserting the value and policy network in (3), we obtain the formulation of approximate Hamiltonian in terms of and
We refer to the algorithm combining PI and deep NN approximators as Deep PI (DPI), which involves alternatively tuning each of the two networks to find optimal parameters and such that , .
The policy evaluation process of DPI proceeds by tuning the value network by solving (7). Given any policy , it is desired to find parameters to minimize the critic loss function
Noted that can be easily guaranteed by selecting proper activation function for the value network. Based on (8), the policy improvement process is carried out by tuning the policy network to minimize expectation of Hamiltonian in each state, which is also called actor loss function here
Many off-the-shelf NN optimization methods can be used to tune these two NNs, such as Stochastic Gradient Descent (SGD), RMSProp, Levenberg Marquardt or Adam [ruder2016optimizationmethod]. In fact, the value network and policy network usually require multiple updating iterations to make (7) and (8) hold respectively. Therefore, compared with the PI algorithm mentioned above, two inner updating loops would be introduced to update value network and policy network respectively until convergence. Taking the SGD optimization method as an example, the pseudo-code of DPI is shown in Algorithm 2.
Iii Deep Generalized Policy Iteration Algorithm
Algorithm 2 proceeds by alternately updating the value and policy network by minimizing (9) and (10) respectively. Note that while one NN is being tuned, the other is held constant. Besides, each NN usually requires multiple updating iterations to satisfy the terminal conditions, which is the so-called protracted iterative computation problem [sutton2018reinforcement]. This problem usually leads to the admissibility requirement because the initial policy network needs to satisfy to have a finite and converged value function . Many previous studies used trials and errors process to obtain the range of the initial weights for the policy network to keep the stability of the system [liu2017ADP, vamvoudakis2014linearmethod]. However, this method usually takes a lot of time, especially for complex systems. On the other hand, the protracted problem also often results in slower learning [sutton2018reinforcement].
Iii-a Description of the DGPI Algorithm
Inspired by the idea of generalized PI framework, which is typically utilized in discrete-time dynamic RL problems [sutton2018reinforcement], we present the Deep Generalized PI (DGPI) algorithm for CT systems to relax the requirement A1 (from Introduction) and improve the learning speed by truncating the inner loops (relaxing the requirement A2) of Algorithm 2 without losing the convergence guarantees. The pseudo-code of DGPI algorithm shown in Algorithm 3.
Iii-B Convergence and Optimality Analysis
The solution to (7) may not be smooth for general nonlinear non input-affine systems. However, in keeping with other work in the literature [Vamvoudakis2010OnlineAC] we make the following assumption.
The solution to (7) is smooth if , i.e. [lewis2005definition, Vamvoudakis2010OnlineAC].
In recent years, many experimental results and theoretical proofs have shown simple optimization algorithms such as SGD can find global minima on the training objective of deep NNs in polynomial time if the network is over-parameterized (i.e., the number of hidden neurons is sufficiently large) [allen2018convergence, du2018gradient]. Based on this fact, our second assumption is:
Next, the convergence property of Algorithm 3 will be established. As the iteration index tends to infinity, we will show that the optimal value function and optimal policy can be achieved using Algorithm 3. Before the main theorem, some lemmas are necessary at this point.
(Universal Approximation Theorem). For any continuous function on a compact set , there exists a feed-forward NN, having only a single hidden layer, which uniformly approximates and its gradient to within arbitrarily small error on [Hornik1990Universal].
The following lemma shows how Algorithm 3 can be used to obtain a policy given any initial policy .
Consider the CT dynamic optimal control problem for (1) and (2). The value function and policy are represented by over-parameterized NNs. The parameters and are initialized randomly, i.e., the initial policy can be inadmissible. These two NNs are updated with Algorithm 3. Let Assumption 1 and 2 hold, and suppose all the hyper-parameters (such as , and ) and NN optimization method are properly selected. The NN approximation errors are ignored according to Lemma 1. Suppose all the activation functions and biases of the value network are set to and , and the output layer activation function satisfies . We have that: , if , then for systems (1) on .
which implies that the global minima of loss function is equal to 0, corresponding to the Hamiltonian vanishing for all states . From Lemma 1, utilizing the fact that global minima of can be obtained, one has
Take the time derivative of to obtain
As the utility function is positive definite, it follows
Since , and , we have
From (17) and (18), we infer that the is positive definite. Then, according to (16), is a Lyapunov function for closed loop dynamics obtained from (1) when policy is used. Therefore, the policy for the system in (1) on [Lyapunov1993stability], that is, it is a stabilizing admissible policy.
Again, from Lemma 1, utilizing the fact that global minima of can be obtained, we get
This implies that Hamiltonian can be taken to global minimum, for any value of , by minimizing over . Then, we can also find through (12), such that
This implies that like the case with , is also a Lyapunov function. So, . Extending this for all subsequent time steps, is a Lyapunov function for all , and it is obvious that
This proves Lemma 3. We have thus proven that starting from any arbitrary initial policy, the DGPI algorithm in Algorithm 3 converges to an admissible policy. As claimed previously, this relaxes the requirement A1, which is typical to most other ADP algorithms.
We now present our main result. It is shown in the following theorem that the value function and policy converge to optimum uniformly by applying DGPI Algorithm 3.
(Uniform Convergence). A sequence of functions converges uniformly to on a set if , .
For arbitrary and , if these two NNs are updated with Algorithm 3, , uniformly on as goes to .
From Newton-Leibniz formula,
As such, is pointwise convergent as . We can write . Because is compact, then uniform convergence follows immediately from Dini’s theorem [bartle2011Dinistheorem].
From Definition 2, given arbitrarily small , , such that
Since , we have
So, it is true that
Therefore, and are the solution of the Lyapunov equation (4), and it follows that
Policy for , therefore the state trajectories generated by it is unique due to the locally Lipschitz continuity assumption on the dynamics [lewis2005definition]. Since converges uniformly to , this implies that the system trajectories converge for all . Therefore, also converges uniformly to on . From (24), it is also obvious that
According to (24), (25) and Lemma 2, it follows that and . Therefore, we can conclude that and uniformly on as i goes to . Thus we have proven that the DGPI Algorithm 3 converges uniformly, to , to the optimal policy . As claimed previously, this also relaxes the requirement A2.
Since the state is continuous, it is usually intractable to check the value of every . Therefore, in practical applications, we usually use the expected value of to judge whether each termination condition in Algorithm 3 is satisfied. So, the DGPI Algorithm 3 can also be formulated as Algorithm 4. Fig. 1 shows the frameworks of DPI Algorithm 2 and DGPI Algorithm 4.
In previous analysis, the is limited to a positive definite function, i.e., the equilibrium state (denoted by ) of the system must be . If we take as the input of value network , the DGPI Algorithm 4 can be extended to problems with non-zero , where only when . The corresponding convergence and optimality analysis is similar to the problems of .
According to Lemma 3, all activation functions and biases of are set to and to ensure . To remove these restrictions for value networks, we propose another effective method that drives the to gradually approach 0 by adding an equilibrium term to the critic loss function (9)
where is the hyper-parameter that trades off the importance of the Hamiltonian term and equilibrium term.
To support the proposed DGPI Algorithm 4, we offer two simulation examples, one with linear, and the other one with a nonlinear non input-affine system. We apply Algorithm 4 and Algorithm 2 to solve the optimal policy and value function for these two systems. The simulation results show that our algorithm performs better than Algorithm 2 in both cases.
Iv-a Example 1: Linear Time Invariant System
Iv-A1 Problem Description
Consider the CT aircraft plant control problem used in [stevens2015aircraft, Vamvoudakis2010OnlineAC, vamvoudakis2014linearmethod], which can be formulated as
where and are identity matrices of appropriate dimensions. In this linear case, the optimal analytic strategy and optimal value function can be easily found by solving the algebraic Riccati equation, where
Iv-A2 Algorithm Details
This system is very special, in particular, if the parameters of the policy network is randomly initialized around 0, which is a very common initialization method, then the initialized policy . Therefore, to compare the learning speed of Algorithm 2 and Algorithm 4, both algorithms are implemented to find the optimal policy and value function. The value function and policy are represented by 3-layer fully-connected NNs, which have the same architecture except for the output layers. For each network, the input layer is composed of the states, followed by 2 hidden layers using exponential linear units (ELUs) as activation functions with units per layer. The outputs of the value and policy network are and , using softplus unit and linear unit as activation functions respectively. The training set consists of states which are randomly selected from the compact set at each iteration. The learning rate and are both set to and the Adam update rule is used to minimize the loss functions.
Iv-A3 Result Analysis
Each algorithm was run 20 times and the mean and 95% confidence interval of the training performance are shown in Fig. 2. We plot the policy error and value error of Algorithm 2 and Algorithm 4 at each iteration, which are solved by
where is the test set which contains states randomly selected from the compact set at the beginning of each simulation. We also draw violin plots in different iterations to show the precision distribution and 4-quartiles. Noted that one iteration of Fig. 2 corresponds to one NN update.
It is clear from Fig. 2 that both two algorithms can make the value and policy network approximation errors ( and ) fall with iteration. And after iterations, both errors of Algorithm 4 are less than . This indicates that Algorithm 4 has the ability to converge value function and policy to optimality. In addition, the t-test results in Fig. 2 show that both and of Algorithm 4 are significantly smaller than that of Algorithm 2 () under the same number of iterations. From the perspective of convergence speed, Algorithm 4 requires only about iterations to make both approximation errors less than 0.03, while Algorithm 2 requires about steps. Based on this, Algorithm 4 is about 10 times faster than Algorithm 2. To summarize, Algorithm 4 can converge to the optimal value function and policy, and the convergence speed of Algorithm 4 is significantly higher than that of Algorithm 2.
Iv-B Example 2: Nonlinear and Non Input-Affine System
Iv-B1 Problem Description
Consider the vehicle trajectory tracking control problem with non input-affine nonlinear vehicle system derived as in [kong2015kinematic, li2017sharecontrol]. The desired velocity is 12 m/s and the desired vehicle trajectory is shown in Fig. 4. The system states and control inputs of this problem are listed in Table I, and the vehicle parameters are listed in Table II. The vehicle is controlled by a saturating actuator, where and . The dynamics of the vehicle along with detailed state and input descriptions is given as
where and are the lateral tire forces of the front and rear tires respectively. The lateral tire forces are usually approximated according to the Fiala tire model:
where is the tire slip angle, is the tire load, is the lateral friction coefficient, and the subscript represents the front or rear tires. The slip angles can be calculated from the geometric relationship between the front/rear axle and the center of gravity (CG):
Assuming that the rolling resistance is negligible, the lateral friction coefficient of the front/rear wheel is:
where and are the longitudinal tire forces of the front and rear tires respectively, calculated as
The loads on the front and rear tires can be approximated by
The control objective is to minimize the output tracking errors. Hence, the optimal control problem is given by
|Yaw rate at center of gravity (CG)||[rad/s]|
|Yaw angle between vehicle & trajectory||[rad]|
|Distance between CG & trajectory||[m]|
|input||Front wheel angle||[rad]|
|Front wheel cornering stiffness||88000 [N/rad]|
|Rear wheel cornering stiffness||94000 [N/rad]|
|Distance from CG to front axle||1.14 [m]|
|Distance from CG to rear axle||1.40 [m]|
|Polar moment of inertia at CG||2420 [kg]|
|Tire-road friction coefficient||1.0|
Iv-B2 Algorithm Details
We use the 6-layer fully-connected NNs to approximate and , and the state input layer of each NN is followed by 5 fully-connected hidden layers, units per layer. The selection of activation function is similar to that of Example 1, except that the output layer of the policy network is set as a layer with two units, multiplied by the vector to confront bounded controls. Inspired by the ideas used in multi-threaded variants of Deep RL, the training set consists of the current states of parallel independent vehicles with different initial states, thereby obtaining a more realistic state distribution [mnih2016A3C]. We use Adam method to update two NNs, while the learning rate of value network and policy network are set to and respectively. Besides, we use to trade off the Hamiltonian term and equilibrium term of the critic loss function (Remark 3).
Iv-B3 Result Analysis
Fig. 3 shows the evolution of the average absolute Hamiltonian of random states and the training performance of 20 different runs. The shaded area represents the 95% confidence interval. The policy performance at each iteration is measured by the accumulated cost function in 20s time domain
where initial state is randomly selected for each run. Since the initial policy is not admissible, that is, , Algorithm 2 can never make close to 0, hence the terminal condition of policy evaluation can never be satisfied. Therefore, the finite horizon cost has no change during the entire learning process, i.e., Algorithm 2 can never converge to an admissible policy if .
On the other hand, of Algorithm 4 can gradually converge to 0, while the finite horizon cast is also reduced to a small value during the learning process. Fig. 4 shows the state trajectory controlled by one of the trained DGPI policies. The learned policy can make the vehicle reach the equilibrium state very quickly, which takes less than 0.5s for the case in Fig. 4. The results of Example 2 show that Algorithm 4 can solve the CT dynamic optimal control problem for general non input-affine nonlinear CT systems with saturated actuators and handle inadmissible initial policies.
In conclusion, these two examples demonstrate that the proposed DGPI algorithm can converge to the optimal policy and value function for general nonlinear and non input-affine CT systems without reliance on initial admissible policy. In addition, if the initial policy , the learning speed of Algorithm 4 is also faster than that of Algorithm 2.
The paper presented the Deep Generalized Policy Iteration (DGPI) Algorithm 4, along with proof of convergence and optimality, for solving optimal control problems of general nonlinear CT systems with known dynamics. The proposed algorithm can circumvent the requirements of “admissibility” and input-affine system dynamics (described in A1 and A2 of Introduction), quintessential to previously proposed counterpart algorithms. As a result, given any arbitrary initial policy, the DGPI algorithm is shown to eventually converge to an admissible and optimal policy, even for general nonlinear non input-affine system dynamics. The convergence and optimality were mathematically proven by using detailed Lyapunov analysis. We further demonstrated the efficacy and theoretical accuracy of our algorithm via two numerical examples, which yielded faster learning speed of the optimal policy starting from an admissible initialization, as compared to conventional Deep Policy Iteration (DPI) algorithm (Algorithm 2).
We would like to acknowledge Prof. Francesco Borrelli, Ms. Ziyu Lin, Dr. Yiwen Liao, Dr. Xiaojing Zhang and Ms. Jiatong Xu for their valuable suggestions throughout this research.