# HJB Optimal Feedback Control with Deep Differential Value Functions and Action Constraints

###### Abstract

Learning optimal feedback control laws capable of executing optimal trajectories is essential for many robotic applications. Such policies can be learned using reinforcement learning or planned using optimal control. While reinforcement learning is sample inefficient, optimal control only plans an optimal trajectory from a specific starting configuration. In this paper we propose deep optimal feedback control to learn an optimal feedback policy rather than a single trajectory. By exploiting the inherent structure of the robot dynamics and strictly convex action cost, we can derive principled cost functions such that the optimal policy naturally obeys the action limits, is globally optimal and stable on the training domain given the optimal value function. The corresponding optimal value function is learned end-to-end by embedding a deep differential network in the Hamilton-Jacobi-Bellmann differential equation and minimizing the error of this equality while simultaneously decreasing the discounting from short- to far-sighted to enable the learning. Our proposed approach enables us to learn an optimal feedback control law in continuous time, that in contrast to existing approaches generates an optimal trajectory from any point in state-space without the need of replanning. The resulting approach is evaluated on non-linear systems and achieves optimal feedback control, where standard optimal control methods require frequent replanning.

00A0 \DeclareUnicodeCharacter221E

## 1 Introduction

Specifying a task through a reward function and letting an agent autonomously discover a corresponding controller promises to simplify programming of complex robotic behaviors by reducing the amount of manual engineering required. Previous research demonstrated that such approach can successfully generate robot controllers capable of performing dexterous manipulation [mordatch2012contact, andrychowicz2018learning, toussaint2018differentiable] and locomotion [mordatch2012discovery, heess2017emergence]. These controllers were obtained via reinforcement learning or trajectory optimization. While reinforcement learning optimizes a possibly non-linear policy under the assumption of unknown rewards, dynamics and actuation limits, trajectory optimization plans a sequence of actions and states using a known model, reward function, initial state and the actuator limits. When applied to the physical system, the planned trajectories must be augmented with a hand-tuned tracking controller to compensate modeling errors.

To obtain a globally optimal feedback policy that naturally obeys the actuator limits without randomly sampling actions on the system as in reinforcement learning, we propose to incorporate actuator limits within the cost function and obtain the corresponding optimal feedback controller by embedding a deep differential network in the Hamilton-Jacobi-Bellman (HJB) differential equation and learn the network weights using a curricular learning scheme. The curricular learning scheme adapts the discounting from short to far-sighted to ensure learning of the optimal policy despite the multiple spurious solutions of the HJB. Assuming the inherent structure of most robotic tasks, i.e., control-affine dynamics due to holonomicity of mechanical systems and perfect approximation of the value function, the learned policy is globally optimal on the state domain , guaranteed to be stable and does not require any replanning or hand-tuning of the feedback gains. Incorporating the actuation limits within the cost function, transforms the constrained optimization problem to an unconstrained problem. Thereby enabling the learning of feasible feedback policies using gradient descent.

Our technical contributions are the following. First, the derivation of strictly convex cost functions that make common classes of controllers, including torque limited policies, optimal. Second, the derivation of the deep optimal control and the corresponding curricular learning scheme, enabling robust learning of the viscosity solution of the HJB, which implies optimal closed-loop policies that are guaranteed to be optimal and stable on the domain. Furthermore, we provide intuitive explanations about the convergence of the proposed curricular learning scheme despite the simplicity compared to previous approaches that require significantly more manual engineering.

In the following, Section 2 introduces the HJB differential equation using the inherent structure of robotic problems. Afterwards, the principled cost functions that enable torque limited policies to be optimal are derived in Section 3. HJB optimal feedback control and the corresponding curriculum learning scheme is derived in Section 4. Finally, our proposed algorithm is applied to non-linear systems and compared to trajectory optimization in Section 5 and the differences to existing literature are highlighted in Section 6.

## 2 The Hamilton-Jacobi-Bellman Differential Equation

The optimal policy solves a task by choosing the optimal actions given the current state , such that the total accumulated cost is minimized. Therefore, the optimal policy is described by

(1) |

with the action , the cost function , the dynamics , the discounting factor , the state domain , the positive actuator limit and the time horizon . Let the optimal value function be defined as the minimum accumulated cost-to-go, i.e.,

(2) |

For the infinite time horizon, i.e., , the optimal value function is independent of time [doya2000reinforcement]. The Hamilton-Jacobi-Bellman (HJB) differential equation, the continuous counterpart of the discrete Bellmann equation, can be derived by substituting the value function at the time into Equation 2, approximating with its 1st order Taylor expansion and taking the limit [doya2000reinforcement]. Therefore, the HJB differential equation is described by

(3) |

In the following, we will refer to as .

The HJB equation has multiple solutions and only incorporating the boundary condition

(4) |

with the outward pointing normal vector defined on the state domain boundary makes the solution unique [fleming2006controlled]. This boundary condition implies that the optimal action always prevent the system to leave the state domain. Within the LQR-setting, this boundary condition implies the positive-definiteness of the quadratic value function. Using the control affine dynamics of mechanical systems with holonomic constraints, i.e., , Equation 3 and Equation 4 can be further simplified to

(5) |

## 3 Incorporating Action Constraints within the Cost Function

Assuming a separable cost, consisting of the terms for the state and action, we show that the optimization problem on the right-hand side of the HJB in Equation 5 can be performed in closed form if the action cost is strictly convex. Moreover, action constraints are naturally accommodated if the action cost has a barrier shape. Thereby, the constrained optimization in Equation 1 is transformed into an unconstrained problem, with the optimal policy inherently obeying the action limits.

### 3.1 Strictly Convex Action Cost

The only requirement needed to get rid of the nested optimization in Equation 5 is that the cost splits into the sum of terms , were can be chosen depending on the task and must be a strictly convex function. The strong convexity of generalizes the positive definiteness assumption on the action cost required by LQR. Under these assumptions, the HJB equation becomes

(6) |

The optimal action and the optimal policy can be computed in closed form by employing the convex conjugate function and exploiting its defining property ,

(7) |

The strict convexity of assures that Equation 7 provides a unique global minimum [boyd2004convex]. Importantly, Equation 7 describes the optimal policy in closed form and therefore no learning of the policy is required. The value function is also a Lyapunov function and hence, the policy is stable [liberzon2011calculus]. Substituting the optimal action into the HJB and using the Fenchel-Young identity with , we arrive at the final form of the HJB equation

(8) |

Notably, the HJB equation in the form of Equation 8 is a straightforward equality and does not contain a nested optimization problem in contrast to the original problem in Equation 5. Therefore, one only needs to find the optimal value function and then the optimal feedback controller is directly given.

Policy Name | Action Range | Action Cost | Policy | HJB Nonlinear Term |
---|---|---|---|---|

Linear | ||||

Logistic | ||||

Atan | ||||

Action-Scaled | ||||

Cost-Scaled | ||||

Action-Shifted | ||||

Tanh | ||||

TanhActScaled | ||||

AtanActScaled | ||||

Bang-Bang | - charact. fun. | |||

Bang-Lin | - Huber loss |

### 3.2 Torque Limited Optimal Policies

Exploiting the closed form solution for the optimal policy given in Equation 7 and the convex conjugacy, one can define cost functions such that the standard controllers, including torque limited controllers, become optimal. This approach is favorable compared to the naive quadratic action cost because such cost can potentially take unbounded actions. Clipping the unbounded actions is only optimal for linear systems [de2000elucidation] and increasing the action cost to ensure the action limits leads to over-conservative behavior and underuse of the control range.

The shape of the optimal policy is determined by the monotone function . Therefore, one can define any desired monotone shape and determine the corresponding strictly convex cost by inverting to compute and integrating to obtain the strictly convex cost function . For example, the linear policy is optimal with respect to the quadratic action cost, whereas the logistic policy is optimal with respect to the binary cross-entropy cost. The full generality of this concept based on convex conjugacy is shown in Table 1, which shows the corresponding cost functions for Linear, Logistic, Atan, Tanh and Bang-Bang controllers. Furthermore, using the rules from convex analysis [boyd2004convex], the effects of scaling the action limits, shifting the action range, or scaling the action cost can be succinctly described, as shown by Action-Scaled, Action-Shifted, and Cost-Scaled rows in Table 1. This enables quick experimentation by mixing and matching costs. For example, the action cost corresponding to the Tanh policy is straightforwardly derived using the well-known relationship between and the logistic sigmoid given by . Note that a formula for general invertible affine transformations can be derived, not only for scalar scaling and shifting. Classical types of hard nonlinearities [ching2010quasilinear] can be derived as limiting cases of smooth solutions. For example, taking the Tanh action cost and scaling it with , i.e., putting a very small cost on actions but nevertheless preserving the action limits, results in the Bang-Bang control shape. Taking a different limit of the Tanh policy in which scaling is performed simultaneously with respect to the action and cost, the resulting shape is what we call Bang-Lin and corresponds to a function which is linear around zero and saturates for larger input values.

It is worth emphasizing that the choice of the action cost has a big effect on the optimization dynamics even if the shapes of the optimal policies appear similar. A striking example is the difference between the costs corresponding to the Tanh and Atan policies, because these costs behave very differently at the action limits. The Tanh action cost is not defined on the action limits but tends towards a finite value while the Atan action cost tends towards infinity. These different limits imply different optimal control strategies. While the finite limit of the Tanh allows the usage of the maximum action, the infinite limit of the Atan cost leads to the avoidance of the maximum action.

## 4 Learning the Value Function with Differential Networks

To obtain the optimal policy, one must solve the differential equation described in Equation 8 to obtain the optimal value function. Learning this value function is non-trivial because the equation contains both the value function as well as the Jacobian and has multiple solutions. The first problem can be addressed, by using a differential network, which previous research used to learn the parameters of the Euler-Lagrange differential equation [lutter2018deep], while the latter prevents the naive optimization of the HJB. Therefore, we first introduce the differential network and the naive optimization loss in Section 4.1 and describe the curricular optimization scheme to learn the unique solution in Section 4.2.

### 4.1 Deep Differential Network

Deep networks are fully differentiable and one can compute the partial derivative w.r.t. networks at machine precision [raissi2018hidden]. Therefore, deep networks are well suited for being embedded within differential equations and learning the solution end-to-end. The deep differential network architecture initially introduced by [lutter2018deep], computes the functional value and the Jacobian w.r.t. to the network inputs within a single forward-pass by adding a additional graph within each layer to directly compute the Jacobian using the chain rule. Therefore, this architecture can be naturally used to model and and be learned end-to-end by minimizing the error of the HJB using standard deep learning techniques. In addition, this architecture enables the fast computation time for the Jacobian s.t. the this network has been used for real-time control loops with up to 500Hz [Lutter2019Energy]. Let be the deep network representing the approximated value function and approximated partial derivative and be the network weights and biases. Then this deep network can be trained by minimizing the loss describing the difference of the HJB equality (3) described by

(9) |

where is uniformly sampled from the state domain

### 4.2 Constrained Optimization of the Value Function

The naive optimization of the loss of Equation 9 using gradient descent is not sufficient because the HJB has multiple solutions. Even for the one dimensional linear system with quadratic rewards and quadratic value function, i.e., , with and , the HJB equation has two solutions described by

The problem of multiple solutions can be addressed from two perspectives. First one can enforce the boundary constraint to make the solution unique. Second, one can use the a curricular learning scheme to change the discounting factor during the learning from short to far sighted. In addition, one can exploit the knowledge of terminal states and can locally clamp the value function at the terminal states to the cost function. Similar approaches have been suggested in [riedmiller2005neural, tassa2007least].

#### 4.2.1 Boundary Constraint

The solution of the HJB for the 1d linear system is unique, when the boundary condition of Equation 4 is included, i.e.,

with . Therefore, incorporating this constraint within the learning objective should enforce the desired optimal solution. The boundary constraint can be incorporated as additional penalty term, i.e., with

(10) |

and optimizing this objective yields the desired optimal solution. For the integrator dynamics, the optimization landscape is smoothed and only one minima remains given a sufficiently large penalty term (Fig. 1 a). Despite the well posed optimization surface, such constraint is hard to draft even for slightly more complex systems as the double integrator. Defining the state domain, which covers both position and velocity is non-trivial because the boundary constraint implies that the system must be controllable on the given domain boundary. Therefore, one must manually engineer the maximum region of attraction, which is dependent on the cost function and their relative magnitudes. E.g., the relation of action cost to velocity cost determines the magnitude of deceleration for a moving object and hence, the slope of the boundary constraining positions and velocities. Therefore, a specific domain boundary might be feasible for one cost function but not for others. Furthermore, one must account for states in which the system is not controllable. If wrong boundary conditions are incorporated, the optimization objectives contradict and one converges pre-maturely. Furthermore, this boundary constraint is only enforced locally and hence, the deep network may be attracted to other solutions inside the domain. Both drawbacks, the necessary engineering and locality, render the boundary constraint not useful for the learning of the HJB.

#### 4.2.2 Changing the discounting from short to far-sighted

Besides adding the boundary constraint, one can continuously decrease the discounting such that the value function changes from short to far-sighted solutions. This can be achieved by first initializing and slowly decreasing once the value function is sufficiently learned. Thereby, the value function is initially attracted to a single solution and follows this solution closely through the parameter space when the discounting factor is decreased. This initial solution is unique, because taking the limit for Equation 3 shows that only one unique finite value function, i.e., , exists. Therefore, the deep network is initially attracted to the desired optimal value functions and follows this solution closely when decreasing . Applying this to the 1d linear system example,

shows that the undesired solution diverges to , while the desired optimal solution approaches . This divergence is also shown in Figure 1 c. Therefore, the value function is attracted to the desired optimal solution and follows this solution to the undiscounted infinite horizon value function. This learning scheme of changing the parameter during optimization can be interpreted as continuation method [allgower2012numerical] and curricular learning [bengio2009curriculum]. Both approaches gradually increase the task complexity to achieve faster convergence (as curriculum learning) or the avoidance of bad local optima (continuation methods). Especially, continual learning solves an initially simplified non-linear optimization problem and tracks this solutions through parameter space when the task complexity increases. Therefore, gradually decreasing and tracking the short-sighted solution through weight space to the potentially discontinuous undiscounted infinite horizon value function is comparable to continuation methods. We only use a heuristic of decreasing when an error or epoch threshold is reached and aim to apply the principled methods of continuation methods for adapting in future work.

### 4.3 Terminal States

One can exploit the knowledge of the terminal states to direct the learning of the value function. For every terminal state , the value function must be equal to the cost function, i.e. . This known point of the value function can be incorporated by extending the original loss and is described by

(11) |

This clamping of equality constraints can also be used for any known regularities, e.g., constraining the the multiples of of a continuous revolute joint to be identical.

## 5 Experiments

The proposed approach for learning optimal feedback control is evaluated
on two tasks: stabilization of a one-dimensional linear system
and nonlinear swing-up control of a torque-limited pendulum. For each system, the optimal controller for quadratic and log-cosine action cost is learned. For the quadratic action cost, the actions are not limited, while the log-cosine action cost implicitly limits the actions. In the following, we will refer to our proposed approach of learning the optimal value function and applying the closed-form policy as *HJB control*. The performance is compared to LQR and shooting methods^{1}^{1}1Single and multiple shooting is implemented using CasADi [Andersson2018] augmented with hand-tuned tracking controllers.

### 5.1 Linear System

For the linear system, the dynamics are described by and the state cost is quadratic . For the experiments, a simple one-dimensional integrator with the parameters , and the domain is used. For the quadratic cost function the actions are unconstrained, while for the log-cosine cost the actions are constrained to .

The learned value function and achieved control cost for closed-loop control with Hz are shown in Figure 2. For the quadratic cost function, HJB control, LQR and single shooting obtain identical state and action trajectories (Fig. 2 b-c). For randomly sampled starting configurations, the learned value function and the accumulated cost of the HJB controller match the cost of LQR and single shooting (Fig. 2 a). For the log-cosine cost, only HJB control and single shooting achieve optimal performance on the complete state domain while LQR only achieves comparable cost for limited starting configurations, when the action limits are not active. This can be seen in Figure 2 d, where the expected control cost of the learned value function and the accumulated cost of the HJB controller match the cost of single shooting. In contrast, LQR achieves comparable cost for but significantly larger cost for .

### 5.2 Torque Limited Pendulum

The torque limited pendulum is a one-degree-of-freedom non-linear system with the joint position , velocity and torque . The continuous time dynamics and canonical system representation with the state and being the upward pointing pendulum are described by

(12) |

with the pendulum mass kg, length m and the gravity m/s. The state cost is given by . For the log-cosine cost, the torque is constrained to N/m.

The learned value function and control performance of the HJB controller for both the quadratic and log-cosine cost is shown in Figure 3. For the quadratic action cost, the value function is continuous and only locally quadratic within the surrounding of the balancing point. The trajectories of HJB control, multiple shooting and LQR from 300 randomly sampled starting configurations are shown in Figure 3 (a-c). The trajectories from HJB control match the trajectories of multiple shooting and hence are optimal for the non-linear system. In contrast, the LQR controller applies unnecessarily large actions because the system linearization at over-estimates the system dynamics. Furthermore, the corresponding cost distribution for the starting configurations is very similar for the HJB controller and multiple shooting, while the LQR cost distributions shows that LQR requires larger cost for some starting configurations (Fig. 3 d). For the log-cosine cost, which implicitly limits the feasible actions, the value function is discontinuous and contains two ridges leading to the balancing point, because the balancing point cannot be reached directly from every point within the state domain. The learned value function and the trajectories of HJB control, multiple shooting and LQR are shown in Figure3 e-g. Multiple shooting and HJB control achieve the swing-up from every starting configuration while LQR cannot swing-up the pendulum and only achieves balancing of starting configurations close to the upright position. Furthermore, the cost distributions for HJB control and multiple shooting match closely (Fig. 3 h). This close similarity between cost distribution shows that our proposed approach learned the optimal value function, and the corresponding feedback policy achieves optimal feedback control.

## 6 Related Work

Non-linear control problems are normally solved by trajectory optimization with explicit inequality constraints on the actions [bryson1975applied]. These approaches yield a single optimal trajectory, that needs to be replanned for every initial configuration and augmented with a non-optimal tracking controller. Locally-optimal tracking controller can be obtained when using iterative linear quadratic programming (iLQR) with action constraints [tassa2014control] or guided policy search (GPS) [levine2013guided]. In contrast to these approaches, our proposed method transforms the constrained problem to an unconstrained problem by using principled cost function and provides the globally optimal feedback controller on the domain .

The transformation of the inequality constraints to the principled cost function has been explored by a number of authors. Historically the first mention of generalized convex action costs goes back to [lyshevski1998optimal]. Thereafter, the non-linearity derived from the corresponding action cost has been commonly used in the adaptive dynamic programming literature [abu2005nearly, yang2014reinforcement, modares2014online]. Furthermore, [doya2000reinforcement] and [tassa2007least] present similar derivation to ours, with a generic convex action cost. However, these papers do not arrive at the general convex conjugate theory—only treating the case and do not derive the HJB in the form of Equation 8 as we derived it with the convex conjugate .

Global optimal feedback controllers were previously learned by using the least squares solution of the HJB differential equation [doya2000reinforcement, tassa2007least, yang2014reinforcement, liu2014neural]. Early on, [doya2000reinforcement] used radial-basis-function networks to learn the HJB. Similarly, [yang2014reinforcement, liu2014neural] learned polynomial value function using gradient descent. Most similar to our approach, [tassa2007least] used neural networks of the Pineda architecture to learn the average cost solution of the HJB. To achieve convergence the authors added domain constraints, adapted the discounting horizon and added stochasticity to the dynamics to smoothen the value function. In contrast to the previous works, our proposed approach learns the value function using a generic deep differential network capable of representing discontinuous value functions and only requires the adaptation of the discounting factor to achieve robust convergence. Furthermore, we provide intuitive insights that show how changing the discounting factor enables the learning of the optimal value function.

## 7 Conclusion

In this we paper, we showed that constrained optimization problems for finding action-limited optimal policies can be transformed into an unconstrained problem by designing principled strictly convex cost-functions, which guarantee optimal policies that naturally obey the actuator limits. Exploiting this transformation as well as the affine control dynamics of mechanical systems with holonomic constraints, we showed that the optimal policy can be described in closed form given the differentiable value function. The differentiable value function can be learned by embedding a deep differential network within this HJB and learning the solution using a curricular learning approach, that changes the discounting from short- to far-sighted. The curriculum learning is necessary to achieve robust convergence to the desired value function and avoiding the solutions of the HJB, which violate the boundary conditions. The experiments demonstrated that the learned optimal feedback controller achieves similar performance as shooting methods on linear and non-linear systems.

In future work we plan to extend this work to model-based reinforcement learning by combining the HJB optimal feedback control with model learning in the form of Deep Lagrangian Networks [lutter2018deep], which ensures learning control-affine dynamics. Iterating between optimal control and model learning, the optimal policy and model are simultaneously learned. While the global optimality is favourable for low dimensional systems, it prevents scaling to high dimensional problems, as the domain grows exponentially. Therefore, we plan to develop an iterative sampling approach to construct a domain limited to the surroundings of the trajectories obtained from optimal policy.

## Acknowledgement

This project has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No #640554 (SKILLS4ROBOTS). Furthermore, this research was also supported by grants from ABB AG, NVIDIA and the NVIDIA DGX Station.

## References

## Appendix

### Analytic Derivation of the 1d LQR System

Let the system be , the cost be and the value function be defined as and . Then Equation 8 is described by

Solving the quadratic equation yields the optimal parameter of the value function described by