Sequential Action Control: Closed-Form Optimal Control for Nonlinear and Nonsmooth Systems

Sequential Action Control: Closed-Form Optimal Control for Nonlinear and Nonsmooth Systems

Alex Ansari and Todd Murphey Manuscript received October 27, 2015; revised June 4, 2016; accepted July 9, 2015.A. Ansari is with the Robotics Institute, Carnegie Mellon University, Pittsburgh, PA 15213 USA (e-mail: aansari1@andrew.cmu.edu).T. Murphey is with the Department of Mechanical Engineering, Northwestern University, Evanston, IL 60208 (e-mail: t-murphey@northwestern.edu).
Abstract

This paper presents a new model-based algorithm that computes predictive optimal controls on-line and in closed loop for traditionally challenging nonlinear systems. Examples demonstrate the same algorithm controlling hybrid impulsive, underactuated, and constrained systems using only high-level models and trajectory goals. Rather than iteratively optimize finite horizon control sequences to minimize an objective, this paper derives a closed-form expression for individual control actions, i.e., control values that can be applied for short duration, that optimally improve a tracking objective over a long time horizon. Under mild assumptions, actions become linear feedback laws near equilibria that permit stability analysis and performance-based parameter selection. Globally, optimal actions are guaranteed existence and uniqueness. By sequencing these actions on-line, in receding horizon fashion, the proposed controller provides a min-max constrained response to state that avoids the overhead typically required to impose control constraints. Benchmark examples show the approach can avoid local minima and outperform nonlinear optimal controllers and recent, case-specific methods in terms of tracking performance, and at speeds orders of magnitude faster than traditionally achievable.

real-time optimal control; nonlinear control systems; hybrid systems; impacting systems; closed loop systems.

I Introduction

Model-based control techniques like dynamic programming or trajectory optimization can generate highly efficient motions that leverage, rather than fight, the dynamics of robotic mechanisms. Yet, these tools are often difficult to apply since even basic robot locomotion and manipulation tasks often yield complex nonlinear, hybrid, underactuated, and high-dimensional constrained control problems. For instance, consider the spring-loaded inverted pendulum (SLIP) model in Fig. 1. With only a point-mass body and spring leg, the model provides one of the most idealized locomotion templates that remains popular in robotics for capturing the center of mass (COM) running dynamics of a wide range of species [1, 2]. However, even this simple model exhibits nonlinear, underactuated, and hybrid phenomena that affect optimizations in most model-based methods.

Fig. 1: Time lapse (0.5s) of a spring-loaded inverted pendulum (SLIP) reactively hopping up stairs using SAC.

For example, a trajectory optimization approach could derive the dashed solution in Fig. 1, but would usually require a good initial guess (and special care to account for discontinuities). Initialization is important because the nonlinear dynamics imply the constrained objective is non-convex and subject to potentially poor local minima. An additional regulating controller, capable of tracking the trajectory through impacts, would also be necessary to track the resulting solution. Still, the approach would not be able to adapt the trajectory to accommodate a dynamically changing environment.

As an alternative, a (nonlinear) receding horizon control approach would compute an optimal trajectory, follow it for a single time step, and then iterate to construct a closed-loop response. However, each receding horizon problem would still be non-convex, requiring computationally expensive iterative optimization [3]. In addition to local minima issues, the approach would be limited to lower bandwidth scenarios with slowly-varying environmental/dynamic conditions [4, 3].

To address these problems, this paper proposes a new model-based algorithm, which we refer to as Sequential Action Control (SAC), that makes strategic trade-offs for computational gain and improved generality. That is, rather than solving for full control curves that minimize a non-convex objective over each receding horizon, SAC finds a single optimal control value and time to act that maximally improves performance.111SAC also uses a line search [5] to specify a short duration (usually a single discrete time-step) to apply each control and improve performance. For instance, in the SLIP example SAC waits until the flight phase to tilt the leg backwards, predicting that the action will drive the robot farther up the steps (after the leg re-contacts) for some specified horizon. As in receding horizon control, SAC incorporates feedback and repeats these calculations at each time step as the horizon recedes. The resulting process computes a real-time, closed-loop response that reacts to terrain and continually drives the robot up the steps. Figure 2 provides an overview of the SAC process.

There are several advantages to the trade-offs SAC makes, i.e., computing individual control actions at each time step that improve performance rather than curves that directly optimize a performance objective. These advantages include: 1) SAC controls can be rapidly computed on-line from a closed-form expression with guaranteed optimality, existence, and uniqueness.222There are algorithms other than SAC that yield controls in closed form. However, we are unaware of any methods (particularly model/optimization-based methods) that provide comparable constrained closed-form controls on-line for examples such as those in Section IV and accommodate nonlinear hybrid/impulsive robots. 2) SAC controls can be directly saturated to obey min-max constraints without any computational overhead or specialized solvers. 3) SAC’s control synthesis process is unaffected by discontinuities in dynamics and so applies to challenging hybrid and impulsive robots. 4) In spite of sacrificing the multi-step planning process of trajectory optimization, benchmark examples demonstrate a final, unintuitive finding – SAC can avoid poor local minima that trap nonlinear optimal control. To illustrate this last point, Section IV includes a number of robotics-related control examples that show SAC outperforms case-specific methods and popular optimal control algorithms (sequential quadratic programming [5] and iLQG [6]). Compared to these alternatives, SAC computes high-bandwidth ( KHz) closed-loop trajectories with equivalent or better final cost in less time (milliseconds/seconds vs. hours).

Fig. 2: An overview of the SAC control process including possible open-source interfaces, e.g., ROS [7] and trep [8].

To sum up, SAC provides a model-based control response to state that is easily implemented and efficiently computed for for most robotic systems, including those that are saturated, underactuated, nonlinear, hybrid/impulsive, and high dimensional. This paper introduces SAC in two parts. Part I focuses on robots with differentiable nonlinear dynamics, and Part II considers hybrid impulsive robots. Benchmark examples and relevant background material are introduced in the context of each. Table I includes notation used throughout this paper.

Fig. 3: Following the cyclic process in Fig. 2, SAC computes a schedule, , providing the value of optimal actions that maximally improve a tracking objective over the current (receding) horizon. Next, SAC selects an application time, , and becomes the value of the next SAC action (blue shaded bar). A line search sets the duration, . Previously computed actions are applied while current calculations complete, . After incorporating feedback, SAC repeats the process at the next sample time, .

Part I: SAC for Differentiable Systems

Ii Control Synthesis

symbol description
partial derivative
norm where provides the metric
(e.g.,  )
equivalent to
indicates is positive definite ( for semi-definite)
TABLE I: Notation

The subsequent sections detail SAC control synthesis, following the cyclic process in Fig. 2. We describe how each cycle of the SAC process computes an optimal actiondefined by the triplet consisting of a control’s value, , a short application duration, , and application time, (see the blue shaded bar in Fig. 3) – that is sent to a robot.

Ii-a Prediction

At fixed sampling times every seconds, SAC measures the current (initial) state, , and begins control synthesis by predicting the nominal motion of a robotic system over a receding horizon. Prediction starts at the current (initial) time, , and extends to final time, , with a horizon length, . So, for example, SAC produces the s SLIP trajectory in Fig. 1 by cycling through a synthesis process every s ( Hz), with each prediction phase lasting s. In each cycle, SAC computes an action that improves the s predicted trajectory. Repeating the process, SAC generates a piecewise continuous response.

In Part I, the dynamics,

(1)

are nonlinear in state . Though these methods apply more broadly, we derive controls for the case where (1) is linear with respect to the control, , satisfying control-affine form,

(2)

The time dependence in (1) and (2) will be dropped for brevity.

The prediction phase simulates motion resulting from some choice of nominal control, . Thus, the nominal predicted motion corresponds to,

Although the nominal control may be chosen arbitrarily, all examples here use a null nominal control, . Hence, in the SLIP example, SAC seeks actions that improve performance relative to doing nothing, i.e., letting the SLIP fall.

With and , the cost functional,

(3)

measures trajectory performance to gauge the improvement provided by SAC actions.333Though not required, (3) should be non-negative if it is to provide a performance measure in the formal sense. The following assumptions further clarify the systems and cost functionals addressed.

Assumption 1.

The elements of the dynamics, , are real, bounded, in , and in and .

Assumption 2.

The terminal cost, , is real and differentiable with respect to . Incremental cost is real, Lebesgue integrable, and in .

The prediction phase concludes with simulation of (1) and (3).

Ii-B Computing Optimal Actions

Since SAC has not yet decided when to act, it derives a schedule (curve), , providing the value of the optimal action at every moment along the predicted motion. For instance, in the SLIP example SAC may determine that 1 N-m of torque at the current time will tilt the leg backwards sufficiently for the SLIP to bounce forward and improve (3). The same strategy may be optimal at a later time, e.g., just before leg touchdown, but require 10 N-m to accelerate the leg into position before impact. In this scenario, would provide these optimal torque values at each time and SAC would choose one “best” action to take. At every sample time, SAC would update and choose another action.

Modeling a SAC action as a short perturbation in the predicted trajectory’s nominal control, this section derives the optimal action to apply at a given time by finding the perturbation that optimizes trajectory improvement. Given the application time, , a (short) duration, , and the optimal action value, , the perturbed control signal is piecewise continuous based on Def. 1 and Assump. 3.

Definition 1.

Piecewise continuous functions will be referred to as . These functions will be defined according to one of their one-sided limits at discontinuities.

Assumption 3.

SAC control signals, , are real, bounded, and such that

with nominal control, , that is in .444The dynamics and nominal control can be in if application times, , exclude points of discontinuity in .

Hence, over each receding horizon, SAC assumes the system evolves according to nominal dynamics, , except for a brief duration, where it switches to the alternate mode,

SAC seeks the vector that optimally improves cost (3).

In Part II, we derive a local model of the change in cost resulting from the perturbed SAC control signal and solve for actions that optimize improvement. In the present case of differentiable dynamics (1) however, the local model of the change in cost corresponds to an existing term from mode scheduling literature. That is, we can re-interpret the problem of finding the change in cost (3) due to short application of , as one of finding the change in cost due to inserting a new dynamic mode, , into the nominal trajectory for a short duration around . In this case, the mode insertion gradient [9, 10],

(4)

provides a first-order model of the change in cost (3) relative to the duration of mode . The model is local to the neighborhood where the duration of the switch to/from approaches zero, . Note that (4) assumes the state in and is defined from the nominal control and is the adjoint variable calculated from the nominal trajectory,555As opposed to traditional fixed-horizon optimal control methods [11, 12], this adjoint is easily computed because it does not depend on the closed-loop, optimal state .

(5)

with .

The mode insertion gradient is typically used in mode scheduling [13, 10, 14, 15] to determine the optimal time to insert control modes assuming the modes are known a priori. In this section, we use the mode insertion gradient to solve for new optimal modes (optimal actions) at each instant.666Also, Section V shows a local hybrid cost model yields a generalized version of (4) for hybrid impulsive dynamical systems with resets and objectives that depend on the control. A discussion is in Appendix B-B.

One way to interpret the mode insertion gradient is as a sensitivity. That is, the mode insertion gradient (4) indicates the sensitivity of cost (3) to an action’s application duration at any potential application time, . To achieve a desired degree of cost improvement with each action, SAC uses a control objective to select optimal actions that drive the cost sensitivity (4) toward a desired negative value, . At any potential application time , the action value, , that minimizes,

(6)

minimizes control authority in achieving the desired sensitivity. The matrix provides a metric on control effort. Because the space of positive semi-definite / definite cones is convex [16], (6) is convex with respect to action values, .

With Assumps. 1-3, the mode insertion gradient exists, is bounded, and (6) can be minimized with respect to . The following theorem, which stems from early work in [17], finds this minimum to compute the schedule of optimal action values.

Theorem 1.

Define . The schedule providing the value of the optimal action,

(7)

to which cost (3) is optimally sensitive at any time is

(8)
Proof:

Evaluated at any time , provides the value of the optimal action that minimizes (6) at that time. The schedule therefore also minimizes the (infinite) sum of costs (6) associated with the optimal action values at every time . Hence, (7) can be obtained by minimizing

(9)

Because the sum of convex functions is convex, and in (4) depends only on , minimizing (9) with respect to is convex and unconstrained. It is necessary and sufficient for (global) optimality to find the for which the first variation of (9) is . Using the Gâteaux derivative and the definition of the functional derivative,

(10)

where is a scalar and .

A generalization of the Fundamental Lemma of Variational Calculus [18], implies at the optimizer. Solving

(11)

in terms of confirms the optimal schedule is (8). ∎

To summarize, in computing optimal actions, SAC calculates a schedule providing the value of the optimal action at every possible application time along the predicted trajectory. These values optimize a local model of the change in cost relative to control duration at each application time. The model is provided by the mode insertion gradient (4). As a benefit of SAC, the schedule of optimal action values can be computed from a closed-form expression, (8), of the nominal state and adjoint (5) even for non-convex tracking costs (3).

Ii-C Deciding When to Act

Fig. 4: Configuration variables for the cart-pendulum system.

Assuming control calculations require some time, , SAC searches for application times, ,777SAC implements the previous action while current calculations complete. that optimize an objective to find the most effective time to act over each predicted trajectory. We use

(12)

to balance a trade-off between control efficiency and the cost of waiting, though there are many other choices of objective.888Implementation examples apply as a balance between time and control effort in achieving tracking tasks, but any choice of will work.

Consider, for example, inverting a simple cart-pendulum with state, as in Fig. 4, acceleration control, , and underactuated dynamics,

(13)

with length, m, and gravity, . Fig. 5 shows a schedule, , computed for (13) starting at into an example closed-loop SAC trajectory. At every time, the action in drives the mode insertion gradient (purple curve) toward . The mode insertion gradient is at s when the pendulum is horizontal, i.e., rad., since no action can push toward at that time. The mode insertion gradient also goes to toward the end of the horizon since no finite control action can improve (3) at the final time. The curve of vs time (blue) indicates, to optimize the trade-off between wait time and effectiveness of the action (as in (12)), SAC should do nothing until optimal time s.999The next SAC synthesis cycle, i.e., the next sampling time, may begin (and conclude) before the action at is applied. In such cases, SAC often computes a similar . So, in this case, SAC would likely continue to wait until s to act.

Fig. 5: A schedule of optimal actions, , is depicted (red curve) for a predicted trajectory of the cart-pendulum system (13) starting at current time . These actions minimize acceleration in driving the mode insertion gradient toward . The (purple) mode insertion gradient curve approximates the change in cost (3) achievable by short application of at different times. The objective, , (blue curve) is minimized to find an optimal time, , to act. Waiting to act at rather than at (we assume ), SAC generates greater cost reduction using less effort.

Ii-D Deciding How Long to Act

Temporal continuity of , , and provided by Assump. 1-3 ensures the mode insertion gradient is continuous with respect to duration around where . Therefore, there exists a neighborhood, , where the sensitivity indicated by (4) models the change in cost relative to application duration to first-order (see [13, 10] and the generalized derivation in Section V). For finite durations, , the change in cost (3) is locally modeled as

(14)

As regulates , (14) becomes . Thus the choice of and allows the control designer to specify the desired degree of change provided by actions, . We use a line search with a simple descent condition to find a that yields the desired change [5].101010Because the pair determines the change in cost each action can provide, it is worth noting that a sufficient decrease condition similar to the one proposed in [13] can be applied to the choice of .

Upon selection of the application duration, , the SAC action is fully specified and sent to the robot. The process iterates and the next cycle begins when SAC incorporates new state feedback at the subsequent sampling time.

Iii Special Properties of SAC Control

In addition to providing a closed-form solution for the entire schedule of optimal actions (8), SAC controls inherit powerful guarantees. Appendix A includes derivations that show 1) the schedule (8) globally optimizes (7). 2) Around equilibria, SAC controls simplify to linear state feedback laws permitting local stability analysis and parameter selection, e.g., parameters of (3), , or . 3) Finally, actions computed from (8) can be saturated to satisfy min-max constraints using quadratic programming, by scaling the control vector, or by scaling components of the control vector.111111Proofs are included for each with , as in all the examples in this paper. All examples enforce constraints using the component scaling approach.

For an overview of the SAC approach outlining the calculations required for on-line synthesis of constrained optimal actions, selection of actuation times, and resolution of control durations, refer to Algorithm 1.

Iv Example Systems

The following section provides simulation examples that apply SAC on-line in benchmark underactuated control tasks.121212We also have trajectory tracking results, e.g, for differential drive robots, but cannot include them due to space constraints. Each example emphasizes a different performance-related aspect of SAC and results are compared to alternative methods.

Algorithm 1 Sequential Action Control Initialize , minimum change in cost , current time , default control duration , nominal control , scale factor , prediction horizon , sampling time , the max time for iterative control calculations , and the max backtracking iterations . while  do            Use feedback to initialize      Simulate from for      Compute initial cost      Specify 131313In all examples, we choose to specify as a feedback law, . We find works well.      Compute from using Theorem 1      Specify / search for time, , to apply      Saturate      Initialize ,      while  and  do           =                    Re-simulate applying for          Compute new cost                end while            Send updated to robot      while  do          Wait      end while end while Algorithm 1 At sampling intervals SAC incorporates feedback and simulates the system with a nominal (typically null) control. Optimal alternative actions are computed as a closed-form function of time. A time is chosen to apply the control action. A line search provides a duration that reduces cost.
(a)
(b)
Fig. 6: SAC inverts the cart-pendulum at a low sampling and control sequencing frequency of Hz (at equilibrium the dynamics correspond to a simple pendulum with natural frequency of Hz). This low-frequency control signal (Fig. 5(b)) illustrates how individual actions are sequenced (especially apparent from to s). SAC maintains the cart in m during inversion. Figure 5(b) also shows SAC automatically develops an energy pumping strategy to invert the pendulum.

Iv-a Cart-Pendulum

First, we present examples where SAC is applied to the nonlinear cart-pendulum (13) in simulated constrained swing-up. Performance of SAC is demonstrated using the cart-pendulum as it provides a well understood underactuated control problem that has long served as a benchmark for new control methodologies (see [19, 20, 21, 22, 23, 24]).

Iv-A1 Low-Frequency Constrained Inversion

This example uses SAC to invert the cart-pendulum (13) with low frequency ( Hz) feedback and control action sequencing to highlight the control synthesis process.

(a)
(b)
Fig. 7: SAC can provide control solutions on-line and in closed-loop (these results reflect Hz feedback) that achieve performance comparable to or better than solutions from nonlinear optimal control. For the trajectory depicted, SAC achieves the same final cost of as SQP and iLQG.

Control constraints, , show SAC can find solutions that require multiple swings to invert. We use a quadratic tracking cost (31) with the state dependent weights, . to impose a barrier / penalty function (see [16, 25]) that constrains the cart’s state so . Terminal and control costs in (6) and (31) are defined using , , and a horizon of s.141414All examples use wrapped angles .

Results in Fig. 6 correspond to an initial condition with the pendulum hanging at the stable equilibrium and zero initial velocity, . The red curve shows the penalty function successfully keeps the cart position within m. The simulated trajectory is included in the video attachment.

Iv-A2 High-Frequency Constrained Inversion

In this example, SAC performs on-line swing-up and cart-pendulum inversion with high-frequency feedback ( KHz). To gauge the quality of the inversion strategy, we compare the on-line, closed-loop SAC control to off-line trajectory optimization using MATLAB’s sequential quadratic programming (SQP) and iLQG implementations. The SQP method is widely used and underlies the approach to optimization in [26, 27, 28, 29, 30, 31]. The iLQG algorithm [32, 33] is a state-of-the-art variant of differential dynamic programming (DDP). While early versions did not accommodate control constraints, iLQG achieves a tenfold speed improvement over DDP in simulations [34] and has since been applied for real-time humanoid control [32]. This section compares to a recent variant that incorporates control constraints through a new active-set method [6]. We use a publicly available MATLAB iLQG implementation developed by its authors.151515Available at http://www.mathworks.com/matlabcentral/fileexchange/52069-ilqg-ddp-trajectory-optimization.

To highlight the sensitivity of optimal control, i.e., iLQG and SQP, to local minima even on simple nonlinear problems (and to speed SQP computations), this example uses a low-dimensional cart-pendulum model. The simplified model leaves the cart position and velocity unconstrained and ignores their error weights, such that dynamics are represented by the first two components of (13). In this case, the goal is to compute controls that minimize a norm on the cart’s acceleration while driving the pendulum angle toward the origin (inverted equilibrium). We compare performance of the trajectories produced by each algorithm over a fixed time horizon, , based on the objective,

(15)

with and . All algorithms are constrained to provide controls .

s s s
dt min. iters min. iters min. iters
s SQP 13 1,234 22 1052 46 1,346
iLQG 2 737 9 2,427 13 3,108
s SQP 169 2,465 32 201 105 251
iLQG 5 908 56 8,052 5 622
s SQP 689 2,225 817 853 1,286 933
iLQG 9 1,007 28 2,423 9 688
TABLE II: SQP versus iLQG for swing-up of the cart-pendulum under varying optimization horizon, , and discretization, dt. All solutions converge to the same optimizer with , except the gray results, which converged to low performance local minima. For each parameter combination, columns indicate the number of iterations (iters) and time in minutes (min.) for convergence.

Both SQP and iLQG directly optimize a discretized version of (15) to derive their optimal trajectories. For comparison, results are provided for different choices of discretization, dt, and optimization horizons, .161616Horizons are based on the assumed time for pendulum inversion, and discretizations on assumed frequency requirements and linearization accuracy. In contrast, SAC computes a trajectory of duration by deriving actions from a receding state tracking cost (31) (see Appendix A-A) with quadratic state norms similar to the one in (15). Although SAC runs at KHz, optimal control results are limited to dts, as SQP computations become infeasible and consume all computational resources below this.171717All results were obtained on the same laptop with Intel CoreTM i7-4702HQ CPU @ 2.20GHz 8 and 16GB RAM. Table II provides the time and number of optimization iterations required for each parameter combination.

The parameter combinations in Table II that do not correspond to gray data converged to the same (best case) optimal trajectory, which inverts the pendulum in s with . 181818The cost of the optimal solution is the same when measured for horizons s since the incremental cost in (15) is negligible after inversion at s. Gray data indicate convergence to an alternate local minima with significantly worse cost. In all cases with s, SQP converges to local minima with costs . While iLQG tends to be less sensitive to local minima, it converges to the worst local minima with for both finer discretizations when s.

Since varying has no affect on SAC control synthesis (other than to specify the duration of the resulting trajectory), SAC control simulations included a variety of additional parameter combinations including receding horizons from s and different synthesis frequencies. These solutions yield costs ranging from , with the majority of solutions close or equal to . The SAC solution depicted in Fig. 7 achieves the best case cost of from receding horizons of s, with parameters and in (31), and with . SAC’s on-line controls perform constrained inversion as well as the best solutions from offline optimal control. Also, local minima significantly affect SQP and iLQG, while SAC tends to be less sensitive.

Considering the simplicity of this nonlinear example, it is noteworthy that both optimal control algorithms require significant time to converge. While iLQG ranges from minutes to an hour, with a discretization as coarse as SAC, SQP requires hours to compute the single, open-loop trajectory in Fig. 7 using 4 CPU cores. Our C++ implementation of SAC obtains a solution equivalent to the best results on-line, with KHz feedback, in less than s using 1 CPU core.191919As the MATLAB SQP and iLQG implementations utilize compiled and parallelized libraries, it is unclear how to provide a side-by-side comparison to the timing results in Table II. To illustrate that SAC is still fast in slower, interpreted code, we also implemented SAC in Mathematica. Computations require s and are linear w.r.t. to horizon, , and discretization, . Computing optimal actions in closed-form, SAC achieves dramatic gains and avoids the iterative optimization process, which requires thousands of variables and constraints in SQP / iLQG.

Finally, we emphasize the closed-loop nature of SAC compared to SQP, which provides an open-loop trajectory, and iLQG, which yields an affine controller with both feedforward and feedback components. As the affine controller from iLQG is only valid near the optimal solution (SAC provides feedback from arbitrary states), SQP or iLQG must be applied in receding horizon for feedback comparable to SAC. For improved speed, [6] recommends a receding horizon implementation using suboptimal solutions from a fixed number (one) of iterations. However, in this simple nonlinear example, SQP / iLQG trajectories only resemble the final solution a few iterations before convergence. Hence, receding horizon implementations would likely result in poor local solutions.

Iv-A3 Sensitivity to Initial Conditions

Using a horizon of s, SAC was applied to invert the same, reduced cart-pendulum system from a variety of initial conditions. Simulations used the quadratic tracking cost (31) and weight matrices from (15). A total of initial conditions for , uniformly sampled over rad, were paired with initial angular velocities at points uniformly sampled over .

To gauge performance, a s closed-loop trajectory was constructed from each of the sampled initial conditions, and the state at the final time measured. If the final state was within rad of the inverted position and the absolute value of angular velocity was , the trajectory was judged to have successfully converged to the inverted equilibrium. Tests confirmed the SAC algorithm was able to successfully invert the pendulum within s from all initial conditions. The average computation time was s for each s trajectory on the test laptop.

Fig. 8: Configuration of the acrobot and pendubot systems.

Iv-B Pendubot and Acrobot

This section applies SAC for swing-up control of the pendubot [35, 36, 37] and acrobot [38, 39, 40]. The pendubot is a two-link pendulum with an input torque that can be applied about the joint constraining the first (base) link. The acrobot is identical except the input torque acts about the second joint. The nonlinear dynamics and pendubot model parameters match those from simulations in [35] and experiments in [36]. The acrobot model parameters and dynamics are from simulations in [39] and in seminal work [38]. Figure 8 depicts the configuration variables and the model parameters are below. Each system’s state vector is with the relevant joint torque control, .

pendubot: m = 1.0367 kg m = 0.5549 kg
l = 0.1508 m l = 0.2667 m
l = 0.1206 m l = 0.1135 m
I = 0.0031 kg m I = 0.0035 kg m
acrobot: m = 1 kg m = 1 kg
l = 1 m l = 2 m
l = 0.5 m l = 1 m
I = 0.083 kg m I = 0.33 kg m

Due to their underactuated dynamics and many local minima, the pendubot and acrobot provide challenging test systems for control. As a popular approach, researchers often apply energy based methods for swing-up control and switch to LQR controllers for stabilization in the vicinity of the inverted equilibrium (see [35, 41, 42, 38, 37, 39, 40]). We also use LQR controllers to stabilize the systems once near the inverted equilibrium. However, the results here show SAC can swing-up both systems without special energy optimizing methods. The algorithm utilizes the quadratic state error based cost functional (31), without modification.

While the pendubot simulations in [35] require control torques up to a magnitude of for inversion, the experimental results in [36] perform inversion with motor torques restricted to . Hence, the pendubot inputs are constrained to . The acrobot torques are constrained with to invert the system using less than the required in [39].

Example simulations initialize each system at the downward, stable equilibrium and the desired position is the fully inverted equilibrium. Results are based on a feedback sampling rate of Hz for the pendubot with , , and and Hz for the acrobot with , , and . The LQR controllers derived offline for final stabilization, and were calculated about the inverted equilibrium to stabilize the pendubot and acrobot, respectively. We selected as the switching condition for pendubot stabilization.202020More formally, a supervisory controller can switch between swing-up and stabilizing based on the stabilizing region of attraction [43, 44]. The acrobot switched once all its configuration variables were .

Figure 9 shows the pendubot trajectory (the acrobot and pendubot solutions are in video attachment). In both cases, SAC swings each system close enough for successful stabilization. SAC inverts the pendubot using the same peak effort as in experiments from [36] and less than half that from simulations in [35]. Also, SAC requires only s to invert, while simulations in [35] needed s. Where [35] switches between separately derived controllers for pumping energy into, out of, and inverting the system before final stabilization, SAC performs all these tasks without any change in parameters and with the simple state tracking norm in (31). In the case of the acrobot, SAC inverts the system with the desired peak torque magnitude of ( the torque required in simulations from [39]). These closed-loop results were computed on-line and required only 1.23 and s to compute s trajectories for the pendubot and acrobot systems, respectively.

To invert the pendubot and acrobot in minimal time and under the tight input constraints, the two most important parameters for tuning are the horizon length, , and the desired change in cost due to each control actuation, . All examples specify iteratively based on the current initial trajectory cost under the nominal (null) control as . Generally, because of the speed of SAC computations, good parameters values can be found relatively quickly using sampling. These pendubot and acrobot results use and similar horizons of s and s, respectively.

Fig. 9: SAC swings up the pendubot close enough for final stabilization by the LQR controller. The LQR controller takes effect at s. The algorithm inverts the system using less peak control effort and in less time than existing methods from literature with the same parameters.

As mentioned earlier, optimal controllers typically use energy metrics for swing-up of the pendubot and acrobot, as simple state-tracking objectives yield local minima and convergence to undesirable solutions. It is noteworthy that SAC is able to invert both systems on-line and at high frequency considering optimal controllers (SQP/iLQG) generally fail under the same objective (31).

Part II: Extension to Hybrid Impulsive Systems

Part II of this paper extends SAC to systems with hybrid impulsive dynamics. These systems model a more general class of robotics problems in locomotion and manipulation, which involve contact and impacts. Such systems are challenging in optimal control and require specialized treatment and optimality conditions [45, 46, 32]. By planning each control action in a neighborhood of 0 duration, SAC avoids these issues and does not need to optimize control curves over discontinuous segments of trajectory.

V Control Synthesis for Hybrid Systems

The SAC algorithm introduced in Section II is limited to differentiable nonlinear systems because the mode insertion gradient (4) is subject to Assump. 1. Rather than rely on (4), this section directly develops a first-order approximation of the variation in state and cost due to the perturbation in nominal control generated by each SAC action. We show the change in cost due to short SAC actions corresponds to the same mode insertion gradient formula (4), but in terms of an adjoint variable derived for hybrid impulsive systems. As a result (and a benefit of SAC), the SAC process described in Algorithm 1 remains unchanged for hybrid impulsive systems.

Section VI demonstrates the hybrid calculations on a system and then illustrates SAC in simulated on-line control of a bouncing ball. The section concludes with the spring-loaded inverted pendulum (SLIP) example from the introduction.

V-a Prediction

As in Part I, SAC predicts the nominal motion of hybrid robotic systems and computes actions that improve trajectory cost over (receding) horizons. However, this section introduces new notation more appropriate for hybrid impulsive systems. Specifically, the classes of hybrid systems considered here are similar to those in [46] and are defined such that:212121We assume actions are not applied at switching times, exclude Zeno behavior, and allow only a single element of to be active to exclude simultaneous events and potentially indeterminate behavior. These (and continuity) assumptions guarantee a local neighborhood exists such that perturbed system trajectories evolve through the same nominal location sequence (as in [46]).

  1. is a finite set of locations.

  2. is a family of state space manifolds indexed by .

  3. is a family of control spaces.

  4. is a family of maps to the tangent bundle, . The maps are the dynamics at .

  5. is a family of sets of admissible control mappings.

  6. is a family of consecutive subintervals corresponding to the time spent at each location .

  7. The series of guards, , indicates transitions between locations and when . The state transitions according to a series of corresponding reset maps, .

For clarity, we avoid using numerical subscripts for the nominal control, . Instead, SAC predicts nominal motions assuming a (possibly null) nominal control, , is defined for every location, . So, for instance, in the SLIP example, one nominal control is defined in stance with another, possibly identical, control in flight. Note that as a hybrid robotic system applies controls, it evolves through an ordered sequence of locations, , e.g., from flight, to stance, to flight again, for a hopping SLIP.

With the initial location as , state and the collection , SAC’s prediction phase simulates

(16)

starting with , to obtain the nominal state. Guards indicate when a transition should occur, i.e., they specify the end of each interval , and the next location, , based on which guard becomes . Reset maps define the initial condition in as . Through this process, the prediction phase defines the nominal location sequence, , intervals, , and the resulting nominal trajectory,

As before, SAC’s prediction phase concludes after computing the performance of the nominal trajectory. However, in this hybrid case we use an objective,

(17)

with incremental and terminal costs defined in each location, and , such that and . Also, (17) is more general than (3), as it may depend on a control. Given resulting from nominal control, , (17) can be evaluated along the hybrid trajectory.

V-B Computing Optimal Actions

Recall that each cycle of SAC seeks an action that improves nominal trajectory performance. This section defines the perturbed signal from an arbitrary SAC action of value as

assuming a short duration, . In this case, the magnitude of is specified as and the direction by an arbitrary positive scalar, . Because the perturbed system will eventually be evaluated as , assume the perturbation occurs when the nominal state, , is in the arbitrary location so that .222222In the limit as , the SAC action is a needle perturbation [12]. Figure 10 depicts the perturbed control and the corresponding perturbed state.

To derive actions that maximally improve the nominal trajectory, Sec. II-B used the mode insertion gradient (4) to model the change in cost (3) relative to control duration. To accommodate the discontinuous trajectories of hybrid robotic systems, this section derives a model of the change in nominal cost (17) resulting from the perturbed, , by first modeling the effect of the control perturbation on state trajectory. To these ends, we define the first-order perturbed state model,232323The litte-o notation, , indicates terms that are higher than first order in . These terms go to zero faster than first-order terms in (18) as .

(18)

The term is known as the variational equation [11, 12]. It is the direction of the state variation at time and is the magnitude. The following proposition provides formulas to compute the variational equation along hybrid impulsive trajectories. The derivation is in Appendix B-A.

Fig. 10: A perturbed control (top) and the corresponding state variation (bottom) for a hybrid system. The nominal system switches locations at time and the perturbed system switches at time . Taken in the limit as , the control perturbation is a needle perturbation, which is equivalent to an infinitesimal duration action in SAC.
Proposition 1.

Assume the state, , of a hybrid system evolves from location to with the transition time, . If a control perturbation occurs at , as in Fig. 10, state variations propagate according to

(19)

with the linear variational reset map,

(20)

, , and is the linearization about the (known) nominal state trajectory at .

Note that if a nominal trajectory evolves through more than two locations, each transition requires reset of at transition times according to (20). Variations continue according to the dynamics linearized about the nominal trajectory. Repeating computations in rows of (19) between consecutive locations, variations can be propagated to .

With Prop. 1 to compute the perturbed state (18), the following section derives the cost variation resulting from the perturbed control, . We will show the formula is a generalization of the mode insertion gradient (4) that applies to a larger class of hybrid and impulsive systems.

V-B1 Modeling the Cost Variation

To first-order, the perturbed cost can be modeled as,

(21)

where is the direction of variation in the cost function and is the magnitude. To simplify derivation of , we translate the hybrid system to Mayer form by appending the incremental costs, , to the dynamics vectors, , in each location. Objects with a bar refer to appended versions of hybrid system such that and

In Mayer form, the first component of the perturbed appended state, , is the perturbed integral cost in (17). Hence, the perturbed cost model (21) can be written as a sum,

which includes the perturbed terminal cost. The direction of variation in the cost is . Evaluating the derivative yields

(22)

Note that provides the same information as the mode insertion gradient in (4) but applies to hybrid impulsive systems with resets. That is, provides the sensitivity of a cost, , to applying an action at as . Given a control perturbation at arbitrary time , one can calculate from (V-B1) by propagating the appended state variation forward from to using Prop. 1. However, in searching for an optimal time to act, SAC needs to compare the cost variation produced by taking action, i.e., applying a control perturbation, at different times, (see Sec. II-C). The process is computationally intensive if is naively computed from the state variation. That is, considering two possible times, , when control perturbation may be applied, would require separate simulations of from and .242424One may also apply linear transformations to the variational system simulated from the perturbation at based on superposition of the initial condition at . Variational reset maps would require similar transformation.

Since the mode insertion gradient (4) does not require re-simulation to consider different application times in optimizing (12), we seek to express in a form that more closely resembles (4). To these ends, we now re-write in terms of an adjoint system, ,252525The adjoint belongs to the cotangent bundle, , such that . to the variational system .262626See [11] for a similar derivation of an adjoint in the context of continuous variations. The systems are adjoint [11] if

(23)

That is, we can derive by ensuring is constant. Note also that by choosing the terminal condition,

(24)

(V-B1) allows us to express in terms of the adjoint at the terminal time as . If we enforce (23) in deriving the adjoint, the inner product will be constant and equal to at times subsequent to the control perturbation, , and can be interpreted as the sensitivity of (17) to a state variation at time .

Assuming the system is at the (arbitrary) location at the perturbation time , the inner product in (23) yields

(25)

which no longer depends on forward simulation of . The initial time, , of the control perturbation is arbitrary. Like in (4), once the adjoint, , is computed over , (V-B1) can be evaluated at any number of different times, , to provide the cost sensitivity, , to the control perturbation in each case.

The following proposition derives an adjoint formula, , that maintains its interpretation as the cost sensitivity to state variations, as in .

Proposition 2.

Assuming flows between the locations with a control perturbation as in Prop. 1,

(26)

satisfies the adjoint relation (23) and (V-B1).

Proof:

The adjoint is simulated backwards from a terminal condition (26) because this choice of terminal conditions yields