Sequential Action Control: ClosedForm Optimal Control for Nonlinear and Nonsmooth Systems
Abstract
This paper presents a new modelbased algorithm that computes predictive optimal controls online and in closed loop for traditionally challenging nonlinear systems. Examples demonstrate the same algorithm controlling hybrid impulsive, underactuated, and constrained systems using only highlevel models and trajectory goals. Rather than iteratively optimize finite horizon control sequences to minimize an objective, this paper derives a closedform expression for individual control actions, i.e., control values that can be applied for short duration, that optimally improve a tracking objective over a long time horizon. Under mild assumptions, actions become linear feedback laws near equilibria that permit stability analysis and performancebased parameter selection. Globally, optimal actions are guaranteed existence and uniqueness. By sequencing these actions online, in receding horizon fashion, the proposed controller provides a minmax constrained response to state that avoids the overhead typically required to impose control constraints. Benchmark examples show the approach can avoid local minima and outperform nonlinear optimal controllers and recent, casespecific methods in terms of tracking performance, and at speeds orders of magnitude faster than traditionally achievable.
I Introduction
Modelbased control techniques like dynamic programming or trajectory optimization can generate highly efficient motions that leverage, rather than fight, the dynamics of robotic mechanisms. Yet, these tools are often difficult to apply since even basic robot locomotion and manipulation tasks often yield complex nonlinear, hybrid, underactuated, and highdimensional constrained control problems. For instance, consider the springloaded inverted pendulum (SLIP) model in Fig. 1. With only a pointmass body and spring leg, the model provides one of the most idealized locomotion templates that remains popular in robotics for capturing the center of mass (COM) running dynamics of a wide range of species [1, 2]. However, even this simple model exhibits nonlinear, underactuated, and hybrid phenomena that affect optimizations in most modelbased methods.
For example, a trajectory optimization approach could derive the dashed solution in Fig. 1, but would usually require a good initial guess (and special care to account for discontinuities). Initialization is important because the nonlinear dynamics imply the constrained objective is nonconvex and subject to potentially poor local minima. An additional regulating controller, capable of tracking the trajectory through impacts, would also be necessary to track the resulting solution. Still, the approach would not be able to adapt the trajectory to accommodate a dynamically changing environment.
As an alternative, a (nonlinear) receding horizon control approach would compute an optimal trajectory, follow it for a single time step, and then iterate to construct a closedloop response. However, each receding horizon problem would still be nonconvex, requiring computationally expensive iterative optimization [3]. In addition to local minima issues, the approach would be limited to lower bandwidth scenarios with slowlyvarying environmental/dynamic conditions [4, 3].
To address these problems, this paper proposes a new modelbased algorithm, which we refer to as Sequential Action Control (SAC), that makes strategic tradeoffs for computational gain and improved generality. That is, rather than solving for full control curves that minimize a nonconvex objective over each receding horizon, SAC finds a single optimal control value and time to act that maximally improves performance.^{1}^{1}1SAC also uses a line search [5] to specify a short duration (usually a single discrete timestep) to apply each control and improve performance. For instance, in the SLIP example SAC waits until the flight phase to tilt the leg backwards, predicting that the action will drive the robot farther up the steps (after the leg recontacts) for some specified horizon. As in receding horizon control, SAC incorporates feedback and repeats these calculations at each time step as the horizon recedes. The resulting process computes a realtime, closedloop response that reacts to terrain and continually drives the robot up the steps. Figure 2 provides an overview of the SAC process.
There are several advantages to the tradeoffs SAC makes, i.e., computing individual control actions at each time step that improve performance rather than curves that directly optimize a performance objective. These advantages include: 1) SAC controls can be rapidly computed online from a closedform expression with guaranteed optimality, existence, and uniqueness.^{2}^{2}2There are algorithms other than SAC that yield controls in closed form. However, we are unaware of any methods (particularly model/optimizationbased methods) that provide comparable constrained closedform controls online for examples such as those in Section IV and accommodate nonlinear hybrid/impulsive robots. 2) SAC controls can be directly saturated to obey minmax constraints without any computational overhead or specialized solvers. 3) SAC’s control synthesis process is unaffected by discontinuities in dynamics and so applies to challenging hybrid and impulsive robots. 4) In spite of sacrificing the multistep planning process of trajectory optimization, benchmark examples demonstrate a final, unintuitive finding – SAC can avoid poor local minima that trap nonlinear optimal control. To illustrate this last point, Section IV includes a number of roboticsrelated control examples that show SAC outperforms casespecific methods and popular optimal control algorithms (sequential quadratic programming [5] and iLQG [6]). Compared to these alternatives, SAC computes highbandwidth ( KHz) closedloop trajectories with equivalent or better final cost in less time (milliseconds/seconds vs. hours).
To sum up, SAC provides a modelbased control response to state that is easily implemented and efficiently computed for for most robotic systems, including those that are saturated, underactuated, nonlinear, hybrid/impulsive, and high dimensional. This paper introduces SAC in two parts. Part I focuses on robots with differentiable nonlinear dynamics, and Part II considers hybrid impulsive robots. Benchmark examples and relevant background material are introduced in the context of each. Table I includes notation used throughout this paper.
Part I: SAC for Differentiable Systems
Ii Control Synthesis
symbol  description 

partial derivative  
norm where provides the metric  
(e.g., )  
equivalent to  
indicates is positive definite ( for semidefinite) 
The subsequent sections detail SAC control synthesis, following the cyclic process in Fig. 2. We describe how each cycle of the SAC process computes an optimal action – defined by the triplet consisting of a control’s value, , a short application duration, , and application time, (see the blue shaded bar in Fig. 3) – that is sent to a robot.
Iia Prediction
At fixed sampling times every seconds, SAC measures the current (initial) state, , and begins control synthesis by predicting the nominal motion of a robotic system over a receding horizon. Prediction starts at the current (initial) time, , and extends to final time, , with a horizon length, . So, for example, SAC produces the s SLIP trajectory in Fig. 1 by cycling through a synthesis process every s ( Hz), with each prediction phase lasting s. In each cycle, SAC computes an action that improves the s predicted trajectory. Repeating the process, SAC generates a piecewise continuous response.
In Part I, the dynamics,
(1) 
are nonlinear in state . Though these methods apply more broadly, we derive controls for the case where (1) is linear with respect to the control, , satisfying controlaffine form,
(2) 
The time dependence in (1) and (2) will be dropped for brevity.
The prediction phase simulates motion resulting from some choice of nominal control, . Thus, the nominal predicted motion corresponds to,
Although the nominal control may be chosen arbitrarily, all examples here use a null nominal control, . Hence, in the SLIP example, SAC seeks actions that improve performance relative to doing nothing, i.e., letting the SLIP fall.
With and , the cost functional,
(3) 
measures trajectory performance to gauge the improvement provided by SAC actions.^{3}^{3}3Though not required, (3) should be nonnegative if it is to provide a performance measure in the formal sense. The following assumptions further clarify the systems and cost functionals addressed.
Assumption 1.
The elements of the dynamics, , are real, bounded, in , and in and .
Assumption 2.
The terminal cost, , is real and differentiable with respect to . Incremental cost is real, Lebesgue integrable, and in .
IiB Computing Optimal Actions
Since SAC has not yet decided when to act, it derives a schedule (curve), , providing the value of the optimal action at every moment along the predicted motion. For instance, in the SLIP example SAC may determine that 1 Nm of torque at the current time will tilt the leg backwards sufficiently for the SLIP to bounce forward and improve (3). The same strategy may be optimal at a later time, e.g., just before leg touchdown, but require 10 Nm to accelerate the leg into position before impact. In this scenario, would provide these optimal torque values at each time and SAC would choose one “best” action to take. At every sample time, SAC would update and choose another action.
Modeling a SAC action as a short perturbation in the predicted trajectory’s nominal control, this section derives the optimal action to apply at a given time by finding the perturbation that optimizes trajectory improvement. Given the application time, , a (short) duration, , and the optimal action value, , the perturbed control signal is piecewise continuous based on Def. 1 and Assump. 3.
Definition 1.
Piecewise continuous functions will be referred to as . These functions will be defined according to one of their onesided limits at discontinuities.
Assumption 3.
SAC control signals, , are real, bounded, and such that
with nominal control, , that is in .^{4}^{4}4The dynamics and nominal control can be in if application times, , exclude points of discontinuity in .
Hence, over each receding horizon, SAC assumes the system evolves according to nominal dynamics, , except for a brief duration, where it switches to the alternate mode,
SAC seeks the vector that optimally improves cost (3).
In Part II, we derive a local model of the change in cost resulting from the perturbed SAC control signal and solve for actions that optimize improvement. In the present case of differentiable dynamics (1) however, the local model of the change in cost corresponds to an existing term from mode scheduling literature. That is, we can reinterpret the problem of finding the change in cost (3) due to short application of , as one of finding the change in cost due to inserting a new dynamic mode, , into the nominal trajectory for a short duration around . In this case, the mode insertion gradient [9, 10],
(4) 
provides a firstorder model of the change in cost (3) relative to the duration of mode . The model is local to the neighborhood where the duration of the switch to/from approaches zero, . Note that (4) assumes the state in and is defined from the nominal control and is the adjoint variable calculated from the nominal trajectory,^{5}^{5}5As opposed to traditional fixedhorizon optimal control methods [11, 12], this adjoint is easily computed because it does not depend on the closedloop, optimal state .
(5) 
with .
The mode insertion gradient is typically used in mode scheduling [13, 10, 14, 15] to determine the optimal time to insert control modes assuming the modes are known a priori. In this section, we use the mode insertion gradient to solve for new optimal modes (optimal actions) at each instant.^{6}^{6}6Also, Section V shows a local hybrid cost model yields a generalized version of (4) for hybrid impulsive dynamical systems with resets and objectives that depend on the control. A discussion is in Appendix BB.
One way to interpret the mode insertion gradient is as a sensitivity. That is, the mode insertion gradient (4) indicates the sensitivity of cost (3) to an action’s application duration at any potential application time, . To achieve a desired degree of cost improvement with each action, SAC uses a control objective to select optimal actions that drive the cost sensitivity (4) toward a desired negative value, . At any potential application time , the action value, , that minimizes,
(6) 
minimizes control authority in achieving the desired sensitivity. The matrix provides a metric on control effort. Because the space of positive semidefinite / definite cones is convex [16], (6) is convex with respect to action values, .
With Assumps. 13, the mode insertion gradient exists, is bounded, and (6) can be minimized with respect to . The following theorem, which stems from early work in [17], finds this minimum to compute the schedule of optimal action values.
Theorem 1.
Define . The schedule providing the value of the optimal action,
(7) 
to which cost (3) is optimally sensitive at any time is
(8) 
Proof:
Evaluated at any time , provides the value of the optimal action that minimizes (6) at that time. The schedule therefore also minimizes the (infinite) sum of costs (6) associated with the optimal action values at every time . Hence, (7) can be obtained by minimizing
(9) 
Because the sum of convex functions is convex, and in (4) depends only on , minimizing (9) with respect to is convex and unconstrained. It is necessary and sufficient for (global) optimality to find the for which the first variation of (9) is . Using the Gâteaux derivative and the definition of the functional derivative,
(10) 
where is a scalar and .
To summarize, in computing optimal actions, SAC calculates a schedule providing the value of the optimal action at every possible application time along the predicted trajectory. These values optimize a local model of the change in cost relative to control duration at each application time. The model is provided by the mode insertion gradient (4). As a benefit of SAC, the schedule of optimal action values can be computed from a closedform expression, (8), of the nominal state and adjoint (5) even for nonconvex tracking costs (3).
IiC Deciding When to Act
Assuming control calculations require some time, , SAC searches for application times, ,^{7}^{7}7SAC implements the previous action while current calculations complete. that optimize an objective to find the most effective time to act over each predicted trajectory. We use
(12) 
to balance a tradeoff between control efficiency and the cost of waiting, though there are many other choices of objective.^{8}^{8}8Implementation examples apply as a balance between time and control effort in achieving tracking tasks, but any choice of will work.
Consider, for example, inverting a simple cartpendulum with state, as in Fig. 4, acceleration control, , and underactuated dynamics,
(13) 
with length, m, and gravity, . Fig. 5 shows a schedule, , computed for (13) starting at into an example closedloop SAC trajectory. At every time, the action in drives the mode insertion gradient (purple curve) toward . The mode insertion gradient is at s when the pendulum is horizontal, i.e., rad., since no action can push toward at that time. The mode insertion gradient also goes to toward the end of the horizon since no finite control action can improve (3) at the final time. The curve of vs time (blue) indicates, to optimize the tradeoff between wait time and effectiveness of the action (as in (12)), SAC should do nothing until optimal time s.^{9}^{9}9The next SAC synthesis cycle, i.e., the next sampling time, may begin (and conclude) before the action at is applied. In such cases, SAC often computes a similar . So, in this case, SAC would likely continue to wait until s to act.
IiD Deciding How Long to Act
Temporal continuity of , , and provided by Assump. 13 ensures the mode insertion gradient is continuous with respect to duration around where . Therefore, there exists a neighborhood, , where the sensitivity indicated by (4) models the change in cost relative to application duration to firstorder (see [13, 10] and the generalized derivation in Section V). For finite durations, , the change in cost (3) is locally modeled as
(14) 
As regulates , (14) becomes . Thus the choice of and allows the control designer to specify the desired degree of change provided by actions, . We use a line search with a simple descent condition to find a that yields the desired change [5].^{10}^{10}10Because the pair determines the change in cost each action can provide, it is worth noting that a sufficient decrease condition similar to the one proposed in [13] can be applied to the choice of .
Upon selection of the application duration, , the SAC action is fully specified and sent to the robot. The process iterates and the next cycle begins when SAC incorporates new state feedback at the subsequent sampling time.
Iii Special Properties of SAC Control
In addition to providing a closedform solution for the entire schedule of optimal actions (8), SAC controls inherit powerful guarantees. Appendix A includes derivations that show 1) the schedule (8) globally optimizes (7). 2) Around equilibria, SAC controls simplify to linear state feedback laws permitting local stability analysis and parameter selection, e.g., parameters of (3), , or . 3) Finally, actions computed from (8) can be saturated to satisfy minmax constraints using quadratic programming, by scaling the control vector, or by scaling components of the control vector.^{11}^{11}11Proofs are included for each with , as in all the examples in this paper. All examples enforce constraints using the component scaling approach.
For an overview of the SAC approach outlining the calculations required for online synthesis of constrained optimal actions, selection of actuation times, and resolution of control durations, refer to Algorithm 1.
Iv Example Systems
The following section provides simulation examples that apply SAC online in benchmark underactuated control tasks.^{12}^{12}12We also have trajectory tracking results, e.g, for differential drive robots, but cannot include them due to space constraints. Each example emphasizes a different performancerelated aspect of SAC and results are compared to alternative methods.
Iva CartPendulum
First, we present examples where SAC is applied to the nonlinear cartpendulum (13) in simulated constrained swingup. Performance of SAC is demonstrated using the cartpendulum as it provides a well understood underactuated control problem that has long served as a benchmark for new control methodologies (see [19, 20, 21, 22, 23, 24]).
IvA1 LowFrequency Constrained Inversion
This example uses SAC to invert the cartpendulum (13) with low frequency ( Hz) feedback and control action sequencing to highlight the control synthesis process.
Control constraints, , show SAC can find solutions that require multiple swings to invert. We use a quadratic tracking cost (31) with the state dependent weights, . to impose a barrier / penalty function (see [16, 25]) that constrains the cart’s state so . Terminal and control costs in (6) and (31) are defined using , , and a horizon of s.^{14}^{14}14All examples use wrapped angles .
Results in Fig. 6 correspond to an initial condition with the pendulum hanging at the stable equilibrium and zero initial velocity, . The red curve shows the penalty function successfully keeps the cart position within m. The simulated trajectory is included in the video attachment.
IvA2 HighFrequency Constrained Inversion
In this example, SAC performs online swingup and cartpendulum inversion with highfrequency feedback ( KHz). To gauge the quality of the inversion strategy, we compare the online, closedloop SAC control to offline trajectory optimization using MATLAB’s sequential quadratic programming (SQP) and iLQG implementations. The SQP method is widely used and underlies the approach to optimization in [26, 27, 28, 29, 30, 31]. The iLQG algorithm [32, 33] is a stateoftheart variant of differential dynamic programming (DDP). While early versions did not accommodate control constraints, iLQG achieves a tenfold speed improvement over DDP in simulations [34] and has since been applied for realtime humanoid control [32]. This section compares to a recent variant that incorporates control constraints through a new activeset method [6]. We use a publicly available MATLAB iLQG implementation developed by its authors.^{15}^{15}15Available at http://www.mathworks.com/matlabcentral/fileexchange/52069ilqgddptrajectoryoptimization.
To highlight the sensitivity of optimal control, i.e., iLQG and SQP, to local minima even on simple nonlinear problems (and to speed SQP computations), this example uses a lowdimensional cartpendulum model. The simplified model leaves the cart position and velocity unconstrained and ignores their error weights, such that dynamics are represented by the first two components of (13). In this case, the goal is to compute controls that minimize a norm on the cart’s acceleration while driving the pendulum angle toward the origin (inverted equilibrium). We compare performance of the trajectories produced by each algorithm over a fixed time horizon, , based on the objective,
(15) 
with and . All algorithms are constrained to provide controls .
s  s  s  

dt  min.  iters  min.  iters  min.  iters  
s  SQP  13  1,234  22  1052  46  1,346 
iLQG  2  737  9  2,427  13  3,108  
s  SQP  169  2,465  32  201  105  251 
iLQG  5  908  56  8,052  5  622  
s  SQP  689  2,225  817  853  1,286  933 
iLQG  9  1,007  28  2,423  9  688 
Both SQP and iLQG directly optimize a discretized version of (15) to derive their optimal trajectories. For comparison, results are provided for different choices of discretization, dt, and optimization horizons, .^{16}^{16}16Horizons are based on the assumed time for pendulum inversion, and discretizations on assumed frequency requirements and linearization accuracy. In contrast, SAC computes a trajectory of duration by deriving actions from a receding state tracking cost (31) (see Appendix AA) with quadratic state norms similar to the one in (15). Although SAC runs at KHz, optimal control results are limited to dts, as SQP computations become infeasible and consume all computational resources below this.^{17}^{17}17All results were obtained on the same laptop with Intel Core^{TM} i74702HQ CPU @ 2.20GHz 8 and 16GB RAM. Table II provides the time and number of optimization iterations required for each parameter combination.
The parameter combinations in Table II that do not correspond to gray data converged to the same (best case) optimal trajectory, which inverts the pendulum in s with . ^{18}^{18}18The cost of the optimal solution is the same when measured for horizons s since the incremental cost in (15) is negligible after inversion at s. Gray data indicate convergence to an alternate local minima with significantly worse cost. In all cases with s, SQP converges to local minima with costs . While iLQG tends to be less sensitive to local minima, it converges to the worst local minima with for both finer discretizations when s.
Since varying has no affect on SAC control synthesis (other than to specify the duration of the resulting trajectory), SAC control simulations included a variety of additional parameter combinations including receding horizons from s and different synthesis frequencies. These solutions yield costs ranging from , with the majority of solutions close or equal to . The SAC solution depicted in Fig. 7 achieves the best case cost of from receding horizons of s, with parameters and in (31), and with . SAC’s online controls perform constrained inversion as well as the best solutions from offline optimal control. Also, local minima significantly affect SQP and iLQG, while SAC tends to be less sensitive.
Considering the simplicity of this nonlinear example, it is noteworthy that both optimal control algorithms require significant time to converge. While iLQG ranges from minutes to an hour, with a discretization as coarse as SAC, SQP requires hours to compute the single, openloop trajectory in Fig. 7 using 4 CPU cores. Our C++ implementation of SAC obtains a solution equivalent to the best results online, with KHz feedback, in less than s using 1 CPU core.^{19}^{19}19As the MATLAB SQP and iLQG implementations utilize compiled and parallelized libraries, it is unclear how to provide a sidebyside comparison to the timing results in Table II. To illustrate that SAC is still fast in slower, interpreted code, we also implemented SAC in Mathematica. Computations require s and are linear w.r.t. to horizon, , and discretization, . Computing optimal actions in closedform, SAC achieves dramatic gains and avoids the iterative optimization process, which requires thousands of variables and constraints in SQP / iLQG.
Finally, we emphasize the closedloop nature of SAC compared to SQP, which provides an openloop trajectory, and iLQG, which yields an affine controller with both feedforward and feedback components. As the affine controller from iLQG is only valid near the optimal solution (SAC provides feedback from arbitrary states), SQP or iLQG must be applied in receding horizon for feedback comparable to SAC. For improved speed, [6] recommends a receding horizon implementation using suboptimal solutions from a fixed number (one) of iterations. However, in this simple nonlinear example, SQP / iLQG trajectories only resemble the final solution a few iterations before convergence. Hence, receding horizon implementations would likely result in poor local solutions.
IvA3 Sensitivity to Initial Conditions
Using a horizon of s, SAC was applied to invert the same, reduced cartpendulum system from a variety of initial conditions. Simulations used the quadratic tracking cost (31) and weight matrices from (15). A total of initial conditions for , uniformly sampled over rad, were paired with initial angular velocities at points uniformly sampled over .
To gauge performance, a s closedloop trajectory was constructed from each of the sampled initial conditions, and the state at the final time measured. If the final state was within rad of the inverted position and the absolute value of angular velocity was , the trajectory was judged to have successfully converged to the inverted equilibrium. Tests confirmed the SAC algorithm was able to successfully invert the pendulum within s from all initial conditions. The average computation time was s for each s trajectory on the test laptop.
IvB Pendubot and Acrobot
This section applies SAC for swingup control of the pendubot [35, 36, 37] and acrobot [38, 39, 40]. The pendubot is a twolink pendulum with an input torque that can be applied about the joint constraining the first (base) link. The acrobot is identical except the input torque acts about the second joint. The nonlinear dynamics and pendubot model parameters match those from simulations in [35] and experiments in [36]. The acrobot model parameters and dynamics are from simulations in [39] and in seminal work [38]. Figure 8 depicts the configuration variables and the model parameters are below. Each system’s state vector is with the relevant joint torque control, .
pendubot:  m = 1.0367 kg  m = 0.5549 kg 

l = 0.1508 m  l = 0.2667 m  
l = 0.1206 m  l = 0.1135 m  
I = 0.0031 kg m  I = 0.0035 kg m  
acrobot:  m = 1 kg  m = 1 kg 
l = 1 m  l = 2 m  
l = 0.5 m  l = 1 m  
I = 0.083 kg m  I = 0.33 kg m 
Due to their underactuated dynamics and many local minima, the pendubot and acrobot provide challenging test systems for control. As a popular approach, researchers often apply energy based methods for swingup control and switch to LQR controllers for stabilization in the vicinity of the inverted equilibrium (see [35, 41, 42, 38, 37, 39, 40]). We also use LQR controllers to stabilize the systems once near the inverted equilibrium. However, the results here show SAC can swingup both systems without special energy optimizing methods. The algorithm utilizes the quadratic state error based cost functional (31), without modification.
While the pendubot simulations in [35] require control torques up to a magnitude of for inversion, the experimental results in [36] perform inversion with motor torques restricted to . Hence, the pendubot inputs are constrained to . The acrobot torques are constrained with to invert the system using less than the required in [39].
Example simulations initialize each system at the downward, stable equilibrium and the desired position is the fully inverted equilibrium. Results are based on a feedback sampling rate of Hz for the pendubot with , , and and Hz for the acrobot with , , and . The LQR controllers derived offline for final stabilization, and were calculated about the inverted equilibrium to stabilize the pendubot and acrobot, respectively. We selected as the switching condition for pendubot stabilization.^{20}^{20}20More formally, a supervisory controller can switch between swingup and stabilizing based on the stabilizing region of attraction [43, 44]. The acrobot switched once all its configuration variables were .
Figure 9 shows the pendubot trajectory (the acrobot and pendubot solutions are in video attachment). In both cases, SAC swings each system close enough for successful stabilization. SAC inverts the pendubot using the same peak effort as in experiments from [36] and less than half that from simulations in [35]. Also, SAC requires only s to invert, while simulations in [35] needed s. Where [35] switches between separately derived controllers for pumping energy into, out of, and inverting the system before final stabilization, SAC performs all these tasks without any change in parameters and with the simple state tracking norm in (31). In the case of the acrobot, SAC inverts the system with the desired peak torque magnitude of ( the torque required in simulations from [39]). These closedloop results were computed online and required only 1.23 and s to compute s trajectories for the pendubot and acrobot systems, respectively.
To invert the pendubot and acrobot in minimal time and under the tight input constraints, the two most important parameters for tuning are the horizon length, , and the desired change in cost due to each control actuation, . All examples specify iteratively based on the current initial trajectory cost under the nominal (null) control as . Generally, because of the speed of SAC computations, good parameters values can be found relatively quickly using sampling. These pendubot and acrobot results use and similar horizons of s and s, respectively.
As mentioned earlier, optimal controllers typically use energy metrics for swingup of the pendubot and acrobot, as simple statetracking objectives yield local minima and convergence to undesirable solutions. It is noteworthy that SAC is able to invert both systems online and at high frequency considering optimal controllers (SQP/iLQG) generally fail under the same objective (31).
Part II: Extension to Hybrid Impulsive Systems
Part II of this paper extends SAC to systems with hybrid impulsive dynamics. These systems model a more general class of robotics problems in locomotion and manipulation, which involve contact and impacts. Such systems are challenging in optimal control and require specialized treatment and optimality conditions [45, 46, 32]. By planning each control action in a neighborhood of 0 duration, SAC avoids these issues and does not need to optimize control curves over discontinuous segments of trajectory.
V Control Synthesis for Hybrid Systems
The SAC algorithm introduced in Section II is limited to differentiable nonlinear systems because the mode insertion gradient (4) is subject to Assump. 1. Rather than rely on (4), this section directly develops a firstorder approximation of the variation in state and cost due to the perturbation in nominal control generated by each SAC action. We show the change in cost due to short SAC actions corresponds to the same mode insertion gradient formula (4), but in terms of an adjoint variable derived for hybrid impulsive systems. As a result (and a benefit of SAC), the SAC process described in Algorithm 1 remains unchanged for hybrid impulsive systems.
Section VI demonstrates the hybrid calculations on a system and then illustrates SAC in simulated online control of a bouncing ball. The section concludes with the springloaded inverted pendulum (SLIP) example from the introduction.
Va Prediction
As in Part I, SAC predicts the nominal motion of hybrid robotic systems and computes actions that improve trajectory cost over (receding) horizons. However, this section introduces new notation more appropriate for hybrid impulsive systems. Specifically, the classes of hybrid systems considered here are similar to those in [46] and are defined such that:^{21}^{21}21We assume actions are not applied at switching times, exclude Zeno behavior, and allow only a single element of to be active to exclude simultaneous events and potentially indeterminate behavior. These (and continuity) assumptions guarantee a local neighborhood exists such that perturbed system trajectories evolve through the same nominal location sequence (as in [46]).

is a finite set of locations.

is a family of state space manifolds indexed by .

is a family of control spaces.

is a family of maps to the tangent bundle, . The maps are the dynamics at .

is a family of sets of admissible control mappings.

is a family of consecutive subintervals corresponding to the time spent at each location .

The series of guards, , indicates transitions between locations and when . The state transitions according to a series of corresponding reset maps, .
For clarity, we avoid using numerical subscripts for the nominal control, . Instead, SAC predicts nominal motions assuming a (possibly null) nominal control, , is defined for every location, . So, for instance, in the SLIP example, one nominal control is defined in stance with another, possibly identical, control in flight. Note that as a hybrid robotic system applies controls, it evolves through an ordered sequence of locations, , e.g., from flight, to stance, to flight again, for a hopping SLIP.
With the initial location as , state and the collection , SAC’s prediction phase simulates
(16) 
starting with , to obtain the nominal state. Guards indicate when a transition should occur, i.e., they specify the end of each interval , and the next location, , based on which guard becomes . Reset maps define the initial condition in as . Through this process, the prediction phase defines the nominal location sequence, , intervals, , and the resulting nominal trajectory,
As before, SAC’s prediction phase concludes after computing the performance of the nominal trajectory. However, in this hybrid case we use an objective,
(17) 
with incremental and terminal costs defined in each location, and , such that and . Also, (17) is more general than (3), as it may depend on a control. Given resulting from nominal control, , (17) can be evaluated along the hybrid trajectory.
VB Computing Optimal Actions
Recall that each cycle of SAC seeks an action that improves nominal trajectory performance. This section defines the perturbed signal from an arbitrary SAC action of value as
assuming a short duration, . In this case, the magnitude of is specified as and the direction by an arbitrary positive scalar, . Because the perturbed system will eventually be evaluated as , assume the perturbation occurs when the nominal state, , is in the arbitrary location so that .^{22}^{22}22In the limit as , the SAC action is a needle perturbation [12]. Figure 10 depicts the perturbed control and the corresponding perturbed state.
To derive actions that maximally improve the nominal trajectory, Sec. IIB used the mode insertion gradient (4) to model the change in cost (3) relative to control duration. To accommodate the discontinuous trajectories of hybrid robotic systems, this section derives a model of the change in nominal cost (17) resulting from the perturbed, , by first modeling the effect of the control perturbation on state trajectory. To these ends, we define the firstorder perturbed state model,^{23}^{23}23The litteo notation, , indicates terms that are higher than first order in . These terms go to zero faster than firstorder terms in (18) as .
(18) 
The term is known as the variational equation [11, 12]. It is the direction of the state variation at time and is the magnitude. The following proposition provides formulas to compute the variational equation along hybrid impulsive trajectories. The derivation is in Appendix BA.
Proposition 1.
Assume the state, , of a hybrid system evolves from location to with the transition time, . If a control perturbation occurs at , as in Fig. 10, state variations propagate according to
(19) 
with the linear variational reset map,
(20)  
, , and is the linearization about the (known) nominal state trajectory at .
Note that if a nominal trajectory evolves through more than two locations, each transition requires reset of at transition times according to (20). Variations continue according to the dynamics linearized about the nominal trajectory. Repeating computations in rows of (19) between consecutive locations, variations can be propagated to .
With Prop. 1 to compute the perturbed state (18), the following section derives the cost variation resulting from the perturbed control, . We will show the formula is a generalization of the mode insertion gradient (4) that applies to a larger class of hybrid and impulsive systems.
VB1 Modeling the Cost Variation
To firstorder, the perturbed cost can be modeled as,
(21) 
where is the direction of variation in the cost function and is the magnitude. To simplify derivation of , we translate the hybrid system to Mayer form by appending the incremental costs, , to the dynamics vectors, , in each location. Objects with a bar refer to appended versions of hybrid system such that and
In Mayer form, the first component of the perturbed appended state, , is the perturbed integral cost in (17). Hence, the perturbed cost model (21) can be written as a sum,
which includes the perturbed terminal cost. The direction of variation in the cost is . Evaluating the derivative yields
(22) 
Note that provides the same information as the mode insertion gradient in (4) but applies to hybrid impulsive systems with resets. That is, provides the sensitivity of a cost, , to applying an action at as . Given a control perturbation at arbitrary time , one can calculate from (VB1) by propagating the appended state variation forward from to using Prop. 1. However, in searching for an optimal time to act, SAC needs to compare the cost variation produced by taking action, i.e., applying a control perturbation, at different times, (see Sec. IIC). The process is computationally intensive if is naively computed from the state variation. That is, considering two possible times, , when control perturbation may be applied, would require separate simulations of from and .^{24}^{24}24One may also apply linear transformations to the variational system simulated from the perturbation at based on superposition of the initial condition at . Variational reset maps would require similar transformation.
Since the mode insertion gradient (4) does not require resimulation to consider different application times in optimizing (12), we seek to express in a form that more closely resembles (4). To these ends, we now rewrite in terms of an adjoint system, ,^{25}^{25}25The adjoint belongs to the cotangent bundle, , such that . to the variational system .^{26}^{26}26See [11] for a similar derivation of an adjoint in the context of continuous variations. The systems are adjoint [11] if
(23) 
That is, we can derive by ensuring is constant. Note also that by choosing the terminal condition,
(24) 
(VB1) allows us to express in terms of the adjoint at the terminal time as . If we enforce (23) in deriving the adjoint, the inner product will be constant and equal to at times subsequent to the control perturbation, , and can be interpreted as the sensitivity of (17) to a state variation at time .
Assuming the system is at the (arbitrary) location at the perturbation time , the inner product in (23) yields
(25) 
which no longer depends on forward simulation of . The initial time, , of the control perturbation is arbitrary. Like in (4), once the adjoint, , is computed over , (VB1) can be evaluated at any number of different times, , to provide the cost sensitivity, , to the control perturbation in each case.
The following proposition derives an adjoint formula, , that maintains its interpretation as the cost sensitivity to state variations, as in .
Proposition 2.
Proof:
The adjoint is simulated backwards from a terminal condition (26) because this choice of terminal conditions yields