Curious iLQR: Resolving Uncertainty in Modelbased RL
Abstract
Curiosity as a means to explore during reinforcement learning problems has recently become very popular. However, very little progress has been made in utilizing curiosity for learning control. In this work, we propose a modelbased reinforcement learning (MBRL) framework that combines Bayesian modeling of the system dynamics with curious iLQR , an iterative LQR approach that considers model uncertainty. During trajectory optimization the curious iLQR attempts to minimize both the taskdependent cost and the uncertainty in the dynamics model. We demonstrate the approach on reaching tasks with 7DoF manipulators in simulation and on a real robot. Our experiments show that MBRL with curious iLQR reaches desired endeffector targets more reliably and with less system rollouts when learning a new task from scratch, and that the learned model generalizes better to new reaching tasks.
Exploration, Robots, Modelbased RL
1 Introduction
Modelbased reinforcement learning holds promise for sampleefficient learning on real robots [atkeson1997comparison]. The hope is that a model learned on a set of tasks can be used to learn to achieve new tasks faster. A challenge is then to ensure that the learned model generalizes beyond the specific tasks used to learn it. We believe that curiosity, as means of exploration, can help with this challenge. Though curiosity has been defined in various ways, it is generally considered a fundamental building block of human behaviour [loewenstein1994psychology] and essential for the development of autonomous behaviour [white1959motivation].
In this work, we take inspiration from [Kagan1972], which defines curiosity as motivation to resolve uncertainty in the environment. Following this definition, we postulate that by seeking out uncertainties, a robot is able to learn a model faster and therefore achieve lower costs more quickly compared to a noncurious robot. Keeping real robot experiments in mind, our goal is to develop a modelbased reinforcement learning (MBRL) algorithm that optimizes action sequences to not only minimize a task cost but also to reduce model uncertainty.
Specifically, our MBRL algorithm iterates between learning a probabilistic model of the robot dynamics and using that model to optimize local control policies (i.e. desired joint trajectories and feedback gains) via a curious version of the iterative Linear Quadratic Regulator (iLQR) [Tassa2014]. These policies are executed on the robot to gather new data to improve the dynamics model, closing the loop, as summarized in Figure 1.
In a nutshell, our curious iLQR aims at optimizing local policies that minimize the cost and explore parts of the model with high uncertainty. In order to encourage actions that explore states for which the dynamics model is uncertain, we incorporate the variance of the model predictions into the cost function evaluation. We propose a computationally efficient approach to incorporate this uncertainty by leveraging results on risksensitive optimal control [Jacobson1973, Farshidian]. [Jacobson1973] showed that optimizing actions with respect to the expected exponentiated cost directly takes into account higher order moments of the cost distribution while affording the explicit computation of the optimal control through Riccati equations. A risksensitive version of iLQR was recently proposed in [Farshidian]. While in these approaches the dynamic model is typically considered known and uncertainty comes from external disturbances, we propose to instead explicitly incorporate model uncertainty in the algorithm to favor the exploration of uncertain parts of the model. The proposed coupling between model learning and risksensitive control explicitly favours actions that resolve the uncertainty in the model while minimizing a taskrelated cost.
The contributions of this work are as follows: 1) We present a MBRL algorithm that learns a global probabilistic model of the dynamics of the robot from data and show how to utilize the uncertainty of the model for exploration through our curious iLQR optimization. 2) We demonstrate that our MBRL algorithm can scale to seven degree of freedom (DoF) manipulation platform in the real world without requiring demonstrations to initialize the MBRL loop. 3) The results show that using curiosity not only learns a better model faster on the initial task, but also that this model generalizes to new tasks more reliably. We perform an extensive evaluation in both simulation and on hardware.
2 Background
The goal of MBRL is to solve a task through learning a model of the true dynamics of the system that is subsequently used to solve an optimal control problem. The dynamics are described through where and are the state and action of the current time step, and the state at the next time step. represents the learned model of the dynamics. MBRL seeks to find a policy that minimizes a cost describing the desired behavior. Policy optimization can be performed in various ways such as trajectory sampling approaches as summarized and evaluated in [chua2018deep], random shooting methods, where trajectories are randomly chosen and evaluated with the learned model, or iterative LQG approaches, as in [levine2013guided]. Model learning also can be tackled with various methods. [levine2014learning] proposes learning linear models of the forward dynamics. In [chua2018deep] the dynamics are learned with an ensemble of neural networks. In general, the learned model of dynamics can be deterministic as in [levine2014learning] or probabilistic as in [deisenroth2011pilco, chua2018deep].
In MBRL, the learned model is used to simulate the robot behaviour when optimizing a trajectory or control policy. The learned model and the optimizer are task independent; this independence promises sample efficiency and generalization capabilities, as an already learned model can be reused for new tasks. As a side effect, however, the learned models quality can drastically affect the computed solution, as pointed out in [schaal1997learning, atkeson1997comparison, deisenroth2010efficient], since the policy is optimized given the current learned model and not by interacting with the robot. This effect is called model bias [deisenroth2010efficient] and can lead to a policy with drastically lower performance on the real robot. We argue that exploration can alleviate this modelbias. Resolving model uncertainty while optimizing for a task can encourage visiting states which resolve ambiguities in the learned model and therefore lead to both better models and control policies.
2.1 Intrinsic motivation for RL
The concept of curiosity has also been explored within the reinforcement learning literature from various angles. For example, a first attempt towards intrinsically motivated agents consisted in rewarding agents to minimize prediction errors of sensory events [Barto2004, Singh2004, Singh2010]. This initial work was designed for lowdimensional and discrete stateandaction spaces. Recently, curiosity as a means to better explore was also investigated for highdimensional continuous state spaces [Bellemare2016, Pathak2017]. Most of this work, including recent efforts towards curiosity driven robot learning [tanneberg2019intrinsic, laversanne2018curiosity], has defined curiosity as a function of model prediction error and within a modelfree reinforcement learning framework. In MBRL, [shyammax] recently proposed a measure of disagreement as exploration signal. [levine2016end] propose a maximum entropy exploration behaviour. Other algorithms which take uncertainty into account have been presented as well [deisenroth2011pilco, williams2017information, chua2018deep, boedecker2014approximate]. They differ in their choice of policy optimization, dynamics model representation and how they incorporate uncertainty. While [deisenroth2011pilco, chua2018deep] utilize model uncertainty to generate trajectory distributions, the uncertainty does not play an explicit role in the cost. Thus, these approaches do not explicitly optimize actions that resolve uncertainty in the current model of the dynamics, which is in contrast to the approach we propose in this paper.
2.2 Risk Sensitive stochastic optimal control
Risksensitive optimal control has a long history [Jacobson1973, WHITTLE:1981cd]. The central idea is to not only minimize the expectation of the performance objective under the stochastic dynamics but to also take into account higherorder moments of the cost distribution. The objective function takes the form of an exponential transformation of the performance criteria [Jacobson1973]. Here, is the performance index, which is a random variable, and a functional of the policy . is the expected value of over stochastic trajectories induced by the policy . accounts for the sensitivity of the cost to higher order moments (variance, skewness, etc.). Notably, from [Farshidian], the cost is , where and stand for variance and skewness and is the optimal task cost. When the optimal control will be riskaverse, favoring low costs with low variance but when the optimal control will be riskseeking, favoring low costs with high variance. , reduces to the standard, riskneutral, optimal control problem. Jacobson [Jacobson1973] originally demonstrated that for linear dynamics and quadratic costs the optimal control could be computed as the solution of a Riccati equation. Leveraging this result, [Farshidian] recently proposed a risksensitive extension of iLQR and [Ponton] further extended the approach to explicitly incorporate measurement noise.
3 MBRL via Curious iLQR
We present our approach to incorporate curious behaviour into a robot’s learning control loop. We are interested in minimizing a performance objective to achieve a desired robot behavior and approximate the true dynamics of the system with a discretetime dynamical system
(1) 
where denotes the state of the system at time step and represents the unknown model of the dynamics of the system and needs to be learned to achieve the desired task. The hypothesis we seek to confirm is that, by trying to explore uncertain parts of the model, our MBRL algorithm can learn a good dynamics model more quickly and find behaviors with higher performance. Our algorithm learns a probabilistic model of the system dynamics while concurrently optimizing a desired cost objective (Figure 1). It combines i) a riskseeking iLQR algorithm and ii) a probabilistic model of the dynamics. We describe the algorithm in the following. In particular, we show how to incorporate model uncertainty in risksensitive optimal control. Algorithm 1 shows the complete algorithm.
3.1 Risksensitive iLQR
Consider the following general nonlinear stochastic difference equation
(2) 
where maps a Brownian motion , with 0 mean and covariance , to system states. and the nonlinear map , typically model an unknown physical disturbance, while assuming a known model of the dynamics. When considering the exponentiated performance criteria (see 2 for more details), it has been shown that iLQR [Tassa2014] can be extended to risksensitive stochastic nonlinear optimal control problems [Farshidian]. The algorithm begins with a nominal state and control input trajectory and . The dynamics and cost are approximated to first and second order respectively along the nominal trajectories , in terms of state and control deviations , . Given a quadratic control cost, the locally optimal control law will be of the form . The underlying optimal control problem can be solved by using Bellman equation
(3) 
where is the quadratic cost, and by making the following quadratic approximation of the value function where and are functions of the partial derivatives of the value function.
Using the (timevarying) linear dynamics, the quadratic cost and the quadratic approximation of , and solving for the optimal control, we get
(4) 
where , , are given by
(5) 
where , and , , , and are the coefficients of the Taylor expansion of the cost function around the nominal trajectory. The corresponding backward recursions are
(6) 
(7) 
We note that this Riccati recursion is different from usual iLQR ([Tassa2014]) due to the presence of the covariance : the locally optimal control law explicitly depends on the noise uncertainty.
3.2 Curious iLQR: seeking out uncertainties
We use Gaussian Process (GP) regression to learn a probabilistic model of the dynamics in order to include the predictive variance from the model into the risksensitive iLQR algorithm. This predictive variance will then capture both model as well as measurement uncertainty. Specifically, we set where , are joint position and velocity vectors respectively. We let denote the vector of commanded torques. After each system rollout, we get a new set of tuples of states and actions as inputs and , joint accelerations at the next time step, as outputs which we add to our dataset on which we retrain the probabilistic dynamics model (see Algorithm 1). Once trained, the model produces a one step prediction of the joint accelerations of the robot as a probability distribution of the form
(8) 
where is the mean vector and the covariance matrix of the predictive distribution evaluated at . The outputs is the acceleration at the next time step which is numerically integrated to velocity and position This results in a Gaussian predictive distribution of the system dynamics
(9) 
It is the covariance matrix of this distribution that is incorporated into the Riccati equations from above. Specifically, during each MBRL iteration we optimize a new local feedback policy under the current dynamics model , via Algorithm 3. Each outer loop of the optimization, relinearizes with respect to the current nominal trajectories , in the backwardpass:
(10) 
with , and , where and are the analytical gradients of the probabilistic model prediction at each time step and weights how the uncertainty is propagated trough the system. We utilize the Riccati equations from Section 3.1, Equations (5) and (6), to optimize a new local feedback policy that utilizes the models predictive covariance . During the shooting phase of the algorithm, we integrate the nonlinear model from the GP and, to guarantee convergence to lower costs, we use a line search approach during the optimization. We leverage the riskseeking capabilities of the optimization by setting . The algorithm then favors costs with higher variance which is related to exploring regions of the state space with higher uncertainty in the dynamics. As a result, the agent is encouraged to select actions that explore uncertain regions of the dynamic model while still trying to reduce the task specific error. With the agent will ignore any uncertainty in the environment and therefore not explore. This is equivalent to standard iLQR optimization which ignores higher order statistics of the cost function. An overview of curious iLQR is given in Algorithm 3.
4 Illustration: Curious iLQR
In this section, we want to illustrate the advantages of using the motivation to resolve model uncertainty as an exploration tool. The objectives of this section is to give an intuitive example of the effect of our MBRL loop. In the following, and throughout the paper, we will refer to the agent that tries to resolve the uncertainty in its environment as curious and the one that is not following the uncertainty but only optimizes for the task related cost as normal.
The experimental platform is the OpenAI Gym Reacher environment [gym], a two degrees of freedom arm attached at the center of the scene. The goal of the task is to reach a target placed in the environment. In the experiments presented here, actions were optimized as described in section 3. The probabilistic model was learned with Gaussian Process (GP) regression using the GPy library [gpy2014]. The intuition behind this experiment is that, if an agent is driven to resolve uncertainty in its model, a better model of the system dynamics can be learned and therefore used to optimize a control sequence more reliably. Our hypothesis is that, the model learned by the curious agent is better by the end of learning and therefore we expect it to perform better when using it to solve new reaching tasks.
In Figure 2 we show the resulting endeffector trajectories of 8 consecutive MBRL iterations when optimizing to reach 2 different targets in sequence. We compare the behavior of the curious and normal agent in orange and blue, respectively. The targets are represented by the black dot. The curious agent tries to resolve the uncertainty within the model; the normal agent optimizes only for the task related cost. The normal agent seemingly reaches the first target after the second learning iteration; the curious agent only manages to reach the target during the third iteration. Interestingly, the exploration of the curious agent leverages the arm to reach the second target immediately and continues to reach it consistently thereafter. Figure 4 confirms the intuition that the curious agent has learned a better model than the normal agent. The figure shows the uncertainty and the prediction error (in endeffector space) of the model learned by the normal and the curious agent respectively. With curiosity, the learned model has overall lower uncertainty and prediction error values over the whole state space. We also compare our MBRL loop via curious iLQR optimization to: normal iLQR, a random exploration controller that adds Gaussian noise to the actions with mean and variance , a maximum entropy exploration behaviour following the approach proposed in [levine2016end] and PILCO [deisenroth2011pilco], in Figure 3. For these experiments, we initialize the model with only two data points collected randomly during motor babbling. We report the mean and the standard deviation across 10 trials, where each trial starts from a different initial joint configuration and is initialized with a different initial torque trajectory for optimization. In this scenario, with a very poor initial model quality, PILCO could not perform comparably to our MBRL loop. MBRL via curious iLQR outperforms all the other approaches. Furthermore it converges to solutions more reliably, as the variance between trials is lowest.
5 Experiments on highdimensional problems
Finally, the goal of this work is to learn motor skills on a torquecontrolled manipulator. Our experimental platform is the Sawyer robot from Rethink Robotics [sawyer], a 7 degrees of freedom manipulator. We start with experiments performed in the PyBullet physics simulator [pybullet]. In the next Section, we present results on the Sawyer robot arm hardware. Previous work such as [Farshidian] and [Ponton], which use risksensitive control variations of iLQR, primarily deal with simplified, low dimensional problems. Our experiments are conducted on a 7 degree of freedom robot, and the higher dimensional system adds some complexities to the approach: the gradients in Section 3.1 of the value function (Equations (6), (7)) tend to suffer from numerical illconditioning in highdimensions. We account for this issue with Tikhonov regularization: before inversion for calculating the optimal control we add a diagonal matrix to from Equation (5). The regularization parameter and the line search parameter are adapted following the Levenberg Marquardt heuristic [Tassa2014].
The goal of these experiments is to reach a desired target joint configuration . We show results for dynamics learned with GP regression (GPR), as well as initial results on ensemble of probabilistic neural networks (EPNN) following the approach presented in [lakshminarayanan2017simple]. When using GPs, a separate GP is trained for each output dimension.
We perform two sets of experiments, both in simulation and on hardware, to analyze the effect of using curiosity. Specifically, we believe that curiosity helps to find better solutions faster, because it guides exploration within the MBRL loop. Intuitively, curiosity helps to observe more diverse data samples during each rollout such that the model learns more about the dynamics.
We start with evaluation in simulation. Throughout all of the simulation experiments the optimization horizon was 150 time steps long at a sampling rate of Hz. Motor babbling was performed at the beginning for 0.5 by commanding random torques in each joint.
5.1 Reaching task from scratch
During the first set of experiments, we compare the performance when learning to reach a given target configuration from scratch. We compare our MBRL loop, as before, when using our curious iLQR optimization, regular iLQR, a random exploration controller and a maximum entropy exploration behaviour as described previously. PILCO was not able to learn the reaching movement on the 7DoF manipulator, so we exclude the results from the analysis. We perform this experiments for each kind of controller 5 times. Each run slightly perturbs the initial joint positions, and uses a random initial torque trajectories for the optimizer. For a given target joint configuration, iterations of optimizing the trajectory, running the trajectory on the system and updating the dynamics model, were performed. We perform this experiment for different target joint configurations. The following results are averaged across the runs ( runs per target).
The left most plot in Figure 5 compares performance of curious iLQR when using EPNN vs GPR for dynamics model learning, with and without curiosity. Our analysis shows that MBRL via curious iLQR improves performance over regular iLQR, for both model architectures. While the EPNN is more promising in scaling our approach, it currently requires more data to train. For this reason we will focus on the GP model for the remainder of our experimental section. In the 2nd to 4th plot of Figure 5, we compare the performance of curious iLQR against the above mentioned baselines for exploration during policy optimization, when using GPR for model learning. We compare the methods with respect to 3 metrics: final Euclidean endeffector distance (plot 2), iLQR cost (plot 3) and the predictive performance of the model on each rollout (plot 4). We can consistently see that, on average, MBRL via curious iLQR outperforms the other approaches: the error/cost is smaller and the solutions are more consistent across trials as the standard deviation is lower. This shows that curiosity can lead to faster learning of a new task, when learning from scratch. The results on the predictive performance of the model suggest that the quality of the model learned via curious iLQR might be better in terms of generalization. In the next section we present results that investigate this assumption.
5.2 Optimizing towards new targets after model learning
To confirm the hypothesis that the models learned by MBRL with curious iLQR generalize better, because they have explored the state space better, we decided to evaluate the learned dynamics models on a second set of experiments in which the robot tries to reach new, unseen targets. In this experiment we take the GP models learned during experiment 1 in Section 5.1 and use them to optimize trajectories to reach new targets that were not seen during training of the model. The results are shown in Figure 6, where four randomly chosen targets were set and the trajectory was optimized with regular iLQR. Note, that here we use regular iLQR to optimize for the trajectory so that we can better compare the models learned with/without curiosity in the previous set of experiments. Figure 6 shows the trajectory in end effector space for each coordinate dimension, together with the target end effector position as a solid horizontal line. The results are averaged across trials. The trials correspond to using one of the dynamics models at the end of Experiment 1 in Section 5.1. For each trial, the initial torque trajectory was initialized randomly, and the initial joint configuration slightly perturbed. The mean and the standard deviation of the optimized trajectories are computed across the models learned via MBRL with curious iLQR (first col), MBRL with normal iLQR (second col), iLQR with random exploration (third col) and iLQR with maximum entropy exploration bonus (fourth col.). We see that MBRL with curious iLQR results in a model that performs better when presented with a new target. The new targets are reached more reliably and precisely.
6 Real hardware experiments
Target  Distance to Target in m (Learning Iteration)  

Curious  Normal  
1  0.05 (6)  0.09 (2)  0.09 (3)  0.07 (3.67)  0.37 (8)  0.08 (2)  0.18 (8)  0.21 (7.0) 
2  0.05 (3)  0.09 (4)  0.09 (4)  0.07 (3.67)  0.20 (8)  0.08 (3)  0.09 (5)  0.12 (5.3) 
3  0.09 (6)  0.09 (4)  0.09 (3)  0.09 (4.33)  0.17 (8)  0.16 (8)  0.11 (8)  0.15 (8.0) 
4  0.04 (2)  0.07 (2)  0.07 (2)  0.06 (2.33)  0.04 (3)  0.08 (3)  0.05 (3)  0.06 (3.0) 
0.07 (3.5)  0.14 (5.9) 



Target  Curious  Normal  
1  0.20  0.67  
2  0.26  0.61  
3  0.25  1.06  
4  0.24  0.67  
5  0.37  0.49  
0.26  0.7 
The experimental platform for our hardware experiments is the Sawyer Robot [sawyer]. The purpose of the experiments was to demonstrate the applicability and the benefits of our algorithm on real hardware. We perform reaching experiments for 4 different target locations. Each experiment is started from scratch with no prior data, and the number of hardware experiments needed to reach the target are compared. The results are summarized in Table 1 and show the number of learning iterations needed in order to reach the target together with the precision in endeffector space. If the target was reached with a precision of below 10 cm, we would consider the task as achieved; if the target was not reached after the 8th learning iteration we would stop the experiment and consider the last endeffector position. We decided to terminate our experiments after the eight iteration as running the experiment on hardware was a lengthy process, as the GP training and the rollout would happen iteratively and GP training time increases with growing amount of data. Also, the reaching precision that we were able to achieve on hardware was significantly lower, compared to the simulation experiments.
We believe this is due to the data collected from the Sawyer robot, as we could only control the robot at which introduces inaccuracies when reading the effects of the sent torque command. We repeated each experiment three times to demonstrate the repeatability of our method as we expected measurement noise to affect solutions. From the table we can see that MBRL with curious iLQR would reach a target on average after 3.5 iterations with an average precision of 7 cm, compared to MBRL with regular iLQR that needed 5.9 iterations (often not ever reaching the target after eight iterations with the desired precision), with a precision of 14cm on average. As in simulation, similar to Experiment 5.2 we wanted to evaluate the quality of the learned models on new target positions. The results are summarized in Table 2 and are similar to what we observe in simulation: the models learned with curiosity, when used to optimize for new targets, can achieve higher precision than when using the models learned without curiosity.
7 Conclusion and future work
In this work, we presented a modelbased reinforcement learning algorithm that uses an optimal control framework to tradeoff between optimizing for a task specific cost and exploring around a locally optimal trajectory. Our algorithm explicitly encourages actions that seek out uncertainties in our model by incorporating them into the cost. By doing so, we are able to learn a model of the dynamics that achieves the task faster than MBRL with standard iLQR, and also transfers well to other tasks. We present experiments on a Sawyer robot in simulation and on hardware. In both sets of experiments, MBRL with curious iLQR (our approach) not only learns to achieve the specified task faster, but also generalizes to new tasks and initial conditions. All this points towards the conclusion that resolving dynamics uncertainty during modelbased reinforcement learning is indeed a powerful tool. As [loewenstein1994psychology] states, curiosity is a superficial affection: it can arise, diverge and end promptly. We were able to observe similar behaviour in our experiments as well, as can be seen in Figure 5: towards the end of learning, the exploration signal around the trajectory decreases and the robot would explore, deviate from the task slightly, before going back to exploiting once it is fairly certain about the dynamics. In the future, we would like to explore this direction by considering how to maintain exploration strategies. This could be helpful if the robot is still certain about a task, even though the environment or task has changed.
The authors would like to thank Stefan Schaal for his advise throughout the project. The authors thank the International Max Planck Research School for Intelligent Systems (IMPRSIS) for supporting Sarah Bechtle. Part of this work was supported by the MaxPlanck Society, New York University, the European Unions Horizon 2020 research and innovation program (grant agreement 780684 and European Research Councils grant 637935) and a Google Faculty Research Award.
References
Appendix A Background: Uncertainty in Optimal Control
In this appendix we review the fundamentals on which our work is built on. An approach enabling the inclusion of higher order statistics in the performance measure while keeping computations tractable, at least in the linear case, is to use exponential costs, as introduced by Jacobson [Jacobson1973]. [Farshidian] extended this work by deriving an iterative algorithm for continuoustime stochastic nonlinear optimal control problems called iterative Linear ExponentialQuadratic Optimal Control under Gaussian Process Noise (iLEG). Ponton et al. [Ponton] extended the work from [Farshidian] to cases where not only process noise is present but also measurement noise has to be taken into consideration. Next, we briefly present the details of the risksensitive iLQR algorithm, following [Farshidian], [Ponton] and [Jacobson1973].
a.1 Risksensitive iLQR
To include stochastic processes when optimizing a trajectory, it is necessary to consider a nonlinear optimal control problem where the system dynamics are defined by the following stochastic differential equation
(11) 
where represents the dynamics of the system and the stochasticity of the problem. is a Brownian motion with zero mean and covariance . Following the idea of [Jacobson1973] to include higher order momenta of the cost function, the objective function takes the form of an exponential transformation of the performance criteria :
(12) 
where is the performance index, which is a random variable, and a functional of the policy . is the expected value of over stochastic trajectories induced by the policy . accounts for the sensitivity of the cost to higher order moments (variance, skewness, etc). Notably from [Farshidian], the cost is
(13) 
where and stand for variance and skewness and is the optimal task cost. When the optimal control will be riskaverse, favoring low costs with low variance but when the optimal control will be riskseeking, favoring low costs with high variance. reduces to the standard, riskneutral, optimal control problem. We will exploit this property to create policies that explore regions with high uncertainty in the next sections.
Let the performance criteria be
(14) 
where
(15) 
is composed of the state cost and a quadratic control cost. is the cost at the final state.
a.2 Algorithm derivation
The algorithm begins with a nominal state and control input trajectory and . The dynamics are linearized and the cost is quadratized along , in terms of state and control deviations , leading to the linear dynamics approximation:
(16) 
where and
represents how the uncertainty propagates through the system. And the quadratic approximation of the cost
(17) 
with final cost
(18) 
where , , , and are the coefficients of the Taylor expansion of the cost function around the nominal trajectory. Given the quadratic control cost, the locally optimal control law will be of the form and the underlying optimal control problem can be solved by using Bellman equation
(19) 
which becomes is:
(20) 
and making the following quadratic approximation of the value function :
(21) 
where and are functions of the partial derivatives of the value function. Using the linear dynamics, the quadratic cost and the quadratic approximation of , and solving for the optimal control, we get
(22) 
where , , are given by
(23) 
The corresponding backward recursions are
(24) 
(25) 
With the recursions revert to the usual Ricatti recursions for iLQR [Tassa2014].
Appendix B Details for Experiments
Throughout the experiments we chose for the curious robot, and for the normal robot. The iLQR position error weight the velocity weight and the torque error weight . The regularization parameter was initialized with , the scaling factor for was and the maximum value allowed was .