AdaPT: Zero-Shot Adaptive Policy Transfer for Stochastic Dynamical Systems

AdaPT: Zero-Shot Adaptive Policy Transfer for Stochastic Dynamical Systems

Abstract

Model-free policy learning has enabled good performance on complex tasks that were previously intractable with traditional control techniques. However, this comes at the cost of requiring a perfectly accurate model for training. This is infeasible due to the very high sample complexity of model-free methods preventing training on the target system. This renders such methods unsuitable for physical systems. Model mismatch due to dynamics parameter differences and unmodeled dynamics error may cause suboptimal or unsafe behavior upon direct transfer. We introduce the Adaptive Policy Transfer for Stochastic Dynamics (AdaPT) algorithm that achieves provably safe and robust, dynamically-feasible zero-shot transfer of RL-policies to new domains with dynamics error. AdaPT combines the strengths of offline policy learning in a black-box source simulator with online tube-based MPC to attenuate bounded dynamics mismatch between the source and target dynamics. AdaPT allows online transfer of policies, trained solely in a simulation offline, to a family of unknown targets without fine-tuning. We also formally show that (i) AdaPT guarantees bounded state and control deviation through state-action tubes under relatively weak technical assumptions and, (ii) AdaPT results in a bounded loss of reward accumulation relative to a policy trained and evaluated in the source environment. We evaluate AdaPT on continuous, non-holonomic simulated dynamical systems with 4 different disturbance models, and find that AdaPT performs between - better on mean reward accrual than direct policy transfer.

\addbibresource

adapt.bib

1 Introduction

Deep reinforcement learning (RL) has achieved remarkable advances in sequential decision making in recent years, often outperforming humans on tasks such as Atari games mnih2015human. However, model-free variants of deep RL are not directly applicable to physical systems because they exhibit poor sample complexity, often requiring millions of training examples on an accurate model of the environment. One approach to using model-free RL methods on robotic systems is thus to train in a relatively accurate simulator (a source domain), and transfer the policy to the physical robot (a target domain). This naive transfer may, in practice, perform arbitrarily badly and so online fine-tuning may be performed abbeel2006using. During this fine-tuning, the robot may behave unsafely however, and so it is desirable for a system to be able to train in a simulator with slight model inaccuracies but still be able to perform well on the target system on the first iteration. We refer to this as the zero-shot policy transfer problem.

The zero-shot transfer problem involves training a policy on a system possessing different dynamics than the target system, and evaluating performance as the average initial return in target domain without training in the target domain. This problem is challenging for robotic systems since simplified simulated models may not always accurately capture all relevant dynamics phenomena, such as friction, structural compliance, turbulence and so on, as well as parametric uncertainty in the model. In spite of the renewed focus on this problem, few studies in deep policy adaptation offer insightful analysis or guarantees regarding feasibility, safety, and robustness in policy transfer.

In this paper, we introduce a new algorithm which we refer to as AdaPT, that achieves provably safe and robust, dynamically-feasible zero-shot direct transfer of RL policies to new domains with dynamics mismatch. The key insight here is to leverage the global optimality of learned policy with local stabilization from MPC based methods to enable dynamic feasibility, thereby building on strengths of two different methods. In the offline stage, AdaPT first computes a nominal trajectory (without disturbance) by executing the learned policy on the simulator dynamics. Then in the online stage, AdaPT adapts the nominal trajectory to the target dynamics with an auxiliary MPC controller.

\runinhead

Statement of Contributions

  1. We develop the AdaPT algorithm, which allows online transfer of policy trained solely in a simulation offline, to a family of unknown targets without fine-tuning.

  2. We also formally show that (i) AdaPT guarantees state and control safety through state-action tubes under the assumption of Lipschitz continuity of the divergence in dynamics and, (ii) AdaPT results in a bounded loss of reward accumulation in case of direct transfer with AdaPT as compared to a policy trained only on target.

  3. We evaluate AdaPT on two continuous, non-holonomic simulated dynamical systems with four different disturbance models, and find that AdaPT performs between - better on mean reward accrual than direct policy transfer as compared to mean reward.

\runinhead

Organization This paper is structured as follows. In Section 2 we review related work in robust control, robust reinforcement learning, and transfer learning. In Section 3 we formally state the policy transfer problem. In Section 4 we present AdaPT and discuss algorithmic design features. In Section 5 we prove the accrued reward for AdaPT is lower bounded. In Section 6 we present experimental results on a simulated car environment and a two-link robotic manipulator, as well as present results for AdaPT with robust policy learning methods. Finally, in Section 7 we draw conclusions and discuss future directions.

2 Related Work and Background

A plethora of work in both learning and control theory has addressed the problem of varying system dynamics, especially in the context of safe policy transfer and robust control.

\runinhead

Transfer in reinforcement learning The problem of high sample complexity in reinforcement learning has generated considerable interest in policy transfer. Taylor et al. provide an excellent review of approaches to the transfer learning problem Taylor2009TransferLF. A series of approaches focused on reducing the number of rollouts performed on a physical robot, by alternating between policy improvement in simulation and physical rollouts abbeel2006using; levine2016end. In those works, a time-dependent term is added to the dynamics after each physical rollout to account for unmodeled error. This approach, however, does not address robustness in the initial transfer, and the system could sustain or cause damage before the online learning model converges.

The EPOpt algorithm rajeswaran2016epopt randomly samples dynamics parameters from a Gaussian distribution prior to each training run, and optimizes the reward for the worst-performing -fraction of dynamics parameters. However, it is not clear how robust it is against disturbances not explicitly experienced in training. This approach is conceptually similar to that in mordatch2015ensemble, in which more traditional trajectory optimization methods are used with an ensemble of models to increase robustness. Similarly, mandlekar2017arpl and pinto2017robust use adversarial disturbances instead of random dynamics parameters for robust policy training. Tobin et al. tobin2017domain and Peng et al. peng2017sim randomize visual inputs and dynamics parameters respectively. Bousmalis et al. bousmalis2017using meanwhile adapt rendered visual inputs to reality using a framework based on generative adversarial networks, as opposed to strictly randomizing them. While this may improve adaptation to a target environment in which these parameters are varied, this may not improve performance on dynamics changes outside of those varied; in effect, it does not mitigate errors due to the “unknown unknowns”.

Christiano et al. christiano2016transfer approach the transfer problem by training an inverse dynamics model on the target system and generating a nominal trajectory of states. The inverse dynamics model then generates actions to connect these states. However, there are no guarantees that an action exists in the target dynamics to connect two learned adjacent states. Moreover, this requires training on the target environment; in this work we consider zero-shot learning where this is not possible. Recently, the problem of transfer has been addressed in part by rapid test adaptation devin2016learning; rusu2016progressive. These approaches have focused on training modular networks that have both “task-specific” and “robot-specific” modules. This then allows the task-specific module to be efficiently swapped out and retrained. However, it is unclear how error in the learned model affects these methods.

In this work we aim to perform zero-shot policy transfer, and thus efficient model-based approaches are not directly applicable. However, our approach uses an auxiliary control scheme that leverages model learning for an approximate dynamics model. When online learning is possible, sample-efficient model-based reinforcement learning approaches can dramatically improve sample complexity, largely by leveraging tools from planning and optimal control kober2013reinforcement. However, these models require an accurate estimate of the true system dynamics in order to learn an effective policy. A variety of model classes have been used to represent system dynamics, such as neural networks heess2015learning, Gaussian processes deisenroth2011pilco, and local linear models gu2016continuous; levine2016end.

\runinhead

Robust control Trajectory optimization methods have been widely used for robotic control tassa2012synthesis. Among these optimization methods, model predictive control (MPC) is a class of online methods that perform trajectory optimization in a receding-horizon fashion neunert2016fast. This receding-horizon approach, in which a finite-horizon, open-loop trajectory optimization problem is continuously re-solved, results in an online control algorithm that is robust to disturbances. Several works have attempted to combine trajectory optimization methods with dynamics learning mitrovic2010adaptive and policy learning kahn2016plato. In this work, we develop an auxiliary robust MPC-based controller to guarantee robustness and performance for learned policies. Our method combines the strengths of deep policy networks schulman2015trust and tube-based MPC mayne2011tube to offer a controller with good performance as well as robustness guarantees.

3 Problem Setup and Preliminaries

Consider a finite-horizon Markov Decision Process () defined as a tuple . Here and represent continuous, bounded state and action spaces for the agent, is the reward function that maps a state-action tuple to a scalar, and is the problem horizon. Finally, is the transition distribution that captures the state transition dynamics in the environment and is a distribution over states conditioned on the previous state and action. The goal is to find a policy that maximizes the expected cumulative reward over the choice of policy:

(1)

The above reflects a standard setup for policy optimization in continuous state and action spaces. In this work, we are interested in the case in which we only have an approximately correct environment, which we refer to as the source environment (e.g. a physics simulator). We may sample this simulator an unlimited number of times, but we wish to maximize performance on the first execution in a target environment. Without any assumptions on the correctness of the simulator, this problem is of course intractable as the two sets of dynamics may be arbitrarily different. However, relatively loose assumptions about the correctness of the simulator are very reasonable, based on the modeling fidelity of the simulator. We assume the simulator (denoted ) has deterministic, twice continuously-differentiable dynamics . Then, let the dynamics of the target environment (denoted ) be denoted , for iid additive noise with compact, convex support that contains the origin. Generally, the noise distribution may be state and action dependent, so this formulation reduces to standard formulations in both robust and stochastic control zhou1996robust. We assume all other components of the MDPs defining the source and target environments are the same (e.g. reward function). Finally, we assume the reward function is Lipschitz continuous, an assumption that we discuss in more detail in section 5. Based on the above definitions, we can now state the problem we aim to solve.

\runinhead

Problem Statement Given the simulator dynamics and the problem defined by the MDP , we wish to learn a policy to maximize the reward accrued during operation in the target system, . Formally, if we write the realization of the disturbance at time as , we wish to solve the problem:

(2)

while only having access to the simulator, , for training.

4 AdaPT: Adaptive Policy Transfer for Stochastic Dynamics

In this section we present the AdaPT algorithm for zero-shot transfer. A high level view of the algorithm is presented in Algorithm 1. First, we assume that a policy is trained in simulation. Our approach is to first compute a nominal trajectory (without disturbance) by continuously executing the learned policy on the simulator dynamics. Then, when transferred to the target environment, we use an auxiliary model predictive control-based (MPC) controller to stabilize around this nominal trajectory. In this work, we use a reward formulation for operation in the primary environment (i.e, the aim is to maximize reward), and a cost formulation for the auxiliary controller (i.e., the aim is to minimize cost to thus minimize deviation from the nominal trajectory). This is in part to disambiguate the distinction between the primary and auxiliary optimization problems.

\runinhead

Policy Training We use model-free policy optimization on the black-box simulated model. Our theoretical guarantees rely on the auxiliary controller avoiding saturation. Therefore, if a policy operates near the limits of its control authority and thus the auxiliary controller saturates when used on the target environment, this policy is trained using restricted state and action spaces , . We let denote an MDP with restricted state and action spaces. This follows the approach of mayne2011tube, where it is used to prevent auxiliary controller saturation. Intuitively, restricting the state and action space ensures any nominal trajectory in those spaces can be stabilized by the auxiliary controller. Therefore, if saturation is rare, restricting these sets is unnecessary.

AdaPT is invariant to the choice of policy optimization method. During online operation, a nominal trajectory is generated by rolling out the policy on the simulator dynamics, . The auxiliary controller then tracks this trajectory in the target environment.

\runinhead

Approximate Dynamics Model Because the model of the simulator is treated as a black-box, it is impractical to use for the auxiliary controller in an optimal control framework. As such, we rely on an approximate model of the dynamics, separate from the simulator dynamics , which we refer to as . The specific representation of the model (e.g. linear model, feedforward neural network, etc.) depends on both the accuracy required as well as the method used to solve the auxiliary control problem. This model may be either learned from the simulator, or based on prior knowledge. A substantial body of literature exists on dynamics model learning from black-box systems moerland2017learning. Alternatively, this model may be based on external knowledge, either from learning a dynamics model in advance from the target system or from, for example, a physical model of the system.

\runinhead

Auxiliary MPC Controller Our auxiliary nonlinear MPC controller is based on that of mayne2011tube. Specifically, we write the auxiliary control problem:

(3)

where is the MPC horizon, and are positive definite cost matrices for the state deviation and control deviation respectively, and is the approximate dynamics model. In some cases, this problem is convex, but generally it may not be. In our experiments, this optimization problem is solved with iterative relinearization based on todorov2005generalized. However, whereas they iteratively linearize the nonlinear optimal control problem and solve an LQR problem over the full horizon of the problem, we explicitly solve the problem over the MPC horizon. We do not consider terminal state costs or constraints. This formulation of the auxiliary controller by mayne2011tube allows us to guarantee, under our assumptions, that our true state stays in a tube around the nominal trajectory, where the tube is defined by level sets of the value function (the details of this are addressed in Section 5).

0:  Source Env: , Target Env: , Initial State: Offline:
1:  \hfill// Calculate constrained state & action space
2:   \hfill// Train a policy for using constrained
3:   \hfill// Fit Dynamics for Online:
4:   \hfill// Roll out on to get nominal trajectory
5:  
6:  for  do
7:       \hfill// NMPC with iterative linearization
8:       \hfill// Rollout the first step of action seq. on
9:  end for
Algorithm 1 Adaptive Policy Transfer for Stochastic Dynamics  (AdaPT)

The solution to the MPC problem is iterative. First, we linearize around the nominal trajectory . We introduce the notation , which is the solution for the last iteration. These are initialized as and . Then, we introduce the deviations from this solution as

(4)

Then, taking the linearization of our dynamics

(5)

we can rewrite the MPC problem as:

(6)

Note that the optimization is over the action deviations . Once this problem is solved, we use the update rule , . The dynamics are then relinearized, and this is iterated until convergence. Because we use iterative linearization to solve the nonlinear program, it is necessary to choose a dynamics representation that is efficiently linearizable. In our experiments, we use an analytical nonlinear dynamics representation for which the linearization can be computed analytically (see webb2013kinodynamic for details), as well as fit a time-varying linear model. Choices such as, e.g., a Gaussian process representation, may be expensive to linearize.

5 AdaPT: Analysis

The following section develops the main theoretical analysis of this study. We will first show that AdaPT results in bounded deviation from the nominal trajectory under a set of technical assumptions. This result is then used to show that the deviation between cumulative reward of the realized rollout on the target system and the cumulative reward of the nominal trajectory on the source environment, is upper bounded. This is to say, the decrease in performance below the ideal case is bounded.

5.1 Safety Analysis in AdaPT

Using the notation from Eq (3), let us denote the solution at time as for MPC horizon . This is the minimum cost associated with the finite horizon problem that is solved iteratively in the MPC framework. Note that this problem is solved with the approximate dynamics model; in the case where the approximate dynamics model exactly matches the target environment model, the solution to this problem would have value zero as the trajectory would be tracked exactly. We denote by the action at time from the solution to the MPC problem. Then, let denote the level set of the cost function for some value (for some constant ; see mayne2011tube) at time .

We assume the error between approximate dynamics representation and the simulator dynamics is outer approximated by a compact, convex set that contains the origin. Therefore, for all state, action pairs , . In the case where the state and action spaces are bounded, there always exists an outer approximation which satisfies this assumption. However, in practice, it is likely considerably smaller than this worst case.

Let denote a state tube defined by the time-dependent level sets of the auxiliary cost function. We may now state our first result, noting that the auxiliary stabilizing policy is the result of the MPC optimization problem relying solely on the approximate dynamics .

Theorem 5.1.

Every state trajectory generated by the target dynamics with initial state , lies in the state tube .

Proof.

Note that , where the addition denotes a Minkowski sum, is compact, convex, and contains the origin. Then, the result follows from Theorem 1 of mayne2011tube by replacing the set of disturbances (which the authors refer to as ) with . ∎

The above result combined with Proposition 2i of mayne2011tube, which shows that for some constant , , gives insight into the safety of AdaPT. In particular, note that for an arbitrarily long trajectory, the realized trajectory stays in a region around the nominal trajectory despite using an inaccurate dynamics representation in the MPC optimization problem. While this result shows that the deviation from the nominal trajectory is bounded, it does not allow construction of explicit tubes in the state space, and thus can not be used directly for guarantees on obstacle avoidance. Recent work by Singh et al. singh2017robust establishes tubes of this form, and this is thus a promising extension of the AdaPT framework.

5.2 Robustness Analysis in AdaPT

We will now show that due to the boundedness of state deviation, the deviation in the total accrued reward over a rollout on the target system is bounded. Let and denote the value functions associated with some state and the primary policy executed on the source environment, and the AdaPT policy on the secondary environment respectively.

Theorem 5.2.

Under the technical assumptions made in Section 3 and 5.1, , where is some constant and .

Proof.

First, note , where and . Additionally, letting and , note that similarly to Proposition 2i of mayne2011tube, we can establish a bound on the action deviation from the nominal trajectory in terms of the auxiliary cost function, for all (where the norm is in the Euclidean sense), by taking as the minimum eigenvalue of . By the Lipschitz continuity of the reward function, and writing the Lipschitz constant of the reward function , we have

(7)

Then, noting that the quadratic auxiliary cost function is always positive, the result is proved by applying Proposition 2i of mayne2011tube and the bound on action deviation from the nominal to the right hand side of Equation 7. ∎

This result may then be restated in terms of the disturbance sets. Let .

Theorem 5.3.

Under the same technical assumptions as Theorem 2, the following inequality holds for some constant :

(8)
Proof.

The result follows from combining Theorem 2 with Proposition 4ii of mayne2011tube. ∎

These results shows that along with guarantees on spatial deviation from the nominal trajectory, we may also establish bounds on the accrued reward relative to what is received with the nominal policy in the source environment, in effect demonstrating that zero-shot transfer is possible. The Lipschitz continuity of the reward function is essential to this result, and this illustrates several aspects of the policy transfer problem.

The AdaPT algorithm is based on tracking a nominal rollout in simulation. Critical in the success of this approach is gradual variation of the reward function. Sparse reward structures are likely to fail with this approach to transfer, as tracking the nominal trajectory, even relatively closely, may result in poor reward. On the other hand, a slowly varying reward function, even if tracked relatively roughly may result in accrued reward close to the nominal rollout on the source environment.

6 Experimental Evaluation

Figure 1: Mean cumulative cost over the length of an episode for 50 episodes on the kinematic car environment. The confidence intervals are standard error. The costs are normalized to the cost of the naive policy being rolled out on the simulated environment from the same initial state, to allow more direct comparison across episodes. The naive rollout is the nominal policy executed on the target environment. The disturbances tested are a) a hill landscape, b) additive control error, c) process noise, and d) dynamics parameter error.

We implemented AdaPT on a nonlinear, non-holonomic kinematic car model with a 5-dimensional state space as well as on the Reacher environment in OpenAI’s Gym brockman2016. We train policies using Trust Region Policy Optimization (TRPO) schulman2015trust. The policy is parameterized as a neural network with two hidden layers, each with 64 units and ReLU nonlinearities. In all of our experiments, we report normalized cost. This is the cost (negative reward) realized by a trial in the target environment, divided by the cost of the nominal policy rolled out on the simulated environment from the same initial state. This allows more direct comparison between episodes for environments with stochastic initial states. We generally compare the naive trial, which is the nominal policy rolled out on the target environment (e.g., standard transfer with no adaptation) to AdaPT.

6.1 Environment I: 5-D Car

(a)
(b)
(c)
Figure 2: a) The car environment with the paths for the ideal case (nominal policy on simulated environment), the naive case (nominal policy on the target environment), and the AdaPT case (AdaPT on the target environment). The contour plot shows the height of the added hills. Figures (b) and (c) show the normalized cost for varying disturbances due to additive control error and dynamics parameter error for b) the naive case and c) AdaPT (lower is better). In addition to the listed disturbances, disturbances due to hills are also added for all trials. Each grid cell is the mean of 50 trials.

We implemented AdaPT on a nonlinear, nonholonomic 5-dimensional kinematic car model that has been used previously in the motion planning literature webb2013kinodynamic. Specifically, the car has state , where and denote coordinates in the plane, denotes heading angle, denotes speed, and denotes trajectory curvature. The system has dynamics , where and are the controlled acceleration and curvature derivative. The policy is trained to minimize the quadratic cost , where , which results in policies that drive to the origin. In each trial, the vehicle is initialized in a random state, with position , with random heading and zero velocity and curvature.

Our auxiliary controller used an MPC horizon of 2 seconds (20 timesteps). Our state deviation penalty matrix, , has value 1 along the diagonal for the position terms, and zero elsewhere. Thus, the MPC controller penalizes only deviation in position. The matrix had small terms () along the diagonal to slightly penalize control deviations. In practice, this mostly acts as a small regularizing term to prevent large oscillatory control inputs by the auxiliary controller. The behavior of the auxiliary controller is dependent on the matrices and , but in practice good performance may be achieved across environments with fixed values. Because of the relatively high quadratic penalty on control in policy training, the nominal policy rarely approaches the control limits. Thus, we can set , and we set . For our dynamics model, we use the linearization reported in webb2013kinodynamic.

6.2 Disturbance Models

We investigate four disturbance types:

  1. Environmental Uncertainty: We add randomly-generated hills to the target environment such that the car experiences accelerations due to gravity. This noise is therefore state-dependent. Figure 1(a) shows a randomly generated landscape. We randomly sample 20 hills in the workspace, each of which is circular and has varying radius and height. The vehicle experiences an additive longitudinal acceleration proportional to the landscape slope at its current location, and no lateral acceleration.

  2. Control noise: Nonzero-mean additive control error drawn from a uniform distribution.

  3. Process noise: Additive, zero-mean noise added to the state. Disturbances are drawn from a uniform distribution.

  4. Dynamics parameter error: We add a scaling factor to the control of , such that .

For the last three, the noise terms were drawn i.i.d. from a uniform distribution at each time . These disturbances were investigated both independently (Figure 1) and simultaneously (Figure 2). Figure 1 shows the normalized cost of the naive transfer and AdaPT for each of the four disturbances individually.

In our experiments, AdaPT substantially outperforms naive transfer, achieving normalized costs 1.5-5x smaller. Additionally, the variance of the naive transfer is considerably higher, whereas the realized cost for AdaPT is clustered relatively tightly around one (e.g., approximately equal cost to the ideal case). In Figure 1d, the normalized cost of AdaPT is actually below one, implying that the transferred policy performs better than the ideal policy. In fact, this is because the dynamics parameter error in this trial results in oversteer, and so the agent accumulates less cost to turn to face the goal than in the nominal environment. Thus, pointing toward the goal is more “cost-efficient” in the target environment. The performance of direct transfer and AdaPT with varying parameter error may be seen in Figure 1(b) and Figure 1(c). In Figure 1(a), a case is presented where the direct policy transfer fails to make it up a hill, whereas the AdaPT policy tracks the nominal trajectory well.

6.3 AdaPT with Robust Offline Policy

Figure 3: Mean cumulative cost over the length of an episode for 50 episodes on the 5-D car environment, using an EPOpt-1 robust policy. The confidence intervals are standard error. The disturbances tested are a) a hill landscape, b) additive control error, c) process noise, and d) dynamics parameter error. The details of each noise source is presented in the supplementary materials.

Whereas AdaPT’s approach to policy transfer relies primarily on stabilization in the target environment, recent work has focused on training robust policies in the source domain, and then performing direct transfer. In the EPOpt policy training framework rajeswaran2016epopt, an agent is trained over a family of MDPs in which model parameters are drawn from distributions before each training rollout. Then, a Conditional Value-at-Risk (CVaR) objective function is optimized as opposed to an expectation over all training runs. We apply AdaPT on top of an EPOpt-1 policy (equivalent to optimizing expected reward, with model parameters varying), and find that for disturbances explicitly varied during training, the performance of EPOpt-only transfer and AdaPT are comparable. We add parameters to the state derivative as follows: . Each of these are drawn from Gaussian distributions before each training run, and are fixed during the training run. Although some of these parameters do not have a physical interpretation, the resulting policies are still robust to both parametric error, as well as process noise. In these experiments, an MPC horizon of 1 second was used (10 timesteps). The matrices and were set as in Section 6.1.

In Figure 3, the comparison between the direct transfer of EPOpt policies and AdaPT policies is presented. We can see that, for disturbances that are explicitly considered in training (specifically, model parameter error), naive transfer performs slightly better, albeit with higher variance. For other disturbances, like the addition of hills or control noise, AdaPT significantly outperforms the directly-transferred policy. Indeed, while the performance of the AdaPT policy is comparable to direct transfer for disturbances directly considered in training, unmodelled disturbances are handled substantially better by AdaPT. Thus, to extract the best performance, we recommend applying the two approaches in tandem.

6.4 Environment II: 2-Link Planar Robot Arm

We next evaluate the performance of AdaPT on the Reacher environment of Gym brockman2016. This environment is a two link robotic arm that receives reward for proximity to a goal in the workspace, and is penalized for control effort. The state is a vector of the sin and cos of the joint angles, as well as joint angular velocities, the goal position, and the distance from the arm end-effector to the goal. In our tests, we fix one goal location and one starting state for all tests to more directly compare between trials. As such, the variance in normalized cost in experiments is much smaller than in the car experiments. For these experiments, the same noise models were used as in the previous section, with the exception of the “hills” disturbance.

As an approximate dynamics model used for the auxiliary controller, we use the time-varying linear dynamics from levine2014learning. This model is fit from rollouts in simulation. Since this model is linear, the MPC problem is convex, and the iterative MPC converges in one iteration. These dynamics are only valid in a local region, and thus must be fit for each desired policy rollout in the target environment. However, since the model is fit from simulation data, it is generated quickly and inexpensively.

The results for normalized cost comparisons between naive transfer and AdaPT are presented in Figure 4. We note that AdaPT achieves significantly lower cost for additive control error and process noise, but achieves comparable cost for parameter error. The parameter varied in these experiments was the mass of the links of the arm. The effect of this change is to increase the inertia of the manipulator as a whole. In fact, this can be seen in the Figure 4c. While the cost of the naive transfer increases slowly, the cost of the AdaPT trials spikes at approximately time . As AdaPT is tracking the nominal trajectory, it increases the torque applied, thus suffering a penalty for the increased control action, but resulting in better tracking of the nominal trajectory.

A similar effect can be observed in Figure 4a. The added control error actually drives the manipulator toward the goal, resulting in the dip in the normalized cost for both trajectories. However, the naive policy overshoots the goal substantially, and thus accrues substantially higher normalized cost than the AdaPT experiments.

Figure 4: Mean cumulative cost over the length of an episode for 50 episodes on the reacher environment. The confidence intervals are standard error. The costs are normalized to the cost of the naive policy being rolled out on the simulated environment from the same initial state, to allow more direct comparison across episodes. The naive rollout is the nominal policy executed on the target environment. The disturbances tested are a) additive control error, b) process noise, and c) dynamics parameter error.

7 Conclusion and Outlook

We have presented the AdaPT algorithm for robust transfer of learned policies to target environments with unmodeled disturbances or model parameters. We have also provided guarantees on the lower bounds of the accrued reward in the target environment for a policy transferred with AdaPT. Our results were demonstrated on two different environments with four disturbance models investigated. We additionally discuss usage of robust policies with AdaPT. The results presented demonstrate that this method improves performance on unmodeled disturbances by 50-300%.

In this work, we construct our analysis on the Lipschitz continuity of the dynamics. Indeed, the smoothness of the deviation in dynamics is fundamental to the guarantees we establish. An immediate avenue of future investigation is, therefore, expanding the work presented here to environments with discrete and discontinuous dynamics such as contact. Recently, \textcitefarshidian2016sequential have extended an iteratively linearized nonlinear MPC, similar to ours, to switching linear systems, which may have potential as a foundation on which to develop a capable contact formulation of AdaPT. Additionally, recent work has developed robust, receding horizon tube controllers that allow the establishment of explicit tubes in the state space singh2017robust. This approach has the potential to establish explicit safety constraints for operation in cluttered environments. Finally, these methods will also be evaluated on a physical systems.

\printbibliography

Footnotes

  1. email: jharrison@stanford.edu
  2. email: pavone@stanford.edu
  3. email: jharrison@stanford.edu
  4. email: pavone@stanford.edu
  5. email: jharrison@stanford.edu
  6. email: pavone@stanford.edu
  7. email: jharrison@stanford.edu
  8. email: pavone@stanford.edu
  9. email: jharrison@stanford.edu
  10. email: pavone@stanford.edu
  11. email: jharrison@stanford.edu
  12. email: pavone@stanford.edu
  13. email: jharrison@stanford.edu
  14. email: pavone@stanford.edu
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
6394
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description