\controllerabv: A Controller for Escaping Traps in Novel Environments

\controllerabv: A Controller for Escaping Traps in Novel Environments


We propose an approach to online model adaptation and control in the challenging case of hybrid and discontinuous dynamics where actions may lead to difficult-to-escape “trap” states. We first learn dynamics for a given system from training data which does not contain unexpected traps (since we do not know what traps will be encountered online). These “nominal” dynamics allow us to perform tasks under ideal conditions, but when unexpected traps arise in execution, we must find a way to adapt our dynamics and control strategy and continue attempting the task. Our approach, \controllername (\controllerabv), is a two-level hierarchical control algorithm that reasons about traps and non-nominal dynamics to decide between goal-seeking and recovery policies. An important requirement of our method is the ability to recognize nominal dynamics even when we encounter data that is out-of-distribution w.r.t the training data. We achieve this by learning a representation for dynamics that exploits invariance in the nominal environment, thus allowing better generalization. We evaluate our method on simulated planar pushing and peg-in-hole as well as real robot peg-in-hole problems against adaptive control and reinforcement learning baselines, where traps arise due to unexpected obstacles that we only observe through contact. Our results show that our method significantly outperforms the baselines in all tested scenarios.


Machine Learning for Robot Control, Reactive and Sensor-Based Control.

I Introduction

In this paper, we study the problem of controlling robots in environments with unforeseen traps. Informally, traps are states in which the robot cannot make progress towards its goal and is effectively “stuck”. Traps are common in robotics and can arise due to a wide variety of factors including geometric constraints imposed by obstacles, frictional locking effects, and nonholonomic dynamics leading to dropped degrees of freedom [4][10][15]. In this paper, we consider instances of trap dynamics in planar pushing with walls and peg-in-hole with unmodeled obstructions to the goal.

Developing generalizable algorithms that rapidly adapt to handle the wide variety of traps encountered by robots is important to their deployment in the real-world. Two central challenges in online adaptation to environments with traps are the data-efficiency requirements and the lack of progress towards the goal for actions inside of traps. In this paper, our key insight is that we can address these challenges by explicitly reasoning over different dynamic modes, in particular traps, together with contingent recovery policies, organized as a hierarchical controller. We introduce an online modeling and controls method that balances naive optimism and pessimism when encountering novel dynamics. Our method learns a dynamics representation that infers underlying invariances and exploits it when possible (optimism) while treading carefully to escape and avoid potential traps in non-nominal dynamics (pessimism). Specifically, we:

  1. Introduce a novel representation architecture for generalizing dynamics and show how it allows our method to achieve good performance on out-of-distribution data;

  2. Introduce \controllername (TAMPC), a novel control algorithm that reasons about non-nominal dynamics and traps to reach goals in novel environments with traps;

  3. Evaluate our method on real robot and simulated peg-in-hole, and simulated planar pushing tasks with traps where adaptive control and reinforcement learning baselines achieve less than 5% success rate while our method achieves 75% success rate (85% with tuning) averaged across environments and tasks.

Our approach addresses limitations in state-of-the-art techniques [12][13] that cannot be expected to perform well in these scenarios because they have little to no contingencies for handling traps – action sequences that escape traps incur high short-term costs and are much less likely to be discovered.

Fig. 1: \controllerabv on peg-in-hole tasks with obstacles near the goal. The robot has no visual sensing (cannot anticipate walls) and has not encountered walls in the training data. Path segments show (1) the initial direct trajectory to goal, (2) detecting non-nominal dynamics from reaction forces and exploration of it by sliding along the wall, (3) detecting a trap due to the inability to make progress using non-nominal dynamics, (4) recovery to nominal dynamics, (5) going around seen traps to goal, (6) spiraling to find the precise location of the hole, and (7) sliding against the wall (non-nominal) towards the goal.

Ii Problem Statement

Let ( dimensional) denote the state, ( dimensional) denote the control, and denote the change in state. We assume that we are given a state distance function and a control similarity function . The state distance function measures changes in state and progress towards the goal implies . Let denote the nominal dynamics (i.e. for the system under ideal conditions) and denote the novel dynamics in a novel environment. We assume they are discrete-time in the form:


where denotes the error dynamics. We assume the error dynamics are relatively small (w.r.t. to the nominal) except for non-nominal regions for which:


We are given some goal-directed cost function which induce trap states . These states are basins of attraction for inside of which no action sequence below some length can decrease the cost where can be considered a given trap’s difficulty. Note that traps are defined by both the environment and cost function.

In trap states, the controller benefits from considering a horizon greater than and may need to incur high short-term cost to escape and have a chance at eventually reaching the goal. This is especially difficult when the trap dynamics are unknown; i.e. when . We consider the case where traps are unforeseen and novel environments where may be discontinuous, such as in the case of unexpected contact during manipulation. Partial observability and limited sensing significantly increase the difficulty of the problem because they prevent the anticipation and preemptive avoidance of novel traps. The key challenges to accomplishing a task in this scenario are 1) the identification of traps; and 2) a reactive control scheme that can escape from traps and make progress toward the goal.

To learn the nominal dynamics , we assume that some data has been collected in a nominal environment. We are given transition sequences from the nominal environment , . Starting from an initial state in a novel environment, which may contain traps that were not present in the nominal environment, our objective is to minimize the number of steps to navigate to a given goal set :


Iii Related Work

In this section, we review related work to the two main components of this paper: handling traps and generalizing models to out-of-distribution (OOD) dynamics.

Handling Traps: Traps can arise due to many factors including nonholonomic dynamics, frictional locking, and geometric constraints [4][10][15]. In the control literature, viability control [22] can be applied to nonholonomic systems in the case where the dynamics of entering traps (leaving the viability set) is known. While they enforce staying inside the viability set as a hard constraint, our method can be interpreted as online learning of the non-viable set with a policy for returning to the viable set.

Another way to handle traps is with exploration, such as through intrinsic curiosity (based on model predictability) [25][23][5], state counting [26][2], or stochastic policies [20]. However, trap dynamics can be difficult to escape from and can require returning to dynamics the model predicts well (so receives little exploration reward). We show in an ablation test how random actions are insufficient for escaping traps in the tasks we consider. Similar to [9], we remember interesting past states and attempt to recover to them before resuming our control algorithm. They require a simulator to reset to the previous state and design domain-specific state interest scores. We do not require resetting, and effectively allow for online adaptation of the state score based on how much movement the induced policy generates while inside a trap.

Generalizing models to OOD Dynamics: Actor-critic methods have long been used to control nonlinear systems with unknown dynamics online [8][3]. However, they tend to do poorly in discontinuous dynamics and are sample inefficient compared to our method as we show in experimentation. Another approach to handle novel environments online is with locally-fitted models [17], which [12] shows could be mixed with a global model and used in model predictive control (MPC). Similar to this approach, our method adapts a nominal model to local dynamics; however, we do not always exploit the dynamics to reach the goal.

One goal of our method is to generalize the nominal dynamics to OOD novel environments. A popular approach for doing this is explicitly learning to be robust to expected variations across training and test environments. This includes methods such as meta-learning [11][18], domain randomization [27][21], Sim-to-real [24][14], and other transfer learning [29] methods. These methods are unsuitable for this problem because our training data contains only nominal dynamics, whereas they need a diverse set of non-nominal dynamics. Instead, we learn a robust, or “disentangled” representation [6] of the system under which models can generalize. This idea is active in computer vision, where learning based on invariance has become popular [1][16]. Using similar ideas, we present a novel architecture for learning invariant representations for dynamics models.

Iv Method

Our approach is composed of two components: offline representation learning and online control. First, we present how we learn a representation that allows for generalization by exploiting inherent invariances inferred from the nominal data, shown in Fig. 2. Second, we present \controllername (\controllerabv), a two-level hierarchical MPC method shown in Fig. 3. The high-level controller explicitly reasons about non-nominal dynamics and traps, deciding when to exploit the dynamics and when to recover to familiar ones by outputting the model and cost function the low-level controller uses to compute control signals.

Iv-a Offline: Invariant Representation for Dynamics

Fig. 2: Architecture for learning an invariant representation. Grey boxes are given data, green boxes are parameterized transforms, white boxes are computed values, and red dotted lines are losses.

In this section, our objective is to exploit potential underlying invariances in state-control space when predicting system dynamics to achieve better generalization to unseen data. More formally, our representation of dynamics is composed of an invariant transformation and a predictive module, shown in Fig. 2. The invariant transformation maps the state-action space into a latent space () that the predictive module operates on to produce a \vname () that is then mapped back into the state space using . We parameterize the transformations with neural networks and build in two mechanisms to promote a meaningful and descriptive latent space:

First, we impose to create an information bottleneck. This encourages the \zname to throw out information not relevant for predicting dynamics and to discover invariances—variations in the original state-action space that can be safely ignored. Further, we limit the capacity of to be significantly smaller than that of such that the dynamics take on a simple form.

Second, we partially decouple and by learning in an autoencoder fashion: we require that can reconstruct with partial information from . Passing in at reconstruction allows the dynamics pathway to ignore some information in . Much like the autoencoder architecture, we require the encoded to match the output from dynamics. To further improve generalization, we restrict information passed from to when reconstructing with a dimension reducing transform . This mechanism has the effect of reducing compounding errors when is OOD. These two mechanisms yield the following expressions:

and their associated batch-based reconstruction loss and matching loss:

These losses are ratios relative to the norm of the quantity we are trying to match to avoid decreasing loss by scaling the representation. In addition to these two losses, we apply Variance Risk Extrapolation (V-REx [16]) to explicitly penalize the variance in loss across the trajectories:


We train on Eq. (4) using gradient descent with an annealing strategy for suggested by [16]. For minibatches, we adjust Eq. (4) to be over only the trajectories that are in the batch.

After learning the transforms, we replace with a higher capacity model and fine-tune it on the nominal data with just . Learning the invariant representation this way avoids compounding errors from OOD inputs while allowing our model to be robust to variations unnecessary for dynamics. For further details, please see App. C.

Iv-B Online: Trap-Aware MPC

Online, we require a controller that has two important properties. First, it should incorporate strategies to escape from and avoid detected traps. Second, it should iteratively improve its dynamics representation, in particular when encountering previously unseen modes. To address these challenges, our approach uses a two-level hierarchical controller where the high-level controller described in Alg. 1 explicitly reasons about non-nominal dynamics and traps, outputting the dynamics model and cost function that the low level controller uses to compute control. This structure allows a variety of low-level MPC designs that can be specialized to the task or dynamics if domain knowledge is available.

Fig. 3: High-level control architecture.

operates in either an exploit or recovery mode, judiciously chosen based on the type of dynamics encountered. When in nominal dynamics, the algorithm exploits its confidence in predictions. When encountering non-nominal dynamics, it attempts to exploit a local approximation built online until it detects entering a trap, at which time recovery is triggered. This approach attempts to balance between a potentially overly-conservative treatment of all non-nominal dynamics as traps and an overly-optimistic approach of exploiting all non-nominal dynamics assuming goal progress is always possible.

The first step in striking this balance is identifying non-nominal dynamics. Here, we evaluate the nominal model prediction error against observed states (“nominal model accuracy” block in Fig. 3 and lines 1 and 1 from Alg. 1):


where is a designed tolerance threshold and is the expected model error per dimension computed on the training data. To handle jitter, we consider transitions from consecutive time steps.

When in non-nominal dynamics, the controller needs to differentiate between dynamics it can navigate to reach the goal vs. traps and adapt its dynamics models accordingly. By definition, a trap is the inability to make progress towards the goal despite the MPC taking actions according to a goal-directed cost. The controller detects this by monitoring the maximum one-step state distance in nominal dynamics, , and comparing it against the average state distance to recent states: (depicted in “movement fast enough” block of Fig. 3). Here, is the current time, is time measured from the start of non-nominal dynamics or the end of last recovery, whichever is more recent, up to to handle jitter by ensuring state distances are measured over windows of at least size . is how much slower the controller tolerates moving in non-nominal dynamics. For more details see Alg. 2.

Given :  cost, , MPC, parameters from Tab. II
1 \nominalmode,  ,  ,  ,  ,  ,   MAB arms random convex combs. while  acceptable threshold do
2       if \ctrlmode is \nominalmode then
3             if not nominal from Eq. 5 then
4                   \nonnominalmode initialize GP with
5             else
                   // anneal trap cost
8       else
             fit to include was nominal last steps // Eq. 5
9             if EnteringTrap() then
10                   \recovermode expand according to Eq. 9
11             if \ctrlmode is \recovermode then
12                   if n or Recovered() then
13                         \nonnominalmode normalize so
14                   else if  steps since last arm pull then
15                         reward last arm pulled with Thompson sample an arm
17             if n then
18                   \nominalmode
      MPC.model if \ctrlmode is \nominalmode else MPC.cost if is else // Eq. 11 and 10
20       , apply and observe from env
Algorithm 1 \controllerabv high level control loop

Our model adaptation strategy, for both non-nominal dynamics and traps, is to mix the nominal model with an online fitted local model. Rather than the linear models considered in prior work [12], we add an estimate of the error dynamics represented as a Gaussian Process (GP) to the output of the nominal model. Using a GP provides a sample-efficient model that captures non-linear dynamics. To mitigate over-generalizing local non-nominal dynamics to where nominal dynamics holds, we fit it to only the last points since entering non-nominal dynamics. We also avoid over-generalizing the invariance that holds in nominal dynamics by constructing the GP model in the original state-control space. Our total dynamics is then


When exploiting dynamics to navigate towards the goal, we regularize the goal-directed cost with a trap set cost to avoid previously seen traps (line 1 from Alg. 1). This trap set is expanded whenever we detect entering a trap. We add to it the transition with the lowest ratio of actual to expected movement (from one-step prediction) since the end of last recovery:


To handle traps close to the goal, we only penalize revisiting trap states if similar actions are to be taken. With the control similarity function we formulate the cost


We switch from exploit to recovery mode when detecting a trap, but it is not obvious what the recovery policy should be. Driven by the online setting and our objective of data-efficiency: First, we restrict the recovery policy to be one induced by running the low-level MPC on some cost function other than one used in exploit mode. Second, we propose hypothesis cost functions and consider only convex combinations of them. Without domain knowledge, one hypothesis is to return to one of the last visited nominal states. However, the dynamics may not always allow this. Another hypothesis is to return to a state that allowed for the most one-step movement. Both of these are implemented in terms of the following cost, where is a state set and we pass in either , the set of last visited nominal states, or , the set of states that allowed for the most single step movement since entering non-nominal dynamics:


Third, we formulate learning the recovery policy online as a non-stationary multi-arm bandit (MAB) problem. We initialize bandit arms, each a random convex combination of our hypothesis cost functions. Every steps in recovery mode, we pull an arm to select and execute a recovery policy. After executing control steps, we update that arm’s estimated value with a movement reward: . When in a trap, we assume any movement is good, even away from the goal. The normalization makes tuning easier across environments. To accelerate learning, we exploit the correlation between arms, calculated as the cosine similarity between the cost weights. Our formulation fits the problem from [19] and we implement their framework for non-stationary correlated multi-arm bandits.

Finally, we return to exploit mode after a fixed number of steps , if we returned to nominal dynamics, or if we stopped moving after leaving the initial trap state. For details see Alg. 3.

V Results

In this section, we first evaluate our dynamics representation learning approach, in particular how well it generalizes out-of-distribution. Second, we compare \controllerabv against baselines on tasks with traps in two environments.

V-a Experiment Environments

Fig. 4: Annotated simulation environments of (left) planar pushing, and (right) peg-in-hole.

Our two tasks are quasi-static planar pushing and peg-in-hole. Both tasks are evaluated in simulation using PyBullet  [7] and the latter is additionally evaluated empirically using a KUKA LBR iiwa arm depicted in Fig. 1. In planar pushing, the goal is to push a block to a known desired position. In peg-in-hole, the goal is to place the peg into a hole with approximately known location. In both environments, the robot has access to its own pose and senses the reaction force at the end-effector. Thus the robot cannot perceive the obstacle geometry visually, it only perceives contact through reaction force. During offline learning of nominal dynamics, there are no obstacles or traps. During online task completion, obstacles are introduced in the environment, inducing unforeseen traps. See Fig. 4 for a depiction of the environments and Fig. 6 for typical traps from tasks in these environments, and App. B for environment details.

In planar pushing, the robot controls a cylindrical pusher restricted to push a square block from a fixed side. Traps introduced by walls are shown in Fig. 6. Frictional contact with the wall limits sliding along the wall and causes most actions to rotate the block into the wall. The state is where is the block pose, and is the reaction force the pusher feels, both in world frame. Control is , where is where along the side to push, is the push distance, and is the push direction relative to side normal. The state distance is the 2-norm of the pose, with yaw normalized by the block’s radius of gyration. The control similarity is where cossim is cosine similarity.

In peg-in-hole, we control a gripper holding a peg (square in simulation and circular on the real robot) that is constrained to slide along a surface. Traps in this environment geometrically block the shortest path to the hole. The state is and control is , the distance to move in and . We execute these on the real robot using a Cartesian impedance controller. The state distance is the 2-norm of the position and the control similarity is . The goal-directed cost for both environments is in the form . The MPC assigns a terminal multiplier of 50 at the end of the horizon on the state cost. See Tab. III for the cost parameters of each environment.

The nominal data for simulated environments consists of trajectories each with transitions (all motion is collision-free as there are no obstacles). For the real robot, we use , . We generate them by uniformly sampling starts from ( for planar pushing is also uniformly sampled) and applying actions uniformly sampled from .

V-B Offline: Learning Invariant Representations

In this section we evaluate if our representation can learn useful invariances from the offline training data. We expect nominal dynamics (in freespace) in both environments to be invariant to translation. Since the training data has positions around the range , we evaluate this as the learning losses on the validation set translated by . As Fig. 5b shows, we achieve good performance on the translated validation set even with ablations learned without REx. This could be due to our low dimensional state space whereas REx was developed for high dimensional image representations. All representations (for both planar pushing and peg-in-hole) use , and implement the transforms with shallow fully connected networks. For learning details see App. C.

Fig. 5c shows performance on a test set with OOD reaction forces, which we do not expect invariance over. Since the nominal data has no obstacles, we expect dynamics to do poorly (high match loss), but we see that with the transform we still learn to reconstruct the output.

Fig. 5: Learning curves on the (left) validation, (mid) translated validation, and (right) test set for planar pushing representation. Mean across 10 runs is solid while standard deviation is shaded.

V-C Online: Tasks in Novel Environments

Fig. 6: (top) Initial condition and (bottom) typical traps for planar pushing and peg-in-hole tasks. Our method has no visual sensing and is pre-trained only on environments with no obstacles.

We evaluate \controllerabv against baselines and ablations on the tasks shown in Fig. 1 and Fig. 6. For \controllerabv’s low-level MPC, we use a modified model predictive path integral controller (MPPI) [28] where we take the expected cost across rollouts for each control trajectory sample to account for stochastic dynamics. See Alg. 4 for our modifications to MPPI.

Baselines: We compare against three baselines. The first is online model adaptation from [12], which does iLQR control on a linearized global model mixed with a locally-fitted linear model. This represents the overly-optimistic approach of always exploiting non-nominal dynamics to head towards the goal. iLQR (code provided by [12]’s author) performs poorly in freespace of the planar pushing environment, possibly due to the non-linearity and noise from reaction force dimensions. We instead do control with MPPI using a locally-fitted GP model (effectively an ablation of \controllerabv with control mode fixed to ). This is termed “adaptive baseline++”. Our second baseline is model-free reinforcement learning with Soft Actor-Critic (SAC) [13]. Here, a nominal policy is learned offline for 1000000 steps on the nominal environment, which is used to initialize the policy at test time. Online, the policy is retrained after every control on the dense environment reward. Lastly, our “non-adaptive” baseline runs MPPI on the nominal model.

Fig. 7: Minimum distance along the trajectory to the goal accounting for walls (computed with Djikstra’s algorithm). Median across 10 runs is in solid while the 20th–80th percentile is shaded. Task success on block tasks is achieving distance and achieving 0 distance for peg-in-hole.

Vi Discussion

Fig. 7 shows that TAMPC or its variants outperforms baselines on all tasks in median distance to goal after 500 control steps. Our method is empirically robust to parameter value choices, as we use the same values for different tasks, with many also shared across environments; see Tab. II. We further improve performance on Peg-U and Peg-I by tuning only three independent parameters. We control the exploration of non-nominal dynamics with . For cases like Peg-U where the goal is surrounded by non-nominal dynamics, we increase exploration by increasing with the trade-off of staying longer in traps. Independently, we control the expected trap difficulty with the MPC horizon . Intuitively, we increase to match higher (more difficult), as in Peg-I, at the cost of more computation. Lastly, controls how quickly we decrease the weight of the trap cost. Too low a value prevents escape from difficult traps while values close to 1 leads to waiting longer in cost function local minima. For Peg-U, the “TAMPC tuned” runs used , while for Peg-I, the “TAMPC tuned” runs used .

On the real robot, the Cartesian impedance controller did not account for joint limits; instead, these limits were handled naturally as traps by \controllerabv. The only scenario in which the baselines achieved some level of success was the real Peg-U, where these approaches generated controls signals that always ended up sliding the peg along the same side of the U. This may be due to possibly imperfect construction leading one corner of the obstacle to be be closer to the goal, or that the joint configuration favoured moving in that direction. This movement bias may have contributed to the adaptive baseline’s success. In contrast, the non-adaptive baseline did not exhibit any movement biases and stayed at the bottom of the U, while \controllerabv trajectories explored both sides.

A common trend we identify from Fig. 7 is that the baselines tend to plateau after initially decreasing distance to goal. Through inspection, we found that this plateau did indeed correspond to being caught in traps. We highlight that the adaptive baseline also gets caught in traps. This may be due to a combination of insufficient knowledge of dynamics around traps, over-generalization of trap dynamics, and using too short of a MPC horizon. SAC likely stays inside of traps because no action immediately decreases cost, and action sequences that eventually decrease cost have high short-term cost and are thus unlikely to be sampled in 500 steps. \controllerabv escapes traps because it uses the signal from state distance to explicitly reason about traps and switch to a recovery policy. This added structure is useful in tasks with traps because it decreases the degree of learning and planning the algorithm has to do.

In more detail, the ablation variants demonstrate the value of \controllerabv components. Our structured recovery policy is highlighted on the block tasks. \controllerabv random (recovery policy is uniform random actions until dynamics is nominal) performs significantly worse than our full method here because escaping traps in planar pushing requires long sequences exploiting the local non-nominal dynamics to rotate the block against the wall. This is unlike the peg environment where the gripper can directly move away from the wall and the random recovery policy performs well. The Peg-T(T) task highlights our learned dynamics representation. It is a copy of Peg-T translated 10 units in . Using the invariant representation, we maintain similar performance to Peg-T whereas performance degrades under a nominal model in the original space. This is because annealing the trap set cost requires being in recognized nominal dynamics, without which it is easy to get stuck in local minima.

Vi-a Limitations and Future Work

The main failure mode of \controllerabv is oscillation between previously visited traps. Future work could break the symmetry between traps by weighting them on time of visit. A limitation of our method is that we need to be given a state space with a distance function. Future work could attempt to learn this from nominal data, for example by assuming linearity of state distance to control effort. Our experiments focus on rigid-body interactions that can be succinctly represented with low-dimensional state-spaces. Future work could apply \controllerabv to non-rigid manipulation problems, where the higher dimensional state space could be computationally challenging for the local GP model.

Finally, future work could improve the detection of non-nominal dynamics. For certain classes of problems, there could be more justified measures than the 2-norm on model prediction error.

Appendix A Parameters and Algorithm Details

Parameter block peg real peg
samples 500 500 500
horizon 40 10 15
rollouts 10 10 10
0.01 0.01 0.01
TABLE I: MPPI parameters for different environments.

Making U-turns in planar pushing requires . We shorten the horizon to 5 and remove the terminal state cost of 50 when in recovery mode to encourage immediate progress.

Parameter block peg real peg
trap cost annealing rate 0.97 0.9 0.8
recovery cost weight 2000 1 1
nominal error tolerance 10 15 15
trap tolerance 0.6 0.6 0.6
min dynamics window 5 5 2
nominal window 3 3 3
steps for bandit arm pulls 3 3 3
number of bandit arms 100 100 100
max steps for recovery 20 20 20
local model window 50 50 50
converged threshold 0.05 0.05 0.05
move threshold 1 1 1
TABLE II: \controllerabv parameters for different environments.

depends on how accurately we need to model trap dynamics to escape them. Increasing leads to selecting only the best sampled action while a lower value leads to more exploration by taking sub-optimal actions.

Nominal error tolerance depends on the variance of the model prediction error in nominal dynamics. A higher variance requires a higher . We use a higher value for peg-in-hole because of simulation quirks in measuring reaction force from the two fingers gripping the peg. Block pushing has greater stability in measuring reaction force since there is only one point of contact between the pusher and block.

Term block peg real peg
TABLE III: Goal cost parameters for each environment.
Given :  since end of last recovery or start of local dynamics, whichever is more recent, max nominal per-step distance, ,
1 for  to  do
2       if  then
3             return True
return False
Algorithm 2 EnteringTrap
Given :  since start recovery, , parameters from Tab.II
1 if  then
2       return False
3 else if  then
4       return True
converged away return converged and away
Algorithm 3 Recovered
Given : , , , , Tab. I parameters
// \us perturbation for steps
clip to control bounds   //
1   for  to  do
2       for  to  do
             // sample rollout
  mean across softmax mix perturbations return
Algorithm 4 MPPI with multiple rollouts. Differences from standard MPPI are highlighted.

Appendix B Environment details

The planar pusher is a cylinder with radius 0.02m to push a square with side length m. We have , , , and . All distances are in meters and all angles are in radians. The state distance function is the 2-norm of the pose, where yaw is normalized by the block’s radius of gyration. . The sim peg-in-hole peg is square with side length 0.03m, and control is limited to . These values are internally normalized so MPPI outputs control in the range .

Appendix C Representation learning & GP

Each of the transforms is represented by 2 hidden layer multilayer perceptrons (MLP) activated by LeakyReLU and implemented in PyTorch. They each have (16, 32) hidden units except for the simple dynamics which has (16, 16) hidden units. We optimize for 3000 epochs using Adam with default settings (learning rate 0.001), , and for the first 1000 epochs, then afterwards. For training with V-REx, we use a batch size of 2048, and a batch size of 500 otherwise. We use the GP implementation of gpytorch with an RBF kernel, zero mean, and independent output dimensions. On every transition, we retrain for 15 iterations on the last 50 transitions since entering non-nominal dynamics to only fit non-nominal data.


  1. M. Arjovsky, L. Bottou, I. Gulrajani and D. Lopez-Paz (2019) Invariant risk minimization. arXiv preprint arXiv:1907.02893. Cited by: §III.
  2. M. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton and R. Munos (2016) Unifying count-based exploration and intrinsic motivation. In NeurIPS, pp. 1471–1479. Cited by: §III.
  3. S. Bhasin, R. Kamalapurkar, M. Johnson, K. G. Vamvoudakis, F. L. Lewis and W. E. Dixon (2013) A novel actor–critic–identifier architecture for approximate optimal control of uncertain nonlinear systems. Automatica 49 (1), pp. 82–92. Cited by: §III.
  4. J. Borenstein, G. Granosik and M. Hansen (2005) The omnitread serpentine robot: design and field performance. In Unmanned Ground Vehicle Technology VII, Vol. 5804, pp. 324–332. Cited by: §I, §III.
  5. Y. Burda, H. Edwards, A. Storkey and O. Klimov (2018) Exploration by random network distillation. arXiv preprint arXiv:1810.12894. Cited by: §III.
  6. X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever and P. Abbeel (2016) Infogan: interpretable representation learning by information maximizing generative adversarial nets. In NeurIPS, pp. 2172–2180. Cited by: §III.
  7. E. Coumans and Y. Bai (2016) Pybullet, a python module for physics simulation for games, robotics and machine learning. Cited by: §V-A.
  8. T. Dierks and S. Jagannathan (2012) Online optimal control of affine nonlinear discrete-time systems with unknown internal dynamics by using time-based policy update. IEEE Trans Neural Netw Learn Syst 23 (7), pp. 1118–1129. Cited by: §III.
  9. A. Ecoffet, J. Huizinga, J. Lehman, K. O. Stanley and J. Clune (2019) Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995. Cited by: §III.
  10. I. Fantoni, R. Lozano and R. Lozano (2002) Non-linear control for underactuated mechanical systems. Springer Science. Cited by: §I, §III.
  11. C. Finn, P. Abbeel and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, pp. 1126–1135. Cited by: §III.
  12. J. Fu, S. Levine and P. Abbeel (2016) One-shot learning of manipulation skills with online dynamics adaptation and neural network priors. In IROS, pp. 4019–4026. Cited by: §I, §III, §IV-B, §V-C.
  13. T. Haarnoja, A. Zhou, P. Abbeel and S. Levine (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In ICML, pp. 1861–1870. Cited by: §I, §V-C.
  14. S. James, P. Wohlhart, M. Kalakrishnan, D. Kalashnikov, A. Irpan, J. Ibarz, S. Levine, R. Hadsell and K. Bousmalis (2019) Sim-to-real via sim-to-sim: data-efficient robotic grasping via randomized-to-canonical adaptation networks. In CVPR, pp. 12627–12637. Cited by: §III.
  15. D. E. Koditschek, R. J. Full and M. Buehler (2004) Mechanical aspects of legged locomotion control. Arthropod structure & development 33 (3), pp. 251–272. Cited by: §I, §III.
  16. D. Krueger, E. Caballero, J. Jacobsen, A. Zhang, J. Binas, R. L. Priol and A. Courville (2020) Out-of-distribution generalization via risk extrapolation (rex). arXiv preprint arXiv:2003.00688. Cited by: §III, §IV-A.
  17. S. Levine and P. Abbeel (2014) Learning neural network policies with guided policy search under unknown dynamics. In NeurIPS, pp. 1071–1079. Cited by: §III.
  18. D. Li, Y. Yang, Y. Song and T. M. Hospedales (2018) Learning to generalize: meta-learning for domain generalization. In AAAI, Cited by: §III.
  19. D. McConachie and D. Berenson (2020) Bandit-based model selection for deformable object manipulation. In Algorithmic Foundations of Robotics XII, pp. 704–719. Cited by: §IV-B.
  20. I. Osband, C. Blundell, A. Pritzel and B. Van Roy (2016) Deep exploration via bootstrapped dqn. In NeurIPS, pp. 4026–4034. Cited by: §III.
  21. S. J. Pan, I. W. Tsang, J. T. Kwok and Q. Yang (2010) Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks 22 (2), pp. 199–210. Cited by: §III.
  22. D. Panagou and K. J. Kyriakopoulos (2013) Viability control for a class of underactuated systems. Automatica 49 (1), pp. 17–29. Cited by: §III.
  23. D. Pathak, P. Agrawal, A. A. Efros and T. Darrell (2017) Curiosity-driven exploration by self-supervised prediction. In CVPR Workshops, pp. 16–17. Cited by: §III.
  24. X. B. Peng, M. Andrychowicz, W. Zaremba and P. Abbeel (2018) Sim-to-real transfer of robotic control with dynamics randomization. In ICRA, pp. 1–8. Cited by: §III.
  25. J. Schmidhuber (1991) A possibility for implementing curiosity and boredom in model-building neural controllers. In Proc. of the international conference on simulation of adaptive behavior: From animals to animats, pp. 222–227. Cited by: §III.
  26. H. Tang, R. Houthooft, D. Foote, A. Stooke, O. X. Chen, Y. Duan, J. Schulman, F. DeTurck and P. Abbeel (2017) # exploration: a study of count-based exploration for deep reinforcement learning. In NeurIPS, pp. 2753–2762. Cited by: §III.
  27. J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba and P. Abbeel (2017) Domain randomization for transferring deep neural networks from simulation to the real world. In IROS, pp. 23–30. Cited by: §III.
  28. G. Williams, N. Wagener, B. Goldfain, P. Drews, J. M. Rehg, B. Boots and E. A. Theodorou (2017) Information theoretic mpc for model-based reinforcement learning. In ICRA, pp. 1714–1721. Cited by: §V-C.
  29. A. Zhang, H. Satija and J. Pineau (2018) Decoupling dynamics and reward for transfer learning. arXiv preprint arXiv:1804.10689. Cited by: §III.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description