\controllerabv: A Controller for Escaping Traps in Novel Environments
Abstract
We propose an approach to online model adaptation and control in the challenging case of hybrid and discontinuous dynamics where actions may lead to difficulttoescape “trap” states. We first learn dynamics for a given system from training data which does not contain unexpected traps (since we do not know what traps will be encountered online). These “nominal” dynamics allow us to perform tasks under ideal conditions, but when unexpected traps arise in execution, we must find a way to adapt our dynamics and control strategy and continue attempting the task. Our approach, \controllername (\controllerabv), is a twolevel hierarchical control algorithm that reasons about traps and nonnominal dynamics to decide between goalseeking and recovery policies. An important requirement of our method is the ability to recognize nominal dynamics even when we encounter data that is outofdistribution w.r.t the training data. We achieve this by learning a representation for dynamics that exploits invariance in the nominal environment, thus allowing better generalization. We evaluate our method on simulated planar pushing and peginhole as well as real robot peginhole problems against adaptive control and reinforcement learning baselines, where traps arise due to unexpected obstacles that we only observe through contact. Our results show that our method significantly outperforms the baselines in all tested scenarios.
Machine Learning for Robot Control, Reactive and SensorBased Control.
I Introduction
In this paper, we study the problem of controlling robots in environments with unforeseen traps. Informally, traps are states in which the robot cannot make progress towards its goal and is effectively “stuck”. Traps are common in robotics and can arise due to a wide variety of factors including geometric constraints imposed by obstacles, frictional locking effects, and nonholonomic dynamics leading to dropped degrees of freedom [4], [10], [15]. In this paper, we consider instances of trap dynamics in planar pushing with walls and peginhole with unmodeled obstructions to the goal.
Developing generalizable algorithms that rapidly adapt to handle the wide variety of traps encountered by robots is important to their deployment in the realworld. Two central challenges in online adaptation to environments with traps are the dataefficiency requirements and the lack of progress towards the goal for actions inside of traps. In this paper, our key insight is that we can address these challenges by explicitly reasoning over different dynamic modes, in particular traps, together with contingent recovery policies, organized as a hierarchical controller. We introduce an online modeling and controls method that balances naive optimism and pessimism when encountering novel dynamics. Our method learns a dynamics representation that infers underlying invariances and exploits it when possible (optimism) while treading carefully to escape and avoid potential traps in nonnominal dynamics (pessimism). Specifically, we:

Introduce a novel representation architecture for generalizing dynamics and show how it allows our method to achieve good performance on outofdistribution data;

Introduce \controllername (TAMPC), a novel control algorithm that reasons about nonnominal dynamics and traps to reach goals in novel environments with traps;

Evaluate our method on real robot and simulated peginhole, and simulated planar pushing tasks with traps where adaptive control and reinforcement learning baselines achieve less than 5% success rate while our method achieves 75% success rate (85% with tuning) averaged across environments and tasks.
Our approach addresses limitations in stateoftheart techniques [12], [13] that cannot be expected to perform well in these scenarios because they have little to no contingencies for handling traps – action sequences that escape traps incur high shortterm costs and are much less likely to be discovered.
Ii Problem Statement
Let ( dimensional) denote the state, ( dimensional) denote the control, and denote the change in state. We assume that we are given a state distance function and a control similarity function . The state distance function measures changes in state and progress towards the goal implies . Let denote the nominal dynamics (i.e. for the system under ideal conditions) and denote the novel dynamics in a novel environment. We assume they are discretetime in the form:
(1) 
where denotes the error dynamics. We assume the error dynamics are relatively small (w.r.t. to the nominal) except for nonnominal regions for which:
(2) 
We are given some goaldirected cost function which induce trap states . These states are basins of attraction for inside of which no action sequence below some length can decrease the cost where can be considered a given trap’s difficulty. Note that traps are defined by both the environment and cost function.
In trap states, the controller benefits from considering a horizon greater than and may need to incur high shortterm cost to escape and have a chance at eventually reaching the goal. This is especially difficult when the trap dynamics are unknown; i.e. when . We consider the case where traps are unforeseen and novel environments where may be discontinuous, such as in the case of unexpected contact during manipulation. Partial observability and limited sensing significantly increase the difficulty of the problem because they prevent the anticipation and preemptive avoidance of novel traps. The key challenges to accomplishing a task in this scenario are 1) the identification of traps; and 2) a reactive control scheme that can escape from traps and make progress toward the goal.
To learn the nominal dynamics , we assume that some data has been collected in a nominal environment. We are given transition sequences from the nominal environment , . Starting from an initial state in a novel environment, which may contain traps that were not present in the nominal environment, our objective is to minimize the number of steps to navigate to a given goal set :
(3) 
Iii Related Work
In this section, we review related work to the two main components of this paper: handling traps and generalizing models to outofdistribution (OOD) dynamics.
Handling Traps: Traps can arise due to many factors including nonholonomic dynamics, frictional locking, and geometric constraints [4], [10], [15]. In the control literature, viability control [22] can be applied to nonholonomic systems in the case where the dynamics of entering traps (leaving the viability set) is known. While they enforce staying inside the viability set as a hard constraint, our method can be interpreted as online learning of the nonviable set with a policy for returning to the viable set.
Another way to handle traps is with exploration, such as through intrinsic curiosity (based on model predictability) [25], [23], [5], state counting [26], [2], or stochastic policies [20]. However, trap dynamics can be difficult to escape from and can require returning to dynamics the model predicts well (so receives little exploration reward). We show in an ablation test how random actions are insufficient for escaping traps in the tasks we consider. Similar to [9], we remember interesting past states and attempt to recover to them before resuming our control algorithm. They require a simulator to reset to the previous state and design domainspecific state interest scores. We do not require resetting, and effectively allow for online adaptation of the state score based on how much movement the induced policy generates while inside a trap.
Generalizing models to OOD Dynamics: Actorcritic methods have long been used to control nonlinear systems with unknown dynamics online [8], [3]. However, they tend to do poorly in discontinuous dynamics and are sample inefficient compared to our method as we show in experimentation. Another approach to handle novel environments online is with locallyfitted models [17], which [12] shows could be mixed with a global model and used in model predictive control (MPC). Similar to this approach, our method adapts a nominal model to local dynamics; however, we do not always exploit the dynamics to reach the goal.
One goal of our method is to generalize the nominal dynamics to OOD novel environments. A popular approach for doing this is explicitly learning to be robust to expected variations across training and test environments. This includes methods such as metalearning [11], [18], domain randomization [27], [21], Simtoreal [24], [14], and other transfer learning [29] methods. These methods are unsuitable for this problem because our training data contains only nominal dynamics, whereas they need a diverse set of nonnominal dynamics. Instead, we learn a robust, or “disentangled” representation [6] of the system under which models can generalize. This idea is active in computer vision, where learning based on invariance has become popular [1], [16]. Using similar ideas, we present a novel architecture for learning invariant representations for dynamics models.
Iv Method
Our approach is composed of two components: offline representation learning and online control. First, we present how we learn a representation that allows for generalization by exploiting inherent invariances inferred from the nominal data, shown in Fig. 2. Second, we present \controllername (\controllerabv), a twolevel hierarchical MPC method shown in Fig. 3. The highlevel controller explicitly reasons about nonnominal dynamics and traps, deciding when to exploit the dynamics and when to recover to familiar ones by outputting the model and cost function the lowlevel controller uses to compute control signals.
Iva Offline: Invariant Representation for Dynamics
In this section, our objective is to exploit potential underlying invariances in statecontrol space when predicting system dynamics to achieve better generalization to unseen data. More formally, our representation of dynamics is composed of an invariant transformation and a predictive module, shown in Fig. 2. The invariant transformation maps the stateaction space into a latent space () that the predictive module operates on to produce a \vname () that is then mapped back into the state space using . We parameterize the transformations with neural networks and build in two mechanisms to promote a meaningful and descriptive latent space:
First, we impose to create an information bottleneck. This encourages the \zname to throw out information not relevant for predicting dynamics and to discover invariances—variations in the original stateaction space that can be safely ignored. Further, we limit the capacity of to be significantly smaller than that of such that the dynamics take on a simple form.
Second, we partially decouple and by learning in an autoencoder fashion: we require that can reconstruct with partial information from . Passing in at reconstruction allows the dynamics pathway to ignore some information in . Much like the autoencoder architecture, we require the encoded to match the output from dynamics. To further improve generalization, we restrict information passed from to when reconstructing with a dimension reducing transform . This mechanism has the effect of reducing compounding errors when is OOD. These two mechanisms yield the following expressions:
and their associated batchbased reconstruction loss and matching loss:
These losses are ratios relative to the norm of the quantity we are trying to match to avoid decreasing loss by scaling the representation. In addition to these two losses, we apply Variance Risk Extrapolation (VREx [16]) to explicitly penalize the variance in loss across the trajectories:
(4) 
We train on Eq. (4) using gradient descent with an annealing strategy for suggested by [16]. For minibatches, we adjust Eq. (4) to be over only the trajectories that are in the batch.
After learning the transforms, we replace with a higher capacity model and finetune it on the nominal data with just . Learning the invariant representation this way avoids compounding errors from OOD inputs while allowing our model to be robust to variations unnecessary for dynamics. For further details, please see App. C.
IvB Online: TrapAware MPC
Online, we require a controller that has two important properties. First, it should incorporate strategies to escape from and avoid detected traps. Second, it should iteratively improve its dynamics representation, in particular when encountering previously unseen modes. To address these challenges, our approach uses a twolevel hierarchical controller where the highlevel controller described in Alg. 1 explicitly reasons about nonnominal dynamics and traps, outputting the dynamics model and cost function that the low level controller uses to compute control. This structure allows a variety of lowlevel MPC designs that can be specialized to the task or dynamics if domain knowledge is available.
operates in either an exploit or recovery mode, judiciously chosen based on the type of dynamics encountered. When in nominal dynamics, the algorithm exploits its confidence in predictions. When encountering nonnominal dynamics, it attempts to exploit a local approximation built online until it detects entering a trap, at which time recovery is triggered. This approach attempts to balance between a potentially overlyconservative treatment of all nonnominal dynamics as traps and an overlyoptimistic approach of exploiting all nonnominal dynamics assuming goal progress is always possible.
The first step in striking this balance is identifying nonnominal dynamics. Here, we evaluate the nominal model prediction error against observed states (“nominal model accuracy” block in Fig. 3 and lines 1 and 1 from Alg. 1):
(5) 
where is a designed tolerance threshold and is the expected model error per dimension computed on the training data. To handle jitter, we consider transitions from consecutive time steps.
When in nonnominal dynamics, the controller needs to differentiate between dynamics it can navigate to reach the goal vs. traps and adapt its dynamics models accordingly. By definition, a trap is the inability to make progress towards the goal despite the MPC taking actions according to a goaldirected cost. The controller detects this by monitoring the maximum onestep state distance in nominal dynamics, , and comparing it against the average state distance to recent states: (depicted in “movement fast enough” block of Fig. 3). Here, is the current time, is time measured from the start of nonnominal dynamics or the end of last recovery, whichever is more recent, up to to handle jitter by ensuring state distances are measured over windows of at least size . is how much slower the controller tolerates moving in nonnominal dynamics. For more details see Alg. 2.
Our model adaptation strategy, for both nonnominal dynamics and traps, is to mix the nominal model with an online fitted local model. Rather than the linear models considered in prior work [12], we add an estimate of the error dynamics represented as a Gaussian Process (GP) to the output of the nominal model. Using a GP provides a sampleefficient model that captures nonlinear dynamics. To mitigate overgeneralizing local nonnominal dynamics to where nominal dynamics holds, we fit it to only the last points since entering nonnominal dynamics. We also avoid overgeneralizing the invariance that holds in nominal dynamics by constructing the GP model in the original statecontrol space. Our total dynamics is then
(6)  
(7)  
(8) 
When exploiting dynamics to navigate towards the goal, we regularize the goaldirected cost with a trap set cost to avoid previously seen traps (line 1 from Alg. 1). This trap set is expanded whenever we detect entering a trap. We add to it the transition with the lowest ratio of actual to expected movement (from onestep prediction) since the end of last recovery:
(9) 
To handle traps close to the goal, we only penalize revisiting trap states if similar actions are to be taken. With the control similarity function we formulate the cost
(10) 
We switch from exploit to recovery mode when detecting a trap, but it is not obvious what the recovery policy should be. Driven by the online setting and our objective of dataefficiency: First, we restrict the recovery policy to be one induced by running the lowlevel MPC on some cost function other than one used in exploit mode. Second, we propose hypothesis cost functions and consider only convex combinations of them. Without domain knowledge, one hypothesis is to return to one of the last visited nominal states. However, the dynamics may not always allow this. Another hypothesis is to return to a state that allowed for the most onestep movement. Both of these are implemented in terms of the following cost, where is a state set and we pass in either , the set of last visited nominal states, or , the set of states that allowed for the most single step movement since entering nonnominal dynamics:
(11) 
Third, we formulate learning the recovery policy online as a nonstationary multiarm bandit (MAB) problem. We initialize bandit arms, each a random convex combination of our hypothesis cost functions. Every steps in recovery mode, we pull an arm to select and execute a recovery policy. After executing control steps, we update that arm’s estimated value with a movement reward: . When in a trap, we assume any movement is good, even away from the goal. The normalization makes tuning easier across environments. To accelerate learning, we exploit the correlation between arms, calculated as the cosine similarity between the cost weights. Our formulation fits the problem from [19] and we implement their framework for nonstationary correlated multiarm bandits.
Finally, we return to exploit mode after a fixed number of steps , if we returned to nominal dynamics, or if we stopped moving after leaving the initial trap state. For details see Alg. 3.
V Results
In this section, we first evaluate our dynamics representation learning approach, in particular how well it generalizes outofdistribution. Second, we compare \controllerabv against baselines on tasks with traps in two environments.
Va Experiment Environments
Our two tasks are quasistatic planar pushing and peginhole. Both tasks are evaluated in simulation using PyBullet [7] and the latter is additionally evaluated empirically using a KUKA LBR iiwa arm depicted in Fig. 1. In planar pushing, the goal is to push a block to a known desired position. In peginhole, the goal is to place the peg into a hole with approximately known location. In both environments, the robot has access to its own pose and senses the reaction force at the endeffector. Thus the robot cannot perceive the obstacle geometry visually, it only perceives contact through reaction force. During offline learning of nominal dynamics, there are no obstacles or traps. During online task completion, obstacles are introduced in the environment, inducing unforeseen traps. See Fig. 4 for a depiction of the environments and Fig. 6 for typical traps from tasks in these environments, and App. B for environment details.
In planar pushing, the robot controls a cylindrical pusher restricted to push a square block from a fixed side. Traps introduced by walls are shown in Fig. 6. Frictional contact with the wall limits sliding along the wall and causes most actions to rotate the block into the wall. The state is where is the block pose, and is the reaction force the pusher feels, both in world frame. Control is , where is where along the side to push, is the push distance, and is the push direction relative to side normal. The state distance is the 2norm of the pose, with yaw normalized by the block’s radius of gyration. The control similarity is where cossim is cosine similarity.
In peginhole, we control a gripper holding a peg (square in simulation and circular on the real robot) that is constrained to slide along a surface. Traps in this environment geometrically block the shortest path to the hole. The state is and control is , the distance to move in and . We execute these on the real robot using a Cartesian impedance controller. The state distance is the 2norm of the position and the control similarity is . The goaldirected cost for both environments is in the form . The MPC assigns a terminal multiplier of 50 at the end of the horizon on the state cost. See Tab. III for the cost parameters of each environment.
The nominal data for simulated environments consists of trajectories each with transitions (all motion is collisionfree as there are no obstacles). For the real robot, we use , . We generate them by uniformly sampling starts from ( for planar pushing is also uniformly sampled) and applying actions uniformly sampled from .
VB Offline: Learning Invariant Representations
In this section we evaluate if our representation can learn useful invariances from the offline training data. We expect nominal dynamics (in freespace) in both environments to be invariant to translation. Since the training data has positions around the range , we evaluate this as the learning losses on the validation set translated by . As Fig. 5b shows, we achieve good performance on the translated validation set even with ablations learned without REx. This could be due to our low dimensional state space whereas REx was developed for high dimensional image representations. All representations (for both planar pushing and peginhole) use , and implement the transforms with shallow fully connected networks. For learning details see App. C.
Fig. 5c shows performance on a test set with OOD reaction forces, which we do not expect invariance over. Since the nominal data has no obstacles, we expect dynamics to do poorly (high match loss), but we see that with the transform we still learn to reconstruct the output.
VC Online: Tasks in Novel Environments
We evaluate \controllerabv against baselines and ablations on the tasks shown in Fig. 1 and Fig. 6. For \controllerabv’s lowlevel MPC, we use a modified model predictive path integral controller (MPPI) [28] where we take the expected cost across rollouts for each control trajectory sample to account for stochastic dynamics. See Alg. 4 for our modifications to MPPI.
Baselines: We compare against three baselines. The first is online model adaptation from [12], which does iLQR control on a linearized global model mixed with a locallyfitted linear model. This represents the overlyoptimistic approach of always exploiting nonnominal dynamics to head towards the goal. iLQR (code provided by [12]’s author) performs poorly in freespace of the planar pushing environment, possibly due to the nonlinearity and noise from reaction force dimensions. We instead do control with MPPI using a locallyfitted GP model (effectively an ablation of \controllerabv with control mode fixed to ). This is termed “adaptive baseline++”. Our second baseline is modelfree reinforcement learning with Soft ActorCritic (SAC) [13]. Here, a nominal policy is learned offline for 1000000 steps on the nominal environment, which is used to initialize the policy at test time. Online, the policy is retrained after every control on the dense environment reward. Lastly, our “nonadaptive” baseline runs MPPI on the nominal model.
Vi Discussion
Fig. 7 shows that TAMPC or its variants outperforms baselines on all tasks in median distance to goal after 500 control steps. Our method is empirically robust to parameter value choices, as we use the same values for different tasks, with many also shared across environments; see Tab. II. We further improve performance on PegU and PegI by tuning only three independent parameters. We control the exploration of nonnominal dynamics with . For cases like PegU where the goal is surrounded by nonnominal dynamics, we increase exploration by increasing with the tradeoff of staying longer in traps. Independently, we control the expected trap difficulty with the MPC horizon . Intuitively, we increase to match higher (more difficult), as in PegI, at the cost of more computation. Lastly, controls how quickly we decrease the weight of the trap cost. Too low a value prevents escape from difficult traps while values close to 1 leads to waiting longer in cost function local minima. For PegU, the “TAMPC tuned” runs used , while for PegI, the “TAMPC tuned” runs used .
On the real robot, the Cartesian impedance controller did not account for joint limits; instead, these limits were handled naturally as traps by \controllerabv. The only scenario in which the baselines achieved some level of success was the real PegU, where these approaches generated controls signals that always ended up sliding the peg along the same side of the U. This may be due to possibly imperfect construction leading one corner of the obstacle to be be closer to the goal, or that the joint configuration favoured moving in that direction. This movement bias may have contributed to the adaptive baseline’s success. In contrast, the nonadaptive baseline did not exhibit any movement biases and stayed at the bottom of the U, while \controllerabv trajectories explored both sides.
A common trend we identify from Fig. 7 is that the baselines tend to plateau after initially decreasing distance to goal. Through inspection, we found that this plateau did indeed correspond to being caught in traps. We highlight that the adaptive baseline also gets caught in traps. This may be due to a combination of insufficient knowledge of dynamics around traps, overgeneralization of trap dynamics, and using too short of a MPC horizon. SAC likely stays inside of traps because no action immediately decreases cost, and action sequences that eventually decrease cost have high shortterm cost and are thus unlikely to be sampled in 500 steps. \controllerabv escapes traps because it uses the signal from state distance to explicitly reason about traps and switch to a recovery policy. This added structure is useful in tasks with traps because it decreases the degree of learning and planning the algorithm has to do.
In more detail, the ablation variants demonstrate the value of \controllerabv components. Our structured recovery policy is highlighted on the block tasks. \controllerabv random (recovery policy is uniform random actions until dynamics is nominal) performs significantly worse than our full method here because escaping traps in planar pushing requires long sequences exploiting the local nonnominal dynamics to rotate the block against the wall. This is unlike the peg environment where the gripper can directly move away from the wall and the random recovery policy performs well. The PegT(T) task highlights our learned dynamics representation. It is a copy of PegT translated 10 units in . Using the invariant representation, we maintain similar performance to PegT whereas performance degrades under a nominal model in the original space. This is because annealing the trap set cost requires being in recognized nominal dynamics, without which it is easy to get stuck in local minima.
Via Limitations and Future Work
The main failure mode of \controllerabv is oscillation between previously visited traps. Future work could break the symmetry between traps by weighting them on time of visit. A limitation of our method is that we need to be given a state space with a distance function. Future work could attempt to learn this from nominal data, for example by assuming linearity of state distance to control effort. Our experiments focus on rigidbody interactions that can be succinctly represented with lowdimensional statespaces. Future work could apply \controllerabv to nonrigid manipulation problems, where the higher dimensional state space could be computationally challenging for the local GP model.
Finally, future work could improve the detection of nonnominal dynamics. For certain classes of problems, there could be more justified measures than the 2norm on model prediction error.
Appendix A Parameters and Algorithm Details
Parameter  block  peg  real peg 
samples  500  500  500 
horizon  40  10  15 
rollouts  10  10  10 
0.01  0.01  0.01  
Making Uturns in planar pushing requires . We shorten the horizon to 5 and remove the terminal state cost of 50 when in recovery mode to encourage immediate progress.
Parameter  block  peg  real peg 
trap cost annealing rate  0.97  0.9  0.8 
recovery cost weight  2000  1  1 
nominal error tolerance  10  15  15 
trap tolerance  0.6  0.6  0.6 
min dynamics window  5  5  2 
nominal window  3  3  3 
steps for bandit arm pulls  3  3  3 
number of bandit arms  100  100  100 
max steps for recovery  20  20  20 
local model window  50  50  50 
converged threshold  0.05  0.05  0.05 
move threshold  1  1  1 
depends on how accurately we need to model trap dynamics to escape them. Increasing leads to selecting only the best sampled action while a lower value leads to more exploration by taking suboptimal actions.
Nominal error tolerance depends on the variance of the model prediction error in nominal dynamics. A higher variance requires a higher . We use a higher value for peginhole because of simulation quirks in measuring reaction force from the two fingers gripping the peg. Block pushing has greater stability in measuring reaction force since there is only one point of contact between the pusher and block.
Term  block  peg  real peg 
Appendix B Environment details
The planar pusher is a cylinder with radius 0.02m to push a square with side length m. We have , , , and . All distances are in meters and all angles are in radians. The state distance function is the 2norm of the pose, where yaw is normalized by the block’s radius of gyration. . The sim peginhole peg is square with side length 0.03m, and control is limited to . These values are internally normalized so MPPI outputs control in the range .
Appendix C Representation learning & GP
Each of the transforms is represented by 2 hidden layer multilayer perceptrons (MLP) activated by LeakyReLU and implemented in PyTorch. They each have (16, 32) hidden units except for the simple dynamics which has (16, 16) hidden units. We optimize for 3000 epochs using Adam with default settings (learning rate 0.001), , and for the first 1000 epochs, then afterwards. For training with VREx, we use a batch size of 2048, and a batch size of 500 otherwise. We use the GP implementation of gpytorch with an RBF kernel, zero mean, and independent output dimensions. On every transition, we retrain for 15 iterations on the last 50 transitions since entering nonnominal dynamics to only fit nonnominal data.
References
 (2019) Invariant risk minimization. arXiv preprint arXiv:1907.02893. Cited by: §III.
 (2016) Unifying countbased exploration and intrinsic motivation. In NeurIPS, pp. 1471–1479. Cited by: §III.
 (2013) A novel actor–critic–identifier architecture for approximate optimal control of uncertain nonlinear systems. Automatica 49 (1), pp. 82–92. Cited by: §III.
 (2005) The omnitread serpentine robot: design and field performance. In Unmanned Ground Vehicle Technology VII, Vol. 5804, pp. 324–332. Cited by: §I, §III.
 (2018) Exploration by random network distillation. arXiv preprint arXiv:1810.12894. Cited by: §III.
 (2016) Infogan: interpretable representation learning by information maximizing generative adversarial nets. In NeurIPS, pp. 2172–2180. Cited by: §III.
 (2016) Pybullet, a python module for physics simulation for games, robotics and machine learning. Cited by: §VA.
 (2012) Online optimal control of affine nonlinear discretetime systems with unknown internal dynamics by using timebased policy update. IEEE Trans Neural Netw Learn Syst 23 (7), pp. 1118–1129. Cited by: §III.
 (2019) Goexplore: a new approach for hardexploration problems. arXiv preprint arXiv:1901.10995. Cited by: §III.
 (2002) Nonlinear control for underactuated mechanical systems. Springer Science. Cited by: §I, §III.
 (2017) Modelagnostic metalearning for fast adaptation of deep networks. In ICML, pp. 1126–1135. Cited by: §III.
 (2016) Oneshot learning of manipulation skills with online dynamics adaptation and neural network priors. In IROS, pp. 4019–4026. Cited by: §I, §III, §IVB, §VC.
 (2018) Soft actorcritic: offpolicy maximum entropy deep reinforcement learning with a stochastic actor. In ICML, pp. 1861–1870. Cited by: §I, §VC.
 (2019) Simtoreal via simtosim: dataefficient robotic grasping via randomizedtocanonical adaptation networks. In CVPR, pp. 12627–12637. Cited by: §III.
 (2004) Mechanical aspects of legged locomotion control. Arthropod structure & development 33 (3), pp. 251–272. Cited by: §I, §III.
 (2020) Outofdistribution generalization via risk extrapolation (rex). arXiv preprint arXiv:2003.00688. Cited by: §III, §IVA.
 (2014) Learning neural network policies with guided policy search under unknown dynamics. In NeurIPS, pp. 1071–1079. Cited by: §III.
 (2018) Learning to generalize: metalearning for domain generalization. In AAAI, Cited by: §III.
 (2020) Banditbased model selection for deformable object manipulation. In Algorithmic Foundations of Robotics XII, pp. 704–719. Cited by: §IVB.
 (2016) Deep exploration via bootstrapped dqn. In NeurIPS, pp. 4026–4034. Cited by: §III.
 (2010) Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks 22 (2), pp. 199–210. Cited by: §III.
 (2013) Viability control for a class of underactuated systems. Automatica 49 (1), pp. 17–29. Cited by: §III.
 (2017) Curiositydriven exploration by selfsupervised prediction. In CVPR Workshops, pp. 16–17. Cited by: §III.
 (2018) Simtoreal transfer of robotic control with dynamics randomization. In ICRA, pp. 1–8. Cited by: §III.
 (1991) A possibility for implementing curiosity and boredom in modelbuilding neural controllers. In Proc. of the international conference on simulation of adaptive behavior: From animals to animats, pp. 222–227. Cited by: §III.
 (2017) # exploration: a study of countbased exploration for deep reinforcement learning. In NeurIPS, pp. 2753–2762. Cited by: §III.
 (2017) Domain randomization for transferring deep neural networks from simulation to the real world. In IROS, pp. 23–30. Cited by: §III.
 (2017) Information theoretic mpc for modelbased reinforcement learning. In ICRA, pp. 1714–1721. Cited by: §VC.
 (2018) Decoupling dynamics and reward for transfer learning. arXiv preprint arXiv:1804.10689. Cited by: §III.