Relational Mimic for Visual Adversarial Imitation Learning

Relational Mimic for Visual Adversarial Imitation Learning

Abstract

In this work, we introduce a new method for imitation learning from video demonstrations. Our method, Relational Mimic (RM), improves on previous visual imitation learning methods by combining generative adversarial networks and relational learning. RM is flexible and can be used in conjunction with other recent advances in generative adversarial imitation learning [26] to better address the need for more robust and sample-efficient approaches. In addition, we introduce a new neural network architecture that improves upon the previous state-of-the-art in reinforcement learning and illustrate how increasing the relational reasoning capabilities of the agent enables the latter to achieve increasingly higher performance in a challenging locomotion task with pixel inputs. Finally, we study the effects and contributions of relational learning in policy evaluation, policy improvement and reward learning through ablation studies.

\wacvfinalcopy

1 Introduction

Reinforcement learning (RL) [59] has received attention for training intelligent agents capable of solving sequential decision-making tasks under uncertainty. While RL has been hindered by the lack of strong function approximators, the advances in model expressiveness enabled by deep neural networks have enabled deep reinforcement learning to solve increasingly challenging tasks [43, 58, 47]. Despite alleviating the burden of hand-crafting task-relevant features, deep RL is constrained by the need for reward shaping [45], that is, designing a reward signal that will guide the learning agent to solve the end-task.

Instead of relying on reward signals requiring considerable engineering efforts to align [36] with the desired goal, we use imitation learning (IL) [6] in which the agent is provided with expert demonstrations before the training procedure starts. The agent does not receive any external reward signal while it interacts with its environment. The behavior that emerges from the mimicking agent should resemble the behavior demonstrated by the expert. Demonstrations have successfully helped mitigate side effects due to reward hand-crafting, commonly referred to as reward hacking [21], and induce risk-averse, safer behaviors [2, 34, 37].

When the demonstrated trajectories contain both the states visited by the expert and the controls performed by the expert in each state, the imitation learning task can be framed as a supervised learning task, where the agent’s objective consists of learning the mapping between states and controls (commonly denoted as actions in the RL framework). This supervised approach is referred to as behavioral cloning (BC), and has enabled advances in autonomous driving [48, 49, 11] and robotics [51, 3, 65]. However, BC remains extremely brittle in the absence of abundant data. In particular, the cloning agent can only recover from past mistakes if corrective behavior appears in the provided demonstrations due to compounding errors. This phenomenon, known as covariate shift [52, 53], showcases the fragility of supervised learning approaches in interactive, dynamically-entangled, sequential problems.

In contrast to BC, Apprenticeship learning (AL) [1] tackles IL problems without attempting to map states to actions in a supervised learning fashion. AL tries to first recover the reward signal which explains the behavior observed in the demonstrations, an approach called inverse reinforcement learning (IRL) [46, 6], and subsequently uses the learned reward to train the agent by regular RL. Assuming the recovered reward is the reinforcement signal that was optimized by the expert demonstrator, learning a policy by RL from this signal will yield a policy mimicking the demonstrated behavior. While training models with data interactively collected from the environment mitigates the compounding of errors, solving an RL problem every iteration (for every new reward update) is expensive, and recovering the expert’s exact reward is an ill-posed problem [73], often requiring various relaxations [44, 60, 61, 62, 27].

Generative adversarial imitation learning (GAIL) [26] addresses these limitations by jointly learning a similarity metric with a generative adversarial network (GAN) [19] and optimizing a policy by RL using the learned similarity metric as a surrogate reward. In contrast to AL, GAIL does not try to recover the reward signal assumed to have been optimized by the expert when demonstrating the target behavior. Instead, GAIL learns a surrogate signal that, when learned jointly with the policy which uses it as a reward, yields a robust, high-performance imitation policy. Interestingly, in environments that are close to deterministic (which applies to every current benchmark environment), GAIL still performs well when provided with expert demonstrations containing only the states visited by the expert, without the actions that led the expert from one state to the next [64]. This is important because video sharing platforms provide numerous instances of state-only demonstrations. Being able to leverage these, by designing imitation methods which use videos as input, is an important imitation learning milestone. In this work, we introduce RM, an video imitation learning method that builds on GAIL.

We focus on locomotion tasks from the MuJoCo [63] suite of continuous control environments [12]. In addition to control sensitivity inherent to continuous actions spaces, locomotion tasks require agents to maintain dynamic equilibria while solving the task. Learning to preserve balance is not an universal skill and depends both on the agent’s body configuration and the goal task. When the proprioceptive state representation of the agent is available (joint angles, positions and velocities), introducing a structural bias by explicitly modeling the morphology of the agent to learn the relationships between joints as edges in a graph can yield significant gains in performance and robustness [68]. Rather than being provided with proprioceptive states, our agents are solely provided with high-dimensional visual state representations, preventing us from modelling the agent’s structure explicitly. In this work, we introduce a self-attention mechanism in the convolutional perception stack of our agents, drawing inspiration from relational modules [69, 54, 66] to give them the ability to perform visual relational reasoning from raw visual state representations. By working over sequences of consecutive frames, our agents are able to recover motion information and consequently can draw long-range relationships across time and space (Figure 0(b)). As we show in Section 5, RM achieves state-of-the-art performance in state-only visual imitation in complex locomotion environments.

(a)
(b)
Figure 1: (a) Example frame for learning locomotion policies from pixel input in the Walker2d-v3 environment from the MuJoCo benchmark. (b) Relational learning capabilities of our approach, with examples of possible relationships depicted by double-ended green arrows.

2 Background

We model the stateful environment, , with a Markov decision process (MDP), formally represented by the tuple , containing: the state space, , the action space, , the initial state density, , and transition distribution with conditional density, , jointly defining the world dynamics, the reward function, , and the discount factor, , as traditionally seen in infinite-horizon MDPs. Since we work in the episodic setting, we assume that every agent-generated trajectory ends in an absorbing state at time horizon which triggers episode termination by zeroing out once reached. The environment is unknown to the agent and can only be queried via interaction.

The agent follows its policy, , modeled via a neural network parametrized by , to sequentially determine which decision (action), , to make in the current situation (state), . The decision process is dictated by the conditional density . In contrast with traditional RL settings, our agent is not rewarded upon interaction and therefore does not benefit from external supervision towards completing the task at hand. The agent however has access to a set of trajectories collected by an expert policy, , which demonstrates how to complete the task. Our objective is to train agents capable of displaying behaviors resembling the ones shown by the expert in the demonstrations with high fidelity and in a robust fashion. Since we consider the problem of imitation learning from states without the associated actions performed by the expert, we define demonstrations as sequences of states visited by the expert during an episode, . Finally, we introduce the return defined as the sum of future rewards collected from time to episode termination, , and the state value defined as the expected return of starting from state, and following policy thereafter, . Since we are interested in the representations learned for policy evaluation, we learn (as opposed to simply computing it, e.g., by Monte-Carlo estimation). We model via a neural network , parametrized by .

3 Relational Mimic

We introduce Relational Mimic (RM), capable of learning efficient imitation policies from visual input in complex locomotion environments. To learn in complex environments, agents have to deal with many limbs, connected by complex joints (with potentially many degrees-of-freedom), which calls for greater coordination skills. Our experiments show that using relational modules in our agent is key to the success of our method.

We first focus on a new architecture for visual continuous control which will serve as a building block throughout the remainder of the paper. This architecture takes advantage of non-local modeling, which has arguably been overlooked in recent successful deep vision-purposed architectures and has very recently been revived in [69]. When used in a reinforcement learning scenario, we show that an agent using the proposed architecture not only improves on previous canonical architectures for control (Figures 2 and 5), but also that enriching the architecture with a (slightly modified) non-local block [69] (Figure 2) yields even further improvements in the locomotion tasks tackled. For legibility purposes, we will hereafter refer to the agent built with the complete architecture as non-local agent (Table 1(a)), and to the agent without non-local block as local agent. We then introduce our main contribution, RM, which builds on recent advances in both relational reasoning and GANs.

Visual inputs. As previously mentioned, our agents do not make use of proprioceptive feedback, which provides the agent with detailed, salient information about its current state (joint angles and velocities). Instead, our agents perceive only a visual (pixel) representation of the state (see Figure 0(a)), resulting from passing the proprioceptive state into a renderer before giving it to the agent (more details found in Section 5 describing the experimental setup). While the joint positions can be recovered with relative ease depending on the resolution of the rendered image, joint velocities are not represented in the rendered pixel state. In effect, the state is now partially-observable. However, to act optimally, the agent needs to know its current joint velocities. This hindrance has previously been addressed either by using frame-stacking or using a recurrent model. Agents modeled with recurrent models learn a state embedding from single input frames and work over sequences of such embedding to make a decision. With frame-stacking however, agents learn a state sequence embedding directly, albeit over usually shorter sequence lengths. Recent work reports similar performance with either approach [14]. We opt for frame-stacking to learn convolution-based relational embeddings over sequences of states, a key contribution of our approach. We denote by the number of frames stacked in the input (corresponding to back-steps), and define a stacked state as the tuple .

Architecture. Relational learning capabilities are given to our method by integrating self-attention into the architecture. In [69], self-attention enables the enriched model to reason in both time and space, by enabling the model to capture long-range dependencies both spatially within the images and temporally across the sequences of consecutive frames. Non-local operations capture global long-range dependencies by directly computing interactions between two positions, irrespective of their spatial distance and temporal delay. As a result, models augmented with self-attention mechanisms are naturally able to capture relationships between locations in and across frames, which imbues the system with the ability to perform relational reasoning about the objects evolving in the input. The architecture of our non-local agent is described in Table 1(a). The careful use of feature-pooling (rows 3 and 7 in Table 1(a)) combined with the introduction of a self-attentive relational block (row 5 in Table 1(a)) are the key features of our approach, and are described in dedicated sub-sections.

In recent years, it has been common practice to use a single network with two heads, an action head and a value head, to parameterize the policy and value networks respectively. While reducing the computational cost, weight sharing relies upon the assumption that the value and policy are both optimal under the same feature representation. However, [70] shows with saliency maps that policy and value networks learn different hidden representations and pay attention to different cues in the visual input state. Inspired by this observation, we separate the policy and value into distinct networks, which will also enable us to conduct an ablation study on the effect of self-attention.

0 Input, size:
1 Standardize input:
2 Conv2D, , , , ReLU post
3 MaxPool3D, , ,
4 Residual Block
5 Relational Block
6 Residual Block, ReLU post
7 MaxPool3D, , ,
8 Residual Block
9 Residual Block, ReLU post
10 FullyConnected,
(a) Architecture of our Non-Local Agent. Removing the fifth row yields the Local Agent.
0 Input, inner feature maps
1 Conv2D, , , , ReLU pre
2 Conv2D, , , , ReLU pre
3 Conv2D, , , , ReLU pre
4 SkipConnection, add input to the output
(b) Residual Block [24] with ReLU pre-activations.
Table 1: Non-Local Agent architecture description. Indices in the left column are an indication of depth (the higher the index, the deeper the associated layer is in the architecture). The mention "pre" indicates that the non-linearity is applied right before the assigned row (pre-activation). Analogously, "post" designates post-activations. The number of channels used has been fine-tuned such that the resulting model is comparable with the baselines in terms of convolutional stack depth, number of parameters and forward pass computational cost.
Architecture number of …
Parameters Flops Convolutional Layers
Nature [43] 1.276M 16.62M 3
Large Impala [15] 0.5002M 84.52M 15
Local Agent (ours) 0.4575M 13.71M 13
Non-Local Agent (ours) 0.4577M 13.84M 15
Table 2: Architecture comparison between the introduced architectures and notable convolutional baselines from the Reinforcement Learning literature. Criteria are, ordered from left to right, 1) the total number of model parameters including the fully connected layers following the convolutional stack, 2) the computational cost of one forward pass through the model, expressed in flops, and 3) the depth of the perception stack, corresponding to the maximum number of consecutive convolutional layers used in the network.

Feature pooling. In locomotion tasks, uncertainties in the visual state representation can prevent the agent from infering its position accurately. This phenomenon is exarcerbated when the agent also has to infer its velocity from sequences of visual states. However, both position (spatial) and velocity (temporal) are crucial to predict sensible continuous actions. In order to propagate this spatial and temporal information as the visual input sequence is fed through the layers, our architecture only involves one spatial pooling layer (row 3 in Table 1(a)), as opposed to three in the large Impala network [15]. Note, the non-local and large Impala agents both have 15-layer deep convolutional stacks. In order to remain competitive with the architectures in Table 2 in terms of number of parameters and computational cost, we use two feature pooling layers, at different depth levels (row 3 and 7 in Table 1(a), where row 3 performs both spatial and feature pooling).

Relational block. Our relational block, shown in Figure 2, is based on [69] which implements the non-local mean operation [13] as follows:

(1)

We seek a pairwise similarity measure, , that considers the relationships between every pair of features with . Instead of comparing to directly, we compare embeddings derived from these features via respective convolutional layers, and . We follow non-local modeling [13], and similarly to [69], use an exponentiated dot-product similarity (Gaussian radial basis function) between embeddings, . By formalizing the normalization factor as , we use a softmax operation and implement the self-attention formulation proposed in [66]:

(2)

where denotes a position-wise linear embedding on computed via convolutions. We lastly introduce embedding (similar to ) on the output of the non-local mean operation and add a residual connection [24] to encourage the module to focus on local features before establishing distant dependencies:

(3)

The embeddings and involved in the pairwise function use half the number channels used in . We did not use batch normalization [29] in the relational block as early experiments showed it reduced performance when used in the policy network. This degradation was not as significant when used in the value network. Additionally, preliminary experiments showed a slight decrease in performance when using the dot-product for . To reiterate, the relational block combined with frame stacking enables the model to consider relationships between entities across space and time, while using computationally-efficient 2D convolutions.

Figure 2: Relational Block. “matmul”, “softmax”, and “add” respectively denote the matrix multiplication, row-wise softmax, and addition operations.

Reward learning.

0 Input, size:
1 Standardize input:
2 Conv2D, , , , LReLU post
3 Conv2D, , , , LReLU post
4 Conv2D, , , , LReLU post
5 Relational Block
6 Conv2D, , , , LReLU post
7 Relational Block
8 Conv2D, , , , LReLU post
9 FullyConnected,
Table 3: Reward network architecture used in RM. LReLU designates leaky ReLU activations [40] with a leak coefficient (slope for ) of .

Generative adversarial imitation learning (GAIL) [26] proposes a new approach to apprenticeship learning [1, 44, 60, 61, 62, 27] by deriving a surrogate reward instead of trying to recover the true reward explaining of the expert behavior. The GAIL framework involves a generative adversarial network (GAN) [19], opposing two networks, a generator and a discriminator, in a two-player zero-sum game. The role of generator is played by the policy, , which attempts to pick expert-like actions to arrive at expert-like states. The role of discriminator is played by a binary classifier with parameter vector denoted as . The discriminator observes samples from both expert and policy and tries to tell them apart. In GAIL, the success of a policy, , is measured by its ability to fool into outputting that state-action pairs collected by were collected by . This setup can be formulated as a minimax problem , where

(4)

Observing that measures how much a state-action pair generated by the policy resembles the expert’s state-action pairs, RM builds on GAIL by making use of the learned to craft a surrogate reward and training to maximize this surrogate reward by policy optimization. The introduced parameterized reward signal is trained by minimizing the regularized cross-entropy loss :

(5)

where denotes the usual gradient penalty regularizer [20], encouraging the discriminator to be near-linear and consequently near-convex. Originally used to mitigate destructive updates in Wasserstein GANs [4], many works have hitherto reported analogous benefits when used in the original JS-GANs [16, 39]. Mitigating layer-wise spectral norm in [42] yields similar stability advantages and is less computationally intensive than the gradient penalty. We use spectral normalization [42] and gradient penalization [20] simultaneously, as advocated in [33], to train in RM.

We align the surrogate reward signal, , with the confusion of the classifier. If the policy manages to fool the classifier, the policy is rewarded for its prediction. To ensure numerical stability, we define the reward as: . The parameters are updated every iteration by descending along gradients of evaluated alternatively on a) the minibatch collected by earlier in the iteration and b) the expert demonstrations. The demonstrations however do not contain actions. Nevertheless, in environments with near-deterministic dynamics, action can be approximatively inferred from the state and the next . Consequently, constitutes a good proxy for . Note, is available to the agent as soon as the control, , is executed in . Since we defined the stacked state as the sequence of latest observations, , the two most recent observations in are and . We can therefore evaluate by evaluating , and define over stacked states:

(6)

The architecture of the -parameterized network is depicted in Table 3, and draws inspiration from the one in [72]. The network features two relational blocks which makes it capable of performing message passing between distant locations (in time, space or both) in the input.

RM training. The value and policy networks are trained with the PPO algorithm [55]. Algorithm 1 provides more details on the procedure.

Initialize network parameters (, , )
for  do
       for  do
             Observe in environment , perform action , observe returned by
             Augment with reward and store in
            
       end for
      for  do
             for  do
                   Sample uniformly a minibatch of stacked states from
                   Sample uniformly a minibatch of stacked states from the expert dataset , with
                   Update reward parameters with the equal mixture by following the gradient:
             end for
            for  do
                   Sample uniformly a minibatch of augmented transitions from
                   Update policy parameters and value parameters with PPO [55] with the minibatch
             end for
            
       end for
      Flush
end for
Algorithm 1 Relational Mimic

4 Related Work

Self-Attention. Self-attention has allowed significant advances in neural machine translation [66] and has improved upon previous state-of-the-art methods, which often relied on recurrent models, in the various disciplines of Machine Learning [50, 67, 9, 72]. Not only does self-attention enable state-of-the-art results in sequence prediction and generation, but it can also be formalized as a non-local operation (closely related to the non-local mean [13]). This operation is straightforward to integrate into deep convolutional architectures to solve video prediction tasks, alleviating the need to introduce difficult to train and tedious to parallelize recurrent models [69].

Adversarial Imitation. While extending the generative adversarial imitation learning paradigm to learn imitation policies from proprioceptive state-only demonstrations has been previously explored [41], to the best of our knowledge only [64] has dealt with visual state representation in this setting. Generative adversarial imitation from observation [64] reports state-of-the-art performance in state-only imitation against non-adversarial baselines. We will therefore compare our models to this baseline in Section 5. Orthogonal to the control tasks considered in this work, Self-Attention GAN (SAGAN) [72] reports state-of-the-art results in image generation tasks by making both the generator and discriminator self-attentive. Our method can be viewed as a bridge between SAGAN and GAIL, which we enriched by adding temporal relational learning capabilities by working over sequences of states, in addition to the spatial relational aspect. Overcoming GAIL’s challenges such as sample-inefficiency [10, 32] and its propensity to mode collapse [23, 38] has been the focus of GAIL research in recent years. These advances are orthogonal to our approach. We believe our work to be the first to apply relational learning via non-local modeling to adversarial imitation learning.

Relational learning. Techniques aiming to exploit the inherent structure of the state in domains or applications where relational inductive biases [7] can be leveraged have recently gained in popularity. Due to their flexibility, graphs constitute the archetype structure at the root of many approaches explicitly looking at pairwise relationships between objects present in the state. Graph neural networks (GNNs) are the most common deep structures in the field. GNNs have been successfully used to model locomotion policies able to transfer across different body shapes and impairement conditions [68], to model interactions in physical systems [8], to improve inter-agent communications in multi-agent scenarios [30, 28], for gait generation [31], for skeleton-based action recognition [56, 57] and to enhance grasp stability in dexterous robotics [18]. While GNNs have proven to be effective at learning relationships between objects, their explicitly defined structure cannot perform object discovery directly from visual states in practice. Self-attention provides a solution as it considers the relationships between every pair of atomic entities in the input features (e.g., convolutional feature maps). By using a self-attention mechanism, relation networks [54] learn a pairwise function of feature embeddings for every pair of input regions (pixels or receptive fields) to discover the relationships that are the most useful to solve question answering tasks. In control, self-attention has been used to play StarCraft II mini-games from raw pixel input, a challenging multi-agent coordination task [71]. Instead, we focus on locomotion and learn limb coordination from visual input with the additional difficulty of having to deal with continuous action spaces and third-person side view camera state spaces with a dynamic background. Our agents architecture differs from [71] on several points. We deal with spatial and temporal relationships jointly via a combination of convolution and self-attention, whereas [71] relies on a large stack of fully-connected layers to perform non-spatial relational learning or alternatively, on a recurrent model with many parameters to perform temporal relational reasoning. Our method is lightweight in comparison. Additionally, by using non-local modeling with a skip connection, our model is able to attend to local regions. Besides providing improvements in the RL setting, our method yields state-of-the-art results in imitation learning as we will describe now.

5 Results

Experimental setup. In every reported experiment, we use a gradient averaging distributed learning scheme consisting of spawning 4 parallel learners only differing by their random seeds, and using an average of their gradients for every learner update. This orchestration choice does not cause a decrease in sample-efficiency like traditional multi-actors massively-distributed frameworks would, which is desirable considering the poor sample-efficiency vanilla GAIL already suffers from. We repeat every experiment with 10 random seeds. The same seeds are used for all our experiments. Each learning curve is depicted with a solid curve corresponding to the mean episodic return across the 10 seeds, surrounded by an envelope of width equal to the associated standard deviation. We enforce this methodology to ensure that the drawn comparisons are fair [25].

We report the performance of our methods and baselines (described later per setting) in locomotion environments from the OpenAI Gym [12] library based in the MuJoCo physics engine [63]. Simulator states are rendered using the OpenAI Gym headless renderer with the default camera view for the given environment to ensure the reproducibility of the entirety of our experiments. After converting the rendered frames to gray-scale and stacking the most recent ones, the input state contains pixels.

In addition to the mean episodic return across random seeds, we also report the Complementary Cumulative Distribution Function (CCDF), also sometimes called the survival function, for every experiment. The CCDF shows an estimate of how often the return is above a given threshold. In contrast with the mean, the CCDF does not taint the performance of the best seeds with the performance of the worst ones, which is an especially desirable property since we work over many seeds.

We trained the expert policies used in this work in the environments our agents interact with. Experts are trained using the proprioceptive state representation with the PPO algorithm [55] as it provides the best wall-clock time, being an on-policy model-free continuous control method. Expert training was set to proceed for 10 million timesteps. To collect demonstrations for the trained experts, we ran the expert in its environment while rendering every visited state using the same rendering scheme as with the policy ( pixels per state), and saved these visual state representations as demonstrations. A demonstration is therefore a video of the expert interacting with its environment for an episode. The frames are saved with the same frame-rate as when interacting, which is an important property to preserve in order for the agent to match the expert velocity. For reproducibility purposes, we used the default settings of the simulator (MuJoCo version 2.0 binaries).

Note, we used layer normalization [5] in the dense layers of every network. See supplementary for more details.

Method area under CCDF for…
Hopper-v3 Walker2d-v3
Baseline 524,189 479,430
RM-L-L (ours) 731,871 1,539,345
RM-L-NL (ours) 771,666 1,444,834
RM-NL-NL (ours) 645,558 1,640,749
Figure 3: Imitation learning performance comparison with 8 demonstrations and .
Method area under CCDF for…
Hopper-v3 Walker2d-v3
Baseline 467,803 457,018
RM-L-L (ours) 892,730 846,887
RM-L-NL (ours) 866,267 1,604,128
RM-NL-NL (ours) 846,395 1,603,569
Figure 4: Imitation learning performance comparison with 8 demonstrations and .

Imitation learning results. In Figures 3 and 4, we compare the performance of three different network configurations of RMRM-L-L (local policy, local value), RM-L-NL (local policy, non-local value), and RM-NL-NL (non-local policy, non-local value) — against the baseline. The RM configurations only differ by their use of relational blocks in the reward, policy and value modules, and are summarized in Table 4. The baseline closely resembles [64], the variant of GAIL [26] reporting SOTA performance in imitation from demonstrations without actions. The only differences from [64] are that we modulate the number of stacked frames in the visual input state, and that we adopt network architectures that make the comparisons against RM fair. The baseline corresponds to RM without relational modules, as summarized in Table 4. Note, [64] only reports results for environments less complex that Walker2d in the studied setting.

Non-local? reward policy value
Baseline no no no
RM-L-L (ours) yes no no
RM-L-NL (ours) yes no yes
RM-NL-NL (ours) yes yes yes
Table 4: Use of relational blocks in the different modules.

In Figure 3, we observe that: a) non-local modeling has the most significant effect when used in the reward module compared to other modules, b) all the RM variants perform similarly, and c) the baseline does not take off in the more complex Walker2d environment, unlike RM.

Additionally, Figure 4 shows that, by increasing the input sequence length from to , RM achieves better performance in the Hopper environment, with a +21% increase for all methods. In the Walker2d environment however, while RM-L-NL benefits from a +11% increase in performance, RM-L-L suffers from a -45% decrease, but still scores well above the baseline. This shows that using relational learning in the value and (or) policy can help dealing with longer input sequences, a particularly valuable observation for POMDPs that require such long input history to alleviate poor state observability. Finally, the baseline suffers from a -7% performance drop in every environment when increasing from 4 to 8, further widening the gap with RM.

Method area under CCDF for…
Walker2d-v3
Nature [43] 6,582,843
LargeImpala [15] 7,112,912
LocalAgent (ours) 8,774,574
NonLocalAgent (ours) 9,652,209
Figure 5: RL performance comparison with .

Reinforcement learning results. In Figure 5, we compare the performance of several architectures, previously described in Table 2. We train these by RL, with PPO [55], using the reward from the environment. The results show that the LocalAgent outperforms the Nature [43] and LargeImpala [15] baselines. The performance is further increased by the NonLocalAgent, using relational modules in both the policy and value (10% increase).

6 Conclusion

In this work, we introduced RM, a new method for visual imitation learning from observations based on GAIL [26], that we enriched with the capability to consider spatial and temporal long-range relationships in the input, allowing our agents to perform relational learning. Since the significant gains in sample-efficiency and overall performance enabled by our method stem from an architecture enrichment, RM can be directly combined with methods addressing GAIL sample-inefficiency by algorithmic enhancements. The obtained results are in line with our initial conjecture about the usefulness of self-attention to solve locomotion tasks. Our method is able to work with high-dimensional state spaces, such as video sequences, and shows resilience to periodic limb obstruction on the pixel input and video demonstration misalignment. Finally, we show the effect of self-attention in the different components of our model and show outcomes on policy improvement, policy evaluation, and reward learning. The most significant impact was observed when we used self-attention for reward learning.

7 Future Work

In future work, an investigation of visual relational learning could help agents to better cope with induced simulated body impairments [68] in locomotion tasks and predict the impact of proposed changes on the ensuing walking gait [35] when working solely with visual state representations. Another avenue of improvement could be to leverage other modalities relevant for skeleton-based locomotion (e.g., limb morphology, kinematics) to solve the upstream task of learning an accurate inverse dynamics model [22, 65, 17]. Using this model, one could then learn a mimic on the the richer, action-augmented demonstrations.

References

  1. P. Abbeel and A. Y. Ng (2004) Apprenticeship Learning via Inverse Reinforcement Learning. In International Conference on Machine Learning (ICML), External Links: Link Cited by: §1, §3.
  2. D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman and D. Mané (2016-06) Concrete Problems in AI Safety. External Links: Link, 1606.06565 Cited by: §1.
  3. B. D. Argall, S. Chernova, M. Veloso and B. Browning (2009-05) A survey of robot learning from demonstration. Robotics and Autonomous Systems 57 (5), pp. 469–483. External Links: Link Cited by: §1.
  4. M. Arjovsky, S. Chintala and L. Bottou (2017-01) Wasserstein GAN. External Links: Link, 1701.07875 Cited by: §3.
  5. J. L. Ba, J. R. Kiros and G. E. Hinton (2016-07) Layer Normalization. External Links: Link, 1607.06450 Cited by: §5.
  6. J. A. Bagnell (2015) An invitation to imitation. Technical report Carnegie Mellon, Robotics Institute, Pittsburgh. External Links: Link Cited by: §1, §1.
  7. P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, C. Gulcehre, F. Song, A. Ballard, J. Gilmer, G. Dahl, A. Vaswani, K. Allen, C. Nash, V. Langston, C. Dyer, N. Heess, D. Wierstra, P. Kohli, M. Botvinick, O. Vinyals, Y. Li and R. Pascanu (2018-06) Relational inductive biases, deep learning, and graph networks. External Links: Link, 1806.01261 Cited by: §4.
  8. P. W. Battaglia, R. Pascanu, M. Lai, D. Rezende and K. Kavukcuoglu (2016-12) Interaction Networks for Learning about Objects, Relations and Physics. External Links: Link, 1612.00222 Cited by: §4.
  9. I. Bello, B. Zoph, A. Vaswani, J. Shlens and Q. V. Le (2019-04) Attention Augmented Convolutional Networks. External Links: Link, 1904.09925 Cited by: §4.
  10. L. Blondé and A. Kalousis (2019) Sample-Efficient Imitation Learning via Generative Adversarial Nets. In International Conference on Artificial Intelligence and Statistics (AISTATS), External Links: Link Cited by: §4.
  11. M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, J. Zhao and K. Zieba (2016-04) End to End Learning for Self-Driving Cars. External Links: Link, 1604.07316 Cited by: §1.
  12. G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang and W. Zaremba (2016-06) OpenAI Gym. External Links: Link, 1606.01540 Cited by: §1, §5.
  13. A. Buades, B. Coll and J. Morel (2005) A non-local algorithm for image denoising. In Conference on Computer Vision and Pattern Recognition (CVPR), External Links: Link Cited by: §3, §4.
  14. K. Cobbe, O. Klimov, C. Hesse, T. Kim and J. Schulman (2018-12) Quantifying Generalization in Reinforcement Learning. External Links: Link, 1812.02341 Cited by: §3.
  15. L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, S. Legg and K. Kavukcuoglu (2018-02) IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures. External Links: Link, 1802.01561 Cited by: Table 2, §3, Figure 5, §5.
  16. W. Fedus, M. Rosca, B. Lakshminarayanan, A. M. Dai, S. Mohamed and I. Goodfellow (2018) Many Paths to Equilibrium: GANs Do Not Need to Decrease a Divergence At Every Step. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: §3.
  17. Y. Ganin, T. Kulkarni, I. Babuschkin, S. M. Ali Eslami and O. Vinyals (2018-04) Synthesizing Programs for Images using Reinforced Adversarial Learning. External Links: Link, 1804.01118 Cited by: §7.
  18. A. Garcia-Garcia, B. S. Zapata-Impata, S. Orts-Escolano, P. Gil and J. Garcia-Rodriguez (2019-01) TactileGCN: A Graph Convolutional Network for Predicting Grasp Stability with Tactile Sensors. External Links: Link, 1901.06181 Cited by: §4.
  19. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville and Y. Bengio (2014) Generative Adversarial Nets. In Neural Information Processing Systems (NIPS), pp. 2672–2680. External Links: Link Cited by: §1, §3.
  20. I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin and A. Courville (2017) Improved Training of Wasserstein GANs. In Neural Information Processing Systems (NIPS), External Links: Link Cited by: §3.
  21. D. Hadfield-Menell, S. Milli, P. Abbeel, S. Russell and A. Dragan (2017-11) Inverse Reward Design. External Links: Link, 1711.02827 Cited by: §1.
  22. J. P. Hanna and P. Stone (2017) Grounded Action Transformation for Robot Learning in Simulation. In AAAI Conference on Artificial Intelligence, External Links: Link Cited by: §7.
  23. K. Hausman, Y. Chebotar, S. Schaal, G. Sukhatme and J. Lim (2017) Multi-Modal Imitation Learning from Unstructured Demonstrations using Generative Adversarial Nets. In Neural Information Processing Systems (NIPS), External Links: Link Cited by: §4.
  24. K. He, X. Zhang, S. Ren and J. Sun (2015-12) Deep Residual Learning for Image Recognition. External Links: Link, 1512.03385 Cited by: 1(b), §3.
  25. P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup and D. Meger (2018) Deep Reinforcement Learning that Matters. In AAAI Conference on Artificial Intelligence, External Links: Link Cited by: §5.
  26. J. Ho and S. Ermon (2016) Generative Adversarial Imitation Learning. In Neural Information Processing Systems (NIPS), External Links: Link Cited by: Relational Mimic for Visual Adversarial Imitation Learning, §1, §3, §5, §6.
  27. J. Ho, J. K. Gupta and S. Ermon (2016-05) Model-Free Imitation Learning with Policy Optimization. External Links: Link, 1605.08478 Cited by: §1, §3.
  28. Y. Hoshen (2017) VAIN: Attentional Multi-agent Predictive Modeling. In Neural Information Processing Systems (NeurIPS), External Links: Link Cited by: §4.
  29. S. Ioffe and C. Szegedy (2015-02) Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. External Links: Link, 1502.03167 Cited by: §3.
  30. J. Jiang, C. Dun and Z. Lu (2018-10) Graph Convolutional Reinforcement Learning for Multi-Agent Cooperation. External Links: Link, 1810.09202 Cited by: §4.
  31. T. Kipf, E. Fetaya, K. Wang, M. Welling and R. Zemel (2018) Neural Relational Inference for Interacting Systems. In International Conference on Machine Learning (ICML), External Links: Link Cited by: §4.
  32. I. Kostrikov, K. K. Agrawal, D. Dwibedi, S. Levine and J. Tompson (2019) Discriminator-Actor-Critic: Addressing Sample Inefficiency and Reward Bias in Adversarial Imitation Learning. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: §4.
  33. K. Kurach, M. Lucic, X. Zhai, M. Michalski and S. Gelly (2018-07) The GAN Landscape: Losses, Architectures, Regularization, and Normalization. External Links: Link, 1807.04720 Cited by: §3.
  34. J. Lacotte, M. Ghavamzadeh, Y. Chow and M. Pavone (2019) Risk-Sensitive Generative Adversarial Imitation Learning. In International Conference on Artificial Intelligence and Statistics (AISTATS), External Links: Link Cited by: §1.
  35. S. Lee, M. Park, K. Lee and J. Lee (2019-08) Scalable Muscle-Actuated Human Simulation and Control. ACM Trans. Graph.. Cited by: §7.
  36. J. Leike, D. Krueger, T. Everitt, M. Martic, V. Maini and S. Legg (2018-11) Scalable agent alignment via reward modeling: a research direction. External Links: Link, 1811.07871 Cited by: §1.
  37. J. Leike, M. Martic, V. Krakovna, P. A. Ortega, T. Everitt, A. Lefrancq, L. Orseau and S. Legg (2017-11) AI Safety Gridworlds. External Links: Link, 1711.09883 Cited by: §1.
  38. Y. Li, J. Song and S. Ermon (2017) InfoGAIL: Interpretable Imitation Learning from Visual Demonstrations. In Neural Information Processing Systems (NIPS), External Links: Link Cited by: §4.
  39. M. Lucic, K. Kurach, M. Michalski, S. Gelly and O. Bousquet (2017-11) Are GANs Created Equal? A Large-Scale Study. External Links: Link, 1711.10337 Cited by: §3.
  40. A. L. Maas, A. Y. Hannun and A. Y. Ng (2013) Rectifier Nonlinearities Improve Neural Network Acoustic Models. In International Conference on Machine Learning (ICML), External Links: Link Cited by: Table 3.
  41. J. Merel, Y. Tassa, T. B. Dhruva, S. Srinivasan, J. Lemmon, Z. Wang, G. Wayne and N. Heess (2017-07) Learning human behaviors from motion capture by adversarial imitation. External Links: Link, 1707.02201 Cited by: §4.
  42. T. Miyato, T. Kataoka, M. Koyama and Y. Yoshida (2018) Spectral Normalization for Generative Adversarial Networks. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: §3.
  43. V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg and D. Hassabis (2015-02) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533 (en). External Links: Link Cited by: §1, Table 2, Figure 5, §5.
  44. G. Neu and C. Szepesvari (2012-06) Apprenticeship Learning using Inverse Reinforcement Learning and Gradient Methods. External Links: Link, 1206.5264 Cited by: §1, §3.
  45. A. Y. Ng, D. Harada and S. Russell (1999) Policy invariance under reward transformations: Theory and application to reward shaping. In International Conference on Machine Learning (ICML), pp. 278–287. External Links: Link Cited by: §1.
  46. A. Y. Ng and S. J. Russell (2000) Algorithms for Inverse Reinforcement Learning. In International Conference on Machine Learning (ICML), pp. 663–670. External Links: Link Cited by: §1.
  47. OpenAI, M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, J. Schneider, S. Sidor, J. Tobin, P. Welinder, L. Weng and W. Zaremba (2018-08) Learning Dexterous In-Hand Manipulation. External Links: Link, 1808.00177 Cited by: §1.
  48. D. Pomerleau (1989) ALVINN: An Autonomous Land Vehicle in a Neural Network. In Neural Information Processing Systems (NIPS), pp. 305–313. External Links: Link Cited by: §1.
  49. D. Pomerleau (1990) Rapidly Adapting Artificial Neural Networks for Autonomous Navigation. In Neural Information Processing Systems (NIPS), pp. 429–435. External Links: Link Cited by: §1.
  50. P. Ramachandran, N. Parmar, A. Vaswani, I. Bello, A. Levskaya and J. Shlens (2019-06) Stand-Alone Self-Attention in Vision Models. External Links: Link, 1906.05909 Cited by: §4.
  51. N. Ratliff, J. A. Bagnell and S. S. Srinivasa (2007-11) Imitation learning for locomotion and manipulation. In IEEE-RAS International Conference on Humanoid Robots, pp. 392–397. External Links: Link Cited by: §1.
  52. S. Ross and J. A. Bagnell (2010) Efficient Reductions for Imitation Learning. In International Conference on Artificial Intelligence and Statistics (AISTATS), External Links: Link Cited by: §1.
  53. S. Ross, G. J. Gordon and J. A. Bagnell (2011) A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning. In International Conference on Artificial Intelligence and Statistics (AISTATS), External Links: Link Cited by: §1.
  54. A. Santoro, D. Raposo, D. G. T. Barrett, M. Malinowski, R. Pascanu, P. Battaglia and T. Lillicrap (2017-06) A simple neural network module for relational reasoning. External Links: Link, 1706.01427 Cited by: §1, §4.
  55. J. Schulman, F. Wolski, P. Dhariwal, A. Radford and K. Oleg (2017-07) Proximal Policy Optimization Algorithms. External Links: Link Cited by: §3, §5, §5, 1.
  56. L. Shi, Y. Zhang, J. Cheng and H. Lu (2019) Skeleton-Based Action Recognition with Directed Graph Neural Networks. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7912–7921. External Links: Link Cited by: §4.
  57. L. Shi, Y. Zhang, J. Cheng and H. Lu (2019) Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), External Links: Link Cited by: §4.
  58. D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel and D. Hassabis (2016-01) Mastering the game of Go with deep neural networks and tree search. Nature 529 (7587), pp. 484–489 (en). External Links: Link Cited by: §1.
  59. R. S. Sutton and A. G. Barto (1998) Reinforcement Learning: An Introduction. Cited by: §1.
  60. U. Syed, M. Bowling and R. E. Schapire (2008) Apprenticeship Learning Using Linear Programming. In International Conference on Machine Learning (ICML), pp. 1032–1039. External Links: Link Cited by: §1, §3.
  61. U. Syed and R. E. Schapire (2008) A Game-Theoretic Approach to Apprenticeship Learning. In Neural Information Processing Systems (NIPS), pp. 1449–1456. External Links: Link Cited by: §1, §3.
  62. U. Syed and R. E. Schapire (2010) A Reduction from Apprenticeship Learning to Classification. In Neural Information Processing Systems (NIPS), pp. 2253–2261. External Links: Link Cited by: §1, §3.
  63. E. Todorov, T. Erez and Y. Tassa (2012-10) MuJoCo: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5026–5033. External Links: Link Cited by: §1, §5.
  64. F. Torabi, G. Warnell and P. Stone (2018-07) Generative Adversarial Imitation from Observation. External Links: Link, 1807.06158 Cited by: §1, §4, §5.
  65. F. Torabi, G. Warnell and P. Stone (2018) Behavioral Cloning from Observation. In International Joint Conference on Artificial Intelligence (IJCAI), External Links: Link Cited by: §1, §7.
  66. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser and I. Polosukhin (2017) Attention Is All You Need. In Neural Information Processing Systems (NIPS), External Links: Link Cited by: §1, §3, §4.
  67. P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò and Y. Bengio (2018) Graph Attention Networks. In International Conference on Machine Learning (ICML), External Links: Link Cited by: §4.
  68. T. Wang, R. Liao, J. Ba and S. Fidler (2018-02) NerveNet: Learning Structured Policy with Graph Neural Networks. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: §1, §4, §7.
  69. X. Wang, R. Girshick, A. Gupta and K. He (2018) Non-local Neural Networks. In Conference on Computer Vision and Pattern Recognition (CVPR), External Links: Link Cited by: §1, §3, §3, §3, §4.
  70. Z. Wang, T. Schaul, M. Hessel, H. van Hasselt, M. Lanctot and N. de Freitas (2016) Dueling Network Architectures for Deep Reinforcement Learning. In International Conference on Machine Learning (ICML), pp. 1995–2003. External Links: Link Cited by: §3.
  71. V. Zambaldi, D. Raposo, A. Santoro, V. Bapst, Y. Li, I. Babuschkin, K. Tuyls, D. Reichert, T. Lillicrap, E. Lockhart, M. Shanahan, V. Langston, R. Pascanu, M. Botvinick, O. Vinyals and P. Battaglia (2019) Deep reinforcement learning with relational inductive biases. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: §4.
  72. H. Zhang, I. Goodfellow, D. Metaxas and A. Odena (2019) Self-Attention Generative Adversarial Networks. In International Conference on Machine Learning (ICML), External Links: Link Cited by: §3, §4, §4.
  73. B. D. Ziebart, A. L. Maas, J. A. Bagnell and A. K. Dey (2008) Maximum Entropy Inverse Reinforcement Learning. In AAAI Conference on Artificial Intelligence, pp. 1433–1438. External Links: Link Cited by: §1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
402487
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description