Learning to Follow Language Instructions with Adversarial Reward Induction

Learning to Follow Language Instructions with Adversarial Reward Induction

Dzmitry Bahdanau
University of Montreal
Montreal, Canada
&Felix Hill
&Jan Leike
&Edward Hughes
&Pushmeet Kohli
&Edward Grefenstette
Work done during an internship at DeepMind.

Recent work has shown that deep reinforcement-learning agents can learn to follow language-like instructions from infrequent environment rewards. However, for many real-world natural language commands that involve a degree of underspecification or ambiguity, such as tidy the room, it would be challenging or impossible to program an appropriate reward function. To overcome this, we present a method for learning to follow commands from a training set of instructions and corresponding example goal-states, rather than an explicit reward function. Importantly, the example goal-states are not seen at test time. The approach effectively separates the representation of what instructions require from how they can be executed. In a simple grid world, the method enables an agent to learn a range of commands requiring interaction with blocks and understanding of spatial relations and underspecified abstract arrangements. We further show the method allows our agent to adapt to changes in the environment without requiring new training examples.


Learning to Follow Language Instructions with Adversarial Reward Induction

  Dzmitry Bahdanauthanks: Work done during an internship at DeepMind. MILA University of Montreal Montreal, Canada dimabgv@gmail.com Felix Hill DeepMind felixhill@google.com Jan Leike DeepMind leike@google.com Edward Hughes DeepMind edwardhughes@google.com Pushmeet Kohli DeepMind pushmeet@google.com Edward Grefenstette DeepMind etg@google.com


noticebox[b]Preprint. Work in progress.\end@float

1 Introduction

Developing agents that can learn to follow user instructions pertaining to an environment is a longstanding goal of AI research [29]. This challenge is complicated by the large degree of vagueness (under-specification) and ambiguity inherent in natural language. Recent work has shown deep reinforcement learning (RL) to be a promising paradigm for learning to follow language-like instructions in both 2D and 3D worlds (e.g. [11, 6], see Section 4 for a review). However, in each of these cases, a reward function instantiated in the environment is programmed to evaluate whether an instruction—such as find the red tomato—has been successfully executed. This approach is viable if the environment can unambiguously report whether a red tomato has been found. However, many language instructions in complex environments (or, indeed, the real world) could not plausibly be checked in this way. For instance, it is hard to imagine a hard-coded reward function for the everyday chores fold the towels, arrange the flowers or set the table, even though human users would have little problem judging whether these tasks had been carried out correctly.

In this work, we take a step towards learning to execute a much wider class of underspecified and partially ambiguous instructions. We focus on the case of declarative commands that implicitly characterize a set of goal-states (e.g. “arrange the red blocks in a circle.”). Given a dataset of instructions and a subset of the (multiple) viable goal-states for each instruction, provided by an expert, we jointly train a discriminator network and a policy network, which focus on the “what to do” and “how to do it” aspects of the tasks, respectively. The discriminator predicts whether a given state is the goal-state for an instruction or not. Meanwhile, the policy network maximizes the frequency with which it confuses the discriminator. We call our approach Adversarial Goal-Induced Learning from Examples (AGILE). AGILE is strongly inspired by Inverse Reinforcement Learning (IRL; 19, 32) methods in general, and Generative Adversarial Imitation Learning [12] in particular. However, it develops these methods to enable language learning; the policy and the discriminator are conditioned on an instruction, and that the training data contains goal-states—rather than complete trajectories.

We first verify that our method works in settings where a comparison with deep RL is possible, to which end we implement a formal program for verifying rewards in the environment. In this setting, we show that the learning speed and performance of AGILE are superior to the standard A3C policy-gradient algorithm, and comparable to A3C supplemented by auxiliary unsupervised reward prediction, all without using the formal reward that RL relies on. To simulate an instruction-learning setting in which RL would be problematic, we then construct a dataset of instructions and goal-states for the task of building colored orientation-invariant arrangements of blocks. On this task, without us ever having to implement the reward function, the AGILE agent learns to consistently construct arrangements as instructed. Finally, we study how well the AGILE agent generalises beyond the examples on which it was trained. We find that the discriminator can be reused to allow the policy to adapt to changes in the environment.

2 Adversarial Goal-Induced Learning from Examples

Figure 1: Information flow during AGILE training. The policy acts conditioned on the instruction and receives the reward from the discriminator. The discriminator is trained to distinguish between “A”, the (instruction, state) pairs from the agent’s experience, and “B”, the (instruction, goal-state) pairs from the dataset.

Our algorithm trains a policy with parameters on the instances from using only examples from to understand the semantics of the instructions. In order to train the agent without an explicit reward but by using just examples from we introduce an additional network , the discriminator, whose purpose is to define a meaningful reward function for training . Specifically, the discriminator is trained to predict whether a state is a goal-state for an instruction . The discriminator’s positive examples are fetched from , whereas its negative examples come from the agent’s attempts to solve the training instances from . Formally, the policy is trained to maximize a return and the discriminator is trained to minimize a cross-entropy loss , the equations for which are:


In the equations above, the Iverson Bracket maps truth to 1 and falsehood to 0, e.g.  iff and 0 otherwise. is the discount factor. With we denote a state trajectory that was obtained by sampling and running conditioned on starting from . denotes a replay buffer to which pairs from -step episodes are added; i.e., it is the undiscounted occupancy measure over the first steps. is the probability of having a positive label according to the discriminator. is the policy’s entropy, and is a hyperparameter. The approach is illustrated in Fig 1. Pseudocode is available in Appendix B.

Dealing with False Negatives

Compared to the ideal case where all would be deemed positive iff , the labelling of examples implied by Equation (2) can be very noisy, because for a significant share of the intermediate state could already be a goal-state from . Depending on the task and the episode length the share of such false negatives could be big enough to hurt the discriminator’s training. We therefore propose the following simple heuristic to approximately identify the false negatives. We rank examples in according to the discriminator’s output and discard the top percent as potential false negatives. Only the other percent are used as negative examples of the discriminator. Formally speaking, the first term in Equation 2 becomes , where stands for the percent of selected as described above. We will henceforth refer to as the anticipated negative rate. Setting to 100% means using like in Equation (2), but our preliminary experiments have shown clearly that this inhibits the discriminator capability to correctly learn a reward function. Using too small a value for on the other hand may deprive the discriminator of the most informative negative examples. We thus recommend to tune as a hyperparameter on a task-specific basis.

Reusability of the Discriminator

An appealing advantage of AGILE is the fact that the discriminator and the policy learn two related but distinct aspects of an instruction: the discriminator focuses on recognizing the goal-states (what should be done), whereas the policy learns what to do in order to get to a goal-state (how it should be done). The intuition motivating this design is that the knowledge about how instructions define goals should generalize more strongly than the knowledge about which behavior is needed to execute instructions. Following this intuition, we propose to reuse a trained AGILE’s discriminator as a reward function for training or fine-tuning policies.

3 Experiments

We experiment with AGILE in a grid environment that we call GridLU, short for Grid Language Understanding and after the famous SHRDLU world [29]. GridLU is a fully observable grid world in which the agent can walk around the grid (moving up, down left or right), pick blocks up and drop them at new locations (see Figure 2 for an illustration and Appendix C for a detailed description of the environment).

3.1 Models

All our models receive the world state as a 56x56 RGB image. Because the language of our instructions is generated from a simple grammar, we perform most of our experiments using policy and discriminator networks that are constructed using the Neural Module Network (NMN, [3]) paradigm. NMN is an elegant architecture for grounded language processing in which a tree of neural modules is constructed based on the language input. The visual input is then fed to the leaf modules, which send their outputs to their parent modules, which process is repeated until the root of the tree. We mimick the structure of the instructions when constructing the tree of modules; for example, the NMN corresponding to the instruction =NorthFrom(Color(‘red’, Shape(‘circle’, SCENE)), Color(‘blue’, Shape(‘square’, SCENE))) performs a computation , where denotes the module corresponding to the token , and is a representation of state . Each module performs a convolution (weights shared by all modules) followed by token-specific Feature-Wise Linear Modulation (FiLM) [22]: , where and are module inputs, is a vector of FiLM multipliers, are FiLM biases, and are element-wise multiplication and addition with broadcasting, denotes convolution. The representation is produced by a convnet. The NMN’s output undergoes max-pooling and is fed through a 1-layer MLP to produce action probabilities or the discriminator’s output. Note, that while structure-wise our policy and discriminator are mostly similar, they do not share parameters.

NMN is an excellent model when the language structure is known, but this may not be the case for natural language. To showcase AGILE’s generality we also experiment with a very basic structure-agnostic architecture. We use FiLM to condition a standard convnet on an instruction representation produced by an LSTM. The -th layer of the convnet performs a computation , where , . The same procedure as described above for is used to produce the network outputs using the output of the layer of the convnet.

In the rest of the paper we will refer to the architectures described above as FiLM-NMN and FiLM-LSTM respectively. FiLM-NMN will be the default model in all experiments unless explicitly specified otherwise. Detailed information about network architectures can be found in Appendix G.

3.2 Training Details

For the baseline RL experiments and for training the policy component of AGILE we used the Asynchronous Advantage Actor-Critic (A3C; 18) algorithm with a discount and with , i.e. without temporal difference learning for the baseline network. The length of an episode was 30, but we trained the agent on advantage estimation rollouts of length 15. Every experiment was repeated 5 times. We considered an episode to be a success if the final state was a goal state as judged a task-specific success criterion, which we describe for the individual tasks below. We use the success rate (i.e. the percentage of successful episodes) as our main performance metric for the agents. Unless otherwise specified we use the NMN-based policy and discriminator in our experiments. Full experimental details can be found in Appendix B.

Figure 2: Initial state and goal state for GridLU-Relations (top-left) and GridLU-Arrangements episodes (bottom-left), and the complete GridLU-Arrangements vocabulary (right), each with examples of some possible goal-states.

3.3 GridLU-Relations

Our first task, GridLU-Relations, is an adaptation of the SHAPES visual question answering dataset [3] in which the blocks can be moved around freely. GridLU-Relations requires the agent to induce the meaning of spatial relations such as above or right of, and to manipulate the world in order to instantiate these relationships. Named GridLU-Relations, the task involves five spatial relationships (NorthFrom, SouthFrom, EastFrom, WestFrom, SameLocation), whose arguments can be either the blocks, which are referred to by their shapes and colors, or the agent itself. To generate the full set of possible instructions spanned by these relations and our grid objects, we define a formal grammar that generates strings such as:

NorthFrom(Color(‘red’, Shape(‘circle’, SCENE)), Color(‘blue’, Shape(‘square’, SCENE))) (3)

This string carries the meaning ‘put a red circle north from (above) a blue square’. In general, when a block is the argument to a relation, it can be referred to by specifying both the shape and the color, like in the example above, or by specifying just one of these attributes. In addition, the AGENT constant can be an argument to all relations, in which case the agent itself must move into a particular spatial relation with an object. Figure 2 shows two examples of GridLU-Relations instructions and their respective goal states. There are 990 possible instructions in the GridLU-Relations task, and the number of distinct training instances can be loosely lower-bounded by (see Appendix E for details).

Notice that, even for the highly concrete spatial relationships in the GridLU-Relations language, the instructions are underspecified and somewhat ambiguous—is a block in the top-right corner of the grid above a block in the bottom left corner? We therefore decided (arbitrarily) to consider all relations to refer to immediate adjacency (so that Instruction (3) is satisfied if and only if there is a red circle in the location immediately above a blue square). Notice that the commands are still underspecified in this case (since they refer to the relationship between two entities, not their absolute positions), even if the degree of ambiguity in their meaning is less than in many real-world cases. AGILE then has to infer this specific sense of what these spatial relations mean from goal-state examples, while the RL agent is allowed to access our programmed ground-truth reward. The binary ground truth reward (true if the state is a goal state) is also used as the success criterion for evaluating AGILE.

Having formally defined the semantics of the relationships and programmed a reward function, we compared the performance of AGILE against a vanilla RL baseline (which has privileged access to ground-truth reward) on the GridLU-Relations task. Interestingly, we found that AGILE learned the task more easily than standard A3C, a policy-gradient algorithm that has demonstrated very strong performance on a range of complex RL tasks [18]. We hypothesize this is because the AGILE policy objective of fooling the discriminator is easy at first and becomes more difficult as the discriminator slowly improves. This naturally emerging curriculum expedites learning in the AGILE policy when compared to an A3C policy that only receives signal upon reaching a perfect goal state.

We did observe, however, that the A3C algorithm could be improved significantly by applying the auxiliary task of reward prediction (RP; 13), which was applied to language learning tasks by [11] (see the RL and RL-RP curves in Figure 3). This objective reinforces the association between instructions and states by having the agent replay the states immediately prior to a non-zero reward and predict whether or not it the reward was positive (i.e. the states match the instruction) or not. This mechanism made a significant different to the A3C performance, increasing performance to . AGILE also achieved nearly perfect performance (). We found this to be a very promising result, since AGILE has to induce the reward function from a limited set of examples. The best results with AGILE were obtained using the anticipated negative rate . When we used larger values of AGILE training started quicker but after 100-200 million steps the performance started to deteriorate (see AGILE curves in Figure 3), while it remained stable with .

Data efficiency

These results suggest that the AGILE discriminator was able to induce a near perfect reward function from a limited set of instruction, goal-state pairs. We therefore explored how small this training set of examples could be to achieve reasonable performance. We found that a training set of only 8000 examples, the AGILE agent could reach a performance of 60% (massively above chance). However, the optimal performance was achieved with more than 100,000 examples. The full results are available in Appendix D.

AGILE with Structure-Agnostic Models

We report the results for AGILE with a structure-agnostic FILM-LSTM model in Figure 3 (middle). AGILE with achieves a high success rate, and notably it trains almost as fast as an RL-RP agent with the same architecture.

Figure 3: Left: learning curves for RL, RL-RP and AGILE with different values of the anticipated negative rate on the GridLU-Relations task. We report success rate (see Section 3.2). Middle: learning curves for RL and AGILE with different model architectures. Right: the discriminator’s accuracy for different values of .

Analyzing the Discriminator

We compare the binary reward provided by the discriminator with the ground truth from the environment during training on the GridLU-Relation task. With the accuracy of the discriminator peaks at 99.5%. As shown in Figure 3 (right) the discriminator learns faster in the beginning with larger values of but then deteriorates, which confirms our intuition about why is an important hyperparameter and is aligned with the success rate learning curves in Figure 3 (left). We also observe during training that the false negative rate is always kept reasonably low (<3% of rewards) whereas the discriminator will initially be more generous with false positives (20–50% depending on during the first 20M steps of training) and will produce an increasing number of false positives for insufficiently small values of (see plots in Appendix E). We hypothesize that early false positives may facilitate the policy’s training by providing it with a sort of curriculum, possibly explaining the improvement over vanilla RL shown above.

Discriminator as general reward function

An instruction-following agent should be able to carry-out known instructions in a range of different contexts, not just settings that match identically the specific setting in which those skills were learned. To test whether the AGILE agent is robust to (semantically-unimportant) changes to the environment dynamics, we first trained it as normal and then modified the effective physics of the world by making all red square objects immovable. In this case, following instructions correctly is still possible in almost all cases, but not all solutions available during training are available at test time. As expected, this change impaired the policy and the agent’s success rate on the instructions referring to a red square dropped from to . However, after fine-tuning the policy (additional training of the policy on the test episodes using the reward from the previously-trained-then-frozen discriminator), the success rate went up to (Figure 4).

This experiment suggests that the AGILE discriminator learns useful and generalisable linguistic knowledge. The knowledge can be applied to help policies adapt in scenarios where the high-level meaning of commands is familiar but the low-level physical dynamics is not.

Figure 4: Fine-tuning for an immovable red square.

3.4 GridLU-Arrangements Task

The experiments thus far demonstrate that even without directly using the reward function AGILE performs comparably to deep RL. However, the principal motivation for the AGILE algorithm is to provide means to train language-learning agents in cases where a reward function is not available, but it may be feasible to have humans provide or identify what the world might look like if an instruction is followed correctly. To model this setting more explicitly, we developed the task GridLU-Arrangements, in which each instruction is associated with multiple viable goal-states that share some (more abstract) common form. The complete set of instructions and forms is illustrated in Figure 2. To get training data, we built a generator to produce random instantiations (i.e. any translation, rotation, reflection or color mapping of the illustrated forms) of these goal-state classes, as positive examples for the discriminator. In the real world, this process of generating goal-states could be replaced by finding, or having humans annotate, labelled images. In total, there are 36 possible instructions in GridLU-Arrangements, which together refer to a total of 390 million correct goal-states (see Appendix F for details). Despite this enormous space of potentially correct goal-states, we found that for good performance it was necessary to train AGILE on only 100,000 (less than 0.3%) of these goal-states, sampled from the same distribution as observed in the episodes. To replicate the real-world conditions as close as possible, we did not write a reward function for GridLU-Arrangements (even though it would have been theoretically possible), and instead carried out all evaluation manually.

The training regime for GridLU-Arrangements involved two classes of episodes (and instructions). Half of the episodes began with four square blocks (all of the same color), and the agent, in random unique positions, and an instruction sampled uniformly from the list of possible arrangement words. In the other half of the episodes, four square blocks of one color and four square blocks of a different color were initially each positioned randomly. The instruction in these episodes specified one of the two colors together with an arrangement word. We started 10 AGILE seeds for each level, and selected the best based on how well the policy fooled the discriminator. We then manually assessed the final state of each of 200 evaluation episodes, using human judgement that the correct shape has been produced as success criterion to evaluate AGILE. We found that the agent made the correct arrangement in 58% of the episodes. The failure cases were almost always in the episodes involving eight blocks111The agent succeeded on 92% (24%) with 4 (8) blocks.. In these cases, the AGILE agent tended towards building the correct arrangement, but was impeded by the randomly positioned non-target-color blocks and could not recover. Nonetheless, these scores, and the compelling behaviour observed in the video (https://www.youtube.com/watch?v=07S-x3MkEoQ, anonymous account), demonstrate the potential of AGILE for teaching agents to execute semantically vague or underspecified instructions.

4 Related Work

Our work can be categorized as apprenticeship learning, which studies learning to perform tasks from demonstrations and feedback. Many approaches to apprenticeship learning are variants of inverse reinforcement learning (IRL), which recover a reward function from expert demonstrations [1, 32]. Others involve training a policy [15, 26] or a reward function [27, 8] directly from human feedback.

Most closely related to AGILE is generative adversarial imitation learning (GAIL; 12), which trains a reward function and a policy. The former is trained to distinguish between the expert’s and the policy’s trajectories, while the latter is trained to maximize the modelled reward. GAIL differs from AGILE in a number of important respects. First, AGILE is conditioned on instructions so a single AGILE agent can learn combinatorially many skills rather than just one. Second, in AGILE the discriminator observes only states (either goal states from an expert, or states from the agent acting on the environment) rather than traces , learning to reward the agent based on “what” needs to be done rather than according to “how” it must be done. Finally, in AGILE the policy’s reward is the thresholded probability as opposed to the log-probability used in GAIL. We considered this change of objective necessary because the GAIL-style reward would take arbitrarily low values for intermediate states visited by the agent, as the discriminator will be confident as those are not goal states. The binary reward in AGILE carries a clear message to the policy that all non-goal states are equally undesirable.222We tried values other than 0.5 for the binarization threshold, as well as not binarizing and using directly as the reward. We got similar but slightly worse results.

To our knowledge, ours is the first work to apply an IRL-like method to instruction-following and highlight the generalization differences between the reward model (which is the discriminator in our case) and the policy. Several recent imitation learning works considered using goal-states directly for defining the task [10, 21]. AGILE differs from these approaches in that goal-states are only used to explain instructions at training time and instructions alone are used at test time.

Learning to follow language instructions has been approached in many different ways, for example by reinforcement learning using a reward function programmed by a system designer. [14, 20, 11, 6, 9, 31] consider instruction-following in 2D or 3D environments and reward the agent for arriving at the correct location or object. [14] and [17] train RL agents to produce goal-states given instructions. As discussed, these approaches are constrained by the difficulty of programming language-related reward functions, a task that requires an programming expert, detailed access to the state of the environment and hard choices above how language should map to the world. Agents can be trained to follow instructions using complete demonstrations, that is sequences of correct actions describing instruction execution for given initial states. [7, 4] train semantic parsers to produce a formal representation of the query that when fed to a predefined execution model matches exactly the sequence of actions from the demonstration. [2, 16] sidestep the intermediate formal representation and train a Conditional Random Field (CRF) and a sequence-to-sequence neural model respectively to directly predict the actions from the demonstrations. A underlying assumption behind all these approaches is that the agent and the demonstrator share the same actuation model, which might not always be the case. In the case of navigational instructions the trajectories of the agent and the demonstrators can sometimes be compared without relying on the actions, like e.g. [25], but for other types of instructions such a hard-coded comparison may be infeasible. [24] train a log-linear model to map instruction constituents into their groundings, which can be objects, places, state sequences, etc. Their approach requires access to a structured representation of the world environment as well as intermediate supervision for grounding the constituents.

5 Discussion

We have proposed AGILE, an approach to training instruction-following agents from examples of corresponding goal-states rather than explicit reward functions. This opens up new possibilities for training language-aware agents, because in the real world, and even in rich simulated environments [5, 30], acquiring such data via human annotation would often be much more viable than defining and implementing reward functions programmatically. Indeed, programming rewards to teach robust and general instruction-following may ultimately be as challenging as writing a program to interpret language directly, an endeavour that is notoriously laborious [28], and some say, ultimately futile [29].

As well as a means to learn from a potentially more prevalent form of data, our experiments demonstrate that AGILE performs comparably with and can learn as fast as RL with an auxiliary task. Our analysis of the discriminator’s classifications gives a sense of how this is possible; the false positive decisions that it makes early in the training help the policy to start learning. As the policy improves, false negatives can instead cause the discriminator accuracy to deteriorate. We determined a simple method to mitigate this, however, leading to robust training that is comparable to RL with reward prediction and unlimited access to a perfect reward function.

An attractive aspect of AGILE is that learning “what should be done” and “how it should be done” is performed by two different model components. Our experiments confirm that the “what” kind of knowledge generalizes better to different environments. When the dynamics of the environment changed at test time, fine-tuning against a frozen discriminator allowed to the policy recover some of its original capability in the new setting.

It is interesting to consider how AGILE could be applied to more realistic learning settings, for instance involving first-person vision of 3D environments. Two issues would need to be dealt with, namely training the agent to factor out the difference in perspective between the expert data and the agent’s observations, and training the agent to ignore its own body parts if they are visible in the observations. Future work could focus on applying third-person imitation learning methods recently proposed by Stadie et al. [23] learn the aforementioned invariances. Most of our experiments were conducted with a formal language with a known structure, however AGILE also performed very well when we used a structure-agnostic FiLM-LSTM model which processed the instruction as a plain sequence of tokens. This result suggest that in future work AGILE could be used with natural language instructions.


  • Abbeel and Ng [2004] Pieter Abbeel and Andrew Y. Ng. Apprenticeship Learning via Inverse Reinforcement Learning. In Proceedings of the Twenty-first International Conference on Machine Learning, ICML ’04, 2004. URL http://doi.acm.org/10.1145/1015330.1015430.
  • Andreas and Klein [2015] Jacob Andreas and Dan Klein. Alignment-Based Compositional Semantics for Instruction Following. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015.
  • Andreas et al. [2016] Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural Module Networks. In Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. URL http://arxiv.org/abs/1511.02799.
  • Artzi and Zettlemoyer [2013] Yoav Artzi and Luke Zettlemoyer. Weakly supervised learning of semantic parsers for mapping instructions to actions. Transactions of the Association for Computational Linguistics, 1:49–62, 2013.
  • Brodeur et al. [2017] Simon Brodeur, Ethan Perez, Ankesh Anand, Florian Golemo, Luca Celotti, Florian Strub, Jean Rouat, Hugo Larochelle, and Aaron Courville. HoME: a Household Multimodal Environment. arXiv:1711.11017 [cs, eess], November 2017. URL http://arxiv.org/abs/1711.11017. arXiv: 1711.11017.
  • Chaplot et al. [2018] Devendra Singh Chaplot, Kanthashree Mysore Sathyendra, Rama Kumar Pasumarthi, Dheeraj Rajagopal, and Ruslan Salakhutdinov. Gated-Attention Architectures for Task-Oriented Language Grounding. In Proceedings of 32nd AAAI Conference on Artificial Intelligence, 2018. URL http://arxiv.org/abs/1706.07230.
  • Chen and Mooney [2011] David L. Chen and Raymond J. Mooney. Learning to Interpret Natural Language Navigation Instructions from Observations. In Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence, pages 859–865, 2011. URL http://dl.acm.org/citation.cfm?id=2900423.2900560.
  • Christiano et al. [2017] Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, pages 4302–4310, 2017.
  • Denil et al. [2017] Misha Denil, Sergio Gómez Colmenarejo, Serkan Cabi, David Saxton, and Nando de Freitas. Programmable Agents. arXiv:1706.06383 [cs, stat], June 2017. URL http://arxiv.org/abs/1706.06383.
  • Ganin et al. [2018] Yaroslav Ganin, Tejas Kulkarni, Igor Babuschkin, S. M. Ali Eslami, and Oriol Vinyals. Synthesizing Programs for Images using Reinforced Adversarial Learning. arXiv:1804.01118 [cs, stat], April 2018. URL http://arxiv.org/abs/1804.01118. arXiv: 1804.01118.
  • Hermann et al. [2017] Karl Moritz Hermann, Felix Hill, Simon Green, Fumin Wang, Ryan Faulkner, Hubert Soyer, David Szepesvari, Wojciech Marian Czarnecki, Max Jaderberg, Denis Teplyashin, Marcus Wainwright, Chris Apps, Demis Hassabis, and Phil Blunsom. Grounded Language Learning in a Simulated 3d World. arXiv:1706.06551 [cs, stat], June 2017. URL http://arxiv.org/abs/1706.06551.
  • Ho and Ermon [2016] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, pages 4565–4573, 2016.
  • Jaderberg et al. [2016] Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z. Leibo, David Silver, and Koray Kavukcuoglu. Reinforcement Learning with Unsupervised Auxiliary Tasks. In ICLR, November 2016. URL http://arxiv.org/abs/1611.05397.
  • Janner et al. [2017] Michael Janner, Karthik Narasimhan, and Regina Barzilay. Representation Learning for Grounded Spatial Reasoning. Transactions of the Association for Computational Linguistics, July 2017. URL http://arxiv.org/abs/1707.03938.
  • Knox and Stone [2009] W Bradley Knox and Peter Stone. Interactively shaping agents via human reinforcement: The TAMER framework. In International Conference on Knowledge Capture, pages 9–16, 2009.
  • Mei et al. [2016] Hongyuan Mei, Mohit Bansal, and Matthew R. Walter. Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences. In Proceedings of the AAAI Conference on Artificial Intelligence, 2016. URL http://arxiv.org/abs/1506.04089.
  • Misra et al. [2017] Dipendra Misra, John Langford, and Yoav Artzi. Mapping Instructions and Visual Observations to Actions with Reinforcement Learning. In arXiv:1704.08795 [cs], April 2017. URL http://arxiv.org/abs/1704.08795.
  • Mnih et al. [2016] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928–1937, 2016.
  • Ng and Russell [2000] Andrew Y. Ng and Stuart Russell. Algorithms for Inverse Reinforcement Learning. In in Proc. 17th International Conf. on Machine Learning, pages 663–670. Morgan Kaufmann, 2000.
  • Oh et al. [2017] Junhyuk Oh, Satinder Singh, Honglak Lee, and Pushmeet Kohli. Zero-Shot Task Generalization with Multi-Task Deep Reinforcement Learning. In Proceedings of The 34st International Conference on Machine Learning, June 2017. URL http://arxiv.org/abs/1706.05064.
  • Pathak et al. [2018] Deepak Pathak, Parsa Mahmoudieh, Guanghao Luo, Pulkit Agrawal, Dian Chen, Yide Shentu, Evan Shelhamer, Jitendra Malik, Alexei A. Efros, and Trevor Darrell. Zero-shot visual imitation. arXiv preprint arXiv:1804.08606, 2018.
  • Perez et al. [2017] Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron Courville. FiLM: Visual Reasoning with a General Conditioning Layer. In In Proceedings of the AAAI Conference on Artificial Intelligence, 2017. URL http://arxiv.org/abs/1709.07871.
  • Stadie et al. [2017] Bradly C. Stadie, Pieter Abbeel, and Ilya Sutskever. Third-Person Imitation Learning. In ICLR, March 2017. URL http://arxiv.org/abs/1703.01703.
  • Tellex et al. [2011] Stefanie Tellex, Thomas Kollar, Steven Dickerson, Matthew R. Walter, Ashis Gopal Banerjee, Seth Teller, and Nicholas Roy. Understanding Natural Language Commands for Robotic Navigation and Mobile Manipulation. In Twenty-Fifth AAAI Conference on Artificial Intelligence, August 2011. URL https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/view/3623.
  • Vogel and Jurafsky [2010] Adam Vogel and Dan Jurafsky. Learning to Follow Navigational Directions. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 806–814. Association for Computational Linguistics, 2010. URL http://dl.acm.org/citation.cfm?id=1858681.1858764.
  • Warnell et al. [2017] Garrett Warnell, Nicholas Waytowich, Vernon Lawhern, and Peter Stone. Deep TAMER: Interactive agent shaping in high-dimensional state spaces. arXiv preprint arXiv:1709.10163, 2017.
  • Wilson et al. [2012] Aaron Wilson, Alan Fern, and Prasad Tadepalli. A Bayesian approach for policy learning from trajectory preference queries. In Advances in Neural Information Processing Systems, pages 1133–1141, 2012.
  • Winograd [1971] Terry Winograd. Procedures as a representation for data in a computer program for understanding natural language. Technical report, MASSACHUSETTS INST OF TECH CAMBRIDGE PROJECT MAC, 1971.
  • Winograd [1972] Terry Winograd. Understanding natural language. Cognitive Psychology, 3(1):1–191, 1972. doi: 10.1016/0010-0285(72)90002-3. URL http://linkinghub.elsevier.com/retrieve/pii/0010028572900023.
  • Wu et al. [2018] Yi Wu, Yuxin Wu, Georgia Gkioxari, and Yuandong Tian. Building Generalizable Agents with a Realistic and Rich 3d Environment. arXiv:1801.02209 [cs], January 2018. URL http://arxiv.org/abs/1801.02209. arXiv: 1801.02209.
  • Yu et al. [2018] Haonan Yu, Haochao Zhang, and Wei Xu. Interactive Grounded Language Acquisition and Generalization in 2d Environment. In ICLR, 2018. URL https://openreview.net/forum?id=H1UOm4gA-&noteId=H1UOm4gA-.
  • Ziebart et al. [2008] Brian D. Ziebart, Andrew Maas, J. Andrew Bagnell, and Anind K. Dey. Maximum Entropy Inverse Reinforcement Learning. In Proc. AAAI, pages 1433–1438, 2008.

Appendix A AGILE Pseudocode

0:  The policy network , the discriminator network , the anticipated negative rate , a dataset , a replay buffer , the batch size , a stream of training instances , the episode length , the rollout length .
1:  while Not Converged do
2:     Sample a training instance .
4:     while t < T do
5:        Act with and produce a rollout .
6:        Add pairs from to the replay buffer . Remove old pairs from if it is overflowing.
7:        Sample a batch of positive examples from .
8:        Sample a batch of negative examples from .
9:        Compute for all and reject the top percent of with the highest . The resulting will contain examples.
10:        Compute .
11:        Compute the gradient and use it to update .
12:        Synchronise and with other workers.
14:     end while
15:  end while
Algorithm 1 AGILE Discriminator Training
0:  The policy network , the discriminator network , a dataset , a replay buffer , a stream of training instances , the episode length .
1:  while Not Converged do
2:     Sample a training instance .
4:     while t < T do
5:        Act with and produce a rollout .
6:        Use the discriminator to compute the rewards .
7:        Perform an RL update for using the rewards .
8:        Synchronise and with other workers.
10:     end while
11:  end while
Algorithm 2 AGILE Policy Training

Appendix B Training Details

We trained the policy and the discriminator concurrently using RMSProp as the optimizer and Asynchronous Advantage Actor-Critic (A3C) [18] as the RL method. A baseline predictor (see Appendix G for details) was trained to predict the discounted return by minimizing the mean square error. The RMSProp hyperparameters were different for and , see Table 1. A designated worker was used to train the discriminator (see Algorithm 1). Other workers trained only the policy (see Algorithm 2). We tried having all workers write to the replay buffer that was used for the discriminator training and found that this gave the same performance as using pairs produced by the discriminator worker only. We found it crucial to regularize the discriminator by clipping columns of all weights matrices to have the L2 norm of at most 1. We linearly rescaled the policy’s rewards to the interval for both RL and AGILE. When using RL with reward prediction we fetch a batch from the replay buffer and compute the extra gradient for every rollout.

For the exact values of hyperparameters for the GridLU-Relations task we refer the reader to Table 1. The hyperparameters for GridLU-Arrangements were mostly the same, with the exception of the episode length and the rollout length, which were 45 and 30 respectively. For training the RL baseline for GridLU-Relations we used the same hyperparameter settings as for the AGILE policy.

Group Hyperparameter Policy Discriminator
RMSProp learning rate
decay 0.99 0.9
grad. norm threshold 40 25
batch size 1 256
RL rollout length 15
episode length 30
discount 0.99
reward scale 0.1
baseline cost 1.0
reward prediction cost (when used) 1.0
reward prediction batch size 4
num. workers training 15 1
AGILE size of replay buffer 100000
num. workers training 1
Regularization entropy weight
max. column norm 1
Table 1: Hyperparameters for the policy and the discriminator for the GridLU-Relations task.

Appendix C GridLU Environment

The GridLU world is a gridworld surrounded by walls. The cells of the grid can be occupied by blocks of 3 possible shapes (circle, triangle, and square) and 3 possible colors (red, blue, and green). The grid also contains an agent sprite. The agent may carry a block; when it does so, the agent sprite changes color333We wanted to make sure the that world state is fully observable, hence the agent’s carrying state is explicitly color-coded.. When the agent is free, i.e. when it does not carry anything, it is able to enter cells with blocks. A free agent can pick a block in the cell where both are situated. An agent that carries a block cannot enter non-empty cells, but it can instead drop the block that it carries in any non-empty cell. Both picking up and dropping are realized by the INTERACT action. Other available actions are LEFT, RIGHT, UP and DOWN and NOOP. The GridLU agent can be seen as a cursor (and this is also how it is rendered) that can be moved to select a block or a position where the block should be released. Figure 5 illustrates the GridLU world and its dynamics. We render the state of the world as a color image by displaying each cell as an patch444The relatively high resolution was necessary to let the network discern the shapes. and stitching these patches in a image555The image size is because the walls surrounding the GridLU world are also displayed.. All neural networks take this image as an input.

Figure 5: The dynamics of the GridLU world illustrated by a 6-step trajectory. The order of the states is indicated by arrows. The agent’s actions are written above arrows.

Appendix D Experiment Details

Every experiment was repeated 5 times and the average result is reported.


All agents were trained for steps.

Data Efficiency

We trained AGILE policies with datasets of different sizes for steps. For each policy we report the maximum success rate that it showed in the course of training.


We trained the agent for 100M time steps, saving checkpoints periodically, and selected the checkpoint that best fooled the discriminator according to the agent’s internal reward.

Data Efficiency

We measure how many examples of instructions and goal-states are required by AGILE in order to understand the semantics of the GridLU-Relations instruction language. The results are reported in Figure 6. The AGILE-trained agent succeeds in more than 50% of cases starting from examples, but as many as is required for the best performance.

Figure 6: Performance of AGILE for different sizes of the dataset of instructions and goal-states. For each dataset size of we report is the best average success rate over the course of training.

Appendix E Analysis of the GridLU-Relations Task

e.1 GridLU Relations Instance Generator

All GridLU instructions can be generated from <instruction> using the following Backus-Naur form, with one exception: The first expansion of <obj> must not be identical to the second expansion of <obj> in <bring_to_instruction>.

<shape> ::= circle | rect | triangle
<color> ::= red | green | blue

<relation1> ::= NorthFrom | SouthFrom | EastFrom | WestFrom
<relation2> ::= <relation1> | SameLocation

<obj> ::= Color(<color>, <obj_part2>) | Shape(<shape>, SCENE)
<obj_part2> ::= Shape(<shape>, SCENE) | SCENE

<go_to_instruction> ::= <relation2>(AGENT, <obj>) | <relation2>(<obj>, AGENT)
<bring_to_instruction> ::= <relation1>(<obj>, <obj>)
<instruction> ::= <go_to_instruction> | <bring_to_instruction>

There are unique possibilities to expand the nonterminal <obj>, so there are unique possibilities to expand <go_to_instruction> and unique possibilities to expand <bring_to_instruction> (not counting the exceptions mentioned above). Hence there are unique instructions in total. However, several syntactically different instructions can be semantically equivalent, such as EastFrom(AGENT, Shape(rect, SCENE)) and WestFrom(Shape(rect, SCENE), AGENT).

Every instruction partially specifies what kind of objects need to be available in the environment. For go-to-instructions we generate one object and for bring-to-instructions we generate two objects according to this partial specification (unspecified shapes or colors are picked uniformly at random). Additionally, we generate one “distractor object”. This distractor object is drawn uniformly at random from the 9 possible objects. All of these objects and the agent are each placed uniformly at random into one of 25 cells in the 5x5 grid.

The instance generator does not sample an instruction uniformly at random from a list of all possible instructions. Instead, it generates the environment at the same time as the instruction according to the procedure above. Afterwards we impose two ‘sanity checks’: are any two objects in the same location or are they all identical? If any of these two checks fail, the instance is discarded and we start over with a new instance.

Because of this rejection sampling technique, go-to-instructions are ultimately generated with approximately probability even though they only represent of all possible instructions.

The number of different initial arrangements of three objects can be lower-bounded by if we disregard their permutation. Hence every bring-to-instruction has at least associated initial arrangements. Therefore the total number of task instances can be lower-bounded with , disregarding the initial position of the agent.

e.2 Discriminator Evaluation

During the training on GridLU-Relations we compared the predictions of the discriminator with those of the ground-truth reward checker. This allowed us to monitor several performance indicators of the discriminator, see Figure 7.

Figure 7: The discriminator’s errors in the course of training. Left: percentage of false positives. Right: percentage of false negatives.

Appendix F Analysis of the GridLU-Arrangements Task

Instruction Syntax

We used two types of instructions in the GridLU-Arrangements task, those referring only to the arrangement and others that also specified the color of the blocks. Examples Connected(AGENT, SCENE) and Snake(AGENT, Color(’yellow’, SCENE)) illustrate the syntax that we used for both instruction types.

Number of Distinct Goal-States

Table 2 presents our computation of the number of distinct goal-states in the GridLU-Arrangements Task.

Arrangement Possible arrangement positions Possible colors Possible agent positions Possible distractor positions Possible distractor colors Total goal states
Square 16 3 25 5985 2 14,364,000
Line 40 3 25 5985 2 35,910,000
Dline 8 3 25 5985 2 7,182,000
Triangle 48 3 25 5985 2 43,092,000
Circle 9 3 25 5985 2 8,079,750
Eel 48 3 25 5985 2 43,092,000
Snake 48 3 25 5985 2 43,092,000
Connected 200 3 25 5985 2 179,550,000
Disconnected 17 3 25 5985 2 15,261,750
Total 389M
Table 2: Number of unique goal-states in GridLU-Arrangements task.

Appendix G Models

Figure 8: Our policy and discriminator networks with a Neural Module Network (NMN) as the core component. The NMN’s structure corresponds to an instruction WestFrom(Color(‘red’, Shape(‘rect’, SCENE)), Color(‘yellow’, Shape(‘triangle’, SCENE))). The modules are depicted as blue rectangles. Subexpressions Color(’red’, …), Shape(’rect’, …), etc. are depicted as “red” and “rect” to save space. The bottom left of the figure illustrates the computation of a module in our variant of NMN.

In this section we explain in detail the neural architectures that we used in our experiments. We will use to denote convolution, , to denote element-wise addition of a vector to a 3D tensor with broadcasting (i.e. same vector will be added/multiplied at each location of the feature map). We used ReLU as the nonlinearity in all layers with the exception of LSTM.


We will first describe the FiLM-NMN discriminator . The discriminator takes a x RGB image as the representation of the state. The image is fed through a stem convnet that consisted of an convolution with 16 kernels and a x convolution with 64 kernels. The resulting tensor had a xx shape.

As a Neural Module Metwork [3], the FiLM-NMN is constructed of modules. The module corresponding to a token takes a left-hand side input and a right-hand side input and performs the following computation with them:


where and are FiLM coefficients [22] corresponding to the token , is a weight tensor for a x convolution with 128 input features and 64 output features. Zero-padding is used to ensure that the output of has the same shape as and . The equation above describes a binary module that takes two operands. For the unary modules that received only one input (e.g. , ) we present the input as and zeroed out . This way we are able to use the same set of weights for all modules. We have 12 modules in total, 3 for color words, 3 for shape words, 5 for relations words and one module used in go-to instructions. The modules are selected and connected based on the instructions, and the output of the root module is used for further processing. For example, the following computation would be performed for the instruction NorthFrom(Color(‘red’, Shape(‘circle’, SCENE)), Color(‘blue’, Shape(‘square’, SCENE))):


and the following one for NorthFrom(AGENT, Shape(‘triangle’, SCENE)):


Finally, the output of the discriminator is computed by max-pooling the output of the FiLM-NMN across spatial dimensions and feeding it to an MLP with a hidden layer of 100 units:


where , and are weights and biases, is the sigmoid function.

The policy network is similar to the discriminator network . The only difference is that (1) it outputs softmax probabilites for 5 actions instead of one real number (2) we use an additional convolutional layer to combine the output of FiLM-NMN and :


the output of which is further used in the policy network instead of .

Figure 8 illustrates our FiLM-NMN policy and discriminator networks.


For our structure-agnostic models we use an LSTM of 100 hidden units to predict FiLM biases and multipliers for a 5 layer convnet. More specifically, let be the final state of the LSTM after it consumes the instruction . We compute the FiLM coefficients for the layer as follows:


and use them as described by the equation below:


where are the convolutional weights, is set to the pixel-level representation of the world state . The characteristics of the 5 layers were the following: (x, , VALID), (x, , VALID), (x, , SAME), (x, , SAME), (x, , SAME), where (x, , ) stands for a convolutional layer with x filters, output features, and padding strategy. Layers with do not use padding, whereas in those with zero padding is added in order to produce an output with the same shape as the input. The layer 5 is also connected to layer 3 by a residual connection. Similarly to FiLM-NMN, the output of the convnet is max-pooled and fed into an MLP with 100 hidden units to produce the outputs:


Baseline prediction

In all policy networks the baseline predictor is a linear layer that took the same input as the softmax layer. The gradients of the baseline predictor are allowed to propagate through the rest of the network.

Reward prediction

We use the result of the max-pooling operation (which was a part of all models that we considered) as the input to the reward prediction pathway of our model. is fed through a linear layer and softmax to produce probabilities of the reward being positive or zero (the reward is never negative in AGILE).

Weight Initialization

We use the standard initialisation methods from the Sonnet library666https://github.com/deepmind/sonnet/. Bias vectors are initialised with zeros. Weights of fully-connected layers are sampled from a truncated normal distribution with , where is the number of input units of the layer. Convolutional weights are sampled from a truncated normal distribution with , where is the product of kernel width, kernel height and the number of input features.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description