Exploiting Language Instructions for Interpretable and Compositional Reinforcement Learning

Exploiting Language Instructions for Interpretable and Compositional Reinforcement Learning


In this work, we present an alternative approach to making an agent compositional through the use of a diagnostic classifier. Because of the need for explainable agents in automated decision processes, we attempt to interpret the latent space from an RL agent to identify its current objective in a complex language instruction. Results show that the classification process causes changes in the hidden states which makes them more easily interpretable, but also causes a shift in zero-shot performance to novel instructions. Lastly, we limit the supervisory signal on the classification, and observe a similar but less notable effect.


1 Introduction

As AI becomes more widespread in the real world, and the strive towards universal AI gains more traction, the need for interpretable and general agents increases. Since more systems are performing automated decisions, humans require those systems to explain their behavior, and expect them to work in unknown scenarios.

In this paper, we investigate whether it is possible to train a Reinforcement Learning (RL) agent to operate in a virtual environment while being interpretable in its ‘intentions’, and how its interpretability helps in finding more compositional solutions. More specifically, while training a neural agent to follow some navigation instructions, we require it to spell out what is, at each time-step, its current objective. To accomplish that, we use the recently introduced diagnostic classifier (Hupkes et al., 2018), a linear classifier which assesses the presence of some specific information in a neural network by trying to predict it from its hidden states. In our case, we use it at training time to predict the current objective of the RL agent.

Our approach is inspired by how humans learn. While in traditional RL, the objective is defined in terms of a single goal, expressed through some reward function (Sutton and Barto, 2018), when we teach humans to follow instructions, not only do we check for accurate execution, but we also make sure that the instruction, usually expressed in natural language, is correctly understood. Are all word meanings in the instruction known? Is it clear how to segment the instruction such that it can be decomposed in sub-tasks, encouraging efficient sub-task separation (Gopalan et al., 2017)? In this paper, we account for some of this extra supervision and measure its impact on learning efficiency.

2 Related Work

2.1 Following language instructions

One of the first attempts at following language instructions is SHRDLU (Winograd, 1971). It was designed to understand natural language by relating to a physical world. However, its apparent success stemmed from handwritten rules in a finite grammar, which is unsustainable in natural language. In attempts to deal with incomplete information, probabilistic methods extract cues from the instruction to improve the agent’s learning capabilities (Kollar et al., 2010; Vogel and Jurafsky, 2010; Tellex et al., 2011; Dzifcak et al., 2009).

Recently, following language instructions has been actively researched, with the introduction of artificial environments (Bisk et al., 2016; Hermann et al., 2017; Wu et al., 2018). Now, the trend has been leaning towards deep reinforcement learning agents, hoping to fullfill the promise of generalizable agents that exploit the instruction (Misra et al., 2017; Bahdanau et al., 2018; Yu et al., 2018).

We aim to recover a more-human like learning environment; as humans provide linguistic and non-linguistic cues about how to segment instructions (e.g., during execution, by asking the learner to explain some of her actions), we probe the artificial learner for its focus using information from the language instruction.

2.2 Compositionality

In the abstract information that an instruction in natural language provides for humans (Werning et al., 2012), artificial agents intuitively should also be able to benefit from the compositionality of language. If the agent is instructed to perform an action on an object that it has never seen before, but does know how to execute the action, it could reuse its knowledge and require less training before it can successfully complete the instruction.

In the context of following navigation commands, Lake and Baroni (2018) introduced the SCAN task, which is designed to test for compositional abilities in neural networks. The authors show how sequence-to-sequence models are generally able to learn navigation commands, but, as soon as they are tested on instructions which require compositional generalization, they fail miserably.

Additionally, intrinsic motivation aids in scaling RL agents through the use of an internal supervisory signal, representing a reward from performing “interesting” actions (e.g., Chentanez et al., 2005; Şimşek and Barto, 2006). These signals, obtained during unsupervised traversal of the environment, are used to help the agent form a set of skills by exploration. Later, they can be reused and employed when optimizing for a task. It has been successfully applied in cases such as efficient learning with sparse rewards (Pathak et al., 2017), or in the development of an embodied robotics actor (Frank et al., 2014).

Finally, compositionality is also at the core of curriculum learning for RL (e.g., Narvekar et al., 2017; Florensa et al., 2017). The idea is to design and solve a sequence of tasks with increasing complexity and reuse the skills acquired in these task to solve the target task. However, designing the curricula may be as hard as (or even more complex) than directly solving the target tasks (when prior knowledge is unavailable). It is crucial that the appriopriate design is chosen, but experiments show that curriculum learning can be beneficial in scaling the training of RL agents (Wu and Tian, 2016; Gupta et al., 2017).

Our approach can be seen as an instance of curriculum learning where the prior knowledge is the task instruction and the curriculum leverages the sequential structure of the task.

2.3 Understanding black box models

Presently, deep neural networks are mostly black boxes, and creating an understanding of their internal mechanisms remains a shot in the dark. Fortunately, recent work in explainable AI (XAI) attempts to increase the transparency of these models. An overview by Biran and Cotton (2017) distinguishes between two notions of explainability: justification and interpretability. Justifications are reasons for decisions an agent might make, but are not necessarily connected to the workings of the agent itself. This means they can be generated for non-interpretable systems, and require no retraining of the original model. Interpretations on the other hand, are inherent to the agent, and should reflect how the agent arrived at its decision through its interal workings.

Recent developments for generating interpretations include using t-SNE plots to visualise the latent space of agents (Zahavy et al., 2016; Jaderberg et al., 2018), examining the attention patterns when agents make decisions (Greydanus et al., 2017), including a human in the loop to help a model’s interpretability (Lage et al., 2018) and using ’diagnostic classifiers’ to decode which specific information is encoded in the network (Hupkes et al., 2018).

In this work, we encourage the agent to develop a more interpretable policy, which, at any time step, is able to report its current objective. Additionally, we investigate the compositionality after training the interpretable policy.

3 Approach

This section describes the environment, model and setup used in the experiments.

3.1 BabyAI game

As a testbed for the learning process we make use of the BabyAI platform (Chevalier-Boisvert et al., 2018), which consists of a grid world environment in which the agent is presented with a structured language instruction. The platform contains different levels, which increase in complexity through a combination of distractors, composite instructions, and sparse rewards. The observation presented to the agent is a 7x7 grid, a 2D representation of the agent’s surroundings. This ego-centric view contains a symbolic representation of objects, walls, doors and their colors. The agent has access to actions such as picking up objects and walking around. The compact representation of the grid world allows for fast processing of the observations.

The instruction is given in the Baby Language, a well-defined subset of the English language, which is simple yet diverse. For all our experiments, we develop customized levels which spin off from the original GoTo level. We choose GoTo because is the least complex instruction and therefore easiest to learn. The atomic instruction is formed by selecting a color and object type at random, specifying a target for the agent. Optionally, the modifier twice or thrice can be added to an atomic instuction, much like the SCAN dataset (Lake and Baroni, 2017). Example atomic instruction include go to the red ball, go to a blue box twice and go to the yellow key thrice.

In the case an agent is instructed to visit an object multiple times, upon arriving at a target object the objects are shuffled around the environment. The agent has to visit the same object respectively one or two more times in order to complete the instruction correctly. To prevent infinite length episodes, every instruction has an associated maximum number of steps, corresponding to the complexity of the instruction.

Atomic instructions are subsequently combined through the use of various task connectors. By means of these operators a compound instruction can be made consisting of atomic instructions and . We consider the following task connectors:

  • Before: Complete before completing . If the agent completes instruction first, the compound instruction fails, and no reward is given.

  • After: Complete after completing . If the agent completes instruction first, the compound instruction fails and no reward is given.

Besides combining atomic instructions, the connectors apply to complex instructions as well. In this case, the connectors are left-associative.

For example, a compound instruction is go to the blue box twice before go to the yellow key. An overview of all levels considered in this work is given in Table 1, and a visual example of the setup is given in Figure 1.

Level name Connectors Num. targets Other
Before Before 2 None
Mixed-2 Before,After 2 None
Before (repeat) Before 2 Twice/Thrice modifier
Mixed-3 Before,After 3 None
Table 1: Overview of all levels. The last column denotes any special feature in each level.

3.2 Model

For our base agent, we select the Small BabyAI model, originally introduced by Chevalier-Boisvert et al. (2018). This model combines the language instruction and world representation in an Action-Critic architecture (Szepesvári, 2010). The instruction is parsed using a GRU using a fixed vocabulary, after which it is combined with the observation through two FiLM (Perez et al., 2018) layers. The output generated by these layers is passed into an LSTM to allow for temporal feedback connections. Ultimately, the LSTM’s output is used in an actor network to generate actions and a critic network to generate state values. The agent is optimized using Proximal Policy Optimization (PPO, Schulman et al., 2017), a sample efficient actor-critic approach.

3.3 Diagnostic classification

As an extension to the base model, the model is made interpretable through the addition of a diagnostic classifier. This classifier is tasked with providing an intuitive explanation of the agent’s behavior when asked, making it more interpretable for humans. It does so by generating, at every time step, the current target for the agent. While this does not directly give a justification for individual moves, it does give an idea of the current focus of the agent. Since we consider complex instructions, there are at least two subtasks to be completed, and through the classification the agent signifies its current objective (e.g. I’m trying to complete ). By means of this extra task, we aim to make the agent aware of the compositional nature of the instruction. The agent now has access to a signal that indicates the separation between two objects in its environment, and it is up to the agent to learn to compose previously learned behavior, and become more efficient.

To create the labels for the diagnostic classifier, we exploit the temporal relation between the subtasks. This way, the agent is trained to visit the objectives in order, and the focus of the agent should follow this same order. In the levels, there are unique object type/color combinations, as can be generated by the Baby Language. By enumerating all combinations, a mapping can be created. Subsequently, the labeler takes the language instruction and the current status of visits (e.g. whether the agent has visited ), and uses this mapping to generate a label. Since only the final label, and not the grammar or task status is exposed to the agent, we avoid providing further external information.

Finally, the labels are used to train the diagnostic classifier, which is a linear mapping from the LSTM’s hidden state to the unique object/color combinations. Because of the classification task, a cross entropy term is added to the PPO reward function with a coefficient . This results in Equation 1, which takes class labels and output probabilities .


Note that this differs from regularized approaches for RL where the regularization term is computed w.r.t. the current policy estimate (e.g., Neu et al., 2017). This regularization term can be interpreted as a form of reward shaping (Ng et al., 1999).

“go to the green ball before go to the green box”

Figure 1: Visual overview of the environment. The light gray area is currently in view for the agent, represented by the red triangle. The green ball and green square are the two objectives. Every small arrow is a future action taken by the agent, while simulataneously providing an object classification. For the white arrows, the correct label is “green ball.” For the blue arrows, the correct label is “green box.”

4 Experiments

Below, four experiments are outlined. The experiments are designed to quantify how the additional classifier in the agent is affecting its interpretability, and to check whether it has impacted the agent’s compositionality.

As a measure of the agent’s performance over time, different metrics are used. These metrics show how proficient an agent is in completing the overall instruction, or how consistently it can complete levels. The following are used:

  • Diagnostic accuracy: The average accuracy of the diagnostic object prediction.

  • Success rate: The average number of episodes that end with a positive reward out of all episodes. In other words: the average ratio of episodes ended within the maximum amount of steps that did not end in a failure.

  • Episode length: The average number of steps required for the completion of a level. At most, this is the maximum number of steps defined for each level.

  • Failure rate: The ratio of episodes that end with the agent failing a task. Since there is a temporal ordering in the connectors, the agent is not allowed to visit them out of order. Similarly, if the agent fails to obey the twice/thrice modifier, the agent can fail the task by arriving at the next object too early.

  • Timeout rate: The ratio of episodes that end without the agent completing the whole instruction, reaching the maximum number of steps.

Unless otherwise specified, we report the mean and standard devation over at least three different seeds to account for randomness factors in network initialization, the environment generation and the optimization process.

4.1 Diagnostic training

In this initial experiment, we add the diagnostic classifier to the agent (Aware model), and look at differences in how the training of the two models (Baseline and Aware) develop. For the Aware model, we record also the diagnostic classifier’s accuracy during training.

Furthermore, we perform an offline training test to check whether the hidden states of the agent are affected by diagnostic classification. Both the Baseline and Aware converged models are put in inference mode, and run for a fixed number of episodes. For all frames in these episodes, the hidden states and the correct diagnostic target are recorded. Together, they form an offline dataset, which we can use to train a new classifier, identical to the one used in the RL training. We then compare performance of the new classifier trained on the Baseline- vs. Aware-generated datasets.

4.2 Source-level performance

Here, we observe the performance of the two models on the levels they have been trained on. Since the Aware model has an added task of making its hidden states explainable for a small classifier, convergence might take longer than the Baseline model. Furthermore, the base performance on the two novel complex levels can be examined using the source-level performance.

4.3 Zero-shot generalization

Next, we check whether the Aware agent can use the extra training signal for separating subtasks. By introducing an unseen characteristic to objects in the environment, the agent now has to identify which object it does know, and generalize learned behavior to the unknown object. Being able to isolate single objects in the environment should help the agent in this type of generealization.

Specifically, we consider the following cases:

  • Color: One object’s color is replaced with an unknown color.

  • Object: One object’s type is replaced with an unknown type.

  • ColorObject: One object’s type and color are both replaced with unknowns.

In all cases, we only change a single object in the environment, such that the agent should be able to deduce which object is altered. This aids the agent in completing the given instruction, whereas changing multiple objects could lead to the agent visiting the objects in the wrong order more often.

4.4 Sparse classification

Lastly, an attempt is made to make the guiding signal more realistic. In the original setting, for every timestep in the environment we ask the agent for its current objective. However, in humans, intuitively this is too frequent, and should only be asked occasionally.

Therefore, we lower the frequency of the diagnostic classification. Instead of every frame, a classification is only asked up to a maximum of three times per game episode. Now, the extra signal is much lower, and both the classifier might need more time to reach convergence, as well as the feedback on the agent’s hidden states is less dominant. We explore whether this is beneficial to the agent, considering the criteria from before. The training of the agent takes significantly longer, therefore only the Before and Mixed-2 levels are considered.

5 Results

Below, we give an overview of the results per experiment. First, the performance of the diagnostic classification task in general is presented. Second, all zero-shot experiments are shown. Finally, we elaborate on observations made during the sparse classification task.

5.1 Diagnostic training

Figure 2: Diagnostic classification accuracy (and standard deviations) of the Baseline and Aware model on all levels.

See Figure 2. The Baseline model only has access to a classifier trained on the offline collected dataset, while the Aware model was evaluated at two different stages: once after training with RL, and once after retraining on the offline dataset.

Across all cases, the Aware model is able to predict the correct objective consistently. In the RL stage, the classifier is successfully trained, which indicates that the agent is still able to converge to a stable optimum. Furthermore, the subsequent difference between the offline trained classifiers shows that the hidden states are positively affected by the training process: the Aware model’s states are better suited for retraining the same classifier using a restricted dataset, and thus are more easily interpretable than the Baseline. Since this dataset is only a fraction of the number of frames that the agent observed during the RL stage, performance is slightly lower. Still, this shows that only a limited dataset is required before a classifier can be trained for the Aware model.

5.2 Source-level performance

Frames Episode length Success rate Fail rate Timeout
Baseline Before 3k 10.9 () 1.00 () 0.00 () 0.00 ()
Mixed-2 3k 10.6 () 1.00 () 0.00 () 0.00 ()
Before (repeat) 15k 50.1 () 0.71 () 0.08 () 0.21 ()
Mixed-3 6k 19.4 () 0.98 () 0.02 () 0.01 ()
Aware Before 3k 11.1 () 1.00 () 0.00 () 0.00 ()
Mixed-2 3k 10.5 () 1.00 () 0.00 () 0.00 ()
Before (repeat) 11k 35.7 () 0.81 () 0.10 () 0.09 ()
Mixed-3 5k 17.7 () 0.98 () 0.02 () 0.01 ()
Table 2: Performance of trained models in the source levels.

See Table 2. In the two most simple levels, there is little difference in the models, as both models agree on a seemingly optimal policy. However, on the complex levels, the two models show different behavior. Especially on the Before (repeat) level, the Aware model is able to reach a faster policy. Repeating a subtask is easier if the agent learns to disentangle objects better, and the increase in success rate shows that the Aware model is able to complete the compound instruction more often.

Figure 3: Training progress for the two models over the episode length metric, for two different levels. The dashed line indicates the Baseline, the solid indicates the Aware model. Each line is an average over multiple individual runs.

In Figure 3, the training progress can is plotted over the episode length metric for the two simple levels. Even though both models reach the same performance, there is a slight difference in their speed. Instead of the Aware model taking longer, because of the classification task, it can exploit the additional signal to learn slightly faster.

5.3 Zero-shot generalization

Transfer Frames Episode length Success rate
Base Before Color 16k 17.4 () 0.76 ()
Object 53k 57.8 () 0.22 ()
ColObj 47k 51.9 () 0.31 ()
Mixed-2 Color 20k 22.4 () 0.67 ()
Object 70k 77.6 () 0.12 ()
ColObj 65k 71.6 () 0.14 ()
Before (repeat) Color 54k 59.9 () 0.45 ()
Object 75k 82.9 () 0.17 ()
ColObj 69k 75.6 () 0.28 ()
Mixed-3 Color 40k 44.0 () 0.48 ()
Object 79k 86.8 () 0.07 ()
ColObj 64k 70.4 () 0.13 ()
Aware Before Color 16k 17.3 () 0.77 ()
Object 49k 54.4 () 0.35 ()
ColObj 43k 47.6 () 0.41 ()
Mixed-2 Color 19k 21.2 () 0.68 ()
Object 53k 58.6 () 0.25 ()
ColObj 47k 40.5 () 0.40 ()
Before (repeat) Color 45k 49.4 () 0.59 ()
Object 88k 96.0 () 0.10 ()
ColObj 89k 97.4 () 0.11 ()
Mixed-3 Color 41k 45.1 () 0.54 ()
Object 75k 82.4 () 0.24 ()
ColObj 61k 67.5 () 0.33 ()
Table 3: Performance of a trained model on the source levels, applied in the new transfer learning setting. Here, there are three new scenarios: 1) a novel color, 2) a new type of object, 3) a combination of both.

See Table 3. In this case, we see an improvement for the Aware model in the two simple levels. Both in episode length and in success rate, the Aware model outperforms the baseline. Here, the Mixed-2 level shows a larger difference than the easier Before level. This is is evidence for the need for complexity before the agent is able to exploit the language instruction fully.

However, for the complex levels, this difference is not as visible, but still the Aware model holds up to the baseline. When presented with only a new color, the Aware agent is able to be significantly faster, but in all other cases performance is comparable. Interestingly, the Aware model fails the whole instruction less often, but instead times out in both levels. This shift in termination reason is most likely due to the agent understanding that the known object in the level should not be visited yet, but fails to identify the unknown object. Upon inspection of the learned policy, the agent is actively avoiding the known object, but does not reach the other object in most cases. This shows that the training procedure did aid the agent in understanding its environment better: previously seen objects are more successfully identified, and the agent seems to know about their visiting order.

5.4 Sparse classification

Figure 4: Diagnostic classficiation results for the standard and sparse versions of the Aware model. All values are averaged over at least two runs, with a standard deviation under 0.05.
- Before 11.1 1.00
Sparse Before 10.9 1.00
- Mixed-2 10.6 1.00
Sparse Mixed-2 10.5 1.00
Table 4: Intra-level results for the standard and sparse versions of the Aware model. BA is the Mixed-2 level. The policies learned by all agents were comparable and did not differ significantly over multiple runs.
Transfer EL SR
- Before Color 17.3 0.77
Object 54.4 0.35
ColObj 47.6 0.41
Mixed-2 Color 21.2 0.68
Object 58.6 0.25
ColObj 40.5 0.40
Sparse Before Object 16.1 0.81
Object 31.3 0.42
ColObj 24.7 0.52
Mixed-2 Color 18.3 0.63
Object 50.7 0.33
ColObj 41.0 0.43
Table 5: Zero-shot performance of the sparsely trained models, compared to the standard Aware model.

Lastly, we present the results for the sparse diagnostic classification in Figure 4, Table 4 and Table 5.

In comparison with the Baseline and Aware model, learning a policy for traversing the two simple levels does not take longer and reaches the same optimum as before (see Table 4). This is because the RL agent itself is unaffected by the changes in the classification procedure. However, the impact on the hidden states is considerably lower, as can be seen in Figure 4. Here, the offline retrained classifier is not as easy to train as the standard Aware model. Still, compared to the earlier Baseline results, the Sparse classification is able to instigate some changes to the latent space.

In the zero shot experiments, there is some slight improvements in episode lengths and success rates. The hidden states may now be in balance between interpretability, as they can be organized by the retrained classifier to a certain degree, and efficiency, as the agent generalize them to unseen situations.

6 Conclusion

In this paper, we explored the addition of a simple classification task to a complex instruction-following RL problem. Through this addition, the agent was intended to become both more interpretable, and more aware of the compositional nature of the instructions. The results indicate that the agent is able to provide its current objective consistently, while having a minimal impact on the policy itself. Furthermore, these modified agents can be shown to be more general in zero-shot settings, suggesting that the added training signal helps in disentangling objects.

Future research should focus on expanding the level set that the agent was trained and evaluated on. Other types of instructions from the BabyAI environment, such as Pick up or Put next add more complexity to the task that the agents has to accomplish, and could also benefit from the improvements in object disentanglement. Additionally, adding obstacles such as separate rooms connected by doors, or distractor objects can interfere with the current setup. These situations form an interesting case for testing the diagnostic classification.

Finally, creating a more explicit hierarchical structure for the agent could make it more efficient in composing learned skills (e.g. Sutton et al., 1999). Such a hierarchical approach could use the training signal to train elementary skills and compose them more efficiently than in the current model.


  1. Learning to understand goal specifications by modelling reward. Cited by: §2.1.
  2. Explanation and justification in machine learning: a survey. In IJCAI-17 workshop on explainable AI (XAI), Vol. 8, pp. 1. Cited by: §2.3.
  3. Natural language communication with robots. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 751–761. Cited by: §2.1.
  4. Intrinsically motivated reinforcement learning. In Advances in neural information processing systems, pp. 1281–1288. Cited by: §2.2.
  5. BabyAI: first steps towards grounded language learning with a human in the loop. arXiv preprint arXiv:1810.08272. Cited by: §3.1, §3.2.
  6. What to do and how to do it: translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In 2009 IEEE International Conference on Robotics and Automation, pp. 4163–4168. Cited by: §2.1.
  7. Reverse curriculum generation for reinforcement learning. In CoRL, Proceedings of Machine Learning Research, Vol. 78, pp. 482–495. Cited by: §2.2.
  8. Curiosity driven reinforcement learning for motion planning on humanoids. Frontiers in neurorobotics 7, pp. 25. Cited by: §2.2.
  9. Planning with abstract markov decision processes. In Twenty-Seventh International Conference on Automated Planning and Scheduling, Cited by: §1.
  10. Visualizing and understanding atari agents. arXiv preprint arXiv:1711.00138. Cited by: §2.3.
  11. Cooperative multi-agent control using deep reinforcement learning. In International Conference on Autonomous Agents and Multiagent Systems, pp. 66–83. Cited by: §2.2.
  12. Grounded language learning in a simulated 3d world. arXiv preprint arXiv:1706.06551. Cited by: §2.1.
  13. Visualisation and’diagnostic classifiers’ reveal how recurrent and recursive neural networks process hierarchical structure. Journal of Artificial Intelligence Research 61, pp. 907–926. Cited by: §1, §2.3.
  14. Human-level performance in first-person multiplayer games with population-based deep reinforcement learning. arXiv preprint arXiv:1807.01281. Cited by: §2.3.
  15. Toward understanding natural language directions. In Proceedings of the 5th ACM/IEEE international conference on Human-robot interaction, pp. 259–266. Cited by: §2.1.
  16. Human-in-the-loop interpretability prior. In Advances in Neural Information Processing Systems, pp. 10159–10168. Cited by: §2.3.
  17. Generalization without systematicity: on the compositional skills of sequence-to-sequence recurrent networks. arXiv preprint arXiv:1711.00350. Cited by: §3.1.
  18. Generalization without systematicity: on the compositional skills of sequence-to-sequence recurrent networks. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pp. 2879–2888. External Links: Link Cited by: §2.2.
  19. Mapping instructions and visual observations to actions with reinforcement learning. arXiv preprint arXiv:1704.08795. Cited by: §2.1.
  20. Autonomous task sequencing for customized curriculum design in reinforcement learning. In IJCAI, pp. 2536–2542. Cited by: §2.2.
  21. A unified view of entropy-regularized markov decision processes. arXiv preprint arXiv:1705.07798. Cited by: §3.3.
  22. Policy invariance under reward transformations: theory and application to reward shaping. In ICML, Vol. 99, pp. 278–287. Cited by: §3.3.
  23. Curiosity-driven exploration by self-supervised prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 16–17. Cited by: §2.2.
  24. Film: visual reasoning with a general conditioning layer. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §3.2.
  25. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §3.2.
  26. An intrinsic reward mechanism for efficient exploration. In Proceedings of the 23rd international conference on Machine learning, pp. 833–840. Cited by: §2.2.
  27. Reinforcement learning: an introduction. MIT press. Cited by: §1.
  28. Between mdps and semi-mdps: a framework for temporal abstraction in reinforcement learning. Artificial intelligence 112 (1-2), pp. 181–211. Cited by: §6.
  29. Algorithms for reinforcement learning. Synthesis lectures on artificial intelligence and machine learning 4 (1), pp. 1–103. Cited by: §3.2.
  30. Understanding natural language commands for robotic navigation and mobile manipulation. In Twenty-Fifth AAAI Conference on Artificial Intelligence, Cited by: §2.1.
  31. Learning to follow navigational directions. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 806–814. Cited by: §2.1.
  32. The oxford handbook of compositionality. Oxford Handbooks in Linguistic. Cited by: §2.2.
  33. Procedures as a representation for data in a computer program for understanding natural language. Technical report MASSACHUSETTS INST OF TECH CAMBRIDGE PROJECT MAC. Cited by: §2.1.
  34. Building generalizable agents with a realistic and rich 3d environment. arXiv preprint arXiv:1801.02209. Cited by: §2.1.
  35. Training agent for first-person shooter game with actor-critic curriculum learning. Cited by: §2.2.
  36. Interactive grounded language acquisition and generalization in a 2d world. arXiv preprint arXiv:1802.01433. Cited by: §2.1.
  37. Graying the black box: understanding dqns. In International Conference on Machine Learning, pp. 1899–1908. Cited by: §2.3.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description