Learn How to Cook a New Recipe in a New House: Using Map Familiarization, Curriculum Learning, and Common Sense to Learn Families of Text-Based Adventure Games
We consider the task of learning to play families of text-based computer adventure games, i.e., fully textual environments with a common theme (e.g. cooking) and goal (e.g. prepare a meal from a recipe) but with different specifics; new instances of such games are relatively straightforward for humans to master after a brief exposure to the genre but have been curiously difficult for computer agents to learn. We find that the deep Q-learning strategies that have been successfully leveraged for superhuman performance in single-instance action video games can be applied to learn families of text video games when adopting simple strategies that correlate with human-like learning behavior. Specifically, we build agents that learn to tackle simple scenarios before more complex ones (curriculum learning), that are equipped with the contextualized semantics of BERT (and we demonstrate that this provides a measure of common sense), and that familiarize themselves in an unfamiliar environment by navigating before acting. We demonstrate faster training convergence and improved task completion rates over reasonable baselines.
Building agents able to play text-based adventure games is a useful proxy task for learning open-world goal-oriented problem-solving dialogue agents. Via an alternating sequence of natural language descriptions given by the game and natural language commands given by the player, a player-agent navigates an environment, discovers and interacts with entities, and accomplishes a goal, receiving explicit rewards for doing so. Human players are skilled at text games when they understand the situation they are placed in and can make rational decisions based on their life and game playing experience. For example, in the classic text game Zork Infocom (2001), the adventurer discovers an air pump and an uninflated plastic boat; common sense leads human players to inflate the boat with the pump.
Games such as Zork are very complicated and are designed to be played repeatedly until all the puzzles contained within have been solved; in this way, they are not very similar to real human experiences. Another kind of text game, as exemplified by the TextWorld learning environment Côté et al. (2018) and competition, expects agents to learn a particular task theme (such as rescuing victims from a burning building or preparing a meal) but evaluates on never-before-seen instances of that theme. This is a much more realistic scenario. A person who has never cooked a meal before would no doubt flounder when asked to prepare one. In order to learn to cook, one does not begin by learning to make Coq au Vin, but rather starts simply and works up to more complicated tasks. However, once the cooking skill is learned, one would reasonably expect to be able to prepare a new recipe the first time it is seen. Furthermore, even if the recipe was prepared in a somewhat unfamiliar location (say, the kitchen of a vacation home), a reasonable person would explore the new space, recognize the familiar rooms and elements, and then begin cooking.
In this work, we approach this more-realistic scenario and consider how we might train models to learn to play familiar but unseen text games by adopting a training regimen and knowledge set that mirror human skill acquisition. Specifically, we make the following contributions in our text game agent learning models:
|You find yourself in a kitchen. You make out a fridge. The fridge is empty. You see a cookbook on the table. you see a counter. the counter is vast. on the counter you can make out a knife.|
|You open the cookbook and start reading. ‘Recipe 1: ingredients: red potato. directions: slice the red potato. roast the red potato. prepare meal’|
|take knife from counter|
|You take the knife from the counter.|
|slice red potato with knife|
|You slice the potato. Your score has gone up by 1 point.|
We build agents that can play unseen text-based games, by transferring learned knowledge instead of by simply overfitting on a single trained game.
We show how the proper use of domain-aware curriculum learning strategies can lead to a better learned agent.
We draw a distinction between knowledge into the universal (e.g., that cooking can be done in the kitchen) and instance (e.g. that the kitchen is east of the bedroom); the former can be usefully learned with training data, but the latter cannot. We show how environment familiarization through construction of a knowledge graph improves learning.
We show that the incorporation of a pre-trained contextualized large language model speeds up training convergence. We also demonstrate that this is because it provides external common sense knowledge that otherwise must be learned through trial and error, or not at all.
2 Reinforcement learning for text game models
The influential Deep Q-Network (DQN) approach of learning simple action video games pioneered by \newcitegoogle-atari has motivated research into the limits of this technique when applied to other kinds of games. We follow recent work that ports this approach to text-based games Narasimhan et al. (2015); He et al. (2016); Fulda et al. (2017); Zahavy et al. (2018); Ansari et al. (2018); Kostka et al. (2017); Yuan et al. (2018); Ammanabrolu and Riedl (2018); Yin and May (2019). The core approach of DQN as described by \newcitegoogle-atari is to build a replay memory of partial games with associated scores, and use this to learn a function , where is the expected reward (a.k.a. Q-value) obtained by choosing action when in state ; from , choosing affords the optimal action policy and this is used at inference time. As in the original work, a key innovation is using the appropriate input to determine the game state; for video games, it is using a sequence of images from the game display; while for text games we use a history of system description-player action sequences, which we call a trajectory; an abbreviated example is given in Figure 1. A means of efficiently representing infinite is necessary; most related work uses LSTMs Narasimhan et al. (2015); Ammanabrolu and Riedl (2018); Yuan et al. (2018); Kostka et al. (2017); Ansari et al. (2018), though we follow Zahavy et al. (2018), which uses CNNs, to achieve greater speed in training. The DQN is trained in an exploration-exploitation method (-search): with the probability , the agent chooses a random action (explores), and otherwise the agent chooses the action that maximizes the DQN function. The hyperparameter usually decays from 1 to 0 during the training process.
Much game-learning research is concerned with the optimization of a single game, e.g. applying DQN repeatedly on Pac-Man with the goal of learning to be very good at playing Pac-Man. While this is a realistic goal when strictly limited to the domain of video game play
2.1 Handling unbounded action representations
A consequence of learning to play a game that has not been seen before is that actions not seen in training may be necessary at test time. Vanilla DQNs as introduced by \newcitegoogle-atari are incompatible with this modification; they presume a predefined finite action space and were designed for a space of up to 18 (each of nine joystick directions and a potential button push). Additionally, vanilla DQNs presume no semantic relatedness among action spaces, while in text games it would make sense for, e.g., open the door to be semantically closer to shut the door than dice the carrot. In our experiments we assume a game’s action set is fully known at inference time but not beforehand, and that actions have some relatedness.
where is a learned weight matrix. In preliminary experiments we found that LSTMs worked better than CNNs on the small and similar actions in our space such as take yellow potato from fridge and dice purple potato.
We use the games released by Microsoft for the ‘First TextWorld Problems’
The games are divided into 222 different types, with 20 games per type. A type is a set of attributes that increase the complexity of a game. These attributes include the number of ingredients, the set of necessary actions, and the number of rooms in the environment. One example of such a type is recipe3 + take3 + open + drop + go9 that implies the game contains three ingredients in the recipe, and players need to find and take the three items. In the process of finding these items, there could be doors to open, e.g. a door of a fridge, or a door of a room. The agent may also need to drop something in hand before taking another. Finally, the go9 means there are nine different rooms in the game. A constant reward (i.e. one point) is given for each acquisition or proper preparation of a necessary ingredient as well as for accomplishing the goal (preparing the correct recipe). Each game has a different maximum score, so we report aggregate scores as a percentage of achievable points.
3.1 Levels of difficulty
Game types naturally cluster into tiers of increasing difficulty. The easiest games take place inside a single room and require only one (tier-1), two (tier-2), or three (tier-3) ingredients. More complicated are the multi-room games; these may have six (tier-4), nine (tier-5), or twelve (tier-6) rooms. Intuitively, it should be very easy to learn a tier-1 game. Adding additional ingredients requires knowing how to prepare each ingredient correctly, and adding additional rooms requires finding the kitchen and other locations. Table 1 contains per-tier details.
4.1 Curriculum learning
Correctly training a DQN-like model to play even a single game can take millions of training steps Mnih et al. (2015) due to the need for heavy exploration. If our models are able to learn critical general skills in the early parts of training, they can focus on more fine-grained skills later on. For example, recognizing that the action cook potato with stove matches the cookbook instruction fry potato allows generalization to, e.g., fry eggplant. This skill is needed across all games. More specific skills, like knowing to drop items before picking up other items are less commonly used.
Curriculum learning Bengio et al. (2009) is a good way of structuring our learning to capture core skills first and gradually build in more complicated knowledge. We initially only train with tier-1 training data. After convergence we then use the best model to initialize the model of tier-2, and so on. Because tiers 1–3 differ significantly from tiers 4–6 (the latter have movements and more games per tier), we alter our approach slightly as training proceeds. We start training tier-1 with the games of tier-1 only. When we train tier-2, we mix the games of tier-1 and tier-2 in order to make the agent perform well on both tiers. We then mix tier-3 data in. But for tier-4 to tier-6, we only use the data for the specific stage of training, and do not mix in data from previous tiers. For each stage of curriculum learning we initialize to 1 and decay evenly to 0.0001 across a maximum of 2,000,000 steps. In ablation experiments without curriculum learning we instead decay over 10,000,000 steps.
|Test 1||Test 2|
4.2 Learning universally from local information
Since knowledge like the connection between the behavior of fry and using a stove can be learned from past experience and applied to future scenarios, we call this universal knowledge. Other knowledge that is specific to a particular scenario and not reusable we term instance knowledge. In a specific game from our data set, for example, the player may have to go north to reach the kitchen. However, this will not be the case in general. Thus, naively learning a policy for the action go east given a particular state is likely to be suboptimal. We’d like to ensure that training does not overfit by turning instance knowledge into universal knowledge
As it turns out, in the domain we are studying, learning that we must go from the room we are in (generally to reach the kitchen or a room containing missing ingredients) is universal knowledge. A simple way to remove instance knowledge, which we call random-go, is to conflate all actions of the form go direction into a single go action, but then randomly choose a cardinal direction.
Since the room we are trying to reach is more universally important than the direction chosen in a particular game, another approach to converting instance to universal knowledge is to augment directions with the name of the room that will be reached before encoding actions. If, in a particular game, the bedroom is east of the hallway, the action go east is modified during training to be go east to hallway, enabling the action representation to incorporate the more globally useful room type of context into its representation. At inference time we build a simple knowledge graph with this information by a series of initial random walks.
4.3 Learning with common sense knowledge
Humans play games by both learning from failures and by using common sense. Common sense knowledge, such as that a closed door should be open, that it is helpful to light lamp in a dark dungeon, or that one can fry on a stove, is helpful a priori knowledge that allows agents to learn to play faster. An agent that does not have this knowledge could conceivably, through reward signal and enough random exploration, learn these associations, but humans playing these games will be extremely unlikely to attempt to fry using, e.g., a fridge. We incorporate BERT Devlin et al. (2018), a large pre-trained contextualized language model, in our system, as a source of common sense.
While it is rather controversial to claim that a model trained only to predict missing words in context has common sense akin to that of a human, the fact remains that an adequately fine-tuned BERT has been shown to answer multiple choice questions from the Situations With Adversarial Generations (SWAG) dataset Zellers et al. (2018), among others, at near-human levels. Such ‘weak common sense’ knowledge may be enough for our use case, which also may be expressed as a multiple-choice test given textual context. To save time during training, we use the first layer of BERT as an embedding-level feature extractor, and fine-tune this layer during the learning procedure. In ablation studies we compare this to a randomly initialized simple (non-contextualized) embedding baseline.
5 Experiments and discussion
We hold out a selection of 10% of the games and divide this portion into two separate test sets, each consisting of 222 games, one from each type. We randomly select an additional 400 games as a dev set and keep the remaining games for training. We consider an episode to be a play-through of a game; there are multiple episodes of each game run during training and scores are taken over a 10-episode run of each game when evaluating test. An episode is run until a loss (an ingredient is damaged or the maximum of 100 steps is reached) or a win, by completing the recipe successfully. Apart from the inherent game reward, we add reward (i.e. punishment) to every step, to encourage more direct gameplay. Also, if the game stops early because of a loss, we set the instant reward to to penalize the last action.
During training, we use 50,000 observation steps, 500,000 replay memory entries, and decay from 1 to 0.0001 in 10,000,000 steps for training with all games in training data.
From a training run, we select the model with the highest score on the dev set for test inference. We run 10 episodes for each game during the test phase with , allowing for some stochasticity. The maximum total steps of evaluating on one test set is thus . The maximum total score is not unique since different games could have different scores. We use the percentage of scores and steps as the evaluation criteria in the following sections. The higher the score, the better the agent. A lower percentage of steps means better policy when scores tie; we show the percentage of wins alongside steps; if steps decrease and wins do not, this indicates an improving policy.
We use a CNN with 32 of each size-3, 4, 5 convolutional filters, followed by a max-pooling layer. The LSTM action encoder contains 32 units in a single-layer. We use the last LSTM hidden state as the encoded action state. We initialize our models with random word embeddings and position embeddings. We use a fixed embedding size of 64. At every training step, we draw a minibatch of 32 samples and use a learning rate of with the Adam optimizer. We trim trajectories to contain no more than 21 sentences to avoid unnecessarily long concatenated strings.
5.1 Core results
We primarily report results as a percentage of total achievable points on the test sets. Core findings are shown in Table 2. For a simple, training-free baseline, we choose a random action from the set of admissible actions at each state. Our main comparisons are that of curriculum learning (curric) as described in Section 4.1 to the default (mixed), and between the three different approaches to handling instance knowledge as described in Section 4.2. We next take a more in-depth look at the differences in learning behavior.
5.2 Curriculum analysis
Table 3 breaks down the test results ‘mixed go-room’ and ‘curric go-room’ by tier, evaluating after all training is complete. Here we can see that a) curriculum training is generally helpful at every tier, and that b) the ability to reach 100% of score generally decreases by tier. The training behavior of ‘mixed go-room’ is shown in Figure 3. As training proceeds, the total score percentage on dev should go up, and as long as the percentage of wins is not decreasing, the total steps percentage should go down, indicating fewer unnecessary steps. Indeed, this is what we see; the total score gradually increases during training and finally is stable at 54%.
Training graphs for ‘curric go-room’ broken down by tier are shown in Figure 4. For tier-1 (Figure 3(a)) we converge to almost 100% of total score after 140 epochs, which means our agent grasps basic cooking abilities. However, the results of tier-2 (Figure 3(b)) and tier-3 (Figure 3(c)) are flat, indicating there is minor ingredient confusion but it is never resolved. For tiers 4 through 6 (Figure 3(d) to 3(f)), scores generally improve from 40% to roughly 60%, indicating progressive ability to learn to navigate rooms.
|Tier||Test 1||Test 2|
5.3 Analysis of universal information conversion
Table 4 breaks down performance of each strategy for dealing with instance information in each tier that requires resolution of this information. It is clear that ‘go-cardinal,’ which does not convert any instance information, is less able to learn than the other methods at any tier. As the number of rooms to navigate grows from tier-4 to tier-6, the random navigation strategy becomes less effective, such that the ‘go-room’ transferring from instance-level cardinal information into universal-level room transition information is the most effective at navigating the large twelve-room games of tier-6.
Table 5 shows that there is a correlation between the most recently trained tier and performance on test data from that tier; we run ‘curric go-room’ but stop after the tier indicated, then subdivide test data per-tier. We see strongest performance on the main diagonal. This is reasonable because the six-room games of tier-4 use the same six rooms each time and so on; the extra rooms of tier-6 aren’t known during tier-4 training, and some decay of tier-4 rooms is observed as learning is rededicated to new rooms. Nevertheless, by training on all tiers we get best overall performance on Test 1.
|\diagboxTestTrain||Tier 4||Tier 5||Tier 6|
5.4 Common sense analysis
We replace the uninitialized simple type-based embedding used heretofore with the lowest level of pre-trained BERT-uncase base Devlin et al. (2018) to investigate the a priori knowledge that this large language model brings to aid in game playing. Figure 5 shows the training result on tier-1 with BERT. Comparing the training process with the baseline agent, the BERT agent can converge to the optimal scores and steps on tier-1 in far fewer training steps. After 260,000 steps of training on tier-1, the baseline model scores 24% on Test 1 and 29% on Test 2, while the BERT-enhanced model scores 34% on Test 1 and 35% on Test 2.
Figure 6 shows the comparison of grasping key knowledge in games of the BERT and the CNN agents; for each of six key cooking verbs, we show the percentage of games in the dev set requiring these verbs where they are used correctly. While both BERT and CNN learn to use take fairly quickly, BERT masters the remaining five much more quickly than the baseline. As both models have the same exploration ability and the contextualized representations afforded by BERT are not particularly useful in the fairly simple language of this genre, we can conclude that equipping our model with BERT gives it sufficient common sense to associate these key verbs with clues that indicate their use placed in the game text.
A major downside of replacing simple 64-dimension type-based embeddings with the contextualized BERT model is that, even though we restrict our usage to a single layer, fine-tuning this model, especially given the long trajectories used as input, becomes cumbersome. An epoch of training using BERT takes 1093 seconds on a Tesla P100, vs. 118 seconds for the baseline. As full curriculum learning through tier-6 on the baseline model on our P100 takes about 2 days per tier, it is not currently practical to fully train using BERT.
|You find yourself in a kitchen. You make out a fridge. The fridge contains a yellow bell pepper. You see a cookbook and a knife on the table.|
|You open the cookbook and start reading. ‘Recipe 1: ingredients: red apple. directions: slice the red apple. prepare meal’|
|You are carrying: a red apple, a red onion.|
|drop red onion.|
|you drop the red onion on the ground.|
|take knife from table.|
|you take the knife from the table.|
|slice red apple with knife.|
|you slice the red apple. your score has just gone up by one point.|
|adding the meal to your inventory. your score has just gone up by one point.|
|you eat the meal. your score has just gone up by one point. you win.|
|cook red apple with stove|
|You fried the red apple. You lost!|
6 Related Work
Many recent works Narasimhan et al. (2015); He et al. (2016); Li et al. (2016); Ansari et al. (2018); Fulda et al. (2017); Côté et al. (2018); Kostka et al. (2017) on building agents of text-based games apply the DQN Mnih et al. (2015) from playing video games or its variants. Different aspects of DQN have been presented, such as action reduction with language correlation Fulda et al. (2017), a bounding method Zahavy et al. (2018), the introduction of a knowledge graph Ammanabrolu and Riedl (2018), text understanding with dependency parsing Yin and May (2019) and an entity relation graph Ammanabrolu and Riedl (2018).
However, previous work is chiefly focused on learning to self-train on games and then do well on the same games, instead of playing unseen games. A rare exception, \newciteDBLP:journals/corr/abs-1806-11525 work on generalization of agents on variants of a very simple coin-collecting game. The simplicity of their games enables them to use an LSTM-DQN method with a counting-based reward. \newciteDBLP:journals/corr/abs-1812-01628 use a knowledge graph as a persistent memory to encode states, while we use a knowledge graph to make actions more informative. Our work is closely related to task-oriented dialogue studies He et al. (2017); Rajendran et al. (2018); Bordes et al. (2017) though these are generally not directly transferrable to our scenario, because they use customized models and rely on training data.
In this paper, we train agents to play a family of text-based games. Instead of repeatedly optimizing on a single game, we train agents to play familiar but unseen games. We show that curriculum learning helps the agent learn better. We convert instance knowledge into universal knowledge via map familiarization. We also show how the incorporation of an external knowledge source (BERT) leads the agent to learn in far fewer epochs.
- occasional stochasticity notwithstanding
- This is itself still a simplification, as many text games allow open text generation and thus infinite action space. Our approach does not preclude abandoning this simplification, but the difficulty of the problem is sufficient to leave this for future work.
- An even more pertinent strategy would be to label directions by their ability to get to key destination rooms, i.e. the kitchen and supermarket, but these strategies would not necessarily transfer well to a new domain.
- Prithviraj Ammanabrolu and Mark O. Riedl. 2018. Playing text-adventure games with graph-based deep reinforcement learning. CoRR, abs/1812.01628.
- Ghulam Ahmed Ansari, Sagar J. P, Sarath Chandar, and Balaraman Ravindran. 2018. Language expansion in text-based games. CoRR, abs/1805.07274.
- Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 41–48, New York, NY, USA. ACM.
- Antoine Bordes, Y.-Lan Boureau, and Jason Weston. 2017. Learning end-to-end goal-oriented dialog. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings.
- Marc-Alexandre Côté, Ákos Kádár, Xingdi Yuan, Ben Kybartas, Tavian Barnes, Emery Fine, James Moore, Matthew J. Hausknecht, Layla El Asri, Mahmoud Adada, Wendy Tay, and Adam Trischler. 2018. Textworld: A learning environment for text-based games. CoRR, abs/1806.11532.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Nancy Fulda, Daniel Ricks, Ben Murdoch, and David Wingate. 2017. What can you do with a rock? affordance extraction via word embeddings. In IJCAI, pages 1039–1045. ijcai.org.
- He He, Anusha Balakrishnan, Mihail Eric, and Percy Liang. 2017. Learning symmetric collaborative dialogue agents with dynamic knowledge graph embeddings. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1766–1776, Vancouver, Canada. Association for Computational Linguistics.
- Ji He, Jianshu Chen, Xiaodong He, Jianfeng Gao, Lihong Li, Li Deng, and Mari Ostendorf. 2016. Deep reinforcement learning with a natural language action space. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1621–1630. Association for Computational Linguistics.
- Sepp Hochreiter and JÃ¼rgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9:1735–80.
- Infocom. 2001. Zork I Manual. http://infodoc.plover.net/manuals/zork1.pdf.
- Bartosz Kostka, Jaroslaw Kwiecieli, Jakub Kowalski, and Pawel Rychlikowski. 2017. Text-based adventures of the golovin AI agent. In CIG, pages 181–188. IEEE.
- Jiwei Li, Will Monroe, Alan Ritter, Dan Jurafsky, Michel Galley, and Jianfeng Gao. 2016. Deep reinforcement learning for dialogue generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1192–1202, Austin, Texas. Association for Computational Linguistics.
- Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. 2015. Human-level control through deep reinforcement learning. Nature, 518:529 EP –.
- Karthik Narasimhan, Tejas Kulkarni, and Regina Barzilay. 2015. Language understanding for text-based games using deep reinforcement learning. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1–11. Association for Computational Linguistics.
- Janarthanan Rajendran, Jatin Ganhotra, Satinder Singh, and Lazaros Polymenakos. 2018. Learning end-to-end goal-oriented dialog with multiple answers. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3834–3843, Brussels, Belgium. Association for Computational Linguistics.
- Xusen Yin and Jonathan May. 2019. Comprehensible context-driven text game playing. CoRR, abs/1905.02265.
- Xingdi Yuan, Marc-Alexandre Côté, Alessandro Sordoni, Romain Laroche, Remi Tachet des Combes, Matthew J. Hausknecht, and Adam Trischler. 2018. Counting to explore and generalize in text-based games. CoRR, abs/1806.11525.
- Tom Zahavy, Matan Haroush, Nadav Merlis, Daniel J. Mankowitz, and Shie Mannor. 2018. Learn what not to learn: Action elimination with deep reinforcement learning. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada., pages 3566–3577.
- Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. SWAG: A large-scale adversarial dataset for grounded commonsense inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 93–104, Brussels, Belgium. Association for Computational Linguistics.