Learn How to Cook a New Recipe in a New House: Using Map Familiarization, Curriculum Learning, and Bandit Feedback to Learn Families of Text-Based Adventure Games

Learn How to Cook a New Recipe in a New House: Using Map Familiarization, Curriculum Learning, and Bandit Feedback to Learn Families of Text-Based Adventure Games

Xusen Yin & Jonathan May
Information Sciences Institute / University of Southern California
4676 Admiralty Way, Suite 1001
Marina del Rey, California 90292
{xusenyin,jonmay}@isi.edu
Abstract

We consider the task of learning to play families of text-based computer adventure games, i.e., fully textual environments with a common theme (e.g. cooking) and goal (e.g. prepare a meal from a recipe) but with different specifics; new instances of such games are relatively straightforward for humans to master after a brief exposure to the genre but have been curiously difficult for computer agents to learn. We find that the deep Q-learning strategies that have been successfully leveraged for superhuman performance in single-instance action video games can be applied to learn families of text video games when adopting simple strategies that correlate with human-like learning behavior. Specifically, we build agents that learn to tackle simple scenarios before more complex ones using curriculum learning, that familiarize themselves in an unfamiliar environment by navigating before acting, and that explore uncertain environment more thoroughly using multi-armed bandit decision policies. We demonstrate improved task completion rates over reasonable baselines when evaluating on never-before-seen games of that theme.

1 Introduction

Building agents able to play text-based adventure games is a useful proxy task for learning open-world goal-oriented problem-solving dialogue agents. Via an alternating sequence of natural language descriptions given by the game and natural language commands given by the player, a player-agent navigates an environment, discovers and interacts with entities, and accomplishes a goal, receiving explicit rewards for doing so. Human players are skilled at text games when they understand the situation they are placed in and can make rational decisions based on their life and game playing experience. For example, in the classic text game Zork [Lebling, Blank, and Anderson1979], the adventurer discovers an air pump and an uninflated plastic boat; common sense leads human players to inflate the boat with the pump.

Games such as Zork are very complicated and are designed to be played repeatedly until all the puzzles contained within have been solved; in this way, they are not very similar to real human experiences. Another kind of text game, as exemplified by the TextWorld learning environment [Côté et al.2018] and competition, expects agents to learn a particular task theme (such as rescuing victims from a burning building or preparing a meal) but evaluates on never-before-seen instances of that theme in a zero-shot evaluation setting. This is a much more realistic scenario. A person who has never cooked a meal before would no doubt flounder when asked to prepare one. In order to learn to cook, one does not begin by learning to make Coq au Vin, but rather starts simply and works up to more complicated tasks. However, once the cooking skill is learned, one would reasonably expect to be able to prepare a new recipe the first time it is seen. Furthermore, even if the recipe was prepared in a somewhat unfamiliar location (say, the kitchen of a vacation home), a reasonable person would explore the new space, recognize the familiar rooms and elements, and then begin cooking.

In this work, we approach this more-realistic scenario and consider how we might train models to learn to play familiar but unseen text games by adopting a training regimen that mirrors human skill acquisition. We additionally show that, by exploring the search space more thoroughly and evenly by leveraging multi-armed bandit feedback, an agent can reach higher scores in the zero-shot evaluation setting. Specifically, we make the following contributions in our text game agent learning models:

  • We build agents that can play unseen text-based games, by transferring learned knowledge instead of by simply over-fitting on a single trained game.

  • We show how the proper use of domain-aware curriculum learning strategies can lead to a better learned agent.

  • We draw a distinction between knowledge into the universal (e.g., that cooking can be done in the kitchen) and instance (e.g. that the kitchen is east of the bedroom); the former can be usefully learned with training data, but the latter cannot. We show how environment familiarization through construction of a knowledge graph improves learning.

  • We show that incorporating bandit feedback in both training and evaluation leads to a better agent by exploring environments more thoroughly, especially in a zero-shot test with new games.

Master Player
You find yourself in a kitchen. You make out a fridge. The fridge is empty. You see a cookbook on the table. you see a counter. the counter is vast. on the counter you can make out a knife.
examine cookbook
You open the cookbook and start reading. ‘Recipe 1: ingredients: red potato. directions: slice the red potato. roast the red potato. prepare meal’
take knife from counter
You take the knife from the counter.
slice red potato with knife
You slice the potato. Your score has gone up by 1 point.
Figure 1: Truncated example of dialogue from the First TextWorld Problems Challenge shows a portion of a ‘tier-1’ game, as described in Section 3. The concatenation of all master-player sequences constitutes a trajectory as described in Section 2.

2 Reinforcement Learning for Text Games

The influential Deep Q-Network (DQN) approach of learning simple action video games pioneered by \citeauthorgoogle-atari \shortcitegoogle-atari has motivated research into the limits of this technique when applied to other kinds of games. We follow recent work that ports this approach to text-based games [Narasimhan, Kulkarni, and Barzilay2015, He et al.2016, Fulda et al.2017, Zahavy et al.2018, Ansari et al.2018, Kostka et al.2017, Yuan et al.2018, Ammanabrolu and Riedl2018, Yin and May2019]. The core approach of DQN as described by \citeauthorgoogle-atari \shortcitegoogle-atari is to build a replay memory of partial games with associated scores, and use this to learn a function , where is the expected reward (a.k.a. Q-value) obtained by choosing action when in state ; from , choosing affords the optimal action policy, and this is used at inference time. As in the original work, a key innovation is using the appropriate input to determine the game state; for video games, it is using a sequence of images (e.g. 4-frame of images in [Mnih et al.2015]) from the game display; while for text games we use a history of system description-player action sequences, which we call a trajectory; an abbreviated example is given in Figure 1. A means of efficiently representing infinite is necessary; most related work uses LSTMs [Narasimhan, Kulkarni, and Barzilay2015, Ammanabrolu and Riedl2018, Yuan et al.2018, Kostka et al.2017, Ansari et al.2018], though we follow [Zahavy et al.2018, Yin and May2019], which uses CNNs, to achieve greater speed in training. The DQN is trained in an exploration-exploitation method (-greedy): with probability , the agent chooses a random action (explores), and otherwise the agent chooses the action that maximizes the DQN function. The hyperparameter usually decays from 1 to 0 during the training process.

Much game-learning research is concerned with the optimization of a single game, e.g. applying DQN repeatedly on Pac-Man with the goal of learning to be very good at playing Pac-Man. While this is a realistic goal when strictly limited to the domain of video game play,111occasional stochasticity notwithstanding single-game optimization is rather unsatisfying. It is difficult to tell if a single game-trained model has managed to simply overfit on its target or if it has learned something general about the task it is trying to complete. More concretely, if we consider game playing as a proxy for real-world navigation (in the action game genre) or task-oriented dialogue (in the text genre), it is clear that a properly trained agent should be able to succeed in a new, yet familiar environment. We thus depart from the single-game approach taken by others [Narasimhan, Kulkarni, and Barzilay2015, He et al.2016, Ammanabrolu and Riedl2018, Zahavy et al.2018] and evaluate principally on games that are in the same genre as those seen in training, but that have not previously been played during training.

Figure 2: The architecture of the DRRN model. Trajectories and actions are encoded by a CNN and an LSTM into hidden states and hidden actions, followed by a dense layer to compute the Q-vector. We construct a knowledge graph from trajectories to add information in contradicted actions.

2.1 Handling Unbounded Action Representations

A consequence of learning to play a game that has not been seen before is that actions not seen in training may be necessary at test time. Vanilla DQNs as introduced by \citeauthorgoogle-atari \shortcitegoogle-atari are incompatible with this modification; they presume a predefined finite action space and were designed for a space of up to 18 (each of nine joystick directions and a potential button push). Additionally, vanilla DQNs presume no semantic relatedness among action spaces, while in text games it would make sense for, e.g., open the door to be semantically closer to shut the door than dice the carrot. In our experiments we assume a game’s action set is fully known at inference time but not beforehand, and that actions have some relatedness.222This is itself still a simplification, as many text games allow open text generation and thus infinite action space. Our approach does not preclude abandoning this simplification, but the difficulty of the problem is sufficient to leave this for future work. We thus represent actions using Deep Reinforcement Relevance Networks (DRRN) (Figure 2) [He et al.2016], a modification of the standard DQN. Actions are encoded via an LSTM [Hochreiter and Schmidhuber1997] and scored against state representations according to this equation:

where is a learned weight matrix, is the encoded state and is the encoded action. In preliminary experiments we found that LSTMs worked better than CNNs on the small and similar actions in our space such as take yellow potato from fridge and dice purple potato.

3 Games

We use the games released by Microsoft for the ‘First TextWorld Problems’ competition.333https://www.microsoft.com/en-us/research/project/textworld The competition provides 4,440 cooking games generated by the TextWorld framework [Côté et al.2018]. The goal of each game is to prepare a recipe. The action space is simple, yet expressive, and has a fairly large, though domain-limited, vocabulary. A portion of a simple example is shown in Figure 1.

The games are divided into 222 different types, with 20 games per type. A type is a set of attributes that increase the complexity of a game. These attributes include the number of ingredients, the set of necessary actions, and the number of rooms in the environment. One example of such a type is recipe3 + take3 + open + drop + go9 that implies the game contains three ingredients in the recipe, and players need to find and take the three items. In the process of finding these items, there could be doors to open, e.g. a door of a fridge, or a door of a room. The agent may also need to drop an item it is holding before taking another. Finally, the go9 means there are nine different rooms in the game. A constant reward (i.e. one point) is given for each acquisition or proper preparation of a necessary ingredient as well as for accomplishing the goal (preparing the correct recipe). Each game has a different maximum score, so we report aggregate scores as a percentage of achievable points.

3.1 Levels of Difficulty

We divide the game types into six tiers of increasing difficulty. The easiest games take place inside a single room and require only one (tier-1), two (tier-2), or three (tier-3) ingredients. More complicated are the multi-room games; these may have six (tier-4), nine (tier-5), or twelve (tier-6) rooms. Intuitively, it should be very easy to learn a tier-1 game. Adding additional ingredients requires knowing how to prepare each ingredient correctly, and adding additional rooms requires finding the kitchen and other locations. Table 1 contains per-tier details.

tier #ingredients #rooms #games
1 1 1 420
2 2 1 420
3 3 1 420
4 3 6 1040
5 3 9 1040
6 3 12 1040
Table 1: Tiers of games. The tiers are selected by the difficulty level of games. Tier-1 is the simplest, containing only one ingredient in a recipe and one room to explore per game. Tier-6 is the most difficult, including up to three ingredients in a recipe, and twelve rooms to explore per game. The first three tiers only contain one room, which means there need be no go actions involved in these games.

4 Methods

4.1 Curriculum Learning

Correctly training a DQN-like model to play even a single game can take millions of training steps [Mnih et al.2015] due to the need for heavy exploration. If our models are able to learn critical general skills in the early parts of training, they can focus on more fine-grained skills later on. For example, recognizing that the action cook potato with stove matches the cookbook instruction fry potato allows generalization to, e.g., fry eggplant. This skill is needed across all games. More specific skills, like knowing to drop items before picking up other items are less commonly used.

Curriculum learning [Bengio et al.2009] is a good way of structuring our learning to capture core skills first and gradually build in more complicated knowledge. We initially only train with tier-1 training data. After convergence we then use the best model to initialize the model of tier-2, and so on. Because tiers 1–3 differ significantly from tiers 4–6 (the latter have movements and more games per tier), we alter our approach slightly as training proceeds. We start training tier-1 with the games of tier-1 only. When we train tier-2, we mix the games of tier-1 and tier-2 in order to make the agent perform well on both tiers. We then mix tier-3 data in. But for tier-4 to tier-6, we only use the data for the specific stage of training, and do not mix in data from previous tiers. For each stage of curriculum learning we initialize to 1 and decay evenly to across a maximum of two million steps. In ablation experiments without curriculum learning we instead decay over 10 million steps.

4.2 Learning Universally from Local Information

Since knowledge like the connection between the behavior of fry and using a stove can be learned from past experience and applied to future scenarios, we call this universal knowledge. Other knowledge that is specific to a particular scenario and not reusable we term instance knowledge. In a specific game from our data set, for example, the player may have to go north to reach the kitchen. However, this will not be the case in general. Thus, naively learning a policy for the action go east given a particular state is likely to be sub-optimal. We’d like to ensure that training does not overfit by turning instance knowledge into universal knowledge

As it turns out, in the domain we are studying, learning that we must go from the room we are in (generally to reach the kitchen or a room containing missing ingredients) is universal knowledge. A simple way to remove instance knowledge, which we call random-go, is to conflate all actions of the form go direction into a single go action, but then randomly choose a cardinal direction.

Since the room we are trying to reach is more universally important than the direction chosen in a particular game, another approach to converting instance to universal knowledge is to augment directions with the name of the room that will be reached before encoding actions. If, in a particular game, the bedroom is east of the hallway, the action go east is modified during training to be go east to hallway, enabling the action representation to incorporate the more globally useful room type of context into its representation. At inference time we build a simple knowledge graph with this information by a series of initial random walks.

4.3 Learning and Evaluation with Uncertainty

A DQN agent tends to repeat itself with one or a few actions because of learned sub-optimal policies [Yin and May2019]. Since the policy is learned by the function , the problem could also be affected by bias towards learning state and action representations by encoding trajectories and actions with neural networks. Underestimated representations for infrequently seen state-action pairs may contain high variance, which leads to the selection of incorrect Q-values and hence sub-optimal policies. The phenomenon becomes more severe in the setting of zero-shot evaluation, especially when encoding long unseen trajectories.

The -greedy method for exploration and exploitation that is widely used with DQN is then not able to solve this problem. At inference time, repeated actions that are essentially randomly chosen can have dire results. For example, making the decision to cook an ingredient that has already been cooked will result in destruction of that ingredient and an game failure. We call these actions dangerous actions. Even a small , such as as used at inference time in many works [Narasimhan, Kulkarni, and Barzilay2015, Yin and May2019, Zahavy et al.2018, Yuan et al.2018]) is too small to have enough possibility to jump out of these loops, but a large epsilon can lead to direct failure by choosing dangerous actions. A more nuanced approach is needed.

We instead model the uncertainty of choosing actions by employing multi-armed bandit feedback in both training and evaluation phases. We deal with training and evaluation phases in different ways, depending on whether the state representation changes or not. \citeauthorDBLP:conf/nips/ZahavyHMMM18 \shortciteDBLP:conf/nips/ZahavyHMMM18 use the linear upper confidence bound (LinUCB) [Auer2003, Abe, Biermann, and Long2003, Abbasi-yadkori, Pál, and Szepesvári2011] algorithm to learn action elimination signals that can delete inadmissible actions during the DQN training phase. We use the same LinUCB algorithm with two major differences from \citeauthorDBLP:conf/nips/ZahavyHMMM18 \shortciteDBLP:conf/nips/ZahavyHMMM18: First, we apply LinUCB to directly predict Q-values, while they apply it to reduce the action space. Second, we only use LinUCB during the evaluation phase, since LinUCB requires the encoded states to be unchanged. On the contrary, \citeauthorDBLP:conf/nips/ZahavyHMMM18 \shortciteDBLP:conf/nips/ZahavyHMMM18 use a batch-update framework in order to use LinUCB in the training phase.

More specifically, we compute a confidence bound for each game. We assume that the Q-values according to each action are a linear function of the encoded state plus some noise drawn from an R-sub-Gaussian with mean 0 and covariance matrix , i.e.

where and as an upper bound. The covariance is defined such that

By solving for with ridge regression we can say, with probability , the confidence bound for action at step is

where

Then we choose an action at step by

Since the encoded states change during training, and our goal is to make prediction on new games with different actions, we use a simpler method for the training phase instead of LinUCB. At every episode, we count the frequency of using each action at every step, and then penalize Q-values according to this frequency. Our intuition is that less frequently used actions should be overampled to more thoroughly explore then uncertain environment. In practice, we apply this method in an -greedy manner.

5 Experiments and Discussion

We hold out a selection of 10% of the games and divide this portion into two separate test sets, each consisting of 222 games, one from each type. We randomly select an additional 400 games as a dev set and keep the remaining games for training. We consider an episode to be a play-through of a game; there are multiple episodes of each game run during training and scores are taken over a 10-episode run of each game when evaluating test. An episode is run until a loss (an ingredient is damaged or the maximum of 100 steps is reached) or a win, by completing the recipe successfully. Apart from the inherent game reward, we add reward (i.e. punishment) to every step, to encourage more direct gameplay. Also, if the game stops early because of a loss, we set the instant reward to to penalize the last action.

During training, we use 50,000 observation steps, 500,000 replay memory entries, and decay from 1 to in 10 million steps for training with all games in training data.

From a training run, we select the model with the highest score on the dev set for test inference. We run 10 episodes for each game during the test phase with , allowing for some stochasticity. The maximum total steps of evaluating on one test set is thus . The maximum total score is not unique since different games could have different scores. We use the percentage of scores and steps as the evaluation criteria in the following sections. The higher the score, the better the agent. A lower percentage of steps means better policy when scores tie; we show the percentage of wins alongside steps; if steps decrease and wins do not, this indicates an improving policy.

We use a CNN with 32 of each size-3, 4, 5 convolutional filters, followed by a max-pooling layer. The LSTM action encoder contains 32 units in a single-layer. We use the last LSTM hidden state as the encoded action state. We initialize our models with random word embeddings and position embeddings. We use a fixed embedding size of 64. At every training step, we draw a minibatch of 32 samples and use a learning rate of with the Adam optimizer. We trim trajectories to contain no more than 11 sentences to avoid unnecessarily long concatenated strings.

5.1 Core Results

Experiment Score %
Test 1 Test 2
random action 14 14
curric go-cardinal 50 52
curric go-random 55 57
curric go-room 55 58
mixed go-room 50 54
fine-tuning 64 64
fine-tuning & LinUCB 71 67
Table 2: Core overall results on unseen games of various difficulty levels. The random action baseline gives predictably poor results. Casting directions in terms of the room destination (go-room) generalizes better than learning specific cardinal directions (go-cardinal), but the alternative of picking a direction at random (go-random) appears surprisingly competitive. Using curriculum learning (curric) is preferred to training with all games simultaneously (mixed). Fine-tuning with bandit feedback and evaluation with LinUCB can further improve scores by more thorough exploration.

We primarily report results as a percentage of total achievable points on the test sets. Core findings are shown in Table 2. For a simple, training-free baseline, we choose a random action from the set of admissible actions at each state. Our main comparisons are that of curriculum learning (curric) as described in Section 4.1 to the default (mixed), and between the three different approaches to handling instance knowledge as described in Section 4.2. The fine-tuning with bandit feedback and LinUCB methods are described in Section 4.3. We next take a more in-depth look at the differences in learning behavior.

5.2 Curriculum Analysis

Figure 3: The training process of ‘mixed go-room’ (Table 2); all 3,596 training games without curriculum learning and with room destination. We evaluate on the dev set at every epoch (10,000 steps). The total score converges around 54% after 500 epochs of training.

Table 3 breaks down the test results ‘mixed go-room’ and ‘curric go-room’ by tier, evaluating after all training is complete. Here we can see that a) curriculum training is generally helpful at every tier, and that b) the ability to reach 100% of score generally decreases by tier. The training behavior of ‘mixed go-room’ is shown in Figure 3. As training proceeds, the total score percentage on dev should go up, and as long as the percentage of wins is not decreasing, the total steps percentage should go down, indicating fewer unnecessary steps. Indeed, this is what we see; the total score gradually increases during training and finally is stable at 54%.

Training graphs for ‘curric go-room’ broken down by tier are shown in Figure 4. For tier-1 we converge to almost 100% of total score after 140 epochs, which means our agent grasps basic cooking abilities. However, the results of tier-2 and tier-3 are flat, indicating there is minor ingredient confusion but it is never resolved. For tiers 4 through 6, scores generally improve from 40% to roughly 60%, indicating progressive ability to learn to navigate rooms.

Tier Test 1 Test 2
mixed curric mixed curric
1 88 95 85 94
2 53 58 53 55
3 57 55 54 55
4 55 56 57 58
5 40 49 55 60
6 36 47 41 45
All 50 55 54 58
Table 3: Comparing the evaluation results of training all tiers together (mixed) and training with curriculum learning (curric) on the two separate test sets. Rows 1-6 show the breakdown of total scores and steps on each tier. The curriculum learning method generally shows better results on both test sets.
Figure 4: The training process of ‘curric go-room’ broken down by tier. Results on tier-specific dev sets are shown. Each tier is trained starting with the best model of its previous tier. The learning is generally rational (scores go up) but is less effective in tiers 2 and 3.

5.3 Analysis of Universal Information Conversion

Table 4 breaks down the performance of each strategy for dealing with instance information in each relevant tier. It is clear that ‘go-cardinal,’ which does not convert any instance information, is less able to learn than the other methods at any tier. As the number of rooms to navigate grows from tier-4 to tier-6, the random navigation strategy becomes less effective, such that the ‘go-room’ transferring from instance-level cardinal information into universal-level room transition information is the most effective at navigating the large twelve-room games of tier-6.444An even more pertinent strategy would be to label directions by their ability to get to key destination rooms, i.e. the kitchen and supermarket, but these strategies would not necessarily transfer well to a new domain.

Tier go-cardinal go-random go-room
4 49 58 56
5 40 48 49
6 36 44 47
All 50 55 55
Table 4: Breakdown of information conversion strategies by tier on Test 1; the ‘go-random’ approach is less effective as map size increases.

Table 5 shows that there is a correlation between the most recently trained tier and performance on test data from that tier; we run ‘curric go-room’ but stop after the tier indicated, then subdivide test data per-tier. We see strongest performance on the main diagonal. This is reasonable because the six-room games of tier-4 use the same six rooms each time and so on; the extra rooms of tier-6 aren’t known during tier-4 training, and some decay of tier-4 rooms is observed as learning is rededicated to new rooms. Nevertheless, by training on all tiers we get best overall performance on Test 1.

\diagboxTestTrain Tier 4 Tier 5 Tier 6
Tier 4 62 59 56
Tier 5 41 50 49
Tier 6 26 35 47
All 51 53 55
Table 5: Recency effect of curriculum learning (using go-room) on Test 1; performance on tier-specific subsets is best on the last tier used for training, though training on the entire set gives the overall best result.

5.4 Generalization Ability

To analyze the ability of our models to generalize, we test each model on its train/dev/test sets with 10 episodes per game. Table 6 shows the results of models trained from tier-1 to tier-6 with no fine-tuning. The evaluation result on the training set of tier-1 is 98%, which means that the agent can learn to play a game optimally by repeatedly running on it. When applying what is learned on training to the unseen test set of tier-1, the score earned drops to 88%; we lose 10% of scores when generalizing to unseen tier-1 games. For tier-2, there is 13% drop of scores earned from the training set (71%) to test set (58%). Tier-3 still has a 10% drop from the training set (64%) to test set (54%).

The test results on the training set of tier-2 and tier-3 are 71% and 64%, which means our agent cannot play tier-2 and tier-3 games as well as tier-1, even though it is trained on these games. The result is also confirmed by the training graphs for tier-2 and tier-3 in Figure 4). Since tier-2 and tier-3 introduce more ingredients and cooking steps, the agent may be confused by the relationship between the ingredients and cooking methods required.

On tier-4, the agent using the go-random strategy has the best results on train and dev sets, while the agent using go-room shows the best result on the test set. For tier-5 and tier-6, the go-room agents have the best results on train/dev/test sets. The overall generalization score drop is also around 10%.

Train \diagboxgo strategyTest train dev test
Tier-1 - 98 93 88
Tier-2 - 71 62 58
Tier-3 - 64 69 54
Tier-4 go-cardinal 56 53 45
go-random 68 65 58
go-room 66 63 62
Tier-5 go-cardinal 45 48 36
go-random 56 55 46
go-room 60 58 50
Tier-6 go-cardinal 35 34 36
go-random 47 51 44
go-room 60 52 47
Table 6: Generalization ability analysis of tier 1-6. Twelve models are trained in a curriculum learning style on each tier. For tier 4-6, we also show the results of using different go-strategies. We evaluate on train/dev/test (Test set 1) for tiers that are last trained on. There is about a 10 percentage point drop from training to test sets for every tier.

5.5 Improvement from Uncertain Exploration

Figure 5: Fine-tuning starts from the best model of curriculum learning with bandit feedback. In the fine-tuning process we use all 3,596 training games together. We evaluate on the dev set at every epoch (10,000 steps). The total score converges at 71% in 260 epochs.
Tier Test 1 Test 2
curric ft Lin curric ft Lin
1 95 96 100 94 100 100
2 58 75 75 55 70 67
3 55 61 64 55 71 71
4 56 68 76 58 69 70
5 49 64 68 60 63 68
6 47 46 60 45 43 51
All 55 64 71 58 64 67
Table 7: We compare the evaluation results of curriculum learning with go-room (curric), fine-tuning with bandit feedback (ft), and LinUCB during evaluation (Lin) on two test sets. By exploring uncertainty in both training and evaluation phases, the agent increases scores by around 10%.

We fine-tune the DRRN model with all training games starting from the best model of curriculum learning, with a mixture of -greedy with bandit feedback. decays from 0.5 to in 200 epochs, with 10,000 steps per epoch. Other hyper-parameters are unchanged. The evaluation score on the dev set during training is increasing from around 60% to 71% (Figure 5).

At inference time, we set since randomly picking up an action usually will not work especially when the action space is quite large, and could easily choose dangerous actions that lead to direct failure as discussed in Section 4.3. Evaluating on the two test sets (Table 7), the fine-tuned model works better than the curriculum learning results on both test sets, with up to 9% increase on test-1 and 6% on test-2. With LinUCB during evaluation, the scores increase by another 7% and 3%, respectively. Moreover, tier 4-6 benefits more from the fine-tuning and LinUCB than tier 1-3.

6 Related Work

Many recent works on building agents of text-based games [Narasimhan, Kulkarni, and Barzilay2015, He et al.2016, Li et al.2016, Ansari et al.2018, Fulda et al.2017, Côté et al.2018, Kostka et al.2017] apply DQNs [Mnih et al.2015] or variants. \citeauthorD15-1001 \shortciteD15-1001 use the vanilla DQN setting employed by \citeauthorgoogle-atari \shortcitegoogle-atari but use an LSTM with a mean-pooling layer to encode text trajectories, and generate two-word actions in a verb+noun format, while \citeauthorP16-1153 \shortciteP16-1153 extend the DQN framework by encoding actions into representations, and use distributional Q-values [Bellemare, Dabney, and Munos2017] when choosing actions. For video games that require the understanding of a range of frames such as shooting games, \citeauthorLample:2017:PFG:3298483.3298548 \shortciteLample:2017:PFG:3298483.3298548 also use LSTMs to encode frames of images for scene understanding. Different aspects of DQN have been presented, such as action reduction with language correlation [Fulda et al.2017], and action elimination with a linear upper confidence bounding method [Zahavy et al.2018]. Language features are being explored by the introduction of a knowledge graph [Ammanabrolu and Riedl2018], and text understanding with dependency parsing [Yin and May2019]. \citeauthorNIPS2017_6868,DBLP:journals/corr/abs-1806-11525 \shortciteNIPS2017_6868,DBLP:journals/corr/abs-1806-11525 use a count-based method to shape the instant reward to encourage agents to explore new scenarios.

However, previous work is chiefly focused on learning to self-train on games and then do well on the same games, instead of on playing unseen games. A rare exception, \citeauthorDBLP:journals/corr/abs-1806-11525 \shortciteDBLP:journals/corr/abs-1806-11525 work on generalization of agents on variants of a very simple coin-collecting game. The simplicity of their games enables them to use an LSTM-DQN method with a counting-based reward. \citeauthorDBLP:journals/corr/abs-1812-01628 \shortciteDBLP:journals/corr/abs-1812-01628 use a knowledge graph as a persistent memory to encode states, while we use a knowledge graph to make actions more informative. Our work is closely related to task-oriented dialogue studies [He et al.2017, Rajendran et al.2018, Bordes, Boureau, and Weston2017] though these are generally not directly transferrable to our scenario, because they use customized models and rely on labeled training data.

7 Conclusion

In this paper, we train agents to play a family of text-based games. Instead of repeatedly optimizing on a single game, we train agents to play familiar but unseen games. We show that curriculum learning helps the agent learn better. We convert instance knowledge into universal knowledge via map familiarization. We also show how the incorporation of bandit feedback to both training and evaluation phases leads the agent to explore more thoroughly and reach higher scores.

References

  • [Abbasi-yadkori, Pál, and Szepesvári2011] Abbasi-yadkori, Y.; Pál, D.; and Szepesvári, C. 2011. Improved algorithms for linear stochastic bandits. In Shawe-Taylor, J.; Zemel, R. S.; Bartlett, P. L.; Pereira, F.; and Weinberger, K. Q., eds., Advances in Neural Information Processing Systems 24. Curran Associates, Inc. 2312–2320.
  • [Abe, Biermann, and Long2003] Abe, N.; Biermann, A. W.; and Long, P. M. 2003. Reinforcement learning with immediate rewards and linear hypotheses. Algorithmica 37:263–293.
  • [Ammanabrolu and Riedl2018] Ammanabrolu, P., and Riedl, M. O. 2018. Playing text-adventure games with graph-based deep reinforcement learning. CoRR abs/1812.01628.
  • [Ansari et al.2018] Ansari, G. A.; P, S. J.; Chandar, S.; and Ravindran, B. 2018. Language expansion in text-based games. CoRR abs/1805.07274.
  • [Auer2003] Auer, P. 2003. Using confidence bounds for exploitation-exploration trade-offs. J. Mach. Learn. Res. 3:397–422.
  • [Bellemare, Dabney, and Munos2017] Bellemare, M. G.; Dabney, W.; and Munos, R. 2017. A distributional perspective on reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, 449–458. JMLR.org.
  • [Bengio et al.2009] Bengio, Y.; Louradour, J.; Collobert, R.; and Weston, J. 2009. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, 41–48. New York, NY, USA: ACM.
  • [Bordes, Boureau, and Weston2017] Bordes, A.; Boureau, Y.; and Weston, J. 2017. Learning end-to-end goal-oriented dialog. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings.
  • [Côté et al.2018] Côté, M.; Kádár, Á.; Yuan, X.; Kybartas, B.; Barnes, T.; Fine, E.; Moore, J.; Hausknecht, M. J.; Asri, L. E.; Adada, M.; Tay, W.; and Trischler, A. 2018. Textworld: A learning environment for text-based games. CoRR abs/1806.11532.
  • [Fulda et al.2017] Fulda, N.; Ricks, D.; Murdoch, B.; and Wingate, D. 2017. What can you do with a rock? affordance extraction via word embeddings. In Sierra, C., ed., IJCAI, 1039–1045. ijcai.org.
  • [He et al.2016] He, J.; Chen, J.; He, X.; Gao, J.; Li, L.; Deng, L.; and Ostendorf, M. 2016. Deep reinforcement learning with a natural language action space. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1621–1630. Association for Computational Linguistics.
  • [He et al.2017] He, H.; Balakrishnan, A.; Eric, M.; and Liang, P. 2017. Learning symmetric collaborative dialogue agents with dynamic knowledge graph embeddings. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1766–1776. Vancouver, Canada: Association for Computational Linguistics.
  • [Hochreiter and Schmidhuber1997] Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural computation 9:1735–80.
  • [Kostka et al.2017] Kostka, B.; Kwiecieli, J.; Kowalski, J.; and Rychlikowski, P. 2017. Text-based adventures of the golovin AI agent. In CIG, 181–188. IEEE.
  • [Lample and Chaplot2017] Lample, G., and Chaplot, D. S. 2017. Playing fps games with deep reinforcement learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, AAAI’17, 2140–2146. AAAI Press.
  • [Lebling, Blank, and Anderson1979] Lebling; Blank; and Anderson. 1979. Special feature zork: A computerized fantasy simulation game. Computer 12(4):51–59.
  • [Li et al.2016] Li, J.; Monroe, W.; Ritter, A.; Jurafsky, D.; Galley, M.; and Gao, J. 2016. Deep reinforcement learning for dialogue generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 1192–1202. Austin, Texas: Association for Computational Linguistics.
  • [Mnih et al.2015] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; Petersen, S.; Beattie, C.; Sadik, A.; Antonoglou, I.; King, H.; Kumaran, D.; Wierstra, D.; Legg, S.; and Hassabis, D. 2015. Human-level control through deep reinforcement learning. Nature 518:529 EP –.
  • [Narasimhan, Kulkarni, and Barzilay2015] Narasimhan, K.; Kulkarni, T.; and Barzilay, R. 2015. Language understanding for text-based games using deep reinforcement learning. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 1–11. Association for Computational Linguistics.
  • [Rajendran et al.2018] Rajendran, J.; Ganhotra, J.; Singh, S.; and Polymenakos, L. 2018. Learning end-to-end goal-oriented dialog with multiple answers. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 3834–3843. Brussels, Belgium: Association for Computational Linguistics.
  • [Tang et al.2017] Tang, H.; Houthooft, R.; Foote, D.; Stooke, A.; Xi Chen, O.; Duan, Y.; Schulman, J.; DeTurck, F.; and Abbeel, P. 2017. #exploration: A study of count-based exploration for deep reinforcement learning. In Guyon, I.; Luxburg, U. V.; Bengio, S.; Wallach, H.; Fergus, R.; Vishwanathan, S.; and Garnett, R., eds., Advances in Neural Information Processing Systems 30. Curran Associates, Inc. 2753–2762.
  • [Yin and May2019] Yin, X., and May, J. 2019. Comprehensible context-driven text game playing. CoRR abs/1905.02265.
  • [Yuan et al.2018] Yuan, X.; Côté, M.; Sordoni, A.; Laroche, R.; des Combes, R. T.; Hausknecht, M. J.; and Trischler, A. 2018. Counting to explore and generalize in text-based games. CoRR abs/1806.11525.
  • [Zahavy et al.2018] Zahavy, T.; Haroush, M.; Merlis, N.; Mankowitz, D. J.; and Mannor, S. 2018. Learn what not to learn: Action elimination with deep reinforcement learning. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada., 3566–3577.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
392609
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description