Generalization of Reinforcement Learners with Working and Episodic Memory

Generalization of Reinforcement Learners with Working and Episodic Memory


Memory is an important aspect of intelligence and plays a role in many deep reinforcement learning models. However, little progress has been made in understanding when specific memory systems help more than others and how well they generalize. The field also has yet to see a prevalent consistent and rigorous approach for evaluating agent performance on holdout data. In this paper, we aim to develop a comprehensive methodology to test different kinds of memory in an agent and assess how well the agent can apply what it learns in training to a holdout set that differs from the training set along dimensions that we suggest are relevant for evaluating memory-specific generalization. To that end, we first construct a diverse set of memory tasks1 that allow us to evaluate test-time generalization across multiple dimensions. Second, we develop and perform multiple ablations on an agent architecture that combines multiple memory systems, observe its baseline models, and investigate its performance against the task suite.

1 Introduction

Humans use memory to reason, imagine, plan, and learn. Memory is a foundational component of intelligence, and enables information from past events and contexts to inform decision-making in the present and future. Recently, agents that utilize memory systems have advanced the state of the art in various research areas including reasoning, planning, program execution and navigation, among others (Graves et al., 2016; Zambaldi et al., 2018; Santoro et al., 2018; Banino et al., 2018; Vaswani et al., 2017; Sukhbaatar et al., 2015).

Memory has many aspects, and having access to different kinds allows intelligent organisms to bring the most relevant past information to bear on different sets of circumstances. In cognitive psychology and neuroscience, two commonly studied types of memory are working and episodic memory. Working memory (Miyake and Shah, 1999) is a short-term temporary store with limited capacity.

In contrast, episodic memory (Tulving and Murray, 1985) is typically a larger autobiographical database of experience (e.g. recalling a meal eaten last month) that lets one store information over a longer time scale and compile sequences of events into episodes (Tulving, 2002). Episodic memory has been shown to help reinforcement learning agents adapt more quickly and thereby boost data efficiency (Blundell et al., 2016; Pritzel et al., 2017; Hansen et al., 2018). More recently, Ritter et al. (2018) shows how episodic memory can be used to provide agents with context-switching abilities in contextual bandit problems. The transformer (Vaswani et al., 2017) can be viewed as a hybrid of working memory and episodic memory that has been successfully applied to many supervised learning problems.

In this work, we explore adding such memory systems to agents and propose a consistent and rigorous approach for evaluating whether an agent demonstrates generalization-enabling memory capabilities similar to those seen in animals and humans.

One fundamental principle in machine learning is to train on one set of data and test on an unseen holdout set, but it has to date been common in reinforcement learning to evaluate agent performance solely on the training set which is suboptimal for testing generalization (Pineau, 2018). Also, though advances have recently been made on evaluating generalization in reinforcement learning (Cobbe et al., 2018) these have not been specific to memory.

Our approach is to construct a train-holdout split where the holdout set differs from the training set along axes that we propose are relevant specifically to memory, i.e. the scale of the task and precise objects used in the task environments. For instance, if an agent learns in training to travel to an apple placed in a room, altering the room size or apple color as part of a generalization test should ideally not throw it off.

We propose a set of environments that possess such a split and test different aspects of working and episodic memory, to help us better understand when different kinds of memory systems are most helpful and identify memory architectures in agents with memory abilities that cognitive scientists and psychologists have observed in humans.

Alongside these tasks, we develop a benchmark memory-based agent, the Memory Recall Agent (MRA), that brings together previously developed systems thought to mimic working memory and episodic memory. This combination of a controller that models working memory, an external episodic memory, and an architecture that encourages long-term representational credit assignment via an auxiliary unsupervised loss and backpropagation through time that can ‘jump’ over several time-steps obtains better performance than baselines across the suite. In particular, episodic memory and learning good representations both prove crucial and in some cases stack synergistically.

To summarize, our contribution is to:

  • Introduce a suite of tasks that require an agent to utilize fundamental functional properties of memory in order to solve in a way that generalizes to holdout data.

  • Develop an agent architecture that explicitly models the operation of memory by integrating components that functionally mimic humans’ episodic and working memory.

  • Show that different components of our agent’s memory have different effectiveness in training and in generalizing to holdout sets.

  • Show that none of the models fully generalize outside of the train set on the more challenging tasks, and that the extrapolation incurs a greater level of degradation.

2 Task suite overview

We define a suite of 13 tasks designed to test different aspects of memory, with train-test splits that test for generalization across multiple dimensions ( _memorytasks). These include cognitive psychology tasks adapted from PsychLab (Leibo et al., 2018) and DMLab (Beattie et al., 2016), and new tasks built with the Unity 3D game engine (27) that require the agent to 1) spot the difference between two scenes; 2) remember the location of a goal and navigate to it; or 3) infer an indirect transitive relation between objects. Videos with task descriptions are at

2.1 PsychLab

Four tasks in the Memory Tasks Suite use the PsychLab environment (Leibo et al., 2018), which simulates a psychology laboratory in first-person. The agent is presented with a set of one or multiple consecutive images, where each set is called a ‘trial’. Each episode has multiple trials.

In Arbitrary Visuomotor Mapping (AVM) a series of objects is presented, each with an associated look-direction (e.g. up,left). The agent is rewarded if it looks in the associated direction the next time it sees a given object in the episode (Fig 8(a) in App. B). Continuous Recognition presents a series of images with rewards given for correctly indicating whether an image has been previously shown in the episode (Fig 8(b) in App. B). In Change Detection the agent sees two consecutive images, separated by a variable-length delay, and has to correctly indicate if the two images differ (Fig 8(c) in App. B). In What Then Where the agent is shown a single ‘challenge’ MNIST digit, then an image of that digit with three other digits, each placed along an edge of the rectangular screen. It next has to correctly indicate the location of the ‘challenge’ digit (Fig 8(d) in App. B).

2.2 3D tasks

(a) Spot the Difference basic
(b) Navigate to Goal
(c) Transitive Inference
Figure 1: Task layouts for Spot the Difference, Goal Navigation, and Transitive Inference. In (a), the agent has to identify the difference between the two rooms. In (b), the agent has to go to the goal. which is represented by an oval symbol here and may be visible or not to the agent. In (c), the agent has to go to the higher-valued object in each pair. The value order is given by the transitive chain outside the room. It is shown here solely for illustration; the agent cannot see it.

Spot the Difference: This tests whether the agent can correctly identify the difference between two nearly identical scenes (Figure 1(a)). The agent has to move from the first to the second room, with a ‘delay’ corridor in between. See Fig. 2 for the four different variants.

(a) Spot the Difference Basic
(b) Spot the Difference Passive
(c) Spot the Difference Multi-Object
(d) Spot the Difference Motion
Figure 2: Spot the Difference tasks. (a) All the tasks in this family are variants of this basic setup, where each room contains two blocks. (b) By placing Room 1’s blocks right next to the corridor entrance, we guarantee that the agent will always see them. (c) The number of objects varies. (d) Instead of differing in color between rooms, the altered block follows a different motion pattern.

Goal Navigation: This task family was inspired by the Morris Watermaze (Miyake and Shah, 1999) setup used with rodents in behavioral neuroscience. The agent is rewarded every time it successfully reaches the goal; once it gets there it is respawned randomly in the arena and has to find its way back to the goal. The goal location is re-randomized at the start of episode (Fig. 1(b), Fig. 3).

(a) Invisible Goal Empty Arena
(b) Invisible Goal, With Buildings
(c) Visible Goal With Buildings
(d) Visible Goal Procedural Maze
Figure 3: Goal Navigation tasks. (a) The arena has no buildings, agent must navigate by skybox. (b) There are rectangular buildings at fixed, non-randomized locations in the arena. (c) As in (b), but the goal appears as an oval. (d) A visible goal in a procedurally generated maze.

Transitive Inference:

This task tests if an agent can learn an overall transitive ordering over a chain of objects, through being presented with ordered pairs of adjacent objects (See Fig. 1(c) and App. B).

2.3 Scale and Stimulus Split

To test how well the agent can generalize to holdout data after training, we create per-task holdout levels that differ from the training level along a scale and a stimulus dimension. The scale dimension is intended to capture something about the memory demand of the task: e.g., a task with a longer time delay between events that must be related should be harder than one with a short delay. The stimulus dimension is to guard against trivial overfitting to the particular visual input presented to the input: the memory representation should be more abstract than the particular colour of an object.

The training level comprises a ‘small’ and ‘large’ scale version of the task. When training the agent we uniformly sample between these two scales. As for the holdout levels, one of them – ‘holdout-interpolate’ – corresponds to an interpolation between those two scales (call it ‘medium’) and the other, ‘holdout-extrapolate’, corresponds to an extrapolation beyond the ‘large’ scale (call it ‘extra-large’). Alterations made for each task split and their settings are in Table 2 in App. A.

3 The Memory Recall Agent

Our agent, the Memory Recall Agent (MRA), incorporates five components: 1) a pixel-input convolutional, residual network, 2) a working memory, 3) a slot-based episodic memory, 4) an auxiliary contrastive loss for representation learning (van den Oord et al., 2018), 5) a jumpy backpropagation-through-time training regime. Our agent architecture is shown in Figure 4(a). The overall agent is built on top of the IMPALA model (Espeholt et al., 2018) and is trained in the same way with the exceptions described below. Component descriptions are below.

Pixel Input Pixel input is fed to a convolutional neural network, as is common in recent agents, followed by a residual block (He et al., 2015). The precise hyper-parameters are given in C.2: we use three convolutional layers followed by two residual layers. The output of this process is in Figure 4(a) and serves as input to three other parts of the network: 1) part of the input to the working memory module, 2) in the formation of keys and queries for the episodic memory, 3) as part of the target for the contrastive predictive coding.

Working Memory Working memory is often realized through latent recurrent neural networks (RNNs) with some form of gating, such as LSTMs and Relational Memory architectures (Hochreiter and Schmidhuber, 1997; Santoro et al., 2018). These working memory models calculate the next set of hidden units using the current input and the previous hidden units. Although models which rely on working memory can perform well on a variety of problems, their ability to tackle dependencies and represent variables over long time periods is limited. The short-term nature of working memory is pragmatically, and perhaps unintentionally, reflected in the use of truncated backprop through time and the tendency for gradients through these RNNs to explode or vanish. Our agent uses an LSTM as a model of working memory. As we shall see in experiments, this module is able to perform working memory–like operations on tasks: i.e., learn calculations involving short-term memory. As depicted in Figure 4(a), the LSTM takes as input from the pixel input network and from the episodic memory module. As in Espeholt et al. (2018), the LSTM has two heads as output, producing the policy and the baseline value function . In our architecture these are derived from the output from the LSTM, . is also used to form episodic memories, as described below.

Episodic Memory (MEM) If our agent only consisted of the working memory and pixel input described above, it would be almost identical to the model in IMPALA (Espeholt et al., 2018), an already powerful RL agent. But MRA also includes a slot-based episodic memory module as that can store values more reliably and longer-term than an LSTM, is less susceptible to the intricacies of gradient propagation, and its fundamental operations afford the agent different abilities (as observed in our experiments). The MEM in MRA has a key-value structure which the agent reads from and writes to at every time-step (see Fig. 4(a)). MRA implements a mechanism to learn how to store summaries of past experiences and retrieve relevant information when it encounters similar contexts. The reads from memory are used as additional inputs to the neural network (controller), which produces the model predictions. This effectively augments the controller’s working memory capabilities with experiences from different time scales retrieved from the MEM, which facilitate learning long-term dependencies, a difficult task when relying entirely on backpropagation in recurrent architectures (Hochreiter and Schmidhuber, 1997; Graves et al., 2016; Vaswani et al., 2017).

(a) Architecture of the MRA.
(b) Contrastive Predictive Coding loss for MRA.
Figure 4: The Memory Recall Agent (MRA) architecture. Here is the pixel input embedding from step , and is the LSTM hidden state . is the key used for reading; we compute it from and . is the query we use to compare against keys to find nearest neighbors.

The MEM has a number of slots, indexed by . Each slot stores activations from the pixel input network and LSTM from previous times in the past. The MEM acts as a fixed-size circular (first-in-first-out) buffer: New keys and values are added, overwriting the least recently added entry if there are no unused slots available. The contents of the episodic memory buffer is wiped at the end of each episode.

Memory Writing

Crucially, writing to episodic memory is done without gradients. At each step a free slot is chosen for writing, denoted . Next, the following is stored:


where is the pixel input embedding from step and is the LSTM hidden state (if the working memory is something else, e.g. a feedforward, this would be the output activations). is the key, used for reading (described below), computed as a simple linear function of the other two values stored. Caching the key speeds up memory reads significantly. However, the key can become stale as the weights and biases, and are learnt (the procedure for learning them is described below under Jumpy Backpropagation). In our experiments we did not see an adverse effect of this staleness.

Memory Reading

The agent uses a form of dot-product attention (Bahdanau et al., 2015) over its MEM, to select the most relevant events to provide as input to the LSTM. The query is a linear transform of the pixel input embedding and the LSTM hidden state from the previous time-step , with weight and bias .


The query is then compared against the keys in MEM as in Pritzel et al. (2017): Let , be the nearest neighbors to from MEM, under an L2 norm between and .


We compute a weighted aggregate of the values () of the nearest neighbors, weighted by the inverse of each neighbor-key’s distance to the query. Note that the distance is re-calculated from values stored in the MEM, via the linear projection in (1). We concatenate the resulting weighted aggregate memory with the embedded pixel input , and pass it as input to the working memory as shown in Figure 4(a).

Jumpy backpropagation We now turn to how gradients flow into memory writes. Full backpropagation can become computationally infeasible as this would require backpropagation into every write that is read from and so on. Thus as a new -triplet is added to the MEM, there are trade-offs to be made regarding computational complexity versus performance of the agent. To make it more computationally tractable, we place a stop-gradient in the memory write. In particular, the write operation for the key in (1) becomes:


where denote that the gradients are stopped. This allows the parameters and to receive gradients from the loss during writing and reading, while at the same time bounding the computational complexity as the gradients do not flow back into the recurrent working memory (or via that back into the MEM). To re-calculate the distances, we want to use these learnt parameters rather than, say, random projection, so we need to store the arguments and of the key-generating linear transform for all previous time-steps. Thus in the MEM we store the full -triplet, where , and is the step that write was made. We call this technique ‘jumpy backpropagation’ because the intermediate steps between the current time-step and the memory write step are not taken into account in the gradient updates.

This approach is similar to Sparse Attentive Backtracking (Ke et al., 2018, SAB) which uses sparse replay by passing gradients only through memories selected as relevant at each step. Our model differs in that it does not have a fixed chunking scheme and does not do full backpropagation through the architecture (which in our case becomes quickly intractable). Our approach has minimal computational overhead as we only recompute the keys for the nearest neighbors.

Auxiliary Unsupervised Losses An agent with good memory provides a good basis for forming a rich representation of the environment, as it captures a history of the states visited by the agent. This is the primary basis for many rich probabilistic state representations in reinforcement learning such as belief states and predictive state representations (Littman and Sutton, 2002). Auxiliary unsupervised losses can significantly improve agent performance (Jaderberg et al., 2016). Recently it has been shown that agents augmented with one-step contrastive predictive coding (van den Oord et al., 2018, CPC) can learn belief state representations of the environment (Guo et al., 2018). Thus in MRA we combine the working and episodic memory mechanisms listed above with a CPC unsupervised loss to imbue the agent with a rich state representation. The CPC auxiliary loss is added to the usual RL losses, and is of the following form:


where CPCLoss is from van den Oord et al. (2018), is the working memory hidden state, and is the encoding pixel input at steps in the future. is the number of CPC steps (typically or in our experiments). See Figure 4(b) for an illustration and further details and equations elaborating on this loss in App. C.3.

Reconstruction losses have also been used as an auxiliary task (Jaderberg et al., 2016; Wayne et al., 2018) and we include this as a baseline in our experiments. Our reconstruction baseline minimizes the L2 distance between the predicted reward and predicted pixel input and the true reward and pixel input, using the working memory state as input. Details of this baseline are given in App. C.4.

4 Experiments

Setup We ran 10 ablations on the MRA architecture, on the training and the two holdout levels:

  • Working Memory component: Either feedforward neural network (‘FF’ for short) or LSTM. The LSTM-only baseline corresponds to IMPALA (Espeholt et al., 2018).

  • With or without using episodic memory module (‘MEM’).

  • With or without auxiliary unsupervised loss (either CPC or reconstruction loss (‘REC’)).

  • With or without jumpy backpropagation, for MRA (i.e. LSTM + MEM + CPC)

Given that the experiments are computationally demanding, we only performed small variations within as part of our hyper-parameter tuning process for each task (see App. D).

We hypothesize that in general the agent should perform the best in training, somewhat worse on the holdout-interpolation level and the worst on the holdout-extrapolation level. That is, we expect to see a generalization gap. Our results validated this hypothesis for the tasks that were much harder for agents than for humans.

4.1 Full comparison

We computed human-normalized scores (details in App. B) and plotted them into a heatmap (Fig 5) sorted such that the model with the highest train scores on average is the top row and the task with highest train scores on average is the leftmost column. The heatmap suggests that the MRA architecture, LSTM + MEM + CPC, broadly outperforms the other models (App. B Table 3). This ranking was almost always maintained across train and holdout levels, despite MRA performing worse than the LSTM-only baseline on What Then Where. What Then Where was one of the tasks where all models did poorly, along with Spot the Difference: Multi-Object, Spot the Difference: Multi-Object, Spot the Difference: Multi-Object (rightmost columns in heatmap). At the other end of the difficulty spectrum, LSTM + MEM had superhuman scores on Visible Goal Procedural Maze in training and on Transitive Inference in training and holdout, and further adding CPC or REC boosted the scores even higher.

Figure 5: Heatmap of ablations per task sorted by normalized score for Train, Holdout-Interpolate, Holdout-Extrapolate. The same plot with standard errors is in App. B Fig. 14.
Figure 6: Normalized scores averaged across tasks.

4.2 Results

Different memory systems worked best for different kinds of tasks, but the MRA architecture’s combination of LSTM + MEM + CPC did the best overall on training and holdout (Fig. 6). Removing jumpy backpropagation from MRA hurt performance in five Memory Suite tasks (App. B Fig. 10), while performance was the same in the remaining ones (App. B Fig. 11 and 12).

Generalization gap widens as task difficulty increases The hypothesized generalization gap was minimal for some tasks e.g. AVM and Continuous Recognition but significant for others e.g. What Then Where and Spot the Difference: Multi-Object (Fig 7). We observed that the gap tended to be wider as the task difficulty went up, and that in PsychLab, the two tasks where the scale was the number of trials seemed to be easier than the other two tasks where the scale was the delay duration.

MEM critical on some tasks, is enhanced by auxiliary unsupervised loss Adding MEM improved scores on nine tasks in training, six in holdout-interpolate, and six in holdout-extrapolate. Adding MEM alone, without an auxiliary unsupervised loss, was enough to improve scores on AVM and Continuous Recognition, all Spot the Difference tasks except Spot the Difference: Multi-Object, all Goal Navigation tasks except Visible Goal Procedural Maze, and also for Transitive Inference.

Adding MEM helped to significantly boost holdout performance for Transitive Inference, AVM, and Continuous Recognition. For the two PsychLab tasks this finding was in line with our expectations, since they both can be solved by memorizing single images and determining exact matches and thus an external episodic memory would be the most useful. For Transitive Inference, in training MEM helped when the working memory was FF but made little difference on an LSTM, but on holdout MEM helped noticeably for both FF and LSTM. In Change Detection and Multi-Object, adding MEM alone had little or no effect but combining it with CPC or REC provided a noticeable boost.

Synergistic effect of MEM + CPC, for LSTM On average, adding either the MEM + CPC stack or MEM + REC stack to any working memory appeared to improve the agent’s ability to generalize to holdout levels (Fig. 6). Interestingly, on several tasks we found that combining MEM + CPC had a synergistic effect when the working memory was LSTM: The performance boost from adding MEM + CPC was larger than the sum of the boost from adding MEM or CPC alone. We observed this phenomenon in seven tasks in training, six in holdout-interpolate, and six in holdout-extrapolate. Among these, the tasks where there was MEM + CPC synergy across training, holdout-interpolate, and holdout-extrapolate were: the easiest task, Visible Goal Procedural Maze; Visible Goal with Buildings; Spot the Difference: Basic; and the hardest task, Spot the Difference: Multi-Object.

CPC vs. REC CPC was better than REC on all Spot the Difference tasks, and the two harder PsychLab tasks Change Detection and What Then Where. On the other two PsychLab tasks there was no difference between CPC and REC. However, REC was better on all Goal Navigation tasks except Invisible Goal Empty Arena. When averaged out, REC was more useful when the working memory was FF, but CPC was more useful for an LSTM working memory.

Figure 7: Generalization gap is smaller for AVM and Continuous Recognition, larger for What Then Where and Spot the Difference: Multi-Object. Dotted lines indicate human baseline scores. See other curves in App. B Fig. 13.

5 Discussion & Future Work

We constructed a diverse set of environments 2 to test memory-specific generalization, based on tasks designed to identify working memory and episodic memory in humans, and also developed an agent that demonstrates many of these cognitive abilities. We propose both a testbed and benchmark for further work on agents with memory, and demonstrate how better understanding the memory and generalization abilities of reinforcement learning agents can point to new avenues of research to improve agent performance and data efficiency. There is still room for improvement on the trickiest tasks in the suite where the agent fared relatively poorly. In particular, solving Spot the Difference: Motion might need a generative model that enables forward planning to imagine how future motion unrolls (e.g., (Racanière et al., 2017)). Our results indicate that adding an auxiliary loss such as CPC or reconstruction loss to an architecture that already has an external episodic memory improves generalization performance on holdout sets, sometimes synergistically. This suggests that existing agents that use episodic memory, such as DNC and NEC, could potentially boost performance by implementing an additional auxiliary unsupervised loss.


We would like to thank Jessica Hamrick, Jean-Baptiste Lespiau, Frederic Besse, Josh Abramson, Oriol Vinyals, Federico Carnevale, Charlie Beattie, Piotr Trochim, Piermaria Mendolicchio, Aaron van den Oord, Chloe Hillier, Tom Ward, Ricardo Barreira, Matthew Mauger, Thomas Köppe, Pauline Coquinot and many others at DeepMind for insightful discussions, comments and feedback on this work.

Appendix A Level descriptions and further experimental findings

As described in Section 2.3, for each task in the Suite we construct a small training level, a large training level, a ‘holdout-interpolation’ level and a ‘holdout-extrapolation’ level.

During training the environment uniformly samples from the small and large training levels. The interpolation level has a scale somewhere in between ‘small’ and ‘large’ while the extrapolation level corresponds to ‘extra-large’ (Table 1). A summary of the alterations made for each task split is in Table 2. The settings used in each level per task are described below.

\diagboxScaleStimuli Training set Holdout set
Small Used for training
Medium Used for interpolation
Large Used for training
Extra-large Used for extrapolation
Table 1: Overall structure for scale and stimulus split.

The dashed (‘—’) settings in Table 1 are not reported nor used, since they lack a clear interpretation in terms of generalization.

Task Scale Stimulus
AVM Number of trials Image
Continuous Recognition Number of trials Image
Change Detection Delay study/test Color
What Then Where Delay study/query Digit image
Spot Diff Basic Corridor delay Color
Spot Diff Passive Corridor delay duration Color
Spot Diff Multi-object Number of objects Color
Spot Diff Motion Corridor delay Motion pattern
All Goal Navigation tasks Arena size Goal spawn
Transitive Inference Length of transitive chain Object color
Table 2: Scale and stimulus alterations across task families

a.1 PsychLab

Our Memory Tasks Suite has four PsychLab tasks: Arbitrary Visuomotor Mapping (AVM), Continuous Recognition, Change Detection and What Then Where. The description of each task is found in Figure 8. Videos with agent play and the

(a) Arbitrary Visuomotor Mapping (AVM)
(b) Continuous Recognition
(c) Change Detection
(d) What Then Where
Figure 8: All PsychLab tasks have multiple trials within an episode. Each trial consists of a single image being displayed on the panel. In (a), when the agent sees an image for the first time, the associated direction is indicated on the screen (green box on the left). By executing the indicated pattern, the agent receives a reward. When the agent is presented with an image it has already seen during the episode, the associated direction is no longer indicated (middle), and the agent must remember it from its previous experience in order to get a reward (right). In (b), the agent is shown a pattern (left), and after a delay (middle), a second pattern is shown (right). The agent has to indicate if there was a change between the two patterns or not by looking right or left, respectively. The delay period separating the two patterns varies in length. In (c), the agent indicates if it has seen the image in the current episode by looking left or right, respectively. In (d), in the ‘what’ study phase, an MNIST digit is displayed (left). In the ‘where’ study period, four distinct MNIST digits are displayed including the one from the ‘what’ period (middle). In the test phase (right), the agent must remember what digit was displayed in the ‘what’ period, see where it is located during the study where period, and then respond by looking to that location. In this example it has to look left.

Scale Either number of trials per episode or delay duration.

For Arbitrary Visuomotor Mapping and Continuous Recognition, every episode lasts at most 300 seconds, except for the Extrapolate level where the cap is set to 450 seconds to accommodate the larger number of trials. In Change Detection an episode lasts at most 300 seconds, while for What Then Where it is 600 seconds.

\diagboxScaleTask AVM and Cont. Recog.: Trials per episode Change Detection: delay (seconds) What Then Where: delay (seconds)
Small 50 2, 4, 8 4, 8
Interpolate 40 16, 32 16, 64
Large 50 64, 128 32, 128
Extrapolate 75 130, 150, 200, 250 132, 156, 200, 256

Stimulus Either color set or image set.

Task AVM and Cont. Recog Change Detection What Then Where
Stimulus Different images Color set MNIST digits
Training Images with even ID Amethyst, Caramel, 0, 1, 2, 3, 4
Honeydew, Jade, Mallow
Holdout Images with odd ID Yellow, Lime, Pink, Sky, Violet 5, 6, 7, 8, 9

PsychLab: main experimental findings

Avm: in this task, the agent must remember associations between images and specific movement patterns (Figure 8 (a)).

The most useful component turned out to be MEM. This is in line with earlier findings that an external episodic memory is a prerequisite for solving AVM [Wayne et al., 2018]. Adding an auxiliary loss helped when the controller was FF but made no difference for an LSTM. Also, choosing between CPC or REC for auxiliary unsupervised loss did not make a major difference for either controller.

Continuous Recognition: in this task, the agent must remember if it has seen a particular image before by looking left or right (Figure 8 (b)).

MEM was the most useful component when added to an LSTM, but made no difference when added alone to an FF controller. However, adding a stack of MEM plus either CPC or REC provided a substantial performance boost for both FF and LSTM.

Change Detection: in this task, agent sees two images separated by a delay and has to correctly indicate if the two images are different (Figure 8 (c)).

CPC brought the largest benefit. Interestingly the addition of MEM to the FF baseline actually hurt performance slightly, and made no difference for LSTM.

What Then Where: this task consists of a ‘what’ and ‘where’ study phase, followed by a test phase where the agent must remember what image was displayed and where it was located (Figure 8 (d)).

This was the trickiest task in the Psychlab family. This task was an outlier in the sense that unlike any other task in the suite, the LSTM baseline beat all other models. The worst additional component was REC which dragged down performance to below random.

a.2 Spot the Difference (SD)

The tasks were built in Unity, and each episode lasts 120 seconds except for Spot the Difference: Motion which has a 240-second timeout.

Scale Either corridor delay duration or number of objects in room.

In Spot the Difference Multi-Object, Room 2 has the exact same number of objects as Room 1.

\diagboxScaleTask SD Basic, Passive and Motion: Corridor delay (seconds) SD Multi-Object: Number of objects in Room 1
Small 0 2 or 3
Interpolate 5 4
Large 10 5 or 6
Extrapolate 15 7

Stimulus Either color set or motion pattern set.

Task SD Basic, Passive and Motion SD Multi-Object
Stimulus Color Set Motion Pattern Set
Training Red, Green, Blue, Circle, Square, Five-point star, Hexagon
White, Slate Linear along X-axis, Linear along Y = X diagonal
Holdout Yellow, Brown, Pink, No motion, Triangle, Pentagon, Figure-eight
Orange, Purple Linear along Y-axis, Linear along Y = -X diagonal

Spot the Diff: main experimental findings

Every task in this family consists of two rooms connected by a short corridor. There is a set of gates in the middle of the corridor that can trap the agent there for a configurable delay duration.

Basic In the basic Spot the Difference task, where the agent is not forced to see any of the blocks in Room 1 before it goes to the next room, adding MEM alone to the controller had minimal effect, and using REC with MEM also did not make much difference. Adding CPC to an LSTM helped performance but it turned out that using the combination of MEM + CPC provided the biggest gain and was synergistic.

Passive In this task the agent is guaranteed to see the two blocks in the first room before it enters the second room. Adding MEM alone to the controller made the biggest positive difference, which makes sense since that hypothetically would make it possible for the agent to solve the task by remembering a single snapshot. CPC helped when added to FF together with MEM, but hurt when added to LSTM alone. REC helped performance when added to FF + MEM, but not as much as CPC did in that case, and actually hurt performance when added to LSTM + MEM.

Motion Nothing did well on train or holdout sets, and curves took longer to take off in general. This is likely due to the highly challenging nature of the task, which requires the agent to memorize 3D motion patterns traced out over some time period by multiple objects and then compare motion patterns against each other. Results would potentially be improved by hyperparameter tuning or further improvements to agent architecture.

Multi-object This was the hardest task in the family, and nothing did well here either. This could be due to there being a variable number of objects in each room, rather than always exactly two objects per room. When added by itself to a controller MEM either had no effect or hurt performance. The combined synergistic stack of MEM + CPC was the most useful addition on this task when the working memory was LSTM. That said, no models fared well on Holdout-Interpolate and Holdout-Extrapolate for this task.

a.3 Navigate to Goal

These tasks are in Unity and have an episode timeout of 200 seconds, except Visible Goal Procedural Maze which is a modification of DMLab’s Explore Goal Locations task and has episodes lasting 120 seconds each.

Scale Size of square arena, in terms of in-game metric units.

\diagboxScaleTask Visible Goal Procedural Maze: Arena Size All but Visible Goal Procedural Maze: Arena Size
Small 11 11 10 10
Interpolate 15 15 15 15
Large 21 21 20 20
Extrapolate 27 27 25 25

Stimulus Goal spawn region.

\diagboxStimulusTask Visible Goal Procedural Maze: Goal spawn region All but Visible Goal Procedural Maze: Goal spawn region
Training North half Northwest and southeast quadrants
Holdout South half The other two quadrants

Navigate to Goal: main experimental findings

Using an auxiliary unsupervised reconstruction loss to learn high-quality representations turned out to be the most useful component for this task family.

We also observed that in successful models such as LSTM + MEM + CPC, which is the MRA architecture, the agent is able to do better than simply memorizing a route to the invisible goal. Rather, it learns the location of the goal, and the time it takes to reach the goal location grows shorter every time it respawns within an episode (see example trajectory in Fig 9(a) and time-to-goal plot in Fig 9(b)).

Visible Goal Procedural Maze Using REC with LSTM + MEM performed the best here, and FF + MEM + REC was the next best. The MEM + CPC stack was a distant runner-up compared with the MEM + REC stack for both controllers.

Visible Goal With Buildings Like in the other Visible Goal task, LSTM + MEM + REC was the most successful model. MEM was slightly more helpful than CPC when used in conjunction with an LSTM (we did not have bandwidth to run the FF + CPC ablation). MEM + CPC also had a synergistic effect when stacked with an LSTM.

Invisible Goal With Buildings Adding MEM + REC was the most useful, for both FF and LSTM.

Invisible Goal Empty Arena This task can be expected to be the most difficult in the family due to the relative sparsity of visual spatial cues. Adding MEM alone to a controller always helped slightly. REC helped more than CPC did when used with an FF controller but for an LSTM controller CPC had a slight edge.

(a) Routes taken by MRA agent in one episode
(b) Timesteps taken to reach goal
Figure 9: Trajectories and time-to-goal for Invisible Goal with Buildings. In (a), our MRA (LSTM + MEM + CPC) agent learns to take increasingly shorter routes to the goal. Note: The end-points of each trial trajectory appear to be in slightly different locations. This is because the goal is on a map tile rather than a single coordinate, and also due to manual adjustments we made to account for the agent avatar in Unity continuing to move for a small number of frames immediately after reaching the goal but before it is respawned. In (b), the number of time-steps taken per trial is plotted for Train, Holdout-Interpolate, Holdout-Extrapolate, along with standard error bars. Note: Some points at the rightmost end of each curve will have no error bar if there was only one data point.

a.4 Transitive Inference

The task was built in Unity and has an episode timeout of 200 seconds.

Scale: Number of objects in transitive chain. Stimulus: Color set.

Scale Transitive chain length Small 5 Interpolate 6 Large 7 Extrapolate 8 Stimulus Color set Training Red, Green, Blue, White, Black, Pink, Orange, Purple, Grey, Tan Holdout Slate, Yellow, Brown, Lime, Magenta Mint, Navy, Olive, Teal, Turquoise

Transitive Inference: main experimental findings

Transitive inference is a form of reasoning where one infers a relation between items that have not been explicitly directly compared to each other. In humans, performance on probe pairs and anchor pairs with symbolic distance of greater than one excluding anchor objects tends to correlate with awareness of the implied hierarchy [Smith and Squire, 2005].

As an illustrative example: Given a ‘transitive chain’ of five objects A, B, C, D, E where we assume A is the lowest-valued object and E the highest, we begin with a demonstration phase in which we present the agent with pairs of adjacent objects <A, B>, <B, C>, <C, D>, <D, E> .

In this demo phase we scramble the order in which the pairs are presented and also scramble the objects in the pair such that an agent may see <D, C> followed by <A, B>, etc. The pairs are presented one at a time, and the agent needs to correctly identify the higher-valued object in the current pair in order to proceed to seeing the next pair.

Once the demo phase is completed, we show the agent a single, possibly-scrambled challenge pair. This challenge pair always consists of the object second from the left and the object second from the right in the transitive chain, in this case <B, D>. The agent’s task is again to go to the higher-valued object.

In our results, we found that stacking MEM with auxiliary loss was crucial. For an FF controller CPC was more useful than REC, but for LSTM it was the other way round. Also, although both LSTM + MEM + CPC and LSTM + MEM + REC achieved normalized scores that were not too far apart, REC was more data-efficient and took off earlier than the former. We observed a synergistic effect when combining MEM with CPC for an LSTM, but that was still outdone by using MEM + REC.

a.5 Jumpy Backpropagation (JB) ablation

We studied the impact of having Jumpy Backpropagation (JB) as described in Section 3. In Fig 10, we can see the set of tasks where adding the JB yields improvements on performance both at training time and on the holdout test levels. Figures 11 and 12 show the performance on the remaining levels from the Memory Task Suite, where having the JB feature did not hurt performance. We conclude that JB is an important component of the MRA architecture.

Figure 10: Comparison between MRA (LSTM + MEM +CPC) and its version without the jumpy backpropagation feature on MEM: LSTM + MEM (no JB) + CPC. Here we show the tasks where JB yields improvements on performance both at training time and on the holdout test levels. The dotted lines indicate human baseline scores for each task.
Figure 11: [1/2] Comparison between MRA and its version without the jumpy backpropagation (JB) feature. Here we show the tasks where JB makes little difference on performance. The dotted lines indicate human baseline scores for each task.
Figure 12: [2/2] Comparison between MRA and its version without the jumpy backpropagation (JB) feature. Here we show the tasks where JB makes little difference on performance. The dotted lines indicate human baseline scores for each task.

a.6 Agent Performance Curves

In this session we show training and test curves for all models in all tasks. The dotted lines indicate human baseline scores for each task.

Figure 13: Training and test curves for all models in all tasks. Dotted lines indicate human baseline scores for each task.
Figure 14: Heatmap of ablations per task including standard errors. Tasks are sorted by normalized score across models during training, such that the task with the highest mean scores in training is in the leftmost column, and the model that had the highest mean scores in training is at the top row.

Appendix B Human-Normalized Scores and Episode Rewards

We used one action set across all PsychLab tasks, and another across the 3D tasks.

In PsychLab we used a set of five actions: look left, look right, look up, look down, do nothing.

For the rest, we used a set of eight actions: move forward, move backward, strafe left, strafe right, look left, look right, look left while moving forward, look right while moving forward.

In Figure 13 we show the training and test curves for each of our ablation models on all tasks. The curves in bold correspond to the median score across three random seeds, and the corresponding confidence intervals are shown in lighter shades.

b.1 Human-Normalized Score Computation

We computed the Human-Normalized Scores used in our heatmap via the following procedure. In our reported results we used three seeds, and took a rolling average as described below.

  1. For each seed, apply smoothing in the form of exponential weighted moving average3.

  2. For each seed, take a further rolling average of the episode reward, over a window of 10.

  3. Among these rolling reward windows, find the highest window value over the course of training. The mean over the seeds corresponds to .

  4. For each seed, find the time-step that corresponds to , to use as a snapshot point for comparison against the holdout levels.

  5. At this snapshot point, record the seed-averaged rolling episode reward for the two holdout levels, and .

  6. Obtain the episode reward of a random agent and the episode reward achieved by a human, .

  7. For Train, Holdout-Interpolate, and Holdout-Extrapolate, with corresponding standard error:


Results are shown ranked (best at top) in Figure 3.

Average Human-Normalized Score (percentage points)
Model Train Holdout-Interpolate Holdout-Extrapolate
MRA: LSTM + MEM + CPC 92.9 3.9 56.2 5.8 52.6 6.5
LSTM + MEM + REC 82.2 5.2 54.2 2.3 51.4 4.9
LSTM + MEM 78.7 5.8 50.0 3.1 45.8 4.5
LSTM + CPC 77.6 4.6 42.7 2.8 37.7 5.3
FF + MEM + REC 63.1 9.6 45.4 3.7 45.4 14.2
FF + MEM + CPC 62.6 5.9 45.4 7.3 41.3 4.0
LSTM 73.0 6.9 40.2 4.3 35.6 5.9
FF + MEM 42.3 5.8 27.8 6.7 27.0 6.6
FF 33.9 3.3 23.0 3.5 19.7 4.2
Table 3: Ranking of ablation models, sorted by overall task-averaged human-normalized score.

b.2 Episode Rewards

Absolute episode rewards per task per level, obtained by trained agent as well as and , with standard error4 bars. See Tables 7 to 16.

Model Train Holdout-Interpolate Holdout-Extrapolate
FF 25.90 0.32 19.37 0.43 35.74 0.37
FF + MEM 43.14 6.12 33.34 5.92 60.16 13.56
FF + MEM + CPC 49.98 0.00 38.86 0.32 73.66 0.63
FF + MEM + REC 50.00 0.00 39.76 0.14 73.17 0.46
LSTM 33.35 0.24 27.20 0.25 32.34 1.20
LSTM + CPC 30.75 0.21 25.64 0.26 35.50 1.09
LSTM + MEM 50.00 0.00 39.99 0.01 72.32 0.98
LSTM + MEM + REC 50.00 0.00 39.63 0.36 73.91 0.28
MRA: LSTM+MEM+CPC 50.00 0.00 39.99 0.00 74.32 0.13
Random 0.06 0.00 0.06 0.00 0.06 0.00
Human 50.00 0.00 40.00 0.00 75.00 0.00
Table 5: Episode reward: PsychLab - Continuous Recognition
Model Train Holdout-Interpolate Holdout-Extrapolate
FF 26.90 0.28 20.62 0.28 38.45 0.34
FF + MEM 26.51 0.06 20.54 0.49 37.96 0.51
FF + MEM + CPC 49.60 0.01 39.51 0.15 71.40 0.15
FF + MEM + REC 49.78 0.08 39.90 0.03 65.57 0.56
LSTM 27.11 0.29 20.92 0.11 37.28 0.18
LSTM + CPC 26.25 0.26 20.11 0.55 37.46 0.36
LSTM + MEM 42.18 6.93 39.68 0.06 56.59 9.84
LSTM + MEM + REC 49.78 0.08 39.90 0.03 65.57 0.56
MRA: LSTM+MEM+CPC 49.92 0.03 39.83 0.00 72.52 0.25
Random 0.04 0.00 0.05 0.00 0.05 0.00
Human 49.40 0.24 39.40 0.40 74.20 0.58
Table 6: Episode reward: PsychLab - Change Detection
Model Train Holdout-Interpolate Holdout-Extrapolate
FF 26.40 0.08 24.17 0.10 24.73 0.48
FF + MEM 25.76 0.16 24.95 0.12 24.97 0.36
FF + MEM + CPC 44.76 0.05 36.07 0.46 36.95 0.40
FF + MEM + REC 25.82 0.22 23.99 0.29 24.89 0.23
LSTM 26.39 0.21 25.24 0.43 25.37 0.22
LSTM + CPC 48.37 0.39 41.43 0.12 42.72 0.11
LSTM + MEM 26.21 0.10 24.77 0.28 24.63 0.52
LSTM + MEM + REC 39.12 5.88 42.12 1.97 37.31 6.38
MRA: LSTM+MEM+CPC 49.14 0.24 42.24 0.07 43.00 0.39
Random 0.00 0.00 0.00 0.00 0.00 0.00
Human 47.60 0.40 48.80 0.58 46.80 1.07
Table 7: Episode reward: PsychLab - What Then Where
Model Train Holdout-Interpolate Holdout-Extrapolate
FF 12.71 0.06 12.19 0.06 8.39 0.16
FF + MEM 12.11 0.14 12.05 0.12 8.34 0.11
FF + MEM + CPC 12.92 0.32 12.21 0.20 7.73 0.29
FF + MEM + REC 6.54 0.50 6.10 0.35 6.30 0.10
LSTM 37.18 0.14 25.06 0.34 17.51 0.38
LSTM + CPC 24.21 6.04 20.68 3.89 12.99 2.79
LSTM + MEM 26.19 6.23 23.74 2.65 14.72 1.22
LSTM + MEM + REC 2.96 0.04 1.71 0.27 2.34 0.22
MRA: LSTM+MEM+CPC 24.22 5.45 23.10 1.82 15.54 1.39
Random 0.02 0.00 0.03 0.00 0.01 0.00
Human 50.00 0.00 50.00 0.00 49.60 0.24
Table 4: Episode reward: PsychLab - AVM
Model Train Holdout-Interpolate Holdout-Extrapolate
FF 0.46 0.00 0.43 0.00 0.43 0.01
FF + MEM 0.46 0.01 0.45 0.01 0.44 0.00
FF + MEM + CPC 0.93 0.02 0.71 0.04 0.69 0.05
FF + MEM + REC 0.44 0.01 0.45 0.02 0.45 0.00
LSTM 0.46 0.01 0.45 0.01 0.46 0.01
LSTM + CPC 0.54 0.03 0.48 0.03 0.49 0.01
LSTM + MEM 0.47 0.00 0.45 0.01 0.45 0.00
LSTM + MEM + REC 0.46 0.01 0.44 0.01 0.45 0.00
MRA: LSTM+MEM+CPC 0.90 0.07 0.81 0.00 0.78 0.00
Random 0.05 0.00 0.04 0.00 0.04 0.00
Human 1.00 0.00 1.00 0.00 1.00 0.00
Table 9: Episode reward: Spot Diff Passive
Model Train Holdout-Interpolate Holdout-Extrapolate
FF 0.22 0.09 0.23 0.00 0.14 0.11
FF + MEM 0.54 0.05 0.46 0.00 0.49 0.04
FF + MEM + CPC 0.80 0.01 0.66 0.03 0.68 0.01
FF + MEM + REC 0.45 0.00 0.45 0.00 0.45 0.01
LSTM 0.95 0.00 0.85 0.01 0.84 0.00
LSTM + CPC 0.91 0.01 0.77 0.00 0.75 0.00
LSTM + MEM 0.97 0.01 0.78 0.04 0.83 0.03
LSTM + MEM + REC 0.74 0.12 0.54 0.06 0.52 0.09
MRA: LSTM+MEM+CPC 0.96 0.01 0.82 0.01 0.78 0.01
Random 0.03 0.00 0.03 0.00 0.02 0.00
Human 1.00 0.00 1.00 0.00 1.00 0.00
Table 10: Episode reward: Spot Diff Multi-Object
Model Train Holdout-Interpolate Holdout-Extrapolate
FF 0.02 0.01 0.01 0.00 0.00 0.00
FF + MEM 0.01 0.00 0.01 0.00 0.00 0.00
FF + MEM + CPC 0.12 0.08 0.11 0.06 0.04 0.02
FF + MEM + REC 0.18 0.02 0.17 0.01 0.01 0.01
LSTM 0.52 0.20 0.14 0.07 0.05 0.02
LSTM + CPC 0.58 0.03 0.24 0.01 0.09 0.00
LSTM + MEM 0.39 0.05 0.14 0.07 0.05 0.03
LSTM + MEM + REC 0.18 0.04 0.15 0.02 0.07 0.00
MRA: LSTM+MEM+CPC 0.69 0.02 0.27 0.00 0.10 0.00
Random 0.01 0.00 0.01 0.00 0.00 0.00
Human 1.00 0.00 1.00 0.00 1.00 0.00
Table 11: Episode reward: Spot Diff Motion
Model Train Holdout-Interpolate Holdout-Extrapolate
FF 0.12 0.04 0.00 0.01 0.00 0.00
FF + MEM 0.08 0.05 0.07 0.05 0.08 0.06
FF + MEM + CPC 0.13 0.09 0.24 0.10 0.23 0.11
FF + MEM + REC 0.46 0.01 0.42 0.01 0.43 0.00
LSTM 0.45 0.00 0.44 0.01 0.43 0.01
LSTM + CPC 0.46 0.00 0.43 0.00 0.44 0.01
LSTM + MEM 0.45 0.01 0.46 0.00 0.45 0.00
LSTM + MEM + REC 0.44 0.02 0.44 0.01 0.42 0.01
MRA: LSTM+MEM+CPC 0.47 0.01 0.45 0.00 0.46 0.01
Random 0.02 0.00 0.02 0.00 0.02 0.00
Human 1.00 0.00 1.00 0.00 1.00 0.00
Table 8: Episode reward: Spot Diff Basic
Model Train Holdout-Interpolate Holdout-Extrapolate
FF 12.27 0.83 3.74 1.70 0.14 0.14
FF + MEM 13.58 1.45 1.52 0.28 0.47 0.02
FF + MEM + CPC 31.87 0.25 22.99 0.70 9.83 0.40
FF + MEM + REC 31.01 0.77 23.42 0.33 9.84 1.09
LSTM 28.72 0.34 11.37 0.00 2.66 0.00
LSTM + CPC 29.46 0.23 11.52 0.08 2.52 0.02
LSTM + MEM 30.92 0.28 11.74 0.22 3.29 0.10
LSTM + MEM + REC 35.95 0.28 25.16 0.05 12.54 0.34
MRA: LSTM+MEM+CPC 32.45 0.39 12.66 0.06 3.38 0.05
Random 1.08 0.02 0.58 0.01 0.22 0.01
Human 23.60 1.69 23.50 0.75 14.30 0.60
Table 13: Episode reward: Invisible Goal With Buildings
Model Train Holdout-Interpolate Holdout-Extrapolate
FF 9.30 0.33 1.80 0.01 0.28 0.02
FF + MEM 9.95 0.65 1.54 0.02 0.48 0.00
FF + MEM + CPC 10.65 0.28 1.52 0.10 0.31 0.01
FF + MEM + REC 12.29 0.24 0.70 0.05 0.20 0.00
LSTM 27.22 0.48 2.01 0.19 0.79 0.01
LSTM + CPC 28.46 0.63 2.25 0.01 0.86 0.01
LSTM + MEM 30.15 0.04 3.10 0.08 1.17 0.08
LSTM + MEM + REC 32.10 0.05 2.75 0.24 1.17 0.09
MRA: LSTM+MEM+CPC 30.51 0.21 2.39 0.09 1.09 0.03
Random 0.95 0.02 0.53 0.01 0.20 0.01
Human 17.37 1.91 12.40 1.45 4.90 0.71
Table 14: Episode reward: Invisible Goal Empty Arena
Model Train Holdout-Interpolate Holdout-Extrapolate
FF 1.78 0.11 0.38 0.04 0.05 0.00
FF + MEM 2.21 0.05 0.28 0.02 0.07 0.01
FF + MEM + CPC 2.37 0.07 0.22 0.02 0.05 0.00
FF + MEM + REC 3.25 0.25 0.27 0.02 0.06 0.01
LSTM 7.60 0.14 0.14 0.01 0.05 0.01
LSTM + CPC 10.48 0.35 0.12 0.02 0.03 0.01
LSTM + MEM 10.32 0.12 0.19 0.02 0.03 0.01
LSTM + MEM + REC 12.40 0.08 0.08 0.01 0.04 0.00
MRA: LSTM+MEM+CPC 13.04 0.60 0.23 0.04 0.07 0.01
Random 0.15 0.01 0.15 0.01 0.03 0.00
Human 4.90 1.32 1.70 0.67 0.30 0.30
Table 12: Episode reward: Visible Goal With Buildings
Model Train Holdout-Interpolate Holdout-Extrapolate
FF 174.63 4.27 43.55 3.08 11.93 2.17
FF + MEM 224.53 11.31 37.80 2.28 8.52 0.87
FF + MEM + CPC 272.99 5.31 33.38 1.51 9.99 1.10
FF + MEM + REC 607.48 36.98 59.64 9.41 43.07 19.35
LSTM 463.43 12.84 27.09 1.69 6.52 1.86
LSTM + CPC 473.42 3.84 19.72 0.98 4.90 0.36
LSTM + MEM 523.14 10.52 22.39 3.65 7.83 1.25
LSTM + MEM + REC 655.10 11.65 49.53 24.78 57.21 23.92
MRA: LSTM+MEM+CPC 546.08 2.26 40.64 0.00 14.21 0.00
Random Small: 7.79 0.14 3.89 0.10 1.19 0.05
Large: 1.97 0.06
Human Small: 364.00 43.20 198.00 24.98 86.00 20.15
Large: 104.00 23.58
Table 15: Episode reward: Visible Goal Procedural Maze
Model Train Holdout-Interpolate Holdout-Extrapolate
FF 3.71 0.07 3.52 0.06 3.73 0.04
FF + MEM 4.34 0.51 4.09 0.48 5.76 0.18
FF + MEM + CPC 4.64 0.76 5.16 0.01 5.62 0.41
FF + MEM + REC 0.47 0.22 0.30 0.01 0.69 0.12
LSTM 8.67 0.55 5.31 0.10 7.32 0.38
LSTM + CPC 9.65 0.53 5.59 0.02 7.77 0.08
LSTM + MEM 8.98 0.42 6.76 0.75 8.84 0.41
LSTM + MEM + REC 10.86 0.02 8.88 0.11 9.80 0.16
MRA: LSTM+MEM+CPC 10.34 0.10 7.21 0.19 9.81 0.09
Random Small: 1.44 0.02 1.42 0.02 1.43 0.02
Large: 1.44 0.02
Human Small: 5.40 2.20 6.00 2.45 7.20 2.94
Large: 6.60 2.69
Table 16: Episode reward: Transitive Inference

Appendix C Model

c.1 Importance Weighted Actor-Learner Architecture

We use the Importance Weighted Actor-Learner Architecture (IMPALA) [Espeholt et al., 2018] in our work. IMPALA uses an off-policy actor-critic approach where decoupled actors communicate experience to a learner. The actor-to-learner relationship is many-to-one. Each actor generates a batched trajectory, or episode, of experience and sends the state-action-reward traces () to its respective learner. The learner gathers trajectories from each actor and computes gradients to update the model parameters continuously. As actors finish processing a trajectory they receive parameter updates from the learner then continue to generate trajectories.

Under this scheme the actors and learner policies fall out of sync between parameter updates. The actor’s behaviour policy, , is said to have policy lag with respect to the target policy of the learner, . To correct for this effect importance weighting with V-trace targets are computed for each step:


where is a discount factor, and are the state reward at time-step , and are truncated importance sampling weights. These V-trace targets are used to compute gradients for the policy approximation in the learner. This enables observations and parameters to each flow in a single direction, allowing for high data efficiency and resource allocation in comparison to other other approaches, such as asynchronous advantageous actor critic (A3C) [Mnih et al., 2016].

c.2 Residual Network Architecture

To process the pixel input, the Memory Recall Agent and the other baselines reported in this work use a residual network [He et al., 2015] with a similar architecture found in [Espeholt et al., 2018]. This consists of three convolutional blocks with feature map counts of size 16, 32, and 32; each block has a convolutional layer with kernel size 3x3 followed max pooling with kernel size 3x3 and stride 2x2, followed by two residual subblocks. The output from the top residual block is followed by a 256-unit MLP to generate latent representations to be passed to the working memory and query network .

c.3 Contrastive Predictive Coding

We use the encoder already present in the agent’s architecture, the convolution neural network that takes the input frame () and converts it to the embedded visual input . The auto-regressive component is the working memory itself, which takes as input and outputs which can be used to predict future steps in latent space: , where said to be the number of CPC steps. Figure 4(b) illustrates the CPC approach (van den Oord et al. [2018]).

To introduce a noise-contrastive loss the mutual information (Eq. 8) between the target encoded representations , and the contexts () – which in our case are the memory states . For each sample, a positive real score is then generated via , a log-bilinear density function (Eq. 9) by taking the current output from the working memory and the latent vector of the step, .


Given a sample trajectory of length and a fixed number of maximum CPC steps , predictions are computed for each of the -step predictive models (). For timestep () and predictive model let denote a set of samples from which a contrastive noise estimate is derived. Each set may be split into two subsets, a single positive sample and negative samples: , such that (). The noise contrastive loss is then determined by computing the categorical cross-entropy over the -trajectory sample set . Details can be seen in Eq. 10.


c.4 Reconstruction

Action and reward reconstructions are linear projections and while reconstructions of the image input are generated via the transpose residual network . Sum of squared error losses are used for prior step reward and prior step actions while sigmoid cross-entropy is used for the image reconstruction. The losses are summed and scaled by a cost hyper-parameter for each to produce a full reconstruction loss for the model, . See equations 11 to 14 below for more details ( is the sigmoid function).


In our experiments we set ===1.0 for all tasks, except in AVM, Continuous Recognition and Change Detection, where and 3, respectively. We did not tune for this hyper-parameter, we used first guess or previous work (such as [Wayne et al., 2018]) for choosing it.

Appendix D Hyper-parameter Tuning

All experiments used three seeds, with identical hyper-parameters each. Given the scope of the experiments undertaken, all hyper-parameter tuning was preliminary and not exhaustive.

Initial hyper-parameters were either inherited from the IMPALA paper or given an arbitrary first-guess value that seemed reasonable. Whatever tuning that was done was performed in a relatively systematic way: Hyper-parameters were shared across all model variations, and tuned with the objective of getting as many model variations as possible to achieve adequate performance on the training tasks.

The PsychLab tasks were the ones with the most tuning. For PsychLab, we performed a manual sweep over arbitrary reasonable-seeming values when train performance wasn’t getting off the floor or was too noisy. We had a preference for hypers that fared well across all models (e.g. choosing a bigger hidden size of 1024 rather than 512 for the controller so that FF models would have capacity).

For the other tasks, very minimal tuning occurred and hyper-parameters were first-guess. With Spot the Difference, we tried two different discount rates and went with the better one. For Goal Navigation and Transitive Inference tasks, we stuck to a standardized discount rate of 0.99.

We did not perform any tuning for REC throughout.

Fixed hyper-parameters (See Table 17) For optimizers, whenever we used Adam we standardized the discount rate to 0.98, and whenever we used RMSProp the discount rate was mostly 0.99 except in certain cases where we were able to also try 0.999 and found that it did better. Whenever we used an external episodic memory module (‘MEM’) we used the fixed hyper-parameters in Table 17.

For individual task hyper-parameter configurations see Table 18.

Beta1 0.9
Beta2 0.999
Epsilon 1e-4
Epsilon 0.1
Momentum (Inherited from IMPALA paper) 0.0
Decay 0.99
Number of k-nearest neighbors to retrieve from MEM 10
MEM key size (and accordingly, query size) 128
Capacity (max number of timesteps storable) 2048 for Unity levels, else 1024
Table 17: Fixed hyper-parameters
Parameter Hidden size Baseline cost5 Entropy Batch size Unroll Discount Optimizer Learning Num CPC
length rate CPC steps weight
AVM 512 0.5 0.005206 16 50 0.98 Adam 1e-5 10 10
Cont. Recognition 1024 0.5 0.01 16 50 0.98 Adam 1e-5 10 10
Change Detection 512 0.5 0.01 16 50 0.98 Adam 1e-5 10 10
What Then Where 1024 2.0 0.01 32 100 0.98 Adam 1e-5 10 30
Sweep [0.5, 1.0, 2.0] Sweep [10, 30]
Spot Diff 7
Basic 1024 0.5 0.003 16 200 0.99 RMSProp 1e-4 50 20
Passive 1024 0.5 0.003 16 200 0.999 RMSProp 1e-4 50 20
Multi-object 1024 0.5 0.003 16 200 0.99 RMSProp 1e-4 50 20
Motion 1024 0.5 0.003 16 200 0.99 RMSProp 1e-4 50 20
Goal Navigation 8
Visible Goal, 512 0.5 0.005209 16 50 0.98 Adam 1e-5 10 5
Proced. Maze
Visible Goal, 1024 0.5 0.003 16 200 0.99 RMSProp 1e-4 50 20
With Buildings
Invisible Goal 1024 0.5 0.003 16 200 0.99 RMSProp 1e-4 50 20
With Buildings
Invisible Goal 1024 0.5 0.003 16 200 0.99 RMSProp 1e-4 50 20
Empty Arena
Transitive 1024 0.5 0.003 16 200 0.98 Adam 1e-4 50 20
Table 18: Hyper-parameters


  1. Videos available at
  2. Available at
  3. For PsychLab tasks and Visible Goal Procedural Maze, alpha = 0.05. For the rest, alpha = 0.001.
  4. Computed over three seeds for trained agent and for random agent. For human scores, all levels had five trials each except the following: 10 for Visible Goal with Buildings and Invisible Goal Empty Arena, 19 for the Train level of Invisible Goal with Buildings and 20 for the other two levels. The difference was due to time constraints.
  5. Inherited from IMPALA paper, except for What Then Where.
  6. Copied from previous work, not tuned for this paper. 0.01 was slower and noisier.
  7. Sweep over [0.99, .999] throughout.
  8. Sweep over [0.98, .99] throughout.
  9. Copied from AVM


  1. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Link Cited by: §3.
  2. Vector-based navigation using grid-like representations in artificial agents. Nature 557 (7705), pp. 429. Cited by: §1.
  3. DeepMind lab. CoRR abs/1612.03801. Cited by: §2.
  4. Model-free episodic control. arXiv preprint arXiv:1606.04460. Cited by: §1.
  5. Quantifying generalization in reinforcement learning. arXiv preprint arXiv:1812.02341. Cited by: §1.
  6. IMPALA: scalable distributed deep-rl with importance weighted actor-learner architectures. CoRR abs/1802.01561. External Links: Link, 1802.01561 Cited by: §C.1, §C.2, §3, §3, §3, 1st item.
  7. Hybrid computing using a neural network with dynamic external memory. Nature 538 (7626), pp. 471–476. External Links: Link, Document Cited by: §1, §3.
  8. Neural predictive belief representations. CoRR abs/1811.06407. External Links: Link, 1811.06407 Cited by: §3.
  9. Fast deep reinforcement learning using online adjustments from the past. In Advances in Neural Information Processing Systems, pp. 10567–10577. Cited by: §1.
  10. Deep residual learning for image recognition. CoRR abs/1512.03385. External Links: Link, 1512.03385 Cited by: §C.2, §3.
  11. Long short-term memory. Neural Comput. 9 (8), pp. 1735–1780. External Links: ISSN 0899-7667, Link, Document Cited by: §3, §3.
  12. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397. Cited by: §3, §3.
  13. Sparse attentive backtracking: temporal creditassignment through reminding. CoRR abs/1809.03702. External Links: Link, 1809.03702 Cited by: §3.
  14. Psychlab: A psychology laboratory for deep reinforcement learning agents. CoRR abs/1801.08116. Cited by: §2.1, §2.
  15. Predictive representations of state. In Advances in neural information processing systems, pp. 1555–1561. Cited by: §3.
  16. Models of working memory: mechanisms of active maintenance and executive control. Cambridge University Press. External Links: Document Cited by: §1, §2.2.
  17. Asynchronous methods for deep reinforcement learning. CoRR abs/1602.01783. Cited by: §C.1.
  18. OReproducible, reusable, and robust reinforcement learning (invited talk). Advances in Neural Information Processing Systems, 2018. Cited by: §1.
  19. Neural episodic control. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2827–2836. Cited by: §1, §3.
  20. Imagination-augmented agents for deep reinforcement learning. In Advances in neural information processing systems, pp. 5690–5701. Cited by: §5.
  21. Been there, done that: meta-learning with episodic recall. arXiv preprint arXiv:1805.09692. Cited by: §1.
  22. Relational recurrent neural networks. CoRR abs/1806.01822. External Links: Link, 1806.01822 Cited by: §1, §3.
  23. Declarative memory, awareness, and transitive inference. Journal of Neuroscience 25 (44), pp. 10138–10146. External Links: Document, ISSN 0270-6474, Link, Cited by: §A.4.1.
  24. Weakly supervised memory networks. CoRR abs/1503.08895. External Links: Link, 1503.08895 Cited by: §1.
  25. Elements of episodic memory. Canadian Psychology 26 (3), pp. 235–238. Cited by: §1.
  26. Episodic memory: from mind to brain. Annual Review of Psychology 53 (1), pp. 1–25. Note: PMID: 11752477 External Links: Document, Link, Cited by: §1.
  27. Unity.. Note: \url Cited by: §2.
  28. Representation learning with contrastive predictive coding. CoRR abs/1807.03748. External Links: Link, 1807.03748 Cited by: §C.3, §3, §3.
  29. Attention is all you need. CoRR abs/1706.03762. External Links: Link, 1706.03762 Cited by: §1, §1, §3.
  30. Unsupervised predictive memory in a goal-directed agent. CoRR abs/1803.10760. Cited by: §A.1.1, §C.4, §3.
  31. Relational deep reinforcement learning. CoRR abs/1806.01830. External Links: Link, 1806.01830 Cited by: §1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description