# Is prioritized sweeping the better episodic control?

## Abstract

Episodic control has been proposed as a third approach to reinforcement learning, besides model-free and model-based control, by analogy with the three types of human memory. i.e. episodic, procedural and semantic memory. But the theoretical properties of episodic control are not well investigated. Here I show that in deterministic tree Markov decision processes, episodic control is equivalent to a form of prioritized sweeping in terms of sample efficiency as well as memory and computation demands. For general deterministic and stochastic environments, prioritized sweeping performs better even when memory and computation demands are restricted to be equal to those of episodic control. These results suggest generalizations of prioritized sweeping to partially observable environments, its combined use with function approximation and the search for possible implementations of prioritized sweeping in brains.

## 1Introduction

Single experiences can drastically change subsequent decision making. Discovering a new passage in my home town, for example, may immediately affect my policy to navigate and shorten path lengths. A software developer may discover a tool that increases productivity and never go back to the old workflow. And on the level of the policy of a state, Vasco Da Gamma’s discovery of the sea route to India had an almost immediate and long-lasting effect for the Portuguese.

Given the importance of such single experiences, it is not surprising that humans and some animal use an episodic(-like) memory system devoted to single experiences [1]. But how exactly should episodic memory influence decision making? Lengyel and Dayan [1] proposed “episodic control”, where

[..] each time the subject experiences a reward that is considered large enough (larger than expected a priori) it stores the specific sequence of state-action pairs leading up to this reward, and tries to follow such a sequence whenever it stumbles upon a state included in it. If multiple successful sequences are available for the same state, the one that yielded maximal reward is followed.

This form of episodic control was found to be beneficial in the initial phase of learning for stochastic tree Markov Decision Processes (tMDP)[1] and in the domain of simple video games [3]. But it is unclear, if there are conditions under which episodic control converges to a (nearly) optimal solution and how fast it does so.

An alternative way to link episodic memory to decision making is to compute Q-values at decision time based on a set of retrieved episodes using a mechanism similar to N-step backups [4]. In fact, if all episodes starting in a given state-action pair are retrieved the result is equivalent to TD-learning with N-step backups and decaying learning rates. More interesting is the case where the retrieved episodes are cleverly selected, but how this selection mechanism should work exactly, remains an open question.

While speed of learning in the sense of sample efficiency is one important aspect of a learning algorithm, a second criterion is computational efficiency [1]. Without this second desideratum, the canonical choice would be model-based reinforcement learning, where the agent additionally learns how its perception of the environment changes in response to actions and how reward depends on state-action pairs. Model-based reinforcement learning is well known for high sample efficiency [5]. But it is often considered to be computationally challenging. In fact, one demanding way to use the model is to perform a forward search before every decision, similar to a chess player who plans the next moves. Instead of forward search at decision time, however, one can use backward search whenever an important discovery is made. Prioritized sweeping [6], implements this backward search efficiently, in particular with small backups [8]. In the words of Moore and Atkeson [6]:

It is justified to complain about the indiscriminate use of combinatorial search or matrix inversion prior to each supposedly real-time decision. However, models need not be used in such an extravagant fashion. The prioritized sweeping algorithm is just one example of a class of algorithms that can easily operate in real time and also derive great power from a model.

Prioritized sweeping converges to the optimal policy for arbitrary Markov Decision Processes (MDP) [8]. Furthermore, for deterministic environments or at the initial phase of learning in stochastic environments, prioritized sweeping relies in a similar way on single experiences as episodic control.

In this paper I address the question of whether prioritized sweeping or episodic control are preferable in terms of sample efficiency and computational efficiency. I introduce a variant of prioritized sweeping with model reset at the end of each episode and show that it has the same computational complexity as episodic control, equivalent sample efficiency in deterministic tree Markov Decision Processes and better sample efficiency in general deterministic MDPs or stochastic MDPs. Thus, it appears as a third and promising candidate algorithm to link episodic memory to decision making.

### 1.1Review of prioritized sweeping

Prioritized sweeping [6] is an efficient method for doing approximate full backups in reinforcement learning. A typical setting of reinforcement learning consists of an agent who observes state , performs action and receives reward at each time step . It is assumed that the state transitions and reward emissions are governed by a stationary stochastic process with probabilities and . The agent’s goal is to find a policy that maximizes for each state the expected future discounted reward , where is a discount factor. With -values we see that the optimal policy should satisfy the Bellman equations . Dropping for notational simplicity the conditioning on the policy and using the definition of the -values this can be written as

From this equation, we see that a change in affects the values of possible predecessor states , if and only if before or after the change.

In reinforcement learning the agent is assumed not to know the true values of the parameters and . The idea of prioritized sweeping is to maintain a maximum likelihood estimate of and and a priority queue of predecessor states to be updated next with priority given by the absolute value of the change in -value. The observation below of limited impact of -value changes has an important implication in the reinforcement learning setting: if an agent, who acts in a large, familiar environment, discovers a novel value for or , this may change the -values of some predecessor states of , but generally it does not lead to a change of all -values. Hence, a model-based approach that recomputes all -values, like value iteration, is computationally inefficient in the reinforcement learning setting. Empirically it is found that a small and constant number of -value updates in the order of the priority queue is sufficient to learn considerably faster, i.e. requiring less interactions with the environment, than model-free methods like -learning or policy gradient learning, while the memory complexity is only slightly higher ( in general) or equal ( for deterministic environments) [5]. Prioritized sweeping is an instance of the Dyna architecture, where models, i.e. estimates of and , are explicitly learned through interactions with the environment and then used to update the -values offline, i.e. without interaction with the environment (“trying things in your head” [9]). Importantly, at decision time, this form of model-based reinforcement learning has the same complexity as any model-free method that relies on learned -values.

## 2Results

### 2.1Episodic control is equivalent to prioritized sweeping for deterministic tree Markov Decision Processes

A

B |
only terminal rewards | C |
with intermittent rewards |

?

For deterministic Markov decision processes with and rewards the Bellman equations reduce to

In tree Markov Decision Processes (c.f. A, A) each state has only one predecessor state and thus prioritized sweeping never branches during the backups. After the first episode in a deterministic tMDP of depth , the -values learned by prioritized sweeping are given by

if backup steps are performed. When in episode at time step a state action pair is reexperienced, the -values are updated according to

This learning rule for the -values is identical to the episodic control rule used by Blundell et al. [3] which is a formalization of the proposition by Lengyel and Dayan [1] quoted in the introduction.

For deterministic tMDPs an implementation of prioritized sweeping does not need to maintain the reward values and the transition table beyond the end of an episode. It has thus no influence on the performance, if the likelihood estimates of the model parameters are reset after each episode. I call this algorithm episodic control with model reset. In the performance curves of prioritized sweeping with model reset after each episode (orange curves, see also ) are perfectly covered by the ones of episodic control (red curves). With model reset, the memory requirements are identical to those of episodic control. There is also no additional cost in maintaining a priority queue since the queue will contain only one item at each moment in time: the predecessor of the currently backed up state.

To see why prioritized sweeping leads to the update in , we note that will change when a novel action is chosen in the subsequent state with for all and thus (c.f. ). This update propagates backwards along the path the agent took in this episode until the -value of a certain state-action pair is larger than the discounted reward. In we see that episodic control (red curves) performs equivalently to prioritized sweeping (solid blue curves, covered by the red curves).

We assumed above that the -values are initialized in a way such that for all novel actions , i.e. actions that were never chosen before in state , . Exploration is not dramatically hampered by this choice, if novel actions are selected whenever possible. If, on the other hand, a maximally exploratory behavior is enforced with the initialization , where is the depth of the tree, then may decrease over time and thus prioritized sweeping performs also other updates than the ones in . In deterministic tMDPs this leads rapidly to a full exploration of the decision tree at the cost of low rewards in the initial phase of learning ( dashed blue curves). Only bootstrapping methods, like Q-learning or prioritized sweeping, can be forced to aggressive exploration with large initial values. Episodic control is not of this type; the max operation in would just keep the -values constant at the large initialization value.

A |
B |
||

?

### 2.2Prioritized sweeping with model reset outperforms episodic control in general environments

For optimal performance in stochastic (t)MDPs or in deterministic MDPs without tree structure, prioritized sweeping needs to store the reward values and the transition table. This means, full prioritized sweeping requires more memory and computation than episodic control. However, an interesting competitor of episodic control in terms of memory and computation requirements is a learner that resets the model after each episode but uses prioritized sweeping for the backups. If the episodes are long enough, such that the learned model for each episode has not just tree structure, this learner may still have an advantage over episodic control.

In stochastic environments there needs to be some averaging somewhere, e.g. explicitly averaging over episodes in Monte Carlo control, choosing a small learning rate in Q-learning or maintaining a maximum likelihood estimate of and in prioritized sweeping. As already observed by Lengyel & Dayan [1], algorithms that do not properly average, like episodic control or prioritized sweeping with model reset, may perform well at the initial phase of learning but they lead quickly to suboptimal performance (c.f. B, red and orange curve). While I do not see a clear theoretical advantage of prioritized sweeping with model reset over episodic control in such environments, it still performs slightly better in stochastic tMDPs (see B, orange above red curve), but overall clearly worse than normal prioritized sweeping.

A |
B |
||

?

In deterministic environments without tree structure, states can have multiple predecessor states. In this case, prioritized sweeping also performs backups along “virtually composed” episodes, i.e. episodes that were in their entirety not yet experienced, but whose possibility is known to the agent from experiencing episodes with shared states. This is similar to crossroad where the origin of some roads is known, or the example with the uncharted passage in the introduction. The deterministic maze environment in A is of this type: with prioritized sweeping (blue curves) the agent learns rapidly to navigate from any starting position to the goal (red square), even if the model is reset after each episode (orange curves). Episodic control does not branch during the backups. While it performs well initially (red curves), it even gets quickly outperformed by the model-free Q-learner with exploration bias (dashed green curves).

### 2.3Simulation details

In all simulations the reward rate is measured as the mean reward obtained with an -greedy policy in adjacent windows of time steps. The normalized reward rate is obtained by an affine transformation of the reward rate such that 0 corresponds to the value of the uniform random policy and 1 corresponds to the policy with the optimal Q-values (obtained by policy iteration) and the same -greedy action selection. In all figures, the thick curves are obtained by running simulations with different random seeds on samples of the MDP in question and averaging the results. The thin curves are 5 random samples of the simulations. The small backups implementation of prioritized sweeping [8] was used with 3 backups per time step. In the variant of prioritized sweeping with model reset (P.S. with reset), no backups were performed until the end of the episode, at which point backups are performed, where is the length of the episode. This is the same number of backups as in episodic control (c.f. ). After the end of each episode all transition counts and observed rewards are reset to the initial value in prioritized sweeping with reset. The parameters of the other algorithms were chosen after a short search in the parameter space and are probably close to but not exactly optimal for Q - learning in stochastic tMDPs, n-step TD and Q - in all environments. See for the parameter choices. The code is available at http://github.com/jbrea/episodiccontrol.

environment |
Q - learning | |||||||||||

det tMDP Fig. | 1.0 | 0.1 | 200 | 100 | 5 | 0.08 | 5 | 1.0 | 0.2 | 1.0 | 5.0 | 5.0 |

stoch tMDP Fig. | 1.0 | 0.1 | 100 | 100 | 50 | 0.05 | 5 | 0.1 | 0.2 | 0.1 | 5.0 | 5.0 |

maze Fig. | 0.99 | 0.1 | 50 | 2 | 0.01 | 50 | 0.005 | 0.2 | 1.0 | 5.0 | 0.005 |

## 3Discussion

Prioritized sweeping allows to learn efficiently from a small number of experiences. In deterministic environments the demands on memory and computation are low and they are expected to increase gracefully with increasing stochasticity. Episodic control is equivalent to prioritized sweeping in the case of deterministic tree Markov Decision Processes, because in this case it is not necessary to maintain a model beyond one episode. The observation that episodic control performs surprisingly well in the initial phase of some Atari games [3] may be a hint that these games are close to deterministic tree Markov Decision Processes.

I suggested model reset after the end of each episode. While this is a rather ruthless form of forgetting, the observation show that it is still possible to learn fairly well in the maze task. Future work will be needed to explore more elaborate forgetting heuristics that allow to keep the memory load low while preserving high performance.

It should be noted that prioritized sweeping with a limited number of backups per time step is not necessarily the most sample efficient way to learn: in large environments with a lot to learn the priority queue may become longer and longer. In this case it may be better to set an upper bound to the length of the priority queue and, whenever the agent has sufficient computational resources, the model can be used to do Monte Carlo tree search to refine action selection at decision time. I think it is an interesting question how to optimally combine such forward and backward search and whether biological agents make use of such a combination.

The current formulation of prioritized sweeping requires a tabular environment, i.e. a representation of states by integer numbers, and its superiority to episodic control in deterministic environments becomes only evident in a Markovian setting where states have multiple predecessors. However, tabular environments contrast with the natural continuity of space and sensor values encountered by biological agents and robots. Additionally, the sensory input may contain details that are irrelevant to the task at hand, e.g. the pedestrians I encounter every time I walk by the same place in my home town, which are irrelevant to my task of walking from A to B. Also, in many tasks the state is not fully observable and state aliasing may occur, i.e. different states have the same observation and can only be distinguished by taking the history of the agent into account.

This raises the question: how can we find functions that map the sensory input to a representation useful for prioritized sweeping? Unfortunately, prioritized sweeping does not immediately lend itself to function approximation, in contrast to model-free learning methods like episodic control, Q-learning or policy gradient learning [5]. But an interesting approach could be to learn a function that maps the high dimensional sensory input to a discrete representation, either using unsupervised learning, e.g. a variational autoencoder [10], or using also reward information to shape the map in a similar spirit as the neural episodic control model [11]. It remains to be seen, however, if this is more efficient than e.g. prioritized experience replay [12].

How can we generalize prioritized sweeping to partially observable environments? There does not seem to be an obvious answer to this question, but memories of sequences that lead to reward together with a mechanism to attach converging sequences could allow to go beyond updating single sequences as in episodic control but rather make use of branching during backups as in prioritized sweeping. It is an open question whether this can be made to perform better than the average over smartly selected episodes proposed by Gershman and Daw [4].

Prioritized sweeping relies on learning the model of the environment, i.e. the reward table and the transition table. This may look like far from anything we think of episodic memory. But in regions of the transition table with few branching states, but many chains of -triplets that are experienced only once, replaying such chains is like recalling an episode from episodic memory. With increasing experience, it may become impossible to identify all the episodes that contributed to the transition table. This may lead to a gradual shift from episodic memory to semantic memory.

Is prioritized sweeping occurring in brains? The reverse replay of navigation episodes observed in the hippocampus of rats [13] can be seen as a signature of both episodic control and prioritized sweeping. But the observation of “never-experienced novel-path sequences” in hippocampal sharp wave ripples of rats [14] is a feature of prioritized sweeping only.

At least on a sufficiently abstract level, many tasks solved by humans seem to be tabular and fairly deterministic and it is in this setting where prioritized sweeping excels. This lead me to the question, if prioritized sweeping is the better episodic control. If the defining characteristics of episodic control is sample efficient learning with moderate memory and computation footprint, I would argue, the answer is yes.

## 4Acknowledgements

I thank Vasiliki Liakoni, Marco Lehmann, Dane Corneil and Wulfram Gerstner for helpful discussions and critical feedback on earlier versions of the manuscript. This research was supported by the Swiss National Science Foundation, grant 200020_165538.

### References

**Hippocampal contributions to control: The third way.**

Máté Lengyel and Peter Dayan. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors,*Advances in Neural Information Processing Systems 20*, pages 889–896. Curran Associates, Inc., 2008.**Episodic memory.**

Nicola S. Clayton, Lucie H. Salwiczek, and Anthony Dickinson.*Current Biology*, 17(6):R189–R191, Mar 2007.**Model-Free Episodic Control.**

C. Blundell, B. Uria, A. Pritzel, Y. Li, A. Ruderman, J. Z Leibo, J. Rae, D. Wierstra, and D. Hassabis.*ArXiv e-prints*, June 2016, 1606.04460.**Reinforcement learning and episodic memory in humans and animals: An integrative framework.**

Samuel J. Gershman and Nathaniel D. Daw.*Annual Review of Psychology*, 68(1):101–128, Jan 2017.*Reinforcement Learning: An Introduction*.

Richard S. Sutton and Andrew G. Barto. MIT Press, Cambridge, MA, 2017.**Prioritized sweeping: Reinforcement learning with less data and less time.**

Andrew W. Moore and Christopher G. Atkeson.*Machine Learning*, 13(1):103–130, Oct 1993.**Efficient learning and planning within the dyna framework.**

Jing Peng and R. J. Williams.*Adaptive Behavior*, 1(4):437–454, Mar 1993.**Planning by prioritized sweeping with small backups.**

Harm Van Seijen and Rich Sutton. In Sanjoy Dasgupta and David McAllester, editors,*Proceedings of the 30th International Conference on Machine Learning*, volume 28 of*Proceedings of Machine Learning Research*, pages 361–369, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR.**Dyna, an integrated architecture for learning, planning, and reacting.**

Richard S. Sutton.*ACM SIGART Bulletin*, 2(4):160–163, Jul 1991.**The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables.**

C. J. Maddison, A. Mnih, and Y. Whye Teh.*ArXiv e-prints*, November 2016, 1611.00712.**Neural Episodic Control.**

A. Pritzel, B. Uria, S. Srinivasan, A. Puigdomènech, O. Vinyals, D. Hassabis, D. Wierstra, and C. Blundell.*ArXiv e-prints*, March 2017, 1703.01988.**Prioritized Experience Replay.**

T. Schaul, J. Quan, I. Antonoglou, and D. Silver.*ArXiv e-prints*, November 2015, 1511.05952.**Reverse replay of behavioural sequences in hippocampal place cells during the awake state.**

David J. Foster and Matthew A. Wilson.*Nature*, 440(7084):680–683, Feb 2006.**Hippocampal replay is not a simple function of experience.**

Anoopum S. Gupta, Matthijs A.A. van der Meer, David S. Touretzky, and A. David Redish.*Neuron*, 65(5):695–705, Mar 2010.