# Probabilistic Successor Representations with Kalman Temporal Differences

## 1 Abstract

The effectiveness of Reinforcement Learning (RL) depends on an animal’s ability to assign credit for rewards to the appropriate preceding stimuli. One aspect of understanding the neural underpinnings of this process involves understanding what sorts of stimulus representations support generalisation. The Successor Representation (SR), which enforces generalisation over states that predict similar outcomes, has become an increasingly popular model in this space of inquiries. Another dimension of credit assignment involves understanding how animals handle uncertainty about learned associations, using probabilistic methods such as Kalman Temporal Differences (KTD). Combining these approaches, we propose using KTD to estimate a distribution over the SR. KTD-SR captures uncertainty about the estimated SR as well as covariances between different long-term predictions. We show that because of this, KTD-SR exhibits partial transition revaluation as humans do in this experiment without additional replay, unlike the standard TD-SR algorithm. We conclude by discussing future applications of the KTD-SR as a model of the interaction between predictive and probabilistic animal reasoning.

Keywords: Reinforcement Learning; Successor Representation; Kalman Filter; Transition Revaluation

## 2 Introduction

An impressive signature of animal behavior is the capacity to flexibly learn relationships between the environment and reward. One approach to understanding this behavior involves investigating how the brain represents different stimuli such that credit for reward is generalised appropriately. Predictive representations, like the Successor Representation (SR) Dayan1993ImprovingRepresentation, generalise over stimuli that predict similar futures and can provide a useful balance between efficiency and flexibility Gershman2018TheSubstrates; Russek2016a. SR learning is faster to adapt to change than model-free (MF) learning, particularly changes in reward location, and supports more efficient state evaluation than model-based (MB) algorithms, which use time-consuming forward simulations to evaluate state. Since this efficiency depends on caching long-term expected state occupancies, however, the SR is worse than MB at handling changes in the environment’s transition structure. In neuroscience and psychology, the SR offers a compelling explanation for a range of behavioural and neural findings Momennejad2016; Stachenfeld2017TheMap; Gardner2018RethinkingError; Garvert2017.

While the SR offers a solution to some of the shortcomings of model-free learning, existing methods for estimating the SR, such as temporal difference (TD) learning, do not take into account uncertainty. Here, we attempt to rectify this by drawing on the Kalman TD (KTD) method for value learning Geist2010KalmanDifferences, which explains a range of animal conditioning phenomena that standard TD cannot explain Gershman2015ALearning. KTD-SR gives the agent an estimate of its uncertainty in the SR as well as the covariance between different entries of the SR. We show how this augments the SRs capacity to support revaluation following changes in transition structure.

## 3 Results

### 3.1 The successor representation

We define an RL environment to be a Markov Decision Process consisting of *states* the agent can occupy, *transition probabilities* of moving from state to states given the agent’s policy over actions , and the reward available at each state, for which denotes the expectation. An RL agent is tasked with finding a policy that maximises its expected discounted total future reward, or *value*:

(1) |

where indexes timestep and , where , is a discount factor that down-weights distal rewards.

The value function can be decomposed into a product of the reward function and the SR matrix Dayan1993ImprovingRepresentation:

(2) |

is defined such that each entry gives the expected discounted future number of times the agent will visit from starting state , under the current policy Dayan1993ImprovingRepresentation:

(3) |

where if and 0 otherwise. Each row in this matrix constitutes the SR for some state , thus representing each state as a vector over future “successor states.” Factorising value into an SR term and a reward term permits greater flexibility because if one term changes, it can be relearned while the other remains intact Dayan1993ImprovingRepresentation; Gershman2018TheSubstrates.

We first consider the SR in a tabular setting with deterministic transitions and a fixed, deterministic policy. This means that there is only one possible state following any predecessor state . In this setting, the SR matrix rows of two temporally adjacent states can be recursively related as follows:

(4) |

where is the feature vector (of length , the number of features) observed by the agent in state .
In this article, we consider problems with discrete state spaces, for which the feature vector is a one-hot vector with an entry for every state and a 1 only in the *s ^{th}* position.
Equation 4 is analogous to the Bellman equation for value widely used in RL Sutton1998ReinforcementIntroduction, with the vector-valued in lieu of scalar .

We can express the estimated current one hot state vector (based on the SR) as the difference between two successive SRs:

(5) |

where we have defined : the discounted temporal difference between state features. The (vector valued) successor prediction error, used to update the SR in TD methods, is then given by .

### 3.2 Learning a probabilistic SR using a Kalman Filter

The algorithm described above produces a point estimate of the SR. While useful for approximating expected value, it is not capable of expressing certainty in these estimates.
In order to derive a probabilistic interpretation of the SR, we assume that the agent has an internal generative model of how sensory data are generated from the SR parameters that can be learned with KTD Geist2010KalmanDifferences; Gershman2015ALearning.
This model consists of a *prior distribution* on the (hidden) parameters, – where is the SR reshaped into a vector – an *evolution process* on the parameters, , and a distribution of observed (one-hot) feature vectors given the current parameters and observations . As with earlier work on KTD, we assume a Gaussian model: , and ,
where is the prior covariance between SR matrix entries, is the process covariance, describing how the evolution of different parameters covaries, and is the observation covariance, describing covariance in the observations. , and are set by the practitioner (see Table 1).

The purpose of the Kalman Filter is to infer a posterior distribution over that hidden state given the observations :

(6) |

Under the Gaussian model described above, this posterior distribution is Gaussian with mean and covariance parameters which will be estimated by the Kalman Filter. To set up the filter, we specify an *evolution equation* describing how the hidden parameters (the SR) evolve over time and an *observation equation* describing how observation relates to our hidden parameters. These two equations comprise the *state-space formulation* for KTD SR:

(7) |

where is the *process noise* and the *observation noise*, denotes the Kronecker product and the identity matrix. We will start from the assumption that the process noise is white, meaning that , i.e. the expected mean SR on time equals the estimated SR on time .

The Kalman Filter keeps track of the mean and covariance of the posterior (6). At each timestep, the parameters of the posterior are updated using the Kalman Filter equations:

(8) | ||||

(9) | ||||

(10) |

where is the covariance between the parameters and the prediction error, and is the covariance of the prediction error. The notation means that the estimate of the parameter covariance is conditioned on all observations until time <see¿Geist2010KalmanDifferences. Importantly, and in contrast to standard TD updates for the SR Dayan1993ImprovingRepresentation, the Kalman gain is stimulus specific, as it is dependent on the ratio between the covariance in the parameters and the covariance in the observations. This allows for a principled weighting of prior knowledge and incoming data. Note also that the Kalman gain is a matrix, meaning that each entry of the successor representation gets updated with a different gain. See Algorithm 1 for a full description of the method, including how these quantities are computed.

In summary, we have introduced a method of handling uncertainty over SR estimates. This allows for an efficient combination of prior knowledge and incoming information when updating the SR estimates. Furthermore, it allows us to estimate dependencies between different entries in the SR that inform SR updates. This permits non-local updates which, in the case of KTD for value estimation, have proven to better explain animal behaviour than the strictly local updates of vanilla TD Gershman2015ALearning. We explore a possible role for non-local updates in the following section.

### 3.3 Partial Transition Revaluation Simulations

A key prediction of standard TD-SR learning is that “reward revaluation” should be supported while “transition revaluation” should not. \shortciteAMomennejad2016 tested this in humans. In the first phase of their experiment, participants learned two different sequences of states terminating in different reward amounts: 246$1 and 135$10 (see Figure 1B). In the next stage, half of the participants were exposed to the transition revaluation condition, observing novel 45$10 and 36$1 transitions. The other half experienced “reward revaluation” in the form of novel reward amounts 6$10 and 5$1 (Figure 1A). Importantly, the novel experiences start from intermediate states such that transitions from 1 or 2 are not seen following phase 1. While participants were significantly better at reward revaluation than transition revaluation, they were capable of some transition revaluation as well (Figure 1C). Accordingly, the authors proposed a hybrid SR model: an SR-TD agent that is also endowed with capacity for replaying experienced transitions (Figure 1F). This permits updating of the SR vectors of states 1 and 2 through simulated experience.

Name | Symbol | Value |
---|---|---|

Discount factor | 0.9 | |

Process covariance | ||

Observation covariance | ||

Prior covariance | ||

Prior SR | ||

Rescorla Wagner learning rate | 0.1 | |

Number of trials per phase | 50 |

Here, we simulate this experiment and find that the probabilistic KTD-SR accounts for partial transition revaluation even without replay (Figure 1D). KTD-SR correctly learns the SR matrix after phase 1 (Figure 1E) as well as an estimate of the covariance between all entries in the SR matrix, . Unlike TD-SR, KTD-SR uses the covariance matrix to estimate the Kalman gain and uses that to update the whole matrix. This means that after seeing , it updates not just but also because these entries have historically covaried (same for and ) (Figure 1F). To estimate direct reward , the agent uses a Rescorla-Wagner rule Rescorla1972ANonreinforcement. Model parameters are listed in Table 1 and experimental parameters are kept the same as in Momennejad2016.

## 4 Discussion

The SR constitutes a middle ground between model-based and model-free RL algorithms by separating reward representations from cached long-run state predictions. Here we learn a probabilistic SR model using KTD that supports principled handling of uncertainty about state predictions and inter-dependencies between these predictions. We exploit this feature to show that, unlike standard TD-SR, KTD-SR can perform partial transition revaluation. In later work, we plan to test our model on other tasks that could benefit from KTD-SR in a similar way, such as policy revaluation <a well-known weak spot of TD-SR; ¿Barreto2016.

We note the relative strengths and weaknesses of KTD-SR when compared to a hybrid-MB-SR approach. Replay requires a buffer to store experienced episodes and a sufficient number of replays that information is propagated throughout the SR model. While KTD-SR can incorporate information about long-range in a single update, it must learn and store a large matrix <although dimensionality reduction can reduce this burden;¿Fisher1998DevelopmentFilter. There is compelling evidence in favor of both replay Carr2011HippocampalRetrieval; Olafsdottir2018ThePlanning and probabilistic representations Ma2006BayesianCodes driving behavior. Future work will consider how the relative tradeoffs of these approaches constrain hypotheses.

Probabilistic models provide a number of advantages for RL in terms of optimal credit assignment Kruschke2008BayesianLearning, uncertainty-minimising exploration Dearden1998BayesianQ-learning, arbitration between competing models Daw2005Uncertainty-basedControl. Distributional RL-trained neural network agents achieve state of the art performance Bellemare2017ALearning. Furthermore, a range of animal learning findings suggest that animals are capable of probabilistic reasoning Gershman2015ALearning; Kruschke2008BayesianLearning; Courville2006BayesianWorld. Future work will involve exploring these advantages in the context of SR learning Gardner2018RethinkingError.

We make several assumptions in order to make this model tractable. The Gaussian assumption is clearly violated in the case of one-hot state vectors (i.e. neither nor should have negative entries). However, the model is sufficiently expressive that a good approximation can still be found, and a “successor feature” model could be applied over arbitrary features for which the Gaussian assumption might hold. The random walk process noise is useful for capturing slow changes in the environment, but might be ill-suited for step changes or sub-optimal when the dynamics are predictable. While we assume deterministic transitions and linear function approximation in this work, it is straightforward to extent the model to stochastic transitions and nonlinear function approximation with a “coloured noise” approach Geist2010KalmanDifferences.

## 5 Acknowledgments

This work is funded by the Gatsby Foundation and the Wellcome Trust. We thank Samuel Gershman, Talfan Evans, Eszter Vértes, Steven Hansen and Matthew Botvinick for helpful comments and suggestions.