Organizing Experience: A Deeper Look at Replay Mechanisms for Sample-based Planning in Continuous State Domains
Model-based strategies for control are critical to obtain sample efficient learning. Dyna is a planning paradigm that naturally interleaves learning and planning, by simulating one-step experience to update the action-value function. This elegant planning strategy has been mostly explored in the tabular setting. The aim of this paper is to revisit sample-based planning, in stochastic and continuous domains with learned models. We first highlight the flexibility afforded by a model over Experience Replay (ER). Replay-based methods can be seen as stochastic planning methods that repeatedly sample from a buffer of recent agent-environment interactions and perform updates to improve data efficiency. We show that a model, as opposed to a replay buffer, is particularly useful for specifying which states to sample from during planning, such as predecessor states that propagate information in reverse from a state more quickly. We introduce a semi-parametric model learning approach, called Reweighted Experience Models (REMs), that makes it simple to sample next states or predecessors. We demonstrate that REM-Dyna exhibits similar advantages over replay-based methods in learning in continuous state problems, and that the performance gap grows when moving to stochastic domains, of increasing size.
Experience replay has become nearly ubiquitous in modern large-scale, deep reinforcement learning systems [\citeauthoryearSchaul et al.2016]. The basic idea is to store an incomplete history of previous agent-environment interactions in a transition buffer. During planning, the agent selects a transition from the buffer and updates the value function as if the samples were generated online—the agent replays the transition. There are many potential benefits of this approach, including stabilizing potentially divergent non-linear Q-learning updates, and mimicking the effect of multi-step updates as in eligibility traces.
Experience replay (ER) is like a model-based RL system, where the transition buffer acts as a model of the world [\citeauthoryearLin1992]. Using the data as your model avoids model errors that can cause bias in the updates (c.f. [\citeauthoryearBagnell and Schneider2001]). One of ER’s most distinctive attributes as a model-based planning method, is that it does not perform multistep rollouts of hypothetical trajectories according to a model; rather previous agent-environment transitions are replayed randomly or with priority from the transition buffer. Trajectory sampling approaches such as PILCO [\citeauthoryearDeisenroth and Rasmussen2011], Hallucinated-Dagger [\citeauthoryearTalvitie2017], and CPSRs [\citeauthoryearHamilton et al.2014], unlike ER, can rollout unlikely trajectories ending up in hypothetical states that do not match any real state in the world when the model is wrong [\citeauthoryearTalvitie2017]. ER’s stochastic one-step planning approach was later adopted by Sutton’s Dyna architecture [\citeauthoryearSutton1991].
Despite the similarities between Dyna and ER, there have been no comprehensive, direct empirical comparisons comparing the two and their underlying design-decisions. ER maintains a buffer of transitions for replay, and Dyna a search-control queue composed of stored states and actions from which to sample. There are many possibilities for how to add, remove and select samples from either ER’s transition buffer or Dyna’s search-control queue. It is not hard to imagine situations where a Dyna-style approach could be better than ER. For example, because Dyna models the environment, states leading into high priority states—predecessors—can be added to the queue, unlike ER. Additionally, Dyna can choose to simulate on-policy samples, whereas ER can only replay (likely off-policy) samples previously stored. In non-stationary problems, small changes can be quickly recognized and corrected in the model. On the other hand, these small changes might result in wholesale changes to the policy, potentially invalidating many transitions in ER’s buffer. It remains to be seen if these differences manifest empirically, or if the additional complexity of Dyna is worthwhile.
In this paper, we develop a novel semi-parametric Dyna algorithm, called REM-Dyna, that provides some of the benefits of both Dyna-style planning and ER. We highlight criteria for learned models used within Dyna, and propose Reweighted Experience Models (REMs) that are data-efficient, efficient to sample and can be learned incrementally. We investigate the properties of both ER and REM-Dyna, and highlight cases where ER can fail, but REM-Dyna is robust. Specifically, this paper contributes both (1) a new method extending Dyna to continuous-state domains—significantly outperforming previous attempts [\citeauthoryearSutton et al.2008], and (2) a comprehensive investigation of the design decisions critical to the performance of one-step, sample-based planning methods for reinforcement learning with function approximation. An Appendix is publicly available on arXiv, with theorem proof and additional algorithm and experimental details.
We formalize an agent’s interaction with its environment as a discrete time Markov Decision Process (MDP). On each time step , the agent observes the state of the MDP , and selects an action , causing a transition to a new state and producing a scalar reward on the transition . The agent’s objective is to find an optimal policy , which maximizes the expected return for all , where , , and , with future states and rewards are sampled according to the one-step dynamics of the MDP. The generalization to a discount function allows for a unified specification of episodic and continuing tasks [\citeauthoryearWhite2017], both of which are considered in this work.
In this paper we are concerned with model-based approaches to finding optimal policies. In all approaches we consider here the agent forms an estimate of the value function from data: . The value function is parameterized by allowing both linear and non-linear approximations. We consider sample models, that given an input state and action need only output one possible next state and reward, sampled according to the one-step dynamics of the MDP: .
In this paper, we focus on stochastic one-step planning methods, where one-step transitions are sampled from a model to update an action-value function. The agent interacts with the environment on each time step, selecting actions according to its current policy (e.g., -greedy with respect to ), observing next states and rewards, and updating . Additionally, the agent also updates a model with these observed sample transitions on each time step. After updating the value function and the model, the agent executes steps of planning. On each planning step, the agent samples a start state and action in some way (called search control), then uses the model to simulate the next state and reward. Using this hypothetical transition the agent updates in the usual way. In this generic framework, the agent can interleave learning, planning, and acting—all in realtime. Two well-known implementations of this framework are ER [\citeauthoryearLin1992], and the Dyna architecture [\citeauthoryearSutton1991].
3 One-step Sample-based Planning Choices
There are subtle design choices in the construction of stochastic, one-step, sample-based planning methods that can significantly impact performance. These include how to add states and actions to the search-control queue for Dyna, how to select states and actions from the queue, and how to sample next states. These choices influence the design of our REM algorithm, and so we discuss them in this section.
One important choice for Dyna-style methods is whether to sample a next state, or compute an expected update over all possible transitions. A sample-based planner samples , given , and stochastically updates . An alternative is to approximate full dynamic programming updates, to give an expected update, as done by stochastic factorization approaches [\citeauthoryearBarreto et al.2011, \citeauthoryearKveton and Theocharous2012, \citeauthoryearBarreto et al.2014, \citeauthoryearYao et al.2014, \citeauthoryearBarreto et al.2016, \citeauthoryearPires and Szepesvári2016], kernel-based RL (KBRL) [\citeauthoryearOrmoneit and Sen2002], and kernel mean embeddings (KME) for RL [\citeauthoryearGrunewalder et al.2012, \citeauthoryearVan Hoof et al.2015, \citeauthoryearLever et al.2016]. Linear Dyna [\citeauthoryearSutton et al.2008] computes an expected next reward and expected next feature vector for the update, which corresponds to an expected update when is a linear function of features. We advocate for a sampled update, because approximate dynamic programming updates, such as KME and KBRL, are typically too expensive, couple the model and value function parameterization and are designed for a batch setting. Computation can be more effectively used by sampling transitions.
There are many possible refinements to the search-control mechanism, including prioritization and backwards-search. For tabular domains, it is feasible to simply store all possible states and actions, from which to simulate. In continuous domains, however, care must be taken to order and delete stored samples. A basic strategy is to simply store recent transitions for the transition buffer in ER, or state and actions for the search-control queue in Dyna. This, however, provides little information about which samples would be most beneficial for learning. Prioritizing how samples are drawn, based on absolute TD-error , has been shown to be useful for both tabular Dyna [\citeauthoryearSutton and Barto1998], and ER with function approximation [\citeauthoryearSchaul et al.2016]. When the buffer or search-control queue gets too large, one then must also decide whether to delete transitions based on recency or priority. In the experiments, we explore this question about the efficacy of recency versus priorities for adding and deleting.
ER is limited in using alternative criteria for search-control, such as backward search. A model allows more flexibility in obtaining useful states and action to add to the search-control queue. For example, a model can be learned to simulate predecessor states—states leading into (a high-priority) for a given action . Predecessor states can be added to the search-control queue during planning, facilitating a type of backward search. The idea of backward search and prioritization were introduced together for tabular Dyna [\citeauthoryearPeng and Williams1993, \citeauthoryearMoore and Atkeson1993]. Backward search can only be applied in ER in a limited way because its buffer is unlikely to contain transitions from multiple predecessor states to the current state in planning. [\citeauthoryearSchaul et al.2016] proposed a simple heuristic to approximate prioritization with predecessors, by updating the priority of the most recent transition on the transition buffer to be at least as large as the transition that came directly after it. This heuristic, however, does not allow a systematic backward-search.
A final possibility we consider is using the current policy to select the actions during search control. Conventionally, Dyna draws the action from the search-control queue using the same mechanism used to sample the state. Alternatively, we can sample the state via priority or recency, and then query the model using the action the learned policy would select in the current state: . This approach has the advantage that planning focuses on actions that the agent currently estimates to be the best. In the tabular setting, this on-policy sampling can result in dramatic efficiency improvements for Dyna [\citeauthoryearSutton and Barto1998], while [\citeauthoryearGu et al.2016] report improvement from on-policy sample of transitions, in a setting with multi-step rollouts. ER cannot emulate on-policy search control because it replays full transitions , and cannot query for an alternative transition if a different action than is taken.
4 Reweighted Experience Models for Dyna
In this section, we highlight criteria for selecting amongst the variety of available sampling models, and then propose a semi-parametric model—called Reweighted Experience Models—as one suitable model that satisfies these criteria.
4.1 Generative Models for Dyna
Generative models are a fundamental tool in machine learning, providing a wealth of possible model choices. We begin by specifying our desiderata for online sample-based planning and acting. First, the model learning should be incremental and adaptive, because the agent incrementally interleaves learning and planning. Second, the models should be data-efficient, in order to achieve the primary goal of improving data-efficiency of learning value functions. Third, due to policy non-stationarity, the models need to be robust to forgetting: if the agent stays in a part of the world for quite some time, the learning algorithm should not overwrite—or forget—the model in other parts of the world. Fourth, the models need to be able to be queried as conditional models. Fifth, sampling should be computationally efficient, since a slow sampler will reduce the feasible number of planning steps.
Density models are typically learned as a mixture of simpler functions or distributions. In the most basic case, a simple distributional form can be used, such as a Gaussian distribution for continuous random variables, or a categorical distribution for discrete random variables. For conditional distributions, , the parameters to these distributions, like the mean and variance of , can be learned as a (complex) function of . More general distributions can be learned using mixtures, such as mixture models or belief networks. A Conditional Gaussian Mixture Model, for example, could represent , where and are (learned) functions of . In belief networks—such as Boltzmann distributions—the distribution is similarly represented as a sum over hidden variables, but for more general functional forms over the random variables—such as energy functions. To condition on , those variables in the network are fixed both for learning and sampling.
Kernel density estimators (KDE) are similar to mixture models, but are non-parametric: means in the mixture are the training data, with a uniform weighting: for samples. KDE and conditional KDE is consistent [\citeauthoryearHolmes et al.2007]—since the model is a weighting over observed data—providing low model-bias. Further, it is data-efficient, easily enables conditional distributions, and is well-understood theoretically and empirically. Unfortunately, it scales linearly in the data, which is not compatible with online reinforcement learning problems. Mixture models, on the other hand, learn a compact mixture and could scale, but are expensive to train incrementally and have issues with local minima.
Neural network models are another option, such as Generative Adversarial Networks [\citeauthoryearGoodfellow et al.2014] and Stochastic Neural Networks [\citeauthoryearSohn et al.2015, \citeauthoryearAlain et al.2016]. Many of the energy-based models, however, such as Boltzmann distributions, require computationally expensive sampling strategies [\citeauthoryearAlain et al.2016]. Other networks—such as Variational Auto-encoders—sample inputs from a given distribution, to enable the network to sample outputs. These neural network models, however, have issues with forgetting [\citeauthoryearMcCloskey and Cohen1989, \citeauthoryearFrench1999, \citeauthoryearGoodfellow et al.2013], and require more intensive training strategies—-often requiring experience replay themselves.
4.2 Reweighted Experience Models
We propose a semi-parametric model to take advantage of the properties of KDE and still scale with increasing experience. The key properties of REM models are that 1) it is straightforward to specify and sample both forward and reverse models for predecessors— and —using essentially the same model (the same prototypes); 2) they are data-efficient, requiring few parameters to be learned; and 3) they can provide sufficient model complexity, by allowing for a variety of kernels or metrics defining similarity.
REM models consist of a subset of prototype transitions , chosen from all transitions experienced by the agent, and their corresponding weights . These prototypes are chosen to be representative of the transitions, based on a similarity given by a product kernel
A product kernel is a product of separate kernels. It is still a valid kernel, but simplifies dependences and simplifies computing conditional densities, which are key for Dyna, both for forward and predecessor models. They are also key for obtaining a consistent estimate of the , described below.
We first consider Gaussian kernels for simplicity. For states,
with covariance . For discrete actions, the similarity is an indicator if and otherwise . For next state, reward and discount, a Gaussian kernel is used for with covariance . We set the covariance matrix , where is a sample covariance, and use a conditional covariance for .
First consider a KDE model, for comparison, where all experience is used to define the distribution
This estimator puts higher density around more frequently observed transitions. A conditional estimator is similarly intuitive, and also a consistent estimator [\citeauthoryearHolmes et al.2007],
The experience similar to has higher weight in the conditional estimator: distributions centered at contribute more to specifying . Similarly, it is straightforward to specify the conditional density .
When only prototype transitions are stored, joint and conditional densities can be similarly specified, but prototypes must be weighted to reflect the density in that area. We therefore need a method to select prototypes and to compute weightings. Selecting representative prototypes or centers is a very active area of research, and we simply use a recent incremental and efficient algorithm designed to select prototypes [\citeauthoryearSchlegel et al.2017]. For the reweighting, however, we can design a more effective weighting exploiting the fact that we will only query the model using conditional distributions.
Reweighting approach. We develop a reweighting scheme that takes advantage of the fact that Dyna only requires conditional models. Because , a simple KDE strategy is to estimate coefficients on the entire transition and on , to obtain accurate densities and . However, there are several disadvantages to this approach. The and need to constantly adjust, because the policy is changing. Further, when adding and removing prototypes incrementally, the other and need to be adjusted. Finally, and can get very small, depending on visitation frequency to a part of the environment, even if is not small. Rather, by directly estimating the conditional coefficients , we avoid these problems. The distribution is stationary even with a changing policy; each can converge even during policy improvement and can be estimated independently of the other .
We can directly estimate , because of the conditional independence assumption made by product kernels. To see why, for prototype in the product kernel in Equation (1),
Rewriting and because , we can rewrite the probability as
Now we simply need to estimate . Again using the conditional independence property, we can prove the following.
Let be the similarity of for sample to for prototype . Then
is a consistent estimator of .
The proof for this theorem, and a figure demonstrating the difference between KDE and REM, are provided in the appendix. Though there is a closed form solution to this objective, we use an incremental stochastic update to avoid storing additional variables and for the model to be more adaptive. For each transition, the are updated for each prototype as
The resulting REM model is
To sample predecessor states, with , the same set of prototypes can be used, with a separate set of conditional weightings estimated as for .
Sampling from REMs. Conveniently, to sample from the REM conditional distribution, the similarity across next states and rewards need not be computed. Rather, only the coefficients need to be computed. A prototype is sampled with probability ; if prototype is sampled, then the density (Gaussian) centered around is sampled.
In the implementation, the terms in the Gaussian kernels are omitted, because as fixed constants they can be normalized out. All kernel values then are in , providing improved numerical stability and the straightforward initialization for new prototypes. REMs are linear in the number of prototypes, for learning and sampling, with complexity per-step independent of the number of samples.
Addressing issues with scaling with input dimension. In general, any nonnegative kernel that integrates to one is possible. There are realistic low-dimensional physical systems for which Gaussian kernels have been shown to be highly effective, such as in robotics [\citeauthoryearDeisenroth and Rasmussen2011]. Kernel-based approaches can, however, extend to high-dimensional problems with specialized kernels. For example, convolutional kernels for images have been shown to be competitive with neural networks [\citeauthoryearMairal et al.2014]. Further, learned similarity metrics or embeddings enable data-driven models—such as neural networks—to improve performance, by replacing the Euclidean distance. This combination of probabilistic structure from REMs and data-driven similarities for neural networks is a promising next step.
We first empirically investigate the design choices for ER’s buffer and Dyna’s search-control queue in the tabular setting. Subsequently, we examine the utility of REM-Dyna, our proposed model-learning technique, by comparing it with ER and other model learning techniques in the function approximation setting. Maintaining the buffer or queue involves determining how to add and remove samples, and how to prioritize samples. All methods delete the oldest samples. Our experiments (not shown here), showed that deleting samples of lowest priority—computed from TD error—is not effective in the problems we studied. We investigate three different settings:
1) Random: samples are drawn randomly.
2) Prioritized: samples are drawn probabilistically according to the absolute TD error of the transitions [\citeauthoryearSchaul et al.2016, Equation 1] (exponent = 1).
3) Predecessors: same as Prioritized, and predecessors of the current state are also added to the buffer or queue.
We also test using On-policy transitions for Dyna, where only is stored on the queue and actions simulated according to the current policy; the queue is maintained using priorities and predecessors. In Dyna, we use the learned model to sample predecessors of the current , for all actions , and add them to the queue. In ER, with no environment model, we use a simple heuristic which adds the priority of the current sample to the preceding sample in the buffer [\citeauthoryearSchaul et al.2016]. Note that [\citeauthoryearvan Seijen and Sutton2015] relate Dyna and ER, but specifically for a theoretical equivalence in policy evaluation based on a non-standard form of replay related to true online methods, and thus we do not include it.
Experimental settings: All experiments are averaged over many independent runs, with the randomness controlled based on the run number. All learning algorithms use -greedy action selection () and Q-learning to update the value function in both learning and planning phases. The step-sizes are swept in . The size of the search-control queue and buffer was fixed to 1024—large enough for the micro-worlds considered—and the number of planning steps was fixed to 5.
A natural question is if the conclusions from experiments in the below microworlds extend to larger environments. Microworlds are specifically designed to highlight phenomena in larger domains, such as creating difficult-to-reach, high-reward states in River Swim described below. The computation and model size are correspondingly scaled down, to reflect realistic limitations when moving to larger environments. The trends obtained when varying the size and stochasticity of these environments provides insights into making such changes in larger environments. Experiments, then, in microworlds enable a more systematic issue-oriented investigation and suggest directions for further investigation for use in real domains.
Results in the Tabular Setting: To gain insight into the differences between Dyna and ER, we first consider them in the deterministic and stochastic variants of a simple gridworld with increasing state space size. ER has largely been explored in deterministic problems, and most work on Dyna has only considered the tabular setting. The gridworld is discounted with , and episodic with obstacles and one goal, with a reward of 0 everywhere except the transition into goal, in which case the reward is +100. The agent can take four actions. In the stochastic variant each action takes the agent to the intended next state with probability 0.925, or one of the other three adjacent states with probability 0.025. In the deterministic setting, Dyna uses a table to store next state and reward for each state and action; in stochastic, it estimates the probabilities of each observed transition via transition counts.
Figure 1 shows the reward accumulated by each agent over time-steps. We observe that: 1) Dyna with priorities and predecessors outperformed all variants of ER, and the performance gap increases with gridworld size. 2) TD-error based prioritization on Dyna’s search control queue improved performance only when combined with the addition of predecessors; otherwise, unprioritized variants outperformed prioritized variants. We hypothesize that this could be due to out-dated priorities, previously suggested to be problematic [\citeauthoryearPeng and Williams1993, \citeauthoryearSchaul et al.2016]. 3) ER with prioritization performs slightly worse than unprioritized ER variants for the deterministic setting, but its performance degrades considerably in the stochastic setting. 4) On-Policy Dyna with priorities and predecessors outperformed the regular variant in the stochastic domain with a larger state space. 5) Dyna with similar search-control strategies to ER, such as recency and priorities, does not outperform ER; only with the addition of improved search-control strategies is there an advantage. 6) Deleting samples from the queue or transitions from the buffer according to recency was always better than deleting according to priority for both Dyna and ER.
Results for Continuous States. We recreate the above experiments for continuous states, and additionally explore the utility of REMs for Dyna. We compare to using a Neural Network model—with two layers, trained with the Adam optimizer on a sliding buffer of 1000 transitions—and to a Linear model predicting features-to-expected next features rather than states, as in Linear Dyna. We improved upon the original Linear Dyna by learning a reverse model and sweeping different step-sizes for the models and updates to .
We conduct experiments in two tasks: a Continuous Gridworld and River Swim. Continuous Gridworld is a continuous variant of a domain introduced by [\citeauthoryearPeng and Williams1993], with , a sparse reward of 1 at the goal, and a long wall with a small opening. Agents can choose to move 0.05 units up, down, left, right, which is executed successfully with probability and otherwise the environment executes a random move. Each move has noise . River Swim is a difficult exploration domain, introduced as a tabular domain [\citeauthoryearStrehl and Littman2008], as a simple simulation of a fish swimming up a river. We modify it to have a continuous state space . On each step, the agent can go right or left, with the river pushing the agent towards the left. The right action succeeds with low probability depending on the position, and the left action always succeeds. There is a small reward at the leftmost state (close to ), and a relatively large reward at the rightmost state (close to ). The optimal policy is to constantly select right. Because exploration is difficult in this domain, instead of -greedy, we induced a bit of extra exploration by initializing the weights to . For both domains, we use a coarse tile-coding, similar to state-aggregation.
REM-Dyna obtains the best performance on both domains, in comparison to the ER variants and other model-based approaches. For search-control in the continuous state domains, the results in Figures 2 parallels the conclusions from the tabular case. For the alternative models, REMs outperform both Linear models and NN models. For Linear models, the model-accuracy was quite low and the step-size selection sensitive. We hypothesize that this additional tuning inadvertently improved the Q-learning update, rather than gaining from Dyna-style planning; in River Swim, Linear Dyna did poorly. Dyna with NNs performs poorly because the NN model is not data-efficient; after several 1000s of more learning steps, however, the model does finally become accurate. This highlights the necessity for data-efficient models, for Dyna to be effective. In Riverswim, no variant of ER was within 85% of optimal, in 20,000 steps, whereas all variants of REM-Dyna were, once again particularly for REM-Dyna with Predecessors.
In this work, we developed a semi-parametric model learning approach, called Reweighted Experience Models (REMs), for use with Dyna for control in continuous state settings. We revisited a few key dimensions for maintaining the search-control queue for Dyna, to decide how to select states and actions from which to sample. These included understanding the importance of using recent samples, prioritizing samples (with absolute TD-error), generating predecessor states that lead into high-priority states, and generating on-policy transitions. We compared Dyna to the simpler alternative, Experience Replay (ER), and considered similar design decisions for its transition buffer. We highlighted several criteria for the model to be useful in Dyna, for one-step sampled transitions, namely being data-efficient, robust to forgetting, enabling conditional models and being efficient to sample. We developed a new semi-parametric model, REM, that uses similarities to a representative set of prototypes, and requires only a small set of coefficients to be learned. We provided a simple learning rule for these coefficients, taking advantage of a conditional independence assumption and that we only require conditional models. We thoroughly investigate the differences between Dyna and ER, in several microworlds for both tabular and continuous states, showing that Dyna can provide significant gains through the use of predecessors and on-policy transitions. We further highlight that REMs are an effective model for Dyna, compared to using a Linear model or a Neural Network model.
Appendix A Consistency of conditional probability estimators
Theorem 1 Let be the similarity of for sample to for prototype . Then
is a consistent estimator of .
The closed-form solution for this objective is
with expectation according to . The second equality holds because (a) all three limits exist and (b) the limit of the denominator is not zero: .
Expanding out these expectations, where by the symmetry of the kernel, we get
and so converges to as . ∎
Appendix B REM Algorithmic Details
Algorithm 1 summarizes REM-Dyna, our online algorithm for learning, acting, and sample-based planning. Supporting pseudocode, for sampling and updating REMs, is given in Section B.3 below. We include Experience Replay with Priorities in Algorithm 2, for comparison. We additionally include a diagram highlighting the difference between KDE and REM, to approximate densities, in Figure 3.
For the queue and buffer, we maintain a circular array. When a sample is added to the array, with priority , it is placed in the spot with the oldest transition. When a state-action or transition is sampled with priority from the array, it is used to update the weights and its priority in the array is updated with its new priority. Therefore, it is not removed from the array, simply updated. Array elements are only removed once they are the oldest, implemented by incrementing the index each time a new point is added to the array.
b.1 Computing Conditional Covariances
To sample from REMs, as in Algorithm 5, we need to be able to compute the conditional covariance. Recall that
This is not necessarily the true conditional distribution over , but it is the conditional distribution under our model.
It is straightforward to sample from this model, using as the coefficients, shown in Algorithm 5. The key detail is computing a conditional covariance, described below.
Given sample , the conditional mean is
Similarly, we can compute the conditional covariance
This conditional covariance matrix more accurately reflects the distribution over , given . A covariance over would be significantly larger than this conditional covariance, since it would reflect the variability across the whole state space, rather than for a given . For example, in a deterministic domain, the conditional covariance is zero, whereas the covariance of across the space is not. If one consistent covariance is desired for , a reasonable choice is to compute a running average of conditional covariances across observed .
b.2 Details on Prototype Selection
We use a prototype selection strategy that maximizes [\citeauthoryearSchlegel et al.2017]. There are a number of parameters, but they are intuitive to set and did not require sweeps. The algorithm begins by adding the first prototypes, to fill up the budget of prototypes. Then, it starts to swap out the least useful prototypes as new transitions are observed. The algorithm adds in new prototypes, if the are sufficiently different from previous prototypes and increase the diversity of the set. The utility increase threshold is set to ; this threshold simply avoids swapping too frequently, which is computationally expensive, rather than having much impact on quality of the solution.
A component of this algorithm is a k-means clustering algorithm, to make the update more efficient. We perform k-means clustering using the distance metric , where is the empirical covariance matrix for transitions . The points are clustered into blocks, to speed up the computation of the log-determinant. The clustering is re-run every swaps, but is efficient to do, since it is started from the previous clustering and only a few iterations needs to be executed.
b.3 Additional Pseudocode for REMs
The pseudocode for the remaining algorithms is included below, in Algorithm 3-6. Note that the implementation for REMs can be made much faster by using KD-trees to find nearest points, but we do not include those details here.
Appendix C Domain Details
Tabular Gridworld. The tabular grid is a discounted episodic problem with obstacles and one goal, as illustrated in Figure 4. The reward is 0 everywhere except the transition into goal, in which case the reward is +100. The discount rate is . The agent can take four actions. In the stochastic variant each action takes the agent to the intended next state with probability 0.925, or one of the other three adjacent states with probability 0.025.
The modelling choices in Dyna are different for the deterministic and stochastic variants. In the deterministic setting, Dyna uses a table to store next state and reward for each state and action; in stochastic, it estimates the probabilities of each observed transition via transition counts.
Continuous Gridworld. The Continuous Gridworld domain is an episodic problem with a long wall as the obstacle between the start state and goal state, where the wall has one hole for the agent to pass through. The state space consists of points . The wall in the middle has a width of . The hole on the wall is located at with height , hence the hole area is . At each step the agent can choose from actions to move. The successful move probability is ; otherwise a random action is taken, with stepsize where is Gaussian noise with mean and variance . The agent always starts at and the goal area is . The reward is in the goal area, and otherwise is zero.
River Swim. The continuous River Swim domain is a modified version of the tabular River Swim domain proposed in [\citeauthoryearStrehl and Littman2008]. The state space is and the action space is . The move step size is , changing the state each time by unit. The goal of the agent is to swim upstream (move right) towards the end of the chain, but the dynamics of the environment push the agent downstream, to the beginning of the chain. The agent receives a reward of if it is at the beginning of the chain, in region , but receives a much larger reward of if it manages to get to the end of the chain, in region . Otherwise, it receives a reward of zero. The task is continuing, with . The left action always succeeds, taking the agent left. The right action, however, can fail. Once a right action is chosen, if it is at the beginning of the chain, the agent moves to the right with probability 0.4 and otherwise stays at original position; if it is at the end of the chain, it moves left with probability 0