# Importance Resampling for Off-policy Prediction

###### Abstract

Importance sampling (IS) is a common reweighting strategy for off-policy prediction in reinforcement learning. While it is consistent and unbiased, it can result in high variance updates to the weights for the value function. In this work, we explore a resampling strategy as an alternative to reweighting. We propose Importance Resampling (IR) for off-policy prediction, which resamples experience from a replay buffer and applies standard on-policy updates. The approach avoids using importance sampling ratios in the update, instead correcting the distribution before the update. We characterize the bias and consistency of IR, particularly compared to Weighted IS (WIS). We demonstrate in several microworlds that IR has improved sample efficiency and lower variance updates, as compared to IS and several variance-reduced IS strategies, including variants of WIS and V-trace which clips IS ratios. We also provide a demonstration showing IR improves over IS for learning a value function from images in a racing car simulator.

Importance Resampling for Off-policy Prediction

Matthew Schlegel University of Alberta mkschleg@ualberta.ca Wesley Chung University of Alberta wchung@ualberta.ca Daniel Graves Huawei daniel.graves@huawei.com Jian Qian University of Alberta jq1@ulberta.ca Martha White University of Alberta whitem@ulberta.ca

noticebox[b]Preprint. Under review.\end@float

## 1 Introduction

An emerging direction for reinforcement learning systems is to learn many predictions, formalized as value function predictions contingent on many different policies. The idea is that such predictions can provide a powerful abstract model of the world. Some examples of systems that learn many value functions are the Horde architecture composed of General Value Functions (GVFs) (Sutton et al., 2011; Modayil et al., 2014), systems that use options (Sutton et al., 1999; Schaul et al., 2015a), predictive representation approaches (Sutton et al., 2005; Schaul and Ring, 2013; Silver et al., 2017) and systems with auxiliary tasks (Jaderberg et al., 2017). Off-policy learning is critical for learning many value functions with different policies, because it enables data to be generated from one behavior policy to update the values for each target policy in parallel.

The typical strategy for off-policy learning is to reweight updates using importance sampling (IS). For a given state , with action selected according to behavior , the IS ratio is the ratio between the probability of the action under the target policy and the behavior: . The update is multiplied by this ratio, adjusting the action probabilities so that the expectation of the update is as if the actions were sampled according to the target policy . Though the IS estimator is unbiased and consistent (Kahn and Marshall, 1953; Rubinstein and Kroese, 2016), it can suffer from high or even infinite variance due to large magnitude IS ratios, in theory (Andradottir et al., 1995) and in practice (Precup et al., 2001; Mahmood et al., 2014, 2017).

There have been some attempts to modify off-policy prediction algorithms to mitigate this variance.^{1}^{1}1There is substantial literature on variance reduction for another area called off-policy policy evaluation, but which estimates only a single number or value for a policy (e.g., see (Thomas and Brunskill, 2016)). The resulting algorithms differ substantially, and are not appropriate for learning the value function.
Weighted IS (WIS) algorithms have been introduced (Precup et al., 2001; Mahmood et al., 2014; Mahmood and Sutton, 2015), which normalize each update by the sample average of the ratios. These algorithms improve learning over standard IS strategies, but are not straightforward to extend to nonlinear function approximation. In the offline setting, a reweighting scheme, called importance sampling with unequal support (Thomas and Brunskill, 2017), was introduced to account for samples where the ratio is zero, in some cases significantly reducing variance.
Another strategy is to rescale or truncate the IS ratios, as used by V-trace (Espeholt et al., 2018) for learning value functions and Tree-Backup (Precup et al., 2000), Retrace (Munos et al., 2016) and ABQ (Mahmood et al., 2017) for learning action-values. Truncation of IS-ratios in V-trace can incur significant bias, and this additional truncation parameter needs to be tuned.

An alternative to reweighting updates is to instead correct the distribution before updating the estimator using weighted bootstrap sampling: resampling a new set of data from the previously generated samples (Smith et al., 1992; Arulampalam et al., 2002). Consider a setting where a buffer of data is stored, generated by a behavior policy. Samples for policy can be obtained by resampling from this buffer, proportionally to for state-action pairs in the buffer. In the sampling literature, this strategy has been proposed under the name Sampling Importance Resampling (SIR) (Rubin, 1988; Smith et al., 1992; Gordon et al., 1993), and has been particularly successful for Sequential Monte Carlo sampling (Gordon et al., 1993; Skare et al., 2003). Such resampling strategies have also been popular in classification, with over-sampling or under-sampling typically being preferred to weighted (cost-sensitive) updates (Lopez et al., 2013).

A resampling strategy has several potential benefits for off-policy prediction.^{2}^{2}2We explicitly use the term prediction rather than policy evaluation to make it clear that we are not learning value functions for control. Rather, our goal is to learn value functions solely for the sake of prediction.
Resampling could even have larger benefits for learning approaches, as compared to averaging or numerical integration problems, because updates accumulate in the weight vector and change the optimization trajectory of the weights. For example, very large importance sampling ratios could destabilize the weights. This problem does not occur for resampling, as instead the same transition will be resampled multiple times, spreading out a large magnitude update across multiple updates. On the other extreme, with small ratios, IS will waste updates on transitions with very small IS ratios.
By correcting the distribution before updating, standard on-policy updates can be applied.
The magnitude of the updates vary less—because updates are not multiplied by very small or very large importance sampling ratios—potentially reducing variance of stochastic updates and simplifying learning rate selection.
We hypothesize that resampling (a) learns in a fewer number of updates to the weights, because it focuses computation on samples that are likely under the target policy and (b) is less sensitive to learning parameters and target and behavior policy specification.

In this work, we investigate the use of resampling for online off-policy prediction for known, unchanging target and behavior policies. We first introduce Importance Resampling (IR), which samples transitions from a buffer of (recent) transitions according to IS ratios. These sampled transitions are then used for on-policy updates. We show that IR has the same bias as WIS, and that it can be made unbiased and consistent with the inclusion of a batch correction term—even under a sliding window buffer of experience. We provide additional theoretical results characterizing when we might expect the variance to be lower for IR than IS. We then empirically investigate IR on three microworlds and a racing car simulator, learning from images, highlighting that (a) IR is less sensitive to learning rate than IS and V-trace (IS with clipping) and (b) IR converges more quickly in terms of the number of updates.

## 2 Background

We consider the problem of learning General Value Functions (GVFs) (Sutton et al., 2011). The agent interacts in an environment defined by a set of states , a set of actions and Markov transition dynamics, with probability of transitions to state when taking action in state . A GVF is defined for policy , cumulant and continuation function , with and for a (random) transition . The value for a state is

The operator indicates an expectation with actions selected according to policy . GVFs encompass standard value functions, where the cumulant is a reward. Otherwise, GVFs enable predictions about discounted sums of others signals into the future, when following a target policy . These values are typically estimated using parametric function approximation, with weights defining approximate values .

In off-policy learning, transitions are sampled according to behavior policy, rather than the target policy. To get an unbiased sample of an update to the weights, the action probabilities need to be adjusted. Consider on-policy temporal difference (TD) learning, with update for a given , for learning rate and TD-error . If actions are instead sampled according to a behavior policy , then we can use importance sampling (IS) to modify the update, giving the off-policy TD update for IS ratio . Given state , if when , then the expected value of these two updates are equal. To see why, notice that

which equals because

Though unbiased, IS can be high-variance. A lower variance alternative is Weighted IS (WIS). For a batch consisting of transitions , batch WIS uses a normalized estimate for the update. For example, an offline batch WIS TD algorithm, denoted WIS-Optimal below, would use update . Obtaining an efficient WIS update is not straightforward, however, when learning online and has resulted in algorithms specialized to the tabular setting (Precup et al., 2001) or linear functions (Mahmood et al., 2014; Mahmood and Sutton, 2015). We nonetheless use WIS as a baseline in the experiments and theory.

## 3 Importance Resampling

In this section, we introduce Importance Resampling (IR) for off-policy prediction and characterize its bias and variance.
A resampling strategy requires a buffer of samples, from which we can resample. Replaying experience from a buffer was introduced as a biologically plausible way to reuse old experience (Lin, 1992, 1993), and has become common for improving sample efficiency, particularly for control (Mnih et al., 2015; Schaul et al., 2015b). In the simplest case—which we assume here—the buffer is a sliding window of the most recent samples, , at time step .
We assume samples are generated by taking actions according to behavior . The transitions are generated with probability , where is the stationary distribution for policy . The goal is to obtain samples according to , as if we had taken actions according to policy from states^{3}^{3}3The assumption that states are sampled from underlies most off-policy learning algorithms. Only a few attempt to adjust probabilities to , either by multiplying IS ratios before a transition (Precup et al., 2001) or by directly estimating state distributions (Hallak and Mannor, 2017; Liu et al., 2018). In this work, we focus on using resampling to correct the action distribution—the standard setting. We expect, however, that some insights will extend to how to use resampling to correct the state distribution, particularly because wherever IS ratios are used it should be straightforward to use our resampling approach. .

The IR algorithm is simple: resample a mini-batch of size on each step from the buffer of size , proportionally to in the buffer. Standard on-policy updates, such as on-policy TD or on-policy gradient TD, can then used on this resample. The key difference to IS and WIS is that the distribution itself is corrected, before the update, whereas IS and WIS correct the update itself. This small difference, however, can have larger ramifications practically, as we show in this paper.

We consider two variants of IR: with and without bias correction. For point sampled from the buffer, let be the on-policy update for that transition. For example, for TD, . The first step for either variant is to sample a mini-batch of size from the buffer, proportionally to . Bias-Corrected IR (BC-IR) additionally pre-multiplies with the average ratio in the buffer , giving the following estimators for the update direction

BC-IR negates bias introduced by the average ratio in the buffer deviating significantly from the true mean. For reasonably large buffers, will be close to 1 making IR and BC-IR have near-identical updates. Nonetheless, they do have different theoretical properties, particularly for small buffer sizes , so we characterize both.

Across all results, we make the following assumption.

###### Assumption 1.

Transition tuples are sampled i.i.d. according to the distribution , for .

To denote expectations under and , we overload the notation from above, using operators and respectively. To reduce clutter, we write to mean , because most expectations are under the sampling distribution. All proofs can be found in Appendix B.

### 3.1 Bias of IR

We first show that IR is biased, and that its bias is actually equal to WIS-Optimal, in Theorem 3.1.

###### Theorem 3.1.

[Bias for a fixed buffer of size ] Assume a buffer of transitions is sampled i.i.d., according to . Let be the WIS-Optimal estimator of the update. Then,

and so the bias of is proportional to

(1) |

where is the expected update across all transitions, with actions from taken by the target policy ; ; ; and covariance .

This bias of IR will be small for reasonably large , both because it is proportional to and because larger will result in lower variance of the average ratios and average update for the buffer in Equation (1). In particular, as grows, these variances decay proportionally to . Nonetheless, for smaller buffers, such bias could have an impact. We can, however, easily mitigate this bias with a bias-correction term, as shown in the next corollary and proven in Appendix B.2.

###### Corollary 3.1.1.

BC-IR is unbiased: .

### 3.2 Consistency of IR

Consistency of IR in terms of an increasing buffer, with , is a relatively straightforward extension of prior results for SIR, with or without the bias correction, and from the derived bias of both estimators (see Theorem B.1 in Appendix B.3). More interesting, and reflective of practice, is consistency with a fixed length buffer and increasing interactions with the environment, . IR, without bias correction, is asymptotically biased in this case; in fact, its asymptotic bias is the one characterized above for a fixed length buffer in Theorem 3.1. BC-IR, on the other hand, is consistent, even with a sliding window, as we show in the following theorem.

###### Theorem 3.2.

Let be the buffer of the most recent transitions sampled by time , i.i.d. as specified in Assumption 1. Let be the bias-corrected IR estimator, with samples from buffer . Define the sliding-window estimator . Assume there exists a such that . Then, as , converges in probability to .

### 3.3 Variance of Updates

It might seem that resampling avoids high-variance in updates, because it does not reweight with large magnitude IS ratios. The notion of effective sample size from statistics, however, provides some intuition about why large magnitude IS ratios can also negatively affect IR, not just IS. Effective sample size is between 1 and , with one estimator (Kong et al., 1994; Martino et al., 2017). When the effective sample size is low, this indicates that most of the probability is concentrated on a few samples. For high magnitude ratios, IR will repeatedly sample the same transitions, and potentially never sample some of the transitions with small IS ratios.

Fortunately, we find that, despite this dependence on effective sample size, IR can significantly reduce variance over IS. In this section, we characterize the variance of the BC-IR estimator. We choose this variant of IR, because it is unbiased and so characterizing its variance is a more fair comparison to IS. We define the mini-batch IS estimator , where indices are sampled uniformly from . This contrasts the indices for that are sampled proportionally to .

We begin by characterizing the variance, under a fixed dataset . For convenience, let . We characterize the sum of the variances of each component in the update estimator, which equivalently corresponds to normed deviation of the update from its mean,

for an unbiased stochastic update . We show two theorems that BC-IR has lower variance than IS, with two different conditions on the norm of the update. We first start with more general conditions, and then provide a theorem for conditions that are likely only true in early learning.

###### Theorem 3.3.

Assume that, for a given buffer , for samples where , and that for samples where , for some . Then the BC-IR estimator has lower variance than the IS estimator: .

The conditions in Theorem 3.3 preclude having update norms for samples with small be quite large—larger than a number —and a small norm for samples with large . These conditions can be relaxed to a statement on average, where the cumulative weighted magnitude of the update norm for samples with below the median needs to be smaller than for samples with above the meanx (see the proof in Appendix B.5).

We next consider a setting where the magnitude of the update is independent of the given state and action. We expect this condition to hold in early learning, where the weights are randomly initialized and so potentially randomly incorrect across the state-action space. As learning progresses, and value estimates become more accurate in some states, it is unlikely for this condition to hold.

###### Theorem 3.4.

Assume and the magnitude of the update are independent

Then the BC-IR estimator will have equal or lower variance than the IS estimator.

These results have focused on variance of each estimator, for a fixed buffer, which provided insight into variance of updates when executing the algorithms. We would, however, also like to characterize variability across buffers, especially for smaller buffers. Fortunately, such a characterization is a simple extension on the above results, because variability for a given buffer already demonstrates variability due to different samples. It is easy to check that . The variances can be written using the law of total variance

with expectation across buffers. Therefore, the analysis of directly applies.

## 4 Empirical Results

We investigate the two hypothesized benefits of resampling as compared to reweighting: improved sample efficiency and reduced variance. These benefits are tested in two microworld domains—a Markov chain and the Four Rooms domain—where exhaustive experiments can be conducted. We also provide a demonstration that IR reduces sensitivity over IS and VTrace in a car simulator, TORCs, when learning from images.

We compare IR and BC-IR
against several reweighting strategies, including importance sampling (IS); two online approaches to weighted important sampling, WIS-Minibatch with weighting and WIS-Buffer with weighting ; and V-trace^{4}^{4}4Retrace, ABQ and TreeBackup also use clipping to reduce variance. But, they are designed for learning action-values and for mitigating variance in eligibility traces. When trace parameter —as we assume here—there are no IS ratios and these methods become equivalent to using Sarsa(0) for learning action-values.
, which corresponds to clipping importance weights (Espeholt et al., 2018). Where appropriate, we also include baselines using On-policy sampling; WIS-Optimal which uses the whole buffer to get an update; and Sarsa(0) which learns action-values—which does not require IS ratios—and then produces estimate . WIS-Optimal is included as an optimal baseline, rather than as a competitor, as it estimates the update using the whole buffer on every step.

In all the experiments, the data is generated off-policy. We compute the absolute value error (AVE) on every training step. The error bars represent the standard error over runs, which are featured on every plot — although not visible in some instances. For the microworlds, the true value function is found using dynamic programming with threshold , and we compute AVE over all the states. For TORCs and continuous Four Rooms, the true value function is approximated using rollouts from a random subset of states generated when running the behavior policy . A tabular representation is used in the microworld experiments, tilecoded features with 64 tilings and 8 tiles is used in continuous Four Rooms, and a convolutional neural network is used for TORCs, with the same architecture as previously defined for self-driving cars (Bojarski et al., 2016).

### 4.1 Investigating Convergence Rate

We first investigate the convergence rate of IR. We report learning curves in Four Rooms, as well as sensitivity to the learning rate. The Four Rooms domain (Stolle and Precup, 2002) has four rooms in an 11x11 grid world. The four rooms are positioned in a grid pattern with each room having two adjacent rooms. Each adjacent room is separated by a wall with a single connecting hallway. The target policy takes the down action deterministically. The cumulant for the value function is 1 when the agent hits a wall and 0 otherwise. The continuation function is , with termination when the agent hits a wall. The resulting value function can be thought of as distance to the bottom wall. The behavior policy is uniform random everywhere except for 25 randomly selected states which take the action down with probability 0.05 with remaining probability split equally amongst the other actions. The choice of behavior and target policy induce high magnitude IS ratios.

As shown in Figure 1, IR has noticeable improvements over the reweighting strategies tested. The fact that IR resamples more important transitions from the replay buffer seems to significantly increase the learning speed. Further, IR has a wider range of usable learning rates. The same effect is seen even as we reduce the total number of updates, where the uniform sampling methods perform significantly worse as the interactions between updates increases—suggesting improved sample efficiency. WIS-Buffer performs almost equivalently to IS, because for reasonably size buffers, its normalization factor because . WIS-Minibatch and V-trace both reduce the variance significantly, with their bias having only a limited impact on the final performance compared to IS. Even the most aggressive clipping parameter for V-trace—a clipping of 1.0— outperforms IS. The bias may have limited impact because the target policy is deterministic, and so only updates for exactly one action in a state. Sarsa—which is the same as Retrace(0)—performs similarly to the reweighting strategies.

The above results highlight the convergence rate improvements from IR, in terms of number of updates, without generalization across values. Conclusions might actually be different with function approximation, when updates for one state can be informative for others. For example, even if in one state the target policy differs significantly from the behavior policy, if they are similar in a related state, generalization could overcome effective sample size issues. We therefore further investigate if the above phenomena arise under function approximation with RMSProp learning rate selection.

We conduct a similar experiment above, in a continuous state Four Rooms variant. The agent is a circle with radius 0.1, and the state consists of a continuous tuple containing the x and y coordinates of the agent’s center point. The agent takes an action in one of the 4 cardinal directions moving in that directions with random drift in the orthogonal direction sampled from . The target policy, as before, deterministically takes the down action. The representation is a tile coded feature vector with 64 tilings and 8 tiles.

We find that generalization can mitigate some of the differences between IR and IS above in some settings, but in others the difference remains just as stark. If we use the behavior policy from the tabular domain, which skews the behavior in a sparse set of states, the nearby states mitigate this skew. However, if we use a behavior policy that selects all actions uniformly, then again IR obtains noticeable gains over IS and V-trace, for reducing the required number of updates, as shown in in Figure 2. Expanded results can be found in Appendix C.2.

### 4.2 Investigating Variance

To better investigate the update variance we use a Markov chain, where we can more easily control dissimilarity between and , and so control the magnitude of the IS ratios. The Markov chain is composed of 8 non-terminating states and 2 terminating states on the ends of the chain, with a cumulant of 1 on the transition to the right-most terminal state and 0 everywhere else. We consider policies with probabilities [left, right] equal in all states: ; further policy settings can be found in Appendix C.1.

We first measure the variance of the updates for fixed buffers. We compute the variance of the update—from a given weight vector—by simulating the many possible updates that could have occurred. We are interested in the variance of updates both for early learning—when the weight vector is quite incorrect and updates are larger—and later learning. To obtain a sequence of such weight vectors, we use the sequence of weights generated by WIS-Optimal. As shown in Figure 5, the variance of IR is lower than IS, particularly in early learning, where the difference is stark. Once the weight vector has largely converged, the variance of IR and IS is comparable and near zero.

We can also evaluate the variance by proxy using learning rate sensitivity curves. As seen in Figure 5 (a) and (b), IR has the lowest sensitivity to learning rates, on-par with On-Policy sampling. IS has the highest sensitivity, along with WIS-Buffer and WIS-Minibatch. Various clipping parameters with V-trace are also tested. V-trace does provide some level of variance reduction but incurs more bias as the clipping becomes more aggressive.

### 4.3 Demonstration on a Car Simulator

We use the TORCs racing car simulator to perform scaling experiments with neural networks to compare IR, IS, and V-trace. The simulator produces 64x128 cropped grayscale images. We use an underlying deterministic steering controller that produces steering actions and take an action with probability defined by a Gaussian . The target policy is a Gaussian , which corresponds to steering left. Pseudo-termination (i.e., ) occurs when the car nears the center of the road, and the cumulant becomes 1. Otherwise, the cumulant is zero and . The policy is specified using continuous action distributions and results in IS ratios as high as and highly variant updates for IS.

Again, we can see that IR provides benefits over IS and V-trace, in Figure 4. There is even more generalization from the neural network in this domain, than in Four Rooms where we found generalization did reduce some of the differences between IR and IS. Yet, IR still obtains the best performance, and avoids some of the variance seen in IS for two of the learning rates. Additionally, BC-IR actually performs differently here, having worse performance for the largest learning rate. This suggest IR has an effect in reducing variance.

## 5 Conclusion

In this paper we introduced a new approach to off-policy learning: Importance Resampling. We showed that IR is consistent, and that the bias is the same as for Weighted Importance Sampling. We also provided an unbiased variant of IR, called Bias-Corrected IR. We empirically showed that IR (a) has lower learning rate sensitivity than IS and V-trace, which is IS with varying clipping thresholds; (b) the variance of updates for IR are much lower in early learning than IS and (c) IR converges faster than IS and other competitors, in terms of the number of updates. These results confirm the theory presented for IR, which states that variance of updates for IR are lower than IS in two settings, one being an early learning setting. Such lower variance also explains why IR can converge faster in terms of number of updates, for a given buffer of data.

The algorithm and results in this paper suggest new directions for off-policy prediction, particularly for faster convergence. Resampling is promising for scaling to learning many value functions in parallel, because many fewer updates can be made for each value function. A natural next step is a demonstration of IR, in such a parallel prediction system. Resampling from a buffer also opens up questions about how to further focus updates. One such option is using an intermediate sampling policy. Another option is including prioritization based on error, such as was done for control with prioritized sweeping (Peng and Williams, 1993) and prioritized replay (Schaul et al., 2015b).

## References

- Andradottir et al. [1995] Sigrun Andradottir, Daniel P Heyman, and Teunis J Ott. On the Choice of Alternative Measures in Importance Sampling with Markov Chains. Operations Research, 1995.
- Arulampalam et al. [2002] M S Arulampalam, S Maskell, N Gordon, and T Clapp. A Tutorial on Particle Filters for Online Nonlinear/Non-Gaussian Bayesian Tracking. IEEE Transactions on Signal Processing, 2002.
- Bojarski et al. [2016] Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316, 2016.
- Espeholt et al. [2018] Lasse Espeholt, Hubert Soyer, Rémi Munos, Karen Simonyan, Volodymir Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, and others. IMPALA: Scalable distributed Deep-RL with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561, 2018.
- Gordon et al. [1993] N J Gordon, D J Salmond, Radar, AFM Smith IEE Proceedings F Signal, and 1993. Novel approach to nonlinear/non-Gaussian Bayesian state estimation. IET, 1993.
- Hallak and Mannor [2017] Assaf Hallak and Shie Mannor. Consistent on-line off-policy evaluation. arXiv preprint arXiv:1702.07121, 2017.
- Jaderberg et al. [2017] Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, and Koray Kavukcuoglu. Reinforcement Learning with Unsupervised Auxiliary Tasks. In International Conference on Representation Learning, 2017.
- Kahn and Marshall [1953] H Kahn and A W Marshall. Methods of Reducing Sample Size in Monte Carlo Computations. Journal of the Operations Research Society of America, 1953.
- Kong et al. [1994] Augustine Kong, Jun S Liu, and Wing Hung Wong. Sequential imputations and Bayesian missing data problems. Journal of the American Statistical Association, 1994.
- Lin [1992] Long-Ji Lin. Self-Improving Reactive Agents Based On Reinforcement Learning, Planning and Teaching. Machine Learning, 1992.
- Lin [1993] Long-Ji Lin. Reinforcement Learning for Robots Using Neural Networks. PhD thesis, Carnegie Mellon University, 1993.
- Liu et al. [2018] Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31. Curran Associates, Inc., 2018.
- Lopez et al. [2013] Victoria Lopez, Alberto Fernandez, Salvador Garcia, Vasile Palade, and Francisco Herrera. An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Information Sciences, 2013.
- Mahmood and Sutton [2015] A R Mahmood and R.S. Sutton. Off-policy learning based on weighted importance sampling with linear computational complexity. In Conference on Uncertainty in Artificial Intelligence, 2015.
- Mahmood et al. [2014] A Rupam Mahmood, Hado P van Hasselt, and Richard S Sutton. Weighted importance sampling for off-policy learning with linear function approximation. In Advances in Neural Information Processing Systems, 2014.
- Mahmood et al. [2017] Ashique Rupam Mahmood, Huizhen Yu, and Richard S Sutton. Multi-step Off-policy Learning Without Importance Sampling Ratios. arXiv:1509.01240v2, 2017.
- Martino et al. [2017] Luca Martino, Víctor Elvira, and Francisco Louzada. Effective sample size for importance sampling based on discrepancy measures. Signal Processing, 2017.
- Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 2015.
- Modayil et al. [2014] Joseph Modayil, Adam White, and Richard S Sutton. Multi-timescale nexting in a reinforcement learning robot. Adaptive Behavior - Animals, Animats, Software Agents, Robots, Adaptive Systems, 2014.
- Munos et al. [2016] Rémi Munos, Tom Stepleton, Anna Harutyunyan, and Marc G Bellemare. Safe and Efficient Off-Policy Reinforcement Learning. Advances in Neural Information Processing Systems, 2016.
- Owen [2013] Art B. Owen. Monte Carlo theory, methods and examples. 2013.
- Peng and Williams [1993] Jing Peng and Ronald J Williams. Efficient Learning and Planning Within the Dyna Framework. Adaptive Behavior, 1993.
- Precup et al. [2000] Doina Precup, Richard S Sutton, and Satinder P Singh. Eligibility Traces for Off-Policy Policy Evaluation. ICML, 2000.
- Precup et al. [2001] Doina Precup, Richard S Sutton, and Sanjoy Dasgupta. Off-Policy Temporal-Difference Learning with Function Approximation. ICML, 2001.
- Rubin [1988] Donald B Rubin. Using the SIR algorithm to simulate posterior distributions. Bayesian statistics, 1988.
- Rubinstein and Kroese [2016] Reuven Y Rubinstein and Dirk P Kroese. Simulation and the Monte Carlo Method. John Wiley & Sons, 2016.
- Schaul and Ring [2013] Tom Schaul and Mark Ring. Better generalization with forecasts. In International Joint Conference on Artificial Intelligence, 2013.
- Schaul et al. [2015a] Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal Value Function Approximators. In International Conference on Machine Learning, 2015a.
- Schaul et al. [2015b] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized Experience Replay. arXiv:1511.05952 [cs], 2015b.
- Silver et al. [2017] David Silver, Hado van Hasselt, Matteo Hessel, Tom Schaul, Arthur Guez, Tim Harley, Gabriel Dulac-Arnold, David P Reichert, Neil C Rabinowitz, André Barreto, and Thomas Degris. The Predictron - End-To-End Learning and Planning. In AAAI Conference on Artificial Intelligence, 2017.
- Skare et al. [2003] Øivind Skare, Erik Bølviken, and Lars Holden. Improved Sampling-Importance Resampling and Reduced Bias Importance Sampling. Scandinavian Journal of Statistics, 2003.
- Smith et al. [1992] AFM Smith, AE Gelfand The American Statistician, and 1992. Bayesian statistics without tears: a sampling–resampling perspective. Taylor & Francis, 1992.
- Stolle and Precup [2002] Martin Stolle and Doina Precup. Learning Options in Reinforcement Learning. In International Symposium on Abstraction, Reformulation, and Approximation, 2002.
- Sutton et al. [1999] Richard S Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 1999.
- Sutton et al. [2005] Richard S Sutton, Eddie J Rafols, and Anna Koop. Temporal Abstraction in Temporal-difference Networks. In Advances in Neural Information Processing Systems, 2005.
- Sutton et al. [2011] Richard S Sutton, J Modayil, M Delp, T Degris, P.M. Pilarski, A White, and D Precup. Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In International Conference on Autonomous Agents and Multiagent Systems, 2011.
- Thomas and Brunskill [2016] Philip Thomas and Emma Brunskill. Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning. In AAAI Conference on Artificial Intelligence, 2016.
- Thomas and Brunskill [2017] Philip S Thomas and Emma Brunskill. Importance Sampling with Unequal Support. In AAAI Conference on Artificial Intelligence, 2017.

## Appendix A Weighted Importance Sampling

We consider three weighted importance sampling updates as competitors to IR. is the size of the experience replay buffer, is the size of a single batch. WIS-Minibatch and WIS-Buffer both follow a similar protocol as IS, in that they uniformly sample a mini-batch from the experience replay buffer and use this to update the value functions. The difference comes in the scaling of the update. The first, WIS-Minibatch, uses the sum of the importance weights in the sampled mini-batch, while WIS-Buffer uses the sum of importance weights in the experience replay buffer. WIS-Buffer is also scaled by the size of the buffer and brought to the same effective scale as the other updates with . WIS-Optimal follows a different approach and performs the best possible version of WIS where the gradient descent update is calculated from the whole experience replay buffer. We do not provide analysis on the bias or consistency of WIS-Minibatch or WIS-Buffer, but are natural versions of WIS one might try.

## Appendix B Additional Theoretical Results and Proofs

### b.1 Bias of IR

Theorem 3.1(Bias for a fixed buffer of size ) Assume a buffer of transitions is sampled i.i.d., according to . Let be the WIS-Optimal estimator of the update. Then,

and so the bias of is proportional to

where is the expected update across all transitions, with actions from taken by the target policy ; ; ; and covariance .

###### Proof.

Notice first that when we weight with , this is equivalent to weighting with , and so is the correct IS ratio for the transition.

This bias of is the same as , which is characterized in Owen [2013], completing the proof. ∎

### b.2 Proof of Unbiasedness of BC-IR

Corollary 3.1.1 BC-IR is unbiased: .

###### Proof.

∎

### b.3 Consistency of the resampling distribution with a growing buffer

We show that the distribution when following a resampling strategy is consistent: as , the resampling distribution converges to the true distribution. Our approach closely follows that of [Smith et al., 1992], but we recreate it here for convenience.

###### Theorem B.1.

Let be a buffer of data sampled i.i.d. according to proposal density . Let be some distribution of interest with associated random variable and assume the proposal distribution samples everywhere where is non-zero, i.e . Also, let be a discrete random variable taking values with probability . Then, converges in distribution to as .

###### Proof.

Let . From the probability mass function of , we have that:

∎

This means a resampling strategy effectively changes the distribution of random variable to that of , meaning we can use samples from to build statistics about the target distribution . This result motivates using resampling to correct the action distribution in off-policy learning. This result can also be used to show that the IR estimators are consistent, with .

### b.4 Consistency under a sliding window

Theorem 3.2 Let be the buffer of the most recent transitions sampled by time , i.i.d. as specified in Assumption 1. Let be the bias-corrected IR estimator, with samples from buffer . Define the sliding-window estimator . Assume there exists a such that . Then, as , converges in probability to .

###### Proof.

Notice first that is random because is random and because transitions are sampled from . Therefore, given , is independent of other random variables and for .

Now, using Corollary 3.1.1, we can show that BC-IR is unbiased for a sliding window

Next, we show that . For , and are independent, because they are disjoint sets of i.i.d. random variables. Correspondingly, is independent of . Explicitly, using the law of total covariance, we get that The first term is zero because is independent of given , and the second term is zero because and are independent. Therefore, .

Using the assumption on the variance, we can apply Lemma B.2 to to get the desired result. ∎

###### Lemma B.2.

Let be random variables with mean . Suppose there exists a such that and that . Then, as , converges in probability to .

###### Proof.

Let .

The first term is bounded by from our assumption on the variance. Now, to bound the second term.

Fix and choose such that , (such an must exist since ). Assuming that , we can decompose the second term into

By the Cauchy-Schwarz inequality and our variance assumption, . So, we get

Altogether, our upper bound is

Finally, we apply Chebyshev’s inequality. For a fixed ,

Since we can choose to be arbitrarily small (say ), the right-hand side goes to as , concluding the proof. ∎

### b.5 Variance of BC-IR and IS

This lemma characterizes the variance of the BC-IR and IS estimators for a fixed buffer.

###### Lemma B.3.

###### Proof.

Because we have independent samples,

and similarly We can further simplify these expressions. For the IS estimator

and for the BC-IR estimator, where recall ,

∎

The following two theorems present certain conditions when the BC-IR estimator would have lower variance than the IS estimator.

Theorem 3.3 Assume that for samples where , and that for samples where , for some . Then the BC-IR estimator has lower variance than the IS estimator.

###### Proof.

We show :

∎

Theorem 3.4 Assume and the magnitude of the update