# Breaking the Curse of Horizon:

Infinite-Horizon Off-Policy Estimation

###### Abstract

We consider off-policy estimation of the expected reward of a target policy using samples collected by a different behavior policy. Importance sampling (IS) has been a key technique for deriving (nearly) unbiased estimators, but is known to suffer from an excessively high variance in long-horizon problems. In the extreme case of infinite-horizon problems, the variance of an IS-based estimator may even be unbounded. In this paper, we propose a new off-policy estimator that applies IS directly on the stationary state-visitation distributions to avoid the exploding variance faced by existing methods. Our key contribution is a novel approach to estimating the density ratio of two stationary state distributions, with trajectories sampled from only the behavior distribution. We develop a mini-max loss function for the estimation problem, and derive a closed-form solution for the case of RKHS. We support our method with both theoretical and empirical analyses.

## 1 Introduction

Reinforcement learning (RL) [36] is one of the most successful approaches to artificial intelligence, and has found successful applications in robotics, games, dialogue systems, and recommendation systems, among others. One of the key problems in RL is policy evaluation: given a fixed policy, estimate the average reward garnered by an agent that runs this policy in the environment. In this paper, we consider the off-policy estimation problem, in which we want to estimate the expected reward of a given target policy with samples collected by a different behavior policy. This problem is of great practical importance in many application domains where deploying a new policy can be costly or risky, such as medical treatments [26], econometrics [13], recommender systems [19], education [23], Web search [18], advertising and marketing [4, 5, 38, 40]. It can also be used as a key component for developing efficient off-policy policy optimization algorithms [7, 14, 18, 39].

Most state-of-the-art off-policy estimation methods are based on importance sampling (IS) [e.g., 22]. A major limitation, however, is that this approach can become inaccurate due to the high variance introduced by the importance weights, especially when the trajectory is long. Indeed, most existing IS-based estimators compute the weight as the product of the importance ratios of many steps in the trajectory. Variances in individual steps accumulate multiplicatively, so that the overall IS weight of a random trajectory can have an exponentially high variance to result in an unreliable estimator. In the extreme case when the trajectory length is infinite, as in infinite-horizon average-reward problems, some of these estimators are not even well-defined. Ad hoc approaches can be used, such as truncating the trajectories, but often lead to a hard-to-control bias in the final estimation. Analogous to the well-known “curse of dimensionality” in dynamic programming [2], we call this problem the “curse of horizon” in off-policy learning.

In this work, we develop a new approach that tackles the curse of horizon. The key idea is to apply importance sampling on the average visitation distribution of single steps of state-action pairs, instead of the much higher dimensional distribution of whole trajectories. This avoids the cumulative product across time in the density ratio, substantially decreasing its variance and eliminating the estimator’s dependence on the horizon.

Our key challenge, of course, is to estimate the importance ratios of average visitation distributions. In practice, we often have access to both the target and behavior policies to compute their importance ratio of an action conditioned on a given state. But we typically have no access to transition probabilities of the environment, so estimating importance ratios of state visitation distributions has been very difficult, especially when only off-policy samples are available. In this paper, we develop a mini-max loss function for estimating the true stationary density ratio, which yields a closed-form representation similar to maximum mean discrepancy [9] when combined with a reproducing kernel Hilbert space (RKHS). We study the theoretical properties of our loss function, and demonstrate its empirical effectiveness on long-horizon problems. \lihongRevisit this part (after paper is mostly finished).

## 2 Background

#### Problem Definition

Consider a Markov decision process (MDP) [31] with state space , action space , reward function , and transition probability function . Assume the environment is initialized at state , drawn from an unknown distribution . At each time step , an agent observes the current state , takes an action according to a possibly stochastic policy , receives a reward whose expectation is , and transitions to a next state according to transition probabilities . To simplify exposition and avoid unnecessary technicalities, we assume and are finite unless otherwise specified, although our method extends to continuous spaces straightforwardly, as demonstrated in experiments.

We consider the infinite horizon problem in which the MDP continues without termination. Let be the distribution of trajectory under policy . The expected reward of is

where is the reward of trajectory up to time . Here, is a discount factor. We distinguish two reward criteria, the average reward () and discounted reward ():

where is a normalization factor. The problem of off-policy value estimation is to estimate the expected reward of a given target policy , when we only observe a set of trajectories generated by following a different behavior policy .

#### Bellman Equation

We briefly review the Bellman equation and the notation of value functions, for both average and discounted reward criteria. In the discounted case , the value is the expected total discounted reward when the initial state is fixed to be : . Note that we do not normalize by in our notation. For the average reward () case, the expected average reward does not depend on the initial state if the Markov process is ergodic [31]. Instead, the value function in the average case measures the average adjusted sum of reward: . It represents the relative difference in total reward gained from starting in state as opposed to .

Under these definitions, is the fixed-point solution to the respective Bellman equations:

Average: | (1) | |||

Discounted: | (2) |

#### Importance Sampling

IS represents a major class of approaches to off-policy estimation, which, in principle, only applies to the finite-horizon reward when the trajectory is truncated at a finite time step . IS-based estimators are based on the following change-of-measure equality:

with | (3) |

where is the single-step density ratio of policies and evaluated at a particular state-action pair , and is the density ratio of the trajectory up to time . Methods based on (3) are called trajectory-wise IS, or weighted IS (WIS) when the weights are self-normalized [22, 30]. It is possible to improve trajectory-wise IS with the so called step-wise, or per-decision, IS/WIS, which uses weight for reward at time , yielding smaller variance [30]. More details about these estimators are given in Appendix A.

#### The Curse of Horizon

The importance weight is a product of density ratios, whose variance can grow exponentially with . Thus, IS-based estimators have not been widely successful in long-horizon problems, let alone infinite-horizon ones where may not even be well-defined. While WIS estimators often have reduced variance, the exponential dependence on horizon is unavoidable in general. We call this phenomenon in IS/WIS-based estimators the curse of horizon.

Not all hope is lost, however. To see this, consider an MDP with states and actions, where states are arranged on a circle (see figure on the right). The two actions deterministically move the agent from the current state to the neighboring state counterclockwise and clockwise, respectively. Suppose we are given two policies with opposite effects: the behavior policy moves the agent clockwise with probability , and the target policy moves the agent counterclockwise with probability , for some constant . As shown in Appendix B, IS and WIS estimators suffer from exponentially large variance when estimating the average reward of . However, a keen reader will realize that the two policies are symmetric, and thus their stationary state visitation distributions are identical. As we show in the sequel, this allows us to estimate the expected reward using a much more efficient importance sampling, whose importance weight equals the single-step density ratio , instead of the cumulative product weight in (3), allowing us to significantly reduce the variance. Such an observation inspired the approach developed in this paper.

## 3 Off-Policy Estimation via Stationary State Density Ratio Estimation

As shown in the example above, significant decrease in estimation variance is possible when we apply importance weighting on the state space, rather than the trajectory space. It eliminates the dependency on the trajectory length and is much more suited for long- or infinite-horizon problems. To realize this, we need to introduce an alternative representation of the expected reward. Denote by the distribution of state when we execute policy starting from an initial state drawn from an initial distribution . We define the average visitation distribution to be

(4) |

We always assume the limit exists in this work. When in the discounted case, is a discounted average of , that is, ; when in the average reward case, is the stationary distribution of as under policy , that is, .

Following Definition 4, it can be verified that can be expressed alternatively as

(5) |

where, abusing notation slightly, we use to denote draws from distribution . Our idea is to construct an IS estimator based on (5), where the importance ratio is computed on state-action pairs rather than on trajectories: \myempty

(6) |

(7) |

where and is the density ratio of the visitation distributions and ; here, is not known directly but can be estimated, as shown later. Eq 5 allows us to construct a (weighted-)IS estimator by approximating with data obtained when running policy ,

where | (8) |

This IS estimator works in the space of , instead of trajectoris , leading to a potentially significant variance reduction. Returning to the example in Section 2 (see also Appendix B), since the two policies are symmetric and lead to the same stationary distributions, that is, , the importance weight in (7) is simply , independent of the trajectory length. This avoids the excessive variance in long horizon problems. In Appendix A, we provide a further discussion, showing that our estimator can be viewed as a type of Rao-Backwellization of the trajectory-wise and step-wise estimators.

### 3.1 Average Reward Case

The key technical challenge remaining is estimating the density ratio , which we address in this section. For simplifying the presentation, we start with estimating for the average reward case and discuss the discounted case in Section 3.2.

Let be the transition probability from to following policy . In the average reward case, equals the stationary distribution of , satisfying

(9) |

Assume the Markov chain of is finite state and ergodic, is also the unique distribution that satisfies (9). This simple fact can be leveraged to derive the following key property of . {thm} In the average reward case (), assume is the unique invariant distribution of and , . Then a function equals (up to a constant factor) if and only if it satisfies

(10) |

where and denote the conditional distribution related to joint distribution . Note that this is a time-reserved conditional probability, since it is the conditional distribution of given that their next state is following policy . Because the conditional distribution is time reversed, it is difficult to directly estimate the conditional expectation for a given . This is because we usually can observe only a single data point from of a fixed , given that it is difficult to see by chance two different pairs transit to the same . \myempty This causes a biased gradient problem if we directly minimizes the L2 error of (LABEL:wes) in order to estimate . To be more specific, denoting by the LHS of Eq LABEL:wes, a naive approach to estimating would be to minimize the mean squred loss . Unfortunately, it is hard to construct unbiased estimator of (and its gradient), because of the difficulty of estimating the conditional expectation that appears inside the square function. This problem is also known as the double-sample issue in the RL literature [baird95residual]. This problem can be addressed by introducing a discriminator function and constructing a mini-max loss function. Specifically, multiplying (10) with a function and averaging under gives

(11) |

Following Theorem 3.1, we have if and only if for any function . This motivates us to estimate with a mini-max problem:

(12) |

where is a set of discriminator functions and normalizes to avoid the trivial solution . We shall assume to be rich enough following the conditions to be discussed in Section 3.3. A promising choice of a rich function class is neural networks, for which the mini-max problem (12) can be solved numerically in a fashion similar to generative adversarial networks (GANs) [8]. Alternatively, we can take to be a ball of a reproducing kernel Hilbert space (RKHS), which enables a closed form representation of as we show in the following.

We start with a brief introduction of RKHS. A symmetric function is called positive definite if all matrices of form are positive definite for any .
Related to every positive definite kernel is an unique RKHS which is the closure of functions of form , , equipped with inner product for .
A key property of RKHS is the so called reproducing property, which says . ^{1}^{1}todo: 1Move to appendix?
{thm}
Assume is a RKHS of functions with a positive definite kernel , and define to be the unit ball of .
We have

(13) |

where and are independent transition pairs obtained when running policy , and is defined in (11). See Appendix C for more background on RKHS. In practice, we approximate the expectation in (13) using discounted empirical distribution of the transition pairs, yielding consistent estimates following standard results on V-statistics [33].

### 3.2 Discounted Reward Case

We now discuss the extension to the discount case of . Similar to the average reward case, we start with a recursive equation that characterizes in the discounted case. {lem} Following the definition of in (4), for any , we have

(14) |

Denote by draws from . For any function , we have

(15) |

One may view as the invariant distribution of an induced Markov chain with transition probability of , which follows with probability , and restarts from initial distribution with probability . We can show that exists and is unique under mild conditions [31].

Assume is the unique solution of (14), and , . Define

(16) |

Assume , then if and only if for any test function . When , the definition in (16) reduces to the average reward case in (11). A subtle difference is that only ensures when , while when . This is because the additional term in (16) forces to be normalized properly. In practice, however, we still find it works better to pre-normalize to , and optimize the objective .

### 3.3 Further Theoretical Analysis

\lihongMore specific section title, like “Error Analysis”? In this section, we develop further theoretical understanding on the loss function . Lemma 3.3 below reveals an interesting connection between and the Bellman equation, allowing us to bound the estimation error of density ratio and expected reward with the mini-max loss when the discriminator space is chosen properly (Theorems 3.3 and 3.3). The results in this section apply to both discounted and average reward cases.

Given in (16), and assuming in the average reward case, we have

(17) | |||

(18) |

Note that equals the left hand side of the Bellman equations (1) and (2), when . Lemma 3.3 represents as an inner product between and (under base measure ). This provides an alternative proof of Theorem 3.2, since implies that is orthogonal with all and hence when is sufficiently rich.

In order to make orthogonal to a given function , it requires “reversing” operator : finding a function which solves for given . Observing that can be viewed as a Bellman equation (Eqs. (1)–(2)) when taking and to be the reward and value functions, respectively, we can derive an explicit representation of (Lemma C in Appendix). This allows one to gain insights into what discriminator set would be a good choice, so that minimizing yields good estimation with desirable properties. In the following, by taking , , we can characterize the conditions on under which the mini-max loss upper bounds the estimation error of or . {thm} Let be the -step transition probability of . For , define

(19) |

Assume Lemma 3.3 holds. We have

Since our main goal is to estimate the expected total reward instead of the density ratio , it is of interest to select to directly bound the estimation error of the total reward. Interestingly, this can be achieved once includes the true value function . {thm} Define to be the reward estimate using estimated density ratio (which may not equal the true ratio ) and infinite number of trajectories from , that is,

Assume is properly normalized such that , we have Therefore, if , we have \myempty The above analysis gives the theoretical property of the expect objective function, assuming infinite data. In practice, an empirical loss function is used based on samples of finite size. Specifically, let the empirical estimator that we constructed based on using (7). We have

where the first term accounts the deterministic error due to the estimation of , and the second term is due to the use of empirical averaging, and decreases to zero as the data size increases. A full course analysis of the error bound can be established using standard concentration bounds, which we leave for future work (\redor in Appendix).

## 4 Related Work

Our off-policy setting is related to, but different from, off-policy value-function learning [30, 29, 37, 12, 25, 21]. Our goal is to estimate a single scalar that summarizes the quality of a policy (a.k.a. off-policy value estimation as called by some authors [20]). However, our idea can be extended to estimating value functions as well, by using estimated density ratios to weight observed transitions (c.f., the distribution in LSTDQ [16]). We leave this as future work.

IS-based off-policy value estimation has seen a lot of interest recently for short-horizon problems, including contextual bandits [26, 13, 7, 42], and achieved many empirical successes [7, 34]. When extended to long-horizon problems, it faces an exponential blowup of variance, and variance-reduction techniques are used to improve the estimator [14, 39, 10, 42]. However, it can be proved that in the worst case, the mean squared error of any estimator has to depend exponentially on the horizon [20, 10]. Fortunately, many problems encountered in practical applications may present structures that enable more efficient off-policy estimation, as tackled by the present paper. An interesting open direction is to characterize theoretical conditions that can ensure tractable estimation for long horizon problems.

Few prior work directly target infinite-horizon problems. There exists approaches that use simulated samples to estimate stationary state distributions [1, Chapter IV]. However, they need a reliable model to draw such simulations, a requirement that is not satisfied in many real-world applications. To the best of our knowledge, the recently developed COP-TD algorithm [11] is the only work that attempts to estimate as an intermediate step of estimating the value function of a target policy . They take a stochastic-approximation approach and show asymptotic consistence. However, extending their approach to continuous state/action spaces appears challenging.

Finally, there is a comprehensive literature of two-sample density ratio estimation [e.g., 27, 35], which estimates the density ratio of two distributions from pairs of their samples. Our problem setting is different in that we only have data from , but not from ; this makes the traditional density ratio estimators inapplicable to our problem. Our method is made possible by taking the special temporal structure of MDP into consideration.

## 5 Experiment

In this section, we conduct experiments on different environmental settings to compare our method with existing off-policy evaluation methods. We compare with the standard trajectory-wise and step-wise IS and WIS methods. We do not report the results of unnormalized IS because they are generally significantly worse than WIS methods [30, 22]. In all the cases, we also compare with an on-policy oracle and a naive averaging baseline, which estimates the reward using direct averaging over the trajectories generated by the target policy and behavior policy, respectively. For problems with discrete action and state spaces, we also compare with a standard model-based method, which estimates the transition and reward model and then calculates expected reward explicitly using the model up to the desired truncation length. When applying our method on problems with finite and discrete state space, we optimize and in the space of all possible functions (corresponding to using a delta kernel in terms of RKHS). For continuous state space, we assume is a standard feed-forward neural network, and is a RKHS with a standard Gaussian RBF kernel whose bandwidth equals the median of the pairwise distances between the observed data points.

Because we cannot simulate truly infinite steps in practice, we use the behavior policy to generate trajectories of length , and evaluate the algorithms based on the mean square error (MSE) w.r.t. the -step rewards of a large number of trajectories of length from the target policy. We expect that our method gets better as increases, since it is designed for infinite horizon problems, while the IS/WIS methods receive large variance and deteriorate as increases.

# of Trajectories () | Different Behavior Policies | Truncated Length | vs. Plot | Training Iteration |

(a) | (b) | (c) | (d) | (e) |

#### Taxi Environment

Taxi [6] is a 2D grid world simulating taxi movement along the grids. A taxi moves North, East, South, West or attends to pick up or drop off a passenger. It receives a reward of when it successfully picks up a passenger or drops her off at the right place, and otherwise a reward of -1 every time step. The original taxi environment would stop when the taxi successfully picks up a passenger and drops her off at the right place. We modify the environment to make it infinite horizon, by allowing passengers to randomly appear and disappear at every corner of the map at each time step. We use a grid size of , which yields states in total (, corresponding to taxi locations, passenger appearance status and taxi status (empty or with one of 4 destinations)).

# of Trajectories () | Different Behavior Policies | Truncated length | Discounted Factor | |

(a) | (b) | (c) | (d) |

To construct target and behavior policies for testing our algorithm, we set our target policy to be the final policy after running Q-learning for iterations, and set another policy after iterations. The behavior policy is , where is a mixing ratio that can be varied.

#### Results in Taxi Environment

Figure 1(a)–(b) show results with average reward. We can see our method performs almost as well as the on-policy oracle, outperforming all the other methods. To evaluate the approximation error of the estimated density ratio , we plot in Figure 1(c) the weighted total variation distance between with the true with TV distance as we optimize the loss function. Figure 1(d) shows scatter plot of at convergence, indicating our method correctly estimates the true density ratio over the state space.

Figure 2 shows similar results for discounted reward. From Figure 2(c) and (d), we can see that typical IS methods deteriorate as the trajectory length and discount factor increase, respectively, which is expected since their variance grows exponentially with . In contrast, our density ratio method performs better as trajectory length increases, and is robust as increases.

#### Pendulum Environment

The Taxi environment features discrete action and state spaces. We now test Pendulum, which has a continuous state space of and action space of . In this environment, we want to control the pendulum to make it stand up as long as possible (for the average case), or as fast as possible (for small discounted case). The policy is taken to be a truncated Gaussian whose mean is a neural network of the states and variance a constant.

We train a near-optimal policy using REINFORCE and set it to be the target policy. The behavior policy is set to be , where is a mixing ratio, and is another policy from REINFORCE when it has not converged. Our results are shown in Figure 3, where we again find that our method generally outperforms the standard trajectory-wise and step-wise WIS, and works favorably in long-horizon problems (Figure 3(b)).

Mixing Ratio | Truncated Length | Mixing Ratio | Discount Factor | |

(a) Average Reward Case | (b) Average Reward Case | (c) Discounted Reward Case | (d) Discounted Reward Case |

(a) Environment | (b) # of Trajectories | (c) Different Behavior Policies | (d) Truncated Length |

#### SUMO Traffic Simulator

SUMO [15] is an open source traffic simulator; see Figure 4(a) for an illustration. We consider the task of reducing traffic congestion by modelling traffic light control as a reinforcement learning problem [41]. We use TraCI, a built-in “Traffic Control Interface”, to interact with the SUMO simulator. Full details of our environmental settings can be found in Appendix E. Our results are shown in Figure 4, where we again find that our method is consistently better than standard IS methods.

## 6 Conclusions

We study the off-policy estimation problem in infinite-horizon problems and develop a new algorithm based on direct estimation of the stationary state density ratio between the target and behavior policies. Our mini-max objective function enjoys nice theoretical properties and yields an intriguing connection with Bellman equations that is worth further investigation. Future directions include scaling our method to larger scale problems and extending it to estimate value functions and leverage off-policy data in policy optimization.

## Acknowledgement

This work is supported in part by NSF CRII 1830161. We would like to acknowledge Google Cloud for their support.

## References

- [1] Søren Asmussen and Peter W. Glynn. Stochastic Simulation: Algorithms and Analysis, volume 57 of Probability Theory and Stochastic Processes. Springer-Verlag, 2007.
- [2] Richard E. Bellman. Dynamic Programming. Princeton University Press, 1957.
- [3] Alain Berlinet and Christine Thomas-Agnan. Reproducing kernel Hilbert spaces in probability and statistics. Springer Science & Business Media, 2011.
- [4] Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis Xavier Charles, D. Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. Counterfactual reasoning and learning systems: The example of computational advertising. Journal of Machine Learning Research, 14:3207–3260, 2013.
- [5] Olivier Chapelle, Eren Manavoglu, and Romer Rosales. Simple and scalable response prediction for display advertising. ACM Transactions on Intelligent Systems and Technology, 5(4):61:1–61:34, 2014.
- [6] Thomas G Dietterich. Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research, 13:227–303, 2000.
- [7] Miroslav Dudík, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on Machine Learning (ICML), pages 1097–1104, 2011.
- [8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems 27 (NIPS), pages 2672–2680, 2014.
- [9] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723–773, 2012.
- [10] Zhaohan Guo, Philip S. Thomas, and Emma Brunskill. Using options and covariance testing for long horizon off-policy policy evaluation. In Advances in Neural Information Processing Systems 30 (NIPS), pages 2489–2498, 2017.
- [11] Assaf Hallak and Shie Mannor. Consistent on-line off-policy evaluation. In Proceedings of the 34th International Conference on Machine Learning (ICML), pages 1372–1383, 2017.
- [12] Assaf Hallak, Aviv Tamar, Remi Munos, and Shie Mannor. Generalized emphatic temporal difference learning: Bias-variance analysis. In Proceedings of the 30th AAAI Conference on Artificial Intelligence, pages 1631–1637, 2016.
- [13] Keisuke Hirano, Guido W Imbens, and Geert Ridder. Efficient estimation of average treatment effects using the estimated propensity score. Econometrica, 71(4):1161–1189, 2003.
- [14] Nan Jiang and Lihong Li. Doubly robust off-policy evaluation for reinforcement learning. In Proceedings of the 23rd International Conference on Machine Learning (ICML), pages 652–661, 2016.
- [15] Daniel Krajzewicz, Jakob Erdmann, Michael Behrisch, and Laura Bieker. Recent development and applications of sumo-simulation of urban mobility. International Journal On Advances in Systems and Measurements, 5(3&4), 2012.
- [16] Michail G. Lagoudakis and Ronald Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4:1107–1149, 2003.
- [17] David A Levin and Yuval Peres. Markov chains and mixing times, volume 107. American Mathematical Soc., 2017.
- [18] Lihong Li, Shunbao Chen, Ankur Gupta, and Jim Kleban. Counterfactual analysis of click metrics for search engine optimization: A case study. In Proceedings of the 24th International World Wide Web Conference (WWW), Companion Volume, pages 929–934, 2015.
- [19] Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In Proceedings of the 4th International Conference on Web Search and Data Mining (WSDM), pages 297–306, 2011.
- [20] Lihong Li, Rémi Munos, and Csaba Szepesvári. Toward minimax off-policy value estimation. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 608–616, 2015.
- [21] Hao Liu, Yihao Feng, Yi Mao, Dengyong Zhou, Jian Peng, and Qiang Liu. Action-dependent control variates for policy optimization via stein identity. In Proceedings of the 6th International Conference on Learning Representations (ICLR), 2018.
- [22] Jun S. Liu. Monte Carlo Strategies in Scientific Computing. Springer Series in Statistics. Springer-Verlag, 2001.
- [23] Travis Mandel, Yun-En Liu, Sergey Levine, Emma Brunskill, and Zoran Popovic. Offline policy evaluation across representations with applications to educational games. In Proceedings of the 13th International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS), pages 1077–1084, 2014.
- [24] Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, Bernhard Schölkopf, et al. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends® in Machine Learning, 10(1-2):1–141, 2017.
- [25] Rémi Munos, Tom Stepleton, Anna Harutyunyan, and Marc G. Bellemare. Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems 29 (NIPS), pages 1046–1054, 2016.
- [26] Susan A. Murphy, Mark van der Laan, and James M. Robins. Marginal mean models for dynamic regimes. Journal of the American Statistical Association, 96(456):1410–1423, 2001.
- [27] XuanLong Nguyen, Martin J Wainwright, and Michael Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. Information Theory, IEEE Transactions on, 56(11):5847–5861, 2010.
- [28] Art B. Owen. Monte Carlo Theory, Methods and Examples. 2013. http://statweb.stanford.edu/~owen/mc.
- [29] Doina Precup, Richard S. Sutton, and Sanjoy Dasgupta. Off-policy temporal-difference learning with funtion approximation. In Proceedings of the 18th Conference on Machine Learning (ICML), pages 417–424, 2001.
- [30] Doina Precup, Richard S. Sutton, and Satinder P. Singh. Eligibility traces for off-policy policy evaluation. In Proceedings of the 17th International Conference on Machine Learning (ICML), pages 759–766, 2000.
- [31] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley-Interscience, New York, 1994.
- [32] Bernhard Scholkopf and Alexander J Smola. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2001.
- [33] Robert J Serfling. Approximation theorems of mathematical statistics, volume 162. John Wiley & Sons, 2009.
- [34] Alexander L. Strehl, John Langford, Lihong Li, and Sham M. Kakade. Learning from logged implicit exploration data. In Advances in Neural Information Processing Systems 23 (NIPS-10), pages 2217–2225, 2010.
- [35] Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density ratio estimation in machine learning. Cambridge University Press, 2012.
- [36] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, March 1998.
- [37] Richard S. Sutton, A. Rupam Mahmood, and Martha White. An emphatic approach to the problem of off-policy temporal-difference learning. Journal of Machine Learning Research, 17(73):1–29, 2016.
- [38] Liang Tang, Romer Rosales, Ajit Singh, and Deepak Agarwal. Automatic ad format selection via contextual bandits. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management (CIKM), pages 1587–1594, 2013.
- [39] Philip S. Thomas and Emma Brunskill. Data-efficient off-policy policy evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML), pages 2139–2148, 2016.
- [40] Philip S. Thomas, Georgios Theocharous, Mohammad Ghavamzadeh, Ishan Durugkar, and Emma Brunskill. Predictive off-policy policy evaluation for nonstationary decision problems, with applications to digital marketing. In Proceedings of the 31st AAAI Conference on Artificial Intelligence (AAAI), pages 4740–4745, 2017.
- [41] Elise Van der Pol and Frans A Oliehoek. Coordinated deep reinforcement learners for traffic light control. In NIPS Workshop on Learning, Inference and Control of Multi-Agent Systems, 2016.
- [42] Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudík. Optimal and adaptive off-policy evaluation in contextual bandits. In Proceedings of the 34th International Conference on Machine Learning (ICML), pages 3589–3597, 2017.

## Appendix A Several Variants of IS- and WIS-based Estimators

\lihongLihong (or other) can add a bit more to make this section more self-contained. Not high priority.

Denote by for notation simplicity. Define

Then we have the following two key formulas, which derive the trajectory-wise, and step-wise importance sampling (IS) estimators, respectively.

(20) | ||||

(21) |

where the only difference of (20) and (21) is that (21) replaces the in (20) with , yielding smaller variance without changing the expectation. This is made possible because . Therefore, step-wise estimator can be viewed as Rao-backwellizing each term in (20) by conditioning on .

Given a set of observed trajectories , , drawn from . The trajectory-wise and step-wise estimators are

Trajectory-wise: | Step-wise: |

where and is a normalization constant of the importance weights: when , the corresponding estimators (called Trajectory-wise IS and Step-wise IS, respectively) provide unbiased estimates of ; when , the corresponding estimators are weighted (or self-normalized) importance sampling (called Trajectory-wise WIS and Step-wise WIS, respectively), which introduce bias but often have lower variance. It has been shown that the Step-wise WIS often performs the best among all these variants [30, 22].

In comparison, our method can be viewed as a further Rao-backwellization of the step-wise estimators. Define

Then we have

(22) |

where we replace in (21) with , based on Rao-backwellization conditioning on . This gives an empirical estimator:

where and or . Comparing this with the trajectory-wise and step-wise estimators, it is easy to expect that it yields smaller variance, when ignoring the estimation error of .

## Appendix B A motivating example

Here we provide an example when is exponential on the trajectory length , yielding high variance in trajectory-wise and step-wise estimators in long horizon problems, while the variance of our stationary density ratio based importance weight stays to be a constant as increases.

The MDP has states: , arranged on a circle (see the figure on the right), where is an odd number. There are two actions, left () and right (). The left action moves the agent from the current state counterclockwise to the next state, and the right action has the opposite (clockwise) effect. The deterministic reward is if taking action and otherwise. In summary, we have for any and that

Suppose we are given two policies. The behavior policy and target policy choose action with probability and , respectively. We focus on the average reward here.

#### Claim #1. Stationary density ratio stays constant as .

First, note that the MDP is ergodic under either policy, as is odd. Since and are symmetric, their stationary distributions are identical, that is, . In fact, both are uniform over . Therefore,

and similarly . Both ratios are independent of the trajectory length, and have zero variance.

#### Claim #2. Variance of trajectory-wise IS weight grows exponentially in .

{pro}Under the setting above, let be a trajectory drawn from the behavior policy , we have