DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections

DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections

Ofir Nachum111Equal contribution. Yinlam Chow Bo Dai Lihong Li Google AI {ofirnachum,yinlamchow,bodai,lihong}@google.com
Abstract

In many real-world reinforcement learning applications, access to the environment is limited to a fixed dataset, instead of direct (online) interaction with the environment. When using this data for either evaluation or training of a new policy, accurate estimates of discounted stationary distribution ratios — correction terms which quantify the likelihood that the new policy will experience a certain state-action pair normalized by the probability with which the state-action pair appears in the dataset — can improve accuracy and performance. In this work, we propose an algorithm, DualDICE, for estimating these quantities. In contrast to previous approaches, our algorithm is agnostic to knowledge of the behavior policy (or policies) used to generate the dataset. Furthermore, it eschews any direct use of importance weights, thus avoiding potential optimization instabilities endemic of previous methods. In addition to providing theoretical guarantees, we present an empirical study of our algorithm applied to off-policy policy evaluation and find that our algorithm significantly improves accuracy compared to existing techniques.

 

DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections


  Ofir NachumEqual contribution. Yinlam Chow Bo Dai Lihong Li Google AI {ofirnachum,yinlamchow,bodai,lihong}@google.com

\@float

noticebox[b]Equal contribution.\end@float

1 Introduction

Reinforcement learning (RL) has recently demonstrated a number of successes in various domains, such as games [31], robotics [1], and conversational systems [15, 24]. These successes have often hinged on the use of simulators to provide large amounts of experience necessary for RL algorithms. While this is reasonable in game environments, where the game is often a simulator itself, and some simple real-world tasks can be simulated to an accurate enough degree, in general one does not have such direct or easy access to the environment. Furthermore, in many real-world domains such as medicine [32], recommendation [25], and education [30], the deployment of a new policy, even just for the sake of performance evaluation, may be expensive and risky. In these applications, access to the environment is usually in the form of off-policy data [46], logged experience collected by potentially multiple and possibly unknown behavior policies.

State-of-the-art methods which consider this more realistic setting — either for policy evaluation or policy improvement — often rely on estimating (discounted) stationary distribution ratios or corrections. For each state and action in the environment, these quantities measure the likelihood that one’s current target policy will experience the state-action pair normalized by the probability with which the state-action pair appears in the off-policy data. Proper estimation of these ratios can improve the accuracy of policy evaluation [27] and the stability of policy learning [16, 18, 28, 47]. In general, these ratios are difficult to compute, let alone estimate, as they rely not only on the probability that the target policy will take the desired action at the relevant state, but also on the probability that the target policy’s interactions with the environment dynamics will lead it to the relevant state.

Several methods to estimate these ratios have been proposed recently [16, 18, 27], all based on the steady-state property of stationary distributions of Markov processes [19]. This property may be expressed locally with respect to state-action-next-state tuples, and is therefore amenable to stochastic optimization algorithms. However, these methods possess several issues when applied in practice: First, these methods require knowledge of the probability distribution used for each sampled action appearing in the off-policy data. In practice, these probabilities are usually not known and difficult to estimate, especially in the case of multiple, non-Markovian behavior policies. Second, the loss functions of these algorithms involve per-step importance ratios (the ratio of action sample probability with respect to the target policy versus the behavior policy). Depending on how far the behavior policy is from the target policy, these quantities may have large variance, and thus have a detrimental effect on stochastic optimization algorithms.

In this work, we propose Dual stationary DIstribution Correction Estimation (DualDICE), a new method for estimating discounted stationary distribution ratios. It is agnostic to the number or type of behavior policies used for collecting the off-policy data. Moreover, the objective function of our algorithm does not involve any per-step importance ratios, and so our solution is less likely to be affected by their high variance. We provide theoretical guarantees on the convergence of our algorithm and evaluate it on a number of off-policy policy evaluation benchmarks. We find that DualDICE can consistently, and often significantly, improve performance compared to previous algorithms for estimating stationary distribution ratios.

2 Background

We consider a Markov Decision Process (MDP) setting [39], in which the environment is specified by a tuple , consisting of a state space, an action space, a reward function, a transition probability function, and an initial state distribution. A policy interacts with the environment iteratively, starting with an initial state . At step , the policy produces a distribution over the actions , from which an action is sampled and applied to the environment. The environment stochastically produces a scalar reward and a next state . In this work, we consider infinite-horizon environments and the -discounted reward criterion for . It is clear that any finite-horizon environment may be interpreted as infinite-horizon by considering an augmented state space with an extra terminal state which continually loops onto itself with zero reward.

2.1 Off-Policy Policy Evaluation

Given a target policy , we are interested in estimating its value, defined as the normalized expected per-step reward obtained by following the policy:

(1)

The off-policy policy evaluation (OPE) problem studied here is to estimate using a fixed set of transitions sampled in a certain way. This is a very general scenario: can be collected by a single behavior policy (as in most previous work), multiple behavior policies, or an oracle sampler, among others. In the special case where contains entire trajectories collected by a known behavior policy , one may use importance sampling (IS) to estimate . Specifically, given a finite-length trajectory collected by , the IS estimate of based on is estimated by [38]: Although many improvements exist [13, 21, 38, 50], importance-weighting the entire trajectory can suffer from exponentially high variance, which is known as “the curse of horizon” [26, 27].

To avoid exponential dependence on trajectory length, one may weight the states by their long-term occupancy measure. First, observe that the policy value may be re-expressed as,

where

(2)

is the normalized discounted stationary distribution over state-actions with respect to . One may define the discounted stationary distribution over states analogously, and we slightly abuse notation by denoting it as ; note that . If consists of trajectories collected by a behavior policy , then the policy value may be estimated as,

where is the discounted stationary distribution correction. The key challenge is in estimating these correction terms using data drawn from .

2.2 Learning Stationary Distribution Corrections

We provide a brief summary of previous methods for estimating the stationary distribution corrections. The ones that are most relevant to our work are a suite of recent techniques [16, 18, 27], which are all essentially based on the following steady-state property of stationary Markov processes:

(3)

where we have simplified the identity by restricting to discrete state and action spaces. This identity simply reflects the conservation of flow of the stationary distribution: At each timestep, the flow out of (the LHS) must equal the flow into (the RHS). Given a behavior policy , equation 3 can be equivalently rewritten in terms of the stationary distribution corrections, i.e., for any given ,

(4)

where

provided that whenever . The quantity TD can be viewed as a temporal difference associated with . Accordingly, previous works optimize loss functions which minimize this TD error using samples from . We emphasize that although is associated with a temporal difference, it does not satisfy a Bellman recurrence in the usual sense [3]. Indeed, note that equation 3 is written “backwards”: The occupancy measure of a state is written as a (discounted) function of previous states, as opposed to vice-versa. This will serve as a key differentiator between our algorithm and these previous methods.

2.3 Off-Policy Estimation with Multiple Unknown Behavior Policies

While the previous algorithms are promising, they have several limitations when applied in practice:

  • The off-policy experience distribution is with respect to a single, Markovian behavior policy , and this policy must be known during optimization. In practice, off-policy data often comes from multiple, unknown behavior policies.

  • Computing the TD error in equation 4 requires the use of per-step importance ratios at every state-action sample . Depending on how far the behavior policy is from the target policy, these quantities may have high variance, which can have a detrimental effect on the convergence of any stochastic optimization algorithm that is used to estimate .

The method we derive below will be free of the aforementioned issues, avoiding unnecessary requirements on the form of the off-policy data collection as well as explicit uses of importance ratios. Rather, we consider the general setting where consists of transitions sampled in an unknown fashion. Since contains rewards and next states, we will often slightly abuse notation and write not only but also and , where the notation emphasizes that, unlike previously, is not the result of a single, known behavior policy. The target policy’s value can be equivalently written as,

(5)

where the correction terms are given by , and our algorithm will focus on estimating these correction terms. Rather than relying on the assumption that is the result of a single, known behavior policy, we instead make the following regularity assumption:

Assumption 1 (Reference distribution property).

For any , implies . Furthermore, the correction terms are bounded by some finite constant : .

3 DualDICE

We now develop our algorithm, DualDICE, for estimating the discounted stationary distribution corrections . In the OPE setting, one does not have explicit knowledge of the distribution , but rather only access to samples . Similar to the TD methods described above, we also assume access to samples from the initial state distribution . We begin by introducing a key result, which we will later derive and use as the crux for our algorithm.

3.1 The Key Idea

Consider optimizing a (bounded) function for the following objective:

(6)

where we use to denote the expected Bellman operator with respect to policy and zero reward: . The first term in equation 6 is the expected squared Bellman error with zero reward. This term alone would lead to a trivial solution , which can be avoided by the second term that encourages . Together, these two terms result in an optimal that places some non-zero amount of Bellman residual at state-action pairs sampled from .

Perhaps surprisingly, as we will show, the Bellman residuals of are exactly the desired distribution corrections:

(7)

This key result provides the foundation for our algorithm, since it provides us with a simple objective (relying only on samples from , , ) which we may optimize in order to obtain estimates of the distribution corrections. In the text below, we will show how we arrive at this result. We provide one additional step which allows us to efficiently learn a parameterized with respect to equation 6. We then generalize our results to a family of similar algorithms and lastly present theoretical guarantees.

3.2 Derivation

A Technical Observation

We begin our derivation of the algorithm for estimating by presenting the following simple technical observation: For arbitrary scalars , the optimizer of the convex problem is unique and given by . Using this observation, and letting be some bounded subset of which contains , one immediately sees that the optimizer of the following problem,

(8)

is given by for any . This result provides us with an objective that shares the same basic form as equation 6. The main distinction is that the second term relies on an expectation over , which we do not have access to.

Change of Variables

In order to transform the second expectation in equation 8 to be over the initial state distribution , we perform the following change of variables: Let be an arbitrary state-action value function that satisfies,

(9)

Since is bounded and , the variable is well-defined and bounded. By applying this change of variables, the objective function in 8 can be re-written in terms of , and this yields our previously presented objective from equation 6. Indeed, define,

to be the state visitation probability at step when following . Clearly, . Then,

The Bellman residuals of the optimum of this objective give the desired off-policy corrections:

(10)

Equation 6 provides a promising approach for estimating the stationary distribution corrections, since the first expectation is over state-action pairs sampled from , while the second expectation is over and actions sampled from , both of which we have access to. Therefore, in principle we may solve this problem with respect to a parameterized value function , and then use the optimized to deduce the corrections. In practice, however, the objective in its current form presents two difficulties:

  • The quantity involves a conditional expectation inside a square. In general, when environment dynamics are stochastic and the action space may be large or continuous, this quantity may not be readily optimized using standard stochastic techniques. (For example, when the environment is stochastic, its Monte-Carlo sample gradient is generally biased.)

  • Even if one has computed the optimal value , the corrections , due to the same argument as above, may not be easily computed, especially when the environment is stochastic or the action space continuous.

Exploiting Fenchel Duality

We solve both difficulties listed above in one step by exploiting Fenchel duality [42]: Any convex function may be written as , where is the Fenchel conjugate of . In the case of , the Fenchel conjugate is given by . Thus, we may express our objective as,

By the interchangeability principle [8, 41, 43], we may replace the inner max over scalar to a max over functions and obtain a min-max saddle-point optimization:

(11)

Using the KKT condition of the inner optimization problem (which is convex and quadratic in ), for any the optimal value is equal to the Bellman residual, . Therefore, the desired stationary distribution correction can then be found from the saddle-point solution of the minimax problem in equation 11 as follows:

(12)

Now we finally have an objective which is well-suited for practical computation. First, unbiased estimates of both the objective and its gradients are easy to compute using stochastic samples from , , and , all of which we have access to. Secondly, notice that the min-max objective function in equation 11 is linear in and concave in . Therefore in certain settings, one can provide guarantees on the convergence of optimization algorithms applied to this objective (see Section 3.4). Thirdly, the optimizer of the objective in equation 11 immediately gives us the desired stationary distribution corrections through the values of , with no additional computation.

3.3 Extension to General Convex Functions

Besides a quadratic penalty function, one may extend the above derivations to a more general class of convex penalty functions. Consider a generic convex penalty function . Recall that is a bounded subset of which contains the interval of stationary distribution corrections. If is contained in the range of , then the optimizer of the convex problem, for , satisfies the following KKT condition: . Analogously, the optimizer of,

(13)

satisfies the equality

With change of variables , the above problem becomes,

(14)

Applying Fenchel duality to in this objective further leads to the following saddle-point problem:

(15)

By the KKT condition of the inner optimization problem, for any the optimizer satisfies,

(16)

Therefore, using the fact that the derivative of a convex function is the inverse function of the derivative of its Fenchel conjugate , our desired stationary distribution corrections are found by computing the saddle-point of the above problem:

(17)

Amazingly, despite the generalization beyond the quadratic penalty function , the optimization problem in equation 15 retains all the computational benefits that make this method very practical for learning : All quantities and their gradients may be unbiasedly estimated from stochastic samples; the objective is linear in and concave in , thus is well-behaved; and the optimizer of this problem immediately provides the desired stationary distribution corrections through the values of , without any additional computation.

This generalized derivation also provides insight into the initial technical result: It is now clear that the objective in equation 13 is the negative Fenchel dual (variational) form of the Ali-Silvey or -divergence, which has been used in previous work to estimate divergence and data likelihood ratios [33]. Despite their similar formulations, we emphasize that the aforementioned dual form of the -divergence is not immediately applicable to estimation of off-policy corrections in the context of RL, due to the fact that samples from distribution are unobserved. Indeed, our derivations hinged on two additional key steps: (1) the change of variables from to ; and (2) the second application of duality to introduce . Due to these repeated applications of duality in our derivations, we term our method Dual stationary DIstribution Correction Estimation (DualDICE).

3.4 Theoretical Guarantees

In this section, we consider the theoretical properties of DualDICE in the setting where we have a dataset formed by empirical samples , , and target actions for .333For the sake of simplicity, we consider the batch learning setting with i.i.d. samples as in [48]. The results can be easily generalized to single sample path with dependent samples (see Appendix). We will use the shorthand notation to denote an average over these empirical samples. Although the proposed estimator can adopt general , for simplicity of exposition we restrict to . We consider using an algorithm (e.g., stochastic gradient descent/ascent) to find optimal of equation 15 within some parameterization families , respectively. We denote by the outputs of . We have the following guarantee on the quality of with respect to the off-policy policy estimation (OPE) problem.

Theorem 2.

(Informal) Under some mild assumptions, the mean squared error (MSE) associated with using for OPE can be bounded as,

(18)

where the outer expectation is with respect to the randomness of the empirical samples and , denotes the optimization error, and denotes the approximation error due to .

The sources of estimation error are explicit in Theorem 2. As the number of samples increases, the statistical error approaches zero. Meanwhile, there is an implicit trade-off in and . With flexible function spaces and (such as the space of neural networks), the approximation error can be further decreased; however, optimization will be complicated and it is difficult to characterize . On the other hand, with linear parameterization of , under some mild conditions, after iterations we achieve provably fast rate, for and for , at the cost of potentially increased approximation error. See the Appendix for the precise theoretical results, proofs, and further discussions.

4 Related Work

Density Ratio Estimation

Density ratio estimation is an important tool for many machine learning and statistics problems. Other than the naive approach, (i.e., the density ratio is calculated via estimating the densities in the numerator and denominator separately, which may magnify the estimation error), various direct ratio estimators have been proposed [44], including the moment matching approach [17], probabilistic classification approach [4, 7, 40], and ratio matching approach [22, 33, 45]

The proposed DualDICE algorithm, as a direct approach for density ratio estimation, bears some similarities to ratio matching [33], which is also derived by exploiting the Fenchel dual representation of the -divergences. However, compared to the existing direct estimators, the major difference lies in the requirement of the samples from the stationary distribution. Specifically, the existing estimators require access to samples from both and , which is impractical in the off-policy learning setting. Therefore, DualDICE is uniquely applicable to the more difficult RL setting.

Off-policy Policy Evaluation

The problem of off-policy policy evaluation has been heavily studied in contextual bandits [12, 49, 52] and in the more general RL setting [14, 21, 26, 29, 34, 36, 37, 50, 51]. Several representative approaches can be identified in the literature. The Direct Method (DM) learns a model of the system and then uses it to estimate the performance of the evaluation policy. This approach often has low variance but its bias depends on how well the selected function class can express the environment dynamics. Importance sampling (IS) [38] uses importance weights to correct the mismatch between the distributions of the system trajectory induced by the target and behavior policies. Its variance can be unbounded when there is a big difference between the distributions of the evaluation and behavior policies, and grows exponentially with the horizon of the RL problem. Doubly Robust (DR) is a combination of DM and IS, and can achieve the low variance of DM and no (or low) bias of IS. Other than DM, all the methods described above require knowledge of the policy density ratio, and thus the behavior policy. Our proposed algorithm avoids this necessity.

5 Experiments

We evaluate our method applied to off-policy policy evaluation (OPE). We focus on this setting because it is a direct application of stationary distribution correction estimation, without many additional tunable parameters, and it has been previously used as a test-bed for similar techniques [27]. In each experiment, we use a behavior policy to collect some number of trajectories, each for some number of steps. This data is used to estimate the stationary distribution corrections, which are then used to estimate the average step reward, with respect to a target policy . We focus our comparisons here to a TD-based approach [16] and weighted step-wise IS (as described in [27]), which we and others have generally found to work best relative to common IS variants [30, 38]. See the Appendix for additional results and implementation details.

We begin in a controlled setting with an evaluation agnostic to optimization issues, where we find that, absent these issues, our method is competitive with TD-based approaches (Figure 1). However, as we move to more difficult settings with complex environment dynamics, the performance of TD methods degrades dramatically, while our method is still able to provide accurate estimates (Figure 2). Finally, we provide an analysis of the optimization behavior of our method on a simple control task across different choices of function (Figure 3). Interestingly, although the choice of is most natural, we find the empirically best performing choice to be . All results are summarized for 20 random seeds, with median plotted and error bars at and percentiles.

5.1 Estimation Without Function Approximation

log RMSE

trajectory length
Figure 1: We perform OPE on the Taxi domain [10]. The plots show log RMSE of the estimator across different numbers of trajectories and different trajectory lengths (-axis). For this domain, we avoid any potential issues in optimization by solving for the optimum of the objectives exactly using standard matrix operations. Thus, we are able to see that our method and the TD method are competitive with each other.

We begin with a tabular task, the Taxi domain [10]. In this task, we evaluate our method in a manner agnostic to optimization difficulties: The objective 6 is a quadratic equation in , and thus may be solved by matrix operations. The Bellman residuals (equation 7) may then be estimated via an empirical average of the transitions appearing in the off-policy data. In a similar manner, TD methods for estimating the correction terms may also be solved using matrix operations [27]. In this controlled setting, we find that, as expected, TD methods can perform well (Figure 1), and our method achieves competitive performance. As we will see in the following results, the good performance of TD methods quickly deteriorates as one moves to more complex settings, while our method is able to maintain good performance, even when using function approximation and stochastic optimization.

5.2 Control Tasks

Cartpole, Cartpole, Cartpole, Reacher, Reacher, Reacher,

Figure 2: We perform OPE on control tasks. Each plot shows the estimated average step reward over training and different behavior policies (higher corresponds to a behavior policy closer to the target policy). We find that in all cases, our method is able to approximate these desired values well, with accuracy improving with a larger . On the other hand, the TD method performs poorly, even more so when the behavior policy is unknown and must be estimated. While on Cartpole it can start to approach the desired value for large , on the more complicated Reacher task (which involves continuous actions) its learning is too unstable to learn anything at all.

We now move on to difficult control tasks: A discrete-control task Cartpole and a continuous-control task Reacher [6]. In these tasks, observations are continuous, and thus we use neural network function approximators with stochastic optimization. Figure 2 shows the results of our method compared to the TD method. We find that in this setting, DualDICE is able to provide good, stable performance, while the TD approach suffers from high variance, and this issue is exacerbated when we attempt to estimate rather than assume it as given. See the Appendix for additional baseline results.

5.3 Choice of Convex Function

log RMSE

Figure 3: We compare the OPE error when using different forms of to estimate stationary distribution ratios with function approximation, which are then applied to OPE on a simple continuous grid task. In this setting, optimization stability is crucial, and this heavily depends on the form of the convex function . We plot the results of using for . We also show the results of TD and IS methods on this task for comparison. We find that consistently performs the best, often providing significantly better results.

We analyze the choice of the convex function . We consider a simple continuous grid task where an agent may move left, right, up, or down and is rewarded for reaching the bottom right corner of a square room. We plot the estimation errors of using DualDICE for off-policy policy evaluation on this task, comparing against different choices of convex functions of the form . Interestingly, although the choice of is most natural, we find the empirically best performing choice to be . Thus, this is the form of we used in our experiments for Figure 2.

6 Conclusions

We have presented DualDICE, a method for estimating off-policy stationary distribution corrections. Compared to previous work, our method is agnostic to knowledge of the behavior policy used to collect the off-policy data and avoids the use of importance weights in its losses. These advantages have a profound empirical effect: our method provides significantly better estimates compared to TD methods, especially in settings which require function approximation and stochastic optimization.

Future work includes (1) incorporating the DualDICE algorithm into off-policy training, (2) further understanding the effects of on the performance of DualDICE (in terms of approximation error of the distribution corrections), and (3) evaluating DualDICE on real-world off-policy evaluation tasks.

References

  • [1] Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous in-hand manipulation. arXiv preprint arXiv:1808.00177, 2018.
  • [2] András Antos, Csaba Szepesvári, and Rémi Munos. Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71(1):89–129, 2008.
  • [3] Richard Ernest Bellman. Dynamic Programming. Dover Publications, Inc., New York, NY, USA, 2003.
  • [4] Steffen Bickel, Michael Brückner, and Tobias Scheffer. Discriminative learning for differing training and test distributions. In Proceedings of the 24th international conference on Machine learning, pages 81–88. ACM, 2007.
  • [5] Stephane Boucheron, Gabor Lugosi, and Pascal Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2016.
  • [6] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
  • [7] Kuang Fu Cheng, Chih-Kang Chu, et al. Semiparametric density estimation under a two-sample density ratio model. Bernoulli, 10(4):583–604, 2004.
  • [8] Bo Dai, Niao He, Yunpeng Pan, Byron Boots, and Le Song. Learning from conditional distributions via dual embeddings. arXiv preprint arXiv:1607.04579, 2016.
  • [9] Bo Dai, Albert Shaw, Lihong Li, Lin Xiao, Niao He, Zhen Liu, Jianshu Chen, and Le Song. Sbeed: Convergent reinforcement learning with nonlinear function approximation. arXiv preprint arXiv:1712.10285, 2017.
  • [10] Thomas G Dietterich. Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research, 13:227–303, 2000.
  • [11] Simon S Du, Jianshu Chen, Lihong Li, Lin Xiao, and Dengyong Zhou. Stochastic variance reduction methods for policy evaluation. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1049–1058. JMLR. org, 2017.
  • [12] Miroslav Dudík, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. arXiv preprint arXiv:1103.4601, 2011.
  • [13] Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. More robust doubly robust off-policy evaluation. arXiv preprint arXiv:1802.03493, 2018.
  • [14] Raphael Fonteneau, Susan A. Murphy, Louis Wehenkel, and Damien Ernst. Batch mode reinforcement learning based on the synthesis of artificial trajectories. Annals of Operations Research, 208(1):383–416, 2013.
  • [15] Jianfeng Gao, Michel Galley, and Lihong Li. Neural approaches to Conversational AI. Foundations and Trends in Information Retrieval, 13(2–3):127–298, 2019.
  • [16] Carles Gelada and Marc G Bellemare. Off-policy deep reinforcement learning by bootstrapping the covariate shift. AAAI, 2018.
  • [17] Arthur Gretton, Alex J Smola, Jiayuan Huang, Marcel Schmittfull, Karsten M Borgwardt, and Bernhard Schöllkopf. Covariate shift by kernel mean matching. In Dataset shift in machine learning, pages 131–160. MIT Press, 2009.
  • [18] Assaf Hallak and Shie Mannor. Consistent on-line off-policy evaluation. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1372–1383. JMLR. org, 2017.
  • [19] W Keith Hastings. Monte carlo sampling methods using markov chains and their applications. 1970.
  • [20] David Haussler. Sphere packing numbers for subsets of the boolean n-cube with bounded vapnik-chervonenkis dimension. Journal of Combinatorial Theory, Series A, 69(2):217–232, 1995.
  • [21] Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning, pages 652–661, 2016.
  • [22] Takafumi Kanamori, Shohei Hido, and Masashi Sugiyama. A least-squares approach to direct importance estimation. Journal of Machine Learning Research, 10(Jul):1391–1445, 2009.
  • [23] Alessandro Lazaric, Mohammad Ghavamzadeh, and Rémi Munos. Finite-sample analysis of least-squares policy iteration. Journal of Machine Learning Research, 13(Oct):3041–3074, 2012.
  • [24] Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao, and Dan Jurafsky. Deep reinforcement learning for dialogue generation. arXiv preprint arXiv:1606.01541, 2016.
  • [25] Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In Proceedings of the fourth ACM international conference on Web search and data mining, pages 297–306. ACM, 2011.
  • [26] Lihong Li, Rémi Munos, and Csaba Szepesvàri. Toward minimax off-policy value estimation. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, pages 608–616, 2015.
  • [27] Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the curse of horizon: Infinite-horizon off-policy estimation. In Advances in Neural Information Processing Systems, pages 5356–5366, 2018.
  • [28] Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. Off-policy policy gradient with state distribution correction. In Proceedings of the Thirty-Fifth Conference on Uncertainty in Artificial Intelligence, 2019. To appear.
  • [29] A. Mahmood, H. van Hasselt, and R. Sutton. Weighted importance sampling for off-policy learning with linear function approximation. In Proceedings of the 27th International Conference on Neural Information Processing Systems, 2014.
  • [30] Travis Mandel, Yun-En Liu, Sergey Levine, Emma Brunskill, and Zoran Popovic. Offline policy evaluation across representations with applications to educational games. In Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems, pages 1077–1084. International Foundation for Autonomous Agents and Multiagent Systems, 2014.
  • [31] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  • [32] Susan A Murphy, Mark J van der Laan, James M Robins, and Conduct Problems Prevention Research Group. Marginal mean models for dynamic regimes. Journal of the American Statistical Association, 96(456):1410–1423, 2001.
  • [33] XuanLong Nguyen, Martin J Wainwright, and Michael I Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847–5861, 2010.
  • [34] C. Paduraru. Off-policy Evaluation in Markov Decision Processes. PhD thesis, McGill University, 2013.
  • [35] D Pollard. Convergence of Stochastic Processes. David Pollard, 1984.
  • [36] D. Precup, R. Sutton, and S. Dasgupta. Off-policy temporal difference learning with function approximation. In Proceedings of the 18th International Conference on Machine Learning, pages 417–424, 2001.
  • [37] D. Precup, R. Sutton, and S. Singh. Eligibility traces for off-policy policy evaluation. In Proceedings of the 17th International Conference on Machine Learning, pages 759–766, 2000.
  • [38] Doina Precup. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, page 80, 2000.
  • [39] Martin L Puterman. Markov decision processes: Discrete stochastic dynamic programming. 1994.
  • [40] Jing Qin. Inferences for case-control and semiparametric two-sample density ratio models. Biometrika, 85(3):619–630, 1998.
  • [41] R Tyrrell Rockafellar and Roger J-B Wets. Variational analysis, volume 317. Springer Science & Business Media, 2009.
  • [42] Ralph Tyrell Rockafellar. Convex analysis. Princeton university press, 2015.
  • [43] Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczyński. Lectures on stochastic programming: modeling and theory. SIAM, 2009.
  • [44] Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density ratio estimation in machine learning. Cambridge University Press, 2012.
  • [45] Masashi Sugiyama, Taiji Suzuki, Shinichi Nakajima, Hisashi Kashima, Paul von Bünau, and Motoaki Kawanabe. Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60(4):699–746, 2008.
  • [46] Richard S Sutton and Andrew G Barto. Introduction to reinforcement learning, volume 135.
  • [47] Richard S Sutton, A Rupam Mahmood, and Martha White. An emphatic approach to the problem of off-policy temporal-difference learning. The Journal of Machine Learning Research, 17(1):2603–2631, 2016.
  • [48] Richard S Sutton, Csaba Szepesvári, Alborz Geramifard, and Michael Bowling. Dyna-style planning with linear function approximation and prioritized sweeping. In Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence, pages 528–536. AUAI Press, 2008.
  • [49] A. Swaminathan, A. Krishnamurthy, A. Agarwal, M. Dudík, J. Langford, D. Jose, and I. Zitouni. Off-policy evaluation for slate recommendation. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 3635–3645, 2017.
  • [50] P. Thomas and E. Brunskill. Data-efficient off-policy policy evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning, pages 2139–2148, 2016.
  • [51] P. Thomas, G. Theocharous, and M. Ghavamzadeh. High confidence off-policy evaluation. In Proceedings of the 29th Conference on Artificial Intelligence, 2015.
  • [52] Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudik. Optimal and adaptive off-policy evaluation in contextual bandits. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3589–3597. JMLR. org, 2017.
  • [53] Bin Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, pages 94–116, 1994.

Appendix A Pseudocode

  Inputs: Convex function and its Fenchel conjugate , off-policy data , sampled initial states , target policy , networks , learning rates , number of iterations , batch size .
  for  do
     Sample batch from .
     Sample batch from .
     Sample actions , for .
     Sample actions , for .
     Compute empirical loss .
     Update .
     Update .
  end for
  Return .
Algorithm 1 DualDICE

Appendix B Additional Results

Cartpole, Cartpole, Cartpole, Reacher, Reacher, Reacher,

Figure 4: We perform OPE on control tasks using our method compared to a number of additional baselines: doubly-robust (DR), in which one learns a value function in order to reduce the variance of an IS estimate of the evaluation; direct method (DM), in which one learns a model of the dynamics and reward of the environment and performs Monte Carlo rollouts using the model in order to estimate the value of the target policy; and , in which one learns values via Bellman error minimization over the off-policy data, and uses the initial values as estimates of the policy value (these estimates are below for Reacher, ).

Appendix C Experimental Details

c.1 Taxi

For the Taxi domain, we follow the same protocol as used in [27]. In this tabular, exact solve setting, the TD methods [16] are equivalent to their kernel-based TD method. We fix to . The behavior and target policies are also taken from [27] (referred in their work as the behavior policy for ).

In this setting, we solve for the optimal empirical exactly using matrix operations. Since [27] perform a similar exact solve for variables , for better comparison we also perform our exact solve with respect to variables . Specifically, one may follow the same derivations for DualDICE with respect to learning . The final objective will require knowledge of the importance weights .

c.2 Control Tasks

We use the Cartpole and Reacher tasks as given by OpenAI Gym [6]. In these tasks we use COP-TD [16] for the TD method ([27] requires a proper kernel, which is not readily available for these tasks). When assuming an unknown , we learn a neural network policy using behavior cloning, and use its probabilities for computing importance weights . All neural networks are feed-forward with two hidden layers of dimension and activations.

We modify the Cartpole task to be infinite horizon: We use the same dynamics as in the original task but change the reward to be if the original task returns a termination (when the pole falls below some threshold) and otherwise. We train a policy on this task until convergence. We then define the target policy as a weighted combination of this pre-trained policy (weight ) and a uniformly random policy (weight ). The behavior policy for a specific is taken to be a weighted combination of the pre-trained policy (weight ) and a uniformly random policy (weight ). We use , which yields an average step reward of for and for with . We generate an off-policy dataset by running the behavior policy for epsiodes, each of length steps. We train each stationary distribution correction estimation method using the Adam optimizer with batches of size and learning rates chosen using a hyperparameter search (the optimal learning rate found for either method was ).

For the Reacher task, we train a deterministic policy until convergence. We define the target policy as a Gaussian with mean given by the pre-trained policy and standard deviation given by . The behavior policy for a specific is taken to be a Gaussian with mean given by the pre-trained policy and standard deviation given by . We use , which yields an average step reward of for and for with . We generate an off-policy dataset by running the behavior policy for epsiodes, each of length steps. We train each stationary distribution correction estimation method using the Adam optimizer with batches of size and learning rates chosen using a hyperparameter search (the optimal learning rate found for either method was ).

c.3 Continuous Grid

For this task, we create a grid which the agent can traverse by moving left/right/up/down. The observations are the coordinates of the square the agent is on. The reward at each step is given by . We use . The target policy is taken to be the optimal policy for this task plus weight on uniform exploration. The behavior policy is taken to be the optimal policy plus weight on uniform exploration. We train using batches of size the Adam optimizer with batches of size and learning rates for and for .

Appendix D Proofs

We provide the proof for Theorem 2. We first decompose the error in Section D.1. Then, we analyze the statistical error and optimization error in Section D.2 and Section D.4, respectively. The total error will be discussed in D.3.

Although the proposed estimator can use any general convex function , as a first step towards a more complete theoretical understanding, we consider the special case of . Clearly, now is -strongly convex with . Under Assumption 1, we need only consider , which implies that , and that is -Lipschitz continuous with . Similarly, is -Lipschitz continuous with on . The following assumption will be needed.

Assumption 3 (MDP regularity).

We assume the observed reward is uniformly bounded, i.e., for some constant . It follows that the reward’s mean and variance are both bounded in .

For convenience, the objective function of DualDICE is repeated here:

We will also make use of the objective in the form prior to introduction of , which we denote as :

Let denotes the empirical surrogate of with optimal solution as . We denote and . We denote and as the primal objectives, and , as the dual objectives. We apply some optimization algorithm for optimizing with samples , , and target actions for . We denote the outputs of by .

d.1 Error Decomposition

Let

Observe that

We begin by considering the estimation error induced by using as estimates of , where denotes the empirical Bellman backup with respect to samples from . We will subsequently reconcile this with the true implementation of DualDICE, which uses as estimates of .

The mean squared error of the policy value estimate when using in place of can be decomposed as

(19)
(23)

The first term, , is induced by the randomness in observed reward, and we have

which will be discussed in section D.2.

We consider the as

which is the error induced by optimization .

For the last term , we have