DualDICE: BehaviorAgnostic Estimation of Discounted Stationary Distribution Corrections
Abstract
In many realworld reinforcement learning applications, access to the environment is limited to a fixed dataset, instead of direct (online) interaction with the environment. When using this data for either evaluation or training of a new policy, accurate estimates of discounted stationary distribution ratios — correction terms which quantify the likelihood that the new policy will experience a certain stateaction pair normalized by the probability with which the stateaction pair appears in the dataset — can improve accuracy and performance. In this work, we propose an algorithm, DualDICE, for estimating these quantities. In contrast to previous approaches, our algorithm is agnostic to knowledge of the behavior policy (or policies) used to generate the dataset. Furthermore, it eschews any direct use of importance weights, thus avoiding potential optimization instabilities endemic of previous methods. In addition to providing theoretical guarantees, we present an empirical study of our algorithm applied to offpolicy policy evaluation and find that our algorithm significantly improves accuracy compared to existing techniques.
DualDICE: BehaviorAgnostic Estimation of Discounted Stationary Distribution Corrections
Ofir Nachum^{†}^{†}†Equal contribution. Yinlam Chow Bo Dai Lihong Li Google AI {ofirnachum,yinlamchow,bodai,lihong}@google.com
noticebox[b]Equal contribution.\end@float
1 Introduction
Reinforcement learning (RL) has recently demonstrated a number of successes in various domains, such as games [31], robotics [1], and conversational systems [15, 24]. These successes have often hinged on the use of simulators to provide large amounts of experience necessary for RL algorithms. While this is reasonable in game environments, where the game is often a simulator itself, and some simple realworld tasks can be simulated to an accurate enough degree, in general one does not have such direct or easy access to the environment. Furthermore, in many realworld domains such as medicine [32], recommendation [25], and education [30], the deployment of a new policy, even just for the sake of performance evaluation, may be expensive and risky. In these applications, access to the environment is usually in the form of offpolicy data [46], logged experience collected by potentially multiple and possibly unknown behavior policies.
Stateoftheart methods which consider this more realistic setting — either for policy evaluation or policy improvement — often rely on estimating (discounted) stationary distribution ratios or corrections. For each state and action in the environment, these quantities measure the likelihood that one’s current target policy will experience the stateaction pair normalized by the probability with which the stateaction pair appears in the offpolicy data. Proper estimation of these ratios can improve the accuracy of policy evaluation [27] and the stability of policy learning [16, 18, 28, 47]. In general, these ratios are difficult to compute, let alone estimate, as they rely not only on the probability that the target policy will take the desired action at the relevant state, but also on the probability that the target policy’s interactions with the environment dynamics will lead it to the relevant state.
Several methods to estimate these ratios have been proposed recently [16, 18, 27], all based on the steadystate property of stationary distributions of Markov processes [19]. This property may be expressed locally with respect to stateactionnextstate tuples, and is therefore amenable to stochastic optimization algorithms. However, these methods possess several issues when applied in practice: First, these methods require knowledge of the probability distribution used for each sampled action appearing in the offpolicy data. In practice, these probabilities are usually not known and difficult to estimate, especially in the case of multiple, nonMarkovian behavior policies. Second, the loss functions of these algorithms involve perstep importance ratios (the ratio of action sample probability with respect to the target policy versus the behavior policy). Depending on how far the behavior policy is from the target policy, these quantities may have large variance, and thus have a detrimental effect on stochastic optimization algorithms.
In this work, we propose Dual stationary DIstribution Correction Estimation (DualDICE), a new method for estimating discounted stationary distribution ratios. It is agnostic to the number or type of behavior policies used for collecting the offpolicy data. Moreover, the objective function of our algorithm does not involve any perstep importance ratios, and so our solution is less likely to be affected by their high variance. We provide theoretical guarantees on the convergence of our algorithm and evaluate it on a number of offpolicy policy evaluation benchmarks. We find that DualDICE can consistently, and often significantly, improve performance compared to previous algorithms for estimating stationary distribution ratios.
2 Background
We consider a Markov Decision Process (MDP) setting [39], in which the environment is specified by a tuple , consisting of a state space, an action space, a reward function, a transition probability function, and an initial state distribution. A policy interacts with the environment iteratively, starting with an initial state . At step , the policy produces a distribution over the actions , from which an action is sampled and applied to the environment. The environment stochastically produces a scalar reward and a next state . In this work, we consider infinitehorizon environments and the discounted reward criterion for . It is clear that any finitehorizon environment may be interpreted as infinitehorizon by considering an augmented state space with an extra terminal state which continually loops onto itself with zero reward.
2.1 OffPolicy Policy Evaluation
Given a target policy , we are interested in estimating its value, defined as the normalized expected perstep reward obtained by following the policy:
(1) 
The offpolicy policy evaluation (OPE) problem studied here is to estimate using a fixed set of transitions sampled in a certain way. This is a very general scenario: can be collected by a single behavior policy (as in most previous work), multiple behavior policies, or an oracle sampler, among others. In the special case where contains entire trajectories collected by a known behavior policy , one may use importance sampling (IS) to estimate . Specifically, given a finitelength trajectory collected by , the IS estimate of based on is estimated by [38]: Although many improvements exist [13, 21, 38, 50], importanceweighting the entire trajectory can suffer from exponentially high variance, which is known as “the curse of horizon” [26, 27].
To avoid exponential dependence on trajectory length, one may weight the states by their longterm occupancy measure. First, observe that the policy value may be reexpressed as,
where
(2) 
is the normalized discounted stationary distribution over stateactions with respect to . One may define the discounted stationary distribution over states analogously, and we slightly abuse notation by denoting it as ; note that . If consists of trajectories collected by a behavior policy , then the policy value may be estimated as,
where is the discounted stationary distribution correction. The key challenge is in estimating these correction terms using data drawn from .
2.2 Learning Stationary Distribution Corrections
We provide a brief summary of previous methods for estimating the stationary distribution corrections. The ones that are most relevant to our work are a suite of recent techniques [16, 18, 27], which are all essentially based on the following steadystate property of stationary Markov processes:
(3) 
where we have simplified the identity by restricting to discrete state and action spaces. This identity simply reflects the conservation of flow of the stationary distribution: At each timestep, the flow out of (the LHS) must equal the flow into (the RHS). Given a behavior policy , equation 3 can be equivalently rewritten in terms of the stationary distribution corrections, i.e., for any given ,
(4) 
where
provided that whenever . The quantity TD can be viewed as a temporal difference associated with . Accordingly, previous works optimize loss functions which minimize this TD error using samples from . We emphasize that although is associated with a temporal difference, it does not satisfy a Bellman recurrence in the usual sense [3]. Indeed, note that equation 3 is written “backwards”: The occupancy measure of a state is written as a (discounted) function of previous states, as opposed to viceversa. This will serve as a key differentiator between our algorithm and these previous methods.
2.3 OffPolicy Estimation with Multiple Unknown Behavior Policies
While the previous algorithms are promising, they have several limitations when applied in practice:

The offpolicy experience distribution is with respect to a single, Markovian behavior policy , and this policy must be known during optimization. In practice, offpolicy data often comes from multiple, unknown behavior policies.

Computing the TD error in equation 4 requires the use of perstep importance ratios at every stateaction sample . Depending on how far the behavior policy is from the target policy, these quantities may have high variance, which can have a detrimental effect on the convergence of any stochastic optimization algorithm that is used to estimate .
The method we derive below will be free of the aforementioned issues, avoiding unnecessary requirements on the form of the offpolicy data collection as well as explicit uses of importance ratios. Rather, we consider the general setting where consists of transitions sampled in an unknown fashion. Since contains rewards and next states, we will often slightly abuse notation and write not only but also and , where the notation emphasizes that, unlike previously, is not the result of a single, known behavior policy. The target policy’s value can be equivalently written as,
(5) 
where the correction terms are given by , and our algorithm will focus on estimating these correction terms. Rather than relying on the assumption that is the result of a single, known behavior policy, we instead make the following regularity assumption:
Assumption 1 (Reference distribution property).
For any , implies . Furthermore, the correction terms are bounded by some finite constant : .
3 DualDICE
We now develop our algorithm, DualDICE, for estimating the discounted stationary distribution corrections . In the OPE setting, one does not have explicit knowledge of the distribution , but rather only access to samples . Similar to the TD methods described above, we also assume access to samples from the initial state distribution . We begin by introducing a key result, which we will later derive and use as the crux for our algorithm.
3.1 The Key Idea
Consider optimizing a (bounded) function for the following objective:
(6) 
where we use to denote the expected Bellman operator with respect to policy and zero reward: . The first term in equation 6 is the expected squared Bellman error with zero reward. This term alone would lead to a trivial solution , which can be avoided by the second term that encourages . Together, these two terms result in an optimal that places some nonzero amount of Bellman residual at stateaction pairs sampled from .
Perhaps surprisingly, as we will show, the Bellman residuals of are exactly the desired distribution corrections:
(7) 
This key result provides the foundation for our algorithm, since it provides us with a simple objective (relying only on samples from , , ) which we may optimize in order to obtain estimates of the distribution corrections. In the text below, we will show how we arrive at this result. We provide one additional step which allows us to efficiently learn a parameterized with respect to equation 6. We then generalize our results to a family of similar algorithms and lastly present theoretical guarantees.
3.2 Derivation
A Technical Observation
We begin our derivation of the algorithm for estimating by presenting the following simple technical observation: For arbitrary scalars , the optimizer of the convex problem is unique and given by . Using this observation, and letting be some bounded subset of which contains , one immediately sees that the optimizer of the following problem,
(8) 
is given by for any . This result provides us with an objective that shares the same basic form as equation 6. The main distinction is that the second term relies on an expectation over , which we do not have access to.
Change of Variables
In order to transform the second expectation in equation 8 to be over the initial state distribution , we perform the following change of variables: Let be an arbitrary stateaction value function that satisfies,
(9) 
Since is bounded and , the variable is welldefined and bounded. By applying this change of variables, the objective function in 8 can be rewritten in terms of , and this yields our previously presented objective from equation 6. Indeed, define,
to be the state visitation probability at step when following . Clearly, . Then,
The Bellman residuals of the optimum of this objective give the desired offpolicy corrections:
(10) 
Equation 6 provides a promising approach for estimating the stationary distribution corrections, since the first expectation is over stateaction pairs sampled from , while the second expectation is over and actions sampled from , both of which we have access to. Therefore, in principle we may solve this problem with respect to a parameterized value function , and then use the optimized to deduce the corrections. In practice, however, the objective in its current form presents two difficulties:

The quantity involves a conditional expectation inside a square. In general, when environment dynamics are stochastic and the action space may be large or continuous, this quantity may not be readily optimized using standard stochastic techniques. (For example, when the environment is stochastic, its MonteCarlo sample gradient is generally biased.)

Even if one has computed the optimal value , the corrections , due to the same argument as above, may not be easily computed, especially when the environment is stochastic or the action space continuous.
Exploiting Fenchel Duality
We solve both difficulties listed above in one step by exploiting Fenchel duality [42]: Any convex function may be written as , where is the Fenchel conjugate of . In the case of , the Fenchel conjugate is given by . Thus, we may express our objective as,
By the interchangeability principle [8, 41, 43], we may replace the inner max over scalar to a max over functions and obtain a minmax saddlepoint optimization:
(11) 
Using the KKT condition of the inner optimization problem (which is convex and quadratic in ), for any the optimal value is equal to the Bellman residual, . Therefore, the desired stationary distribution correction can then be found from the saddlepoint solution of the minimax problem in equation 11 as follows:
(12) 
Now we finally have an objective which is wellsuited for practical computation. First, unbiased estimates of both the objective and its gradients are easy to compute using stochastic samples from , , and , all of which we have access to. Secondly, notice that the minmax objective function in equation 11 is linear in and concave in . Therefore in certain settings, one can provide guarantees on the convergence of optimization algorithms applied to this objective (see Section 3.4). Thirdly, the optimizer of the objective in equation 11 immediately gives us the desired stationary distribution corrections through the values of , with no additional computation.
3.3 Extension to General Convex Functions
Besides a quadratic penalty function, one may extend the above derivations to a more general class of convex penalty functions. Consider a generic convex penalty function . Recall that is a bounded subset of which contains the interval of stationary distribution corrections. If is contained in the range of , then the optimizer of the convex problem, for , satisfies the following KKT condition: . Analogously, the optimizer of,
(13) 
satisfies the equality
With change of variables , the above problem becomes,
(14) 
Applying Fenchel duality to in this objective further leads to the following saddlepoint problem:
(15) 
By the KKT condition of the inner optimization problem, for any the optimizer satisfies,
(16) 
Therefore, using the fact that the derivative of a convex function is the inverse function of the derivative of its Fenchel conjugate , our desired stationary distribution corrections are found by computing the saddlepoint of the above problem:
(17) 
Amazingly, despite the generalization beyond the quadratic penalty function , the optimization problem in equation 15 retains all the computational benefits that make this method very practical for learning : All quantities and their gradients may be unbiasedly estimated from stochastic samples; the objective is linear in and concave in , thus is wellbehaved; and the optimizer of this problem immediately provides the desired stationary distribution corrections through the values of , without any additional computation.
This generalized derivation also provides insight into the initial technical result: It is now clear that the objective in equation 13 is the negative Fenchel dual (variational) form of the AliSilvey or divergence, which has been used in previous work to estimate divergence and data likelihood ratios [33]. Despite their similar formulations, we emphasize that the aforementioned dual form of the divergence is not immediately applicable to estimation of offpolicy corrections in the context of RL, due to the fact that samples from distribution are unobserved. Indeed, our derivations hinged on two additional key steps: (1) the change of variables from to ; and (2) the second application of duality to introduce . Due to these repeated applications of duality in our derivations, we term our method Dual stationary DIstribution Correction Estimation (DualDICE).
3.4 Theoretical Guarantees
In this section, we consider the theoretical properties of DualDICE in the setting where we have a dataset formed by empirical samples , , and target actions for .^{3}^{3}3For the sake of simplicity, we consider the batch learning setting with i.i.d. samples as in [48]. The results can be easily generalized to single sample path with dependent samples (see Appendix). We will use the shorthand notation to denote an average over these empirical samples. Although the proposed estimator can adopt general , for simplicity of exposition we restrict to . We consider using an algorithm (e.g., stochastic gradient descent/ascent) to find optimal of equation 15 within some parameterization families , respectively. We denote by the outputs of . We have the following guarantee on the quality of with respect to the offpolicy policy estimation (OPE) problem.
Theorem 2.
(Informal) Under some mild assumptions, the mean squared error (MSE) associated with using for OPE can be bounded as,
(18) 
where the outer expectation is with respect to the randomness of the empirical samples and , denotes the optimization error, and denotes the approximation error due to .
The sources of estimation error are explicit in Theorem 2. As the number of samples increases, the statistical error approaches zero. Meanwhile, there is an implicit tradeoff in and . With flexible function spaces and (such as the space of neural networks), the approximation error can be further decreased; however, optimization will be complicated and it is difficult to characterize . On the other hand, with linear parameterization of , under some mild conditions, after iterations we achieve provably fast rate, for and for , at the cost of potentially increased approximation error. See the Appendix for the precise theoretical results, proofs, and further discussions.
4 Related Work
Density Ratio Estimation
Density ratio estimation is an important tool for many machine learning and statistics problems. Other than the naive approach, (i.e., the density ratio is calculated via estimating the densities in the numerator and denominator separately, which may magnify the estimation error), various direct ratio estimators have been proposed [44], including the moment matching approach [17], probabilistic classification approach [4, 7, 40], and ratio matching approach [22, 33, 45]
The proposed DualDICE algorithm, as a direct approach for density ratio estimation, bears some similarities to ratio matching [33], which is also derived by exploiting the Fenchel dual representation of the divergences. However, compared to the existing direct estimators, the major difference lies in the requirement of the samples from the stationary distribution. Specifically, the existing estimators require access to samples from both and , which is impractical in the offpolicy learning setting. Therefore, DualDICE is uniquely applicable to the more difficult RL setting.
Offpolicy Policy Evaluation
The problem of offpolicy policy evaluation has been heavily studied in contextual bandits [12, 49, 52] and in the more general RL setting [14, 21, 26, 29, 34, 36, 37, 50, 51]. Several representative approaches can be identified in the literature. The Direct Method (DM) learns a model of the system and then uses it to estimate the performance of the evaluation policy. This approach often has low variance but its bias depends on how well the selected function class can express the environment dynamics. Importance sampling (IS) [38] uses importance weights to correct the mismatch between the distributions of the system trajectory induced by the target and behavior policies. Its variance can be unbounded when there is a big difference between the distributions of the evaluation and behavior policies, and grows exponentially with the horizon of the RL problem. Doubly Robust (DR) is a combination of DM and IS, and can achieve the low variance of DM and no (or low) bias of IS. Other than DM, all the methods described above require knowledge of the policy density ratio, and thus the behavior policy. Our proposed algorithm avoids this necessity.
5 Experiments
We evaluate our method applied to offpolicy policy evaluation (OPE). We focus on this setting because it is a direct application of stationary distribution correction estimation, without many additional tunable parameters, and it has been previously used as a testbed for similar techniques [27]. In each experiment, we use a behavior policy to collect some number of trajectories, each for some number of steps. This data is used to estimate the stationary distribution corrections, which are then used to estimate the average step reward, with respect to a target policy . We focus our comparisons here to a TDbased approach [16] and weighted stepwise IS (as described in [27]), which we and others have generally found to work best relative to common IS variants [30, 38]. See the Appendix for additional results and implementation details.
We begin in a controlled setting with an evaluation agnostic to optimization issues, where we find that, absent these issues, our method is competitive with TDbased approaches (Figure 1). However, as we move to more difficult settings with complex environment dynamics, the performance of TD methods degrades dramatically, while our method is still able to provide accurate estimates (Figure 2). Finally, we provide an analysis of the optimization behavior of our method on a simple control task across different choices of function (Figure 3). Interestingly, although the choice of is most natural, we find the empirically best performing choice to be . All results are summarized for 20 random seeds, with median plotted and error bars at and percentiles.
5.1 Estimation Without Function Approximation
log RMSE 

trajectory length  
We begin with a tabular task, the Taxi domain [10]. In this task, we evaluate our method in a manner agnostic to optimization difficulties: The objective 6 is a quadratic equation in , and thus may be solved by matrix operations. The Bellman residuals (equation 7) may then be estimated via an empirical average of the transitions appearing in the offpolicy data. In a similar manner, TD methods for estimating the correction terms may also be solved using matrix operations [27]. In this controlled setting, we find that, as expected, TD methods can perform well (Figure 1), and our method achieves competitive performance. As we will see in the following results, the good performance of TD methods quickly deteriorates as one moves to more complex settings, while our method is able to maintain good performance, even when using function approximation and stochastic optimization.
5.2 Control Tasks
Cartpole,  Cartpole,  Cartpole,  Reacher,  Reacher,  Reacher,  



We now move on to difficult control tasks: A discretecontrol task Cartpole and a continuouscontrol task Reacher [6]. In these tasks, observations are continuous, and thus we use neural network function approximators with stochastic optimization. Figure 2 shows the results of our method compared to the TD method. We find that in this setting, DualDICE is able to provide good, stable performance, while the TD approach suffers from high variance, and this issue is exacerbated when we attempt to estimate rather than assume it as given. See the Appendix for additional baseline results.
5.3 Choice of Convex Function
log RMSE 

We analyze the choice of the convex function . We consider a simple continuous grid task where an agent may move left, right, up, or down and is rewarded for reaching the bottom right corner of a square room. We plot the estimation errors of using DualDICE for offpolicy policy evaluation on this task, comparing against different choices of convex functions of the form . Interestingly, although the choice of is most natural, we find the empirically best performing choice to be . Thus, this is the form of we used in our experiments for Figure 2.
6 Conclusions
We have presented DualDICE, a method for estimating offpolicy stationary distribution corrections. Compared to previous work, our method is agnostic to knowledge of the behavior policy used to collect the offpolicy data and avoids the use of importance weights in its losses. These advantages have a profound empirical effect: our method provides significantly better estimates compared to TD methods, especially in settings which require function approximation and stochastic optimization.
Future work includes (1) incorporating the DualDICE algorithm into offpolicy training, (2) further understanding the effects of on the performance of DualDICE (in terms of approximation error of the distribution corrections), and (3) evaluating DualDICE on realworld offpolicy evaluation tasks.
References
 [1] Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous inhand manipulation. arXiv preprint arXiv:1808.00177, 2018.
 [2] András Antos, Csaba Szepesvári, and Rémi Munos. Learning nearoptimal policies with bellmanresidual minimization based fitted policy iteration and a single sample path. Machine Learning, 71(1):89–129, 2008.
 [3] Richard Ernest Bellman. Dynamic Programming. Dover Publications, Inc., New York, NY, USA, 2003.
 [4] Steffen Bickel, Michael Brückner, and Tobias Scheffer. Discriminative learning for differing training and test distributions. In Proceedings of the 24th international conference on Machine learning, pages 81–88. ACM, 2007.
 [5] Stephane Boucheron, Gabor Lugosi, and Pascal Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2016.
 [6] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
 [7] Kuang Fu Cheng, ChihKang Chu, et al. Semiparametric density estimation under a twosample density ratio model. Bernoulli, 10(4):583–604, 2004.
 [8] Bo Dai, Niao He, Yunpeng Pan, Byron Boots, and Le Song. Learning from conditional distributions via dual embeddings. arXiv preprint arXiv:1607.04579, 2016.
 [9] Bo Dai, Albert Shaw, Lihong Li, Lin Xiao, Niao He, Zhen Liu, Jianshu Chen, and Le Song. Sbeed: Convergent reinforcement learning with nonlinear function approximation. arXiv preprint arXiv:1712.10285, 2017.
 [10] Thomas G Dietterich. Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research, 13:227–303, 2000.
 [11] Simon S Du, Jianshu Chen, Lihong Li, Lin Xiao, and Dengyong Zhou. Stochastic variance reduction methods for policy evaluation. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1049–1058. JMLR. org, 2017.
 [12] Miroslav Dudík, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. arXiv preprint arXiv:1103.4601, 2011.
 [13] Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. More robust doubly robust offpolicy evaluation. arXiv preprint arXiv:1802.03493, 2018.
 [14] Raphael Fonteneau, Susan A. Murphy, Louis Wehenkel, and Damien Ernst. Batch mode reinforcement learning based on the synthesis of artificial trajectories. Annals of Operations Research, 208(1):383–416, 2013.
 [15] Jianfeng Gao, Michel Galley, and Lihong Li. Neural approaches to Conversational AI. Foundations and Trends in Information Retrieval, 13(2–3):127–298, 2019.
 [16] Carles Gelada and Marc G Bellemare. Offpolicy deep reinforcement learning by bootstrapping the covariate shift. AAAI, 2018.
 [17] Arthur Gretton, Alex J Smola, Jiayuan Huang, Marcel Schmittfull, Karsten M Borgwardt, and Bernhard Schöllkopf. Covariate shift by kernel mean matching. In Dataset shift in machine learning, pages 131–160. MIT Press, 2009.
 [18] Assaf Hallak and Shie Mannor. Consistent online offpolicy evaluation. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1372–1383. JMLR. org, 2017.
 [19] W Keith Hastings. Monte carlo sampling methods using markov chains and their applications. 1970.
 [20] David Haussler. Sphere packing numbers for subsets of the boolean ncube with bounded vapnikchervonenkis dimension. Journal of Combinatorial Theory, Series A, 69(2):217–232, 1995.
 [21] Nan Jiang and Lihong Li. Doubly robust offpolicy value evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning, pages 652–661, 2016.
 [22] Takafumi Kanamori, Shohei Hido, and Masashi Sugiyama. A leastsquares approach to direct importance estimation. Journal of Machine Learning Research, 10(Jul):1391–1445, 2009.
 [23] Alessandro Lazaric, Mohammad Ghavamzadeh, and Rémi Munos. Finitesample analysis of leastsquares policy iteration. Journal of Machine Learning Research, 13(Oct):3041–3074, 2012.
 [24] Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao, and Dan Jurafsky. Deep reinforcement learning for dialogue generation. arXiv preprint arXiv:1606.01541, 2016.
 [25] Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. Unbiased offline evaluation of contextualbanditbased news article recommendation algorithms. In Proceedings of the fourth ACM international conference on Web search and data mining, pages 297–306. ACM, 2011.
 [26] Lihong Li, Rémi Munos, and Csaba Szepesvàri. Toward minimax offpolicy value estimation. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, pages 608–616, 2015.
 [27] Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the curse of horizon: Infinitehorizon offpolicy estimation. In Advances in Neural Information Processing Systems, pages 5356–5366, 2018.
 [28] Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. Offpolicy policy gradient with state distribution correction. In Proceedings of the ThirtyFifth Conference on Uncertainty in Artificial Intelligence, 2019. To appear.
 [29] A. Mahmood, H. van Hasselt, and R. Sutton. Weighted importance sampling for offpolicy learning with linear function approximation. In Proceedings of the 27th International Conference on Neural Information Processing Systems, 2014.
 [30] Travis Mandel, YunEn Liu, Sergey Levine, Emma Brunskill, and Zoran Popovic. Offline policy evaluation across representations with applications to educational games. In Proceedings of the 2014 international conference on Autonomous agents and multiagent systems, pages 1077–1084. International Foundation for Autonomous Agents and Multiagent Systems, 2014.
 [31] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
 [32] Susan A Murphy, Mark J van der Laan, James M Robins, and Conduct Problems Prevention Research Group. Marginal mean models for dynamic regimes. Journal of the American Statistical Association, 96(456):1410–1423, 2001.
 [33] XuanLong Nguyen, Martin J Wainwright, and Michael I Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847–5861, 2010.
 [34] C. Paduraru. Offpolicy Evaluation in Markov Decision Processes. PhD thesis, McGill University, 2013.
 [35] D Pollard. Convergence of Stochastic Processes. David Pollard, 1984.
 [36] D. Precup, R. Sutton, and S. Dasgupta. Offpolicy temporal difference learning with function approximation. In Proceedings of the 18th International Conference on Machine Learning, pages 417–424, 2001.
 [37] D. Precup, R. Sutton, and S. Singh. Eligibility traces for offpolicy policy evaluation. In Proceedings of the 17th International Conference on Machine Learning, pages 759–766, 2000.
 [38] Doina Precup. Eligibility traces for offpolicy policy evaluation. Computer Science Department Faculty Publication Series, page 80, 2000.
 [39] Martin L Puterman. Markov decision processes: Discrete stochastic dynamic programming. 1994.
 [40] Jing Qin. Inferences for casecontrol and semiparametric twosample density ratio models. Biometrika, 85(3):619–630, 1998.
 [41] R Tyrrell Rockafellar and Roger JB Wets. Variational analysis, volume 317. Springer Science & Business Media, 2009.
 [42] Ralph Tyrell Rockafellar. Convex analysis. Princeton university press, 2015.
 [43] Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczyński. Lectures on stochastic programming: modeling and theory. SIAM, 2009.
 [44] Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density ratio estimation in machine learning. Cambridge University Press, 2012.
 [45] Masashi Sugiyama, Taiji Suzuki, Shinichi Nakajima, Hisashi Kashima, Paul von Bünau, and Motoaki Kawanabe. Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60(4):699–746, 2008.
 [46] Richard S Sutton and Andrew G Barto. Introduction to reinforcement learning, volume 135.
 [47] Richard S Sutton, A Rupam Mahmood, and Martha White. An emphatic approach to the problem of offpolicy temporaldifference learning. The Journal of Machine Learning Research, 17(1):2603–2631, 2016.
 [48] Richard S Sutton, Csaba Szepesvári, Alborz Geramifard, and Michael Bowling. Dynastyle planning with linear function approximation and prioritized sweeping. In Proceedings of the TwentyFourth Conference on Uncertainty in Artificial Intelligence, pages 528–536. AUAI Press, 2008.
 [49] A. Swaminathan, A. Krishnamurthy, A. Agarwal, M. Dudík, J. Langford, D. Jose, and I. Zitouni. Offpolicy evaluation for slate recommendation. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 3635–3645, 2017.
 [50] P. Thomas and E. Brunskill. Dataefficient offpolicy policy evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning, pages 2139–2148, 2016.
 [51] P. Thomas, G. Theocharous, and M. Ghavamzadeh. High confidence offpolicy evaluation. In Proceedings of the 29th Conference on Artificial Intelligence, 2015.
 [52] YuXiang Wang, Alekh Agarwal, and Miroslav Dudik. Optimal and adaptive offpolicy evaluation in contextual bandits. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 3589–3597. JMLR. org, 2017.
 [53] Bin Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, pages 94–116, 1994.
Appendix A Pseudocode
Appendix B Additional Results
Cartpole,  Cartpole,  Cartpole,  Reacher,  Reacher,  Reacher,  



Appendix C Experimental Details
c.1 Taxi
For the Taxi domain, we follow the same protocol as used in [27]. In this tabular, exact solve setting, the TD methods [16] are equivalent to their kernelbased TD method. We fix to . The behavior and target policies are also taken from [27] (referred in their work as the behavior policy for ).
In this setting, we solve for the optimal empirical exactly using matrix operations. Since [27] perform a similar exact solve for variables , for better comparison we also perform our exact solve with respect to variables . Specifically, one may follow the same derivations for DualDICE with respect to learning . The final objective will require knowledge of the importance weights .
c.2 Control Tasks
We use the Cartpole and Reacher tasks as given by OpenAI Gym [6]. In these tasks we use COPTD [16] for the TD method ([27] requires a proper kernel, which is not readily available for these tasks). When assuming an unknown , we learn a neural network policy using behavior cloning, and use its probabilities for computing importance weights . All neural networks are feedforward with two hidden layers of dimension and activations.
We modify the Cartpole task to be infinite horizon: We use the same dynamics as in the original task but change the reward to be if the original task returns a termination (when the pole falls below some threshold) and otherwise. We train a policy on this task until convergence. We then define the target policy as a weighted combination of this pretrained policy (weight ) and a uniformly random policy (weight ). The behavior policy for a specific is taken to be a weighted combination of the pretrained policy (weight ) and a uniformly random policy (weight ). We use , which yields an average step reward of for and for with . We generate an offpolicy dataset by running the behavior policy for epsiodes, each of length steps. We train each stationary distribution correction estimation method using the Adam optimizer with batches of size and learning rates chosen using a hyperparameter search (the optimal learning rate found for either method was ).
For the Reacher task, we train a deterministic policy until convergence. We define the target policy as a Gaussian with mean given by the pretrained policy and standard deviation given by . The behavior policy for a specific is taken to be a Gaussian with mean given by the pretrained policy and standard deviation given by . We use , which yields an average step reward of for and for with . We generate an offpolicy dataset by running the behavior policy for epsiodes, each of length steps. We train each stationary distribution correction estimation method using the Adam optimizer with batches of size and learning rates chosen using a hyperparameter search (the optimal learning rate found for either method was ).
c.3 Continuous Grid
For this task, we create a grid which the agent can traverse by moving left/right/up/down. The observations are the coordinates of the square the agent is on. The reward at each step is given by . We use . The target policy is taken to be the optimal policy for this task plus weight on uniform exploration. The behavior policy is taken to be the optimal policy plus weight on uniform exploration. We train using batches of size the Adam optimizer with batches of size and learning rates for and for .
Appendix D Proofs
We provide the proof for Theorem 2. We first decompose the error in Section D.1. Then, we analyze the statistical error and optimization error in Section D.2 and Section D.4, respectively. The total error will be discussed in D.3.
Although the proposed estimator can use any general convex function , as a first step towards a more complete theoretical understanding, we consider the special case of . Clearly, now is strongly convex with . Under Assumption 1, we need only consider , which implies that , and that is Lipschitz continuous with . Similarly, is Lipschitz continuous with on . The following assumption will be needed.
Assumption 3 (MDP regularity).
We assume the observed reward is uniformly bounded, i.e., for some constant . It follows that the reward’s mean and variance are both bounded in .
For convenience, the objective function of DualDICE is repeated here:
We will also make use of the objective in the form prior to introduction of , which we denote as :
Let denotes the empirical surrogate of with optimal solution as . We denote and . We denote and as the primal objectives, and , as the dual objectives. We apply some optimization algorithm for optimizing with samples , , and target actions for . We denote the outputs of by .
d.1 Error Decomposition
Let
Observe that
We begin by considering the estimation error induced by using as estimates of , where denotes the empirical Bellman backup with respect to samples from . We will subsequently reconcile this with the true implementation of DualDICE, which uses as estimates of .
The mean squared error of the policy value estimate when using in place of can be decomposed as
(19)  
(23)  
The first term, , is induced by the randomness in observed reward, and we have
which will be discussed in section D.2.
We consider the as
which is the error induced by optimization .
For the last term , we have