OffPolicy Policy Gradient with State Distribution Correction
Abstract
We study the problem of offpolicy policy optimization in Markov decision processes, and develop a novel offpolicy policy gradient method. Prior offpolicy policy gradient approaches have generally ignored the mismatch between the distribution of states visited under the behavior policy used to collect data, and what would be the distribution of states under the learned policy. Here we build on recent progress for estimating the ratio of the Markov chain stationary distribution of states in policy evaluation, and present an offpolicy policy gradient optimization technique that can account for this mismatch in distributions. We present an illustrative example of why this is important, theoretical convergence guarantee for our approach and empirical simulations that highlight the benefits of correcting this distribution mismatch.
1 Introduction
The ability to use data about prior decisions and their outcomes to make counterfactual inferences about how alternative decision policies might perform, is a cornerstone of intelligent behavior. It also has immense practical potential – it can enable the use of electronic medical record data to infer better treatment decisions for patients, the use of prior product recommendations to inform more effective strategies for presenting recommendations, and previously collected data from students using educational software to better teach those and future students. Such counterfactual reasoning, particularly when one is deriving decision policies that will be used to make not one but a sequence of decisions, is important since online sampling during a learning procedure is both costly and dangerous, and not practical in many of the applications above. While amply motivated, doing such counterfactual reasoning is also challenging because the data is censored – we can only observe the result of providing a particular chemotherapy treatment policy to a particular patient, not the counterfactual of if we were then to start with a radiation sequence.
We focus on the problem of performing such counterfactual inferences in the context of sequential decision making in a Markov decision process (MDP). We assume that data has been previously collected using some fixed and known behavior policy, and our goal is to learn a new decision policy with good performance for future use. This problem is often known as batch offpolicy policy optimization. We assume that the behavior policy used to gather the data is stochastic: if it is deterministic, without any additional assumptions, we will not be able to estimate the performance of any other policy.
In this paper we consider how to perform batch offpolicy policy optimization (OPPO) using a policy gradient method. While there has been increasing interest in batch offpolicy reinforcement learning (RL) over the last few years (Thomas et al., 2015; Jiang and Li, 2016; Thomas and Brunskill, 2016), much of this has focused on offpolicy policy evaluation, where the goal is to estimate the performance of a particular given target decision policy. Ultimately we will very frequently be interested in the optimization question, which requires us to determine a good new policy for future potential deployment, given a fixed batch of prior data.
To do batch offpolicy policy optimization, model free methods (like deep Qlearning (Mnih et al., 2015) or fitted Q iteration (Ernst et al., 2005)) can be used alone, but there are many cases where we might prefer to focus on policy gradient or actorcritic methods. Policy gradient methods have seen substantial success in the last few years (Schulman et al., 2015) in the onpolicy setting, and they can be particularly appealing for cases where it is easier to encode inductive bias in the policy space, or when the actions are continuous (see e.g. Abbeel and Schulman (2016) for more discussion). However, existing approaches to incorporating offline information into online policy gradients have shown limited benefit (Gu et al., 2017b, a), in part due to the variance in gradients incurred due to incorporating offpolicy data. One approach is to correct exactly for the difference between the sampling data distribution and the target policy data distribution, by using importance sampling to reweight every sample according to the likelihood ratio of behavior policy and evaluation policy up to that step. Unfortunately the variance of this importance sampling ratio will grow exponentially with the problem horizon.
To avoid introducing variance in gradients, offpolicy actor critic (OffPAC) (Degris et al., 2012) ignores the stationary state distribution difference between the behavior policy and the target policy, and instead only uses a one step of importance sampling to reweight the action distributions. Many practical offpolicy policy algorithms including DDPG (Silver et al., 2014), ACER (Wang et al., 2016), and OffPAC with emphatic weightings (Imani et al., 2018) are based on the gradient expression in the OffPAC algorithm (Degris et al., 2012). However as we will demonstrate, not correcting for this mismatch in state distributions can result in poor performance in general, both in theory and empirically.
Instead, here we introduce an offpolicy policy gradient algorithm that can be used with batch data and that accounts for the difference in the state distributions between the current target and behavior policies during each gradient step. Our approach builds on recent approaches for policy evaluation that avoid the exponential blow up in importance sampling weights by instead computing a direct ratio over the stationary distribution of state visitations under the target and behavior policy (Hallak and Mannor, 2017; Liu et al., 2018a; Gelada and Bellemare, 2019). We incorporate these ideas within an offpolicy actor critic method to do batch policy optimization. We first provide an illustrative example to demonstrate the benefit of this approach over OffPAC (Degris et al., 2012), and show that correcting for the mismatch in state distributions of the behavior policy and the target policy can be critical for getting good estimates of the policy gradient, and we also provide convergence guarantees for our algorithm under certain assumptions. We then compare our approach and OffPAC experimentally on two simulated domains, cart pole and a HIV patient simulator (Ernst et al., 2005). Our results show that our approach is able to learn a substantially higher performing policy than both OffPAC and the behavior policy that is used to gather the batch data. We further demonstrate that we can use the recently proposed offpolicy evaluation technique of Liu et al. (2018a) to reliably identify good policies found during the policy gradient optimization run. Our results suggest that directly accounting for the state distribution mismatch can be done without prohibitively increasing the variance during policy gradient evaluations, and that doing so can yield significantly better policies. These results are promising for enabling us to learn better policies given batch data or improving the sample efficiency of online policy gradient methods by being able to better incorporate past data.
Related Work
Many prior works focus on the policy evaluation problem, as it is a foundation for downstream policy learning problems. These approaches often build on importance sampling techniques to correct for distribution mismatch in the trajectory space, pioneered by the early work on eligibility traces (Precup et al., 2000), and further enhanced with a variety of variance reduction techniques (Thomas et al., 2015; Jiang and Li, 2016; Thomas and Brunskill, 2016). Some authors consider modelbased approaches to OPPE (Farajtabar et al., 2018; Liu et al., 2018b), which usually perform better than importance sampling approaches empirically in policy evaluation settings. But those methods do not extend easily to our OPPO setting, as well as introduce additional challenges due to bias in the models and typically require fitting a separate model for each target policy. The recent work of Liu et al. (2018a) partially alleviates the variance problem for modelfree OPPE by reweighting the state visitation distributions, which can result in as just as high a variance in the worst case, but is often much smaller. Our work incorporates this recent estimator in policy optimization methods to enable learning from offpolicy collected data.
In the offpolicy policy optimization setting, many works study valuefunction based approaches (like fitted Q iteration (Ernst et al., 2005) and DQN (Mnih et al., 2015)), as they are known to be more robust to distribution mismatch. Some recent works aim to further incorporate reweighting techniques within offpolicy value function learning (Hallak and Mannor, 2017; Gelada and Bellemare, 2019). These methods hint at the intriguing potential of valuefunction based techniques for offpolicy learning, and we are interested in similarly understanding the viability of using direct policy optimization techniques in the offpolicy setting.
Offpolicy actor critic method (Degris et al., 2012; Imani et al., 2018) proposed an answer to this question by learning the critic in a offpolicy way and reweighting actor gradients by correcting the conditional action probabilities, but ignores the mismatch between the state visitation distributions of the data collection policy and learned policies. A different research thread on trust region policy optimization method (Schulman et al., 2015), while requiring the onpolicy setting, incorporates robustness to the mismatch between the data collection and gradient evaluation policies. However this is not a fully offpolicy scenario and learning from an offline dataset is still strongly motivated by many applications. Many recent methods (Silver et al., 2014; Wang et al., 2016; Gu et al., 2017a, b; Lillicrap et al., 2015) are derived based on the policy gradient form in Degris et al. (2012) and some also combined with trust region idea to improve the empirical sample efficiency by using more offpolicy samples from previous iteration. In this work, we demonstrate a basic weakness of the policy gradient definition in Degris et al. (2012), and show how to correct that.
2 Preliminaries
We consider finite horizon MDPs , with a continuous state space , a discrete action space , a transition probability distribution and an expected reward function . We observe tuples of state, action, reward and next state: , where is drawn from a initial state distribution , action is drawn from a stochastic behavior policy and the reward and next state are generated by the MDP. Given a discount factor , the goal is to maximize the expected return of policy:
(1) 
When this becomes the average reward case and is called the discounted reward case. Given any fixed policy the MDP becomes a Markov chain and we can define the state distribution at time step : , and the stationary state distribution across time:
To make sure the optimal policy is learnable from collected data, we assume the following about the support set of behavior policy:
Assumption 1.
For at least one optimal policy , for all such that , and for all such that when .
3 An OffPolicy Policy Gradient Estimator
Note that Assumption 1 is quite weak when designing a policy evaluation or optimization scheme, since it only guarantees that adequately visits all the states and actions visited by some . However, a policy optimization algorithm might require offpolicy policy gradient estimates at arbitrary intermediate policy it produces along the way, which might visit states not reached by . A strong assumption to handle such scenarios is that Assumption 1 holds not just for some , but any possible policy . Instead of making such a strong assumption, we start by defining an augmented MDP where Assumption 1 suffices for obtaining pessimistic estimates of policy values and gradients.
3.1 Constructing an Augmented MDP
Given a data collection policy , let its support set be and . Consider a modified MDP . Any stateaction pairs not in will essentially transition to which is a new absorbing state where all actions will lead to a zero reward selfloop. Concretely, and for any . For all other states, the transition probabilities and rewards are defined as: For , for all , and . For all but , . for , and otherwise. First we prove that the optimal policy of the original MDP remains optimal in augmented MDP as a consequence of Assumption 1.
Theorem 1.
The expected return of all policies in the original MDP is larger than the expected return in the new MDP: . For any optimal that satisfies Assumption 1 we have that
That is, policy optimization in has at least one optimal solution identical to the original MDP with the same policy value since lower bounds the policy value in , so suboptimal policies remain suboptimal.
Proof.
For any trajectory sampled from policy , if every then . If not, let be the first stateaction pair that is not in . Then . Dividing the accumulated rewards by and taking the limit of , then taking the expectation over trajectories induced by , we have that: . For , since covers all stateaction pairs reachable by , so the expected return remains the same. ∎
3.2 OffPolicy Policy Gradient in Augmented MDP
We will now use the expected return in the modified MDP, , as a surrogate for deriving policy gradients. According to the policy gradient theorem in Sutton et al. (2000), for a parametric policy with parameters :
From here on, is with respect to the new MDP. The definition of in both the average and discounted reward cases follows Sutton et al. (2000). ^{1}^{1}1For discounted case, our definition of expected return differs from the definition of Sutton et al. (2000) by a normalization factor . This is because the definitions of stationary distributions are scaled differently in the two cases.
Now we will show that we can get an unbiased estimator of this gradient using importance sampling from the stationary state distribution and the action distribution . According to the definition of , we have that for all such that , is not in . Hence for any policy since will receive zero reward and lead to a zero reward selfloop. So we have:
(2) 
Note that according to the definition of , the Markov chain induced by and is exactly the same as and . Thus the distribution of generated by executing in is the same as executing in . So, we can estimate this policy gradient using the data we collected from in . We conclude the section by pointing out that working in the augmented MDP allows us to construct a reasonable offpolicy policy gradient estimator under the mild Assumption 1, while all prior works in this vein either explicitly or implicitly require the coverage of all possible policies.
Note that in the average reward case, such an augmented MDP would not be helpful for policy optimization since all policies that potentially reach will have a value of zero, and the stationary state distribution will be a single mass in the absorbing state. That would not induce a practical policy optimization algorithm. In the average reward case, either we need a stronger assumption that covers the entire stateaction space or we must approximate the problem by setting a discount factor for the policy optimization algorithm, which is a common approach for deriving practical algorithms in an average reward (episodic) environment.
4 Algorithm: OPPOSD
Given the offpolicy policy gradient derived in (2), how can we efficiently estimate it from samples collected from ? Notice that most quantities in the gradient estimator (2) are quite intuitive and also present in prior works such as OffPAC. The main difference is the state distribution reweighting , which we would like to estimate using samples collected with . For estimating this ratio of state distributions, we build on the recent work of Liu et al. (2018a) which we describe next.
For a policy , let us define the shorthand . Further given a function , define . Then we have the following result.
Theorem 2 ((Liu et al., 2018a)).
Given any , assume that for all and define
Then if and only if for any measurable test function .^{2}^{2}2When , is only determined up to normalization, and hence an additional constraint is required to obtain the conclusion .
This result suggests a constructive procedure for estimating the state distribution ratio using samples from , by finding a function over the states which minimizes . Since the maximization over all measurable functions as per Theorem 2 is intractable, Liu et al. (2018a) suggest restricting the maximization to a unit ball in an RKHS, which has an analytical solution to the maximization problem, and we use the same procedure to approximate density ratios in our algorithm.
Applying Theorem 2 requires overcoming one final obstacle. The theorem presupposes for all . In case where we can directly apply the theorem. Otherwise in the MDP , this assumption indeed holds for all states, but never visits the absorbing state , or any transitions leading into this state. However, since we know this special state, as well as the dynamics leading in and out of it, we can simulate some samples for this state, effectively corresponding to a slight perturbation of to cover . Concretely, we first choose a small smoothing factor . For any sample in our data set, if there exist actions such that , then we will keep the old samples with probability and sample any one of the actions with probability uniformly and change the next state to . If we sampled , consequently, we would also change all samples after this transition to a selfloop in . Thus we create samples drawn according to a new behavior policy which covers all the state action pairs: where is a uniform distribution over the actions not chosen by in state . Now we can use Theorem 2 and the algorithm from Liu et al. (2018a) to estimate . Note that the propensity scores and policy gradients computed on this new dataset correspond to the behaviour policy and not . Formally, in place of using (2), we now estimate:
(3) 
Note that we can estimate the expectation in (3) from the smoothed dataset by construction, since the ratio in all states are known.
Now that we have an algorithm for estimating policy gradients from (3), we can plug this into any policy gradient optimization method. Following prior work, we incorporate our offpolicy gradients into an actorcritic algorithm. For learning the critic , we can use any offpolicy Temporal Difference (Bhatnagar et al., 2009; Maei, 2011) or Qlearning algorithm (Watkins and Dayan, 1992). In our algorithm, we fit an approximate value function by: ^{3}^{3}3For simplicity, Eqn 4 views in the tabular setting. See Line 14 in Alg 1 for the function approximation case.
(4) 
where is the stepsize for critic updates and is the offpolicy return:
and is generated by executing . After we learn , serves the role of in our algorithm.
Given the estimates of the state distribution ratio from Liu et al. (2018a) and the critic updates from (4), we can now update the policy by plugging these quantities in (3). It remains to specify the initial conditions to start the algorithm. Since we have data collected from a behavior policy, it is natural to also warmstart the actor policy in our algorithm to be the same as the behavior policy and correspondingly the critic and ’s to be the value function and distribution ratios for the behavior policy. This can be particularly useful in situations where the behavior policy, while suboptimal, still gets to states with high rewards with a reasonable probability. Hence we use behavior cloning to warmstart the policy parameters for the actor, use onpolicy value function learning for the critic and also fit the state ratios for the actor obtained by behavior cloning. Note that while the ratio will be identically equal to 1 if our behavior cloning was perfect, we actually estimate the ratio to better handle imperfections in the learned actor initialization.
A full pseudocode of our algorithm, which we call OPPOSD for OffPolicy Policy Optimization with State Distribution Correction, is presented in Algorithm 1. We mention a couple of implementation details which we found helpful in improving the convergence properties of the algorithm. Typical actorcritic algorithms update the critic once per actor update in the onpolicy setting. However, in the offpolicy setting, we find that performing multiple critic updates before an actor update is helpful, since the offpolicy TD learning procedure can have a high variance. Secondly, the computation of the state distribution ratio is done in an online manner similar to the critic updates, and analogous to the critic, we always retain the state of the optimizer for across the actor updates (rather than learning the from scratch after each actor update). Similar to the critic, we also perform multiple updates after each actor update. These choices are intuitively reasonable as the standard twotime scale asymptotic analysis of actorcritic methods (Borkar, 2009) does require the critic to converge faster than the actor.
5 Convergence Result
In this section, we present two main results to demonstrate the theoretical advantage of our algorithm. First we present a simple scenario where the prior approach of OffPAC yields an arbitrarily biased gradient estimate, despite having access to a perfect critic. In contrast OPPOSD estimates the gradients correctly whenever the distribution ratios in (2) and the critic are estimated perfectly, by definition. We will further provide a convergence result for OPPOSD to a stationary point of the expected reward function.
A hard example for OffPAC
Many prior offpolicy policy gradient methods use the policy gradient estimates proposed in Degris et al. (2012).
Notice that, in contrast to the exact policy gradient, the expectation over states is taken with respect to the behavior policy distribution instead of . In tabular settings this can lead to correct policy updates, as proved by Degris et al. (2012). We now present an example where the policy gradient computed this way is problematic when using function approximators. Consider the problem instance shown in Figure 1, where the behavior policy is given as: . Thus gives us good coverage over all states and actions. Now we consider policies parameterized by a parameter where has the following structure:
Thus aliases the states and as a manifestation of imperfect representation which is typical with large state spaces. The true state value function of , satisfies that:
Now we define our policy class . Clearly the optimal policy is as it completely eliminates the illeffects of state aliasing. We now study the OffPAC gradient estimator in an idealized setting where the critic is perfectly known. As per Equation 5 of Degris et al. (2012), we have
That is, the gradient vanishes for any policy , meaning that the algorithm can be arbitrarily suboptimal at any point during policy optimization. We note that this does not contradict the previous OffPAC theorems as the policy class is not fully expressive in our example, a requirement for their convergence results. Our gradient estimator (2) instead evaluates to , which is correctly maximized at .
Convergence results for OPPOSD
We next ask whether OPPOSD converges, given reasonable estimates for the density ratio and the critic. To this end, we need to introduce some additional notations and assumptions. Suppose we run OPPOSD over some parametric policy class with . In the sequel, we use subscripts and superscripts by to mean the corresponding quantities with to ease the notation. We begin by describing an abstract set of assumptions and a general result about the convergence of OPPOSD, when we run it over the policies given data collected with an arbitrary policy , before instantiating our assumptions for the specific structure of used in our algorithm.
Definition 1.
A function is smooth when
Assumption 2.
pairs, and a data collection policy , we assume that the MDP guarantees:




, and the function approximator for satisfies .

The expected return of : is a differentiable, Lipschitz and smooth function w.r.t. .
Theorem 3.
Assume an MDP, a data collection policy and function classes and satisfy Assumption 2. Suppose OPPOSD with policy parameters at iteration is provided critic estimates and distribution ratio estimates satisfying and for iterations . Then
(5) 
That is, when Assumption 2 holds, the scheme converges to an approximate stationary point given estimators and with a small average MSE across the iterations under . An immediate consequence of the theorem above is that as long as we guarantee that , which a reasonable online critic and learning algorithm can guarantee, we have:
which implies the procedure will converge to a stationary point where the true policy gradient is zero.
We now discuss the validity of Assumption 2 in the specific context of the data collection policy used in OPPOSD as well as the augmented MDP . The first assumption, that the gradient of policy distribution is bounded, can be achieved by an appropriate policy parametrization such as a linear or a differentiable neural networkbased scoring function composed with a softmax link. The second assumption on bounded value functions is standard in the literature. In particular, both these assumptions are crucial for the convergence of policy gradient methods even in an onpolicy setting. The third assumption on lower bounded action probabilities holds by construction for the policy due to the smoothing. The fourth assumption on bounded distribution ratios can be ensured if . Technically, this might not hold for in as some states in might be reached with tiny probabilities, but we can instead define to be the set of all the states with . With this change, and given a suitably large , always satisfies the fourth assumption in the MDP . We note that the assumption also requires the outputs of the function approximator to be bounded, which might require additional clipping or regularization in the algorithm. In Algorithm 1, we instead use a weighted importance sampling version of which normalize the value in by its mean in one batch, which ensures that the largest value of is the minibatch size . Finally the regularity assumption on the smoothness of the reward function is again standard for policy gradient methods even in an onpolicy setting.
Thus we find that under standard assumptions for policy gradient methods, along with some reasonable additional conditions, we expect OPPOSD to have good convergence properties in theory.
6 Experimental Evaluation
In this section we study the empirical properties of OPPOSD, with an eye towards two questions:

Does the state distribution correction help improve the performance of offpolicy policy optimization?

Can we identify the best policy from the optimization path using offpolicy policy evaluation?
Baseline and implementation details
To answer the first question, we compare OPPOSD with its closest prior work, but without the state distribution correction, that is the OffPAC algorithm (Degris et al., 2012).
We implement both OPPOSD and OffPAC using feedforward neural networks for the actor and critic, with ReLU hidden layers. For state distribution ratio , we also use a neural network with ReLU hidden layers, with the last activation function to guarantee that for any input. To make a fair comparison, we keep the implementation of OffPAC as close as possible to OPPOSD other than the use of . Concretely, we also equip OffPAC with the enhancements that we find improve empirical performance such as warm start of the actor and critic, as well as several critic updates per actor update. We use the same offpolicy critic learning algorithm for OffPAC and OPPOSD. To learn , we use Algorithm 1 (average reward) in Liu et al. (2018a) with RBF kernel for CartPolev0 experiment, and Algorithm 2 (discounted reward) in Liu et al. (2018a) with RBF kernel for HIV experiment. We normalize the input to the networks to have 0 mean and 1 standard deviation, and in each minibatch we normalized kernel loss of fitting by the mean of the kernel matrix elements, to minimize the effect kernel hyperparameters on the learning rate. Full implementation details when omitted are provided in the Appendix.
Simulation domains
We compare the algorithms in two simulation domains. The first domain is the cart pole control problem, where an agent needs to balance a mass attached to a pole in an upright position, by applying one of two sideways movements to a cart on a frictionless track. The state space is continuous and describes the position and velocity of cart and pole. The action space consists of applying a unit force to two directions. The horizon is fixed to 200. If the trajectory ends in less than 200 steps, we pad the episode by continuing to sample actions and repeating the last state. We use a uniformly random policy to collect trajectories as offpolicy data, which is a very challenging data set for offpolicy policy optimization methods to learn from as this policy does not attain the desired upright configuration for any prolonged period of time. We use neural networks with a 32unit hidden layer to fit the stationary distribution ratio, actor and critic.
The second domain is an HIV treatment simulation described in Ernst et al. (2006). Here the states are sixdimensional realvalued vectors, which model the response of numbers of cells/virus to a treatment. Each action corresponds to whether or not to apply two types of drug, leading to a total of 4 actions. The transition dynamics are modeled by an ODE system in Ernst et al. (2006). The reward consists of a small negative reward for deploying each type of drug, and a positive reward based on the HIVspecific cytotoxic Tcells which will increase with a proper treatment schedule. To maximize the total reward in this simulator, algorithms need to do structured treatment interruption (STI), which aim to achieve a balance between treatment and the adverse effect of abusing drugs. The horizon of this domain is 200 and discount factor is set by the simulator to . Each trajectory simulates a treatment period for one patient in 1000 days and each step corresponds to a 5day interval in the ODE system. We represent the state by taking logarithm of state features and divide the reward by to ensure they are in a reasonable range to fit the neural network models. A uniformly random policy does not visit any rewarding states often enough to collect useful data for offpolicy learning. To simulate an imperfect but reasonable data collection policy, we first train an onpolicy actor critic method to learn a reasonable (but still far from optimal) policy . We then use the data collection policy , where is the uniformly random policy, to collect trajectories. We use neural networks with three 16unit hidden layers to fit the actor and state distribution ratio, and a neural network with four 32unit hidden layers for the critic.
Though in both domains our data collection policy is eventually able to cover the whole stateaction space, the situation under finite amounts of data is different. In cart pole since an optimal policy can control the cart to stay in a small region, it is relatively easy for the uniform random policy to cover the states visited by the optimal policy. In the HIV treatment domain, it is unlikely that the logged data will cover the desirable state space.
Impact of state reweighting on policy optimization
In Figures 1(a) and 1(b), we plot the onpolicy evaluation values of the policies produced by OPPOSD and OffPAC during the actor updates across 10 runs. Each run uses a different data set collected using the behavior policy as well as a different random seed for the policy optimization procedure. In each run we use the same dataset for each method to allow paired comparisons. We evaluate the policy after every 100 actor updates using onpolicy MonteCarlo evaluation over 20 trajectories. The results are averaged over 10 runs and error bars show the standard deviation. Along Xaxis, the plot shows how the policy value changes as we take policy gradient steps.
At a highlevel, we see that in both the domains our algorithm significantly improves upon the behavior policy, and eventually outperforms OffPAC consistently. Zooming in a bit, we see that for the initial iterates on the left of the plots, the gap between OPPOSD and OffPAC is small as the state distribution between the learned policies is likely close enough to the behavior policy for the distribution mismatch to not matter significantly. This effect is particularly pronounced in Figure 1(a). However, the gap quickly widens as we move to the right in both the figures. In particular, OffPAC barely improves over behavior policy in Figure 1(b), while OPPOSD finds a significantly better policy. Overall, we find that these results are an encouraging validation of our intuition about the importance of correcting the state distribution mismatch.
Identifying Best Policy by OffPolicy Evaluation
While OPPOSD consistently outperforms OffPAC in average performance across 10 runs in both the domains, there is still significant variance in both the methods across runs. Given this variance, a natural question is whether we can identify the best performing policies, during and across multiple runs of OPPOSD for a single dataset. To answer this question, we checkpoint all the policies produced by OPPOSD after every 1000 actor updates, across 5 runs of our algorithm with the same input dataset generated in the HIV domain. We then evaluate these policies using the offpolicy policy evaluation (OPPE) method in Liu et al. (2018a). The evaluation is performed with an additional dataset sampled from the behavior policy. This corresponds to the typical practice of sample splitting between optimization and evaluation.
We show the quality of the OPPE estimates against the true policy values for two different datasets for OPPE sampled from the behavior policy in the two panels of Figure 3. In each plot, the Xaxis shows the true values by onpolicy MonteCarlo evaluation results and Yaxis shows the OPPE estimates. We find that the OPPE estimates are generally well correlated with the onpolicy values, and picking the policy with the best OPPE estimate results in a true value substantially better than both the best OffPAC result as well as the behavior policy. A closer inspection also reveals the importance of this validation step. The red squares correspond to the final iterate of OPPOSD in each of the 5 iterations, which has a very high value in some cases, but somewhat worse in other runs. Using OPPE to robustly select a good policy adds a layer of additional assurance to our policy optimization procedure.
7 Discussion and Conclusion
We presented a new offpolicy actor critic algorithm, OPPOSD, based on a recently proposed stationary state distribution ratio estimator. There exist many interesting next steps, including different critic learning methods which may also leverage the state distribution ratio, and exploring alternative methods for policy evaluation or alternative stationary state distribution ratio estimators (Hallak and Mannor, 2017; Gelada and Bellemare, 2019). Another interesting direction is to improve the sample efficiency of online policy gradient algorithms by using our corrected gradient estimates.
There are also many different algorithms which have been built using the previous off policy gradient framework (OffPAC) and improve OffPAC in different directions, such as DDPG, ACER, etc. They are orthogonal to our work and our state distribution correction techniques are composable with these further improvements in the OffPAC framework. For understanding the impact of correcting the stationary distribution, in the experiment section of this work we therefore focus on ablation comparison with OffPAC. It would be interesting to combine our work with the additional contributions of DDPG, ACER etc. to derive improved variants of each of those algorithms.
In parallel with our work, Zhang et al. (2019) have presented a different approach for offpolicy policy gradient, motivated by a similar recognition of the bias in the OffPAC gradient estimator. While similarly motivated, the two works have important differences. On the methodological side, Zhang et al. (2019) start from an offpolicy objective function and derive a gradient for it. In contrast, we compute an offpolicy estimator for the gradient of the onpolicy objective function. The latter leads to a much simpler method, both conceptually and computationally, as we do not need to compute the gradients of the visitation distribution. On the other hand, Zhang et al. (2019) focus on incorporating more general interest functions in the offpolicy objective, and use the emphatic weighting machinery for obtaining the gradient of their offpolicy objective. In terms of settings, our approach works in the offline setting (though easily extended to online), while they require an online setting in order to compute the gradients of the propensity score function. Finally, we present convergence results quantifying the error in our critic and propensity score computations while Zhang et al. (2019) assume a perfect oracle for both and rely on a truly unbiased gradient estimator for the convergence results.
To conclude, our algorithm fixes the bias in offpolicy policy gradient estimates introduced by the behavior policy’s stationary state distribution. We prove under certain assumptions our algorithm is guaranteed to converge. We also show that ignoring the bias due to the mismatch in state distributions can make an off policy gradient algorithm fail even in a simple illustrative example, and that by accounting for this mismatch our approach yields significantly better performance in two simulation domains.
References
 Abbeel and Schulman (2016) P. Abbeel and J. Schulman. Deep reinforcement learning through policy optimization. NeurIPS Tutorial, 2016. URL https://nips.cc/Conferences/2016/Schedule?showEvent=6198.
 Barto et al. (1983) A. G. Barto, R. S. Sutton, and C. W. Anderson. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE transactions on systems, man, and cybernetics, pages 834–846, 1983.
 Bhatnagar et al. (2009) S. Bhatnagar, D. Precup, D. Silver, R. S. Sutton, H. R. Maei, and C. Szepesvári. Convergent temporaldifference learning with arbitrary smooth function approximation. In Advances in Neural Information Processing Systems, pages 1204–1212, 2009.
 Borkar (2009) V. S. Borkar. Stochastic approximation: a dynamical systems viewpoint, volume 48. Springer, 2009.
 Brockman et al. (2016) G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Openai gym, 2016.
 Degris et al. (2012) T. Degris, M. White, and R. S. Sutton. Offpolicy actorcritic. arXiv preprint arXiv:1205.4839, 2012.
 Ernst et al. (2005) D. Ernst, P. Geurts, and L. Wehenkel. Treebased batch mode reinforcement learning. Journal of Machine Learning Research, 6(Apr):503–556, 2005.
 Ernst et al. (2006) D. Ernst, G.B. Stan, J. Goncalves, and L. Wehenkel. Clinical data based optimal sti strategies for hiv: a reinforcement learning approach. In Proceedings of the 45th IEEE Conference on Decision and Control, pages 667–672. IEEE, 2006.
 Farajtabar et al. (2018) M. Farajtabar, Y. Chow, and M. Ghavamzadeh. More robust doubly robust offpolicy evaluation. In Proceedings of the 35th International Conference on Machine Learning, pages 1446–1455, 2018.
 Gelada and Bellemare (2019) C. Gelada and M. G. Bellemare. Offpolicy deep reinforcement learning by bootstrapping the covariate shift. In AAAI, 2019.
 Gu et al. (2017a) S. Gu, T. Lillicrap, Z. Ghahramani, R. E. Turner, and S. Levine. Qprop: Sampleefficient policy gradient with an offpolicy critic. In International Conference on Learning Representations, 2017a.
 Gu et al. (2017b) S. S. Gu, T. Lillicrap, R. E. Turner, Z. Ghahramani, B. Schölkopf, and S. Levine. Interpolated policy gradient: Merging onpolicy and offpolicy gradient estimation for deep reinforcement learning. In Advances in Neural Information Processing Systems, pages 3846–3855, 2017b.
 Hallak and Mannor (2017) A. Hallak and S. Mannor. Consistent online offpolicy evaluation. In Proceedings of the 34th International Conference on Machine Learning, pages 1372–1383, 2017.
 Imani et al. (2018) E. Imani, E. Graves, and M. White. An offpolicy policy gradient theorem using emphatic weightings. In Advances in Neural Information Processing Systems, pages 96–106, 2018.
 Jiang and Li (2016) N. Jiang and L. Li. Doubly robust offpolicy value evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on International Conference on Machine LearningVolume 48, pages 652–661, 2016.
 Lillicrap et al. (2015) T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 Liu et al. (2018a) Q. Liu, L. Li, Z. Tang, and D. Zhou. Breaking the curse of horizon: Infinitehorizon offpolicy estimation. In Advances in Neural Information Processing Systems, pages 5361–5371, 2018a.
 Liu et al. (2018b) Y. Liu, O. Gottesman, A. Raghu, M. Komorowski, A. A. Faisal, F. DoshiVelez, and E. Brunskill. Representation balancing mdps for offpolicy policy evaluation. In Advances in Neural Information Processing Systems, pages 2649–2658, 2018b.
 Maei (2011) H. R. Maei. Gradient Temporaldifference Learning Algorithms. PhD thesis, University of Alberta, 2011.
 Mnih et al. (2015) V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 Precup et al. (2000) D. Precup, R. S. Sutton, and S. P. Singh. Eligibility traces for offpolicy policy evaluation. In Proceedings of the 17th International Conference on Machine Learning, pages 759–766, 2000.
 Schulman et al. (2015) J. Schulman, S. Levine, P. Moritz, M. Jordan, and P. Abbeel. Trust region policy optimization. In Proceedings of the 32nd International Conference on International Conference on Machine Learning, pages 1889–1897, 2015.
 Silver et al. (2014) D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller. Deterministic policy gradient algorithms. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, 2014.
 Sutton et al. (2000) R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
 Thomas and Brunskill (2016) P. Thomas and E. Brunskill. Dataefficient offpolicy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pages 2139–2148, 2016.
 Thomas et al. (2015) P. Thomas, G. Theocharous, and M. Ghavamzadeh. High confidence policy improvement. In Proceedings of the 32nd International Conference on International Conference on Machine Learning, pages 2380–2388, 2015.
 Wang et al. (2016) Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu, and N. de Freitas. Sample efficient actorcritic with experience replay. arXiv preprint arXiv:1611.01224, 2016.
 Watkins and Dayan (1992) C. J. Watkins and P. Dayan. Qlearning. Machine learning, 8(34):279–292, 1992.
 Zhang et al. (2019) S. Zhang, W. Boehmer, and S. Whiteson. Generalized offpolicy actorcritic. arXiv preprint arXiv:1903.11329, 2019.
Appendix A Proof of Theorem 3
We first state and prove an abstract result. Suppose we have a function which is differentiable, Lipschitz and smooth, and attains a finite minimum value . Suppose we have access to a noisy gradient oracle which returns a vector given a query point . We say that the vector is accurate for parameter if for all , the quantity satisfies
(6) 
Notice that the expectations above are only with respect to any randomness in the oracle, while holding the query point fixed. Suppose we run the stochastic gradient descent algorithm using the oracle responses, that is we update . While several results for the convergence of stochastic gradient descent to a stationary point of a smooth, nonconvex function are wellknown, we could not find a result for the biased oracle assumed here and hence we provide a result from first principles. We have the following guarantee on the convergence of the sequence to an approximate stationary point of .
Theorem 4.
Suppose is differentiable and smooth, and the approximate gradient oracle satisfies the conditions (6) with parameters at iteration . Then stochastic gradient descent with the oracle, with an initial solution and stepsize satisfies after iterations:
Proof.
Since is smooth, we have
Here the first equality follows from our update rule while the remaining simply use the definition of along with algebraic manipulations. Now taking expectations of both sides, we obtain
where we have invoked the properties of the oracle to bound the last two terms. Summing over iterations , we obtain
Rearranging terms, and using that , we obtain
Now using the choice and simplifying, we obtain the statement of the theorem. ∎
The theorem tells us that if we pick an iterate uniformly at random from , then it is an approximate stationary point in expectation, up to an accuracy which is determined by the bias and variance of the stochastic gradient oracle.
Given this abstract result, we can now prove Theorem 3 by instantiating the errors in the gradient oracle as a function of our assumptions.
Proof of Theorem 3
We now instantiate the result and assumptions for the case of the offpolicy policy gradient method. First, note that the algorithm is stochastic gradient ascent for maximizing the expected return . Thus we can apply Theorem 4 with , so that where is an upper bound on the value of any policy in the MDP. attains a finite minimum value since the expected return has a finite maximum value. We focus on quantifying the bias in terms of errors in the critic and propensity score computations first. We first introduce some additional notation. Suppose and are the true propensity (in terms of state distributions, relative to ) and value functions for a policy . Let . Suppose we are given estimators and for and respectively. Then our estimated and true offpolicy policy gradients can be written as:
Now the bias can be bounded as
How we simplify further depends on the assumptions we make on the errors in and . As a natural assumption, suppose that the relative errors are bounded in MSE, that is and . Then by CauchyShwartz inequality, we can simplify the above bias term as
where the operations of squaring and square root are applied elementwise to the vector . By Assumption 2 we have , for all , and
Then the bound on the bias further simplifies to
Similarly, for the variance we have
Hence, the RHS of Theorem 4 simplifies to
where and are the error parameters in the propensity scores and critic at the iteration of our algorithm. Since we update these quantities online along with the policy parameters, we expect and to decrease as increases. That is, assuming that satisfies the coverage assumptions with finite upper bounds on the propensities and the policy class is Lipschitz continuous in its parameters, the scheme converges to an approximate stationary point given estimators and with a small average MSE across the iterations under .
Appendix B Details for Experiments
In this section we will show some important details and hyperparameter settings of our algorithm in experiments. We use three separate neural networks, one for each of actor, critic, and the state distribution ratio model . We use the Adam optimizer for all of them. We also use a entropy regularization for the actor. We warm start the actor by maximizing the loglikelihood of actor on the collected dataset. For critic, we use the same critic algorithm as we used in Algorithm 1 except that there is no importance sampling ratio (as it is onpolicy for the warm start). For the warm start of w, we just fit the w for several iterations using the warm start policy found for the actor. Warm start uses the same learning rates as normal training. For critic and , we also keep the state of optimizer to be the same when we start normal training.
In the table below we show some hyperparameters setting we used in both domain:
Hyperparameters  cart pole  HIV 
1.0  0.98  
0.  0.  
entropy coefficient  0.01  0.03 
learning rate (actor)  1e3  5e6 
learning rate (critic)  1e3  1e3 
learning rate (w)  1e3  3e4 
batch size (actor)  5000  5000 
batch size (critic)  5000  5000 
batch size (w)  200  200 
number of iterations (critic)  10  10 
number of iterations (w)  50  50 
weight decay (w)  1e5  1e5 
behavior cloning number of iterations  2000  2000 
warm start number of iterations (crtic)  500  2500 
warm start number of iterations (w)  500  2500 
We also follow the details in Algorithm 1 and Algorithm 2 of Liu et al. [2018a] to learn . We scale the inputs to so that the whole offpolicy dataset has zero mean and standard deviation of 1 along each dimension in state space. We use the RBF kernel to compute the loss function for . For the CartPole simulator, the kernel bandwidth is set to be the median of state distance. If computing this median state distance over the whole offpolicy dataset is computationally too expensive, it can be approximated using a minibatch. In the HIV domain the bandwidth is set to be 1. When we compute the loss of , we need to sample two minibatch independently to get an unbiased estimates of the quadratic loss. The loss in each pair of minibatch is normalized by the sum of kernel matrix elements computed from them.
b.1 Choice of Algorithm with Discounted Reward
In discounted reward settings, the state distribution is also defined with respect to the discount factor , and Liu et al. [2018a] introduce an algorithm to learn state distribution ratio in this setting. However, we notice that in onpolicy policy learning cases, though the policy gradient theorem [Sutton et al., 2000] requires samples from the stationary state distribution defined using the discount factor, it is common to directly use the collected samples to compute policy gradient without resampling/reweighting ’s according to the discounted stationary distribution. This might be driven by sample efficiency concerns, as samples at later timesteps in the discounted stationay distribution will receive exponentially small probability, meaning they are not leveraged as effectively by the algorithm. Given this, we compare three different variants of our algorithm in HIV experiment with discounted reward. The first (OPPOSD average) variant uses the algorithm for the average reward setting, but evaluates its discounted reward. The second learns the state distribution ratio in the discounted case (Algorithm 2 in [Liu et al., 2018a]), but still samples from the undiscounted distribution to compute the gradient (OPPOSD disc ). The third learns the state distribution ratio in the discounted case and also resamples the samples according to (OPPOSD). In the main body of paper, we select the third one as it is the most natural way from the definition of problem and policy gradient theorem. Results of these three methods are demonstrated in Figure 4 and they do not have significant differences in this experiment.